Lecture 10 - ending CUDA
Even though, when programming CUDA, we have virtually unlimited threads, there has to be some physical hardware limit. These are:
- max number of threads per block per grid dimension
- It's
or about 65538 for sm2_0 - The maximum number of threads per block is 512 or 1024 for each dimension
- If we have the maximums, the maximum number of threads is:
- 65535 * 128 = 8388480 number of threads
Once we reach this limit, we must increase the number of operations per thread. We can do 4 operations, 8 operations, ...
We have to make sure that each thread is hitting spatially and temporally local data between each other, in order to maximize the hits on the cache.
![[2.1IntroToGPU-Cuda.pdf#page=23]]
Notice above that:
- The
i+=...part just adds over the number of threads in the grid. - Here we are running
/grid dimension. - For example if we call
vecAdd<<< 100, 256>>>thenblockDim.x === 256andgridDim.x === 100.
Note that the code above will have worse memory accesses due to more cache misses. The diagram below shows this difference:
branch divergence occurs when some threads (a very even amount) are taking an if (ie: a branch) or not. This is a problem because then the branch prediction tends to fail more often than not, slowing down the operations.