How Threads and blocks organize in CUDA and How to decide number of threads and blocks for any application?
This article will let you know, for the particular
application how you decide the fixed number of threads and variable number of
blocks in a grid.
The dimension and size of blocks per grid and the dimension
and size of threads per block are both important factors. The multidimensional
aspect of these parameters allows easier mapping of multidimensional problems
to CUDA and does not play a role in performance. As a result, this section
discusses size but not dimension. When choosing the first execution
configuration parameter—the number of blocks per grid, or grid size—the
primary concern is keeping the entire GPU busy. The number of blocks in a grid should
be larger than the number of multiprocessors so that all multiprocessors have
at least one block to execute. Furthermore, there should be multiple active
blocks per multiprocessor so that blocks that aren’t waiting for a __syncthreads() can keep the hardware
busy. This recommendation is subject to resource availability; therefore, it
should be determined in the context of the second execution parameter—the
number of threads per block, or block size— as well as shared
memory usage.
When choosing the block size, it is important to remember
that multiple concurrent blocks can reside on a multiprocessor, so occupancy is
not determined by block size alone. In particular, a larger block size does not
imply a higher occupancy. For example, on a device of compute capability 1.1 or
lower, a kernel with a maximum block size of 512 threads results in an
occupancy of 66 percent because the maximum number of threads per
multiprocessor on such a device is 768. Hence, only a single block can be
active per multiprocessor. However, a kernel with 256 threads per block on such
a device can result in 100 percent occupancy with three resident active blocks
higher occupancy does not always equate to better performance. For example,
improving occupancy from 66 percent to 100 percent generally does not translate
to a similar increase in performance. A lower occupancy kernel will have more
registers available per thread than a higher occupancy kernel, which may result
in less register spilling to local memory. Typically, once an occupancy of 50
percent has been reached, additional increases in occupancy do not translate
into improved performance.
There are many such factors involved in selecting block
size, and inevitably some experimentation is required. However, a few rules of
thumb should be followed:
· Threads
per block should be a multiple of warp size to avoid wasting
computation on under-populated warps and to facilitate coalescing.
· A
minimum of 64 threads per block should be used, and only if there are
multiple concurrent blocks per multiprocessor.
· Between
128 and 256 threads per block is a better choice and a good initial
range for experimentation with different block sizes.
· Use
several (3 to 4) smaller thread blocks rather than one large thread block per
multiprocessor if latency affects performance. This is particularly
beneficial to kernels that frequently call __syncthreads().
Note that when a thread block allocates more
registers than are available on a multiprocessor, the kernel launch fails,
as it will when too much shared memory or too many threads are requested.
We talked about basics of Threads and blocks, now the time is come to go in detail. This link will provide you all the details on the threads and blocks.
Got Questions?
Feel free to ask me any question because I'd be happy to walk you through
step by step!
References and External Links
For Contact us….. Click on Contact us Tab
Hello sir, I have a question about cuda prgmg,ie,1. where does D's cuda prgrms are used like GPU, object detection etc ,if there plz give use code to that.
ReplyDelete2. Like vetor addition and matrix multiplication code of cuda, what purpose thus this is used and where.
Tq and I'm eagerly waiting for our response
😁