THREAD AND BLOCK HEURISTICS in CUDA Programming

Posted by Unknown at 08:20 | 1 comments

How Threads and blocks organize in CUDA and How to decide number of threads and blocks for any application?

This article will let you know, for the particular application how you decide the fixed number of threads and variable number of blocks in a grid.

The dimension and size of blocks per grid and the dimension and size of threads per block are both important factors. The multidimensional aspect of these parameters allows easier mapping of multidimensional problems to CUDA and does not play a role in performance. As a result, this section discusses size but not dimension. When choosing the first execution configuration parameter—the number of blocks per grid, or grid size—the primary concern is keeping the entire GPU busy. The number of blocks in a grid should be larger than the number of multiprocessors so that all multiprocessors have at least one block to execute. Furthermore, there should be multiple active blocks per multiprocessor so that blocks that aren’t waiting for a __syncthreads() can keep the hardware busy. This recommendation is subject to resource availability; therefore, it should be determined in the context of the second execution parameter—the number of threads per block, or block size— as well as shared memory usage.

When choosing the block size, it is important to remember that multiple concurrent blocks can reside on a multiprocessor, so occupancy is not determined by block size alone. In particular, a larger block size does not imply a higher occupancy. For example, on a device of compute capability 1.1 or lower, a kernel with a maximum block size of 512 threads results in an occupancy of 66 percent because the maximum number of threads per multiprocessor on such a device is 768. Hence, only a single block can be active per multiprocessor. However, a kernel with 256 threads per block on such a device can result in 100 percent occupancy with three resident active blocks higher occupancy does not always equate to better performance. For example, improving occupancy from 66 percent to 100 percent generally does not translate to a similar increase in performance. A lower occupancy kernel will have more registers available per thread than a higher occupancy kernel, which may result in less register spilling to local memory. Typically, once an occupancy of 50 percent has been reached, additional increases in occupancy do not translate into improved performance.

There are many such factors involved in selecting block size, and inevitably some experimentation is required. However, a few rules of thumb should be followed:

· Threads per block should be a multiple of warp size to avoid wasting computation on under-populated warps and to facilitate coalescing.

· A minimum of 64 threads per block should be used, and only if there are multiple concurrent blocks per multiprocessor.

· Between 128 and 256 threads per block is a better choice and a good initial range for experimentation with different block sizes.

· Use several (3 to 4) smaller thread blocks rather than one large thread block per multiprocessor if latency affects performance. This is particularly beneficial to kernels that frequently call __syncthreads().

Note that when a thread block allocates more registers than are available on a multiprocessor, the kernel launch fails, as it will when too much shared memory or too many threads are requested.

We talked about basics of Threads and blocks, now the time is come to go in detail. This link will provide you all the details on the threads and blocks.

Got Questions?

Feel free to ask me any question because I'd be happy to walk you through step by step!

References and External Links

Wikipedia

hyperphysic

algebralab

CUDA C Programming Guide

CUDA; Nvidia

For Contact us….. Click on Contact us Tab

1 comment:

Dushanth N18 July 2019 at 00:35
Hello sir, I have a question about cuda prgmg,ie,1. where does D's cuda prgrms are used like GPU, object detection etc ,if there plz give use code to that.
2. Like vetor addition and matrix multiplication code of cuda, what purpose thus this is used and where.
Tq and I'm eagerly waiting for our response
😁
ReplyDelete
Replies

Add comment

Help us to improve our quality and become contributor to our blog

CUDA Programming

Prefer Your Language

Search This Blog

Tags

THREAD AND BLOCK HEURISTICS in CUDA Programming

1 comment:

Recent Post

About Me

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Related Posts

Share This

THREAD AND BLOCK HEURISTICS in CUDA Programming

1 comment:

Recent Post

About Me

Subscribe To

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin