Threads and Blocks in Detail in CUDA

Posted by Unknown at 08:59 | 18 comments

As an engineer, I like C because it is relatively low-level compared to other languages. This lets me infer how the C code is handled by the processor so I can make on-the-fly judgments about the efficiency of a program. For the same reason, I need a mental model of how a CUDA device is organized and how its parts operate.

With a single processor executing a program, it’s easy to tell what’s going on and where it’s happening. With a CUDA device, not so much. There seem to be a lot of things going on at once in a lot of different places. CUDA organizes a parallel computation using the abstractions of threads, blocks and grids for which I provide these simple definitions:

Thread: This is just an execution of a kernel with a given index. Each thread uses its index to access elements in array (see the kernel in my first CUDA program) such that the collection of all threads cooperatively processes the entire data set.

Block: This is a group of threads. There’s not much you can say about the execution of threads within a block – they could execute concurrently or serially and in no particular order. You can coordinate the threads, somewhat, using the _syncthreads() function that makes a thread stop at a certain point in the kernel until all the other threads in its block reach the same point.

Grid: This is a group of blocks. There’s no synchronization at all between the blocks.

But where do threads, blocks and grids actually get executed? With respect to Nvidia’s G80 GPU chip, it appears the computation is distributed as follows:

Grid → GPU: An entire grid is handled by a single GPU chip.

Block → MP: The GPU chip is organized as a collection of multiprocessors (MPs), with each multiprocessor responsible for handling one or more blocks in a grid. A block is never divided across multiple MPs.

Thread → SP: Each MP is further divided into a number of stream processors (SPs), with each SP handling one or more threads in a block.

(Some universities have extended CUDA to work on multicore CPUs by assigning a grid to the CPU with each core executing the threads in one or more blocks. I don’t have a link to this work, but it is mentioned here.)

In combination with the hierarchy of processing units, the G80 also provides a memory hierarchy:

Global memory: This memory is built from a bank of SDRAM chips connected to the GPU chip. Any thread in any MP can read or write to any location in the global memory. Sometimes this is calleddevice memory.

Texture cache: This is a memory within each MP that can be filled with data from the global memory so it acts like a cache. Threads running in the MP are restricted to read-only access of this memory.

Constant cache: This is a read-only memory within each MP.

Shared memory: This is a small memory within each MP that can be read/written by any thread in a block assigned to that MP.

Registers: Each MP has a number of registers that are shared between its SPs.

As usual, the upper levels of the memory hierarchy provide larger, slower storage (access times of 400-600 cycles) while the lower levels are smaller and faster (access times of several cycles or less, I think). How large are these memories? You can find out by running the DeviceQuery application included in the CUDA SDK. This is the result I get for my Nvidia card:

There is 1 device supporting CUDA

Device 0: "GeForce 8600 GTS"
  Major revision number:                         1
  Minor revision number:                         1
  Total amount of global memory:                 268173312 bytes
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1458000 kilohertz

Test PASSED

The GPU device is hooked to 256 MB of SDRAM that provides the global memory. Each MP in the GPU has access to 16 KB of shared memory and 8192 registers (I’m not sure what the register width is). And there is 64 KB of constant memory for all the MPs in the GPU (I’m not sure why that is not reported per-MP like the shared memory is).

DeviceQuery also shows the limits on the sizes of blocks and grids. A block is one-, two- or three-dimensional with the maximum sizes of the x, y and z dimensions being 512, 512 and 64, respectively, and such that x × y × z ≤ 512, which is the maximum number of threads per block. Blocks are organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension. The primary limitation here is the maximum of 512 threads per block, primarily imposed by the small number of registers that can be allocated across all the threads running in all the blocks assigned to an MP. The thread limit constrains the amount of cooperation between threads because only threads within the same block can synchronize with each other and exchange data through the fast shared memory in an MP.

Between the memory sizes and the thread/block/grid dimensions is the warp size. What is a warp? I think the definition that applies to CUDA is “threads in a fabric running lengthwise”. The warp size is the number of threads running concurrently on an MP. In actuality, the threads are running both in parallel and pipelined. At the time this was written, each MP contains eight SPs and the fastest instruction takes four cycles. Therefore, each SP can have four instructions in its pipeline for a total of 8 × 4 = 32 instructions being executed concurrently. Within a warp, the threads all have sequential indices so there is a warp with indices 0..31, the next with indices 32..63 and so on up to the total number of threads in a block.

The homogeneity of the threads in a warp has a big effect on the computational throughput. If all the threads are executing the same instruction, then all the SPs in an MP can execute the same instruction in parallel. But if one or more threads in a warp is executing a different instruction from the others, then the warp has to be partitioned into groups of threads based on the instructions being executed, after which the groups are executed one after the other. This serialization reduces the throughput as the threads become more and more divergent and split into smaller and smaller groups. So it pays to keep the threads as homogenous as possible.

How the threads access global memory also affects the throughput. Things go much faster if the GPU can coalesce several global addresses into a single burst access over the wide data bus that goes to the external SDRAM. Conversely, reading/writing separated memory addresses requires multiple accesses to the SDRAM which slows things down. To help the GPU combine multiple accesses, the addresses generated by the threads in a warp must be sequential with respect to the thread indices, i.e. thread N must access address Base + N where Base is a pointer of type T and is aligned to 16 × sizeof(T) bytes.

A program that shows some of these performance effects is given below.

01// example2.cpp : Defines the entry point for the console application.

02//

03

04#include "stdafx.h"

05

06#include <stdio.h>

07#include <cuda.h>

08#include <cutil.h>

09

10// Kernel that executes on the CUDA device

11__global__ void square_array(float *a, int N)

12{

13#define STRIDE       32

14#define OFFSET        0

15#define GROUP_SIZE  512

16  int n_elem_per_thread = N / (gridDim.x * blockDim.x);

17  int block_start_idx = n_elem_per_thread * blockIdx.x * blockDim.x;

18  int thread_start_idx = block_start_idx

19            + (threadIdx.x / STRIDE) * n_elem_per_thread * STRIDE

20            + ((threadIdx.x + OFFSET) % STRIDE);

21  int thread_end_idx = thread_start_idx + n_elem_per_thread * STRIDE;

22  if(thread_end_idx > N) thread_end_idx = N;

23  int group = (threadIdx.x / GROUP_SIZE) & 1;

24  for(int idx=thread_start_idx; idx < thread_end_idx; idx+=STRIDE)

25  {

26    if(!group) a[idx] = a[idx] * a[idx];

27    else       a[idx] = a[idx] + a[idx];

28  }

29}

30

31// main routine that executes on the host

32int main(void)

33{

34  float *a_h, *a_d;  // Pointer to host & device arrays

35  const int N = 1<<25;  // Make a big array with 2**N elements

36  size_t size = N * sizeof(float);

37  a_h = (float *)malloc(size);        // Allocate array on host

38  cudaMalloc((void **) &a_d, size);   // Allocate array on device

39  // Initialize host array and copy it to CUDA device

40  for (int i=0; i<N; i++) a_h[i] = (float)i;

41  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);

42  // Create timer for timing CUDA calculation

43  unsigned int timer = 0;

44  cutCreateTimer( &timer );

45  // Set number of threads and blocks

46  int n_threads_per_block = 1<<9;  // 512 threads per block

47  int n_blocks = 1<<10;  // 1024 blocks

48  // Do calculation on device

49  cutStartTimer( timer );  // Start timer

50  square_array <<< n_blocks, n_threads_per_block >>> (a_d, N);

51  cudaThreadSynchronize();  // Wait for square_array to finish on CUDA

52  cutStopTimer( timer );  // Stop timer

53  // Retrieve result from device and store it in host array

54  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

55  // Print some of the results and the CUDA execution time

56  for (int i=0; i<N; i+=N/50) printf("%d %f\n", i, a_h[i]);

57  printf("CUDA execution time = %f ms\n",cutGetTimerValue( timer ));

58  // Cleanup

59  free(a_h); cudaFree(a_d);

60}

This program is just a modification of my first CUDA program The size of the array on line 35 is increased from ten to 32M elements so the kernel will execute long enough to get an accurate reading with the timer (declared on line 44). The number of blocks and threads per block are increased (lines 46 and 47) because that provides a large pool of available computations the GPU can execute while waiting for memory accesses to complete.

The timer is started (line 49) and the kernel is initiated (line 50). Because the CUDA device works in parallel with the CPU, a call to the cudaThreadSynchronize() subroutine is needed to halt the CPU until the kernel has completed (line 51). Then the timer can be stopped (line 52). The execution time for the kernel is output on line 57.

The kernel (lines 11-29) has been modified so that each thread loops over a set of array elements using a memory access pattern that can be changed along with the homogeneity of the threads. The number of array elements handled by each thread is calculated on line 16 by dividing the array size by the total number of threads (which is the product of the number of blocks and the block dimension). Then, each block of threads is assigned a starting address for the contiguous block of array elements that its threads will process (line 17). Each thread in a block is assigned a starting address within its block determined by the thread index and the addressing stride length and offset. For example, if there are 512 total array elements handled by eight threads with the stride and offset set to four and zero, respectively, then the addresses accessed by each thread are:

thread 0:   0   4   8  12  16  20  24 ... 248 252
thread 1:   1   5   9  13  17  21  25 ... 249 253
thread 2:   2   6  10  14  18  22  26 ... 250 254
thread 3:   3   7  11  15  19  23  27 ... 251 255
thread 4: 256 260 264 268 272 276 280 ... 504 508
thread 5: 257 261 265 269 273 277 281 ... 505 509
thread 6: 258 262 266 270 274 278 282 ... 506 510
thread 7: 259 263 267 271 275 279 283 ... 507 511

In this example, you can see that threads 0-3 are accessing sequential addresses in their section of the global memory, so their accesses can be coalesced. The same applies to threads 4-7. But, the accesses for threads 0-3 and 4-7 combined cannot be coalesced because then they are not sequential. So the GPU has to make a separate memory bursts for each group of threads.

Using the same example, if the offset is increased to one then the addresses for each thread become:

thread 0:   1   5   9  13  17  21  25 ... 249 253
thread 1:   2   6  10  14  18  22  26 ... 250 254
thread 2:   3   7  11  15  19  23  27 ... 251 255
thread 3:   0   4   8  12  16  20  24 ... 248 252
thread 4: 257 261 265 269 273 277 281 ... 505 509
thread 5: 258 262 266 270 274 278 282 ... 506 510
thread 6: 259 263 267 271 275 279 283 ... 507 511
thread 7: 256 260 264 268 272 276 280 ... 504 508

Now the addresses are still in the same range, but they are no longer sequential with respect to the thread indices.

A third macro parameter, GROUP_SIZE (line 15), is used to direct the flow of execution in the for loop (lines 24-28). Setting the group size to four, for example, sets the group variable (line 23) to zero for threads with indices [0..3], [8..11], [16..19] … which makes these threads square the array elements (line 26). The remaining threads, which have group set to one, just double the array elements (line 27). This difference in operations prevents the MP from running the threads in both groups in parallel.

By playing with the STRIDE, OFFSET and GROUP macros (lines 13-15), we can vary the memory access patterns and thread homogeneity and see their effect on execution times for the kernel. The following trials were done using the Release version of the program with execution times averaged over 1000 runs:

Trial	STRIDE	OFFSET	GROUP_SIZE	Execution Time
#1	32	0	512	14.5 ms
#2	16	0	512	14.7 ms
#3	8	0	512	86.0 ms
#4	32	1	512	93.5 ms
#5	32	0	16	14.4 ms
#6	32	0	8	22.1 ms
#7	8	1	8	85.5 ms

Trial #1 uses the most advantageous settings with the stride set equal to the warp size and the offset set to zero so that each thread in a warp generates sequential addresses that track the thread indices. The threads in a block are placed into a maximal group of 512 so they are all executing exactly the same instructions.

In trial #2, decreasing the stride to half the warp size has no effect on the kernel’s execution time. Nvidia mentions that many operations in the MP proceed on a half-warp basis, but you can’t count on this behavior in future versions of their devices.

Further decreasing the stride in trial #3 expands the gaps between thread addresses and reduces the amount of address coalescing the GPU can do, to the point where the execution time increases by a factor of six.

Restoring the stride back to the warp size but edging the offset up to one in trial #4 causes the same problem with coalescing addresses. So addresses that do not track the thread index cause performance problems as great as using non-sequential addresses.

In trial #5, the stride and offset are returned to their best settings while the size of thread groups executing the same instruction are reduced to 16 threads. No performance decrease is seen at half-warp size.

Decreasing the groups to eight threads in trial #6 does cause the MPs to execute the groups sequentially. This causes only a 50% increase in run-time because the two groups still have many instructions that they can execute in common.

Finally, setting all the parameters to their worst-case values in trial #7 does not lead to a multiplicative increase in execution times because the effects of serializing the thread groups is hidden by the increased memory access times caused by the non-sequential addresses.

Here’s the source code for this example if you want to try it. (Note: I had to move the cutil32D.dll and cutil32.dll libraries from the CUDA SDK directory into my C:\CUDA\bin directory so my example2.exe program could find it. You may have to do the same.)

So now I’ve gotten an idea of how the GPU is executing code and some of the factors that affect it’s performance. (For reference and greater detail, please read the CUDA Programming Guide Version 1.1.) It’s important not to have a bunch of divergent conditional operations in the threads and not to jump around from place-to-place in global memory. In fact, given the long access times, it’s probably better to minimize global memory traffic and make use of on-chip registers and shared memory instead. I’ll take a look at these memories in a future post

Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step!

References
CUDA Programming Guide Version 1.1.

For Contact us….. Click on Contact us Tab

18 comments:

varun sharmaa14 March 2018 at 10:14
it’s really nice and meaningful. it’s really cool blog. Linking is very useful thing.you have really helped lots of people who visit the blog and provide them useful information.

Hadoop Training in HRBR Layout
Hadoop Training in Kalyan Nagar
Best Hadoop Training in Kalyan Nagar Bangalore
ReplyDelete
Replies
Hayati983 October 2023 at 05:22
elazığ
sakarya
zonguldak
adıyaman
antep
6TQD
ReplyDelete
Replies
Almila5 October 2023 at 02:18
van
kastamonu
elazığ
tokat
sakarya
OJX
ReplyDelete
Replies
Aylin120 October 2023 at 16:45
Kocaeli Lojistik
Uşak Lojistik
Osmaniye Lojistik
Çorlu Lojistik
Kocaeli Lojistik
LWK
ReplyDelete
Replies
4B965Karma28F2F5 November 2023 at 16:52
E9A26
Bursa Lojistik
Mardin Lojistik
Bilecik Lojistik
Kastamonu Lojistik
Çorum Lojistik
ReplyDelete
Replies
DD653Justin5D8158 November 2023 at 07:00
EB869
Tunceli Evden Eve Nakliyat
Van Evden Eve Nakliyat
Manisa Evden Eve Nakliyat
Malatya Evden Eve Nakliyat
Aydın Evden Eve Nakliyat
ReplyDelete
Replies
9479BEugene992969 November 2023 at 09:10
FF34E
Bingöl Şehir İçi Nakliyat
Silivri Çatı Ustası
Maraş Şehir İçi Nakliyat
Bingöl Lojistik
Burdur Şehirler Arası Nakliyat
Keçiören Parke Ustası
Kırıkkale Şehir İçi Nakliyat
Paribu Güvenilir mi
Artvin Parça Eşya Taşıma
ReplyDelete
Replies
C2C14Jennifer2683110 November 2023 at 23:21
8BC39
Ağrı Parça Eşya Taşıma
Maraş Parça Eşya Taşıma
Bolu Lojistik
Erzincan Şehirler Arası Nakliyat
Şırnak Lojistik
Ünye Oto Lastik
Yenimahalle Boya Ustası
Bilecik Evden Eve Nakliyat
Kütahya Şehirler Arası Nakliyat
ReplyDelete
Replies
346C7Halle5DD3A11 November 2023 at 22:26
1A472
Tokat Şehirler Arası Nakliyat
Pi Network Coin Hangi Borsada
Düzce Lojistik
Ankara Evden Eve Nakliyat
Şırnak Lojistik
Osmaniye Şehir İçi Nakliyat
Burdur Şehirler Arası Nakliyat
Ünye Çatı Ustası
Çanakkale Lojistik
ReplyDelete
Replies
1E1A4LaurenE13022 December 2023 at 23:38
3E57C
https://referanskodunedir.com.tr/
ReplyDelete
Replies
275A5Elliana5ECAB5 January 2024 at 07:59
FF94B
diyarbakır bedava sohbet siteleri
kars nanytoo sohbet
çankırı tamamen ücretsiz sohbet siteleri
kars canlı sohbet ücretsiz
kars bedava sohbet uygulamaları
kütahya rastgele canlı sohbet
elazığ görüntülü canlı sohbet
istanbul sesli sohbet
canli sohbet bedava
ReplyDelete
Replies
746B98F63FGraceF4B10899951 January 2025 at 02:03
D791ADFEFF
instagram takipçi gönderme
ReplyDelete
Replies
Anonymous30 January 2025 at 20:45
3CC7CBADFE
instagram organik takipçi arttırma
ReplyDelete
Replies
Anonymous2 February 2025 at 15:23
FD38EF84DD
takipçi almak
ReplyDelete
Replies
Anonymous5 February 2025 at 20:39
EF64E6FC6A
Anadoluslot Güncel Adres
Anadoluslot
Anadoluslot Giriş
Anadoluslot Yeni Adres
Trwin Giriş Adresi
Trwin Giriş Adresi
Trwin
Trwin Yeni Adres
Trwin Güncel Giriş Adresi
ReplyDelete
Replies
Anonymous25 February 2025 at 19:35
D6483E3D33
coin madenciliği
oyun oyna coin kazan
görev yap coin kazan
tıklayarak coin kazan
telegram coin oyunları
ReplyDelete
Replies
토토사이트29 March 2025 at 01:22
Un article vraiment très instructif, j'ai hâte d'en lire d'autres. En savoir plus.
ReplyDelete
Replies
Anonymous9 June 2025 at 11:22
EDCC5C5F19
mmorpg oyunlar
sms onay
türk telekom mobil bozum
instagram takipci satın alma
-
ReplyDelete
Replies

Add comment

Help us to improve our quality and become contributor to our blog

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Threads and Blocks in Detail in CUDA

18 comments:

Recent Post

About Me

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Related Posts

Share This

Threads and Blocks in Detail in CUDA

18 comments:

Recent Post

About Me

Subscribe To

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin