Pages

Tuesday, 22 January 2013

Threads and Blocks in Detail in CUDA


As an engineer, I like C because it is relatively low-level compared to other languages. This lets me infer how the C code is handled by the processor so I can make on-the-fly judgments about the efficiency of a program. For the same reason, I need a mental model of how a CUDA device is organized and how its parts operate.
With a single processor executing a program, it’s easy to tell what’s going on and where it’s happening. With a CUDA device, not so much. There seem to be a lot of things going on at once in a lot of different places. CUDA organizes a parallel computation using the abstractions of threads, blocks and grids for which I provide these simple definitions:
Thread: This is just an execution of a kernel with a given index. Each thread uses its index to access elements in array (see the kernel in my first CUDA program) such that the collection of all threads cooperatively processes the entire data set.
Block: This is a group of threads. There’s not much you can say about the execution of threads within a block – they could execute concurrently or serially and in no particular order. You can coordinate the threads, somewhat, using the _syncthreads() function that makes a thread stop at a certain point in the kernel until all the other threads in its block reach the same point.
Grid: This is a group of blocks. There’s no synchronization at all between the blocks.
But where do threads, blocks and grids actually get executed? With respect to Nvidia’s G80 GPU chip, it appears the computation is distributed as follows:
Grid → GPU: An entire grid is handled by a single GPU chip.
Block → MP: The GPU chip is organized as a collection of multiprocessors (MPs), with each multiprocessor responsible for handling one or more blocks in a grid. A block is never divided across multiple MPs.
Thread → SP: Each MP is further divided into a number of stream processors (SPs), with each SP handling one or more threads in a block.
(Some universities have extended CUDA to work on multicore CPUs by assigning a grid to the CPU with each core executing the threads in one or more blocks. I don’t have a link to this work, but it is mentioned here.)
In combination with the hierarchy of processing units, the G80 also provides a memory hierarchy:
Global memory: This memory is built from a bank of SDRAM chips connected to the GPU chip. Any thread in any MP can read or write to any location in the global memory. Sometimes this is calleddevice memory.
Texture cache: This is a memory within each MP that can be filled with data from the global memory so it acts like a cache. Threads running in the MP are restricted to read-only access of this memory.
Constant cache: This is a read-only memory within each MP.
Shared memory: This is a small memory within each MP that can be read/written by any thread in a block assigned to that MP.
Registers: Each MP has a number of registers that are shared between its SPs.
As usual, the upper levels of the memory hierarchy provide larger, slower storage (access times of 400-600 cycles) while the lower levels are smaller and faster (access times of several cycles or less, I think). How large are these memories? You can find out by running the DeviceQuery application included in the CUDA SDK. This is the result I get for my Nvidia card:
There is 1 device supporting CUDA

Device 0: "GeForce 8600 GTS"
  Major revision number:                         1
  Minor revision number:                         1
  Total amount of global memory:                 268173312 bytes
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1458000 kilohertz

Test PASSED
The GPU device is hooked to 256 MB of SDRAM that provides the global memory. Each MP in the GPU has access to 16 KB of shared memory and 8192 registers (I’m not sure what the register width is). And there is 64 KB of constant memory for all the MPs in the GPU (I’m not sure why that is not reported per-MP like the shared memory is).
DeviceQuery also shows the limits on the sizes of blocks and grids. A block is one-, two- or three-dimensional with the maximum sizes of the xy and z dimensions being 512, 512 and 64, respectively, and such that x × y × z ≤ 512, which is the maximum number of threads per block. Blocks are organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension. The primary limitation here is the maximum of 512 threads per block, primarily imposed by the small number of registers that can be allocated across all the threads running in all the blocks assigned to an MP. The thread limit constrains the amount of cooperation between threads because only threads within the same block can synchronize with each other and exchange data through the fast shared memory in an MP.
Between the memory sizes and the thread/block/grid dimensions is the warp size. What is a warp? I think the definition that applies to CUDA is “threads in a fabric running lengthwise”. The warp size is the number of threads running concurrently on an MP. In actuality, the threads are running both in parallel and pipelined. At the time this was written, each MP contains eight SPs and the fastest instruction takes four cycles. Therefore, each SP can have four instructions in its pipeline for a total of 8 × 4 = 32 instructions being executed concurrently. Within a warp, the threads all have sequential indices so there is a warp with indices 0..31, the next with indices 32..63 and so on up to the total number of threads in a block.
The homogeneity of the threads in a warp has a big effect on the computational throughput. If all the threads are executing the same instruction, then all the SPs in an MP can execute the same instruction in parallel. But if one or more threads in a warp is executing a different instruction from the others, then the warp has to be partitioned into groups of threads based on the instructions being executed, after which the groups are executed one after the other. This serialization reduces the throughput as the threads become more and more divergent and split into smaller and smaller groups. So it pays to keep the threads as homogenous as possible.
How the threads access global memory also affects the throughput. Things go much faster if the GPU can coalesce several global addresses into a single burst access over the wide data bus that goes to the external SDRAM. Conversely, reading/writing separated memory addresses requires multiple accesses to the SDRAM which slows things down. To help the GPU combine multiple accesses, the addresses generated by the threads in a warp must be sequential with respect to the thread indices, i.e. thread N must access address Base + N where Base is a pointer of type T and is aligned to 16 × sizeof(T) bytes.
A program that shows some of these performance effects is given below.
01// example2.cpp : Defines the entry point for the console application.
02//
03
04#include "stdafx.h"
05
06#include <stdio.h>
07#include <cuda.h>
08#include <cutil.h>
09
10// Kernel that executes on the CUDA device
11__global__ void square_array(float *a, int N)
12{
13#define STRIDE       32
14#define OFFSET        0
15#define GROUP_SIZE  512
16  int n_elem_per_thread = N / (gridDim.x * blockDim.x);
17  int block_start_idx = n_elem_per_thread * blockIdx.x * blockDim.x;
18  int thread_start_idx = block_start_idx
19            + (threadIdx.x / STRIDE) * n_elem_per_thread * STRIDE
20            + ((threadIdx.x + OFFSET) % STRIDE);
21  int thread_end_idx = thread_start_idx + n_elem_per_thread * STRIDE;
22  if(thread_end_idx > N) thread_end_idx = N;
23  int group = (threadIdx.x / GROUP_SIZE) & 1;
24  for(int idx=thread_start_idx; idx < thread_end_idx; idx+=STRIDE)
25  {
26    if(!group) a[idx] = a[idx] * a[idx];
27    else       a[idx] = a[idx] + a[idx];
28  }
29}
30
31// main routine that executes on the host
32int main(void)
33{
34  float *a_h, *a_d;  // Pointer to host & device arrays
35  const int N = 1<<25;  // Make a big array with 2**N elements
36  size_t size = N * sizeof(float);
37  a_h = (float *)malloc(size);        // Allocate array on host
38  cudaMalloc((void **) &a_d, size);   // Allocate array on device
39  // Initialize host array and copy it to CUDA device
40  for (int i=0; i<N; i++) a_h[i] = (float)i;
41  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
42  // Create timer for timing CUDA calculation
43  unsigned int timer = 0;
44  cutCreateTimer( &timer );
45  // Set number of threads and blocks
46  int n_threads_per_block = 1<<9;  // 512 threads per block
47  int n_blocks = 1<<10;  // 1024 blocks
48  // Do calculation on device
49  cutStartTimer( timer );  // Start timer
50  square_array <<< n_blocks, n_threads_per_block >>> (a_d, N);
51  cudaThreadSynchronize();  // Wait for square_array to finish on CUDA
52  cutStopTimer( timer );  // Stop timer
53  // Retrieve result from device and store it in host array
54  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
55  // Print some of the results and the CUDA execution time
56  for (int i=0; i<N; i+=N/50) printf("%d %f\n", i, a_h[i]);
57  printf("CUDA execution time = %f ms\n",cutGetTimerValue( timer ));
58  // Cleanup
59  free(a_h); cudaFree(a_d);
60}
This program is just a modification of my first CUDA program The size of the array on line 35 is increased from ten to 32M elements so the kernel will execute long enough to get an accurate reading with the timer (declared on line 44). The number of blocks and threads per block are increased (lines 46 and 47) because that provides a large pool of available computations the GPU can execute while waiting for memory accesses to complete.
The timer is started (line 49) and the kernel is initiated (line 50). Because the CUDA device works in parallel with the CPU, a call to the cudaThreadSynchronize() subroutine is needed to halt the CPU until the kernel has completed (line 51). Then the timer can be stopped (line 52). The execution time for the kernel is output on line 57.
The kernel (lines 11-29) has been modified so that each thread loops over a set of array elements using a memory access pattern that can be changed along with the homogeneity of the threads. The number of array elements handled by each thread is calculated on line 16 by dividing the array size by the total number of threads (which is the product of the number of blocks and the block dimension). Then, each block of threads is assigned a starting address for the contiguous block of array elements that its threads will process (line 17). Each thread in a block is assigned a starting address within its block determined by the thread index and the addressing stride length and offset. For example, if there are 512 total array elements handled by eight threads with the stride and offset set to four and zero, respectively, then the addresses accessed by each thread are:
thread 0:   0   4   8  12  16  20  24 ... 248 252
thread 1:   1   5   9  13  17  21  25 ... 249 253
thread 2:   2   6  10  14  18  22  26 ... 250 254
thread 3:   3   7  11  15  19  23  27 ... 251 255
thread 4: 256 260 264 268 272 276 280 ... 504 508
thread 5: 257 261 265 269 273 277 281 ... 505 509
thread 6: 258 262 266 270 274 278 282 ... 506 510
thread 7: 259 263 267 271 275 279 283 ... 507 511
In this example, you can see that threads 0-3 are accessing sequential addresses in their section of the global memory, so their accesses can be coalesced. The same applies to threads 4-7. But, the accesses for threads 0-3 and 4-7 combined cannot be coalesced because then they are not sequential. So the GPU has to make a separate memory bursts for each group of threads.
Using the same example, if the offset is increased to one then the addresses for each thread become:
thread 0:   1   5   9  13  17  21  25 ... 249 253
thread 1:   2   6  10  14  18  22  26 ... 250 254
thread 2:   3   7  11  15  19  23  27 ... 251 255
thread 3:   0   4   8  12  16  20  24 ... 248 252
thread 4: 257 261 265 269 273 277 281 ... 505 509
thread 5: 258 262 266 270 274 278 282 ... 506 510
thread 6: 259 263 267 271 275 279 283 ... 507 511
thread 7: 256 260 264 268 272 276 280 ... 504 508
Now the addresses are still in the same range, but they are no longer sequential with respect to the thread indices.
A third macro parameter, GROUP_SIZE (line 15), is used to direct the flow of execution in the for loop (lines 24-28). Setting the group size to four, for example, sets the group variable (line 23) to zero for threads with indices [0..3], [8..11], [16..19] … which makes these threads square the array elements (line 26). The remaining threads, which have group set to one, just double the array elements (line 27). This difference in operations prevents the MP from running the threads in both groups in parallel.
By playing with the STRIDE, OFFSET and GROUP macros (lines 13-15), we can vary the memory access patterns and thread homogeneity and see their effect on execution times for the kernel. The following trials were done using the Release version of the program with execution times averaged over 1000 runs:
TrialSTRIDEOFFSETGROUP_SIZEExecution Time
#132051214.5 ms
#216051214.7 ms
#38051286.0 ms
#432151293.5 ms
#53201614.4 ms
#6320822.1 ms
#781885.5 ms
Trial #1 uses the most advantageous settings with the stride set equal to the warp size and the offset set to zero so that each thread in a warp generates sequential addresses that track the thread indices. The threads in a block are placed into a maximal group of 512 so they are all executing exactly the same instructions.
In trial #2, decreasing the stride to half the warp size has no effect on the kernel’s execution time. Nvidia mentions that many operations in the MP proceed on a half-warp basis, but you can’t count on this behavior in future versions of their devices.
Further decreasing the stride in trial #3 expands the gaps between thread addresses and reduces the amount of address coalescing the GPU can do, to the point where the execution time increases by a factor of six.
Restoring the stride back to the warp size but edging the offset up to one in trial #4 causes the same problem with coalescing addresses. So addresses that do not track the thread index cause performance problems as great as using non-sequential addresses.
In trial #5, the stride and offset are returned to their best settings while the size of thread groups executing the same instruction are reduced to 16 threads. No performance decrease is seen at half-warp size.
Decreasing the groups to eight threads in trial #6 does cause the MPs to execute the groups sequentially. This causes only a 50% increase in run-time because the two groups still have many instructions that they can execute in common.
Finally, setting all the parameters to their worst-case values in trial #7 does not lead to a multiplicative increase in execution times because the effects of serializing the thread groups is hidden by the increased memory access times caused by the non-sequential addresses.
Here’s the source code for this example if you want to try it. (Note: I had to move the cutil32D.dll and cutil32.dll libraries from the CUDA SDK directory into my C:\CUDA\bin directory so my example2.exe program could find it. You may have to do the same.)
So now I’ve gotten an idea of how the GPU is executing code and some of the factors that affect it’s performance. (For reference and greater detail, please read the CUDA Programming Guide Version 1.1.) It’s important not to have a bunch of divergent conditional operations in the threads and not to jump around from place-to-place in global memory. In fact, given the long access times, it’s probably better to minimize global memory traffic and make use of on-chip registers and shared memory instead. I’ll take a look at these memories in a future post


Got Questions?
 Feel free to ask me any question because I'd be happy to walk you through step by step!

References
CUDA Programming Guide Version 1.1.

 For Contact us….. Click on Contact us Tab

12 comments:

  1. it’s really nice and meaningful. it’s really cool blog. Linking is very useful thing.you have really helped lots of people who visit the blog and provide them useful information.

    Hadoop Training in HRBR Layout
    Hadoop Training in Kalyan Nagar
    Best Hadoop Training in Kalyan Nagar Bangalore

    ReplyDelete

Help us to improve our quality and become contributor to our blog