Pages

Tuesday, 29 January 2013

What is “Constant Memory” in CUDA | Constant Memory in CUDA



We have talked about the Global memory and shared memory in previous article(s), we also some example like Vector Dot Product  which demonstrate how to use shared memory. CUDA architecture provides another kind of memory which we call Constant Memory. 

In this article; we answer following questions.



1.    What is Constant memory in CUDA?
2.    Why constant memory?
3.    Where the constant memory resides?
4.    How does Constant memory speed up you in CUDA code performance?
5.    How does Constant memory works in CUDA?
6.    How to use Constant memory in CUDA?
7.    Where to use and where should not use Constant memory in CUDA?
8.    Performance consideration of constant memory.




Motivation

Previously, we discussed how modern GPUs are equipped with enormous amounts of arithmetic processing power. In fact, the computational advantage graphics processors have over CPUs helped precipitate the initial interest in using  graphics processors for general-purpose computing. With hundreds of arithmetic units on the GPU, often the bottleneck is not the arithmetic throughput of the chip but rather the memory bandwidth of the chip. There are so many ALUs on graphics processors that sometimes we just can’t keep the input coming to them fast enough to sustain such high rates of computation. So, it is worth investigating means by which we can reduce the amount of memory traffic required for a given problem.

What is Constant Memory?

The CUDA language makes available another kind of memory known as constant memory. As the name may indicate, we use constant memory for data that will not change over the course of a kernel execution.

Why Constant Memory?

NVIDIA hardware provides 64KB of constant memory that it treats differently than it treats standard global memory. In some situations, using constant memory rather than global memory will reduce the required memory bandwidth.


Where the Constant memory resides in GPU ?

Fig will shown you, where the constant memory resides in GPU 



How does Constant memory speed up your CUDA code performance?

There is a total of 64 KB constant memory on a device. The constant memory space is cached. As a result, a read from constant memory costs one memory read from device memory only on a cache miss; otherwise, it just costs one read from the constant cache.

Advantage along with disadvantage

For all threads of a half warp, reading from the constant cache is as fast as reading from a register as long as all threads read the same address. Accesses to different addresses by threads within a half warp are serialized, so cost scales linearly with the number of different addresses read by all threads within a half warp.
The constant memory space resides in device memory and is cached in the constant cache mentioned in Sections F.3.1 and F.4.1 see CUDA C Programming Guide for details.


How does Constant memory works in CUDA?
Working of Constant memory is divided in Three steps which are as follows.

For devices of compute capability 1.x;

Step 1: A constant memory request for a warp is first split into two requests, one for each half-warp, that are issued independently.

Step 2: A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.

Final Step: The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.

Alternatively, on devices of compute capability 2.x, programs use the LoaD Uniform (LDU) operation; see Section F.4.4 of the CUDA C Programming Guide for details.

How to use Constant memory in CUDA?

We talked about;

Where is constant memory?
·         Data is stored in the device global memory
·         Read data through multiprocessor constant cache
·         64KB constant memory and 8KB cache for each multiprocessor.
            How about the Performance?
§  Optimized when warp of threads read same location
§  4 bytes per cycle through broadcasting to warp of threads
§  Serialized when warp of threads read in different locations
§  Very slow when cache miss (read data from global memory)
§  Access latency can range from one to hundreds clock cycles
 Now we learn how to declare and use constant memory.

Declaration of constant memory

We declare constant memory using __constant__ keyword. Constant memory always declared in File scope (global variable). So,The mechanism for declaring memory constant is similar to the one we used of declaring a buffer as shared memory.

Example

__constant__ float cst_ptr [size];

Here we declare array of constant memory with name “cst_ptr” of size “size”.

Note:
It must be declare out of the main body and the kernel.

Way to use Constant memory

The instruction cudaMemcpyToSymbol must be used in order to copy the values to the kernel.
Syntax

cudaError_t cudaMemcpytoSymbol  (const char * symbol, const void * src, size_t count ,
                                  size_t offset=0, enum cudaMemcpyKind )


Example:
//copy data from host to constant memory
cudaMemcpyToSymbol (cst_ptr, host_ptr, data_size );

Since we dint explicitly wrote the cudaMemcpyKind, its default value is cudaMemcpyHostToDevice




Note:
The variables in constant memory are not necessary to be declared in the kernel invocation.

For example: “M” is the constant memory declared outside

__global__ void kernel (float *a, int N)
{
              int idx = blockIdx.x * blockDim.x + threadIdx.x;
             
             if (idx<N)
             {
                      a[idx] = a[idx] *M;
              }
}


Complete Example of constant memory in cuda

Constant_memory_in_cuda.cu

//declare constant memory
__constant__ float cangle[360];

int main(int argc,char** argv)
{
         int size=3200;
         float* darray;
         float hangle[360];

          //allocate device memory
         cudaMalloc ((void**)&darray,sizeof(float)*size);
         
         //initialize allocated memory
        cudaMemset (darray,0,sizeof(float)*size);

         //initialize angle array on host
       for(int loop=0;loop<360;loop++)
                    hangle[loop] = acos( -1.0f )* loop/ 180.0f;

        //copy host angle data to constant memory
       cudaMemcpyToSymbol    (  cangle,  hangle,   sizeof(float)*360  );
   
        test_kernel  <<<  size/64  ,64  >>>  (darray);
     
       //free device memory
       cudaFree(darray);
  return 0;
}


__global__ void test_kernel(float* darray)
{
      int index;
    
      //calculate each thread global index
     Index = blockIdx.x * blockDim.x + threadIdx.x;
   
    #pragma unroll 10
   for(int loop=0;loop<360;loop++)
                 darray[index]= darray [index] + cangle [loop] ;
return;

}

Difference between cudaMemcpy () and cudaMemcpyToSymbol ()

cudaMemcpyToSymbol () is the special version  of cudaMemcpy () when we copy from host memory to constant memory on the GPU. The only differences between
cudaMemcpyToSymbol () and cudaMemcpy () using cudaMemcpyHostToDevice are that cudaMemcpyToSymbol () copies to constant memory and cudaMemcpy () copies to global memory.

Where to use and should not use Constant memory in CUDA?

As per as I guide;
Use Constant memory in the following cases
o   When you know, your input data will not change during the execution
o   When you know, your all thread will access data from same part of memory
As shown in above example; every thread in a block access same address space pointed out by “cangle” array.

Should not use Constant memory in the following cases
o   When you know, your input data will be change during the execution
o   When you know, your all thread will not access data from same part of memory.
o   When your data is not read only. For example “output” memory space should not be constant.   

In essence, you should not use constant memory when every thread in a block don’t access same address space. For example; you have an array of 3,000 elements and you breaks this element to lunch sufficient number of threads in a block. So, each thread in a block will access different element of an array as counter part of the above example where each thread reads same data controlled by “loop” variable, therefore each thread will access same data.   


Performance consideration of constant memory

Declaring memory as __constant__ constrains our usage to be read-only. In taking on this constraint, we expect to get something in return. As I previously mentioned, reading from constant memory can conserve memory bandwidth when compared to reading the same data from global memory. There are two reasons why reading from the 64KB of constant memory can save bandwidth over standard reads of global memory:


A single read from constant memory can be broadcast to other “nearby” threads, effectively saving up to 15 reads.


Constant memory is cached, so consecutive reads of the same address will not incur any additional memory traffic.


What do we mean by the word nearby?

To answer this question, we will need to explain the concept of a warp. For those readers who are more familiar with Star Trek than with weaving, a warp in this context has nothing to do with the speed of travel through space. In the world of weaving, a warp refers to the group of threads being woven together into fabric. In the CUDA Architecture, a warp refers to a collection of 32 threads that are “woven together” and get executed in lockstep. At every line in your program, each thread in a warp executes the same instruction on different data [More? follow this link].

When it comes to handling constant memory, NVIDIA hardware can broadcast a single memory read to each half-warp. A half-warp—not nearly as creatively named as a warp—is a group of 16 threads: half of a 32-thread warp. If every thread in a half-warp requests data from the same address in constant memory, your GPU will generate only a single read request and subsequently broadcast the data to every thread. If you are reading a lot of data from constant memory, you will generate only 1/16 (roughly 6 percent) of the memory traffic as you would when using global memory. But the savings don’t stop at a 94 percent reduction in bandwidth when reading constant memory! Because we have committed to leaving the memory unchanged, the hardware can aggressively cache the constant data on the GPU.
So after the first read from an address in constant memory, other half-warps requesting the same address, and therefore hitting the constant cache, will generate no additional memory traffic.

After caching the data, every other thread avoids generating memory traffic as a result of one of the two constant memory benefits:


 It receives the data in a half-warp broadcast.


 It retrieves the data from the constant memory cache.


Unfortunately, there can potentially be a downside to performance when using constant memory. The half-warp broadcast feature is in actuality a double-edged sword. Although it can dramatically accelerate performance when all 16 threads are reading the same address, it actually slows performance to a crawl when all 16 threads read different addresses.

The Final words on Performance consideration of constant memory

The trade-off to allowing the broadcast of a single read to 16 threads is that the 16 threads are allowed to place only a single read request at a time. For example, if all 16 threads in a half-warp need different data from constant memory, the 16 different reads get serialized, effectively taking 16 times the amount of time to place the request. If they were reading from conventional global memory, the request could be issued at the same time. In this case, reading from constant memory would probably be slower than using global memory.

Summary of the Article

In this article we read about constant memory in context of CUDA programming. We started talking about Why (What is) constant memory and how to declare & use constant memory in CUDA and end our discussion with Performance consideration of constant memory in CUDA.

Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step! 

 Want to Contact us? Click here

Monday, 28 January 2013

How to Overlap Data Transfers in CUDA C/C++| Parallel data transfer and computation


In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let’s learn about streams.

CUDA Streams

A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently. [Want to know more? Follow this   post]

The default stream

All device operations (kernels and data transfers) in CUDA run in a stream. When no stream is specified, the default stream (also called the “null stream”) is used. The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device: no operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation (in any stream on the device) will begin.
Let’s look at some simple code examples that use the default stream, and discuss how operations progress from the perspective of the host as well as the device.

cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice);
increment<<<1,N>>>(d_a)
cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);

In the code above, from the perspective of the device, all three operations are issued to the same (default) stream and will execute in the order that they were issued. From the perspective of the host, the implicit data transfers are blocking or synchronous transfers, while the kernel launch is asynchronous. Since the host-to-device data transfer on the first line is synchronous, the CPU thread will not reach the kernel call on the second line until the host-to-device transfer is complete. Once the kernel is issued, the CPU thread moves to the third line, but the transfer on that line cannot begin due to the device-side order of execution.
The asynchronous behavior of kernel launches from the host’s perspective makes overlapping device and host computation very simple. We can modify the code to add some independent CPU computation as follows.

cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice);
increment<<<1,N>>>(d_a)
myCpuFunction(b)
cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);

In the above code, as soon as the increment() kernel is launched on the device the CPU thread executes myCpuFunction(), overlapping its execution on the CPU with the kernel execution on the GPU. Whether the host function or device kernel completes first doesn’t affect the subsequent device-to-host transfer, which will begin only after the kernel completes. From the perspective of the device, nothing has changed from the previous example; the device is completely unaware of myCpuFunction().

Non-default streams

Non-default streams in CUDA C/C++ are declared, created, and destroyed in host code as follows.

cudaStream_t stream1;
cudaError_t result;
result = cudaStreamCreate(&stream1)
result = cudaStreamDestroy(stream1)

To issue a data transfer to a non-default stream we use the cudaMemcpyAsync() function, which is similar to the cudaMemcpy() function discussed in the previous post, but takes a stream identifier as a fifth argument.

result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1)         

cudaMemcpyAsync() is non-blocking on the host, so control returns to the host thread immediately after the transfer is issued. There are cudaMemcpy2DAsync() and cudaMemcpy3DAsync() variants of this routine which can transfer 2D and 3D array sections asynchronously in the specified streams.
To issue a kernel to a non-default stream we specify the stream identifier as a fourth execution configuration parameter (the third execution configuration parameter allocates shared device memory, which is discussed here; use 0 for now).

increment<<<1,N,0,stream1>>>(d_a)

Synchronization with streams

Since all operations in non-default streams are non-blocking with respect to the host code, you will run across situations where you need to synchronize the host code with operations in a stream. There are several ways to do this. The “heavy hammer” way is to use cudaDeviceSynchronize(), which blocks the host code until all previously issued operations on the device have completed. In most cases this is overkill, and can really hurt performance due to stalling the entire device and host thread.
The CUDA stream API has multiple less severe methods of synchronizing the host with a stream. The function cudaStreamSynchronize(stream) can be used to block the host thread until all previously issued operations in the specified stream have completed. The function cudaStreamQuery(stream) tests whether all operations issued to the specified stream have completed, without blocking host execution. The functions cudaEventSynchronize(event) and cudaEventQuery(event) act similar to their stream counterparts, except that their result is based on whether a specified event has been recorded rather than whether a specified stream is idle. You can also synchronize operations within a single stream on a specific event using cudaStreamWaitEvent(event) (even if the event is recorded in a different stream, or on a different device!).

Overlapping Kernel Execution and Data Transfers

Earlier we demonstrated how to overlap kernel execution in the default stream with execution of code on the host. But our main goal in this post is to show you how to overlap kernel execution with data transfers. There are several requirements for this to happen.

The device must be capable of “concurrent copy and execution”. This can be queried from the deviceOverlap field of a cudaDeviceProp struct, or from the output of the deviceQuery sample included with the CUDA SDK/Toolkit. Nearly all devices with compute capability 1.1 and higher have this capability.

The kernel execution and the data transfer to be overlapped must both occur in different, non-default streams.

The host memory involved in the data transfer must be pinned memory.


So let’s modify our simple host code from above to use multiple streams and see if we can achieve any overlap. In the modified code, we break up the array of size N into chunks of streamSize elements. Since the kernel operates independently on all elements, each of the chunks can be processed independently. The number of (non-default) streams used is nStreams=N/streamSize. There are multiple ways to implement the domain decomposition of the data and processing; one is to loop over all the operations for each chunk of the array as in this example code.


for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, stream[i]);
  kernel<<>>(d_a, offset);
  cudaMemcpyAsync(&a[offset], &d_a[offset], streamBytes, stream[i]);
}


Another approach is to batch similar operations together, issuing all the host-to-device transfers first, followed by all kernel launches, and then all device-to-host transfers, as in the following code.

for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  cudaMemcpyAsync(&d_a[offset], &a[offset],
                  streamBytes, cudaMemcpyHostToDevice, stream[i]);
}

for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  kernel<<>>(d_a, offset);
}

for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  cudaMemcpyAsync(&a[offset], &d_a[offset],
                  streamBytes, cudaMemcpyDeviceToHost, stream[i]);
}


Both asynchronous methods shown above yield correct results, and in both cases dependent operations are issued to the same stream in the order in which they need to be executed. But the two approaches perform very differently depending on the specific generation of GPU used. On a Tesla C1060 (compute capability 1.3) running the test code (from Github) gives the following results.

Device : Tesla C1060
 
Time for sequential transfer and execute (ms ): 12.92381
  max error : 2.3841858E -07
Time for asynchronous V1 transfer and execute (ms ): 13.63690 
  max error : 2.3841858E -07
Time for asynchronous V2 transfer and execute (ms ): 8.84588
  max error : 2.3841858E -07

On a Tesla C2050 (compute capability 2.0) we get the following results.

Device : Tesla C2050
 
Time for sequential transfer and execute (ms ): 9.984512
  max error : 1.1920929e -07
Time for asynchronous V1 transfer and execute (ms ): 5.735584 
  max error : 1.1920929e -07
Time for asynchronous V2 transfer and execute (ms ): 7.597984
  max error : 1.1920929e -07

Here the first time reported is the sequential transfer and kernel execution using blocking transfers, which we use as a baseline for asynchronous speedup comparison. Why do the two asynchronous strategies perform differently on different architectures? To decipher these results we need to understand a bit more about how CUDA devices schedule and execute tasks. CUDA devices contain engines for various tasks, which queue up operations as they are issued. Dependencies between tasks in different engines are maintained, but within any engine all external dependencies are lost; tasks in each engine’s queue are executed in the order they are issued. The C1060 has a single copy engine and a single kernel engine. A time line for the execution of our example code on a C1060 is shown in the following diagram.



In the schematic we assume that the time required for the host-to-device transfer, kernel execution, and device-to-host transfer are approximately the same (the kernel code was chosen in order to achieve this). As expected for the sequential kernel, there is no overlap in any of the operations. For the first asynchronous version of our code the order of execution in the copy engine is: H2D stream(1), D2H stream(1), H2D stream(2), D2H stream(2), and so forth. This is why we do not see any speed-up when using the first asynchronous version on the C1060: tasks were issued to the copy engine in an order that precludes any overlap of kernel execution and data transfer. For version two, however, where all the host-to-device transfers are issued before any of the device-to-host transfers, overlap is possible as indicated by the lower execution time. From our schematic, we expect the execution of asynchronous version 2 to be 8/12 of the sequential version, or 8.7 ms which is confirmed in the timing results given previously.
On the C2050, two features interact to cause a behavior difference from the C1060. The C2050 has two copy engines, one for host-to-device transfers and another for device-to-host transfers, as well as a single kernel engine. The following diagram illustrates execution of our example on the C2050.


Having two copy engines explains why asynchronous version 1 achieves good speed-up on the C2050: the device-to-host transfer of data in stream[i] does not block the host-to-device transfer of data in stream[i+1] as it did on the C1060 because there is a separate engine for each copy direction on the C2050. The schematic predicts the execution time to be cut in half relative to the sequential version, and this is roughly what our timing results showed.
But what about the performance degradation observed in asynchronous version 2 on the C2050? This is related to the C2050′s ability to concurrently run multiple kernels. When multiple kernels are issued back-to-back in different (non-default) streams, the scheduler tries to enable concurrent execution of these kernels and as a result delays a signal that normally occurs after each kernel completion (which is responsible for kicking off the device-to-host transfer) until all kernels complete. So, while there is overlap between host-to-device transfers and kernel execution in the second version of our asynchronous code, there is no overlap between kernel execution and device-to-host transfers. The schematic predicts an overall time for the asynchronous version 2 to be 9/12 of the time for the sequential version, or 7.5 ms, and this is confirmed by our timing results.
A more detailed description of the example used in this post is available in CUDA Fortran Asynchronous Data Transfers. The good news is that for devices with compute capability 3.5 (the K20 series), the Hyper-Q feature eliminates the need to tailor the launch order, so either approach above will work. We will discuss using Kepler features in a future post, but for now, here are the results of running the sample code on a Tesla K20c GPU. As you can see, both asynchronous methods achieve the same speedup over the synchronous code.
Device : Tesla K20c
Time for sequential transfer and execute (ms): 7.101760
  max error : 1.1920929e -07
Time for asynchronous V1 transfer and execute (ms): 3.974144 
  max error : 1.1920929e -07
Time for asynchronous V2 transfer and execute (ms): 3.967616 
  max error : 1.1920929e -07

Summary

This post and theprevious one discussed how to optimize data transfers between the host and device. The previous post focused on how to minimize the time for executing such transfers, and this post introduced streams and how to use them to mask data transfer time by concurrently executing copies and kernels.
In a post dealing with streams I should mention that while using the default stream is convenient for developing code—synchronous code is simpler—eventually your code should use non-default streams. This is especially important when writing libraries. If code in a library uses the default stream, there is no chance for the end user to overlap data transfers with library kernel execution.
Now you know how to move data efficiently between the host and device, so we’ll look at how to access data efficiently from within kernels in the next post.



Got Questions?
 Feel free to ask me any question because I'd be happy to walk you through step by step!

 For Contact us….. Click on Contact us Tab