Pages

Friday, 8 March 2013

Optimizing histogram CUDA code | Optimization in histogram in CUDA


Continuing from Part1 
                                                          Part2

Optimizing histogram CUDA code | Optimization in histogram in CUDA

We would like to adapt the histogram computation example to run on the GPU. If our input array is large enough, it might save a considerable amount of time to have different threads examining different parts of the buffer. Having different threads read different parts of the input should be easy enough. After all, it’s very similar to things we have seen in many previous posts. The problem with computing a histogram from the input data arises from the fact that multiple threads may want to increment the same bin of the output histogram at the same time. In this situation, we will need to use atomic increments to avoid this situation.
I’ll assume you know about atomicAdd () function, so I’ll not describe this in this post.

Our main() routine looks very similar to the CPU version, although we will need to add some of the CUDA C plumbing in order to get input to the GPU and results from the GPU. However, we start exactly as we did on the CPU:


#define SIZE (100*1024*1024) // 100 MB
int main( void ) {
unsigned char *buffer = (unsigned char*)big_random_block( SIZE );


We will be interested in measuring how our code performs, so we initialize events for timing exactly like we always have.


cudaEvent_t start, stop;
checkCudaErrors( cudaEventCreate( &start ) );
checkCudaErrors ( cudaEventCreate( &stop ) );
checkCudaErrors ( cudaEventRecord( start, 0 ) );


After setting up our input data and events, we look to GPU memory. We will need to allocate space for our random input data and our output histogram. After allocating the input buffer, we copy the array we generated with big_random_block() to the GPU. Likewise, after allocating the histogram, we initialize it to zero just like we did in the CPU version.


// allocate memory on the GPU for the file's data
unsigned char *dev_buffer;
unsigned int *dev_histo;
checkCudaErrors( cudaMalloc( (void**)&dev_buffer, SIZE ) );
checkCudaErrors( cudaMemcpy( dev_buffer, buffer, SIZE,
cudaMemcpyHostToDevice ) );
checkCudaErrors( cudaMalloc( (void**)&dev_histo,256 * sizeof( long ) ) );
checkCudaErrors( cudaMemset( dev_histo, 0,256 * sizeof( int ) ) );


You may notice that we slipped in a new CUDA runtime function, cudaMemset(). This function has a similar signature to the standard C function memset(), and the two functions behave nearly identically. The difference in signature is between these functions is that cudaMemset() returns an error code while the C library function memset() does not. This error code will inform the caller whether anything bad happened while attempting to set GPU memory. Aside from the error code return, the only difference is that cudaMemset() operates on GPU
memory while memset() operates on host memory.

After initializing the input and output buffers, we are ready to compute our histogram. You will see how we prepare and launch the histogram kernel momentarily. For the time being, assume that we have computed the histogram on the GPU. After finishing, we need to copy the histogram back to the CPU, so we allocate a 256-entry array and perform a copy from device to host.


unsigned int histo[256];
checkCudaErrors ( cudaMemcpy( histo, dev_histo,256 * sizeof( int ),
cudaMemcpyDeviceToHost ) );


At this point, we are done with the histogram computation so we can stop our timers and display the elapsed time. Just like the previous event code, this is identical to the timing code we’ve used for several previous articles.


// get stop time, and display the timing results
checkCudaErrors ( cudaEventRecord( stop, 0 ) );
checkCudaErrors ( cudaEventSynchronize( stop ) );
float elapsedTime;
checkCudaErrors ( cudaEventElapsedTime( &elapsedTime,start, stop ) );
printf( "Time to generate: %3.1f ms\n", elapsedTime );



At this point, we could pass the histogram as input to another stage in the algorithm, but since we are not using the histogram for anything else, we will simply verify that the computed GPU histogram matches what we get on the CPU. First, we verify that the histogram sum matches what we expect. This is identical to the CPU code shown here:


long histoCount = 0;
for (int i=0; i<256; i++) {
        histoCount += histo[i];
}
printf( "Histogram Sum: %ld\n", histoCount );


To fully verify the GPU histogram, though, we will use the CPU to compute the
To fully verify the GPU histogram, though, we will use the CPU to compute the same histogram. The obvious way to do this would be to allocate a new histogram array, compute a histogram from the input using the code from Part1 : CPU Histogram Computation, and, finally, ensure that each bin in the GPU and CPU version match. But rather than allocate a new histogram array, we’ll opt to start with the GPU histogram and compute the CPU histogram “in reverse.” By computing the histogram “in reverse,” we mean that rather than starting at zero and incrementing bin values when we see data elements, we will start with the GPU histogram and decrement the bin’s value when the CPU sees data elements. Therefore, the CPU has computed the same histogram as the GPU if and only if every bin has the value zero when we are finished. In some sense, we are computing the difference between these two histograms. The code will look remarkably like the CPU histogram computation but with a decrement operator instead of an increment operator.



long histoCount = 0;
for (int i=0; i<256; i++) {
           histoCount += histo[i];
}
printf( "Histogram Sum: %ld\n", histoCount );


As usual, the finale involves cleaning up our allocated CUDA events, GPU memory, and host memory.


checkCudaErrors ( cudaEventDestroy( start ) );
checkCudaErrors ( cudaEventDestroy( stop ) );
checkCudaErrors ( cudaFree( dev_histo ) ) ;
checkCudaErrors ( cudaFree( dev_buffer ) ) ;
free( buffer );
return 0;
}


Now, be concentrate while reading further;

Before, we assumed that we had launched a kernel that computed our histogram and then pressed on to discuss the aftermath. Our kernel launch is slightly more complicated than usual because of performance concerns. Because the histogram contains 256 bins, using 256 threads per block proves convenient as well as results in high performance. But we have a lot of flexibility in terms of the number of blocks we launch. For example, with 100MB of data, we have 104,857,600 bytes of data. We could launch a single block and have each thread examine 409,600 data elements. Likewise, we could launch 409,600 blocks and have each thread examine a single data element.
As you might have guessed, the optimal solution is at a point between these two extremes. By running some performance experiments, optimal performance is achieved when the number of blocks we launch is exactly twice the number of multiprocessors our GPU contains. For example, a GeForce GTX 280 has 30 multiprocessors, so our histogram kernel happens to run fastest on a GeForce GTX 280 when launched with 60 parallel blocks.

In Querying Device Properties, we discussed a method for querying various properties of the hardware on which our program is running. We will need to use one of these device properties if we intend to dynamically size our launch based on our current hardware platform. To accomplish this, we will use the following code segment. Although you haven’t yet seen the kernel implementation, you should still be able to follow what is going on.





cudaDeviceProp prop;
checkCudaErrors( cudaGetDeviceProperties( &prop, 0 ) );
int blocks = prop.multiProcessorCount;
histo_kernel_optimized <<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );


Since our walk-through of main() has been somewhat fragmented, here is the entire routine from start to finish:

Histogram.cu


#define SIZE (100*1024*1024) // 100 MB

// Write kernel code here



int main( void ) {
unsigned char *buffer = (unsigned char*)big_random_block( SIZE );

cudaEvent_t start, stop;
checkCudaErrors( cudaEventCreate( &start ) );
checkCudaErrors ( cudaEventCreate( &stop ) );
checkCudaErrors ( cudaEventRecord( start, 0 ) );


// allocate memory on the GPU for the file's data
unsigned char *dev_buffer;
unsigned int *dev_histo;
checkCudaErrors( cudaMalloc( (void**)&dev_buffer, SIZE ) );
checkCudaErrors( cudaMemcpy( dev_buffer, buffer, SIZE,
cudaMemcpyHostToDevice ) );
checkCudaErrors( cudaMalloc( (void**)&dev_histo,256 * sizeof( long ) ) );
checkCudaErrors( cudaMemset( dev_histo, 0,256 * sizeof( int ) ) );

cudaDeviceProp prop;
checkCudaErrors( cudaGetDeviceProperties( &prop, 0 ) );
int blocks = prop.multiProcessorCount;
histo_kernel_optimized <<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );

unsigned int histo[256];
checkCudaErrors ( cudaMemcpy( histo, dev_histo,256 * sizeof( int ),
cudaMemcpyDeviceToHost ) );

// get stop time, and display the timing results
checkCudaErrors ( cudaEventRecord( stop, 0 ) );
checkCudaErrors ( cudaEventSynchronize( stop ) );
float elapsedTime;
checkCudaErrors ( cudaEventElapsedTime( &elapsedTime,start, stop ) );
printf( "Time to generate: %3.1f ms\n", elapsedTime );

// verify that we have the same counts via CPU

long histoCount = 0;
for (int i=0; i<256; i++) {
           histoCount += histo[i];
}
printf( "Histogram Sum: %ld\n", histoCount );

// verify that we have the same counts via CPU
for (int i=0; i<SIZE; i++)
histo[buffer[i]]--;
for (int i=0; i<256; i++) {
if (histo[i] != 0)
printf( "Failure at %d!\n", i );
}


checkCudaErrors ( cudaEventDestroy( start ) );
checkCudaErrors ( cudaEventDestroy( stop ) );
checkCudaErrors ( cudaFree( dev_histo ) ) ;
checkCudaErrors ( cudaFree( dev_buffer ) ) ;
free( buffer );
return 0;
}



Optimization step 3
HISTOGRAM KERNEL USING GLOBAL MEMORY ATOMICS

And now for the fun part: the GPU code that computes the histogram! The kernel that computes the histogram itself needs to be given a pointer to the input data array, the length of the input array, and a pointer to the output histogram. The first thing our kernel needs to compute is a linearized offset into the input data array. Each thread will start with an offset between 0 and the number of threads minus 1. It will then stride by the total number of threads that have been launched. We hope you remember this technique; we used the same logic to add vectors of arbitrary length.




__global__ void
histo_kernel_optimized4( unsigned char *buffer,
long size,
unsigned int *histo ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;


Once each thread knows its starting offset i and the stride it should use, the code walks through the input array incrementing the corresponding histogram bin.


while (i < size)
{
          atomicAdd( &(histo[buffer[i]]), 1 );
              i += stride;
}

}

The highlighted line represents the way we use atomic operations in CUDA C. The call atomicAdd( addr, y ); generates an atomic sequence of operations that read the value at address addr, adds y to that value, and stores the result back to the memory address addr. The hardware guarantees us that no other thread can read or write the value at address addr while we perform these operations, thus ensuring predictable results. In our example, the address in
question is the location of the histogram bin that corresponds to the current byte. If the current byte is buffer[i], just like we saw in the CPU version, the corresponding histogram bin is histo[buffer[i]]. The atomic operation needs the address of this bin, so the first argument is therefore &(histo[buffer[i]]). Since we simply want to increment the value in that bin by one, the second argument is 1.

So after that entire hullabaloo, our GPU histogram computation is fairly similar to the corresponding CPU version.



__global__ void
histo_kernel_optimized4( unsigned char *buffer,
long size,
unsigned int *histo ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
  while (i < size)
  {
          atomicAdd( &(histo[buffer[i]]), 1 );
              i += stride;
  }

}


However, we need to save the celebrations for later. After running this example, we discover that a GeForce GTX 285 can construct a histogram from 100MB of input data in 1.752 seconds. If you read the section on CPU-based histograms (Part1), you will realize that this performance is terrible. In fact, this is more than four times slower than the CPU version! But this is why we always measure our baseline performance. It would be a shame to settle for such a low-performance implementation simply because it runs on the GPU.

Since we do very little work in the kernel, it is quite likely that the atomic operation on global memory is causing the problem. Essentially, when thousands of threads are trying to access a handful of memory locations, a great deal of contention for our 256 histogram bins can occur. To ensure atomicity of the increment operations, the hardware needs to serialize operations to the same memory location. This can result in a long queue of pending operations, and any performance gain we might have had will vanish. We will need to improve the algorithm itself in order to recover this performance.

Happy to read Further Part3

3 comments:

  1. I think it is wrong to measure time from host to device and device to host alongside the time of kernel invocation.

    ReplyDelete
  2. Great written text. Let as many valuable texts as possible. Emergency Dentist London

    ReplyDelete
  3. I always spent my half an hour to read this webpage’s content every day along with a cup of coffee.

    ReplyDelete

Help us to improve our quality and become contributor to our blog