Part2
Optimizing histogram CUDA code |
Optimization in histogram in CUDA
We would
like to adapt the histogram computation example to run on the GPU. If our input
array is large enough, it might save a considerable amount of time to have
different threads examining different parts of the buffer. Having different threads
read different parts of the input should be easy enough. After all, it’s very similar
to things we have seen in many previous posts. The problem with computing a
histogram from the input data arises from the fact that multiple threads may
want to increment the same bin of the output histogram at the same time. In
this situation, we will need to use atomic increments to avoid this situation.
I’ll
assume you know about atomicAdd () function, so I’ll not describe this
in this post.
Our main()
routine looks very similar to the CPU version, although we will need to add
some of the CUDA C plumbing in order to get input to the GPU and results from
the GPU. However, we start exactly as we did on the CPU:
#define SIZE (100*1024*1024) // 100 MB
int main( void ) {
unsigned char *buffer = (unsigned char*)big_random_block( SIZE );
|
We will be
interested in measuring how our code performs, so we initialize events for
timing exactly like we always have.
cudaEvent_t
start, stop;
checkCudaErrors( cudaEventCreate(
&start )
);
checkCudaErrors ( cudaEventCreate(
&stop )
);
checkCudaErrors ( cudaEventRecord(
start, 0 ) );
|
After
setting up our input data and events, we look to GPU memory. We will need to
allocate space for our random input data and our output histogram. After
allocating the input buffer, we copy the array we generated with big_random_block() to the GPU.
Likewise, after allocating the histogram, we initialize it to zero just like we
did in the CPU version.
// allocate memory on the GPU for the file's data
unsigned char *dev_buffer;
unsigned int *dev_histo;
checkCudaErrors( cudaMalloc( (void**)&dev_buffer,
SIZE ) );
checkCudaErrors( cudaMemcpy(
dev_buffer, buffer,
SIZE,
cudaMemcpyHostToDevice
) );
checkCudaErrors( cudaMalloc( (void**)&dev_histo,256
* sizeof( long ) ) );
checkCudaErrors( cudaMemset(
dev_histo, 0,256
* sizeof( int ) ) );
|
You may
notice that we slipped in a new CUDA runtime function, cudaMemset().
This function has a similar signature to the standard C function memset(),
and the two functions behave nearly identically. The difference in signature is
between these functions is that cudaMemset()
returns an error code while the C library function memset()
does not. This error code will inform the caller whether anything bad happened
while attempting to set GPU memory. Aside from the error code return, the only
difference is that cudaMemset()
operates on GPU
After
initializing the input and output buffers, we are ready to compute our
histogram. You will see how we prepare and launch the histogram kernel
momentarily. For the time being, assume that we have computed the histogram on
the GPU. After finishing, we need to copy the histogram back to the CPU, so we
allocate a 256-entry array and perform a copy from device to host.
unsigned int histo[256];
checkCudaErrors
( cudaMemcpy(
histo, dev_histo,256
* sizeof( int ),
cudaMemcpyDeviceToHost
) );
|
At this
point, we are done with the histogram computation so we can stop our timers and
display the elapsed time. Just like the previous event code, this is identical
to the timing code we’ve used for several previous articles.
// get stop time, and display the timing results
checkCudaErrors
( cudaEventRecord(
stop, 0 ) );
checkCudaErrors
( cudaEventSynchronize( stop ) );
float elapsedTime;
checkCudaErrors
( cudaEventElapsedTime( &elapsedTime,start, stop ) );
printf( "Time to
generate: %3.1f ms\n",
elapsedTime );
|
At this
point, we could pass the histogram as input to another stage in the algorithm, but
since we are not using the histogram for anything else, we will simply verify
that the computed GPU histogram matches what we get on the CPU. First, we
verify that the histogram sum matches what we expect. This is identical to the CPU
code shown here:
long histoCount = 0;
for (int
i=0; i<256; i++) {
histoCount +=
histo[i];
}
printf( "Histogram Sum:
%ld\n", histoCount );
|
To fully
verify the GPU histogram, though, we will use the CPU to compute the
To fully
verify the GPU histogram, though, we will use the CPU to compute the same
histogram. The obvious way to do this would be to allocate a new histogram array,
compute a histogram from the input using the code from Part1 : CPU Histogram
Computation, and, finally, ensure that each bin in the GPU and CPU version
match. But rather than allocate a new histogram array, we’ll opt to start with
the GPU histogram and compute the CPU histogram “in reverse.” By computing the
histogram “in reverse,” we mean that rather than starting at zero and incrementing
bin values when we see data elements, we will start with the GPU histogram and decrement
the bin’s value when the CPU sees data elements. Therefore, the CPU has
computed the same histogram as the GPU if and only if every bin has the value
zero when we are finished. In some sense, we are computing the difference
between these two histograms. The code will look remarkably like the CPU
histogram computation but with a decrement operator instead of an increment
operator.
long histoCount = 0;
for (int
i=0; i<256; i++) {
histoCount += histo[i];
}
printf( "Histogram Sum:
%ld\n", histoCount );
|
As usual,
the finale involves cleaning up our allocated CUDA events, GPU memory, and host
memory.
checkCudaErrors
( cudaEventDestroy(
start ) );
checkCudaErrors
( cudaEventDestroy(
stop ) );
checkCudaErrors
( cudaFree(
dev_histo ) )
;
checkCudaErrors
( cudaFree(
dev_buffer ) )
;
free( buffer );
return 0;
}
|
Now, be
concentrate while reading further;
Before,
we assumed that we had launched a kernel that computed our histogram and then
pressed on to discuss the aftermath. Our kernel launch is slightly more
complicated than usual because of performance concerns. Because the histogram
contains 256 bins, using 256 threads per block proves convenient as well as
results in high performance. But we have a lot of flexibility in terms of the
number of blocks we launch. For example,
with 100MB of data, we have 104,857,600 bytes of data. We could launch a single
block and have each thread examine 409,600 data elements. Likewise, we could
launch 409,600 blocks and have each thread examine a single data element.
As
you might have guessed, the optimal solution is at a point between these two
extremes. By running some performance experiments, optimal performance is
achieved when the number of blocks we launch is exactly twice the number of multiprocessors
our GPU contains. For example, a GeForce GTX 280 has 30 multiprocessors, so our
histogram kernel happens to run fastest on a GeForce GTX 280 when launched with
60 parallel blocks.
In Querying
Device Properties, we discussed a method for querying
various properties of the hardware on which our program is running. We will
need to use one of these device properties if we intend to dynamically size our
launch based on our current hardware platform. To accomplish this, we will use
the following code segment. Although you haven’t yet seen the kernel
implementation, you should still be able to follow what is going on.
cudaDeviceProp
prop;
checkCudaErrors( cudaGetDeviceProperties(
&prop,
0 ) );
int blocks = prop.multiProcessorCount;
histo_kernel_optimized <<<blocks*2,256>>>(
dev_buffer, SIZE,
dev_histo );
|
Since our
walk-through of main() has been somewhat fragmented, here is the entire routine
from start to finish:
Histogram.cu
#define SIZE (100*1024*1024) // 100 MB
// Write kernel code here
int main( void ) {
unsigned char *buffer = (unsigned char*)big_random_block( SIZE );
cudaEvent_t
start, stop;
checkCudaErrors( cudaEventCreate(
&start )
);
checkCudaErrors ( cudaEventCreate(
&stop )
);
checkCudaErrors ( cudaEventRecord(
start, 0 ) );
// allocate memory on the GPU for the file's data
unsigned char *dev_buffer;
unsigned int *dev_histo;
checkCudaErrors( cudaMalloc( (void**)&dev_buffer,
SIZE ) );
checkCudaErrors( cudaMemcpy(
dev_buffer, buffer,
SIZE,
cudaMemcpyHostToDevice
) );
checkCudaErrors( cudaMalloc( (void**)&dev_histo,256
* sizeof( long ) ) );
checkCudaErrors( cudaMemset(
dev_histo, 0,256
* sizeof( int ) ) );
cudaDeviceProp
prop;
checkCudaErrors( cudaGetDeviceProperties(
&prop,
0 ) );
int blocks = prop.multiProcessorCount;
histo_kernel_optimized <<<blocks*2,256>>>(
dev_buffer, SIZE,
dev_histo );
unsigned int histo[256];
checkCudaErrors
( cudaMemcpy(
histo, dev_histo,256
* sizeof( int ),
cudaMemcpyDeviceToHost
) );
// get stop time, and display the timing results
checkCudaErrors
( cudaEventRecord(
stop, 0 ) );
checkCudaErrors
( cudaEventSynchronize( stop ) );
float elapsedTime;
checkCudaErrors
( cudaEventElapsedTime( &elapsedTime,start, stop ) );
printf( "Time to
generate: %3.1f ms\n",
elapsedTime );
// verify that we have the same counts via CPU
long histoCount = 0;
for (int
i=0; i<256; i++) {
histoCount += histo[i];
}
printf( "Histogram Sum:
%ld\n", histoCount );
// verify that we have the same counts via CPU
for (int
i=0; i<SIZE; i++)
histo[buffer[i]]--;
for (int
i=0; i<256; i++) {
if (histo[i] != 0)
printf( "Failure at
%d!\n", i );
}
checkCudaErrors
( cudaEventDestroy(
start ) );
checkCudaErrors
( cudaEventDestroy(
stop ) );
checkCudaErrors
( cudaFree(
dev_histo ) )
;
checkCudaErrors
( cudaFree(
dev_buffer ) )
;
free( buffer );
return 0;
}
|
Optimization step 3
HISTOGRAM KERNEL USING GLOBAL
MEMORY ATOMICS
And now
for the fun part: the GPU code that computes the histogram! The kernel that
computes the histogram itself needs to be given a pointer to the input data
array, the length of the input array, and a pointer to the output histogram. The
first thing our kernel needs to compute is a linearized offset into the input data
array. Each thread will start with an offset between 0 and the number of threads
minus 1. It will then stride by the total number of threads that have been launched.
We hope you remember this technique; we used the same logic to add
vectors of arbitrary length.
__global__
void
histo_kernel_optimized4( unsigned char *buffer,
long size,
unsigned int *histo ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
|
Once each
thread knows its starting offset i and the stride it should use, the code walks
through the input array incrementing the corresponding histogram bin.
while (i < size)
{
atomicAdd(
&(histo[buffer[i]]), 1 );
i +=
stride;
}
}
|
The
highlighted line represents the way we use atomic operations in CUDA C. The
call atomicAdd( addr, y ); generates
an atomic sequence of operations that read the value at address addr, adds y to
that value, and stores the result back to the memory address addr. The hardware
guarantees us that no other thread can read or write the value at address addr
while we perform these operations, thus ensuring predictable results. In our
example, the address in
question
is the location of the histogram bin that corresponds to the current byte. If
the current byte is buffer[i], just
like we saw in the CPU version, the corresponding histogram bin is histo[buffer[i]]. The atomic operation
needs the address of this bin, so the first argument is therefore &(histo[buffer[i]]). Since we simply
want to increment the value in that bin by one, the second argument is 1.
So after that
entire hullabaloo, our GPU histogram computation is fairly similar to the
corresponding CPU version.
__global__
void
histo_kernel_optimized4( unsigned char *buffer,
long size,
unsigned int *histo ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
while (i < size)
{
atomicAdd(
&(histo[buffer[i]]), 1 );
i +=
stride;
}
}
|
However,
we need to save the celebrations for later. After running this example, we
discover that a GeForce GTX 285 can construct a histogram from 100MB of input
data in 1.752 seconds. If you read the section on CPU-based histograms (Part1), you will realize that this performance
is terrible. In fact, this is more than four times slower than the CPU version!
But this is why we always measure our baseline performance. It would be a shame
to settle for such a low-performance implementation simply because it runs on
the GPU.
Since
we do very little work in the kernel, it is quite likely that the atomic
operation on global memory is causing the problem. Essentially, when thousands
of threads are trying to access a handful of memory locations, a great deal of
contention for our 256 histogram bins can occur. To ensure atomicity of the
increment operations, the hardware needs to serialize operations to the same
memory location. This can result in a long queue of pending operations, and any
performance gain we might have had will vanish. We
will need to improve the algorithm itself in order to recover this performance.
Happy to
read Further Part3
I think it is wrong to measure time from host to device and device to host alongside the time of kernel invocation.
ReplyDeleteGreat written text. Let as many valuable texts as possible. Emergency Dentist London
ReplyDeleteI always spent my half an hour to read this webpage’s content every day along with a cup of coffee.
ReplyDelete