CUDA Programming: Further Optimization in histogram CUDA code | Fast implementation of histogram in CUDA

Continuing from Part2

Part3

Further Optimization in histogram CUDA code | Fast implementation of histogram in CUDA

Optimization step 4

HISTOGRAM KERNEL USING SHARED AND GLOBAL MEMORY ATOMICS

Up to this step, our performance on GPU was really bad as compare to CPU version. The one of the possible optimization I though start using shared memory. If reader is not familiar with shared memory concept feel free to follow this link, I explained shared memory with wide details.
So, I’m damn sure all the reader must thought how this optimization comes in my mind?
Well answer is trivial, since when we using global memory, the memory traffic is terrible and even you can use Visual profiler to profile your code to get to know about the performance, or you can build a performance matrix for that, follow this link for that.

Apart from the shared memory there is another way to optimize it, which we have discussed in “Optimization step 3” lunching two dimension blocks and threads. Well, I’m interested to optimize my code first with shared memory than I’ll configure my blocks and grids dimension.

Motivation for Shared memory

Ironically, despite that the atomic operations cause this performance degradation, alleviating the slowdown actually involves using more atomics, not fewer. The core problem was not the use of atomics so much as the fact that thousands of threads were competing for access to a relatively small number of memory addresses.

To address this issue, we will split our histogram computation into two phases.

Phase 1

Computing histogram in shared memory

In phase one, each parallel block will compute a separate histogram of the data that its constituent threads examine. Since each block does this independently, we can compute these histograms in shared memory, saving us the time of sending each write-off chip to DRAM. Doing this does not free us from needing atomic operations, though, since multiple threads within the block can still examine data elements with the same value. However, the fact that only 256 threads will now be competing for 256 addresses will reduce contention from the global version where thousands of threads were competing.

The first phase then involves allocating and zeroing a shared memory buffer to hold each block’s intermediate histogram. Recall from shared memory article, that since the subsequent step will involve reading and modifying this buffer, we need a __syncthreads() call to ensure that every thread’s write has completed before progressing.

__global__ void histo_kernel_optimized5 ( unsigned char *buffer, long size,

unsigned int *histo )

{

__shared__ unsigned int temp[256];

temp[threadIdx.x] = 0;

__syncthreads();

After zeroing the histogram, the next step is remarkably similar to our original GPU histogram. The sole differences here are that we use the shared memory buffer temp[] instead of the global memory buffer histo[] and that we need a subsequent call to __syncthreads() to ensure the last of our writes have been committed.

int i = threadIdx.x + blockIdx.x * blockDim.x;

int offset = blockDim.x * gridDim.x;

while (i < size)

{

atomicAdd( &temp[buffer[i]], 1);

i += offset;

}

__syncthreads();

Phase 2

Merging histograms

The last step in our modified histogram example requires that we merge each block’s temporary histogram into the global buffer histo[].

Suppose we split the input in half and two threads look at different halves and compute separate histograms. If thread A sees byte 0xFC 20 times in the input and thread B sees byte 0xFC 5 times, the byte 0xFC must have appeared 25 times in the input. Likewise, each bin of the final histogram is just the sum of the corresponding bin in thread A’s histogram and thread B’s histogram. This logic extends to any number of threads, so merging every block’s histogram into a single final histogram involves adding each entry in the block’s histogram to the corresponding entry in the final histogram. For all the reasons we’ve seen already, this needs to be done atomically:

atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );

}

Important note

Since we have decided to use 256 threads and have 256 histogram bins, each thread atomically adds a single bin to the final histogram’s total. If these numbers didn’t match, this phase would be more complicated. Note that we have no guarantees about what order the blocks add their values to the final histogram, but since integer addition is commutative, we will always get the same answer provided that the additions occur atomically.

And with this, our two phase histogram computation kernel is complete. Here it is from start to finish:

__global__ void histo_kernel_optimized5( unsigned char *buffer, long size,

unsigned int *histo )

{

__shared__ unsigned int temp[256];

temp[threadIdx.x] = 0;

__syncthreads();

int i = threadIdx.x + blockIdx.x * blockDim.x;

int offset = blockDim.x * gridDim.x;

while (i < size)

{

atomicAdd( &temp[buffer[i]], 1);

i += offset;

}

__syncthreads();

atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );

}

This version of our histogram example improves dramatically over the previous GPU version. Adding the shared memory component drops our running time on a GeForce GTX 285 to 0.057 seconds. Not only is this significantly better than the version that used global memory atomics only, but this beats our original CPU implementation by an order of magnitude (from 0.416 seconds to 0.057 seconds). This improvement represents greater than a sevenfold boost in speed over the CPU version. So despite the early setback in adapting the histogram to a GPU implementation, our version that uses both shared and global atomics should be considered a success.

Can we do more faster

Yes we can do, after profiling above code, I found it has bank conflict. So if we are able to remove bank conflict then time can reduce by factor 2 or 4 or 8 or even more.
I left this task now for reader,

I’ll present bank conflict free histogram computation in subsequent article.

Now what about competition,

In that online competition, number of bins was 1024 and Size = 1000*1024 = 1024000 elements.

my naïve implementation (optimization 3) was performed well and time was 16.xx ms, with optimization step 4 implementation, time was 8.xx ms ( 2x faster ) and with the optimization step 5 implementation, time was 1.xx ms, it means 8 time faster than global memory version and 16 time faster than naïve implementation.

My kernel invocation was look like this;

// with naïve implementation

naive_kernel <<<blocks*8,1024>>>( d_vals, d_histo, numElems );

//with optimized implementaion

OptimizedStep5 <<<blocks*8,1024, 1024*sizeof(int)>>>( d_vals, d_histo,

numElems );

You may noticed, I lunched 1024 threads with dynamic shared memory allocation (look third argument of kernel invocation) since number of bins was 1024.

Further Problem statement

If these numbers of bins and threads in a block does not match than this problem will be more complicate, like as above, number of bins was 1024 and let say I can lunch only 512 threads in a block then how to solve this problem?

I've cover this approach in Part4 of this article.

3 comments:

Yuriy15 August 2015 at 15:14
Great article! Minor correction, the kernel invocation
OptimizedStep5 <<>>( d_vals, d_histo, numElems );
should be
OptimizedStep5 <<>>( d_vals, numElems, d_histo);

- parameters 2 and 3 are swapped.
GG.13 December 2018 at 03:01
The postings on your site are always excellent.
Greatest hotels in the world
Government Jobs in Himachal Pradesh15 June 2022 at 23:16
This post will assist the internet people for building up new blog or even a weblog from start to end.
bu jhansi ba 1st year result name wise

Help us to improve our quality and become contributor to our blog

CUDA Programming

Pages

Friday, 8 March 2013

Further Optimization in histogram CUDA code | Fast implementation of histogram in CUDA

3 comments: