Continuing
from Part2
Part3
Further Optimization in histogram CUDA code | Fast implementation of
histogram in CUDA
Optimization step 4
HISTOGRAM KERNEL USING SHARED
AND GLOBAL MEMORY ATOMICS
Up to this
step, our performance on GPU was really bad as compare to CPU version. The one
of the possible optimization I though start using shared memory. If reader is
not familiar with shared memory concept feel free to follow this
link, I explained shared memory with wide details.
So, I’m damn sure all the reader must thought how this optimization comes in my mind?
Well answer is trivial, since when we using global memory, the memory traffic is terrible and even you can use Visual profiler to profile your code to get to know about the performance, or you can build a performance matrix for that, follow this link for that.
So, I’m damn sure all the reader must thought how this optimization comes in my mind?
Well answer is trivial, since when we using global memory, the memory traffic is terrible and even you can use Visual profiler to profile your code to get to know about the performance, or you can build a performance matrix for that, follow this link for that.
Apart from
the shared memory there is another way to optimize it, which we have discussed
in “Optimization step 3” lunching two dimension blocks and threads. Well, I’m
interested to optimize my code first with shared memory than I’ll configure my
blocks and grids dimension.
Motivation
for Shared memory
Ironically,
despite that the atomic operations cause this performance degradation, alleviating
the slowdown actually involves using more
atomics, not fewer. The core
problem was not the use of atomics so much as the fact that thousands of
threads were competing for access to a relatively small number of memory addresses.
To address this issue, we will split our histogram
computation into two phases.
Phase 1
Computing histogram in shared memory
In phase
one, each parallel block will compute a
separate histogram of the data that its constituent threads examine. Since
each block does this independently, we can compute these histograms in shared memory, saving us the time of
sending each write-off chip to DRAM. Doing this does not free us from
needing atomic operations, though, since multiple threads within the block can
still examine data elements with the same value. However, the fact that only
256 threads will now be competing for 256 addresses will reduce contention from
the global version where thousands of threads were competing.
The first
phase then involves allocating and zeroing a shared memory buffer to hold each
block’s intermediate histogram. Recall from
shared memory article, that since the subsequent step will
involve reading and modifying this buffer, we need a __syncthreads() call to ensure that every thread’s write has
completed before progressing.
__global__
void histo_kernel_optimized5 ( unsigned char *buffer, long size,
unsigned int *histo )
{
__shared__ unsigned
int temp[256];
temp[threadIdx.x] = 0;
__syncthreads();
|
After
zeroing the histogram, the next step is remarkably similar to our original GPU
histogram. The sole differences here are that we use the shared memory buffer
temp[] instead of the global memory buffer histo[]
and that we need a subsequent call to __syncthreads()
to ensure the last of our writes have been committed.
int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x * gridDim.x;
while (i < size)
{
atomicAdd(
&temp[buffer[i]], 1);
i +=
offset;
}
__syncthreads();
|
Phase 2
Merging histograms
The
last step in our modified histogram example requires that we merge each block’s
temporary histogram into the global buffer histo[].
Suppose
we split the input in half and two threads look at different halves and compute
separate histograms. If thread A sees byte 0xFC 20 times in the input and
thread B sees byte 0xFC 5 times, the byte 0xFC must have appeared 25 times in
the input. Likewise, each bin of the final histogram is
just the sum of the corresponding bin in thread A’s histogram and thread B’s
histogram. This logic extends to any number of threads, so merging every block’s histogram into a single final
histogram involves adding each entry in the block’s histogram to the
corresponding entry in the final histogram. For all the reasons we’ve
seen already, this needs to be done atomically:
atomicAdd(
&(histo[threadIdx.x]), temp[threadIdx.x] );
}
|
Important note
Since we
have decided to use 256 threads and have 256 histogram bins, each thread
atomically adds a single bin to the final histogram’s total. If these numbers didn’t match, this phase would be more
complicated. Note that we have no guarantees about what order the blocks add
their values to the final histogram, but since integer addition is commutative,
we will always get the same answer provided that the additions occur
atomically.
And with
this, our two phase histogram computation kernel is complete. Here it is from
start to finish:
__global__
void histo_kernel_optimized5( unsigned char *buffer, long size,
unsigned int *histo )
{
__shared__ unsigned
int temp[256];
temp[threadIdx.x] = 0;
__syncthreads();
int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x * gridDim.x;
while (i < size)
{
atomicAdd( &temp[buffer[i]], 1);
i +=
offset;
}
__syncthreads();
atomicAdd(
&(histo[threadIdx.x]), temp[threadIdx.x] );
}
|
This
version of our histogram example improves dramatically over the previous GPU
version. Adding the shared memory component drops our running time on a GeForce
GTX 285 to 0.057 seconds. Not only is this significantly better than the version
that used global memory atomics only, but this beats our original CPU implementation
by an order of magnitude (from 0.416 seconds to 0.057 seconds). This
improvement represents greater than a sevenfold boost in speed over the CPU version.
So despite the early setback in adapting the histogram to a GPU implementation,
our version that uses both shared and global atomics should be considered a
success.
Can we do more faster
Yes we can
do, after profiling above code, I found it has bank conflict. So if we are able to
remove bank conflict then time can reduce by factor 2 or 4 or 8 or even more.
I left this task now for reader,
I left this task now for reader,
I’ll present bank conflict free histogram computation in subsequent
article.
Now what about competition,
In that online competition, number of bins was 1024 and Size =
1000*1024 = 1024000 elements.
my naïve implementation (optimization 3) was performed well and time was 16.xx ms, with optimization step 4 implementation, time was 8.xx ms ( 2x faster ) and with the optimization step 5 implementation, time was 1.xx ms, it means 8 time faster than global memory version and 16 time faster than naïve implementation.
My kernel
invocation was look like this;
// with
naïve implementation
naive_kernel
<<<blocks*8,1024>>>(
d_vals, d_histo,
numElems );
//with
optimized implementaion
OptimizedStep5
<<<blocks*8,1024, 1024*sizeof(int)>>>( d_vals,
d_histo,
numElems );
|
You may
noticed, I lunched 1024 threads with dynamic shared memory allocation (look
third argument of kernel invocation) since number of bins was 1024.
Further Problem statement
If
these numbers of bins and threads in a block does not match than this problem
will be more complicate, like as above, number of bins was
1024 and let say I can lunch only 512 threads in a block then how to solve this
problem?
I've cover this approach in Part4 of this article.
Great article! Minor correction, the kernel invocation
ReplyDeleteOptimizedStep5 <<>>( d_vals, d_histo, numElems );
should be
OptimizedStep5 <<>>( d_vals, numElems, d_histo);
- parameters 2 and 3 are swapped.
The postings on your site are always excellent.
ReplyDeleteGreatest hotels in the world
This post will assist the internet people for building up new blog or even a weblog from start to end.
ReplyDeletebu jhansi ba 1st year result name wise