Optimization in histogram CUDA code: When number of Bins not equal to max threads in a block
Continuing
from Part3
Part4
Optimization in histogram CUDA code
When number of Bins not equal to
max threads in a block
As
I already mentioned at last section of Part3 that when the number of bins not
equal to the maximum number of thread in a block, than problem become more
complicated to solve. In this part of the article we’ll concentrate how to
solve this problem efficiently.
In order to solve this problem, it is very necessary to understand how we wrote our kernel where number of bins is equal to no. of threads in a block. So let examine them.
In order to solve this problem, it is very necessary to understand how we wrote our kernel where number of bins is equal to no. of threads in a block. So let examine them.
Our
previous kernel is look like this;
__global__
void histo_kernel_optimized5( unsigned char *buffer, long size,
unsigned int *histo )
{
Line 1: __shared__
unsigned int
temp[256];
Line 2: temp[threadIdx.x] = 0;
Line 3: __syncthreads();
Line 4: int i = threadIdx.x + blockIdx.x * blockDim.x;
Line 5: int offset = blockDim.x * gridDim.x;
Line 6: while (i < size)
{
Line 7: atomicAdd( &temp[buffer[i]], 1);
Line 8: i += offset;
}
Line 9: __syncthreads();
Line 10: atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}
|
And
kernel invocation was like this;
histo_kernel_optimized <<<blocks*2,256>>>(
dev_buffer, SIZE,
dev_histo );
|
So
in kernel; at Line 1: we are allocating
memory in shared memory refere as temp; the
memory size is equal to number of bins (256).
At
Line 2: We fill the temp array with zero, so effectively we
are filling 256 elements as 0 in shared memory.
At
Line 3: all threads wait for Line 2
task.
Line 4 and 5: These
lines are trivial we are calculating absolute id and offset according to 256
threads (which is equal to number of bins).
Line 6:
In line 6 we loop over all the element of buffer till our absolute id touch
Size value.
Line 7:
We are extracting the value located in buffer and increment the count in shared
memory array temp.
Line 8:
Incrementing absolute id by offset
Line 9:
threads wait for all other thread In a block.
Line 10:
finally we update our final result.
Now In order to change the above code for Bincount = 1024
(say), we need to care that each element will touch once. Since, at the kernel
invocation we are lunching 256 threads in a block and our number of bins is
1024 then shared memory size should be equal to number bins, so it should be
1024, it implies that line 1 will change
as below.
Line 1: __shared__
unsigned int
temp[1024];
|
What
about the Line 2; if threads = 256 then only 256 element will affect but our
bins count was 1024 then all should be initialize by 0. Then how we achieve
this?
Since our memory pattern was look like this;
Thread_Id update at location
0 0
1 1
2 2
.
.
255 255
|
The
entire element followed by 255 means from 256 to 1023 was not affect at all. One
method is to allow only last thread to update from 255 to 1023 but by doing
this we are forcing last threads to does more work than other threads. So we
can take advantage of other threads. One way to do this, let say last thread
fill 1023 location then second last thread should update 1022 location and so
on then pattern will look like this;
Thread_Id update at location
0 0,256,512,768
1 1,257,513,769
2 2,258,514,770
.
.
255 255,511,767,1023
|
So,
Line 2 will change and look like this.
temp[threadIdx.x + 0] = 0;
temp[threadIdx.x + 256] = 0;
temp[threadIdx.x +
512] = 0;
temp[threadIdx.x +
768] = 0;
|
Now
what about the line from 3 to 9, does anything will change? No, Since loop is
working fine and it will update accordingly. For example let say at absolute id
= 10, if the buffer value is 678 then corresponding value will update.
And
the line 10 has to be change accordingly as line 2 changed;
atomicAdd( &(histo[threadIdx.x + 0]), temp[threadIdx.x + 0] );
atomicAdd( &(histo[threadIdx.x +
256]), temp[threadIdx.x + 256] );
atomicAdd( &(histo[threadIdx.x +
512]), temp[threadIdx.x + 512] );
atomicAdd( &(histo[threadIdx.x +
768]), temp[threadIdx.x + 768] );
|
Hence
our final kernel will look like this;
__global__
void histo_kernel_optimized5( unsigned char *buffer, long size,
unsigned int *histo )
{
__shared__ unsigned
int temp[1024];
temp[threadIdx.x + 0] = 0;
temp[threadIdx.x
+ 256] = 0;
temp[threadIdx.x + 512] = 0;
temp[threadIdx.x + 768] = 0;
__syncthreads();
int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x * gridDim.x;
while (i < size)
{
atomicAdd( &temp[buffer[i]], 1);
i +=
offset;
}
__syncthreads();
atomicAdd(
&(histo[threadIdx.x + 0]), temp[threadIdx.x + 0] );
atomicAdd( &(histo[threadIdx.x
+ 256]), temp[threadIdx.x + 256] );
atomicAdd( &(histo[threadIdx.x
+ 512]), temp[threadIdx.x + 512] );
atomicAdd( &(histo[threadIdx.x
+ 768]), temp[threadIdx.x + 768] );
}
|
All
the updated lines are colored red in above code and we are done, our kernel
invocation does not change.
Now
time for quiz, if you’ll run your CUDA code in GPU GTX 280 with CC = 1.3 having
30 multiprocessor. Then how many element per thread in above code you are
operating? You can answer in comment box.
Hint:
Blocks = 2*30 = 60 and threads= 256 in a block so total threads =
blocks*threads;
I hope you
must like this article and have learned histogram
computation with GPU using CUDA.
Got
Questions?
Feel
free to ask me any question by commenting below because I'd be happy to walk
you through step by step!
Where are the shared memory bank conflicts in this final optimized5 version?
ReplyDeleteEach read and write seems to be done as a multiple of 16 on unsigned int (4 byte) data, so they should be ok, shouldn't they?
Experimentally, extending
Delete__shared__ unsigned int temp[1024];
to be something like
__shared__ unsigned int temp[1249];
gives it a significant speedup.
But I don't quite understand it an an intuitive level.
@Anonymous,
DeleteThere are bank conflicts; Consider this scenario;
Buffer values: 1 5 78 65 34 34 23 54........98
threads: 0 1 2 3 4 5 6 7 8 ........15
Each thread in a half warp access some random location, which may or may not be hit in required bank, for example, above Buffer is a snap shot of Total element, 16 elements consecutively for say half warp. thread 0 hitting bank 1 instead of bank 0 and thread 1 hitting bank 5 and so on....
Bank conflict are Bound To be happen, we can't ride off from it. It totally depends of data in Buffer, if data is too random such that each thread can't hit its corresponding Bank then we have 16 way bank conflict, which cause 16 time slower execution of above code.
DeleteOnly possible way of hiding bank conflict is either data is follow some non-bank conflict order or Data is sorted in some order.
You can check CUDA SDK example, ("http://docs.nvidia.com/cuda/cuda-samples/index.html#imaging")->Cuda Histogram
64 bin implementation which is bank conflict free while 256 bin example which cause bank conflict.
@Anonymous "Experimentally, extending"
ReplyDeleteYes you are correct, i did Experiment in my that online competition course;
i got great speed up, with __shared__ unsigned int temp[1024]; time was 1.2xxxx ms
instead of allocating __shared__ unsigned int temp[1024]; i allocate
__shared__ unsigned int temp[1024 + 255]; | since i have lunched 256 threads
got .81xxxx ms
the reason is Bank conflict reduction. After doing some padding in shared memory, bank conflict got reduced very much.
Thanks for the replies Nitin, that clarifies things.
DeleteI see now the potential bank conflicts in access of "temp[buffer[i]]" where the address being read depends on the data in the buffer[i]... I guess I was just considering all the other memory access of the form "variable[threadIdx.x + offset]" and overlooked the main one inside the while loop.
You are most Welcome ... if you still confuse somewhere, you can go through this article
ReplyDeletehttp://cuda-programming.blogspot.com/2013/02/bank-conflicts-in-shared-memory-in-cuda.html
what happens when your input values are greater than your shared memory array? For instance, in the line: atomicAdd( &temp[buffer[i]], 1), what if buffer[i] = 10,233? if temp is only size 1024, that will produce an out of memory exception.
ReplyDeleteNo, it is not possible, the random data is always in boundary of Bins either 0-255 or 0-1023,
DeleteBigRandomData function always generate in boundary.
All really great and very informative.
ReplyDeleteparaprotex
I have been searching for a useful post like this on salesforce course details, it is highly helpful for me and I have a great experience with this Salesforce Training who are providing certification and job assistance. Salesforce CRM certification in Noida
ReplyDeleteVery interesting, good job and thanks for sharing such a good blog. Your article is so convincing that I never stop myself to say something about it. You’re doing a great job. Keep it up. Check this out
ReplyDeleteHACCP Certification
Found your post interesting to read. I can’t wait to see your post soon. Good Luck for the upcoming update
ReplyDeleteFSC Certification
Great article, I learn new information from your article
ReplyDeleteCE Marking
This is the one of the most important information for me. And I am feeling glad reading your article. The article is really excellent ?
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteAsk him any Questions about herpes
ReplyDelete[…] It’s a crazy miracle product. Read my review here. […]
I was cured of herpes simplex virus his contact ___________Robinsonbuckler11@gmail. com.......................
this is the best herbal remedy…
Ana apa, ya postingan iki pancen apik lan aku wis sinau akeh babagan babagan blog. matur nuwun
ReplyDelete