Optimization in histogram CUDA code: When number of Bins not equal to max threads in a block

Posted by Unknown at 23:13 | 18 comments

Continuing from Part3

Part4

Optimization in histogram CUDA code

When number of Bins not equal to max threads in a block

As I already mentioned at last section of Part3 that when the number of bins not equal to the maximum number of thread in a block, than problem become more complicated to solve. In this part of the article we’ll concentrate how to solve this problem efficiently.
In order to solve this problem, it is very necessary to understand how we wrote our kernel where number of bins is equal to no. of threads in a block. So let examine them.

Our previous kernel is look like this;

__global__ void histo_kernel_optimized5( unsigned char *buffer, long size,

unsigned int *histo )

{

Line 1: __shared__ unsigned int temp[256];

Line 2: temp[threadIdx.x] = 0;

Line 3: __syncthreads();

Line 4: int i = threadIdx.x + blockIdx.x * blockDim.x;

Line 5: int offset = blockDim.x * gridDim.x;

Line 6: while (i < size)

{

Line 7: atomicAdd( &temp[buffer[i]], 1);

Line 8: i += offset;

}

Line 9: __syncthreads();

Line 10: atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );

}

And kernel invocation was like this;

histo_kernel_optimized <<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );

So in kernel; at Line 1: we are allocating memory in shared memory refere as temp; the memory size is equal to number of bins (256).

At Line 2: We fill the temp array with zero, so effectively we are filling 256 elements as 0 in shared memory.

At Line 3: all threads wait for Line 2 task.

Line 4 and 5: These lines are trivial we are calculating absolute id and offset according to 256 threads (which is equal to number of bins).

Line 6: In line 6 we loop over all the element of buffer till our absolute id touch Size value.

Line 7: We are extracting the value located in buffer and increment the count in shared memory array temp.

Line 8: Incrementing absolute id by offset

Line 9: threads wait for all other thread In a block.

Line 10: finally we update our final result.

Now In order to change the above code for Bincount = 1024 (say), we need to care that each element will touch once. Since, at the kernel invocation we are lunching 256 threads in a block and our number of bins is 1024 then shared memory size should be equal to number bins, so it should be 1024, it implies that line 1 will change as below.

Line 1: __shared__ unsigned int temp[1024];

What about the Line 2; if threads = 256 then only 256 element will affect but our bins count was 1024 then all should be initialize by 0. Then how we achieve this?

Since our memory pattern was look like this;

Thread_Id update at location

0 0

1 1

2 2

255 255

The entire element followed by 255 means from 256 to 1023 was not affect at all. One method is to allow only last thread to update from 255 to 1023 but by doing this we are forcing last threads to does more work than other threads. So we can take advantage of other threads. One way to do this, let say last thread fill 1023 location then second last thread should update 1022 location and so on then pattern will look like this;

Thread_Id update at location

0 0,256,512,768

1 1,257,513,769

2 2,258,514,770

255 255,511,767,1023

So, Line 2 will change and look like this.

temp[threadIdx.x + 0] = 0;

temp[threadIdx.x + 256] = 0;

temp[threadIdx.x + 512] = 0;

temp[threadIdx.x + 768] = 0;

Now what about the line from 3 to 9, does anything will change? No, Since loop is working fine and it will update accordingly. For example let say at absolute id = 10, if the buffer value is 678 then corresponding value will update.

And the line 10 has to be change accordingly as line 2 changed;

atomicAdd( &(histo[threadIdx.x + 0]), temp[threadIdx.x + 0] );

atomicAdd( &(histo[threadIdx.x + 256]), temp[threadIdx.x + 256] );

atomicAdd( &(histo[threadIdx.x + 512]), temp[threadIdx.x + 512] );

atomicAdd( &(histo[threadIdx.x + 768]), temp[threadIdx.x + 768] );

Hence our final kernel will look like this;

__global__ void histo_kernel_optimized5( unsigned char *buffer, long size,

unsigned int *histo )

{

__shared__ unsigned int temp[1024];

temp[threadIdx.x + 0] = 0;

temp[threadIdx.x + 256] = 0;

temp[threadIdx.x + 512] = 0;

temp[threadIdx.x + 768] = 0;

__syncthreads();

int i = threadIdx.x + blockIdx.x * blockDim.x;

int offset = blockDim.x * gridDim.x;

while (i < size)

{

atomicAdd( &temp[buffer[i]], 1);

i += offset;

}

__syncthreads();

atomicAdd( &(histo[threadIdx.x + 0]), temp[threadIdx.x + 0] );

atomicAdd( &(histo[threadIdx.x + 256]), temp[threadIdx.x + 256] );

atomicAdd( &(histo[threadIdx.x + 512]), temp[threadIdx.x + 512] );

atomicAdd( &(histo[threadIdx.x + 768]), temp[threadIdx.x + 768] );

}

All the updated lines are colored red in above code and we are done, our kernel invocation does not change.

Now time for quiz, if you’ll run your CUDA code in GPU GTX 280 with CC = 1.3 having 30 multiprocessor. Then how many element per thread in above code you are operating? You can answer in comment box.

Hint: Blocks = 2*30 = 60 and threads= 256 in a block so total threads = blocks*threads;

I hope you must like this article and have learned histogram computation with GPU using CUDA.

Got Questions?

Feel free to ask me any question by commenting below because I'd be happy to walk you through step by step!

Want to Contact us? Click here

18 comments:

Anonymous13 March 2013 at 10:59
Where are the shared memory bank conflicts in this final optimized5 version?

Each read and write seems to be done as a multiple of 16 on unsigned int (4 byte) data, so they should be ok, shouldn't they?
ReplyDelete
Replies
Unknown14 March 2013 at 10:03
@Anonymous "Experimentally, extending"
Yes you are correct, i did Experiment in my that online competition course;
i got great speed up, with __shared__ unsigned int temp[1024]; time was 1.2xxxx ms
instead of allocating __shared__ unsigned int temp[1024]; i allocate
__shared__ unsigned int temp[1024 + 255]; | since i have lunched 256 threads
got .81xxxx ms
the reason is Bank conflict reduction. After doing some padding in shared memory, bank conflict got reduced very much.
ReplyDelete
Replies
Unknown15 March 2013 at 01:36
You are most Welcome ... if you still confuse somewhere, you can go through this article
http://cuda-programming.blogspot.com/2013/02/bank-conflicts-in-shared-memory-in-cuda.html
ReplyDelete
Replies
Anonymous26 March 2013 at 08:14
what happens when your input values are greater than your shared memory array? For instance, in the line: atomicAdd( &temp[buffer[i]], 1), what if buffer[i] = 10,233? if temp is only size 1024, that will produce an out of memory exception.
ReplyDelete
Replies
KaKa8 February 2019 at 05:54
All really great and very informative.
paraprotex
ReplyDelete
Replies
Ananad14 June 2020 at 02:40
I have been searching for a useful post like this on salesforce course details, it is highly helpful for me and I have a great experience with this Salesforce Training who are providing certification and job assistance. Salesforce CRM certification in Noida
ReplyDelete
Replies
Ascent World11 November 2020 at 02:51
Very interesting, good job and thanks for sharing such a good blog. Your article is so convincing that I never stop myself to say something about it. You’re doing a great job. Keep it up. Check this out

HACCP Certification
ReplyDelete
Replies
Ascent World28 December 2020 at 03:10
Found your post interesting to read. I can’t wait to see your post soon. Good Luck for the upcoming update

FSC Certification
ReplyDelete
Replies
Rashmi18 February 2021 at 03:51
Great article, I learn new information from your article

CE Marking

ReplyDelete
Replies
UOK BA 1st Year Result31 May 2022 at 04:11
This is the one of the most important information for me. And I am feeling glad reading your article. The article is really excellent ?
ReplyDelete
Replies
Scofio morgan2 November 2022 at 08:09
This comment has been removed by the author.
ReplyDelete
Replies
Scofio morgan2 November 2022 at 08:13
Ask him any Questions about herpes

[â€¦] Itâ€™s a crazy miracle product. Read my review here. [â€¦]

I was cured of herpes simplex virus his contact ___________Robinsonbuckler11@gmail. com.......................

this is the best herbal remedyâ€¦
ReplyDelete
Replies
gizmodopowerball3 January 2025 at 01:18
Ana apa, ya postingan iki pancen apik lan aku wis sinau akeh babagan babagan blog. matur nuwun
ReplyDelete
Replies

Add comment

Help us to improve our quality and become contributor to our blog

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Optimization in histogram CUDA code: When number of Bins not equal to max threads in a block

18 comments:

Recent Post

About Me

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Related Posts

Share This

Optimization in histogram CUDA code: When number of Bins not equal to max threads in a block

18 comments:

Recent Post

About Me

Subscribe To

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin