Prefer Your Language

Search This Blog

Optimization in histogram CUDA code: When number of Bins not equal to max threads in a block


Continuing from Part3

       Part4
Optimization in  histogram CUDA code


When number of Bins not equal to max threads in a block


As I already mentioned at last section of Part3 that when the number of bins not equal to the maximum number of thread in a block, than problem become more complicated to solve. In this part of the article we’ll concentrate how to solve this problem efficiently.
In order to solve this problem, it is very necessary to understand how we wrote our kernel where number of bins is equal to no. of threads in a block. So let examine them.
Our previous kernel is look like this;


__global__ void histo_kernel_optimized5( unsigned char *buffer, long size,
                               unsigned int *histo )
{
Line 1:     __shared__ unsigned int temp[256];
Line 2:     temp[threadIdx.x] = 0;
Line 3:     __syncthreads();

Line 4:     int i = threadIdx.x + blockIdx.x * blockDim.x;
Line 5:     int offset = blockDim.x * gridDim.x;
Line 6:     while (i < size)
            {
Line 7:              atomicAdd( &temp[buffer[i]], 1);
Line 8:              i += offset;
           }
Line 9:     __syncthreads();


Line 10:    atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}


And kernel invocation was like this;

histo_kernel_optimized <<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );


So in kernel; at Line 1: we are allocating memory in shared memory refere as temp; the memory size is equal to number of bins (256).
At Line 2: We fill the temp array with zero, so effectively we are filling 256 elements as 0 in shared memory.
At Line 3: all threads wait for Line 2 task.
Line 4 and 5: These lines are trivial we are calculating absolute id and offset according to 256 threads (which is equal to number of bins).
Line 6: In line 6 we loop over all the element of buffer till our absolute id touch Size value.
Line 7: We are extracting the value located in buffer and increment the count in shared memory array temp.
Line 8: Incrementing absolute id by offset
Line 9: threads wait for all other thread In a block.
Line 10: finally we update our final result.

Now In order to change the above code for Bincount = 1024 (say), we need to care that each element will touch once. Since, at the kernel invocation we are lunching 256 threads in a block and our number of bins is 1024 then shared memory size should be equal to number bins, so it should be 1024, it implies that line 1 will change as below.


Line 1:     __shared__ unsigned int temp[1024];


                                                What about the Line 2; if threads = 256 then only 256 element will affect but our bins count was 1024 then all should be initialize by 0. Then how we achieve this?
Since our memory pattern was look like this;
Thread_Id            update at location
0                                              0
1                                              1
2                                              2
.
.
255                                         255

The entire element followed by 255 means from 256 to 1023 was not affect at all. One method is to allow only last thread to update from 255 to 1023 but by doing this we are forcing last threads to does more work than other threads. So we can take advantage of other threads. One way to do this, let say last thread fill 1023 location then second last thread should update 1022 location and so on then pattern will look like this;
Thread_Id                     update at location
0                                              0,256,512,768
1                                              1,257,513,769
2                                              2,258,514,770
.
.
255                                         255,511,767,1023


So, Line 2 will change and look like this.

temp[threadIdx.x + 0] = 0;
temp[threadIdx.x + 256] = 0;
temp[threadIdx.x + 512] = 0;
temp[threadIdx.x + 768] = 0;


Now what about the line from 3 to 9, does anything will change? No, Since loop is working fine and it will update accordingly. For example let say at absolute id = 10, if the buffer value is 678 then corresponding value will update.  
And the line 10 has to be change accordingly as line 2 changed;

atomicAdd( &(histo[threadIdx.x + 0]), temp[threadIdx.x + 0] );
atomicAdd( &(histo[threadIdx.x + 256]), temp[threadIdx.x + 256] );
atomicAdd( &(histo[threadIdx.x + 512]), temp[threadIdx.x + 512] );
atomicAdd( &(histo[threadIdx.x + 768]), temp[threadIdx.x + 768] );


Hence our final kernel will look like this;


__global__ void histo_kernel_optimized5( unsigned char *buffer, long size,
                               unsigned int *histo )
{   
    __shared__ unsigned int temp[1024];
  
     temp[threadIdx.x + 0] = 0;
     temp[threadIdx.x + 256] = 0;
     temp[threadIdx.x + 512] = 0;
     temp[threadIdx.x + 768] = 0;
     __syncthreads();

     int i = threadIdx.x + blockIdx.x * blockDim.x;
     int offset = blockDim.x * gridDim.x;
     while (i < size)
     {
              atomicAdd( &temp[buffer[i]], 1);
              i += offset;
     }
     __syncthreads();


  
    atomicAdd( &(histo[threadIdx.x + 0]), temp[threadIdx.x + 0] );
    atomicAdd( &(histo[threadIdx.x + 256]), temp[threadIdx.x + 256] );
    atomicAdd( &(histo[threadIdx.x + 512]), temp[threadIdx.x + 512] );
    atomicAdd( &(histo[threadIdx.x + 768]), temp[threadIdx.x + 768] );

}


All the updated lines are colored red in above code and we are done, our kernel invocation does not change.
Now time for quiz, if you’ll run your CUDA code in GPU GTX 280 with CC = 1.3 having 30 multiprocessor. Then how many element per thread in above code you are operating? You can answer in comment box.

Hint: Blocks = 2*30 = 60 and threads= 256 in a block so total threads = blocks*threads;
I hope you must like this article and have learned histogram computation with GPU using CUDA.
Got Questions?
Feel free to ask me any question by commenting below because I'd be happy to walk you through step by step! 
Want to Contact us? Click here

17 comments:

  1. Where are the shared memory bank conflicts in this final optimized5 version?

    Each read and write seems to be done as a multiple of 16 on unsigned int (4 byte) data, so they should be ok, shouldn't they?

    ReplyDelete
    Replies
    1. Experimentally, extending

      __shared__ unsigned int temp[1024];

      to be something like

      __shared__ unsigned int temp[1249];

      gives it a significant speedup.

      But I don't quite understand it an an intuitive level.

      Delete
    2. @Anonymous,
      There are bank conflicts; Consider this scenario;

      Buffer values: 1 5 78 65 34 34 23 54........98
      threads: 0 1 2 3 4 5 6 7 8 ........15
      Each thread in a half warp access some random location, which may or may not be hit in required bank, for example, above Buffer is a snap shot of Total element, 16 elements consecutively for say half warp. thread 0 hitting bank 1 instead of bank 0 and thread 1 hitting bank 5 and so on....

      Delete
    3. Bank conflict are Bound To be happen, we can't ride off from it. It totally depends of data in Buffer, if data is too random such that each thread can't hit its corresponding Bank then we have 16 way bank conflict, which cause 16 time slower execution of above code.
      Only possible way of hiding bank conflict is either data is follow some non-bank conflict order or Data is sorted in some order.
      You can check CUDA SDK example, ("http://docs.nvidia.com/cuda/cuda-samples/index.html#imaging")->Cuda Histogram
      64 bin implementation which is bank conflict free while 256 bin example which cause bank conflict.

      Delete
  2. @Anonymous "Experimentally, extending"
    Yes you are correct, i did Experiment in my that online competition course;
    i got great speed up, with __shared__ unsigned int temp[1024]; time was 1.2xxxx ms
    instead of allocating __shared__ unsigned int temp[1024]; i allocate
    __shared__ unsigned int temp[1024 + 255]; | since i have lunched 256 threads
    got .81xxxx ms
    the reason is Bank conflict reduction. After doing some padding in shared memory, bank conflict got reduced very much.

    ReplyDelete
    Replies
    1. Thanks for the replies Nitin, that clarifies things.

      I see now the potential bank conflicts in access of "temp[buffer[i]]" where the address being read depends on the data in the buffer[i]... I guess I was just considering all the other memory access of the form "variable[threadIdx.x + offset]" and overlooked the main one inside the while loop.

      Delete
  3. You are most Welcome ... if you still confuse somewhere, you can go through this article
    http://cuda-programming.blogspot.com/2013/02/bank-conflicts-in-shared-memory-in-cuda.html

    ReplyDelete
  4. what happens when your input values are greater than your shared memory array? For instance, in the line: atomicAdd( &temp[buffer[i]], 1), what if buffer[i] = 10,233? if temp is only size 1024, that will produce an out of memory exception.

    ReplyDelete
    Replies
    1. No, it is not possible, the random data is always in boundary of Bins either 0-255 or 0-1023,
      BigRandomData function always generate in boundary.

      Delete
  5. All really great and very informative.
    paraprotex

    ReplyDelete
  6. I have been searching for a useful post like this on salesforce course details, it is highly helpful for me and I have a great experience with this Salesforce Training who are providing certification and job assistance. Salesforce CRM certification in Noida

    ReplyDelete
  7. Very interesting, good job and thanks for sharing such a good blog. Your article is so convincing that I never stop myself to say something about it. You’re doing a great job. Keep it up. Check this out

    HACCP Certification

    ReplyDelete
  8. Found your post interesting to read. I can’t wait to see your post soon. Good Luck for the upcoming update

    FSC Certification

    ReplyDelete
  9. Great article, I learn new information from your article

    CE Marking

    ReplyDelete
  10. This is the one of the most important information for me. And I am feeling glad reading your article. The article is really excellent ?

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Ask him any Questions about herpes

    […] It’s a crazy miracle product. Read my review here. […]

    I was cured of herpes simplex virus his contact ___________Robinsonbuckler11@gmail. com.......................

    this is the best herbal remedy…

    ReplyDelete

Help us to improve our quality and become contributor to our blog

Become a contributor to this blog. Click on contact us tab
Blogger Template by Clairvo