# Computing Histogram on CUDA | CUDA code for Histogram computation | Computing Histogram on GPU using CUDA

Recently, during solving online competition problem on image processing I found that certain sub-problem need to solve which can be solve using histogram computation, So I started learning histogram computation and how efficiently I can code in CUDA.
But, although it is very easy problem to solve in CPU with single threaded system and hard to do efficient parallelism, you must be wonder why? O.K., I let you know during describing problem and its solution in CUDA as efficient as possible.
I've divided this post in Four consecutive parts.  Part 1
So let’s start Part1, In this part we’ll learn basic implementation of Histogram computation.
Oftentimes, algorithms require the computation of a histogram of some set of data. If you haven’t had any experience with histograms in the past, that’s not a big deal. Essentially, given a data set that consists of some set of elements, a histogram represents a count of the frequency of each element. For example, if we created a histogram of the letters in the phrase Programming with CUDA C, we would end up with the result shown in Fig 1.

Although simple to describe and understand, computing histograms of data arises surprisingly often in computer science. It’s used in algorithms for image processing, data compression, computer vision, machine learning, audio encoding, and many others. We will use histogram computation as the algorithm for the following code examples.

Computing Histogram on CPU

Because the computation of a histogram may not be familiar to all readers, we’ll start with an example of how to compute a histogram on the CPU. This example will also serve to illustrate how computing a histogram is relatively simple in a single-threaded CPU application. The application will be given some large stream of data. In an actual application, the data might signify anything from pixel colors to audio samples, but in our sample application, it will be a stream of randomly generated bytes. We can create this random stream of bytes using a utility function we have provided called big_random_block(). In our application, we create 100MB of random data.

 #define SIZE (100*1024*1024) // 100 MB int main( void ) { unsigned char *buffer = (unsigned char*)big_random_block( SIZE );

Since each random 8-bit byte can be any of 256 different values (from 0x00 to 0xFF), our histogram needs to contain 256 bins in order to keep track of the number of times each value has been seen in the data. We create a 256-bin array and initialize all the bin counts to zero.

 unsigned int histo; for (int i=0; i<256; i++)       histo[i] = 0;

Once our histogram has been created and all the bins are initialized to zero, we need to tabulate the frequency with which each value appears in the data contained in buffer[]. The idea here is that whenever we see some value z in the array buffer[], we want to increment the value in bin z of our histogram. This way, we’re counting the number of times we have seen an occurrence of the value z.

If buffer[i] is the current value we are looking at, we want to increment the count we have in the bin numbered buffer[i]. Since bin buffer[i] is located at histo[buffer[i]], we can increment the appropriate counter in a single line of code.

 histo[buffer[i]]++;

We do this for each element in buffer[] with a simple for() loop:

 for (int i=0; i

At this point, we’ve completed our histogram of the input data. In a full application, this histogram might be the input to the next step of computation. In our simple example, however, this is all we care to compute, so we end the application by verifying that all the bins of our histogram sum to the expected value.

 long histoCount = 0; for (int i=0; i<256; i++) {            histoCount += histo[i]; } printf( "Histogram Sum: %ld\n", histoCount );

If you’ve followed closely, you will realize that this sum will always be the same, regardless of the random input array. Each bin counts the number of times we have seen the corresponding data element, so the sum of all of these bins should be the total number of data elements we’ve examined. In our case, this will be the value SIZE.
And needless to say (but we will anyway), we clean up after ourselves and return.

 free( buffer ); return 0; }

I have tested this code on my benchmark machine, a Core 2 Duo, the histogram of this 100MB array of data can be constructed in 0 .416 seconds. This will provide a baseline performance for the GPU version we intend to write.

Computing Histogram on GPU

We would like to adapt the histogram computation example to run on the GPU. If our input array is large enough, it might save a considerable amount of time to have different threads examining different parts of the buffer. Having different threads read different parts of the input should be easy enough. After all, it’s very similar to things we have seen in many previous posts. The problem with computing a histogram from the input data arises from the fact that multiple threads may want to increment the same bin of the output histogram at the same time. In this situation, we will need to use atomic increments to avoid this situation.
I’ll assume you know about atomicAdd () function, so I’ll not describe this in this post.

For the GPU implementation, we’ll go step by step, try first porting CPU code to CUDA then try to optimize it step by step.

A very naïve implementation will look like this;

step 1

 __global__ void naiveHistoKernel (const char *const buffer , int *histo) {       int i = threadIdx.x ;       atomicAdd (&histo[buffer[i]] , 1 ); }

With kernel calling as;

 naiveHistKernel <<<1 , 1024>>> (buffer ,  histo) ;

But the bad bad thing regarding this naïve implementation is we can only have 1024 element in buffer at most (as per as the recent GPU) or 512 element (after replacing <<<, 512>>>) on old GPU) .
But if we have more elements then we need to divide them in grid and blocks. Like in below example we does;
Optimization step 1

 __global__ void naiveHistoKernel_2 (const char *const buffer , int *histo) {       int id = blockidx.x * blockDim.x + threadIdx.x ;       atomicAdd (&histo[buffer[id]] , 1 ); }

In this implementation we divide total element in single dimension grid and blocks.

Kernel calling will look like this;

 naiveHistKernel_2 <<>> (buffer ,  histo) ;
Or
 naiveHistKernel_2 <<>> (buffer ,  histo) ;

But if N*1024 > 65535 then it will make problem for us. Since in current GPU you can only lunch 65535 blocks in a direction.
Optimization step 2

One possible solution for this we can lunch grid in two dimension like;

 __global__ void naiveHistoKernel_3 (const char *const buffer , int *histo, int Nby2) {       int id_x = blockidx.x * blockDim.x + threadIdx.x ;       int id_y = blockidx.y * blockDim.y + threadIdx.y ;       int absoulteId = id_y*Nby2 + id_x ;               atomicAdd (&histo[buffer[absoulteId]] , 1 ); } ……………….. // I’m assuming division will lead to even number int Nby2_ = N/2 ;  dim3 blocks (16,16); dim3 grid(Nby2_ / 16 , Nby2_/16) ;  naiveHistKernel_3 <<>> (buffer ,  histo, Nby2_ ) ;

It will perform well but not as much we want, even still having limitation;
Further we’ll discuss a good algorithm for this; happy to read further; Part2

1. Hi in the text you refer "For example, with 100MB of data, we have 104,857,600 bytes of data. We could launch a single block and have each thread examine 409,600 data elements. Likewise, we could launch 409,600 blocks and have each thread examine a single data element." Could you please explain a litle more?

2. under what header file does big_random_block comes

3. So many lunches!, was this written when you were really really hungry in the afternoon? Nice tutorial though!

4. nice article for beginners.thank you.
javacodegeeks
welookups

5. This comment has been removed by the author.

Help us to improve our quality and become contributor to our blog