Computing Histogram on CUDA  CUDA code for Histogram computation  Computing Histogram on GPU using CUDA
In this article we’ll
learn about histogram computation on GPU using CUDA architecture.
Recently, during solving
online competition problem on image processing I found that certain subproblem
need to solve which can be solve using histogram computation, So I started
learning histogram computation and how efficiently I can code in CUDA.
But, although it is very
easy problem to solve in CPU with single threaded system and hard to do
efficient parallelism, you must be wonder why? O.K., I let you know during
describing problem and its solution in CUDA as efficient as possible.
I've divided this post in Four consecutive
parts.
Part 1
So let’s start Part1,
In this part we’ll learn basic implementation of Histogram computation.
Oftentimes, algorithms
require the computation of a histogram of some set of data. If
you haven’t had any experience with histograms in the past, that’s not a big
deal. Essentially, given a data set that consists of some set of elements, a
histogram represents a count of the frequency of each element. For example, if
we created a histogram of the letters in the phrase Programming with
CUDA C, we would end up with the result shown in Fig 1.
Although simple to
describe and understand, computing histograms of data arises surprisingly often
in computer science. It’s used in algorithms for image processing, data
compression, computer vision, machine learning, audio encoding, and many
others. We will use histogram computation as the algorithm for the following
code examples.
Computing Histogram on
CPU
Because the computation
of a histogram may not be familiar to all readers, we’ll start with an example
of how to compute a histogram on the CPU. This example will also serve to
illustrate how computing a histogram is relatively simple in a singlethreaded
CPU application. The application will be given some large stream of data. In an
actual application, the data might signify anything from pixel colors to audio
samples, but in our sample application, it will be a stream of randomly
generated bytes. We can create this random stream of bytes using a utility
function we have provided called big_random_block(). In our
application, we create 100MB of random data.
#define SIZE (100*1024*1024) // 100 MB
int main( void ) {
unsigned char *buffer = (unsigned char*)big_random_block( SIZE );

Since each random 8bit
byte can be any of 256 different values (from 0x00 to 0xFF), our histogram
needs to contain 256 bins in order to keep track of the number
of times each value has been seen in the data. We create a 256bin array and
initialize all the bin counts to zero.
unsigned int histo[256];
for (int i=0; i<256; i++)
histo[i] = 0;

Once our histogram has
been created and all the bins are initialized to zero, we need to tabulate the
frequency with which each value appears in the data contained in buffer[]. The
idea here is that whenever we see some value z in the array buffer[], we
want to increment the value in bin z of our histogram. This way, we’re counting
the number of times we have seen an occurrence of the value z.
If buffer[i] is
the current value we are looking at, we want to increment the count we have in
the bin numbered buffer[i]. Since bin buffer[i] is
located at histo[buffer[i]], we can increment the appropriate
counter in a single line of code.
histo[buffer[i]]++;

We do this for each
element in buffer[] with a simple for() loop:
for (int i=0; i<SIZE; i++)
histo[buffer[i]]++;

At this point, we’ve
completed our histogram of the input data. In a full application, this
histogram might be the input to the next step of computation. In our simple
example, however, this is all we care to compute, so we end the application by
verifying that all the bins of our histogram sum to the expected value.
long histoCount = 0;
for (int i=0; i<256; i++) {
histoCount += histo[i];
}
printf( "Histogram Sum: %ld\n", histoCount );

If you’ve followed
closely, you will realize that this sum will always be the same, regardless of
the random input array. Each bin counts the number of times we have seen the
corresponding data element, so the sum of all of these bins should be the total
number of data elements we’ve examined. In our case, this will be the value
SIZE.
And needless to say (but
we will anyway), we clean up after ourselves and return.
free( buffer );
return 0;
}

I have tested this code
on my benchmark machine, a Core 2 Duo, the histogram of this 100MB array of
data can be constructed in 0 .416 seconds. This will provide a baseline
performance for the GPU version we intend to write.
Computing Histogram on
GPU
We would like to adapt
the histogram computation example to run on the GPU. If our input array is
large enough, it might save a considerable amount of time to have different
threads examining different parts of the buffer. Having different threads read
different parts of the input should be easy enough. After all, it’s very
similar to things we have seen in many previous posts. The problem with
computing a histogram from the input data arises from the fact that multiple
threads may want to increment the same bin of the output histogram at the same
time. In this situation, we will need to use atomic increments to avoid this
situation.
I’ll assume you know
about atomicAdd () function, so I’ll not describe this in
this post.
For the GPU
implementation, we’ll go step by step, try first porting CPU code to CUDA then
try to optimize it step by step.
A very naïve
implementation will look like this;
step 1
__global__ void
naiveHistoKernel (const char *const buffer , int *histo)
{
int i = threadIdx.x ;
atomicAdd (&histo[buffer[i]] , 1 );
}

With kernel calling as;
naiveHistKernel <<<1 , 1024>>> (buffer , histo) ;

But the bad bad thing
regarding this naïve implementation is we can only have 1024 element in buffer
at most (as per as the recent GPU) or 512 element (after replacing <<<1 , 512>>>) on old GPU) .
But if we have more
elements then we need to divide them in grid and blocks. Like in below example
we does;
Optimization step 1
__global__ void
naiveHistoKernel_2 (const char *const buffer , int *histo)
{
int id = blockidx.x
* blockDim.x + threadIdx.x ;
atomicAdd (&histo[buffer[id]] , 1 );
}

In this implementation
we divide total element in single dimension grid and blocks.
Kernel calling will look
like this;
naiveHistKernel_2 <<<N/512 , 512>>> (buffer , histo) ;

Or
naiveHistKernel_2 <<<N/1024 , 1024>>> (buffer , histo) ;

But if N*1024 > 65535
then it will make problem for us. Since in current GPU you can only lunch 65535
blocks in a direction.
Optimization step 2
One possible solution
for this we can lunch grid in two dimension like;
__global__ void
naiveHistoKernel_3 (const char *const buffer , int *histo, int Nby2)
{
int id_x = blockidx.x
* blockDim.x + threadIdx.x ;
int id_y = blockidx.y
* blockDim.y + threadIdx.y ;
int absoulteId = id_y*Nby2
+ id_x ;
atomicAdd (&histo[buffer[absoulteId]] , 1 );
}
………………..
// I’m assuming division will lead to even number
int Nby2_ = N/2 ;
dim3 blocks (16,16);
dim3 grid(Nby2_ / 16 , Nby2_/16) ;
naiveHistKernel_3 <<<grid , blocks>>> (buffer , histo,
Nby2_ ) ;

It will perform well but
not as much we want, even still having limitation;
Hi in the text you refer "For example, with 100MB of data, we have 104,857,600 bytes of data. We could launch a single block and have each thread examine 409,600 data elements. Likewise, we could launch 409,600 blocks and have each thread examine a single data element." Could you please explain a litle more?
ReplyDeleteunder what header file does big_random_block comes
ReplyDeleteSo many lunches!, was this written when you were really really hungry in the afternoon? Nice tutorial though!
ReplyDeletenice article for beginners.thank you.
ReplyDeletejavacodegeeks
welookups
This comment has been removed by the author.
ReplyDelete