Vector Dot product in CUDA C; CUDA C Program for Vector Dot product

Posted by Unknown at 08:07 | 11 comments

We divide this article in two part. Part1, will describe everything how we come to the code and Part2 will list code and more explanation regarding threads and blocks.

Jump to Part 1

Lets Continue our discussion

Part 2

In this part we list all about threads and block heuristics. If you know basic about Thread and blocks heuristics, please go through this link

After we have completed a step, we have the same restriction we did after computing all the pairwise products. Before we can read the values we just stored in cache[], we need to ensure that every thread that needs to write to cache[] has already done so. The __syncthreads() after the assignment ensures this

condition is met.

After termination of this while() loop, each block has but a single number remaining. This number is sitting in the first entry of cache[] and is the sum of every pairwise product the threads in that block computed. We then store this single value to global memory and end our kernel:

if (cacheIndex == 0)

c[blockIdx.x] = cache[0];

}

Why do we do this global store only for the thread with cacheIndex == 0? Well, since there is only one number that needs writing to global memory, only a single thread needs to perform this operation. Conceivably, every thread could perform this write and the program would still work, but doing so would create an unnecessarily large amount of memory traffic to write a single value. For simplicity, we chose the thread with index 0, though you could conceivably have chosen any cacheIndex to write cache[0] to global memory. Finally, since each block will write exactly one value to the global array c[], we can simply index it by blockIdx.

We are left with an array c[], each entry of which contains the sum produced by one of the parallel blocks. The last step of the dot product is to sum the entries of c[]. Even though the dot product is not fully computed, we exit the kernel and return control to the host at this point. But why do we return to the host before the computation is complete?

Previously, we referred to an operation like a dot product as a reduction. Roughly speaking, this is because we produce fewer output data elements than we input. In the case of a dot product, we always produce exactly one output, regardless of the size of our input. It turns out that a massively parallel machine like a GPU tends to waste its resources when performing the last steps of a reduction, since

the size of the data set is so small at that point; it is hard to utilize 480 arithmetic units to add 32 numbers!

For this reason, we return control to the host and let the CPU finish the final step of the addition, summing the array c[]. In a larger application, the GPU would now be free to start another dot product or work on another large computation.

However, in this example, we are done with the GPU. In explaining this example, we broke with tradition and jumped right into the actual kernel computation. We hope you will have no trouble understanding the

body of main() up to the kernel call, since it is overwhelmingly similar to what we have shown before.

Final Words; How we calculate blocks for limited number of threadsperblock

But what if we are given a very short list and 32 blocks of 256 threads apiece is too many? If we have N data elements, we need only N threads in order to compute our dot product. So in this case, we need the smallest multiple of threadsPerBlock that is greater than or equal to N. We have seen this once before when we were adding vectors. In this case, we get the smallest multiple of threadsPerBlock that is greater than or equal to N by computing (N+(threadsPerBlock-1)) / threadsPerBlock. As you may be able to tell, this is actually a fairly common trick in integer math, so it is worth digesting this even if you spend most of your time working outside the CUDA C realm.

Therefore, the number of blocks we launch should be either 32 or (N+(threadsPerBlock-1)) / threadsPerBlock, whichever value is smaller.

const int blocksPerGrid =

imin( 32, (N+threadsPerBlock-1) / threadsPerBlock );

If you found all our explanatory interruptions bothersome, here is the entire source listing, sans commentary:

#include <iostream>

using namespace std ;

#define min(x,y) (x>y?x:y)

#define N 33*1024

#define ThreadPerBlock 256

//smallest multiple of threadsPerBlock that is greater than or equal to N

#define blockPerGrid min(32 , (N+ThreadPerBlock-1) / ThreadPerBlock )

__global__ void Vector_Dot_Product ( const float *V1 , const float *V2 , float *V3 )

{

__shared__ float chache[ThreadPerBlock] ;

float temp ;

const unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x ;

const unsigned int chacheindex = threadIdx.x ;

while ( tid < N )

{

temp += V1[tid] * V2[tid] ;

tid += blockDim.x * gridDim.x ;

}

chache[chacheindex] = temp ;

__synchthreads () ;

int i = blockDim.x / 2 ;

while ( i!=0 )

{

if ( chacheindex < i )

chache[chacheindex] += chache [chacheindex + i] ;

__synchthreads () ;

i/=2 ;

}

if ( chacheindex == 0 )

V3[blockIdx.x] = chache [0] ;

}

int main ( int argv , char *argc )

{

float *V1_H , *V2_H , *V3_H ;

float *V1_D , *V2_D , *V3_D ;

V1_H = new float [N] ;

V2_H = new float [N] ;

V3_H = new float [blockPerGrid] ;

cudaMalloc ( (void **)&V1_D , N*sizeof(float)) ;

cudaMalloc ( (void **)&V2_D , N*sizeof(float)) ;

cudaMalloc ( (void **)&V3_D , blockPerGrid*sizeof(float)) ;

for ( int i = 0 ; i<N ; i++ )

{

V1_H[i] = i ;

V2_H[i] = i*2 ;

}

cudaMemcpy ( V1_D , V1_H , N*sizeof(float) , cudaMemcpyHostToDevice ) ;

cudaMemcpy ( V2_D , V2_H , N*sizeof(float) , cudaMemcpyHostToDevice ) ;

Vector_Dot_Product <<<blockPerGrid , ThreadPerBlock >>> (V1_D , V2_D , V3_D ) ;

cudaMemcpy ( V3_H , V3_D , N*sizeof(float) , cudaMemcpyDeviceToHost ) ;

cout <<"\n Vector Dot Prodcut is : " ;

float sum = 0 ;

for ( int i = 0 ; i<blockPerGrid ; i++ )

sum+=V3_H[i] ;

cout << sum << endl ;

cudaFree ( V1_D) ;

cudaFree ( V2_D) ;

cudaFree ( V3_D) ;

delete [] V1_H ;

delete [] V2_H ;

delete [] V3_H ;

}

Got Questions?

Feel free to ask me any question because I'd be happy to walk you through step by step!

References

Wikipedia

hyperphysic

algebralab

CUDA C Programming Guide

CUDA; Nvidia

For Contact us….. Click on Contact us Tab

11 comments:

Unknown2 June 2015 at 06:49
This comment has been removed by the author.
ReplyDelete
Replies
Unknown2 June 2015 at 06:50
Thank you for your article, which is great. Your explanation is neat. nevertheless, your code does not compile: spelling and common mistakes. Here is a working version (I also added a line to check the result):

#include
using namespace std;

#define min(x,y)(x>>(V1_D , V2_D , V3_D );

cudaMemcpy( V3_H , V3_D , blockPerGrid*sizeof(float) , cudaMemcpyDeviceToHost );

cout <<"\n Vector Dot Product is : ";

float sum = 0;

for( int i = 0; i < blockPerGrid; i++ )
sum += V3_H[i];

cout << sum << endl;

cudaFree(V1_D);
cudaFree(V2_D);
cudaFree(V3_D);

cout << " with the cpu " << cpuDot;

delete[] V1_H;
delete[] V2_H;
delete[] V3_H;

return 0;

}
ReplyDelete
Replies
Unknown27 July 2016 at 01:50
as she said there is an error on __synchthread()
and also error in tid += blockDim.x * gridDim.x ; (line no:27)how to correct this??
ReplyDelete
Replies
rexa30 August 2016 at 13:14
The code is explained very nicely. I was trying to modify the code so that It also works with 2D blocks and grids. Unfortunately my code didn't work. I was wondering if you have this code for 2D blocks and grids.

Regards
ReplyDelete
Replies
The Edward28 April 2017 at 09:22
You copied the text of my book, maybe if you had copied the code a little better... if you want code that works, the original text, etc, why not get a copy of the book the text was copied from? :) The code is downloadable from NVIDIA directly, and I believe is currently free (under some license), but it is all there and accurate.
ReplyDelete
Replies
Anonymous18 September 2018 at 13:06
#define min(x,y) (x>y?x:y) isn't this a max function?
ReplyDelete
Replies
Ursula1 October 2018 at 08:56
Very interesting is what you write.
ReplyDelete
Replies
Vinayak Jadkar10 April 2022 at 03:15
line 28 error: expression must be a modifiable lvalue
ReplyDelete
Replies

Add comment

Help us to improve our quality and become contributor to our blog

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Vector Dot product in CUDA C; CUDA C Program for Vector Dot product

11 comments:

Recent Post

About Me

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Related Posts

Share This

Vector Dot product in CUDA C; CUDA C Program for Vector Dot product

11 comments:

Recent Post

About Me

Subscribe To

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin