CUDA program for Vector Addition for Long Vector

Posted by Unknown at 01:25 | 6 comments

GPU SUMS OF A LONGER VECTOR

In the previous article, we noted that the hardware limits the number of blocks in a single launch to 65,535. Similarly, the hardware limits the number of threads per block with which we can launch a kernel. Specifically, this number cannot exceed the value specified by the maxThreadsPerBlock field of the device

properties structure we looked at in previous articles. For many of the graphics processors currently available, this limit is 512 threads per block, so how would we use a thread-based approach to add two vectors of size greater than 512? We will have to use a combination of threads and blocks to accomplish this.

As before, this will require two changes: We will have to change the index computation within the kernel, and we will have to change the kernel launch itself. Now that we have multiple blocks and threads, the indexing will start to look similar to the standard method for converting from a two-dimensional index space to a linear space.

int tid = threadIdx.x + blockIdx.x * blockDim.x;

This assignment uses a new built-in variable, blockDim. This variable is a constant for all blocks and stores the number of threads along each dimension of the block. Since we are using a one-dimensional block, we refer only to blockDim.x. If you recall, gridDim stored a similar value, but it stored the number of blocks along each dimension of the entire grid. Moreover, gridDim is two-dimensional, whereas blockDim is actually three-dimensional. That is, the CUDA runtime allows you to launch a two-dimensional grid of blocks where each block is a three-dimensional array of threads. Yes, this is a lot of dimensions, and it is unlikely you will regularly need the five degrees of indexing freedom afforded you, but they are available if so desired.

Indexing the data in a linear array using the previous assignment actually is quite intuitive. If you disagree, it may help to think about your collection of blocks of threads spatially, similar to a two-dimensional array of pixels. We depict this arrangement in below Figure

If the threads represent columns and the blocks represent rows, we can get a unique index by taking the product of the block index with the number of threads in each block and adding the thread index within the block.

int offset = x + y * DIM;

Here, DIM is the block dimension (measured in threads), y is the block index, and x is the thread index within the block. Hence, we arrive at the index:

tid = threadIdx.x + blockIdx.x * blockDim.x.

The other change is to the kernel launch itself. We still need N parallel threads to launch, but we want them to launch across multiple blocks so we do not hit the 512-thread limitation imposed upon us. One solution is to arbitrarily set the block size to some fixed number of threads; for this example, let’s use 128 threads per block. Then we can just launch N/128 blocks to get our total of N threads running.

Be Careful

The wrinkle here is that N/128 is an integer division. This implies that if N were 127, N/128 would be zero, and we will not actually compute anything if we launch zero threads. In fact, we will launch too few threads whenever N is not an exact multiple of 128. This is bad. We actually want this division to round up.

There is a common trick to accomplish this in integer division without calling ceil(). We actually compute (N+127)/128 instead of N/128. Either you can take our word that this will compute the smallest multiple of 128 greater than or equal to N or you can take a moment now to convince yourself of this fact. We have chosen 128 threads per block and therefore use the following kernel

launch:

Vector_Addition <<< ( N + 127 ) / 128, 128 >>> ( dev _ a, dev _ b, dev _ c );

More Careful

Because of our change to the division that ensures we launch enough threads, we will actually now launch too many threads when N is not an exact multiple of 128. But there is a simple remedy to this problem, and our kernel already takes care of it. We have to check whether a thread’s offset is actually between 0 and N before we use it to access our input and output arrays:

if (tid < N)

c[tid] = a[tid] + b[tid];

Thus, when our index overshoots the end of our array, as will always happen when we launch a nonmultiple of 128, we automatically refrain from performing the calculation. More important, we refrain from reading and writing memory off the end of our array.

Seeking for Code? We prefer you to write your own code after making those changes.
But don’t worry; here is the code that you are seeking for J

#include <stdio.h>

#define HANDLE_ERROR( err ) ( HandleError( err, __FILE__, __LINE__ ) )

static void HandleError( cudaError_t err, const char *file, int line )

{

if (err != cudaSuccess)

{

printf( "%s in %s at line %d\n", cudaGetErrorString( err ),

file, line );

exit( EXIT_FAILURE );

}

const int N = 2048;

// CUDA Kernel for Vector Addition

__global__ void Vector_Addition ( const int *dev_a , const int *dev_b , int *dev_c)

{

//Get the id of thread within a block

unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x ;

while ( tid < N ) // check the boundry condition for the threads

{

dev_c [tid] = dev_a[tid] + dev_b[tid] ;

tid+= blockDim.x * gridDim.x ;

}

int main (void)

{

//Host array

int Host_a[N], Host_b[N], Host_c[N];

//Device array

int *dev_a , *dev_b, *dev_c ;

//Allocate the memory on the GPU

HANDLE_ERROR ( cudaMalloc((void **)&dev_a , N*sizeof(int) ) );

HANDLE_ERROR ( cudaMalloc((void **)&dev_b , N*sizeof(int) ) );

HANDLE_ERROR ( cudaMalloc((void **)&dev_c , N*sizeof(int) ) );

//fill the Host array with random elements on the CPU

for ( int i = 0; i <N ; i++ )

{

Host_a[i] = -i ;

Host_b[i] = i*i ;

}

//Copy Host array to Device array

HANDLE_ERROR (cudaMemcpy (dev_a , Host_a , N*sizeof(int) , cudaMemcpyHostToDevice));

HANDLE_ERROR (cudaMemcpy (dev_b , Host_b , N*sizeof(int) , cudaMemcpyHostToDevice));

//Make a call to GPU kernel

Vector_Addition <<< (N+127)/128, 128 >>> (dev_a , dev_b , dev_c ) ;

//Copy back to Host array from Device array

HANDLE_ERROR (cudaMemcpy(Host_c , dev_c , N*sizeof(int) , cudaMemcpyDeviceToHost));

//Display the result

for ( int i = 0; i<N; i++ )

printf ("%d + %d = %d\n", Host_a[i] , Host_b[i] , Host_c[i] ) ;

//Free the Device array memory

cudaFree (dev_a) ;

cudaFree (dev_b) ;

cudaFree (dev_c) ;

system("pause");

return 0 ;

}

Explanation:

Now that we understand the principle behind this implementation, we just need to understand how we determine the initial index value for each parallel thread and how we determine the increment. We want each parallel thread to start on a different data index, so we just need to take our thread and block indexes and linearize them as we saw in the “GPU Sums of a Longer Vector” section. Each thread will start at an index given by the following:

int tid = threadIdx.x + blockIdx.x * blockDim.x;

After each thread finishes its work at the current index, we need to increment each of them by the total number of threads running in the grid. This is simply the number of threads per block multiplied by the number of blocks in the grid, or blockDim.x * gridDim.x. Hence, the increment step is as follows:

tid += blockDim.x * gridDim.x;

Have you find/point out any problem with this implementation ? yes, yes there is a problem with this implementation.

The only remaining piece is to fix the launch itself. If you remember, we took this detour because the launch add<<<(N+127)/128,128>>>( dev_a, dev_b, dev_c ) will fail when (N+127)/128 is greater than 65,535.

How will you solve this problem, in over to ride the code?
Soon we'll provide the solution for that, till than THINK

Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step!

Reference and External Links

CUDA C Programming Guide
Programming Massively Parallel Processors By David B. Kirk and Wen-mei W.Hwu

https://developer.nvidia.com/content/easy-introduction-cuda-c-and-c

6 comments:

Anonymous3 January 2014 at 00:44
I did not understand "tid += blockDim.x * gridDim.x;". I checked some other books and none of them are thinking in this manner as you are doing . What is wrong with the following :

.....................
.....................
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if(tid < N )
{
dev_c [tid] = dev_a[tid] + dev_b[tid] ;
}

It will be nice to hear from you over this issue.

Thanks
Sajjad
ReplyDelete
Replies
Anonymous4 June 2014 at 07:06
copied from cuda by example
ReplyDelete
Replies
Anonymous23 July 2017 at 23:28
can you represent a multidimensonal thread block in a pictorial form?
ReplyDelete
Replies
Clarence7 September 2018 at 03:48
This publication is excellent.
ReplyDelete
Replies
Lola23 September 2018 at 10:31
I keep my fingers crossed for you and for creating new texts.
ReplyDelete
Replies

Add comment

Help us to improve our quality and become contributor to our blog

CUDA Programming

Prefer Your Language

Search This Blog

Tags

CUDA program for Vector Addition for Long Vector

6 comments:

Recent Post

About Me

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Related Posts

Share This

CUDA program for Vector Addition for Long Vector

6 comments:

Recent Post

About Me

Subscribe To

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin