Pages

Tuesday, 15 January 2013

CUDA program for Vector Addition for Long Vector



 GPU SUMS OF A LONGER VECTOR
In the previous article, we noted that the hardware limits the number of blocks in a single launch to 65,535. Similarly, the hardware limits the number of threads per block with which we can launch a kernel. Specifically, this number cannot exceed the value specified by the maxThreadsPerBlock field of the device
properties structure we looked at in previous articles. For many of the graphics processors currently available, this limit is 512 threads per block, so how would we use a thread-based approach to add two vectors of size greater than 512? We will have to use a combination of threads and blocks to accomplish this.

As before, this will require two changes: We will have to change the index computation within the kernel, and we will have to change the kernel launch itself. Now that we have multiple blocks and threads, the indexing will start to look similar to the standard method for converting from a two-dimensional index space to a linear space.


int tid = threadIdx.x + blockIdx.x * blockDim.x;

This assignment uses a new built-in variable, blockDim. This variable is a constant for all blocks and stores the number of threads along each dimension of the block. Since we are using a one-dimensional block, we refer only to blockDim.x. If you recall, gridDim stored a similar value, but it stored the number of blocks along each dimension of the entire grid. Moreover, gridDim is two-dimensional, whereas blockDim is actually three-dimensional. That is, the CUDA runtime allows you to launch a two-dimensional grid of blocks where each block is a three-dimensional array of threads. Yes, this is a lot of dimensions, and it is unlikely you will regularly need the five degrees of indexing freedom afforded you, but they are available if so desired.
Indexing the data in a linear array using the previous assignment actually is quite intuitive. If you disagree, it may help to think about your collection of blocks of threads spatially, similar to a two-dimensional array of pixels. We depict this arrangement in below Figure



If the threads represent columns and the blocks represent rows, we can get a unique index by taking the product of the block index with the number of threads in each block and adding the thread index within the block.

int offset = x + y * DIM;

Here, DIM is the block dimension (measured in threads), y is the block index, and x is the thread index within the block. Hence, we arrive at the index:


tid = threadIdx.x + blockIdx.x * blockDim.x.

The other change is to the kernel launch itself. We still need N parallel threads to launch, but we want them to launch across multiple blocks so we do not hit the 512-thread limitation imposed upon us. One solution is to arbitrarily set the block size to some fixed number of threads; for this example, let’s use 128 threads per block. Then we can just launch N/128 blocks to get our total of N threads running.

Be Careful
The wrinkle here is that N/128 is an integer division. This implies that if N were 127, N/128 would be zero, and we will not actually compute anything if we launch zero threads. In fact, we will launch too few threads whenever N is not an exact multiple of 128. This is bad. We actually want this division to round up.

There is a common trick to accomplish this in integer division without calling ceil(). We actually compute (N+127)/128 instead of N/128. Either you can take our word that this will compute the smallest multiple of 128 greater than or equal to N or you can take a moment now to convince yourself of this fact. We have chosen 128 threads per block and therefore use the following kernel
launch:


Vector_Addition <<< ( N + 127 ) / 128, 128 >>> ( dev _ a, dev _ b, dev _ c );

More Careful

Because of our change to the division that ensures we launch enough threads, we will actually now launch too many threads when N is not an exact multiple of 128. But there is a simple remedy to this problem, and our kernel already takes care of it. We have to check whether a thread’s offset is actually between 0 and N before we use it to access our input and output arrays:


if (tid < N)
             c[tid] = a[tid] + b[tid];

Thus, when our index overshoots the end of our array, as will always happen when we launch a nonmultiple of 128, we automatically refrain from performing the calculation. More important, we refrain from reading and writing memory off the end of our array.

Seeking for Code? We prefer you to write your own code after making those changes.
But don’t worry; here is the code that you are seeking for J



#include <stdio.h>

#define HANDLE_ERROR( err ) ( HandleError( err, __FILE__, __LINE__ ) )

static void HandleError( cudaError_t err, const char *file, int line )
{
    if (err != cudaSuccess)
      {
        printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
                file, line );
        exit( EXIT_FAILURE );
    }
}



const int N = 2048;

// CUDA Kernel for Vector Addition
__global__ void Vector_Addition ( const int *dev_a , const int *dev_b , int *dev_c)
{
      //Get the id of thread within a block
      unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x ;
     

      while ( tid < N ) // check the boundry condition for the threads
      {          
            dev_c [tid] = dev_a[tid] + dev_b[tid] ;

            tid+= blockDim.x * gridDim.x ;
      }
}

int main (void)
{

      //Host array
      int Host_a[N], Host_b[N], Host_c[N];

      //Device array
      int *dev_a , *dev_b, *dev_c ;

      //Allocate the memory on the GPU
      HANDLE_ERROR ( cudaMalloc((void **)&dev_a , N*sizeof(int) ) );
      HANDLE_ERROR ( cudaMalloc((void **)&dev_b , N*sizeof(int) ) );
      HANDLE_ERROR ( cudaMalloc((void **)&dev_c , N*sizeof(int) ) );

      //fill the Host array with random elements on the CPU
      for ( int i = 0; i <N ; i++ )
      {
            Host_a[i] = -i ;
            Host_b[i] = i*i ; 
      }

      //Copy Host array to Device array
      HANDLE_ERROR (cudaMemcpy (dev_a , Host_a , N*sizeof(int) , cudaMemcpyHostToDevice));
      HANDLE_ERROR (cudaMemcpy (dev_b , Host_b , N*sizeof(int) , cudaMemcpyHostToDevice));

      //Make a call to GPU kernel
      Vector_Addition <<< (N+127)/128, 128  >>> (dev_a , dev_b , dev_c ) ;

      //Copy back to Host array from Device array
      HANDLE_ERROR (cudaMemcpy(Host_c , dev_c , N*sizeof(int) , cudaMemcpyDeviceToHost));

      //Display the result
      for ( int i = 0; i<N; i++ )
                  printf ("%d + %d = %d\n", Host_a[i] , Host_b[i] , Host_c[i] ) ;

      //Free the Device array memory
      cudaFree (dev_a) ;
      cudaFree (dev_b) ;
      cudaFree (dev_c) ;

      system("pause");
      return 0 ;

}

Explanation:

Now that we understand the principle behind this implementation, we just need to understand how we determine the initial index value for each parallel thread and how we determine the increment. We want each parallel thread to start on a different data index, so we just need to take our thread and block indexes and linearize them as we saw in the “GPU Sums of a Longer Vector” section. Each thread will start at an index given by the following:



int tid = threadIdx.x + blockIdx.x * blockDim.x;


After each thread finishes its work at the current index, we need to increment each of them by the total number of threads running in the grid. This is simply the number of threads per block multiplied by the number of blocks in the grid, or blockDim.x * gridDim.x. Hence, the increment step is as follows:


tid += blockDim.x * gridDim.x;


Have you find/point out any problem with this implementation ? yes, yes there is a problem with this implementation.
The only remaining piece is to fix the launch itself. If you remember, we took this detour because the launch add<<<(N+127)/128,128>>>( dev_a, dev_b, dev_c ) will fail when (N+127)/128 is greater than 65,535.


How will you solve this problem, in over to ride the code?
Soon we'll provide the solution for that, till than THINK



Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step!  



Reference and External Links

6 comments:

  1. I did not understand "tid += blockDim.x * gridDim.x;". I checked some other books and none of them are thinking in this manner as you are doing . What is wrong with the following :

    .....................
    .....................
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if(tid < N )
    {
    dev_c [tid] = dev_a[tid] + dev_b[tid] ;
    }


    It will be nice to hear from you over this issue.

    Thanks
    Sajjad

    ReplyDelete
    Replies
    1. I agree! I have been looking around the blog trying to figure out what the advantages are. I think it has something to do with efficiency and making threads to more work, but OP isn't clear why this makes it more efficient. I would also appreciate an explanation.

      Delete
  2. copied from cuda by example

    ReplyDelete
  3. can you represent a multidimensonal thread block in a pictorial form?

    ReplyDelete
  4. This publication is excellent.

    ReplyDelete
  5. I keep my fingers crossed for you and for creating new texts.

    ReplyDelete

Help us to improve our quality and become contributor to our blog