Pages

Monday, 14 January 2013

Vector Addition in CUDA (CUDA C/C++ program for Vector Addition)

We will contrive a simple example to illustrate threads and how we use them to code with CUDA C. Imagine having two lists of numbers where we want to sum corresponding elements of each list and store the result in a third list. If you have any background in linear algebra, you will recognize this operation as summing two vectors.






















This article gives you detailed description about Vector Addition in CUDA C. This article will also let you know how to lunch kernel in CUDA with variable number of threads with limitation.

If you are not clear with idea of thread and block architecture and how to decide. please go through this link 

CPU vector sums

First we’ll look at one way this addition can be accomplished with traditional C code:

#define N 10
void add ( int *a, int *b, int *c )
{
              int tid = 0; // this is CPU zero, so we start at zero
            
              while  (tid < N)
              {
                      c[tid] = a[tid] + b[tid];
                       tid += 1;  // we have one CPU, so we increment by one
               }
}

int main( void )
{
             int a[N], b[N], c[N];
         
           // fill the arrays 'a' and 'b' on the CPU
           for (int i=0; i<N; i++)
           {
                      a [i] = -i;
                      b[i] = i * i;
            }
         add( a, b, c );

       // display the results
           for (int i=0; i<N; i++)
         {
                printf( "%d + %d = %d\n", a[i], b[i], c[i] );
          }
  return 0;
}


Most of the articles bears almost no explanation, but we will briefly look at the add() function to explain why we overly complicated it. One would typically code this in a slightly simpler
manner, like so:

void add( int *a, int *b, int *c )
 {
               for (i=0; i < N; i++)
               {
                      c[i] = a[i] + b[i];
               }
}


Our slightly more convoluted method was intended to suggest a potential way to parallelize the code on a system with multiple CPUs or CPU cores. For example, with a dual-core processor, one could change the increment to 2 and have one core initialize the loop with tid = 0 and another with tid = 1. The first core would add the even-indexed elements, and the second core would add the odd indexed elements. This amounts to executing the following code on each of the two CPU cores:

CPU CORE 1

void add ( int *a, int *b, int *c )
{
              int tid = 0;
            
              while  (tid < N)
              {
                      c[tid] = a[tid] + b[tid];
                       tid += 1;
               }
}




CPU CORE 2

void add ( int *a, int *b, int *c )
{
              int tid = 1;
            
              while  (tid < N)
              {
                      c[tid] = a[tid] + b[tid];
                       tid += 2;
               }
}
Of course, doing this on a CPU would require considerably more code than we have included in this example. You would need to provide a reasonable amount of infrastructure to create the worker threads that execute the function add() as well as make the assumption that each thread would execute in parallel, a scheduling assumption that is unfortunately not always true.





GPU vector sums

We can accomplish the same addition very similarly on a GPU by writing add() as a device function. This should look similar to code you saw in the previous chapter. But before we look at the device code, we present main(). Although the GPU implementation of main() is different from the corresponding CPU version, nothing here should look new:



Using only Threads 

#include <stdio.h>

#define HANDLE_ERROR( err ) ( HandleError( err, __FILE__, __LINE__ ) )

static void HandleError( cudaError_t err, const char *file, int line )
{
    if (err != cudaSuccess)
      {
        printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
                file, line );
        exit( EXIT_FAILURE );
    }
}



const short N = 10 ;

// CUDA Kernel for Vector Addition
__global__ void Vector_Addition ( const int *dev_a , const int *dev_b , int *dev_c)
{
      //Get the id of thread within a block
      unsigned short tid = threadIdx.x ;
     
      if ( tid < N ) // check the boundry condition for the threads
            dev_c [tid] = dev_a[tid] + dev_b[tid] ;

}


int main (void)
{

      //Host array
      int Host_a[N], Host_b[N], Host_c[N];

      //Device array
      int *dev_a , *dev_b, *dev_c ;

      //Allocate the memory on the GPU
      HANDLE_ERROR ( cudaMalloc((void **)&dev_a , N*sizeof(int) ) );
      HANDLE_ERROR ( cudaMalloc((void **)&dev_b , N*sizeof(int) ) );
      HANDLE_ERROR ( cudaMalloc((void **)&dev_c , N*sizeof(int) ) );

      //fill the Host array with random elements on the CPU
      for ( int i = 0; i <N ; i++ )
      {
            Host_a[i] = -i ;
            Host_b[i] = i*i ; 
      }

      //Copy Host array to Device array
      HANDLE_ERROR (cudaMemcpy (dev_a , Host_a , N*sizeof(int) , cudaMemcpyHostToDevice));
      HANDLE_ERROR (cudaMemcpy (dev_b , Host_b , N*sizeof(int) , cudaMemcpyHostToDevice));

      //Make a call to GPU kernel
      Vector_Addition <<< 1, N  >>> (dev_a , dev_b , dev_c ) ;

      //Copy back to Host array from Device array
      HANDLE_ERROR (cudaMemcpy(Host_c , dev_c , N*sizeof(int) , cudaMemcpyDeviceToHost));

      //Display the result
      for ( int i = 0; i<N; i++ )
                  printf ("%d + %d = %d\n", Host_a[i] , Host_b[i] , Host_c[i] ) ;

      //Free the Device array memory
      cudaFree (dev_a) ;
      cudaFree (dev_b) ;
      cudaFree (dev_c) ;

      system("pause");
      return 0 ;

}

In this code, we are launching with a number in the angle brackets :

Vector_Addition<<<1,N>>>( dev _ a, dev _ b, dev _ c );

What gives?

Recall that we left those two numbers in the angle brackets unexplained; we stated vaguely that they were parameters to the runtime that describe how to launch the kernel. Well, the first number in those parameters represents the number of parallel blocks in which we would like the device to execute our kernel. In this case, we’re passing the value N for this parameter. For example, if we launch with Vector_Addition<<<1,2>>>(), you can think of the runtime creating two copies of the kernel and running them in parallel. We call each of these parallel invocations a thread. With Vector_Addition<<<1,256>>>(), you would get 256 threads running on the GPU. Parallel programming has never been easier. But this raises an excellent question: The GPU runs N copies of our kernel code, but how can we tell from within the code which block is currently running? This question brings us to the second new feature of the example, the kernel code itself. Specifically, it brings us to the variable threadIdx.x:

If you recall the CPU-based example with which we began, you will recall that we needed to walk through indices from 0 to N-1 in order to sum the two vectors. Since the runtime system is already launching a kernel where each block will have one of these indices, nearly all of this work has already been done for us. Because we’re something of a lazy lot, this is a good thing. It affords us more time to blog, probably about how lazy we are.

The last remaining question to be answered is, why do we check whether tid is less than N? It should always be less than N, since we’ve specifically launched our kernel such that this assumption holds. But our desire to be lazy also makes us paranoid about someone breaking an assumption we’ve made in our code. Breaking code assumptions means broken code.

Using only blocks



#include <stdio.h>

#define HANDLE_ERROR( err ) ( HandleError( err, __FILE__, __LINE__ ) )

static void HandleError( cudaError_t err, const char *file, int line )
{
    if (err != cudaSuccess)
      {
        printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
                file, line );
        exit( EXIT_FAILURE );
    }
}



const short N = 10 ;

// CUDA Kernel for Vector Addition
__global__ void Vector_Addition ( const int *dev_a , const int *dev_b , int *dev_c)
{
      //Get the id of thread within a block
      unsigned short tid = blockIdx.x ;
     
      if ( tid < N ) // check the boundry condition for the threads
            dev_c [tid] = dev_a[tid] + dev_b[tid] ;

}


int main (void)
{

      //Host array
      int Host_a[N], Host_b[N], Host_c[N];

      //Device array
      int *dev_a , *dev_b, *dev_c ;

      //Allocate the memory on the GPU
      HANDLE_ERROR ( cudaMalloc((void **)&dev_a , N*sizeof(int) ) );
      HANDLE_ERROR ( cudaMalloc((void **)&dev_b , N*sizeof(int) ) );
      HANDLE_ERROR ( cudaMalloc((void **)&dev_c , N*sizeof(int) ) );

      //fill the Host array with random elements on the CPU
      for ( int i = 0; i <N ; i++ )
      {
            Host_a[i] = -i ;
            Host_b[i] = i*i ; 
      }

      //Copy Host array to Device array
      HANDLE_ERROR (cudaMemcpy (dev_a , Host_a , N*sizeof(int) , cudaMemcpyHostToDevice));
      HANDLE_ERROR (cudaMemcpy (dev_b , Host_b , N*sizeof(int) , cudaMemcpyHostToDevice));

      //Make a call to GPU kernel
      Vector_Addition <<< N, 1  >>> (dev_a , dev_b , dev_c ) ;

      //Copy back to Host array from Device array
      HANDLE_ERROR (cudaMemcpy(Host_c , dev_c , N*sizeof(int) , cudaMemcpyDeviceToHost));

      //Display the result
      for ( int i = 0; i<N; i++ )
                  printf ("%d + %d = %d\n", Host_a[i] , Host_b[i] , Host_c[i] ) ;

      //Free the Device array memory
      cudaFree (dev_a) ;
      cudaFree (dev_b) ;
      cudaFree (dev_c) ;

      system("pause");
      return 0 ;

}


Note
If you would like to see how easy it is to generate a massively parallel application, try changing the 10 in the line const int  N 10 to 100 or 500 to launch tens of thousands of parallel thread. Be warned, though: No dimension of your launch of blocks may exceed 512 (or 1024). This is simply a hardware-imposed limit, so you will start to see failures if you attempt launches with more blocks than this. In the next articles, we will see how to work within this limitation.

Using blocks and Threads both
For Better Implementation Read this article 


Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step!  


Reference and External Links

34 comments:

  1. hello there,
    I'm just a newbie in cuda programming and my program shows that my cpu is faster than the gpu, do you happen to know anything about it?? thank you

    ReplyDelete
  2. Hello there,

    There may be lots of Reason for that, some of them are
    1. Your gpu is not busy all the time; you should lunchat leas 192 threads to make your gpu busy
    2. your transferring more data but you computing less on GPU; transferring more data will cause time increase
    3. your thread block and grid block did not configure optimally; try to fine tune with your block size and grid size

    you may calculate the how much time you are speding on computation and transfering data from gpu to cpu
    use this link for performance matrix :
    http://cuda-programming.blogspot.in/2013/01/how-to-implement-performance-metrics-in.html

    you can also profile your code in visual profiler to get to know about kernel time excuation
    Thank you

    ReplyDelete
    Replies
    1. you can also refere this link (in the post; at last)
      Using blocks and Threads both
      For Better Implementation Read this article

      Delete
  3. thanks for the reply i figured out what my problem was, and the link you shared was really helpful thanks bro.....

    ReplyDelete
    Replies
    1. It sounds good to hear from you that your problem got solved, most welcome, Keep visiting us....
      Please put your suggestion in comment box , thank you

      Delete
  4. I think there's a mistake on the "CPU CORE 1" snippet above. It shoud be tid+=2, instead of tid+=1. After all you're moving forward 2 elements at a time on both cpu cores.

    ReplyDelete
  5. Code shows NO Cuda Capable device found.
    If I remove HANDLE ERROR and execute I get the output but garbage data.

    ReplyDelete
  6. The note you have at the bottom is incorrect. The maximum gird size is 65535x65535x65535. Each block may have no more than 1024 threads total.

    ReplyDelete
  7. Nitin bhaiya...You copied most of it from CUDA by example book...Anyway good going bhaiya...

    ReplyDelete
  8. This blog awesome and i learn a lot about programming from here.The best thing about this blog is that you doing from beginning to experts level.

    ReplyDelete
  9. If loop iteration which sums corresponding elements of the two arrays and places the sum in the third array (C[i] = A[i] + B[i]). and addition for two large double array (>=1K)
    How to initialize for 1K input ?

    ReplyDelete
  10. Nitin bhaiya can I call you bhaiya ??

    ReplyDelete
  11. Such news should be made more widely available.

    ReplyDelete
  12. 44E63D91EFAxelE332C772C430 December 2024 at 18:28

    A3E7FCCB5E
    takipçi

    ReplyDelete

Help us to improve our quality and become contributor to our blog