Pages

Monday, 14 January 2013

CUDA C code for Addition of Two Array elements




Problem statement: You are given two array with integer/real values, your task is to add to elements of both the array and store in another array.
 Solution: The main task is to write CUDA kernel for that, writing kernel is not a big task. Let see how;
Let say we have N elements in an array which is represent by “arraySize” (here it is = 5, change accordingly). We have two Function named addWithCuda (…); for invoking kernel and allocating memory on device. d_a and d_b is the device array for storing elements and d_c is the array which stores sum of both array d_a and d_b.
We lunch arraysize number threads to add elements of array.
So, adding corresponding elements is quite easy. Just keep tracking the thread Id and we are done. We store id of thread within the block in “I” and adding both of the array element respectively.

int i = threadIdx.x;                       
    c[i] = a[i] + b[i];

We lunched only one block with number of threads equal to arraysize.    
Here is the Complete code For Adding elements of two array into another array.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);

__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

   // cudaThreadExit must be called before exiting in order for profiling and
 // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaThreadExit();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaThreadExit failed!");
        return 1;
    }

    return 0;
}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

    // cudaThreadSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaThreadSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
   
    return cudaStatus;
}


Need For explanation, Comments for that.
Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step!  


Reference and External Links

6 comments:

  1. what r these fn doing ?

    ReplyDelete
    Replies
    1. I have a C code similar to this and need to convert it to code. Can you help if I post it?

      Delete
  2. This is program is very similar to the corresponding c program to add elements of an array

    ReplyDelete
  3. Dear Sir:
    When I run your code, I got all elements in c array equal to 0. Could you give me a hint how to solve this problem? Thanks in advance.

    Jay Chen

    ReplyDelete
  4. This type is exceptional. These sorts of minuscule realities are utilized a wide assortment of confirmation skills. My accomplice and I favor the hypothesis much.
    Volvo bus on rent in Delhi

    ReplyDelete

Help us to improve our quality and become contributor to our blog