Pages

Wednesday, 9 January 2013

How to Reverse Multi Block in an Array; CUDA C/C++

Basic Programming Module, You are at Level - 1

Problem Statement :

Given an input array {a0, a1, …, an-1} in pointer d_a, store the reversed array {an-1, an-2, …, a0} in pointer d_b Start from the “reverseArray_multiblock” template Multiple 256-thread blocks launched
To reverse an array of size N, N/256 blocks 
Part 1: Compute the number of blocks to launch 
Part 2: Implement the kernel reverseArrayBlock() Note that now you must compute both The reversed location within the block The reversed offset to the start of the block.

Problem Solution :

Explanation:

Part 1 of 2:  Compute number of blocks needed based on array size and desired block size

 it is very easy to compute the number of blocks. Let me tell you how.
Since it is already given that in problem statement that we have 256 threads in a block and let say we have N*256 element in an array so the required number of blocks in a grid will be  
N/256 blocks.
hence, 
 int numBlocks = dimA/numThreadsPerBlock ;

Part 2 of 2: Implement the kernel

Now we have task to just map the thread at required element. Since we need to swap elements so we need access, both of the element. In the below code, old_id refers to old index of element in an array, which is easy to compute. 

int bx = blockIdx.x , tx = threadIdx.x ; 
int old_id = blockDim.x * bx+ tx ;

where "bx" is id for the current block which is to be operated and "tx" is the id for the current thread id.

now the challenging task is to compute the id of the second element, right?
since gridDim.x indicate the number of blocks in a grid along X-direction and blockDim.x tells that the number of thread in a block.so, 
(blockDim.x * gridDim.x)  gives us the total number of threads in a grid and (blockDim.x * gridDim.x) -1 gives the id of last thread (do not confuse with actual id, since in a block id start from 0 to number_of_thread-1), now task is become easy. Now we easily compute the index of the element. :)

  int new_id = (blockDim.x * gridDim.x) - 1 -  old_id ; 



Here is the Complete code;



// includes, system
#include <stdio.h>
#include <assert.h>

// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);

// Part 2 of 2: implement the kernel
__global__ void reverseArrayBlock(int *d_b , int *d_a )
{
int bx = blockIdx.x , tx = threadIdx.x ; 
int old_id = blockDim.x * bx+ tx ;


// GridDim.x gives no. of block in grid in X dimention
int new_id = (blockDim.x * gridDim.x) - 1 -  old_id ; 


    d_b[old_id] = d_a[new_id ]; 

 }

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv)
{
    // pointer for host memory and size
    int *h_a;
    int dimA = 256 * 1024; // 256K elements (1MB total)
    // pointer for device memory
    int *d_b, *d_a;

    // define grid and block size
    int numThreadsPerBlock = 256;

    // Part 1 of 2: compute number of blocks needed based on array size and desired block size
    int numBlocks = dimA/numThreadsPerBlock ;

    // allocate host and device memory
    size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
    h_a = (int *) malloc(memSize);
    cudaMalloc( (void **) &d_a, memSize );
    cudaMalloc( (void **) &d_b, memSize );

    // Initialize input array on host
    for (int i = 0; i < dimA; ++i)
    {
        h_a[i] = i;
    }

    // Copy host array to device array
    cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

    // launch kernel
    dim3 dimGrid(numBlocks);
    dim3 dimBlock(numThreadsPerBlock);

    reverseArrayBlock<<< dimGrid, dimBlock >>>( d_b, d_a );

    // block until the device has completed
    cudaThreadSynchronize();

    // check if kernel execution generated an error
    // Check for any CUDA errors
    checkCUDAError("kernel invocation");

    // device to host copy
    cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

    // Check for any CUDA errors
    checkCUDAError("memcpy");

    // verify the data returned to the host is correct
    for (int i = 0; i < dimA; i++)
    {
        assert(h_a[i] == dimA - 1 - i );
    }

    // free device memory
    cudaFree(d_a);
    cudaFree(d_b);

    // free host memory
    free(h_a);

    // If the program makes it this far, then the results are correct and
    // there are no run-time errors.  Good work!
    printf("Correct!\n");

    return 0;
}

void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err)
    {
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
        exit(EXIT_FAILURE);
    }
}




Feel free to comment...


 References
CUDA C Programming Guide
Programming Massively Parallel Processors By David B. Kirk and Wen-mei W.Hwu

1 comment:

Help us to improve our quality and become contributor to our blog