Basic Programming Module, You are at Level - 1
Problem Statement :
Given an input array {a0, a1, …, an-1} in pointer d_a, store the reversed array {an-1, an-2, …, a0} in pointer d_b Start from the “reverseArray_multiblock” template Multiple 256-thread blocks launched
To reverse an array of size N, N/256 blocks
Part 1: Compute the number of blocks to launch
Part 2: Implement the kernel reverseArrayBlock() Note that now you must compute both The reversed location within the block The reversed offset to the start of the block.
Problem Solution :
Explanation:
Part 1 of 2: Compute number of blocks needed based on array size and desired block size
it is very easy to compute the number of blocks. Let me tell you how.
Since it is already given that in problem statement that we have 256 threads in a block and let say we have N*256 element in an array so the required number of blocks in a grid will be
N/256 blocks.
hence,
Part 2 of 2: Implement the kernel
// includes, system
#include <stdio.h>
#include <assert.h>
// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);
// Part 2 of 2: implement the kernel
__global__ void reverseArrayBlock(int *d_b , int *d_a )
{
int bx = blockIdx.x , tx = threadIdx.x ;
int old_id = blockDim.x * bx+ tx ;
// GridDim.x gives no. of block in grid in X dimention
d_b[old_id] = d_a[new_id ];
}
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv)
{
// pointer for host memory and size
int *h_a;
int dimA = 256 * 1024; // 256K elements (1MB total)
// pointer for device memory
int *d_b, *d_a;
// define grid and block size
int numThreadsPerBlock = 256;
// Part 1 of 2: compute number of blocks needed based on array size and desired block size
int numBlocks = dimA/numThreadsPerBlock ;
// allocate host and device memory
size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
h_a = (int *) malloc(memSize);
cudaMalloc( (void **) &d_a, memSize );
cudaMalloc( (void **) &d_b, memSize );
// Initialize input array on host
for (int i = 0; i < dimA; ++i)
{
h_a[i] = i;
}
// Copy host array to device array
cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );
// launch kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
reverseArrayBlock<<< dimGrid, dimBlock >>>( d_b, d_a );
// block until the device has completed
cudaThreadSynchronize();
// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");
// device to host copy
cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );
// Check for any CUDA errors
checkCUDAError("memcpy");
// verify the data returned to the host is correct
for (int i = 0; i < dimA; i++)
{
assert(h_a[i] == dimA - 1 - i );
}
// free device memory
cudaFree(d_a);
cudaFree(d_b);
// free host memory
free(h_a);
// If the program makes it this far, then the results are correct and
// there are no run-time errors. Good work!
printf("Correct!\n");
return 0;
}
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
Feel free to comment...
References
CUDA C Programming Guide
Programming Massively Parallel Processors By David B. Kirk and Wen-mei W.Hwu
Problem Statement :
Given an input array {a0, a1, …, an-1} in pointer d_a, store the reversed array {an-1, an-2, …, a0} in pointer d_b Start from the “reverseArray_multiblock” template Multiple 256-thread blocks launched
To reverse an array of size N, N/256 blocks
Part 1: Compute the number of blocks to launch
Part 2: Implement the kernel reverseArrayBlock() Note that now you must compute both The reversed location within the block The reversed offset to the start of the block.
Problem Solution :
Explanation:
Part 1 of 2: Compute number of blocks needed based on array size and desired block size
it is very easy to compute the number of blocks. Let me tell you how.
Since it is already given that in problem statement that we have 256 threads in a block and let say we have N*256 element in an array so the required number of blocks in a grid will be
N/256 blocks.
hence,
int numBlocks = dimA/numThreadsPerBlock ;
Part 2 of 2: Implement the kernel
Now we have task to just map the thread at required element. Since we need to swap elements so we need access, both of the element. In the below code, old_id refers to old index of element in an array, which is easy to compute.
int bx = blockIdx.x , tx = threadIdx.x ;
int old_id = blockDim.x * bx+ tx ;
where "bx" is id for the current block which is to be operated and "tx" is the id for the current thread id.
now the challenging task is to compute the id of the second element, right?
since gridDim.x indicate the number of blocks in a grid along X-direction and blockDim.x tells that the number of thread in a block.so, (blockDim.x * gridDim.x) gives us the total number of threads in a grid and (blockDim.x * gridDim.x) -1 gives the id of last thread (do not confuse with actual id, since in a block id start from 0 to number_of_thread-1), now task is become easy. Now we easily compute the index of the element. :)
since gridDim.x indicate the number of blocks in a grid along X-direction and blockDim.x tells that the number of thread in a block.so, (blockDim.x * gridDim.x) gives us the total number of threads in a grid and (blockDim.x * gridDim.x) -1 gives the id of last thread (do not confuse with actual id, since in a block id start from 0 to number_of_thread-1), now task is become easy. Now we easily compute the index of the element. :)
int new_id = (blockDim.x * gridDim.x) - 1 - old_id ;
Here is the Complete code;
// includes, system
#include <stdio.h>
#include <assert.h>
// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);
// Part 2 of 2: implement the kernel
__global__ void reverseArrayBlock(int *d_b , int *d_a )
{
int bx = blockIdx.x , tx = threadIdx.x ;
int old_id = blockDim.x * bx+ tx ;
// GridDim.x gives no. of block in grid in X dimention
int new_id = (blockDim.x * gridDim.x) - 1 - old_id ;
d_b[old_id] = d_a[new_id ];
}
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv)
{
// pointer for host memory and size
int *h_a;
int dimA = 256 * 1024; // 256K elements (1MB total)
// pointer for device memory
int *d_b, *d_a;
// define grid and block size
int numThreadsPerBlock = 256;
// Part 1 of 2: compute number of blocks needed based on array size and desired block size
int numBlocks = dimA/numThreadsPerBlock ;
// allocate host and device memory
size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
h_a = (int *) malloc(memSize);
cudaMalloc( (void **) &d_a, memSize );
cudaMalloc( (void **) &d_b, memSize );
// Initialize input array on host
for (int i = 0; i < dimA; ++i)
{
h_a[i] = i;
}
// Copy host array to device array
cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );
// launch kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
reverseArrayBlock<<< dimGrid, dimBlock >>>( d_b, d_a );
// block until the device has completed
cudaThreadSynchronize();
// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");
// device to host copy
cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );
// Check for any CUDA errors
checkCUDAError("memcpy");
// verify the data returned to the host is correct
for (int i = 0; i < dimA; i++)
{
assert(h_a[i] == dimA - 1 - i );
}
// free device memory
cudaFree(d_a);
cudaFree(d_b);
// free host memory
free(h_a);
// If the program makes it this far, then the results are correct and
// there are no run-time errors. Good work!
printf("Correct!\n");
return 0;
}
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
Feel free to comment...
References
CUDA C Programming Guide
Programming Massively Parallel Processors By David B. Kirk and Wen-mei W.Hwu
Thanks for this content on the blog.
ReplyDelete