Handling CUDA error messages

The following piece of code, in the cudasafe routine, can be used to handle CUDA error messages.
1.  #include <iostream>
2.  #include <cuda.h>
4.  void cudasafe( cudaError_t error, char* message)
5.  {
6.     if(error!=cudaSuccess) { fprintf(stderr,"ERROR: %s : %i\n",message,error); exit(-1); }
7.  }
9.  int main() {
10.   float *a_d; // pointers to device memory; a.k.a. GPU
11.   int block_size, block_no, n=10;
13.// allocate arrays on device
14.   cudasafe( cudaMalloc((void **)&a_d,n*n*sizeof(float)), "cudaMalloc" );
15.   block_size=22;
16.   dim3 dimBlock( block_size, block_size );
17.   dim3 dimGrid( ceil(float(N)/float(dimBlock.x)), ceil(float(N)/float(dimBlock.y)) );
19.   cudasafe( cudaFree(a_d), "cudaFree" );
21.   return(0);
See the Example 1 for another way to check for error messages with CUDA.
Common Errors
*  error: a host function call can not be configured simply means that you tried to call a routine as if it was a kernel to be executed on the device, but you forgot to put __global__ in front of that routine.
*  Invalid Configuration Argument - This error means that the dimension of either the specified grid of blocks (dimGrid) , or number of threads in a block (dimBlock), is incorrect. In such a case, the dimension is either zero or the dimension is larger than it should be. This error will only occur if you dynamically determine the dimensions.
*  Too Many Resources Requested for Launch - This error means that the number of registers available on the multiprocessor is being exceeded. Reduce the number of threads per block to solve the problem.
*  Unspecified launch failure - This error means that CUDA does not know what the problem was. This is the worst error to get because you do not know where to look to correct the error. One way to look at this error message is to mentally translate it to "segmentation fault" for the host code.
Example 1
Suppose you used the following piece of code in your program to check for error messages.
1.  void Check_CUDA_Error(const char *message)
2.  {
3.     cudaError_t error = cudaGetLastError();
4.     if(error!=cudaSuccess) {
5.        fprintf(stderr,"ERROR: %s: %s\n", message, cudaGetErrorString(error) );
6.        exit(-1);
7.     }                        
8.  }
9.  int main(int argc, char** argv))
11.   :
12.   :
13.   block_size=23;
14.   dim3 dimBlock(block_size,block_size);
15.   dim3 dimGrid( ceil(float(N)/float(dimBlock.x)), ceil(float(N)/float(dimBlock.y)));
16.   assign_d<<<dimGrid, dimBlock>>>(a_d,N);
17.   Check_CUDA_Error("Kernel Execution Failed!");
18.   :
19.   :
20.   return 0;
This piece of code would fail without a warning as to the cause. Remember, once you launch the kernel, it operates asynchronously with the CPU. The kernel would fail, and not tell you, but the CPU would continue to compute whatever was left in the program. By checking the error message, you could see that the kernel failed with Invalid Configuration Argument. In this case, we know number of threads in the block is not zero. However, there are 529 threads in the block, which exceeds the capability of the GPU, which was shown in the Getting information about the GPU tutorial to be 512. By reducing the number of threads down to 22 per side of the block (or 484 threads total in the block), the code will run correctly.
Example 2
Suppose you took the code from the Laplace Solver Program tutorial, and modified it so that instead of:
1.  if(i>0 && i<N-1 && j>0 && j<N-1) { B[index] = 0.25*(  A[index1] + A[index2] + A[index3] + A[index4] ); }
you now had a new array, a masking array. This masking array is set to zero on the boundaries of the array, and one on the interior. This way the interior is computed, and the boundary conditions are left alone.
1.  if(mask[index]) { B[index] = 0.25*(  A[index1] + A[index2] + A[index3] + A[index4] ); }
However, when you run the code, you occasionally get the dreaded unspecified launch failure error. Sometimes when you run the code it works fine; sometimes it fails. The problem is that you are accessing an array out of bounds, which is giving you the error. When the program is executed, a number of threads are created. These threads are grouped together in thread blocks. Suppose you say you want 16 threads per block, and the grid on which you are solving the Laplace Equation is 45 x 45. The grid has 2,025 points. Each block has 256 threads. Dividing the size of the grid by the number of threads per block means that you will need 7.9 blocks. Of course, you cannot have a partial block, so the number is rounded up to 8 blocks. That means that you have 2,048 threads, while you need only 2,025. (Really you only need 1,936 threads since you have boundary conditions where no computation takes place.) The extra threads are unused in the first code block. However, in the second code block, with the masking array, those extra threads will be accessing the mask array beyond the bounds of the array. The result is non-deterministic. Sometimes it may succeed ; sometimes it may fail with the error unspecified launch failure.
