Handling CUDA error
messages
The following piece of
code, in the cudasafe routine, can be used to handle CUDA error
messages.
1. #include <iostream>
2. #include <cuda.h>
3.
4. void cudasafe( cudaError_t error, char* message)
5. {
6. if(error!=cudaSuccess) { fprintf(stderr,"ERROR: %s : %i\n",message,error); exit(-1); }
7. }
8.
9. int main() {
10. float *a_d; // pointers to device memory; a.k.a. GPU
11. int block_size, block_no, n=10;
12.
13.// allocate arrays on device
14. cudasafe( cudaMalloc((void **)&a_d,n*n*sizeof(float)), "cudaMalloc" );
15. block_size=22;
16. dim3 dimBlock( block_size, block_size );
17. dim3 dimGrid( ceil(float(N)/float(dimBlock.x)), ceil(float(N)/float(dimBlock.y)) );
18.
19. cudasafe( cudaFree(a_d), "cudaFree" );
20.
21. return(0);
22.}
See the Example 1 for
another way to check for error messages with CUDA.
error: a host function call
can not be configured simply means that you tried to call a
routine as if it was a kernel to be executed on the device, but you forgot to
put __global__ in front of that routine.
Invalid Configuration
Argument -
This error means that the dimension of either the specified grid of blocks (dimGrid)
, or number of threads in a block (dimBlock), is incorrect. In such a
case, the dimension is either zero or the dimension is larger than it should
be. This error will only occur if you dynamically determine the dimensions.
Too Many Resources
Requested for Launch - This error means that the number of
registers available on the multiprocessor is being exceeded. Reduce the number
of threads per block to solve the problem.
Unspecified launch failure - This error means that
CUDA does not know what the problem was. This is the worst error to get because
you do not know where to look to correct the error. One way to look at this
error message is to mentally translate it to "segmentation fault"
for the host code.
Suppose you used the
following piece of code in your program to check for error messages.
1. void Check_CUDA_Error(const char *message)
2. {
3. cudaError_t error = cudaGetLastError();
4. if(error!=cudaSuccess) {
5. fprintf(stderr,"ERROR: %s: %s\n", message, cudaGetErrorString(error) );
6. exit(-1);
7. }
8. }
9. int main(int argc, char** argv))
10.{
11. :
12. :
13. block_size=23;
14. dim3 dimBlock(block_size,block_size);
15. dim3 dimGrid( ceil(float(N)/float(dimBlock.x)), ceil(float(N)/float(dimBlock.y)));
16. assign_d<<<dimGrid,
dimBlock>>>(a_d,N);
17. Check_CUDA_Error("Kernel Execution Failed!");
18. :
19. :
20. return 0;
21.}
This piece of code would
fail without a warning as to the cause. Remember, once you launch the kernel,
it operates asynchronously with the CPU. The kernel would fail, and not tell
you, but the CPU would continue to compute whatever was left in the program. By
checking the error message, you could see that the kernel failed with Invalid
Configuration Argument. In this case, we know number of threads in the
block is not zero. However, there are 529 threads in the block, which exceeds
the capability of the GPU, which was shown in the Getting information about the GPU
tutorial to be 512. By reducing the number of threads down to 22 per side of
the block (or 484 threads total in the block), the code will run correctly.
Suppose you took the code
from the Laplace Solver Program tutorial, and modified
it so that instead of:
1. if(i>0 && i<N-1 && j>0 && j<N-1) { B[index] = 0.25*( A[index1] + A[index2] + A[index3] + A[index4] ); }
you now had a new array, a
masking array. This masking array is set to zero on the boundaries of the
array, and one on the interior. This way the interior is computed, and the
boundary conditions are left alone.
1. if(mask[index]) { B[index] = 0.25*( A[index1] + A[index2] + A[index3] + A[index4] ); }
However, when you run the
code, you occasionally get the dreaded unspecified launch failure error.
Sometimes when you run the code it works fine; sometimes it fails. The problem
is that you are accessing an array out of bounds, which is giving you the
error. When the program is executed, a number of threads are created. These
threads are grouped together in thread blocks. Suppose you say you want 16
threads per block, and the grid on which you are solving the Laplace Equation
is 45 x 45. The grid has 2,025 points. Each block has 256 threads. Dividing the
size of the grid by the number of threads per block means that you will need
7.9 blocks. Of course, you cannot have a partial block, so the number is
rounded up to 8 blocks. That means that you have 2,048 threads, while you need
only 2,025. (Really you only need 1,936 threads since you have boundary
conditions where no computation takes place.) The extra threads are unused in
the first code block. However, in the second code block, with the masking
array, those extra threads will be accessing the mask array beyond the bounds
of the array. The result is non-deterministic. Sometimes it may succeed ;
sometimes it may fail with the error unspecified launch failure.
Got
Questions?
Feel
free to ask me any question because I'd be happy to walk you through step by
step!
For
Contact us….. Click on Contact us Tab
Thank you! I was trying to figure out what "Invalid Configuration Argument" meant ...
ReplyDeleteThank you! My question though is how to reduce the number of threads per block is I have a "Too Many Resources Requested for Launch" problem
ReplyDeleteThanks for this informative article ISO Lead Auditor Training in mumbai
ReplyDeleteI like your post. I appreciate your blogs because they are really good. Please go to this website for Data Science course in Bangalore. These courses are wonderful for professionalism.
JV Gaming | Casino & Hotel - Hendon Mob Hub
ReplyDeleteJV 삼척 출장안마 Gaming is your premier casino information service 밀양 출장안마 with secure, legal, 의정부 출장샵 and fully regulated. 출장안마 Our world class casino information service provides your 김해 출장샵
Apprreciate you blogging this
ReplyDelete