Pages

Wednesday, 23 January 2013

Performance of sqrt in CUDA



Taking the square root of a floating point number is essential in many engineering applications. Whether you are doing nBody simulations, simulating molecules, or linear algebra, the ability to accurately and quickly perform thousands or even millions of square root operations is essential. Unfortunately, the square root functions on most CPUs are very time consuming, even with specialized SSE instructions. Fortunately enough, GPUs have specialized hardware to perform such square root operations extremely fast. CUDA, NVidia’s solution to extremely high performance parallel computing, puts the onboard specialized hardware to full use, and easily outperforms modern Intel or AMD CPUs by a factor of over a hundred.

Example problem

The example problem for this article is a reduced nbody problem. We are given a sets of (x,y) coordinates. For this article, we can have anywhere from 512 to 65536 coordinates, or elements. In order to simplify this example as much as possible, all need to do is calculate the sum of distances to all other points, for each point. In a simulation problem such as this, the algorithm complexity of if the order N squared, which means we will need vast computational power as N, or the number of elements, is increased.

CPU approach
This problem is very simple, and can be approached easily with C or C++. Below is some sample code. Note that it has not been optimized for SSE, but it is debatable as to whether or not some compilers will be able to automatically vectorize such code.

void getDistCPU(float *h_result, float *h_X, float *h_Y, int nElems)
{      
        for (int i=0 ; i < nElems; i++)
        {
               float localX = h_X[i];
               float localY = h_Y[i];

               float accumDist = 0;
               for (int j=0; j < nElems; j++)
               {

                       float curX = h_X[j]-localX;
                       float curY = h_Y[j]-localY;

                       accumDist += sqrt((curX * curX) + (curY*curY));
               }
               h_result[i] = accumDist;
        }
}


CUDA approach

Because CUDA code is basically C code, it has extreme similarity to the CPU code. However, instead of having two imbedded for loops, it is easier to simply have each CUDA thread calculate the result for one element. Thus, the CUDA kernel only needs one for loop.

__global__ void CalculateDist(float *pX, float *pY, float *pResult,
                                             int totalElements)
{
        int index = blockIdx.x * blockDim.x + threadIdx.x;

       
        float localX = pX[index];
        float localY = pY[index];

        float accumulatedDist = 0;
       
        for (int i=0; i < totalElements; i++)
        {
               float curX = pX[i];
               float curY = pY[i];

               curX = curX - localX;  //curX is now distance
               curY = curY - localY;

               accumulatedDist += sqrt((curX * curX) + (curY*curY));
        }
        pResult[index] = accumulatedDist;
}



Performance
Of course, performance is much faster on a GPU than it is on a CPU. The card used for this experiment is an underclocked GTX 280. The CPU is a 2.66Ghz Core 2 Duo, however only one core is being utilized for this experiment. For 65536 elements, the CUDA code executed over 172 times as fast as CPU code. In fact, even for very small numbers of elements, 256 for example, the CUDA code was still able to outperform the CPU by 40%. For these calculations, all overhead, including copying memory from the host to the GPU and executing the kernel are taken into consideration. Clearly, if your computational algorithm requires many square roots to be performed, the performance of CUDA will far exceed that of a CPU.






 Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step!
For Contact us….. Click on Contact us Tab

3 comments:

  1. Where is the full code. Can you please show the full cuda code programs.

    ReplyDelete
    Replies
    1. #include

      __global__ void square(float * d_out, float * d_in){
      int idx = threadIdx.x;
      float f = d_in[idx];
      d_out[idx] = f * f;
      }

      int main(int argc, char ** argv) {
      const int ARRAY_SIZE = 64;
      const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

      // generate the input array on the host
      float h_in[ARRAY_SIZE];
      for (int i = 0; i < ARRAY_SIZE; i++) {
      h_in[i] = float(i);
      }
      float h_out[ARRAY_SIZE];

      // declare GPU memory pointers
      float * d_in;
      float * d_out;

      // allocate GPU memory
      cudaMalloc((void**) &d_in, ARRAY_BYTES);
      cudaMalloc((void**) &d_out, ARRAY_BYTES);

      // transfer the array to the GPU
      cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

      // launch the kernel
      square<<>>(d_out, d_in);

      // copy back the result array to the CPU
      cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

      // print out the resulting array
      for (int i =0; i < ARRAY_SIZE; i++) {
      printf("%f", h_out[i]);
      printf(((i % 4) != 3) ? "\t" : "\n");
      }

      cudaFree(d_in);
      cudaFree(d_out);

      return 0;
      }

      Delete

  2. This is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.
    iphone app training course
    iphone app training course in bangalore
    ios app development in hyderabad

    ReplyDelete

Help us to improve our quality and become contributor to our blog