Performance of sqrt in CUDA

Posted by Unknown at 05:00 | 3 comments

Taking the square root of a floating point number is essential in many engineering applications. Whether you are doing nBody simulations, simulating molecules, or linear algebra, the ability to accurately and quickly perform thousands or even millions of square root operations is essential. Unfortunately, the square root functions on most CPUs are very time consuming, even with specialized SSE instructions. Fortunately enough, GPUs have specialized hardware to perform such square root operations extremely fast. CUDA, NVidia’s solution to extremely high performance parallel computing, puts the onboard specialized hardware to full use, and easily outperforms modern Intel or AMD CPUs by a factor of over a hundred.

Example problem

The example problem for this article is a reduced nbody problem. We are given a sets of (x,y) coordinates. For this article, we can have anywhere from 512 to 65536 coordinates, or elements. In order to simplify this example as much as possible, all need to do is calculate the sum of distances to all other points, for each point. In a simulation problem such as this, the algorithm complexity of if the order N squared, which means we will need vast computational power as N, or the number of elements, is increased.

CPU approach

This problem is very simple, and can be approached easily with C or C++. Below is some sample code. Note that it has not been optimized for SSE, but it is debatable as to whether or not some compilers will be able to automatically vectorize such code.

void getDistCPU(float *h_result, float *h_X, float *h_Y, int nElems)

{

for (int i=0 ; i < nElems; i++)

{

float localX = h_X[i];

float localY = h_Y[i];

float accumDist = 0;

for (int j=0; j < nElems; j++)

{

float curX = h_X[j]-localX;

float curY = h_Y[j]-localY;

accumDist += sqrt((curX * curX) + (curY*curY));

}

h_result[i] = accumDist;

}

CUDA approach

Because CUDA code is basically C code, it has extreme similarity to the CPU code. However, instead of having two imbedded for loops, it is easier to simply have each CUDA thread calculate the result for one element. Thus, the CUDA kernel only needs one for loop.

__global__ void CalculateDist(float *pX, float *pY, float *pResult,

int totalElements)

{

int index = blockIdx.x * blockDim.x + threadIdx.x;

float localX = pX[index];

float localY = pY[index];

float accumulatedDist = 0;

for (int i=0; i < totalElements; i++)

{

float curX = pX[i];

float curY = pY[i];

curX = curX - localX; //curX is now distance

curY = curY - localY;

accumulatedDist += sqrt((curX * curX) + (curY*curY));

}

pResult[index] = accumulatedDist;

}

Performance

Of course, performance is much faster on a GPU than it is on a CPU. The card used for this experiment is an underclocked GTX 280. The CPU is a 2.66Ghz Core 2 Duo, however only one core is being utilized for this experiment. For 65536 elements, the CUDA code executed over 172 times as fast as CPU code. In fact, even for very small numbers of elements, 256 for example, the CUDA code was still able to outperform the CPU by 40%. For these calculations, all overhead, including copying memory from the host to the GPU and executing the kernel are taken into consideration. Clearly, if your computational algorithm requires many square roots to be performed, the performance of CUDA will far exceed that of a CPU.

Got Questions?

Feel free to ask me any question because I'd be happy to walk you through step by step!

For Contact us….. Click on Contact us Tab

3 comments:

Tech4India29 August 2017 at 22:49
Where is the full code. Can you please show the full cuda code programs.
ReplyDelete
Replies
BrnInfotech10 October 2018 at 03:56

This is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.
iphone app training course
iphone app training course in bangalore
ios app development in hyderabad
ReplyDelete
Replies

Add comment

Help us to improve our quality and become contributor to our blog

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Performance of sqrt in CUDA

3 comments:

Recent Post

About Me

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Related Posts

Share This

Performance of sqrt in CUDA

3 comments:

Recent Post

About Me

Subscribe To

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin