Performance of sqrt in CUDA
Taking the square root of a floating
point number is essential in many engineering applications. Whether you are
doing nBody simulations, simulating molecules, or linear algebra, the ability
to accurately and quickly perform thousands or even millions of square root
operations is essential. Unfortunately, the square root functions on most CPUs
are very time consuming, even with specialized SSE instructions. Fortunately
enough, GPUs have specialized hardware to perform such square root operations
extremely fast. CUDA, NVidia’s solution to extremely high performance parallel
computing, puts the onboard specialized hardware to full use, and easily
outperforms modern Intel or AMD CPUs by a factor of over a hundred.
Example
problem
The example problem for this article
is a reduced nbody problem. We are given a sets of (x,y) coordinates. For this
article, we can have anywhere from 512 to 65536 coordinates, or elements. In
order to simplify this example as much as possible, all need to do is calculate
the sum of distances to all other points, for each point. In a simulation
problem such as this, the algorithm complexity of if the order N squared, which
means we will need vast computational power as N, or the number of elements, is
increased.
CPU
approach
This problem is very simple, and can
be approached easily with C or C++. Below is some sample code. Note that it has
not been optimized for SSE, but it is debatable as to whether or not some
compilers will be able to automatically vectorize such code.
void getDistCPU(float *h_result, float *h_X, float *h_Y, int nElems)
{
for (int
i=0 ; i <
nElems; i++)
{
float
localX = h_X[i];
float
localY = h_Y[i];
float
accumDist =
0;
for (int
j=0; j <
nElems; j++)
{
float
curX = h_X[j]-localX;
float
curY = h_Y[j]-localY;
accumDist += sqrt((curX * curX) + (curY*curY));
}
h_result[i] =
accumDist;
}
}
CUDA
approach
Because CUDA code is basically C
code, it has extreme similarity to the CPU code. However, instead of having two
imbedded for loops, it is easier to simply have each CUDA thread calculate the
result for one element. Thus, the CUDA kernel only needs one for loop.
__global__ void
CalculateDist(float *pX, float *pY, float *pResult,
int
totalElements)
{
int index =
blockIdx.x * blockDim.x + threadIdx.x;
float localX = pX[index];
float localY = pY[index];
float accumulatedDist = 0;
for (int i=0; i < totalElements; i++)
{
float curX = pX[i];
float curY = pY[i];
curX = curX -
localX; //curX is now distance
curY = curY -
localY;
accumulatedDist += sqrt((curX * curX) + (curY*curY));
}
pResult[index] =
accumulatedDist;
}
|
Performance
Of course, performance is much
faster on a GPU than it is on a CPU. The card used for this experiment is an
underclocked GTX 280. The CPU is a 2.66Ghz Core 2 Duo, however only one core is
being utilized for this experiment. For 65536 elements, the CUDA code executed
over 172 times as fast as CPU code. In fact, even for very small numbers of
elements, 256 for example, the CUDA code was still able to outperform the CPU
by 40%. For these calculations, all overhead, including copying memory from the
host to the GPU and executing the kernel are taken into consideration. Clearly,
if your computational algorithm requires many square roots to be performed, the
performance of CUDA will far exceed that of a CPU.
Feel
free to ask me any question because I'd be happy to walk you through step by
step!
For
Contact us….. Click on Contact us Tab
Where is the full code. Can you please show the full cuda code programs.
ReplyDelete#include
Delete__global__ void square(float * d_out, float * d_in){
int idx = threadIdx.x;
float f = d_in[idx];
d_out[idx] = f * f;
}
int main(int argc, char ** argv) {
const int ARRAY_SIZE = 64;
const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);
// generate the input array on the host
float h_in[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
h_in[i] = float(i);
}
float h_out[ARRAY_SIZE];
// declare GPU memory pointers
float * d_in;
float * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel
square<<>>(d_out, d_in);
// copy back the result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
// print out the resulting array
for (int i =0; i < ARRAY_SIZE; i++) {
printf("%f", h_out[i]);
printf(((i % 4) != 3) ? "\t" : "\n");
}
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
ReplyDeleteThis is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.
iphone app training course
iphone app training course in bangalore
ios app development in hyderabad