Prefer Your Language

Search This Blog

How to avoid uses of cudaMalloc () in intermediate calculation | How to use cudaGetSymbolAddress () in CUDA | Using cudaGetSymbolAddress () in CUDA



Hello readers, today we’ll see a very interesting optimization technique that I have used in my online competition in CUDA programming. Surprisingly my time get reduce by 2 ms. Although you must be thinking it is not a big optimization that you should do, but let me tell you, this is, if you’re computing something on GPU for which you need lots of intermediate memory allocation that has been allocating by cudaMalloc then surely cudaGetSymbolAddress () function will optimize your code.

Well, it depends on application. But if possible anywhere, we can use this function for optimization.
In this tutorial we’ll follow, following steps:

Step 1: Gather information about cudaGetSymbolAddress ()
Step 2: Example of cudaGetSymbolAddress ()
Step 3: Where to use and where not to use cudaGetSymbolAddress ()
Step 4: How it optimizes your code?

Ok, so we start our discussion on cudaGetSymbolAddress ()

cudaGetSymbolAddress () in CUDA
In many case, probably in computing any function on GPU for which we need lots of intermediate memory allocation. cudaGetSymbolAddress () is the way to ride out form this.
Frankly, this function does not allocate memory for you but it gives you handle for that memory allocation. Wonder!!! Right?

Ok, start with its prototype;

cudaError_t cudaGetSymbolAddress (void **devptr, const char * symbol) ;


This function returns *devptr the address of symbol ‘symbol’ on the device. Symbol can be either be a variable that resides in global or constant memory space; of it can be a character string, naming a variable that resides in global or constant memory space.
 For more: refer this link.
This is done in host code. The following examples will let you know how to use this function.

Example of cudaGetSymbolAddress ()
A simple example
Example1.cu
#include"cuda.h"

__device__ double param = 0.5;

int main()
{
double* h_example;
h_example = (double*)malloc(10*sizeof(double));

for(int i = 0; i < 10; i++)
{
    h_example[i] = i;
}

double* d_example;
cudaMalloc(&d_example, 10*sizeof(double));
cudaMemcpy(d_example, h_example, 10*sizeof(double), cudaMemcpyHostToDevice);

double* paramptr;
cudaGetSymbolAddress(&paramptr, (const char *)"param");
return 0;

}

In the above example, the main lines of code are highlighted by red color. Since according to definition of function the variable should resides in either global or constant memory, so, in our case it resides in global memory as;


__device__ double param = 0.5;


This is static allocation of variable “param”, for getting its device pointer we use following line;

cudaGetSymbolAddress(&paramptr, (const char *)"param");


This line can also be written as;

cudaGetSymbolAddress(&paramptr, param); //simple one


Now you can refer this variable in your device/kernel code using “paramptr”.
Well, Example1.cu is not very informative about cudaGetSymbolAddress, Example2.cu will let you know in terms of Array’s .

Example2.cu
#include"cuda.h"

__device__ double param [10]; //Line 1

int main()
{
double* h_example;
h_example = (double*)malloc(10*sizeof(double));

for(int i = 0; i < 10; i++)
{
    h_example[i] = i;
}

double* d_example;
cudaMalloc(&d_example, 10*sizeof(double));
cudaMemcpy(d_example, h_example, 10*sizeof(double), cudaMemcpyHostToDevice);

double* paramptr;
cudaGetSymbolAddress((void **)&paramptr, param); //Line 2
return 0;

}

So specifically there is no other change in code other than this line;

__device__ double param [10];


Now we are allocating memory statically of 10*sizeof(double) bytes, using this line 2, we get the pointer to this memory.

Now we can use this pointer in kernel call. Which demonstrate by Example 3.cu
#include"cuda.h"

__device__ double param [10]; //Line 1

__global__ void
Kernel (double *ptr)
{
//do some task with ptr

}
int main()
{
double* h_example;
h_example = (double*)malloc(10*sizeof(double));

for(int i = 0; i < 10; i++)
{
    h_example[i] = i;
}

double* d_example;
cudaMalloc(&d_example, 10*sizeof(double));
cudaMemcpy(d_example, h_example, 10*sizeof(double), cudaMemcpyHostToDevice);

double* paramptr;
cudaGetSymbolAddress((void **)&paramptr, param); //Line 2

//call kernel
int numberofthreads = 128 ;
int numberofblocks = 20 ;
Kernel<<<numberofblocks, numberofthreads >>> (paramptr) ;

return 0;

}


You can also copy data from host to device array as you does with cudaMemcpy as;

cudaMemcpy(paramptr, h_example, 10*sizeof(double), cudaMemcpyHostToDevice);



Where to use and where not to use cudaGetSymbolAddress ()
It is necessary to know where to use this function and where not to use. There is no hard and fast rule but my recommendation is, use this function only for intermediate operation.
For example, if your application want’s some reduction on your input data (let say Sum reduction), then you need to reduce first all blocks data and store this intermediate result in intermediate array then again apply reduction on this intermediate array in order to get final value. So allocating this intermediate array using cudaMalloc may cost up to 1ms, but if you use cudaGetSymbolAddress by static allocation, you can save up to 0.9ms+, it means it takes less than 0.01ms time, probably in microseconds.  

How it optimizes your code?
Well answer of this question is straightforward. If you allocate memory using cudaMalloc then actually you are allocating memory dynamically at run time, but if you can estimate how much memory you’ll need for the particular array, you can allocate it statically which is fast as compare to dynamic allocation.

 I hope you must like this article.

Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step! 
Want to Contact us? Click here

2 comments:

Help us to improve our quality and become contributor to our blog

Become a contributor to this blog. Click on contact us tab
Blogger Template by Clairvo