How to avoid uses of cudaMalloc () in intermediate calculation | How to use cudaGetSymbolAddress () in CUDA | Using cudaGetSymbolAddress () in CUDA
Hello readers, today we’ll see a very interesting optimization
technique that I have used in my online competition in CUDA programming. Surprisingly my time get reduce by 2 ms. Although you must be thinking it is not
a big optimization that you should do, but let me tell you, this is, if you’re computing
something on GPU for which you need lots of intermediate memory allocation that
has been allocating by cudaMalloc then surely cudaGetSymbolAddress
() function will optimize your code.
Well, it depends on application. But if possible anywhere, we
can use this function for optimization.
In this tutorial we’ll follow, following steps:
In this tutorial we’ll follow, following steps:
Step 1: Gather information about cudaGetSymbolAddress
()
Step 2: Example of cudaGetSymbolAddress
()
Step 3: Where to use and where not to use cudaGetSymbolAddress
()
Step 4: How it optimizes your code?
Ok, so we start our discussion on cudaGetSymbolAddress
()
cudaGetSymbolAddress
() in CUDA
In many case, probably in computing any function on GPU for
which we need lots of intermediate memory allocation. cudaGetSymbolAddress ()
is the way to ride out form this.
Frankly, this function does not allocate memory for you but it gives you handle for that memory allocation. Wonder!!! Right?
Frankly, this function does not allocate memory for you but it gives you handle for that memory allocation. Wonder!!! Right?
Ok, start with its prototype;
cudaError_t cudaGetSymbolAddress (void **devptr, const char *
symbol) ;
|
This function returns *devptr the address of symbol ‘symbol’
on the device. Symbol can be either be a variable that resides in global or
constant memory space; of it can be a character string, naming a variable that
resides in global or constant memory space.
For more: refer this link.
This is done in host code. The following examples will let
you know how to use this function.
Example of cudaGetSymbolAddress
()
A simple example
Example1.cu
#include"cuda.h"
__device__ double param = 0.5;
int main()
{
double* h_example;
h_example
= (double*)malloc(10*sizeof(double));
for(int i = 0; i < 10; i++)
{
h_example[i] = i;
}
double* d_example;
cudaMalloc(&d_example,
10*sizeof(double));
cudaMemcpy(d_example,
h_example, 10*sizeof(double),
cudaMemcpyHostToDevice);
double* paramptr;
cudaGetSymbolAddress(¶mptr, (const char *)"param");
return 0;
}
|
In the above example, the main lines of code are highlighted
by red color. Since according to definition of function the variable should
resides in either global or constant memory, so, in our case it resides in
global memory as;
__device__ double param = 0.5;
|
This is static allocation of variable “param”, for getting its device pointer we use following line;
cudaGetSymbolAddress(¶mptr, (const char *)"param");
|
This line can also be written as;
cudaGetSymbolAddress(¶mptr, param); //simple one
|
Now you can refer this variable in your device/kernel code
using “paramptr”.
Well, Example1.cu is not very
informative about cudaGetSymbolAddress, Example2.cu will let you know in terms of Array’s .
Example2.cu
#include"cuda.h"
__device__ double param [10]; //Line 1
int main()
{
double* h_example;
h_example
= (double*)malloc(10*sizeof(double));
for(int i = 0; i < 10; i++)
{
h_example[i] = i;
}
double* d_example;
cudaMalloc(&d_example,
10*sizeof(double));
cudaMemcpy(d_example,
h_example, 10*sizeof(double),
cudaMemcpyHostToDevice);
double* paramptr;
cudaGetSymbolAddress((void **)¶mptr, param); //Line 2
return 0;
}
|
So specifically there is no other
change in code other than this line;
__device__ double param [10];
|
Now we are allocating memory statically of 10*sizeof(double) bytes,
using this line 2, we get the pointer to this memory.
Now we can use this pointer in kernel call.
Which demonstrate by Example 3.cu
#include"cuda.h"
__device__ double param [10]; //Line 1
__global__ void
Kernel (double *ptr)
{
//do some task with ptr
}
int main()
{
double* h_example;
h_example
= (double*)malloc(10*sizeof(double));
for(int i = 0; i < 10; i++)
{
h_example[i] = i;
}
double* d_example;
cudaMalloc(&d_example,
10*sizeof(double));
cudaMemcpy(d_example,
h_example, 10*sizeof(double),
cudaMemcpyHostToDevice);
double* paramptr;
cudaGetSymbolAddress((void **)¶mptr, param); //Line 2
//call kernel
int numberofthreads = 128 ;
int numberofblocks = 20 ;
Kernel<<<numberofblocks, numberofthreads >>> (paramptr)
;
return 0;
}
|
You can also copy data from host to
device array as you does with cudaMemcpy as;
cudaMemcpy(paramptr, h_example, 10*sizeof(double), cudaMemcpyHostToDevice);
|
Where to use and where not to use cudaGetSymbolAddress
()
It is necessary to know where to use
this function and where not to use. There is no hard and fast rule but my recommendation is, use this function only for intermediate
operation.
For example, if your application want’s some reduction on your input data (let say Sum reduction), then you need to reduce first all blocks data and store this intermediate result in intermediate array then again apply reduction on this intermediate array in order to get final value. So allocating this intermediate array using cudaMalloc may cost up to 1ms, but if you use cudaGetSymbolAddress by static allocation, you can save up to 0.9ms+, it means it takes less than 0.01ms time, probably in microseconds.
For example, if your application want’s some reduction on your input data (let say Sum reduction), then you need to reduce first all blocks data and store this intermediate result in intermediate array then again apply reduction on this intermediate array in order to get final value. So allocating this intermediate array using cudaMalloc may cost up to 1ms, but if you use cudaGetSymbolAddress by static allocation, you can save up to 0.9ms+, it means it takes less than 0.01ms time, probably in microseconds.
How it optimizes your code?
Well answer
of this question is straightforward. If you allocate memory using cudaMalloc then actually you are
allocating memory dynamically at run time, but if you can estimate how much
memory you’ll need for the particular array, you can allocate it statically
which is fast as compare to dynamic allocation.
I hope you must like this
article.
Got Questions?
Feel free to ask me any
question because I'd be happy to walk you through step by step!
Want to Contact us? Click
here
Good, it's useful to me.
ReplyDeleteThank you. It really helps.
ReplyDeleteI hope that you continue to do this blog work, You're good at this one buddy!
ReplyDeleteA fantastic blog and i’ll come back again for more useful content. Thanks
ReplyDeleteIm very pleased to read this article. Awesome post, Thanks for this one
ReplyDeleteThis is a great article, Wish you would write more. good luck for more!
ReplyDelete