Vector Addition in CUDA (CUDA C/C++ program for Vector Addition)
This article gives you detailed description about Vector Addition in CUDA C. This article will also let you know how to lunch kernel in CUDA with variable number of threads with limitation.
If you are not clear with idea of thread and block architecture and how to decide. please go through this link
CPU vector sums
First we’ll look at one way this addition can be accomplished with traditional C code:
#define N 10
void add ( int *a, int *b, int *c )
{
int tid = 0; // this is CPU zero, so we start at zero
while (tid < N)
{
c[tid] = a[tid] + b[tid];
tid += 1; // we have one CPU, so we increment by one
}
}
int main( void )
{
int a[N], b[N], c[N];
// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++)
{
a [i] = -i;
b[i] = i * i;
}
add( a, b, c );
// display the results
for (int i=0; i<N; i++)
{
printf( "%d + %d = %d\n", a[i], b[i], c[i] );
}
return 0;
}
|
Most of the articles bears almost no explanation, but we will briefly look at the add() function to explain why we overly complicated it. One would typically code this in a slightly simpler
manner, like so:
void add( int *a, int *b, int *c )
{
for (i=0; i < N; i++)
{
c[i] = a[i] + b[i];
}
}
|
Our slightly more convoluted method was intended to suggest a potential way to parallelize the code on a system with multiple CPUs or CPU cores. For example, with a dual-core processor, one could change the increment to 2 and have one core initialize the loop with tid = 0 and another with tid = 1. The first core would add the even-indexed elements, and the second core would add the odd indexed elements. This amounts to executing the following code on each of the two CPU cores:
CPU CORE 1
void add ( int *a, int *b, int *c )
{
int tid = 0;
while (tid < N)
{
c[tid] = a[tid] + b[tid];
tid += 1;
}
}
|
CPU CORE 2
void add ( int *a, int *b, int *c )
{
int tid = 1;
while (tid < N)
{
c[tid] = a[tid] + b[tid];
tid += 2;
}
}
|
Of course, doing this on a CPU would require considerably more code than we have included in this example. You would need to provide a reasonable amount of infrastructure to create the worker threads that execute the function add() as well as make the assumption that each thread would execute in parallel, a scheduling assumption that is unfortunately not always true.
GPU vector sums
We can accomplish the same addition very similarly on a GPU by writing add() as a device function. This should look similar to code you saw in the previous chapter. But before we look at the device code, we present main(). Although the GPU implementation of main() is different from the corresponding CPU version, nothing here should look new:
Using only Threads
#include <stdio.h>
#define HANDLE_ERROR( err ) ( HandleError( err, __FILE__, __LINE__ ) )
static void HandleError( cudaError_t err, const char *file, int line )
{
if (err != cudaSuccess)
{
printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
file, line );
exit( EXIT_FAILURE );
}
}
const short N = 10 ;
// CUDA Kernel for Vector Addition
__global__ void Vector_Addition ( const int *dev_a , const int *dev_b , int *dev_c)
{
//Get the id of thread within a block
unsigned short tid = threadIdx.x ;
if ( tid < N ) // check the boundry condition for the threads
dev_c [tid] = dev_a[tid] + dev_b[tid] ;
}
int main (void)
{
//Host array
int Host_a[N], Host_b[N], Host_c[N];
//Device array
int *dev_a , *dev_b, *dev_c ;
//Allocate the memory on the GPU
HANDLE_ERROR ( cudaMalloc((void **)&dev_a , N*sizeof(int) ) );
HANDLE_ERROR ( cudaMalloc((void **)&dev_b , N*sizeof(int) ) );
HANDLE_ERROR ( cudaMalloc((void **)&dev_c , N*sizeof(int) ) );
//fill the Host array with random elements on the CPU
for ( int i = 0; i <N ; i++ )
{
Host_a[i] = -i ;
Host_b[i] = i*i ;
}
//Copy Host array to Device array
HANDLE_ERROR (cudaMemcpy (dev_a , Host_a , N*sizeof(int) , cudaMemcpyHostToDevice));
HANDLE_ERROR (cudaMemcpy (dev_b , Host_b , N*sizeof(int) , cudaMemcpyHostToDevice));
//Make a call to GPU kernel
Vector_Addition <<< 1, N >>> (dev_a , dev_b , dev_c ) ;
//Copy back to Host array from Device array
HANDLE_ERROR (cudaMemcpy(Host_c , dev_c , N*sizeof(int) , cudaMemcpyDeviceToHost));
//Display the result
for ( int i = 0; i<N; i++ )
printf ("%d + %d = %d\n", Host_a[i] , Host_b[i] , Host_c[i] ) ;
//Free the Device array memory
cudaFree (dev_a) ;
cudaFree (dev_b) ;
cudaFree (dev_c) ;
system("pause");
return 0 ;
}
|
In this code, we are launching with a number in the angle brackets :
Vector_Addition<<<1,N>>>( dev _ a, dev _ b, dev _ c );
What gives?
Recall that we left those two numbers in the angle brackets unexplained; we stated vaguely that they were parameters to the runtime that describe how to launch the kernel. Well, the first number in those parameters represents the number of parallel blocks in which we would like the device to execute our kernel. In this case, we’re passing the value N for this parameter. For example, if we launch with Vector_Addition<<<1,2>>>(), you can think of the runtime creating two copies of the kernel and running them in parallel. We call each of these parallel invocations a thread. With Vector_Addition<<<1,256>>>(), you would get 256 threads running on the GPU. Parallel programming has never been easier. But this raises an excellent question: The GPU runs N copies of our kernel code, but how can we tell from within the code which block is currently running? This question brings us to the second new feature of the example, the kernel code itself. Specifically, it brings us to the variable threadIdx.x:
If you recall the CPU-based example with which we began, you will recall that we needed to walk through indices from 0 to N-1 in order to sum the two vectors. Since the runtime system is already launching a kernel where each block will have one of these indices, nearly all of this work has already been done for us. Because we’re something of a lazy lot, this is a good thing. It affords us more time to blog, probably about how lazy we are.
The last remaining question to be answered is, why do we check whether tid is less than N? It should always be less than N, since we’ve specifically launched our kernel such that this assumption holds. But our desire to be lazy also makes us paranoid about someone breaking an assumption we’ve made in our code. Breaking code assumptions means broken code.
Using only blocks
Using only blocks
#include <stdio.h>
#define HANDLE_ERROR( err ) ( HandleError( err, __FILE__, __LINE__ ) )
static void HandleError( cudaError_t err, const char *file, int line )
{
if (err != cudaSuccess)
{
printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
file, line );
exit( EXIT_FAILURE );
}
}
const short N = 10 ;
// CUDA Kernel for Vector Addition
__global__ void Vector_Addition ( const int *dev_a , const int *dev_b , int *dev_c)
{
//Get the id of thread within a block
unsigned short tid = blockIdx.x ;
if ( tid < N ) // check the boundry condition for the threads
dev_c [tid] = dev_a[tid] + dev_b[tid] ;
}
int main (void)
{
//Host array
int Host_a[N], Host_b[N], Host_c[N];
//Device array
int *dev_a , *dev_b, *dev_c ;
//Allocate the memory on the GPU
HANDLE_ERROR ( cudaMalloc((void **)&dev_a , N*sizeof(int) ) );
HANDLE_ERROR ( cudaMalloc((void **)&dev_b , N*sizeof(int) ) );
HANDLE_ERROR ( cudaMalloc((void **)&dev_c , N*sizeof(int) ) );
//fill the Host array with random elements on the CPU
for ( int i = 0; i <N ; i++ )
{
Host_a[i] = -i ;
Host_b[i] = i*i ;
}
//Copy Host array to Device array
HANDLE_ERROR (cudaMemcpy (dev_a , Host_a , N*sizeof(int) , cudaMemcpyHostToDevice));
HANDLE_ERROR (cudaMemcpy (dev_b , Host_b , N*sizeof(int) , cudaMemcpyHostToDevice));
//Make a call to GPU kernel
Vector_Addition <<< N, 1 >>> (dev_a , dev_b , dev_c ) ;
//Copy back to Host array from Device array
HANDLE_ERROR (cudaMemcpy(Host_c , dev_c , N*sizeof(int) , cudaMemcpyDeviceToHost));
//Display the result
for ( int i = 0; i<N; i++ )
printf ("%d + %d = %d\n", Host_a[i] , Host_b[i] , Host_c[i] ) ;
//Free the Device array memory
cudaFree (dev_a) ;
cudaFree (dev_b) ;
cudaFree (dev_c) ;
system("pause");
return 0 ;
}
|
Note
If you would like to see how easy it is to generate a massively parallel application, try changing the 10 in the line const int N 10 to 100 or 500 to launch tens of thousands of parallel thread. Be warned, though: No dimension of your launch of blocks may exceed 512 (or 1024). This is simply a hardware-imposed limit, so you will start to see failures if you attempt launches with more blocks than this. In the next articles, we will see how to work within this limitation.
Using blocks and Threads both
For Better Implementation Read this article
Using blocks and Threads both
For Better Implementation Read this article
Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step!
Reference and External Links
Feel free to ask me any question because I'd be happy to walk you through step by step!
Reference and External Links
hello there,
ReplyDeleteI'm just a newbie in cuda programming and my program shows that my cpu is faster than the gpu, do you happen to know anything about it?? thank you
Hello there,
ReplyDeleteThere may be lots of Reason for that, some of them are
1. Your gpu is not busy all the time; you should lunchat leas 192 threads to make your gpu busy
2. your transferring more data but you computing less on GPU; transferring more data will cause time increase
3. your thread block and grid block did not configure optimally; try to fine tune with your block size and grid size
you may calculate the how much time you are speding on computation and transfering data from gpu to cpu
use this link for performance matrix :
http://cuda-programming.blogspot.in/2013/01/how-to-implement-performance-metrics-in.html
you can also profile your code in visual profiler to get to know about kernel time excuation
Thank you
you can also refere this link (in the post; at last)
DeleteUsing blocks and Threads both
For Better Implementation Read this article
thanks for the reply i figured out what my problem was, and the link you shared was really helpful thanks bro.....
ReplyDeleteIt sounds good to hear from you that your problem got solved, most welcome, Keep visiting us....
DeletePlease put your suggestion in comment box , thank you
I think there's a mistake on the "CPU CORE 1" snippet above. It shoud be tid+=2, instead of tid+=1. After all you're moving forward 2 elements at a time on both cpu cores.
ReplyDeleteCode shows NO Cuda Capable device found.
ReplyDeleteIf I remove HANDLE ERROR and execute I get the output but garbage data.
The note you have at the bottom is incorrect. The maximum gird size is 65535x65535x65535. Each block may have no more than 1024 threads total.
ReplyDeleteNitin bhaiya...You copied most of it from CUDA by example book...Anyway good going bhaiya...
ReplyDeleteThis blog awesome and i learn a lot about programming from here.The best thing about this blog is that you doing from beginning to experts level.
ReplyDeleteIf loop iteration which sums corresponding elements of the two arrays and places the sum in the third array (C[i] = A[i] + B[i]). and addition for two large double array (>=1K)
ReplyDeleteHow to initialize for 1K input ?
Nitin bhaiya can I call you bhaiya ??
ReplyDeleteSuch news should be made more widely available.
ReplyDeletebitlis
ReplyDeletekastamonu
çorum
van
sakarya
1Z42İ
uşak
ReplyDeletevan
hakkari
elazığ
bingöl
7VW8RD
görüntülü show
ReplyDeleteücretlishow
LT0D
ankara parça eşya taşıma
ReplyDeletetakipçi satın al
antalya rent a car
antalya rent a car
ankara parça eşya taşıma
5JPJKQ
kastamonu evden eve nakliyat
ReplyDeleteantep evden eve nakliyat
balıkesir evden eve nakliyat
erzincan evden eve nakliyat
ankara evden eve nakliyat
MZNWLN
1C6CC
ReplyDeleteGümüşhane Lojistik
Van Parça Eşya Taşıma
Antalya Parça Eşya Taşıma
Ağrı Lojistik
Mersin Evden Eve Nakliyat
504BD
ReplyDeleteKırklareli Parça Eşya Taşıma
Iğdır Evden Eve Nakliyat
Zonguldak Lojistik
Mamak Fayans Ustası
Çerkezköy Halı Yıkama
Sweat Coin Hangi Borsada
Van Şehirler Arası Nakliyat
Mardin Lojistik
İzmir Şehir İçi Nakliyat
E1596
ReplyDeleteSiirt Parça Eşya Taşıma
Anc Coin Hangi Borsada
Mardin Parça Eşya Taşıma
Kırşehir Parça Eşya Taşıma
Kütahya Parça Eşya Taşıma
Batman Parça Eşya Taşıma
Kars Evden Eve Nakliyat
Muğla Şehirler Arası Nakliyat
Silivri Çatı Ustası
3294E
ReplyDeletereferans kodu
3E649
ReplyDeleteaksaray tamamen ücretsiz sohbet siteleri
düzce mobil sohbet
giresun goruntulu sohbet
nevşehir bedava sohbet uygulamaları
elazığ rastgele canlı sohbet
kırşehir bedava görüntülü sohbet sitesi
tunceli bedava görüntülü sohbet sitesi
maraş tamamen ücretsiz sohbet siteleri
canlı sohbet bedava
2554E
ReplyDeletebalıkesir sohbet odaları
aydın telefonda görüntülü sohbet
konya mobil sesli sohbet
hakkari telefonda görüntülü sohbet
random görüntülü sohbet
kütahya en iyi ücretsiz sohbet siteleri
ağrı parasız görüntülü sohbet uygulamaları
kadınlarla sohbet et
kırşehir bedava sohbet
6B418
ReplyDeleterastgele görüntülü sohbet uygulamaları
artvin nanytoo sohbet
şırnak rastgele sohbet
kayseri canlı görüntülü sohbet
bedava sohbet siteleri
siirt canlı sohbet sitesi
adana sesli görüntülü sohbet
antep bedava sohbet chat odaları
artvin kadınlarla rastgele sohbet
BC14E
ReplyDeleteAmasya Telefonda Kadınlarla Sohbet
bedava sohbet uygulamaları
Ardahan Sesli Sohbet Odası
canlı görüntülü sohbet siteleri
isparta ücretsiz sohbet uygulaması
istanbul canlı sohbet odası
Sakarya Tamamen Ücretsiz Sohbet Siteleri
giresun en iyi ücretsiz sohbet siteleri
kilis görüntülü sohbet kadınlarla
FDD18
ReplyDeleteUrfa Sesli Mobil Sohbet
mardin sesli sohbet sitesi
rastgele sohbet odaları
canlı sohbet et
Tokat Sohbet Odaları
yozgat canlı görüntülü sohbet odaları
bayburt sohbet sitesi
adana parasız sohbet siteleri
aksaray yabancı görüntülü sohbet uygulamaları
FEC27
ReplyDeletekütahya parasız görüntülü sohbet uygulamaları
Karabük Canlı Sohbet Odaları
Antep Canlı Sohbet Siteleri
Hakkari Bedava Görüntülü Sohbet Sitesi
bolu sohbet chat
erzurum mobil sohbet et
Adıyaman Canlı Görüntülü Sohbet Odaları
Edirne En İyi Ücretsiz Sohbet Uygulamaları
telefonda kadınlarla sohbet
0C7D9
ReplyDeleteSiirt En İyi Ücretsiz Görüntülü Sohbet Siteleri
bursa canlı görüntülü sohbet odaları
Antalya Rastgele Sohbet Odaları
mardin canlı sohbet odası
ankara canli goruntulu sohbet siteleri
ordu sesli sohbet siteler
burdur en iyi görüntülü sohbet uygulaması
diyarbakır en iyi ücretsiz sohbet uygulamaları
bayburt en iyi sesli sohbet uygulamaları
A6635
ReplyDeleteGörüntülü Sohbet
Kwai Beğeni Hilesi
Twitch İzlenme Satın Al
Gate io Borsası Güvenilir mi
Görüntülü Sohbet
Twitch İzlenme Hilesi
Tiktok Beğeni Satın Al
Binance Neden Tercih Edilir
Star Atlas Coin Hangi Borsada
55F88
ReplyDeleteCoin Nasıl Alınır
Lunc Coin Hangi Borsada
Pepecoin Coin Hangi Borsada
Bitcoin Nasıl Alınır
Coin Madenciliği Nasıl Yapılır
Chat Gpt Coin Hangi Borsada
Coin Üretme
Okex Borsası Güvenilir mi
Soundcloud Dinlenme Hilesi
63F75
ReplyDeleteledger wallet
safepal
looksrare
defilama
ledger desktop
poocoin
defillama
dexscreener
trezor suite
85D82
ReplyDeleteledger desktop
zkswap
chainlist
ledger wallet
solflare
raydium
aave
layerzero
phantom
A3E7FCCB5E
ReplyDeletetakipçi