Basic of CUDA Programming: Part 5
Kernels
CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.
A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
Syntax:
Sample Example:
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x; C[i] = A[i] + B[i];
}
int main()
{ ...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}
Here, each of the N threads that execute VecAdd() performs one pair-wise addition.
You must be wondered, how grid organized in term of block in term of threads; Read this Post
Feel free to comment...
References
CUDA C Programming Guide
CUDA; Nvidia
Kernels
CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.
A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
Syntax:
Kernel_Name<<< GridSize, BlockSize,
SMEMSize, Stream >>>(arg,..);
Where:
SMEMsize : is the size of Shared Memory at Runtime .
Stream : is a stream on
which kernel will execute.
Sample Example:
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x; C[i] = A[i] + B[i];
}
int main()
{ ...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}
Here, each of the N threads that execute VecAdd() performs one pair-wise addition.
You must be wondered, how grid organized in term of block in term of threads; Read this Post
Feel free to comment...
References
CUDA C Programming Guide
CUDA; Nvidia
int* dev_dynamic_ptr;
ReplyDeletecudaMalloc((void**)&dev_dynamic_ptr, dynamic_size);
cuda program
ReplyDelete#include
ReplyDelete#include
#include
int main() {
const int SIZE = 1001; // Array size is 1001 to include 0 to 1000
const int NUM_THREADS = 10;
int sum = 0; // For the total sum of the array
std::vector averages(NUM_THREADS, 0); // To store averages computed by each thread
// Initialize the array with values 0 to 1000
std::vector array(SIZE);
for (int i = 0; i < SIZE; ++i) {
array[i] = i;
}
#pragma omp parallel num_threads(NUM_THREADS)
{
int id = omp_get_thread_num(); // Get the thread ID
int start = id * (SIZE / NUM_THREADS);
int end = (id + 1) * (SIZE / NUM_THREADS);
int thread_sum = 0; // Sum for each thread
for (int i = start; i < end; ++i) {
thread_sum += array[i];
}
float thread_average = static_cast(thread_sum) / (SIZE / NUM_THREADS);
averages[id] = thread_average; // Store the average for the thread
#pragma omp atomic
sum += thread_sum; // Update the global sum atomically
}
// Output the result array (averages) to console
std::cout << "Averages: ";
for (float avg : averages) {
std::cout << avg << " ";
}
std::cout << std::endl;
// Output the sum of the whole array
std::cout << "Sum of array: " << sum << std::endl;
return 0;
}
ؤتواؤلنالو
ReplyDelete