
Tuesday, 25 December 2012

What is Kernel in CUDA Programming

Basic of CUDA Programming: Part 5


CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.


Kernel_Name<<< GridSize,  BlockSize,  SMEMSize,   Stream  >>>(arg,..);

SMEMsize  : is the size of Shared Memory at Runtime .
Stream        : is a stream on which kernel will execute.

Sample Example:

// Kernel definition
 __global__ void VecAdd(float* A, float* B, float* C) 
             int i = threadIdx.x; C[i] = A[i] + B[i]; 
 int main() 
{ ...
     // Kernel invocation with N threads 
      VecAdd<<<1, N>>>(A, B, C);

Here, each of the N threads that execute VecAdd() performs one pair-wise addition.
You must be wondered, how grid organized in term of  block in term of threads; Read this Post

Feel free to comment...

CUDA C Programming Guide
CUDA; Nvidia


  1. int* dev_dynamic_ptr;
    cudaMalloc((void**)&dev_dynamic_ptr, dynamic_size);

  2. #include

    int main() {
    const int SIZE = 1001; // Array size is 1001 to include 0 to 1000
    const int NUM_THREADS = 10;
    int sum = 0; // For the total sum of the array
    std::vector averages(NUM_THREADS, 0); // To store averages computed by each thread

    // Initialize the array with values 0 to 1000
    std::vector array(SIZE);
    for (int i = 0; i < SIZE; ++i) {
    array[i] = i;

    #pragma omp parallel num_threads(NUM_THREADS)
    int id = omp_get_thread_num(); // Get the thread ID
    int start = id * (SIZE / NUM_THREADS);
    int end = (id + 1) * (SIZE / NUM_THREADS);
    int thread_sum = 0; // Sum for each thread

    for (int i = start; i < end; ++i) {
    thread_sum += array[i];

    float thread_average = static_cast(thread_sum) / (SIZE / NUM_THREADS);
    averages[id] = thread_average; // Store the average for the thread

    #pragma omp atomic
    sum += thread_sum; // Update the global sum atomically

    // Output the result array (averages) to console
    std::cout << "Averages: ";
    for (float avg : averages) {
    std::cout << avg << " ";
    std::cout << std::endl;

    // Output the sum of the whole array
    std::cout << "Sum of array: " << sum << std::endl;

    return 0;

  3. ؤتواؤلنالو


Help us to improve our quality and become contributor to our blog