Prefer Your Language

Search This Blog

HETEROGENEOUS COMPUTING

Basic of CUDA Programming: Part 1

CUDA programming involves running code on two different platforms concurrently: a host system with of one or more CPUs and one or more CUDA-enabled NVIDIA GPU devices.
While NVIDIA GPUs are frequently associated with graphics, they are also powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution.
However, the device is based on a distinctly different design from the host system, and it’s important to understand those differences and how they determine the performance of CUDA applications in order to use CUDA effectively.



HOST AND DEVICE

In the world of parallel programming with CUDA, Your CPU is know as Host and Your GPU is known as Device.
There are some primary difference between Host and Device, which are as follows ;


The primary differences are in threading model and in separate physical memories:



Threading resources. Execution pipelines on host systems can support a limited number of concurrent threads. Servers that have four hex-core processors today can run only 24 threads concurrently (or 48 if the CPUs support Hyper Threading.) By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 1536 active threads concurrently per multiprocessor (For more info; see Section F.1 of the CUDA C Programming Guide). On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.



Threads. Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off CPU execution channels to provide multithreading capability. Context switches (when two threads are swapped) are therefore slow and expensive. By comparison, threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work (in warps of 32 threads each). If the GPU must wait on one warp of threads, it simply begins executing work on another. Because separate registers are allocated to all active threads, no swapping of registers or other state need occur when switching among GPU threads. Resources stay allocated to each thread until it completes its execution. In short, CPU cores are designed to minimize latency for one or two threads at a time each, whereas GPUs are designed to handle a large number of concurrent, lightweight threads in order to maximize throughput.



RAM. The host system and the device each have their own distinct attached physical memories. As the host and device memories are separated by the PCI Express (PCIe) bus, items in the host memory must occasionally be communicated across the bus to the device memory or vice versa as described in futher post.

In short, these are the primary hardware differences between CPU hosts and GPU devices with respect to parallel programming

Feel free to comment...

References:
CUDA C Programming Guide







0 comments:

Help us to improve our quality and become contributor to our blog

Become a contributor to this blog. Click on contact us tab
Blogger Template by Clairvo