
Tuesday, 22 January 2013

What is a warp in CUDA ?

This article describe everything about wrap in CUDA, starting with the how the size of wrap has been decided and end with the size of wrap along with its effect on performance.

What is a warp? I think the definition that applies to CUDA is “threads in a fabric running lengthwise”. The warp size is the number of threads running concurrently on an MP. In actuality, the threads are running both in parallel and pipelined. At the time this was written, each MP contains eight SPs and the fastest instruction takes four cycles. Therefore, each SP can have four instructions in its pipeline for a total of 8 × 4 = 32 instructions being executed concurrently. Within a warp, the threads all have sequential indices so there is a warp with indices 0..31, the next with indices 32..63 and so on up to the total number of threads in a block.

The homogeneity of the threads in a warp has a big effect on the computational throughput. If all the threads are executing the same instruction, then all the SPs in an MP can execute the same instruction in parallel. But if one or more threads in a warp is executing a different instruction from the others, then the warp has to be partitioned into groups of threads based on the instructions being executed, after which the groups are executed one after the other. This serialization reduces the throughput as the threads become more and more divergent and split into smaller and smaller groups. So it pays to keep the threads as homogenous as possible.

Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step!


For Contact us….. Click on Contact us Tab


  1. What do MP and SP short for? Multipleprocessor? I run the sample deviceQuery, it says my graphics card, 650, has ( 2) Multiprocessors, (192) CUDA Cores/MP, then GPU can have at most 64 threads running concurrently?

  2. "Wrap" should be renamed to "Warp" :D


Help us to improve our quality and become contributor to our blog