This article describe everything about wrap in
CUDA, starting with the how the size of wrap has been decided and end with the size
of wrap along with its effect on performance.
What is a warp? I think
the definition that applies to CUDA is “threads in a fabric running
lengthwise”. The warp size is the number of threads running concurrently on an
MP. In actuality, the threads are running both in parallel and pipelined. At
the time this was written, each MP contains eight SPs and the fastest
instruction takes four cycles. Therefore, each SP can have four instructions in
its pipeline for a total of 8 × 4 = 32 instructions being executed
concurrently. Within a warp, the threads all have sequential indices so there
is a warp with indices 0..31, the next with indices 32..63 and so on up to the
total number of threads in a block.
The homogeneity of the
threads in a warp has a big effect on the computational throughput. If all the
threads are executing the same instruction, then all the SPs in an MP can
execute the same instruction in parallel. But if one or more threads in a warp
is executing a different instruction from the others, then the warp has to be
partitioned into groups of threads based on the instructions being executed,
after which the groups are executed one after the other. This serialization
reduces the throughput as the threads become more and more divergent and split
into smaller and smaller groups. So it pays to keep the threads as homogenous
as possible.
Got
Questions?
Feel
free to ask me any question because I'd be happy to walk you through step by
step!
References
For
Contact us….. Click on Contact us Tab
What do MP and SP short for? Multipleprocessor? I run the sample deviceQuery, it says my graphics card, 650, has ( 2) Multiprocessors, (192) CUDA Cores/MP, then GPU can have at most 64 threads running concurrently?
ReplyDelete"Wrap" should be renamed to "Warp" :D
ReplyDelete