Using CUDA Warp-Level Primitives

NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. Many CUDA programs achieve high performance by taking advantage of warp execution.

A new technical developer blog post shows how to use primitives introduced in CUDA 9 to make warp-level programing safe and effective.

Part of a warp-level parallel reduction using shfl_down_sync().

While the high performance obtained by warp execution happens behind the scene, many CUDA program can achieve even higher performance by using explicit warp-level programming. Parallel programs often use collective communication operations, such as parallel reductions and scans. CUDA C++ supports such collective operations by providing warp-level primitives and Cooperative Groups collectives.

Read more >

Tags: ,