Kickstart in CUDA (by ai newz)
@ai_newzI will briefly give some links that I used to learn CUDA programming (disclaimer: I had only C and C++ background). Plus I provide a short conspect of some key definitions that are somewhat not always clear from the official the docs.
List of tutorials that I used to kickstart in CUDA programming
1. https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf
2. https://docs.nvidia.com/cuda/cuda-c-programming-guide
3. https://developer.nvidia.com/blog/even-easier-introduction-cuda/
4. Best practices: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#what-is-this-document
4. Lectures from Oxford (prof. Maki Giles):
Lectures 1-4 were the most useful for me and were enough to start programming my own CUDA kernels.
- Lecture 0: CPUs and GPUs
- Lecture 1: an introduction to CUDA
- Lecture 2: different memory and variable types
- Lecture 3: control flow and synchronisation
- Lecture 4: warp shuffles, and reduction / scan operations
- Lecture 5: libraries and tools
- Lecture 6: odds and ends
- Lecture 7: tackling a new application
Some useful notes based on CUDA docs
- Kernel launches are asynchronous. Control returns to the CPU immediately

grid size = gridDim.x * blockDim.x
The grid consists of gridDim.x blocks, blockDim.x each of the block.
N parallel blocks:
add<<<N,1>>>(...) launch N copies of add(), i.e. N parallel blocks with one thread each
A block can be split into parallel threads:
add<<<N,T>>>(...) will call N parallel blocks with T parallel threads each.
- blockDim.x equals to #threads per block
int index = threadIdx.x + blockIdx.x * blockDim.x;
Basically we call a kernel with foo<<<NUM_BLOCKS,N_THREADS_PER_BLOCK>>>(...) or foo<<<NUM_BLOCKS,BLOCK_SIZE>>>(...) which means that we will run NUM_BLOCKS in parallel each with N_THREADS_PER_BLOCK threads.
- Essentially it will run NUM_BLOCKS * N_THREADS_PER_BLOCK copies of foo() in parallel.
- blockDim.x := N_THREADS_PER_BLOCK
- Use blockIdx.x to access block index
- Use threadIdx.x to access thread index within block
- Assign elements to threads by indexing:
int index = threadIdx.x + blockIdx.x * blockDim.x;
Unlike parallel blocks, threads have mechanisms to:
- Communicate
- Synchronize
Within a block, threads share data via shared memory:
- Definition example: _shared_ float x[100].
- Data is not visible to threads in other blocks
- Shared memory is extremely fast on-chip memory, in contrast to device memory, referred to as global memory.
- Use __syncthreads() as a barrier. Use it to prevent data hazards. It syncs all threads in the block.
- __threadfence() - wait until all global and shared memory writes are visible to:
- all threads in block
- all threads, for global data
Warps:
- Threads are executed in warps of 32, with all threads in the warp executing the same instruction at the same time.
- Warp divergence can lead to a big loss of parallel efficiency one of the first things I look out for in a new application.
- In the worst case, effectively lose factor 32× in performance if one thread needs expensive branch, while rest do nothing
Example: processing a long list of elements where, depending on run-time values, a few require very expensive processing
- GPU implementation:
- first process list to build two sub-lists of “simple” and “expensive” elements
- then process two sub-lists separately (two different kernels).
Warp shuffles:
- Warp shuffles are a faster mechanism for moving data between threads in the same warp.
- LaneID is the position within the warp (threadIdx.x%32 for 1D blocks).
If you liked this post, please subscribe to my channel @ai_newz to see more like this!