Kickstart in CUDA (by ai newz)

Kickstart in CUDA (by ai newz)

@ai_newz

I will briefly give some links that I used to learn CUDA programming (disclaimer: I had only C and C++ background). Plus I provide a short conspect of some key definitions that are somewhat not always clear from the official the docs.

List of tutorials that I used to kickstart in CUDA programming

1. https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf

2. https://docs.nvidia.com/cuda/cuda-c-programming-guide

3. https://developer.nvidia.com/blog/even-easier-introduction-cuda/

4. Best practices: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#what-is-this-document

4. Lectures from Oxford (prof. Maki Giles):
Lectures 1-4 were the most useful for me and were enough to start programming my own CUDA kernels.

Some useful notes based on CUDA docs

- Kernel launches are asynchronous. Control returns to the CPU immediately

Grid of Thread Blocks

grid size = gridDim.x * blockDim.x

The grid consists of gridDim.x blocks, blockDim.x each of the block.


N parallel blocks:

add<<<N,1>>>(...) launch N copies of add(), i.e. N parallel blocks with one thread each

A block can be split into parallel threads:

add<<<N,T>>>(...) will call N parallel blocks with T parallel threads each.

  • blockDim.x equals to #threads per block

int index = threadIdx.x + blockIdx.x * blockDim.x;


Basically we call a kernel with foo<<<NUM_BLOCKS,N_THREADS_PER_BLOCK>>>(...) or foo<<<NUM_BLOCKS,BLOCK_SIZE>>>(...) which means that we will run NUM_BLOCKS in parallel each with N_THREADS_PER_BLOCK threads.

  • Essentially it will run NUM_BLOCKS * N_THREADS_PER_BLOCK copies of foo() in parallel.
  • blockDim.x := N_THREADS_PER_BLOCK
  • Use blockIdx.x to access block index
  • Use threadIdx.x to access thread index within block
  • Assign elements to threads by indexing:
    int index = threadIdx.x + blockIdx.x * blockDim.x;

Unlike parallel blocks, threads have mechanisms to:

- Communicate

- Synchronize


Within a block, threads share data via shared memory:

  • Definition example: _shared_ float x[100].
  • Data is not visible to threads in other blocks
  • Shared memory is extremely fast on-chip memory, in contrast to device memory, referred to as global memory.
  • Use __syncthreads() as a barrier. Use it to prevent data hazards. It syncs all threads in the block.
  • __threadfence() - wait until all global and shared memory writes are visible to:
    - all threads in block
    - all threads, for global data

Warps:

  • Threads are executed in warps of 32, with all threads in the warp executing the same instruction at the same time.
  • Warp divergence can lead to a big loss of parallel efficiency one of the first things I look out for in a new application.
  • In the worst case, effectively lose factor 32× in performance if one thread needs expensive branch, while rest do nothing

Example: processing a long list of elements where, depending on run-time values, a few require very expensive processing

- GPU implementation:

  • first process list to build two sub-lists of “simple” and “expensive” elements
  • then process two sub-lists separately (two different kernels).

Warp shuffles:

  • Warp shuffles are a faster mechanism for moving data between threads in the same warp.
  • LaneID is the position within the warp (threadIdx.x%32 for 1D blocks).


If you liked this post, please subscribe to my channel @ai_newz to see more like this!


Report Page