CUDA study
Published:
Following CUDA 编程 (樊哲勇).
Outline
- Useful resources and examples
- official docs: https://docs.nvidia.com/cuda
- programming guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide
- best practices: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide
- optimizations for specified gpu series:
- https://docs.nvidia.com/cuda/kepler-tuning-guide
- https://docs.nvidia.com/cuda/maxwell-tuning-guide
- https://docs.nvidia.com/cuda/pascal-tuning-guide
- https://docs.nvidia.com/cuda/volta-tuning-guide
- https://docs.nvidia.com/cuda/turing-tuning-guide * CUDA apis
- runtime: https://docs.nvidia.com/cuda/cuda-runtime-api
- driver: https://docs.nvidia.com/cuda/cuda-driver-api
- math: https://docs.nvidia.com/cuda/cuda-math-api * CUDA examples
- Environment: windows10 + cuda 11.5 + VS 2022 (C++)
- Path: C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.5\
- Compile examples: nvcc -I”C:/ProgramData/NVIDIA Corporation/CUDA Samples/v11.5/common/inc” -allow-unsupported-compiler .\bandwidthTest.cu
- -allow-unsupported-compiler is used because cuda v11.5 only support VS 2017 & 2019
- cuda中的线程组织
- 核函数 (kernel)
- kernel function用__global__开头,表示global memory
- isotream 在kernel code内不支持 * «<grid_size, block_size»>
- 线程层级:
- grid/网格 > block/线程块 > thread/线程
- 每个 grid 包含 grid_size 个 block,每个 block 包含 block_size 个 thread
- 开普勒架构开始:block_size最大为1024,grid_size最大为$2^{31}-1$,总计线程数 ~ 2万亿
- grid/block size:
- gridDim.x = grid_size
- blockDim.x = block_size
- block/thread id:
- blockIdx.x = 在网格中的线程块指标,范围 0 ~ gridDim.x - 1
- threadIdx.x = 在线程块中线程指标,范围 0 ~ blockDim.x - 1
- 线程绝对ID = blockDim.x * blockIdx.x + threadIdx.x
example 01:
#include <stdio.h> #include <iostream> __global__ void hello_world_from_gpu() { const int bid = blockIdx.x; const int tid = threadIdx.x; int id = blockDim.x * bid + tid; printf("Hello World from GPU, bid = %d, tid = %d, id = %d!\n", bid, tid, id); // iostream is not supported // std::cout << "test cout at GPU" << std::endl; } int main(void) { printf("Hello World from CPU!\n"); std::cout << "test cout at CPU" << std::endl; // <<<grid_size, block_size>>> // each grid contains grid_size blocks // each block contains block_size threads hello_world_from_gpu<<<2, 4>>>(); // synchronize between host and device, refresh flush cudaDeviceSynchronize(); return 0; }
- 多维网格
- blockIdx 和 threadIdx 都是结构体 unit3 类型变量,包含x, y, z三个成员,即 blockIdx.x, blockIdx.y, blockIdx.z; threadIdx.x, threadIdx.y, threadIdx.z
unit3 定义在vector_types.h
struct __device_builtin__ unit3 { unsigned int x, y, z; } typedef __device_builtin__ struct unit3 unit3;
- 类似地,gridDim 和 blockDim 都是结构体 dim3 的变量,包含x, y, z三个成员。
dim3 同样定义在vector_types.h
- 这些内建变量只在核函数有效,满足 i = x, y, z
- blockIdx.i 范围为 0 ~ gridDim.i - 1
- threadIdx.i 范围为 0 ~ blockDim.i - 1
- «<grid_size, block_size»> 中,gridDim 和 blockDim 中未被指定的成员默认为1 (类比 grid_size, block_size 均为int整数的一维情况)
*对于高维情况,可以用结构体 dim3 定义高维的网格/grid和线程块/block
- 三维
dim3 grid_size(Gx, Gy, Gz); dim3 block_size(Bx, By, Bz); - 二维
dim3 grid_size(Gx, Gy); dim3 block_size(Bx, By);
*高维网格和线程块本质还是一维的,
* 线程指标对应 int tid = threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x; * 等效网格块指标 int bid = blockIdx.z * gridDim.x * gridDim.y + blockIdx.y * gridDim.x + blockIdx.x; * 复合线程索引, i = x, y, z int ni = blockDim.i * blockIdx.i + threadIdx.i;