CUDA study

2 minute read

Published:

Following CUDA 编程 (樊哲勇).

Outline

  1. Useful resources and examples
  2. cuda中的线程组织
    • 核函数 (kernel)
    • kernel function用__global__开头,表示global memory
    • isotream 在kernel code内不支持 * «<grid_size, block_size»>
    • 线程层级:
      • grid/网格 > block/线程块 > thread/线程
      • 每个 grid 包含 grid_size 个 block,每个 block 包含 block_size 个 thread
      • 开普勒架构开始:block_size最大为1024,grid_size最大为$2^{31}-1$,总计线程数 ~ 2万亿
    • grid/block size:
      • gridDim.x = grid_size
      • blockDim.x = block_size
    • block/thread id:
      • blockIdx.x = 在网格中的线程块指标,范围 0 ~ gridDim.x - 1
      • threadIdx.x = 在线程块中线程指标,范围 0 ~ blockDim.x - 1
      • 线程绝对ID = blockDim.x * blockIdx.x + threadIdx.x
    • example 01:

      #include <stdio.h>
      #include <iostream>
      
       __global__ void hello_world_from_gpu()
      {
          const int bid = blockIdx.x;
          const int tid = threadIdx.x;
          int id = blockDim.x * bid + tid;
          printf("Hello World from GPU, bid = %d, tid = %d, id = %d!\n", bid, tid, id);
          // iostream is not supported
          // std::cout << "test cout at GPU" << std::endl;
      }
      
      int main(void)
      {
          printf("Hello World from CPU!\n");
          std::cout << "test cout at CPU" << std::endl;
          // <<<grid_size, block_size>>>
          // each grid contains grid_size blocks
          // each block contains block_size threads
          hello_world_from_gpu<<<2, 4>>>();
          // synchronize between host and device, refresh flush
          cudaDeviceSynchronize();
          return 0;
      }  
      
    • 多维网格
      • blockIdx 和 threadIdx 都是结构体 unit3 类型变量,包含x, y, z三个成员,即 blockIdx.x, blockIdx.y, blockIdx.z; threadIdx.x, threadIdx.y, threadIdx.z
      • unit3 定义在vector_types.h

        struct __device_builtin__ unit3
        {
          unsigned int x, y, z;
        }
        typedef __device_builtin__ struct unit3 unit3;
        
      • 类似地,gridDim 和 blockDim 都是结构体 dim3 的变量,包含x, y, z三个成员。
      • dim3 同样定义在vector_types.h

      • 这些内建变量只在核函数有效,满足 i = x, y, z
        • blockIdx.i 范围为 0 ~ gridDim.i - 1
        • threadIdx.i 范围为 0 ~ blockDim.i - 1
      • «<grid_size, block_size»> 中,gridDim 和 blockDim 中未被指定的成员默认为1 (类比 grid_size, block_size 均为int整数的一维情况)

      *对于高维情况,可以用结构体 dim3 定义高维的网格/grid和线程块/block

      • 三维
        dim3 grid_size(Gx, Gy, Gz); dim3 block_size(Bx, By, Bz);
      • 二维
        dim3 grid_size(Gx, Gy); dim3 block_size(Bx, By);

      *高维网格和线程块本质还是一维的,

      * 线程指标对应
      int tid = threadIdx.z * blockDim.x * blockDim.y +
               threadIdx.y * blockDim.x + threadIdx.x;
          
      * 等效网格块指标
      int bid = blockIdx.z * gridDim.x * gridDim.y +
                blockIdx.y * gridDim.x + blockIdx.x;
      
      * 复合线程索引, i = x, y, z
      int ni = blockDim.i * blockIdx.i + threadIdx.i;