lectures-gpu.md - lectures.alex.balgavy.eu - Lecture notes from university.

lectures-gpu.md (9417B)
      1 +++
      2 title = 'GPU programming'
      3 +++
      4 
      5 # GPU computing
      6 GPUs are better performing, more power efficient, but not easy to program well/efficiently.
      7 
      8 CPU:
      9 - few complex cores
     10 - lots of on-chip memory & control logic
     11 
     12 GPU:
     13 - many simple cores
     14 - little memory & control
     15 
     16 NVIDIA GPU architecture
     17 - many slim cores, sets grouped into 'multiprocessors' with shared memory
     18 - various memory spaces -- on-chip, off-chip global, separate CPU and GPU memories
     19 - work as accelerators, with CPU as the host
     20 - CPU offloads work to GPU, symmetric multi-threading
     21 - execution model SIMT: single instruction, multiple threads (with hardware scheduler)
     22 
     23 ## Programming GPUs
     24 - Kernel is the parallel program.
     25 - Device code manages the parallel program.
     26 
     27 Low level models: CUDA, OpenCL, and variations
     28 
     29 CUDA:
     30 - mapping of hardware into programming model
     31     - thread hierarchy that maps to cores, configurable at logical level
     32     - memory spaces mapping to physical memory spaces, usable through variable scopes and types
     33 - symmetric multithreading: write code for single thread, instantiate as many as needed
     34 - SIMT: single instruction on multiple threads at the same time
     35 - thread hierarchy:
     36     - each thread executes same kernel code
     37     - one thread one core
     38     - threads grouped into thread blocks, where only those in same block can cooperate
     39     - all thread blocks logically organised in grid (1D/2D/3D). the grid specifies how many instances run the kernel
     40 - parallelization for GPUs: map data/work to threads, write computation for 1 thread. organise threads in blocks, and blocks in grids.
     41 
     42 CUDA program organisation:
     43 - device code: GPU code, i.e. kernels and GPU functions
     44     - sequential, write for one thread & execute for all
     45 - host code: CPU code
     46     - instantiate grid, run the kernel
     47     - handles memory management
     48 - host-device communicate is explicit/implicit via PCI/e or NVLink
     49 
     50 Example of code:
     51 
     52 ```c
     53 // GPU kernel code
     54 __global__ myKernel(int n, int *dataGPU) {
     55     myProcess(dataGPU);
     56 }
     57 
     58 // GPU device code
     59 __device__ myProcess(int *dataGPU) {
     60     // code
     61 }
     62 
     63 // CPU code
     64 int main(int argc, const char **argv) {
     65     myKernel<<<100, 10>>>(1000, myData);
     66 }
     67 ```
     68 
     69 Compiling CUDA:
     70 - use nvcc
     71 - separates source code into device code (GPU) and host code (CPU)
     72 
     73 Execution flow loop:
     74 1. GPU memory allocation
     75 2. Transfer data CPU → GPU
     76 3. CPU calls GPU kernel
     77 4. GPU kernel executes
     78 5. Transfer data GPU → CPU
     79 6. GPU memory release
     80 
     81 Creating a CUDA application
     82 1. Identify function to offload
     83 2. Determine mapping of operations and data to threads
     84 3. Write kernels & device functions (sequential, per-thread)
     85 4. Determine block geometry, i.e. threads per block and blocks per grid
     86 5. Write host code: memory initialization, kernel launch, inspect results
     87 6. Optimize kernels
     88 
     89 GPU data transfer models
     90 - explicit allocation and management: allocated twice, copied explicitly on demand, allows asynchronous copies
     91 - implicit unified memory: allocated once, implicitly moved/copied when needed, potentially prefetched
     92 
     93 ## Example: vector add
     94 First, the sequential code:
     95 
     96 ```c
     97 void vector_add(int size, float *a, float *b float *c) {
     98     for (int i = 0; i < size; ++i) {
     99         c[i] = a[i] + b[i];
    100     }
    101 }
    102 ```
    103 
    104 What does each thread compute? One addition per thread, each thread uses different element. To find out which, compute mapping of grid to data.
    105 
    106 Example with CUDA:
    107 
    108 ```c
    109 // GPU kernel code
    110 // compute vector sum c = a+b
    111 // each thread does one pair-wise addition
    112 __global__ void vector_add(float *a, float *b, float *c) {
    113     int i = threadIdx.x + blockDim.x * blockIdx.x; // mapping
    114     if (i<N) c[i] = a[i] + b[i];
    115 }
    116 
    117 // Host CPU code
    118 int main() {
    119     N = 5000;
    120     int size = N * sizeof(float);
    121     float *hostA = malloc(size);
    122     float *hostB = malloc(size);
    123     float *hostC = malloc(size);
    124 
    125     // initialize A, B arrays
    126 
    127     // allocate device memory
    128     cudaMalloc(&deviceA, size);
    129     cudaMalloc(&deviceB, size);
    130     cudaMalloc(&deviceC, size);
    131 
    132     // transfer data from host to device
    133     cudaMemcpy(deviceA, hostA, size, cudaMemcpyHostToDevice);
    134     cudaMemcpy(deviceB, hostB, size, cudaMemcpyHostToDevice);
    135 
    136     // launch N/256 blocks of 256 threads each
    137     vector_add<<< N/256+1, 256 >>>(deviceA, deviceB, deviceC);
    138 
    139     // transfer result back from device to host
    140     cudaMemcpy(hostC, deviceC, size, cudaMemcpyDeviceToHost);
    141 
    142     cudaFree(deviceA);
    143     cudaFree(deviceB);
    144     cudaFree(deviceC);
    145     free(hostA);
    146     free(hostB);
    147     free(hostC);
    148 }
    149 ```
    150 
    151 With OpenACC:
    152 
    153 ```c
    154 void vector_add(int size, float *a, float *b float *c) {
    155     #pragma acc kernels, copyin(a[0:n],b[0:n]), copyout(c[0:n])
    156     for (int i = 0; i < size; ++i) {
    157         c[i] = a[i] + b[i];
    158     }
    159 }
    160 ```
    161 
    162 ## Execution model
    163 ### Task queue & GigaThread engine
    164 Host: tasks for GPU pushed into queue ("default stream"), execute in order
    165 
    166 Device: GigaThread engine manages GPU workload, dispatches blocks to multiprocessors (SMs)
    167 - blocks divided into warps (groups of 32 threads)
    168 - per SM, warps submitted to warp schedulers, which issue instructions
    169 
    170 Scheduling: mapping and ordering application blocks on hardware resources
    171 
    172 Context switching: swapping state and data of blocks that replace each other
    173 
    174 Block scheduling:
    175 - one block runs on one SM, to completion without preemption
    176 - undefined block execution order, any should be valid
    177 - same application runs correctly on SMs with different numbers of cores (performance may differ)
    178 
    179 Warp scheduling:
    180 - blocks divided into warps ("wavefronts" in AMD)
    181 - threads in warp execute in lock-step
    182     - same operation in every cycle
    183 - warps mapped onto cores, concurrent warps per SM limited by hardware resources
    184 - undefined warp execution order
    185 - very fast context switching, stalled warps immediately replaced. no fairness guarantee
    186 - if all warps stalled, no instruction issued so performance lost
    187     - main reason is waiting on global memory
    188     - if need to read global memory often, maximize occupancy: active warps divided by max active warps
    189     - resources allocated for entire block
    190     - potential occupancy limiters: register usage, shared memory usage, block size
    191     - to figure out used resources, use compiler flags: `nvcc -Xptxas -v`
    192 - divergence:
    193     - when each thread does something different, worst case is 1/32 performance
    194     - depends on if branching is data or ID-dependent. if ID, consider changing grouping of threads. if data, consider sorting.
    195     - non-diverging warps have no performance penalty, so here branches are not expensive
    196     - best to avoid divergence at warp-level with logical operators, lazy decisions, etc.
    197 
    198 ## Memory spaces
    199 Multiple device memory scopes:
    200 - per-thread local memory
    201 - per-SM shared memory: each block has own shared memory, like explicit software cache, data accessible to all threads in same block
    202     - declared with e.g. `__shared__ float arr[128];`
    203     - not coherent, `__syncthreads()` required to make writes visible to other threads
    204     - good to use when caching pattern not regular, or when data reuse
    205 - device/global memory: GPU frame buffer, any thread can use
    206     - allocated and initialized by host program (`cudaMalloc()`, `cudaMemcpy()`)
    207     - persistent, values re retained between kernels
    208     - not coherent, writes by other threads might not be visible until kernel finishes
    209 - constant memory: fast memory for read-only data
    210     - defined as global variable: `__constant float speed_of_light = 0.299792458`, or initialize with `cudaMemcpyToSymbol`
    211     - read-only for GPU, cannot be accessed directly by host
    212 - texture memory: read-only, initialized with special constructs on CPU and used with special functions on GPU
    213 - registers store thread-local scalars/constant size arrays. stored in SM register file.
    214 
    215 Unified/managed memory
    216 - all processors see single coherent memory image with common address space
    217 - no explicit memory copy calls needed
    218 - performance issues with page faults, scheduling of copies, sync
    219 - behaviour is hardware-generation dependent
    220 
    221 avoid expensive data movement, keeping most operations on device including init.
    222 prefetch managed memory when needed.
    223 use explicit memory copies.
    224 
    225 Prefetching: move data to GPU prior to needing it
    226 
    227 Advise: establish location where data resides, only copy data on demand on non-residing device
    228 
    229 Host (CPU) manages device (GPU) memory, and copies it back and forth.
    230 
    231 Example with unified memory:
    232 
    233 ```c
    234 __global__ void AplusB(int *ret, int a, int b) {
    235     ret[threadIdx.x] = a + b + threadIdx.x;
    236 }
    237 
    238 int main() {
    239     int *ret, i;
    240     cudaMallocManaged(&ret, 1000 * sizeof(int));
    241     AplusB<<< 1, 1000 >>>(ret, 10, 100);
    242     cudaDeviceSynchronize();
    243     for (i = 0; i < 1000; ++i) printf("%d ", host_ret[i]);
    244     cudaFree(ret);
    245 }
    246 ```
    247 
    248 Example with explicit copies
    249 
    250 ```c
    251 __global__ void AplusB(int *ret, int a, int b) {
    252     ret[threadIdx.x] = a + b + threadIdx.x;
    253 }
    254 
    255 int main() {
    256     int *ret, i;
    257     cudaMalloc(&ret, 1000 * sizeof(int));
    258     AplusB<<< 1, 1000 >>>(ret, 10, 100);
    259     int *host_ret = malloc(1000 * sizeof(int));
    260     cudaMemcpy(host_ret, ret, 1000 * sizeof(int), cudaMemcpyDefault);
    261     for (i = 0; i < 1000; ++i) printf("%d ", host_ret[i]);
    262     free(host_ret); cudaFree(ret);
    263 }
    264 ```
    265 
    266 Memory coalescing
    267 - combining multiple memory accesses into one transaction.
    268 - group memory accesses in as few memory transactions as possible.
    269 - stride 1 access patterns preferred.
    270 - structure of arrays is often better than array of structures.
	lectures.alex.balgavy.eu Lecture notes from university.
	git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
	Log \| Files \| Refs \| Submodules