lectures-gpu.md (9417B)
1 +++ 2 title = 'GPU programming' 3 +++ 4 5 # GPU computing 6 GPUs are better performing, more power efficient, but not easy to program well/efficiently. 7 8 CPU: 9 - few complex cores 10 - lots of on-chip memory & control logic 11 12 GPU: 13 - many simple cores 14 - little memory & control 15 16 NVIDIA GPU architecture 17 - many slim cores, sets grouped into 'multiprocessors' with shared memory 18 - various memory spaces -- on-chip, off-chip global, separate CPU and GPU memories 19 - work as accelerators, with CPU as the host 20 - CPU offloads work to GPU, symmetric multi-threading 21 - execution model SIMT: single instruction, multiple threads (with hardware scheduler) 22 23 ## Programming GPUs 24 - Kernel is the parallel program. 25 - Device code manages the parallel program. 26 27 Low level models: CUDA, OpenCL, and variations 28 29 CUDA: 30 - mapping of hardware into programming model 31 - thread hierarchy that maps to cores, configurable at logical level 32 - memory spaces mapping to physical memory spaces, usable through variable scopes and types 33 - symmetric multithreading: write code for single thread, instantiate as many as needed 34 - SIMT: single instruction on multiple threads at the same time 35 - thread hierarchy: 36 - each thread executes same kernel code 37 - one thread one core 38 - threads grouped into thread blocks, where only those in same block can cooperate 39 - all thread blocks logically organised in grid (1D/2D/3D). the grid specifies how many instances run the kernel 40 - parallelization for GPUs: map data/work to threads, write computation for 1 thread. organise threads in blocks, and blocks in grids. 41 42 CUDA program organisation: 43 - device code: GPU code, i.e. kernels and GPU functions 44 - sequential, write for one thread & execute for all 45 - host code: CPU code 46 - instantiate grid, run the kernel 47 - handles memory management 48 - host-device communicate is explicit/implicit via PCI/e or NVLink 49 50 Example of code: 51 52 ```c 53 // GPU kernel code 54 __global__ myKernel(int n, int *dataGPU) { 55 myProcess(dataGPU); 56 } 57 58 // GPU device code 59 __device__ myProcess(int *dataGPU) { 60 // code 61 } 62 63 // CPU code 64 int main(int argc, const char **argv) { 65 myKernel<<<100, 10>>>(1000, myData); 66 } 67 ``` 68 69 Compiling CUDA: 70 - use nvcc 71 - separates source code into device code (GPU) and host code (CPU) 72 73 Execution flow loop: 74 1. GPU memory allocation 75 2. Transfer data CPU → GPU 76 3. CPU calls GPU kernel 77 4. GPU kernel executes 78 5. Transfer data GPU → CPU 79 6. GPU memory release 80 81 Creating a CUDA application 82 1. Identify function to offload 83 2. Determine mapping of operations and data to threads 84 3. Write kernels & device functions (sequential, per-thread) 85 4. Determine block geometry, i.e. threads per block and blocks per grid 86 5. Write host code: memory initialization, kernel launch, inspect results 87 6. Optimize kernels 88 89 GPU data transfer models 90 - explicit allocation and management: allocated twice, copied explicitly on demand, allows asynchronous copies 91 - implicit unified memory: allocated once, implicitly moved/copied when needed, potentially prefetched 92 93 ## Example: vector add 94 First, the sequential code: 95 96 ```c 97 void vector_add(int size, float *a, float *b float *c) { 98 for (int i = 0; i < size; ++i) { 99 c[i] = a[i] + b[i]; 100 } 101 } 102 ``` 103 104 What does each thread compute? One addition per thread, each thread uses different element. To find out which, compute mapping of grid to data. 105 106 Example with CUDA: 107 108 ```c 109 // GPU kernel code 110 // compute vector sum c = a+b 111 // each thread does one pair-wise addition 112 __global__ void vector_add(float *a, float *b, float *c) { 113 int i = threadIdx.x + blockDim.x * blockIdx.x; // mapping 114 if (i<N) c[i] = a[i] + b[i]; 115 } 116 117 // Host CPU code 118 int main() { 119 N = 5000; 120 int size = N * sizeof(float); 121 float *hostA = malloc(size); 122 float *hostB = malloc(size); 123 float *hostC = malloc(size); 124 125 // initialize A, B arrays 126 127 // allocate device memory 128 cudaMalloc(&deviceA, size); 129 cudaMalloc(&deviceB, size); 130 cudaMalloc(&deviceC, size); 131 132 // transfer data from host to device 133 cudaMemcpy(deviceA, hostA, size, cudaMemcpyHostToDevice); 134 cudaMemcpy(deviceB, hostB, size, cudaMemcpyHostToDevice); 135 136 // launch N/256 blocks of 256 threads each 137 vector_add<<< N/256+1, 256 >>>(deviceA, deviceB, deviceC); 138 139 // transfer result back from device to host 140 cudaMemcpy(hostC, deviceC, size, cudaMemcpyDeviceToHost); 141 142 cudaFree(deviceA); 143 cudaFree(deviceB); 144 cudaFree(deviceC); 145 free(hostA); 146 free(hostB); 147 free(hostC); 148 } 149 ``` 150 151 With OpenACC: 152 153 ```c 154 void vector_add(int size, float *a, float *b float *c) { 155 #pragma acc kernels, copyin(a[0:n],b[0:n]), copyout(c[0:n]) 156 for (int i = 0; i < size; ++i) { 157 c[i] = a[i] + b[i]; 158 } 159 } 160 ``` 161 162 ## Execution model 163 ### Task queue & GigaThread engine 164 Host: tasks for GPU pushed into queue ("default stream"), execute in order 165 166 Device: GigaThread engine manages GPU workload, dispatches blocks to multiprocessors (SMs) 167 - blocks divided into warps (groups of 32 threads) 168 - per SM, warps submitted to warp schedulers, which issue instructions 169 170 Scheduling: mapping and ordering application blocks on hardware resources 171 172 Context switching: swapping state and data of blocks that replace each other 173 174 Block scheduling: 175 - one block runs on one SM, to completion without preemption 176 - undefined block execution order, any should be valid 177 - same application runs correctly on SMs with different numbers of cores (performance may differ) 178 179 Warp scheduling: 180 - blocks divided into warps ("wavefronts" in AMD) 181 - threads in warp execute in lock-step 182 - same operation in every cycle 183 - warps mapped onto cores, concurrent warps per SM limited by hardware resources 184 - undefined warp execution order 185 - very fast context switching, stalled warps immediately replaced. no fairness guarantee 186 - if all warps stalled, no instruction issued so performance lost 187 - main reason is waiting on global memory 188 - if need to read global memory often, maximize occupancy: active warps divided by max active warps 189 - resources allocated for entire block 190 - potential occupancy limiters: register usage, shared memory usage, block size 191 - to figure out used resources, use compiler flags: `nvcc -Xptxas -v` 192 - divergence: 193 - when each thread does something different, worst case is 1/32 performance 194 - depends on if branching is data or ID-dependent. if ID, consider changing grouping of threads. if data, consider sorting. 195 - non-diverging warps have no performance penalty, so here branches are not expensive 196 - best to avoid divergence at warp-level with logical operators, lazy decisions, etc. 197 198 ## Memory spaces 199 Multiple device memory scopes: 200 - per-thread local memory 201 - per-SM shared memory: each block has own shared memory, like explicit software cache, data accessible to all threads in same block 202 - declared with e.g. `__shared__ float arr[128];` 203 - not coherent, `__syncthreads()` required to make writes visible to other threads 204 - good to use when caching pattern not regular, or when data reuse 205 - device/global memory: GPU frame buffer, any thread can use 206 - allocated and initialized by host program (`cudaMalloc()`, `cudaMemcpy()`) 207 - persistent, values re retained between kernels 208 - not coherent, writes by other threads might not be visible until kernel finishes 209 - constant memory: fast memory for read-only data 210 - defined as global variable: `__constant float speed_of_light = 0.299792458`, or initialize with `cudaMemcpyToSymbol` 211 - read-only for GPU, cannot be accessed directly by host 212 - texture memory: read-only, initialized with special constructs on CPU and used with special functions on GPU 213 - registers store thread-local scalars/constant size arrays. stored in SM register file. 214 215 Unified/managed memory 216 - all processors see single coherent memory image with common address space 217 - no explicit memory copy calls needed 218 - performance issues with page faults, scheduling of copies, sync 219 - behaviour is hardware-generation dependent 220 221 avoid expensive data movement, keeping most operations on device including init. 222 prefetch managed memory when needed. 223 use explicit memory copies. 224 225 Prefetching: move data to GPU prior to needing it 226 227 Advise: establish location where data resides, only copy data on demand on non-residing device 228 229 Host (CPU) manages device (GPU) memory, and copies it back and forth. 230 231 Example with unified memory: 232 233 ```c 234 __global__ void AplusB(int *ret, int a, int b) { 235 ret[threadIdx.x] = a + b + threadIdx.x; 236 } 237 238 int main() { 239 int *ret, i; 240 cudaMallocManaged(&ret, 1000 * sizeof(int)); 241 AplusB<<< 1, 1000 >>>(ret, 10, 100); 242 cudaDeviceSynchronize(); 243 for (i = 0; i < 1000; ++i) printf("%d ", host_ret[i]); 244 cudaFree(ret); 245 } 246 ``` 247 248 Example with explicit copies 249 250 ```c 251 __global__ void AplusB(int *ret, int a, int b) { 252 ret[threadIdx.x] = a + b + threadIdx.x; 253 } 254 255 int main() { 256 int *ret, i; 257 cudaMalloc(&ret, 1000 * sizeof(int)); 258 AplusB<<< 1, 1000 >>>(ret, 10, 100); 259 int *host_ret = malloc(1000 * sizeof(int)); 260 cudaMemcpy(host_ret, ret, 1000 * sizeof(int), cudaMemcpyDefault); 261 for (i = 0; i < 1000; ++i) printf("%d ", host_ret[i]); 262 free(host_ret); cudaFree(ret); 263 } 264 ``` 265 266 Memory coalescing 267 - combining multiple memory accesses into one transaction. 268 - group memory accesses in as few memory transactions as possible. 269 - stride 1 access patterns preferred. 270 - structure of arrays is often better than array of structures.