commit c2c30f46c76c4d55f6e867c0e04c52a1236b46f4
parent f46ddba72420911c8cc8e17a0c195581668fce7d
Author: Alex Balgavy <alex@balgavy.eu>
Date: Fri, 12 Mar 2021 13:28:42 +0100
PMMS GPU notes
Diffstat:
2 files changed, 271 insertions(+), 0 deletions(-)
diff --git a/content/programming-multi-core-and-many-core-systems/_index.md b/content/programming-multi-core-and-many-core-systems/_index.md
@@ -7,3 +7,4 @@ title = "Programming Multi-Core and Many-Core Systems"
3. [Lecture 3](lecture-3)
4. [Lecture 4](lecture-4)
5. [Lecture 5](lecture-5)
+6. [Lectures on GPU programming](lectures-gpu)
diff --git a/content/programming-multi-core-and-many-core-systems/lectures-gpu.md b/content/programming-multi-core-and-many-core-systems/lectures-gpu.md
@@ -0,0 +1,270 @@
++++
+title = 'GPU programming'
++++
+
+# GPU computing
+GPUs are better performing, more power efficient, but not easy to program well/efficiently.
+
+CPU:
+- few complex cores
+- lots of on-chip memory & control logic
+
+GPU:
+- many simple cores
+- little memory & control
+
+NVIDIA GPU architecture
+- many slim cores, sets grouped into 'multiprocessors' with shared memory
+- various memory spaces -- on-chip, off-chip global, separate CPU and GPU memories
+- work as accelerators, with CPU as the host
+- CPU offloads work to GPU, symmetric multi-threading
+- execution model SIMT: single instruction, multiple threads (with hardware scheduler)
+
+## Programming GPUs
+- Kernel is the parallel program.
+- Device code manages the parallel program.
+
+Low level models: CUDA, OpenCL, and variations
+
+CUDA:
+- mapping of hardware into programming model
+ - thread hierarchy that maps to cores, configurable at logical level
+ - memory spaces mapping to physical memory spaces, usable through variable scopes and types
+- symmetric multithreading: write code for single thread, instantiate as many as needed
+- SIMT: single instruction on multiple threads at the same time
+- thread hierarchy:
+ - each thread executes same kernel code
+ - one thread one core
+ - threads grouped into thread blocks, where only those in same block can cooperate
+ - all thread blocks logically organised in grid (1D/2D/3D). the grid specifies how many instances run the kernel
+- parallelization for GPUs: map data/work to threads, write computation for 1 thread. organise threads in blocks, and blocks in grids.
+
+CUDA program organisation:
+- device code: GPU code, i.e. kernels and GPU functions
+ - sequential, write for one thread & execute for all
+- host code: CPU code
+ - instantiate grid, run the kernel
+ - handles memory management
+- host-device communicate is explicit/implicit via PCI/e or NVLink
+
+Example of code:
+
+```c
+// GPU kernel code
+__global__ myKernel(int n, int *dataGPU) {
+ myProcess(dataGPU);
+}
+
+// GPU device code
+__device__ myProcess(int *dataGPU) {
+ // code
+}
+
+// CPU code
+int main(int argc, const char **argv) {
+ myKernel<<<100, 10>>>(1000, myData);
+}
+```
+
+Compiling CUDA:
+- use nvcc
+- separates source code into device code (GPU) and host code (CPU)
+
+Execution flow loop:
+1. GPU memory allocation
+2. Transfer data CPU → GPU
+3. CPU calls GPU kernel
+4. GPU kernel executes
+5. Transfer data GPU → CPU
+6. GPU memory release
+
+Creating a CUDA application
+1. Identify function to offload
+2. Determine mapping of operations and data to threads
+3. Write kernels & device functions (sequential, per-thread)
+4. Determine block geometry, i.e. threads per block and blocks per grid
+5. Write host code: memory initialization, kernel launch, inspect results
+6. Optimize kernels
+
+GPU data transfer models
+- explicit allocation and management: allocated twice, copied explicitly on demand, allows asynchronous copies
+- implicit unified memory: allocated once, implicitly moved/copied when needed, potentially prefetched
+
+## Example: vector add
+First, the sequential code:
+
+```c
+void vector_add(int size, float *a, float *b float *c) {
+ for (int i = 0; i < size; ++i) {
+ c[i] = a[i] + b[i];
+ }
+}
+```
+
+What does each thread compute? One addition per thread, each thread uses different element. To find out which, compute mapping of grid to data.
+
+Example with CUDA:
+
+```c
+// GPU kernel code
+// compute vector sum c = a+b
+// each thread does one pair-wise addition
+__global__ void vector_add(float *a, float *b, float *c) {
+ int i = threadIdx.x + blockDim.x * blockIdx.x; // mapping
+ if (i<N) c[i] = a[i] + b[i];
+}
+
+// Host CPU code
+int main() {
+ N = 5000;
+ int size = N * sizeof(float);
+ float *hostA = malloc(size);
+ float *hostB = malloc(size);
+ float *hostC = malloc(size);
+
+ // initialize A, B arrays
+
+ // allocate device memory
+ cudaMalloc(&deviceA, size);
+ cudaMalloc(&deviceB, size);
+ cudaMalloc(&deviceC, size);
+
+ // transfer data from host to device
+ cudaMemcpy(deviceA, hostA, size, cudaMemcpyHostToDevice);
+ cudaMemcpy(deviceB, hostB, size, cudaMemcpyHostToDevice);
+
+ // launch N/256 blocks of 256 threads each
+ vector_add<<< N/256+1, 256 >>>(deviceA, deviceB, deviceC);
+
+ // transfer result back from device to host
+ cudaMemcpy(hostC, deviceC, size, cudaMemcpyDeviceToHost);
+
+ cudaFree(deviceA);
+ cudaFree(deviceB);
+ cudaFree(deviceC);
+ free(hostA);
+ free(hostB);
+ free(hostC);
+}
+```
+
+With OpenACC:
+
+```c
+void vector_add(int size, float *a, float *b float *c) {
+ #pragma acc kernels, copyin(a[0:n],b[0:n]), copyout(c[0:n])
+ for (int i = 0; i < size; ++i) {
+ c[i] = a[i] + b[i];
+ }
+}
+```
+
+## Execution model
+### Task queue & GigaThread engine
+Host: tasks for GPU pushed into queue ("default stream"), execute in order
+
+Device: GigaThread engine manages GPU workload, dispatches blocks to multiprocessors (SMs)
+- blocks divided into warps (groups of 32 threads)
+- per SM, warps submitted to warp schedulers, which issue instructions
+
+Scheduling: mapping and ordering application blocks on hardware resources
+
+Context switching: swapping state and data of blocks that replace each other
+
+Block scheduling:
+- one block runs on one SM, to completion without preemption
+- undefined block execution order, any should be valid
+- same application runs correctly on SMs with different numbers of cores (performance may differ)
+
+Warp scheduling:
+- blocks divided into warps ("wavefronts" in AMD)
+- threads in warp execute in lock-step
+ - same operation in every cycle
+- warps mapped onto cores, concurrent warps per SM limited by hardware resources
+- undefined warp execution order
+- very fast context switching, stalled warps immediately replaced. no fairness guarantee
+- if all warps stalled, no instruction issued so performance lost
+ - main reason is waiting on global memory
+ - if need to read global memory often, maximize occupancy: active warps divided by max active warps
+ - resources allocated for entire block
+ - potential occupancy limiters: register usage, shared memory usage, block size
+ - to figure out used resources, use compiler flags: `nvcc -Xptxas -v`
+- divergence:
+ - when each thread does something different, worst case is 1/32 performance
+ - depends on if branching is data or ID-dependent. if ID, consider changing grouping of threads. if data, consider sorting.
+ - non-diverging warps have no performance penalty, so here branches are not expensive
+ - best to avoid divergence at warp-level with logical operators, lazy decisions, etc.
+
+## Memory spaces
+Multiple device memory scopes:
+- per-thread local memory
+- per-SM shared memory: each block has own shared memory, like explicit software cache, data accessible to all threads in same block
+ - declared with e.g. `__shared__ float arr[128];`
+ - not coherent, `__syncthreads()` required to make writes visible to other threads
+ - good to use when caching pattern not regular, or when data reuse
+- device/global memory: GPU frame buffer, any thread can use
+ - allocated and initialized by host program (`cudaMalloc()`, `cudaMemcpy()`)
+ - persistent, values re retained between kernels
+ - not coherent, writes by other threads might not be visible until kernel finishes
+- constant memory: fast memory for read-only data
+ - defined as global variable: `__constant float speed_of_light = 0.299792458`, or initialize with `cudaMemcpyToSymbol`
+ - read-only for GPU, cannot be accessed directly by host
+- texture memory: read-only, initialized with special constructs on CPU and used with special functions on GPU
+- registers store thread-local scalars/constant size arrays. stored in SM register file.
+
+Unified/managed memory
+- all processors see single coherent memory image with common address space
+- no explicit memory copy calls needed
+- performance issues with page faults, scheduling of copies, sync
+- behaviour is hardware-generation dependent
+
+avoid expensive data movement, keeping most operations on device including init.
+prefetch managed memory when needed.
+use explicit memory copies.
+
+Prefetching: move data to GPU prior to needing it
+
+Advise: establish location where data resides, only copy data on demand on non-residing device
+
+Host (CPU) manages device (GPU) memory, and copies it back and forth.
+
+Example with unified memory:
+
+```c
+__global__ void AplusB(int *ret, int a, int b) {
+ ret[threadIdx.x] = a + b + threadIdx.x;
+}
+
+int main() {
+ int *ret, i;
+ cudaMallocManaged(&ret, 1000 * sizeof(int));
+ AplusB<<< 1, 1000 >>>(ret, 10, 100);
+ cudaDeviceSynchronize();
+ for (i = 0; i < 1000; ++i) printf("%d ", host_ret[i]);
+ cudaFree(ret);
+}
+```
+
+Example with explicit copies
+
+```c
+__global__ void AplusB(int *ret, int a, int b) {
+ ret[threadIdx.x] = a + b + threadIdx.x;
+}
+
+int main() {
+ int *ret, i;
+ cudaMalloc(&ret, 1000 * sizeof(int));
+ AplusB<<< 1, 1000 >>>(ret, 10, 100);
+ int *host_ret = malloc(1000 * sizeof(int));
+ cudaMemcpy(host_ret, ret, 1000 * sizeof(int), cudaMemcpyDefault);
+ for (i = 0; i < 1000; ++i) printf("%d ", host_ret[i]);
+ free(host_ret); cudaFree(ret);
+}
+```
+
+Memory coalescing
+- combining multiple memory accesses into one transaction.
+- group memory accesses in as few memory transactions as possible.
+- stride 1 access patterns preferred.
+- structure of arrays is often better than array of structures.