exam-review-notes.md (10610B)
1 +++ 2 title = 'Exam Review Notes' 3 +++ 4 # Exam review notes 5 These are the notes I wrote while revising, sort of a summary of everything. 6 7 ## Optimizing code on a single thread CPU: 8 - do computations/memory accesses once if repeatedly needed 9 - replace expensive operations with cheaper, e.g. `16*x` ⇒ `x << 4` (reduction in strength) 10 - inline functions (manually, or with compiler `-O1` and higher) 11 - use `restrict` if memory only accessed through one pointer 12 13 ## In-core parallelism: SIMD vectorization 14 - same instruction across vector of data at once 15 - autovectorization with `-O2` and higher or specific flags, if 16 - loop count doesn't change after it starts 17 - no data dependencies (`#pragma GCC ivdep`) 18 - check compiler reports! 19 - data alignment leads to better performance 20 - `float A[3] __attribute((aligned(16)))` 21 - `#pragma vector aligned` 22 - hand-vectorize using intrinsics: unroll loop by intended SIMD width, then use vector operations 23 - locality: use low-stride memory accesses, reuse data, keep loop bodies small 24 25 ## OpenMP (-fopenmp) 26 - loop parallelization if no data dependencies: `#pragma omp parallel for` (combines `parallel` and `for` pragmas) 27 - types of variables 28 - private: one for each thread, no communication between threads 29 - shared: one for all threads, communication between threads and sections 30 - only loop variables of parallel loops are private 31 - firstprivate: initialized with master thread value 32 - lastprivate: value in last section propagated out 33 - data race: behavior depends on execution order of threads 34 - critical section: block of statements where only one thread runs at a time 35 - all unnamed are synced, all with same name are synced 36 - reduction clause for sum, max, min, product, etc. 37 - if clause lets you conditionally parallelize if sufficient workload 38 - scheduling in loops: 39 - static/block: loop divided into `nthreads` chunks 40 - static size 1: iterations assigned to threads in round-robin ("cyclic") 41 - static size n: loop divided into chunks of n iterations assigned in round-robin ("block-cyclic") 42 - dynamic size n: loop divided into chunks of n iterations, assigned to threads on demand 43 - guided size n: chunk size decreases exponentially with each assignment, chunks assigned on demand (n == minimum) 44 - runtime: choose scheduling at runtime 45 - static preferable for uniform workload , dynamic otherwise (guided usually best) 46 - watch out for sync barriers! 47 - collapse clause collapses perfectly nested loops 48 - `#pragma omp parallel` starts/terminates threads, `for` doesn't 49 - so have larger `parallel` blocks! 50 - use `nowait` to get rid of barriers, e.g. `parallel` already has barrier 51 - `barrier` construct creates sync barrier where all threads wait 52 - `single` assigns block to one thread, with barrier after (unless `nowait`) 53 - `master` assigns block to one thread, without sync barrier (unless `barrier`) 54 - `sections`: each `section` executed by exactly one thread, threads execute different code 55 - `threadprivate` vars: accessible like global, different for each thread 56 - initialize to master value with clause `copyin` 57 - nested regions: new team of threads started, creator becomes master 58 - `#pragma omp task` spawns async task on block, original thread continues 59 - wait for tasks with `#pragma omp taskwait` 60 - thread affinity 61 - bind thread to place (core, usually hardware thread) 62 - env variable `OMP_PLACES = "{0:4}, {4:4}, {8:4}, {12:4}` defines 4 places with 4 execution units 63 - busy wait vs suspension controlled via `OMP_WAIT_POLICY` 64 - atomic operations with `#pragma omp atomic read|write|update|capture` 65 - no guarantees on operational behaviour 66 ## Pthreads (-pthread) 67 - process terminates when initial thread terminates 68 - all threads terminate when initial thread terminates 69 - create with `pthread_create(...)` 70 - wait for termination with `pthread_join()` -- otherwise you get zombies 71 - attributes: `pthread_attr_init()`, `pthread_attr_destroy()` 72 - joinable (default) or detached 73 - process scope or system scope 74 - process: library threads, more efficient, but process blocks on OS calls 75 - system: kernel threads, scheduled by OS, but expensive context switches 76 - critical sections 77 - mutexes: only one thread holds lock, only owner of lock can unlock 78 - `trylock()` always returns, showing `EBUSY` if already locked 79 - hierarchical locking: mutexes in ascending order, can only lock if all lower level locks are already held 80 - condition variables: event-driven critical sections 81 - allow threads to suspend & trigger other threads 82 - suspend with `pthread_cond_wait()` 83 - wake up one thread with `pthread_cond_signal` or all with `pthread_cond_broadcast` 84 - spin locks: don't need to wake up, but consume resources while waiting 85 - semaphores: wait until can decrement semaphore, or increment semaphore 86 - no ownership, any thread can operate on semaphore 87 - monitors: shared data & set of functions to manipulate it & mutex lock 88 - reader-writer locks: allow concurrent read, exclusive write 89 - problem with locking is low composability & priority inversion (when low priority thread holds lock, but doesn't make progress because low priority) 90 - software transactional memory: write, check for concurrent writes, roll back and redo if needed 91 - `__transaction_atomic { ... }` 92 - must only contain code without side effects 93 - thread-specific global data can be accessed from anywhere by thread 94 - `pthread_once()` runs function at most once 95 - `pthread_cancel()` cancels threads 96 97 ## Common pitfalls 98 - utilisation of libraries: library functions may have hidden internal state 99 - library interfaces should be re-entrant 100 - functions should be mathematical, free of side effects 101 - dynamic memory management: 102 - allocation expensive: access to shared heap must be synchronized (so one lock per malloc and free) 103 - false sharing 104 - two threads frequently access independent unrelated data 105 - independent data coincidentally mapped to nearby virtual addresses 106 - virtual memory management 107 - ccNUMA: cache-coherent non-uniform memory access 108 - memory page allocated close to processor that first accesses it 109 110 ## GPU programming 111 - GPUs have many simple cores, little memory, little control 112 - sets of cores grouped into multiprocessors (SMs) 113 114 ### CUDA: programming GPUs 115 - SIMT (single instruction multiple threads) 116 - parallelism: 117 - thread processor runs thread 118 - multiprocessor runs block of threads 119 - device runs grid of blocks 120 - grid == number of instances of kernel 121 - device code (GPU code) does computation 122 - host code (CPU code) instantiates grid, launches kernel, manages memory 123 - kernel function prefixed with `__global__` 124 - GPU-only function prefixed with `__device__` 125 - launch kernel: `kernelFunc<<<thread_blocks, threads_per_block>>>(args)` 126 - geometry can be `dim3` 127 - built-in device code vars: 128 - `dim3 gridDim`: grid dimension in blocks 129 - `dim3 blockDim`: block dimension in threads 130 - `dim3 blockIdx`: the block index in grid 131 - `dim3 threadIdx`: the thread index in block 132 - global thread index: `(blockIdx.x * blockDim.x) + threadIdx.x` 133 - what;s the max? `cudaGetDeviceProperties()` 134 - memory operations 135 - explicit: 136 - allocate both CPU and GPU buffers 137 - `cudaMalloc`, `cudaMemset`, `cudaFree` 138 - `cudaMemcy`, last arg is enum stating copy direction 139 - blocks CPU until copy done (unless `cudaMemcpyAsync()`) 140 - doesn't start until all previous CUDA calls complete 141 - managed/unified memory 142 - allocate once with `cudaMallocManaged` 143 144 ### Execution model 145 - GigaThread engine sends blocks to SMs (multiprocessors) 146 - blocks divided into warps of 32 threads 147 - undefined block execution order, any should be valid 148 - stalled warps immediately replaced (mainly when waiting on global memory) 149 - try to maximize occupancy: `(active warps)/(max active warps)` 150 - each thread block consumes registers & shared memory 151 - threads in warp execute in lock-step (same instruction, different data) 152 - divergence: when threads do different things (worst case 1/32 performance) 153 154 ### Memory spaces 155 - registers: stored in SM register file, not persistent after kernel ends 156 - constant: global `__constant__` variable 157 - initialized with `cudaMemcpyToSymbol` 158 - read-only for GPU, inaccessible for host 159 - global: `__global__` 160 - set up by host with `cudaMalloc` and `cudaMemcpy` 161 - persistent, values retained between kernels 162 - not coherent: writes by other threads might not be visible 163 - unified memory: single coherent address space 164 - `cudaMallocManaged`, no explicit copies needed 165 - move data to GPU before it's needed: `cudaMemPrefetchAsync` 166 - use 'advise' to establish data location 167 - but is slower than manual management 168 - memory coalescing 169 - group memory accesses into as few memory transactions as possible 170 - stride 1 access patterns are best 171 - shared memory: `__shared__ type var` 172 - size known at compile time 173 - all threads in block see same memory (one per block) 174 - not initialized, threads fill it 175 - not persistent, data lost when kernel finishes 176 - not coherent, have to `__syncthreads()` 177 - L1 and shared memory allocated in same space, may need to configure 178 - best to partition data into subsets and handle each subset with a block and its shared memory 179 - pinned memory may be good option for performance 180 181 ### Shared variables 182 - use atomic to avoid data races (e.g. `atomicAdd`) 183 - best to limit their use because expensive 184 185 ### Consistency & synchronization 186 - host-device 187 - `cudaMemcpy` is blocking 188 - kernel launch is not, use `cudaDeviceSynchronize` 189 - memory consistency 190 - no ordering guarantees between warps and blocks 191 - global and shared memory not consistent 192 - `__syncthreads()` for block-level sync 193 - global barrier only using multiple kernel launches 194 195 ### CUDA streams 196 - sequence of operations 197 - default stream is synchronizing 198 - can create non-default streams: `cudaStreamCreate`, `cudaStreamDestroy` 199 - give as 4th geometry param at kernel launch 200 - sync one stream (`cudaStreamSynchronize`) or all streams (`cudaDeviceSynchronize`) 201 202 ### CUDA events 203 - ways to query progress of work in stream (like markers) 204 - create and destroy: `cudaCreateEvent`, `cudaDestroyEvent` 205 - add event in stream: `cudaEventRecord` 206 - sync on event: `cudaEventSynchronize` (wait for everything before event to happen) 207 - query event: `cudaEventQuery` 208 209