lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

commit 41e2f99661497d1e61d8f2444f8a78fe9cfd9169
parent 5d349a7d603dc8eb55b2637e5142d59cbaf8acd7
Author: Alex Balgavy <alex@balgavy.eu>
Date:   Mon, 22 Mar 2021 00:05:52 +0100

PMMS: exam review notes

Diffstat:
Mcontent/programming-multi-core-and-many-core-systems/_index.md | 2++
Acontent/programming-multi-core-and-many-core-systems/exam-review-notes.md | 209+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 211 insertions(+), 0 deletions(-)

diff --git a/content/programming-multi-core-and-many-core-systems/_index.md b/content/programming-multi-core-and-many-core-systems/_index.md @@ -8,3 +8,5 @@ title = "Programming Multi-Core and Many-Core Systems" 4. [Lecture 4](lecture-4) 5. [Lecture 5](lecture-5) 6. [Lectures on GPU programming](lectures-gpu) + +[Exam review notes](exam-review-notes) diff --git a/content/programming-multi-core-and-many-core-systems/exam-review-notes.md b/content/programming-multi-core-and-many-core-systems/exam-review-notes.md @@ -0,0 +1,209 @@ ++++ +title = 'Exam Review Notes' ++++ +# Exam review notes +These are the notes I wrote while revising, sort of a summary of everything. + +## Optimizing code on a single thread CPU: +- do computations/memory accesses once if repeatedly needed +- replace expensive operations with cheaper, e.g. `16*x` ⇒ `x << 4` (reduction in strength) +- inline functions (manually, or with compiler `-O1` and higher) +- use `restrict` if memory only accessed through one pointer + +## In-core parallelism: SIMD vectorization +- same instruction across vector of data at once +- autovectorization with `-O2` and higher or specific flags, if + - loop count doesn't change after it starts + - no data dependencies (`#pragma GCC ivdep`) +- check compiler reports! +- data alignment leads to better performance + - `float A[3] __attribute((aligned(16)))` + - `#pragma vector aligned` +- hand-vectorize using intrinsics: unroll loop by intended SIMD width, then use vector operations +- locality: use low-stride memory accesses, reuse data, keep loop bodies small + +## OpenMP (-fopenmp) +- loop parallelization if no data dependencies: `#pragma omp parallel for` (combines `parallel` and `for` pragmas) +- types of variables + - private: one for each thread, no communication between threads + - shared: one for all threads, communication between threads and sections + - only loop variables of parallel loops are private + - firstprivate: initialized with master thread value + - lastprivate: value in last section propagated out +- data race: behavior depends on execution order of threads +- critical section: block of statements where only one thread runs at a time + - all unnamed are synced, all with same name are synced +- reduction clause for sum, max, min, product, etc. +- if clause lets you conditionally parallelize if sufficient workload +- scheduling in loops: + - static/block: loop divided into `nthreads` chunks + - static size 1: iterations assigned to threads in round-robin ("cyclic") + - static size n: loop divided into chunks of n iterations assigned in round-robin ("block-cyclic") + - dynamic size n: loop divided into chunks of n iterations, assigned to threads on demand + - guided size n: chunk size decreases exponentially with each assignment, chunks assigned on demand (n == minimum) + - runtime: choose scheduling at runtime +- static preferable for uniform workload , dynamic otherwise (guided usually best) +- watch out for sync barriers! +- collapse clause collapses perfectly nested loops +- `#pragma omp parallel` starts/terminates threads, `for` doesn't + - so have larger `parallel` blocks! + - use `nowait` to get rid of barriers, e.g. `parallel` already has barrier +- `barrier` construct creates sync barrier where all threads wait +- `single` assigns block to one thread, with barrier after (unless `nowait`) + - `master` assigns block to one thread, without sync barrier (unless `barrier`) +- `sections`: each `section` executed by exactly one thread, threads execute different code +- `threadprivate` vars: accessible like global, different for each thread + - initialize to master value with clause `copyin` +- nested regions: new team of threads started, creator becomes master +- `#pragma omp task` spawns async task on block, original thread continues + - wait for tasks with `#pragma omp taskwait` +- thread affinity + - bind thread to place (core, usually hardware thread) + - env variable `OMP_PLACES = "{0:4}, {4:4}, {8:4}, {12:4}` defines 4 places with 4 execution units +- busy wait vs suspension controlled via `OMP_WAIT_POLICY` +- atomic operations with `#pragma omp atomic read|write|update|capture` + - no guarantees on operational behaviour +## Pthreads (-pthread) +- process terminates when initial thread terminates +- all threads terminate when initial thread terminates +- create with `pthread_create(...)` +- wait for termination with `pthread_join()` -- otherwise you get zombies +- attributes: `pthread_attr_init()`, `pthread_attr_destroy()` + - joinable (default) or detached + - process scope or system scope + - process: library threads, more efficient, but process blocks on OS calls + - system: kernel threads, scheduled by OS, but expensive context switches +- critical sections + - mutexes: only one thread holds lock, only owner of lock can unlock + - `trylock()` always returns, showing `EBUSY` if already locked + - hierarchical locking: mutexes in ascending order, can only lock if all lower level locks are already held + - condition variables: event-driven critical sections + - allow threads to suspend & trigger other threads + - suspend with `pthread_cond_wait()` + - wake up one thread with `pthread_cond_signal` or all with `pthread_cond_broadcast` + - spin locks: don't need to wake up, but consume resources while waiting + - semaphores: wait until can decrement semaphore, or increment semaphore + - no ownership, any thread can operate on semaphore + - monitors: shared data & set of functions to manipulate it & mutex lock + - reader-writer locks: allow concurrent read, exclusive write + - problem with locking is low composability & priority inversion (when low priority thread holds lock, but doesn't make progress because low priority) + - software transactional memory: write, check for concurrent writes, roll back and redo if needed + - `__transaction_atomic { ... }` + - must only contain code without side effects + - thread-specific global data can be accessed from anywhere by thread + - `pthread_once()` runs function at most once + - `pthread_cancel()` cancels threads + +## Common pitfalls +- utilisation of libraries: library functions may have hidden internal state + - library interfaces should be re-entrant + - functions should be mathematical, free of side effects +- dynamic memory management: + - allocation expensive: access to shared heap must be synchronized (so one lock per malloc and free) +- false sharing + - two threads frequently access independent unrelated data + - independent data coincidentally mapped to nearby virtual addresses +- virtual memory management + - ccNUMA: cache-coherent non-uniform memory access + - memory page allocated close to processor that first accesses it + +## GPU programming +- GPUs have many simple cores, little memory, little control +- sets of cores grouped into multiprocessors (SMs) + +### CUDA: programming GPUs +- SIMT (single instruction multiple threads) +- parallelism: + - thread processor runs thread + - multiprocessor runs block of threads + - device runs grid of blocks + - grid == number of instances of kernel +- device code (GPU code) does computation +- host code (CPU code) instantiates grid, launches kernel, manages memory +- kernel function prefixed with `__global__` +- GPU-only function prefixed with `__device__` +- launch kernel: `kernelFunc<<<thread_blocks, threads_per_block>>>(args)` + - geometry can be `dim3` +- built-in device code vars: + - `dim3 gridDim`: grid dimension in blocks + - `dim3 blockDim`: block dimension in threads + - `dim3 blockIdx`: the block index in grid + - `dim3 threadIdx`: the thread index in block +- global thread index: `(blockIdx.x * blockDim.x) + threadIdx.x` +- what;s the max? `cudaGetDeviceProperties()` +- memory operations + - explicit: + - allocate both CPU and GPU buffers + - `cudaMalloc`, `cudaMemset`, `cudaFree` + - `cudaMemcy`, last arg is enum stating copy direction + - blocks CPU until copy done (unless `cudaMemcpyAsync()`) + - doesn't start until all previous CUDA calls complete + - managed/unified memory + - allocate once with `cudaMallocManaged` + +### Execution model +- GigaThread engine sends blocks to SMs (multiprocessors) +- blocks divided into warps of 32 threads + - undefined block execution order, any should be valid + - stalled warps immediately replaced (mainly when waiting on global memory) +- try to maximize occupancy: `(active warps)/(max active warps)` + - each thread block consumes registers & shared memory +- threads in warp execute in lock-step (same instruction, different data) +- divergence: when threads do different things (worst case 1/32 performance) + +### Memory spaces +- registers: stored in SM register file, not persistent after kernel ends +- constant: global `__constant__` variable + - initialized with `cudaMemcpyToSymbol` + - read-only for GPU, inaccessible for host +- global: `__global__` + - set up by host with `cudaMalloc` and `cudaMemcpy` + - persistent, values retained between kernels + - not coherent: writes by other threads might not be visible +- unified memory: single coherent address space + - `cudaMallocManaged`, no explicit copies needed + - move data to GPU before it's needed: `cudaMemPrefetchAsync` + - use 'advise' to establish data location + - but is slower than manual management +- memory coalescing + - group memory accesses into as few memory transactions as possible + - stride 1 access patterns are best +- shared memory: `__shared__ type var` + - size known at compile time + - all threads in block see same memory (one per block) + - not initialized, threads fill it + - not persistent, data lost when kernel finishes + - not coherent, have to `__syncthreads()` + - L1 and shared memory allocated in same space, may need to configure + - best to partition data into subsets and handle each subset with a block and its shared memory +- pinned memory may be good option for performance + +### Shared variables +- use atomic to avoid data races (e.g. `atomicAdd`) +- best to limit their use because expensive + +### Consistency & synchronization +- host-device + - `cudaMemcpy` is blocking + - kernel launch is not, use `cudaDeviceSynchronize` +- memory consistency + - no ordering guarantees between warps and blocks + - global and shared memory not consistent + - `__syncthreads()` for block-level sync + - global barrier only using multiple kernel launches + +### CUDA streams +- sequence of operations +- default stream is synchronizing +- can create non-default streams: `cudaStreamCreate`, `cudaStreamDestroy` +- give as 4th geometry param at kernel launch +- sync one stream (`cudaStreamSynchronize`) or all streams (`cudaDeviceSynchronize`) + +### CUDA events +- ways to query progress of work in stream (like markers) +- create and destroy: `cudaCreateEvent`, `cudaDestroyEvent` +- add event in stream: `cudaEventRecord` +- sync on event: `cudaEventSynchronize` (wait for everything before event to happen) +- query event: `cudaEventQuery` + +