lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

exam-review-notes.md (10610B)


      1 +++
      2 title = 'Exam Review Notes'
      3 +++
      4 # Exam review notes
      5 These are the notes I wrote while revising, sort of a summary of everything.
      6 
      7 ## Optimizing code on a single thread CPU:
      8 - do computations/memory accesses once if repeatedly needed
      9 - replace expensive operations with cheaper, e.g. `16*x` ⇒ `x << 4` (reduction in strength)
     10 - inline functions (manually, or with compiler `-O1` and higher)
     11 - use `restrict` if memory only accessed through one pointer
     12 
     13 ## In-core parallelism: SIMD vectorization
     14 - same instruction across vector of data at once
     15 - autovectorization with `-O2` and higher or specific flags, if
     16     - loop count doesn't change after it starts
     17     - no data dependencies (`#pragma GCC ivdep`)
     18 - check compiler reports!
     19 - data alignment leads to better performance
     20     - `float A[3] __attribute((aligned(16)))`
     21     - `#pragma vector aligned`
     22 - hand-vectorize using intrinsics: unroll loop by intended SIMD width, then use vector operations
     23 - locality: use low-stride memory accesses, reuse data, keep loop bodies small
     24 
     25 ## OpenMP (-fopenmp)
     26 - loop parallelization if no data dependencies: `#pragma omp parallel for` (combines `parallel` and `for` pragmas)
     27 - types of variables
     28     - private: one for each thread, no communication between threads
     29     - shared: one for all threads, communication between threads and sections
     30         - only loop variables of parallel loops are private
     31     - firstprivate: initialized with master thread value
     32     - lastprivate: value in last section propagated out
     33 - data race: behavior depends on execution order of threads
     34 - critical section: block of statements where only one thread runs at a time
     35      - all unnamed are synced, all with same name are synced
     36 - reduction clause for sum, max, min, product, etc.
     37 - if clause lets you conditionally parallelize if sufficient workload
     38 - scheduling in loops:
     39     - static/block: loop divided into `nthreads` chunks
     40     - static size 1: iterations assigned to threads in round-robin ("cyclic")
     41     - static size n: loop divided into chunks of n iterations assigned in round-robin ("block-cyclic")
     42     - dynamic size n: loop divided into chunks of n iterations, assigned to threads on demand
     43     - guided size n: chunk size decreases exponentially with each assignment, chunks assigned on demand (n == minimum)
     44     - runtime: choose scheduling at runtime
     45 - static preferable for uniform workload , dynamic otherwise (guided usually best)
     46 - watch out for sync barriers!
     47 - collapse clause collapses perfectly nested loops
     48 - `#pragma omp parallel` starts/terminates threads, `for` doesn't
     49     - so have larger `parallel` blocks!
     50     - use `nowait` to get rid of barriers, e.g. `parallel` already has barrier
     51 - `barrier` construct creates sync barrier where all threads wait
     52 - `single` assigns block to one thread, with barrier after (unless `nowait`)
     53     - `master` assigns block to one thread, without sync barrier (unless `barrier`)
     54 - `sections`: each `section` executed by exactly one thread, threads execute different code
     55 - `threadprivate` vars: accessible like global, different for each thread
     56     - initialize to master value with clause `copyin`
     57 - nested regions: new team of threads started, creator becomes master
     58 - `#pragma omp task` spawns async task on block, original thread continues
     59     - wait for tasks with `#pragma omp taskwait`
     60 - thread affinity
     61     - bind thread to place (core, usually hardware thread)
     62     - env variable `OMP_PLACES = "{0:4}, {4:4}, {8:4}, {12:4}` defines 4 places with 4 execution units
     63 - busy wait vs suspension controlled via `OMP_WAIT_POLICY`
     64 - atomic operations with `#pragma omp atomic read|write|update|capture`
     65     - no guarantees on operational behaviour
     66 ## Pthreads (-pthread)
     67 - process terminates when initial thread terminates
     68 - all threads terminate when initial thread terminates
     69 - create with `pthread_create(...)`
     70 - wait for termination with `pthread_join()` -- otherwise you get zombies
     71 - attributes: `pthread_attr_init()`, `pthread_attr_destroy()`
     72     - joinable (default) or detached
     73     - process scope or system scope
     74         - process: library threads, more efficient, but process blocks on OS calls
     75         - system: kernel threads, scheduled by OS, but expensive context switches
     76 - critical sections
     77     - mutexes: only one thread holds lock, only owner of lock can unlock
     78         - `trylock()` always returns, showing `EBUSY` if already locked
     79         - hierarchical locking: mutexes in ascending order, can only lock if all lower level locks are already held
     80     - condition variables: event-driven critical sections
     81         - allow threads to suspend & trigger other threads
     82         - suspend with `pthread_cond_wait()`
     83         - wake up one thread with `pthread_cond_signal` or all with `pthread_cond_broadcast`
     84     - spin locks: don't need to wake up, but consume resources while waiting
     85     - semaphores: wait until can decrement semaphore, or increment semaphore
     86         - no ownership, any thread can operate on semaphore
     87     - monitors: shared data & set of functions to manipulate it & mutex lock
     88     - reader-writer locks: allow concurrent read, exclusive write
     89     - problem with locking is low composability & priority inversion (when low priority thread holds lock, but doesn't make progress because low priority)
     90     - software transactional memory: write, check for concurrent writes, roll back and redo if needed
     91         - `__transaction_atomic { ... }`
     92         - must only contain code without side effects
     93     - thread-specific global data can be accessed from anywhere by thread
     94     - `pthread_once()` runs function at most once
     95     - `pthread_cancel()` cancels threads
     96 
     97 ## Common pitfalls
     98 - utilisation of libraries: library functions may have hidden internal state
     99     - library interfaces should be re-entrant
    100     - functions should be mathematical, free of side effects
    101 - dynamic memory management:
    102     - allocation expensive: access to shared heap must be synchronized (so one lock per malloc and free)
    103 - false sharing
    104     - two threads frequently access independent unrelated data
    105     - independent data coincidentally mapped to nearby virtual addresses
    106 - virtual memory management
    107     - ccNUMA: cache-coherent non-uniform memory access
    108     - memory page allocated close to processor that first accesses it
    109 
    110 ## GPU programming
    111 - GPUs have many simple cores, little memory, little control
    112 - sets of cores grouped into multiprocessors (SMs)
    113 
    114 ### CUDA: programming GPUs
    115 - SIMT (single instruction multiple threads)
    116 - parallelism:
    117     - thread processor runs thread
    118     - multiprocessor runs block of threads
    119     - device runs grid of blocks
    120     - grid == number of instances of kernel
    121 - device code (GPU code) does computation
    122 - host code (CPU code) instantiates grid, launches kernel, manages memory
    123 - kernel function prefixed with `__global__`
    124 - GPU-only function prefixed with `__device__`
    125 - launch kernel: `kernelFunc<<<thread_blocks, threads_per_block>>>(args)`
    126     - geometry can be `dim3`
    127 - built-in device code vars:
    128     - `dim3 gridDim`: grid dimension in blocks
    129     - `dim3 blockDim`: block dimension in threads
    130     - `dim3 blockIdx`: the block index in grid
    131     - `dim3 threadIdx`: the thread index in block
    132 - global thread index: `(blockIdx.x * blockDim.x) + threadIdx.x`
    133 - what;s the max? `cudaGetDeviceProperties()`
    134 - memory operations
    135     - explicit:
    136         - allocate both CPU and GPU buffers
    137         - `cudaMalloc`, `cudaMemset`, `cudaFree`
    138         - `cudaMemcy`, last arg is enum stating copy direction
    139             - blocks CPU until copy done (unless `cudaMemcpyAsync()`)
    140             - doesn't start until all previous CUDA calls complete
    141     - managed/unified memory
    142         - allocate once with `cudaMallocManaged`
    143 
    144 ### Execution model
    145 - GigaThread engine sends blocks to SMs (multiprocessors)
    146 - blocks divided into warps of 32 threads
    147     - undefined block execution order, any should be valid
    148     - stalled warps immediately replaced (mainly when waiting on global memory)
    149 - try to maximize occupancy: `(active warps)/(max active warps)`
    150     - each thread block consumes registers & shared memory
    151 - threads in warp execute in lock-step (same instruction, different data)
    152 - divergence: when threads do different things (worst case 1/32 performance)
    153 
    154 ### Memory spaces
    155 - registers: stored in SM register file, not persistent after kernel ends
    156 - constant: global `__constant__` variable
    157     - initialized with `cudaMemcpyToSymbol`
    158     - read-only for GPU, inaccessible for host
    159 - global: `__global__`
    160     - set up by host with `cudaMalloc` and `cudaMemcpy`
    161     - persistent, values retained between kernels
    162     - not coherent: writes by other threads might not be visible
    163 - unified memory: single coherent address space
    164     - `cudaMallocManaged`, no explicit copies needed
    165     - move data to GPU before it's needed: `cudaMemPrefetchAsync`
    166     - use 'advise' to establish data location
    167     - but is slower than manual management
    168 - memory coalescing
    169     - group memory accesses into as few memory transactions as possible
    170     - stride 1 access patterns are best
    171 - shared memory: `__shared__ type var`
    172     - size known at compile time
    173     - all threads in block see same memory (one per block)
    174     - not initialized, threads fill it
    175     - not persistent, data lost when kernel finishes
    176     - not coherent, have to `__syncthreads()`
    177     - L1 and shared memory allocated in same space, may need to configure
    178     - best to partition data into subsets and handle each subset with a block and its shared memory
    179 - pinned memory may be good option for performance
    180 
    181 ### Shared variables
    182 - use atomic to avoid data races (e.g. `atomicAdd`)
    183 - best to limit their use because expensive
    184 
    185 ### Consistency & synchronization
    186 - host-device
    187     - `cudaMemcpy` is blocking
    188     - kernel launch is not, use `cudaDeviceSynchronize`
    189 - memory consistency
    190     - no ordering guarantees between warps and blocks
    191     - global and shared memory not consistent
    192     - `__syncthreads()` for block-level sync
    193     - global barrier only using multiple kernel launches
    194 
    195 ### CUDA streams
    196 - sequence of operations
    197 - default stream is synchronizing
    198 - can create non-default streams: `cudaStreamCreate`, `cudaStreamDestroy`
    199 - give as 4th geometry param at kernel launch
    200 - sync one stream (`cudaStreamSynchronize`) or all streams (`cudaDeviceSynchronize`)
    201 
    202 ### CUDA events
    203 - ways to query progress of work in stream (like markers)
    204 - create and destroy: `cudaCreateEvent`, `cudaDestroyEvent`
    205 - add event in stream: `cudaEventRecord`
    206 - sync on event: `cudaEventSynchronize` (wait for everything before event to happen)
    207 - query event: `cudaEventQuery`
    208 
    209