index.md (6428B)
1 +++ 2 title = 'User mode' 3 +++ 4 # User mode 5 ## Privilege separation 6 Operating system's task: 7 - safe and efficient multiplexing of system resources to competing apps 8 - CPU provides management and safety via privileged instructions 9 - OS uses this to manage user apps 10 11 Privilege separation on x86: 12 - 4 rings: 0 (kernel), 1, 2, 3 (user processes) 13 - segmentation: 14 - memory divided into several segments 15 - each pointer deref has segment associated with it: 16 - coe pointer: CS (code segment) 17 - heap pointer: data segment (DS) 18 - local variable pointer: stack segment (SS) 19 - x86_64 pretends not to have segments -- all segments point to same memory, more or less 20 - but, low-order 2 bits of CS register determine current ring 21 - switching x86 rings 22 - no instruction to set CS irectly 23 - traditional way is using interrupts: 24 - user → kernel 25 - interrupt CPU (`int` instruction) 26 - CPU switches to interrupt handler in kernel 27 - kernel → user 28 - kernel interrupt handler returns (`iret` instruction) 29 - CPU restores user mode state, including CS 30 - Global Descriptor Table 31 - defines segments, among other things 32 - code segment is an offset into GDT, so is data segment 33 - GDTR register points to GDT 34 - 64-bit GDT entries (crossed out parts are legacy): 35 - DPL: descriptor privilege level. Which rings can access the segment. 36 37 ![GDT entry](gdt-entry.png) 38 39 - Address spaces and user processes: 40 - different page tables create different address spaces 41 - address spaces isolate processes from each other 42 - most modern OSes provide a one-to-one mapping between address spaces and user processes 43 44 ## Security 45 Meltdown (rogue data cache load): 46 - bypasses supervisor bit 47 - when CPU tries to read data that it can't read, it will fault 48 - modern CPUs speculate on what might happen after an operation 49 - CPU speculates on the read, generates fault only later, so the data will be in the cache 50 - you end up with arbitrary kernel reads from user mode 51 - Kernel page table isolation: 52 - kernel just isn't mapped into app address space, except for a small area for interrupt handlers 53 - kernel has its own address space 54 - but it impacts performance: switch page tables, flush TLB, etc. 55 56 Foreshadow (L1TF) 57 - bypasses present bit 58 - should also set address to 0 when unmapping a page 59 60 MDS (RIDL): 61 - bypasses address bits 62 - flush CPU buffers before iretq 63 64 Defending kernel attacks: 65 - SMEP: supervisor mode execution protection 66 - SMAP: supervisor mode access protection 67 - ASLR: address space layout randomization 68 - map user mode code and data in random locations in virtual address space 69 70 ## Interrupts 71 - events "interrupting" execution flow 72 - kernel handles interrupt before program execution continues 73 - external: key presses, network packets 74 - device signals CPU by setting a pin using an electrical signal 75 - most hardware interrupts can be masked (disabled), using IF in EFLAGS register 76 - internal: divide by zero, page fault, system call 77 78 Most software interrupts are synchronous: directly before/after instruction 79 80 Most hardware interrupts are asynchronous: can come at any time, proper masking is important 81 82 During an interrupt: 83 1. CPU elevates privilege level and switches to kernel stack 84 2. Some user context (e.g. `rip`) is saved 85 3. function is called to handle interrupt ("interrupt service routine") 86 - on x86, interrupt descriptor table (IDT) shows how to handle various interrupts 87 - IDTR register points to IDT (set with `lidt` instruction) 88 - IDT has max 256 entries (1 byte int) 89 - first 32 entries are exceptions 90 - 16 external interrupts can be remapped using APIC unit 91 - calling interrupt handler: 92 - interrupt vector used as inex into IDT, which has interrupt gates 93 - type of gate can be interrupt or trap 94 - interrupt gate clears IF in EFLAGS to mask further interrupts 95 - jump to interrupt handler: set CS to segment selector (changing ring), set RIP to offset (jumps to interrupt handler) 96 - switch stack: 97 - kernel stack pointer stored in Task State Segment (TSS) 98 - task register (TR) contains index in the GDT that specifies where TSS is (can set with `ltr` instruction) 99 - glue pieces of base address together to find address of TSS 100 - TSS contains stack pointers for each ring 101 - load RSP0 from TSS into RSP 102 - set stack segment to null 103 - store calling context 104 - old register values stored on kernel stack, to be restored later 105 - returning from interrupts 106 - `iret` returns back to location at time of interrupts 107 - pops right amount from stack, restores stack, returns to last `rip`, drops privilege 108 4. CPU restores some of user context and drops privilege level 109 110 Dealing with livelocks (when the CPU is doing work, but not useful work): 111 - do as little as you can in interrupt handler, schedule work for later 112 - reduce number of interrupts: use hardware demuxing, poll instead of large number of interrupts 113 114 ## System calls 115 Kernel support for servicing user apps. 116 117 Originally only issues using `int` instruction. 118 Kernel-user communication dictated by calling convention. 119 120 Each OS specifies its own calling convention. 121 - X86 Linux: 122 - `int 0x80` to issue system call 123 - `%rax` has syscall number 124 - arguments specified in `%rdi`, `%rsi`, `%rdx`, `%r10`, `%r8`, and `%r9` 125 - kernel places return value in `%rax` 126 127 `syscall` 128 - caching IDT entry for system call in special CPU register 129 - `syscall`/`sysret` and dedicated registers 130 - requires setup through MSR (model-specific registers) via `rdmsr`/`wrmsr` 131 - on `syscall`: 132 - saves RFLAGS in `%r11`, masks RFLAGS 133 - saves `%rip` in `%rcx` (and switches `%rip` to handler) 134 - switches CS, SS, and privilege level to ring 0 135 - on `sysret`: 136 - restores RFLAGS from `%r11` 137 - restores `%rip` from `%rcx` 138 - switches CS, SS, privilege level to ring 3 139 140 VDSO: 141 - in 32-bit x86, fast syscall instruction (`sysenter`) was optional in CPU 142 - need to retain legacy `int` support 143 - so kernel figures out what CPU supports using `cpuid` instruciton 144 - VDSO: virtual syscall page with optimal system call instruction 145 - replaces `int 0x80` with a `call <addr>` 146 - allows for fixed return point 147 - looks like a dynamic shared object