lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

index.md (6428B)


      1 +++
      2 title = 'User mode'
      3 +++
      4 # User mode
      5 ## Privilege separation
      6 Operating system's task:
      7 - safe and efficient multiplexing of system resources to competing apps
      8 - CPU provides management and safety via privileged instructions
      9 - OS uses this to manage user apps
     10 
     11 Privilege separation on x86:
     12 - 4 rings: 0 (kernel), 1, 2, 3 (user processes)
     13 - segmentation:
     14     - memory divided into several segments
     15     - each pointer deref has segment associated with it:
     16         - coe pointer: CS (code segment)
     17         - heap pointer: data segment (DS)
     18         - local variable pointer: stack segment (SS)
     19     - x86_64 pretends not to have segments -- all segments point to same memory, more or less
     20         - but, low-order 2 bits of CS register determine current ring
     21 - switching x86 rings
     22     - no instruction to set CS irectly
     23     - traditional way is using interrupts:
     24         - user → kernel
     25             - interrupt CPU (`int` instruction)
     26             - CPU switches to interrupt handler in kernel
     27         - kernel → user
     28             - kernel interrupt handler returns (`iret` instruction)
     29             - CPU restores user mode state, including CS
     30 - Global Descriptor Table
     31     - defines segments, among other things
     32     - code segment is an offset into GDT, so is data segment
     33     - GDTR register points to GDT
     34     - 64-bit GDT entries (crossed out parts are legacy):
     35         - DPL: descriptor privilege level. Which rings can access the segment.
     36 
     37     ![GDT entry](gdt-entry.png)
     38 
     39 - Address spaces and user processes:
     40     - different page tables create different address spaces
     41     - address spaces isolate processes from each other
     42     - most modern OSes provide a one-to-one mapping between address spaces and user processes
     43 
     44 ## Security
     45 Meltdown (rogue data cache load):
     46 - bypasses supervisor bit
     47 - when CPU tries to read data that it can't read, it will fault
     48 - modern CPUs speculate on what might happen after an operation
     49 - CPU speculates on the read, generates fault only later, so the data will be in the cache
     50 - you end up with arbitrary kernel reads from user mode
     51 - Kernel page table isolation:
     52     - kernel just isn't mapped into app address space, except for a small area for interrupt handlers
     53     - kernel has its own address space
     54     - but it impacts performance: switch page tables, flush TLB, etc.
     55 
     56 Foreshadow (L1TF)
     57 - bypasses present bit
     58 - should also set address to 0 when unmapping a page
     59 
     60 MDS (RIDL):
     61 - bypasses address bits
     62 - flush CPU buffers before iretq
     63 
     64 Defending kernel attacks:
     65 - SMEP: supervisor mode execution protection
     66 - SMAP: supervisor mode access protection
     67 - ASLR: address space layout randomization
     68     - map user mode code and data in random locations in virtual address space
     69 
     70 ## Interrupts
     71 - events "interrupting" execution flow
     72 - kernel handles interrupt before program execution continues
     73 - external: key presses, network packets
     74     - device signals CPU by setting a pin using an electrical signal
     75     - most hardware interrupts can be masked (disabled), using IF in EFLAGS register
     76 - internal: divide by zero, page fault, system call
     77 
     78 Most software interrupts are synchronous: directly before/after instruction
     79 
     80 Most hardware interrupts are asynchronous: can come at any time, proper masking is important
     81 
     82 During an interrupt:
     83 1. CPU elevates privilege level and switches to kernel stack
     84 2. Some user context (e.g. `rip`) is saved
     85 3. function is called to handle interrupt ("interrupt service routine")
     86     - on x86, interrupt descriptor table (IDT) shows how to handle various interrupts
     87     - IDTR register points to IDT (set with `lidt` instruction)
     88     - IDT has max 256 entries (1 byte int)
     89     - first 32 entries are exceptions
     90     - 16 external interrupts can be remapped using APIC unit
     91     - calling interrupt handler:
     92         - interrupt vector used as inex into IDT, which has interrupt gates
     93             - type of gate can be interrupt or trap
     94             - interrupt gate clears IF in EFLAGS to mask further interrupts
     95         - jump to interrupt handler: set CS to segment selector (changing ring), set RIP to offset (jumps to interrupt handler)
     96         - switch stack:
     97             - kernel stack pointer stored in Task State Segment (TSS)
     98             - task register (TR) contains index in the GDT that specifies where TSS is (can set with `ltr` instruction)
     99             - glue pieces of base address together to find address of TSS
    100             - TSS contains stack pointers for each ring
    101             - load RSP0 from TSS into RSP
    102             - set stack segment to null
    103         - store calling context
    104             - old register values stored on kernel stack, to be restored later
    105     - returning from interrupts
    106         - `iret` returns back to location at time of interrupts
    107         - pops right amount from stack, restores stack, returns to last `rip`, drops privilege
    108 4. CPU restores some of user context and drops privilege level
    109 
    110 Dealing with livelocks (when the CPU is doing work, but not useful work):
    111 - do as little as you can in interrupt handler, schedule work for later
    112 - reduce number of interrupts: use hardware demuxing, poll instead of large number of interrupts
    113 
    114 ## System calls
    115 Kernel support for servicing user apps.
    116 
    117 Originally only issues using `int` instruction.
    118 Kernel-user communication dictated by calling convention.
    119 
    120 Each OS specifies its own calling convention.
    121 - X86 Linux:
    122     - `int 0x80` to issue system call
    123     - `%rax` has syscall number
    124     - arguments specified in `%rdi`, `%rsi`, `%rdx`, `%r10`, `%r8`, and `%r9`
    125     - kernel places return value in `%rax`
    126 
    127 `syscall`
    128 - caching IDT entry for system call in special CPU register
    129 - `syscall`/`sysret` and dedicated registers
    130 - requires setup through MSR (model-specific registers) via `rdmsr`/`wrmsr`
    131 - on `syscall`:
    132     - saves RFLAGS in `%r11`, masks RFLAGS
    133     - saves `%rip` in `%rcx` (and switches `%rip` to handler)
    134     - switches CS, SS, and privilege level to ring 0
    135 - on `sysret`:
    136     - restores RFLAGS from `%r11`
    137     - restores `%rip` from `%rcx`
    138     - switches CS, SS, privilege level to ring 3
    139 
    140 VDSO:
    141 - in 32-bit x86, fast syscall instruction (`sysenter`) was optional in CPU
    142 - need to retain legacy `int` support
    143 - so kernel figures out what CPU supports using `cpuid` instruciton
    144 - VDSO: virtual syscall page with optimal system call instruction
    145     - replaces `int 0x80` with a `call <addr>`
    146     - allows for fixed return point
    147     - looks like a dynamic shared object