lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

index.md (5043B)


      1 +++
      2 title = 'Introduction'
      3 +++
      4 
      5 # Introduction
      6 ## Why binary analysis?
      7 - Code improvement: performance/security (maybe the source code has been lost)
      8 - Vulnerabilities: find exploits, pentest
      9 - Malware: what does it do, how can we be safe, how can we stop it?
     10 
     11 Static analysis: staring at the bytes and trying to see what they mean
     12 - can be prevented by obfuscation, packing, encryption
     13 
     14 Dynamic analysis:
     15 - can be prevented by anti-debugger
     16 - incomplete, maybe not all functionality actually runs
     17 
     18 ## Getting code from binary
     19 Disassembler:
     20 - interpret binary files and decode their instructions
     21     - assembly instructions map to sequence of bytes
     22     - but opposite way is not easy to do
     23 - practical limitations
     24     - overlapping instructions
     25         - on e.g. x86, instructions have variable length
     26         - start address of instructions not know in advance
     27         - depending from which byte you disassemble, you might get different instructions
     28     - desynchronisation: how do you distinguish data from code?
     29 - practical approaches
     30     - linear sweep (objdump, gdb, windbg):
     31         1. start at `.text` section
     32         2. disassemble one instruction after the other
     33         3. assume that well-behaving compiler tightly packs instructions
     34     - recursive traversal (IDA, OllyDbg)
     35         1. start at program entry point
     36         2. disassemble one instruction after the other until a control flow instruction
     37         3. recursively follow the instructions targets (e.g. addresses of `jmp`)
     38             - pros: better at interleaving data and codee
     39             - cons: coverage, what to do with indirect jumps?
     40 
     41 Decompilation:
     42 - issues:
     43     - structure lost, data types lost, no semantic information
     44     - no one-to-one-mapping between code and assembler blocks
     45 - types of analysis:
     46     - static analysis: examine without running, could in principle tell us everything the program could do
     47         - levels of analysis
     48             - program level: tools like strings (strings used in program), readelf (examine structure of binary), ldd (shared libraries used), nm (symbols in a program), file, `cat /proc/<pid>/maps` (show memory mappings)
     49             - instruction level: disassemblers like IDA Pro
     50         - limitations: in principle undecidable, may be obfuscated/encrypted, doesn't scale to real world programs because of cost for huge programs, needs to model library/system calls and environment, hard to deal with indirect addressing and compiler optimizations
     51     - dynamic analysis: run and observe, tells us what the program does in a given environment with a particular input
     52         - containment is important (but maybe that changes its behaviour)
     53 
     54 Analyzing a binary:
     55 
     56 <table>
     57     <tr>
     58         <th></th>
     59         <th>Application level</th>
     60         <th>Instruction level</th>
     61     </tr>
     62     <tr>
     63         <td>Static analysis</td>
     64         <td>
     65         <ul>
     66         <li>Identify file type: <code>file foo</code></li>
     67         <li>Extract strings: <code>strings -a -t d foo</code></li>
     68         <li>Identify libraries and imported symbols
     69             <ul>
     70                 <li><code>ldd</code> - list shared libraries</li>
     71                 <li><code>nm</code> - list symbols, unless stripped</li>
     72             </ul>
     73         </li>
     74         </ul>
     75         </td>
     76         <td>
     77         <ul>
     78             <li>Tracking control flow</li>
     79             <li>Path slices</li>
     80             <li>Data flow graphs</li>
     81             <li>Value set analysis</li>
     82             <li>Symbolic execution</li>
     83         </ul>
     84         </td>
     85     </tr>
     86     <tr>
     87         <td>Dynamic analysis</td>
     88         <td>
     89         <ul>
     90         </ul
     91         <li>General info about the process: <code>/proc/<pid>/maps</code></li>
     92         <li>Library/system call trace
     93         <ul>
     94         <li><code>strace</code> - reveal system calls</li>
     95         <li><code>ltrace</code> - strace but for dynamically linked libraries</li>
     96         </ul></li>
     97         <li>Network sniffer like <code>netstat</code> or <code>tcpdump</code></li>
     98         </td>
     99         <td>
    100         <ul>
    101             <li>Improve accuracy of static analyses</li>
    102             <li>Dynamic information flow tracking, e.g. input and variable types</li>
    103             <li>Function call monitoring</li>
    104             <li>Combination of symbolic and dynamic execution</li>
    105         </ul>
    106         </td>
    107     </tr>
    108 </table>
    109 
    110 ## What's a binary?
    111 Common file formats:
    112 - PE (Windows)
    113 - ELF (Linux and others)
    114 
    115 Defines things like what the file looks like on disk, what it should look like in memory
    116 
    117 Contains info about machine to run it on, executable or library, entry point, sections, what should be writable and what should be executable
    118 
    119 ![Binary format](binary-format.png)
    120 
    121 ## What's malware?
    122 Executable that
    123 - hates debuggers and VMs
    124 - hates being analyzed
    125 - does bad things
    126 - frequently controlled in a centralised or peer-to-peer fashion (botnet)
    127 - often *packed*:
    128     - compressed to reduce size on disk
    129     - may have anti-debugging techniques
    130     - can't say that packed binaries are malware because normal software can also be packed