index.md (5043B)
1 +++ 2 title = 'Introduction' 3 +++ 4 5 # Introduction 6 ## Why binary analysis? 7 - Code improvement: performance/security (maybe the source code has been lost) 8 - Vulnerabilities: find exploits, pentest 9 - Malware: what does it do, how can we be safe, how can we stop it? 10 11 Static analysis: staring at the bytes and trying to see what they mean 12 - can be prevented by obfuscation, packing, encryption 13 14 Dynamic analysis: 15 - can be prevented by anti-debugger 16 - incomplete, maybe not all functionality actually runs 17 18 ## Getting code from binary 19 Disassembler: 20 - interpret binary files and decode their instructions 21 - assembly instructions map to sequence of bytes 22 - but opposite way is not easy to do 23 - practical limitations 24 - overlapping instructions 25 - on e.g. x86, instructions have variable length 26 - start address of instructions not know in advance 27 - depending from which byte you disassemble, you might get different instructions 28 - desynchronisation: how do you distinguish data from code? 29 - practical approaches 30 - linear sweep (objdump, gdb, windbg): 31 1. start at `.text` section 32 2. disassemble one instruction after the other 33 3. assume that well-behaving compiler tightly packs instructions 34 - recursive traversal (IDA, OllyDbg) 35 1. start at program entry point 36 2. disassemble one instruction after the other until a control flow instruction 37 3. recursively follow the instructions targets (e.g. addresses of `jmp`) 38 - pros: better at interleaving data and codee 39 - cons: coverage, what to do with indirect jumps? 40 41 Decompilation: 42 - issues: 43 - structure lost, data types lost, no semantic information 44 - no one-to-one-mapping between code and assembler blocks 45 - types of analysis: 46 - static analysis: examine without running, could in principle tell us everything the program could do 47 - levels of analysis 48 - program level: tools like strings (strings used in program), readelf (examine structure of binary), ldd (shared libraries used), nm (symbols in a program), file, `cat /proc/<pid>/maps` (show memory mappings) 49 - instruction level: disassemblers like IDA Pro 50 - limitations: in principle undecidable, may be obfuscated/encrypted, doesn't scale to real world programs because of cost for huge programs, needs to model library/system calls and environment, hard to deal with indirect addressing and compiler optimizations 51 - dynamic analysis: run and observe, tells us what the program does in a given environment with a particular input 52 - containment is important (but maybe that changes its behaviour) 53 54 Analyzing a binary: 55 56 <table> 57 <tr> 58 <th></th> 59 <th>Application level</th> 60 <th>Instruction level</th> 61 </tr> 62 <tr> 63 <td>Static analysis</td> 64 <td> 65 <ul> 66 <li>Identify file type: <code>file foo</code></li> 67 <li>Extract strings: <code>strings -a -t d foo</code></li> 68 <li>Identify libraries and imported symbols 69 <ul> 70 <li><code>ldd</code> - list shared libraries</li> 71 <li><code>nm</code> - list symbols, unless stripped</li> 72 </ul> 73 </li> 74 </ul> 75 </td> 76 <td> 77 <ul> 78 <li>Tracking control flow</li> 79 <li>Path slices</li> 80 <li>Data flow graphs</li> 81 <li>Value set analysis</li> 82 <li>Symbolic execution</li> 83 </ul> 84 </td> 85 </tr> 86 <tr> 87 <td>Dynamic analysis</td> 88 <td> 89 <ul> 90 </ul 91 <li>General info about the process: <code>/proc/<pid>/maps</code></li> 92 <li>Library/system call trace 93 <ul> 94 <li><code>strace</code> - reveal system calls</li> 95 <li><code>ltrace</code> - strace but for dynamically linked libraries</li> 96 </ul></li> 97 <li>Network sniffer like <code>netstat</code> or <code>tcpdump</code></li> 98 </td> 99 <td> 100 <ul> 101 <li>Improve accuracy of static analyses</li> 102 <li>Dynamic information flow tracking, e.g. input and variable types</li> 103 <li>Function call monitoring</li> 104 <li>Combination of symbolic and dynamic execution</li> 105 </ul> 106 </td> 107 </tr> 108 </table> 109 110 ## What's a binary? 111 Common file formats: 112 - PE (Windows) 113 - ELF (Linux and others) 114 115 Defines things like what the file looks like on disk, what it should look like in memory 116 117 Contains info about machine to run it on, executable or library, entry point, sections, what should be writable and what should be executable 118 119 ![Binary format](binary-format.png) 120 121 ## What's malware? 122 Executable that 123 - hates debuggers and VMs 124 - hates being analyzed 125 - does bad things 126 - frequently controlled in a centralised or peer-to-peer fashion (botnet) 127 - often *packed*: 128 - compressed to reduce size on disk 129 - may have anti-debugging techniques 130 - can't say that packed binaries are malware because normal software can also be packed