Binary Analysis and Rewriting Arvind Ayyangar Niranjan Hasabnis Alireza Saberi Tung Tran R. Sekar Stony Brook University Min Gyung Kang Stephen McCamant.
Post on 13-Jan-2016
214 Views
Preview:
Transcript
Binary Analysis and Rewriting
Arvind AyyangarNiranjan Hasabnis
Alireza SaberiTung TranR. Sekar
Stony Brook University
Min Gyung KangStephen McCamantPongsin Poosankam
Dawn SongUC, Berkeley
MotivationA popular approach for protecting applications
from untrusted OS is to rely on a trusted VMMBinary translation is one of the commonly used
implementation technologies in VMMsQEMU, earlier versions of VMWare, …Benefits: No need for hardware support, applicable to COTS binaries, whole system can be instrumented
Unfortunately, existing binary translators unsuited for enforcing higher level propertiesInformation flow, control-flow integrity, object-granularity memory safety, … Incur very high overheads (4x to 10x slowdown), or are
simply unable to express certain properties
Our ApproachDevelop novel static analysis based methods to overcome the drawbacks of today’s techniques
Robust, scalable static analysis of low-level code From different compilers, or hand-coded assembly
Accurate disassembly of binary code Indirect control-flow transfers, non-standard call/return conventions, mixing of data and code, …
Accurate reasoning about key properties Dynamic taint analysis
Robust and scalable Static analysis of low-level code
Static analysis of low-level codeScalability: requires modular analysis
Analyze functions individually, compose resultsAvoids repeated analysis of same code (esp. libraries)
Strength: requires accurate reasoning about variables (esp. local variables)
Challenges in low-level binary codeDifficult to identify parameter passing in optimized
codeMissing pushes, parameter passing via registers,…
Difficult to distinguish local variables from other accesses
Caller/callee-saved registers, stack pointer conventions, …
Static analysis of low-level code
To solve these challenges, previous approachesmake optimistic assumptions, or rely on compiler
idiomsoften fail on optimized code and/or large programsdon’t work for other compilers, or hand-written assembly
Our solution: Develop a new approach thatUses systematic analysis to reduce
assumptions/heuristicsAccurately tracks local variables by analyzing values
held in registers and on the stack
Stack AnalysisAnalyzes one function at a timeExamines the use of stack to
Determine parameters Number of them, whether in registers or on stack
Caller- and callee-saved registersSummarize effect on parameters
Preservation of SP, return to caller, changes in parameter or register contents,…
ESP RETURN ADDR
ƒ
Abstract Interpretation for Stack Analysis
LATTICE
<ƒ> :
Activation Record
Base_BP +[0,0]
EBP
push %ebpmov %esp, %ebpsub $16, %esp
Base_SP +[0,0]
ESP0 Base_SP
Abstract Interpretation for Stack Analysis
LATTICE
<ƒ> :
Activation Record
Base_BP +[0,0]
EBP
push %ebpmov %esp, %ebpsub $16, %esp
Base_SP +[-4,-4]
ESP
Base_BP+[0,0]
0
-4
Base_SP
Stack Analysis (contd)
Summary for f: No change to ESP Two input parameters on stack EAX, EDX, arg1 changed as shown Others unchanged
<f>:push %ebpmov %esp, %ebpsub $16, %espmov 8(%ebp), %eaxadd $3, %eaxmov %eax, 8(%ebp)mov $7, -12(%ebp)mov 12(%ebp), %edxmov %edx, -8(%ebp)leaveret
args
locals
Base_SP + [-4, -4]
arg1 + [3, 3]
arg2 + [0, 0]
EBP
EAX
EDX
Base_SP + [0, 0]
arg2 + [0, 0]
ESP
-12
SP
arg2
arg1 + [3, 3]
Ret Addr
RP
arg2 + [0, 0]
Base BP +[0,0]
7
Caller frame
Calleeframe
args
locals
Base_SP
Base_SP+[-20,-20]
Stack Analysis: Preliminary results
FTP
pdftops
Gimp
XMMSApache
0
50
100
150
200
250
300
0 200 400 600
Size (K instructions)
Anal
ysis
tim
e (s
econ
ds)
Static disassembly of binary code
Background: Disassembly TechniquesLinear sweep algorithm
Start with program entry point, proceed to disassemble instructions sequentially
Key assumption: all instructions appear one after the next, without any gapsViolated in most code (presence of data or padding)
Recursive Traversal AlgorithmAfter a control-flow transfer instruction (CTI),
proceed to disassemble target addressFor conditional CTI and non-CTI, proceed to
disassemble next instructionKey problems
Code reached only through indirect CTIsFunctions that don’t return in the usual way
Our Approach for DisassemblyAssumption
No code obfuscationNon-assumptions
Function prologue and epilogue patternsCompiler idioms or (lack of) optimizations
ApproachUse recursive traversalUse stack analysis to compute/verify return targetsDevelop new analysis to determine targets of
indirect control-flow transfers
Our Approach: Type inference Key insight: Code pointer values don’t undergo
arithmetic or other transformationsImplication: values assigned to code pointers must
represent indirect CTI targetsAchieves much better results than data flow
analysisAvoids global def-use problem, which is very hard in low-level languages
Compute sets C of possible code addresses and C of definite code addressesCode at addresses in C can be safely disassembledCode at addresses not in C can be safely relocated
Static Disassembly: Preliminary Results
Analysis of disassembler on 'ls' binary
Analysis Disassembled code Reachable code not disassembled
Recursive Traversal 2.7% 85%
Compiler idioms and heuristics 87% 1%
Function pointer analysis 88% 0%
Static Disassembly: Preliminary Results
Gap in dhclient due to incomplete implementation, dealing with global arrays
Application Size (KB)
Disassembled code
Reachable code not disassembled
pdftops 14 97% 0%
chroot 26 85% 0%
chmod 39 87% 0%
cat 43 92% 0%
ls 96 88% 0%
dhclient 411 81% 4%
DTA++: Improving accuracy of Dynamic Taint Analysis [NDSS 2011]
Under-tainting and Over-taintingResults vary based on which values are
considered to depend on others:
Under-tainting and Over-taintingResults vary based on which values are
considered to depend on others:
• Too few dependencies lead to under-tainting
Under-tainting and Over-taintingResults vary based on which values are
considered to depend on others:
• Too many dependencies lead to over-tainting
Under-tainting occurs when control flow state represents (almost) all of the information in inputs
Key idea: propagate taint only for control dependencies that would cause under-tainting (culprit implicit flows)
Key Idea
Under-tainting occurs when control flow state represents (almost) all of the information in inputs
Key idea: propagate taint only for control dependencies that would cause under-tainting (culprit implicit flows)
Key Idea
1 char output[256];2 char input = next_in();3 long len = 0;4 if (input == '{') {5 output[0] = '\\';6 output[1] = '{';7 len = 2;8 }
DTA++ Approach OverviewHypothesis: under-tainting occurs at just a few locations
in a program (culprit branches)Approach: find these locations in advance, and construct
new taint propagation rules for themAssumption: we are given test inputs that demonstrate
the under-tainting
Approach DetailsUnder-tainting Detection Predicate
Given a (partial) execution trace t, φ(t) holds if t contains a culprit implicit flow
ImplementationUse symbolic execution to count how many other inputs could take the same execution path as t
Few or none → φ(t) = trueSearch for Culprit Branches
Find shortest prefix of t that satisfies φthe last instruction in the prefix is the culprit
Remove culprit, repeat the search to find others
ProgramDescription
# of CulpritImplicit Flows
Detected & Fixed
Time forDiagnosis
WordPad, RTF 1 0.26s
MS Word 2003, RTF 24 31m 5.26s
AbiWord, HTML 1 14.29s
AngelWriter, HTML 3 0.63s
AurelEdit, RTF 1 0.76s
VNU Editor, RTF 1 0.34s
IntelliEdit, RTF 1 0.40s
CryptEdit, RTF 1 0.23s
DTA++ Results: Diagnosis Time
DTA++ Results: Over-tainting
Summary and Future WorkDevelop novel static analysis based methods to
overcome the drawbacks of today’s techniquesRobust, scalable static analysis of low-level codeAccurate disassembly of binary code Accurate reasoning about key properties
Dynamic taint analysisFuture work
Experimentation and evaluation of stack analysis and disassembly
Robust and efficient binary instrumentation for information flow and related properties
Application to hostile OS defense
top related