AUTOMATING RETURN ORIENTED ATTACKS ON x86 ARCHITECTURE School of Computer and Communication Sciences Programming Methods Group École Polytechnique Fédérale de Lausanne A degree project submitted in partial fulfilment of the requirements for the degree of Master of Computer Science of Alen Stojanov supervised by: Prof. Dr. Marin Odersky Prof. Dr. Michael Franz Lausanne, EPFL, 2012
60
Embed
AUTOMATING RETURN ORIENTED ATTACKS ON x86 ARCHITECTURE · AUTOMATING RETURN ORIENTED ATTACKS ON x86 ARCHITECTURE ... to devote my utmost gratitude to my external thesis advisor, ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
AUTOMATING RETURN ORIENTEDATTACKS
ON x86 ARCHITECTURE
School of Computer and Communication SciencesProgramming Methods GroupÉcole Polytechnique Fédérale de Lausanne
A degree project submitted in partial fulfilmentof the requirements for the degree ofMaster of Computer Science of
Alen Stojanov
supervised by:
Prof. Dr. Marin OderskyProf. Dr. Michael Franz
Lausanne, EPFL, 2012
Hard work is never in vain.
It always finds its way to pay off.
— Timko Stojanov, my father
Acknowledgements
I would like to devote my utmost gratitude to my external thesis advisor, Prof. Dr.
Michael Franz for allowing me to join his team in the Secure Systems and Software
Laboratory at the University of California, Irvine, for his expertise, kindness, and most
of all, for ensuring positive and encouraging environment for performing research.
This work would not have been completed without help and support.
I am greatly indebted to Dr. Per Larsen and Dr. Stefan Brunthaler for their trust in
me and providing me with this opportunity to acquire the knowledge and expertise
in order to contribute to the support of the research project. Gaining their trust and
friendship made my work so much smoother.
I would like to express my appreciation to Prof. Dr. Martin Odesky for giving me
the opportunity participate in this research project, as well as, for being my internal
supervisor at EPFL.
Lausanne, 16 Mars 2012 A. S.
v
Abstract
Return oriented programming (ROP) is an exploit technique which avoids code injec-
tion by reusing existing code to induce arbitrary behavior in a program. ROP attacks
are conducted by chaining available instruction sequences (gadgets) ending in a “re-
turn” instruction. While the construction of ROP attacks has been automated, these
approaches rely on searching gadgets using predefined sequences which operate on a
fixed set of registers, on the grounds that large and widely distributed chunks of binary
code are likely to contain them. As a result, libraries and operating system kernels
have been targeted as gadget providers.
We propose an automatic gadget construction, targeting stand-alone executables,
without relying on libraries or the system kernel. Due to the possible limit of available
gadgets, stand-alone executables are likely to be restricted on instructions operating
on distinct registers. Subsequently, chaining instructions so that the result of one
instruction is used in the consecutive instructions can be achieved only by moving
data across registers. For that purpose, we build a graph representing register manip-
ulation instruction sequences (mov, xchg, add, sub, etc). Each register represents a
node, and each data movement across registers represents an edge. The strongly con-
nected components in the graph provide the available registers, and the shortest paths
among those registers describe instruction chaining with minimal data movements.
Customizing the gadget search to the available registers increases the flexibility when
automatically constructing attacks, allowing the attacks to be applied on stand-alone
executables, and minimal data movements help optimize the generated attacks.
In order to do the conditional jump, we use the value of eax to modify the esp. Since
eax holds the value of 64 or 0, we can add eax to esp. This will result in either going
to the next gadget address on the stack, or 64/4 = 16 addresses later. However espcan not be modified directly and we need 5 gadgets to set its value. We take this into
consideration and offset esp for 5 32bit addresses via ebx. Note that esp gets added
into ebp on position #25 and gets the stack address of that instruction.
Jump-oriented programming presents another variation of return oriented program-
ing to be also proven Turing-complete. The new approach eliminates the reliance on
the stack and return instructions seen in return-oriented programming. Unlike ROP,
where after the execution of a particular gadget, the control flow returns back to the
stack to follow the next address, JOP performs an uni-directional control-flow transfer
to its target. Instead of having gadgets ending in a return instruction, each gadget
in a JOP attack ends in a jmp instruction. The attack relies on a dispatcher gadget,
which essentially maintains a virtual program counter able to navigate the gadgets to
advance from one gadget to the other, as illustrated on Figure 2.3
2.3 Architectural aspects
The fact that return oriented programming represents an universal thread, can be
easily elucidated by the fact that ROP attacks are not limited to variable length in-
structions such as x86. Architecture as RISC / SPARC, share almost no properties with
the x86 architecture. As the SPARC architecture has a fixed-width instruction length,
and alignment is enforced on instruction read, unintended instruction are no longer
possible to be used when crafting the return oriented attacks, which significantly re-
duces the set of available gadgets. However stack overflow are still possible on SPARC
13
Chapter 2. Background
and the rich set of register, combined with the 4000 return instructions in the Solaris
libc implementation was shown to be enough to create ROP attacks [2].
Similarly to the SPARC architecture, the thread of ROP has also been extended to
support another RISC architecture, namely ARM [3], affecting many smart-phones
and mobile and embedded devices. Although the implementation and the focus
of this work was on Windows Mobile, the technique can be ported on any other
operating system based on the architecture. The real-world use case of ROP has also
been depicted when AVC Advantage Harvard machines were shown to be exploitable
[6], which were used as a voting machines for elections in United States in the past.
Other affected architectures are also Atmel AVR [4] and Power PC [5].
2.4 Defence Techniques
The discovery of the first buffer overflow attack, initiated the need to develop adequate
protection mechanisms to address the potential vulnerabilities. As the sophistication
of the attacks was increasing, many techniques were developed to mitigate against
code reuse attacks. Taking into account strong adversaries, the effectiveness of those
systems can still be challenged to show that ROP and its variations still exists as a valid
thread, especially in environments where those systems are not deployed.
2.4.1 W⊗
X and ASLR
Stack smashing attacks, when initially discovered, were quite popular, as the exploita-
tion of those vulnerabilities was quite simple. As soon as the attacker was able to
change the return address on the stack, it was also possible to inject an arbitrary
machine code on the stack, directing the machine to execute the injected code. To
address this issue, the “W⊗
X” protection model emerged, marking memory regions
as writeable or executable, but never both. Therefore, the model prevented the at-
tackers to inject attack code, as diverting the control flow of the program would have
caused a processor exception. Being a sufficiently strong mitigation, the model was
soon adopted by CPU manufactures, by the name NX bit (No eXecute), and was incor-
porated in many modern operating systems. Intel introduced the XD bit, AMD the
Enhanced Virus Protection, and ARM the XN bit. Microsoft implemented Data Execu-
tion Prevention (DEP) [38], supporting software and hardware based data protection,
Linux adopted PaX [7] to utilize the NX bit, as well as to emulate it, on architectures
where it was not supported. Red Had introduced ExecShield [9]. Mac OS X, FreeBSD,
OpenBSD, NetBSD, Solaris and Andorid also incorporated the protection mechanisms
in their implementation.
14
2.4. Defence Techniques
W⊗
X as a protection model was only able to provide security until the code reuse
attacks appeared. To palliate against the new approaches, address space layout ran-
domization (ASLR) [8] was introduced. The model enforces random arrangement of
libraries, heap and stack space within process address space. The arbitrary offsets
in the dynamic libraries, as well as the heap and stack space hinder the creation of
ROP attacks, especially return-to-libc attacks, hardening the prediction of the position
of the gadget used in the attacks. Similarly to the W⊗
X, ASLR was incorporated
as part of PaX in Linux and OpenBSD, Microsoft included implementation in DEP,
followed by Mac OS X, iOS and Android. Although ASLR resembles a very strong
mitigation against code reuse attacks, it was latter demonstrated how this system
can be bypassed by a derandomization attack [39], converting any standard buffer
overflow attack to affect systems where ASLR is enabled. The derandomization attack
exploited the fact that ASLR does not randomize the stack layout, by brute-forcing the
attack to pinpoint the location of libc. Having into consideration that it only took
216 seconds to compromise Apache, the derandomization attack can be used in a
scenario where it can locate the positions of a gadgets to initiate a ROP attack.
2.4.2 Return-Less Kernels
Building an operating system with return-less kernel was an attempt to remove the
threat of return oriented attacks designated to escalate access privileges in the operat-
ing systems [10]. The author used a compiler based approach to generate the FreeBSD
kernel without returns, and without return opcodes. Instead of using the traditional
call convention where the return address is pushed on the stack, a return index
is pushed on the stack. This index corresponds to an index in a centralized return
address table, containing all valid return addresses permitted in the kernel. Once
return is invoked, the return index is popped from the stack, and the control flow
of the program is subverted to the corresponding return address. As there are finite
number of call instructions within the kernel implementation, the return address
table is static, and therefore it can be pre-generated according to the locations of the
call instructions.
The implementation of the approach, systematically changes each call and returninstructions with functionality to push a return index on the stack, and restore the
index value stored in the return addresses table, effectively removing all return in-
structions within the kernel. However, the control flow of the kernel is now being
subverted using jmp instruction, resulting in increased number of jmp instruction in
the binary logic of the kernel. As the approach does not ensure resilience towards
Jump Oriented Attacks, the increased number of jmp instructions enrich the set of
15
Chapter 2. Background
JOP gadgets, proving additional opportunities in favour of JOP attackers. Although
this system defeats the known return oriented rootkits at that time, it does not provide
a full-proof and generic protection against all variations of ROP attacks.
2.4.3 HyperCrop
HyperCrop [11] is hypervisor i.e. Virtual Machine Manager (VMM) approach to defeat
x86 return oriented programming attacks, built on top of the XEN hypervisor [40]. In
this context the use of the VMM is to intercept stack writes that occur along program
execution, and inspect the content on the stack to determine the thread of a ROP
attack. The system works such that in the initial step, gadgets addresses are extracted
from the binary / library which is protected by the system. The addresses represent a
set of gadget addresses which can be potentially used by an attacker in a ROP scenario.
During the execution of the program 400 bytes are copied from the top of the stack,
which correspond to 100 32bit entries on the stack. Finally each 100 entries are then
cross-referenced against the predefined set of potential gadget addresses. If the ratio
of the number of entries that correspond to the set of potential gadget addresses
against the total number of entries is above carefully chosen threshold, then the system
assumes the potential thread of a ROP attack.
The analysis of the system introduced a performance overhead of 1.4, suggesting that
the system is practical to use to defend against ROP attacks. However the approach has
several limitations. As the binary size of different programs / libraries vary according
to their code base, the threshold used as a heuristic to determine the potential of a
ROP attack must be calculated and normalized for each individual program / library.
Furthermore, each time a program / library is updated and recompiled, the set of
potential gadget addresses must be updated, and, as the cardinality of the set of
potential gadgets can potentially chance, the threshold must be recalculated again.
This makes the system tedious to use and infeasible for deployment on a large scale.
Second problem is the heuristics based on a pre-determined threshold. As the system
only analyses 100 entries from the stack, an attacker can use the so called esp lifting
technique [31], to increase the value of esp with the use of a single gadget. Single
increment of esp will execute to the next gadget available on the stack in a ROP attack.
And if esp is increased by a particular value x, then the CPU will execute the gadget
located at address which resides on the stack position esp+x. This technique can
basically jump over the stack, creating holes inside, which can be filled with bogus
values. If the attack is crafted such that those bogus values do not correspond to
gadget addresses, the ratio of potential gadget addresses against total entries in the
stack can be reduced bellow the value of the threshold, bypassing the HyperCrop
16
2.4. Defence Techniques
detection.
Finally HyperCrop is defenceless against jump oriented programming attacks, since
this types of attacks do not rely on the stack values for control flow retention.
2.4.4 ROPdefender
ROPdefender [12] is a neat technique able to defend against the traditional ROP
attacks using a binary instrumentation framework [41]. The implementation of this
approach is build on top of Pin utilizing the VM emulation unit and just in time
compiler unit to build a shadow stack, similar to the one used in StackGhost [42].
When a call or return instruction occurs, ROPdefender uses the inspection routines
provided by Pin to intercept the instructions. Once a call is encountered, the return
address is pushed on the shadow sack. And when a return instruction occurs, a check
is enforced between the return address the stack pointer points to (i.e., the return
address on the program stack) and the saved return address placed on top of the
shadow stack. If the values do not match, then a ROP attack is detected.
Since the fundamental set of circumstances that make return oriented programming
possible is having the return addresses written on the same stack where arbitrary data
is placed, the approach to keep a copy of the return addresses in a shadowing stack
is quite an elegant solution to avoid ROP attacks. However, the greatest drawback of
ROPdefender is the performance overhead, which in the worst case introduces 3.54
times slowdown then compared to a normal execution of a program. Although most
of it results as a consequence of the performance overhead of Pin, the only feasible
deployment scenario on a large scale would be implementation of ROPdender logic
on the hardware level. Still, similarly to the HyperCrop model, ROPdefender does not
provide mechanism to defend against ROP attacks based on indirect jumps. Since
indirect jumps do not disrupt the calling sequence inside a running program, the
shadow-stack verification of return addresses is futile.
2.4.5 G-Free
G-Free [13] is another system aiming to provide a system to address the wide range
of common code reuse attacks. It focuses on securing libc library to prevent return-
into-libc attacks, by keeping the attacker of reusing existing fragments of code as
basic building blocks. The protection addresses both the intended and unintended
instructions. The protection for the latter is achieved using code rewriting techniques
to remove unintended instructions by aligning instructions using alignment sled,
17
Chapter 2. Background
sufficiently large instruction sequences, having no effect once executed. Combined
with removal of occurrences of the c3 and c2 bytes, the system reduces the number
of unintended gadgets in libc. To protect the intended instructions from being
reused as gadgets, the system incorporates return address protection, by introducing
instructions header to encrypt the return address pushed on the stack and instructions
footer to decrypt the return address before return instruction occurs.
The system also provides protection against indirect jumps and demonstrates a solid
protection against traditional ROP and all its variations, with a very small performance
overhead of about 5.6% in the worst case. Nevertheless, it is quite difficult to recognize
how the system performs on intensive benchmarks, since the performance evaluation
addressed use-cases where the control flow was not the most crucial part (IO-bound
and kernel based workloads). Regardless of the fact that G-Free provides a comprehen-
sive solution to defend against ROP attacks, it has been attested only as a prototype,
furnishing ROP protection within the implementation of libc. The low deployment
of the principle on a large scale of software across different architectures, operating
systems and user-end applications, leaves the thread of ROP in the heart of security
problems.
2.4.6 Control Flow Integrity
Control Flow Integrity [14] is a code rewriting technique, that incorporates lightweight
static verification to instrument runtime checks to prevent code reuse attacks. As
changing the control flow of the program is the essence in the ROP based exploits,
this technique ensures that a program follows its control flow graph (CFG), generated
ahead of time. The implementation is based on modified XFI [43], to support the
so-called individual label instructions that indicate the beginning of a particular
function, without affecting its semantics (prefetch instruction). During the execution
of a program, the system checks whether the labels point to a valid branch instruction,
and when a function returns the system also checks whether the pointer point to a
valid return address, constraining the binary to follow an expected control flow.
The rewriting engine of XFI, analyses the binary and finds all branching instructions.
According to this analyses, the system preforms the code rewriting instrumenting the
branching instructions to enforce the CFG program flow. Unfortunately the drawback
as a result of the binary analysis, is the fact that this system is addressing intended
instructions only. Having a ROP attack crafted with gadgets consisted of unintended
instruction, will eliminate the ability to verify that a return address points to a valid
label instruction. Furthermore, the performance overhead introduced by this system
(∼45% in the worst case) is an additional reason why this system has not seen any
18
2.4. Defence Techniques
significant production deployment.
A follow up work, sharing the same fundamentals of CFI is the control flow locking
(CFL) system [15]. Unintended instructions are removed using already existing tech-
nique - software fault isolation [44]. Instead of introducing label instructions to detect
control flow violation before it occurs, CFL lazily detects the violation, after the trans-
fer occurs. This is done by performing a lock operation before each indirect control
flow transfer, with a corresponding unlock operation present at valid destinations
on binaries and static libraries. The work presented in this system, fills in the gaps
originating from the initial work of CFI. The low performance overhead ((∼23% in
the worst case), provides competitive results towards any other defence mechanisms
dealing with code-reuse attacks. Although the technique can be easily ported to sup-
port dynamically linked libraries, giving the opportunity to defend against code-reuse
attacks, the low deployment of the system makes the use of ROP attacks plausible.
19
3 Related Work
Section 2.4 gives an overview of the available defence techniques against ROP attacks,
their flaws and deployment rate and clearly shows that return oriented programing is
a crowded, important research area. It also demonstrates the inability to provide an
extensive defence mechanism against all ROP attacks, as the mechanisms developed
along the way only resembled piecemeal defences to address each new variant of code
reuse attack, as it occurs.
On the other hand, the evolution of defence techniques was also contentiously ini-
tiating new approaches to develop attack techniques to defeat the widely deployed
security systems. As a result, the attacker were becoming increasingly sophisticated,
and harder to create. At the same time, different architectures, different operating
systems, diverse distribution and applications was additional incentive for attackers
to develop tools to automatize the creation of code reuse attacks, enabling them to
target the wide areas of users using different systems.
The recent work in ROP attacks clearly shed lights on the danger conveyed of these
attacks. Apart from developing defence techniques, attackers also used tricks and
methods to disguise the generated attacks, by creating polymorphic variants, packed
payload, and even ASCII printable payloads [36] to evade non-ASCII filtering. In the
following sections we describe the current tools and methods to create automatic
ROP attacks, portable on different systems and architectures. As the attackers can
potentially have unlimited use of imagination to fork different ideas, we only address
is the only related work that provides fully automatic generation of return oriented
attacks, targeting the Windows kernel. As the logic of the rootkit is implemented using
return oriented programming, we focus on the generation of the attack. The system is
consisted of 3 modules:
1. Constructor. This module proceeds in two steps. In the first step, it searches
for single instructions, followed by a free branch instruction. The search is
focusing on predefined set of instruction, characterised as useful instruction. In
the second step, the construction chains gadgets together, to form structured
gadgets designated to perform basic operations (logical, arithmetic, control flow,
stack manipulation and bitwise operations). In order to find all the necessary
gadgets to perform the operation, as well as to control the which registers are
modified when a particular gadget is used, a set of CPU registers is specified
- working registers. Finally the constructor merges gadgets having instruction
operating with the worker registers to perform particular operation.
2. Compiler. This building block takes into consideration the gadgets provided
by the constructor, and a higher level source code, and produces the attack
payload. The source code is written in a dedicated C-like language. In this stage,
the compiler chains the gadgets generated from the constructor to implement
the semantics given in the source code.
3. Loader. The output generated from the compiler results into generation an
exploit having the relative addresses of the program image. The loader adjusts
the offsets of the addresses, and resolves them into absolute addresses.
The system has shown to automatically generate exploits across different Windows
OS kernels, as well as to provide a good runtime overhead of the exploits and rather
small sized payloads. However this technique by all means is not a comprehensive
solution towards automating ROP attacks. The most considerable drawback is the
search of a single-instruction gadgets. The binary code of typical OS kernel is very
likely to contain huge sets of gadgets that have single instruction as a result of its
binary code magnitude. Therefore, it is possible to match all useful instruction defined
in the context of this work. Nevertheless, this commodity can only be expected
in OS kernels and potentially common libraries. On the other hand, some widely
develop application in their binary code do not particularly provide even the basic
set of arithmetic, bitwise or logical single-instruction gadget. Closely looking at the
25
Chapter 3. Related Work
implementation details of the construction, we can note that the working register
set is restricted on three registers only, namely eax, ecx and edx. This significantly
reduces the flexibility of the gadget search and gadget chaining, as it is very hard to
find gadget operating exclusively on the specified registers, when the system is used
outside the OS kernel.
26
4 Automatic Return Oriented Program-ming
Chapter 3 gives an overview of the available techniques and approaches in automatic
generation of code reuse attacks. Each of the techniques focuses on set of common
libraries or operating system kernels. The convenience of creating attack techniques
on large chunks of machine code found in kernels and common libraries, is the
resulting rich set of gadgets. Therefore, most of those technique rely on predefined
set of template instruction and register sets looking for a single instruction gadgets,
assuming that the common library or operating system kernel are likely to have all
required gadgets. However this approach does not apply on reduced set of gadgets,
often found in small libraries, and stand-alone executables.
Our focus in this work are stand-alone executables. As previously discussed, the
traditional way of crafting a ROP attack is by invoking mprotect to disable the W⊗
X
protection on the system and inject random code, or invoking execve to start a remote
shell on the target machine. Despite the fact that those functions are available across
different operating systems and distributions, a stand-alone binary does not necessary
need to be linked against system libraries. Therefore exploiting a vulnerability requires
making use of the gadgets already available in the binary. To make the best out of the
available gadgets, we observe:
1. If there is no gadget available that does a particular logical or arithmetical
operation we might be able to find suitable substitution.
2. If there is no gadget and no substitution, then that particular operation will not
be available in the ROP attack.
3. If there are only few gadgets available, then gadget chaining requires to move
the data between registers, such that the result of one operation can be used as
an input to the next operation.
27
Chapter 4. Automatic Return Oriented Programming
The first observation proposes that if a gadget having inc functionality is not available
in the system, it might be possible to substitute it with a gadget having add functional-
ity. However, if inc and add functionality gadgets are not available in the executable,
then this operation will be unavailable in the attack according to the second observa-
tion. In order to illustrate the third observation, we assume that an executable is given,
having the gadgets on Figure 4.1. We would like to use the three gadgets to calculate:
edx = edx - ebp - ecx
0xb9268970: add ebp, ecx ret
0x42295cee: sub eax, edx inc [esi+0x5D5B]
ret
0x42295cee: mov ecx, eax ret
Figure 4.1: Available gadgets
Although the simplest implementation of this calculation is using sub ebp, edx;sub ecx, edx, those gadgets are not available in the system. Thus, the sum of ebpand ecx is calculated first, and then the result is subtracted from edx. However to do
this, we need to pass the result from the first computation as input to the next gadget
using a mov instruction gadget (Figure 4.2)
0xb9268970: add ebp, ecx ret
0x42295cee: sub eax, edx inc [esi+0x5D5B]
ret
0x42295cee: mov ecx, eax ret
1
3
2
Figure 4.2: Simple gadget chaining
28
4.1. Process Image Analysis
There are 8 32 bit CPU registers. Having in mind that each instruction having two
operand can use any of the 8 CPU instruction, there are 64 possible combinations for
every computation (arithmetic, logical, etc). The number significantly increases, when
one of the operands is a memory segment, taking into consideration the base register,
displacement value etc. To be able to chain as many gadget as possible, we need to
know the data transfer relation between each of the registers in the system, as well
as memory. This data transfer relations also depend on the available gadgets in the
target executable. However, the reduced set of gadgets, significantly reduces the set
of single instruction gadgets, and gadgets of several instruction must be considered.
Subsequently the use of multi-instruction gadgets can cause unwanted side-effects,
which must be eliminated or taken into consideration.
In the following sections, we provide a description of system that automatize the
creation of ROP attacks in stand-alone executables on x86 architecture, using Linux as
an operating system. The system is consisted of several phases:
1. In the initial phase of the system, we dissemble the the target executable and
extract the raw bytes of the available gadgets.
2. The next phase considers the side effects imposed by the gadgets, and classifies
the extracted gadgets according their semantics.
3. We generate the register transfer graph to describes the relations between data
transfers of each CPU registers and memory.
4. Finally in the last phase, we use the register transfer graph to provide a compre-
hensive method to perform gadgets chaining.
Being aware of the CPU register interconnection, our system can introduce flexibility
on the set of CPU registers being used in the ROP attack, to avoid predefined sets
of registers, as seen in other systems. In addition, the flexibility in the register set
will narrow the search of gadgets performing particular operations, and contribute
towards creation in automatic ROP attacks by avoiding operation templates.
4.1 Process Image Analysis
The first step towards automating the gadget search, is extracting the machine code
of the target program. This can be done by dissembling the executable file of the
program, or by disassembling the binary code of the program when it is loaded into
memory. Linux (as many other Unix-like operating systems) provides a mechanism
29
Chapter 4. Automatic Return Oriented Programming
that describes each region of contiguous virtual memory in a process or thread. As
a result of simplicity of this mechanism (reading into /proc/[pid]/maps), we disas-
semble the program when it is loaded into memory.
0x00000000
Data Segment
BSS segment
Heap
Stack
Text Segment
Memory Mapping
Figure 4.3: Memory Lay-out of a Linux process
When a Linux process is loaded into memory, the OScreates different sections, containing executable code,static or dynamic data (illustrated on Figure ??). Itrandomizes the stack, heap, and shared libraries, butnot the program image (text, data & bss segments)[8]. Therefore the address offsets of the gadgets foundin the text section remain unchanged even when ASLRis enabled. On the other hand, when ASLR is disabled,the shared libraries will always be mapped to the sameregions, having constant address offsets in the librarygadgets, on every program invocation.
Programs can be manually compiled into position in-dependent executables (PIEs) and loaded to multiplepositions in memory. and many third-party applica-tions (including Mozilla Firefox) are deployed as PIE(having the program logic wrapped in a shared library).However modern distributions only compile selectivegroup of programs as PIEs, because doing so introducesa performance overhead at runtime.
In our implementation we use the fact that the text segment remains unchanged to
search for gadgets in this section. For the systems where ASLR is disabled, we ensure
that the implementation takes into consideration all executable sections. Simple
invocation of /proc/[pid]/maps will result with the following listing (inspecting
init process by reading the contents of /proc/1/maps):
Using process traces (ptrace), we control the execution of the target program, and
inspect the internal state. PT_READ_I enables to read any section within the running
process. In this context, the text segment is simply the /sbin/init segment, marked
as r-xp. Once we are able to read the content of this section, we can process the data
into the gadget search algorithm. In systems where ASLR is disabled, all sections
marked as r-xp are taken into consideration.
4.2 Gadget Locations
The gadget search algorithm is performed on chunks of binary data. The algorithm
locates c2 and c3 bytes, as well as pop reg; jmp reg instructions, and backtracks to
a user defined threshold of bytes to find valid x86 instructions. A pseudo code of the
search algorithm is available bellow:
Algorithm 1 Gadget Search
Input: process segment as binary stream segOutput: list of gadgets
1: g ad g et s ←;2: for byte pos in seg do3: if i s_val i d_su f f i x(pos) then4: for i := 1 to thr eshol d do5: if x86_disasm(pos − i , pos) then6: add the gadget into g ad g et s7: end if8: end for9: end if
10: end for11: return g ad g et s
A threshold is necessary because otherwise all side effects of encountered instruc-
tions will make it infeasible to chain the gadgets. The implementation depends on
libdisasm library. In this context, x86_disasm used on line 5 in Algorithm 1 provides
basic disassembly of Intel x86 instructions from binary stream.
Each gadget obtained in the list will have a unique address. However, two or more
gadgets might as well perform identical instructions. To avoid having repetitive
1: if pos[0] = c3 or pos = c2 then2: return true3: end if4: if (p[0]⊗ 0xf8) = 0x58 and p[1] = 0xff and (p[2]⊗ 0xf8 ) = 0xe0 then5: if (p[0]⊗7) = (p[2]⊗7) then6: return true7: end if8: end if9: return false
gadgets, the machine code of each gadget is hashed, and test for existence is performed
before the gadget is added.
The threshold defined in the gadget search algorithm, can hold the bytes of several
instructions, and since the search algorithm backtracks from a free branch instruction,
very often happens that a gadget is a subset of another gadget, such that they differ
in the first few instructions. Therefore, in order to use every gadget possible, each
gadget is treated as a single instruction gadget, such that the first instruction is taken
into consideration only, regardless the number of following instruction until the free
branch instruction. Finally, it is only feasible to treat each gadget as single instruction
gadget if the side effects of the other instructions are taken into consideration.
4.3 Considering Gadget Side-Effects
In order to eliminate the side effects of each gadget, we must determine the change
of the sate that each instruction of the gadget makes. Take into consideration the
following example (obtained from section 2.1.2):
0xB7F9E479: mov %edi, %edx; incl 0x5D5B14C4(%ebx); ret
In order to use the gadget above as a single instruction gadget, taking into consid-
eration only mov %edi, %edx, we must ensure that incl 0x5D5B14C4(%ebx) will
increase the value of a know writeable location. Assuming that location 0x00000001 is
writeable and does not hold data used in the ROP attack, we can only use the gadget if
ebx is initialized to -0x5D5B14C3. In this context, we can resolve this issue, by using
pop gadget to initialize ebx with the value calculated before.
32
4.4. Gadget Classification
To address as many as possible of the known gadget side effects, we focus on the
following instruction types:
1. Memory Clobbering Instructions. This type of side effects are similar to the
example above. Generally speaking, memory clobbering can occurs, when a
gadget contains instruction between the first instruction and the free branch
instruction such that the target operand is a memory location. The memory
location is referenced by CPU registers. Therefore, the side effects of this type
of instructions can be removed by inserting pop gadgets to initialize the values
of the registers and designate the memory dereferencing to occur on a know
location. Having into consideration that the data segment has constant offsets
(as shown in section 4.1), several memory locations can be used in this section
as known clobbered locations. The only exception in this rule is when memory
clobbering occurs when the esp register is dereferenced. This is addressed in
the 3rd category.
2. Register Clobbering Instructions. Similarly to the previous case, this type of
instruction have the target operand as a register. In this case, we do not eliminate
the side effects, but we only mark the register clobbered by the instruction. In
special cases (mostly occurring in unintended instructions), where the target
operand of the first instruction in the gadget is a register being clobbered latter,
the gadget is being disregarded. In any cases, if the register being clobbered is
esp, the gadget is also disregarded.
3. Stack Modification Instructions. The side effects of this category are the most
critical, since the instruction of this type modify the stack, where the logic of
the ROP attack resides. Instructions that decrease the value of the stack pointer
can not be used (push, pusha, etc). This type of instructions modify the space
of the stack where the following gadget addresses are placed. Therefore the
gadgets having those instructions are disregarded. Gadget that increase the
value of the stack pointer can be used, by placing dummy values on the stack.
Note that gadgets having instruction of the type pop, popa, etc, which is not
the first instruction, fit in the 2nd category as well. Apart from introducing
additional dummy value on the stack, the clobbered register must be taken into
consideration.
4.4 Gadget Classification
Once each side effect is taken into consideration, the gadget can be classified accord-
ing to the first instruction. In our implementation we classify the gadgets into 13
33
Chapter 4. Automatic Return Oriented Programming
groups:
1. Stack manipulation: pop
2. Data transfer gadgets: mov, xchg,
les, lea, etc.
3. Arithmetic gadgets: add, sub, inc,
dec, shr, shl, etc.
4. Logic: and, or, xor, not, etc.
5. Control flow manipulation: jmp,
call, etc
6. Interrupts: int, int3, iret, into,
bound, etc.
7. Comparison: cmp, test, etc.
8. System calls: in, out, wait, ins,
etc.
9. Bit manipulation: btr, etc.
10. Flag manipulation: cld, clc, stc,
std, cmc etc.
11. Floating point unit manipulation:
fadd, fsub, fmul, fdiv, etc.
12. String manipulation: movs, cmps,
scas, lods, etc.
13. Other: nop
As this work targets stand-alone binaries, even rudimentary observation on the ob-
tained list of gadgets from stand-alone binaries. can depict the fact that the number
of gadgets in the first 6 groups is significantly surpassing the number of gadgets in
the other groups. This is due to the fact that most commonly used binaries do not
extensively use floating point manipulation, nor call interrupts or even system calls.
Consequently the implementation in this work, completely disregards the existence
of gadgets in groups 8 to 13.
Most frequently found gadgets are pop reg gadgets, in most cases addressing each 8
CPU registers. Less frequently gadgets are data transfer gadgets. And finally almost
every stand-alone binary has at least one of each of the arithmetic and logic gadgets.
The gadget classification notable narrows the gadget search space and is used in the
next step, namely the generation of the register transfer graph.
4.5 Building the Register Transfer Graph
The register transfer graph (RTG) is a directed graph representing the data movement
between CPU registers and memory. Each node in the graph represents either CPU
register or memory location referenced by register with or without a particular dis-
placement. Each edge on the graph represents a gadget that holds instruction to pass
the data from one node to the other. The edge labels represent the instruction used to
transfer the data between the nodes, and the registers stored in brackets represent the
34
4.5. Building the Register Transfer Graph
clobbered registers as a result of using the gadgets illustrated by the edge.
eax
ecx + 0x2be8
mov
edx + 0x8
mov [esi,edi,]
edi
mov [esi,edi,]
ecx
mov [ebp,]
edx
or
eax + 0x8
mov
ebx
xchg
esp
ebp
mov [ebp,]mov [ebp,]
esi
xchg
or
al
mov [ebx,]
dl
mov
eax
mov
ecx + 0x2344
mov
X X
Memory Add.CPU Register Data Movement
Figure 4.4: Apache/2.2.17 (Linux/SUSE) register transfer graph
The register transfer graph takes into consideration the data transfer gadgets, the
arithmetic and logical gadgets, such that:
• Data transfer gadgets. Once xchg and mov gadgets are found, those are inserted
directly in the graph as edges.
• Arithmetic gadgets. When add and sub gadgets are found, those are placed in
the graph only if the target can be initialized with 0. For the sake of simplicity
only pop gadgets are considered as initialization gadgets. The initialization
gadgets are used priori to the use of the data transfer gadget.
• Logical gadgets. Similarly to the previous group, logical gadget are also used
to transfer data between registers, only if the target can be initialized. If orgadget is used, the target is initialized with 0, if and gadget is used, the target
is initialized with 0xFFFFFFFF and if xor gadget is used the target is initialized
with 0.
The resulting graph will contain every data transfer between two nodes. Hence two
nodes might be connected by more than one edge. In order to optimize the generated
graph, the edges causing redundant side effects must be eliminated. On the other
hand, several nodes can load data from memory to a CPU register and the other way
35
Chapter 4. Automatic Return Oriented Programming
around. The number of these nodes can be reduced only to the most relevant memory
nodes. We illustrate this process in the following two sections.
4.5.1 Register Clobbering Edges
When two nodes in the graph have more than one edge, we look closely into the
register clobbering imposed by the edge gadgets.
eax
edx
mov [ebp, ecx, edi] mov [ebp, ecx, ebx]
Figure 4.5: Edges with different set ofregisters
eax
edx
or [ebp] mov [ebp, ecx] mov [ebp, ecx, ebx]
Figure 4.6: Edges having subset of clob-bered registers
To remove the redundant edges, we focus on the three cases:
• Two edges clobber different set of registers. This case is illustrated in Figure
4.5. The use of each edge in the graph generates different side effects. Therefore
both edges are then considered in the graph.
• The set of clobbered register of one edge is a subset of the clobbered registersof another edge. In this case, the edge with larger set of clobbered registers
imposes redundant side effects, and therefore it is disregarded. Note that having
an edge with no clobbered registers is just a special case of this rule. In the
context of Figure 4.6, edge “or [ebp]” will be the only edge considered for the
data transfers between edx and eax.
• Two edges have equal set of clobbered registers. In this case the data transfer
gadgets (xchg and mov) have precedence over the other gadgets. This is due to
the fact that this type of gadgets do not require initialization, and therefore are
more efficient to perform the data transfer.
36
4.6. Discovering Register Candidates
4.5.2 Memory Transfer Nodes
Having redundant memory transfer nodes occurs when several gadgets are found that
perform data transfer from memory to a register.
eax
ecx + 0x70
mov
ecx + 0x2344
mov
ecx + 0x2be8
mov
ecx + 0x5abx
mov
Figure 4.7: Memory transfer nodes transferring data to one register
To reduce the number created memory nodes, we select the nodes being dereferenced
by a particular register. The node having the lowest absolute displacement value is
then taken into consideration, and the rest of the nodes are disregarded. The choice for
having the lowest displacement, is only for convenience, since accessing the memory
value of that node will require proper initialization of the dereferencing register by
subtracting the displacement value.
4.6 Discovering Register Candidates
Once the complete graph is generated, we already have all data movements between
registers and memory. This gives the opportunity to chain gadgets, by transferring the
result from the result of using one gadget to the input of the other gadget. Therefore,
in order to narrow the search to the gadgets that are able to chain their execution, we
need to find the set of register candidates. This set will ensure that data transfers are
possible from each register to every other register in the set. In terms of graph theory,
this set of registers is simply the strongly connected component in the graph. Bellow
we provide a pseudo-code of the Tarjan algorithm to efficiently calculate the strongly
connected components:
Note in this case that every memory node in the graph, can connect to any other
memory node in any direction. This is because the data stored in the memory can be
referenced by both nodes, the one which transfer memory from register to memory
and the other way around. To serve that purpose, we slightly adjust the algorithm
37
Chapter 4. Automatic Return Oriented Programming
Algorithm 3 Tarjan
Input: start vertex v , stack S and graph G = (V ,E)v.i ndex ← i ndexv.lowli nk ← i ndexi ndex ← i ndex +1S.push(v)for (v, w) ∈ E do
if w.i ndex is undefined thenTar j an(w)v.lowli nk ← mi n(v.lowli nk, w.lowli nk) w ∈ Sv.lowli nk ← mi n(v.lowli nk, w.i ndex)
end ifend forif v.lowli nk = v.i ndex then
repeatstart a new strongly connected componentw ← S.pop()add w to current strongly connected component
until w = voutput the current strongly connected component
end if
to take this case into consideration. Once memory is used to pass data from one
register to the other, it is written in a know location, reserved in the data segment of
the process image.
4.7 Encapsulation
The biggest challenge in chaining the gadgets is taking into consideration every side
effect and ensuring deterministic state in CPU registers, stack and memory. As the
registers get clobbered even by a simple data movement, it is very difficult to keep the
state of the system in the registers. Instead we propose a method to maintain the state
in the main memory. This section express the genuine potency of the register transfer
graph and uses it extensively.
As already shown, the data section has constant offsets, and this can be used as a
place to maintain the state of our system. Once we have found the register candidates
set, we consider the gadgets operating on memory operands and register candidates.
Each of this gadget is then encapsulated with gadgets from the RTG to ensure that the
operation performed in the initial gadget, reads input from memory, and writes the
38
4.7. Encapsulation
result back in memory. The RTG provides gadgets such that:
• Read data from memory, by finding the shortest paths from a memory node to
the source and target operand of the gadget being encapsulated.
• Write data to memory, by finding the shortest path from the target operand of
the gadget being encapsulated, to a memory node.
The encapsulation process can provide mechanism to create virtual registers that
reside in the data section. As the used gadgets only clobber the CPU register, and
also reserved addresses in the data section, each use of the encapsulated gadgets will
only modify the value on the virtual registers. Furthermore, the use of encapsulation
requires only one gadget available for each instruction, to encapsulate it to work with
any memory locations. We call the encapsulated gadgets virtual instructions.
In order to illustrate gadget encapsulation, we assume that the attack application is
Pidgin, and that the RTG is already generated, as shown in Figure 4.8. Furthermore
we assume that addition should be performed on two numbers available in the data
section (2 and 3) and stored to another location. Finally we also assume that only one
add gadget is available:
0xb92689782: add %ebp, %ebx; ret;
0xb92689782: add ebp, ebx ret
eax
ebp
mov
esi
mov
ecx
mov [esi,edi,]
edx
mov
eax
mov
ebx
xchg
esp
mov
xchg
mov [ebp,]
edi
mov [ebp,edi,] xchg
al
mov
cl
mov [ebx,]
edx
mov
2
3
Data Segment
Figure 4.8: Initial state
Figures 4.9 shows how operands ebp and ebx are loaded through the use of RTG, by
reading data from memory referenced by eax.
39
Chapter 4. Automatic Return Oriented Programming
0xb92689782: add ebp, ebx ret
eax
ebp
mov
esi
mov
ecx
mov [esi,edi,]
edx
mov
eax
mov
ebx
xchg
esp
mov
xchg
mov [ebp,]
edi
mov [ebp,edi,] xchg
al
mov
cl
mov [ebx,]
edx
mov
2
3
Data Segment
0xb92689782: add ebp, ebx ret
eax
ebp
mov
esi
mov
ecx
mov [esi,edi,]
edx
mov
eax
mov
ebx
xchg
esp
mov
xchg
mov [ebp,]
edi
mov [ebp,edi,] xchg
al
mov
cl
mov [ebx,]
edx
mov
2
3
Data Segment
Figure 4.9: Read from memory to ebp and ebx
Once the data is loaded into the gadget operands, the add gadget can be executed. The
result stored in ebx is then transferred back to memory, using the RTG and referencing
ecx (Figure 4.10).
The product of the encapsulation is the virtual instruction. It is represented as an
interleaved list of gadget addresses and values. This list is consisted of three parts:
addresses of the gadgets corresponding to the edges along the shortest path from
memory node to the source operands of the encapsulated gadget; the address of
the encapsulated gadget; and addresses of gadgets corresponding to edges along the
shortest path from the target operand in the encapsulated gadget to a memory node.
40
4.7. Encapsulation
0xb92689782: add ebp, ebx ret
eax
ebp
mov
esi
mov
ecx
mov [esi,edi,]
edx
mov
eax
mov
ebx
xchg
esp
mov
xchg
mov [ebp,]
edi
mov [ebp,edi,] xchg
al
mov
cl
mov [ebx,]
edx
mov
2
5
3
Data Segment
Figure 4.10: Write the result back to memory
41
5 Evaluation
To evaluate our approach for automating return oriented programming attacks, we
used OpenSUSE Linux 11.4 (x86), having W⊗
X and ASLR enabled. ROP attacks
are usually conducted on vulnerable applications, by taking over the control of a
computer from another host - remote exploitation. Therefore the executables chosen
for the evaluation, are primary popular network applications, available in almost all
Linux distributions. A set of 20 applications is compiled, ranging from 10 KB to 50 MB.
The first investigated metric is the count of register candidates, i.e. the strongly
connected component cardinality (SCCC) of the RTG. The graph available at Figure 5.1
shows that the count of register candidates increases as the size of the binary increases.
It also indicates that many popular 32bit Linux x86 servers and clients already provide
set of at least 3 available registers, in many cases sufficient to build ROP attacks.
cupsd
5yast2
amarok
skype
3.227 x 10KBvlc
acroread
1
8
5.325 x 10KB
5.939 x 100KB
1.228 x 10KB
4.915 x 10KBfirefox
8
Size (b)
master
4.055 x 100KB
1
opera
1.147 x 10KB
2.589 x 1MB
2.143 x 10MB
0
mysqld
7
5.396 x 10MB
6.963 x 10KB
1.987 x 1MB
1.017 x 10MB
82.393 x 10MB
0
smbd
3
4
4.342 x 100KB5.407 x 100KB
avahi
8
3
2
5
7.004 x 1MB
SCCC
pidgin
1.706 x 10MB
dhclient 6
Binary
3
filezillaXorg
1.761 x 1MB
chrome
apache2
8
8
rpcbind
sshd
8
1.049 x 1MB
Binary Section Size
Avai
labl
e x8
6 C
PU R
egis
ters
0
2
4
6
8
●
●
●
●
●
● ●●
●
●
●
●
●
● ● ● ● ● ● ●
104.5 105 105.5 106 106.5 107 107.5
Figure 5.1: Number of CPU candidates increases with the size of the binary
43
Chapter 5. Evaluation
In Figure 5.2, we compare the set of register candidates with the fixed set of registers
(eax, ecx and edx) used in the generation of ROP based kernel rootkits [19]. The results
of our approach show that the set of register candidates changes on different binaries,
making it infeasible to use fixed set of registers candidates to perform automatic
construction of ROP attacks.
Binary SCCC FSR Binary SCCC FSRchrome 8 X yast2 4 ×acroread 8 X sshd 5 ×skype 8 X cupsd 3 ×opera 8 X apache2 3 ×smbd 8 X avahi 3 ×mysqld 8 X amarok 0 ×filezilla 8 X rpcbind 2 ×Xorg 7 X firefox 1 ×dhclient 6 × master 0 ×pidgin 5 × vlc 1 ×
Figure 5.2: Register candidates and feasibility with fixed set of registers
The number of register candidates only suggests that a high number of register candi-
dates, will make the process of gadget chaining feasible. In order to evaluate the level
of automicity of gadget chaining, we evaluate which logical, arithmetic, comparison
and control flow modification gadgets can be encapsulated, as shown on Figure 5.3.
add sub inc dec and or xor not neg jmp test cmp Otheracroread X X X X X X X X X X X X call mul shl
shr rol rorskype X X X X X X X callopera X X X X X X X X X X X X call mul shl
shr rol rorsmbd X X X X X X X X X X X X callmysqld X X X X X X X X X X X X call mul shl
shr rol rorfilezilla X X X X X X X X X X X X callXorg X X X X X X X X X X X calldhclient X X X X X X X X X X X callpidgin X X X X X X X X X X X callyast2 X X X X X X X X X X X call shl shr
rol rorsshd X X X X X X X X X Xcupsd X X X X X X X X X X callapache2 X X X X X X X X X callavahi X X X X X X X X callamarokrpcbind X X Xfirefox X X X Xmaster Xvlc X X
Figure 5.3: Encapsulation on different instructions
The check mark in every cell above indicates that a particular instruction can be
encapsulated into a virtual instruction. The results confirm that almost any binary
above 400 KB can provide sufficient set of virtual instructions to build a ROP attack.
44
6 Conclusion
6.1 Contribution
Return Oriented Programming is an interesting and a crowded research area, where
new techniques are continuously explored to increase the level of sophistication and
automation of computer attacks. Chapter 2 gives an overview on the fundamentals
of the ROP technique, its variations, and provides an in-depth practical example of
a ROP attack. We showed that most of the known defence techniques either do not
provide a full protection against all ROP variations, or are not vastly deployed, leaving
software across different architectures vulnerable against this type of attacks.
Our contribution is developing techniques to automatize the process of creating ROP
attacks within stand-alone binaries on Linux (x86) systems. The related work (Chapter
3) indicates that most of the techniques target shared libraries, or operating systems
kernels, where large number of gadgets are on disposal. We have discussed that this
approach is not applicable on stand-alone binaries in Chapter 4, as a result of the
reduced set of gadgets available in the machine code of the stand-alone executables.
Therefore we proposed methods to automatically generate virtual instructions, by
encapsulating gadgets with extra instructions such that they operate on memory loca-
tion, instead of registers or the stack. We carefully took into consideration each side
effect caused by the use of the available gadgets having single or multiple instructions,
guaranteeing that the generated virtual instructions will handle all potential side-
effects. We have built the encapsulation process using a register transfer graph, used
to analyse the data movements between the CPU registers and the memory, ensure
that the user will be able to read from memory, perform an arbitrary operation and
write the result back to memory. Finally (Chapter 5 we have shown that this approach
is able to encapsulate most of the arithmetic and logic instructions, as well as control
flow and comparison instructions, in most cases sufficient to create a ROP attack.
45
Chapter 6. Conclusion
6.2 Future Work
To give a short insight in the future plans, we classify our ideas in two parts: points
that will be in our focus in the short run, and goals that we aim to investigate in the
long run.
In the short run:
• Currently the system lacks a proper mechanism to provide automatic condi-
tional branching. This is due to the fact that conditional branching is usually
done by exploiting the CF flag, to provide comparison of the type a < b. The
system should be extended to support gadget chaining to perform conditional
jumping and automatic calculation of the jump offsets when virtual instructions
are used.
In the long run:
• While most of the related work concentrates on providing a Turing complete
set of instructions, our approach tries to encapsulate as many operations as
possible, without even considering the Turing completeness. The fact that the
result of this method is providing virtual instructions and virtual registers, it
provides the flexibility to incorporate different models of Turing completeness,
abstracted on the level of virtual instructions. This is especially useful once
particular virtual instruction is not available, since the system can still ensure
Turing completeness by choosing different or less complicated model.
• Being able to provide virtual instructions and virtual registers (located in the
memory), the system is potentially able to implement even virtual stack, and
implement wrappers for simplified function call in the ROP attack.
• Finally the ultimate use case of this work would be creating a compiler able to
create fully automatic ROP attacks for an arbitrary code, written in a dedicated
language. This compiler will ideally use the virtual instructions, registers and
stack, as an assembly abstraction to compile the dedicated language into an
attack payload.
46
Bibliography
[1] H. Shacham, “The geometry of innocent flesh on the bone: return-into-libc
without function calls (on the x86),” in ACM Conference on Computer and Com-
munications Security, pp. 552–561, 2007.
[2] E. Buchanan, R. Roemer, H. Shacham, and S. Savage, “When good instructions
go bad: generalizing return-oriented programming to RISC,” in ACM Conference
on Computer and Communications Security, pp. 27–38, 2008.
[3] T. Kornau, “Return Oriented Programming for the ARM Architecture,” Master’s
thesis, University Ruhr Bochum, December 2009.
[4] A. Francillon, D. Perito, and C. Castelluccia, “Defending embedded systems
against control flow attacks,” in Proceedings of the first ACM workshop on Secure
execution of untrusted code, SecuCode ’09, (New York, NY, USA), pp. 19–26, ACM,