OPTIMIZING WEB APPLICATION FUZZING WITH GENETIC ALGORITHMS AND LANGUAGE THEORY BY SCOTT MICHAEL SEAL A Thesis Submitted to the Graduate Faculty of WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE Computer Science May 2016 Winston-Salem, North Carolina Approved By: Errin Fulp, Ph.D., Advisor William Turkett Ph.D., Chair David John Ph.D.
79
Embed
OPTIMIZING WEB APPLICATION FUZZING WITH GENETIC … · Chapter 1 Introduction..... 1 1.1 Fuzz Testing for Vulnerability Discovery . . . . . . . . . . . . . . . . 3 ... With the advent
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
OPTIMIZING WEB APPLICATION FUZZING WITH GENETIC ALGORITHMSAND LANGUAGE THEORY
BY
SCOTT MICHAEL SEAL
A Thesis Submitted to the Graduate Faculty of
WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES
in Partial Fulfillment of the Requirements
for the Degree of
MASTER OF SCIENCE
Computer Science
May 2016
Winston-Salem, North Carolina
Approved By:
Errin Fulp, Ph.D., Advisor
William Turkett Ph.D., Chair
David John Ph.D.
Acknowledgments
There are many people who helped to make this thesis possible, and they deservedthanks and gratitude I could never fully provide. First and foremost, thanks to tomy family for being supportive over the past—what should have been two but thenbecame three—years. To Errin Fulp, whose patience in the beginning, middle and endof my academic career kept me afloat, who sparked my interest in computer security,who set me up with career and academic opportunities I did not deserve, who bailedme out of moderately serious (albeit laughable, and ridiculous) trouble...thank you.This research would have never happened if your door were not always open. Thanksto the Wake Forest Computer Science Department, in (no) particular (order): JenniferBurg, Daniel Canas, Sam Cho, Don Gage, David John, Paul Pauca, Stan Thomas,and William Turkett. Thanks to Todd Torgersen for patiently teaching me things Iwas already supposed to know, and for spending his free time helping me flesh outthe ideas that took this research from “eh” to worthwhile. Finally, a special personalthank-you to Sarah Reehl for putting up with my complaining, reading numerousiterations of this document, and for not supporting my near-daily urge to leave thiswork unfinished.
5.2 Number of trials per simulation type without an exploit . . . . . . . . 60
5.3 Highest number of exploit strings found throughout a singular trial . 60
vi
Abstract
The widespread availability and use of computing and internet resources require soft-ware developers to implement secure development standards and rigorous testing toprevent vulnerabilities. Due to human fallibility, programming errors and logical in-consistencies abound—thus, conventions for testing software are required to ensureConfidentiality, Integrity, and Availability of sensitive user data. A combination ofmanual inspection and automated analysis of programs is necessary to achieve thisgoal. Because of the massive size of many codebases, especially considering the in-corporation of third-party software and infrastructure, thorough manual code reviewby security experts is not always an option. Therefore, e↵ective automated methodsfor testing software systems are essential.
Fuzz testing is a popular technique for automating the discovery of bugs and se-curity errors in software systems ranging from UNIX utilities to web applications.Although mutation and generation-based fuzzing have been in use for many years,fuzzers that intelligently manage test case generation are actively being researched.In particular, optimally testing web applications with limited feedback remains elu-sive. This research presents a use of Evolutionary Algorithms to generate test caseswhich expose vulnerabilities in web applications. This thesis utilizes grammaticallyanalyzed positive examples of injection strings related to a common web vulnerabilityin order to build a set of attack grammars that guide fitness metrics and test casegeneration. In lieu of a manually written, exhaustive attack grammar, the set of at-tack grammars are automatically derived from positive examples. The e�cacy of thisalgorithm is compared to other methods of solution generation, such as Markov ModelMonte Carlo. Finally, two types of Evolutionary Algorithms (a Genetic Algorithmwith heuristic-based repopulation criteria and CHC) are implemented in the fuzzingframework, and evaluated according to their ability to e↵ectively narrow the searchspace. The results demonstrate that Evolutionary Algorithms with grammar-basedheuristics are able to find unique solutions that are grammatically similar, yet stillunique, to a corpus of positive examples.
vii
Chapter 1: Introduction
As computing technologies broaden their reach to individual consumers, commu-
nities, and corporate industries, providing security and privacy of those resources
becomes an important priority. Because of the expansive reach of internet services,
and the growing interconnectivity of our world, users expect software companies to
provide Confidentiality, Integrity, and Availability of sensitive data. According to a
US census report on computer and internet use in the United States, 2013 saw 83.8
percent of American households with ownership of a personal computer, and 73.5
percent with a high speed internet connection [23]. The International Telecommuni-
cations Union, an agency under the supervision of the United Nations, found that in
2015, 3.2 billion people are connected to the internet—a 600 percent increase since
the year 2000 [28]. With the advent of cloud computing infrastructures and browser
based applications replacing traditional desktop applications, security in the Internet
application sphere is a paramount concern of a corporation’s security posture. Thus,
the development of strategies for discovering errors and vulnerabilities in software is
an open area of both academic and industry research.
Computer security concerns have long been a part of the equation in delivering
useful and reliable software systems to end users. The first class of vulnerabilities
to emerge in consumer software systems were related to memory corruption bugs
in programs, such as Bu↵er/Heap Overflows and format-string vulnerabilities [22].
These vulnerabilities were the result of logical programming errors that allowed an
attacker to arbitrarily write to or read from memory which should be unavailable,
potentially leading to a full compromise of the system [14]. Despite the severity of such
bugs, early on, the impact of these errors was not substantial to end users. However,
1
as time progressed, and users begin to trust software companies with private and
commercial data, those vulnerabilities proved very damaging, and forced developers
to adopt secure coding practices.
The modern landscape of cybersecurity involves many of the same vulnerability
pitfalls of the past 30 years, as well as a litany of new attack surfaces due to the
ubiquity of the Internet activity and mobile devices. Remote code execution vulner-
abilities in web services, such as SQL injection and Cross Site Scripting (XSS), have
exploded in frequency in the past decade, and have dire implications for users and cor-
porations. Figure 1.1 shows the top four most common threats to web applications as
determined by the Open Web Application Security Project (OWASP) [41]. Because
of the widespread availability of software services, and the integration of third-party
tools and libraries, it becomes a nontrivial problem to manage the security posture of
a given application. In order to maintain a tenable consumer market base, software
companies must devote resources and manpower to develop standard procedures for
securing software. One such standard, the Microsoft Secure Development Life-cycle,
shown in Figure 1.2, demonstrates that security awareness is required in all stages
of the development process—from training developers and technicians to the design,
implementation, and maintenance of production software [8].
Software companies and service providers are responsible for rigorously uncov-
ering and fixing bugs in software, and rely on a variety of tools and techniques to
accomplish this task. Testing software for errors is vital to the software development
life-cycle. Techniques such as manual code review and static analysis provide some
insight into the behavior of an application. Unfortunately, they are insu�cient when
codebases become too large, or if visibility into source code or program internals is
limited. Automated testing seeks to fill this void by quickly testing input vectors of
applications and monitoring their handling of that input. This process faces some
2
Figure 1.1: OWASP Top 10 web vulnerabilities shows the frequency of web-basedinjection attacks, and the importance of defending against them
unique challenges—first, it must seek out all the possible branches of execution (cov-
erage). Second, It should also be equipped to determine if a program reaches an
unsafe state (or crashes). Third, it should develop a process for crafting the input in
an intelligent way to stress test input vectors without exhausting too much time.
1.1 Fuzz Testing for Vulnerability Discovery
One of the most e↵ective strategies for auditing software services for vulnerabilities
is fuzz testing. Fuzzing is a technique for automatically generating crafted input,
sending it to an application, and monitoring the behavior of that application in order
to ascertain if a given input causes undefined (or nefarious) behavior [44]. The tech-
nique was developed by Barton Miller et al. at the University of Wisconsin. Miller’s
research team developed programs that crafted randomized input to test how com-
mand line utilities commonly found in UNIX-based operating systems would react.
They were able to uncover errors in 25-33 percent of UNIX utilities [47], ushering in
a new paradigm in software testing.
3
Figure 1.2: The steps of the Microsoft Secure Development Life-cycle
From its inception in the late 1980s as a technique for testing UNIX utilities for
defects [47], fuzz testing has become an established technique for discovering bugs
and security vulnerabilities in software. Fuzzers have been able to uncover serious
vulnerabilities in everything from file parsers and language interpreters to network
protocols and binaries [44]. Di↵erent fuzzing strategies have advantages based on the
target in question—mutation-based fuzzing, in which in various portions of a valid
test case are mutated (such as bit flipping for binary data) can uncover di↵erent sorts
of software errors than Generation-based fuzzing types, which create new test cases
based on a model of expected input. Although simple fuzzing strategies still uncover
software bugs today, the advent of modern security measures for various vulnerabil-
ity classes has forced researchers to abandon simplistic fuzzing strategies in favor of
intelligent systems that optimally traverse the input search space. Modern fuzzing
strategies involve taking steps to reverse engineer the target, in order to discover
possible execution paths [42]. This situation is not always tenable—because of the
feedback limitations of black-box fuzzing for targets such as web applications, alter-
native instrumentation techniques are required. Because the search space involved in
fuzz testing is vast and often di�cult to define, guided search using heuristics is a
reasonable alternative.
4
1.2 Fuzzing with Evolutionary Algorithms
This thesis explores a fuzzing strategy that addresses this problem using Evolutionary
Algorithms, guided by fitness metrics based on the lexical and semantic structures
represented in a corpus of positive examples. Classical fuzzing methods seek to gen-
erate fuzz data that targets boundary and format assumptions of input parsed by a
target program. For this problem, the fundamental structures of known-bad attack
strings are analyzed, and its components are used for the creation of new test cases.
Evolutionary Algorithms are a good fit for optimizing fuzzing, because they can be
easily incorporated in the process of input generation, and aid in the reduction of the
search space. Scoring candidate solutions based on semantic structure gives an Evo-
lutionary Algorithm enough structure to avoid perpetuating nonsensical candidate
solutions while allowing for freedom to discover new payloads that uncover software
vulnerabilities. The goals and advantages of the framework developed by this thesis
include:
1. An intelligent reduction of search space and combinatoric complexity
Enumerating all possible permutations representing positive examples for even
reasonably sized EA candidates quickly becomes infeasible. Grouping n-tuples
of symbols into production rules of a grammar that accept candidates of a given
type allows for more intuitive evaluation of evolved payloads.
2. The generation of unique exploit candidate solutions This system uses
an Evolutionary Algorithm to generate and select candidates through previously
discovered solutions. Using the searching conventions at the core of Evolution-
ary Algorithms, coupled with the language-based fitness metrics derived from a
corpus, this system attempts to generate payloads that are unique, but seman-
tically similar to positive examples of known bad payloads.
5
3. Development of a language-theoretic basis for guiding an EA fuzzing
framework The fitness function at the core of the proof of concept fuzzing
framework uses grammars representative of curated positive examples. These
productions approximately describe the entire corpus of known-bad injection
strings. The utilization of these grammars combined with traditional application
response monitoring provides a formal manner by which to verify the fitness of a
given solution, and can be extended to other languages and frameworks capable
of semantic analysis.
The research described in this thesis is significant for several reasons. First, the
generation of semantic-structure groups learned from positive examples will guide the
Evolutionary Algorithm based on lexical tendencies of known exploitative input. Sec-
ond, the approach can use those semantic structure groups to both measure fitness
and generate payloads, creating more intuitive search guidance for the Evolutionary
Algorithm fuzzing framework. Lastly, this research represents a step towards auto-
matically building formally expressible grammars that can generate good candidate
solutions, and can be extended to test other software systems and protocols.
Consider a large scale end-user web application service that utilizes both front-
end browser-based technologies (Javascript) and back-end data stores (e.g., MySQL).
As the size of the code base increases, manual code review may fail to identify even
commonplace errors. Although static analysis and manual code review can reme-
diate the existence of some vulnerabilities, the source code of an application is not
always available. Black-box fuzzing—the term used to describe a testing scenario in
which the source code of an application is unavailable—requires less overhead than
grey/white-box methods (which have some or complete visibility into application in-
ternals, respectively). The key limitation of black-box fuzzing is its lack of source code
knowledge, and limited visibility into application internals. This research intends to
6
optimize black-box fuzzing by using the learned grammar structures of positive exam-
ples to promote semantically intuitive solutions and suppress non-conforming input
generated by the Evolutionary Algorithm.
The fuzz testing campaign is managed by an Evolutionary Algorithm (EA), a
searching strategy inspired by principles of biological evolution. Evolutionary Algo-
rithms search for better solutions by discovering new candidates through the recom-
bination of “fit” solutions in a population. Each chromosome (a possible solution)
consists of either a sequence of grammar symbols representing an injection string,
or grammar transitions that generate a given injection string. The Evolutionary Al-
gorithm utilizes selection, crossover, and mutation to perpetuate new generations of
chromosomes. The central idea that makes evolutionary algorithms e↵ective is that
chromosomes with higher fitness scores will be more likely to produce o↵spring (i.e.,
fitter chromosomes are probabilistically more likely to be selected for creating the
next generation). Crossover involves combining two chromosomes in a manner to
produce o↵spring for the next generation. In the context of our system, chromosomes
will recombine grammar symbol groups or transitions in order to guide searching the
input space based on the semantic information the chromosomes encode. Mutation of
chromosomes is modeled by randomly changing of symbols or grammar productions
in a given chromosome. Mutation is a necessary convention for maintaining diversity
within a population to avoid stagnation or convergence on a suboptimal plateau in
the search space.
1.3 Contributions
The research outlined in this document contributes in the following ways:
1. Produces a framework for fuzzing at the application-level using Evolutionary
Algorithms (EA) to evaluate and create input.
7
2. Explores the utilization of grammar-based fitness evaluation for guiding an Evo-
lutionary Algorithm-based fuzzer.
3. Analyzes the e↵ectiveness of the proof of concept solution compared to other
fuzzing methods, such as brute force, hand-selected payloads, and Markov
Model Monte Carlo input generation.
The thesis proceeds as follows: the second chapter discusses the history and compo-
nents of fuzz testing, various techniques for uncovering vulnerabilities with automated
testing, and the advantages and limitations of those methods. The third chapter de-
scribes Evolutionary Algorithms through discussion of their various manifestations
and use cases. The fourth chapter outlines the incorporation of EA’s into fuzzing
frameworks, and the strategies that represent the proof of concept approach to EA-
based application fuzzing. Chapter five explains the testing environment by which
our methods are evaluated, including an analysis of the results. Finally, chapter six
draws conclusions based on the results of our testing, and discusses avenues of future
research.
8
Chapter 2: Fuzzing
2.1 History of Fuzzing and its Fundamentals Com-
ponents
The process of testing software for logical errors and security bugs has long been a
part of the recommended software development life-cycle [38]. Most nontrivial systems
depend on a vast number of moving parts. From the design and development stages
all the way to production deployment and maintenance, many di↵erent people write
many lines of code, creating a situation in which errors are unavoidable. In order to
alleviate the impact of serious bugs in actively developed software, rigorous testing
of software is vital. For example, regression analysis ensures that new features and
changes made to an existing codebase do not introduce new (or previously observed)
software errors. Static code analysis performs source code checking to detect common
errors that introduce security-related vulnerabilities. Both of these approaches are
necessary weapons in the testing arsenal, but sometimes fail to account for unexpected
input, or detect incorrect implementations of correct processes. Automated input
testing fills this void by sending crafted input to a running example of the program,
monitoring the program’s response for undefined behavior or traversal into an unsafe
or inoperable state. Although dynamic analysis of software can be time consuming
and resource intensive, it allows developers to cover the spectrum of necessary tests
for ensuring software security when used in concert with static and manual analysis.
One such methodology of this testing ilk is referred to as fuzzing.
Fuzz testing was o�cially formalized in an academic setting by Barton Miller
et al. at the University of Madison, Wisconsin [47]. The idea of sending random
input to popular UNIX utilities was first explored when a thunderstorm tampered
9
with Miller’s remote connection, sending random character input to his terminal and
crashing the UNIX programs in use. Because the thunderstorm’s interfered with the
modem used for the remote session, Miller called the idea “fuzzing” [50]. Inspired
by the phenomenon, Dr. Miller designed a lab for the graduate students in his oper-
ating systems course, instructing them to write programs which send random input
to common UNIX utilities and monitor the results. One student’s submission un-
covered parsing vulnerabilities in many command line utilities that were at the time
considered stable. Thereafter, a software testing group formally explored fuzzing
and demonstrated the widespread input handling errors that plagued many UNIX
programs [47]. Today, fuzzing is an indispensable tool for security researchers and
software development teams responsible for testing and quality assurance. In order
to successfully deliver a fuzz testing campaign, fundamental components must be
in place for the creation of input data and monitoring program behavior. General
steps involved in fuzz testing methods are shown in Figure 2.1. Sutton et al. in
their comprehensive text on fuzzing methodologies, stipulate that although fuzzing
methodologies vary widely, following certain guidelines are more likely to guarantee
results [44].
The first step involved in the fuzzing process is to identify a target for the testing
campaign. In the literature, the target in question is often referred to as a “System
Under Test” (SUT) [39]. Although it is fair to say that any software system that
receives and processes user input is fair game, fuzzing is most e↵ective when input
directly a↵ects the program’s state in a measurable way. For example, fuzzing is
less e↵ective for identifying problems in cryptographic methods, because monitoring
vulnerabilities is di�cult, as opposed to fuzzing file format parsers, where the e↵ects
of the input (a crafted file) on the application are easy to monitor [44]. Fuzzers
are especially good at identifying implementation problems in parsers—programmers
10
Figure 2.1: The general steps involved in fuzzing campaigns
often make assumptions that expose applications to risks in edge cases, which lead
to problems in everything from memory corruption to remote code execution of web
apps [43,51]. The best targets for fuzzing require knowledge of the protocol, by what
means input is interpreted, and a reliable manner way of determining an unsafe state
(more on this later).
The second step in the fuzzing process requires one to identify input channels
of a target application [44]. Identifying the input channels of a target application is
vital to reliably carry out a fuzzing campaign—otherwise, there would be no reason to
conduct fuzz testing in the first place. Although this point is obvious, the more subtle
implication of this step is that when designing a fuzzer, it is vital to have knowledge
of the application’s points of input (which are, in essence, its attack surface). Often
times, software errors are the result of incomplete or incorrect assumptions about
input data, and fuzzers are an e↵ective technique for testing those errors. To this
11
point, model-based security testing is often combined with fuzzing, since modeling
the flow of an applications reveals its attack surface e↵ectively [20]. Identifying the
set of inputs for a given target also informs the manner in which data generation
is conducted—knowing the protocols involved or the structure of input allows for
intelligent design decisions (e.g., one would fuzz file format parsers in a completely
di↵erent way than an input login form for a web page). Furthermore, even with
simple fuzzing techniques such as random mutation, it is vital to preserve fields of
input that an application interprets in order to be e↵ective [44].
In order to make certain that a given testing campaign is properly searching the
input space within a reasonable amount of time and computing resources, the task of
data generation is at the heart of an e↵ective fuzzing strategy. However thorough it
may be, enumerating all possible inputs for a given target would be untenable from
the standpoint of time and computing power. Target information and a knowledge of
the attack surface (determined by the inputs available for a given target application)
guide the components available for data input generation [44]. As an example, for
file formats parsers and network protocols, specifications for the format of data will
be agreed upon as standard and (hopefully) documented in an RFC or equivalent
text. Even for proprietary protocols, it is possible to capture tra�c or examples and
make assumptions about the underlying strucuture [44]. It is important to generate
negative test cases that the SUT will evaluate as structurally expected input—Holler
et al., in their framework for fuzzing web browser Javascript interpreters, were able to
uncover numerous bugs in Javascript interpreters by requiring their fuzzer to generate
test cases that were syntactically valid [36]. The e�cacy of protocol fuzzers falls in
the same category, as the preservation of packet structure is vital for testing the
underlying processing frameworks instead of being cast aside by basic error checking.
In order to achieve quality test results, the data generation stage must be sensitive
12
to the protocol at hand. For network protocols in particular, RFC information will
allow a developer to determine the structure of a packet (and a stream of packets).The
various structural components of the protocol will be the places in which to insert
generated data, so it is vital for the fuzzer to comply with the standard set for field
widths and sequential order.
After those steps are completed, the crafted data is sent to the application. Sutton
et al. explains in great detail a variety of input types where fuzz-generated data
should be injected for a given target [44]. For local applications such as desktop
binaries, command-line arguments and environmental variables are prime channels
by which to submit fuzzed data. Remote fuzzing campaigns, such as fuzzing an
FTP session or web applications, involves slightly more setup—fuzzer-generated input
sent over networks is often done by a tool such as scapy or using a combination of
browser emulator and an HTML parsing library [1,12,13]. E↵ective fuzzing campaigns
require information gathered from target application analysis in order ensure the
application does not ignore the crafted input because of failure to meet the required
data specifications. In fact, when a feedback loop is limited, in the case of remote fuzz
testing, it is important carefully format test cases. In the case of our research, instead
of relying on application feedback to develop properly structured test cases, semantic
structures from positive examples are evaluated and used to intelligently guide test
case generation. The e�ciency of this approach is discussed in future chapters.
Once the fuzzer’s crafted input has been sent to the application, it is necessary
to monitor the SUT for its response, and that the results are thoroughly recorded
for further analysis. In the ideal environment, a monitoring harness is watching the
execution of the target system, and is alerted when exceptions, crashes, or interrupts
are invoked. In the event that a fuzzer sends a negative test case that causes an
error, for the sake of reproducibility, metadata regarding the program’s state as well
13
as the crafted input itself should be logged for analysis [44]. The human element
of fuzzing in this stage is involved in analyzing these events to identify whether or
not the root cause of an error has security implications. Although utilities such as
!exploitable are useful for attempting to measure exceptions and crash states according
to their potential exploitability, they are still limited [52]. These methods are adept at
discovering exploit potential in memory corruption errors, but there is a fundamental
limit to what computer programs can determine regarding the security posture of
another system. At this stage, the results of the fuzzing campaign are best left for
human judgment and inspection.
The e�ciency of a fuzzer is typically measured according to how well it finds bugs,
and this is directly correlated to a fuzzer’s ability to explore the execution paths of a
target application. This is referred to as “code coverage” and is a common metric for
measuring the success of a fuzzer. Sutton et al. remark that this is a measurement
of the “amount of process state a fuzzer induces a target’s process to reach and
execute” [44]. Fuzzers attempt to find vulnerabilities and crash states according to
mishandling of user input, and the best way to measure if a fuzzer is likely to uncover
such a bug is to measure the percentage of execution paths that are covered. Another
important piece of measuring the e�cacy of a fuzz testing campaign or technique
is at the monitoring stage—error detection is vital to determining if a given input
causes a crash via a debugger or other heuristics. Finally, resource constraints are
ever-present, and require the developers of fuzzers to design and implement e�cient
code. All of these things together are used as a measuring stick for fuzzing tools and
techniques.
14
2.2 Present Day Fuzzing
2.2.1 General Purpose Tools and Techniques
Miller et al.’s first paper on the use of testing UNIX utility reliability with random
input demonstrated the simplicity of software testing at that time. Before then, soft-
ware systems were primarily tested for accuracy and e�ciency down “happy paths”,
or under execution environments that were expected according to the specifications
of the software itself. Although these testing methodologies are able to determine
whether or not a piece of code performs calculations correctly and e�ciently with
expected input, it does nothing to examine the manner in which a program handles
malformed (or malicious) data. The arrival of fuzz testing brought with it a new
mindset on testing approaches, and the responsibility of programs to reliably and
safely handle input.
In concert with Miller et al.’s approach, the first fuzzing tools were concerned
with injecting random data as input to applications, monitoring for exceptions and
crashes [44]. In the early stages, this method was very e↵ective: most software was
not written defensively, or with any sort of security mindset. Therefore, fuzz testing
was adept at triggering errors such as memory access violations because they placed
programmer assumptions under stress through unexpected input. Although today
these methods would be considered elementary, they set the stage for academic and
industry research for intelligent, informed software testing. Mutation-based fuzzing is
the name for the process of taking a valid input sample (one which would be correctly
parsed by the program in question) and mutating it semi-randomly, and using it as
a negative test case against a system under test (SUT). The primary advantage of
this method is speed: minimal setup is required, and because little or no time is
spent modeling the structure of the data, fuzzing campaigns are executed relatively
15
quickly. File formats and plaintext protocols with easily identifiable field values are
prime targets for mutation fuzzing. One such manifestation ofMutation-based fuzzing
tools is zzuf : this fuzzer intercepts network tra�c and file formats and performs “bit
flipping” on program input, which literally is the act of randomly changing a variable’s
bit from “0” to “1” (or vice versa) [18]. The disadvantage of this system is that the
success of this method is contingent up on the quality of the available samples [6].
Because of the chaotic nature of randomly flipping bits of input, the fuzzer runs the
risk of mangling the test case past the point in which the target program will even
accept it. The deficiencies of pure Mutation-based fuzzing prompted researchers to
develop fuzzers that made use of data modeling and other analytical approaches.
The other main category of fuzz testing strategies is Generation-based fuzzers.
Generation-based fuzzers seek to make a model of the data accepted by a target
application, and comply with its specification (or violate it, but intelligently) when
injecting crafted input. Sutton et al. refer to Mutation-based fuzzers as “dumb brute
force”, and Generation-based fuzzers as “intelligent brute force”: although the input
crafting mechanism for both methods rely on randomly changing input, Generation-
based fuzzers go to great lengths to ensure a test case follows the specification of a data
model [44]. Peach, a popular Generation-based fuzzing tool, requires a user to create
data models for the framework to use as guides for its fuzzing engines [11]. These
“peach pits” are the basis for the input generation phase of the fuzzing campaign.
An example is show in in Figure 2.2. The Peach fuzzing framework requires an XML-
structured description of the data model, as well as type-specific information useful
for fuzz testing campaigns (e.g., field length, expected content type, delimiters, etc.).
Although clearly Generation-based fuzzing requires more work up front to model the
data in question, its e↵orts are not without results. Miller and Peterson demonstrated
that for modern targets, Generation-based fuzzing was much more e�cient, able to
16
Figure 2.2: An example excerpt of a peach pit used for Generation-based fuzzing [11]
achieve coverage 76 percent better than mutation-based fuzzing of the same targets
[48].
One of the most e↵ective methods of boosting the e↵ectiveness of these algo-
rithms involves boundary checking. Boundary checking is the act of testing values
at the edges of expressible values in order to trigger vulnerabilities such as Integer
Overflows, which lead to access violations of memory and can lead to a full compro-
mise of the target system [44]. When these tests are included with random mutations
and inside fields specified by a data model in Generation-based fuzzing, the likeli-
hood of triggering vulnerabilities increases exponentially. A figure from Sutton et
al. demonstrating common boundaries to test is shown in Figure 2.3. This example
shows a group of interesting values for testing the boundaries of certain data type
widths (MAX32, for example, referring to the maximum value for a 32-bit integer).
Test case generation for classical fuzzing seeks to create values with high potential to
cause problems or expose improper programmer assumptions, and boundary testing
with the value show in Figure 2.3 is an e↵ective means to this end.
17
Figure 2.3: Boundary testing recommendations according to Sutton et al. [44]
2.2.2 Modern Fuzzing
Modern fuzzing techniques still follow the same basic techniques described above, but
with more refined metrics for providing intelligent searching of execution paths and
data generation. Reverse engineering binaries and file format parsers to enumerate
its execution paths has recently been a standard method for optimizing fuzzing. For
local binaries, it is possible to attach debuggers to running processes not only to
monitor changes to the target application, but make sense of its execution paths [44].
Seagle, in his framework for file format fuzzing, made use of a reverse engineering
framework which found the execution paths of his target and fed that information
back to the fitness function of his Genetic Algorithm [53]. The class of fuzz testing
techniques that attach debuggers to running processes and use that channel to in-
ject test cases is called in-memory fuzzing. In-memory fuzzers are very e↵ective for
their visibility into application internals, and the fact that the possibility exists to
set a breakpoint, save the machine state, execute crafted input, and return to the
previous place, allowing for fine-grained control of executing the program with sent
18
data and monitoring its response [44]. Vincenzo Iozzo demonstrates that expensive
reverse engineering operations for preprocessing can be avoided by applying function
hooking during the fuzzing process, and pruning execution paths explored through
the combination of real-time debugging and heuristics. By measuring the cyclomatic
complexity of a given function and performing loop detection, fuzzers can be use
these heuristics to search program execution paths more e�ciently, thereby achieving
more e↵ective code coverage [40]. Other approaches to fuzzing involve measuring the
influence of injected data into program state, known as taint analysis. This method is
simply the process of marking process states where untrusted data has been injected
or evaluated, in order to help fuzzing heuristics make better decisions regarding data
generation and code coverage. Bekrar et al. demonstrate that traditional fuzzing
frameworks informed by metaheuristics and taint analysis allows for more e�cient
determination of exploitable bugs [19]. Iozzo also relies on this method for his frame-
work, which allows for the measuring of data’s propagation through the execution of
a program [40].
2.2.3 Web Application Fuzzing
Application level fuzzing for web applications has uncovered bugs ranging from mem-
ory corruption vulnerabilities in underlying system software, to data exfiltration and
session hijacking via SQL injection and Cross-Site Scripting (XSS) [44]. For the pur-
poses of this thesis, a web application or web service (used interchangeably in this
document) is a computer program that is executed on a remote server which responds
to clients, that connect via the HTTP protocol. The web application targets of inter-
est in this research are those which receive and process input from a client. Fuzzing
web applications for injection vulnerabilities is intuitive, because the process of iden-
tifying a target’s attack surface is trivially easy. Furthermore, although application
19
internals are not usually available, much work has been done in curating sets of e↵ec-
tive injection strings for a wide range of web vulnerabilities [49]. Despite its intuitive
nature from the standpoint of target identification and input generation, application-
level fuzzing of web services su↵ers from a fatal flaw—most of the time, visibility into
a target application’s internals is limited or nonexistent. This means that measuring a
fuzzer based on code coverage requires monitoring capabilities that are not available.
Even still, the amount of work required to set up such a system must force testers and
researchers to question whether or not other testing methods (or manual review) are
more suited to the task at hand. Thus, the monitoring requirement of fuzzing must
be modified in order to determine whether or not a vulnerability has been uncovered.
At the application level, this could simply involve parsing the resulting HTML for a
desired response, such as in Duchene et al. [26].
Most web application fuzzing frameworks contain a crawler that finds webpages
and potentially vulnerable input forms, and a set of known bad payloads that encom-
pass typical exploit vectors, such as directory traversal, SQL injection and Cross-Site
Scripting (XSS) [44]. Tools such as Burp Proxy have the ability to spider an appli-
cation and apply a set of test cases to a given input form [2] Another widely popular
web application security tool is w3af, a web service scanner that uses a curated set of
well-known injections to test for common web application vulnerabilities [17] In the
same vein, the now inactive JBroFuzz provided a general purpose GUI-based fuzzing
framework that covered a range of scanning techniques for discovering web applica-
tion vulnerabilites [9]. Most of these tools are adept at covering low-hanging fruit,
and do not provide any exploitation information—they only can determine if a given
entry point is potentially exploitable.
A large majority of research questions explored regarding fuzzing and web appli-
cation targets involve testing a web service for client-side Javascript execution bugs.
20
Tripp et al., in their research on optimizing Cross-Site Scripting vulnerability test-
ing, parsed the output following the execution of a negative test case, and used the
information to prune their set of test cases, culling a large set of positive examples
based on response of web app [57]. Their research shows that the previously sig-
nificant problem of having a limited feedback loop for fuzzing web applications can
make use of heuristics and response analysis to guide input generation. By charac-
terizing their large corpus of Cross-Site Scripting (XSS) examples according to their
tokens, injected payloads that were filtered or otherwise rejected inform the next test
cases attempted. Tripp et al. demonstrate the ability to uncover vulnerabilities by
e�ciently pruning the space of test cases from their original corpus [57].
The technique of using heuristics to generate test case is very similar to taint
analysis, and is a popular method of information gathering in the monitoring stage of
the fuzzing cycle. Model inference testing attempts in the fuzz testing space attempts
to determine the impact of a negative test case upon a target application. This
information can be utilized as feedback for the intelligent generation of new test cases.
Duchene et al. used taint analysis along with a grammar-based genetic algorithm to
uncover Cross-Site Scripting (XSS) vulnerabilities [26]. Wang et al. used a hidden
Markov model based on Bayesian probability distributions to to generate test cases
for uncovering Cross-Site Scripting (XSS) vulnerabilities. Their work theorizes that
injections are the combination of attack vector and payload, making the primary goal
to determine the attack vector necessary for injecting a payload. Similar to Tripp et
al.’s research focus of tree pruning based on application response to injection, Wang
et al. attempts to learn from the target application’s response to crafted input and
probabilistically generate new injection payload according to Bayesian probability
distribution [59]. Although the results of their methods contained numerous false
positives, their approach has merit—building a probabilistic model for generating
21
tokens allows for search space flexibility not o↵ered by Tripp et al. [59]. This covers one
of the main disadvantages of common fuzzing frameworks, which were not historically
designed to intelligently guide the manner in which test cases are generated.
Figure 2.4: A vulnerable input form that can be exploited using SQL injection
An example of a PHP script vulnerable to SQL injection is listed in figure 2.4,
courtesy of a purposely vulnerable web application made for testing and training
called DVWA [5]. In this case, a PHP script receives the id parameter from a GET
request. The user controlled data in the $id parameter is evaluated on the server
machine as raw SQL, allowing for a malicious user to execute raw commands. This
can lead to data exfiltration, or even complete compromise of the backend machine.
In particular, this case demonstrates a system oblivious to security concerns. Most
modern web applications contain some measure of security, in the form of blacklists or
regular expressions that attempt to filter out and/or detect malicious input. Hansen
and Patterson show the ine↵ectiveness of using regular languages and pattern based
filters to defeat malicious input that is by nature context-free [34]. In the spirit of
22
their development of a language theoretic basis for security, this research attempts
to guide fuzzing based on the lexical structure of positive examples. By focusing on
approximating the “attack language” for a given class of vulnerability, this research
explores the use of language theory to optimize fuzz testing, and move towards a
verifiable language-based reasoning for the e↵ectiveness of certain test cases.
2.3 Fuzzing and Genetic Algorithms
Evolutionary Algorithms are a prime candidate for optimizing fuzz testing by in-
fluencing input creation in an intelligent way. Many researchers have successfully
incorporated Genetic Algorithms into their fuzzing frameworks for targets ranging
from file formats to network services and web applications [26, 42, 56]. Over the past
decade, fuzz testing of applications which utilize evolutionary algorithms to intelli-
gently guide input generation have seen tremendous success in both academic and
applied settings [32, 53]. Thanks in no small part to the power of modern hardware
and availability distributed systems, the complexity concerns that once rendered the
use of genetic algorithms ine�cient are no longer prohibitive [53].
Sherri Sparks et al. proved the value of Genetic Algorithms in optimizing solution
space searching for fuzzers in the development of their program called SIDEWINDER
[56]. After disassembling the System Under Test, execution paths of interest are enu-
merated based on whether or not it contains an unsafe function call [56]. Then
subgraphs containing these functions of interest (because of their propensity to be in-
volved in unsafe operations), are separate for further analysis. The next step involves
their Genetic Algorithm—each chromosome encodes production rules of a Context-
Free Grammar (CFG), which are then used in conjunction with probabilities of path
traversal across known problematic subgraphs [56]. At every point of execution for
a given negative test case, the probabilities of going to the next node in the graph
23
are calculated. Fitness is boosted if new edges are explored, and the Markov Model
heuristic used at the core of their fitness function is updated [56]. This research
demonstrates the e�cacy of using Genetic Algorithms and Context-Free Grammars
to create new test cases based on program feedback. Similarly, DeMott et al. per-
formed grey-box evolutionary fuzzing on targets, but use a more traditional Genetic
Algorithm search heuristic [25]. Grey-box fuzzing assumes that source code for a tar-
get application is unknown, but binary internals (including assembly code) is available
for analysis. In the same manner as Sparks et al., DeMott et al. first reverse engi-
neering the target target application to locate and categorize its execution paths [25].
Based on a valid sample of a test case, it builds a population and performs traditional
Genetic Algorithm operations on individual chromosomes. Fitness is scored based on
how many branches of execution a given test case follows, and its distance between
the current node of execution and a desired target (determined during static analysis).
Test cases that find new branches are promoted especially among a pool of candidate
solutions [25]. The work of Sparks et al. and DeMott et al. represents an important
step in using Genetic Algorithms and a heuristics feedback loop to optimize fuzzing
strategies.
Roger Seagle explored the e�cacy of incorporating a nonstandard Genetic Algo-
rithm called CHC to perform file format fuzzing [53]. His fitness heuristics combine
execution graph heuristics as well as considerations regarding function characteristics
(e.g., the number of arguments and local variables, number of assembly instructions,
etc.) [53]. A catalogue of the fitness function considerations is shown in Figure 2.5.
The resulting work, a distributed fuzzing framework entitled “Mamba” is a collec-
tion of Genetic Algorithm-based fuzzing strategies that were able to find more unique
defects than comparable file format fuzzing tools [53].
As mentioned in the previous section, Fabien Duchene et al. demonstrate the abil-
24
Figure 2.5: Fitness heuristic categories considered by Seagle [53]
ity of Genetic Algorithms alongside model taint analysis to produce fuzz data that
uncovered Cross-Site Scipting (XSS) attacks in a reliable manner. Cross-Site Script-
ing vulnerabilities emerge when front-end web code does not safely escape dynamic
HTML and Javascript. Using a crafted input string, an attacker can execute code that
can be used to perform session hijacking and remote code execution [4]. Duchene et al.
used a Genetic Algorithm and taint-based heuristics to perform fuzzing on a variety
of purposefully vulnerable testing websites [26]. Similar to Sparks et al., this research
encoded chromosomes of a population as productions of a manually written “attack
grammar” tailored to uncovering Cross-Site Scripting (XSS) vulnerabilities. In this
document, the term “attack grammar” describes a grammar used for the creation of
strings with a propensity towards uncovering a class of vulnerabilities related to input
injection (e.g., SQL or LDAP injection, Cross-site Scripting, etc.). This grammar can
be developed manually by an expert, or inferred from positive examples. Although
it is impossible to exhaustively account for all the strings of a given language which
uncover injection vulnerabilities, human intuition (or automatic inference from posi-
tive examples) can approximate the fundamental grammatical components involved.
Duchene et al. succeeded in outperforming such tools as w3af and JBroFuzz [26].
An excerpt of the attack grammar is shown in Figure 2.6 [26]. The open problem
25
of automatically deriving attack grammars discussed in this work is the basis for the
research discussed in this thesis.
Figure 2.6: An excerpt of a manually-written attack grammar for finding Cross-SiteScripting vulnerabilities [26]
2.4 Grammar Fuzzing
The term Grammar Fuzzing refers to the subset of fuzzers whose targets are language
parsers, compilers, and runtimes. Grammar fuzzing has been successful in uncovering
a high number of web browser bugs and vulnerabilities due to incorrect Javascript
parser implementations. Fuzz testing against language interpreters has a long history
of success at finding parser vulnerabilities in software. One of the most frequent
targets of this type of fuzzing is the web browser. Zalewski’smangleme browser fuzzers
is one of many tools aimed at testing a browser’s Javascript parser for implementation
bugs [44]. The mangleme fuzzer was a browser fuzzer design to cause crash states
26
as a result of improper handling of HTML input. Guo et al., developed a technique
for testing Javascript parser engines by taking fragments of valid Javascript code,
and reorganizing them to produce negative test cases [33]. Holler et al. developed
a similar tool for testing web browerser Javascript parsing engines in Firefox [36].
Their fuzzing framework was incorporated into a regression testing suite for Firefox.
The fuzzer executed code against a new release by generating negative test cases that
were produced by recombining fragments of Javascript code to make new test cases
by which to test a given interpreter [36]. Using this method, their team uncovered
160 bugs in the Mozilla Firefox browser’s Javascript parsing engine [36]. In parallel,
but unrelated work Yang et al. developed a language fuzzing tool called “blendfuzz”,
which used “grammar aware mutation” to take valid test cases and rearrange valid
subgraphs to produce test cases that test a language interpreter’s correctness [61].
2.5 Advantages and Limitations
Fuzz testing is an e↵ective technique for uncovering errors in software systems that
process user-controlled input. This has special gravity in the context of a program’s
security posture—oftentimes, fuzzers find bugs that lead to an attacker being able to
craft input that can exfiltrate sensitive data, cause DOS’es, or lead to remote code
execution.
One of its most important advantages is the speed with which fuzzing frameworks
can find software bugs and vulnerabilities. Despite there being the propensity for long
execution times, it is still much faster than employing humans to comb through huge
codebases in search of bugs. Furthermore, operating tests on a live manifestation of
an application uncovers bugs that are lost in manual code review and other testing
tasks subject to human fallibility. The modern day fuzzing landscape brings with it
a variety of frameworks and tools with specialized targets, allowing for developers to
27
easily incorporate fuzzing into their software testing methodologies. Fuzzers are adept
at finding “low-hanging fruit” of vulnerability classes, and are for general purpose
vulnerability assessment. Fuzzers also add a dimension to software testing by creating
inputs that humans would not be likely to conjure up themselves. This research
demonstrates the usefulness of software testing that combines heuristics which encode
“intuition”, with the freedom of Genetic Algorithm solution searching.
Fuzzing is not without its limitations, however. For starters, fuzzing techniques
are only capable of alerting an exception or crash state being triggered—not whether
or not a vulnerability is in progress. This makes fuzzers ine↵ective regarding the
discovery of complicated, multi-step vulnerabilities [44]. Another limiting aspect of
fuzzing is the fact that general purpose crash analysis, for the most part, remains
a manual human endeavor. Despite humans being able to intuit the exploitability
of a given software bug, there is a fundamental limit to the ability of computer
programs to determine whether or not a software error represents a vulnerability
that can be compromised. In respect to Generation-based fuzzers, or any fuzzers
dependent on a data model, complexity increases exponentially as a function of input
specification. In other words, the more complex the data model becomes, the more a
fuzzer will consume design and computing resources. Oftentimes, ensuring an e�cient
fuzz testing campaign can be impossible because the search space can be too large to
enumerate. Thus search space optimizers such as Genetic Algorithms have recently
become en vogue.
Fuzz testing for software defects and vulnerabilities has been proven to work across
a variety of targets and protocol specifications since the late 1980s [44]. Although it
has limitations—not the least of which, the monitoring of applications for undefined
behavior and the analysis of negative test cases post-campaign—fuzzing has secured
itself as a mainstay technique for security researchers and software testers. Rudi-
28
mentary fuzzing techniques, such as random Mutation-based and Generation-based
fuzzing are surprisingly e↵ective at uncovering critical vulnerability classes. Modern
research focuses on informing fuzzing methods with heuristics to intelligently guide
test case generation, in order to combat dumb fuzzing’s inability to learn from the
SUT’s response to previous test cases.
29
Chapter 3: Evolutionary Algorithms
3.1 History
Evolutionary Algorithms describe the set of algorithms—typically, focused on func-
tion optimization and search space reduction—whose fundamental components are
inspired by the phenomenon observable in evolutionary processes in the natural world.
The ideas and inspiration that ushered in the emergence of evolutionary computing
goes back to the mid 20th-century. Alan Turning, in “Computing Machinery and In-
telligence”, the same text in which he famously proposes his “imitation game”, specu-
lates a scenario in which a machine would be modeled after “the mind of a child”, with
the ability to receive sensory input, learn from stimuli, and use that prior information
to make inferences and conclusions regarding new encounters [58]. His description
of the learning capabilities of machines is steeped in the language of evolution, and
the fact that his analogy uses this language helps explain the rise of evolutionary
algorithms, and frames the way computer scientists conceived problem spaces in the
mid-20th century. Evolutionary Algorithms describe the subset of optimization algo-
rithms that attempt to solve search problems by modeling them after processes found
in natural evolution. Inspired by the works of Charles Darwin, computer scientists
began to research methods by which to mimic the evolutionary processes for solv-
ing mathematical problems. The processes evolutionary processes underway which
promote good qualities in species and suppresses undesirable traits can be emulated
in computing, and can be used to e�ciently optimize algorithms—especially those
concerning complex search spaces.
The first recorded examples of modeling evolutionary principles to solve com-
putation problems is found in work by Friedberg et al. in the late 1950s, which was
30
concerned with “finding a program that calculates a given input-output function” [24].
Bremermann’s work in 1962 showed early use of “simulated evolution” for the task
of numerical optimization functions [24]. In the mid-1960s, Lawrence Fogel and John
Holland published groundbreaking, established research in evolutionary programming
and genetic algorithms (respectively), setting the stage for the formalization of this
subfield of machine learning [24]. Since then, Evolutionary Algorithms have been ap-
plied to a wide arrange of optimization problems in numerical methods, engineering,
and computer security.
3.2 Genetic Algorithms
The term Genetic Algorithm describes the subset of Evolutionary Algorithms that
mimic evolutionary conventions to solve optimization problems. John H. Holland,
widely considered the “Father of Genetic Algorithms”, was inspired by the works of
Darwin, and the ability of natural evolution processes to find solutions to biolog-
ical problems. Genetic Algorithms attempt to solve problems by first establishing
a group of solutions (population), which in essence, represent the “gene pool” of a
given solution space. For each individual candidate solutions (chromosome), a fit-
ness function evaluates how well they solve the problem at hand (or whether or not
a correct solution has been found) and assigned a fitness value. The fitness scores
(which, calculated by the fitness function, are numerical representations of how well a
given chromosome solves the problem in question) determine selection, the process by
which a chromosome is chosen for the creation of the next generation’s chromosomes.
A new population is then created by the crossover operation, where the members
of the current population are selected and recombined with other chromosomes. In
typical scenarios, chromosomes with high fitness scores are more likely to be selected
for crossover (“survival of the fittest”), to ensure the genotype of a high scoring can-
31
didate will propagate to the next generation [45]. Pseudocode for the algorithm is
shown in Algorithm 1.
Algorithm 1 Genetic Algorithm
1: procedure Genetic Algorithm(popsize, numgens) . GA run withpopulation size chromosomes and numgens generations
2: initialize population()3: calculate fitness()4: while n 6= numgens do5: selected parents select(population)6: CPOP crossover(selected parents)7: mutate operator(CPOP )8: population CPOP9: calculate fitness()10: n n+ 111: end while12: end procedure
The success of Genetic Algorithms is established according to the Holland’s Schema
Theorem, sometimes referred to as the “Fundamental Theorem of Genetic Algo-
rithms”. Mitchell remarks that its popular interpretation states that “short, low-
order schemas”, which are groups of characteristics found in chromosomes, “whose
average fitness remains above the mean will receive exponentially increasing numbers
of samples...over time” [45]. Schemas describe a set of strings with common values at
certain positions, and represent the presence (or absence) of sub-components within a
set of chromosomes. This theorem describes the power of the crossover and mutation
operators of Genetic Algorithms to propagate good information, and undergo enough
deviation via mutation (referred to as population diversity) to guide search space in
an intelligent manner [45]. Although the Genetic Algorithm is designed to calculate
fitness for entire chromosomes, the implication is that the building blocks of those
chromosomes (schemas) are being evaluated as well, in a phenomenon referred to by
Holland as “implicit parallelism” [35,45]. Mitchell clarifies that the e↵ect of selection
based on fitness leads to a gradual preference towards instances of schemas with above
average implicit fitness scores [45]. This is the basic explanation for why Genetic Al-
32
gorithms excel at optimizing certain search space problems: by managing a pool of
solutions with enough schemata (i.e., building blocks) represented, the propagation
of new chromosomes via selection and crossover (based on fitness scores) will pass on
schemas with high fitness scores. When these high-performing schema are combined
with other high-performing schema, the likelihood of happening upon an optimal
solution is increased. Finally, mutation ensures that a gene pool properly satisfies
diversity requirements necessary to explore possible solutions. In this way, Genetic
Algorithms are useful for performing intelligent test case generation for fuzzers—by
implicitly recombining groups of substrings (schema), it is possible to generate unique
solutions, even in a multimodal search space.
3.2.1 Genetic Algorithm Components
The basic operators of Genetic Algorithms seek to mimic biological phenomenon ob-
served in natural life processes. The selection operator determines the manner in
which chromosomes are selected for reproduction operations that create the following
generation of candidate solutions [45]. Selection processes are typically informed by
fitness scores of individual chromosomes—chromosomes that have high fitness scores
are more likely to reproduce, which follows the “survival of the fittest” motif in biolog-
ical evolution [45]. For some nonstandard Genetic Algorithms such as CHC, selection
is performed in a pure random fashion [29]. However, most algorithms let fitness in-
fluence the manner in which chromosomes are selected for reproduction—a common
method (and the one used in the proof of concept for this thesis) is called elitism,
which means that the strongest chromosomes are always selected for reproduction [45].
This method ensures that the schemata found in high-fitness chromosomes live on to
future generations.
Once selection chooses a pair of parents, the crossover operator is the method by
Figure 3.1: Three traditional crossover methods for creating new chromosomes [3]
which new children are created for the next population [45]. The e�ciency of crossover
methods is largely contingent upon the task at hand—single-point crossover, for ex-
ample, merely selects a spot between two chromosomes, and build two children from
the combination of one parent’s first half and the other’s second [45]. Other methods
still include multipoint crossover and uniform, which simply involves taking half of
the di↵ering bits between two parents. These methods are shown in pictographic form
in Figure 3.2. Holland’s Schema Theorem supposes the power of Genetic Algorithms
comes from crossover’s way of propagating building blocks of above-average fitness
schemata to future generations [35].
Finally mutation is the process by which a given value in a chromosome is ran-
domly changed, analogous to chromosome mutations found in biological processes [45].
Mutation can involve flipping the values at certain bit positions of a chromosome, or
values for other types of chromosome encodings. Mitchell describes mutation as an
“insurance policy” against particular chromosome values being fixed and never being
evaluated as a candidate for change [45]. Holland posits that mutation is required to
maintain diversity across positions in a given chromosome representation.
34
3.3 CHC
CHC is a nonstandard version of the pure Genetic Algorithm developed by Eshel-
man [29]. The method was developed to counteract the main disadvantage to which
Genetic Algorithms are dispose: in multimodal search spaces, GA’s will often fixate
on a local optima and cease to continue searching. The steps of the CHC algorithm
are slightly di↵erent than regular Genetic Algorithms: crossover only occurs when the
di↵erence between two selected parents is high enough [29]. The CHC algorithm re-
quires the crossover operation technique to be Half-Uniform Crossover, which means
that half of the di↵ering bits of two parents will be swapped during crossover. New
generations are created from the highest n chromosomes between the parent popula-
tion and the children. Over time, the chromosomes will all begin to have the same
encoding, and no more children will be created. When that threshold is hit enough
times, a cataclysmic mutation operator is invoked [29]. This form of mutation takes
the chromosome with the highest fitness and, using it as a template, creates new
chromosomes by mutating 35 percent of the selected chromosome’s encoded bits [29].
Pseudocode for the algorithm is shown in Algorithm 2. Because CHC has a built-in
convention by which to escape plateaus in local minima or maxima, it tends to ex-
haust more search space than traditional Genetic Algorithms. This makes it a good
candidate for finding solutions multimodal search spaces. A distinct disadvantage of
CHC, however, lies in its tendency to spend less time in a given search area than
traditional Genetic Algorithms, leading to potentially missed solutions.
3.4 Problem Domain
Evolutionary algorithms have been applied as an optimization strategy for a wide
variety of computing tasks. Holland’s seminal work demonstrates the use of Ge-
35
Algorithm 2 CHC Algorithm
1: procedure Genetic Algorithm(population size, numgens)2: initialize population()3: threshold = L/4 . L is chromosome length4: while n 6= numgens do5: for i in population size/2 do6: select parents p1,p2 without replacement7: if Hamming(p1, p2) > threshold then8: CPOP HUX(p1, p2) . Half Uniform Crossover of p1,p29: end if10: end for11: if sizeof(CPOP ) == 0 then12: threshold -= 113: else14: calculate fitness(CPOP )15: Population equals best N individuals from (Population + CPOP)16: end if17: if threshold < 0 then18: Population cataclysmic mutation(Population)19: threshold L/420: end if21: n n+ 122: end while23: end procedure
netic Algorithms to solve the famous Prisonner’s Dilemma [35, 45]. In the Prisoner’s
Dilemma, two individuals are detained for colluding in criminal activity, and are
held in two separate cells with no communication activity [45]. The authorities o↵er
each prisoner the following deal: if a confession is given, and cooperation to testify
against one’s partner is consented, then the punishment doled out for the crime is
lessened [45]. However, if both parties admit to their crime and testify, the leniency
previously o↵ered is nullified. If neither testify against one another they will each
receive a moderately intense jail sentence [45]. Axelrod sought to determine whether
or not Genetic Algorithms can help decide the best strategy for each individual pris-
oner (which many tournaments showed was simply “TIT FOR TAT”, or a repetition
of the choice made by the other prisoner) [45]. Given the proper conditions, Axelrod
showed that Genetic Algorithms were able to find solutions which scored higher than
“TIT FOR TAT” [45]. This demonstrates the somewhat inexplicable ability of Ge-
36
netic Algorithms to propagate building blocks of good solutions to create new ones
which humans may not consider.
An example use of Genetic Algorithms to solve engineering problems is found in
Hornby et al.’s implementation of the algorithm to automatically perform antenna
design [37]. Before, antenna design was done manually, and consumed a great deal of
human design resources—the design of antennae requires an expert because of the vast
amount of knowledge necessary to produce quality designs [37]. In response, Hornby
et al. implemented an Evolutionary Algorithm which found novel antenna designs
that outperformed human generated solutions, according to the voltage standing wave
ratio and gain values of frequencies [37]. Evolutionary Algorithms have been applied
to problems as varied as financial portfolio optimization, game-theoretic problems as
described in the Prisoner’s Dilemma, and even the development of walking methods
for computer figures [21, 31]
3.5 Advantages and Limitations
Genetic Algorithms are useful for optimizations for a wide variety of problems and
domains. One of the main advantages of these algorithms is described in the Schema
Theorem previously discussed [35,45]. Genetic Algorithms excel at search space opti-
mization with nondeterministic solutions. Genetic Algorithms are also easy to concep-
tualize and implement, so once design decisions are established, Genetic Algorithms
are simple to incorporate into a variety of optimization schemes. The vast set of
parameters involved in tuning Genetic Algorithms is both a blessing and a curse. De
Jong remarks that often times, poorly tuned parameters do not create suboptimal
results [24]. This can lead to a great deal of frustration, however, when underperfor-
mance is observed, as parameter tuning does not necessarily map deterministically to
improved results.
37
The main disadvantage of Evolutionary Algorithms—and, really, any class of op-
timization algorithms—is based on the “No Free Lunch” (NFL) theorem [60]. Simply
put the NFL theorem states that there “cannot exist any algorithm for solving all
(e.g. optimization) problems that is generally (on average) superior” to any other op-
timization algorithm [24,60]. Another disadvantage concerns the fact that stochastic
processes, which are at the center of many Evolutionary Algorithms, rely on random
number generation, and the bias associated with potential incorrect pseudorandom
number generation can lead to problems. Furthermore, many search landscapes are
multimodal, meaning more than one optimal solution exists. Evolutionary algorithms
often have trouble in multimodal search spaces [24]. However, heuristics-based mea-
sure can be taken to ensure reasonably good search performance. The complexity of
chromosome representation and Genetic Algorithm operators can become unwieldy
and ine↵ectual without proper constraints [45]. This research demonstrates the e↵ects
of that consideration, as CHC underperforms because of its ine�cient implementa-
tion for the constraints of variable length chromosomes with complex representations.
Finally, accurate and precise fitness functions can prove di�cult to formulated given
the nature of many real world problems [45].
All told, Evolutionary Algorithms are a useful optimization strategy for certain
types of search space problems, and have a direct, positive results when incorporated
with the data generation aspect of fuzz testing frameworks. The remainder of this
research will concern their application to fuzz testing of web applications, with specific
focus on evolving payloads to exploit SQL injections.
38
Chapter 4: Evolutionary Algorithm
Web Fuzzing Framework
4.1 Approach
As previously discussed, the use of Evolutionary Algorithms to intelligently reduce
the search space for fuzzing campaigns has been proven e↵ective across a wide range
of targets [25,42,56]. Genetic Algorithms are useful for guiding the manner in which
input should be crafted, which can be modeled well as a search algorithm, making
them a good candidate for optimization. Sparks et al. modeled their chromosomes as
productions of a grammar which created a series of opcodes, used to uncover vulnera-
bilities in an FTP program [56]. In the web application sphere, Duchene et al. found
success revealing Cross-Site Scripting (XSS) vulnerabilities through a combination of
taint analysis and an evolutionary algorithm whose chromosome representations were
grammar productions of an “attack grammar” for Cross-Site Scripting (XSS) [26].
One of the limitations of this approach, however, is that their strategy required an
expert to manually write the “attack grammar” used to generate payloads for their
Evolutionary Algorithm [26]. This research explores techniques by which to auto-
matically derive grammars for an attack language by analyzing the lexical structures
of positive examples, and curating a set of productions which represent every string
found in the corpus. The goal is to amalgamate a group of grammar production
rules—which are grouped together based on a “fingerprinting” [30] algorithm for iden-
tifying SQL Injections examples—and use those to score fitness and/or to represent
chromosomes according to production rules.
39
4.1.1 Preprocessing
The purpose of the preprocessing phase of the EA fuzzing framework is to build a set
of attack grammars which encompass the lexical structure of the positive examples,
record the frequency of n-tuple groups of SQL tokens in the corpus, and to find the
frequency of transitions between n-tuple groups of tokens in the positive examples.
Analysis of positive examples of SQL injections allows the set of attack grammars
available to our Genetic Algorithm to be constructed. First, positive examples are
procured: the sample corpus for this set of experiments comes from Søen’s “Forced
Evolution” database, and from Click Security’s “Data Hacking” repository [46, 55].
The elements of the corpus were chosen in order to cover a wide range of di↵er-
ent SQL injection attacks, including boolean and UNION-based [10]. Boolean-based
SQL injections attempt to insert a boolean statement into a SQL query that will
always evaluate to true, thereby returning (exfiltrating) data from SQL queries that
should not be returned. UNION-based SQL injections, on the other hand, attempt to
match the output structure of a given SQL query to exfiltrate data from other tables,
server-specific values, or other sensitive information. Galbreath’s training set of SQL
injections were used in early stages of the project, but were not used for the experi-
ments outline in this document [30]. Although the set of examples from Galbreath’s
libinjection library were high quality, they favored UNION-based attacks too heavily
for our purposes, and represented more fingerprints for which the framework could
perform fitness calculations than could be used in a reasonable amount of time.
Once the corpus has been curated, the preprocessing stage kicks o↵ by lexing each
positive example into its SQL token representation. This research uses sqlparse, a non-
validating SQL lexer/parser [16]. Instead of writing a parser for our purposes, sqlparse
was chosen because it does not require a valid SQL string, and has robust tokenization
capabilities. The first quality mentioned is especially important because our positive
40
examples are merely fragments of SQL statements that represent malicious intent.
tokens are arranged into one, two, or three-tuples groups, and assigned a production
rule in the attack grammar according to their position within the original positive
example (a figure explaining this process is shown below). The frequency of the n-
tuple groups are recorded, and used for fitness metrics. In addition, the frequency
of transitions between n-tuple groups are recorded for use by the fitness function
as well as the Markov Model Monte Carlo algorithm. The information used by the
Evolutionary Algorithms tested in this research is grouped into one, two, and three-
tuples SQL token. A SQL token merely represents the symbolic value of a literal string
according to the SQL language specification. The reason three-tuples were chosen as
the maximum group of lexical tokens for a given terminal is based on research by
Mike Sconzo and Brian Wylie, whose work on data science for security was shown in
proceedings at Shmoocon in 2014 [46]. They demonstrate that 3-gram groupings of
SQL tokens carries enough information to determine whether or not a given string has
malicious intent [46]. Although they were approaching the problem of SQL injection
detection, the idea pertains to fuzzing as well: instead of relying on a human to write
an attack grammar based on known types of injections, the approach of this paper
requires tokenization, and then a grouping of these tokens based on their position.
The idea is that the Genetic Algorithm will be able to move these n-tuple groups
in di↵erent orders (via crossover) while still preserving attack grammar information.
A visual representation of an extract of the production tree is in Figure 4.2. After
the positive examples have been broken down into their semantic tokens, n-tuple
groupings, and n-tuple transition densities, the attack grammars that represent the
corpus are constructed. A visual manifestation of this process is shown in Figure 4.1
41
Corpus ofPositive Examples
Lex andTokenize Pos-itive Example
Build n-tuplesTable
Record MarkovTransitions
Apply Fingerprint
Construct/UpdateGrammars
Figure 4.1: Flow graph of preprocessing stage
42
4.1.2 Attack Grammars
The final phase of preprocessing involves creating a set of grammars whose produc-
tions end in nonterminals represented as n-tuple groups of SQL tokens. This forms
the basis of the proposed method’s chromosome representation and fitness function
calculations. The attack grammars are separated based on the “fingerprint” value
of the positive example, as calculated by Galbreath’s libinjection software [30]. A
fingerprint is calculated by approximating the type of SQL injection attack based on
the tokens that are present in a given string [30]. This ensures that exploit strings
that have similar structural components are grouped together, and their grammar
productions are grouped accordingly.
More formally, the algorithm derives a set of grammars that represent our corpus
of positive examples:
Attack G = {G0, G1, ... Gn�1} , (4.1)
where each grammar G contains production rules that generate strings that have the
same fingerprint, which classifies them to according to the semantic structure of one
or more positive examples. Formally, each grammar is a 4-tuple:
Gi = {V,⌃, R, S} (4.2)
V is a finite set of non-terminals (variables). In this research’s proof of concept im-
plementation, the variables correspond to positional indices where groups of n-tuples
are represented. For a given index of a fingerprint, multiple n-tuples are potential
productions for the grammar. S is a set of terminals which are the actual components
that comprise a valid string of a given language described by the grammar. The ter-
minals in this implementation are one, two, or three tuples of SQL tokens. S is the
start variable, and R is the set of production rules from S that derive terminals [54].
The production rules are purposefully crude in order to limit the time spent deriving
43
start
fp0
0fp0 1fp0 2fp0
(SINGLE, DDL, PUNCT ) (DML, ERROR)(INT, COMPARISON, INT )
(m� 1)fp0
fp1 fpn�1
Figure 4.2: Example extract of Parse tree derived from positive examples of SQLinjection tokens
strings of a given grammar, and for use in exploring the e�cacy of using sets of simple
grammars to approximate an attack language. Each grammar can be described as
follows:
Gfp =�Gfp(s)|s 2 LGfp
and G accepts s
(4.3)
The set of grammars themselves do not seek to accurately encompass the SQL lan-
guage specification—instead, the goal is to approximate the structures of positive
examples well enough to codify the semantic components of an attack language—in
the language of genetics, the phenotypical information available.
Value Fingerprint SQL Token Representation Grammar Productions
’ x OR 1 = 1’ sn&10 Error Name Keyword Inte-ger Comparison Integer Sin-gle
S ! 0 1 20! Error Name Key-word1 ! Integer Compari-son Integer2 ! Single
Table 4.1: An example preprocessing of a positive example
At the cost of more refined expression of a given “Attack Grammar”, such as those
found in Duchene et al., this technique aims to collect various permutations of lexical
44
symbols which represent “known bad” injection attempts (i.e., the semantic struc-
ture of examples of SQL injections) and place them in their positional context [27].
In use with the fuzzing framework, chromosomes are modeled as production for a
given fingerprint, producing a group of one, two, or three sequential tokens. This
design decision ensures that the Genetic Algorithm searches the input space with
genotypic components (tuples of SQL tokens derived from positive examples) that
are well enough preserved for focused searching. The heart of this research involves
exploring whether or not a precise attack grammar is required, or if it is su�cient to
encode shallow productions of grammars, grouped by a common lexical structure, and
allow for an Algorithm to recombine productions of di↵erent fingerprints to produce
new exploit strings. These new exploit strings will sometimes not have representative
fingerprint, and other times will conform to ones available in libinjection. The re-
sults demonstrate that their is value in this approach, especially since it is completely
automated. In theory, provided a lexer for a given target language is available, and
a method for codifying similar examples into fingerprints, it is possible to use this
framework for any type of fuzz testing campaign. Future research will explore us-
ing this framework to find Cross-Site Scripting (XSS) vulnerabilities and memory
corruption vulnerabilities in local binaries.
4.1.3 Fitness Evaluation
For this technique, the fitness function is a combination of three characteristics of a
given candidate solution. First, if a given chromosome successfully achieves a SQL
injection, it is heavily promoted within the population. In a related matter, fit-
ness scores of chromosomes which result in an invalid SQL statement are suppressed.
This begs the question of the proof of concept’s tenability against real-world systems.
While most web fuzzing campaigns are purely black box, it is not unreasonable to an-
45
alyze input forms and determine the type of SQL statement executed. Furthermore,
the current implementation scores this condition very weakly, to the point where it
could be removed without lasting e↵ect. The chromosome in question is scored based
on how well it conforms to the attack grammars built by positive examples. The
tokens of the chromosome in question are compared against the terminals of each
grammar at the corresponding positional indices. Instead of only denoting if a chro-
mosome is accepted or rejected by a grammar, if a given chromosome’s token groups
match a successive sequence of a grammar’s token groups, the fitness score is com-
pounded exponentially. In short, a chromosome that matches a contiguous grouping
of tokens fits a high portion of a grammar potential terminals, and is exponentially
promoted within the population (i.e., a given chromosome does a good job of ap-
proximately representing an attack language). This idea can be summarized in the
formula in figure
n�1X
i=0
m�1X
j=0
(xj == Gi,j) ⇤ k2 (4.4)
where n represents the total number of fingerprints, m represents the positional token
groups in the symbol representation of the chromosome, and k represents the number
of sequential matches found. k is reset to 0 in the event that a mismatch is found. This
formula, compounded with the other two metrics, comprises the fitness calculation
Wake Forest University May 2016Masters of Science in Computer ScienceOverall GPA: 3.166Wake Forest University May 2013Bachelor of Arts in English & Computer ScienceOverall GPA: 3.352Technical Coursework: Network and Computer Security, Internet Protocols, Algo-rithms, Artificial Intelligence, Operating Systems, Linux Administration, Discrete Math-ematics, Calculus, Linear Algebra
EXPERIENCE
Wake Forest University September 2013 - May 2016Research and Teaching Assistant Winston Salem, NC
· Supported research which implemented a ”Moving Target” security configuration systemfor network hosts
· Conducted thesis research that explored the application of machine learning techniques,language theory, and evolutionary algorithms to optimize SQL-injection and XSS audit-ing approaches
· Organized and taught undergraduate lectures on introductory topics related to operat-ing systems and computer security, involving attacker life-cycle, security vulnerabilityauditing, and secure software practices
Pacific Northwest National Laboratory June 2014 - September 2014Masters Intern Richland, WA
· Developed auto-refresh functionality for a network tra�c visualization application writ-ten in Java
· Provided operational assistance for a company-wide Capture the Flag competition,which involved instructing new participants on attack classes, exploitation techniques,and general secure development practices
· Developed Capture the Flag challenges, including a firewall rules testing applicationwhich utilized the Flask microframework, Scapy packet manipulation software, nginxreverse-proxy and gunicorn HTTP server
B/E Aerospace June 2013 - August 2013Operations Security Intern Winston Salem, NC
· Supported company-wide security operations and incident response handling· Developed automated tools and workflow procedures that increased the e�ciency ofincident management and mitigation
Cisco Systems, Inc. June 2012 - August 2012Software Engineering Intern: R&D Knoxville, TN
· Developed an analytic web application using Ruby on Rails framework· Learned and developed software security analysis skills through independent study andparticipation in CTF challenges within penetration testing environments
71
· Studied and analyzed secure software development practices and related vulnerabilityclasses/attack vectors
TECHNICAL SKILLS
Computer Languages and Technologies: Python, Ruby, Java, C/C++, R