When Coding Style Survives Compilation: De-anonymizing ...aylinc/papers/caliskan-islam_when.pdf · When Coding Style Survives Compilation: De-anonymizing Programmers from Executable

When Coding Style Survives Compilation:De-anonymizing Programmers from Executable Binaries

Aylin Caliskan-IslamPrinceton [email protected]

Fabian YamaguchiTechnische Universitat

[email protected]

Edwin DauberDrexel [email protected]

Konrad RieckTechnische Universitat

[email protected]

Richard HarangUS Army Research [email protected]

Rachel GreenstadtDrexel University

[email protected]

Arvind NarayananPrinceton University

[email protected]

ABSTRACTThe ability to identify authors of computer programs basedon their coding style is a direct threat to the privacy andanonymity of programmers. While recent work found thatsource code can be attributed to authors with high accuracy,attribution of executable binaries appears to be much moredifficult. Many distinguishing features present in sourcecode, e.g. variable names, are removed in the compilationprocess, and compiler optimization may alter the structureof a program, further obscuring features that are knownto be useful in determining authorship. We examine pro-grammer de-anonymization from the standpoint of machinelearning, using a novel set of features that include ones ob-tained by decompiling the executable binary to source code.We adapt a powerful set of techniques from the domain ofsource code authorship attribution along with stylistic rep-resentations embedded in assembly, resulting in successfulde-anonymization of a large set of programmers.

We evaluate our approach on data from the Google CodeJam, obtaining attribution accuracy of up to 96% with 100and 83% with 600 candidate programmers. We presentan executable binary authorship attribution approach, forthe first time, that is robust to basic obfuscations, a rangeof compiler optimization settings, and binaries that havebeen stripped of their symbol tables. We perform program-mer de-anonymization using both obfuscated binaries, andreal-world code found “in the wild” in single-author GitHubrepositories and the recently leaked Nulled.IO hacker forum.We show that programmers who would like to remain anony-mous need to take extreme countermeasures to protect theirprivacy.

ACM ISBN 978-1-4503-2138-9.

DOI: 10.1145/1235

1. INTRODUCTIONIf we encounter an executable binary sample in the wild,

what can we learn from it? In this work, we show that theprogrammer’s stylistic fingerprint, or coding style, is pre-served in the compilation process and can be extracted fromthe executable binary. This means that it may be possibleto infer the programmer’s identity if we have a set of knownpotential candidate programmers, along with executable bi-nary samples (or source code) known to be authored by thesecandidates.

Programmer de-anonymization from executable binarieshas implications for privacy and anonymity. Perhaps the cre-ator of a censorship circumvention tool distributes it anony-mously, fearing repression. Our work shows that such aprogrammer might be de-anonymized. Further, there areapplications for software forensics, for example to help ad-judicate cases of disputed authorship or copyright.

Furthermore, the White House Cyber R&D Plan statesthat “effective deterrence must raise the cost of maliciouscyber activities, lower their gains, and convince adversariesthat such activities can be attributed [38].” The DARPAEnhanced Attribution calls for methods that can “consis-tently identify virtual personas and individual malicious cy-ber operators over time and across different endpoint devicesand C2 infrastructures [24].” While the forensic applica-tions are important, as attribution methods develop, theywill threaten the anonymity of privacy-minded individualsat least as much as malicious actors.

We introduce the first part of our approach by significantlyoverperforming the previous attempt at de-anonymizing pro-grammers by Rosenblum et al. [36]. We improve their accu-racy of 51% in de-anonymizing 191 programmers to 92% andthen we scale the results to 83% accuracy on 600 program-mers. First, whereas Rosenblum et al. extract structuressuch as control-flow graphs directly from the executable bi-naries, our work is the first to show that automated decom-pilation of executable binaries gives additional categoriesof useful features. Specifically, we generate abstract syntaxtrees of decompiled source code. Abstract syntax trees havebeen shown to greatly improve author attribution of sourcecode [15]. We find that syntactical properties derived from

these trees also improve the accuracy of executable binaryattribution techniques.

Second, we demonstrate that using multiple tools for dis-assembly and decompilation in parallel increases the accu-racy of de-anonymization by generating different represen-tations of code that capture different aspects of the pro-grammer’s style. We present a machine learning frameworkbased on entropy and correlation for dimensionality reduc-tion, followed by random-forest classification, that allows usto effectively use these disparate types of features in con-junction without overfitting.

These innovations allow us to de-anonymize a large setof real-world programmers with high accuracy. We performexperiments with a controlled dataset collected from GoogleCode Jam (GCJ), allowing a direct comparison to previouswork that used samples from GCJ. The results of these ex-periments are discussed in detail in Section 5. Specifically;we can distinguish between thirty times as many candidateprogrammers (600 vs. 20) with higher accuracy, while uti-lizing less training data and fewer stylistic features per pro-grammer. The accuracy of our method degrades gracefullyas the number of programmers increases, and we presentexperiments with as many as 600 programmers.

Third, we find that traditional binary obfuscation, en-abling compiler optimizations, or stripping debugging sym-bols in executable binaries results in only a modest decreasein de-anonymization accuracy. These results, described inSection 6, are an important step toward establishing thepractical significance of the method.

The fact that coding style survives compilation is unintu-itive, and may leave the reader wanting a “sanity check” oran explanation for why this is possible. In Section 5.8, wepresent several experiments that help illuminate this mys-tery. First, we show that decompiled source code is notnecessarily similar to the original source code in terms ofthe features that we use; rather, the feature vector obtainedfrom disassembly and decompilation can be used to predict,using machine learning, the features in the original sourcecode. Even if no individual feature is well preserved, there isenough information in the vector as a whole to enable thisprediction. On average, the cosine similarity between theoriginal feature vector and the reconstructed vector is over80%. Further, we investigate factors that are correlated withcoding style being well-preserved, and find that more skilledprogrammers are more fingerprintable. This suggests thatprogrammers gradually acquire their own unique style asthey gain experience.

All these experiments were carried out using the GCJdataset; the availability of this dataset is a boon for researchin this area since it allows us to develop and benchmark ourresults under controlled settings [36, 9]. Having done that,we present a case study with a real-world dataset collectedfrom GitHub in Section 6.4. This data presents difficulties,particularly noise in ground truth because of library andcode reuse. However, we show that we can handle a noisydataset of 50 programmers found in the wild with 65% ac-curacy and further extend our method to tackle open worldscenarios. We also present a case study using code found viathe recently leaked Nulled.IO hacker forum. We were able tofind four forum members who, in private messages, linked toexecutables they had authored (one of which had only onesample). Our approach correctly attributed the three indi-viduals who had enough data to build a model and correctly

rejected the fourth sample as none of the previous three.We emphasize that research challenges remain before pro-

grammer de-anonymization from executable binaries is fullyready for practical use. For example, programs may be au-thored by multiple programmers and may have gone throughencryption. We have not performed experiments that modelthese scenarios and we focus on the privacy implications.Nonetheless, we present a robust and principled program-mer de-anonymization method that raises immediate con-cerns for privacy and anonymity.

2. PROBLEM STATEMENTIn this work, we consider an analyst interested in deter-

mining the author of an executable binary purely based onits style. Moreover, we assume that the analyst only hasaccess to executable binary samples each assigned to one ofa set of candidate programmers.

Depending on the context, the analyst’s goal might bedefensive or offensive in nature. For example, the analystmight be trying to identify a misbehaving employee thatviolates the non-compete clause in his company by launchingan application related to what he does at work.

By contrast, the analyst might belong to a surveillanceagency in an oppressive regime who tries to unmask anony-mous programmers. The regime might have made it unlaw-ful for its citizens to use certain types of programs, such ascensorship-circumvention tools, and might want to punishthe programmers of any such tools. If executable binarystylometry is possible, it means that compiling code is nota way of anonymization. Because of its potential dual use,executable binary stylometry is of interest to both securityand privacy researchers.

In either (defensive or offensive) case, the analyst (or ad-versary) will seek to obtain labeled executable binary sam-ples from each of these programmers who may have po-tentially authored the anonymous executable binary. Theanalyst proceeds by converting each labeled sample into anumerical feature vector, and subsequently deriving a clas-sifier from these vectors using machine learning techniques.This classifier can then be used to attribute the anonymousexecutable binary to the most likely programmer.

Since we assume that a set of candidate programmers isknown, we treat our main problem as a closed world, su-pervised machine learning task. It is a multi-class machinelearning problem where the classifier calculates the mostlikely author for the anonymous executable binary sampleamong multiple authors. We briefly present initial experi-ments on an open-world scenario in Section 6.5.

Additional Assumptions. For our experiments, we as-sume that we know the compiler used for a given programbinary. Previous work has shown that with only 20 exe-cutable binary samples per compiler as training data, it ispossible to use a linear Conditional Random Field (CRF) todetermine the compiler used with accuracy of 93% on aver-age [37, 26]. Other work has shown that by using patternmatching, library functions can be identified with precisionand recall between 0.98 and 1.00 based on each one of threecriteria; compiler version, library version, and linux distri-bution [22].

In addition to knowing the compiler, we assume to knowthe optimization level used for compilation of the binary.Past work has shown that toolchain provenance, includingcompiler family, version, optimization, and source language,

can be identified with a linear CRF with accuracy of 99%for language, compiler family, and optimization and with92% for compiler version [35]. Based on this success, wemake the assumption that these techniques will be used toidentify the toolchain provenance of the executable binariesof interest and that our method will be trained using thesame toolchain.

3. RELATED WORKAny domain of creative expression allows authors or cre-

ators to develop a unique style, and we might expect thatthere are algorithmic techniques to identify authors basedon their style. This class of techniques is called stylometry.Natural-language stylometry, in particular, is well over acentury old [29]. Other domains such as source code and mu-sic also have stylistic features, especially grammar. There-fore stylometry is applicable to these domains as well, oftenusing strikingly similar techniques [40, 10].

Linguistic stylometry. The state of the art in linguisticstylometry is dominated by machine-learning techniques [6,30, 7]. Linguistic stylometry has been applied successfullyto security and privacy problems, for example Narayanan etal. used stylometry to identify anonymous bloggers in largedatasets, exposing privacy issues [30]. On the other hand,stylometry has also been used for forensics in undergroundcyber forums. In these forums, the text consists of a mixtureof languages and information about forum products, whichmakes it more challenging to identify personal writing style.Not only have the forum users been de-anonymized but alsotheir multiple identities across and within forums have beenlinked through stylometric analysis [7].

Authors may deliberately try to obfuscate or anonymizetheir writing style [11, 6, 28]. Brennan et al. show how sty-lometric authorship attribution can be evaded with adver-sarial stylometry [11]. They present two ways for adversarialstylometry, namely obfuscating writing style and imitatingsomeone else’s writing style. Afroz et al. identify the stylis-tic changes in a piece of writing that has been obfuscatedwhile McDonald et al. present a method to make writingstyle modification recommendations to anonymize an undis-puted document [6, 28].

Source code stylometry. Several authors have appliedsimilar techniques to identify programmers based on sourcecode [15, 32, 14]. Source code authorship attribution hasapplications in software forensics and plagiarism detection1.

The features used for machine learning in source code au-thorship attribution range from simple byte-level [19] andword-level n-grams [12, 13] to more evolved structural fea-tures obtained from abstract syntax trees [15, 32]. In partic-ular, Burrows et al. present an approach based on n-gramsthat reaches an accuracy of 77% in differentiating 10 differ-ent programmers [13].

Similarly, Kothari et al. combine n-grams with lexicalmarkers such as the line length, to build programmer pro-files that allow them to identify 12 authors with an accu-racy of 76% [25]. Lange et al. further show that metricsbased on layout and lexical features along with a geneticalgorithm reach an accuracy of 75% in de-anonymizing 20

1Note that popular plagiarism-detection tools such as Mossare not based on stylometry; rather they detect code thatmay have been copied, possibly with modifications. This isan orthogonal problem [8].

authors [27]. Finally, Caliskan-Islam et al. incorporate ab-stract syntax tree based structural features to represent pro-grammers’ coding style [15]. They reach 94% accuracy inidentifying 1,600 programmers of the GCJ data set.

Executable binary stylometry. In contrast, identify-ing programmers from compiled code is considerably moredifficult and has received little attention to date. Code com-pilation results in a loss of information and obstructs stylis-tic features. We are aware of only two prior works, bothof which perform their evaluation and experiments on con-trolled corpora that are not noisy, such as the GCJ datasetand student homework assignments [36, 9]. Our work signif-icantly over performs previous work and in addition we in-vestigate noisy real-world datasets, effects of optimizations,and obfuscations.

Rosenblum et al. present two main machine learning tasksbased on programmer de-anonymization. One is based onsupervised classification with a support vector machine toidentify the authors of compiled code [17]. The second ma-chine learning approach they use is based on clustering togroup together programs written by the same programmers.They incorporate a distance based similarity metric to dif-ferentiate between features related to programmer style toincrease clustering accuracy. They use the Paradyn project’sParse API for parsing executable binaries to get the instruc-tion sequences and control flow graphs whereas we use fourdifferent resources to parse executable binaries to generate aricher representation. Their dataset consists of submissionsfrom GCJ and homework assignments with skeleton code.

4. APPROACHOur ultimate goal is to automatically recognize program-

mers of compiled code. We approach this problem usingsupervised machine learning, that is, we generate a classifierfrom training data of sample executable binaries with knownauthors. The advantage of such learning-based methods overtechniques based on manually specified rules is that the ap-proach is easily retargetable to any set of programmers forwhich sample executable binaries exist. A drawback is thatthe method is inoperable if samples are not available or tooshort to represent authorial style.

Data representation is critical to the success of machinelearning. Accordingly, we design a feature set for executablebinary authorship attribution with the goal of faithfully rep-resenting properties of executable binaries relevant for pro-grammer style. We obtain this feature set by augment-ing lower-level features extractable from disassemblers withadditional string and symbol information, and, most im-portantly, incorporating higher-level syntactical features ob-tained from decompilers.

In summary, such an approach results in a method con-sisting of the following four steps (see Figure 1).

• Disassembly. We begin by disassembling the pro-gram to obtain features based on machine code in-structions, referenced strings, symbol information, andcontrol flow graphs (Section 4.1).

• Decompilation. We proceed to translate the pro-gram into C-like pseudo code via decompilation. Bysubsequently passing the code to a fuzzy parser for C,we thus obtain abstract syntax trees from which syn-tactical features and n-grams can be extracted (Sec-tion 4.2).

test edi, edimov eax, 0x0cmovs edi, eax ...


1000 0101 1111 1111 1011 10000000 0000 0000

...


int f(int a) { if (a < 0) a = 0; ... int a

stmt

if

func

param

...

Analysis of information

grain

Random Forest

Classifier



1000 0101 1111 1111 1011 10000000 0000 0000

...


int f(int a) { if (a < 0) a = 0; ... int a

stmt

if

func

param

...

Disassembly Decompilation Fuzzy Parsing

Analysis of information

grain

Dimensionalityreduction

Random Forest

Classifier

Classification De-anonymizedprogrammer

Instruction features

Lexical features

Binary Code

AST & CFG features

Figure 1: Overview of our method. Instructions, symbols, and strings are extracted using disassemblers (1), syntac-

tical and control-flow features are obtained from decompilers (2). Dimensionality reduction is performed to obtain

representative features (3). Finally, a random forest classifier is trained to de-anonymize programmers (4).

• Dimensionality reduction. With features from dis-assemblers and decompilers at hand, we select thoseamong them that are particularly useful for classifica-tion by employing a standard feature selection tech-nique based on information gain and correlation basedfeature selection (Section 4.3).

• Classification. Finally, a random-forest classifier istrained on the corresponding feature vectors to yielda program that can be used for automatic executablebinary authorship attribution (Section 4.4).

In the following sections, we describe these steps in greaterdetail and provide background information on static codeanalysis and machine learning where necessary.

4.1 Feature extraction via disassemblyAs a first step, we disassemble the executable binary to ex-

tract low-level features that have been shown to be suitablefor authorship attribution in previous work. In particular,we follow the example set by Rosenblum et al. and extractraw instruction traces from the executable binary [36]. Inaddition to this, disassemblers commonly make symbol in-formation available, as well as strings referenced in the code,both of which greatly simplify manual reverse engineering.We augment the feature set accordingly. Finally, we canobtain control flow graphs of functions from disassemblers,providing features based on program basic blocks. The re-quired information necessary to construct our feature set isobtained from the following two disassemblers.

• The netwide disassembler. We begin by explor-ing whether simple instruction decoding alone can al-ready provide useful features for de-anonymization. Tothis end, we process each executable binary using thenetwide disassembler (ndisasm), a rudimentary disas-sembler that is capable of decoding instructions butis unaware of the executable’s file format [39]. Dueto this limitation, it resorts to simply decoding theexecutable binary from start to end, skipping byteswhen invalid instructions are encountered. A prob-lem with this approach is that no distinction is madebetween bytes that represent data versus bytes thatrepresent code. Nonetheless, we explore this simplisticapproach as these inaccuracies may not degrade a clas-sifier, given the statistical nature of machine learning.

• The radare2 disassembler. We proceed to applyradare2 [31], a state-of-the-art open-source disassem-bler based on the capstone disassembly framework [34].In contrast to ndisasm, radare2 understands the exe-cutable binary format, allowing it to process relocationand symbol information in particular. This allows usto extract symbols from the dynamic (.dynsym) as wellas the static symbol table (.symtab) where present, andany strings referenced in the code. Our approach thusgains knowledge over functions of dynamic librariesused in the code. Finally, radare2 attempts to identifyfunctions in code and generates corresponding controlflow graphs.

Firstly, we normalize assembly instructions by convert-ing hexadecimal numbers to a placeholder, namely number,to follow Occam’s razor and avoid overfitting. Then, infor-mation provided by the two disassemblers is combined toobtain our disassembly feature set as follows: we tokenizethe instruction traces of both disassemblers and extract to-ken uni-grams, bi-grams, and tri-grams within a single lineof assembly, and 6-grams, which span two consecutive linesof assembly. In addition, we extract single basic blocks ofradare2 ’s control flow graphs, as well as pairs of basic blocksconnected by control flow.

4.2 Feature extraction via decompilationDecompilers are the second source of information that we

consider for feature extraction in this work. In contrastto disassemblers, decompilers do not only uncover the pro-gram’s machine code instructions, but additionally recon-struct higher level constructs in an attempt to translate anexecutable binary into equivalent source code. In particular,decompilers can reconstruct control structures such as dif-ferent types of loops and branching constructs. We make useof these syntactical features of code as they have been shownto be valuable in the context of source code authorship at-tribution [15]. For decompilation, we employ the Hex-Raysdecompiler [1].

Hex-Rays is a commercial state-of-the-art decompiler. Itconverts executable programs into a human readable C-likepseudo code to be read by human analysts. It is noteworthythat this code is typically significantly longer than the orig-

int =

v0

f0

<

v0 MAX

func

decl if

stmtpred

call ...

if int

= stmt

func decl

pred

func

decl if

func

int

decl

int =

v0

f0

<

v0 C0

func

decl if

stmtpred

call ...

Abstract syntax tree (AST) Syntactic featuresAST unigrams:

if int

= stmt

func decl

pred ...

AST bigrams:

func

decl if

func

int

decl...

AST depth: 5

entry

blk1

blk2

blk4

blk3

exit

blk1 blk2 blk3

blk4

blk1

blk2

blk1

blk3

Control-flow graph (CFG) Control-flow featuresCFG unigrams:

CFG bigrams:

...

entry

blk1

blk2

blk4

blk3

exit

blk1 blk2 blk3

blk4 ...

blk1

blk2

blk1

blk3

Figure 2: Feature extraction via decompilation and

fuzzy parsing: C-like pseudo code produced by Hex-Rays

is transformed into an abstract syntax tree and control-

flow graph to obtain syntactic and control-flow features.

inal source code. For example, decompiling an executablebinary generated from 70 lines of source code with Hex-Raysproduces on average 900 lines of decompiled code. We ex-tract two types of features from this pseudo code: lexicalfeatures, and syntactical features.

Lexical features are simply the word unigrams, which cap-ture the integer types used in a program, names of libraryfunctions, and names of internal functions when symbol in-formation is available. Syntactical features are obtained bypassing the C-pseudo code to joern, a fuzzy parser for Cthat is capable of producing fuzzy abstract syntax trees(ASTs) from Hex-Rays pseudo code output [42]. We de-rive syntactic features from the abstract syntax tree, whichrepresent the grammatical structure of the program. Suchfeatures are (illustrated in Figure 2) AST node unigrams,labeled AST edges, AST node term frequency inverse doc-ument frequency, and AST node average depth. Previouswork on source code authorship attribution [15, 41] showsthat these features are highly effective in representing pro-gramming style.

4.3 Dimensionality reductionFeature extraction produces a large amount of features, re-

sulting in sparse feature vectors with thousands of elements.However, not all features are equally informative to express aprogrammer’s style. This makes it desirable to perform fea-ture selection to obtain a compact representation of the datato reduce the computational burden during classification aswell as the chances of overfitting. Moreover, sparse vectorsmay result in a large number of zero-valued attributes be-ing selected during random forest’s random subsampling ofthe attributes to select a best split. Reducing the dimen-sions of the feature set is important for avoiding overfitting.One example to overfitting would be a rare assembly instruc-tion uniquely identifying an author. For these reasons, weuse information gain criteria followed by correlation based

feature selection to identify the most informative attributesthat represent each author as a class. This reduces vectorsize and sparsity while increasing accuracy and model train-ing speed. For example, we get 750,000 features from the900 executable binary samples of 100 programmers. If weuse all of these features in classification, the resulting de-anonymization accuracy is slightly above 30% because therandom forest might be randomly selecting features withvalues of zero in the sparse feature vectors. Once informa-tion gain criteria is applied, we get less than 2,000 featuresand the correct classification accuracy of 100 programmersincreases from 30% to 90%. Then, we identify locally pre-dictive features that are highly correlated with classes andhave low intercorrelation. After this second dimensionalityreduction method, we are left with 53 predictive featuresand no sparsity remains in the feature vectors. Extract-ing 53 features or training a machine learning model whereeach instance has 53 attributes is computationally efficient.Based on such proper representation of instances, the correctclassification accuracy of 100 programmers reaches 96%.

We applied the first dimensionality reduction step usingWEKA’s information gain attribute selection criterion [20],which evaluates the difference between the entropy of thedistribution of classes and the Shannon entropy of the con-ditional distribution of classes given a particular feature [33].

Information gain can be thought of as measuring the amountof information that the observation of the value of an at-tribute gives about the class label associated with the ex-ample. We retained only those features that individuallyhad non-zero information gain.

The second dimensionality reduction step was based oncorrelation based feature selection, which generates a feature-class and feature-feature correlation matrix. The selectionmethod then evaluates the worth of a subset of attributesby considering the individual predictive ability of each fea-ture along with the degree of redundancy between them [21].Feature selection is performed iteratively with greedy hill-climbing and backtracking ability by adding attributes thathave the highest correlation with the class to the list of se-lected features.

4.4 ClassificationWe used random forests as our classifier which are ensem-

ble learners built from collections of decision trees, whereeach tree is trained on a subsample of the data obtainedby random sampling with replacement. Random forests bynature are multi-class classifiers that avoid overfitting.

During classification, each test example is classified viaeach of the trained decision trees by following the binarydecisions made at each node until a leaf is reached, and theresults are aggregated. The most populous class is selectedas the output of the forest for simple classification, or clas-sifications can be ranked according to the number of treesthat ‘voted’ for the label in question when performing re-laxed attribution for top-n classification.

We employed random forests with 500 trees, which em-pirically provided the best tradeoff between accuracy andprocessing time. Examination of numerous out of bag errorvalues across multiple fits suggested that (logM)+1 randomfeatures (where M denotes the total number of features) ateach split of the decision trees was in fact optimal in all ofthe experiments listed in Section 5, and was used through-out. Node splits were selected based on the information

gain criteria, and all trees were grown to the largest extentpossible, without pruning.

The data was analyzed via k -fold cross-validation, wherethe data was split into training and test sets stratified by au-thor (ensuring that the number of code samples per authorin the training and test sets was identical across authors).The parameter k varies according to datasets and is equalto the number of instances present from each author. Thecross-validation procedure was repeated 10 times, each witha different random seed, and average results across all iter-ations are reported, ensuring that results are not biased byimprobably easy or difficult to classify subsets.

We report our classification results in terms of kappa statis-tics, which is roughly equivalent to accuracy but subtractsthe random chance of correct classification from the finalaccuracy. As programmer de-anonymization is a multi-classclassification problem, an evaluation based on accuracy, orthe true positive rate, represents the correct classificationrate in the most meaningful way.

5. GOOGLE CODE JAM EXPERIMENTSIn this section, we go over the details of the various ex-

periments we performed to address the research questionformulated in Section 2.

5.1 Dataset

We evaluate our executable binary authorship attributionmethod on a dataset based on the annual programming com-petition GCJ [5]. It is an annual contest that thousands ofprogrammers take part in each year, including professionals,students, and hobbyists from all over the world. The con-testants implement solutions to the same tasks in a limitedamount of time in a programming language of their choice.Accordingly, all the correct solutions have the same algo-rithmic functionality.

There are two main reasons for choosing GCJ competitionsolutions as an evaluation corpus. First, it enables us todirectly compare our results to previous work on executablebinary authorship attribution as both [9] and [36] evaluatetheir approaches on data from GCJ. Second, we eliminatethe potential confounding effect of identifying programmingtask rather than programmer by identifying functionalityproperties instead of stylistic properties. GCJ is a less noisyand clean dataset known definitely to be single authored.GCJ solutions do not have significant dependencies outsideof the standard library and contain few or no third partylibraries.

We focus our analysis on compiled C++ code, the mostpopular programming language used in the competition. Wecollect the solutions from the years 2008 to 2014 along withauthor names and problem identifiers.

5.2 Code Compilation

To create our experimental datasets, we first compiled thesource code with GNU Compiler Collection’s gcc or g++without any optimization to Executable and Linkable For-mat (ELF) 32-bit, Intel 80386 Unix binaries.

Next, to measure the effect of different compilation op-tions, such as compiler optimization flags, we additionallycompiled the source code with level-1, level-2, and level-3optimizations, namely the O1, O2, and O3 flags. The com-piler attempts to improve the performance and/or code sizewhen the compiler flags are turned on. Optimization has

the expense of increasing compilation time and complicat-ing program debugging.

5.3 53 features represent programmer style.

We are interested in identifying features that representcoding style preserved in executable binaries. With the cur-rent approach, we extract 750,000 representations of codeproperties of 100 authors, but only a subset of these rep-resentations are the result of individual programming style.We are able to capture the features that represent each au-thor’s programming style that is preserved in executable bi-naries by applying information gain criteria to these 750,000features. After applying information gain to effectively rep-resent coding style, we reduce the feature set to containapproximately 1,600 features from all feature types. Fur-thermore, correlation based feature selection during crossvalidation eliminates features that have low class correla-tion and high intercorrelation and preserves 53 of the highlydistinguishing features. Considering the fact that we arereaching such high accuracies on de-anonymizing 100 pro-grammers with 900 executable binary samples (discussedbelow), these features are providing strong representationof style that survives compilation. The compact set of iden-tifying stylistic features contain features of all types, namelydisassembly, CFG, and syntactical decompiled code proper-ties. To examine the potential for overfitting, we considerthe ability of this feature set to generalize to a different setof programmers (see Section 5.5), and show that it does so,further supporting our belief that these features effectivelycapture coding style.

5.4 We can de-anonymize programmers from theirexecutable binaries.

This is the main experiment that demonstrates how de-anonymizing programmers from their executable binaries ispossible. After preprocessing the dataset to generate the ex-ecutable binaries without optimization, we further processthe executable binaries to obtain the disassembly, controlflow graphs, and decompiled source code. We then extractall the possible features detailed in Section 4. We take a setof 100 programmers who all have 9 executable binary sam-ples. With 9-fold-cross-validation, the random forest classi-fier correctly classifies 900 test instances with 95% accuracy,which is significantly higher than the accuracies reached inprevious work.

There is an emphasis on the number of folds used in theseexperiments because each fold corresponds to the implemen-tation of the same algorithmic function by all the program-mers in the GCJ dataset (e.g. all samples in fold 1 may beattempts by the various authors to solve a list sorting prob-lem). Since we know that each fold corresponds to the sameCode Jam problem, by using stratified cross validation with-out randomization and preserving order, we ensure that alltraining and test samples contain the same algorithmic func-tions implemented by all of the programmers. The classifieruses the excluded fold in the testing phase, which containsexecutable binary samples that were generated from an al-gorithmic function that was not previously observed in thetraining set for that classifier. Consequently, the only dis-tinction between the test instances is the coding style of theprogrammer, without the potentially confounding effect ofidentifying an algorithmic function.

5.5 The feature set selected via dimensionality re-duction works and is validated across differentsets of programmers.

In our earlier experiments, we trained the classifier on thesame set of executable binaries that we used during featureselection. The high number of starting features from whichwe select our final feature set via dimensionality reductiondoes raise the potential concern of overfitting. To exam-ine this, we applied this final feature set to a different setof programmers and executable binaries. If we are able toreach accuracies similar to what we got earlier, we can con-clude that these selected features do generalize to other pro-grammers and problems, and therefore are not overfittingto the 100 programmers they were generated from. Thisalso suggests that the final set of features in general captureprogrammer style.

Recall that analyzing 900 executable binary samples of the100 programmers resulted in about 750,000 features, and af-ter dimensionality reduction, we are left with 53 importantfeatures. We picked a different (non-overlapping) set of 100programmers and performed another de-anonymization ex-periment in which the feature selection step was omitted,using instead the information gain and correlation based fea-tures obtained from the original experiment. This resultedin very similar accuracies: we de-anonymized programmersin the validation set with 96% accuracy by using features se-lected via the main development set, compared to the 95%de-anonymization accuracy we achieve on the programmersof the main development set. The ability of the final re-duced set of 53 features to generalize beyond the datasetwhich guided their selection strongly supports the assertionthat these features obtained from the main set of 100 pro-grammers are not overfitting, and they actually representcoding style in executable binaries, and can be used acrossdifferent datasets.

5.6 Large Scale De-anonymization: We can de-anonymize 600 programmers from their exe-cutable binaries.

We would like to see how well our method scales up to 600users. An analyst with a large set of labeled samples mightbe interested in performing large scale de-anonymization.For this experiment, we use 600 contestants from GCJ with9 files. We only extract the reduced set of features fromthe 600 users. This decreases the amount of time requiredfor feature extraction. On the other hand, this experimentshows how effectively overall programming style is repre-sented after dimensionality reduction. The results of largescale programmer de-anonymization in Figure 3, show thatour method can scale to larger datasets with the reduced setof features with a surprisingly small drop on accuracy.

5.7 We advance the state of executable binary au-thorship attribution.

Rosenblum et al. presented the largest scale evaluation ofexecutable binary authorship attribution on 191 program-mers each with at least 8 training samples [36]. We compareour results with Rosenblum et al.’s in Table 1 to show how weadvance the state of the art both in accuracy and on largerdatasets. Rosenblum et al. use 1,900 coding style featuresto represent coding style whereas we use 53 features, whichmight suggest that our features are more powerful in repre-senting coding style that is preserved in executable binaries.

20 100 200 300 400 500 6000

20

40

60

80

10099% 96%

92% 89%85% 83% 83%

Number of Authors

Corr

ect

Cla

ssifi

cati

on

Acc

ura

cy

Figure 3: Large Scale Programmer De-anonymization

On the other hand, we use less training samples as opposedto Rosenblum et al., which makes our experiments morechallenging from a machine learning standpoint. Our ac-curacy in authorship attribution is significantly higher thanRosenblum et al.’s, even when we use an SVM as our clas-sifier, showing that our different approach is more powerfuland robust for de-anonymizing programmers. Rosenblum etal. suggest a linear SVM is the appropriate classifier for de-anonymizing programmers but we show that our differentset of techniques and choice of random forests is leading tosuperior and larger scale de-anonymization.

Related Work Number ofProgram-mers

Number ofTrainingSamples

Accuracy Classifier

Rosenblum [36] 20 8-16 77% SVMThis work 20 8 90% SVMThis work 20 8 99% RF

Rosenblum [36] 100 8-16 61% SVMThis work 100 8 84% SVMThis work 100 8 96% RF

Rosenblum [36] 191 8-16 51% SVMThis work 191 8 81% SVMThis work 191 8 92% RFThis work 600 8 71% SVMThis work 600 8 83% RF

Table 1: Comparison to Previous Results

5.8 Programmer style is preserved in executablebinaries.

We show throughout the results that it is possible to de-anonymize programmers from their executable binaries witha high accuracy. To quantify how stylistic features are pre-served in executable binaries, we calculated the correlationof stylistic source code features and decompiled code fea-tures. We used the stylistic source code features from previ-ous work on de-anonymizing programmers from their sourcecode [15]. We took the most important 150 features in cod-ing style that consist of AST node average depth, AST nodeTFIDF, and the frequencies of AST nodes, AST node bi-grams, word unigrams, and C++ keywords. For each ex-ecutable binary sample, we have the corresponding sourcecode sample. We extract 150 information gain features fromthe original source code. We extract decompiled source codefeatures from the decompiled executable binaries. For each

300 600 9000

0.2

0.4

0.6

0.8

1Original vs. Reconstructed Feature Similarity

Original vs. Decompiled Feature Average Similarity

Reconstructed Feature Vectors

Cosi

ne

Sim

ilari

ty

Figure 4: Feature Transformations: Each data point

on the x-axis is a different executable binary sample.

Each y-axis value is the cosine similarity between the

feature vector extracted from the original source code

and the feature vector that tries to predict the original

features. The average value of these 900 cosine similarity

measurements is 0.81, suggesting that decompiled code

preserves transformed forms of the original source code

features well enough to reconstruct the original source

code features.

executable binary instance, we set one corresponding infor-mation gain feature as the class to predict and then we cal-culate the correlation between the decompiled executablebinary features and the class value. A random forest classi-fier with 500 trees predicts the class value of each instance,and then Pearson’s correlation coefficient is calculated be-tween the predicted and original values. The correlation hasa mean of 0.32 and ranges from -0.12 to 0.69 for the mostimportant 150 features.

To see how well we can reconstruct the original source codefeatures from decompiled executable binary features, we re-constructed the 900 instances with 150 features that repre-sent the highest information gain features. We calculatedthe cosine similarity between the original 900 instances andthe reconstructed instances after normalizing the features tounit distance. The cosine similarity for these instances is inFigure 4, where a cosine similarity of 1 means the two fea-ture vectors are identical. The high values (average of 0.81)in cosine similarity suggest that the reconstructed featuresare similar to the original features. When we calculate thecosine similarity between the feature vectors of the originalsource code and the corresponding decompiled code’s fea-ture vectors (no predictions), the average cosine similarityis 0.35. In summary, reconstructed features are much moresimilar to original code than the raw features extracted fromdecompiled code. This result suggests that decompiled codepreserves transformed forms of the original code featureswell enough to reconstruct the original source code features.

6. REAL-WORLD SCENARIOS

6.1 Programmers of optimized executable binariescan be de-anonymized.

In Section 5, we discussed how we evaluated our approachon a controlled and clean real-world dataset. Section 5 showshow we advance over all the previous methods that were allevaluated with clean datasets such as GCJ or homeworkassignments. In this section, we investigate a more compli-

cated dataset which has been optimized during compilation,where the executable binary samples have been normalizedfurther during compilation.

Compiling with optimization tries to minimize or maxi-mize some attributes of an executable program. The goal ofoptimization is to minimize execution time or the amountof memory a program occupies. The compiler applies op-timizing transformations which are algorithms that take aprogram and transform it to a semantically equivalent pro-gram that uses fewer resources.

GCC has predefined optimization levels that turn on setsof optimization flags. Compilation with optimization level-1,tries to reduce code size and execution time, takes more timeand much more memory for large functions than compilationwith no optimizations. Compilation with optimization level-2 optimizes more than level-1 optimization, uses all level-1optimization flags and more. Level-2 optimization performsall optimizations that do not involve a space-speed tradeoff.Level-2 optimization increases compilation time and perfor-mance of the generated code when compared to optimizationwith level-1. Level-3 optimization yet optimizes more thanboth level-1 and level-2.

So far, we have shown that programming style featuressurvive compilation without any optimizations. As compi-lation with optimizations transforms code further, we in-vestigate how much programming style is preserved in ex-ecutable binaries that have gone through compilation withoptimization. Our results summarized in Table 2 show thatprogramming style is preserved to a great extent even inthe most aggressive level-3 optimization. This shows thatprogrammers of optimized executable binaries can be de-anonymized and optimization is not a highly effective codeanonymization method.

Number ofProgrammers

Number ofTrainingSamples

CompilerOptimizationLevel

Accuracy

100 8 None 96%100 8 1 93%100 8 2 89%100 8 3 89%

Table 2: Programmer De-anonymization with Com-piler Optimization

6.2 Removing symbol information does not anonymizeexecutable binaries.

To investigate the relevance of symbol information forclassification accuracy, we repeat our experiments with 100authors presented in the previous section on fully strippedexecutable binaries, that is, executable binaries where sym-bol information is missing completely. We obtain these ex-ecutable binaries using the standard utility GNU strip oneach executable binary sample prior to analysis. Upon re-moval of symbol information, without any optimizations, wenotice a decrease in classification accuracy by 24%, showingthat stripping symbol information from executable binariesis not effective enough to anonymize an executable binarysample.

6.3 We can de-anonymize programmers from ob-fuscated binaries.

We are furthermore interested in finding out whether ourmethod is capable of dealing with simple binary obfusca-tion techniques as implemented by tools such as Obfuscator-LLVM [23]. These obfuscators substitute instructions byother semantically equivalent instructions, they introducebogus control flow, and can even completely flatten controlflow graphs.

For this experiment, we consider a set of 100 programmersfrom the GCJ data set, who all have 9 executable binarysamples. This is the same data set as considered in ourmain experiment (see Section 5.4), however, we now applyall three obfuscation techniques implemented by Obfuscator-LLVM to the samples prior to learning and classification.

We proceed to train a classifier on obfuscated samples.This approach is feasible in practice as an analyst who hasonly non-obfuscated samples available can easily obfuscatethem to obtain the necessary obfuscated samples for classi-fier training. Using the same features as in Section 5.4, weobtain an accuracy of 88% in correctly classifying authors.

6.4 De-anonymization in the Wild

To better assess the applicability of our programmer de-anonymization approach in the wild, we extend our experi-ments to code collected from real open-source programs asopposed to solutions for programming competitions. To thisend, we automatically collected source files from the popu-lar open-source collaboration platform GitHub [4]. Startingfrom a seed set of popular repositories, we traversed the plat-form to obtain C/C++ repositories that meet the followingcriteria. Only one author has committed to the repository.The repository is popular as indicated by the presence ofat least 5 stars, a measure of popularity for repositories onGitHub. Moreover, it is sufficiently large, containing a to-tal of 200 lines at least. The repository is not a fork ofanother repository, nor is it named ‘linux’, ‘kernel’, ‘osx’,‘gcc’, ‘llvm’, ‘next’, as these repositories are typically copiesof the so-named projects.

We cloned 439 repositories from 161 authors meeting thesecriteria and collect only C/C++ files for which the mainauthor has contributed at least 5 commits and the commitmessages do not contain the word ’signed-off’, a messagethat typically indicates that the code is written by anotherperson. An author and her files are included in the datasetonly if she has written at least 10 different files. In the finalstep, we manually verified ground truth on authorship forthe selected files to make sure that they do not show anyclear signs of code reuse from other projects. The resultingdataset had 2 to 344 files and 2 to 8 repositories from eachauthor, with a total of 3,438 files.

We subsequently compile the collected projects to obtainobject files for each of the selected source files. We performour experiment on object files as opposed to entire binaries,since the object files are the binary representations of thesource files that clearly belong to the specified authors.

For different reasons, compiling code may not be possiblefor a project, e.g., the code may not be in a compilablestate, it may not be compilable for our target platform (32bit Intel, Linux), or the necessary files to setup a workingbuild environment can no longer be obtained. Despite thesedifficulties, we are able to generate 1,075 object files from90 different authors, where the number of object files perauthor ranges from 2 to 24, with most authors having atleast 9 samples. We used 50 of these authors that have 6

to 15 files to perform a machine learning experiment withmore balanced class sizes.

We extract the information gain features that were se-lected from GCJ data from this GitHub dataset. The corre-sponding classifier reaches an accuracy of 65% in correctlyidentifying the authors of executable binary samples.

Being able to de-anonymize programmers in the wild byusing a small number of features obtained from our cleandevelopment dataset is a promising step towards attackingmore challenging real-world de-anonymization problems.

6.5 Have I seen this programmer before?

While attempting to de-anonymize programmers in real-world settings, we cannot be certain that we have formerlyencountered code samples from the programmers in the testset. As a mechanism to check whether an anonymous testfile belongs to one of the candidate programmers in the train-ing set, we extend our method to an open world setting byincorporating classification confidence thresholds. In ran-dom forests, the class probability or classification confidenceP (Bi) that executable binary B is of class i is calculatedby taking the percentage of trees in the random forest thatvoted for class i during classification.

We performed 900 classifications in a 100-class problemto determine the confidence threshold based on the trainingdata. The accuracy was 95%. There were 40 misclassifica-tions with an average classification confidence of 0.49. Wetook another set of 100 programmers with 900 samples. Weclassify these 900 samples with the closed world classifierthat was trained in the first step on samples from a dis-joint set of programmers. All of the 900 programmers areattributed to a programmer in the closed world classifierwith a mean classification confidence of 0.40. We can pick averification threshold and reject all classifications with con-fidence below the selected threshold. Accordingly all therejected open world samples and misclassifications becometrue negatives, and the rejected correct classifications endup as false negatives. Open world samples and misclassifica-tions above the threshold are false positives and the correctclassifications are true positives. Based on this, we generatean accuracy, pecision, and recall graph with varying confi-dence threshold values in Figure 5. This figure shows thatthe optimal rejection threshold to guarantee 90% accuracyon 1,800 samples and 100 classes is around confidence 0.72.Other confidence thresholds can be picked based on precisionand recall trade-offs. These results are encouraging for ex-tending our programmer de-anonymization method to openworld settings where an analyst deals with many uncertain-ties under varying fault tolerance levels.

6.6 Case Study: Nulled.IO Hacker Forum

On May 6, 2016 the well known ‘hacker’ forum Nulled.IOwas compromised and its forum dump was leaked along withthe private messages of its 585,897 members. The membersof these forums share, sell, and buy stolen credentials andcracking software. A high number of the forum members areactive developers that write their own code and sell them,or share some of their code for free in public GitHub reposi-tories along with tutorials on how to use them. The privatemessages of the sellers in the forum include links to theirproducts and even to screenshots of how the products work,for buyers. We were able to find declared authorship alongwith active links to members’ software on sharing sites such

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Accuracy

Precision Recall

Classification Confidence

Figure 5: Confidence Thresholds for Verification

as FileDropper2 and MediaFire3 in the private messages.For our case study, we created a dataset from four forum

members with a total of thirteen Windows executables. Oneof the members had only one sample, which we used to testthe open world setting described in Section 6.5. A challengeencountered in this case study is that the binary programsobtained from Nulled.IO do not contain native code, butbytecode for the Microsoft Common Language Infrastruc-ture (CLI). Therefore, we cannot immediately analyze themusing our existing toolchain. We address this problem byfirst translating bytecode into corresponding native code us-ing the Microsoft Native Image Generator (ngen.exe), andsubsequently forcing the decompiler to treat the generatedoutput files as regular native code for binaries. On the otherhand, emphradare2 is not able to disassemble such outputor the original executables. Consequently we had access toa subset of the information gain feature set obtained fromGCJ. We extracted a total of 605 features consisting of de-compiled source code features and ndisasm disassembly fea-tures. Nevertheless, we are able to de-anonymize these pro-grammers with 100% accuracy while the one sample fromthe open world class is classified in all cases with the low-est classification confidence, such as 0.4, which is below theverification threshold and is recognized by the classifier as asample that does not belong to the rest of the programmers.

A larger de-anonymization attack can be carried out bycollecting code from GitHub users with relevant repositoriesand identifying all the available executables mentioned in thepublic portions of hacker forums. GitHub source code canbe compiled with necessary parameters and used with theapproach described in Section 6.4. Incorporating verificationthresholds from Section 6.5 can help handle programmerswith only one sample. Consequently a large portion of suchmembers can be linked, reduced to a cluster or directly de-anonymized.

The countermeasure against real-world programmer de-anonymization attacks requires a combination of various pre-cautions. Developers should not have any public reposito-ries. A set of programs should not be released by the sameonline identity. Programmers should try to have a differ-ent coding style in each piece of software they write andalso try to code in different programming languages. Soft-ware should utilize different optimizations and obfuscations

2www.filedropper.com: ‘Simplest File Hosting Website..’3www.mediafire.com: ‘All your media, anywhere you go’

to avoid deterministic patterns. A programmer who accom-plishes randomness across all potential identifying factorswould be very difficult to de-anonymize. Nevertheless, eventhe most privacy savvy developer might be willing to con-tribute to open source software or build a reputation forher identity based on her set of products, which would be achallenge for maintaining anonymity.

7. DISCUSSIONWe consider two data sets: the GCJ dataset, and a dataset

based on GitHub repositories. Using the GitHub dataset, weshow that we can perform programmer de-anonymizationwith executable binary authorship attribution in the wild.We de-anonymize GitHub programmers by using stylisticfeatures obtained from the GCJ dataset. Using the samesmall set of features, we perform a case study on the leakedhacker forum Nulled.IO and de-anonymize four of its mem-bers. The successful de-anonymization of programmers fromdifferent sources supports the supposition that, in additionto its other useful properties for scientific analysis of attri-bution tasks, the GCJ dataset is a valid and useful proxyfor real-world authorship attribution tasks.

The advantage of using the GCJ dataset is that we canperform the experiments in a controlled environment wherethe most distinguishing difference between programmers’ so-lutions is their programming style. Every contestant imple-ments the same functionality, in a limited amount of timewhile at each round problems are getting more difficult. Thisprovides the opportunity to control the difficulty level of thesamples and the skill set of the programmers in the dataset.In source code authorship attribution, programmers whocan implement more sophisticated functionality have a moredistinct programming style [15]. We observe the same pat-tern in executable binary samples and gain some softwareengineering insights by analyzing stylistic properties of exe-cutable binaries. In contrast to GCJ, GitHub and Nulled.IOoffer noisy samples. However, our results show that we cande-anonymize programmers with high accuracy as long asenough training data is available.

Previous work shows that coding style is quite prevalentin source code. We were surprised to find that it is alsopreserved to a great degree in compiled source code. Wecan de-anonymize programmers from compiled source codewith great accuracy, and furthermore, we can de-anonymizeprogrammers from source code compiled with optimizationor after obfuscation. In our experiments, we see that eventhough basic obfuscation, optimization, or stripping symbolstransforms executable binaries more than plain compilation,stylistic features are still preserved to a large degree. Suchmethods are not sufficient on their own to protect program-mers from de-anonymization attacks.

In scenarios where authorship attribution is challenging,an analyst or adversary could apply relaxed attribution tofind a suspect set of n authors, instead of a direct top–1classification. In top–10 attribution, the chances of havingthe original author within the returned set of 10 authorsapproaches 100%. Once the suspect set size is reduced to10 from hundreds, the analyst or adversary could adhere tocontent based dynamic approaches and reverse engineeringto identify the author of the executable binary sample.

Even though executable binaries look cryptic and diffi-cult to analyze, we can still extract many useful featuresfrom them. We extract features from disassembly, control

flow graphs, and also decompiled code to identify featuresrelevant to only programming style. After dimensionality re-duction, we see that each of the feature spaces provides pro-grammer style information. The initial development featureset contains a total of 750,000 features for 900 executablebinary samples of 100 authors. Approximately 50 featuressuffice to capture enough key information about coding styleto enable robust authorship attribution. We see that the re-duced set of features are valid in different datasets with dif-ferent programmers, including optimized or obfuscated pro-grammers. Also, the reduced feature set is helpful in scalingup the programmer de-anonymization approach. While wecan identify 100 programmers with 96% accuracy, we can de-anonymize 600 programmers with 83% accuracy using thesame reduced set of features. 83% is a very high number forsuch a challenging de-anonymization task where the randomchance of correctly identifying an author is 0.17%.

8. LIMITATIONSOur experiments suggest that our method is able to as-

sist in de-anonymizing a much larger set of programmerswith significantly higher accuracy than state-of-the-art ap-proaches. However, there are also assumptions that underliethe validity of our experiments as well as inherent limitationsof our method which we discuss in the following paragraphs.First, we assume that our ground truth is correct, but inreality programs in GCJ or on GitHub might be writtenby programmers other than the stated programmer, or bymultiple programmers. Such a ground truth problem wouldcause the classifier to train on noisy models which wouldlead to lower de-anonymization accuracy and a noisy repre-sentation of programming style. Second, many source codesamples from GCJ contestants cannot be compiled. Conse-quently, we perform evaluation only on the subset of sampleswhich can be compiled. This has two effects: first, we areperforming attribution with fewer executable binary sam-ples than the number of available source code samples. Thisis a limitation for our experiments but it is not a limitationfor an attacker who first gets access to the executable bi-nary instead of the source code. If the attacker gets accessto the source code instead, she could perform regular sourcecode authorship attribution. Second, we must assume thatwhether or not a code sample can be compiled does not cor-relate with the ease of attribution for that sample. Third, wemainly focus on C/C++ code compiled (except Nulled.IOsamples) using the GNU compiler gcc in this work, and as-sume that the executable binary format is the Executableand Linking Format. This is important to note as dynamicsymbols are typically present in ELF binary files even afterstripping of symbols, which may ease the attribution taskrelative to other executable binary formats that may notcontain this information. We defer an in depth investigationof the impact that other compilers, languages, and binaryformats might have on the attribution task to future work.

Finally, while we show that our method is capable of deal-ing with simple binary obfuscation techniques, we do notconsider executable binaries that are heavily obfuscated tohinder reverse engineering. While simple systems, such aspackers [2] or encryption stubs that merely restore the orig-inal executable binary into memory during execution maybe analyzed by simply recovering the unpacked or decryptedexecutable binary from memory, more complex approachesare becoming increasingly commonplace. A wide range of

anti-forensic techniques exist [18], including methods thatare designed specifically to prevent easy access to the orig-inal bytecode in memory via such techniques as modifyingthe process environment block or triggering decryption onthe fly via guard pages. Other techniques such as virtu-alization [3] transform the original bytecode to emulatedbytecode running on virtual machines, making decompila-tion both labor-intensive and error-prone. Finally, the useof specialized compilers that lack decompilers and producenonstandard machine code (see [16] for an extreme but il-lustrative example) may likewise hinder our approach, par-ticularly if the compiler is not available and cannot be fin-gerprinted. We leave the examination of these techniques,both with respect to their impact on authorship attributionand to possible mitigations, to future work.

9. CONCLUSIONDe-anonymizing programmers has direct implications for

privacy and anonymity. The ability to attribute authorshipto anonymous executable binary samples has applicationsin software forensics, and is an immediate concern for pro-grammers that would like to remain anonymous. We showthat coding style is preserved in compilation, contrary tothe belief that compilation wipes away stylistic properties.We de-anonymize 100 programmers from their executable bi-nary samples with 96% accuracy, and 600 programmers with83% accuracy. Moreover, we show that we can de-anonymizeGitHub developers or hacker forum members with high ac-curacy. Our work, while significantly improving the lim-ited approaches in programmer de-anonymization, presentsnew methods to de-anonymize programmers in the wild fromchallenging real-world samples.

We discover a small set of features that effectively rep-resent coding style in executable binaries. We obtain thisprecise representation of coding style via two different dis-assemblers, control flow graphs, and a decompiler. Withthis comprehensive representation, we are able to re-identifyGitHub authors from their executable binary samples in thewild, where we reach an accuracy of 65% for 50 program-mers, even though these samples are noisy and products ofcollaborative efforts.

Programmer style is embedded in executable binary to asurprising degree, even when it is obfuscated, generated withaggressive compiler optimizations, or symbols are stripped.Compilation, binary obfuscation, optimization, and strip-ping of symbols reduce the accuracy of stylistic analysis butare not effective in anonymizing coding style.

In future work, we plan to investigate snippet and func-tion level stylistic information to de-anonymize multiple au-thors of collaboratively generated binaries. We also defer theanalysis of highly sophisticated compilation and obfuscationmethods to future work. Nevertheless, we show that identi-fying stylistic information is prevalent in real-world settingsand accordingly developers cannot assume to be anonymousunless they take extreme precautions as a countermeasure.Examples to possible countermeasures include a combina-tion of randomized coding style, different programming lan-guage usage, and employment of indeterministic set of obfus-cation methods. Since incorporating different programminglanguages or obfuscation methods is not always practical,especially for open source software contributors, our futurework would focus on completely stripping stylistic informa-tion from binaries to render them anonymous.

AcknowledgmentThis material is based on work supported by the ARO (U.S.Army Research Office) Grant W911NF-15-2-0055 and AWSin Education Research Grant award. The views and conclu-sions contained herein are those of the authors and shouldnot be interpreted as representing the official policies, eitherexpressed or implied, of the Army Research Laboratory orthe U.S. Government. The U.S. Government is authorized toreproduce and distribute reprints for Government purposesnotwithstanding any copyright notice herein. This materialis based on work supported by the ARO (U.S. Army Re-search Office) Grant W911NF-14-1-0444, the DFG (GermanResearch Foundation) under the project DEVIL (RI 2469/1-1), and AWS in Education Research Grant award. This re-search was supported in part by the Center for InformationTechnology Policy at Princeton University.

10. REFERENCES[1] Hex-rays decompiler, November 2015.

[2] Upx: the ultimate packer for executables.upx.sourceforge.net, November 2015.

[3] Oreans technology: Code virtualizer, 2015 November.

[4] The github repository hosting service.http://www.github.com, visited, November 2015.

[5] Google Code Jam Programming Competition.code.google.com/codejam, visited, November 2015.

[6] S. Afroz, M. Brennan, and R. Greenstadt. Detectinghoaxes, frauds, and deception in writing style online.In Proc. of IEEE Symposium on Security and Privacy.IEEE, 2012.

[7] S. Afroz, A. Caliskan-Islam, A. Stolerman,R. Greenstadt, and D. McCoy. Doppelganger finder:Taking stylometry to the underground. In Proc. ofIEEE Symposium on Security and Privacy, 2014.

[8] A. Aiken et al. Moss: A system for detecting softwareplagiarism. University of California–Berkeley. Seewww. cs. berkeley. edu/aiken/moss. html, 9, 2005.

[9] S. Alrabaee, N. Saleem, S. Preda, L. Wang, andM. Debbabi. Oba2: an onion approach to binary codeauthorship attribution. Digital Investigation, 11, 2014.

[10] E. Backer and P. van Kranenburg. On musicalstylometry—a pattern recognition approach. PatternRecognition Letters, 26(3):299–309, 2005.

[11] M. Brennan, S. Afroz, and R. Greenstadt. Adversarialstylometry: Circumventing authorship recognition topreserve privacy and anonymity. ACM TISSEC,15(3):12–1, 2012.

[12] S. Burrows and S. M. Tahaghoghi. Source codeauthorship attribution using n-grams. In Proc. of theAustralasian Document Computing Symposium, 2007.

[13] S. Burrows, A. L. Uitdenbogerd, and A. Turpin.Application of information retrieval techniques forsource code authorship attribution. In DatabaseSystems for Advanced Applications, 2009.

[14] S. Burrows, A. L. Uitdenbogerd, and A. Turpin.Comparing techniques for authorship attribution ofsource code. Software: Practice and Experience,44(1):1–32, 2014.

[15] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan,C. Voss, F. Yamaguchi, and R. Greenstadt.De-anonymizing programmers via code stylometry. In

Proc. of the USENIX Security Symposium, 2015.

[16] C. Domas. M/o/vfuscator, November 2015.

[17] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,and C.-J. Lin. Liblinear: A library for large linearclassification. Journal of Machine Learning Research(JMLR), 9, 2008.

[18] P. Ferrie. Anti-unpacker tricks–part one. VirusBulletin (2008): 4.

[19] G. Frantzeskou, E. Stamatatos, S. Gritzalis, andS. Katsikas. Effective identification of source codeauthors using byte-level information. In Proc. of theInternational Conference on Software Engineering.ACM, 2006.

[20] M. Hall, E. Frank, G. Holmes, B. Pfahringer,P. Reutemann, and I. H. Witten. The weka datamining software: an update. SIGKDD Explor. Newsl.,11, 2009.

[21] M. A. Hall. Correlation-based feature selection formachine learning. PhD thesis, The University ofWaikato, 1999.

[22] E. R. Jacobson, N. Rosenblum, and B. P. Miller.Labeling library functions in stripped binaries. InProceedings of the 10th ACM SIGPLAN-SIGSOFTworkshop on Program analysis for software tools,pages 1–8. ACM, 2011.

[23] P. Junod, J. Rinaldini, J. Wehrli, and J. Michielin.Obfuscator-LLVM – software protection for themasses. In Proc. of the IEEE/ACM 1st InternationalWorkshop on Software Protection, SPRO’15, 2015.

[24] A. Keromytis. Enhanced attribution.DARPA-BAA-16-34, 2016.

[25] J. Kothari, M. Shevertalov, E. Stehle, andS. Mancoridis. A probabilistic approach to source codeauthorship identification. In Information Technology,2007. ITNG’07. IEEE, 2007.

[26] J. Lafferty, A. McCallum, and F. C. Pereira.Conditional random fields: Probabilistic models forsegmenting and labeling sequence data. 2001.

[27] R. C. Lange and S. Mancoridis. Using code metrichistograms and genetic algorithms to perform authoridentification for software forensics. In Proceedings ofthe Annual Conference on Genetic and EvolutionaryComputation. ACM, 2007.

[28] A. W. McDonald, S. Afroz, A. Caliskan, A. Stolerman,and R. Greenstadt. Use fewer instances of the letter“i”: Toward writing style anonymization. In PrivacyEnhancing Technologies, pages 299–318. SpringerBerlin Heidelberg, 2012.

[29] T. C. Mendenhall. The characteristic curves ofcomposition. Science, pages 237–249, 1887.

[30] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt,E. Stefanov, E. C. R. Shin, and D. Song. On thefeasibility of internet-scale author identification. InProc. of IEEE Symposium on Security and Privacy,2012.

[31] pancake. Radare. radare.org, visited, October 2015.

[32] B. N. Pellin. Using classification techniques todetermine source code authorship.

[33] J. Quinlan. Induction of decision trees. Machinelearning, 1(1), 1986.

[34] N. A. Quynh. Capstone. capstone-engine.org, visited,

October 2015.

[35] N. Rosenblum, B. P. Miller, and X. Zhu. Recoveringthe toolchain provenance of binary code. In Proc. ofthe International Symposium on Software Testing andAnalysis. ACM, 2011.

[36] N. Rosenblum, X. Zhu, and B. Miller. Who wrote thiscode? Identifying the authors of program binaries.Computer Security–ESORICS, 2011.

[37] N. E. Rosenblum, B. P. Miller, and X. Zhu. Extractingcompiler provenance from program binaries. InProceedings of the 9th ACM SIGPLAN-SIGSOFTworkshop on Program analysis for software tools andengineering. ACM, 2010.

[38] N. Science and T. Council”. Federal cybersecurityresearch and development strategic plan.whitehouse.gov/files/documents, 2016.

[39] S. Tatham and J. Hall. The netwide disassembler:NDISASM. http://www.nasm.us/doc/nasmdoca.html,visited, October 2015.

[40] P. van Kranenburg. Composer attribution byquantifying compositional strategies. In ISMIR, 2006.

[41] W. Wisse and C. Veenman. Scripting dna: Identifyingthe javascript programmer. Digital Investigation, 2015.

[42] F. Yamaguchi, N. Golde, D. Arp, and K. Rieck.Modeling and discovering vulnerabilities with codeproperty graphs. In Proc. of IEEE Symposium onSecurity and Privacy, 2014.

When Coding Style Survives Compilation: De-anonymizing ...aylinc/papers/caliskan-islam_when.pdf · When Coding Style Survives Compilation: De-anonymizing Programmers from Executable

Documents