DSpot: Test Amplification for Automatic Assessment of ... · DSpot: Test Amplification for Automatic Assessment of Computational Diversity Benoit Baudryo, Simon Alliero, Marcelino

Post on 02-Oct-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

DSpot Test Amplification for Automatic Assessment ofComputational Diversity

Benoit Baudryo Simon Alliero Marcelino Rodriguez-Canciodaggerdaggero and Martin Monperrusdaggero

o Inria Francedaggerdagger University of Rennes 1 France dagger University of Lille France

contact benoitbaudryinriafr

ABSTRACTContext Computational diversity ie the presence of aset of programs that all perform compatible services butthat exhibit behavioral differences under certain conditionsis essential for fault tolerance and security

Objective We aim at proposing an approach for auto-matically assessing the presence of computational diversityIn this work computationally diverse variants are defined as(i) sharing the same API (ii) behaving the same accordingto an input-output based specification (a test-suite) and (iii)exhibiting observable differences when they run outside thespecified input space

Method Our technique relies on test amplification Wepropose source code transformations on test cases to explorethe input domain and systematically sense the observationdomain We quantify computational diversity as the dissim-ilarity between observations on inputs that are outside thespecified domain

Results We run our experiments on 472 variants of 7classes from open-source large and thoroughly tested Javaclasses Our test amplification multiplies by ten the numberof input points in the test suite and is effective at detectingsoftware diversity

Conclusion The key insights of this study are the sys-tematic exploration of the observable output space of a classprovides new insights about its degree of encapsulation thebehavioral diversity that we observe originates from areas ofthe code that are characterized by their flexibility (cachingchecking formatting etc)

KEYWORDS software diversity software testing testamplification dynamic analysis

1 INTRODUCTIONComputational diversity ie the presence of a set of pro-

grams that all perform compatible services but that exhibitbehavioral differences under certain conditions is essentialfor fault tolerance and security [1 5 17 6] Consequentlyit is of utmost importance to have systematic and efficientprocedures to determine if a set of programs are computa-tionally diverse

Many works have tried to tackle this challenge using inputgeneration [12] static analysis [13] or evolutionary testing[25] and [3] (concurrent of this work) Yet having a reli-able detection of computational diversity for large object-oriented programs is still a challenging endeavor

In this paper we propose an approach called DSpot1

1DSpot stands for diversity spotter

for assessing the presence of computational diversity ie todetermine if a set of program variants exhibit different be-haviors under certain conditions DSpot takes as input a testsuite and a set of n program variants The n variants havethe same application programming interface (API) and theyall pass the same test suite (ie they comply with the sameexecutable specification) DSpot consists of two steps (i)automatically transforming the test suite and (ii) runningthis larger test suite that we call ldquoamplified test suiterdquo onall variants to reveal visible differences in the computation

The first step of DSpot is an original technique of testamplification [23 26 20 25] Our key insight is to combinethe automatic exploration of the input domain with the sys-tematic sensing of the observation domain The former isobtained by transforming the input values and method callsof the original test The latter is the result of the analy-sis and transformation of the original assertions of the testsuite in order to observe the program state from as manyobservation points visible from the public API as possibleThe second step of DSpot runs the augmented test suite oneach variant The observation points introduced during am-plification generate new traces on the program state If thereexists a difference between the trace of a pair of variants wesay that these variants are computationally diverse In otherwords two variants are considered diverse if there exists atleast one input outside the specified domain that triggersdifferent behaviors on the variants which can be observedthrough the public API

To evaluate the ability of DSpot at observing computa-tional diversity we consider 7 open-source software applica-tions For each of them we create 472 program variants andwe manually check that they are computationally diversethey form our ground truth We then run DSpot for eachprogram variant Our experiments show that DSpot detects100 of the 472 computational diverse program variantsIn the literature the technique that is the most similar totest amplification is by Yoo and Harman [25] called ldquotestdata regenerationrdquo (TDR for short) we use it as baselineWe show that test suites amplified with DSpot detect twicemore computationally diverse programs than TDR In par-ticular we show that the new test input transformationsthat we propose bring a real added value with respect toTDR to spot behavioral differences

To sum up our contributions are

bull an original set of test cases transformations for the auto-matic amplification of an object-oriented test suitebull a validation of the ability of amplified test suites to spot

computational diversity in 472 variants of 7 open-source

arX

iv1

503

0580

7v2

[cs

SE

] 1

5 Ju

n 20

15

Software Diversity

Computational Diversity

Static Code Diversity

NVP-Diversitydifferent source code (eg variable names)

different outputs at the module interface

different executions

Source Diversity Binary

Diversityfrom different compiler and compilation options

Figure 1 An High-level View of Software Diversity

1 public int subtract1(int a int b) return a-b

3 public int subtract2(int a int b) throws

OverFlowException 5 BigInteger bigA = BigIntegervalueOf(a)

BigInteger bigB = BigIntegervalueOf(b)7 BigInteger result = bigAsubtract(bigB)

if (resultlowerThan(IntegerMIN_VALUE))) 9 throw new DoNotFitIn32BitException ()

11 the API requires an 32-bit integer value

return resultintValue ()

Listing 1 Two subtraction functions They are NVP-Diverse there exists some inputs for which the output aredifferent

large scale programs

bull a comparative evaluation against the closest related work[25]

bull original insights about the natural diversity of computa-tion due to randomness and variety of runtime environ-ments

bull a publicly available implementation 2 and benchmark 3

The paper is organized as follows section 2 expands onthe background and motivations for this work section 3 de-scribes the core technical contribution of the paper the au-tomatic amplification of test suites section 4 presents ourempirical findings about the amplification of 7 real-worldtest suites and the assessment of diversity among 472 pro-gram variants

2 BACKGROUNDIn this paper we are interested in computational diver-

sity Computational diversity is one kind of software di-versity Figure 1 presents a high-level view of software di-versity Software diversity can be observed statically eitheron source or binary code Computational diversity is theone that happens at runtime The computational diversitywe target in this paper is NVP-Diversity which relates toN-version programming It can be loosely defined as com-putational diversity that is visible at the module interfacedifferent outputs for the same input

21 N-version programmingIn the Encyclopedia of Software Engineering N-version

programming is defined as ldquoa software structuring technique

2httpdiversify-projectgithubiotest-suite-amplificationhtml3httpdiversify-projecteudata

input space observation space

Program P

execution state

Figure 2 Original and amplified points on the input andobservation spaces of P

designed to permit software to be fault-tolerant ie able tooperate and provide correct outputs despite the presence offaultsrdquo [14] In N-version systems N variants of the samemodule written by different teams are executed in parallelThe faults are defined as an output of one or more variantsthat differ from the majorityrsquos output Let us consider asimple example with 2 programs p1 and p2 if one observesa difference in the output for an input x ndash p1(x) 6= p2(x) ndashthen a fault is detected

Let us consider the example of Listing 1 It shows two im-plementations of subtraction which have been developed bytwo different teams a typical N-version setup subtract1

simply uses the subtraction operator subtract2 is morecomplex it leverages BigInteger objects to handle potentialoverflows

The specification given to the two teams states that theexpected input domain is [minus216 216] times [minus216 216] To thatextent both implementations are correct and equivalentThese two implementations are run in parallel in produc-tion using a N-version architecture

If a production input is outside the specified input domaineg subtract1(232+1 2) the behavior of both implemen-tations is different and the overflow fault is detected

22 NVP-DiversityIn this paper we use the term NVP-Diversity to refer

to the concept of computational diversity in N-version pro-gramming

Definition Two programs are NVP-diverse if and onlyif there exists at least one input for which the output isdifferent

Note that according to this definition if two programs areequivalent on all inputs they are not NVP-diverse

In this work we consider programs in mainstream object-oriented programming languages (our prototype handles Javasoftware) In OO programs there is no such thing as ldquoin-putrdquo and ldquooutputsrdquo This requires us to slightly modify ourdefinition of NVP-Diversity

Following [10] we replace ldquoinputrdquo by ldquostimulirdquo and ldquoout-putrdquo by ldquoobservationrdquo A stimuli is a sequence of methodcalls and their parameters on an object under test An ob-servation is a sequence of calls to specific methods to querythe state of an object (typically getter methods) The in-put space I of a class P is the set of all possible stimulifor P The observation space O is the set of all sets ofobservations

Now we can clearly define NVP-diversity for OO-programsDefinition Two classes are NVP-diverse if and only if

there exists two respective instances that produce differentobservations for the same stimuli

23 Graphical ExplanationThe notion of NVP-diversity is directly related to activity

of software testing as illustrated in figure 2 The first partof a test case incl creation of objects and method callsconstitutes the stimuli ie a point in the programrsquos inputspace (black diamonds in the figure) An oracle in the formof an assertion invokes one method and compares the resultto an expected value this constitutes an observation pointon the program state that has been reached when runningthe program with a specific stimuli the observation pointsof a test suite are black circles in the right hand side of thefigure To this extent we say that a test suite specifies aset of relations between points in the input and observationspaces

24 Unspecified Input SpaceIn N-Version programming by definition the differences

that are observed at runtime happen for unspecified inputswhich we call the unspecified domain for short In thispaper we consider that the points that are not exercisedby a test suite form the unspecified domain They are theorange diamonds in the left-hand side of the figure

3 OUR APPROACH TO DETECT COMPU-TATIONAL DIVERSITY

We present DSpot our approach to detect computationaldiversity This approach is based on test suite amplificationthrough automated transformations of test case code

31 OverviewThe global flow of DSpot is illustrated in figure 3Input DSpot takes as inputs a set of program variants

P1 Pn which all pass the same test suite TS Conceptu-ally Px can be written in any programming language Thereno assumption on the correctness or complexity of Px theonly requirements is that they are all specified by the sametest suite In this paper we consider unit tests however theapproach can be straightforwardly extended to other kindsof tests such as integration tests

Output The output of DSpot is an answer to the ques-tion are P1 Pn NVP-diverse

Process First DSpot amplifies the test suite to explorethe unspecified input and observation spaces (as defined inSection 2) As illustrated in figure 2 amplification generatesnew inputs and observations in the neighbourhood of theoriginal points (new points are orange diamonds and greencircles) This cartesian product of the amplified set of inputand the complete set of observable points forms the amplifiedtest suite ATS

Also Figure 3 shows the step ldquoobservation point selec-tionrdquo this step removes the naturally random observationsIndeed as discussed in more details further in the papersome observations points produce diverse outputs betweendifferent runs of the same test case on the same programThis natural randomness comes from randomness in thecomputation and from specificities of the execution envi-ronment (addresses file system etc)

Once DSpot has generated an amplified test suite it runsit on a pair of program variants to compare their visible be-havior as captured by the observation points If some pointsreveal different values on each variant they are consideredas computationally diverse

32 Test Suite Transformations

Inputs

Output Yes

input space exploration

Test suite

Amplified Test suite

observation points selection

computational diversity check

No

Program P1

Program P2

observation point synthesis

satisfiessatisfies

Figure 3 An overview of DSpot a decision procedure forautomatically assessing the presence of NVP-diversity

Our approach for amplifying test suites systematicallyexplores the neighbourhood of the input and observationpoints of the original test suite In this section we discussthe different transformations we perform for test suite am-plification and algorithm 1 summarizes that procedure

Data TS an initial test suiteResult TSprime an amplified version of TS

1 TStmp larr empty2 foreach test isin TS do3 foreach statement isin test do4 testprime larr clone(test)5 TStmp larr remove(statementtestrsquo)6 testprimeprime larr clone(test)7 TStmp larr duplicate(statementtestrdquo)

8 end9 foreach literalV alue isin test do

10 TStmp larr transform(literalValuetest)11 end

12 end13 TSprime larr TStmp cup TS foreach test isin TSprime do14 removeAssertions(test)15 end16 foreach test isin TSprime do17 addObservationPoints(test)18 end19 foreach test isin TSprime do20 filterObservationPoints(test)21 end

Algorithm 1 Amplification of test cases

321 Exploring the Input SpaceLiterals and statement manipulation The first step

of amplification consists in transforming all test cases inthe test suite with the following test case transformationsThose transformations operate on literals and statements

Transforming literals given a test case tc we run the fol-lowing transformations for every literal value a Stringvalue is transformed in three ways remove add a ran-dom character and replace a random character by an-

other one a numerical value i is transformed in fourways i + 1 i minus 1 i times 2 i divide 2 a boolean value is re-placed by the opposite value These transformationsare performed at line 10 of algorithm 1

Transforming statement given a test case tc for everystatement s in tc we generate two test cases one testcase in which we remove s and another one in whichwe duplicate s These transformations are performedat line 2 of algorithm 1

Given the transformations described above the transfor-mation process has the following characteristics (i) eachtime we transform a variable in the original test suite wegenerate a new test case (ie we do not lsquostackrsquo the trans-formations on a single test case) (ii) the amplification pro-cess is exhaustive given s the number of String values nthe number of numerical values b the number of booleansand st the number of statements in an original test suiteTS DSpot produces an amplified test suite ATS of size|ATS| = s lowast 3 + n lowast 4 + b+ st lowast 2

These transformations especially the one on statementscan produce test cases that cannot be executed (eg re-moving a call to add before a remove on a list) In ourexperiments this accounted for approximately 10 of theamplified test cases

Assertion removal The second step of amplificationconsists of removing all assertions from the test cases (line 2of algorithm 14) The rationale is that the original assertionsare here to verify the correctness which is not the goal of thegenerated test cases Their goal is to assess computationaldifferences Indeed assertions that were specified for testcase ts in the original test suite are most probably mean-ingless for a test case that is variant of ts When removingassertions we are cautious to keep method calls that can bepassed as a parameter of an assert method We analyze thecode of the whole test suite to find all assertions using thefollowing heuristic an assertion is a call to a method whichname contains either assert or fail and which is providedby the JUnit framework If one parameter of the assertion isa method call we extract it then we remove the assertionIn the final amplified test suite we keep the original testcase but also remove its assertion

Listing 2 illustrates the generation of two new test casesThe first test method testEntrySetRemoveChangesMap() isthe original one slightly simplified for sake of presentationThe second one testEntrySetRemoveChangesMap_Add du-plicates the statement entrySetremove and does not con-tain the assertion anymore The third test method testEn-

trySetRemoveChangesMap_DataMutator replaces the numer-ical value 0 by 1

public void testEntrySetRemove () 12

for (int i = 0 i lt sampleKeyslength i++) 4 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))assertFalse(

6 Entry should have been removed from theunderlying map

getMap ()containsKey(sampleKeys[i]))8 end for

10

public void testEntrySetRemove_Add () 212

call duplication14 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))

entrySetremove(new DefaultMapEntry ltK Vgt(sampleKeys[i] sampleValues[i]))

16 getMap ()containsKey(sampleKeys[i])

18

public void testEntrySetRemove_Data () 320

integer increment22 int i = 0 -gt int i = 1

for (int i = 1 i lt (sampleKeyslength) i++) 24 entrySetremove(new DefaultMapEntry ltK Vgt(

sampleKeys[i] sampleValues[i]))getMap ()containsKey(sampleKeys[i])

26 end for

Listing 2 A test case testEntrySetRemoveChangesMap(1) that is amplified twice (2 and 3)

322 Adding Observation PointsOur gaol is to observe different observable behaviors be-

tween a program and variants of this program Consequentlywe need observation points on the program state We dothis by enhancing all the test cases in ATS with observationpoints(line 17 of algorithm 14) These points are responsi-ble for collecting pieces of information about the programstate during or after the execution of the test case In thiscontext an observation point is a call to a public methodwhich result is logged in an execution trace

For each object o in the original test case (o can be partof an assertion or a local variable of the test case) we dothe following

bull we look for all getter methods in the class of o (iemethods which name starts with get that takes no pa-rameter and whose return type is not void and meth-ods which name starts with is and return a booleanvalue) and call each of them We also collect the valuesof all public fields

bull if the toString method is redefined for the class of owe call it (we ignore the hashcode that can be returnedby toString)

bull if the original assertion included a method call on owe include this method call as an observation point

Filtering observation points This introspective pro-cess provides a large number of observation points Yet wehave noted in our pilot experiments that some of the valuesthat we monitor change from one execution to another Forinstance the identifier of the current thread changes betweentwo executions In Java ThreadcurrentThread()getId()is an observation point that always needs to be discarded forinstance

If we keep those naturally varying observation points DSpotwould say that two variants are different while the observeddifference would be due to randomness This would be spu-rious results that are irrelevant for computational diversityassessment Consequently we discard certain observationpoints as follows We instrument the amplified tests ATSwith all observation points Then we run ATS 30 times onPx and repeat these 30 runs on three different machinesAll observation points for which at least one value varies be-tween at least two runs are filtered out (line 17 of algorithm20)

To sum up DSpot produces an amplified test suite ATSthat contains more test cases than the original one in whichwe have injected observation points in all test cases

Table 1 Descriptive Statistics about our DatasetProject Purpose Class LOC tests coverage variantscommons-codec Data encoding Base64 255 72 98 12commons-collections Collection library TreeBidiMap 1202 111 92 133commons-io Inputoutput helpers FileUtils 1195 221 82 44commons-lang General purpose helpers (eg String) StringUtils 2247 233 99 22guava Collection library HashBiMap 525 35 91 3gson Json library Gson 554 684 89 145JGit Java implementation of GIT CommitCommand 433 138 81 113

33 Detecting and Measuring the Visible Com-putational Diversity

The final step of DSpot runs the amplified test suite onpairs of program variants Given P1 and P2 the numberof observation points which have a different values on eachvariant accounts for visible computational diversity Whenwe compare a set of variants we use the mean number ofdifferences over each pair of variants

34 ImplementationOur prototype implementation amplifies Java source code 4

The test suites are expected to be written using the JUnittesting framework which is the 1 testing framework forJava It uses Spoon [18] to manipulate the source code inorder to create the amplified test cases DSpot is able toamplify a test suite within minutes

The main challenges for the implementation of DSpot wereas follows handle the many different situations that occurin real-world large test suites (use different versions of JUnitmodularize the code of the test suite itself implement newtypes of assertions etc) handle large traces for comparisonof computation (as we will see in the next section we collecthundreds of thousands observations on each variant) spotthe natural randomness in test case execution to preventfalse positives in the assessment of computational diversity

4 EVALUATIONTo evaluate whether DSpot is capable of detecting com-

putational diversity we set up a novel empirical protocoland apply it on 7 large-scale Java programs Our guidingresearch question is Is DSpot capable of identifying re-alistic large scale programs that are computationallydiverse

41 ProtocolFirst we take large open-source Java programs that are

equipped with good test suites Second we forge variantsof those programs using a technique from our previous work[2] We call the variants sosie programs 5

Definition 1 Sosie (noun) Given a program P a testsuite TS for P and a program transformation T a variantP prime=T (P ) is a sosie of P if the two following conditions hold1) there is at least one test case in TS that executes the partof P that is modified by T 2) all test cases in TS pass onP prime

4the prototype is available herehttpdiversify-projectgithubiotest-suite-amplificationhtml5The word sosie is a French word that literally means ldquolookalikerdquo

Given an initial program we synthesize sosies with sourcecode transformations that are based on the modification ofthe abstract syntax tree (AST) As previous work [16 22]we consider three families of transformation that manipu-late statement nodes of the AST 1) remove a node in theAST (Delete) 2) adds a node just after another one (Add)3) replaces a node by another one eg a statement nodeis replaced by another statement (Replace) For ldquoAddrdquo andldquoReplacerdquo the transplantation point refers to where astatement is inserted the transplant statement refers tothe statement that is copied and inserted and both trans-plantation and transplant points are in the same AST (wedo not synthesize new code nor take code from other pro-grams) We consider transplant statements that manipu-late variables of the same type as the transplantation pointand we bind the names of variables in the transplant tonames that are in the namespace of the transplantationpoint We call these transformations Steroid transforma-tions and more details are available in our previous work[2]

Once we have generated sosie programs we manually se-lect a set of sosies that indeed expose some computationaldiversity Third we amplify the original test suites usingour approach and also using a baseline technique by Yooand Harman [25] presented in 43 Finally we run both am-plified test suites and measure the proportion of variants(sosies) that are detected as computationally different Wealso collect additional metrics to further qualify the effec-tiveness of DSpot

42 DatasetWe build a dataset of subject programs for performing our

experiments The inclusion criteria are the following 1) thesubject program must be real-world software 2) the subjectprogram must be written in Java 3) the subject programrsquostest suite must use the JUnit testing framework 4) thesubject program must have a good test suite (a statementcoverage higher than 80)

This results in Apache Commons Math Apache Com-mons Lang Apache Commons Collections Apache Com-mons Codec and Google GSON and Guava The dominanceof Apache projects is due to the fact that they are amongthe very rare organizations with a very strong developmentdiscipline

In addition we aim at running the whole experiments inless than one day (24 hours) Consequently we take a singleclass for each of those projects as well as all the test casesthat exercise it at least once

Table 1 provides the descriptive statistics of our datasetIt gives the subject program identifier its purpose the classwe consider the classrsquo number of lines of code (LOC) thenumber of tests that execute at least once one method ofthe class under consideration the statement coverage and

the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

43 BaselineIn the area of test suite amplification the work by Yoo

and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

44 Research QuestionsWe first examine the results of our test amplification pro-

cedureRQ1a what is the number of generated test cases

We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

Then we evaluate the ability of the amplified test suiteto assess computational diversity

RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

45 Empirical ResultsWe now discuss the empirical results obtained on applying

DSpot on our dataset

451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

TC assert orobs

TC exec assert orobs exec

disc obs branchcov

pathcov

codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

variants de-tected by TDR

input space effect observation spaceeffect

mean of diver-gences

commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

452 of Generated Observation PointsNow we focus on the observation points The fourth col-

umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

453 EffectivenessWe want to assess whether our method is effective for iden-

tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

3 Object object = makeObject ()if (object instanceof Serializable)

5 String name = getCanonicalEmptyCollectionName(object)

File f = new javaioFile(name)7 observation on f

LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

11

Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

455 Nature of Computational DiversityNow we want to understand more in depth the nature of

the NVP-diversity we are observing Let us discuss threecase studies

Listing 4 shows two variants of the writeStringToFile()

method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

original program2 void writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

OutputStream out = null4 out = openOutputStream(file append)

IOUtilswrite(data out encoding)6 outclose()

8 variantvoid writeStringToFile(File file String data

Charset encoding boolean append) throwsIOException

10 OutputStream out = nullout = new FileOutputStream(file append)

12 IOUtilswrite(data out encoding)outclose()

Listing 4 Two variants of writeStringToFile incommonsio

1 void testCopyDirectoryPreserveDates () try

3 File sourceFile = new File(sourceDirectory hellotxt)

FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

5 catch (Exception e) DSpotobserve(egetMessage ())

7

Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

Original program2 void toJson(Object src Type typeOfSrc JsonWriter

writer)writersetSerializeNulls(oldSerializeNulls)

4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

writer)6 writersetIndent( )

Listing 6 Two variants of toJson in GSON

1 public void testWriteMixedStreamed_remove534 ()throws IOException

3 gsontoJson(RED_MIATA Carclass jsonWriter)

jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

7

Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

test suite amplification Our experimental results are sub-ject to the following threats

First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

7httpdiversify-projectgithubiotest-suite-amplificationhtml

Original program2 void decode(final byte[] in int inPos final int

inAvail final Context context) switch (contextmodulus)

4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

6 not currently tested perhaps it isimpossiblebreak

8

10 variantvoid decode(final byte[] in int inPos final int

inAvail final Context context) 12 switch (contextmodulus)

case 0 impossible as excluded above14 case 1

Listing 8 Two variants of decode in commonscodec

1 Testvoid testCodeInteger3_literalMutation222 ()

3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

encodedInt3getBytes(CharsetsUTF_8)))7

Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

5 RELATED WORKThe work presented is related to two main areas the iden-

tification of similarities or diversity in source code and theautomatic augmentation of test suites

Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

with intensive fuzz testing and observation points on crash-ing states

Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

6 CONCLUSIONIn this paper we have presented DSpot a novel technique

for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

2011-9 No 600654 DIVERSIFY project

8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

[2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

[3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

[4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

[5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

[6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

[7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

[8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

[9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

[10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

[11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

[12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

[13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

[14] J C Knight N-version programming Encyclopedia of

Software Engineering 1990

[15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

[16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

[17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

[18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

[19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

[20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

[21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

[22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

[23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

[24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

[25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

[26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

  • 1 Introduction
  • 2 Background
    • 21 N-version programming
    • 22 NVP-Diversity
    • 23 Graphical Explanation
    • 24 Unspecified Input Space
      • 3 Our Approach to Detect Computational Diversity
        • 31 Overview
        • 32 Test Suite Transformations
          • 321 Exploring the Input Space
          • 322 Adding Observation Points
            • 33 Detecting and Measuring the Visible Computational Diversity
            • 34 Implementation
              • 4 Evaluation
                • 41 Protocol
                • 42 Dataset
                • 43 Baseline
                • 44 Research Questions
                • 45 Empirical Results
                  • 451 of Generated Test Cases
                  • 452 of Generated Observation Points
                  • 453 Effectiveness
                  • 454 Natural Randomness of Computation
                  • 455 Nature of Computational Diversity
                    • 46 Threats to Validity
                      • 5 Related work
                      • 6 Conclusion
                      • 7 Acknowledgements
                      • 8 References

    Software Diversity

    Computational Diversity

    Static Code Diversity

    NVP-Diversitydifferent source code (eg variable names)

    different outputs at the module interface

    different executions

    Source Diversity Binary

    Diversityfrom different compiler and compilation options

    Figure 1 An High-level View of Software Diversity

    1 public int subtract1(int a int b) return a-b

    3 public int subtract2(int a int b) throws

    OverFlowException 5 BigInteger bigA = BigIntegervalueOf(a)

    BigInteger bigB = BigIntegervalueOf(b)7 BigInteger result = bigAsubtract(bigB)

    if (resultlowerThan(IntegerMIN_VALUE))) 9 throw new DoNotFitIn32BitException ()

    11 the API requires an 32-bit integer value

    return resultintValue ()

    Listing 1 Two subtraction functions They are NVP-Diverse there exists some inputs for which the output aredifferent

    large scale programs

    bull a comparative evaluation against the closest related work[25]

    bull original insights about the natural diversity of computa-tion due to randomness and variety of runtime environ-ments

    bull a publicly available implementation 2 and benchmark 3

    The paper is organized as follows section 2 expands onthe background and motivations for this work section 3 de-scribes the core technical contribution of the paper the au-tomatic amplification of test suites section 4 presents ourempirical findings about the amplification of 7 real-worldtest suites and the assessment of diversity among 472 pro-gram variants

    2 BACKGROUNDIn this paper we are interested in computational diver-

    sity Computational diversity is one kind of software di-versity Figure 1 presents a high-level view of software di-versity Software diversity can be observed statically eitheron source or binary code Computational diversity is theone that happens at runtime The computational diversitywe target in this paper is NVP-Diversity which relates toN-version programming It can be loosely defined as com-putational diversity that is visible at the module interfacedifferent outputs for the same input

    21 N-version programmingIn the Encyclopedia of Software Engineering N-version

    programming is defined as ldquoa software structuring technique

    2httpdiversify-projectgithubiotest-suite-amplificationhtml3httpdiversify-projecteudata

    input space observation space

    Program P

    execution state

    Figure 2 Original and amplified points on the input andobservation spaces of P

    designed to permit software to be fault-tolerant ie able tooperate and provide correct outputs despite the presence offaultsrdquo [14] In N-version systems N variants of the samemodule written by different teams are executed in parallelThe faults are defined as an output of one or more variantsthat differ from the majorityrsquos output Let us consider asimple example with 2 programs p1 and p2 if one observesa difference in the output for an input x ndash p1(x) 6= p2(x) ndashthen a fault is detected

    Let us consider the example of Listing 1 It shows two im-plementations of subtraction which have been developed bytwo different teams a typical N-version setup subtract1

    simply uses the subtraction operator subtract2 is morecomplex it leverages BigInteger objects to handle potentialoverflows

    The specification given to the two teams states that theexpected input domain is [minus216 216] times [minus216 216] To thatextent both implementations are correct and equivalentThese two implementations are run in parallel in produc-tion using a N-version architecture

    If a production input is outside the specified input domaineg subtract1(232+1 2) the behavior of both implemen-tations is different and the overflow fault is detected

    22 NVP-DiversityIn this paper we use the term NVP-Diversity to refer

    to the concept of computational diversity in N-version pro-gramming

    Definition Two programs are NVP-diverse if and onlyif there exists at least one input for which the output isdifferent

    Note that according to this definition if two programs areequivalent on all inputs they are not NVP-diverse

    In this work we consider programs in mainstream object-oriented programming languages (our prototype handles Javasoftware) In OO programs there is no such thing as ldquoin-putrdquo and ldquooutputsrdquo This requires us to slightly modify ourdefinition of NVP-Diversity

    Following [10] we replace ldquoinputrdquo by ldquostimulirdquo and ldquoout-putrdquo by ldquoobservationrdquo A stimuli is a sequence of methodcalls and their parameters on an object under test An ob-servation is a sequence of calls to specific methods to querythe state of an object (typically getter methods) The in-put space I of a class P is the set of all possible stimulifor P The observation space O is the set of all sets ofobservations

    Now we can clearly define NVP-diversity for OO-programsDefinition Two classes are NVP-diverse if and only if

    there exists two respective instances that produce differentobservations for the same stimuli

    23 Graphical ExplanationThe notion of NVP-diversity is directly related to activity

    of software testing as illustrated in figure 2 The first partof a test case incl creation of objects and method callsconstitutes the stimuli ie a point in the programrsquos inputspace (black diamonds in the figure) An oracle in the formof an assertion invokes one method and compares the resultto an expected value this constitutes an observation pointon the program state that has been reached when runningthe program with a specific stimuli the observation pointsof a test suite are black circles in the right hand side of thefigure To this extent we say that a test suite specifies aset of relations between points in the input and observationspaces

    24 Unspecified Input SpaceIn N-Version programming by definition the differences

    that are observed at runtime happen for unspecified inputswhich we call the unspecified domain for short In thispaper we consider that the points that are not exercisedby a test suite form the unspecified domain They are theorange diamonds in the left-hand side of the figure

    3 OUR APPROACH TO DETECT COMPU-TATIONAL DIVERSITY

    We present DSpot our approach to detect computationaldiversity This approach is based on test suite amplificationthrough automated transformations of test case code

    31 OverviewThe global flow of DSpot is illustrated in figure 3Input DSpot takes as inputs a set of program variants

    P1 Pn which all pass the same test suite TS Conceptu-ally Px can be written in any programming language Thereno assumption on the correctness or complexity of Px theonly requirements is that they are all specified by the sametest suite In this paper we consider unit tests however theapproach can be straightforwardly extended to other kindsof tests such as integration tests

    Output The output of DSpot is an answer to the ques-tion are P1 Pn NVP-diverse

    Process First DSpot amplifies the test suite to explorethe unspecified input and observation spaces (as defined inSection 2) As illustrated in figure 2 amplification generatesnew inputs and observations in the neighbourhood of theoriginal points (new points are orange diamonds and greencircles) This cartesian product of the amplified set of inputand the complete set of observable points forms the amplifiedtest suite ATS

    Also Figure 3 shows the step ldquoobservation point selec-tionrdquo this step removes the naturally random observationsIndeed as discussed in more details further in the papersome observations points produce diverse outputs betweendifferent runs of the same test case on the same programThis natural randomness comes from randomness in thecomputation and from specificities of the execution envi-ronment (addresses file system etc)

    Once DSpot has generated an amplified test suite it runsit on a pair of program variants to compare their visible be-havior as captured by the observation points If some pointsreveal different values on each variant they are consideredas computationally diverse

    32 Test Suite Transformations

    Inputs

    Output Yes

    input space exploration

    Test suite

    Amplified Test suite

    observation points selection

    computational diversity check

    No

    Program P1

    Program P2

    observation point synthesis

    satisfiessatisfies

    Figure 3 An overview of DSpot a decision procedure forautomatically assessing the presence of NVP-diversity

    Our approach for amplifying test suites systematicallyexplores the neighbourhood of the input and observationpoints of the original test suite In this section we discussthe different transformations we perform for test suite am-plification and algorithm 1 summarizes that procedure

    Data TS an initial test suiteResult TSprime an amplified version of TS

    1 TStmp larr empty2 foreach test isin TS do3 foreach statement isin test do4 testprime larr clone(test)5 TStmp larr remove(statementtestrsquo)6 testprimeprime larr clone(test)7 TStmp larr duplicate(statementtestrdquo)

    8 end9 foreach literalV alue isin test do

    10 TStmp larr transform(literalValuetest)11 end

    12 end13 TSprime larr TStmp cup TS foreach test isin TSprime do14 removeAssertions(test)15 end16 foreach test isin TSprime do17 addObservationPoints(test)18 end19 foreach test isin TSprime do20 filterObservationPoints(test)21 end

    Algorithm 1 Amplification of test cases

    321 Exploring the Input SpaceLiterals and statement manipulation The first step

    of amplification consists in transforming all test cases inthe test suite with the following test case transformationsThose transformations operate on literals and statements

    Transforming literals given a test case tc we run the fol-lowing transformations for every literal value a Stringvalue is transformed in three ways remove add a ran-dom character and replace a random character by an-

    other one a numerical value i is transformed in fourways i + 1 i minus 1 i times 2 i divide 2 a boolean value is re-placed by the opposite value These transformationsare performed at line 10 of algorithm 1

    Transforming statement given a test case tc for everystatement s in tc we generate two test cases one testcase in which we remove s and another one in whichwe duplicate s These transformations are performedat line 2 of algorithm 1

    Given the transformations described above the transfor-mation process has the following characteristics (i) eachtime we transform a variable in the original test suite wegenerate a new test case (ie we do not lsquostackrsquo the trans-formations on a single test case) (ii) the amplification pro-cess is exhaustive given s the number of String values nthe number of numerical values b the number of booleansand st the number of statements in an original test suiteTS DSpot produces an amplified test suite ATS of size|ATS| = s lowast 3 + n lowast 4 + b+ st lowast 2

    These transformations especially the one on statementscan produce test cases that cannot be executed (eg re-moving a call to add before a remove on a list) In ourexperiments this accounted for approximately 10 of theamplified test cases

    Assertion removal The second step of amplificationconsists of removing all assertions from the test cases (line 2of algorithm 14) The rationale is that the original assertionsare here to verify the correctness which is not the goal of thegenerated test cases Their goal is to assess computationaldifferences Indeed assertions that were specified for testcase ts in the original test suite are most probably mean-ingless for a test case that is variant of ts When removingassertions we are cautious to keep method calls that can bepassed as a parameter of an assert method We analyze thecode of the whole test suite to find all assertions using thefollowing heuristic an assertion is a call to a method whichname contains either assert or fail and which is providedby the JUnit framework If one parameter of the assertion isa method call we extract it then we remove the assertionIn the final amplified test suite we keep the original testcase but also remove its assertion

    Listing 2 illustrates the generation of two new test casesThe first test method testEntrySetRemoveChangesMap() isthe original one slightly simplified for sake of presentationThe second one testEntrySetRemoveChangesMap_Add du-plicates the statement entrySetremove and does not con-tain the assertion anymore The third test method testEn-

    trySetRemoveChangesMap_DataMutator replaces the numer-ical value 0 by 1

    public void testEntrySetRemove () 12

    for (int i = 0 i lt sampleKeyslength i++) 4 entrySetremove(new DefaultMapEntry ltK Vgt(

    sampleKeys[i] sampleValues[i]))assertFalse(

    6 Entry should have been removed from theunderlying map

    getMap ()containsKey(sampleKeys[i]))8 end for

    10

    public void testEntrySetRemove_Add () 212

    call duplication14 entrySetremove(new DefaultMapEntry ltK Vgt(

    sampleKeys[i] sampleValues[i]))

    entrySetremove(new DefaultMapEntry ltK Vgt(sampleKeys[i] sampleValues[i]))

    16 getMap ()containsKey(sampleKeys[i])

    18

    public void testEntrySetRemove_Data () 320

    integer increment22 int i = 0 -gt int i = 1

    for (int i = 1 i lt (sampleKeyslength) i++) 24 entrySetremove(new DefaultMapEntry ltK Vgt(

    sampleKeys[i] sampleValues[i]))getMap ()containsKey(sampleKeys[i])

    26 end for

    Listing 2 A test case testEntrySetRemoveChangesMap(1) that is amplified twice (2 and 3)

    322 Adding Observation PointsOur gaol is to observe different observable behaviors be-

    tween a program and variants of this program Consequentlywe need observation points on the program state We dothis by enhancing all the test cases in ATS with observationpoints(line 17 of algorithm 14) These points are responsi-ble for collecting pieces of information about the programstate during or after the execution of the test case In thiscontext an observation point is a call to a public methodwhich result is logged in an execution trace

    For each object o in the original test case (o can be partof an assertion or a local variable of the test case) we dothe following

    bull we look for all getter methods in the class of o (iemethods which name starts with get that takes no pa-rameter and whose return type is not void and meth-ods which name starts with is and return a booleanvalue) and call each of them We also collect the valuesof all public fields

    bull if the toString method is redefined for the class of owe call it (we ignore the hashcode that can be returnedby toString)

    bull if the original assertion included a method call on owe include this method call as an observation point

    Filtering observation points This introspective pro-cess provides a large number of observation points Yet wehave noted in our pilot experiments that some of the valuesthat we monitor change from one execution to another Forinstance the identifier of the current thread changes betweentwo executions In Java ThreadcurrentThread()getId()is an observation point that always needs to be discarded forinstance

    If we keep those naturally varying observation points DSpotwould say that two variants are different while the observeddifference would be due to randomness This would be spu-rious results that are irrelevant for computational diversityassessment Consequently we discard certain observationpoints as follows We instrument the amplified tests ATSwith all observation points Then we run ATS 30 times onPx and repeat these 30 runs on three different machinesAll observation points for which at least one value varies be-tween at least two runs are filtered out (line 17 of algorithm20)

    To sum up DSpot produces an amplified test suite ATSthat contains more test cases than the original one in whichwe have injected observation points in all test cases

    Table 1 Descriptive Statistics about our DatasetProject Purpose Class LOC tests coverage variantscommons-codec Data encoding Base64 255 72 98 12commons-collections Collection library TreeBidiMap 1202 111 92 133commons-io Inputoutput helpers FileUtils 1195 221 82 44commons-lang General purpose helpers (eg String) StringUtils 2247 233 99 22guava Collection library HashBiMap 525 35 91 3gson Json library Gson 554 684 89 145JGit Java implementation of GIT CommitCommand 433 138 81 113

    33 Detecting and Measuring the Visible Com-putational Diversity

    The final step of DSpot runs the amplified test suite onpairs of program variants Given P1 and P2 the numberof observation points which have a different values on eachvariant accounts for visible computational diversity Whenwe compare a set of variants we use the mean number ofdifferences over each pair of variants

    34 ImplementationOur prototype implementation amplifies Java source code 4

    The test suites are expected to be written using the JUnittesting framework which is the 1 testing framework forJava It uses Spoon [18] to manipulate the source code inorder to create the amplified test cases DSpot is able toamplify a test suite within minutes

    The main challenges for the implementation of DSpot wereas follows handle the many different situations that occurin real-world large test suites (use different versions of JUnitmodularize the code of the test suite itself implement newtypes of assertions etc) handle large traces for comparisonof computation (as we will see in the next section we collecthundreds of thousands observations on each variant) spotthe natural randomness in test case execution to preventfalse positives in the assessment of computational diversity

    4 EVALUATIONTo evaluate whether DSpot is capable of detecting com-

    putational diversity we set up a novel empirical protocoland apply it on 7 large-scale Java programs Our guidingresearch question is Is DSpot capable of identifying re-alistic large scale programs that are computationallydiverse

    41 ProtocolFirst we take large open-source Java programs that are

    equipped with good test suites Second we forge variantsof those programs using a technique from our previous work[2] We call the variants sosie programs 5

    Definition 1 Sosie (noun) Given a program P a testsuite TS for P and a program transformation T a variantP prime=T (P ) is a sosie of P if the two following conditions hold1) there is at least one test case in TS that executes the partof P that is modified by T 2) all test cases in TS pass onP prime

    4the prototype is available herehttpdiversify-projectgithubiotest-suite-amplificationhtml5The word sosie is a French word that literally means ldquolookalikerdquo

    Given an initial program we synthesize sosies with sourcecode transformations that are based on the modification ofthe abstract syntax tree (AST) As previous work [16 22]we consider three families of transformation that manipu-late statement nodes of the AST 1) remove a node in theAST (Delete) 2) adds a node just after another one (Add)3) replaces a node by another one eg a statement nodeis replaced by another statement (Replace) For ldquoAddrdquo andldquoReplacerdquo the transplantation point refers to where astatement is inserted the transplant statement refers tothe statement that is copied and inserted and both trans-plantation and transplant points are in the same AST (wedo not synthesize new code nor take code from other pro-grams) We consider transplant statements that manipu-late variables of the same type as the transplantation pointand we bind the names of variables in the transplant tonames that are in the namespace of the transplantationpoint We call these transformations Steroid transforma-tions and more details are available in our previous work[2]

    Once we have generated sosie programs we manually se-lect a set of sosies that indeed expose some computationaldiversity Third we amplify the original test suites usingour approach and also using a baseline technique by Yooand Harman [25] presented in 43 Finally we run both am-plified test suites and measure the proportion of variants(sosies) that are detected as computationally different Wealso collect additional metrics to further qualify the effec-tiveness of DSpot

    42 DatasetWe build a dataset of subject programs for performing our

    experiments The inclusion criteria are the following 1) thesubject program must be real-world software 2) the subjectprogram must be written in Java 3) the subject programrsquostest suite must use the JUnit testing framework 4) thesubject program must have a good test suite (a statementcoverage higher than 80)

    This results in Apache Commons Math Apache Com-mons Lang Apache Commons Collections Apache Com-mons Codec and Google GSON and Guava The dominanceof Apache projects is due to the fact that they are amongthe very rare organizations with a very strong developmentdiscipline

    In addition we aim at running the whole experiments inless than one day (24 hours) Consequently we take a singleclass for each of those projects as well as all the test casesthat exercise it at least once

    Table 1 provides the descriptive statistics of our datasetIt gives the subject program identifier its purpose the classwe consider the classrsquo number of lines of code (LOC) thenumber of tests that execute at least once one method ofthe class under consideration the statement coverage and

    the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

    43 BaselineIn the area of test suite amplification the work by Yoo

    and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

    44 Research QuestionsWe first examine the results of our test amplification pro-

    cedureRQ1a what is the number of generated test cases

    We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

    RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

    Then we evaluate the ability of the amplified test suiteto assess computational diversity

    RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

    RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

    amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

    The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

    RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

    RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

    45 Empirical ResultsWe now discuss the empirical results obtained on applying

    DSpot on our dataset

    451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

    cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

    Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

    Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

    One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

    Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

    Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

    TC assert orobs

    TC exec assert orobs exec

    disc obs branchcov

    pathcov

    codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

    Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

    variants de-tected by TDR

    input space effect observation spaceeffect

    mean of diver-gences

    commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

    there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

    Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

    The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

    method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

    The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

    452 of Generated Observation PointsNow we focus on the observation points The fourth col-

    umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

    As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

    6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

    of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

    If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

    These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

    453 EffectivenessWe want to assess whether our method is effective for iden-

    tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

    We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

    As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

    Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

    by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

    By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

    Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

    The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

    454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

    noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

    We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

    Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

    Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

    1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

    isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

    3 Object object = makeObject ()if (object instanceof Serializable)

    5 String name = getCanonicalEmptyCollectionName(object)

    File f = new javaioFile(name)7 observation on f

    LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

    11

    Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

    codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

    Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

    Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

    The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

    455 Nature of Computational DiversityNow we want to understand more in depth the nature of

    the NVP-diversity we are observing Let us discuss threecase studies

    Listing 4 shows two variants of the writeStringToFile()

    method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

    original program2 void writeStringToFile(File file String data

    Charset encoding boolean append) throwsIOException

    OutputStream out = null4 out = openOutputStream(file append)

    IOUtilswrite(data out encoding)6 outclose()

    8 variantvoid writeStringToFile(File file String data

    Charset encoding boolean append) throwsIOException

    10 OutputStream out = nullout = new FileOutputStream(file append)

    12 IOUtilswrite(data out encoding)outclose()

    Listing 4 Two variants of writeStringToFile incommonsio

    1 void testCopyDirectoryPreserveDates () try

    3 File sourceFile = new File(sourceDirectory hellotxt)

    FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

    5 catch (Exception e) DSpotobserve(egetMessage ())

    7

    Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

    FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

    is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

    Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

    it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

    Original program2 void toJson(Object src Type typeOfSrc JsonWriter

    writer)writersetSerializeNulls(oldSerializeNulls)

    4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

    writer)6 writersetIndent( )

    Listing 6 Two variants of toJson in GSON

    1 public void testWriteMixedStreamed_remove534 ()throws IOException

    3 gsontoJson(RED_MIATA Carclass jsonWriter)

    jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

    MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

    7

    Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

    The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

    case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

    These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

    The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

    46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

    test suite amplification Our experimental results are sub-ject to the following threats

    First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

    Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

    7httpdiversify-projectgithubiotest-suite-amplificationhtml

    Original program2 void decode(final byte[] in int inPos final int

    inAvail final Context context) switch (contextmodulus)

    4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

    6 not currently tested perhaps it isimpossiblebreak

    8

    10 variantvoid decode(final byte[] in int inPos final int

    inAvail final Context context) 12 switch (contextmodulus)

    case 0 impossible as excluded above14 case 1

    Listing 8 Two variants of decode in commonscodec

    1 Testvoid testCodeInteger3_literalMutation222 ()

    3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

    5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

    encodedInt3getBytes(CharsetsUTF_8)))7

    Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

    variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

    Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

    5 RELATED WORKThe work presented is related to two main areas the iden-

    tification of similarities or diversity in source code and theautomatic augmentation of test suites

    Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

    Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

    with intensive fuzz testing and observation points on crash-ing states

    Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

    Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

    Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

    nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

    6 CONCLUSIONIn this paper we have presented DSpot a novel technique

    for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

    Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

    This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

    7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

    2011-9 No 600654 DIVERSIFY project

    8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

    software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

    [2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

    Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

    [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

    [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

    [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

    [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

    [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

    [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

    [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

    [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

    [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

    [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

    [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

    [14] J C Knight N-version programming Encyclopedia of

    Software Engineering 1990

    [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

    [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

    [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

    [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

    [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

    [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

    [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

    [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

    [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

    [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

    [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

    [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

    • 1 Introduction
    • 2 Background
      • 21 N-version programming
      • 22 NVP-Diversity
      • 23 Graphical Explanation
      • 24 Unspecified Input Space
        • 3 Our Approach to Detect Computational Diversity
          • 31 Overview
          • 32 Test Suite Transformations
            • 321 Exploring the Input Space
            • 322 Adding Observation Points
              • 33 Detecting and Measuring the Visible Computational Diversity
              • 34 Implementation
                • 4 Evaluation
                  • 41 Protocol
                  • 42 Dataset
                  • 43 Baseline
                  • 44 Research Questions
                  • 45 Empirical Results
                    • 451 of Generated Test Cases
                    • 452 of Generated Observation Points
                    • 453 Effectiveness
                    • 454 Natural Randomness of Computation
                    • 455 Nature of Computational Diversity
                      • 46 Threats to Validity
                        • 5 Related work
                        • 6 Conclusion
                        • 7 Acknowledgements
                        • 8 References

      of software testing as illustrated in figure 2 The first partof a test case incl creation of objects and method callsconstitutes the stimuli ie a point in the programrsquos inputspace (black diamonds in the figure) An oracle in the formof an assertion invokes one method and compares the resultto an expected value this constitutes an observation pointon the program state that has been reached when runningthe program with a specific stimuli the observation pointsof a test suite are black circles in the right hand side of thefigure To this extent we say that a test suite specifies aset of relations between points in the input and observationspaces

      24 Unspecified Input SpaceIn N-Version programming by definition the differences

      that are observed at runtime happen for unspecified inputswhich we call the unspecified domain for short In thispaper we consider that the points that are not exercisedby a test suite form the unspecified domain They are theorange diamonds in the left-hand side of the figure

      3 OUR APPROACH TO DETECT COMPU-TATIONAL DIVERSITY

      We present DSpot our approach to detect computationaldiversity This approach is based on test suite amplificationthrough automated transformations of test case code

      31 OverviewThe global flow of DSpot is illustrated in figure 3Input DSpot takes as inputs a set of program variants

      P1 Pn which all pass the same test suite TS Conceptu-ally Px can be written in any programming language Thereno assumption on the correctness or complexity of Px theonly requirements is that they are all specified by the sametest suite In this paper we consider unit tests however theapproach can be straightforwardly extended to other kindsof tests such as integration tests

      Output The output of DSpot is an answer to the ques-tion are P1 Pn NVP-diverse

      Process First DSpot amplifies the test suite to explorethe unspecified input and observation spaces (as defined inSection 2) As illustrated in figure 2 amplification generatesnew inputs and observations in the neighbourhood of theoriginal points (new points are orange diamonds and greencircles) This cartesian product of the amplified set of inputand the complete set of observable points forms the amplifiedtest suite ATS

      Also Figure 3 shows the step ldquoobservation point selec-tionrdquo this step removes the naturally random observationsIndeed as discussed in more details further in the papersome observations points produce diverse outputs betweendifferent runs of the same test case on the same programThis natural randomness comes from randomness in thecomputation and from specificities of the execution envi-ronment (addresses file system etc)

      Once DSpot has generated an amplified test suite it runsit on a pair of program variants to compare their visible be-havior as captured by the observation points If some pointsreveal different values on each variant they are consideredas computationally diverse

      32 Test Suite Transformations

      Inputs

      Output Yes

      input space exploration

      Test suite

      Amplified Test suite

      observation points selection

      computational diversity check

      No

      Program P1

      Program P2

      observation point synthesis

      satisfiessatisfies

      Figure 3 An overview of DSpot a decision procedure forautomatically assessing the presence of NVP-diversity

      Our approach for amplifying test suites systematicallyexplores the neighbourhood of the input and observationpoints of the original test suite In this section we discussthe different transformations we perform for test suite am-plification and algorithm 1 summarizes that procedure

      Data TS an initial test suiteResult TSprime an amplified version of TS

      1 TStmp larr empty2 foreach test isin TS do3 foreach statement isin test do4 testprime larr clone(test)5 TStmp larr remove(statementtestrsquo)6 testprimeprime larr clone(test)7 TStmp larr duplicate(statementtestrdquo)

      8 end9 foreach literalV alue isin test do

      10 TStmp larr transform(literalValuetest)11 end

      12 end13 TSprime larr TStmp cup TS foreach test isin TSprime do14 removeAssertions(test)15 end16 foreach test isin TSprime do17 addObservationPoints(test)18 end19 foreach test isin TSprime do20 filterObservationPoints(test)21 end

      Algorithm 1 Amplification of test cases

      321 Exploring the Input SpaceLiterals and statement manipulation The first step

      of amplification consists in transforming all test cases inthe test suite with the following test case transformationsThose transformations operate on literals and statements

      Transforming literals given a test case tc we run the fol-lowing transformations for every literal value a Stringvalue is transformed in three ways remove add a ran-dom character and replace a random character by an-

      other one a numerical value i is transformed in fourways i + 1 i minus 1 i times 2 i divide 2 a boolean value is re-placed by the opposite value These transformationsare performed at line 10 of algorithm 1

      Transforming statement given a test case tc for everystatement s in tc we generate two test cases one testcase in which we remove s and another one in whichwe duplicate s These transformations are performedat line 2 of algorithm 1

      Given the transformations described above the transfor-mation process has the following characteristics (i) eachtime we transform a variable in the original test suite wegenerate a new test case (ie we do not lsquostackrsquo the trans-formations on a single test case) (ii) the amplification pro-cess is exhaustive given s the number of String values nthe number of numerical values b the number of booleansand st the number of statements in an original test suiteTS DSpot produces an amplified test suite ATS of size|ATS| = s lowast 3 + n lowast 4 + b+ st lowast 2

      These transformations especially the one on statementscan produce test cases that cannot be executed (eg re-moving a call to add before a remove on a list) In ourexperiments this accounted for approximately 10 of theamplified test cases

      Assertion removal The second step of amplificationconsists of removing all assertions from the test cases (line 2of algorithm 14) The rationale is that the original assertionsare here to verify the correctness which is not the goal of thegenerated test cases Their goal is to assess computationaldifferences Indeed assertions that were specified for testcase ts in the original test suite are most probably mean-ingless for a test case that is variant of ts When removingassertions we are cautious to keep method calls that can bepassed as a parameter of an assert method We analyze thecode of the whole test suite to find all assertions using thefollowing heuristic an assertion is a call to a method whichname contains either assert or fail and which is providedby the JUnit framework If one parameter of the assertion isa method call we extract it then we remove the assertionIn the final amplified test suite we keep the original testcase but also remove its assertion

      Listing 2 illustrates the generation of two new test casesThe first test method testEntrySetRemoveChangesMap() isthe original one slightly simplified for sake of presentationThe second one testEntrySetRemoveChangesMap_Add du-plicates the statement entrySetremove and does not con-tain the assertion anymore The third test method testEn-

      trySetRemoveChangesMap_DataMutator replaces the numer-ical value 0 by 1

      public void testEntrySetRemove () 12

      for (int i = 0 i lt sampleKeyslength i++) 4 entrySetremove(new DefaultMapEntry ltK Vgt(

      sampleKeys[i] sampleValues[i]))assertFalse(

      6 Entry should have been removed from theunderlying map

      getMap ()containsKey(sampleKeys[i]))8 end for

      10

      public void testEntrySetRemove_Add () 212

      call duplication14 entrySetremove(new DefaultMapEntry ltK Vgt(

      sampleKeys[i] sampleValues[i]))

      entrySetremove(new DefaultMapEntry ltK Vgt(sampleKeys[i] sampleValues[i]))

      16 getMap ()containsKey(sampleKeys[i])

      18

      public void testEntrySetRemove_Data () 320

      integer increment22 int i = 0 -gt int i = 1

      for (int i = 1 i lt (sampleKeyslength) i++) 24 entrySetremove(new DefaultMapEntry ltK Vgt(

      sampleKeys[i] sampleValues[i]))getMap ()containsKey(sampleKeys[i])

      26 end for

      Listing 2 A test case testEntrySetRemoveChangesMap(1) that is amplified twice (2 and 3)

      322 Adding Observation PointsOur gaol is to observe different observable behaviors be-

      tween a program and variants of this program Consequentlywe need observation points on the program state We dothis by enhancing all the test cases in ATS with observationpoints(line 17 of algorithm 14) These points are responsi-ble for collecting pieces of information about the programstate during or after the execution of the test case In thiscontext an observation point is a call to a public methodwhich result is logged in an execution trace

      For each object o in the original test case (o can be partof an assertion or a local variable of the test case) we dothe following

      bull we look for all getter methods in the class of o (iemethods which name starts with get that takes no pa-rameter and whose return type is not void and meth-ods which name starts with is and return a booleanvalue) and call each of them We also collect the valuesof all public fields

      bull if the toString method is redefined for the class of owe call it (we ignore the hashcode that can be returnedby toString)

      bull if the original assertion included a method call on owe include this method call as an observation point

      Filtering observation points This introspective pro-cess provides a large number of observation points Yet wehave noted in our pilot experiments that some of the valuesthat we monitor change from one execution to another Forinstance the identifier of the current thread changes betweentwo executions In Java ThreadcurrentThread()getId()is an observation point that always needs to be discarded forinstance

      If we keep those naturally varying observation points DSpotwould say that two variants are different while the observeddifference would be due to randomness This would be spu-rious results that are irrelevant for computational diversityassessment Consequently we discard certain observationpoints as follows We instrument the amplified tests ATSwith all observation points Then we run ATS 30 times onPx and repeat these 30 runs on three different machinesAll observation points for which at least one value varies be-tween at least two runs are filtered out (line 17 of algorithm20)

      To sum up DSpot produces an amplified test suite ATSthat contains more test cases than the original one in whichwe have injected observation points in all test cases

      Table 1 Descriptive Statistics about our DatasetProject Purpose Class LOC tests coverage variantscommons-codec Data encoding Base64 255 72 98 12commons-collections Collection library TreeBidiMap 1202 111 92 133commons-io Inputoutput helpers FileUtils 1195 221 82 44commons-lang General purpose helpers (eg String) StringUtils 2247 233 99 22guava Collection library HashBiMap 525 35 91 3gson Json library Gson 554 684 89 145JGit Java implementation of GIT CommitCommand 433 138 81 113

      33 Detecting and Measuring the Visible Com-putational Diversity

      The final step of DSpot runs the amplified test suite onpairs of program variants Given P1 and P2 the numberof observation points which have a different values on eachvariant accounts for visible computational diversity Whenwe compare a set of variants we use the mean number ofdifferences over each pair of variants

      34 ImplementationOur prototype implementation amplifies Java source code 4

      The test suites are expected to be written using the JUnittesting framework which is the 1 testing framework forJava It uses Spoon [18] to manipulate the source code inorder to create the amplified test cases DSpot is able toamplify a test suite within minutes

      The main challenges for the implementation of DSpot wereas follows handle the many different situations that occurin real-world large test suites (use different versions of JUnitmodularize the code of the test suite itself implement newtypes of assertions etc) handle large traces for comparisonof computation (as we will see in the next section we collecthundreds of thousands observations on each variant) spotthe natural randomness in test case execution to preventfalse positives in the assessment of computational diversity

      4 EVALUATIONTo evaluate whether DSpot is capable of detecting com-

      putational diversity we set up a novel empirical protocoland apply it on 7 large-scale Java programs Our guidingresearch question is Is DSpot capable of identifying re-alistic large scale programs that are computationallydiverse

      41 ProtocolFirst we take large open-source Java programs that are

      equipped with good test suites Second we forge variantsof those programs using a technique from our previous work[2] We call the variants sosie programs 5

      Definition 1 Sosie (noun) Given a program P a testsuite TS for P and a program transformation T a variantP prime=T (P ) is a sosie of P if the two following conditions hold1) there is at least one test case in TS that executes the partof P that is modified by T 2) all test cases in TS pass onP prime

      4the prototype is available herehttpdiversify-projectgithubiotest-suite-amplificationhtml5The word sosie is a French word that literally means ldquolookalikerdquo

      Given an initial program we synthesize sosies with sourcecode transformations that are based on the modification ofthe abstract syntax tree (AST) As previous work [16 22]we consider three families of transformation that manipu-late statement nodes of the AST 1) remove a node in theAST (Delete) 2) adds a node just after another one (Add)3) replaces a node by another one eg a statement nodeis replaced by another statement (Replace) For ldquoAddrdquo andldquoReplacerdquo the transplantation point refers to where astatement is inserted the transplant statement refers tothe statement that is copied and inserted and both trans-plantation and transplant points are in the same AST (wedo not synthesize new code nor take code from other pro-grams) We consider transplant statements that manipu-late variables of the same type as the transplantation pointand we bind the names of variables in the transplant tonames that are in the namespace of the transplantationpoint We call these transformations Steroid transforma-tions and more details are available in our previous work[2]

      Once we have generated sosie programs we manually se-lect a set of sosies that indeed expose some computationaldiversity Third we amplify the original test suites usingour approach and also using a baseline technique by Yooand Harman [25] presented in 43 Finally we run both am-plified test suites and measure the proportion of variants(sosies) that are detected as computationally different Wealso collect additional metrics to further qualify the effec-tiveness of DSpot

      42 DatasetWe build a dataset of subject programs for performing our

      experiments The inclusion criteria are the following 1) thesubject program must be real-world software 2) the subjectprogram must be written in Java 3) the subject programrsquostest suite must use the JUnit testing framework 4) thesubject program must have a good test suite (a statementcoverage higher than 80)

      This results in Apache Commons Math Apache Com-mons Lang Apache Commons Collections Apache Com-mons Codec and Google GSON and Guava The dominanceof Apache projects is due to the fact that they are amongthe very rare organizations with a very strong developmentdiscipline

      In addition we aim at running the whole experiments inless than one day (24 hours) Consequently we take a singleclass for each of those projects as well as all the test casesthat exercise it at least once

      Table 1 provides the descriptive statistics of our datasetIt gives the subject program identifier its purpose the classwe consider the classrsquo number of lines of code (LOC) thenumber of tests that execute at least once one method ofthe class under consideration the statement coverage and

      the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

      43 BaselineIn the area of test suite amplification the work by Yoo

      and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

      44 Research QuestionsWe first examine the results of our test amplification pro-

      cedureRQ1a what is the number of generated test cases

      We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

      RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

      Then we evaluate the ability of the amplified test suiteto assess computational diversity

      RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

      RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

      amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

      The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

      RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

      RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

      45 Empirical ResultsWe now discuss the empirical results obtained on applying

      DSpot on our dataset

      451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

      cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

      Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

      Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

      One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

      Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

      Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

      TC assert orobs

      TC exec assert orobs exec

      disc obs branchcov

      pathcov

      codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

      Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

      variants de-tected by TDR

      input space effect observation spaceeffect

      mean of diver-gences

      commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

      there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

      Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

      The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

      method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

      The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

      452 of Generated Observation PointsNow we focus on the observation points The fourth col-

      umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

      As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

      6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

      of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

      If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

      These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

      453 EffectivenessWe want to assess whether our method is effective for iden-

      tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

      We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

      As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

      Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

      by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

      By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

      Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

      The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

      454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

      noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

      We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

      Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

      Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

      1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

      isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

      3 Object object = makeObject ()if (object instanceof Serializable)

      5 String name = getCanonicalEmptyCollectionName(object)

      File f = new javaioFile(name)7 observation on f

      LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

      11

      Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

      codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

      Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

      Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

      The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

      455 Nature of Computational DiversityNow we want to understand more in depth the nature of

      the NVP-diversity we are observing Let us discuss threecase studies

      Listing 4 shows two variants of the writeStringToFile()

      method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

      original program2 void writeStringToFile(File file String data

      Charset encoding boolean append) throwsIOException

      OutputStream out = null4 out = openOutputStream(file append)

      IOUtilswrite(data out encoding)6 outclose()

      8 variantvoid writeStringToFile(File file String data

      Charset encoding boolean append) throwsIOException

      10 OutputStream out = nullout = new FileOutputStream(file append)

      12 IOUtilswrite(data out encoding)outclose()

      Listing 4 Two variants of writeStringToFile incommonsio

      1 void testCopyDirectoryPreserveDates () try

      3 File sourceFile = new File(sourceDirectory hellotxt)

      FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

      5 catch (Exception e) DSpotobserve(egetMessage ())

      7

      Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

      FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

      is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

      Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

      it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

      Original program2 void toJson(Object src Type typeOfSrc JsonWriter

      writer)writersetSerializeNulls(oldSerializeNulls)

      4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

      writer)6 writersetIndent( )

      Listing 6 Two variants of toJson in GSON

      1 public void testWriteMixedStreamed_remove534 ()throws IOException

      3 gsontoJson(RED_MIATA Carclass jsonWriter)

      jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

      MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

      7

      Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

      The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

      case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

      These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

      The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

      46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

      test suite amplification Our experimental results are sub-ject to the following threats

      First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

      Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

      7httpdiversify-projectgithubiotest-suite-amplificationhtml

      Original program2 void decode(final byte[] in int inPos final int

      inAvail final Context context) switch (contextmodulus)

      4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

      6 not currently tested perhaps it isimpossiblebreak

      8

      10 variantvoid decode(final byte[] in int inPos final int

      inAvail final Context context) 12 switch (contextmodulus)

      case 0 impossible as excluded above14 case 1

      Listing 8 Two variants of decode in commonscodec

      1 Testvoid testCodeInteger3_literalMutation222 ()

      3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

      5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

      encodedInt3getBytes(CharsetsUTF_8)))7

      Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

      variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

      Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

      5 RELATED WORKThe work presented is related to two main areas the iden-

      tification of similarities or diversity in source code and theautomatic augmentation of test suites

      Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

      Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

      with intensive fuzz testing and observation points on crash-ing states

      Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

      Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

      Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

      nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

      6 CONCLUSIONIn this paper we have presented DSpot a novel technique

      for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

      Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

      This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

      7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

      2011-9 No 600654 DIVERSIFY project

      8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

      software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

      [2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

      Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

      [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

      [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

      [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

      [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

      [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

      [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

      [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

      [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

      [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

      [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

      [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

      [14] J C Knight N-version programming Encyclopedia of

      Software Engineering 1990

      [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

      [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

      [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

      [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

      [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

      [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

      [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

      [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

      [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

      [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

      [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

      [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

      • 1 Introduction
      • 2 Background
        • 21 N-version programming
        • 22 NVP-Diversity
        • 23 Graphical Explanation
        • 24 Unspecified Input Space
          • 3 Our Approach to Detect Computational Diversity
            • 31 Overview
            • 32 Test Suite Transformations
              • 321 Exploring the Input Space
              • 322 Adding Observation Points
                • 33 Detecting and Measuring the Visible Computational Diversity
                • 34 Implementation
                  • 4 Evaluation
                    • 41 Protocol
                    • 42 Dataset
                    • 43 Baseline
                    • 44 Research Questions
                    • 45 Empirical Results
                      • 451 of Generated Test Cases
                      • 452 of Generated Observation Points
                      • 453 Effectiveness
                      • 454 Natural Randomness of Computation
                      • 455 Nature of Computational Diversity
                        • 46 Threats to Validity
                          • 5 Related work
                          • 6 Conclusion
                          • 7 Acknowledgements
                          • 8 References

        other one a numerical value i is transformed in fourways i + 1 i minus 1 i times 2 i divide 2 a boolean value is re-placed by the opposite value These transformationsare performed at line 10 of algorithm 1

        Transforming statement given a test case tc for everystatement s in tc we generate two test cases one testcase in which we remove s and another one in whichwe duplicate s These transformations are performedat line 2 of algorithm 1

        Given the transformations described above the transfor-mation process has the following characteristics (i) eachtime we transform a variable in the original test suite wegenerate a new test case (ie we do not lsquostackrsquo the trans-formations on a single test case) (ii) the amplification pro-cess is exhaustive given s the number of String values nthe number of numerical values b the number of booleansand st the number of statements in an original test suiteTS DSpot produces an amplified test suite ATS of size|ATS| = s lowast 3 + n lowast 4 + b+ st lowast 2

        These transformations especially the one on statementscan produce test cases that cannot be executed (eg re-moving a call to add before a remove on a list) In ourexperiments this accounted for approximately 10 of theamplified test cases

        Assertion removal The second step of amplificationconsists of removing all assertions from the test cases (line 2of algorithm 14) The rationale is that the original assertionsare here to verify the correctness which is not the goal of thegenerated test cases Their goal is to assess computationaldifferences Indeed assertions that were specified for testcase ts in the original test suite are most probably mean-ingless for a test case that is variant of ts When removingassertions we are cautious to keep method calls that can bepassed as a parameter of an assert method We analyze thecode of the whole test suite to find all assertions using thefollowing heuristic an assertion is a call to a method whichname contains either assert or fail and which is providedby the JUnit framework If one parameter of the assertion isa method call we extract it then we remove the assertionIn the final amplified test suite we keep the original testcase but also remove its assertion

        Listing 2 illustrates the generation of two new test casesThe first test method testEntrySetRemoveChangesMap() isthe original one slightly simplified for sake of presentationThe second one testEntrySetRemoveChangesMap_Add du-plicates the statement entrySetremove and does not con-tain the assertion anymore The third test method testEn-

        trySetRemoveChangesMap_DataMutator replaces the numer-ical value 0 by 1

        public void testEntrySetRemove () 12

        for (int i = 0 i lt sampleKeyslength i++) 4 entrySetremove(new DefaultMapEntry ltK Vgt(

        sampleKeys[i] sampleValues[i]))assertFalse(

        6 Entry should have been removed from theunderlying map

        getMap ()containsKey(sampleKeys[i]))8 end for

        10

        public void testEntrySetRemove_Add () 212

        call duplication14 entrySetremove(new DefaultMapEntry ltK Vgt(

        sampleKeys[i] sampleValues[i]))

        entrySetremove(new DefaultMapEntry ltK Vgt(sampleKeys[i] sampleValues[i]))

        16 getMap ()containsKey(sampleKeys[i])

        18

        public void testEntrySetRemove_Data () 320

        integer increment22 int i = 0 -gt int i = 1

        for (int i = 1 i lt (sampleKeyslength) i++) 24 entrySetremove(new DefaultMapEntry ltK Vgt(

        sampleKeys[i] sampleValues[i]))getMap ()containsKey(sampleKeys[i])

        26 end for

        Listing 2 A test case testEntrySetRemoveChangesMap(1) that is amplified twice (2 and 3)

        322 Adding Observation PointsOur gaol is to observe different observable behaviors be-

        tween a program and variants of this program Consequentlywe need observation points on the program state We dothis by enhancing all the test cases in ATS with observationpoints(line 17 of algorithm 14) These points are responsi-ble for collecting pieces of information about the programstate during or after the execution of the test case In thiscontext an observation point is a call to a public methodwhich result is logged in an execution trace

        For each object o in the original test case (o can be partof an assertion or a local variable of the test case) we dothe following

        bull we look for all getter methods in the class of o (iemethods which name starts with get that takes no pa-rameter and whose return type is not void and meth-ods which name starts with is and return a booleanvalue) and call each of them We also collect the valuesof all public fields

        bull if the toString method is redefined for the class of owe call it (we ignore the hashcode that can be returnedby toString)

        bull if the original assertion included a method call on owe include this method call as an observation point

        Filtering observation points This introspective pro-cess provides a large number of observation points Yet wehave noted in our pilot experiments that some of the valuesthat we monitor change from one execution to another Forinstance the identifier of the current thread changes betweentwo executions In Java ThreadcurrentThread()getId()is an observation point that always needs to be discarded forinstance

        If we keep those naturally varying observation points DSpotwould say that two variants are different while the observeddifference would be due to randomness This would be spu-rious results that are irrelevant for computational diversityassessment Consequently we discard certain observationpoints as follows We instrument the amplified tests ATSwith all observation points Then we run ATS 30 times onPx and repeat these 30 runs on three different machinesAll observation points for which at least one value varies be-tween at least two runs are filtered out (line 17 of algorithm20)

        To sum up DSpot produces an amplified test suite ATSthat contains more test cases than the original one in whichwe have injected observation points in all test cases

        Table 1 Descriptive Statistics about our DatasetProject Purpose Class LOC tests coverage variantscommons-codec Data encoding Base64 255 72 98 12commons-collections Collection library TreeBidiMap 1202 111 92 133commons-io Inputoutput helpers FileUtils 1195 221 82 44commons-lang General purpose helpers (eg String) StringUtils 2247 233 99 22guava Collection library HashBiMap 525 35 91 3gson Json library Gson 554 684 89 145JGit Java implementation of GIT CommitCommand 433 138 81 113

        33 Detecting and Measuring the Visible Com-putational Diversity

        The final step of DSpot runs the amplified test suite onpairs of program variants Given P1 and P2 the numberof observation points which have a different values on eachvariant accounts for visible computational diversity Whenwe compare a set of variants we use the mean number ofdifferences over each pair of variants

        34 ImplementationOur prototype implementation amplifies Java source code 4

        The test suites are expected to be written using the JUnittesting framework which is the 1 testing framework forJava It uses Spoon [18] to manipulate the source code inorder to create the amplified test cases DSpot is able toamplify a test suite within minutes

        The main challenges for the implementation of DSpot wereas follows handle the many different situations that occurin real-world large test suites (use different versions of JUnitmodularize the code of the test suite itself implement newtypes of assertions etc) handle large traces for comparisonof computation (as we will see in the next section we collecthundreds of thousands observations on each variant) spotthe natural randomness in test case execution to preventfalse positives in the assessment of computational diversity

        4 EVALUATIONTo evaluate whether DSpot is capable of detecting com-

        putational diversity we set up a novel empirical protocoland apply it on 7 large-scale Java programs Our guidingresearch question is Is DSpot capable of identifying re-alistic large scale programs that are computationallydiverse

        41 ProtocolFirst we take large open-source Java programs that are

        equipped with good test suites Second we forge variantsof those programs using a technique from our previous work[2] We call the variants sosie programs 5

        Definition 1 Sosie (noun) Given a program P a testsuite TS for P and a program transformation T a variantP prime=T (P ) is a sosie of P if the two following conditions hold1) there is at least one test case in TS that executes the partof P that is modified by T 2) all test cases in TS pass onP prime

        4the prototype is available herehttpdiversify-projectgithubiotest-suite-amplificationhtml5The word sosie is a French word that literally means ldquolookalikerdquo

        Given an initial program we synthesize sosies with sourcecode transformations that are based on the modification ofthe abstract syntax tree (AST) As previous work [16 22]we consider three families of transformation that manipu-late statement nodes of the AST 1) remove a node in theAST (Delete) 2) adds a node just after another one (Add)3) replaces a node by another one eg a statement nodeis replaced by another statement (Replace) For ldquoAddrdquo andldquoReplacerdquo the transplantation point refers to where astatement is inserted the transplant statement refers tothe statement that is copied and inserted and both trans-plantation and transplant points are in the same AST (wedo not synthesize new code nor take code from other pro-grams) We consider transplant statements that manipu-late variables of the same type as the transplantation pointand we bind the names of variables in the transplant tonames that are in the namespace of the transplantationpoint We call these transformations Steroid transforma-tions and more details are available in our previous work[2]

        Once we have generated sosie programs we manually se-lect a set of sosies that indeed expose some computationaldiversity Third we amplify the original test suites usingour approach and also using a baseline technique by Yooand Harman [25] presented in 43 Finally we run both am-plified test suites and measure the proportion of variants(sosies) that are detected as computationally different Wealso collect additional metrics to further qualify the effec-tiveness of DSpot

        42 DatasetWe build a dataset of subject programs for performing our

        experiments The inclusion criteria are the following 1) thesubject program must be real-world software 2) the subjectprogram must be written in Java 3) the subject programrsquostest suite must use the JUnit testing framework 4) thesubject program must have a good test suite (a statementcoverage higher than 80)

        This results in Apache Commons Math Apache Com-mons Lang Apache Commons Collections Apache Com-mons Codec and Google GSON and Guava The dominanceof Apache projects is due to the fact that they are amongthe very rare organizations with a very strong developmentdiscipline

        In addition we aim at running the whole experiments inless than one day (24 hours) Consequently we take a singleclass for each of those projects as well as all the test casesthat exercise it at least once

        Table 1 provides the descriptive statistics of our datasetIt gives the subject program identifier its purpose the classwe consider the classrsquo number of lines of code (LOC) thenumber of tests that execute at least once one method ofthe class under consideration the statement coverage and

        the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

        43 BaselineIn the area of test suite amplification the work by Yoo

        and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

        44 Research QuestionsWe first examine the results of our test amplification pro-

        cedureRQ1a what is the number of generated test cases

        We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

        RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

        Then we evaluate the ability of the amplified test suiteto assess computational diversity

        RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

        RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

        amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

        The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

        RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

        RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

        45 Empirical ResultsWe now discuss the empirical results obtained on applying

        DSpot on our dataset

        451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

        cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

        Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

        Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

        One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

        Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

        Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

        TC assert orobs

        TC exec assert orobs exec

        disc obs branchcov

        pathcov

        codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

        Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

        variants de-tected by TDR

        input space effect observation spaceeffect

        mean of diver-gences

        commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

        there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

        Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

        The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

        method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

        The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

        452 of Generated Observation PointsNow we focus on the observation points The fourth col-

        umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

        As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

        6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

        of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

        If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

        These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

        453 EffectivenessWe want to assess whether our method is effective for iden-

        tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

        We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

        As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

        Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

        by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

        By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

        Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

        The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

        454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

        noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

        We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

        Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

        Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

        1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

        isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

        3 Object object = makeObject ()if (object instanceof Serializable)

        5 String name = getCanonicalEmptyCollectionName(object)

        File f = new javaioFile(name)7 observation on f

        LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

        11

        Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

        codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

        Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

        Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

        The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

        455 Nature of Computational DiversityNow we want to understand more in depth the nature of

        the NVP-diversity we are observing Let us discuss threecase studies

        Listing 4 shows two variants of the writeStringToFile()

        method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

        original program2 void writeStringToFile(File file String data

        Charset encoding boolean append) throwsIOException

        OutputStream out = null4 out = openOutputStream(file append)

        IOUtilswrite(data out encoding)6 outclose()

        8 variantvoid writeStringToFile(File file String data

        Charset encoding boolean append) throwsIOException

        10 OutputStream out = nullout = new FileOutputStream(file append)

        12 IOUtilswrite(data out encoding)outclose()

        Listing 4 Two variants of writeStringToFile incommonsio

        1 void testCopyDirectoryPreserveDates () try

        3 File sourceFile = new File(sourceDirectory hellotxt)

        FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

        5 catch (Exception e) DSpotobserve(egetMessage ())

        7

        Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

        FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

        is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

        Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

        it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

        Original program2 void toJson(Object src Type typeOfSrc JsonWriter

        writer)writersetSerializeNulls(oldSerializeNulls)

        4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

        writer)6 writersetIndent( )

        Listing 6 Two variants of toJson in GSON

        1 public void testWriteMixedStreamed_remove534 ()throws IOException

        3 gsontoJson(RED_MIATA Carclass jsonWriter)

        jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

        MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

        7

        Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

        The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

        case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

        These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

        The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

        46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

        test suite amplification Our experimental results are sub-ject to the following threats

        First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

        Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

        7httpdiversify-projectgithubiotest-suite-amplificationhtml

        Original program2 void decode(final byte[] in int inPos final int

        inAvail final Context context) switch (contextmodulus)

        4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

        6 not currently tested perhaps it isimpossiblebreak

        8

        10 variantvoid decode(final byte[] in int inPos final int

        inAvail final Context context) 12 switch (contextmodulus)

        case 0 impossible as excluded above14 case 1

        Listing 8 Two variants of decode in commonscodec

        1 Testvoid testCodeInteger3_literalMutation222 ()

        3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

        5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

        encodedInt3getBytes(CharsetsUTF_8)))7

        Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

        variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

        Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

        5 RELATED WORKThe work presented is related to two main areas the iden-

        tification of similarities or diversity in source code and theautomatic augmentation of test suites

        Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

        Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

        with intensive fuzz testing and observation points on crash-ing states

        Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

        Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

        Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

        nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

        6 CONCLUSIONIn this paper we have presented DSpot a novel technique

        for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

        Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

        This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

        7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

        2011-9 No 600654 DIVERSIFY project

        8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

        software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

        [2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

        Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

        [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

        [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

        [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

        [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

        [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

        [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

        [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

        [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

        [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

        [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

        [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

        [14] J C Knight N-version programming Encyclopedia of

        Software Engineering 1990

        [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

        [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

        [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

        [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

        [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

        [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

        [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

        [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

        [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

        [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

        [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

        [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

        • 1 Introduction
        • 2 Background
          • 21 N-version programming
          • 22 NVP-Diversity
          • 23 Graphical Explanation
          • 24 Unspecified Input Space
            • 3 Our Approach to Detect Computational Diversity
              • 31 Overview
              • 32 Test Suite Transformations
                • 321 Exploring the Input Space
                • 322 Adding Observation Points
                  • 33 Detecting and Measuring the Visible Computational Diversity
                  • 34 Implementation
                    • 4 Evaluation
                      • 41 Protocol
                      • 42 Dataset
                      • 43 Baseline
                      • 44 Research Questions
                      • 45 Empirical Results
                        • 451 of Generated Test Cases
                        • 452 of Generated Observation Points
                        • 453 Effectiveness
                        • 454 Natural Randomness of Computation
                        • 455 Nature of Computational Diversity
                          • 46 Threats to Validity
                            • 5 Related work
                            • 6 Conclusion
                            • 7 Acknowledgements
                            • 8 References

          Table 1 Descriptive Statistics about our DatasetProject Purpose Class LOC tests coverage variantscommons-codec Data encoding Base64 255 72 98 12commons-collections Collection library TreeBidiMap 1202 111 92 133commons-io Inputoutput helpers FileUtils 1195 221 82 44commons-lang General purpose helpers (eg String) StringUtils 2247 233 99 22guava Collection library HashBiMap 525 35 91 3gson Json library Gson 554 684 89 145JGit Java implementation of GIT CommitCommand 433 138 81 113

          33 Detecting and Measuring the Visible Com-putational Diversity

          The final step of DSpot runs the amplified test suite onpairs of program variants Given P1 and P2 the numberof observation points which have a different values on eachvariant accounts for visible computational diversity Whenwe compare a set of variants we use the mean number ofdifferences over each pair of variants

          34 ImplementationOur prototype implementation amplifies Java source code 4

          The test suites are expected to be written using the JUnittesting framework which is the 1 testing framework forJava It uses Spoon [18] to manipulate the source code inorder to create the amplified test cases DSpot is able toamplify a test suite within minutes

          The main challenges for the implementation of DSpot wereas follows handle the many different situations that occurin real-world large test suites (use different versions of JUnitmodularize the code of the test suite itself implement newtypes of assertions etc) handle large traces for comparisonof computation (as we will see in the next section we collecthundreds of thousands observations on each variant) spotthe natural randomness in test case execution to preventfalse positives in the assessment of computational diversity

          4 EVALUATIONTo evaluate whether DSpot is capable of detecting com-

          putational diversity we set up a novel empirical protocoland apply it on 7 large-scale Java programs Our guidingresearch question is Is DSpot capable of identifying re-alistic large scale programs that are computationallydiverse

          41 ProtocolFirst we take large open-source Java programs that are

          equipped with good test suites Second we forge variantsof those programs using a technique from our previous work[2] We call the variants sosie programs 5

          Definition 1 Sosie (noun) Given a program P a testsuite TS for P and a program transformation T a variantP prime=T (P ) is a sosie of P if the two following conditions hold1) there is at least one test case in TS that executes the partof P that is modified by T 2) all test cases in TS pass onP prime

          4the prototype is available herehttpdiversify-projectgithubiotest-suite-amplificationhtml5The word sosie is a French word that literally means ldquolookalikerdquo

          Given an initial program we synthesize sosies with sourcecode transformations that are based on the modification ofthe abstract syntax tree (AST) As previous work [16 22]we consider three families of transformation that manipu-late statement nodes of the AST 1) remove a node in theAST (Delete) 2) adds a node just after another one (Add)3) replaces a node by another one eg a statement nodeis replaced by another statement (Replace) For ldquoAddrdquo andldquoReplacerdquo the transplantation point refers to where astatement is inserted the transplant statement refers tothe statement that is copied and inserted and both trans-plantation and transplant points are in the same AST (wedo not synthesize new code nor take code from other pro-grams) We consider transplant statements that manipu-late variables of the same type as the transplantation pointand we bind the names of variables in the transplant tonames that are in the namespace of the transplantationpoint We call these transformations Steroid transforma-tions and more details are available in our previous work[2]

          Once we have generated sosie programs we manually se-lect a set of sosies that indeed expose some computationaldiversity Third we amplify the original test suites usingour approach and also using a baseline technique by Yooand Harman [25] presented in 43 Finally we run both am-plified test suites and measure the proportion of variants(sosies) that are detected as computationally different Wealso collect additional metrics to further qualify the effec-tiveness of DSpot

          42 DatasetWe build a dataset of subject programs for performing our

          experiments The inclusion criteria are the following 1) thesubject program must be real-world software 2) the subjectprogram must be written in Java 3) the subject programrsquostest suite must use the JUnit testing framework 4) thesubject program must have a good test suite (a statementcoverage higher than 80)

          This results in Apache Commons Math Apache Com-mons Lang Apache Commons Collections Apache Com-mons Codec and Google GSON and Guava The dominanceof Apache projects is due to the fact that they are amongthe very rare organizations with a very strong developmentdiscipline

          In addition we aim at running the whole experiments inless than one day (24 hours) Consequently we take a singleclass for each of those projects as well as all the test casesthat exercise it at least once

          Table 1 provides the descriptive statistics of our datasetIt gives the subject program identifier its purpose the classwe consider the classrsquo number of lines of code (LOC) thenumber of tests that execute at least once one method ofthe class under consideration the statement coverage and

          the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

          43 BaselineIn the area of test suite amplification the work by Yoo

          and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

          44 Research QuestionsWe first examine the results of our test amplification pro-

          cedureRQ1a what is the number of generated test cases

          We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

          RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

          Then we evaluate the ability of the amplified test suiteto assess computational diversity

          RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

          RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

          amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

          The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

          RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

          RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

          45 Empirical ResultsWe now discuss the empirical results obtained on applying

          DSpot on our dataset

          451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

          cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

          Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

          Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

          One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

          Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

          Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

          TC assert orobs

          TC exec assert orobs exec

          disc obs branchcov

          pathcov

          codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

          Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

          variants de-tected by TDR

          input space effect observation spaceeffect

          mean of diver-gences

          commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

          there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

          Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

          The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

          method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

          The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

          452 of Generated Observation PointsNow we focus on the observation points The fourth col-

          umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

          As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

          6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

          of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

          If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

          These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

          453 EffectivenessWe want to assess whether our method is effective for iden-

          tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

          We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

          As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

          Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

          by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

          By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

          Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

          The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

          454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

          noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

          We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

          Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

          Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

          1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

          isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

          3 Object object = makeObject ()if (object instanceof Serializable)

          5 String name = getCanonicalEmptyCollectionName(object)

          File f = new javaioFile(name)7 observation on f

          LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

          11

          Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

          codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

          Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

          Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

          The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

          455 Nature of Computational DiversityNow we want to understand more in depth the nature of

          the NVP-diversity we are observing Let us discuss threecase studies

          Listing 4 shows two variants of the writeStringToFile()

          method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

          original program2 void writeStringToFile(File file String data

          Charset encoding boolean append) throwsIOException

          OutputStream out = null4 out = openOutputStream(file append)

          IOUtilswrite(data out encoding)6 outclose()

          8 variantvoid writeStringToFile(File file String data

          Charset encoding boolean append) throwsIOException

          10 OutputStream out = nullout = new FileOutputStream(file append)

          12 IOUtilswrite(data out encoding)outclose()

          Listing 4 Two variants of writeStringToFile incommonsio

          1 void testCopyDirectoryPreserveDates () try

          3 File sourceFile = new File(sourceDirectory hellotxt)

          FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

          5 catch (Exception e) DSpotobserve(egetMessage ())

          7

          Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

          FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

          is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

          Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

          it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

          Original program2 void toJson(Object src Type typeOfSrc JsonWriter

          writer)writersetSerializeNulls(oldSerializeNulls)

          4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

          writer)6 writersetIndent( )

          Listing 6 Two variants of toJson in GSON

          1 public void testWriteMixedStreamed_remove534 ()throws IOException

          3 gsontoJson(RED_MIATA Carclass jsonWriter)

          jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

          MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

          7

          Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

          The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

          case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

          These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

          The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

          46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

          test suite amplification Our experimental results are sub-ject to the following threats

          First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

          Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

          7httpdiversify-projectgithubiotest-suite-amplificationhtml

          Original program2 void decode(final byte[] in int inPos final int

          inAvail final Context context) switch (contextmodulus)

          4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

          6 not currently tested perhaps it isimpossiblebreak

          8

          10 variantvoid decode(final byte[] in int inPos final int

          inAvail final Context context) 12 switch (contextmodulus)

          case 0 impossible as excluded above14 case 1

          Listing 8 Two variants of decode in commonscodec

          1 Testvoid testCodeInteger3_literalMutation222 ()

          3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

          5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

          encodedInt3getBytes(CharsetsUTF_8)))7

          Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

          variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

          Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

          5 RELATED WORKThe work presented is related to two main areas the iden-

          tification of similarities or diversity in source code and theautomatic augmentation of test suites

          Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

          Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

          with intensive fuzz testing and observation points on crash-ing states

          Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

          Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

          Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

          nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

          6 CONCLUSIONIn this paper we have presented DSpot a novel technique

          for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

          Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

          This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

          7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

          2011-9 No 600654 DIVERSIFY project

          8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

          software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

          [2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

          Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

          [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

          [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

          [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

          [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

          [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

          [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

          [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

          [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

          [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

          [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

          [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

          [14] J C Knight N-version programming Encyclopedia of

          Software Engineering 1990

          [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

          [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

          [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

          [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

          [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

          [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

          [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

          [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

          [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

          [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

          [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

          [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

          • 1 Introduction
          • 2 Background
            • 21 N-version programming
            • 22 NVP-Diversity
            • 23 Graphical Explanation
            • 24 Unspecified Input Space
              • 3 Our Approach to Detect Computational Diversity
                • 31 Overview
                • 32 Test Suite Transformations
                  • 321 Exploring the Input Space
                  • 322 Adding Observation Points
                    • 33 Detecting and Measuring the Visible Computational Diversity
                    • 34 Implementation
                      • 4 Evaluation
                        • 41 Protocol
                        • 42 Dataset
                        • 43 Baseline
                        • 44 Research Questions
                        • 45 Empirical Results
                          • 451 of Generated Test Cases
                          • 452 of Generated Observation Points
                          • 453 Effectiveness
                          • 454 Natural Randomness of Computation
                          • 455 Nature of Computational Diversity
                            • 46 Threats to Validity
                              • 5 Related work
                              • 6 Conclusion
                              • 7 Acknowledgements
                              • 8 References

            the total number of program variants we consider (excludingthe original program) We see that this benchmark coversdifferent domains such as data encoding and collectionsand is only composed of well-tested classes In total thereare between 12 and 145 computationally diverse variants ofeach program to be detected This variation comes fromthe relative difficulty of manually forging computationallydiverse variants depending on the project

            43 BaselineIn the area of test suite amplification the work by Yoo

            and Harman [25] is the most closely related to our approachTheir technique is designed for augmenting input space cov-erage but can be directly applied to detecting computationaldiversity Their algorithm called test data regeneration ndashTDR for short ndash is based on four transformations on nu-merical values in test cases data shifting (λxx + 1 andλxx minus 1 ) and data scaling (multiply or divide the valueby 2) and a hill-climbing algorithm based on the number offitness function evaluations They consider that a test casecalls a single function their implementation deals only withnumerical functions and they consider the numerical outputof that function as the only observation point In our exper-iment we reimplemented the transformations on numericalvalues since the tool used by Yoo is not available We removethe hill-climbing part since it is not relevant in our case An-alytically the key differences between DSpot and TDR areTDR stacks mutliple transformations together DSpot hasmore new transformation operators on test cases DSpotconsiders a richer observation space based on arbitrary datatypes and sequences of method calls

            44 Research QuestionsWe first examine the results of our test amplification pro-

            cedureRQ1a what is the number of generated test cases

            We want to know whether our transformation operators ontest cases enable us to create many different new test casesie new points in the input space Since DSpot systemati-cally explores all neighbors according to the transformationoperators we measure the number of generated test cases toanswer this basic research question

            RQ1b what is the number of additional obser-vation points In addition to creating new input pointsDSpot creates new observation points We want to know theorder of magnitude of the number of those new observationpoints To have a clear explanation we start by performingonly observation point amplification (without input pointamplification) and count the total number of observationsWe compare this number with the initial number of asser-tions which exactly corresponds to the original observationpoints

            Then we evaluate the ability of the amplified test suiteto assess computational diversity

            RQ2a does DSpot identify more computationallydiverse programs than TDR Now we want to compareour technique with the related work We count the numberof variants that are identified as computationally differentusing DSpot and TDR The one with with the highest valueis better

            RQ2b does the efficiency of DSpot come from thenew inputs or the new observations DSpot stackstwo techniques the amplification of the input space and the

            amplification of the observation space To study their im-pact in isolation we count the number of computationallydiverse program variants that are detected by the originalinput points equipped with new observation points and bythe amplified set of input points with the original observa-tions

            The last research questions digs deeper in the analysis ofamplified test cases and computationally diverse variants

            RQ3a What is the number of natural random-ness in computation Recall that DSpot removes someobservation points that naturally varies even on the sameprogram This phenomenon is due to the natural random-ness of computation To answer this question quantitativelywe count the number of discarded observation points to an-swer it quantitatively we discuss one case study

            RQ3b what is the richness of computational di-versity Now we really understand the reasons behindthe computational diversity we observe We take a randomsample of three pairs of computationally diverse programvariants and analyze them We discuss our findings

            45 Empirical ResultsWe now discuss the empirical results obtained on applying

            DSpot on our dataset

            451 of Generated Test CasesTable 2 presents the key statistics of the amplification pro-

            cess The lines of these table go by pair one that providesdata for one subject program and the following one that pro-vides the same data gathered with the test suite amplifiedby DSpot Columns from 2 to 5 are organized in two groupsthe first group gives a static view on the test suites (eg howmany test methods are declared) the second group drawsa dynamic picture of the test suites under study (eg howmany assertions are executed)

            Indeed in real large-scale programs test cases are mod-ular Some test cases are used multiple times because theyare called by other test cases For instance a test case thatspecifies a contract on a collection is called when testing allimplementations of collections (ArrayList LinkedList etc)We call them generic tests

            Letrsquos first concentrate on the static values Column 2 givesthe number of test cases in the original and amplified testsuites while column 3 gives the number of assertions in theoriginal test suites and the number of observations in theamplified

            One can see that our amplification process is massive Wecreate between 4x and 12x more test cases than the origi-nal test suites For instance the test suite considered forcommonscodec contains 72 test cases DSpot produces anamplified test suite that contains 672 test methods 9x morethan the original test suite The original test suite observesthe state of the program with 509 assertions while DSpotemploys 10597 observations points to detect computationaldifferences

            Let us now consider the dynamic part of the table Col-umn 4 gives the number of tests executed (TC exec) andcolumn 5 the number of assertions executed or the numberof observation points executed Column 6 gives the numberof the discarded observation points because of natural vari-ations (discussed in more details in section 454) As wecan see the number of generated tests (ATC exec) is im-pacted by amplification For instance for commonscollection

            Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

            TC assert orobs

            TC exec assert orobs exec

            disc obs branchcov

            pathcov

            codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

            Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

            variants de-tected by TDR

            input space effect observation spaceeffect

            mean of diver-gences

            commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

            there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

            Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

            The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

            method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

            The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

            452 of Generated Observation PointsNow we focus on the observation points The fourth col-

            umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

            As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

            6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

            of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

            If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

            These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

            453 EffectivenessWe want to assess whether our method is effective for iden-

            tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

            We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

            As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

            Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

            by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

            By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

            Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

            The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

            454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

            noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

            We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

            Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

            Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

            1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

            isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

            3 Object object = makeObject ()if (object instanceof Serializable)

            5 String name = getCanonicalEmptyCollectionName(object)

            File f = new javaioFile(name)7 observation on f

            LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

            11

            Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

            codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

            Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

            Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

            The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

            455 Nature of Computational DiversityNow we want to understand more in depth the nature of

            the NVP-diversity we are observing Let us discuss threecase studies

            Listing 4 shows two variants of the writeStringToFile()

            method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

            original program2 void writeStringToFile(File file String data

            Charset encoding boolean append) throwsIOException

            OutputStream out = null4 out = openOutputStream(file append)

            IOUtilswrite(data out encoding)6 outclose()

            8 variantvoid writeStringToFile(File file String data

            Charset encoding boolean append) throwsIOException

            10 OutputStream out = nullout = new FileOutputStream(file append)

            12 IOUtilswrite(data out encoding)outclose()

            Listing 4 Two variants of writeStringToFile incommonsio

            1 void testCopyDirectoryPreserveDates () try

            3 File sourceFile = new File(sourceDirectory hellotxt)

            FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

            5 catch (Exception e) DSpotobserve(egetMessage ())

            7

            Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

            FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

            is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

            Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

            it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

            Original program2 void toJson(Object src Type typeOfSrc JsonWriter

            writer)writersetSerializeNulls(oldSerializeNulls)

            4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

            writer)6 writersetIndent( )

            Listing 6 Two variants of toJson in GSON

            1 public void testWriteMixedStreamed_remove534 ()throws IOException

            3 gsontoJson(RED_MIATA Carclass jsonWriter)

            jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

            MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

            7

            Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

            The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

            case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

            These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

            The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

            46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

            test suite amplification Our experimental results are sub-ject to the following threats

            First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

            Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

            7httpdiversify-projectgithubiotest-suite-amplificationhtml

            Original program2 void decode(final byte[] in int inPos final int

            inAvail final Context context) switch (contextmodulus)

            4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

            6 not currently tested perhaps it isimpossiblebreak

            8

            10 variantvoid decode(final byte[] in int inPos final int

            inAvail final Context context) 12 switch (contextmodulus)

            case 0 impossible as excluded above14 case 1

            Listing 8 Two variants of decode in commonscodec

            1 Testvoid testCodeInteger3_literalMutation222 ()

            3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

            5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

            encodedInt3getBytes(CharsetsUTF_8)))7

            Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

            variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

            Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

            5 RELATED WORKThe work presented is related to two main areas the iden-

            tification of similarities or diversity in source code and theautomatic augmentation of test suites

            Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

            Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

            with intensive fuzz testing and observation points on crash-ing states

            Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

            Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

            Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

            nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

            6 CONCLUSIONIn this paper we have presented DSpot a novel technique

            for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

            Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

            This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

            7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

            2011-9 No 600654 DIVERSIFY project

            8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

            software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

            [2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

            Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

            [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

            [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

            [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

            [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

            [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

            [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

            [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

            [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

            [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

            [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

            [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

            [14] J C Knight N-version programming Encyclopedia of

            Software Engineering 1990

            [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

            [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

            [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

            [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

            [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

            [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

            [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

            [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

            [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

            [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

            [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

            [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

            • 1 Introduction
            • 2 Background
              • 21 N-version programming
              • 22 NVP-Diversity
              • 23 Graphical Explanation
              • 24 Unspecified Input Space
                • 3 Our Approach to Detect Computational Diversity
                  • 31 Overview
                  • 32 Test Suite Transformations
                    • 321 Exploring the Input Space
                    • 322 Adding Observation Points
                      • 33 Detecting and Measuring the Visible Computational Diversity
                      • 34 Implementation
                        • 4 Evaluation
                          • 41 Protocol
                          • 42 Dataset
                          • 43 Baseline
                          • 44 Research Questions
                          • 45 Empirical Results
                            • 451 of Generated Test Cases
                            • 452 of Generated Observation Points
                            • 453 Effectiveness
                            • 454 Natural Randomness of Computation
                            • 455 Nature of Computational Diversity
                              • 46 Threats to Validity
                                • 5 Related work
                                • 6 Conclusion
                                • 7 Acknowledgements
                                • 8 References

              Table 2 The performance of DSpot on amplifying 7 Java test suitesStatic Dynamic

              TC assert orobs

              TC exec assert orobs exec

              disc obs branchcov

              pathcov

              codec 72 509 72 3528 124 1245codec-DSpot 672 (times9) 10597 (times20) 672 16920 12 126 12461collections 111 433 768 7035 223 376collections-DSpot 1291 (times12) 14772 (times34) 9202 973096 0 224 465io 221 1330 262 1346 366 246io-DSpot 2518 (times11) 20408 (times15) 2661 209911 54313 373 287lang 233 2206 233 2266 1014 797lang-DSpot 988 (times4) 12854 (times6) 12854 57856 18 1015 901guava 35 84 14110 20190 60 77guava-DSpot 625 (times18) 6834 (times81) 624656 9464 0 60 77gson 684 1125 671 1127 106 84gson-DSpot 4992 (times7) 26869 (times24) 4772 167150 144 108 137JGit 138 176 138 185 75 1284JGit-DSpot 2152 (times16) 90828 (times516) 2089 92856 13377 75 1735

              Table 3 The effectiveness of computational diversity detectionvariants de-tected by DSpot

              variants de-tected by TDR

              input space effect observation spaceeffect

              mean of diver-gences

              commons-codec 1212 1012 1212 1012 219commons-collections 133133 133133 133133 133133 52079commons-io 4444 1844 4244 1844 4055commons-lang 2222 022 1022 022 229guava 33 03 03 33 2gson 145145 0145 134145 0145 8015jgit 113113 0113 113113 0113 15654

              there are 1291 tests in the amplified test suite but alto-gether 9202 test cases are executed The reason is that wesynthesize new test cases that use other generic test meth-ods Consequently this increases the number of executedgeneric test methods which is included in our count

              Our test case transformations yield a rich exploration ofthe input space Columns 7 to 11 of Table 2 provide deeperinsigths about the synthesized test cases Colum 7 gives thebranch coverage of the original test suites and the amplifiedones (lines with -DSPOT identifiers) While original testsuites have a very high branch coverage rate yet DSpot isstill able to generate new teststhat cover a few previouslyuncovered branches For instance the amplified test suitefor commons-ioFileUtils reaches 7 branches that were notexecuted by the original test suite Meanwhile the originaltest suite for guavaHashBiMap already covers 90 of thebranches and DSpot did not generate test cases that covernew branches

              The richness of the amplified test suite is also revealed inthe last column of the table (path coverage) it provides thecumulative number of different paths executed by the testsuite in all methods under test The amplified test suitescover much more paths than the original ones which meansthat they trigger a much wider set of executions of the classunder test than the original test suites For instance forGuava the total number of different paths covered in themethods under test increases from 84 to 137 This meansthat while the amplified test suite does not cover many newbranches it executes the parts that were already coveredin many novel ways increasing the diversity of executionsthat are tested There is one extreme case in the encode

              method of commons-codec6 the original test suite covers780 different paths in this method while the amplified testsuite covers 11356 different paths This phenomenon is dueto the complex control flow of the method and to the factthat its behavior directly depends on the value of an arrayof bytes that takes many new values in the amplified testsuite

              The amplification process is massive and producesrich new input points the number of declared and ex-ecuted test cases and the diversity of executions fromtest cases increase

              452 of Generated Observation PointsNow we focus on the observation points The fourth col-

              umn of Table 2 gives the number of assertions in original testsuite This corresponds to the number of locations wherethe tester specifies expected values about the state of theprogram execution The fifth column gives the number ofobservation points in the amplified test suite We do not callthem assertions since they do not contain an expected valueie there is no oracle Recall that we use those observationpoints to compare the behavior of two program variants inorder to assess the computational diversity

              As we can see we observe the program state on manymore observation points than the original assertions As dis-cussed in Section 22 those observations points use the API

              6line 331 in the Base64 class httpsgithubcomapachecommons-codecblobca8968be63712c1dcce006a6d6ee9ddcef0e0a51srcmainjavaorgapachecommonscodecbinaryBase64java

              of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

              If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

              These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

              453 EffectivenessWe want to assess whether our method is effective for iden-

              tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

              We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

              As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

              Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

              by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

              By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

              Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

              The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

              454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

              noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

              We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

              Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

              Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

              1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

              isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

              3 Object object = makeObject ()if (object instanceof Serializable)

              5 String name = getCanonicalEmptyCollectionName(object)

              File f = new javaioFile(name)7 observation on f

              LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

              11

              Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

              codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

              Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

              Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

              The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

              455 Nature of Computational DiversityNow we want to understand more in depth the nature of

              the NVP-diversity we are observing Let us discuss threecase studies

              Listing 4 shows two variants of the writeStringToFile()

              method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

              original program2 void writeStringToFile(File file String data

              Charset encoding boolean append) throwsIOException

              OutputStream out = null4 out = openOutputStream(file append)

              IOUtilswrite(data out encoding)6 outclose()

              8 variantvoid writeStringToFile(File file String data

              Charset encoding boolean append) throwsIOException

              10 OutputStream out = nullout = new FileOutputStream(file append)

              12 IOUtilswrite(data out encoding)outclose()

              Listing 4 Two variants of writeStringToFile incommonsio

              1 void testCopyDirectoryPreserveDates () try

              3 File sourceFile = new File(sourceDirectory hellotxt)

              FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

              5 catch (Exception e) DSpotobserve(egetMessage ())

              7

              Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

              FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

              is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

              Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

              it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

              Original program2 void toJson(Object src Type typeOfSrc JsonWriter

              writer)writersetSerializeNulls(oldSerializeNulls)

              4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

              writer)6 writersetIndent( )

              Listing 6 Two variants of toJson in GSON

              1 public void testWriteMixedStreamed_remove534 ()throws IOException

              3 gsontoJson(RED_MIATA Carclass jsonWriter)

              jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

              MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

              7

              Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

              The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

              case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

              These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

              The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

              46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

              test suite amplification Our experimental results are sub-ject to the following threats

              First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

              Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

              7httpdiversify-projectgithubiotest-suite-amplificationhtml

              Original program2 void decode(final byte[] in int inPos final int

              inAvail final Context context) switch (contextmodulus)

              4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

              6 not currently tested perhaps it isimpossiblebreak

              8

              10 variantvoid decode(final byte[] in int inPos final int

              inAvail final Context context) 12 switch (contextmodulus)

              case 0 impossible as excluded above14 case 1

              Listing 8 Two variants of decode in commonscodec

              1 Testvoid testCodeInteger3_literalMutation222 ()

              3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

              5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

              encodedInt3getBytes(CharsetsUTF_8)))7

              Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

              variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

              Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

              5 RELATED WORKThe work presented is related to two main areas the iden-

              tification of similarities or diversity in source code and theautomatic augmentation of test suites

              Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

              Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

              with intensive fuzz testing and observation points on crash-ing states

              Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

              Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

              Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

              nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

              6 CONCLUSIONIn this paper we have presented DSpot a novel technique

              for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

              Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

              This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

              7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

              2011-9 No 600654 DIVERSIFY project

              8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

              software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

              [2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

              Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

              [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

              [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

              [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

              [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

              [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

              [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

              [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

              [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

              [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

              [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

              [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

              [14] J C Knight N-version programming Encyclopedia of

              Software Engineering 1990

              [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

              [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

              [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

              [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

              [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

              [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

              [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

              [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

              [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

              [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

              [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

              [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

              • 1 Introduction
              • 2 Background
                • 21 N-version programming
                • 22 NVP-Diversity
                • 23 Graphical Explanation
                • 24 Unspecified Input Space
                  • 3 Our Approach to Detect Computational Diversity
                    • 31 Overview
                    • 32 Test Suite Transformations
                      • 321 Exploring the Input Space
                      • 322 Adding Observation Points
                        • 33 Detecting and Measuring the Visible Computational Diversity
                        • 34 Implementation
                          • 4 Evaluation
                            • 41 Protocol
                            • 42 Dataset
                            • 43 Baseline
                            • 44 Research Questions
                            • 45 Empirical Results
                              • 451 of Generated Test Cases
                              • 452 of Generated Observation Points
                              • 453 Effectiveness
                              • 454 Natural Randomness of Computation
                              • 455 Nature of Computational Diversity
                                • 46 Threats to Validity
                                  • 5 Related work
                                  • 6 Conclusion
                                  • 7 Acknowledgements
                                  • 8 References

                of the program under consideration hence allow to revealvisible and exploitable computational diversity Howeverthis number also encompasses the observation points on thenew generated test cases

                If we look at the dynamic perspective (second part of Ta-ble 2) one observes the same phenomenon as for test casesand assertions there are many more points actually ob-served during test execution than statically declared onesThe reasons are identical many observations points are ingeneric test methods that are executed several times or arewithin loops in test code

                These results validate our initial intuition that a testsuite only covers a small portion of the observationspace It is possible to observe the program state frommany other observation points

                453 EffectivenessWe want to assess whether our method is effective for iden-

                tifying computationally diverse program variants As goldentruth we have the forged variants for which we know thatthey are NVP-diverse (see Section 41) their numbers aregiven in the descriptive Table 1 The benchmark is publiclyavailable at httpdiversify-projecteudata

                We run DSpot and TDR to see whether those two tech-niques are able to detect the computationally diverse pro-grams Table 3 gives the results of this evaluation The firstcolumn contains the name of the subject program The sec-ond column gives the number of variants detected by DSpotThe third column gives the number of variants detected byTDR The last three columns explore more in depth whethercomputational diversity is reveales by new input points ornew observation points or both we will come back to themlater

                As we can see DSpot is capable of detecting all computa-tionally diverse variants of our benchmark On the contrarythe baseline technique TDR is always worse Either it de-tects only a fraction of them (eg 1012 for commonscodec)or even not at all The reason is that TDR as originally pro-posed by Yoo and Harman focuses on simple programs withshallow input spaces (one single method with integer argu-ments) On the contrary DSpot is designed to handle richinput spaces incl constructor calls method invocationsand strings This has a direct impact on the effectiveness ofdetecting computational diversity in program variants

                Our technique is based on two insights the amplificationof the input space and the amplification of the observationspace We now want to understand the impact of each ofthem To do so we disable one or the other kind of ampli-fication and measure the number of detected variants Theresult of this experiment is given in the last two columns ofTable 3 Column ldquoinput space effectrdquo gives the number ofvariants that are detected only by the exploration of the in-put space (ie by observing the program state only with theobservation method used in the original assertions) Columnldquoobservation space effectrdquo gives the number of variants thatare detected only by the exploration of the observation space(ie by observing the result of method calls on the objectsinvolved in the test) For instance for commons-codec allvariants (1212) are detected by exploring the input spaceand 1012 are detected by exploring the observation spaceThis means that 10 of them are detected are detected either

                by one exploration or the other one On the contrary forguava only the exploration of the observation space enablesDSpot to detect the three computationally diverse variantsof our benchmark

                By comparing columns ldquoinput space effectrdquo and ldquoobserva-tion space effectrdquo one sees that our two explorations are notmutually exclusive and are complementary Some variantsare detected by both kinds of exploration (as in the case ofcommons-codec) For some subjects only the explorationof the input space is effective (eg commons-lang) whilefor others (guava) this is the opposite Globally the explo-ration of the input space is more efficient most variants aredetected this way

                Let us now consider the last column of Table 3 It givesthe mean number of observation points for which we observea difference between the original program and the variantto be detected For instance among the 12 variants forcommonscodec there is on average 219 observation pointsfor which there is a difference Those numbers are highshowing that the observation points are not independentMany of the methods we call to observe the program stateinspect a different facet of the same state For instance ina list the methods isEmpty() and size are semanticallycorrelated

                The systematic exploration of the input and the ob-servation spaces is effective at detecting behavioral di-versity between program variants

                454 Natural Randomness of ComputationWhen experimenting with DSpot on real programs we

                noticed that some observation points naturally vary evenwhen running the same test case several times on the sameprogram For instance a hashcode that takes into accounta random salt can be different between two runs of the sametest case We call this effect the ldquonatural randomnessrdquo oftest case execution

                We distinguish two kinds of natural variations in the ex-ecution of test suites First some observation points varyover time when the test case is executed several times on thesame environment (same machine OS etc) This is the casefor the hashcode example Second some observation pointsvary depending on the execution environment For instanceif one adds an observation point on a file name the pathname convention is different on Unix and Windows systemsIf method getAbsolutePath is an observation point it mayreturn tmpfootxt on Unix and Ctmpfootxt onWindows While this first example is pure randomness thesecond only refers to variations in the runtime environment

                Interestingly this natural randomness is not problematicin the case of the original test suites because it remainsbelow the level of observation of the oracles (the test suiteassertions in JUnit test suites) However in our case if onekeeps an observation point that is impacted by some naturalrandomness this would produce a false positive for com-putational diversity detection Hence as explained in Sec-tion 3 one phase of DSpot consists in detecting the naturalrandomness first and discarding the impacting observationpoints

                Our experimental protocol enables us to quantify the num-ber of discarded observation points The 6th column ofTable 2 gives this number For instance for commons-

                1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

                isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

                3 Object object = makeObject ()if (object instanceof Serializable)

                5 String name = getCanonicalEmptyCollectionName(object)

                File f = new javaioFile(name)7 observation on f

                LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

                11

                Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

                codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

                Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

                Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

                The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

                455 Nature of Computational DiversityNow we want to understand more in depth the nature of

                the NVP-diversity we are observing Let us discuss threecase studies

                Listing 4 shows two variants of the writeStringToFile()

                method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

                original program2 void writeStringToFile(File file String data

                Charset encoding boolean append) throwsIOException

                OutputStream out = null4 out = openOutputStream(file append)

                IOUtilswrite(data out encoding)6 outclose()

                8 variantvoid writeStringToFile(File file String data

                Charset encoding boolean append) throwsIOException

                10 OutputStream out = nullout = new FileOutputStream(file append)

                12 IOUtilswrite(data out encoding)outclose()

                Listing 4 Two variants of writeStringToFile incommonsio

                1 void testCopyDirectoryPreserveDates () try

                3 File sourceFile = new File(sourceDirectory hellotxt)

                FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

                5 catch (Exception e) DSpotobserve(egetMessage ())

                7

                Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

                FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

                is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

                Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

                it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

                Original program2 void toJson(Object src Type typeOfSrc JsonWriter

                writer)writersetSerializeNulls(oldSerializeNulls)

                4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

                writer)6 writersetIndent( )

                Listing 6 Two variants of toJson in GSON

                1 public void testWriteMixedStreamed_remove534 ()throws IOException

                3 gsontoJson(RED_MIATA Carclass jsonWriter)

                jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

                MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

                7

                Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

                The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

                case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

                These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

                The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

                46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

                test suite amplification Our experimental results are sub-ject to the following threats

                First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

                Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

                7httpdiversify-projectgithubiotest-suite-amplificationhtml

                Original program2 void decode(final byte[] in int inPos final int

                inAvail final Context context) switch (contextmodulus)

                4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

                6 not currently tested perhaps it isimpossiblebreak

                8

                10 variantvoid decode(final byte[] in int inPos final int

                inAvail final Context context) 12 switch (contextmodulus)

                case 0 impossible as excluded above14 case 1

                Listing 8 Two variants of decode in commonscodec

                1 Testvoid testCodeInteger3_literalMutation222 ()

                3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

                5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

                encodedInt3getBytes(CharsetsUTF_8)))7

                Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

                variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

                Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

                5 RELATED WORKThe work presented is related to two main areas the iden-

                tification of similarities or diversity in source code and theautomatic augmentation of test suites

                Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

                Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

                with intensive fuzz testing and observation points on crash-ing states

                Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

                Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

                Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

                nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

                6 CONCLUSIONIn this paper we have presented DSpot a novel technique

                for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

                Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

                This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

                7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

                2011-9 No 600654 DIVERSIFY project

                8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

                software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

                [2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

                Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

                [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

                [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

                [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

                [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

                [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

                [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

                [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

                [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

                [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

                [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

                [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

                [14] J C Knight N-version programming Encyclopedia of

                Software Engineering 1990

                [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

                [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

                [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

                [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

                [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

                [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

                [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

                [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

                [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

                [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

                [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

                [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

                • 1 Introduction
                • 2 Background
                  • 21 N-version programming
                  • 22 NVP-Diversity
                  • 23 Graphical Explanation
                  • 24 Unspecified Input Space
                    • 3 Our Approach to Detect Computational Diversity
                      • 31 Overview
                      • 32 Test Suite Transformations
                        • 321 Exploring the Input Space
                        • 322 Adding Observation Points
                          • 33 Detecting and Measuring the Visible Computational Diversity
                          • 34 Implementation
                            • 4 Evaluation
                              • 41 Protocol
                              • 42 Dataset
                              • 43 Baseline
                              • 44 Research Questions
                              • 45 Empirical Results
                                • 451 of Generated Test Cases
                                • 452 of Generated Observation Points
                                • 453 Effectiveness
                                • 454 Natural Randomness of Computation
                                • 455 Nature of Computational Diversity
                                  • 46 Threats to Validity
                                    • 5 Related work
                                    • 6 Conclusion
                                    • 7 Acknowledgements
                                    • 8 References

                  1 void testCanonicalEmptyCollectionExists () if ((( supportsEmptyCollections ()) ampamp (

                  isTestSerialization ())) ampamp ((skipSerializedCanonicalTests ())))

                  3 Object object = makeObject ()if (object instanceof Serializable)

                  5 String name = getCanonicalEmptyCollectionName(object)

                  File f = new javaioFile(name)7 observation on f

                  LoggerlogAssertArgument(fgetCanonicalPath ())9 LoggerlogAssertArgument(fgetAbsolutePath ())

                  11

                  Listing 3 An amplified test case with observation pointsthat naturally vary hence are discarded by DSpot

                  codec DSpot detects 12 observation points that naturallyvary This column shows two interesting facts First thereis a large variation in the number of discarded observationpoints it goes up to 54313 for commons-io This case to-gether with JGIT (the last line) is due to the heavy depen-dency of the library on the underlying file system (commons-io is about IO ndash hence file systems ndashoperations JGIT isabout manipulating GIT versioning repositories that are alsostored on the local file system)

                  Second there are two subject programs (commons-collectionsand guava) for which we discard no points at all In thoseprograms DSpot does not detect a single point that nat-urally varies by running 100 times the test suite on threedifferent operating systems The reasons is that the API ofthose subject programs does not allow to inspect the inter-nals of the program state up to the naturally varying parts(eg the memory addresses) We consider this good as thisit shows that the encapsulation is good more than providingan intuitive API more than providing a protection againstfuture changes it also completely encapsulates the naturalrandomness of the computation

                  Let us now consider a case study Listing 3 shows anexample of an amplified test with observation points forApache Commons Collection There are 12 observation meth-ods that can be called on the object f instance of File (11getter methods and toString) The figure shows two gettermethods that return different values from one run to another(there are 5 getter methods with that kind of behavior fora File object) We ignore these observation points whencomparing the original program with the variants

                  The systematic exploration of the observable outputspace provides new insights about the degree of encap-sulation of a class When a class gives public access tovariables that naturally vary there is a risk that whenused in oracles they result in flaky test cases

                  455 Nature of Computational DiversityNow we want to understand more in depth the nature of

                  the NVP-diversity we are observing Let us discuss threecase studies

                  Listing 4 shows two variants of the writeStringToFile()

                  method of Apache Commons IO The original program callsopenOutputStream which checks different things about thefile name while the variant directly calls the constructor of

                  original program2 void writeStringToFile(File file String data

                  Charset encoding boolean append) throwsIOException

                  OutputStream out = null4 out = openOutputStream(file append)

                  IOUtilswrite(data out encoding)6 outclose()

                  8 variantvoid writeStringToFile(File file String data

                  Charset encoding boolean append) throwsIOException

                  10 OutputStream out = nullout = new FileOutputStream(file append)

                  12 IOUtilswrite(data out encoding)outclose()

                  Listing 4 Two variants of writeStringToFile incommonsio

                  1 void testCopyDirectoryPreserveDates () try

                  3 File sourceFile = new File(sourceDirectory hellotxt)

                  FileUtilswriteStringToFile(sourceFile HELLOWORLD UTF8)

                  5 catch (Exception e) DSpotobserve(egetMessage ())

                  7

                  Listing 5 Amplified test case that reveals computationaldiversity between variants of listing 4

                  FileOutputStream These two variants behave differentlyoutside the specified domain in case writeStringToFile()

                  is called with an invalid file name the original program han-dles it while the variant throws a FileNotFoundExceptionOur test transformation operator on String values producessuch a file name as shown in the test case of listing 5 aldquordquo is changed into a star ldquordquo This made the file name aninvalid one Running this test on the variant results in aFileNotFoundException

                  Let us now consider listing 6 which shows two variantsof the toJson() method from the Google Gson library Thelast statement of the original method is replaced by anotherone instead of setting the serialization format of the writer

                  it set the indent format Each variant creates a JSon withslightly different formats and none of these formatting deci-sions are part of the specified domain (and actually specify-ing the exact formatting of the JSon String could be consid-ered as over-specification) The diversity among variants isdetected by the test cases displayed in figure 7 which addsan observation point (a call to toString()) on instances ofStringWriter which are modified by toJson()

                  Original program2 void toJson(Object src Type typeOfSrc JsonWriter

                  writer)writersetSerializeNulls(oldSerializeNulls)

                  4 variantvoid toJson(Object src Type typeOfSrc JsonWriter

                  writer)6 writersetIndent( )

                  Listing 6 Two variants of toJson in GSON

                  1 public void testWriteMixedStreamed_remove534 ()throws IOException

                  3 gsontoJson(RED_MIATA Carclass jsonWriter)

                  jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

                  MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

                  7

                  Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

                  The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

                  case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

                  These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

                  The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

                  46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

                  test suite amplification Our experimental results are sub-ject to the following threats

                  First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

                  Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

                  7httpdiversify-projectgithubiotest-suite-amplificationhtml

                  Original program2 void decode(final byte[] in int inPos final int

                  inAvail final Context context) switch (contextmodulus)

                  4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

                  6 not currently tested perhaps it isimpossiblebreak

                  8

                  10 variantvoid decode(final byte[] in int inPos final int

                  inAvail final Context context) 12 switch (contextmodulus)

                  case 0 impossible as excluded above14 case 1

                  Listing 8 Two variants of decode in commonscodec

                  1 Testvoid testCodeInteger3_literalMutation222 ()

                  3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

                  5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

                  encodedInt3getBytes(CharsetsUTF_8)))7

                  Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

                  variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

                  Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

                  5 RELATED WORKThe work presented is related to two main areas the iden-

                  tification of similarities or diversity in source code and theautomatic augmentation of test suites

                  Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

                  Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

                  with intensive fuzz testing and observation points on crash-ing states

                  Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

                  Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

                  Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

                  nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

                  6 CONCLUSIONIn this paper we have presented DSpot a novel technique

                  for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

                  Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

                  This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

                  7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

                  2011-9 No 600654 DIVERSIFY project

                  8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

                  software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

                  [2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

                  Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

                  [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

                  [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

                  [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

                  [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

                  [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

                  [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

                  [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

                  [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

                  [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

                  [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

                  [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

                  [14] J C Knight N-version programming Encyclopedia of

                  Software Engineering 1990

                  [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

                  [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

                  [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

                  [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

                  [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

                  [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

                  [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

                  [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

                  [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

                  [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

                  [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

                  [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

                  • 1 Introduction
                  • 2 Background
                    • 21 N-version programming
                    • 22 NVP-Diversity
                    • 23 Graphical Explanation
                    • 24 Unspecified Input Space
                      • 3 Our Approach to Detect Computational Diversity
                        • 31 Overview
                        • 32 Test Suite Transformations
                          • 321 Exploring the Input Space
                          • 322 Adding Observation Points
                            • 33 Detecting and Measuring the Visible Computational Diversity
                            • 34 Implementation
                              • 4 Evaluation
                                • 41 Protocol
                                • 42 Dataset
                                • 43 Baseline
                                • 44 Research Questions
                                • 45 Empirical Results
                                  • 451 of Generated Test Cases
                                  • 452 of Generated Observation Points
                                  • 453 Effectiveness
                                  • 454 Natural Randomness of Computation
                                  • 455 Nature of Computational Diversity
                                    • 46 Threats to Validity
                                      • 5 Related work
                                      • 6 Conclusion
                                      • 7 Acknowledgements
                                      • 8 References

                    1 public void testWriteMixedStreamed_remove534 ()throws IOException

                    3 gsontoJson(RED_MIATA Carclass jsonWriter)

                    jsonWriterendArray ()5 LoggerlogAssertArgument(comgooglegson

                    MixedStreamTestCARS_JSON)LoggerlogAssertArgument(stringWritertoString ())

                    7

                    Listing 7 Amplified test detecting black-box diversityamong variants of listing 6

                    The next case study is in listing 8 two variants of themethod decode() in the Base64 class of the Apache Com-mons Codec library The original program has a switch-

                    case statement in which case 1 execute a break An originalcomment by the programmers indicates that it is probablyimpossible The test case in listing 9 amplifies one of theoriginal test case with a mutation on the String value in theencodedInt3 variable (the original String has an additionallsquorsquo character removed by the ldquoremove characterrdquo transfor-mation) The amplification on the observation points addsmultiple observations points The single observation pointshown in the listing is the one that detects computationaldiversity it calls the static decodeInteger() method whichreturns 1 on the original program and 0 on the variant Inaddition to validating our approach this example anecdo-tally answers the question of the programmer case 1 is pos-sible it can be triggered from the API

                    These three case examples are meant to give the readera better idea of how DSpot was able to detect the variantsWe discuss how augmented test cases reveal this diversity(both with amplified inputs and observation points) Weillustrate three categories of code variations that maintainthe expected functionality as specified in the test suite butstill induce diversity (different checks on input different for-matting different handling of special cases)

                    The diversity that we observe originates from areasof the code that are characterized by their flexibility(caching checking formatting etc) These areas arevery close to the concept of forgiving region proposedby Martin Rinard [21]

                    46 Threats to ValidityDSpot is able to effectively detect NVP-diversity using

                    test suite amplification Our experimental results are sub-ject to the following threats

                    First this experiment is highly computational a bug inour evaluation code may invalidate our findings Howeversince we have manually checked a sample of cases (the casestudies of Section 454 and Section 455) we have a highconfidence in our results Our implementation is publiclyavailable 7

                    Second we have forged the computationally diverse pro-gram variants Eventually as shown on Table 3 our tech-nique DSpot is able to detect them all The reason is thatwe had a bias towards our technique when forging those

                    7httpdiversify-projectgithubiotest-suite-amplificationhtml

                    Original program2 void decode(final byte[] in int inPos final int

                    inAvail final Context context) switch (contextmodulus)

                    4 case 0 impossible as excluded abovecase 1 6 bits - ignore entirely

                    6 not currently tested perhaps it isimpossiblebreak

                    8

                    10 variantvoid decode(final byte[] in int inPos final int

                    inAvail final Context context) 12 switch (contextmodulus)

                    case 0 impossible as excluded above14 case 1

                    Listing 8 Two variants of decode in commonscodec

                    1 Testvoid testCodeInteger3_literalMutation222 ()

                    3 String encodedInt3 =FKIhdgaG5LGKiEtF1vHy4f3y700zaD6QwDS3IrNVGzNp2

                    5 + rY+1 LFWTK6D44AyiC1n8uWz1itkYMZF0aKDK0Yjg ==LoggerlogAssertArgument(Base64decodeInteger(

                    encodedInt3getBytes(CharsetsUTF_8)))7

                    Listing 9 Amplified test case that reveals thecomputational diversity between variants of listing 8

                    variants This is true for all self-made evaluations Thisthreat on the results of the comparative evaluation againstTDR is mitigated by the analytical comparison of the twoapproaches Both the input space and the output space ofTDR (respectively an integer tuple and a returned value) aresimpler and less powerful than our amplification technique

                    Third our experiments consider one programming lan-guage (Java) and 7 different application domains To furtherassess the external validity of our results new experimentsare required on different technologies and more applicationdomains

                    5 RELATED WORKThe work presented is related to two main areas the iden-

                    tification of similarities or diversity in source code and theautomatic augmentation of test suites

                    Computational diversity The recent work by Carzanigaet al [3] has a similar intent as ours automatically identify-ing dissimilarities in the execution of code fragments that arefunctionally similar They use random test cases generatedby Evosuite to get execution traces and log the internals ofthe execution (executed code and the readwrite operationson data) The main difference with our work is that theyassess computational diversity and with random testing in-stead of test amplification

                    Koopman and DeVale [15] aim at quantifying the diver-sity among a set of implementations of the POSIX operatingsystem with respect to their responses to exceptional con-ditions Diversity quantification in this context is used todetect which versions of POSIX provide the most differentfailure profiles and should thus be assembled to ensure faulttolerance Their approach relies on Ballista to generate mil-lions of input data and the outputs are analyzed to quantifythe difference This is an example of diversity assessment

                    with intensive fuzz testing and observation points on crash-ing states

                    Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

                    Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

                    Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

                    nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

                    6 CONCLUSIONIn this paper we have presented DSpot a novel technique

                    for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

                    Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

                    This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

                    7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

                    2011-9 No 600654 DIVERSIFY project

                    8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

                    software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

                    [2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

                    Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

                    [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

                    [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

                    [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

                    [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

                    [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

                    [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

                    [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

                    [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

                    [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

                    [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

                    [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

                    [14] J C Knight N-version programming Encyclopedia of

                    Software Engineering 1990

                    [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

                    [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

                    [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

                    [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

                    [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

                    [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

                    [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

                    [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

                    [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

                    [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

                    [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

                    [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

                    • 1 Introduction
                    • 2 Background
                      • 21 N-version programming
                      • 22 NVP-Diversity
                      • 23 Graphical Explanation
                      • 24 Unspecified Input Space
                        • 3 Our Approach to Detect Computational Diversity
                          • 31 Overview
                          • 32 Test Suite Transformations
                            • 321 Exploring the Input Space
                            • 322 Adding Observation Points
                              • 33 Detecting and Measuring the Visible Computational Diversity
                              • 34 Implementation
                                • 4 Evaluation
                                  • 41 Protocol
                                  • 42 Dataset
                                  • 43 Baseline
                                  • 44 Research Questions
                                  • 45 Empirical Results
                                    • 451 of Generated Test Cases
                                    • 452 of Generated Observation Points
                                    • 453 Effectiveness
                                    • 454 Natural Randomness of Computation
                                    • 455 Nature of Computational Diversity
                                      • 46 Threats to Validity
                                        • 5 Related work
                                        • 6 Conclusion
                                        • 7 Acknowledgements
                                        • 8 References

                      with intensive fuzz testing and observation points on crash-ing states

                      Many other works look for semantic equivalence or diver-sity through static or dynamic analysis Gabel and Su [7] in-vestigate the level of granularity at which diversity emergesin source code Their main finding is that for sequencesup to 40 tokens there is a lot of redundancy Beyond this(of course fuzzy) threshold the diversity and uniquenessof source code appears Higo and Kusumoto [11] investi-gate the interplay between structural similarity vocabularysimilarity and method name similarity to assess functionalsimilarity between methods in Java programs They showthat many contextual factors influence the ability of thesesimilarity measures to spot functional similarity (eg thenumber of methods that share the same name or the factthat two methods with similar structure are in the sameclass or not) Jiang and Su [12] extract code fragments ofa given length and randomly generate input data for thesesnippets Then they identify the snippets that produce thesame output values (which are considered functionally equiv-alent wrt the set of random test inputs) They show thatthis method identifies redundancies that static clone detec-tion does not find Kawaguchi and colleagues [13] focus onthe introduction of changes that break the interface behav-ior They also use a notion of partial equivalence whereldquotwoversions of a program need only be semantically equivalentunder a subset of all inputsrdquo Gao and colleagues [8] pro-pose a graph-based analysis to identify semantic differencesin binary code This work is based on the extraction of callgraphs and control flow graphs of both variants and on com-parisons between these graphs in order to spot the semanticvariations Person and colleagues [19] developed differentialsymbolic execution which can be used to detect and char-acterize behavioral differences between program versions

                      Test suite amplification In the area of test suite am-plification the work by Yoo and Harman [25] is the mostclosely related to our approach and we used as the baselinefor computational diversity assessment They amplify testsuites only with transformations on integer values while wealso transform boolean and String literals as well as state-ments test cases Yoo and Harman also have two additionalparameters for test case transformation the interaction levelthat determines the number of simultaneous transformationon the same test case and the search radius that boundstheir search process when trying to improve the effectivenessof augmented test suites Their original intent is to increasethe input space coverage to improve test effectiveness Theydo not handle the oracle problem in that work

                      Xie [23] augments test suites for Java program with newtest cases that are automatically generated and he automat-ically generates assertions for these new test cases whichcan check for regression errors Harder et al [9] proposeto retrieve operational abstractions ie invariant propertiesthat hold for a set of test cases These abstractions are thenused to compute operational differences which detects di-versity among a set of test cases (and not among a set ofimplementations as in our case) While the authors mentionthat operational differencing can be used to augment a testsuite the generation of new test cases is out of this workrsquosscope Zhang and Elbaum [26] focus on test cases that verifyerror handling code Instead of directly amplifying the testcases as we propose they transform the program under testthey instrument the target program by mocking the exter-

                      nal resource that can throw exceptions which allow them toamplify the space of exceptional behaviors exposed to thetest cases Pezze et al [20] use the information providedin unit test cases about object creation and initializationto build composite test cases that focus on interactions be-tween classes Their main result is that the new test casesfind faults that could not be revealed by the unit test casesthat provided the basic material for the synthesis of compos-ite test cases Xu et al [24] refer toldquotest suite augmentationrdquoas the following process in case a program P evolves into Prsquoidentify the parts of Prsquo that need new test cases and gener-ate these tests They combine concolic and search-based testgeneration to automate this process This hybrid approachis more effective than each technique separately but with in-creased costs Dallmeier et al [4] automatically amplify testsuites by adding and removing method calls in JUnit testcases Their objective is to produce test cases that cover awider set of execution states than the original test suite inorder to improve the quality of models reverse engineeredfrom the code

                      6 CONCLUSIONIn this paper we have presented DSpot a novel technique

                      for detecting one kind of computational diversity between apair of programs This technique is based on test suite am-plification the automatic transformation of the original testsuite DSpot uses two kinds of transformations for respec-tively exploring new points in the programrsquos input space andexploring new observation points on the execution state af-ter execution with the given input points

                      Our evaluation on large open-source projects shows thattest suites amplified by DSpot are capable of assessing com-putational diversity and that our amplification strategy isbetter than the closest related work a technique called TDRby Yoo and Harman [25] We have also presented a deepqualitative analysis of our empirical findings Behind theperformance of DSpot our results shed an original light onthe specified and unspecified parts of real-world test suitesand the natural randomness of computation

                      This opens avenues for future work There is a relationbetween the natural randomness of computation and the so-called flaky tests (those tests that occasionally fail) To usethe assertions of the flaky tests are at the border of the nat-ural undeterministic parts of the execution sometimes theyhit it sometimes they donrsquot With such a view we imag-ine an approach that characterizes this limit and proposesan automatic refactoring of the flaky tests so that they getfarther from the limit of the natural randomness and enteragain into the good old and reassuring world of determin-ism

                      7 ACKNOWLEDGEMENTSThis work is partially supported by the EU FP7-ICT-

                      2011-9 No 600654 DIVERSIFY project

                      8 REFERENCES[1] A Avizienis The n-version approach to fault-tolerant

                      software IEEE Transactions on Software Engineering(12)1491ndash1501 1985

                      [2] B Baudry S Allier and M Monperrus Tailoredsource code transformations to synthesizecomputationally diverse program variants In Proc of

                      Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

                      [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

                      [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

                      [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

                      [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

                      [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

                      [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

                      [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

                      [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

                      [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

                      [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

                      [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

                      [14] J C Knight N-version programming Encyclopedia of

                      Software Engineering 1990

                      [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

                      [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

                      [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

                      [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

                      [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

                      [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

                      [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

                      [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

                      [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

                      [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

                      [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

                      [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

                      • 1 Introduction
                      • 2 Background
                        • 21 N-version programming
                        • 22 NVP-Diversity
                        • 23 Graphical Explanation
                        • 24 Unspecified Input Space
                          • 3 Our Approach to Detect Computational Diversity
                            • 31 Overview
                            • 32 Test Suite Transformations
                              • 321 Exploring the Input Space
                              • 322 Adding Observation Points
                                • 33 Detecting and Measuring the Visible Computational Diversity
                                • 34 Implementation
                                  • 4 Evaluation
                                    • 41 Protocol
                                    • 42 Dataset
                                    • 43 Baseline
                                    • 44 Research Questions
                                    • 45 Empirical Results
                                      • 451 of Generated Test Cases
                                      • 452 of Generated Observation Points
                                      • 453 Effectiveness
                                      • 454 Natural Randomness of Computation
                                      • 455 Nature of Computational Diversity
                                        • 46 Threats to Validity
                                          • 5 Related work
                                          • 6 Conclusion
                                          • 7 Acknowledgements
                                          • 8 References

                        Int Symp on Software Testing and Analysis (ISSTA)pages 149ndash159 2014

                        [3] A Carzaniga A Mattavelli and M Pezze Measuringsoftware redundancy In Proc of Int Conf onSoftware Engineering (ICSE) 2015

                        [4] V Dallmeier N Knopp C Mallon S Hack andA Zeller Generating test cases for specificationmining In Proceedings of the 19th internationalsymposium on Software testing and analysis pages85ndash96 ACM 2010

                        [5] Y Deswarte K Kanoun and J-C Laprie Diversityagainst accidental and deliberate faults In Proceedingsof the Conference on Computer SecurityDependability and Assurance From Needs toSolutions CSDA rsquo98 pages 171ndash Washington DCUSA 1998 IEEE Computer Society

                        [6] M Franz E unibus pluram massive-scale softwarediversity as a defense mechanism In Proc of theworkshop on New security paradigms pages 7ndash16ACM 2010

                        [7] M Gabel and Z Su A study of the uniqueness ofsource code In Proc of the Int Symp on Foundationsof Software Engineering (FSE) pages 147ndash156 ACM2010

                        [8] D Gao M K Reiter and D Song BinhuntAutomatically finding semantic differences in binaryprograms In Information and CommunicationsSecurity pages 238ndash255 Springer 2008

                        [9] M Harder J Mellen and M D Ernst Improvingtest suites via operational abstraction In Proc of theInt Conf on Software Engineering (ICSE) ICSE rsquo03pages 60ndash71 Washington DC USA 2003 IEEEComputer Society

                        [10] M Harman P McMinn M Shahbaz and S Yoo Acomprehensive survey of trends in oracles for softwaretesting Technical Report CS-13-01 2013

                        [11] Y Higo and S Kusumoto How should we measurefunctional sameness from program source code anexploratory study on java methods In Proc of theInt Symp on Foundations of Software Engineering(FSE) pages 294ndash305 ACM 2014

                        [12] L Jiang and Z Su Automatic mining of functionallyequivalent code fragments via random testing In Procof Int Symp on Software Testing and Analysis(ISSTA) pages 81ndash92 ACM 2009

                        [13] M Kawaguchi S K Lahiri and H RebeloConditional equivalence Technical ReportMSR-TR-2010-119 2010

                        [14] J C Knight N-version programming Encyclopedia of

                        Software Engineering 1990

                        [15] P Koopman and J DeVale Comparing the robustnessof posix operating systems In Proc Of Int Symp onFault-Tolerant Computing pages 30ndash37 IEEE 1999

                        [16] C Le Goues T Nguyen S Forrest and W WeimerGenprog A generic method for automatic softwarerepair IEEE Tran on Software Engineering38(1)54ndash72 2012

                        [17] A J OrsquoDonnell and H Sethu On achieving softwarediversity for improved network security usingdistributed coloring algorithms In Proceedings of the11th ACM Conference on Computer andCommunications Security pages 121ndash131 ACM 2004

                        [18] R Pawlak M Monperrus N Petitprez C Nogueraand L Seinturier Spoon v2 Large scale source codeanalysis and transformation for java Technical Reporthal-01078532 INRIA 2006

                        [19] S Person M B Dwyer S Elbaum and C SPasareanu Differential symbolic execution In Proc ofthe Int Symp on Foundations of softwareengineering pages 226ndash237 ACM 2008

                        [20] M Pezze K Rubinov and J Wuttke Generatingeffective integration test cases from unit ones In Procof Int Conf on Software Testing Verification andValidation (ICST) pages 11ndash20 IEEE 2013

                        [21] M C Rinard Obtaining and reasoning about goodenough software In Design Automation Conference(DAC)

                        [22] E Schulte Z P Fry E Fast W Weimer andS Forrest Software mutational robustness GeneticProgramming and Evolvable Machines pages 1ndash322013

                        [23] T Xie Augmenting automatically generated unit-testsuites with regression oracle checking In Proc ofEuro Conf on Object-Oriented Programming(ECOOP) pages 380ndash403 Springer 2006

                        [24] Z Xu Y Kim M Kim and G Rothermel A hybriddirected test suite augmentation technique In Proc ofInt Symp on Software Reliability Engineering(ISSRE) pages 150ndash159 IEEE 2011

                        [25] S Yoo and M Harman Test data regenerationgenerating new test data from existing test dataSoftware Testing Verification and Reliability22(3)171ndash201 2012

                        [26] P Zhang and S Elbaum Amplifying tests to validateexception handling code In Proc of Int Conf onSoftware Engineering (ICSE) pages 595ndash605 IEEEPress 2012

                        • 1 Introduction
                        • 2 Background
                          • 21 N-version programming
                          • 22 NVP-Diversity
                          • 23 Graphical Explanation
                          • 24 Unspecified Input Space
                            • 3 Our Approach to Detect Computational Diversity
                              • 31 Overview
                              • 32 Test Suite Transformations
                                • 321 Exploring the Input Space
                                • 322 Adding Observation Points
                                  • 33 Detecting and Measuring the Visible Computational Diversity
                                  • 34 Implementation
                                    • 4 Evaluation
                                      • 41 Protocol
                                      • 42 Dataset
                                      • 43 Baseline
                                      • 44 Research Questions
                                      • 45 Empirical Results
                                        • 451 of Generated Test Cases
                                        • 452 of Generated Observation Points
                                        • 453 Effectiveness
                                        • 454 Natural Randomness of Computation
                                        • 455 Nature of Computational Diversity
                                          • 46 Threats to Validity
                                            • 5 Related work
                                            • 6 Conclusion
                                            • 7 Acknowledgements
                                            • 8 References

                          top related