Assessing Test Suite E ectiveness Using Static Metricsceur-ws.org/Vol-2070/paper-03.pdf · 2.2 Measuring test code quality Athanasiou et al. introduced a Test Quality Model (TQM)

Assessing Test Suite E↵ectiveness Using Static

Metrics

Paco van Beckhoven1,2, Ana Oprescu1, and Magiel Bruntink2

1University of Amsterdam2Software Improvement Group

Abstract

With the increasing amount of automatedtests, we need ways to measure the teste↵ectiveness. The state-of-the-art tech-nique for assessing test e↵ectiveness, mu-tation testing, is too slow and cumber-some to be used in large scale evolutionstudies or code audits by external compa-nies. In this paper we investigated two al-ternatives, namely code coverage and as-sertion count. We discovered that codecoverage outperforms assertion count byshowing a relation with test suite e↵ec-tiveness for all analysed project. Asser-tion count only displays such a relation inonly one of the analysed projects. Furtheranalysing this relationship between asser-tion count coverage and test e↵ectivenesswould allow to circumvent some of theproblems of mutation testing.

1 Introduction

Software testing is an important part of the soft-ware engineering process. It is widely used inthe industry for quality assurance as tests cantackle software bugs early in the development pro-cess and also serve for regression purposes [20].Part of the software testing process is covered bydevelopers writing automated tests such as unittests. This process is supported by testing frame-works such as JUnit [19]. Monitoring the qualityof the test code has been shown to provide valu-able insight when maintaining high-quality assur-ance standards [18]. Previous research shows thatas the size of production code grows, the size oftest code grows along [43]. Quality control on testsuites is therefore important as the maintenance

Copyright

c� by the paper’s authors. Copying permitted for

private and academic purposes.

Proceedings of the Seminar Series on Advanced Techniques

and Tools for Software Evolution SATToSE 2017 (sat-

tose.org).

07-09 June 2017, Madrid, Spain.

on tests can be di�cult and generate risks if doneincorrectly [22]. Typically, such risks are relatedto the growing size and complexity which conse-quently lead to incomprehensible tests. An impor-tant risk is the occurrence of test bugs i.e., teststhat fail although the program is correct (false pos-itive) or even worse, tests that do not fail when theprogram is not working as desired (false negative).Especially the latter is a problem when breakingchanges are not detected by the test suite. Thisissue can be addressed by measuring the fault de-tecting capability of a test suite, i.e., test suitee↵ectiveness Test suite e↵ectiveness is measuredby the number of faulty versions of a System Un-der Test (SUT) that are detected by a test suite.However, as real faults are unknown in advance,mutation testing is applied as a proxy measure-ment. It has been shown that mutant detectioncorrelates with real fault detection [26].

Mutation testing tools generate faulty versionsof the program and then run the tests to determineif the fault was detected. These faults, called mu-tants, are created by so-called mutators which mu-tate specific statements in the source code. Eachmutant represents a very small change to pre-vent changing the overall functionality of the pro-gram. Some examples of mutators are: replacingoperands or operators in an expression, removingstatements or changing the returned values. A mu-tant is killed if it is detected by the test suite, ei-ther because the program fails to execute (due toexceptions) or because the results are not as ex-pected. If a large set of mutants survives, it mightbe an indication that the test quality is insu�cientas programming errors may remain undetected.

1.1 Problem statement

Mutation analysis is used to measure the test suitee↵ectiveness of a project [26]. However, mutationtesting techniques have several drawbacks, such aslimited availability across programming languagesand being resource expensive [46, 25]. Further-more, it often requires compilation of source codeand it requires running tests which often depend

1

on other systems that might not be available, ren-dering it impractical for external analysis. Exter-nal analysis is often applied in industry by compa-nies such as Software Improvement Group (SIG) toadvise companies on the quality of their software.All these issues are compounded when performingsoftware evolution analysis on large-scale legacy oropen source projects. Therefore our research goalhas both industry and research relevance.

1.2 Research questions and method

To tackle these issues, our goal is to understandto what extent metrics obtained through staticsource code analysis relate to test suite e↵ective-ness as measured with mutation testing.

Preliminary research [40] on static test metricshighlighted two promising candidates: assertioncount and static coverage. We structure our anal-ysis on the following research questions:

RQ 1 To what extent is assertion count a goodpredictor for test suite e↵ectiveness?

RQ 2 To what extent is static coverage a goodpredictor for test suite e↵ectiveness?

We select our test suite e↵ectiveness metric andmutation tool based on state of the art literature.Next, we study existing test quality models to in-spect which static metrics can be related to testsuite e↵ectiveness. Based on these results we im-plement a set of metrics using only static analysis.

To answer the research questions, we implementa simple tool that reads a project’s source files andcalculates the metrics scores using static analysis.

Finally, we evaluate the individual metrics’suitability as indicators for e↵ectiveness by per-forming a case study using our tool on threeprojects: Checkstyle, JFreeChart and JodaTime.The projects were selected from related research,based on size and structure of their respective testsuites. We focus on Java projects as Java is oneof the most popular programming languages [15]and forms the subject of many recent research pa-pers surrounding test e↵ectiveness. We rely onJUnit [7] as the unit testing framework. JUnit isthe most used unit testing framework for Java [44].

1.3 Contributions

In an e↵ort to tackle the drawbacks of using mu-tation testing to measure test suite e↵ectiveness,our research makes the following contributions:1. In-depth analysis on the relation between teste↵ectiveness, assertion count and coverage as mea-sured using static metrics for three large real-worldprojects. 2. A set of scenarios which influence theresults of the static metrics and their sources ofimprecision. 3. An tool to measure static cover-age and assertion count using only static metrics.

Outline. Section 2 revisits background con-cepts. Section 3 introduces the design of the staticmetrics that will be investigated together with ane↵ectiveness metric and a mutation tool. Section 4describes the empirical method of our research.Results are shown in Section 5 and discussed inSection 6. Section 7 summarises related work andSection 8 presents the conclusion and future work.

2 Background

First, we introduce some basic terminology. Next,we describe a test quality model used as input forthe design of our static metrics. We briefly in-troduce mutation testing and compare mutationtools. Finally, we summarize test e↵ectivenessmeasures and describe mutation analysis.

2.1 Terminology

We define several terms used in this paper:

Test (case/method) An individual JUnit test.Test suite A set of tests.Test suite size Number of tests in a test suite.Master test suite All tests of a given project.Dynamic metrics Metrics that can only be

measured by, e.g., running a test suite. Whenwe state that something is measured dynam-ically, we refer to dynamic metrics.

Static metrics Metrics measured by analysingthe source code of a project. When we statethat something is measured statically, we re-fer to static metrics.

2.2 Measuring test code quality

Athanasiou et al. introduced a Test Quality Model(TQM) based on metrics obtained through staticanalysis of production and test code [18]. ThisTQM consists of the following static metrics:

Code coverage is percentage of code tested, im-plemented via static call graph analysis [16].

Assertion-McCabe ratio indicates tested deci-sion points in the code; computed as thetotal number of assertion statements in thetest code divided by the McCabe’s cyclomaticcomplexity score [33] of the production code.

Assertion Density indicates the ability to de-tect defects; computed as the number of asser-tions divided by Lines Of Test Code (TLOC).

Directness indicates the ability to detect the lo-cation a defect’s cause when a test fails. Sim-ilar to code coverage, except that only meth-ods directly called from a test are counted.

Maintainability based on an existing maintain-ability model [21], adapted for test suites.The model consists of the following metricsfor test code: Duplication, Unit Size, UnitComplexity and Unit Dependency.

2

2.3 Mutation testing

Test e↵ectiveness is measured by the number ofmutants that were killed by a test suite. Recentresearch introduced a variety of e↵ectiveness mea-sures and mutants. We describe di↵erent typesof mutants, mutation tools, types of e↵ectivenessmeasures, and work on mutation analysis.

2.3.1 Mutant types

Not all mutants are equally easy to detect. Easyor weak mutants are killed by many tests and thusoften easy to detect. Hard to kill mutants can onlybe killed by very specific tests and often subsumeother mutants. Below is an overview of the di↵er-ent types of mutants in the literature:

Mutant represents a small change to the pro-gram, i.e., a modified version of the SUT.

Equivalent mutants do not change the outcomeof a program, i.e., they cannot be detected.Given a loop that breaks if i == 10, and i

increments by 1. A mutant changing the con-dition to i >= 10 remains undetected as theloop still breaks when i becomes 10.

Subsuming mutants are sole contributors tothe e↵ectiveness scores [36]. If mutants aresubsumed, they are often killed “collaterally”together with the subsuming mutant. Killingthese collateral mutants does not lead to moree↵ective tests, but they influence the test ef-fectiveness score calculation.

2.3.2 Comparison of mutation tools

Three criteria were used to compare mutationtools for Java: 1. E↵ectiveness of the mutation

adequate test suite of each tool. A mutation ade-quate test suite kills all the mutants generated bya mutation tool. Each test of this test suite con-tributes to the e↵ectiveness score, i.e., if one testis removed, less than 100% e↵ectiveness score isachieved. A cross-testing technique is applied toevaluate the e↵ectiveness each tool’s mutation ad-equate test suite. The adequate test suite of eachtool is run on the set of mutants generated by theother tools. If the mutation adequate test suite fortool A would detect all the mutants of tool B, butthe suite of tool B would not detect all the mu-tants of tool A, then tool A would subsume toolB. 2. Tool’s application cost in terms of the num-ber of test cases that need to be generated and thenumber of equivalent mutants that would have tobe inspected. 3. Execution time of each tool.

Kintis et al. analysed and compared the e↵ec-tiveness of PIT, muJava and Major [27]. Each toolwas evaluated using the cross-testing technique ontwelve methods of six Java projects. They foundthat the mutation adequate test suite of muJava

was the most e↵ective, followed by Major and PIT.The ordering in terms of application cost was dif-ferent: PIT required the least test cases and gen-erated the smallest set of equivalent mutants.

Marki and Lindstrom performed similar re-search on the same mutation tools [32]. They usedthree small Java programs popular in literature.They found that none of the mutation tools sub-sumed each other. muJava generated the strongestmutants followed by Major and PIT, however, mu-Java generated significantly more equivalent mu-tants and was slower than Major and PIT.

Laurent et al. introduced PIT+, an improvedversion of PIT with an extended set of muta-tors [31]. They combined the test suites generatedby Kintis et al. [27] into a mutation adequate testsuite that would detect the combined set of mu-tants generated by PIT, muJava and Major. Amutation adequate test suite was also generatedfor PIT+. The set of mutants generated by PIT+was equally strong as the combined set of mutants.

2.3.3 E↵ectiveness measures

We found three types of e↵ectiveness measures:

Normal e↵ectiveness calculated as the numberof killed mutants divided by the total numberof non-equivalents.

Normalised e↵ectiveness calculated as thenumber of killed mutants divided by thenumber of covered mutants, i.e., mutantslocated in code executed by the test suite.Intuitively, test suites killing more mutantswhile covering less code are more thoroughthan test suites killing the same number ofmutants in a larger piece of source code [24].

Subsuming e↵ectiveness is the percentage ofkilled subsuming mutants. Intuitively, strongmutants, i.e., subsuming mutants, are notequally distributed [36], which could lead toskewed e↵ectiveness results.

2.3.4 Mutation analysis

In this section, we describe research conducted onmutation analysis that underpins our approach.

Mutants and real faults. Just et al. in-vestigated whether generated faults are a correctrepresentation of real faults [26]. Statistically sig-nificant evidence shows that mutant detection cor-relates with real fault detection. They could relate73% of the real faults to common mutators. Of theremaining 27%, 10% can be detected by enhanc-ing the set of commonly used mutators. They usedMajor for generating mutations. Equivalent mu-tants were ignored as mutation scores were onlycompared for subsets of a project’s test suite.

Code coverage and e↵ectiveness. Inozemt-seva and Holmes analysed the correlation between

3

code coverage and test suite e↵ectiveness [24] ontwelve studies. They found three main shortcom-ings: 1. Studies did not control the suite size. Ascode coverage relates to the test suite size (morecoverage is achieved by adding more tests), it re-mains unclear whether the correlation with e↵ec-tiveness was due to size or coverage of the testsuite. 2. Small or synthetic programs limit gen-eralisation to industry. 3. Comparing only testsuites that fully satisfy a certain coverage criterion.They argue that these results can be generalised tomore realistic test suites. Eight studies showed acorrelation between some coverage type and e↵ec-tiveness independently of size; the strength varied,in some studies appearing only for high coverage.

They also conducted an experiment on five largeopen source Java projects. All mutants undetectedby the master test suite were marked equivalent.To control for size, fixed size test suites are gener-ated by randomly selecting tests from the mastertest suite. Coverage was measured using Code-Cover [3] on statement, decision and modified con-dition levels. E↵ectiveness was measured usingnormal and normalised e↵ectiveness. They founda low to moderate correlation between coverageand normal e↵ectiveness when controlling for size.The coverage type had little impact on the cor-relation strength and only a weak correlation wasfound for normalised e↵ectiveness.

Assertions and e↵ectiveness. Zhang andMesbah studied the relationship between asser-tions and test suite e↵ectiveness [45]. Their exper-iment used five large open source Java projects,similarly to Inozemtseva and Holmes [24]. Theyfound a strong correlation between assertion countand test e↵ectiveness, even when test suite sizewas controlled for. They also found that some as-sertion types are more e↵ective than others, e.g.,boolean and object assertions are more e↵ectivethan string and numeric assertions.

3 Metrics and mutants

Our goal is to investigate to what extent staticanalysis based metrics are related to test suite ef-fectiveness. First, we need to select a set of staticmetrics. Secondly, we need a tool to measure thesemetrics. Thirdly, we need a way to measure teste↵ectiveness.

3.1 Metric selection

We choose two static analysis-based metrics thatcould predict test suite e↵ectiveness. We analysethe state of the art TQM by Athanasiou et al. [18]because it is already based on static source codeanalysis. Furthermore, the TQM was developed incollaboration with SIG, the host company of thisthesis, which means that knowledge of the model

is directly available. This TQM consists of the fol-lowing static metrics: Code Coverage, Assertion-McCabe ratio, Assertion Density, Directness andTest Code Maintainability (see also Section 2.2).

Test code maintainability relates to code read-ability and understandability, indicating how eas-ily we can make changes. We drop maintainabilityas a candidate metric as we consider it the leastrelated to completeness or e↵ectiveness of tests.

The model also contains two assertion- and twocoverage based metrics. Based on preliminary re-sults we found that the number of assertions hada stronger correlation with test e↵ectiveness thanthe two assertion based TQM metrics for all anal-ysed projects. Similarly, the static code coverageperformed better than directness in the correlationtest with test e↵ectiveness. To get a more quali-tative analysis, we focus on one assertion basedmetric and one coverage based metric, respectivelyassertion count and static coverage.

Furthermore, coverage was shown to be relatedto test e↵ectiveness [24, 35]. Others found a rela-tion between assertions and fault density [28] andbetween assertions and test suite e↵ectiveness [45].

3.2 Tool implementation

In this section, we explain the foundation of thetool and the details of the implemented metrics.

3.2.1 Tool architecture

Figure 1 presents the analysis steps. The rectan-gles are artefacts that form the in/output for thetwo processing stages.

The first processing step is performed by theSoftware Analysis Toolkit (SAT) [29], it constructsa call graph using only static source code analysis.Our analysis tool uses the call graph to measureboth assertion count and static method coverage.

The SAT analyses source code and computesseveral metrics, e.g., Lines of Code (LOC), Mc-Cabe complexity [33] and code duplication, whichare stored in a source graph. This graph containsinformation on the structure of the project, suchas which packages contain which classes, whichclasses contain which methods and the call rela-tions between these methods. Each node is an-notated with information such as lines of code.This graph is designed such that it can be used formany programming languages. By implementingour metrics on top of the SAT, we can do mea-surements for di↵erent programming languages.

3.2.2 Code coverage

Alves and Visser designed an algorithm for mea-suring method coverage using static source codeanalysis [16]. The algorithm takes as input a call

graph obtained by static source code analysis. The

4

Figure 1: Analysis steps to statically measure coverage and assertion count.

calls from test to production code are counted byslicing the source graph and counting the methods.This includes indirect calls, e.g., from one produc-tion method to another. Additionally, the con-structor of each called method’s class is included.They found a strong correlation between static anddynamic coverage. (The mean of the di↵erence be-tween static and dynamic coverage was 9%). Weuse this algorithm with the call graph generated bythe SAT to calculate the static method coverage.

However, the static coverage algorithm has foursources of imprecision [16]. The first is conditionallogic, e.g., a switch statement that for each caseinvokes a di↵erent method. Second is dynamic dis-patch (virtual calls), e.g., a parent class with twosubclasses both overriding a method that is calledon the parent. Third, library/framework calls,e.g., java.util.List.contains() invoke the .equals()method of each object in the list. The source codeof third party libraries is not included in the anal-ysis making it impossible to trace which methodsare called from the framework. And fourth, the useof Java reflection, a technique to invoke methodsdynamically during runtime without knowledge ofthese methods or classes during compile time.

For the first two sources of imprecision, an op-timistic approach is chosen i.e., all possible pathsare considered covered. Consequently, the cover-age is overestimated. Invocations by the latter twosources of imprecision remain undetected, leadingto underestimating the coverage.

3.2.3 Assertions

We measure the number of assertions using thesame call graph as the static method coverage al-gorithm. For each test, we follow the call graphthrough the test code to include all direct andindirect assertion calls. Indirect calls are impor-tant because often tests classes contain some util-ity method for asserting the correctness of an ob-ject. Additionally, we take into account the num-ber of times a method is invoked to approximatethe number of executed assertions. Only assertionsthat are part of JUnit are counted.

Identifying tests. By counting assertionsbased on the number of invocations from tests, weshould also be able to identify these tests stati-cally. We use the SAT to identify all invocationsto assertion methods and then slice the call graphbackwards following all call and virtual call edges.All nodes within scope, that have no parametersand have no incoming edges, are marked as tests.

Assertion content types. Zhang and Mesbah

found a significant di↵erence between the e↵ective-ness of assertions and the type of objects they as-sert [45]. Four assertion content types were clas-sified: numeric, string, object and boolean. Theyfound that object and boolean assertions are moree↵ective than string and numeric assertions. Thetype of objects in an assertion can give insights inthe strength of the assertion. We will include thedistribution of these content types in the analysis.

We use the SAT to analyse the type of objects inan assertion. The SAT is unable to detect the typeof an operator expression used inside a method in-vocation, e.g., assertTrue(a >= b);, resulting inunknown assertion content types. Also, fail state-ments are put in a separate category as these are aspecial type of assertion without any content type.

3.3 Mutation analysis

In this section we discuss our choice for the muta-tion tool and test e↵ectiveness measure.

3.3.1 Mutation tool

We presented four candidate mutation tools forour experiment in Section 2.3.2: Major, muJava,PIT and PIT+. MuJava has not been updatedin the last two years and does not support JU-nit 4 and Java versions above 1.6 [9]. Conformingto these requirements would decrease the set ofprojects we could use in our experiment as bothJUnit 4 and Java 1.7 have been around for quitesome time. Major does support JUnit 4 and hasrecently been updated [8]. However, it only worksin Unix environments [32]. PIT targets indus-try [27], is open source and actively developed [12].Furthermore, it supports a wide scale of build tool-ing and is significantly faster than the other tools.PIT+ is based on a two-year-old branched versionof PIT and was only recently made available [10].The documentation is very sparse, the source codeis missing. However, PIT+ generates a strongerset of mutants than the other three tools whereasPIT generates the weakest set of mutants.

Based on these observations we decided thatPIT+ would be the best choice for measuring teste↵ectiveness. Unfortunately, PIT+ was not avail-able at the start of our research. We first didthe analysis based on PIT and then later switchedto PIT+. Because we first used PIT, we selectedprojects that used Maven as a build tool. PIT+is based on an old version, 1.1.5, not yet support-ing Maven. To enable using the features of PIT’snew version we merged the mutators provided byPIT+ into the regular version of PIT [11].

5

3.3.2 Dealing with equivalent mutants

Equivalent mutants are mutants that do notchange the outcome of the program. Manually re-moving equivalent mutants is time-consuming andgenerally undecidable [35]. A commonplace so-lution is to mark all the mutants that are notkilled by the project’s test suite as equivalent.The resulting non-equivalent mutants are alwaysdetected by at least one test. The disadvantageof this approach is that many mutants might befalsely marked as equivalent. The number of falsepositives depends for example on the coverage ofthe tests: if the mutated code is not covered byany of the tests, it will never be detected and con-sequently be marked as equivalent. Another causeof false positives could be the lack of assertionsin tests, i.e., not checking the correctness of theprogram’s result. The percentage of equivalentmutants expresses to some extent the test e↵ec-tiveness of the project’s test suite.

With this approach, the complete test suiteof each project will always kill all the remainingnon-equivalent mutants. As the number of non-equivalent mutants heavily relies on the quality ofa project’s test suite, we cannot use these e↵ective-ness scores to compare between di↵erent projects.To compensate for that, we will compare sub testsuites within the same project.

3.3.3 Test e↵ectiveness measure

Next, we evaluate both normalised and subsuminge↵ectiveness in the subsections below and describeour choice for an e↵ectiveness measure.

Normalised e↵ectiveness. Normalised e↵ec-tiveness is calculated by dividing the killed mu-tants with the number of non-equivalent mutantsthat are present in the code executed by the test.

Given the following example in which there aretwo Tests T1 and T2 for Method M1. Suppose M1

is only covered by T1 and T2. In total, there arefive mutants Mu1..5 generated for M1. T1 detectsMu1 and T2 detects Mu2. As T1 and T2 are theonly tests to kill M1, the mutants Mu3..5 remainundetected and are marked as equivalent. Bothtests only cover M1 and detect 1 of the two mu-tants resulting in a normal e↵ectiveness score of0.5. A test suite consisting of only the above testswould detect all mutants in the covered code, re-sulting in a normalised e↵ectiveness score of 1.

We notice that the normalised e↵ectivenessscore heavily relies on how mutants are markedas equivalent. Suppose the mutants marked asequivalent were valid mutants but the tests failedto detect them (false positive), e.g., due to miss-ing assertions. In this scenario, the (normalised)e↵ectiveness score suggests that a bad test suite isactually very e↵ective. Projects that have ine↵ec-

tive tests will only detect a small portion of themutants. As a result, a large percentage will bemarked as equivalent. This increases the chancesof false positives which decrease the reliability ofthe normalised e↵ectiveness score.

Given a project of which only a portion of thecode base is thoroughly tested. There is a highprobability that the equivalent mutants are notequally distributed among the code base. Codecovered by poor tests is more likely to contain falsepositives than thoroughly tested code. The poortests scramble the results e.g., a test with no asser-tions can be incorrectly marked as very e↵ective.

Normalised e↵ectiveness is intended to comparethe thoroughness of two test suites, i.e., penalisethe test suites that cover lots of code but only asmall number of mutants. We believe that it is lesssuitable as a replacement for normal e↵ectiveness

We consider normal e↵ectiveness scores morereliable when studying the relation with our met-rics. Normal e↵ectiveness is positively influencedby the breadth of a test and penalises small testsuites as a score of 1.0 can only be achieved if allmutants are found. However, this is less of a prob-lem when comparing test suites of equal sizes.

Subsuming e↵ectiveness. Current algo-rithms for identifying subsuming mutants are in-fluenced by the overlap between tests. Supposethere are five mutants, Mu1..5, for method M1.There are 5 tests, T1..5, that kill Mu1..4 and onetest, T6, that kills all five mutants.

Amman et al. defined subsuming mutants asfollows: “one mutant subsumes a second mutant ifevery test that kills the first mutant is guaranteedalso to kill the second [17].” According to thisdefinition, Mu5 subsumes Mu1..4 because the setof tests that kill Mu5 is a subset of the tests thatkill Mu1..4 : {T6} ⇢ {T1..5}. The tests T1..5 willhave a subsuming e↵ectiveness score of 0.

Our goal is to identify properties of test suitesthat determine their e↵ectiveness. If we wouldmeasure the subsuming e↵ectiveness, T1..5 wouldbe significantly less e↵ective. This would sug-gest that the assertion count or coverage of thesetests did not contribute to the e↵ectiveness, eventhough they still detected 80% of all mutants.

Another vulnerability of this approach is thatit is vulnerable to changes in the test set. If we re-move T6, the mutants previously marked as “sub-sumed” are now subsuming because Mu5 is nolonger detected. Consequently, T1..5 now detectall the subsuming mutants. In this scenario, wedecreased the quality of the master test suite byremoving a single test, which leads to a signifi-cant increase in the subsuming e↵ectiveness scoreof tests, T1..5. This can lead to strange results overtime, as the addition of tests can lead to drops inthe e↵ectiveness of others.

6

Choice of e↵ectiveness measure. Nor-malised e↵ectiveness loses precision when largeamounts of mutants are incorrectly marked asequivalent. Furthermore, normalised e↵ectivenessis intended as a measurement for the thoroughnessof a test suite which is di↵erent from our definitionof e↵ectiveness. Subsuming e↵ectiveness scoreschange when tests are added or removed whichmakes the measure very sensitive to change. Fur-thermore, subsuming e↵ectiveness penalises teststhat do not kill a subsuming mutant.

We choose to apply normal e↵ectiveness as thismeasure is more reliable. It also allows for com-paring with similar research on e↵ectiveness andassertions/coverage [24, 45]. We refer to test suitee↵ectiveness also as normal e↵ectiveness.

4 Are static metrics related to test

suite e↵ectiveness?

Mutation tooling is resource expensive and re-quires running the test suites i.e., dynamic analy-sis. To address these problems, we investigate towhat extent static metrics are related to test suitee↵ectiveness. In this section, we describe how wewill measure whether static metrics are a good pre-dictor for test suite e↵ectiveness.

4.1 Measuring the relationship betweenstatic metrics and test e↵ectiveness

We consider two static metrics, assertion countand static method coverage, as candidates for pre-dicting test suite e↵ectiveness.

4.1.1 Assertion count

We hypothesise that assertion count is related totest e↵ectiveness. Therefore, we first measure as-sertion count by following the call graph from alltests. As our context is static source code analysis,we should be able to identify the tests statically.Thus, we next compare the following approaches:

Static approach we use static call graph slicing(Section 3.2.3) to identify all tests of a projectand measure the total assertion count for theidentified tests.

Semi-dynamic approach we use Java reflection(Section 4.3) to identify all the tests and mea-sure the total assertion count for these tests.

Finally, we inspect the type of the asserted ob-ject as input for the analysis of the relationshipbetween assertion count and test e↵ectiveness.

4.1.2 Static method coverage

We hypothesise that static method coverage is re-lated to test e↵ectiveness. To test this hypothesis,we measure the static method coverage using staticcall graph slicing. We include dynamic method

coverage as input for our analysis to: a) inspectthe accuracy of the static methods coverage al-gorithm and b) to verify if a correlation betweenmethod coverage and test suite e↵ectiveness exists.

4.2 Case study setup

We study our selected projects using an experi-ment design based on work by Inozemtseva andHolmes [24]. They surveyed similar studies onthe relation between test e↵ectiveness and cover-age and found that most studies implemented thefollowing procedure: 1. Create faulty versions ofone or more programs. 2. Create or generate manytest suites. 3. Measure the metric scores of eachsuite. 4. Determine the e↵ectiveness of each suite.We describe our approach for each step in the fol-lowing subsections.

4.2.1 Generating faults

We employ mutation testing as a technique forgenerating faulty versions, mutants, of the di↵er-ent projects that will be analysed. We employ PITas a mutation tool. Mutants are generated usingthe default set of mutators 1. All mutants that arenot detected by the master test suite are removed.

4.2.2 Project selection

We have chosen three projects for our analysisbased on the following set of requirements: Theprojects had in the order of hundreds of thousandsLOC and thousands of tests.

Based on these criteria we selected a set ofprojects: Checkstyle[1], JFreeChart[5] and Joda-Time [6]. Table 1 shows properties of the projects.Java LOC and TLOC are generated using DavidA. Wheeler’s SLOCCount [14].

Checkstyle is a static analysis tool that checksif Java code and Javadoc comply with somecoding rules, implemented in checker classes.Java and Javadoc grammars are used to gen-erate Abstract Syntax Trees (ASTs). Thechecker classes visit the AST, generating mes-sages if violations occur. The core logic is inthe com.puppycrawl.tools.checkstyle.checks

package, representing 71% of the project’s size.Checkstyle is the only project that used contin-uous integration and quality reports on GitHubto enforce quality, e.g., the build that is triggeredby a commit would break if coverage or e↵ective-ness would drop below a certain threshold. Wedecided to use the build tooling’s class exclusionfilters to get more representative results. Thesequality measures are needed as there are severaldevelopers that contributed to the project. Theproject currently has five active team members [2].

1http://pitest.org/quickstart/mutators/

7

JFreeChart is a chart library for Java. Theproject is split into two parts: the logic used fordata and data processing, and the code focussedon construction and drawing of plots. Most no-table are the classes for the di↵erent plots in theorg.jfree.chart.plot package, which contains20% of the production code. JFreeChart is buildand maintained by one developer [5].

JodaTime is a very popular date and time li-brary. It provides functionality for calculationswith dates and times in terms of periods, durationsor intervals while supporting many di↵erent dateformats, calendar systems and time zones. Thestructure of the project is relatively flat, with onlyfive di↵erent packages that are all at the root level.Most of the logic is related to either formattingdates or date calculation. Around 25% of the codeis related to date formatting and parsing. Joda-Time was created by two developers, only of themis maintaining the project [6].

4.2.3 Composing test suites

It has been shown that test suite size influences therelation with test e↵ectiveness [35]. When a testis added to a test suite it can never decrease thee↵ectiveness, assertion count or coverage. There-fore, we will only compare tests suites of equal sizessimilar to previous work [24, 45, 35].

We compose test suites of relative sizes, i.e.,test suites that contain a certain percentage of alltests in the master test suite. For each size, wegenerate 1000 test suites. We selected the follow-ing range of relative suite sizes: 1%, 4%, 9%, 16%,25%, 36%, 49%, 64% and 81%. Larger test suitewere not included because the di↵erences betweenthe generated test suites would become too small.Additionally, we found that this sequence had theleast overlap in e↵ectiveness scores for the di↵er-ent suite sizes while still including a wide spreadof the test e↵ectiveness across di↵erent test suites.

Our approach di↵ers from existing research [24]in which they used suites of sizes: 3, 10, 30, 100,300, 1000 and 3000. A disadvantage of this ap-proach is that the number of test suites for Jo-daTime is larger than for the others because Jo-daTime is the only project that has more than3000 tests. Another disadvantage is that a testsuite with 300 tests might be 50% of the mastertest suite for one project and only 10% of anotherproject’s test suite. Additionally, most composedtests suites in this approach represent only a smallportion of the master test suite. With our ap-proach, we can more precisely study the behaviourof the metrics as the suites grow in size. Further-more, we found that test suites with 16% of alltests already dynamically covered 50% to 70% ofthe methods covered by the master test suite.

4.2.4 Measuring metric scores and e↵ec-tiveness

For each test suite, we measure the e↵ectiveness,assertion count and static method coverage. Thedynamic equivalents of both coverage metrics areincluded to evaluate their comparison. We obtainthe dynamic coverage metrics using JaCoCo [4].

4.2.5 Statistical analysis

To determine how we will calculate the correla-tion with e↵ectiveness we analyse related work onthe relation between test e↵ectiveness and asser-tion count [45] and coverage [24]. Both works havesimilar experiment set-ups in which they generatedsub test suites of fixed sizes and calculated metricand e↵ectiveness scores for these suites. Further-more, both studies used a parametric and non-

parametric correlation test, respectively Pearson

and Kendall. We will also consider the Spearmanrank correlation test, another nonparametric test,as it is commonly used in literature. A parametrictest assumes the underlying data to be normallydistributed whereas nonparametric tests do not.

The Pearson correlation coe�cient is based onthe covariance of two variables, i.e., the metricand e↵ectiveness scores, divided by the product oftheir standard deviations. Assumptions for Pear-son include the absence of outliers, the normalityof variables and linearity. The Kendall’s Tau rankcorrelation coe�cient is a rank based test used tomeasure the extent to which rankings of two vari-ables are similar. Spearman is a rank based ver-sion of the Pearson correlation tests, commonlyused as its computation is more lightweight thanKendall’s. However, our data set leads to similarcomputation time for Spearman and Kendall.

We discard Pearson because we cannot makeassumptions on our data distribution. Moreover,Kendall “is a better estimate of the correspond-ing population parameter and its standard error isknown [23]”. As the advantages of Spearman overKendall do not apply in our case and Kendall hasadvantages over Spearman, we choose Kendall’sTau rank correlation test. The correlation coe�-cient is calculated with R’s “Kendall” package [13].We use the Guilford scale (Table 2) for verbal de-scriptions of the correlation strength [35].

4.3 Evaluation tool

We compose 1000 test suites of nine di↵erent sizesfor each project. Running PIT+ on the mastertest suite took from 0.5 to 2 hours depending onthe project. As we have to calculate the e↵ec-tiveness of 27,000 test suites, this approach wouldtake too much time. Our solution is to measurethe test e↵ectiveness of each test only once. Wethen combine the results for di↵erent sets of tests

8

Table 1: Characteristics of the selected projects. Total Java LOC is the sum of the pro-duction LOC and TLOC

Property Checkstyle JFreeChart JodaTime

Total Java LOC 73,244 134,982 84,035

Production LOC 32,041 95,107 28,724

TLOC 41,203 39,875 55,311

Number of tests 1875 2,138 4,197

Method Coverage 98% 62% 90%

Date cloned from GitHub 4/30/17 4/25/17 3/23/17

Citations in literature [43, 39] [45, 24, 31, 26, 16] [24, 31, 26, 39]

Number of generated mutants 95,185 310,735 100,893

Number of killed mutants 80,380 80,505 69,615

Number of equivalent mutants 14,805 230,230 31,278

Equivalent mutants (%) 15.6% 74.1% 31.0%

Table 2: Guilford scale for the verbal description of correlation coe�cients.

Correlation coe�cient below 0.4 0.4 to 0.7 0.7 to 0.9 above 0.9

Verbal description low moderate high very high

Figure 2: Overview of the experiment set-up to obtain the relevant metrics for each test.

to simulate test suites. To get the scores for atest suite with n tests, we combine the coverageresults, assertion counts and killed mutants of itstests. Similarly, we calculate the static metrics anddynamic coverage only once for each test.

Detecting individual tests. We use a reflec-tion library to detect both JUnit 3 and 4 tests foreach project according to the following definitions:

JUnit 3 All methods in non-abstract subclassesof JUnit’s TestCase class. Each methodshould have a name starting with “test”, bepublic, void and have no parameters.

JUnit 4 All public methods annotated with JU-nit’s @Test annotation.

We verified the number of detected tests withthe number of executed tests reported by eachproject’s build tool.

We also need to include the set-up and tear-down logic of each test. We use JUnit’s test run-ner API to execute individual tests. This API en-sures execution of the corresponding set-up andtear-down logic. This extra test logic should alsobe included in the static coverage metric to getsimilar results. With JUnit 3 the extra logicis defined by overriding TestCase.setUp() orTestCase.tearDown(). JUnit 4 uses the @before

or @after annotations. However, the SAT doesnot provide information on the used annotations.A common practice is to still name these methodssetUp or tearDown. We include methods that arenamed setUp or tearDown and are located in thesame class as the tests in the coverage results.

Aggregating metrics. To aggregate e↵ective-ness, we need to know which mutants are detectedby each test as the set of detected mutants couldoverlap. However, PIT does not provide a list ofkilled mutants. We solved this issue by creatinga custom reporter using PIT’s plug-in system toexport the list of killed mutants.

The coverage of two tests can also overlap.Thus, we need information on the methods coveredby each test. JaCoCo exports this information ina jacoco.exec report file, a binary file containingall the information required for aggregation. Weaggregate these files via JaCoCo’s API. For thestatic coverage metric, we export the list of cov-ered methods in our analysis tool.

The assertion count of a test suite is simply cal-culated as the sum of each test’s assertion count.

Figure 2 provides an overview of the involvedtools used and the data they generate. The eval-uation tool’s input is raw test data and the sizes

9

of the test suites to create. We then compose testsuites by randomly selecting a given number oftests from the master test suite. The output of theanalysis tool is a data set containing the scores onthe dynamic and static metrics for each test suite.

5 Results

We first present the results of our analysis on theassertion count metric, followed by the results ofour analysis on code coverage.

Table 3 provides an overview of the assertioncount, static and dynamic method coverage, andthe percentage of mutants that were marked asequivalent for the master test suite of each project.

5.1 Assertion count

Figure 3 shows the distribution of the number ofassertions for each test of each project.

We notice some tests with exceptionally highassertion counts. We manually checked these testsand found that the assertion count was correct forthe outliers. We briefly explain a few outliers:

TestLocalDateTime Properties.testPropertyRoun

dHour (140 asserts), checks the correctnessof rounding 20 times, with for each check 7assertions on year, month, week, etc.

TestPeriodFormat.test wordBased pl regEx (140asserts) calls and asserts the results of the pol-ish regex parser 140 times.

TestGJChronology.testDurationFields (57 as-serts), tests for each duration field whetherthe field names are correct and if some flagsare set correctly.

CategoryPlotTest.testEquals (114 asserts), in-crementally tests all variations of the equalsmethod of a plot object. The other tests withmore than 37 assertions are similar tests forthe equals methods of other types of plots.

Figure 4 shows the relation between the asser-tion count and normal e↵ectiveness. Each dot rep-resents a generated test suite; and its colour ofthe dot represents the size of the suite relativeto the total number of tests. The normal e↵ec-tiveness, i.e., the percentage of mutants killed bya given test suite is shown on the y-axis. Thenormalised assertion count is shown on the x-axis.We normalised the assertion count as the percent-age of the total number of assertions for a givenproject. For example, as Checkstyle has 3819 as-sertions (see Table 3), a test suite with 100 asser-tions would have a normalised assertion count of100

3819⇤ 100 ⇡ 2.6%.

We observe that test suites of the same rela-tive suite are clustered. For each group of testsuites, we calculated the Kendall correlation coef-ficient between normal e↵ectiveness and assertion

count. These coe�cients for each set of test suitesof a given project and relative size are shown inTable 4. We highlight statistically significant cor-relations that have a p-value < 0.005 with twoasterisks (**), and results with a p-value < 0.01with a single asterisk (*).

We observe a statistically significant, low tomoderate correlation for nearly all groups of testsuites for JFreeChart. For JodaTime and Check-style, we notice significant but weaker correlations:0.08-0.2 compared to JFreeChart’s 0.14-0.4.

Table 5 shows the results of the two test identi-fication approaches for the assertion count metric(see Section 4.1.1). False positives are tests thatwere incorrectly marked as tests. False negativesare tests that were not detected.

Figure 5 shows the distribution of asserted ob-ject types. Assertions for which we could not de-tect the content type are categorised as unknown.

5.2 Code coverage

Figure 6 shows the relation between static methodcoverage and normal e↵ectiveness. A dot repre-sents a test suite and its colour, the relative testsuite size. Table 6 shows the Kendall correlationcoe�cients between static coverage and normal ef-fectiveness for each set of test suites. We highlightstatistically significant correlations that have a p-value < 0.005 with two asterisks (**), and resultswith a p-value < 0.01 with a single asterisk (*).

5.2.1 Static vs. dynamic method coverage

To evaluate the quality of the static method cov-erage algorithm, we compare static coverage withits dynamic counterpart for each suite (Figure 7).A dot represents a test suite, colours represent thesize of a suite relative to the total number of tests.The black diagonal line illustrates the ideal line:all test suites below this line overestimate the cov-erage and all the test suites above underestimatethe coverage. Table 7 shows the Kendall correla-tions between static and dynamic method coveragefor the di↵erent projects and suite sizes. Each cor-relation coe�cient maps to a set of test suites ofthe corresponding suite size and project. Coe�-cients with one asterisk (*) have a p-value < 0.01and coe�cients with two asterisks (**) have a p-value < 0.005. We observe a statistically signif-icant, low to moderate correlation for all sets oftest suites for JFreeChart and JodaTime.

5.2.2 Dynamic coverage and test suite ef-fectiveness

Figure 8 shows the relation between dynamicmethod coverage and normal e↵ectiveness. Eachdot represents a test suite; its colour representsthe size of that suite relative to the total number

10

Table 3: Results for the master test suite of each project.

Project Assertions Static coverage Dynamic coverage Equivalent mutants

Checkstyle 3,819 85% 98% 15.6%

JFreeChart 9,030 60% 62% 74.1%

JodaTime 23,830 85% 90% 31.0%

●● ●●● ●●● ●● ●●● ●●●●●● ●●●●●● ●●● ●●●● ●●●● ●●●●●●● ●● ●●●● ● ●●●● ● ●●●● ●●●● ●● ●●● ●●● ●●● ●●●● ●●●●● ●●●●●●● ●● ● ●●●● ● ●●●● ●● ●●●●● ●●●● ●●●●●● ● ●●●●●● ●●●●● ●●●●● ● ●●●●●● ●● ● ●●● ● ●●● ●● ●●● ●●● ●● ● ●●● ●●●●● ●●●●● ● ● ●● ●●● ●●●● ●●●●●● ●●● ●●●● ●● ●●● ●●●● ●●●●● ●●●● ●●● ●● ●●●●●●●●● ●●●●●●●●● ●●●●●● ●●● ●●●●● ●●● ● ● ●●● ●●●●●●●●●● ● ●● ● ●●● ●●●● ● ●●● ●●●● ● ● ●●● ●● ●● ●●●● ●●●● ●● ● ●●●●●●● ●●●● ●● ●● ●●● ●●● ● ●●●● ●●●● ● ●● ●●●● ●●● ●● ● ● ●● ●● ● ● ●●●● ●●● ●●● ●●● ●● ●● ●●●● ●●● ●● ●● ●●● ●●●● ●● ●● ●●● ● ●●● ●●● ●● ● ●●● ●● ●●●●●●●●● ●●● ● ●●●●● ● ●● ●●●● ● ●● ●● ●●●●●● ●●●●● ●● ●●●● ●● ●● ●●● ●●●●●●●●●● ●●● ●●●● ●●●●●●● ●● ●●●●●● ●●● ●●● ●●● ●●●●●●● ●●● ●●●● ●● ●●●● ●●● ●● ● ●●● ●● ●● ●●● ●● ● ●●●●●●●● ●● ●●● ●●●● ● ●●●● ●● ●●●●● ●●● ● ●●● ● ●●●●●●● ●●● ●●● ● ●●●● ●●●●● ●● ●●●●●● ● ●● ●●●●●● ●●● ● ●●●● ●● ●●●●● ● ●● ●● ●●●● ●● ●●●●●●●●●●● ●● ●

●●● ●●●● ●●●●●● ●●●● ●● ●●●● ●●● ●● ●● ●● ● ●●● ●● ●● ● ●●● ●●●●● ● ●● ●●●● ●● ● ●●● ●● ●● ● ●● ●●● ●● ●● ●● ●● ●●●●● ● ● ●●● ●● ● ●● ●● ● ●● ● ●●●● ● ● ●● ● ●● ●●●● ●●● ●●●●● ●●● ● ●● ●●●● ●●● ●● ●● ●● ● ●●● ● ●●●

● ●● ●● ●●● ●●●● ● ●●● ●● ●●● ●●● ●●● ●● ● ● ●●● ●●● ●●●●● ●●● ●● ●● ●● ●●● ●●● ●● ●● ●● ● ● ● ●●●●●● ●● ●●● ● ●● ● ●●● ● ● ●● ●● ●●● ●● ●● ●●● ● ●●●● ● ●●● ●● ● ●● ● ●●●● ●● ● ●●● ●● ●● ●● ● ●● ●●● ●●● ● ● ●●●●● ●● ●● ● ●●● ● ●● ●● ●● ●● ● ● ● ●●● ●●● ●●● ●●●●●●●●● ●● ●●●●● ●● ●●●● ●●● ● ●● ●●●● ●●●● ●●● ●● ●● ●● ● ●● ●● ●● ●● ● ●●● ● ●● ●●● ● ● ●●●●●●● ●●● ●● ●● ●●● ●●●● ●● ●● ●● ● ●●● ●● ●●● ●● ●● ●● ●

Checkstyle

JFreeChart

JodaTime

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140Number of assertions

Figure 3: Distribution of the assertion count among individual tests per project.

Figure 4: Relation between assertion count and test suite e↵ectiveness.

Table 4: Kendall correlations between assertion count and test suite e↵ectiveness.

Project Relative test suite size

1% 4% 9% 16% 25% 36% 49% 64% 81%

Checkstyle -0.04 0.08** 0.13** 0.18** 0.20** 0.16** 0.16** 0.12** 0.10**

JFreeChart 0.03 0.14** 0.23** 0.32** 0.34** 0.35** 0.39** 0.40** 0.36**

JodaTime 0.05 0.11** 0.13** 0.13** 0.07** 0.09** 0.07** 0.10** 0.06*

Table 5: Comparison of di↵erent approaches to identify tests for the assertion count metric.

Project Semi-static approach Static approach

Numberof tests

Assertioncount

Number oftests (di↵)

Assertioncount (di↵)

Falsepositives

Falsenegatives

CheckStyle 1,875 3,819 1,821 (-54) 3,826 (+0.18%) 5 59

JFreeChart 2,138 9,030 2,172 (+34) 9,224 (+2.15%) 39 7

JodaTime 4,197 23,830 4,180 (-17) 23,943 (+0.47%) 15 32

5% 7% 7% 36% 39% 5%

2% 12% 24% 2% 58% 1%

1% 18% 47% 13% 14% 7%

Checkstyle

JFreeChart

JodaTime

0% 25% 50% 75% 100%Percentage of total assertion count

Assertioncontent type

fail

boolean

string

numeric

object

unknown

Figure 5: The distribution of assertion content types for the analysed projects.

11

Figure 6: Relation between static coverage and test suite e↵ectiveness.

Table 6: Kendall correlations between static method coverage and test suite e↵ectiveness.


1% 4% 9% 16% 25% 36% 49% 64% 81%

Checkstyle -0.05 -0.01 -0.02 -0.02 0.00 -0.04 -0.01 0.00 0.01

JFreeChart 0.49** 0.28** 0.23** 0.26** 0.27** 0.28** 0.31** 0.31** 0.26**

JodaTime 0.13** 0.28** 0.32** 0.28** 0.24** 0.25** 0.23** 0.20** 0.21**

Figure 7: Relation between static and dynamic method coverage. Static coverage of testsuites below the black line is overestimated, above is underestimated.

Table 7: Kendall correlation between static and dynamic method coverage.


1% 4% 9% 16% 25% 36% 49% 64% 81%

Checkstyle -0,03 -0,01 0,01 -0,02 0,00 0,00 0,05 0,10** 0,15**

JFreeChart 0,67** 0,33** 0,28** 0,31** 0,33** 0,35** 0,43** 0,45** 0,44**

JodaTime 0,35** 0,44** 0,48** 0,47** 0,51** 0,51** 0,52** 0,54** 0,59**

of tests. Table 8 shows the Kendall correlationsbetween dynamic method coverage and normal ef-fectiveness for the di↵erent groups of test suites foreach project. Similarly to the other tables, two as-terisks indicate that the correlation is statisticallysignificant with a p-value < 0.005.

6 Discussion

We structure our discussion as follows: First, foreach metric, we compare the results across allprojects, perform an in-depth analysis on some ofthe projects and then answer to the correspondingresearch question. Next, we describe the practi-

cality of this research and the threats to validity.

6.1 Assertions and test suite e↵ectiveness

We observe that test suites of the same relativesize form groups in the plots in Figure 4, i.e., theassertion count and e↵ectiveness score of same sizetest suites are relatively close to each other.

For JFreeChart, groups of test suites with a rel-ative size >=9% exhibit a diagonal shape. Thisshape is ideal as it suggests that test suites withmore assertions are more e↵ective. These groupsalso show the strongest correlation between asser-tion count and e↵ectiveness (Table 4).

12

Figure 8: Relation between dynamic method coverage and test suite e↵ectiveness.

Table 8: Kendall correlation between dynamic method coverage and test suite e↵ectiveness.


1% 4% 9% 16% 25% 36% 49% 64% 81%

Checkstyle 0.67** 0.71** 0.68** 0.59** 0.45** 0.36** 0.33** 0.31** 0.36**

JFreeChart 0.65** 0.59** 0.52** 0.48** 0.44** 0.47** 0.47** 0.49** 0.45**

JodaTime 0.48** 0.49** 0.53** 0.51** 0.48** 0.52** 0.48** 0.47** 0.44**

We notice that the normalised assertion countof a test suite is close to the relative suite size, e.g.,suites with a relative size of 81% have a normalisedassertion count between 77% and 85%. The di↵er-ence between the relative suite size and normalisedassertion count is directly related to the variety inassertion count per test. More variety means thata test suite could exist with only below averageassertion counts, resulting in a ¡80% normalisedassertion count.

We analyse each project to find to what extentassertion count could predict test e↵ectiveness.

6.1.1 Checkstyle

We notice a very low, statistically significant corre-lation between assertion count and test suite e↵ec-tiveness for most of Checkstyle’s test suite groups.

Most of the Checkstyle’s tests target the dif-ferent checks in Checkstyle. Out of the 1875tests, 1503 (80%) tests belong to a class thatextends the BaseCheckTestSupport class. TheBaseCheckTestSupport class contains a set ofutility methods for creating a checker, executingthe checker and verifying the messages generatedby the checker. We notice a large variety in testsuite e↵ectiveness among the tests that extend thisclass. Similarly, we expect the same variety in as-sertion counts. However, the assertion count is thesame for at least 75% of these tests.

We found that 1156 of these tests (62% ofthe master test suite) use the BaseCheckTestSup-

port.verify method for asserting the checker’s re-sults. The verify method iterates over the ex-pected violation messages which are passed as aparameter. This iteration hides the actual num-ber of executed assertions. Consequently, we de-

tect only two assertions for tests which might ex-ecute many assertions at runtime. In addition tothe verify method, we found 60 tests that directlyapplied assertions inside for loops.

Finding 1: Assertions within in an iter-ation block skew the estimated assertioncount. These iterations are a source of im-precision because the actual number of as-sertions could be much higher than the as-sertion count we measured.

Another consequence of the high usage ofverify is that these 1156 tests all have the sameassertion count. Figure 3 shows similar results forthe distribution of assertions for Checkstyle’s tests.

The e↵ectiveness scores for these 1156 testsrange from 0% to 11% (the highest e↵ectivenessscore of an individual test). This range shows thatthe group of tests with two assertions include boththe most and least e↵ective tests. There are ap-proximately 1200 tests for which we detect exactlytwo assertions. As this concerns 64% of all tests,we state there is too little variety in the assertioncount to make predictions on the e↵ectiveness.

Finding 2: 64% of Checkstyle’s tests haveidentical assertion counts. Variety in theassertion count is needed to distinguish be-tween the e↵ectiveness of di↵erent tests.

6.1.2 JFreeChart

JFreeChart is the only project exhibiting a low tomoderate correlation for most groups of test suites.

13

We found many strong assertions inJFreeChart’s tests. By strong, we mean thattwo large objects, e.g., plots, are compared in anassertion. This assertion uses the object’s equalsimplementation. In this equals method, around50 lines long, many fields of the plot, such asPaint or RectangleInsets are compared, againrelying on their consecutive equals implemen-tation. We also notice that most outliers forJFreeChart in Figure 3 are tests for the equalsmethods which suggests that the equals methodscontain much logic.

Finding 3: Not all assertions are equallystrong. Some only cover a single property,e.g., a string or a number, whereas otherscompare two objects, potentially coveringmany properties. For JFreeChart, we no-tice a large number of assertions that com-pare plot objects with many properties.

Next, we searched for the combination of loopsand assertions that could skew the results, andfound no such occurrences in the tests.

6.1.3 JodaTime

The correlations between assertion count and testsuite e↵ectiveness for JodaTime are similar tothat of Checkstyle, and much lower than those ofJFreeChart. We further analyse JodaTime to finda possible explanation for the weak correlation.

Assertions in for loops. We searched for testutility methods similar to the verify method ofCheckstyle, i.e., a method that has assertions in-side an iteration and is used by several tests. Weobserve that the four most e↵ective tests, shown inTable 9, all call testForwardTransitions and/ortestReverseTransitions, both are utility meth-ods of the TestBuilder class. The rank columnscontain the rank relative to the other tests of toprovide some context in how they compare. Ranksare calculated based on the descending order ofe↵ectiveness or assertion count. If multiple testshave the same score, we show the average rank.Note that the utility methods are di↵erent fromthe tests in the top 4 that share the same name.The top 4 tests are the only tests calling theseutility methods. Both methods iterate over atwo-dimensional array containing a set of approx-imately 110 date time transitions. For each tran-sition, 4 to 7 assertions are executed, resulting inmore than 440 executed assertions.

Additionally, we found 22 tests that combinediterations and assertions. Out of these 22 tests,at least 12 tests contained fix length iterations,e.g., for(int i = 0; i < 10; i++), that couldbe evaluated using other forms of static analysis.

In total, we found only 26 tests of the mastertest suite (0.6%) that were directly a↵ected by as-sertions in for loops. Thus, for JodaTime, asser-tions in for loops do not explain the weak correla-tion between assertion count and e↵ectiveness.

Assertion strength. JodaTime has sig-nificantly more assertions than JFreeChart andCheckstyle. We observe many assertions on nu-meric values as one might expect from a librarythat is mostly about calculations on dates andtimes. For example, we noticed many utility meth-ods that checked the properties of Date, DateTimeor Duration objects. Each of these utility meth-ods asserts the number of years, months, weeks,days, hours, etc. This large number of numeric as-sertion corresponds with the observation that 47%of the assertions are on numeric types (Figure 5).

However, the above is not always the case. Forexample, we found many tests, related to parsingdates or times from a string or tests for formatters,that only had a 1 or 2 assertions while still beingin the top half of most e↵ective tests.

We distinguish between two types of tests: a)tests related to the arithmetic aspect with manyassertions and b) tests related to formatting withonly a few assertions. We find that assertion countdoes not work well as a predictor for test suitee↵ectiveness since the assertion count of a test doesnot directly relate to how e↵ective the test is.

Finding 4: Almost half of JodaTime’s as-sertions are on numeric types. These as-sertions often occur in groups of 3 or moreto assert a single result. However, a largenumber of e↵ective tests only contains asmall number of mostly non-numeric asser-tions. This mix leads to poor predictions.

6.1.4 Test identification

We measure the assertion count by following thestatic call graph for each test. As our context isstatic source code analysis, we also need to be ableto identify the individual tests in the test code.We compare our static approach with a semi-staticapproach that uses Java reflection to identify tests.

Table 5 shows that the assertion count ob-tained with the static-approach is closer to the dy-namic approach than the assertion count obtainedthrough the semi-static approach.

For all projects the assertion count of the staticapproach is higher. If the static algorithm doesnot identify tests, there are no call edges betweenthe tests and the assertions. The absence of edgesimplies that these tests either have no assertionsor an edge in the call graph was missing. Thesetests do not contribute to the assertion count.

14

Table 9: JodaTime’s four most e↵ective tests

Test Normal E↵ectiveness Assertions

Score Rank Score Rank

TestCompiler.testCompile() 17.23% 1 13 361.5

TestBuilder.testSerialization() 14.61% 2 13 361.5

TestBuilder.testForwardTransitions() 12.94% 3 7 1,063.5

TestBuilder.testReverseTransitions() 12.93% 4 4 1,773.0

We notice that the methods that were incor-rectly marked as tests, false positives, are meth-ods used for debugging purposes or methods thatwere missing the @Test annotation. The latteris most noticeable for JFreeChart. We identified39 tests that were missing the @Test annotation.Of these 39 tests, 38 tests correctly executed whenthe @Test annotation was added. According to therepository’s owner, these tests are valid tests 2.

Based on the results of these three projects, wealso show that the use of call graph slicing givesaccurate results on a project level.

6.1.5 Assertion count as a predictor fortest e↵ectiveness

We found that the correlation for Checkstyle andJodaTime is weaker than for JFreeChart. Ouranalysis indicates that the correlation for Check-style is less strong because of a combination ofassertions in for loops (Finding 1) and the asser-tion distribution (Finding 2). However, this doesnot explain the weak correlation for JodaTime.As shown in Figure 3, JodaTime has a much largerspread in the assertion count of each test. Fur-thermore, we observe that the assertion-iterationcombination does not have a significant impacton the relationship with test suite e↵ectivenesscompared to Checkstyle. We notice a set of strongassertions for JFreeChart (Finding 3) whereasJodaTime has mostly weak assertions (Finding 4).

RQ 1: To what extent is assertion count agood predictor for test suite e↵ectiveness?

Assertion count has potential as a predictor fortest suite e↵ectiveness because assertions are di-rectly related to detection of mutants. However,more work on assertions is needed as the correla-tion with test suite e↵ectiveness is often weak orstatistically insignificant.

For all three projects, Table 3, we observe dif-ferent assertion counts. Checkstyle and Joda-Time are of similar size and quality, but Check-style only has 16% of the assertions JodaTimehas. JFreeChart has more assertions than Check-style, but the production code base that should betested is also three-times bigger. A test qualitymodel that includes the assertion count should in-

2https://github.com/jfree/jfreechart/issues/57

corporate information about the strength of the as-sertions, either by incorporating assertion contenttypes, assertion coverage [45] or size of the assertedobject. Furthermore, such a model should also in-clude information about the size of a project.

If assertion count would be used, we shouldmeasure the presence of its sources of impreci-sion to judge the reliability. This measurementshould also include the intensity of the usage oferrornous methods. For example, we found hun-dreds of methods and tests with assertions in for-loops. However, only few methods that were oftenused had a significant impact on the results.

6.2 Coverage and e↵ectiveness

We observe a diagonal-like shape for most groupsof same size test suites in Figure 6. This shapeis ideal as it suggests that within this group, testsuites with more static coverage are more e↵ective.These groups also show the strongest correlationbetween static coverage and test suite e↵ective-ness, as shown in Table 6.

Furthermore, we notice a di↵erence in thespread of the static coverage on the horizontal axis.For example, coverage for Checkstyle’s tests suitescan be split into three groups: around 30%, 70%and 80% coverage. JFreeChart shows a relativelylarge spread of coverage for smaller tests suites,ranging between 18% and 45% coverage, but thecoverage converges as test suites grow in size. Jo-daTime is the only project for which there is nosplit in the coverage scores of same size test suites.We consider these di↵erences in the spread of cov-erage a consequence of the quality of the staticcoverage algorithm. These di↵erences are furtherexplored in Section 6.2.1. We perform an in-depthanalysis on Checkstyle in Section 6.2.2 because itis the only project which does not exhibit either astatistically significant correlation between staticcoverage and test e↵ectiveness, or one betweenstatic coverage and dynamic method coverage.

6.2.1 Static vs. dynamic method coverage

When comparing dynamic and static coverage inFigure 7, we notice that the degree of over- orunderestimation of the coverage depends on theproject and test suite size. Smaller test suites tendto overestimate, whereas larger test suites under-estimate. We observe that the quality of the static

15

coverage for the Checkstyle project is significantlydi↵erent compared to the other projects. Check-style is discussed in Section 6.2.2.

Overestimating coverage. The static cover-age for the smaller test suites is significantly higherthan the real coverage, as measured with dynamicanalysis. Suppose a methodM1 has a switch state-ment that, based on its input, calls one of thefollowing methods, M2,M3,M4. There are threetests, T1, T2, T3, that each call M1, with one of thethree options for the switch statement in M1 asa parameter. Additionally, there is a Test suiteTS1 that consists of T1, T2, T3. Each test coversM1 and one of M2,M3,M4, all tests combined inTS1 cover all 4 methods. The static coverage al-gorithm does not evaluate the switch statementand detects for each test that 4 methods are cov-ered. This shows that static coverage is not veryaccurate for individual tests. However, the staticcoverage for TS1 matches the dynamic coverage.This example illustrates why the loss in accuracy,caused by overestimating the coverage, decreasesas test suites grow in size. The paths detectedby the static and dynamic method coverage willeventually overlap once a test suite is created thatcontains all tests for a given function. The amountof overestimated coverage depends on how well thetests cover the di↵erent code paths.

Finding 5: The degree of overestima-tion by the static method coverage algo-rithm depends on the real coverage and theamount of conditional logic and inheritancein the function under test.

Underestimating coverage. We observethat for larger test suites the coverage is often un-derestimated, see Figure 7. Similarly, the under-estimation is also visible in the di↵erence betweenstatic and dynamic method coverage of the dif-ferent master test suites as shown in the projectresults overview in Table 3.

A method that is called through reflection orby an external library is not detected by the staticcoverage algorithm. Smaller test suites do notsu↵er from this issue as the number of overesti-mated methods is often significantly larger thanthe amount of underestimated methods.

We observe di↵erent tipping points be-tween overestimating and underestimating forJFreeChart and JodaTime. For JFreeChart thetipping point is visible for tests suites with a rel-ative size of 81%, whereas JodaTime reaches thetipping point at a relative size of 25%. We as-sume this is caused by the relatively low “real”coverage of JFreeChart. We notice that many ofJFreeChart’s methods that were overestimated bythe static coverage algorithm are not covered.

We illustrate the overlap between over- andunderestimation with a small synthetic example.Given a project with 100 methods and test suiteT. We divide these methods into three groups:1. Group A, with 60 methods that are all cov-ered by T, as measured with dynamic coverage.2. Group B, with 20 methods that are only calledthrough the Java Reflection API, all covered by Tsimilar to Group A. 3. Group C, with 20 methodsthat are not covered by T. The dynamic coveragefor T consists of the 80 methods in groups A andB. The static method coverage for T also consistsof 80 methods. However, the coverage for Group Cis overestimated as they are not covered, and thecoverage for Group B is underestimated as theyare not detected by the static coverage algorithm.

JFreeChart has a relatively low coverage scorecompared to the other projects. It is likely that theparts of the code that are deemed covered by staticand dynamic coverage will not overlap. However,it should be noted that low coverage does not im-ply more methods are overestimated. When partsof the code base are completely uncovered, thestatic method coverage might also not detect anycalls to the code base.

Finding 6: The degree of underestimationby the static coverage algorithm partiallydepends on the number of overestimatedmethods, as this will compensate for theunderestimated methods, and on the num-ber of methods that were called by reflec-tion or external libraries.

Correlation between dynamic and staticmethod coverage. Table 4 shows, forJFreeChart and JodaTime, statistically significantcorrelations that increase from a low correlationfor smaller suites to a moderate correlation forlarger suites. One exception is the correlation forJFreeChart”s test suites with 1% relative size. Wecould not find a explanation for this exception.

We expected that the tipping point betweenstatic and dynamic coverage would also be visiblein the correlation table. However, this is not thecase. Our rank correlation test checks whether twovariables follow the same ordering, i.e., if one vari-able increases, the other also increases. Underesti-mating the coverage does not influence the correla-tion when the degree of underestimation is similarfor all test suites. As test suites grow in size, theybecome more similar in terms of included tests.Consequently, the chances of test suites formingan outlier decrease as the size increases.

Finding 7: As test suites grow, the corre-lation between static and dynamic methodcoverage increases from low to moderate.

16

6.2.2 Checkstyle

Figures 6 and 7 show that the static coverage re-sults for Checkstyle’s test suites are significantlydi↵erent from JFreeChart and JodaTime. ForCheckstyle, all groups of test suites with a relativesize of 49% and lower are split into three subgroupsthat have around 30%, 70% and 80% coverage. Inthe following subsections, we analyse the qualityof the static coverage for Checkstyle and the pre-dictability of test suite e↵ectiveness.

Quality of static coverage algorithm. Toanalyse the static coverage algorithm for Check-style we compare the static coverage with the dy-namic coverage for individual tests (Figure 9a),and inspect the distribution of the static coverageamong the di↵erent tests (Figure 9b).

We regard the di↵erent groupings of test suitesin the static coverage spread as a consequence ofthe few tests with high static method coverage.

Checker tests. Figure 9b shows 1104 testsscoring 30% to 32.5% coverage. Furthermore, dy-namic coverage only varied between 31.3% and31.6% coverage and nearly all tests are located inthe com.puppycrawl.tools.checkstyle.checks

package. We call these tests checker tests, as theyare all focussed on the checks. A small experi-ment where we combined the coverage of all 1104tests, resulted in 31.8% coverage, indicating thatall these checker tests almost completely overlap.

Listing 1 shows the structure typical forchecker tests: the logic is mostly located in utilitymethods. Once the configuration for the checker iscreated, verify is called with the files that will bechecked and the expected messages of the checker.

@Test

public void t e s tCo r r e c t ( ) throws Exception {f ina l Defau l tCon f i gura t i on checkConf ig =

createCheckConf ig (

AnnotationLocationCheck . class ) ;

f ina l St r ing [ ] expected = CommonUtils .

EMPTY STRING ARRAY;

v e r i f y ( checkConfig , getPath ( ”

InputCorrectAnnotat ionLocat ion . java ” ) ,

expected ) ;

}

Listing 1: Test in AnnotationLocationCheckTest

Finding 8: Most of Checkstyle’s tests arefocussed on the checker logic. Althoughthese tests vary in e↵ectiveness, they coveran almost identical set of methods as mea-sured with the static coverage algorithm.

Coverage subgroups and outliers. We no-tice three vertical groups for Checkstyle in Figure 7starting around 31%, 71% and 78% static coverageand then slowly curving to the right. These group-ings are a result of how test suites are composed

and the coverage of the included tests.

The coverage of the individual tests is shownin Figure 9a. We notice a few outliers at 48%,58%, 74% and 75% coverage. We construct testsuites by randomly selecting tests. A test suite’scoverage is never lower than the highest coverageamong its individual tests. For example, everytime a test with 74% coverage is included, the testsuite’s coverage will jump to at least that percent-age. As test suites grow in size, the chances ofincluding a positive outlier increases. We noticethat the outliers do not exactly match with thecoverage of the vertical groups. The second verti-cal for Checkstyle in Figure 7 starts around 71%coverage. We found that if the test with 47.5%coverage, AbstractChecktest.testVisitToken,is combined with a 30% coverage test (anyof the checker tests), it results in 71% cov-erage. This shows that only 6.5% coverageis overlapping between both tests. We ob-serve that all test suites in the vertical groupat 71% include at least one checker test andAbstractCheckTest.testVisitToken and thatthey do not include any of the other outliers withmore than 58%. The most right vertical groupstarts at 79% coverage. This coverage is achievedby combining any of the tests with more than 50%coverage with a single checker test.

The groupings in Checkstyle’s coverage scoresare a consequence of the few coverage outliers. Weshow that these outliers can have a significant im-pact on a project’s coverage score. Without thesefew outliers, the static coverage for Checkstyle’smaster test suite would only be 50%

Test suites with low coverage. Figure 9bshows that more than half of the tests have atleast 30% coverage. Similarly, Figure 7 shows thatall test suites cover at least 31% of the methods.However, there are 763 tests with less than 30%coverage, and no test suites with less than 30%coverage. We explain this using probability the-ory. The smallest test suite for Checkstyle has arelative size of 1% which are 19 tests. The chanceof only including tests with less than 31% cover-age 763

1875 ⇤ 763�11875�1 ⇤ . . . ⇤ 763�18

1875�18 ⇡ 3 ⇤ 10�8. Thesechances are negligible, even without consideringthat a combination of the selected tests might stilllead to a coverage above 31%.

Missing coverage. We found thatAbstractCheckTest.testVisitToken scores47.5% static method coverage, although it onlytests the AbstractCheck.visitToken method.Therefore any test calling the visitToken methodwill have at least 47.5% static method coverage.

160 classes extend AbstractCheck, of which123 override the visitToken method. Thestatic method coverage algorithm includes 123virtual calls when AbstractCheck.visitToken is

17

(a) Static and dynamic method coverage of

individual tests. Static coverage of tests be-

low the black line is overestimated, above is

underestimated.

(b) Distribution of the tests over the di↵er-

ent levels of static method coverage.

Figure 9: Static method coverage scores for individual tests of Checkstyle.

called.The coverage of all visitToken overridescombined is 47.5%. Note that the static cover-age algorithm also considers constructor calls andstatic blocks as covered when a method of a classis invoked. We found that only 6.5% of the totalmethod coverage overlaps with testVisitToken.

This large overlap between both tests suggeststhat visitToken is not called by any of thecheck tests. However, we found that the verify

method indirectly calls visitToken. The callprocess(File, FileText), is not matchedwith AbstractFileSetCheck.process(File,

List). The parameter of type FileText extendsAbstractList which is part of the java.util

package. During the construction of the static callgraph, it was not detected that AbstractList isan implementation of the List interface becauseonly Checkstyle’s source code was inspected.If these calls were detected the coverage of allchecker tests would increase to 71%, filling thegap between the two right-most vertical groups inthe plots for Checkstyle in both Figures 6 and 7.

Finding 9: Our static coverage algorithmfails to detect a set of calls in the tests forthe substantial group of checker tests dueto shortcomings in the static call graph. Ifthese the calls were correctly detected, thestatic coverage for test suites of the samesize would be grouped more closely possiblyresulting in a more significant correlation.

High reflection usage. Checkstyle applies avisitor pattern on an AST for the di↵erent codechecks. The AbstractCheck class forms the ba-sis of this visitor and is extended by 160 checkerclasses. These classes contain the core function-ality of Checkstyle and consist of 2090 methods(63% of all methods), according to SAT. Runningour static coverage algorithm on the master test

suite missed calls to 328 methods. Of these meth-ods, 248 (7.5% of all methods) are setter meth-ods. Further inspection showed that checkers areconfigured using reflection, based on a configura-tion file with properties that match the setters ofthe checkers. This large group of methods missedby the static coverage algorithm partially explainsthe di↵erence between static and dynamic methodcoverage of Checkstyle’s master test suite.

Finding 10: The large gap between staticand dynamic method coverage for Check-style is caused by a significant amount ofsetter methods for the checker classes thatare called through reflection.

Relation with e↵ectiveness. Checkstyle isthe only project for which there is no statisticallysignificant correlation between static method cov-erage and test suite e↵ectiveness.

We notice a large distance, regarding invoca-tions in the call hierarchy, between most checkersand their tests. There are 9 invocations betweenvisitToken and the much used verify method.

In addition to the actual checker logic, a lot in-frastructure is included in each test. For example,instantiating the checkers and its properties basedon a reflection framework, parsing the files and cre-ating an AST, traversing the AST, collecting andconverting all messages of the checkers.

These characteristics seem to match those of in-tegration tests. Zaidman et al. studied the evolu-tion of the Checkstyle project and arrived at sim-ilar findings: “Moreover, there is a thin line be-tween unit tests and integration tests. The Check-style developers see their tests more as I/O inte-gration tests, yet associate individual test caseswith a single production class by name” [43].

Directness. We implemented the directnessmeasure to inspect whether it would reflect the

18

presence of mostly integration like tests. The di-rectness is based on the percentage of methodsthat are directly called from a test. The mastertest suites of Checkstyle, JFreeChart and Joda-Time cover respectively 30%, 26% and 61% of allmethods directly. As Checkstyle’s static coverageis significantly higher than that of JFreeChart weobserve that Checkstyle covers the smallest por-tion of methods directly from tests. Given thatunit tests should be focused on small functionalunits, we expected a relatively high directnessmeasure for the test suites.

Finding 11: Many of Checkstyle’s testsare integration-like tests that have a largedistance between the test and the logic un-der test. Consequently, only a small por-tion of the code is covered directly.

To make matters worse, the integration-liketests were mixed with actual tests. We arguethat integrations tests have di↵erent test proper-ties compared to unit tests: they often cover morecode, have less assertions, but the assertions havea higher impact, e.g., comparing all the reportedmessages. These di↵erences can lead to a skew inthe e↵ectiveness results.

6.2.3 Dynamic method coverage and e↵ec-tiveness

We observe in Figure 8 that, within groups of testsuites of the same size, test suite with more dy-namic coverage are also more e↵ective. Similarly,we observe a moderate correlation between dy-namic method coverage and normal e↵ectivenessfor all three projects in Table 8.

When comparing test suite e↵ectiveness withstatic method coverage, we observe a low to mod-erate correlation for JFreeChart and JodaTimewhen accounting for size in Table 6, but no statis-tically significant correlation for Checkstyle. Sim-ilarly, only the Checkstyle project does not show astatistically significant correlation between staticand dynamic method coverage, as shown in Ta-ble 7. We believe this is a consequence of the inte-gration like test characteristics of the Checkstyleproject. Due to the large distance between testsand code and the abstractions used in-between,the static coverage is not very accurate.

The moderate correlation between dynamicmethod coverage and e↵ectiveness suggests thereis a relation between method coverage and normale↵ectiveness. However, the static method coveragedoes not show a statistically significant correlationwith normal e↵ectiveness for Checkstyle. We statethat our static method coverage metric is not ac-curate enough for the Checkstyle project.

6.2.4 Method coverage as a predictor fortest suite e↵ectiveness

We found a statistically significant, low correlationbetween test suite e↵ectiveness and static methodcoverage for JFreeChart and JodaTime. We evalu-ated the static coverage algorithm and found thatsmaller test suites typically overestimate the cov-erage (Finding 5), whereas for larger test suites thecoverage is often underestimated (Finding 6). Thetipping point depends on the real coverage of theproject. We also found that static coverage cor-relates better with dynamic coverage as test suiteincrease in size (Finding 7).

An exception to these observations is Check-style, the only project without a statistically sig-nificant correlation between static method cover-age and both, test suite e↵ectiveness and dynamicmethod coverage. Most of Checkstyle’s tests havenearly identical coverage results (Finding 8) albeitthe e↵ectiveness varies. The SAT could calculatestatic code coverage, however it is less suitable formore complex projects. The large distance be-tween tests and tested functionality (Finding 11)in the Checkstyle project in terms of call hierar-chy led to skewed results as some of the must usedcalls were not resolved (Finding 9). This can bepartially mitigated by improving the call resolving.

We consider the inaccurate results of the staticcoverage algorithm a consequence of the quality ofthe call graph and the frequent use of Java reflec-tion(Finding 10). Furthermore, the unit tests forCheckstyle show similarities with integration tests.

RQ 2: To what extent is static coverage agood predictor for test suite e↵ectiveness?

First, we found a moderate to high correla-tion between dynamic method coverage and e↵ec-tiveness for all analysed projects which suggeststhat method coverage is a suitable indicator. Theprojects that showed a statistically significant cor-relation between static and dynamic method cov-erage also showed a significant correlation betweenstatic method coverage and test suite e↵ectiveness.Although the correlation between test suite e↵ec-tiveness and static coverage was not statisticallysignificant for Checkstyle, the coverage score onproject level provided a relatively good indicationof the project’s real coverage. Based on these ob-servations we consider coverage suitable as a pre-dictor for test e↵ectiveness.

6.3 Practicality

A test quality model based on the current state ofthe metrics would not be su�ciently accurate.

Although there is evidence of a correlation be-tween assertion count and e↵ectiveness, the as-sertion count of each project’s master test suite

19

did not map to the relative e↵ectiveness of eachproject. Each of the analysed projects had on aver-age a di↵erent number of assertions per test. Fur-ther improvements to the assertion count metric,e.g., including the strength of the correlation, areneeded to get more usable results.

The static method coverage could be used toevaluate e↵ectiveness to a certain extent. Wefound a low to moderate correlation for two of theproject between e↵ectiveness and static methodcoverage. Furthermore, we found a similar cor-relation between static and dynamic method cov-erage. The quality of the static call graph shouldbe improved to better estimate the real coverage.

We did not investigate the quality of these met-rics for other programming languages. However,the SAT supports call graph analysis and identi-fying assertions for a large range of programminglanguages, facilitating future experiments.

We encountered scenarios for which the staticmetrics gave imprecise results. If these sources ofimprecision would be translated to metrics, theycould indicate the quality of the static metrics. Anindication of low quality could suggest that moremanual inspection is needed.

6.4 Internal threats to validity

Static call graph. We use the static call graphconstructed by the SAT, for both metrics. Wefound several occurrences where the SAT did notcorrectly resolve the call graph. We fixed some ofthe issues encountered during our analysis. How-ever, as we did not manually analyse all the calls,this remains a threat to validity.

Equivalent mutants. We treated all mutantsthat were not detected by the master test suiteas equivalent mutants, an approach often used inliterature [35, 24, 45]. There is a high probabilitythat this resulted in overestimating the numberof equivalent mutants, especially for JFreeChartwhere a large part of the code is simply tested. Inprinciple, this is not a problem as we only comparethe e↵ectiveness of sub test suites. However, ourstatement on the order of the master’s tests suitee↵ectiveness is vulnerable to this threat as we didnot manually inspect each mutant for equivalence.

Accuracy of analysis. We manually in-spected large parts of the Java code of eachproject. Most of the inspections were done bya single person with four years of experience inJava. Also, we did not inspect all the tests. Mosttests were selected on a statistic driven-basis, i.e.,we looked at tests that showed high e↵ectivenessbut low coverage, or tests with a large di↵erencebetween static and dynamic. To mitigate this, wealso verified randomly selected tests. However, thechances of missing relevant source of imprecisionremains a threat to validity.

6.5 External threats to validity

We study three open source Java projects. Our re-sults are not generalisable to projects using otherprogramming languages. Also, we only includedassertions provided by JUnit. Although JUnit isthe most popular testing library for Java, thereare testing libraries possibly using di↵erent asser-tions [44]. We also ignored mocking libraries inour analysis. Mocking libraries provide a form ofassertions based on the behaviour of units undertest. These assertions are ignored by our analysis,albeit they can lead to an increase in e↵ectiveness.

6.6 Reliability

Tengeri et al. compared di↵erent instrumentationtechniques and found that JaCoCo produces in-accurate results especially when mapped back tosource code [39]. The main problem was that Ja-CoCo did not include coverage between two di↵er-ent sub-modules in a Maven project. For example,a call from sub-module A to sub-module B is notregistered by JaCoCo because JaCoCo only anal-yses coverage on a module level. As the projectsanalysed in this thesis do not contain sub-modules,this JaCoCo issue is not applicable to our work.

7 Related work

We group related work as follows: test qualitymodels, standalone test metrics, code coverage ande↵ectiveness, and assertions and e↵ectiveness.

7.1 Test quality models

We compare the TQM [18] we used, as describedin Section 2.2 with two other test quality models.We first describe the other models, followed by amotivation for the choice of a model.

STREW. Nagappan introduced the SoftwareTesting and Reliability Early Warning (STREW)metric suite to provide “an estimate of post-release field quality early in software developmentphases [34].” The STREW metric suite consistsof nine static source and test code metrics. Themetric suite is divided into three categories: Testquantification, Complexity and OO-metrics, andSize adjustment. The test quantifications metricsare the following: 1. Number of assertions per lineof production code. 2. Number of tests per lineof production code. 3. Number of assertion pertest. 4. The ratio between lines of test code andproduction code, divided by the ratio of test andproduction classes.

TAIME. Tengeri et al. introduced a system-atic approach for test suite assessment with a focuson code coverage [38]. Their approach, Test SuiteAssessment and Improvement Method (TAIME),is intended to find improvement points and guide

20

the improvement process. In this iterative process,first, both the test code and production code aresplit into functional groups and paired together.The second step is to determine the granularity ofthe measures, start with coarse metrics on proce-dure level and in later iterations repeat on state-ment level. Based on these functional groups theydefine the following set of metrics:

Code coverage calculated on both procedureand statement level.

Partition metric “The Partition Metric(PART) characterizes how well a set oftest cases can di↵erentiate between theprogram elements based on their coverageinformation [38]”.

Tests per Program how many tests have beencreated on average for a functional group.

Specialisation how many tests for a functionalgroup are in the corresponding test group.

Uniqueness what portion of covered functional-ity is covered only by a particular test group.

STREW, TAIME and TQM are models for as-sessing aspects of test quality. STREW and TQMare both based on static source code analysis.However, STREW lacks coverage related metricscompared to TQM. TAIME is di↵erent from theother two models as it does not depend on a spe-cific programming language or xUnit framework.Furthermore, TAIME is more an approach than asimple metric model. It is an iterative process thatrequires user input to identify functional groups.The required user input makes it less suitable forautomated analysis or large-scale studies.

7.2 Standalone test metrics

Bekerom investigated the relation between testsmells and test bugs [41]. He built a tool using theSAT to detect a set of test smells: Eager test, Lazytest, Assertion Roulette, Sensitive Equality andConditional Test Logic. He showed that classesa↵ected by test bugs score higher on the presenceof test smells. Additionally, he predicted classesthat have test bugs based on the eager smell witha precision of 7% which was better than random.However, the recall was very low which led to theconclusion that it is not yet usable to predict testbugs with smells.

Ramler et al. implemented 42 new rules forthe static analysis tool PDM to evaluate JUnitcode [37]. They defined four key problem areasthat should be analysed: Usage of the xUnit testframework, implementation of the unit test, main-tainability of the test suite and testability of theSUT. The rules were applied to the JFreeChartproject and resulted in 982 violations of which one-third was deemed to be some symptom of problemsin the underlying code.

7.3 Code coverage and e↵ectiveness

Namin et al. studied how coverage and size in-dependently influence e↵ectiveness [35]. Their ex-periment used seven Siemens suite programs whichvaried between 137 and 513 LOC and had between1000 and 5000 test cases. Four types of code cov-erage were measured: block, decision, C-Use andP-Use. The size was defined by the number oftests and e↵ectiveness was measured using muta-tion testing. Test suites of fixed sizes and di↵erentcoverage levels were randomly generated to mea-sure the correlation between coverage and e↵ec-tiveness. They showed that both coverage and sizeindependently influence test suite e↵ectiveness.

Another study on the relation between test ef-fectiveness and code coverage was performed byInozemtseva and Holmes [24]. They conductedan experiment on a set of five large open sourceJava projects and accounted for the size of thedi↵erent test suites. Additionally, they intro-duced a novel e↵ectiveness metric, normalized ef-fectiveness. They found moderate correlations be-tween coverage and e↵ectiveness when size was ac-counted for. However, the correlation was low fornormalized e↵ectiveness.

The main di↵erence with our work is thatwe used static source code analysis to calculatemethod coverage. Our experiment set-up is simi-lar to that of Inozemtseva and Holmes except thatwe chose a di↵erent set of data points which weshowed as more representative.

7.4 Assertions and e↵ectiveness

Kudrjavets et al. investigated the relation betweenassertions and fault density [28]. They measuredthe assertion density, i.e., number of assertions perthousand lines of code, for two components of Mi-crosoft Visual Studio written in C and C++. Ad-ditionally, real faults were taken from an internalbug database and converted to fault density. Theirresult showed a negative relation between asser-tion density and fault density, i.e., code that hada higher assertion density has a lower fault density.Instead of assertion density we focussed on the as-sertion count of Java projects and used artificialfaults, i.e., mutants.

Zhang and Mesbah [45] investigated the rela-tionship between assertions and test suite e↵ec-tiveness. They found that, even when test suitesize was controlled for, there was a strong corre-lation between assertion count and test e↵ective-ness. Our results overlap with their work as weboth found a correlation between assertion countand e↵ectiveness for the JFreeChart project. How-ever, we showed that this correlation is not alwayspresent as both Checkstyle and JodaTime showeddi↵erent results.

21

8 Conclusion

We analysed the relation between test suite e↵ec-tiveness and metrics, assertion count and staticmethod coverage, for three large Java projects,Checkstyle, JFreeChart and JodaTime. Both met-rics were measured using static source code anal-ysis. We found a low correlation between testsuite e↵ectiveness and static method coverage forJFreeChart and JodaTime and a low to moderatecorrelation with assertion count for JFreeChart.We found that the strength of the correlation de-pends on the characteristics of the project. Theabsence of a correlation does not imply that themetrics are not useful for a TQM.

Our current implementation of the assertioncount metric only shows promising results whenpredicting test suite e↵ectiveness for JFreeChart.We found that simply counting the assertions foreach project gives results that do not align with therelative e↵ectiveness of the projects. The projectwith the most e↵ective master test suite had a sig-nificantly lower assertion than the other projects.Even for sub test suites of most project, the asser-tion count did not correlate with test e↵ectiveness.Incorporating the strength of an assertion couldlead to better predictions.

Static method coverage is a good candidate forpredicting test suite e↵ectiveness. We found a sta-tistically significant, low correlation between staticmethod coverage and test suite e↵ectiveness formost analysed projects. Furthermore, the cover-age algorithm is consistent in its predictions ona project level, i.e., the ordering of the projectsbased on the coverage matched the relative rank-ing in terms of test e↵ectiveness.

8.1 Future work

Static coverage. Landman et al. investigatedthe challenges for static analysis of Java reflec-tion [30]. They identified that is at least possibleto identify and measure the use of hard to resolvereflection usage. Measuring reflection usage couldgive an indication of the degree of underestimatedcoverage. Similarly, we would like to investigatewhether we can give an indication of the degree ofoverestimation of the project.

Assertion count. We would like to investi-gate further whether we can measure the strengthof an assertion. Zhang and Mesbah included as-sertion coverage and measured the e↵ectiveness ofdi↵erent assertion types [45]. We would like to in-corporate this knowledge into the assertion count.This could result in a more comparable assertioncount on project level.

Deursen et al. described a set of test smellsincluding the eager tests, a test the verifies toomuch functionality of the tested function [42].

We found a large number of tests in the Jo-daTime project that called the function undertest several times. For example, JodaTime’stest wordBased pl regEx test checks 140 timesif periods are formatted correctly in Polish. Theseeager tests should be split into separate cases thattest the specific scenarios.

8.2 Acknowledgements

We would like to thank Prof. Serge Demeyer forhis elaborate and insightful feedback on our paper.

References

[1] Checkstyle. https://github.com/checkstyle/

checkstyle. Accessed: 2017-07-15.

[2] Checkstyle team. http://checkstyle.

sourceforge.net/team-list.html. Accessed:2017-11-19.

[3] Code cover. http://codecover.org/. Accessed:2017-07-15.

[4] JaCoCo. http://www.jacoco.org/. Accessed:2017-07-15.

[5] JFreeChart. https://github.com/jfree/

jfreechart. Accessed: 2017-07-15.

[6] JodaTime. https://github.com/jodaorg/

joda-time. Accessed: 2017-07-15.

[7] JUnit. http://junit.org/. Accessed: 2017-07-15.

[8] MAJOR mutation tool . http://

mutation-testing.org/. Accessed: 2017-07-15.

[9] muJava mutation tool. https://cs.gmu.edu/

~offutt/mujava/. Accessed: 2017-07-15.

[10] PIT+. https://github.com/LaurentTho3/

ExtendedPitest. Accessed: 2017-07-15.

[11] PIT fork. https://github.com/pacbeckh/

pitest. Accessed: 2017-07-15.

[12] PIT mutation tool . http://pitest.org/. Ac-cessed: 2017-07-15.

[13] R’s Kendall package. https://cran.r-project.org/web/packages/Kendall/Kendall.pdf. Ac-cessed: 2017-07-15.

[14] SLOCCount. https://www.dwheeler.com/

sloccount/. Accessed: 2017-07-15.

[15] TIOBE-Index. https://www.tiobe.com/

tiobe-index/. Accessed: 2017-07-15.

[16] Tiago L. Alves and Joost Visser. Static estima-tion of test coverage. In Ninth IEEE Interna-tional Working Conference on Source Code Anal-ysis and Manipulation, SCAM 2009, Edmonton,Alberta, Canada, September 20-21, 2009, pages55–64, 2009.

[17] Paul Ammann, Marcio Eduardo Delamaro, andJe↵ O↵utt. Establishing theoretical minimalsets of mutants. In Seventh IEEE InternationalConference on Software Testing, Verification andValidation, ICST 2014, March 31 2014-April 4,2014, Cleveland, Ohio, USA, pages 21–30, 2014.

22

[18] Dimitrios Athanasiou, Ariadi Nugroho, JoostVisser, and Andy Zaidman. Test code quality andits relation to issue handling performance. IEEETrans. Software Eng., 40(11):1100–1125, 2014.

[19] Kent Beck and Erich Gamma. Test infected:Programmers love writing tests. Java Report,3(7):37–50, 1998.

[20] Antonia Bertolino. Software testing research:Achievements, challenges, dreams. In Interna-tional Conference on Software Engineering, ISCE2007, Workshop on the Future of Software En-gineering, FOSE 2007, May 23-25, 2007, Min-neapolis, MN, USA, pages 85–103, 2007.

[21] Ilja Heitlager, Tobias Kuipers, and Joost Visser.A practical model for measuring maintainabil-ity. In Quality of Information and Communi-cations Technology, 6th International Conferenceon the Quality of Information and Communica-tions Technology, QUATIC 2007, Lisbon, Portu-gal, September 12-14, 2007, Proceedings, pages30–39, 2007.

[22] Ferenc Horvath, Bela Vancsics, Laszlo Vidacs,Arpad Beszedes, David Tengeri, Tamas Gergely,and Tibor Gyimothy. Test suite evaluation usingcode coverage based metrics. In Proceedings ofthe 14th Symposium on Programming Languagesand Software Tools (SPLST’15), Tampere, Fin-land, October 9-10, 2015., pages 46–60, 2015.

[23] David C Howell. Statistical methods for psychol-ogy. Cengage Learning, 2012.

[24] Laura Inozemtseva and Reid Holmes. Coverage isnot strongly correlated with test suite e↵ective-ness. In 36th International Conference on Soft-ware Engineering, ICSE ’14, Hyderabad, India -May 31 - June 07, 2014, pages 435–445, 2014.

[25] Yue Jia and Mark Harman. An analysis and sur-vey of the development of mutation testing. IEEETrans. Software Eng., 37(5):649–678, 2011.

[26] Rene Just, Darioush Jalali, Laura Inozemtseva,Michael D. Ernst, Reid Holmes, and GordonFraser. Are mutants a valid substitute for realfaults in software testing? In Proceedings of the22nd ACM SIGSOFT International Symposiumon Foundations of Software Engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014,pages 654–665, 2014.

[27] Marinos Kintis, Mike Papadakis, AndreasPapadopoulos, Evangelos Valvis, and NicosMalevris. Analysing and comparing the e↵ec-tiveness of mutation testing tools: A manualstudy. In 16th IEEE International Working Con-ference on Source Code Analysis and Manipula-tion, SCAM 2016, Raleigh, NC, USA, October2-3, 2016, pages 147–156, 2016.

[28] Gunnar Kudrjavets, Nachiappan Nagappan, andThomas Ball. Assessing the relationship betweensoftware assertions and faults: An empirical in-vestigation. In 17th International Symposium onSoftware Reliability Engineering (ISSRE 2006),7-10 November 2006, Raleigh, North Carolina,USA, pages 204–212, 2006.

[29] Tobias Kuipers and Joost Visser. A tool-basedmethodology for software portfolio monitoring.In Software Audit and Metrics, Proceedings ofthe 1st International Workshop on Software Au-dit and Metrics, SAM 2004, In conjunction withICEIS 2004, Porto, Portugal, April 2004, pages118–128, 2004.

[30] Davy Landman, Alexander Serebrenik, and Ju-rgen J. Vinju. Challenges for static analysis ofjava reflection: literature review and empiricalstudy. In Proceedings of the 39th InternationalConference on Software Engineering, ICSE 2017,Buenos Aires, Argentina, May 20-28, 2017, pages507–518, 2017.

[31] Thomas Laurent, Mike Papadakis, Marinos Kin-tis, Christopher Henard, Yves Le Traon, and An-thony Ventresque. Assessing and improving themutation testing practice of PIT. In 2017 IEEEInternational Conference on Software Testing,Verification and Validation, ICST 2017, Tokyo,Japan, March 13-17, 2017, pages 430–435, 2017.

[32] Andras Marki and Birgitta Lindstrom. Mutationtools for java. In Proceedings of the Symposium onApplied Computing, SAC 2017, Marrakech, Mo-rocco, April 3-7, 2017, pages 1364–1415, 2017.

[33] Thomas J. McCabe. A complexity measure. IEEETrans. Software Eng., 2(4):308–320, 1976.

[34] Nachiappan Nagappan. A Software Testing andReliability Early Warning (Strew) Metric Suite.PhD thesis, North Carolina State University,2005.

[35] Akbar Siami Namin and James H. Andrews. Theinfluence of size and coverage on test suite e↵ec-tiveness. In Proceedings of the Eighteenth Interna-tional Symposium on Software Testing and Anal-ysis, ISSTA 2009, Chicago, IL, USA, July 19-23,2009, pages 57–68, 2009.

[36] Mike Papadakis, Christopher Henard, Mark Har-man, Yue Jia, and Yves Le Traon. Threats tothe validity of mutation-based test assessment. InProceedings of the 25th International Symposiumon Software Testing and Analysis, ISSTA 2016,Saarbrucken, Germany, July 18-20, 2016, pages354–365, 2016.

[37] Rudolf Ramler, Michael Moser, and Josef Pichler.Automated static analysis of unit test code. InFirst International Workshop on Validating Soft-ware Tests, VST@SANER 2016, Osaka, Japan,March 15, 2016, pages 25–28, 2016.

[38] David Tengeri, Arpad Beszedes, TamasGergely, Laszlo Vidacs, David Havas, andTibor Gyimothy. Beyond code coverage - anapproach for test suite assessment and improve-ment. In Eighth IEEE International Conferenceon Software Testing, Verification and Validation,ICST 2015 Workshops, Graz, Austria, April13-17, 2015, pages 1–7, 2015.

[39] David Tengeri, Ferenc Horvath, Arpad Beszedes,Tamas Gergely, and Tibor Gyimothy. Nega-tive e↵ects of bytecode instrumentation on java

23

source code coverage. In IEEE 23rd Interna-tional Conference on Software Analysis, Evolu-tion, and Reengineering, SANER 2016, Suita,Osaka, Japan, March 14-18, 2016 - Volume 1,pages 225–235, 2016.

[40] Paco van Beckhoven. Assessing test suite e↵ec-tiveness using static analysis. Master’s thesis,University of Amsterdam, 2017.

[41] Kevin van den Bekerom. Detecting test bugs us-ing static analysis tools. Master’s thesis, Univer-sity of Amsterdam, 2016.

[42] Arie van Deursen, Leon Moonen, Alex van denBergh, and Gerard Kok. Refactoring test code.In Proceedings of the 2nd international confer-ence on extreme programming and flexible pro-cesses in software engineering (XP2001), pages92–95, 2001.

[43] Andy Zaidman, Bart Van Rompaey, Serge De-meyer, and Arie van Deursen. Mining softwarerepositories to study co-evolution of production& test code. In First International Conferenceon Software Testing, Verification, and Validation,ICST 2008, Lillehammer, Norway, April 9-11,2008, pages 220–229, 2008.

[44] Ahmed Zerouali and Tom Mens. Analyzingthe evolution of testing library usage in opensource java projects. In IEEE 24th InternationalConference on Software Analysis, Evolution andReengineering, SANER 2017, Klagenfurt, Aus-tria, February 20-24, 2017, pages 417–421, 2017.

[45] Yucheng Zhang and Ali Mesbah. Assertionsare strongly correlated with test suite e↵ective-ness. In Proceedings of the 2015 10th JointMeeting on Foundations of Software Engineering,ESEC/FSE 2015, Bergamo, Italy, August 30 -September 4, 2015, pages 214–224, 2015.

[46] Hong Zhu, Patrick A. V. Hall, and John H. R.May. Software unit test coverage and adequacy.ACM Comput. Surv., 29(4):366–427, 1997.

24

Assessing Test Suite E ectiveness Using Static Metricsceur-ws.org/Vol-2070/paper-03.pdf · 2.2 Measuring test code quality Athanasiou et al. introduced a Test Quality Model (TQM)

Documents