Top Banner
Software Unit Test Coverage and Adequacy HONG ZHU Nanjing University PATRICK A. V. HALL AND JOHN H. R. MAY The Open University, Milton Keynes, UK Objective measurement of test quality is one of the key issues in software testing. It has been a major research focus for the last two decades. Many test criteria have been proposed and studied for this purpose. Various kinds of rationales have been presented in support of one criterion or another. We survey the research work in this area. The notion of adequacy criteria is examined together with its role in software dynamic testing. A review of criteria classification is followed by a summary of the methods for comparison and assessment of criteria. Categories and Subject Descriptors: D.2.5 [Software Engineering]: Testing and Debugging General Terms: Measurement, Performance, Reliability, Verification Additional Key Words and Phrases: Comparing testing effectiveness, fault- detection, software unit test, test adequacy criteria, test coverage, testing methods 1. INTRODUCTION In 1972, Dijkstra claimed that “program testing can be used to show the presence of bugs, but never their absence” to per- suade us that a testing approach is not acceptable [Dijkstra 1972]. However, the last two decades have seen rapid growth of research in software testing as well as intensive practice and experiments. It has been developed into a validation and verification technique indispensable to software engineering discipline. Then, where are we today? What can we claim about software testing? In the mid-’70s, in an examination of the capability of testing for demonstrat- ing the absence of errors in a program, Goodenough and Gerhart [1975, 1977] made an early breakthrough in research on software testing by pointing out that the central question of software testing is “what is a test criterion?”, that is, the criterion that defines what constitutes an adequate test. Since then, test crite- ria have been a major research focus. A great number of such criteria have been proposed and investigated. Consider- able research effort has attempted to provide support for the use of one crite- rion or another. How should we under- stand these different criteria? What are the future directions for the subject? In contrast to the constant attention given to test adequacy criteria by aca- Authors’ addresses: H. Zhu, Institute of Computer Software, Nanjing University, Nanjing, 210093, P.R. of China; email: ^[email protected]&; P.A.V. Hall and J.H.R. May, Department of Computing, The Open University, Walton Hall, Milton Keynes, MK76AA, UK. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. © 1997 ACM 0360-0300/97/1200–0366 $03.50 ACM Computing Surveys, Vol. 29, No. 4, December 1997
62
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Software Unit Test Coverage and Adequacy

Software Unit Test Coverage and AdequacyHONG ZHU

Nanjing University

PATRICK A. V. HALL AND JOHN H. R. MAY

The Open University, Milton Keynes, UK

Objective measurement of test quality is one of the key issues in software testing.It has been a major research focus for the last two decades. Many test criteria havebeen proposed and studied for this purpose. Various kinds of rationales have beenpresented in support of one criterion or another. We survey the research work inthis area. The notion of adequacy criteria is examined together with its role insoftware dynamic testing. A review of criteria classification is followed by asummary of the methods for comparison and assessment of criteria.

Categories and Subject Descriptors: D.2.5 [Software Engineering]: Testing andDebugging

General Terms: Measurement, Performance, Reliability, Verification

Additional Key Words and Phrases: Comparing testing effectiveness, fault-detection, software unit test, test adequacy criteria, test coverage, testing methods

1. INTRODUCTION

In 1972, Dijkstra claimed that “programtesting can be used to show the presenceof bugs, but never their absence” to per-suade us that a testing approach is notacceptable [Dijkstra 1972]. However, thelast two decades have seen rapid growthof research in software testing as well asintensive practice and experiments. Ithas been developed into a validation andverification technique indispensable tosoftware engineering discipline. Then,where are we today? What can we claimabout software testing?

In the mid-’70s, in an examination ofthe capability of testing for demonstrat-ing the absence of errors in a program,

Goodenough and Gerhart [1975, 1977]made an early breakthrough in researchon software testing by pointing out thatthe central question of software testingis “what is a test criterion?”, that is, thecriterion that defines what constitutesan adequate test. Since then, test crite-ria have been a major research focus. Agreat number of such criteria have beenproposed and investigated. Consider-able research effort has attempted toprovide support for the use of one crite-rion or another. How should we under-stand these different criteria? What arethe future directions for the subject?

In contrast to the constant attentiongiven to test adequacy criteria by aca-

Authors’ addresses: H. Zhu, Institute of Computer Software, Nanjing University, Nanjing, 210093, P.R.of China; email: ^[email protected]&; P.A.V. Hall and J.H.R. May, Department of Computing, TheOpen University, Walton Hall, Milton Keynes, MK76AA, UK.Permission to make digital / hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and / or a fee.© 1997 ACM 0360-0300/97/1200–0366 $03.50

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 2: Software Unit Test Coverage and Adequacy

demics, the software industry has beenslow to accept test adequacy measure-ment. Few software development stan-dards require or even recommend theuse of test adequacy criteria [Wichmann1993; Wichmann and Cox 1992]. Aretest adequacy criteria worth the cost forpractical use?

Addressing these questions, we sur-vey research on software test criteria inthe past two decades and attempt to putit into a uniform framework.

1.1 The Notion of Test Adequacy

Let us start with some examples. Herewe seek to illustrate the basic notionsunderlying adequacy criteria. Precisedefinitions will be given later.

—Statement coverage. In software test-ing practice, testers are often re-quired to generate test cases to exe-cute every statement in the programat least once. A test case is an inputon which the program under test isexecuted during testing. A test set is aset of test cases for testing a program.The requirement of executing all thestatements in the program under testis an adequacy criterion. A test setthat satisfies this requirement is con-sidered to be adequate according tothe statement coverage criterion.Sometimes the percentage of executedstatements is calculated to indicatehow adequately the testing has beenperformed. The percentage of thestatements exercised by testing is ameasurement of the adequacy.

—Branch coverage. Similarly, the branchcoverage criterion requires that allcontrol transfers in the program un-der test are exercised during testing.The percentage of the control trans-fers executed during testing is a mea-surement of test adequacy.

—Path coverage. The path coverage cri-terion requires that all the executionpaths from the program’s entry to itsexit are executed during testing.

—Mutation adequacy. Software testingis often aimed at detecting faults in

software. A way to measure how wellthis objective has been achieved is toplant some artificial faults into theprogram and check if they are de-tected by the test. A program with aplanted fault is called a mutant of theoriginal program. If a mutant and theoriginal program produce differentoutputs on at least one test case, thefault is detected. In this case, we saythat the mutant is dead or killed bythe test set. Otherwise, the mutant isstill alive. The percentage of dead mu-tants compared to the mutants thatare not equivalent to the original pro-gram is an adequacy measurement,called the mutation score or mutationadequacy [Budd et al. 1978; DeMilloet al. 1978; Hamlet 1977].

From Goodenough and Gerhart’s[1975, 1977] point of view, a softwaretest adequacy criterion is a predicatethat defines “what properties of a pro-gram must be exercised to constitute a‘thorough’ test, i.e., one whose success-ful execution implies no errors in atested program.” To guarantee the cor-rectness of adequately tested programs,they proposed reliability and validityrequirements of test criteria. Reliabilityrequires that a test criterion alwaysproduce consistent test results; that is,if the program tested successfully onone test set that satisfies the criterion,then the program also tested success-fully on all test sets that satisfies thecriterion. Validity requires that the testalways produce a meaningful result;that is, for every error in a program,there exists a test set that satisfies thecriterion and is capable of revealing theerror. But it was soon recognized thatthere is no computable criterion thatsatisfies the two requirements, andhence they are not practically applica-ble [Howden 1976]. Moreover, these tworequirements are not independent sincea criterion is either reliable or valid forany given software [Weyuker and Os-trand 1980]. Since then, the focus ofresearch seems to have shifted fromseeking theoretically ideal criteria to

Test Coverage and Adequacy • 367

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 3: Software Unit Test Coverage and Adequacy

the search for practically applicable ap-proximations.

Currently, the software testing litera-ture contains two different, but closelyrelated, notions associated with theterm test data adequacy criteria. First,an adequacy criterion is considered tobe a stopping rule that determineswhether sufficient testing has beendone that it can be stopped. For in-stance, when using the statement cover-age criterion, we can stop testing if allthe statements of the program havebeen executed. Generally speaking,since software testing involves the pro-gram under test, the set of test cases,and the specification of the software, anadequacy criterion can be formalized asa function C that takes a program p, aspecification s, and a test set t and givesa truth value true or false. Formally, letP be a set of programs, S be a set ofspecifications, D be the set of inputs ofthe programs in P, T be the class of testsets, that is, T 5 2D, where 2X denotesthe set of subsets of X.

Definition 1.1 (Test Data AdequacyCriteria as Stopping Rules). A testdata adequacy criterion C is a functionC: P 3 S 3 T 3 {true, false}. C(p, s, t) 5true means that t is adequate for testingprogram p against specification s accord-ing to the criterion C, otherwise t is inad-equate.

Second, test data adequacy criteriaprovide measurements of test qualitywhen a degree of adequacy is associatedwith each test set so that it is not sim-ply classified as good or bad. In practice,the percentage of code coverage is oftenused as an adequacy measurement.Thus, an adequacy criterion C can beformally defined to be a function C froma program p, a specification s, and a testset t to a real number r 5 C(p, s, t),the degree of adequacy [Zhu and Hall1992]. Formally:

Definition 1.2 (Test Data AdequacyCriteria as Measurements). A test dataadequacy criterion is a function C, C:P 3 S 3 T 3 [0,1]. C(p, s, t) 5 r means

that the adequacy of testing the pro-gram p by the test set t with respect tothe specification s is of degree r accord-ing to the criterion C. The greater thereal number r, the more adequate thetesting.

These two notions of test data ade-quacy criteria are closely related to oneanother. A stopping rule is a specialcase of measurement on the continuumsince the actual range of measurementresults is the set {0,1}, where 0 meansfalse and 1 means true. On the otherhand, given an adequacy measurementM and a degree r of adequacy, one canalways construct a stopping rule Mrsuch that a test set is adequate if andonly if the adequacy degree is greaterthan or equal to r; that is, Mr(p, s, t) 5true N M(p, s, t) $ r. Since a stoppingrule asserts a test set to be either ade-quate or inadequate, it is also called apredicate rule in the literature.

An adequacy criterion is an essentialpart of any testing method. It plays twofundamental roles. First, an adequacycriterion specifies a particular softwaretesting requirement, and hence deter-mines test cases to satisfy the require-ment. It can be defined in one of thefollowing forms.

(1) It can be an explicit specification fortest case selection, such as a set ofguidelines for the selection of testcases. Following such rules one canproduce a set of test cases, althoughthere may be some form of randomselections. Such a rule is usuallyreferred to as a test case selectioncriterion. Using a test case selectioncriterion, a testing method may bedefined constructively in the form ofan algorithm which generates a testset from the software under test andits own specification. This test set isthen considered adequate. It shouldbe noticed that for a given test caseselection criterion, there may exist anumber of test case generation algo-rithms. Such an algorithm may alsoinvolve random sampling amongmany adequate test sets.

368 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 4: Software Unit Test Coverage and Adequacy

(2) It can also be in the form of specify-ing how to decide whether a giventest set is adequate or specifyinghow to measure the adequacy of atest set. A rule that determineswhether a test set is adequate (ormore generally, how adequate) isusually referred to as a test dataadequacy criterion.

However, the fundamental conceptunderlying both test case selection cri-teria and test data adequacy criteria isthe same, that is, the notion of testadequacy. In many cases they can beeasily transformed from one form to an-other. Mathematically speaking, testcase selection criteria are generators,that is, functions that produce a class oftest sets from the program under testand the specification (see Definition1.3). Any test set in this class is ade-quate, so that we can use any of themequally.1 Test data adequacy criteriaare acceptors that are functions fromthe program under test, the specifica-tion of the software and the test set to acharacteristic number as defined in Def-inition 1.1. Generators and acceptorsare mathematically equivalent in thesense of one-one correspondence. Hence,we use “test adequacy criteria” to de-note both of them.

Definition 1.3 (Test Data AdequacyCriteria as Generators [Budd and An-gluin 1982]). A test data adequacy cri-terion C is a function C: P 3 S 3 2T. Atest set t [ C(p, s) means that t satis-fies C with respect to p and s, and it issaid that t is adequate for (p, s) accord-ing to C.

The second role that an adequacy cri-terion plays is to determine the observa-tions that should be made during thetesting process. For example, statementcoverage requires that the tester, or thetesting system, observe whether eachstatement is executed during the pro-

cess of software testing. If path cover-age is used, then the observation ofwhether statements have been executedis insufficient; execution paths shouldbe observed and recorded. However, ifmutation score is used, it is unneces-sary to observe whether a statement isexecuted during testing. Instead, theoutput of the original program and theoutput of the mutants need to be re-corded and compared.

Although, given an adequacy crite-rion, different methods could be devel-oped to generate test sets automaticallyor to select test cases systematicallyand efficiently, the main features of atesting method are largely determinedby the adequacy criterion. For example,as we show later, the adequacy criterionis related to fault-detecting ability, thedependability of the program thatpasses a successful test and the numberof test cases required. Unfortunately,the exact relationship between a partic-ular adequacy criterion and the correct-ness or reliability of the software thatpasses the test remains unclear.

Due to the central role that adequacycriteria play in software testing, soft-ware testing methods are often com-pared in terms of the underlying ade-quacy criteria. Therefore, subsequently,we use the name of an adequacy crite-rion as a synonym of the correspondingtesting method when there is no possi-bility of confusion.

1.2 The Uses of Test Adequacy Criteria

An important issue in the managementof software testing is to “ensure thatbefore any testing the objectives of thattesting are known and agreed and thatthe objectives are set in terms that canbe measured.” Such objectives “shouldbe quantified, reasonable and achiev-able” [Ould and Unwin 1986]. Almostall test adequacy criteria proposed inthe literature explicitly specify particu-lar requirements on software testing.They are objective rules applicable byproject managers for this purpose.

For example, branch coverage is a

1 Test data selection criteria as generators shouldnot be confused with test case generation softwaretools, which may only generate one test set.

Test Coverage and Adequacy • 369

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 5: Software Unit Test Coverage and Adequacy

test requirement that all branches ofthe program should be exercised. Theobjective of testing is to satisfy thisrequirement. The degree to which thisobjective is achieved can be measuredquantitatively by the percentage ofbranches exercised. The mutation ade-quacy criterion specifies the testing re-quirement that a test set should be ableto rule out a particular set of softwarefaults, that is, those represented by mu-tants. Mutation score is another kind ofquantitative measurement of test qual-ity.

Test data adequacy criteria are alsovery helpful tools for software testers.There are two levels of software testingprocesses. At the lower level, testing isa process where a program is tested byfeeding more and more test cases to it.Here, a test adequacy criterion can beused as a stopping rule to decide whenthis process can stop. Once the mea-surement of test adequacy indicatesthat the test objectives have beenachieved, then no further test case isneeded. Otherwise, when the measure-ment of test adequacy shows that a testhas not achieved the objectives, moretests must be made. In this case, theadequacy criterion also provides aguideline for the selection of the addi-tional test cases. In this way, adequacycriteria help testers to manage the soft-ware testing process so that softwarequality is ensured by performing suffi-cient tests. At the same time, the cost oftesting is controlled by avoiding redun-dant and unnecessary tests. This role ofadequacy criteria has been consideredby some computer scientists [Weyuker1986] to be one of the most important.

At a higher level, the testing proce-dure can be considered as repeated cy-cles of testing, debugging, modifyingprogram code, and then testing again.Ideally, this process should stop onlywhen the software has met the requiredreliability requirements. Although testdata adequacy criteria do not play therole of stopping rules at this level, theymake an important contribution to theassessment of software dependability.

Generally speaking, there are two basicaspects of software dependability as-sessment. One is the dependability esti-mation itself, such as a reliability fig-ure. The other is the confidence inestimation, such as the confidence orthe accuracy of the reliability estimate.The role of test adequacy here is a con-tributory factor in building confidencein the integrity estimate. Recent re-search has shown some positive resultswith respect to this role [Tsoukalas1993].

Although it is common in current soft-ware testing practice that the test pro-cesses at both the higher and lowerlevels stop when money or time runsout, there is a tendency towards the useof systematic testing methods with theapplication of test adequacy criteria.

1.3 Categories of Test Data AdequacyCriteria

There are various ways to classify ade-quacy criteria. One of the most commonis by the source of information used tospecify testing requirements and in themeasurement of test adequacy. Hence,an adequacy criterion can be:

—specification-based, which specifiesthe required testing in terms of iden-tified features of the specification orthe requirements of the software, sothat a test set is adequate if all theidentified features have been fully ex-ercised. In software testing literatureit is fairly common that no distinctionis made between specification and re-quirements. This tradition is followedin this article also;

—program-based, which specifies test-ing requirements in terms of the pro-gram under test and decides if a testset is adequate according to whetherthe program has been thoroughly ex-ercised.

It should not be forgotten that for bothspecification-based and program-basedtesting, the correctness of program out-puts must be checked against the speci-fication or the requirements. However,

370 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 6: Software Unit Test Coverage and Adequacy

in both cases, the measurement of testadequacy does not depend on the resultsof this checking. Also, the definition ofspecification-based criteria given previ-ously does not presume the existence ofa formal specification.

It has been widely acknowledged thatsoftware testing should use informationfrom both specification and program.Combining these two approaches, wehave:

—combined specification- and program-based criteria, which use the ideas ofboth program-based and specification-based criteria.

There are also test adequacy criteriathat specify testing requirements with-out employing any internal informationfrom the specification or the program.For example, test adequacy can be mea-sured according to the prospective us-age of the software by consideringwhether the test cases cover the datathat are most likely to be frequentlyused as input in the operation of thesoftware. Although few criteria are ex-plicitly proposed in such a way, select-ing test cases according to the usage ofthe software is the idea underlying ran-dom testing, or statistical testing. Inrandom testing, test cases are sampledat random according to a probabilitydistribution over the input space. Sucha distribution can be the one represent-ing the operation of the software, andthe random testing is called representa-tive. It can also be any probability dis-tribution, such as a uniform distribu-tion, and the random testing is callednonrepresentative. Generally speaking,if a criterion employs only the “inter-face” information—the type and validrange for the software input—it can becalled an interface-based criterion:

—interface-based criteria, which specifytesting requirements only in terms ofthe type and range of software inputwithout reference to any internal fea-tures of the specification or the pro-gram.

In the software testing literature, peo-ple often talk about white-box testingand black-box testing. Black-box testingtreats the program under test as a“black box.” No knowledge about theimplementation is assumed. In white-box testing, the tester has access to thedetails of the program under test andperforms the testing according to suchdetails. Therefore, specification-basedcriteria and interface-based criteria be-long to black-box testing. Program-based criteria and combined specifica-tion and program-based criteria belongto white-box testing.

Another classification of test ade-quacy criteria is by the underlying test-ing approach. There are three basic ap-proaches to software testing:

(1) structural testing: specifies testingrequirements in terms of the cover-age of a particular set of elements inthe structure of the program or thespecification;

(2) fault-based testing: focuses on de-tecting faults (i.e., defects) in thesoftware. An adequacy criterion ofthis approach is some measurementof the fault detecting ability of testsets.2

(3) error-based testing: requires testcases to check the program on cer-tain error-prone points according toour knowledge about how programstypically depart from their specifica-tions.

The source of information used in theadequacy measurement and the under-lying approach to testing can be consid-ered as two dimensions of the space ofsoftware test adequacy criteria. A soft-ware test adequacy criterion can beclassified by these two aspects. The re-view of adequacy criteria is organizedaccording to the structure of this space.

2 We use the word fault to denote defects in soft-ware and the word error to denote defects in theoutputs produced by a program. An execution thatproduces an error is called a failure.

Test Coverage and Adequacy • 371

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 7: Software Unit Test Coverage and Adequacy

1.4 Organization of the Article

The remainder of the article consists oftwo main parts. The first part surveysvarious types of test data adequacy cri-teria proposed in the literature. It in-cludes three sections devoted to struc-tural testing, fault-based testing, anderror-based testing. Each section con-sists of several subsections covering theprinciples of the testing method andtheir application to program-based andspecification-based test criteria. Thesecond part is devoted to the rationalepresented in the literature in support ofthe various criteria. It has two sections.Section 5 discusses the methods of com-paring adequacy criteria and surveysthe research results in the literature.Section 6 discusses the axiomatic studyand assessment of adequacy criteria. Fi-nally, Section 7 concludes the paper.

2. STRUCTURAL TESTING

This section is devoted to adequacy cri-teria for structural testing. It consists oftwo subsections, one for program-basedcriteria and the other for specification-based criteria.

2.1 Program-Based Structural Testing

There are two main groups of program-based structural test adequacy criteria:control-flow criteria and data-flow crite-ria. These two types of adequacy crite-ria are combined and extended to givedependence coverage criteria. Most ade-quacy criteria of these two groups arebased on the flow-graph model of pro-gram structure. However, a few control-flow criteria define test requirements interms of program text rather than usingan abstract model of software structure.

2.1.1 Control Flow Adequacy Crite-ria. Before we formally define variouscontrol-flow-based adequacy criteria, wefirst give an introduction to the flowgraph model of program structure.

A. The flow graph model of programstructure. The control flow graphstems from compiler work and has long

been used as a model of program struc-ture. It is widely used in static analysisof software [Fenton et al. 1985; Ko-saraju 1974; McCabe 1976; Paige 1975].It has also been used to define andstudy program-based structural test ad-equacy criteria [White 1981]. In thissection we give a brief introduction tothe flow-graph model of program struc-ture. Although we use graph-theory ter-minology in the following discussion,readers are required to have only a pre-liminary knowledge of graph theory. Tohelp understand the terminology and toavoid confusion, a glossary is providedin the Appendix.

A flow graph is a directed graph thatconsists of a set N of nodes and a setE # N 3 N of directed edges betweennodes. Each node represents a linearsequence of computations. Each edgerepresenting transfer of control is anordered pair ^n1, n2& of nodes, and isassociated with a predicate that repre-sents the condition of control transferfrom node n1 to node n2. In a flowgraph, there is a begin node and an endnode where the computation starts andfinishes, respectively. The begin nodehas no inward edges and the end nodehas no outward edges. Every node in aflow graph must be on a path from thebegin node to the end node. Figure 1 isan example of flow graph.

Example 2.1 The following programcomputes the greatest common divisorof two natural numbers by Euclid’s al-gorithm. Figure 1 is the correspondingflow graph.

Begininput (x, y);while (x . 0 and y . 0) do

if (x . y)then x: 5 x 2 yelse y: 5 y 2 x

endifendwhile;output (x 1 y);

end

It should be noted that in the litera-ture there are a number of conventionsof flow-graph models with subtle differ-

372 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 8: Software Unit Test Coverage and Adequacy

ences, such as whether a node is al-lowed to be associated with an emptysequence of statements, the number ofoutward edges allowed for a node, andthe number of end nodes allowed in aflow graph, and the like. Although mostadequacy criteria can be defined inde-pendently of such conventions, usingdifferent ones may result in differentmeasures of test adequacy. Moreover,testing tools may be sensitive to suchconventions. In this article no restric-tions on the conventions are made.

For programs written in a proceduralprogramming language, flow-graphmodels can be generated automatically.Figure 2 gives the correspondences be-tween some structured statements andtheir flow-graph structures. Using theserules, a flow graph, shown in Figure 3,can be derived from the program givenin Example 2.1. Generally, to constructa flow graph for a given program, theprogram code is decomposed into a setof disjoint blocks of linear sequences ofstatements. A block has the propertythat whenever the first statement of theblock is executed, the other statementsare executed in the given order. Fur-thermore, the first statement of theblock is the only statement that may beexecuted directly after the execution ofa statement in another block. Eachblock corresponds to a node in the flowgraph. A control transfer from one block

to another is represented by a directededge between the nodes such that thecondition of the control transfer is asso-ciated with it.

B. Control-flow adequacy criteria.Now, given a flow-graph model of a pro-gram and a set of test cases, how do wemeasure the adequacy of testing for theprogram on the test set? First of all,recall that the execution of the programon an input datum is modeled as atraverse in the flow graph. Every execu-tion corresponds to a path in the flowgraph from the begin node to the endnode. Such a path is called a completecomputation path, or simply a computa-tion path or an execution path in soft-ware testing literature.

A very basic requirement of adequatetesting is that all the statements in theprogram are covered by test executions.This is usually called statement cover-age [Hetzel 1984]. But full statementcoverage cannot always be achieved be-cause of the possible existence of infea-sible statements, that is, dead code.Whether a piece of code is dead code isundecidable [Weyuker 1979a; Weyuker1979b; White 1981]. Because state-ments correspond to nodes in flow-graph models, this criterion can be de-fined in terms of flow graphs, as follows.

Definition 2.1 (Statement CoverageCriterion). A set P of execution paths

Figure 1. Flow graph for program in Example 2.1.

Test Coverage and Adequacy • 373

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 9: Software Unit Test Coverage and Adequacy

satisfies the statement coverage crite-rion if and only if for all nodes n in theflow graph, there is at least one path pin P such that node n is on the path p.

Notice that statement coverage is soweak that even some control transfers

may be missed from an adequate test.Hence, we have a slightly stronger re-quirement of adequate test, calledbranch coverage [Hetzel 1984], that allcontrol transfers must be checked. Sincecontrol transfers correspond to edges inflow graphs, the branch coverage crite-rion can be defined as the coverage ofall edges in the flow graph.

Definition 2.2 (Branch Coverage Crite-rion). A set P of execution paths satis-fies the branch coverage criterion if andonly if for all edges e in the flow graph,there is at least one path p in P suchthat p contains the edge e.

Branch coverage is stronger thanstatement coverage because if all edgesin a flow graph are covered, all nodesare necessarily covered. Therefore, atest set that satisfies the branch cover-age criterion must also satisfy state-ment coverage. Such a relationship be-tween adequacy criteria is called thesubsumes relation. It is of interest inthe comparison of software test ade-

Figure 2. Example flow graphs for structured statements.

Figure 3. Flow graph for Example 2.1.

374 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 10: Software Unit Test Coverage and Adequacy

quacy criteria (see details in Section5.1.3).

However, even if all branches are ex-ercised, this does not mean that all com-binations of control transfers arechecked. The requirement of checkingall combinations of branches is usuallycalled path coverage or path testing,which can be defined as follows.

Definition 2.3 (Path Coverage Crite-rion). A set P of execution paths satis-fies the path coverage criterion if andonly if P contains all execution pathsfrom the begin node to the end node inthe flow graph.

Although the path coverage criterionstill cannot guarantee the correctness ofa tested program, it is too strong to bepractically useful for most programs,because there can be an infinite numberof different paths in a program withloops. In such a case, an infinite set oftest data must be executed for adequatetesting. This means that the testingcannot finish in a finite period of time.But, in practice, software testing mustbe fulfilled within a limited fixed periodof time. Therefore, a test set must befinite. The requirement that an ade-quacy criterion can always be satisfiedby a finite test set is called finite appli-cability [Zhu and Hall 1993] (see Sec-tion 6).

The statement coverage criterion andbranch coverage criterion are not fi-nitely applicable either, because theyrequire testing to cover infeasible ele-ments. For instance, statement cover-age requires that all the statements in aprogram are executed. However, a pro-gram may have infeasible statements,that is, dead code, so that no input datacan cause their execution. Therefore, insuch cases, there is no adequate test setthat can satisfy statement coverage.Similarly, branch coverage is not fi-nitely applicable because a programmay contain infeasible branches. How-ever, for statement coverage and alsobranch coverage, we can define a fi-nitely applicable version of the criterionby requiring testing only to cover the

feasible elements. Most program-basedadequacy criteria in the literature arenot finitely applicable, but finitely ap-plicable versions can often be obtainedby redefinition in this way. Subse-quently, such a version is called thefeasible version of the adequacy crite-rion. It should be noted, first, that al-though we can often obtain finite appli-cability by using the feasible version,this may cause the undecidability prob-lem; that is, we may not be able todecide whether a test set satisfies agiven adequacy criterion. For example,whether a statement in a program isfeasible is undecidable [Weyuker 1979a;Weyuker 1979b; White 1991]. There-fore, when a test set does not cover allthe statements in a program, we maynot be able to decide whether a state-ment not covered by the test data isdead code. Hence, we may not be able todecide if the test set satisfies the feasi-ble version of statement coverage. Sec-ond, for some adequacy criteria, such aspath coverage, we cannot obtain finiteapplicability by such a redefinition.

Recall that the rationale for path cov-erage is that there is no path that doesnot need to be checked by testing, whilefinite applicability forces us to select afinite subset of paths. Thus, researchinto flow-graph-based adequacy criteriahas focused on the selection of the mostimportant subsets of paths. Probablythe most straightforward solution to theconflict is to select paths that contain noredundant information. Hence, two no-tions from graph theory can be used.First, a path that has no repeated occur-rence of any edge is called a simple pathin graph theory. Second, a path that hasno repeated occurrences of any node iscalled an elementary path. Thus, it ispossible to define simple path coverageand elementary path coverage criteria,which require that adequate test setsshould cover all simple paths and ele-mentary paths, respectively.

These two criteria are typical onesthat select finite subsets of paths byspecifying restrictions on the complexityof the individual paths. Another exam-

Test Coverage and Adequacy • 375

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 11: Software Unit Test Coverage and Adequacy

ple of this type is the length-n pathcoverage criterion, which requires cover-age of all subpaths of length less thanor equal to n [Gourlay 1983]. A morecomplicated example of the type isPaige’s level-i path coverage criterion[Paige 1978; Paige 1975]. Informally,the criterion starts with testing all ele-mentary paths from the begin node tothe end node. Then, if there is an ele-mentary subpath or cycle that has notbeen exercised, the subpath is requiredto be checked at next level. This processis repeated until all nodes and edges arecovered by testing. Obviously, a test setthat satisfies the level-i path coveragecriterion must also satisfy the elemen-tary path coverage criterion, becauseelementary paths are level-0 paths.

A set of control-flow adequacy criteriathat are concerned with testing loops isloop count criteria, which date back tothe mid-1970s [Bently et al. 1993]. Forany given natural number K, the loopcount-K criterion requires that everyloop in the program under test shouldbe executed zero times, once, twice, andso on, up to K times [Howden 1975].Another control-flow criterion con-cerned with testing loops is the cyclecombination criterion, which requiresthat an adequate test set should coverall execution paths that do not contain acycle more than once.

An alternative approach to definingcontrol-flow adequacy criteria is to spec-ify restrictions on the redundancyamong the paths. McCabe’s cyclomaticmeasurement is such an example [Mc-Cabe 1976; McCabe 1983; McCabe andSchulmeyer 1985]. It is based on thetheorem of graph theory that for anyflow graph there is a set of executionpaths such that every execution pathcan be expressed as a linear combina-tion of them. A set of paths is indepen-dent if none of them is a linear combina-tion of the others. According to McCabe,a path should be tested if it is indepen-dent of the paths that have been tested.On the other hand, if a path is a linearcombination of tested paths, it can beconsidered redundant. According to

graph theory, the maximal size of a setof independent paths is unique for anygiven graph and is called the cyclomaticnumber, and can be easily calculated bythe following formula.

v~G! 5 e 2 n 1 p,

where v(G) denotes the cyclomatic num-ber of the graph G, n is the number ofvertices in G, e is the number of edges,and p is the number of strongly con-nected components.3 The adequacy cri-terion is then defined as follows.

Definition 2.4 (Cyclomatic-NumberCriterion). A set P of execution pathssatisfies the cyclomatic number crite-rion if and only if P contains at leastone set of v independent paths, wherev 5 e 2 n 1 p is the cyclomatic numberof the flow graph.

McCabe also gave an algorithm togenerate a set of independent pathsfrom any given flow graph [McCabe1976]. Paige [1978] has shown that thelevel-i path coverage criterion subsumesMcCabe’s cyclomatic number criterion.

The preceding control-flow test ade-quacy criteria are all defined in terms offlow-graph models of program structureexcept loop count criteria, which aredefined in terms of program text. Anumber of other test adequacy criteriaare based on the text of program. One ofthe most popular criteria in softwaretesting practice is the so-called multiplecondition coverage discussed in Myers’[1979] classic book which has provedpopular in commercial software testingpractice. The criterion focuses on theconditions of control transfers, such asthe condition in an IF-statement or aWHILE-LOOP statement. A test set issaid of satisfying the decision coveragecriterion if for every condition there isat least one test case such that thecondition has value true when evalu-

3 A graph is strongly connected if, for any twonodes a and b, there exists a path from a to b anda path from b to a. Strongly connected compo-nents are maximal strongly connected subgraphs.

376 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 12: Software Unit Test Coverage and Adequacy

ated, and there is also at least one testcase such that the condition has valuefalse. In a high-level programming lan-guage, a condition can be a Booleanexpression consisting of several atomicpredicates combined by logic connec-tives like and, or, and not. A test setsatisfies the condition coverage crite-rion if for every atomic predicate thereis at least one test case such that thepredicate has value true when evalu-ated, and there is also at least one testcase such that the predicate has valuefalse. Although the result value of anevaluation of a Boolean expression canonly be one of two possibilities, true orfalse, the result may be due to differentcombinations of the truth values of theatomic predicates. The multiple condi-tion coverage criterion requires that atest set should cover all possible combi-nations of the truth values of atomicpredicates in every condition. This crite-rion is sometimes also called extendedbranch coverage in the literature. For-mally:

Definition 2.5 (Multiple ConditionCoverage). A test set T is said to beadequate according to the multiple-con-dition-coverage criterion if, for everycondition C, which consists of atomicpredicates (p1, p2, . . . , pn), and all thepossible combinations (b1, b2, . . . , bn)of their truth values, there is at leastone test case in T such that the value ofpi equals bi, i 5 1, 2, . . . , n.

Woodward et al. [1980] proposed andstudied a hierarchy of program-text-based test data adequacy criteria basedon a class of program units called linearcode sequence and jump (LCSAJ). Thesecriteria are usually referred to as testeffectiveness metrics in the literature.An LCSAJ consists of a body of codethrough which the flow of control mayproceed sequentially and which is ter-minated by a jump in the control flow.The hierarchy TERi, i 5 1, 2, . . . ,n, . . . of criteria starts with statementcoverage as the lowest level, followed bybranch coverage as the next lowestlevel. They are denoted by TER1 and

TER2, respectively, where TER repre-sents test effectiveness ratio. The cover-age of LCSAJ is the third level, which isdefined as TER3. The hierarchy is thenextended to the coverage of programpaths containing a number of LCSAJs.

Definition 2.6 (TER3: LCSAJ Cover-age)

TER3 5

number of LCSAJs exercisedat least once

total number of LCSAJs.

Generally speaking, an advantage oftext-based adequacy criteria is that testadequacy can be easily calculated fromthe part of program text executed dur-ing testing. However, their definitionsmay be sensitive to language details.For programs written in a structuredprogramming language, the applicationof TERn for n greater than or equal to 3requires analysis and reformatting ofthe program structure. In such cases,the connection between program textand its test adequacy becomes lessstraightforward. In fact, it is observedin software testing practice that a smallmodification to the program may resultin a considerably different set of linearsequence code and jumps.

It should be noted that none of theadequacy criteria discussed in this sec-tion are applicable due to the possibleexistence of infeasible elements in aprogram, such as infeasible statements,infeasible branches, infeasible combina-tions of conditions, and the like. Thesecriteria, except path coverage, can beredefined to obtain finite applicabilityby only requiring the coverage of feasi-ble elements.

2.1.2 Data-Flow-Based Test Data Ad-equacy Criteria. In the previous sec-tion, we have seen how control-flow in-formation in the program under test isused to specify testing requirements. Inthis section, data-flow information istaken into account in the definition oftesting requirements. We first introducethe way that data-flow information is

Test Coverage and Adequacy • 377

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 13: Software Unit Test Coverage and Adequacy

added into the flow-graph models of pro-gram structures. Then, three basicgroups of data-flow adequacy criteriaare reviewed. Finally, their limitationsand extensions are discussed.

A. Data-flow information in flowgraph. Data-flow analysis of test ade-quacy is concerned with the coverage offlow-graph paths that are significant forthe data flow in the program. Therefore,data-flow information is introduced intothe flow-graph models of program struc-tures.

Data-flow testing methods are basedon the investigation of the ways inwhich values are associated with vari-ables and how these associations caneffect the execution of the program. Thisanalysis focuses on the occurrences ofvariables within the program. Eachvariable occurrence is classified as ei-ther a definition occurrence or a useoccurrence. A definition occurrence of avariable is where a value is bound to thevariable. A use occurrence of a variableis where the value of the variable isreferred. Each use occurrence is furtherclassified as being a computational useor a predicate use. If the value of avariable is used to decide whether apredicate is true for selecting executionpaths, the occurrence is called a predi-cate use. Otherwise, it is used to com-pute a value for defining other variablesor as an output value. It is then called acomputational use. For example, the as-signment statement “y :5 x1 1 x2” con-tains computational uses of x1 and x2and a definition of y. The statement “ifx1 , x2 then goto L endif” containspredicate uses of x1 and x2.

Since we are interested in tracing theflow of data between nodes, any defini-tion that is used only within the node inwhich the definition occurs is of littleimportance. Therefore a distinction ismade between local computational usesand global computational uses. A globalcomputational use of a variable x iswhere no definition of x precedes thecomputational use within the node inwhich it occurs. That is, the value must

have been bound to x in some nodeother than the one in which it is beingused. Otherwise it is a local computa-tional use.

Data-flow test adequacy analysis isconcerned with subpaths from defini-tions to nodes where those definitionsare used. A definition-clear path withrespect to a variable x is a path wherefor all nodes in the path there is nodefinition occurrence of the variable x.A definition occurrence of a variable xat a node u reaches a computational useoccurrence of the variable at node v ifand only if there is a path p from u to vsuch that p 5 (u, w1, w2, . . . , wn, v),and (w1, w2, . . . , wn) is definition-clearwith respect to x, and the occurrence ofx at v is a global computational use. Wesay that the definition of x at u reachesthe computational occurrence of x at vthrough the path p. Similarly, if there isa path p 5 (u, w1, w2, . . . , wn, v) fromu to v, and (w1, w2, . . . , wn) is defini-tion-clear with respect to x, and there isa predicate occurrence of x associatedwith the edge from wn to v, we say thatu reaches the predicate use of x on theedge (wn, v) through the path p. If apath in one of the preceding definitionsis feasible, that is, there is at least oneinput datum that can actually cause theexecution of the path, we say that adefinition feasibly reaches a use of thedefinition.

Three groups of data-flow adequacycriteria have been proposed in the liter-ature, and are discussed in the follow-ing.

B. Simple definition-use associationcoverage—the Rapps-Weyuker-Franklfamily. Rapps and Weyuker [1985]proposed a family of testing adequacycriteria based on data-flow information.Their criteria are concerned mainlywith the simplest type of data-flowpaths that start with a definition of avariable and end with a use of the samevariable. Frankl and Weyuker [1988]later reexamined the data-flow ade-quacy criteria and found that the origi-nal definitions of the criteria did not

378 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 14: Software Unit Test Coverage and Adequacy

satisfy the applicability condition. Theyredefined the criteria to be applicable.The following definitions come from themodified definitions.

The all-definitions criterion requiresthat an adequate test set should coverall definition occurrences in the sensethat, for each definition occurrence, thetesting paths should cover a paththrough which the definition reaches ause of the definition.

Definition 2.7 (All Definitions Crite-rion). A set P of execution paths satis-fies the all-definitions criterion if andonly if for all definition occurrences of avariable x such that there is a use of xwhich is feasibly reachable from the def-inition, there is at least one path p in Psuch that p includes a subpath throughwhich the definition of x reaches someuse occurrence of x.

Since one definition occurrence of avariable may reach more than one useoccurrence, the all-uses criterion re-quires that all of the uses should beexercised by testing. Obviously, this re-quirement is stronger than the all-defi-nition criterion.

Definition 2.8 (All Uses Criterion). Aset P of execution paths satisfies theall-uses criterion if and only if for alldefinition occurrences of a variable xand all use occurrences of x that thedefinition feasibly reaches, there is atleast one path p in P such that p in-cludes a subpath through which thatdefinition reaches the use.

The all-uses criterion was also pro-posed by Herman [1976], and calledreach-coverage criterion. As discussed atthe beginning of the section, use occur-rences are classified into computationaluse occurrences and predicate use oc-currences. Hence, emphasis can be puteither on computational uses or onpredicate uses. Rapps and Weyuker[1985] identified four adequacy criteriaof different strengths and emphasis.The all-c-uses/some-p-uses criterion re-quires that all of the computational

uses are exercised, but it also requiresthat at least one predicate use should beexercised when there is no computa-tional use of the variable. In contrast,the all-p-uses/some-c-uses criterion putsemphasis on predicate uses by requiringthat test sets should exercise all predi-cate uses and exercise at least onecomputational use when there is nopredicate use. Two even weaker criteriawere also defined. The all-predicate-uses criterion completely ignores thecomputational uses and requires thatonly predicate uses need to be tested.The all-computation-uses criterion onlyrequires that computational uses shouldbe tested and ignores the predicateuses.

Notice that, given a definition occur-rence of a variable x and a use of thevariable x that is reachable from thatdefinition, there may exist many pathsthrough which the definition reachesthe use. A weakness of the precedingcriteria is that they require only one ofsuch paths to be exercised by testing.However, the applicability problemarises if all such paths are to be exer-cised because there may exist an infi-nite number of such paths in a flowgraph. For example, consider the flowgraph in Figure 1, the definition of y atnode a1 reaches the use of y at node a3through all the paths in the form:

~a1, a2! ∧ ~a2, a2!n ∧ ~a2, a3!, n $ 1,

where ∧ is the concatenation of paths,pn is the concatenation of p with itselffor n times, which is inductively definedto be p1 5 p and pk 5 p ∧ pk21, for allk . 1. To obtain finite applicability,Frankl and Weyuker [1988] and Clarkeet al. [1989] restricted the paths to becycle-free or only the end node of thepath to be the same as the start node.

Definition 2.9 (All Definition-Use-Paths Criterion: Abbr. All DU-PathsCriterion). A set P of execution pathssatisfies the all-du-paths criterion ifand only if for all definitions of a vari-able x and all paths q through which

Test Coverage and Adequacy • 379

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 15: Software Unit Test Coverage and Adequacy

that definition reaches a use of x, thereis at least one path p in P such that q isa subpath of p, and q is cycle-free orcontains only simple cycles.

However, even with this restriction, itis still not applicable since such a pathmay be infeasible.

C. Interactions between variables—theNtafos required K-tuples criteria. Ntafos[1984] also used data-flow informationto analyze test data adequacy. He stud-ied how the values of different variablesinteract, and defined a family of ade-quacy criteria called required k-tuples,where k . 1 is a natural number. Thesecriteria require that a path set cover thechains of alternating definitions anduses, called definition-reference interac-tions (abbr. k–dr interactions) inNtafos’ terminology. Each definition ina k–dr interaction reaches the next usein the chain, which occurs at the samenode as the next definition in the chain.Formally:

Definition 2.10 (k–dr interaction).For k . 1, a k–dr interaction is a se-quence K 5 [d1(x1), u1(x1), d2(x2),u2(x2), . . . , dk(xk), uk(xk)] where

(i) di( xi), 1 # i , k, is a definitionoccurrence of the variable xi;

(ii) ui( xi), 1 # i , k, is a use occur-rence of the variable xi;

(iii) the use ui( xi) and the definitiondi11( xi) are associated with thesame node ni11;

(iv) for all i, 1 # i , k, the ith defini-tion di( xi) reaches the ith useui( xi).

Note that the variables x1, x2, . . . , xkand the nodes n1, n2, . . . , nk need notbe distinct. This definition comes fromNtafos’ [1988] later work. It is differentfrom the original definition where thenodes are required to be distinct [Ntafos1984]. The same modification was alsomade by Clark et al. [1989] in theirformal analysis of data flow adequacycriteria.

An interaction path for a k–dr inter-

action is a path p 5 (n1) p p1 p (n2) p. . . p (nk21) p pk21 p (nk) such that forall i 5 1, 2, . . . , k 2 1, di(xi) reachesui(xi) through pi. The required k-tuplescriterion then requires that all k–drinteractions are tested.

Definition 2.11 (Required k-TuplesCriteria). A set P of execution pathssatisfies the required k-tuples criterion,k . 1, if and only if for all j–dr interac-tions L, 1 , j # k, there is at least onepath p in P such that p includes asubpath which is an interaction path forL.

Example 2.1 Consider the flowgraph in Figure 2. The following are3–dr interaction paths.

~a1, a3, a2, a4! for the 3–dr

interaction @d1~ x!, u1~ x!, d2~ y!,

u2~ y!, d3~ x!, u3~ x!#; and

~a1, a2, a3, a4! for the 3–dr

interaction @d1~ y!, u1~ y!, d2~ x!,

u2~ x!, d3~ y!, u3~ y!#.

D. Combinations of definitions—theLaski-Korel criteria. Laski and Korel[1983] defined and studied another kindof testing path selection criteria basedon data-flow analysis. They observedthat a given node may contain uses ofseveral different variables, where eachuse may be reached by several defini-tions occurring at different nodes. Suchdefinitions constitute the context of thecomputation at the node. Therefore,they are concerned with exploring suchcontexts for each node by selectingpaths along which the various combina-tions of definitions reach the node.

Definition 2.12 (Ordered Context).Let n be a node in the flow graph.Suppose that there are uses of the vari-ables x1, x2, . . . , xm at the node n.4 Let

4 This assumption comes from Clark et al. [1989].The original definition given by Laski and Korel

380 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 16: Software Unit Test Coverage and Adequacy

[n1, n2, . . . , nm] be a sequence of nodessuch that for all i 5 1, 2, . . . , m, thereis a definition of xi on node ni, and thedefinition of xi reaches the node n withrespect to xi. A path p 5 p1 p (n1) p p2 p(n2) p . . . p pm p (nm) p pm11 p (n) iscalled an ordered context path for thenode n with respect to the sequence [n1,n2, . . . , nm] if and only if for all i 5 2,3, . . . , m, the subpath pi p (ni) p pi11 p. . . p pm11 is definition clear with re-spect to xi21. In this case, we say thatthe sequence [n1, n2, . . . , nm] of nodesis an ordered context for n.

Example 2.2 Consider the flowgraph in Figure 1. There are uses of thetwo variables x and y at node a4. Thenode sequences [a1, a2], [a1, a3], [a2,a3], and [a3, a2] are ordered contextsfor node a4. The paths (a1, a2, a4),(a1, a3, a4), (a2, a3, a4), and (a3, a2,a4) are the ordered context paths forthem, respectively.

The ordered context coverage requiresthat an adequate test set should coverall ordered contexts for every mode.

Definition 2.13 (Ordered-Context Cov-erage Criterion). A set P of executionpaths satisfies the ordered-context cov-erage criterion if and only if for allnodes n and all ordered contexts c for n,there is at least one path p in P suchthat p contains a subpath which is anordered context path for n with respectto c.

Given a node n, let {x1, x2, . . . , xm}be a nonempty subset of the variablesthat are used at the node n, the nodesni, i 5 1, 2, . . . , m, have definitionoccurrences of the variables xi, thatreach the node n. If there is a permuta-tion s of the nodes which is an orderedcontext for n, then we say that the set{n1, n2, . . . , nm} is a context for n, andan ordered context path for n with re-spect to s is also called a definitioncontext path for n with respect to the

context {n1, n2, . . . , nm}. Ignoring theordering between the nodes, a slightlyweaker criterion, called the context-cov-erage criterion, requires that all con-texts for all nodes are covered.

Definition 2.14 (Context-Coverage Cri-terion). A set P of execution paths sat-isfies the context coverage criterion ifand only if for all nodes n and for allcontexts for n, there is at least one pathp in P such that p contains a subpathwhich is a definition context path for nwith respect to the context.

E. Data-flow testing for structureddata and dynamic data. The data flowtesting methods discussed so far have anumber of limitations. First, they makeno distinction between atomic data suchas integers and structured or aggregatedata such as arrays and records. Modifi-cations and references to an element ofa structured datum are regarded asmodifications and references to thewhole datum. It was argued that treat-ing structured data, such as arrays, asaggregate values may lead to two typesof mistakes [Hamlet et al. 1993]. A com-mission mistake may happen when adefinition-use path is identified but it isnot present for any array elements. Anomission mistake may happen when apath is missed because of a false inter-mediate assignment. Such mistakes oc-cur frequently even in small programs[Hamlet et al. 1993]. Treating elementsof structured data as independent datacan correct the mistakes. Such an ex-tension seems to add no complexitywhen the references to the elements ofstructured data are static, such as thefields of records. However, treating ar-rays element-by-element may introducea potential infinity of definition-usepaths to be tested. Moreover, theoreti-cally speaking, whether two referencesto array elements are references to thesame element is undecidable. Hamlet etal. [1993] proposed a partial solution tothis problem by using symbolic execu-tion and a symbolic equation solver todetermine whether two occurrences of

[1983] defines a context to be formed from allvariables having a definition that reaches thenode.

Test Coverage and Adequacy • 381

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 17: Software Unit Test Coverage and Adequacy

array elements can be the occurrencesof the same element.

The second limitation of the data-flowtesting methods discussed is that dy-namic data were not taken into account.One of the difficulties in the data-flowanalysis of dynamic data such as thosereferred to by pointers is that a pointervariable may actually refer to a numberof data storage. On the other hand, adata storage may have a number ofreferences to it, that is, the existence ofalias. Therefore, for a given variable V,a node may contain a definite definitionto the variable if a new value is defi-nitely bound to the variable at the node.It has a possible definition at a node n ifit is possible that a new value is boundto it at the node. Similarly, a path maybe definitely definition-clear or possiblydefinition-clear with respect to a vari-able. Ostrand and Weyuker [1991] ex-tended the definition-use association re-lation on the occurrences of variables toa hierarchy of relations. A definition-use association is strong if there is adefinite definition of a variable and adefinite use of the variable and everydefinition-clear path from the definitionto the use is definitely definition-clearwith respect to the variable. The associ-ation is firm if both the definition andthe use are definite and there is at leastone path from the definition to the usethat it is definitely definition-clear. Theassociation is weak if both the definitionand the use are definite, but there is nopath from the definition to the usewhich is definitely definition-clear. Anassociation is very weak if the definitionor the use or both of them are possibleinstead of definite.

F. Interprocedural data-flow test-ing. The data-flow testing methodsdiscussed so far have also been re-stricted to testing the data dependenceexisting within a program unit, such asa procedure. As current trends in pro-gramming encourage a high degree ofmodularity, the number of procedurecalls and returns executed in a modulecontinues to grow. This mandates the

efficient testing of the interaction be-tween procedures. The basic idea of in-terprocedural data-flow testing is to testthe data dependence across procedureinterfaces. Harrold and Soffa [1990;1991] identified two types of interproce-dural data dependences in a program:direct data dependence and indirectdata dependence. A direct data depen-dence is a definition-use associationwhose definition occurs in procedure Pand use occurs in a directly called proce-dure Q of P. Such a dependence existswhen (1) a definition of an actual pa-rameter in one procedure reaches a useof the corresponding formal parameterat a call site (i.e., a procedure call); (2) adefinition of a formal parameter in acalled procedure reaches a use of thecorresponding actual parameter at a re-turn site (i.e., a procedure return); or (3)a definition of a global variable reachesa call or return site. An indirect datadependence is a definition-use associa-tion whose definition occurs in proce-dure P and use occurs in an indirectlycalled procedure Q of P. Conditions forindirect data dependence are similar tothose for direct data dependence, exceptthat multiple levels of procedure callsand returns are considered. Indirectdata dependence can be determined byconsidering the possible uses of defini-tions along the calling sequences. Whena formal parameter is passed as an ac-tual parameter at a call site, an indirectdata dependence may exist. Given thisdata dependence information, the data-flow test adequacy criteria can be easilyextended for interprocedural data-flowtesting. Harrold and Soffa [1990] pro-posed an algorithm for computing theinterprocedural data dependences anddeveloped a tool to support interproce-dural data-flow testing.

Based on Harrold and Soffa’s work,Ural and Yang [1988; 1993] extendedthe flow-graph model for accurate repre-sentation of interprocedural data-flowinformation. Pande et al. [1991] pro-posed a polynomial-time algorithm fordetermining interprocedural definition-

382 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 18: Software Unit Test Coverage and Adequacy

use association including dynamic dataof single-level pointers for C programs.

2.1.3 Dependence Coverage Criterion—an Extension and Combination of Data-Flow and Control-Flow Testing. An ex-tension of data-flow testing methodswas made by Podgurski and Clarke[1989; 1990] by generalizing control anddata dependence. Informally, a state-ment s is semantically dependent on astatement s9 if the function computedby s9 affects the execution behavior of s.Podgurski and Clarke then proposed anecessary condition of semantic depen-dence called weak syntactic dependenceas a generalization of data dependence.There is a weak syntactic dependencebetween two statements if there is achain of data flow and a weak controldependence between the statements,where a statement u is weakly control-dependent on statement v if v has suc-cessors v9 and v0 such that if the branchfrom v to v9 is executed then u is neces-sarily executed within a fixed number ofsteps, whereas if the branch v to v0 istaken then u can be bypassed or itsexecution can be delayed indefinitely.Podgurski and Clarke also defined thenotion of strong syntactic dependence:there is a strong syntactic dependencebetween two statements if there is achain of data flow and a strong controldependence between the statements.Roughly speaking, a statement u isstrongly control dependent on state-ment v if v has two successors v9 and v0such that the execution through thebranch v to v9 may result in the execu-tion of u, but u may be bypassed whenthe branch from v to v0 is taken. Pod-gurski and Clarke proved that strongsyntactic dependence is not a necessarycondition of semantic dependence.

When the definition-use associationrelation is replaced with various depen-dence relations, various dependence-coverage criteria can be obtained as ex-tensions to the data-flow test adequacycriteria. Such criteria make more use ofsemantic information contained in the

program under test. Furthermore, thesedependence relations can be efficientlycalculated.

2.2 Specification-Based Structural Testing

There are two main roles a specificationcan play in software testing [Richardsonet al. 1992]. The first is to provide thenecessary information to check whetherthe output of the program is correct[Podgurski and Clarke 1989; 1990].Checking the correctness of programoutputs is known as the oracle problem.The second is to provide information toselect test cases and to measure testadequacy. As the purpose of this articleis to study test adequacy criteria, wefocus on the second use of specifications.

Like programs, a specification hastwo facets, syntactic structure and se-mantics. Both of them can be used toselect test cases and to measure testadequacy. This section is concernedwith the syntactic structure of a specifi-cation.

A specification specifies the proper-ties that the software must satisfy.Given a particular instance of the soft-ware’s input and its corresponding out-put, to check whether the instance ofthe software behavior satisfies theseproperties we must evaluate the specifi-cation by substituting the instance ofinput and output into the input andoutput variables in the specification, re-spectively. Although this evaluationprocess may take various forms, de-pending on the type of the specification,the basic idea behind the approach is toconsider a particular set of elements orcomponents in the specification and tocalculate the proportion of such ele-ments or components involved in theevaluation.

There are two major approaches toformal software functional specifica-tions, model-based specifications andproperty-oriented specifications such asaxiomatic or algebraic specifications.The following discussion is based onthese types of specifications.

Test Coverage and Adequacy • 383

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 19: Software Unit Test Coverage and Adequacy

2.2.1 Coverage of Model-Based For-mal Functional Specifications. When aspecification is model-based, such asthose written in Z and VDM, it has twoparts. The first describes the statespace of the software, and the secondpart specifies the required operations onthe space. The state space of the soft-ware system can be defined as a set oftyped variables with a predicate to de-scribe the invariant property of thestate space. The operations are func-tions mapping from input data and thestate before the operation to the outputdata and the state after the operation.Such operations can be specified by aset of predicates that give the precondi-tion, that is, the condition on the inputdata and the state before the operation,and postconditions that specify the rela-tionship between the input data, outputdata, and the states before and after theoperation.

The evaluation of model-based formalfunctional specifications is fairly similarto the evaluation of a Boolean expres-sion in an imperative programming lan-guage. When input and output variablesin the expression are replaced with aninstance of input data and program out-puts, each atomic predicate must be ei-ther true or false. If the result of theevaluation of the whole specification istrue, then the correctness of the soft-ware on that input is confirmed. Other-wise, a program error is found. How-ever, the same truth value of aspecification on two instances of input/output may be due to different combina-tions of the truth values of the atomicpredicates. Therefore it is natural torequire that an adequate test cover acertain subset of feasible combinationsof the predicates. Here a feasible combi-nation means that the combination canbe satisfied; that is, there is an assign-ment of values to the input and outputvariables such that the atomic predi-cates take their corresponding values inthe predicate combination. In the casewhere the specification contains nonde-terminism, the program may be lessnondeterministic than the specification.

That is, some of the choices of outputallowed by the specification may not beimplemented by the program. This maynot be considered a program error, butit may result in infeasible combinations.

A feasible combination of the atomicpredicates in the preconditions is a de-scription of the conditions that testcases should satisfy. It specifies a sub-domain of the input space. It can beexpressed in the same specification lan-guage. Such specifications of testing re-quirements are called test templates.Stocks and Carrington [1993] suggestedthe use of the formal functional specifi-cation language Z to express test tem-plates because the schema structure ofZ and its schema calculus can providesupport to the derivation and refine-ment of test templates according to for-mal specifications and heuristic testingrules. Methods have also been proposedto derive such test templates from mod-el-based specification languages. Amlaand Ammann [1992] described a tech-nique to extract information from for-mal specifications in Z and to derivetest templates written in Z for partitiontesting. The key step in their method isto identify the categories of the testdata for each parameter and environ-ment variable of a functional unit undertest. These categories categorize the in-put domain of one parameter or oneenvironment variable according to themajor characteristics of the input. Ac-cording to Amla and Ammann, there aretypically two distinct sources of catego-ries in Z specifications: (a) characteris-tics enumerated in the preconditionsand (b) characteristics of a parameter orenvironment variable by itself. For pa-rameters, these characteristics arebased on their type. For environmentvariables, these characteristics mayalso be based on the invariant for thestate components. Each category is thenfurther divided into a set of choices. Achoice is a subset of data that can beassigned to the parameter or the envi-ronment variable. Each category can bebroken into at least two choices: one forthe valid inputs and the other for the

384 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 20: Software Unit Test Coverage and Adequacy

invalid inputs. Finer partitions of validinputs are derived according to the syn-tactic structure of the preconditionpredicate, the parameters, or the invari-ant predicate of the environment vari-ables. For example, the predicate “A ∨B” is partitioned into three choices: (a)“¬ (A ∨ B)” for the set of data which areinvalid inputs; (b) “A” for the subset ofvalid inputs which satisfy condition A;(3) “B” for the subset of valid inputswhich satisfy the condition B.

Based on Amla and Ammann’s [1992]work, Ammann and Offutt [1994] re-cently considered how to test a func-tional unit effectively and efficiently byselecting test cases to cover various sub-sets of the combinations of the catego-ries and choices when the functionalunit has more than one parameter andenvironment variable. They proposedthree coverage criteria. The all-combi-nations criterion requires that softwareis tested on all combinations of choices;that is, for each combination of thechoices of the parameters and the envi-ronment variables, there is at least onetest datum in the combination. Let x1,x2, . . . , xn be the parameters and envi-ronment variables of the functional unitunder test. Suppose that the choices forxi are Ai,1, Ai,2, . . . , Ai,ki

, ki . 0, i 5 1,2, . . . , n. Let C 5 {A1,u1

3 A2,u23 . . .

3 An,unu 1 # ui # ki and 1 # i # n}. C

is then the set of all combinations ofchoices. The all combination criterioncan be formally defined as follows.

Definition 2.15 (All-Combination Cri-terion). A set of test data T satisfiesthe all-combination criterion if for allc [ C, there exists at least one t [ Tsuch that t [ c.

This criterion was considered to beinefficient, and the each-choice-used cri-terion was considered ineffective [Am-mann and Offutt 1994]. The each-choice-used criterion requires that eachchoice is used in the testing; that is, foreach choice of each parameter or envi-

ronment variable, there is at least onetest datum that belongs to the choice.Formally:

Definition 2.16 (Each-Choice-UsedCriterion). A set of test data T satis-fies the each-choice-used criterion if thesubset E 5 {e u e [ C and ?t [ T.(t [e)} satisfies the condition:

@i.~1 # i # n

f~Ei 5 $Ai,1 , Ai,2 , . . . , Ai,ki%!!,

where

Ei 5 $eu ?X1 , . . . , Xi21 ,

Xi11 , . . . , Xn .~X1 3 . . . 3 Xi21 3 e

3 Xi11 3 . . . 3 Xn [ E!%.

Ammann and Offutt suggested theuse of the base-choice-coverage criterionand described a technique to derive testtemplates that satisfy the criterion. Thebase-choice-coverage criterion is basedon the notion of base choice, which is acombination of the choices of parame-ters and environment variables thatrepresents the normal operation of thefunctional unit under test. Therefore,test cases of the base choice are usefulto evaluate the function’s behavior innormal operation mode. To satisfy thebase-choice-coverage criterion, softwareneeds to be tested on the subset of com-binations of choices such that for eachchoice in a category, the choice is com-bined with the base choices for all othercategories. Assume that A1,1 3 A2,1 3. . . 3 An,1 is the base choice. The base-choice coverage criterion can be for-mally defined as follows.

Definition 2.17 (Base-Choice-CoverageCriterion). A set of test data T satis-fies the base-choice-coverage criterion ifthe subset E 5 {e u e [ C ∧ ?t [ T.(t [e)} satisfies the following condition:

E $ øi51

n

Bi ,

Test Coverage and Adequacy • 385

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 21: Software Unit Test Coverage and Adequacy

where

Bi 5 $A1,1 3 . . . 3 Ai21,1 3 Ai, j 3 Ai11,1

3 . . . 3 An,1u j 5 1, 2, . . . , ki%.

There are a number of works on spec-ification-based testing that focus on der-ivation of test cases from specifications,including Denney’s [1991] work on test-case generation from Prolog-based spec-ifications and many others [Hayes 1986;Kemmerer 1985; McMullin and Gannon1983; Wild et al. 1992].

Model-based formal specification canalso be in an executable form, such as afinite state machine or a state chart.Aspects of such models can be repre-sented in the form of a directed graph.Therefore, the program-based adequacycriteria based on the flow-graph modelcan be adapted for specification-basedtesting [Fujiwara et al. 1991; Hall andHierons 1991; Ural and Yang 1988;1993].

2.2.2 Coverage of Algebraic FormalFunctional Specifications. Property-oriented formal functional specificationsspecify software functions by a set ofproperties that the software should pos-sess. In particular, an algebraic specifi-cation consists of a set of equations thatthe operations of the software must sat-isfy. Therefore checking if a programsatisfies the specification means check-ing whether all of the equations aresatisfied by the program.

An equation in an algebraic specifica-tion consists of two terms as two sidesof the equation. A term is constructedfrom three types of symbols: variablesrepresenting arbitrary values of a givendata type, constants representing agiven data value in a data type, andoperators representing data construc-tors and operations on data types.

Each term has two interpretations inthe context of testing. First, a term rep-resents a sequence of calls to the opera-tions that implement the operatorsspecified in the specification. When thevariables in the term are replaced withconstants, such a sequence of calls to

the operations represents a test execu-tion of the program, where the test caseconsists of the constants substituted forthe variables. Second, a term also repre-sents a value, that is, the result of thesequence of operations. Therefore,checking an equation means executingthe operation sequences for the twoterms on the two sides of the equationand then comparing the results. If theresults are the same or equivalent, theprogram is considered to be correct onthis test case, otherwise the implemen-tation has errors. This interpretationallows the use of algebraic specifica-tions as test oracles.

Since variables in a term can be re-placed by any value of the data type,there is a great deal of freedom to chooseinput data for any given sequence of oper-ations. For algebraic specification, valuesare represented by ground terms, that is,terms without variables. Gaudel [Bougeet al. 1986; Bernot et al. 1991] and hercolleagues suggested that the selection oftest cases should be based on partitioningthe set of ground terms according to theircomplexity so that the regularity and uni-formity hypotheses on the subsets in thepartition can be assumed. The complexityof a test case is then the depth of nestingof the operators in the ground term.Therefore, roughly speaking, the selectionof test cases should first consider con-stants specified by the specification, thenall the values generated by one applica-tion of operations on constants, then val-ues generated by two applications on con-stants, and so on until the test set coversdata of a certain degree of complexity.

The following hypothesis, called theregularity hypothesis [Bouge et al. 1986;Bernot et al. 1991], formally states thegap between software correctness andadequate testing by the preceding ap-proach.

Regularity Hypothesis

@x~complexity~ x! # K f t~ x!

5 t9~ x!! f @x~t~ x! 5 t9~ x!! (2.1)

386 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 22: Software Unit Test Coverage and Adequacy

Informally, the regularity hypothesisassumes that for some complexity de-gree k, if a program satisfies an equa-tion t(x) 5 t9(x) on all data of complex-ity not higher than k, then the programsatisfies the equation on all data. Thishypothesis captures the intuition of in-ductive reasoning in software testing.But it cannot be proved formally (atleast in its most general form) nor vali-dated empirically. Moreover, there is noway to determine the complexity k suchthat only the test cases of complexityless than k need to be tested.

2.3 Summary of Structure CoverageCriteria

In summary, when programs are mod-eled as directed graphs, paths in theflow graph should be exercised by test-ing. However, only a finite subset of thepaths can be checked during testing.The problem is therefore to choosewhich paths should be exercised.

Control-flow test data adequacy crite-ria answer this question by specifyingrestrictions on the complexity of thepaths or by specifying restrictions onthe redundancy among paths. Data-flowadequacy criteria use data-flow infor-mation in the program and select pathsthat are significant with respect to suchinformation. Data-flow and control-flowadequacy criteria can be extended todependence coverage criteria, whichmake more use of semantic informationcontained in the program under test.However, none of these criteria use in-formation about software requirementsor functional specifications.

Specification-based structure cover-age criteria specify testing require-ments and measure test adequacy ac-cording to the extent that the test datacover the required functions specified informal specifications. These criteria fo-cus on the specification and ignore theprogram that implements the specifica-tion.

As discussed in Section 1, softwaretesting should employ information con-tained in the program as well as infor-

mation in the specification. A simpleway to combine program-based struc-tural testing with specification-basedstructure coverage criteria is to mea-sure test adequacy with criteria fromboth approaches.

3. FAULT-BASED ADEQUACY CRITERIA

Fault-based adequacy criteria measurethe quality of a test set according to itseffectiveness or ability to detect faults.

3.1 Error Seeding

Error seeding is a technique originallyproposed to estimate the number offaults that remain in software. By thismethod, artificial faults are introducedinto the program under test in somesuitable random fashion unknown tothe tester. It is assumed that these arti-ficial faults are representative of theinherent faults in the program in termsof difficulty of detection. Then, the soft-ware is tested and the inherent andartificial faults discovered are countedseparately. Let r be the ratio of thenumber of artificial faults found to thenumber of total artificial faults. Thenthe number of inherent faults in theprogram is statistically predicted withmaximum likelihood to be f/r, where f isthe number of inherent faults found bytesting.

This method can also be used to mea-sure the quality of software testing. Theratio r of the number of artificial faultsfound to the total number of artificialfaults can be considered as a measure ofthe test adequacy. If only a small pro-portion of artificial faults are found dur-ing testing, the test quality must bepoor. In this sense, error seeding canshow weakness in the testing.

An advantage of the method is that itis not restricted to measuring test qual-ity for dynamic testing. It is applicableto any testing method that aims at find-ing errors or faults in the software.

However, the accuracy of the measureis dependent on how faults are intro-duced. Usually, artificial faults are

Test Coverage and Adequacy • 387

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 23: Software Unit Test Coverage and Adequacy

manually planted, but it has beenproved difficult to implement errorseeding in practice. It is not easy tointroduce artificial faults that areequivalent to inherent faults in diffi-culty of detection. Generally, artificialerrors are much easier to find than in-herent errors. In an attempt to over-come this problem, mutation testing in-troduces faults into a program moresystematically.

3.2 Program Mutation Testing

3.2.1 Principles of Mutation Ade-quacy Analysis. Mutation analysis isproposed as a procedure for evaluatingthe degree to which a program is tested,that is, to measure test case adequacy[DeMillo et al. 1978; Hamlet 1977].Briefly, the method is as follows. Wehave a program p and a test set t thathas been generated in some fashion.The first step in mutation analysis isthe construction of a collection of alter-native programs that differ from theoriginal program in some fashion. Thesealternatives are called mutants of theoriginal program, a name borrowedfrom biology. Each mutant is then exe-cuted on each member of the test set t,stopping either when an element of t isfound on which p and the mutant pro-gram produce different responses, orwhen t is exhausted.

In the former case we say that themutant has died since it is of no furthervalue, whereas in the latter case we saythe mutant lives. These live mutantsprovide valuable information. A mutantmay remain alive for one of the follow-ing reasons.

(1) The test data are inadequate.

If a large proportion of mutants live,then it is clear that on the basis of thesetest data alone we have no more reasonto believe that p is correct than to be-lieve that any of the live mutants arecorrect. In this sense, mutation analysiscan clearly reveal a weakness in testdata by demonstrating specific pro-grams that are not ruled out by the test

data presented. For example, the testdata may not exercise the portion of theprogram that was mutated.

(2) The mutant is equivalent to theoriginal program.

The mutant and the original programalways produce the same output, henceno test data can distinguish betweenthe two. Normally, only a small percent-age of mutants are equivalent to theoriginal program.

Definition 3.1 (Mutation AdequacyScore). The mutation adequacy of a setof test data is measured by an adequacyscore computed according to the follow-ing equation.

Adequacy Score S 5D

M 2 E

where D is the number of dead mutants,M is the total number of mutants, andE is the number of equivalent mutants.

Notice that the general problem ofdeciding whether a mutant is equiva-lent to the original program is undecid-able.

3.2.2 Theoretical Foundations of Mu-tation Adequacy. Mutation analysis isbased on two basic assumptions—thecompetent programmer hypothesis andthe coupling effect hypothesis. Thesetwo assumptions are based on observa-tions made in software developmentpractice, and if valid enable the practi-cal application of mutation testing toreal software [DeMillo et al. 1988].However, they are very strong assump-tions whose validity is not self-evident.

A. Competent programmer assump-tion. The competent programmer hy-pothesis assumes that the program tobe tested has been written by competentprogrammers. That is, “they create pro-grams that are close to being correct”[DeMillo et al. 1988]. A consequencedrawn from the assumption by DeMilloet al. [1988] is that “if we are right inour perception of programs as beingclose to correct, then these errors should

388 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 24: Software Unit Test Coverage and Adequacy

be detectable as small deviations fromthe intended program.” In other words,the mutants to be considered in muta-tion analysis are those within a smalldeviation from the original program. Inpractice, such mutants are obtained bysystematically and mechanically apply-ing a set of transformations, called mu-tation operators, to the program undertest. These mutation operators wouldideally model programming errors madeby programmers. In practice, this maybe only partly true. We return to muta-tion operators in Section 3.2.3.

B. Coupling effect assumption. Thesecond assumption in mutation analysisis the coupling effect hypothesis, whichassumes that simple and complex errorsare coupled, and hence test data thatcause simple nonequivalent mutants todie will usually also cause complex mu-tants to die.

Trying to validate the coupling effectassumption, Offutt [1989; 1992] did anempirical study with the Mothra muta-tion-testing tool. He demonstrated thata test set developed to kill first-ordermutants (i.e., mutants) can be very suc-cessful at killing second-order mutants(i.e., mutants of mutants). However, therelationship between second-order mu-tants and complex faults remains un-clear.

In the search for the foundation ofmutation testing, theories have been de-veloped for fault-based testing in gen-eral and mutation testing in particular.Program correctness is usually taken tomean that the program computes theintended function. But in mutationanalysis there is another notion of cor-rectness, namely, that a certain class offaults has been ruled out. This is calledlocal correctness here to distinguish itfrom the usual notion of correctness.

According to Budd and Angluin [1982]local correctness of a program p can bedefined relative to a neighborhood F ofp, which is a set of programs containingthe program p itself. Precisely speaking,F is a mapping from a program p to a

set of programs F(p). We say that p islocally correct with respect to F if, forall programs q in F(p), either q isequivalent to p or q fails on at least onetest point in the input space. In otherwords, in the neighborhood of a pro-gram only the program and its equiva-lents are possibly correct, because allothers are incorrect. With the compe-tent programmer hypothesis, local cor-rectness implies correctness if theneighborhood is always large enough tocover at least one correct program.

Definition 3.2 (Neighborhood Ade-quacy Criterion). A test set t is F-ade-quate if for all programs q Þ p in F,there exists x in t such that p(x) Þ q(x).

Budd and Angluin [1982] studied thecomputability of the generation and rec-ognition of adequate test sets with re-spect to the two notions of correctness,but left the neighborhood constructionproblem open.

The neighborhood construction prob-lem was investigated by Davis andWeyuker [1988; 1983]. They developed atheory of metric space on programswritten in a language defined by a set ofproduction rules. A transition sequencefrom program p to program q was de-fined to be a sequence of the programsp1, p2, . . . , pn such that p1 5 p andpn 5 q, and for each i 5 1, 2, . . . , n 2 1,either pi11 is obtained from pi (a for-ward step) or pi can be obtained frompi11 (a backward step), using one of theproductions of the grammar in additionto a set of short-cut rules to catch theintuition that certain kinds of programsare conceptually close. The length of atransition sequence is defined to be themaximum of the number of forwardsteps and the number of backward stepsin the transition sequence. The distancer(p, q) between program p and q is thendefined as the smallest length of a tran-sition sequence from p to q. This dis-tance function on the program space canbe proved to satisfy the axioms of ametric space, that is, for all programs p,q, r,

Test Coverage and Adequacy • 389

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 25: Software Unit Test Coverage and Adequacy

(1) r( p, q) $ 0;(2) r( p, q) 5 0 if and only if p 5 q;(3) r( p, q) 5 r(q, p);(4) r( p, r) $ r( p, q) 1 r(q, r).

Then, the neighborhood Fd(p) of pro-gram p within a distance d is the set ofprograms within the distance d accord-ing to r; formally, Fd(p) 5 {q u r(p, q) #d}. Davis and Weyuker introduced thenotion of critical points for a programwith respect to a neighborhood. Intu-itively, a critical point for a program isa test case that is the unique input tokill some mutant in the neighborhoodset. Formally:

Definition 3.3 (Critical Points). Aninput c is a F-critical point for p withrespect to F if there exists a programq [ F such that for all x Þ c, p(x) 5q(x) but p(c) Þ q(c).

There is a nice relationship betweencritical points [Davis and Weyuker1988] and adequate test sets. First,F-critical points must be members ofany F-adequate test sets. Second, whenthe distance d increases, the neighbor-hood set Fd(p) as well as the set ofcritical points increase in the sense ofset inclusion. That is, if c is Fd(p)-critical, then for all « $ d, c is also

F«(p)-critical. Finally, by studying min-imally adequate test sets, Davis andWeyuker [1988] obtained a lower boundof the numbers of non-critical pointsthat must be present in an adequatetest set.

3.2.3 Mutation Transformations. Nowlet us consider the problem of how togenerate mutants from a given program.A practical method is the use of mutationoperators. Generally speaking, a muta-tion operator is a syntactic transforma-tion that produces a mutant when appliedto the program under test. It applies to acertain type of syntactic structure in theprogram and replaces it with another. Incurrent mutation testing tools such asMothra [King and Offutt 1991], the muta-tion operators are designed on the basisof many years’ study of programmer er-rors. Different levels of mutation analysiscan be done by applying certain types ofmutation operators to the correspondingsyntactical structures in the program. Ta-ble I from Budd [1981], briefly describesthe levels of mutation analysis and thecorresponding mutation operators. Thisframework has been used by almost all ofthe mutation testing tools.

3.2.4 The Pros and Cons. Using amutation-based testing system, a tester

Table I. Levels of Analysis in Mutation Testing

a An erroneous expression may coincidentally compute correct output on a particular input. Forexample, assume that a reference to a variable is mistaken by referring to another variable, but in aparticular case these two variables have the same value. Then, the expression computes the correctoutput on the particular input. If such an input is used as a test case, it is unable to detect the fault.Coincidental correctness analysis attempts to analyse whether a test set suffers from such weakness.

390 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 26: Software Unit Test Coverage and Adequacy

supplies the program to be tested andchooses the levels of mutation analysisto be performed. Then the system gener-ates the set of mutants. The tester alsosupplies a test set. The testing systemexecutes the original program and eachmutant on each test case and comparesthe outputs produced by the originalprogram and its mutants. If the outputof a mutant differs from the output ofthe original program, the mutant ismarked dead. Once execution is com-plete, the tester examines any mutantsthat are still alive. The tester can thendeclare a mutant to be equivalent to theoriginal program, or can supply addi-tional test data in an effort to kill themutant.

Reports on experiments with muta-tion-based testing have claimed thatthe method is powerful and has a num-ber of attractive features [DeMillo andMathur 1990; DeMillo et al. 1988].

(1) Mutation analysis allows a great de-gree of automation. Mutants aregenerated by applying mutation op-erators and are then compiled andexecuted. The outputs of the mu-tants and the original programs arecompared and then a mutation ade-quacy score is calculated. All thesecan be supported by mutation-test-ing software such as that of Mothra[DeMillo et al. 1988; King and Of-futt 1991].

(2) Mutation-based testing systems pro-vide an interactive test environmentthat allows the tester to locate andremove errors. When a program un-der test fails on a test case and amutant does not, the tester shouldfind it easy to locate and remove theerror by considering the mutationoperator applied and the locationwhere the mutation operator is ap-plied.

(3) Mutation analysis includes manyother testing methods as specialcases. For example, statement cov-erage and branch coverage are spe-cial cases of statement analysis inmutation testing. This can be

achieved by replacing a statementor a branch by the special statementTRAP, which causes the abortion ofthe execution.

The drawback of mutation testing isthe large computation resources (bothtime and space) required to test large-scale software. It was estimated thatthe number of mutants for an n-lineprogram is on the order of n2 [Howden1982]. Recently, experimental data con-firmed this estimate; see Section 5.3[Offutt et al. 1993]. A major expense inmutation testing is perhaps the sub-stantial human cost of examining largenumbers of mutants for possible equiva-lence, which cannot be determined ef-fectively. An average of 8.8% of equiva-lent mutants had been observed inexperiments [Offutt et al. 1993].

3.3 Variants of Program Mutation Testing

Using the same idea of mutation analy-sis, Howden [1982] proposed a testingmethod to improve efficiency. His test-ing method is called weak mutation test-ing because it is weaker than the origi-nal mutation testing. The originalmethod was referred to as strong muta-tion testing. The fundamental conceptof weak mutation testing is to identifyclasses of program components and er-rors that can occur in the components.Mutation transformations are then ap-plied to the components to generate mu-tants of the components. For each com-ponent and a mutant of the component,weak mutation testing requires that atest datum be constructed so that itcauses the execution of the componentand that the original component and themutant compute different values. Themajor advantage of weak mutation test-ing is efficiency. Although there is thesame number of mutants in weak muta-tion testing as in strong mutation test-ing, it is not necessary to carry out aseparate program execution for eachmutant. Although a test set that is ade-quate for weak mutation testing maynot be adequate for the strong mutation

Test Coverage and Adequacy • 391

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 27: Software Unit Test Coverage and Adequacy

testing, experiments with weak muta-tion testing such as Offutt and Lee’s[1991] and Marick’s [1991] suggest that itcan still be an effective testing method.

From a more general view of muta-tion-testing principles, Woodward andHalewood [1988] proposed firm muta-tion testing. They regarded mutationtesting as the process of making smallchanges in a program and comparingthe outcomes of the original andchanged versions. They identified a setof parameters of such changes and com-parisons. The basic idea of firm muta-tion testing is to make such parametersexplicit so that they can be altered bytesters. Strong and weak mutation test-ing are two extremes of the firm-mutationtesting method. The advantage of firm-mutation testing is that it is less expen-sive than strong-mutation testing butstronger than weak-mutation testing. Inaddition, firm-mutation testing providesa mechanism for the control of fault intro-duction and test results comparison. Themajor disadvantage of firm-mutationtesting is that there is no obvious system-atic basis on which to select the parame-ters and the area of program code.

Returning to frameworks for strongmutation testing, ordered mutation test-ing was proposed to improve efficiencywithout sacrificing effectiveness. Theidea is to construct an order # betweenmutants such that mutant b is strongerthan a (written a # b) if for any testcase t, t kills b implies that t necessar-ily kills a. Therefore, mutant b shouldbe executed on test cases before theexecution of a, and a is executed onlywhen the test data failed to kill b. Asimilar ordering can also be defined ontest data. Given a mutant q of programp, q should be executed on test data tbefore the execution on test data s if t ismore likely to kill q than s is. Orderingon mutation operators was proposed toachieve the ordering on mutants inearly ’80s by Woodward et al. [1980]and Riddell et al. [1982], and reap-peared recently in a note by Duncanand Robson [1990]. A mutation opera-tion f is said to be stronger than f9 iffor all programs p, the application of f

to p always gives a mutant that isstronger than the mutants obtained bythe application of f9 to p at the samelocation. Taking the relational opera-tor 5 as an example, mutants can begenerated by replacing “5” with “Þ”,“#”, “$”, “,”, and “.”, respectively. In-tuitively, replacing “5” with “Þ” shouldbe stronger than replacing with “,”,because if the test data are not goodenough to distinguish “5” from “Þ”, itwould not be adequate for other rela-tional operators. Based on such argu-ments, a partial ordering on relationaloperators was defined. However, Wood-ward [1991] proved that operator order-ing is not the right approach to achiev-ing mutant ordering. The situationturns out to be quite complicated whena mutation operator is applied to a loopbody, and a counterexample was given.Experiments are needed to see how ef-fective ordered mutation testing can beand to assess the extent of cost saving.

It was observed that some mutationoperators generate a large number ofmutants that are bound to be killed if atest set kills other mutants. Therefore,Mathur [1991] suggested applying themutation testing method without apply-ing the two mutation operators thatproduce the most mutants. Mathur’smethod was originally called con-strained mutation testing. Offutt et al.[1993] extended the method to omit thefirst N most prevalent mutation opera-tors and called it N-selective mutationtesting. They did an experiment on se-lective testing with the Mothra muta-tion testing environment. They tested10 small programs using an automatictest-case generator, Godzilla [DeMilloand Offutt 1991; 1993]. Their experi-mentation showed that with full selec-tive mutation score, an average nonse-lective mutation score of 99.99, 99.84and 99.71% was achieved by 2-selective,4-selective, and 6-selective mutationtesting, respectively. Meanwhile, theaverage savings5 for these selective mu-tations were 23.98, 41.36, and 60.56%,

5 The savings are measured as the reduction ofthe number of mutants.

392 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 28: Software Unit Test Coverage and Adequacy

respectively [Offutt et al. 1993]. Mathurtheorized that selective mutation test-ing has complexity linear in programsize measured as the number of variablereferences, and yet retains much of theeffectiveness of mutation testing. How-ever, experimental data show that thenumber of mutants is still quadraticalthough the savings are substantial[Offutt et al. 1993].

3.4 Perturbation Testing

While mutation testing systematicallyplants faults in programs by applica-tions of syntactic transformations, Zeil’sperturbation testing analyses test effec-tiveness by considering faults in an “er-ror” space. It is concerned with faults inarithmetic expressions within programstatements [Zeil 1983]. It was proposedto test vector-bounded programs, whichhave the following properties: (1) vector-bounded programs have a fixed numberof variables on a continuous input do-main; (2) in the flow-graph model, eachnode n in the flow graph is associatedwith a function Cn that transforms theenvironment v to a new environment v9.Here, environment is a vector (x1, x2,. . . , xn, y1, y2, . . . , ym) which consistsof the current values of the input vari-ables xi and the values of the intermedi-ate and output variables yi. The func-tions associated with nodes areassumed to be in a class C which isclosed under function composition. (3)each node may have at most two out-ward arcs. In such cases the node n isalso associated with a predicate Tn,which is applied to the new environ-ment obtained by applying Cn and com-pared with zero to determine the nextnode for execution. It was assumed thatthe predicates are simple, not combinedwith and, or, or other logical operators.(4) the predicates are assumed to be in aclass T which is a vector space over thereal numbers R of dimension k and isclosed under composition with C. Exam-ples of vector spaces of functions includethe set of linear functions, the set of

polynomial functions of degree k, andthe set of multinomials of degree k.

There are some major advantages ofvector boundedness. First, any finite-dimensioned vector space can be de-scribed by a finite set of characteristicvectors such that any member of thevector space is a linear combination ofthe characteristic vectors. Second, vec-tor spaces are closed under the opera-tions of addition and scalar multiplica-tion. Suppose that some correct functionC has been replaced by an erroneousform C9. Then the expression C 2 C9 isthe effect of the fault on the transforma-tion function. It is called the error func-tion of C9 or the perturbation to C. Sincethe vector space contains C and C9, theerror function must also be in the vectorspace. This enables us to study faults ina program as functional differences be-tween the correct function and the in-correct function rather than simply assyntactical differences, as in mutationtesting. Thus it builds a bridge betweenfault-based testing and error-based test-ing, which is discussed in Section 4.

When a particular error functionspace E is identified, a neighborhood ofa given function f can be expressed asthe set of functions in the form of f 1 fe,where fe [ E. Test adequacy can thenbe defined in terms of a test set’s abilityto detect error functions in a particularerror function space. Assuming thatthere are no missing path errors in theprogram, Zeil considered the error func-tions in the function C and the predi-cate T associated with a node. The fol-lowing gives more details aboutperturbation of these two types of func-tions.

3.4.1 Perturbation to the Predicate.Let n be a node in a flow graph and Cnand Tn be the computation function andpredicate function associated with noden. Let A be a path from the begin nodeto node n and CA be the function com-puted through the path. Let T9 be thecorrect predicate that should be used inplace of Tn. Suppose that the predicateis tested by an execution through the

Test Coverage and Adequacy • 393

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 29: Software Unit Test Coverage and Adequacy

path A. Recall that a predicate is areal-valued function, and its result iscompared with zero to determine thedirection of control transfer. If in a testexecution the error function e 5 T9 2Tn is evaluated to zero, it does not affectthe direction of control transfer. There-fore the error function e 5 T9 2 Tncannot be detected by such an executionif and only if there exists a positivescalar a such that

Tn ° CA~v0! 5 aT9 ° CA~v0!,

for all initial environments v0 whichcause execution through path A, where °is functional composition. The subset of Tthat consists of the functions satisfyingthe preceding equation is called the blind-ness space for the path A, denoted byBLIND(A). Zeil identified three types ofblindness and provided a constructivemethod to obtain BLIND(A) for any pathA. Assignment blindness consists of func-tions in the following set.

null~CA! 5 $eue [ T ∧ @v. e ° CA~v! 5 0%.

These functions evaluate to zero whenthe expressions computed in CA are sub-stituted for the program variables. Afterthe assignment statement “X :5 f(v)”,for example, the expression “X 2 f(v)”can be added into the set of undetect-able predicates. Equality blindness con-sists of equality restrictions on the pathdomain. To be an initial environmentthat causes execution of a path A, itmust satisfy some restrictions. A re-striction can be an equality, such as x 52. If an input restriction r(v0) 5 0 isimposed, then the predicate r(v) 5 0 isan undetectable error function. The fi-nal component of undetectable predi-cates is the predicate Tn itself. BecauseTn(v) compared with zero and aTn(v)compared with zero are identical for allpositive real numbers a, self-blindnessconsists of all predicates of the formaTn(v). These three types of undetect-able predicates can be combined to formmore complicated undetectable errorfunctions.

Having given a characterization theo-rem of the set BLIND(A) and a con-structive methods to calculate the set,Zeil defined a test-path selection crite-rion for predicate perturbations.

Definition 3.4 (Adequacy of DetectingPredicate Perturbation). A set P ofpaths all ending at some predicate T isperturbation-test-adequate for predicateT if

ùp[P

BLIND~ p! 5 B.

Zeil’s criterion was originally stated inthe form of a rejection rule: if a programhas been reliably tested on a set P ofexecution paths that all end at somepredicate T, then an additional path palso ending at T need not be tested ifand only if

ùx[P

BLIND~ x! # BLIND~ p!.

Zeil also gave the following theoremabout perturbation testing for predicateerrors.

THEOREM 3.1 (Minimal Adequate TestSet for Predicate Perturbation Testing).A minimal set of subpaths adequate fortesting a given predicate in a vectorbounded program contains at most ksubpaths, where k is the dimension of T.

3.4.2 Perturbation to Computations.The perturbation function of the compu-tation associated with node n can alsobe expressed as e 5 C9 2 Cn where C9is the unknown correct computation.However, a fault in computation func-tion may cause two types of errors: do-main error or computation error. A com-putation error can be revealed if there isa path A from the begin node to thenode n and a path B from node n to anode that contains an output statementM, such that for some initial environ-ment v0 that causes the execution ofpaths A and B

M ° CB ° Cn ° CA~v0!

Þ M ° CB ° C9 ° CA~v0!.

394 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 30: Software Unit Test Coverage and Adequacy

A sufficient condition of this inequalityis that M ° CB ° e ° CA(v0) Þ 0. The setof functions that satisfy the equationM ° CB ° e ° CA(v0) 5 0 for all v0 thatexecute A and B is then the blindnessspace of the test execution through thepath A followed by B.

A domain error due to an erroneouscomputation function Cn can be re-vealed by a path A from the begin nodeto node n and a path B from node n to anode m with predicate Tm, if for someinitial environment v0 that causes theexecution of paths A and B to node m.

Tm ° CB ° Cn ° CA~v0!

Þ aTm ° CB ° C9 ° CA~v0!.

In other words, the erroneous computa-tion may cause domain error if the faultaffects the evaluation of a predicate.The blindness space of a test path fordetecting such perturbations of a com-putation function can be expressed bythe equation: Tm ° CB ° e ° CA(v0) 5 0,where e 5 Cn 2 C9.

A solution to the two preceding equa-tions was obtained for linear-dominatedprograms, which are bounded vectorprograms with the additional restric-tions that C is the set of linear trans-formations and T is the set of linearfunctions [Zeil 1983]. The solution es-sentially consists of two parts: assign-ment blindness and equality blindness,as in predicate perturbation, and blind-ness due to computation CB maskingout differences between Cn and C9. Anadequacy criterion similar to the predi-cate perturbation adequacy was also de-fined.

The principle of perturbation testingcan be applied without the assumptionsabout the computation function space Cand the predicate function space T, butselecting an appropriate error functionspace E. Zeil [1983] gave a characteriza-tion of the blindness space for vectorerror spaces without the condition oflinearity. The principle of perturbationtesting can also be applied to individualtest cases [Zeil 1984]. The notion of

error space has been applied to improvedomain-testing methods as well [Afifi etal. 1992; Zeil et al. 1992], which arediscussed in Section 4.

Although both mutation testing andperturbation testing are aimed at de-tecting faults in software, there are anumber of differences between the twomethods. First, mutation testing mainlydeals with faults at the syntacticallevel, whereas perturbation testing fo-cuses on the functional differences. Sec-ond, mutation testing systematicallygenerates mutants according to a givenlist of mutation operators. The pertur-bation testing perturbs a program byconsidering perturbation functionsdrawn from a structured error space,such as a vector space. Third, in muta-tion testing, a mutant is killed if on atest case it produces an output differentfrom the output of the original program.Hence the fault is detected. However,for perturbation testing, the blindnessspace is calculated according to the testcases. A fault is undetected if it remainsin the blindness space. Finally, al-though mutation testing is more gener-ally applicable in the sense that thereare no particular restrictions on thesoftware under test, perturbation test-ing guarantees that combinations of thefaults can be detected if all simplefaults are detected, provided that theerror space is a vector space.

3.5 The RELAY Model

As discussed in perturbation testing, itis possible for a fault not to cause anerror in output even if the statementcontaining the fault is executed. Withthe restriction to linear-domained pro-grams, Zeil provided conditions onwhich a perturbation (i.e., a fault), canbe exposed.

The problem of whether a fault can bedetected was addressed by Morrel[1990], where a model of error origina-tion and transfer was proposed. An er-ror is originated (called “created” byMorrel) when an incorrect state is intro-duced at some fault location, and it is

Test Coverage and Adequacy • 395

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 31: Software Unit Test Coverage and Adequacy

transferred (called “propagated” by Mor-rel) if it persists to the output.

The RELAY model proposed by Rich-ardson and Thompson [1988; 1993] wasbuilt upon Morrel’s theory, but withmore detailed analysis of how an erroris transferred and the conditions of suchtransfers. The errors considered withinthe RELAY model are those caused byparticular faults in a module. A poten-tial fault is a discrepancy between anode n in the module under test and thecorresponding node n* in the hypotheti-cally correct module. This potentialfault results in a potential error if theexpression containing the fault is exe-cuted and evaluated to a value differentfrom that of the corresponding hypo-thetically correct expression. Given apotential fault, a potential error origi-nates at the smallest subexpression ofthe node containing the fault that eval-uates incorrectly. The potential errortransfers to a superexpression if the su-perexpression evaluates incorrectly.Such error transfers are called compu-tational transfer. To reveal an outputerror, execution of a potential faultmust cause an error that transfers fromnode to node until an incorrect outputresults, where an error in the functioncomputed by a node is called a contexterror. If a potential error is reflected inthe value of some variable that is refer-enced at another node, the error trans-fer is called a data-flow transfer. Thisprocess of error transfer is illustrated inFigure 4.

Therefore the conditions under whicha fault is detected are:

(1) origination of a potential error inthe smallest subexpression contain-ing the fault;

(2) computational transfer of the poten-tial error through each operator inthe node, thereby revealing a con-text error;

(3) data-flow transfer of that contexterror to another node on the paththat references the incorrect con-text;

(4) cycle through (2) and (3) until apotential error is output.

If there is no input datum for which apotential error originates and all thesetransfers occur, then the potential faultis not a fault. This view of error detectionhas an analogy in a relay race, hence thename of the model. Based on this view,the RELAY model develops revealing con-ditions that are necessary and sufficientto guarantee error detection. Test dataare then selected to satisfy revealing con-ditions. When these conditions are in-stantiated for a particular type of fault,they provide a criterion by which testdata can be selected for a program so asto guarantee the detection of an errorcaused by any fault of that type.

3.6 Specification-Mutation Testing

Specification-fault-based testing at-tempts to detect faults in the implemen-tation that are derived from misinter-preting the specification or the faults inthe specification. Specification-fault-based testing involves planting faultsinto the specification. The program that

Figure 4. RELAY model of fault detection.

396 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 32: Software Unit Test Coverage and Adequacy

implements the original specification isthen executed and the results of theexecution are checked against the origi-nal specification and those with plantedfaults. The adequacy of the test is deter-mined according to whether all theplanted faults are detected by checking.

Gopal and Budd [1983] extended pro-gram-mutation testing to specification-mutation testing for specifications writ-ten in predicate calculus. They identifieda set of mutation operations that areapplied to a specification in the form ofpre/postconditions to generate mutantsof the specification. Then the programunder test is executed on test cases toobtain the output of the program. Theinput/output pair is used to evaluate themutants of the specification. If a mutantis falsified by the test cases in the eval-uation, we say that it is killed by thetest cases; otherwise it remains alive.

Gopal and Budd [1983] noticed thatsome alterations to specifications werenot useful as mutation operators. Forexample, replacing the various clausesin the specification by the truth values“true” or “false” tends to generate mu-tants that are trivial to kill and appearto be of little practical significance.

In an investigation of mutation opera-tors for algebraic specifications, Wood-ward [1993] defined his set of mutationoperators based on his analysis of errorsin the specifications made by students.He considered algebraic specifications asterm-rewriting systems. The originalspecification and the mutants of the spec-ification are compiled into executablecodes. When the executions of the originalspecification and a mutant on a given testcase generate two different outputs, themutant is regarded as dead. Otherwise, itis alive. In this way the test adequacy ismeasured without executing the program.

3.7 Summary of Fault-Based AdequacyCriteria

Fault-based adequacy criteria focus onthe faults that could possibly be con-tained in the software. The adequacy of

a test set is measured according to itsability to detect such faults.

Error seeding is based on the assump-tion that the artificial faults planted inthe program are as difficult to detect asthe natural errors. This assumption hasproved not true in general.

Mutation analysis systematically andautomatically generates a large numberof mutants. A mutant represents a pos-sible fault. It can be detected by the testcases if the mutant produces an outputdifferent from the output of the originalprogram. The most important issue inmutation-adequacy analysis is the de-sign of mutation operators. The methodis based on the competent-programmerand coupling-effect hypotheses. Theprinciple of mutation-adequacy analysiscan be extended to specification-based-adequacy analysis. Given a set of testcases and the program under test, mu-tation adequacy can be calculated auto-matically except for detection of equiva-lent mutants.

However, measuring the adequacy ofsoftware testing by mutation analysis isexpensive. It may require large compu-tation resources to store and execute alarge number of mutants. It also re-quires huge human resources to deter-mine if live mutants are equivalent tothe original program. Reduction of thetesting expense to a practically accept-able level has been an active researchtopic. Variants such as weak mutationtesting, firm mutation testing, and or-dered mutation testing have been pro-posed. Another approach not addressedin this article is the execution of mu-tants in parallel computers [Choi et al.1989; Krauser et al. 1991]. This re-quires the availability of massive paral-lel computers and very high portabilityof the software so that it can be exe-cuted on the target machine as well asthe testing machine. In summary, al-though progress has been made to re-duce the expense of adequacy measure-ment by mutation analysis, thereremain open problems.

Perturbation testing is concernedwith the possible functional differences

Test Coverage and Adequacy • 397

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 33: Software Unit Test Coverage and Adequacy

between the program under test and thehypothetical correct program. The ade-quacy of a test set is decided by itsability to limit the error space definedin terms of a set of functions.

A test case may not reveal a fault.The RELAY model analyzed the condi-tion under which a test case reveals afault. Therefore, given a fault, test casescan be selected to satisfy the conditions,and hence guarantee the detection ofthe fault. This model can also be used asa basis of test adequacy criteria.

4. ERROR-BASED ADEQUACY CRITERIAAND DOMAIN ANALYSIS

Error-based testing methods requiretest cases to check programs on certainerror-prone points [Foster 1980; Myers1979]. The basic idea of domain analysisand domain testing is to partition theinput-output behavior space into subdo-mains so that the behavior of the soft-ware on each subdomain is equivalent,in the sense that if the software behavescorrectly for one test case within a sub-domain, then we can reasonably assumethat the behavior of the software is cor-rect for all data within that subdomain.We may wish, however, to take morethan one test case within each subdo-main in order to increase our confidencein the conformance of the implementa-tion upon this subdomain. Given a par-tition into subdomains, the question ishow many test cases should be used foreach subdomain and where in the sub-domain these test cases should be cho-sen. The answers to these questions arebased on assumptions about the where-abouts of errors. However, theoreticalstudies have proved that testing on cer-tain sets of error-prone points can de-tect certain sets of faults in the program[Afifi et al. 1992; Clarke et al. 1989;Richardson and Clarke 1985; Zeil 1984;1983; 1989a; 1989b; 1992].

Before considering these problems, letus first look at how to partition thebehavior space.

4.1 Specification-Based Input SpacePartitioning

The software input space can be parti-tioned either according to the programor the specification. When partitioningthe input space according to the specifi-cation, we consider a subset of data as asubdomain if the specification requiresthe same function on the data.

For example, consider the followingspecification of a module DISCOUNTINVOICE, from Hall and Hierons[1991].

Example 4.1 (Informal Specificationof DISCOUNT INVOICE Module). Acompany produces two goods, X and Y,with prices £5 for each X purchased and£10 for each Y purchased. An orderconsists of a request for a certain num-ber of Xs and a certain number of Ys.The cost of the purchase is the sum ofthe costs of the individual demands forthe two products discounted as ex-plained in the following. If the total isgreater than £200, a discount of 5% isgiven; if the total is greater than £1,000,a discount of 20% is given; also, thecompany wishes to encourage sales of Xand give a further discount of 10% ifmore than thirty Xs are ordered. Nonin-teger final costs are rounded down togive an integer value.

When the input x and y to the moduleDISCOUNT INVOICE has the propertythat x # 30 and 5*x 1 10* y # 200, theoutput should be 5*x 1 10*y. That is,for all the data in the subset {(x, y) ux # 30, 5*x 1 10*y # 200}, the re-quired computation is the same functionsum 5 5*x 1 10*y. Therefore the sub-set should be one subdomain. A carefulanalysis of the required computationwill give the following partition of theinput space into six subdomains, asshown in Figure 5.

It seems that there is no general me-chanically applicable method to derivepartitions from specifications, even ifthere is a formal functional specifica-tion. However, systematic approaches toanalyzing formal functional specifica-tions and deriving partitions have been

398 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 34: Software Unit Test Coverage and Adequacy

proposed by a number of researchers,such as Stocks and Carrington [1993].In particular, for formal specificationsin certain normal forms, the derivationof partitions is possible. Hierons [1992]developed a set of transformation rulesto transform formal functional specifica-tions written in pre/postconditions intothe following normal form.

P1~ x1 , x2 , . . . , xn!

∧ Q1~ x1 , x2 , . . . , xn , y1 , y2 , . . . , ym! ∨

P2~ x1 , x2 , . . . , xn!

∧ Q2~ x1 , x2 , . . . , xn , y1 , y2 , . . . , ym! ∨

· · ·

PK~ x1 , x2 , . . . , xn!

∧ QK~ x1 , x2 , . . . , xn , y1 , y2 , . . . , ym! ∨

where Pi(x1, x2, . . . , xn), i 5 1, 2, . . . ,K, are preconditions that give the condi-tion on the valid input data and thestate before the operation. Qi(x1, x2,. . . , xn, y1, y2, . . . , ym), i 5 1, 2, . . . ,K, are postconditions that specify therelationship between the input data,output data, and the states before andafter the operation. The variables xi areinput variables and yi are output vari-ables.

The input data that satisfy a precon-dition predicate Pi should constitute asubdomain. The corresponding compu-tation on the subdomain must satisfythe postcondition Qi. The preconditionpredicate is called the subdomain’s do-main condition. When the domain con-dition can be written in the form of theconjunction of atomic predicates of in-equality, such as

exp1~ x1 , x2 , . . . , xn!

# expr~ x1 , x2 , . . . , xn! (4.1)

exp1~ x1 , x2 , . . . , xn!

, expr~ x1 , x2 , . . . , xn!, (4.2)

then the equation

exp1~ x1 , x2 , . . . , xn!

5 expr~ x1 , x2 , . . . , xn! (4.3)

defines a border of the subdomain.

Example 4.2 (Formal specification ofthe DISCOUNT INVOICE module).For the DISCOUNT INVOICE module,a formal specification can be written asfollows.

~ x $ 0 ∧ y $ 0! f ~sum 5 5x 1 10y!

~sum # 200! f ~discount1 5 100!

Figure 5. Partition of input space of DISCOUNT INVOICE module.

Test Coverage and Adequacy • 399

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 35: Software Unit Test Coverage and Adequacy

~sum . 200! ∧ ~sum # 1000!

f ~discount1 5 95!

~sum . 1000! f ~discount1 5 80!)

~ x # 30! f ~discount2 5 100!

~ x . 30! f ~discount2 5 90!

~total 5 round~sum z discount1

z discount2/10000!!,

where round is the function that roundsdown a real number to an integer value;x is the number of product X that acustomer ordered; y is the number ofproduct Y that the customer ordered;sum is the cost of the order before dis-count; discount1, discount2 are the per-centages of the first and second type ofdiscount, respectively; and total is thegrand total cost of the purchase.

When this specification is trans-formed into the normal form, we have aclause with the following predicate asprecondition.

~ x $ 0! & ~ y $ 0! & ~sum # 200!

& not~sum . 200 & sum # 1000!

& not~sum . 1000! & ~ x # 30!

& not ~ x . 30!

and the following predicate as the corre-sponding postcondition,

~discount1 5 100! & ~discount2 5 100!

& ~total 5 round~sumpdiscount1

pdiscount2/10000!!.

The precondition defines the region A inthe partition shown in Figure 5. It isequivalent to

~ x $ 0! & ~ y $ 0! & ~sum # 200!

& ~ x # 30!. (4.4)

Since sum 5 5*x 1 10*y, the borders ofthe region are the lines defined by thefollowing four equations:

x 5 0, y 5 0, x 5 30,

5*x 1 10*y 5 200.

They are the x axis, the y axis, theborders a and g in Figure 5, respec-tively.

Notice that the DISCOUNT INVOICEmodule has two input variables. Hencethe border equations should be under-stood as lines in a two-dimensionalplane as in Figure 5. Generally speak-ing, the number of input variables is thedimension of the input space. A borderof a subdomain in a K-dimensionalspace is a surface of K 2 1 dimension.

4.2 Program-Based Input-SpacePartitioning

The software input space can also bepartitioned according to the program. Inthis case, two input data belong to thesame subdomain if they cause the same“computation” of the program. Usually,the same execution path of a program isconsidered as the same computation.Therefore, the subdomains correspondto the paths in the program. When theprogram contains loops, a particularsubset of the paths is selected accordingto some criterion, such as those dis-cussed in Section 2.1.

Example 4.3 Consider the followingprogram, which implements the DIS-COUNT INVOICE.

Program DISCOUNT_INVOICE(x, y: Int)

Var discount1, discount2:Int;input (x, y);if x # 30then discount2 : 5 100else discount2 : 5 90endif;sum : 5 5*x 1 10*y;if sum # 200then discount1 : 5 100elseif sum # 1000

400 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 36: Software Unit Test Coverage and Adequacy

then discount1 : 5 95else discount1 : 5 80endif;output (round(sum*discount1*discount2/10000))

end

There are six paths in the program; seeFigure 6 for its flow graph. For eachpath there is a condition on the inputdata such that an input causes the exe-cution of the path if and only if thecondition is satisfied. Such a conditionis called the path condition, and can bederived mechanically by, say, symbolicexecution [Girgis 1992; Howden 1977;1978; Liu et al. 1989; Young and Taylor1988]. Therefore, path conditions char-acterize the subdomains (i.e., they aredomain conditions). Table II gives thepath conditions of the paths.

Notice that, for the DISCOUNT IN-VOICE example, partitioning accordingto program paths is different from thepartitioning according to specification.The borders of the x axis and the y axis

are missing in the partitioning accord-ing to the program.

Now let us return to the questionsabout where and how many test casesshould be selected for each subdomain.If only one test case is required and ifwe do not care about the position of thetest case in the subdomain, then thetest adequacy is equivalent to path cov-erage provided that the input space ispartitioned according to program paths.But error-based testing requires testcases selected not only within the sub-domains, but also on the boundaries, atvertices, and just off the vertices or theboundaries in each of the adjacent sub-domains, because these places are tradi-tionally thought to be error-prone. Atest case in the subdomain is called anon test point; a test case that lies out-side the subdomain is called an off testpoint.

This answer stems from the classifica-tion of program errors into two maintypes: domain error and computation

Figure 6. Flow graph of DISCOUNT INVOICE program.

Test Coverage and Adequacy • 401

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 37: Software Unit Test Coverage and Adequacy

error. Domain errors are those due tothe incorrectness in a program’s selec-tion of boundaries for a subdomain.Computation errors are those due to theincorrectness of the implementation ofthe computation on a given subdomain.This classification results in two typesof domain testing. The first aims at thecorrectness of the boundaries for eachsubdomain. The second is concernedwith the computation on each subdo-main. The following give two sets ofcriteria for these two types of testing.

4.3 Boundary Analysis

White and Cohen [1980] proposed a testmethod called N 3 1 domain-testingstrategy that requires N test cases to beselected on the borders in an N-dimen-sional space and one test case just offthe border. This can be defined as thefollowing adequacy criterion.

Definition 4.1 (N 3 1 Domain Ad-equacy). Let {D1, D2, . . . , Dn} be theset of subdomains of software S thathas N input variables. A set T of testcases is said to be N 3 1 domain-testadequate if, for each subdomain Di, i 51, 2, . . . , n, and each border B of Di,there are at least N test cases on theborder B and at least one test casewhich is just off the border B. If theborder is in the domain Di, the test caseoff the border should be an off testpoint, otherwise, the test case off theborder should be an on test point.

An adequacy criterion stricter thanN 3 1 adequacy is the N 3 N criterion,which requires N test cases off the bor-der rather than only one test case offthe border. Moreover, these N test casesare required to be linearly independent[Clarke et al. 1982].

Definition 4.2 (N 3 N Domain Ad-equacy). Let {D1, D2, . . . , Dn} be theset of subdomains of software S thathas N input variables. A set T of testcases is said to be N 3 N domain-testadequate if, for each subdomain Di, i 51, 2, . . . , n, and each border B of Di,there are at least N test cases on theborder B and at least N linearly inde-pendent test cases just off the border B.If the border B is in the domain Di, theN test cases off the border should be offtest points, otherwise they should be ontest points.

The focal point of boundary analysisis to test if the borders of a subdomainare correct. The N 3 1 domain-ade-quacy criterion aims to detect if there isan error of parallel shift of a linearborder, whereas the N 3 N domain-adequacy is able to detect parallel shiftand rotation of linear borders. This iswhy the criteria select the specific posi-tion for the on and off test cases. Con-sidering that the vertices of a subdo-main are the points at the intersectionof several borders, Clark et al. [1982]suggested the use of vertices as testcases to improve the efficiency of bound-

Table II. Paths and Path Conditions

402 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 38: Software Unit Test Coverage and Adequacy

ary analysis. The criterion requires thatthe vertices are chosen as test cases andthat for each vertex a point close to thevertex is also chosen as a test case.

Definition 4.3 (V 3 V Domain Ad-equacy). Let {D1, D2, . . . , Dn} be theset of subdomains of software S. A set Tof test cases is said to be V 3 V domain-test adequate if, for each subdomain Di,i 5 1, 2, . . . , n, T contains the verticesof Di and for each vertex v of Di there isa test case just off the vertex v. If avertex v of Di is in the subdomain Di,then the test case just off v should be anoff test point; otherwise it should be anon point.

The preceding criteria are effectivefor detection of errors in linear domains,where a domain is a linear domain if itsborders are linear functions. For nonlin-ear domains, Afifi et al. [1992] provedthat the following criteria were effectivefor detecting linear errors.

Definition 4.4 (N 1 2 Domain Ad-equacy). Let {D1, D2, . . . , Dn} be theset of subdomains of software S thathas N input variables. A set T of testcases is said to be N 1 2 domain-testadequate if, for each subdomain Di, i 51, 2, . . . , n, and each border B of Di,there are at least N 1 2 test cases x1,x2, . . . , xN12 in T such that

—each set of (N 1 1) test cases are ingeneral position, where a set {xW i} con-taining N 1 1 vectors is in generalposition if the N vectors xW i 2 xW1, i 52, 3, . . . , N 1 1, are linear-indepen-dent;

—there is at least one on test point andone off test point;

—for each pair of test cases, if the twopoints are of the same type (in thesense that both are on test points orboth are off test points), they shouldlie on the opposite sides of the hyper-plane formed by the other n testpoints; otherwise they should lie onthe same side of the hyperplaneformed by the other n test points.

By applying Zeil’s [1984; 1983] workon error spaces (see also the discussionof perturbation testing in Section 3.4),Afifi et al. [1992] proved that any testset satisfying the N 1 2 domain-ade-quacy criterion can detect all linear er-rors of domain borders. They also pro-vided a method of test-case selection tosatisfy the criterion consisting of thefollowing four steps.

(1) Choose (N 1 1) on test points ingeneral position; the selection ofthese points should attempt tospread them as far from one anotheras possible and put them on or veryclose to the border.

(2) Determine the open convex region ofthese points.

(3) If this region contains off points,then select one.

(4) If this region has no off points, thenchange each on point to be an offpoint by a slight perturbation; nowthere are N 1 1 off points and thereis near certainty of finding an onpoint in the new open convex region.

4.4 Functional Analysis

Although boundary analysis focuses onborder location errors, functional analy-sis emphasizes the correctness of com-putation on each subdomain. To see howfunctional analysis works, let us takethe DISCOUNT INVOICE module as anexample again.

Example 4.4 Consider region A inFigure 5. The function to be calculatedon this subdomain specified by the spec-ification is

total 5 5*x 1 10*y. (4.5)

This can be written as f(x, y) 5 5x 110y, a linear function of two variables.Mathematically speaking, two pointsare sufficient to determine a linearfunction, but one point is not. Thereforeat least two test cases must be chosen inthe subdomain. If the program also com-putes a linear function on the subdo-main and produces correct outputs on

Test Coverage and Adequacy • 403

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 39: Software Unit Test Coverage and Adequacy

the two test cases, then we can say thatthe program computes the correct func-tion on the subdomain.

Analyzing the argument given in thepreceding example, we can see thatfunctional analysis is based on the fol-lowing assumptions. First, it assumesthat for each subdomain there is a set offunctions Ff associated with the do-main. In the preceding example, Ff con-sists of linear functions of two variables.Second, the specified computation of thesoftware on this domain is considered asa function f *, which is an element of Ff.Third, it is assumed that the function fcomputed by the software on the subdo-main is also an element of Ff. Finally,there is a method for selecting a finitenumber of test cases for f * such that if fand f * agree on the test cases, thenthey are equivalent.

In Section 3.4, we saw the use of anerror space in defining the neighbor-hood set Ff and the use of perturbationtesting in detecting computation errors[Zeil 1984; 1983].

In Howden’s [1978; 1987] algebraictesting, the set of functions associatedwith the subdomain is taken as thepolynomials of degree less than or equalto k, where k is chosen as the degree ofthe required function, if it is a polyno-mial. The following mathematical theo-rem, then, provides a guideline for theselection of test cases in each subdo-main [Howden, 1987].

THEOREM 4.1 Suppose that F containsall multinomials in n variables x1, x2,. . . , xn of degree less than or equal to k,and f, f * [ F. Let f(x1, x2, . . . , xn) 5(i51

(k11)n

aiti(x1, x2, . . . , xn), where ti(x1,x2, . . . , xn) 5 x1

i1x2i2, . . . , xn

in, 0 # i1,i2, . . . , in # k. Then f and f * areidentical if they agree on any set of m 5(k 1 1)n values {^ci,1, ci,2, . . . , ci,n& ui 5 1, 2, . . . , m} such that the matrixM 5 [bij] is nonsingular, where bij 5 ti(cj,1, cj,2, . . . , cj,n).

Definition 4.5 (Functional Adequacy)[Howden 1978; 1987]. Let {D1, D2,. . . , Dn} be the set of subdomains of

software S. Suppose that the requiredfunction on subdomain Di is a multino-mial in m variables of degree ki, i 5 1,2, . . . , n. A set T of test cases is said tobe functional-test adequate if, for eachsubdomain Di, i 5 1, 2, . . . , n, Tcontains at least (ki 1 1)m test casescj 5 ^cj,1, cj,2, . . . , cj,m& in the subdo-main Di such that the matrix T 5 [bij]is nonsingular, where bij 5 ti (cj,1, cj,2,. . . , cj,m), and ti is the same as in thepreceding theorem.

4.5 Summary of Domain-Analysis andError-Based Test Adequacy Criteria

The basic idea behind domain analysisis the classification of program errorsinto two types: computation errors anddomain errors. A computation error isreflected by an incorrect function com-puted by the program. Such an errormay be caused, for example, by the exe-cution of an inappropriate assignmentstatement that affects the function com-puted within a path in the program. Adomain error may occur, for instance,when a branch predicate is expressedincorrectly or an assignment statementthat affects a branch predicate is wrong,thus affecting the conditions underwhich the path is selected. A boundary-analysis adequacy criterion focuses onthe correctness of the boundaries, whichare sensitive to domain errors. A func-tional-analysis criterion focuses on thecorrectness of the computation, which issensitive to computation errors. Theyshould be used in a complementaryfashion.

It is widely recognized that softwaretesting should take both specificationand program into account. A way tocombine program-based and specifica-tion-based domain-testing techniques isfirst to partition the input space usingthe two methods separately and thenrefine the partition by intersection ofthe subdomains [Gourlay 1983;Weyuker and Ostrand 1980; Richardsonand Clarke 1985]. Finally, for each sub-domain in the refined partition, the re-quired function and the computed func-

404 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 40: Software Unit Test Coverage and Adequacy

tion are checked to see if they belong tothe same set of functions, say, polyno-mials of degree K, and the test cases areselected according to the set of functionsto which they belong.

A limitation of domain-analysis tech-niques is that they are too complicatedto be applicable to software that has acomplex input space. For example, pro-cess control software may have se-quences of interactions between thesoftware and the environment system.It may be difficult to partition the inputspace into subdomains.

Another shortcoming of boundary-analysis techniques is that they wereproposed for numerical input space suchthat the notions of the closeness of twopoints in the V 3 V adequacy, and “justoff a border” in the N 3 N and N 3 1adequacy can be formally defined. How-ever, it is not so simple to extend thesenotions to nonnumerical software suchas compilers.

5. COMPARISON OF TEST DATAADEQUACY CRITERIA

Comparison of testing methods has al-ways been desirable. It is notoriouslydifficult because testing methods aredefined using different models of soft-ware and based on different theoreticalfoundations. The results are often con-troversial.

In comparing testing adequacy crite-ria, it must be made very clear in whatsense one criterion is better than an-other. There are three main types ofsuch measures in the software testingliterature: fault-detecting ability, soft-ware reliability, and test cost. This sec-tion reviews the methods and the mainresults of the comparisons.

5.1 Fault-Detecting Ability

Fault-detecting ability is one of themost direct measures of the effective-ness of test adequacy criteria [Basiliand Selby 1987; Duran and Ntafos1984; Frankl and Weiss 1993; Frankland Weyuker 1988; Ntafos 1984;

Weyuker and Jeng 1991; Woodward etal. 1980]. The methods to compare testadequacy criteria according to this mea-sure can be classified into three types:statistical experiment, simulation, andformal analysis. The following summa-rizes research in these three ap-proaches.

5.1.1 Statistical Experiments. Thebasic form of statistical experimentswith test adequacy criteria is as follows.

Let C1, C2, . . . , Cn be the test ade-quacy criteria under comparison. Theexperiment starts with the selection ofa set of sample programs, say, P1,P2, . . . , Pm. Each program has a collec-tion of faults that are known due toprevious experience with the software,or planted artificially, say, by applyingmutation operations. For each programPi, i 5 1, 2, . . . , m, and adequacycriterion Cj, j 5 1, 2, . . . , n, k test setsTi1

j , Ti2

j , . . . , Tik

j are generated in somefashion so that Tiu

j is adequate to testprogram Pi according to the criterion Cj.The proportion of faults riu

j detected bythe test set Tiu

j over the known faults inthe program Pi is calculated for every i 51, 2, . . . , n, j 5 1, 2, . . . , m, and u 51, 2, . . . , k. Statistical inferences arethen made based on the data riu

j , i 5 1,. . . , n, j 5 1, . . . , m, u 5 1, 2, . . . , k.

For example, Ntafos [1984] comparedbranch coverage, random testing, andrequired pair coverage. He used 14small programs. Test cases for each pro-gram were selected from a large set ofrandom test cases and modified asneeded to satisfy each of the three test-ing strategies. The percentages of mu-tants killed by the test sets were consid-ered the fault-detection abilities.

Hamlet [1989] pointed out two poten-tial invalidating factors in this method.They are:

(1) A particular collection of programsmust be used—it may be too smallor too peculiar for the results to betrusted.

(2) Particular test data must be createdfor each method—the data may have

Test Coverage and Adequacy • 405

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 41: Software Unit Test Coverage and Adequacy

good or bad properties not related tothe testing method.

Hamlet [1989] was also critical: “It isnot unfair to say that a typical testingexperiment uses a small set of toy pro-grams, with uncontrolled human gener-ation of the test data. That is, neither(1) nor (2) is addressed.”

In addition to these two problems,there is, in fact, another potential inval-idating factor in the method. That is:

(3) A particular collection of knownfaults in each program must beused—it may not be representativeof the faults in software. The faultscould be too easy or too difficult tofind, or there could be a differentdistribution of the types of faults.One particular test method or ade-quacy criterion may have a betterability to detect one type of faultthan other types.

To avoid the effects of the potentialinvalidating factors related to the testdata, Basili and Selby [1987] used frac-tional factorial design methodology ofstatistical experiments and repeated ex-perimentation. They compared two dy-namic testing methods (functional test-ing and statement coverage) and astatic testing method (code review) inthree iterative phases involving 74 sub-jects (i.e., testers) of various back-grounds. However, they only used foursample programs. The faults containedin the sample programs include thosemade during the actual development ofthe program as well as artificial faults.Not only the fault-detection ability, butalso the fault-detection cost with re-spect to various classes of faults werecompared. Since the test sets for dy-namic testing are generated manuallyand the static testing is performed man-ually, they also made a careful compar-ison of human factors in testing. Theirmain results were:

—Human factors. With professionalprogrammers, code reading detectedmore software faults and yielded a

higher fault-detection rate than func-tional or structural testing did. Al-though functional testing detectedmore faults than structural testingdid, functional testing and structuraltesting did not differ in fault-detec-tion rate. In contrast, with advancedstudents, the three techniques werenot different in fault-detection rate.

—Software type. The number of faultsobserved, fault-detection rate, and to-tal effort in software testing allclearly depended on the type of soft-ware under test.

—Fault type. Different testing tech-niques have different fault-detectionabilities for different types of faults.Code reading detected more interfacefaults than the other methods did,and functional testing detected morecontrol faults than the other methodsdid.

Their experiment results indicated thecomplexity of the task of comparingsoftware testing methods.

Recently, Frankl and Weiss [1993]used another approach to addressingthe potential invalidating factor associ-ated with the test data. They comparedbranch adequacy and all-uses data-flowadequacy criteria using nine small pro-grams of different subjects. Instead ofusing one adequate test set for eachcriterion, they generated a large num-ber of adequate test sets for each crite-rion and calculated the proportion of thetest sets that detect errors as an esti-mate of the probability of detecting er-rors. Their main results were:

—for five of the nine subjects, the all-uses criterion was more effective thanbranch coverage at 99% confidence,where the effectiveness of an ade-quacy criterion is the probability thatan adequate test set exposes an errorwhen the test set is selected randomlyaccording to a given distribution onthe adequate test sets of the criterion.

—in four of the nine subjects, all-usesadequate test sets were more effective

406 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 42: Software Unit Test Coverage and Adequacy

than branch-coverage adequate setsof similar size.

—for the all-uses criterion, the probabil-ity of detecting errors clearly dependson the extent of the adequacy onlywhen the adequacy is very close to100% (i.e., higher than 97%). In fourof the subjects, such a dependenceexisted for the branch-coverage crite-rion over all the adequacy range.

The advantage of the statistical experi-ment approach is that it is not limitedto comparing adequacy criteria based ona common model. However, given thepotential invalidating factors, it isdoubtful that it is capable of producingtrustworthy results on pairs of ade-quacy criteria with subtle differences.

5.1.2 Simulation. Both simulationand formal analysis are based on a cer-tain simplified model of software test-ing. The most famous research thatuses simulation for comparing softwaretest adequacy criteria are Duran andNtafos’ [1984] comparison of partitiontesting with random testing and Hamletand Taylor’s [1990] repetition of thecomparison.

Duran and Ntafos modeled partitiontesting methods by a finite number ofdisjoint subsets of software inputspaces. Suppose that the input space Dof a given software system is par-titioned into k disjoint subsets D1,D2, . . . , Dk and an input chosen atrandom has probability pi of lying in Di.Let ui be the failure rate for Di, then

u 5 Oi51

k

piui (5.1)

is the probability that the program willfail to execute correctly on an inputchosen at random in accordance withthe input distribution. Duran andNtafos investigated the probability thata test set will reveal one or more errorsand the expected number of errors that

a set of test cases will discover.6 Forrandom testing, the probability Pr offinding at least one error in n tests is1 2 (1 2 u)n. For a partition testingmethod in which ni test cases are cho-sen at random from subset Di, the prob-ability Pp of finding at least one error isgiven by

Pp 5 1 2 Pi51

k

~1 2 ui!ni. (5.2)

With respect to the same partition,

Pr 5 1 2 ~1 2 u !n

5 1 2 S 1 2 Oi51

k

piuiDn

,

where n 5 Oi51

k

ni . (5.3)

The expected number Ep of errors dis-covered by partition testing is given by

Ep~k! 5 Oi51

k

ui (5.4)

if one random test case is chosen fromeach Di, where ui is the failure rate forDi.

If in total n random test cases areused in random testing, some set ofactual values n 5 {n1, n2, . . . , nk} willoccur, where ni is the number of testcases that fall in Di. If n were known,then the expected number of errorsfound would be

E~n, k, n! 5 Oi51

k

~1 2 ~1 2 ui!ni!. (5.5)

6 Note that an error means that the program’soutput on a specific input does not meet thespecification of the software.

Test Coverage and Adequacy • 407

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 43: Software Unit Test Coverage and Adequacy

Since n is not known, the expectednumber Er of errors found by n randomtests is

Er~k, n! 5 On SE~n, k, n! z n!

z S Pi51

k

piniYP

i51

k

niDD 5 k 2 Oi51

k

~1 2 piui!n.

(5.6)

They conducted simulations of vari-ous program conditions in terms of uidistributions and K, and calculated andcompared the values of Pp, Pr, Ep, andEr. Two types of ui distributions wereinvestigated. The first type, called hypo-thetical distributions, considered thesituation where a test case chosen froma subset would have a high probabilityof finding an error affecting that subset.Therefore the uis were chosen from adistribution such that 2 percent of thetime ui $ 0.98 and 98 percent of thetime, ui , 0.049. The second type of uidistribution was uniform distributions;that is, the uis were allowed to varyuniformly from 0 to a value umax thatvaries from 0.01 to 1.

The main result of their simulationwas that when the fault-detecting abil-ity of 100 simulated random test caseswas compared to that of 50 simulatedpartition test cases, random testing wassuperior. This was considered as evi-dence to support one of the main conclu-sions of the paper, that random testingwould be more cost-effective than parti-tion testing, because performing 100random tests was considered less expen-sive than 50 partition tests.

Considering Duran and Ntafos’ resultcounterintuitive, Hamlet and Taylor[1990] did more extensive simulationand arrived at more precise statementsabout the relationship between parti-tion probabilities, failure rates, and ef-fectiveness. But their results corrobo-rated Duran and Ntafos’ results.

In contrast to statistical experiment,

imulation can be performed in an idealtesting scenario to avoid some of thecomplicated human factors. However,because simulation is based on a certainsimplified model of software testing, therealism of the simulation result could bequestionable. For example, in Duranand Ntafos’ experiment, the choice ofthe particular hypothetical distribution(i.e., for 2 percent of the time ui $ 0.98and for 98 percent of the time ui ,0.049) seems rather arbitrary.

5.1.3 Formal Analysis of the Relation-ships among Adequacy Criteria. Oneof the basic approaches to comparingadequacy criteria is to define some rela-tion among adequacy criteria and toprove that the relation holds or does nothold for various pairs of criteria. Themajority of such comparisons in the lit-erature use the subsume ordering,which is defined as follows.

Definition 5.1 (Subsume Relationamong Adequacy Criteria). Let C1 andC2 be two software test data adequacycriteria. C1 is said to subsume C2 if forall programs p under test, all specifica-tions s and all test sets t, t is adequateaccording to C1 for testing p with re-spect to s implies that t is adequateaccording to C2 for testing p with re-spect to s.

Rapps and Weyuker [1985] studiedthe subsume relation among their origi-nal version of data-flow adequacy crite-ria, which were not finitely applicable.Frankl and Weyuker [1988] later usedthis relation to compare their revisedfeasible version of data-flow criteria.They found that feasibility affects thesubsume relation among adequacy crite-ria. In Figure 7, the subsume relationthat holds for the infeasible version ofcriteria but does not hold for the feasi-ble version of the criteria is denoted byan arrow marked with the symbol “*”.Ntafos [1988] also used this relation tocompare all the data-flow criteria andseveral other structural coverage crite-ria. Among many other works on com-paring testing methods by subsume re-

408 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 44: Software Unit Test Coverage and Adequacy

lation are those by Weiser et al. [1985]and Clark et al. [1989]. Since the ade-quacy criteria based on flow analysishave a common basis, most of them canbe compared; they fall into a fairly sim-ple hierarchy, as shown in Figure 7. Arelatively complete picture of the rela-tions among test adequacy criteria canbe built. However, many methods areincomparable; that is, one does not sub-sume the other.

The subsume relation is actually acomparison of adequacy criteria accord-ing to the severity of the testing meth-ods. The subsume relation evaluates ad-equacy criteria in their own terms,without regard for what is really of in-terest. The relation expresses nothingabout the ability to expose faults or toassure software quality. Frankl and

Weyuker [1993a] proved that the factthat C1 subsumes C2 does not alwaysguarantee that C1 is better at detectingfaults.

Frankl and Weyuker investigatedwhether a relation on adequacy criteriacan guarantee better fault-detectingability for subdomain testing methods.The testing methods they consideredare those by which the input space isdivided into a multiset of subspaces sothat in each subspace at least one testcase is required. Therefore a subdomaintest adequacy criterion is equivalent toa function that takes a program p and aspecification s and gives a multiset7

{D1, D2, . . . , Dk} of subdomains. They

7 In a multiset an element may appear more thanonce.

Figure 7. Subsume relation among adequacy criteria.

Test Coverage and Adequacy • 409

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 45: Software Unit Test Coverage and Adequacy

defined a number of partial orderingson test adequacy criteria and studiedwhether these relations are related tothe ability of fault detection. The follow-ing are the relations they defined,which can be regarded as extensions ofthe subsume relation.

Definition 5.2 (Relations among TestAdequacy Criteria). Let C1 and C2 betwo subdomain test adequacy criteriaand pc1(p, s) and pc2(p, s) be the multi-set of the subsets of the input space fora program p and a specification s ac-cording to C1 and C2, respectively.

—C1 narrows C2 for (p, s) if for everyD [ pC2(p, s), there is a subdomainD9 [ pC1(p, s) such that D9 # D.

—C1 covers C2 for ( p, s) if for every D [pC2( p, s), there is a nonempty collec-tion of subdomains {D1, D2, . . . , Dn}belonging to pC1( p, s) such that D1 øD2 ø . . . Dn 5 D.

—C1 partitions C2 for ( p, s) if for everyD [ pC2( p, s), there is a nonemptycollection of pairwise disjoint subdo-mains {D1, D2, . . . , Dn} belonging topC1( p, s) such that D1 ø D2 ø . . . øDn 5 D.

—Let pC1( p, s) 5 {D1, D2, . . . , Dm},pC2( p, s) 5 {E1, E2, . . . , En}. C1

properly covers C2 for ( p, s) if there isa multiset M 5 {D1,1, D1,2, . . . ,D1,k1, . . . , Dn,1, . . . , Dn,kn} # pC1( p,s) such that Ei 5 Di,1 ø Di,2 ø . . . øDi,ki for all i 5 1, 2, . . . , n.

—Let pC1( p, s) 5 {D1, D2, . . . , Dm},pC2( p, s) 5 {E1, E2, . . . , En}. C1properly partitions C2 for ( p, s) ifthere is a multiset M 5 {D1,1, . . . ,D1,k1, . . . , Dn,1, . . . , Dn,kn} # pC1( p,s) such that Ei 5 Di,1 ø Di,2 ø . . . øDi,ki for all i 5 1, 2, . . . , n, and foreach i 5 1, 2, . . . , n, the collection{Di,1, . . . , Di,ki} is pairwise disjoint.

For each relation R previously defined,Frankl and Weyuker defined a strongerrelation, called universal R, which re-quires R to hold for all specifications sand all programs p. These relationshave the relationships shown in Figure8.

To study fault-detecting ability,Frankl and Weyuker considered twoidealized strategies for selecting thetest cases. The first requires the testerto independently select test cases atrandom from the whole input space ac-cording to a uniform distribution untilthe adequacy criterion is satisfied. Thesecond strategy assumes that the inputspace has first been divided into subdo-

Figure 8. Relationships among partial orderings on adequacy criteria.

410 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 46: Software Unit Test Coverage and Adequacy

mains and then requires independentrandom selection of a predeterminednumber n of test cases from each subdo-main. They used three different mea-sures, M1, M2, and M3 in the followingdefinition of fault-detecting ability. In alater paper Frankl and Weyuker[1993b] also studied the expected num-ber of errors detected by a test method.This measure is E(C, p, s) in the follow-ing definition. Notice that these mea-sures are the same as those used byDuran and Ntafos [1984] except ui isreplaced by mi/di; see also Section 5.1.2.

Definition 5.3 (Measures of Fault-De-tection Ability). Let pc(p, s) 5 {D1, D2,. . . , Dk} be the multiset of the subdo-mains of the input space of (p, s) ac-cording to the criterion C, di 5 uDiu, andmi be the number of fault-causing in-puts in Di, i 5 1, . . . , k. Define:

M1~C, p, s! 5 max1#i#k

Smi

diD (5.7)

M2~C, p, s! 5 1 2 Pi51

k S1 2mi

diD (5.8)

M3~C, p, s, n! 5 1 2 Pi51

k S1 2mi

diDn

, (5.9)

where n $ 1 is the number of test casesin each subdomain.

E~C, p, s! 5 Oi51

k mi

di

. (5.10)

According to Frankl and Weyuker, thefirst measure M1(C, p, s) gives a crudelower bound on the probability that anadequate test set selected using eitherof the two strategies will expose at leastone fault. The second measure M2(C, p,s) gives the exact probability that a testset chosen by using the first samplingstrategy will expose at least one fault.M3(C, p, s, n) is a generalization of M2by taking the number n of test casesselected in each subdomain into ac-count. It should be noted that M2 andM3 are special cases of Equation (5.2)for Pp in Duran and Ntafos’ work thatni 5 1 and ni 5 n, respectively. It isobvious that M2 and M3 are accurateonly for the first sampling strategy be-cause for the second strategy it is un-reasonable to assume that each subdo-main has exactly n random test cases.

Frankl and Weyuker [1993a,b] provedthat different relations on the adequacycriteria do relate to the fault-detectingability. Table III summarizes their re-sults.

Given the fact that the universal nar-rows relation is equivalent to the sub-sume relation for subdomain testingmethods, Frankl and Weyuker con-cluded that “C1 subsumes C2” does notguarantee a better fault-detecting abil-ity for C1. However, recently Zhu[1996a] proved that in certain testingscenarios the subsumes relation can im-ply better fault-detecting ability. Heidentified two software testing scenar-ios. In the first scenario, a softwaretester is asked to satisfy a particularadequacy criterion and generates test

Table III. Relationships among Fault-Detecting Ability and Orderings on Adequacy Criteria (ki, i 5 1, 2,is the number of subdomains in pCi(p, s))

Test Coverage and Adequacy • 411

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 47: Software Unit Test Coverage and Adequacy

cases specifically to meet the criterion.This testing scenario is called the priortesting scenario. Frankl and Weyuker’ssecond sampling strategy belongs tothis scenario. In the second scenario,the tester generates test cases withoutany knowledge of test adequacy crite-rion. The adequacy criterion is onlyused as a stopping rule so that thetester stops generating test cases only ifthe criterion is satisfied. This scenariois called posterior testing scenario.Frankl and Weyuker’s first samplingstrategy belongs to the posterior sce-nario. It was proved that in the poste-rior testing scenario, the subsume rela-tion does imply better fault-detectingability in all the probability measuresfor error detection and for the expectednumber of errors. In the posterior test-ing scenario, the subsume relation alsoimplies more test cases [Zhu 1996a].

Frankl and Weyuker [1993b] also in-vestigated how existing adequacy crite-ria fit into the partial orderings. Theystudied the properly cover relation on asubset of data-flow adequacy criteria, alimited mutation adequacy (which is ac-tually branch coverage), and conditioncoverage. The results are shown in Fig-ure 9.

5.2 Software Reliability

The reliability of a software system thathas passed an adequate test is a directmeasure of the effectiveness of test ade-quacy criteria. However, comparing testadequacy criteria with respect to thismeasure is difficult. There is little workof this type in the literature. A recentbreakthrough is the work of Tsoukalaset al. [1993] on estimation of softwarereliability from random testing and par-tition testing.

Motivated to explain the observationsmade in Duran and Ntafos’ [1984] ex-periment on random testing and parti-tion testing, they extended the Thayer-Lipow-Nelson reliability model [Thayeret al. 1978] to take into account the costof errors, then compared random testingwith partition testing by looking at theupper confidence bounds when estimat-ing the cost-weighted failure rate.

Tsoukalas et al. [1993] considered thesituation where the input space D ispartitioned into k pairwise disjoint sub-sets so that D 5 D1 ø D2 ø . . . ø Dk.On each subdomain Di, there is a costpenalty ci that would be incurred by theprogram’s failure to execute properly onan input from Di, and a probability pi

Figure 9. Universally properly cover relation among adequacy criteria.

412 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 48: Software Unit Test Coverage and Adequacy

that a randomly selected input will be-long to Di. They defined the cost-weighted failure rate for the program asa whole as

C 5 Oi51

k

ci piui . (5.11)

Usually, the cost-weighted failure rateis unknown in advance, but can be esti-mated by

C < Oi51

k

ci pi~ fi/ni!, (5.12)

where fi is the number of failures ob-served over Di and ni is the number ofrandom test cases within the subdo-main Di. Given the total number n oftest cases, Equation (5.13) gives themaximum likelihood estimate of thecost-weighted failure rate C.

C < Oi51

k

cifi/n (5.13)

But Tsoukalas et al. [1993] pointed outthat the real issue is how confident onecan be in the estimate. To this end theysought an upper confidence bound on C.Therefore, considering finding the upperbound as a linear programming problemthat maximizes (i51

k cipiui subject to cer-tain conditions, they obtained expressionsfor the upper bounds in the two cases ofrandom testing and partition testing andcalculated the upper bounds for variousspecial cases.

The conclusion was that confidence ismore difficult to achieve for randomtesting than for partition testing. Thisagrees with the intuition that it shouldbe easier to have a certain level of con-fidence for the more systematic of thetwo testing methods. That is, it is easierto put one’s faith in partition testing.But they also confirmed the result ofthe empirical studies of the effective-ness of random testing by Duran andNtafos [1984] and Hamlet and Taylor

[1990]. They showed that in some casesrandom testing can perform much bet-ter than partition testing, especiallywhen only one test case is selected ineach subdomain.

This work has a number of practicalimplications, especially for safety-criti-cal software where there is a significantpenalty for failures in some subdo-mains. For example, it was proved thatwhen no failures are detected, the opti-mum way to distribute the test casesover the subdomains in partition testingis to select ni proportional to ci pi, 0 #i # k, rather than select one test case ineach subdomain.

5.3 Test Cost

As testing is an expensive software de-velopment activity, the cost of testing toachieve a certain adequacy according toa given criterion is also of importance.Comparison of testing costs involvesmany factors. One of the simplifiedmeasures of test cost is the size of anadequate test set. Weyuker’s [1988c,1993] work on the complexity of data-flow adequacy criteria belongs to thiscategory.

In 1988, Weyuker [1988c] reported anexperiment with the complexity of data-flow testing. Weyuker used the suite ofprograms in Software Tools in Pascal byKernighan and Plauger [1981] to estab-lish empirical complexity estimates forRapps and Weyuker’s family of data-flow adequacy criteria. The suite con-sists of over 100 subroutines. Test caseswere selected by testers who were in-structed to select “natural” and “atomic”test cases using the selection strategy oftheir choice, and were not encouraged toselect test cases explicitly to satisfy theselected data-flow criterion. Then sevenstatistical estimates were computed foreach criterion from the data (di, ti)where di denoted the number of deci-sion statements in program i and tidenoted the number of test cases usedto satisfy the given criterion for pro-gram i. Among the seven measures, themost interesting ones are the least

Test Coverage and Adequacy • 413

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 49: Software Unit Test Coverage and Adequacy

squares line t 5 ad 1 b (where t is thenumber of test cases sufficient to satisfythe given criteria for the subject pro-gram and d is the number of decisionstatements in that program), and theweighted average of the ratios of thenumber of decision statements in a sub-ject program to the number of test casessufficient to satisfy the selected crite-rion. Weyuker observed that (1) the co-efficients of d in the least-square lineswere less than 1 for all the cases and (2)the coefficients were very close to theaverage number of test cases requiredfor each decision statement. Based onthese observations, Weyuker concludedthat the experiment results reinforcedour intuition that the relationship be-tween the required number of test casesand the program size was linear. A re-sult similar to Weyuker’s is also ob-served for the all du-paths criterion byBieman and Schultz [1992].

The programs in the Kernighan andPlauger suite used in Weyuker’s experi-ment were essentially the same type,well designed and modularized, andhence relatively small in size. Address-ing the problem that whether substan-tially larger, unstructured and modular-ized programs would require largeramounts of test data relative to theirsize, Weyuker [1993] conducted anotherexperiment with data-flow test ade-quacy criteria. She used a subset ofACM algorithms in Collected Algo-rithms from ACM (Vol. 1, ACM Press,New York, 1980) as sample programsthat contained known faults and five ormore decision statements. In this study,the same information was collected andthe same values were calculated. Thesame conclusion was obtained. How-ever, in this study, Weyuker observed alarge proportion of infeasible paths indata-flow testing. The average rates ofinfeasible paths for the all du-path cri-terion were 49 and 56% for the twogroups of programs, respectively.Weyuker pointed out that the largenumber of infeasible paths implies thatassessing the cost of data-flow testingonly in terms of the number of test

cases needed to satisfy a criterion mightyield an optimistic picture of the realeffort needed to accomplish the testing.

Offutt et al. [1993] studied the cost ofmutation testing with a sample set of 10programs. They used simple and multi-ple linear regression models to establisha relationship among the number ofgenerated mutants and the number oflines of code, the number of variables,the number of variable references, andthe number of branches. The linear re-gression models provide a powerful ve-hicle for finding functional relationshipsamong random variables. The coeffi-cient of determination provides a sum-mary statistic that measures how wellthe regression equation fits the data,and hence is used to decide whether arelationship between some data exists.Offutt et al. used a statistical packageto calculate the coefficient of determina-tion of the following formulas.

Ymutant 5 b0 1 b1Xline 1 b2 Xline2

Ymutant 5 b0 1 b1Xvar 1 b2 Xvar2

Ymutant 5 b0 1 b1Xvar 1 b2Xvarref

1 b3Xvar Xvarref ,

where Xline is the number of lines in theprogram, Xvar is the number of vari-ables in the program, Xvarref is the num-ber of variable references in the pro-gram, and Ymutant is the number ofmutants. They found that for singleunits, the coefficients of determinationof the formulas were 0.96, 0.96, and0.95, respectively. For programs of mul-tiple units, they established the follow-ing formula with a coefficient of deter-mination of 0.91.

Ymutant 5 b0 1 b1Xvar 1 b2Xvarref

1 b3Xunit 1 b4XvarXvarref ,

where Xunit is the number of units inthe program. Therefore their conclusionwas that the number of mutants is qua-dratic. As mentioned in Section 3, Of-futt et al. also studied the cost of selec-

414 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 50: Software Unit Test Coverage and Adequacy

tive mutation testing and showed thateven for 6-selective mutation testing,the number of mutants is still qua-dratic.

5.4 Summary of the Comparison ofAdequacy Criteria

Comparison of test data adequacy crite-ria can be made with respect to variousmeasures, such as fault detection, confi-dence in the reliability of the testedsoftware, and the cost of software test-ing. Various abstract relations amongthe criteria, such as the subsume rela-tion, have also been proposed and rela-tionships among adequacy criteria havebeen investigated. The relationship be-tween such relations and fault-detectingability has also been studied. Recentresearch results have shown that parti-tion testing more easily achieves highconfidence in software reliability thanrandom testing, though random testingmay be more cost-effective.

6. AXIOMATIC ASSESSMENT OFADEQUACY CRITERIA

In Section 5, we have seen two kinds ofrationale presented to support the useof one criterion or another, yet there isno clear consensus. The first kind ofrationale uses statistical or empiricaldata to compare testing effectiveness,whereas the second type analyticallycompares test data adequacy criteriabased on certain mathematical modelsof software testing.

In this section we present a third typeof rationale, the axiomatic study of theproperties of adequacy criteria. The ba-sic idea is to seek the most fundamentalproperties of software test adequacyand then check if the properties aresatisfied for each particular criterion.Although the idea of using abstractproperties as requirements of test ade-quacy criteria can be found in the liter-ature, such as Baker et al. [1986],Weyuker [1986, 1988a] is perhaps thefirst who explicitly and clearly em-ployed axiom systems. Work following

this direction includes refinement andimprovement of the axioms [Parrish andZweben 1991; 1993; Zhu and Hall 1993;Zhu et al. 1993; Zhu 1995a] as well asanalysis and criticism [Hamlet 1989;Zweben and Gourlay 1989].

There are four clearcut roles that axi-omatization can play in software testingresearch, as already seen in physics andmathematics. First, it makes explicitthe exact details of a concept or anargument that, before the axiomatiza-tion, was either incomplete or unclear.Second, axiomatization helps to isolate,abstract, and study more fully a class ofmathematical structures that have re-curred in many important contexts, of-ten with quite different surface forms,so that a particular form can draw onresults discovered elsewhere. Third, itcan provide the scientist with a compactway to conduct a systematic explorationof the implications of the postulates.Finally, the fourth role is the study ofwhat is needed to axiomatize a givenempirical phenomenon and what can beaxiomatized in a particular form. Forinstance, a number of computer scien-tists have questioned whether formalproperties can be defined to distinguishone type of adequacy criteria from an-other when criteria are expressed in aparticular form such as predicates[Hamlet 1989; Zweben and Gourlay1989].

Therefore we believe that axiomatiza-tion will improve our understanding ofsoftware testing and clarify our notionof test adequacy. For example, aspointed out by Parrish and Zweben[1993], the investigation of the applica-bility of test data adequacy criteria isan excellent example of the role thataxiomatization has played. Applicabilitywas proposed as an axiom by Weyukerin 1986, requiring a criterion to be ap-plicable to any program in the sense ofthe existence of test sets that satisfy thecriterion. In the assessment of test dataadequacy criteria against this axiom,the data-flow adequacy criteria origi-nally proposed by Rapps and Weyukerin 1985 were found not applicable. This

Test Coverage and Adequacy • 415

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 51: Software Unit Test Coverage and Adequacy

leads to the redefinition of the criteriaand reexamination of the power of thecriteria [Frankl and Weyuker 1988].Weyuker’s applicability requirementswere defined on the assumption thatthe program has a finite representableinput data space. Therefore any ade-quate test set is finite. When this as-sumption is removed and adequacy cri-teria were considered as measurements,Zhu and Hall [1993] used the term fi-nite applicability to denote the require-ment that for any software and anyadequacy degree r less than 1 there is afinite test set whose adequacy is greaterthan r. They also found that finite ap-plicability can be derived from othermore fundamental properties of testdata adequacy criteria.

6.1 Weyuker’s Axiomatization of Program-Based Adequacy Criteria

Weyuker [1986] proposed an informalaxiom system of test adequacy criteria.Her original purpose of the system wasto present a set of properties of idealprogram-based test adequacy criteriaand use these properties to assess exist-ing criteria. Regarding test adequacycriteria as stopping rules, Weyuker’s ax-iom system consists of eleven axioms,which were later examined, formalized,and revised by Parrish and Zweben[1991; 1993].

The most fundamental properties ofadequacy criteria proposed by Weyukerwere those concerning applicability. Shedistinguished the following three appli-cability properties of test adequacy cri-teria.

AXIOM A1 (Applicability). For everyprogram, there exists an adequate testset.

Assuming the finiteness of represent-able points in the input data space,Weyuker rephrased the axiom into thefollowing equivalent form.

AXIOM A1 For every program, thereexists a finite adequate test set.

Weyuker then analyzed exhaustive test-ing and pointed out that, although ex-haustive testing is adequate, an ade-quacy criterion should not always askfor exhaustive testing. Hence she de-fined the following nonexhaustive appli-cability.

AXIOM A2 (Nonexhaustive Applicabil-ity). There is a program p and a test sett such that p is adequately tested by tand t is not an exhaustive test set.

Notice that by exhaustive testingWeyuker meant the test set of all repre-sentable points of the specification do-main. The property of central impor-tance in Weyuker’s system ismonotonicity.

AXIOM A3 (Monotonicity). If t is ade-quate for p and t # t9, then t9 is ade-quate for p.

AXIOM A4 (Inadequate Empty Set).The empty set is not adequate for anyprogram.

Weyuker then studied the relation-ships among test adequacy and programsyntactic structure and semantics. How-ever, her axioms were rather negative.They stated that neither semantic close-ness nor syntactic structure closenessare sufficient to ensure that two pro-grams require the same test data. More-over, an adequately tested programdoes not imply that the components ofthe program are adequately tested, nordoes adequate testing of componentsimply adequate testing of the program.

Weyuker assessed some test adequacycriteria against the axioms. The maindiscovery of the assessment is that themutation adequacy criterion defined inits original form does not satisfy themonotonicity and applicability axioms,because it requires a correctness condi-tion. The correctness condition requiresthat the program under test producescorrect output on a test set if it is to beconsidered adequate. For all the ade-quacy criteria we have discussed, it ispossible for a program to fail on a testcase after producing correct outputs on

416 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 52: Software Unit Test Coverage and Adequacy

an adequate test set. Hence the correct-ness condition causes a conflict withmonotonicity. The correctness conditiondoes not play any fundamental role inmutation adequacy, so Weyuker sug-gested removal of this condition fromthe definition.8

Aware of the insufficiency of the ax-iom system, Weyuker [1988a] later pro-posed three additional axioms. The re-naming property requires that a test setbe adequate for a program p if and onlyif it is adequate for a program obtainedby systematic renaming variables in p.The complexity property requires thatfor every natural number n, there is aprogram p, such that p is adequatelytested by a size-n test set but not by anysize (n 2 1) test set. The statementcoverage property requires that all ade-quate test sets cover all feasible state-ments of the program under test. Theseaxioms were intended to rule out some“unsuitable notions of adequacy.”

Given the fact that Weyuker’s axiomsare unable to distinguish most ade-quacy criteria, Hamlet [1989] and Zwe-ben and Gourlay [1989] questioned ifthe axiomatic approach could provideuseful comparison of adequacy criteria.Recently, some of Weyuker’s axiomswere combined with Baker et al.’s[1986] properties for assessing control-flow adequacy criteria [Zhu 1995a].Control-flow adequacy criteria of subtledifferences were distinguished. The as-sessment suggested that the cycle com-bination criterion was the most favor-able.

6.2 Parrish and Zweben’s Formalizationand Analysis of Weyuker’s Axioms

Parrish and Zweben [1993; 1991] for-malized Weyuker’s axioms, but putthem in a more general framework thatinvolved specifications as well. They de-fined the notions of program-independentcriteria and specification-independent cri-teria. A program-independent criterion

measures software test adequacy inde-pendently of the program, whereas aspecification-independent criterion mea-sures software test adequacy indepen-dently of the specification. Parrish andZweben explored the interdependence re-lations between the notions and the axi-oms. Their work illustrated the complexlevels of analysis that an axiomatic the-ory of software testing can make possible.Formalization is not merely a mathemat-ical exercise because the basic notions ofsoftware testing have to be expressed pre-cisely and clearly so that they can becritically analyzed. For example, infor-mally, a program-limited criterion onlyrequires test data selected in the domainof the program, and a specification-lim-ited criterion requires test data only se-lected in the domain of the specification.Parrish and Zweben formally defined thenotion to be that if a test case falls out-side the domain, the test set is consideredinadequate. This definition is counterin-tuitive and conflicts with the monotonic-ity axiom. A more logical definition is thatthe test cases outside the domain are nottaken into account in determining testadequacy [Zhu and Hall 1993]. Parrishand Zweben [1993] also formalized thenotion of the correctness condition suchthat a test set is inadequate if it containsa test case on which the program is incor-rect. They tried to put it into the axiomsystem by modifying the monotonicity ax-iom to be applied only to test sets onwhich the software is correct. However,Weyuker [1986] argued that the correct-ness condition plays no role in determin-ing test adequacy. It should not be anaxiom.

6.3 Zhu and Hall’s Measurement Theory

Based on the mathematical theory ofmeasurement, Zhu and Hall [1993; Zhuet al. 1993] proposed a set of axioms forthe measurement of software test ade-quacy. The mathematical theory of mea-surement is the study of the logical andphilosophical concepts underlying mea-surement as it applies in all sciences. Itstudies the conceptual structures of

8 The definition of mutation adequacy used in thisarticle does not use the correctness condition.

Test Coverage and Adequacy • 417

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 53: Software Unit Test Coverage and Adequacy

measurement systems and their proper-ties related to the validation of the useof measurements [Roberts 1979; Krantzet al. 1971; Suppes et al. 1989; Luce etal. 1990]. Therefore the measurement ofsoftware test adequacy should conformto the theory. In fact, recent years haveseen rapid growth of rigorous applica-tions of the measurement theory to soft-ware metrics [Fenton 1991].

According to the theory, measurementis the assignment of numbers to proper-ties of objects or events in the real worldby means of an objective empirical oper-ation. The modern form of measurementtheory is representational; that is, num-bers assigned to objects/events mustrepresent the perceived relation be-tween the properties of those objects/events. Readers are referred to Roberts’[1979] book for various applications ofthe theory and Krantz et al.’s trilogy[1971; Suppes et al. 1989; Luce et al.1990] for a systematic treatment of thesubject. Usually, such a theory com-prises three main sections: the descrip-tion of an empirical relational system, arepresentation theorem, and a unique-ness condition.

An empirical relational system Q 5(Q, R) consists of a set Q of manifesta-tions of the property or attribute and afamily of relations R 5 {R1, R2, . . . ,Rn} on Q. The family R of relationsprovides the basic properties of themanifestations with respect to the prop-erty to be measured. They are usuallyexpressed as axioms derived from theempirical knowledge of the real world.For software testing, the manifestationsare the software tests P 3 S 3 T, whichconsist of a set of programs P, a set ofspecifications S, and a set of test sets T.Zhu et al. [1994] defined a relation # onthe space of software testing such thatt1 # t2 means that test t2 is at least asadequate as t1. It was argued that therelation has the properties of reflexiv-ity, transitivity, comparability, bound-edness, equal inadequacy for emptytests, and equal adequacy for exhaus-tive tests. Therefore this relation is a

total ordering with maximal and mini-mal elements.

To assign numbers to the objects inthe empirical system, a numerical sys-tem N 5 (N, G) must be defined so thatthe measurement mapping is a homo-morphism from the empirical system tothe numerical system. It has beenproved that for any numerical system(N, #N) and measurement g of testadequacy on N, there is a measurementm on the unit interval of real numbers([0,1], #) such that g 5 t ° m, where t isan isomorphism between (N, #N) and([0,1], #) [Zhu and Hall 1993; Zhu et al.1994; Zhu 1995b]. Therefore propertiesof test adequacy measurements can beobtained by studying adequacy mea-surements on the real unit interval.

Zhu and Hall’s axiom system for ade-quacy measurements on the real unitinterval consists of the following axi-oms.

AXIOM B1 (Inadequacy of Empty TestSet). For all programs p and specifica-tions s, the adequacy of the empty testset is 0.

AXIOM B2 (Adequacy of ExhaustiveTesting). For all programs p and speci-fications s, the adequacy of the exhaus-tive test set D is 1.

AXIOM B3 (Monotonicity). For all pro-grams p and specifications s, if test sett1 is a subset of test set t2 , then theadequacy of t1 is less than or equal tothe adequacy of t2.

AXIOM B4 (Convergence). Let t1,t2, . . . , tn, . . . [ T be test sets such thatt1 # t2 # . . . # tn # . . . . Then, for allprograms p and specifications s,

limn3`

Cps ~tn! 5 Cp

sSøn51

`

tnD ,

where Cps (t) is the adequacy of test set t

for testing program p with respect tospecification s.

AXIOM B5 (Law of Diminishing Re-turns). The more a program has beentested, the less a given test set can fur-

418 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 54: Software Unit Test Coverage and Adequacy

ther contribute to the test adequacy. For-mally, for all programs p, specificationss, and test sets t,

@c, d [ T.~c # d f Cps ~tuc!

$ Cps ~tud!!,

where Cps ~tuc! 5 Cp

s ~t ø c! 2 Cps ~c!.

This axiom system has been proved tobe consistent [Zhu and Hall 1993; Zhuet al. 1994]. From these axioms, a num-ber of properties of test data adequacycriteria can be proved. For example, if atest data adequacy criterion satisfiesAxiom B2 and Axiom B4, then it isfinitely applicable, which means thatfor any real number r less than 1, thereis a finite test set whose adequacy isgreater than or equal to r. Therefore anadequacy criterion can fail to be finitelyapplicable because of two reasons: notsatisfying the adequacy of exhaustivetest, such as requiring coverage of infea-sible elements, or not satisfying the con-vergence property such as the path-cov-erage criterion. Another example ofderivable property is subadditivity,which can be derived from the law ofdiminishing returns. Here an adequacymeasurement is subadditive if the ade-quacy of the union of two test sets t1and t2 is less than or equal to the sum ofthe adequacy of t1 and the adequacy oft2.

The measurement theory of softwaretest adequacy was concerned not onlywith the properties derivable from theaxioms, but also with measurement-the-oretical properties of the axiom systems.Zhu et al. [1994] proved a uniquenesstheorem that characterizes the admissi-ble transformations between any twoadequacy measurements that satisfythe axioms. However, the irregularity ofadequacy measurements was also for-mally proved. That is, not all adequacymeasurements can be transformed fromone to another. In fact, few existingadequacy criteria are convertible fromone to another. According to measure-ment theory, these two theorems lay the

foundation for the investigation of themeaningfulness of statements and sta-tistical operations that involve test ade-quacy measurements [Krantz et al.1971; Suppes et al. 1989; Luce et al.1990].

The irregularity theorem confirms ourintuition that the criteria are actuallydifferent approximations to the idealmeasure of test adequacy. Therefore itis necessary to find the “errors” for eachcriterion. In an attempt to do so, Zhu etal. [1994] also investigated the proper-ties that distinguish different classes oftest adequacy criteria.

6.4 Semantics of Software Test Adequacy

The axioms of software test adequacycriteria characterize the notion of soft-ware test adequacy by a set of proper-ties for adequacy. These axioms do notdirectly answer what test adequacymeans. The model theory [Chang andKeisler 1973] of mathematical logic pro-vides a way of assigning meanings toformal axioms. Recently, the model the-ory was applied to assigning meaningsto the axioms of test adequacy criteria[Zhu 1996b, 1995b]. Software testingwas interpreted as inductive inferencesuch that a finite number of observa-tions made on the behavior of the soft-ware under test are used to inductivelyderive general properties of the soft-ware. In particular, Gold’s [1967] induc-tive inference model of identification inthe limit was used to interpret Weyuk-er’s axioms and a relationship betweenadequate testing and software correct-ness was obtained in the model [Zhu1996b]. Valiant’s [1984] PAC inductiveinference model was used to interpretZhu and Hall’s axioms of adequacy mea-surements. Relationships between soft-ware reliability and test adequacy wereobtained [Zhu 1995b].

7. CONCLUSION

Since Goodenough and Gerhart [1975,1977] pointed out that test criteria are acentral problem of software testing, test

Test Coverage and Adequacy • 419

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 55: Software Unit Test Coverage and Adequacy

criteria have been a major focus of theresearch on software testing. A largenumber of test data adequacy criteriahave been proposed and various ration-ales have been presented for the sup-port of one or another criterion. In thisarticle, various types of software testadequacy criteria proposed in the litera-ture are reviewed. The research on com-parison and evaluation of criteria is alsosurveyed. The notion of test adequacy isexamined and its roles in software test-ing practice and theoretical research arediscussed. Although whether an adequacycriterion captures the true notion of testadequacy is still a matter of controversy,it appears that software test adequacycriteria as a whole are clearly linked tothe fault-detecting ability of softwaretesting as well as to the dependability ofthe software that has passed an adequatetest. It was predicted that with the ad-vent of a quality assurance frameworkbased upon ISO 9001, which calls for spe-cific mechanisms for defect removal, theacceptance of more formal measurementof the testing process can now be antici-pated [Wichmann 1993]. There is a ten-dency towards systematic approaches tosoftware testing through using test ade-quacy criteria.

ACKNOWLEDGMENT

The authors would like to thank the anonymousreferees for their valuable comments.

APPENDIX

Glossary of Graph-Theory-RelatedTerminology

Complete computation path: A com-plete computation path is a path thatstarts with the begin node and ends atthe end node of the flow graph. It is alsocalled a computation path or executionpath.

Concatenation of paths: If p 5 (n1,n2, . . . , ns), q 5 (ns11, . . . , nt) are twopaths, and r 5 (n1, n2, . . . , ns, ns11,. . . , nt) is also a path, we say that r is

the concatenation of p to q and writer 5 p ∧ q.

Cycle: A path p 5 (n1, n2, . . . , nt,n1) is called a cycle.

Cycle-free path: A path is said to becycle-free if it does not contain cycles assubpaths.

Directed graph, node, and edge: Adirected graph consists of a set N ofnodes and a set E # N 3 N of directededges between nodes.

Elementary cycle and simple cycle: Ifthe nodes n1, n2, . . . , nt in the cycle(n1, n2, . . . , nt, n1) are all different,then the cycle is called an elementarycycle. If the edges in the cycle are alldifferent, then it is called a simple cycle.

Elementary path and simple path: Apath is called an elementary path if thenodes in the path are all different. It iscalled a simple path if the edges in thepath are all different.

Empty path: A path is called anempty path if its length is 1. In thiscase, the path contains no edge but onlya node, and hence is written as (n1).

Feasible and infeasible paths: Notall complete computation paths neces-sarily correspond to an execution of theprogram. A feasible path is a completecomputation path such that there existinput data that can cause the executionof the path. Otherwise, it is an infeasi-ble path.

Flow graph: A flow graph is a di-rected graph that satisfies the followingconditions: (1) it has a unique beginnode which has no inward edge; (2) ithas a unique end node which has nooutward edge; and (3) every node in aflow graph must be on a path from thebegin node to the end node.

Flow graph model of program struc-ture: The nodes in a flow graph repre-sent linear sequences of computations.The edges in the flow graph representcontrol transfers. Each edge is associ-ated with a predicate that represents

420 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 56: Software Unit Test Coverage and Adequacy

the condition of control transfer fromthe computation on the start node of theedge to that on the end node of the edge.The begin node and the end node in theflow graph are where the computationstarts and finishes.

Inward edges and outward edges of anode: The edge ^n1, n2& is an inwardedge of the node n2. It is an outwardedge of the node n1.

Length of a path: The length of apath (n1, n2, . . . , nt) is t.

Path: A path p in a graph is a se-quence (n1, n2, . . . , nt) of nodes suchthat ^ni, ni11& is an edge in the graph,for all i 5 1, 2, . . . , t 2 1, t . 0.

Start node and end node of an edge:The node n1 is the start node of the edge^n1, n2& and n2 is the end node of theedge.

Start node and end node of a path:The start node of a path (n1, n2, . . . ,nt) is n1, and nt is the end node of thepath.

Strongly connected graph: A graphis strongly connected if for any twonodes a and b, there exists a path froma to b and a path from b to a.

Subpath: A subsequence nu, nu11,. . . , nv of the path p 5 (n1, n2, . . . , nt)is called a subpath of p, where 1 # u #v # t.

REFERENCES

ADRION, W. R., BRANSTAD, M. A., AND CHERNI-AVSKY, J. C. 1982. Validation, verification,and testing of computer software. Comput.Surv. 14, 2 (June), 159–192.

AFIFI, F. H., WHITE, L. J., AND ZEIL, S. J. 1992.Testing for linear errors in nonlinear com-puter programs. In Proceedings of the 14thIEEE International Conference on SoftwareEngineering (May), 81–91.

AMLA, N. AND AMMANN, P. 1992. Using Z speci-fications in category partition testing. In Pro-ceedings of the Seventh Annual Conference onComputer Assurance (June), IEEE, 3–10.

AMMANN, P. AND OFFUTT, J. 1994. Using formalmethods to derive test frames in category-partition testing. In Proceedings of the NinthAnnual Conference on Computer Assurance(Gaithersburg, MD, June), IEEE, 69–79.

BACHE, R. AND MULLERBURG, M. 1990. Meas-ures of testability as a basis for quality assur-ance. Softw. Eng. J. (March), 86–92.

BAKER, A. L., HOWATT, J. W., AND BIEMAN, J. M.1986. Criteria for finite sets of paths thatcharacterize control flow. In Proceedings ofthe 19th Annual Hawaii International Confer-ence on System Sciences, 158–163.

BASILI, V. R. AND RAMSEY, J. 1984. Structuralcoverage of functional testing. Tech. Rep. TR-1442, Department of Computer Science, Uni-versity of Maryland at College Park, Sept.

BASILI, V. R. AND SELBY, R. W. 1987. Com-paring the effectiveness of software testing.IEEE Trans. Softw. Eng. SE-13, 12 (Dec.),1278–1296.

BAZZICHI, F. AND SPADAFORA, I. 1982. An auto-matic generator for compiler testing. IEEETrans. Softw. Eng. SE-8, 4 (July), 343–353.

BEIZER, B. 1983. Software Testing Techniques.Van Nostrand Reinhold, New York.

BEIZER, B. 1984. Software System Testing andQuality Assurance. Van Nostrand Reinhold,New York.

BENGTSON, N. M. 1987. Measuring errors in op-erational analysis assumptions, IEEE Trans.Softw. Eng. SE-13, 7 (July), 767–776.

BENTLY, W. G. AND MILLER, E. F. 1993. CT cov-erage—initial results. Softw. Quality J. 2, 1,29–47.

BERNOT, G., GAUDEL, M. C., AND MARRE, B. 1991.Software testing based on formal specifica-tions: A theory and a tool. Softw. Eng. J.(Nov.), 387–405.

BIEMAN, J. M. AND SCHULTZ, J. L. 1992. An em-pirical evaluation (and specification) of the alldu-paths testing criterion. Softw. Eng. J.(Jan.), 43–51.

BIRD, D. L. AND MUNOZ, C. U. 1983. Automaticgeneration of random self-checking test cases.IBM Syst. J. 22, 3.

BOUGE, L., CHOQUET, N., FRIBOURG, L., AND GAU-DEL, M.-C. 1986. Test set generation fromalgebraic specifications using logic program-ming. J. Syst. Softw. 6, 343–360.

BUDD, T. A. 1981. Mutation analysis: Ideas, ex-amples, problems and prospects. In ComputerProgram Testing, Chandrasekaran and Radic-chi, Eds., North Holland, 129–148.

BUDD, T. A. AND ANGLUIN, D. 1982. Two notionsof correctness and their relation to testing.Acta Inf. 18, 31–45.

BUDD, T. A., LIPTON, R. J., SAYWARD, F. G., AND

DEMILLO, R. A. 1978. The design of a pro-totype mutation system for program testing.In Proceedings of National Computer Confer-ence, 623–627.

CARVER, R. AND KUO-CHUNG, T. 1991. Replayand testing for concurrent programs. IEEESoftw. (March), 66–74.

Test Coverage and Adequacy • 421

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 57: Software Unit Test Coverage and Adequacy

CHAAR, J. K., HALLIDAY, M. J., BHANDARI, I. S., ANDCHILLAREGE, R. 1993. In-process evaluationfor software inspection and test. IEEE Trans.Softw. Eng. 19, 11, 1055–1070.

CHANDRASEKARAN, B. AND RADICCHI, S. (EDS.)1981. Computer Program Testing, North-Holland.

CHANG, C. C. AND KEISLER, H. J. 1973. ModelTheory. North-Holland, Amsterdam.

CHANG, Y.-F. AND AOYAMA, M. 1991. Testing thelimits of test technology. IEEE Softw.(March), 9–11.

CHERNIAVSKY, J. C. AND SMITH, C. H. 1987. Arecursion theoretic approach to program test-ing. IEEE Trans. Softw. Eng. SE-13, 7 (July),777–784.

CHERNIAVSKY, J. C. AND SMITH, C. H. 1991. OnWeyuker’s axioms for software complexitymeasures. IEEE Trans. Softw. Eng. SE-17, 6(June), 636–638.

CHOI, B., MATHUR, A., AND PATTISON, B. 1989.PMothra: Scheduling mutants for executionon a hypercube. In Proceedings of SIGSOFTSymposium on Software Testing, Analysis andVerification 3 (Dec.) 58–65.

CHUSHO, T. 1987. Test data selection and qual-ity estimation based on the concept of essen-tial branches for path testing. IEEE Trans.Softw. Eng. SE-13, 5 (May), 509–517.

CLARKE, L. A., HASSELL, J., AND RICHARDSON, D. J.1982. A close look at domain testing. IEEETrans. Softw. Eng. SE-8, 4 (July), 380–390.

CLARKE, L. A., PODGURSKI, A., RICHARDSON, D. J.,AND ZEIL, S. J. 1989. A formal evaluation ofdata flow path selection criteria. IEEE Trans.Softw. Eng. 15, 11 (Nov.), 1318–1332.

CURRIT, P. A., DYER, M., AND MILLS, H. D. 1986.Certifying the reliability of software. IEEETrans. Softw. Eng. SE-6, 1 (Jan.) 2–13.

DAVIS, M. AND WEYUKER, E. 1988. Metric space-based test-data adequacy criteria. Comput. J.13, 1 (Feb.), 17–24.

DEMILLO, R. A. AND MATHUR, A. P. 1990. Onthe use of software artifacts to evaluate theeffectiveness of mutation analysis for detect-ing errors in production software. In Proceed-ings of 13th Minnowbrook Workshop on Soft-ware Engineering (July 24–27, BlueMountain Lake, NY), 75–77.

DEMILLO, R. A. AND OFFUTT, A. J. 1991. Con-straint-based automatic test data generation.IEEE Trans. Softw. Eng. 17, 9 (Sept.), 900–910.

DEMILLO, R. A. AND OFFUTT, A. J. 1993. Experimental results from an automatic test casegenerator. ACM Trans. Softw. Eng. Methodol.2, 2 (April), 109–127.

DEMILLO, R. A., GUINDI, D. S., MCCRACKEN, W. M.,OFFUTT, A. J., AND KING, K. N. 1988. Anextended overview of the Mothra softwaretesting environment. In Proceedings of SIG-

SOFT Symposium on Software Testing, Anal-ysis and Verification 2, (July), 142–151.

DEMILLO, R. A., LIPTON, R. J., AND SAYWARD,F. G. 1978. Hints on test data selection:Help for the practising programmer. Com-puter 11, (April), 34–41.

DEMILLO, R. A., MCCRACKEN, W. M., MATIN, R. J.,AND PASSUFIUME, J. F. 1987. Software Test-ing and Evaluation, Benjamin-Cummings,Redwood City, CA.

DENNEY, R. 1991. Test-case generation fromProlog-based specifications. IEEE Softw.(March), 49–57.

DIJKSTRA, E. W. 1972. Notes on structured pro-gramming. In Structured Programming, byO.-J. Dahl, E. W. Dijkstra, and C. A. R.Hoare, Academic Press.

DOWNS, T. 1985. An approach to the modellingof software testing with some applications.IEEE Trans. Softw. Eng. SE-11, 4 (April),375–386.

DOWNS, T. 1986. Extensions to an approach tothe modelling of software testing with someperformance comparisons. IEEE Trans.Softw. Eng. SE-12, 9 (Sept.), 979–987.

DOWNS, T. AND GARRONE, P. 1991. Some newmodels of software testing with performancecomparisons. IEEE Trans. Rel. 40, 3 (Aug.),322–328.

DUNCAN, I. M. M. AND ROBSON, D. J. 1990. Or-dered mutation testing. ACM SIGSOFTSoftw. Eng. Notes 15, 2 (April), 29–30.

DURAN, J. W. AND NTAFOS, S. 1984. An evalua-tion of random testing. IEEE Trans. Softw.Eng. SE-10, 4 (July), 438–444.

FENTON, N. 1992. When a software measure isnot a measure. Softw. Eng. J. (Sept.), 357–362.

FENTON, N. E. 1991. Software metrics—a rigor-ous approach. Chapman & Hall, London.

FENTON, N. E., WHITTY, R. W., AND KAPOSI,A. A. 1985. A generalised mathematicaltheory of structured programming. Theor.Comput. Sci. 36, 145–171.

FORMAN, I. R. 1984. An algebra for data flowanomaly detection. In Proceedings of the Sev-enth International Conference on Software En-gineering (Orlando, FL), 250–256.

FOSTER, K. A. 1980. Error sensitive test caseanalysis (ESTCA). IEEE Trans. Softw. Eng.SE-6, 3 (May), 258–264.

FRANKL, P. G. AND WEISS, S. N. 1993. An exper-imental comparison of the effectiveness ofbranch testing and data flow testing. IEEETrans. Softw. Eng. 19, 8 (Aug.), 774–787.

FRANKL, P. G. AND WEYUKER, J. E. 1988. Anapplicable family of data flow testing criteria.IEEE Trans. Softw. Eng. SE-14, 10 (Oct.),1483–1498.

FRANKL, P. G. AND WEYUKER, J. E. 1993a. A

422 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 58: Software Unit Test Coverage and Adequacy

formal analysis of the fault-detecting abilityof testing methods. IEEE Trans. Softw. Eng.19, 3 (March), 202–213.

FRANKL, P. G. AND WEYUKER, E. J. 1993b. Prov-able improvements on branch testing. IEEETrans. Softw. Eng. 19, 10, 962–975.

FREEDMAN, R. S. 1991. Testability of softwarecomponents. IEEE Trans. Softw. Eng. SE-17,6 (June), 553–564.

FRITZSON, P., GYIMOTHY, T., KAMKAR, M., AND

SHAHMEHRI, N. 1991. Generalized algorith-mic debugging and testing. In Proceedings ofACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation (To-ronto, June 26–28).

FUJIWARA, S., V. BOCHMANN, G., KHENDEK, F.,AMALOU, M., AND GHEDAMSI, A. 1991. Testselection based on finite state models. IEEETrans. Softw. Eng. SE-17, 6 (June), 591–603.

GAUDEL, M.-C. AND MARRE, B. 1988. Algebraicspecifications and software testing: Theoryand application. In Rapport LRI 407.

GELPERIN, D. AND HETZEL, B. 1988. The growthof software testing. Commun. ACM 31, 6(June), 687–695.

GIRGIS, M. R. 1992. An experimental evalua-tion of a symbolic execution system. Softw.Eng. J. (July), 285–290.

GOLD, E. M. 1967. Language identification inthe limit. Inf. Cont. 10, 447–474.

GOODENOUGH, J. B. AND GERHART, S. L. 1975.Toward a theory of test data selection. IEEETrans. Softw. Eng. SE-3 (June).

GOODENOUGH, J. B. AND GERHART, S. L. 1977.Toward a theory of testing: Data selectioncriteria. In Current Trends in ProgrammingMethodology, Vol. 2, R. T. Yeh, Ed., Prentice-Hall, Englewood Cliffs, NJ, 44–79.

GOPAL, A. AND BUDD, T. 1983. Program testingby specification mutation. Tech. Rep. TR 83-17, University of Arizona, Nov.

GOURLAY, J. 1983. A mathematical frameworkfor the investigation of testing. IEEE Trans.Softw. Eng. SE-9, 6 (Nov.), 686–709.

HALL, P. A. V. 1991. Relationship betweenspecifications and testing. Inf. Softw. Technol.33, 1 (Jan./Feb.), 47–52.

HALL, P. A. V. AND HIERONS, R. 1991. Formalmethods and testing. Tech. Rep. 91/16, Dept.of Computing, The Open University.

HAMLET, D. AND TAYLOR, R. 1990. Partitiontesting does not inspire confidence. IEEETrans. Softw. Eng. 16 (Dec.), 206–215.

HAMLET, D., GIFFORD, B., AND NIKOLIK, B. 1993.Exploring dataflow testing of arrays. In Pro-ceedings of 15th ICSE (May), 118–129.

HAMLET, R. 1989. Theoretical comparison oftesting methods. In Proceedings of SIGSOFTSymposium on Software Testing, Analysis,and Verification 3 (Dec.), 28–37.

HAMLET, R. G. 1977. Testing programs with theaid of a compiler. IEEE Trans. Softw. Eng. 3,4 (July), 279–290.

HARROLD, M. J., MCGREGOR, J. D., AND FITZ-PATRICK, K. J. 1992. Incremental testing ofobject-oriented class structures. In Proceed-ings of 14th ICSE (May) 68–80.

HARROLD, M. J. AND SOFFA, M. L. 1990. Inter-procedural data flow testing. In Proceedingsof SIGSOFT Symposium on Software Testing,Analysis, and Verification 3 (Dec.), 158–167.

HARROLD, M. J. AND SOFFA, M. L. 1991. Select-ing and using data for integration testing.IEEE Softw. (March), 58–65.

HARTWICK, D. 1977. Test planning. In Proceed-ings of National Computer Conference, 285–294.

HAYES, I. J. 1986. Specification directed mod-ule testing. IEEE Trans. Softw. Eng. SE-12, 1(Jan.), 124–133.

HENNELL, M. A., HEDLEY, D., AND RIDDELL, I. J.1984. Assessing a class of software tools. InProceedings of the Seventh ICSE, 266–277.

HERMAN, P. 1976. A data flow analysis ap-proach to program testing. Aust. Comput. J. 8,3 (Nov.), 92–96.

HETZEL, W. 1984. The Complete Guide to Soft-ware Testing, Collins.

HIERONS, R. 1992. Software testing from formalspecification. Ph.D. Thesis, Brunel Univer-sity, UK.

HOFFMAN, D. M. AND STROOPER, P. 1991. Auto-mated module testing in Prolog. IEEE Trans.Softw. Eng. 17, 9 (Sept.), 934–943.

HORGAN, J. R. AND LONDON, S. 1991. Data flowcoverage and the C language. In Proceedingsof TAV4 (Oct.), 87–97.

HORGAN, J. R. AND MATHUR, A. P. 1992. Assess-ing testing tools in research and education.IEEE Softw. (May), 61–69.

HOWDEN, W. E. 1975. Methodology for the gen-eration of program test data. IEEE Trans.Comput. 24, 5 (May), 554–560.

HOWDEN, W. E. 1976. Reliability of the pathanalysis testing strategy. IEEE Trans. Softw.Eng. SE-2, (Sept.), 208–215.

HOWDEN, W. E. 1977. Symbolic testing and theDISSECT symbolic evaluation system. IEEETrans. Softw. Eng. SE-3 (July), 266–278.

HOWDEN, W. E. 1978a. Algebraic program test-ing. ACTA Inf. 10, 53–66.

HOWDEN, W. E. 1978b. Theoretical and empiri-cal studies of program testing. IEEE Trans.Softw. Eng. SE-4, 4 (July), 293–298.

HOWDEN, W. E. 1978c. An evaluation of the ef-fectiveness of symbolic testing. Softw. Pract.Exper. 8, 381–397.

HOWDEN, W. E. 1980a. Functional programtesting. IEEE Trans. Softw. Eng. SE-6, 2(March), 162–169.

Test Coverage and Adequacy • 423

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 59: Software Unit Test Coverage and Adequacy

HOWDEN, W. E. 1980b. Functional testing anddesign abstractions. J. Syst. Softw. 1, 307–313.

HOWDEN, W. E. 1981. Completeness criteria fortesting elementary program functions. In Pro-ceedings of Fifth International Conference onSoftware Engineering (March), 235–243.

HOWDEN, W. E. 1982a. Validation of scientificprograms. Comput. Surv. 14, 2 (June), 193–227.

HOWDEN, W. E. 1982b. Weak mutation testingand completeness of test sets. IEEE Trans.Softw. Eng. SE-8, 4 (July), 371–379.

HOWDEN, W. E. 1985. The theory and practice offunctional testing. IEEE Softw. (Sept.), 6–17.

HOWDEN, W. E. 1986. A functional approach toprogram testing and analysis. IEEE Trans.Softw. Eng. SE-12, 10 (Oct.), 997–1005.

HOWDEN, W. E. 1987. Functional program test-ing and analysis. McGraw-Hill, New York.

HUTCHINS, M., FOSTER, H., GORADIA, T., AND OS-TRAND, T. 1994. Experiments on the effec-tiveness of dataflow- and controlflow-basedtest adequacy criteria. In Proceedings of 16thIEEE International Conference on SoftwareEngineering (May).

INCE, D. C. 1987. The automatic generation oftest data. Comput. J. 30, 1, 63–69.

INCE, D. C. 1991. Software testing. In SoftwareEngineer’s Reference Book, J. A. McDermid,Ed., Butterworth-Heinemann (Chapter 19).

KARASIK, M. S. 1985. Environmental testingtechniques for software certification. IEEETrans. Softw. Eng. SE-11, 9 (Sept.), 934–938.

KEMMERER, R. A. 1985. Testing formal specifi-cations to detect design errors. IEEE Trans.Softw. Eng. SE-11, 1 (Jan.), 32–43.

KERNIGHAN, B. W. AND PLAUGER, P. J. 1981.Software Tools in Pascal, Addison-Wesley,Reading, MA.

KING, K. N. AND OFFUTT, A. J. 1991. A FOR-TRAN language system for mutation-basedsoftware testing. Softw. Pract. Exper. 21, 7(July), 685–718.

KOREL, B., WEDDE, H., AND FERGUSON, R. 1992.Dynamic method of test data generation fordistributed software. Inf. Softw. Tech. 34, 8(Aug.), 523–532.

KOSARAJU, S. 1974. Analysis of structured pro-grams. J. Comput. Syst. Sci. 9, 232–255.

KRANTZ, D. H., LUCE, R. D., SUPPES, P., AND TVER-SKY, A. 1971. Foundations of Measurement,Vol. 1: Additive and Polynomial Representa-tions. Academic Press, New York.

KRAUSER, E. W., MATHUR, A. P., AND REGO, V. J.1991. High performance software testing onSIMD machines. IEEE Trans. Softw. Eng.SE-17, 5 (May), 403–423.

LASKI, J. 1989. Testing in the program develop-ment cycle. Softw. Eng. J. (March), 95–106.

LASKI, J. AND KOREL, B. 1983. A data flow ori-

ented program testing strategy. IEEE Trans.Softw. Eng. SE-9, (May), 33–43.

LASKI, J., SZERMER, W., AND LUCZYCKI, P.1993. Dynamic mutation testing in inte-grated regression analysis. In Proceedings of15th International Conference on SoftwareEngineering (May), 108–117.

LAUTERBACH, L. AND RANDALL, W. 1989. Ex-perimental evaluation of six test techniques.In Proceedings of COMPASS 89 (Washington,DC, June), 36–41.

LEVENDEL, Y. 1991. Improving quality with amanufacturing process. IEEE Softw. (March),13–25.

LINDQUIST, T. E. AND JENKINS, J. R. 1987. Testcase generation with IOGEN. In Proceedingsof the 20th Annual Hawaii International Con-ference on System Sciences, 478–487.

LITTLEWOOD, B. AND STRIGINI, L. 1993. Valida-tion of ultra-high dependability for software-based systems. C ACM 36, 11 (Nov.), 69–80.

LIU, L.-L. AND ROBSON, D. J. 1989. Symbolicevaluation in software testing, the final re-port. Computer Science Tech. Rep. 10/89,School of Engineering and Applied Science,University of Durham, June.

LUCE, R. D., KRANTZ, D. H., SUPPES, P., AND TVER-SKY, A. 1990. Foundations of Measurement,Vol. 3: Representation, Axiomatization, andInvariance. Academic Press, San Diego.

MALAIYA, Y. K., VONMAYRHAUSER, A., AND SRIMANI,P. K. 1993. An examination of fault expo-sure ratio. IEEE Trans. Softw. Eng. 19, 11,1087–1094.

MARICK, B. 1991. The weak mutation hypothe-sis. In Proceedings of SIGSOFT Symposiumon Software Testing, Analysis, and Verifica-tion 4 (Oct.), 190–199.

MATHUR, A. P. 1991. Performance, effective-ness, and reliability issues in software test-ing. In Proceedings of the 15th Annual Inter-national Computer Software and ApplicationsConference (Tokyo, Sept.), 604–605.

MARSHALL, A. C. 1991. A Conceptual model ofsoftware testing. J. Softw. Test. Ver. Rel. 1, 3(Dec.), 5–16.

MCCABE, T. J. 1976. A complexity measure.IEEE Trans. Softw. Eng. SE-2, 4, 308–320.

MCCABE, T. J. (ED.) 1983. Structured Testing.IEEE Computer Society Press, Los Alamitos,CA.

MCCABE, T. J. AND SCHULMEYER, G. G. 1985.System testing aided by structured analysis:A practical experience. IEEE Trans. Softw.Eng. SE-11, 9 (Sept.), 917–921.

MCMULLIN, P. R. AND GANNON, J. D. 1983. Com-bining testing with specifications: A casestudy. IEEE Trans. Softw. Eng. SE-9, 3(May), 328–334.

MEEK, B. AND SIU, K. K. 1988. The effectivenessof error seeding. Alvey Project SE/064: Qual-

424 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 60: Software Unit Test Coverage and Adequacy

ity evaluation of programming language pro-cessors, Report No. 2, Computing Centre,King’s College London, Oct.

MILLER, E. AND HOWDEN, W. E. 1981. Tutorial:Software Testing and Validation Techniques,(2nd ed.). IEEE Computer Society Press, LosAlamitos, CA.

MILLER, K. W., MORELL, L. J., NOONAN, R. E.,PARK, S. K., NICOL, D. M., MURRILL, B. W., AND

VOAS, J. M. 1992. Estimating the probabil-ity of failure when testing reveals no failures.IEEE Trans. Softw. Eng. 18, 1 (Jan.), 33–43.

MORELL, L. J. 1990. A theory of fault-basedtesting. IEEE Trans. Softw. Eng. 16, 8 (Aug.),844–857.

MYERS, G. J. 1977. An extension to the cyclo-matic measure of program complexity. SIG-PLAN No. 12, 10, 61–64.

MYERS, G. J. 1979. The Art of Software Testing.John Wiley and Sons, New York.

MYERS, J. P., JR. 1992. The complexity of soft-ware testing. Softw. Eng. J. (Jan.), 13–24.

NTAFOS, S. C. 1984. An evaluation of requiredelement testing strategies. In Proceedings ofthe Seventh International Conference on Soft-ware Engineering, 250–256.

NTAFOS, S. C. 1984. On required element test-ing. IEEE Trans. Softw. Eng. SE-10, 6 (Nov.),795–803.

NTAFOS, S. C. 1988. A comparison of somestructural testing strategies. IEEE Trans.Softw. Eng. SE-14 (June), 868–874.

OFFUTT, A. J. 1989. The coupling effect: Fact orfiction. In Proceedings of SIGSOFT Sympo-sium on Software Testing, Analysis, and Veri-fication 3 (Dec. 13–15), 131–140.

OFFUTT, A. J. 1992. Investigations of the soft-ware testing coupling effect. ACM Trans.Softw. Eng. Methodol. 1, 1 (Jan.), 5–20.

OFFUTT, A. J. AND LEE, S. D. 1991. How strongis weak mutation? In Proceedings of SIG-SOFT Symposium on Software Testing, Anal-ysis, and Verification 4 (Oct.), 200–213.

OFFUTT, A. J., ROTHERMEL, G., AND ZAPF, C.1993. An experimental evaluation of selec-tive mutation. In Proceedings of 15th ICSE(May), 100–107.

OSTERWEIL, L. AND CLARKE, L. A. 1992. A pro-posed testing and analysis research initiative.IEEE Softw. (Sept.), 89–96.

OSTRAND, T. J. AND BALCER, M. J. 1988. Thecategory-partition method for specifying andgenerating functional tests. Commun. ACM31, 6 (June), 676–686.

OSTRAND, T. J. AND WEYUKER, E. J. 1991. Data-flow-based test adequacy analysis for lan-guages with pointers. In Proceedings of SIG-SOFT Symposium on Software Testing, Anal-ysis, and Verification 4, (Oct.), 74–86.

OULD, M. A. AND UNWIN, C., EDS. 1986. Testingin Software Development. Cambridge Univer-sity Press, New York.

PAIGE, M. R. 1975. Program graphs, an alge-bra, and their implication for programming.IEEE Trans. Softw. Eng. SE-1, 3, (Sept.),286–291.

PAIGE, M. R. 1978. An analytical approach tosoftware testing. In Proceedings COMP-SAC’78, 527–532.

PANDI, H. D., RYDER, B. G., AND LANDI, W.1991. Interprocedural Def-Use associationsin C programs. In Proceedings of SIGSOFTSymposium on Software Testing, Analysis,and Verification 4, (Oct.), 139–153.

PARRISH, A. AND ZWEBEN, S. H. 1991. Analysisand refinement of software test data ade-quacy properties. IEEE Trans. Softw. Eng.SE-17, 6 (June), 565–581.

PARRISH, A. S. AND ZWEBEN, S. H. 1993. Clarify-ing some fundamental concepts in softwaretesting. IEEE Trans. Softw. Eng. 19, 7 (July),742–746.

PETSCHENIK, N. H. 1985. Practical priorities insystem testing. IEEE Softw. (Sept.), 18–23.

PIWOWARSKI, P., OHBA, M., AND CARUSO, J. 1993.Coverage measurement experience duringfunction testing. In Proceedings of the 15thICSE (May), 287–301.

PODGURSKI, A. AND CLARKE, L. 1989. The impli-cations of program dependences for softwaretesting, debugging and maintenance. In Pro-ceedings of SIGSOFT Symposium on SoftwareTesting, Analysis, and Verification 3, (Dec.),168–178.

PODGURSKI, A. AND CLARKE, L. A. 1990. A for-mal model of program dependences and itsimplications for software testing, debuggingand maintenance. IEEE Trans. Softw. Eng.16, 9 (Sept.), 965–979.

PRATHER, R. E. AND MYERS, J. P. 1987. The pathprefix software testing strategy. IEEE Trans.Softw. Eng. SE-13, 7 (July).

PROGRAM ANALYSIS LTD., UK. 1992. Testbedtechnical description. May.

RAPPS, S. AND WEYUKER, E. J. 1985. Selectingsoftware test data using data flow informa-tion. IEEE Trans. Softw. Eng. SE-11, 4(April), 367–375.

RICHARDSON, D. J. AND CLARKE, L. A. 1985.Partition analysis: A method combining test-ing and verification. IEEE Trans. Softw. Eng.SE-11, 12 (Dec.), 1477–1490.

RICHARDSON, D. J., AHA, S. L., AND O’MALLEY, T. O.1992. Specification-based test oracles for re-active systems. In Proceedings of 14th Inter-national Conference on Software Engineering(May), 105–118.

RICHARDSON, D. J. AND THOMPSON, M. C.1988. The RELAY model of error detection

Test Coverage and Adequacy • 425

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 61: Software Unit Test Coverage and Adequacy

and its application. In Proceedings of SIG-SOFT Symposium on Software Testing, Anal-ysis, and Verification 2 (July).

RICHARDSON, D. J. AND THOMPSON, M. C. 1993.An analysis of test data selection criteria us-ing the relay model of fault detection. IEEETrans. Softw. Eng. 19, 6, 533–553.

RIDDELL, I. J., HENNELL, M. A., WOODWARD, M. R.,AND HEDLEY, D. 1982. Practical aspects ofprogram mutation. Tech. Rep., Dept. of Compu-tational Science, University of Liverpool, UK.

ROBERTS, F. S. 1979. Measurement Theory, En-cyclopedia of Mathematics and Its Applica-tions, Vol. 7. Addison-Wesley, Reading, MA.

ROE, R. P. AND ROWLAND, J. H. 1987. Some the-ory concerning certification of mathematicalsubroutines by black box testing. IEEE Trans.Softw. Eng. SE-13, 6 (June), 677–682.

ROUSSOPOULOS, N. AND YEH, R. T. 1985. SEES:A software testing environment support system.IEEE Trans. Softw. Eng. SE-11, 4, (April), 355–366.

RUDNER, B. 1977. Seeding/tagging estimationof software errors: Models and estimates.Rome Air Development Centre, Rome, NY,RADC-TR-77-15, also AD-A036 655.

SARIKAYA, B., BOCHMANN, G. V., AND CERNY,E. 1987. A test design methodology for pro-tocol testing. IEEE Trans. Softw. Eng. SE-13,5 (May), 518–531.

SHERER, S. A. 1991. A cost-effective approach totesting. IEEE Softw. (March), 34–40.

SOFTWARE RESEARCH. 1992. Software Test-Works—Software Testers Workbench System.Software Research, Inc.

SOLHEIM, J. A. AND ROWLAND, J. H. 1993. Anempirical-study of testing and integrationstrategies using artificial software systems.IEEE Trans. Softw. Eng. 19, 10, 941–949.

STOCKS, P. A. AND CARRINGTON, D. A. 1993. Testtemplates: A specification-based testingframework. In Proceedings of 15th Interna-tional Conference on Software Engineering(May), 405–414.

SU, J. AND RITTER, P. R. 1991. Experience intesting the Motif interface. IEEE Softw.(March), 26–33.

SUPPES, P., KRANTZ, D. H., LUCE, R. D., AND TVER-SKY, A. 1989. Foundations of Measurement,Vol. 2: Geometrical, Threshold, and Probabi-listic Representations. Academic Press, SanDiego.

TAI, K.-C. 1993. Predicate-based test genera-tion for computer programs. In Proceedings of15th International Conference on SoftwareEngineering (May), 267–276.

TAKAHASHI, M. AND KAMAYACHI, Y. 1985. Anempirical study of a model for program errorprediction. IEEE, 330–336.

THAYER, R., LIPOW, M., AND NELSON, E. 1978.Software Reliability. North-Holland.

TSAI, W. T., VOLOVIK, D., AND KEEFE, T. F.1990. Automated test case generation forprograms specified by relational algebra que-ries. IEEE Trans. Softw. Eng. 16, 3 (March),316–324.

TSOUKALAS, M. Z., DURAN, J. W., AND NTAFOS,S. C. 1993. On some reliability estimationproblems in random and partition testing.IEEE Trans. Softw. Eng. 19, 7 (July), 687–697.

URAL, H. AND YANG, B. 1988. A structural testselection criterion. Inf. Process. Lett. 28, 3(July), 157–163.

URAL, H. AND YANG, B. 1993. Modeling softwarefor accurate data flow representation. In Pro-ceedings of 15th International Conference onSoftware Engineering (May), 277–286.

VALIANT, L. C. 1984. A theory of the learnable.Commun. ACM 27, 11, 1134–1142.

VOAS, J., MORRELL, L., AND MILLER, K. 1991.Predicting where faults can hide from testing.IEEE Softw. (March), 41–48.

WEISER, M. D., GANNON, J. D., AND MCMULLIN,P. R. 1985. Comparison of structural testcoverage metrics. IEEE Softw. (March), 80–85.

WEISS, S. N. AND WEYUKER, E. J. 1988. An ex-tended domain-based model of software reli-ability. IEEE Trans. Softw. Eng. SE-14, 10(Oct.), 1512–1524.

WEYUKER, E. J. 1979a. The applicability of pro-gram schema results to programs. Int.J. Comput. Inf. Sci. 8, 5, 387–403.

WEYUKER, E. J. 1979b. Translatability and de-cidability questions for restricted classes ofprogram schema. SIAM J. Comput. 8, 5, 587–598.

WEYUKER, E. J. 1982. On testing non-testableprograms. Comput. J. 25, 4, 465–470.

WEYUKER, E. J. 1983. Assessing test data ade-quacy through program inference. ACM Trans.Program. Lang. Syst. 5, 4, (Oct.), 641–655.

WEYUKER, E. J. 1986. Axiomatizing softwaretest data adequacy. IEEE Trans. Softw. Eng.SE-12, 12, (Dec.), 1128–1138.

WEYUKER, E. J. 1988a. The evaluation of pro-gram-based software test data adequacy crite-ria. Commun. ACM 31, 6, (June), 668–675.

WEYUKER, E. J. 1988b. Evaluating softwarecomplexity measures. IEEE Trans. Softw.Eng. SE-14, 9, (Sept.), 1357–1365.

WEYUKER, E. J. 1988c. An empirical study ofthe complexity of data flow testing. In Pro-ceedings of SIGSOFT Symposium on SoftwareTesting, Analysis, and Verification 2 (July),188–195.

WEYUKER, E. J. 1993. More experience withdata-flow testing. IEEE Trans. Softw. Eng.19, 9, 912–919.

WEYUKER, E. J. AND DAVIS, M. 1983. A formal

426 • Zhu et al.

ACM Computing Surveys, Vol. 29, No. 4, December 1997

Page 62: Software Unit Test Coverage and Adequacy

notion of program-based test data adequacy.Inf. Cont. 56, 52–71.

WEYUKER, E. J. AND JENG, B. 1991. Analyzingpartition testing strategies. IEEE Trans.Softw. Eng. 17, 7 (July), 703–711.

WEYUKER, E. J. AND OSTRAND, T. J. 1980. Theo-ries of program testing and the application ofrevealing sub-domains. IEEE Trans. Softw.Eng. SE-6, 3 (May), 236–246.

WHITE, L. J. 1981. Basic mathematical defini-tions and results in testing. In Computer Pro-gram Testing, B. Chandrasekaran and S.Radicchi, Eds., North-Holland, 13–24.

WHITE, L. J. AND COHEN, E. I. 1980. A domainstrategy for computer program testing. IEEETrans. Softw. Eng. SE-6, 3 (May), 247–257.

WHITE, L. J. AND WISZNIEWSKI, B. 1991. Pathtesting of computer programs with loops us-ing a tool for simple loop patterns. Softw.Pract. Exper. 21, 10 (Oct.).

WICHMANN, B. A. 1993. Why are there no mea-surement standards for software testing?Comput. Stand. Interfaces 15, 4, 361–364.

WICHMANN, B. A. AND COX, M. G. 1992. Prob-lems and strategies for software componenttesting standards. J. Softw. Test. Ver. Rel. 2,167–185.

WILD, C., ZEIL, S., CHEN, J., AND FENG, G. 1992.Employing accumulated knowledge to refinetest cases. J. Softw. Test. Ver. Rel. 2, 2 (July),53–68.

WISZNIEWSKI, B. W. 1985. Can domain testingovercome loop analysis? IEEE, 304–309.

WOODWARD, M. R. 1991. Concerning orderedmutation testing of relational operators. J.Softw. Test. Ver. Rel. 1, 3 (Dec.), 35–40.

WOODWARD, M. R. 1993. Errors in algebraicspecifications and an experimental mutationtesting tool. Softw. Eng. J. (July), 211–224.

WOODWARD, M. R. AND HALEWOOD, K. 1988.From weak to strong—dead or alive? An anal-ysis of some mutation testing issues. In Pro-ceedings of Second Workshop on SoftwareTesting, Verification and Analysis (July) 152–158.

WOODWARD, M. R., HEDLEY, D., AND HENNEL,M. A. 1980. Experience with path analysisand testing of programs. IEEE Trans. Softw.Eng. SE-6, 5 (May), 278–286.

WOODWARD, M. R., HENNEL, M. A., AND HEDLEY,D. 1980. A limited mutation approach toprogram testing. Tech. Rep. Dept. of Compu-tational Science, University of Liverpool.

YOUNG, M. AND TAYLOR, R. N. 1988. Combiningstatic concurrency analysis with symbolic ex-

ecution. IEEE Trans. Softw. Eng. SE-14, 10(Oct.), 1499–1511.

ZEIL, S. J. 1983. Testing for perturbations ofprogram statements. IEEE Trans. Softw. Eng.SE-9, 3, (May), 335–346.

ZEIL, S. J. 1984. Perturbation testing for com-putation errors. In Proceedings of SeventhInternational Conference on Software Engi-neering (Orlando, FL), 257–265.

ZEIL, S. J. 1989. Perturbation techniques fordetecting domain errors. IEEE Trans. Softw.Eng. 15, 6 (June), 737–746.

ZEIL, S. J., AFIFI, F. H., AND WHITE, L. J.1992. Detection of linear errors via domaintesting. ACM Trans. Softw. Eng. Methodol. 1,4, (Oct.), 422–451.

ZHU, H. 1995a. Axiomatic assessment of con-trol flow based software test adequacy crite-ria. Softw. Eng. J. (Sept.), 194–204.

ZHU, H. 1995b. An induction theory of softwaretesting. Sci. China 38 (Supp.) (Sept.), 58–72.

ZHU, H. 1996a. A formal analysis of the sub-sume relation between software test adequacycriteria. IEEE Trans. Softw. Eng. 22, 4(April), 248–255.

ZHU, H. 1996b. A formal interpretation of soft-ware testing as inductive inference. J. Softw.Test. Ver. Rel. 6 (July), 3–31.

ZHU, H. AND HALL, P. A. V. 1992a. Test dataadequacy with respect to specifications andrelated properties. Tech. Rep. 92/06, Depart-ment of Computing, The Open University,UK, Jan.

ZHU, H. AND HALL, P. A. V. 1992b. Testability ofprograms: Properties of programs related totest data adequacy criteria. Tech. Rep. 92/05,Department of Computing, The Open Univer-sity, UK, Jan.

ZHU, H. AND HALL, P. A. V. 1993. Test dataadequacy measurements. Softw. Eng. J. 8, 1(Jan.), 21–30.

ZHU, H., HALL, P. A. V., AND MAY, J. 1992.Inductive inference and software testing. J.Softw. Test. Ver. Rel. 2, 2 (July), 69–82.

ZHU, H., HALL, P. A. V., AND MAY, J. 1994. Un-derstanding software test adequacy—an axi-omatic and measurement approach. In Math-ematics of Dependable Systems, Proceedings ofIMA First Conference (Sept., London), OxfordUniversity Press, Oxford.

ZWEBEN, S. H. AND GOURLAY, J. S. 1989. On theadequacy of Weyuker’s test data adequacyaxioms. IEEE Trans. Softw. Eng. SE-15, 4,(April), 496–501.

Received November 1994; revised March 1996; accepted October 1996

Test Coverage and Adequacy • 427

ACM Computing Surveys, Vol. 29, No. 4, December 1997