Top Banner
An Automated Model-Based Debugging Approach Cemal Yilmaz and Clay Williams IBM T. J. Watson Research Center Hawthorne, NY 10532 {cyilmaz, clayw}@us.ibm.com ABSTRACT Program debugging is a difficult and time-consuming task. Our ultimate goal in this work is to help developers reduce the space of potential root causes for failures, which can, in turn, improve the turn around time for bug fixes. We pro- pose a novel and very different approach. Rather then focus- ing on how a program behaves by analyzing its source code and/or execution traces, we concentrate on how it should behave with respect to a given behavioral model. We iden- tify and verify slices of the behavior model, that, once im- plemented wrong in the program, can potentially lead to failures. Not only do we identify functional differences be- tween the program and its model, but we also provide a ranked list of diagnoses which might explain (or be associ- ated with) these differences. Our experiments suggest that the proposed approach can be quite effective in reducing the search space for potential root causes for failures. Categories and Subject Descriptors D.2.5 [Testing and Debugging]: Debugging aids, Diag- nostics General Terms Reliability, Experimentation, Algorithms Keywords Model-based problem determination, Fault localization, Au- tomated debugging 1. INTRODUCTION Program debugging is a process of identifying and fixing bugs. Identifying the root causes is the hardest, thus the most expensive, component of debugging [13, 8]. Developers often take a slice of the statements involved in a failure, hypothesize a set of potential causes in an ad hoc manner, and iteratively verify and refine their hypotheses until root Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASE'07, November 5-9, 2007, Atlanta, Georgia, USA. Copyright 2007 ACM 978-1-59593-882-4/07/0011...$5.00. causes are located. Obviously, this process can be quite tedious and time-consuming. Many approaches have been reported in the literature to facilitate fault localization. They all have the same ultimate goal that is to narrow down the potential root causes for de- velopers, but a different way to achieve it. These approaches often leverage static and dynamic program analysis to de- tect anomalies or dependencies in the code [8, 5, 4], with one notable exception, namely Delta Debugging [13]. Delta debugging is different in the sense that it is empirical. The fault localization information provided by these approaches is often in the form of slices of program states that may lead to failures [13] or slices of automatically identified likely pro- gram invariants that are violated [8] or slices of the code that look suspicious [4]. Although these approaches can be quite effective, they suffer from two major limitations: Inability to deal with conceptual errors. Current ap- proaches mainly target coding errors. They may not track down missing or misinterpreted program requirements. Note that we here define a failure as the inability of a system or component to perform its required function. For example, consider the functional requirements of the deposit func- tion for an automated teller machine (ATM). In its simplest form it can be expressed as balance = balance + amt, where balance is the balance of an account and amt is the amount to be deposited. Now assume that the implementation fails to update the balance or fails to commit the updated bal- ance to a database. Tools that rely on static and dynamic analysis may not be able to localize it, since, in general, what is not in the code/execution cannot be analyzed. Empirical tools may not localize it either; they often require at least one passing and one failing run in order to perform their functions and in this case there may not be a passing run. Dependence on source code or binaries. In one form or another, current approaches rely on accessing source code or binaries. However, this is not always possible for pro- grams composed of remote third-party components such as Web Services. As more and more systems are built with commercial COTS components and service-oriented archi- tectures (SOA) are gaining momentum, the importance of being able to debug systems composed of many black-boxes is getting magnified. To overcome these shortcomings, we propose a novel and very different approach. Rather then focusing on how a pro- gram behaves by analyzing its source code and/or execution traces, we concentrate on how it should behave with respect to its behavioral model. As with the current automated de- bugging approaches, our ultimate goal is to help developers 174
10

An automated model-based debugging approach

Jan 18, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An automated model-based debugging approach

An Automated Model-Based Debugging Approach

Cemal Yilmaz and Clay WilliamsIBM T. J. Watson Research Center

Hawthorne, NY 10532{cyilmaz, clayw}@us.ibm.com

ABSTRACTProgram debugging is a difficult and time-consuming task.Our ultimate goal in this work is to help developers reducethe space of potential root causes for failures, which can, inturn, improve the turn around time for bug fixes. We pro-pose a novel and very different approach. Rather then focus-ing on how a program behaves by analyzing its source codeand/or execution traces, we concentrate on how it shouldbehave with respect to a given behavioral model. We iden-tify and verify slices of the behavior model, that, once im-plemented wrong in the program, can potentially lead tofailures. Not only do we identify functional differences be-tween the program and its model, but we also provide aranked list of diagnoses which might explain (or be associ-ated with) these differences. Our experiments suggest thatthe proposed approach can be quite effective in reducing thesearch space for potential root causes for failures.

Categories and Subject DescriptorsD.2.5 [Testing and Debugging]: Debugging aids, Diag-nostics

General TermsReliability, Experimentation, Algorithms

KeywordsModel-based problem determination, Fault localization, Au-tomated debugging

1. INTRODUCTIONProgram debugging is a process of identifying and fixing

bugs. Identifying the root causes is the hardest, thus themost expensive, component of debugging [13, 8]. Developersoften take a slice of the statements involved in a failure,hypothesize a set of potential causes in an ad hoc manner,and iteratively verify and refine their hypotheses until root

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ASE'07, November 5-9, 2007, Atlanta, Georgia, USA.Copyright 2007 ACM 978-1-59593-882-4/07/0011...$5.00.

causes are located. Obviously, this process can be quitetedious and time-consuming.

Many approaches have been reported in the literature tofacilitate fault localization. They all have the same ultimategoal that is to narrow down the potential root causes for de-velopers, but a different way to achieve it. These approachesoften leverage static and dynamic program analysis to de-tect anomalies or dependencies in the code [8, 5, 4], withone notable exception, namely Delta Debugging [13]. Deltadebugging is different in the sense that it is empirical. Thefault localization information provided by these approachesis often in the form of slices of program states that may leadto failures [13] or slices of automatically identified likely pro-gram invariants that are violated [8] or slices of the code thatlook suspicious [4]. Although these approaches can be quiteeffective, they suffer from two major limitations:Inability to deal with conceptual errors. Current ap-proaches mainly target coding errors. They may not trackdown missing or misinterpreted program requirements. Notethat we here define a failure as the inability of a system orcomponent to perform its required function. For example,consider the functional requirements of the deposit func-tion for an automated teller machine (ATM). In its simplestform it can be expressed as balance = balance+amt, wherebalance is the balance of an account and amt is the amountto be deposited. Now assume that the implementation failsto update the balance or fails to commit the updated bal-ance to a database. Tools that rely on static and dynamicanalysis may not be able to localize it, since, in general, whatis not in the code/execution cannot be analyzed. Empiricaltools may not localize it either; they often require at leastone passing and one failing run in order to perform theirfunctions and in this case there may not be a passing run.Dependence on source code or binaries. In one formor another, current approaches rely on accessing source codeor binaries. However, this is not always possible for pro-grams composed of remote third-party components such asWeb Services. As more and more systems are built withcommercial COTS components and service-oriented archi-tectures (SOA) are gaining momentum, the importance ofbeing able to debug systems composed of many black-boxesis getting magnified.

To overcome these shortcomings, we propose a novel andvery different approach. Rather then focusing on how a pro-gram behaves by analyzing its source code and/or executiontraces, we concentrate on how it should behave with respectto its behavioral model. As with the current automated de-bugging approaches, our ultimate goal is to help developers

174

Page 2: An automated model-based debugging approach

authenticateduninitialized activated

T3: authenticate(pin),pin!=PIN && tries<2/err(INV_PIN), tries++

T8: deposit(amt), bal+amt <= MAXBAL/ack(OK), bal+=amtT2: activate(amt),amt<=MAXBAL/ack,bal=amt

T4: authenticate(pin),tries>2 || (pin!=PIN && tries=2)/err(PURSE_LOCKED)

T9: withdraw(amt), bal−amt >= 0/ack(OK), bal−=amt

T1: activate(amt), amt>MAXBAL/err(INV_PARAM)

T7: withdraw(amt), bal−amt < 0/err(INV_PARAM)T6: deposit(amt), bal+amt > MAXBAL/err(INV_PARAM)

T5: authenticate(pin),pin=PIN&&tries<=2/ack,tries=0

Figure 1: EFSM model of an electronic purse application written for the Java Card platform.

reduce the space of potential root causes for a given failure.The way we do this, however, is via identifying and verify-ing subsets of the behavioral model, that, once implementedwrong in the program, can potentially lead to the failure.Not only do we identify functional differences between theprogram and its model, but we also provide a ranked list ofdiagnoses which might explain (or be associated with) thesedifferences. We call this approach automated model-baseddebugging, shortly MBD.

MBD is a purely empirical and black-box technique. Ittakes as input a program, its behavioral model expressed asan extended finite state machine (EFSM), and a failing inputsequence. In a sense, MBD simulates the role of a human de-bugger. We hypothesize what might have gone wrong withthe program and led to the failure and then verify and scoreour hypotheses according to how well they demonstrate theactual erroneous program behavior. Our hypotheses are con-structed by mutating the behavioral model of the program.Each mutant represents a faulty behavior that the programmay erroneously demonstrate. The intention behind thesemutations is to mimic developers’ typical errors, such as off-by-one errors. The verification of a hypothesis is performedby extracting special purpose test cases, called confirmingsequences [10], from the behavioral model and executingthem on the program. Our experiments suggest that MBDcan effectively reduce the space of potential root causes forfailures, which can, in turn, improve the turn around timefor fixes.

2. BACKGROUNDThis section provides background information on EFSM

models and confirming sequences.

2.1 EFSM ModelsThe EFSM is a Mealy machine (a finite state machine

(FSM)) extended with parameters, predicates, and opera-tions. There are three types of parameters in the EFSM,namely: input, output, and context parameters. Unlike theMealy machine, the input/output signals and the states ofthe EFSM are parameterized by input, output, and contextparameters, respectively. Each transition of the EFSM isassociated with a predicate and a set of operations, bothdefined over the input and context parameters.

Figure 1 illustrates an example EFSM model for an elec-tronic purse application written for the Java Card plat-form. The basic inputs that this application can receiveare: activate(amt), authenticate(pin), deposit(amt), andwithdraw(amt). A pur-se first needs to be activated with aninitial balance. Once activated, each deposit and withdrawoperation must be authenticated with a card specific pinnumber (represented by the constant PIN in the model).The authentication should be done in at most three at-tempts, otherwise the purse is locked.

Figure 1 models this application with three control states,four input signals (activate, deposit, withdraw, and

authenticate), two input parameters (amt and pin), andtwo output signals (ack and err). This model also has twocontext parameters: bal, storing the current balance of thepurse, and tries, storing the current number of incorrectauthentication attempts. Each transition is denoted by anotation of the form s− (i, p/o, f) → s′, where s and s′ arethe starting and ending states, i and o are the parameterizedinput and output signals, p is the predicate, and f is thecontext update function of the transition. To simplify thenotations, we drop update functions from the transitionsthat have no updates.

A state and a valuation of context parameters constitutea so-called configuration of the EFSM. The EFSM usuallystarts with a designated initial configuration. For example,our example EFSM starts with the initial configuration of[state = uninitialized, bal = 0, tries = 0]. A configura-tion reflects the history of input signals and the updateson context parameters from the system start to the presentmoment.

The EFSM operates as follows: The machine receivesa parameterized input signal and identifies the transitionswhose predicates are satisfied for the current configuration.Among these transitions, a single transition is fired. Duringthe execution of the chosen transition, the machine 1) gen-erates an output signal along with the output parameters,2) updates the context variables according to the updatefunction of the transition, and 3) moves from the startingto the ending state of the transition.

2.2 Confirming SequencesModel-based testing (MBT) is one of the fields that has

been extensively leveraging finite state models. In MBT,test cases are automatically derived from a given model andare in the form of an input and output sequence. The pro-gram is fed with the input sequence and its output is com-pared to the expected output.

Although matching program and model outputs increasesour confidence in the correctness of the program, it is barelyadequate. For example, consider the FSM model (M) givenin Figure 2 and a program (P ) attempting to implementthis model. P is a black box; its inputs and outputs can beobserved, but no other information regarding its conditionis known. Now, assume that P incorrectly implements thetransition A − a/x → B as A − a/x → C. A legitimatetest case (derived from M) to test the original transition iscomposed of the single input a, assuming that the machineis already in A. Although P gives the expected output whenfed with a, a leaves P in a wrong state, which can manifestitself later as a failure. Therefore, there is a need to verifythe state of P after executing a. The question is: Can weverify the current state of the black box P , if all we knowabout P is its blueprint (i.e., model)?

An input/output sequence known as a confirming sequenceis a solution to the problem [7]. A confirming sequence isa test case that, if passed, increases our confidence in the

175

Page 3: An automated model-based debugging approach

A B Ca/xa/x

a/y

Figure 2: An example FSM model.

correctness of the state reached by the program after exe-cuting an input. Given a model and a state to be verified, aconfirming sequence is extracted directly from the model ina way that distinguishes the state from all the other statesin the model. Going back to our simple example, we knowthat, after consuming a, P should be in state B. A con-firming sequence that could be computed for state B is (a,a)/(x, y). Note that this sequence separates B from A andC. That is, B is the only state in M that would generate theoutput (x, y) given the input (a, a). In our simple example,feeding P with this confirming sequence would reveal thatP appears not to be in the expected state, since input (a,a) generates output (y, x).

There are well-established algorithmic approaches to com-pute confirming sequences for FSM models [7]. Extractingconfirming sequences from EFSM models is a more chal-lenging task though. This is due to the fact that statesof EFSM models are parameterized with context parame-ters. Although context parameters take their values froman infinite domain, extracting a confirming sequence thatseparates a given configuration (i.e., a state and a valuationof the context parameters) from all other configurations inthe model is often computationally impractical because ofthe combinatorial complexity involved.

One way to ease this complexity is to verify a configu-ration against a carefully chosen list of suspicious configu-rations, rather than verifying it against all other configura-tions. Even if the choice of suspicious configurations mayaffect the quality of the confirming sequence, experimentssuggest that this approach can be quite effective in prac-tice [10].

In MBD, we leverage confirming sequences to validate theexpected behavior of the program against a set of error hy-potheses and, in the case of a failure, to validate and scoreour hypotheses according to how well they demonstrate theactual erroneous program behavior.

3. THE MODEL-BASED DEBUGGINGAPPROACH

MBD takes as input a program, its behavioral model ex-pressed as a EFSM, and a failing input sequence. The out-put is a ranked list of diagnoses given as a small subset ofthe model, which might explain (or be associated with) whatthe program implemented incorrectly, leading to the failure.

MBD is a completely black-box technique. All we needis a way to map the model inputs/outputs to the actualprogram inputs/outputs. We don’t make any assumptionabout how the program actually implements the model. Forexample, model states and transitions can be abstract andimplemented either implicitly or explicitly in the program.However, we rely on competent specifier hypothesis whichstates that the specifier of a program behavior is likely toconstruct EFSM models close to the correct behavior.

A cornerstone of our approach is a stepwise validation ofthe program against its behavioral model. Stepwise valida-tion allows us to detect faults as close to the time they occuras possible, rather than when they manifest themselves.

Algorithm 1 depicts the high-level MBD algorithm, where

Algorithm 1 MBD(Model M , Program P , Input I)

1: D ← {}2: for all i where i ∈ I do

3: T ← M(i)

4: P (i)

5: F ← mutate(M, T )

6: CS ← ComputeConfirmingSequences(M, F )

7: if ∀cs · cs ∈ CS ∧ P ′(cs.in) = cs.out then

8: continue9: end if10: for all M′ where M′ ∈ F do

11: T ′ ← M′(i)

12: F ′ ← mutate(M′, T ′)13: CS′ ← ComputeConfirmingSequences(M′, F ′)14: M′.score ← percentageOf({cs′|cs′ ∈ CS′ ∧ P ′(cs′.in) =

cs′.out})15: D ← D ∪ {M′}16: end for17: end for18: return ranked list of M′’s where M′ ∈ D

P , M , and I are the program, its EFSM model, and thefailing input sequence, respectively. D is the set of diag-noses to be reported, M(i) returns the output of the modelas well as the parts (T ) of the model exercised by the in-put i, mutate(.) mutates the parts (T ) of the model passedas the first argument, ComputeConfirmingSequences(.)computes pairwise confirming sequences between the modelpassed as the first argument and the set of models passedas the second argument, and cs.in and cs.out represent theinput and output of the confirming sequence cs. Further-more, P (.) and P ′(.) both return the output of the programfor the input passed as the argument, with the differencethat the latter is assumed to restore the program state towhat it was before executing the input, whereas the formerleaves the program state as it is. One way to obtain P ′ isto keep a history of inputs executed so far and then restartP with these inputs.

We execute the failing input sequence stepwise in paral-lel on both the program and the model (line 2-4 in Algo-rithm 1). After each input, we validate the resulting stateof the program. This step is performed regardless of the pro-gram output; having the right output from the program fora given input is not always a necessary and sufficient condi-tion to conclude that the program resulted in an expectedstate after executing the input (Section 2.2). Since the pro-gram is a black-box to the MBD approach, the validationsare performed by generating and executing special purposetest cases called confirming sequences (line 5-9). Confirmingsequences help us to compare the internal execution statesof the program with the model via the means of testing.

As we have already discussed in Section 2, one way tovalidate the program’s expected resulting configuration isto extract a confirming sequence from the behavioral modelthat separates the expected configuration from all other pos-sible configurations in the model and execute it on the pro-gram. However, this approach is often computationally in-feasible in practice because of the combinatorial complexityinvolved [10]. Separating the expected configuration froma set of suspicious configurations wouldn’t help either; al-though they are practical and can be effective in detectingthe presence of faults, they cannot locate or diagnose them.

We, instead, anticipate what could go wrong with the pro-gram in implementing the parts of the model that are exer-cised by the current input step. This is done by mutatingthe respective parts of the behavioral model (line 5). Theintent behind these mutations is to mimic programmers’ typ-

176

Page 4: An automated model-based debugging approach

ical mistakes, such as miscoded conditions, missing updates,and misinterpreted specifications.

Each mutant represents an error hypothesis. We thenvalidate the expected configuration of the program againstthese error hypotheses by computing pairwise confirmingsequences. That is, we compute a test case for each pairof expected behavior and error hypothesis that allows us todistinguish the expected behavior from an error hypothesisby investigating the input output behavior of the program(line 6).

Consider our electronic purse model given in Figure 1 asan example. Given an implementation of this model, wemay want to validate that the implementation starts withthe expected initial configuration of [state = uninitialized,bal = 0, tries = 0] rather than a faulty configuration of[state = uninitialized, bal = 0, tries = 1]. This faulty con-figuration is a good candidate of validating the expected con-figuration against, since it reflects a typical off-by-one errorin the initialization of a variable. One pairwise confirmingsequence that separates the expected configuration from thefaulty one is (activate(5), authenticate(2), authenticate(2),authenticate(2))/(ack, err(INV PIN), err(INV PIN),err(PURSE LOCKED)) with the assumption that themodel constant PIN is not 2.

The first thing to note about this confirming sequence isthat what it really validates is whether or not there is anoff-by-one error in the initialization of the tries (however itis implemented in the program), rather than whether theprogram actually initializes it to 0 or 1. Moreover, betweenthe expected initial configuration and the faulty one, theprogram, when fed with the input sequence of the confirmingsequence, would generate the expected output sequence, ifand only if it started with the expected initial configurationwith respect to the initialization of tries.

Given a confirming input/output sequence which sepa-rates the behavior to be validated from a mutant, if theactual program output, when fed with the input sequence,matches the expected output, then it increases our confi-dence about the validity of the behavior under investigation.Otherwise (i.e., if the outputs don’t match), it is a good in-dication that the program doesn’t demonstrate the expectedbehavior.

In MBD, if all the pairwise confirming sequences com-puted to distinguish the expected configuration from its mu-tants pass, it increases our confidence about the correctnessof the resulting program configuration and we continue withthe next input (line 7-8). Otherwise, if at least one confirm-ing sequence fails, it signals the presence of errors.

In the presence of a failure, we ask the question: Whichof the anticipated faulty behaviors (i.e., error hypothesescomputed) better demonstrates the erroneous program be-havior? The way we answer this question is identical to theway we validate the expected behavior, only this time eacherror hypothesis, in turn, becomes the expected behavior ofthe faulty program (line 10-16). As was the case with thevalidation of the expected behavior, anticipated faulty be-haviors are validated against their mutants. The rationalebehind this approach is that if the behavior to be validateddemonstrates the actual erroneous program behavior thenall confirming sequences computed to distinguish it from itsmutants should pass.

One difference between validating the expected behaviorand validating the error hypotheses is that we use a relative

scoring approach to assign belief to alternate error hypothe-ses. The score of an error hypothesis is the percentage ofthe pairwise confirming sequences (computed to distinguishit from its mutants) that supports the hypothesis (line 14),reflecting how well the error hypothesis demonstrates the ac-tual behavior of the faulty program. The error hypothesesare sorted in descending order by their scores and communi-cated to the end user in the form of a diagnosis report (line18).

Using pairwise confirming sequences allows us to employsuch a scoring schema. Instead of computing a single con-firming sequence, we compute pairwise confirming sequencesbetween a behavior (either expected or faulty) and its mu-tants. A single confirming sequence to validate an errorhypothesis implies that we expect the faulty program to be-have exactly the same way we anticipate it could behaveerroneously, otherwise the confirming sequence would fail.As we will see in Section 4.2, pairwise confirming sequences,on the other hand, help us diagnose faults which are notdirectly anticipated by the mutations.

3.1 Constructing Error HypothesesWe hypothesize a set of faulty behaviors that the program

may demonstrate by mutating its model. In this section,without losing the generality of the approach, we define aset of simple mutation operators.

The choice of our mutation operators is based on one ma-jor factor: A desirable mutation operator should be coarse-grained enough to detect as many faults as possible, but yetfine-grained enough to diagnose them. We designed our mu-tation operators by giving equal importance to these com-peting factors.

The mutation operators defined in this work operate lo-cally on the parts of the model which were exercised by thecurrent input. The result of applying a mutation operatorto a model is the replica of the original model with the cor-responding mutations. The intent behind these mutationsis to mimic programmers’ typical mistakes.

In the rest of the argument, let M be the model to bemutated, i be the last input consumed by M , C be theresulting configuration of the machine after executing i, andT : s − (i, p/o, f) → s′ be the transition taken at i. Wedesigned the following operators:MIS – Modifying initial configurations. MIS modifiesthe initial configuration of the machine by 1) changing theinitial state to every other state in M and 2) introducing anerror term into the initialization of each context parameter,one at a time. For example, an initialization of the formbal = 0 would be mutated into bal = 0 + err, where errranges over a small interval of positive and negative num-bers. MIS is designed to validate that the program understudy starts with the expected initial configuration.MDT – Deleting transitions. MDT deletes the transi-tion T from the model.MTS – Modifying tail states. MTS changes the tailstate s′ of T to every other state in M , one at a time.MDU – Deleting updates. MDU modifies the updatefunction f of T by deleting the update operations on thecontext parameters, one at a time.MMU – Modifying updates. MMU modifies the updatefunction f of T by introducing error terms into each updateoperation, one at a time. The way we introduce error termsis explained above for the MIS operator.

177

Page 5: An automated model-based debugging approach

Score Diagnosis

100.00 Update on tries appears to be incorrectly implemented! (Off-by-one-error)Current Input (I): authenticate(pin=PIN) and Input History (H): active(amt=5)Model transition taken on I (T ): activated − authenticate, pin = P IN&tries ≤ 2/ack, tries = 0 → authenticated

T appears to be implemented as T ′: activated − authenticate, pin = P IN&tries ≤ 2/ack, tries = 0 + 1 → authenticated100.00 tries appears to be corrupted! (Off-by-one-error)

Current Input (I): authenticate(pin=PIN) and Input History (H): active(amt=5)Model configuration after I (C): [state=authenticated, bal=5, tries=0]Program appears to be in configuration: [state=authenticated, bal=5, tries=0+1]

84.62 tries appears to be corrupted! (Off-by-two-errors)Current Input (I): authenticate(pin=PIN) and Input History (H): active(amt=5)Model configuration after I (C): [state=authenticated, bal=5, tries=0]Program appears to be in configuration: [state=authenticated, bal=5, tries=0+2]

... ...

Table 1: A sample output.

MAU – Introducing updates. MAU introduces addi-tional updates, one at a time, for the context parameterswhich are not originally updated by the function f of T .MMC – Modifying context parameters. MMC mod-ifies the context parameters in C, one at a time, by intro-ducing error terms. The difference between MMC and MAUis that MMC targets nonsystematic faults by only mutatingthe current context, whereas MAU targets systematic faultsby mutating the underlying machine.

For a given transition (T ) and a configuration (C), eachmutation operator defined above may produce zero, one, ormore mutated models. Later in Section 4.4, as a cost-cuttingmeasure, we will add a parameter to the MIS and MTSoperators called distance, which constraints the portion ofthe model that is visible to these operators. Next sectionbriefly describes the way we compute pairwise confirmingsequences once we have the mutated models.

3.2 Computing Confirming SequencesPetrenko et al. introduced an approach in [10] to compute

confirming sequences for EFSM models. We adopt that ap-proach in this work. The details of the approach is beyondthe scope of this paper. In a nutshell, given two EFSMmachines along with their current configurations, Petrenkocasts the problem of computing a confirming sequence, thatseparates the first machine from the second, into a reach-ability problem in the “distinguishing EFSM machine” ob-tained from crossproducting the given machines in a certainmanner.

Once we compute the distinguishing EFSM machine, weuse a model checker [1] to solve the reachability problem.The negation of the reachability problem is expressed asa branching-time logic formula which should hold globallyacross all the paths, so that the counter example returnedfrom the model checker (if any) becomes our confirming se-quence.

3.3 MBD in ActionThis section illustrates the MBD approach over an imple-

mentation of the electronic purse model given in Figure 1.To facilitate the study, we manually introduced an errorinto the actual implementation of the authenticate function.When the valid pin is entered, instead of setting the num-ber of failed authentication attempts (i.e., parameter tries)to zero, the implementation now erroneously sets it to one;a typical off-by-one error that requires certain combinationsof inputs to detect. The introduced error corresponds to theimplementation of the transition T5.

We fed the MBD tool with the EFSM model, the faultyimplementation, and the input sequence (activate(amt =5),authenticate(pin = 1)), with the assumption that thePIN constant in the model is 1. This input sequence is in-teresting in the sense that the error we introduced is not ob-

servable externally through the input sequence provided; thefaulty program returns the expected output of (ack, ack).Since we validate the resulting state of the program regard-less of the program output, the error is, in fact, not requiredto be observable through the provided input sequence.

The MBD tool started by validating whether the programstarts with the expected initial configuration of [state =uninitialized, bal = 0, tries = 0]. The MIS mutation op-erator provided 6 mutants for this purpose. Six pairwiseconfirming sequences were automatically extracted, each ofwhich distinguishing the expected initial configuration froma mutant. Confirming sequences were executed on the pro-gram and it turned out that the program passed all of them,suggesting that the program starts with the expected initialconfiguration.

The tool then executed the input (activate(amt = 5)) onthe program. To validate that the program is now in theconfiguration of [state = activated, bal = 5, tries = 0], 13mutants were created. All of the corresponding confirmingsequences passed.

The input authenticate(pin = 1) was executed next. Themodel transition taken on this input was T5 that movedthe machine to the configuration [state = authenticated,bal = 5, tries = 0]. Several of the 13 confirming sequencescomputed to validate the current configuration of the pro-gram failed, suggesting that the program is not in the ex-pected state after executing the input.

One of the failing confirming sequences, for example, was(withdraw(amt = 0), authenticate(pin = 3),authenticate(pin = 3), authenticate(pin = 1),deposit(amt = 2),authenticate(pin = 3))/( ack,err(INV PIN), err(INV PIN), ack, ack,err(INV PIN)). It was computed to validate the originalEFSM with the configuration [state = authenticated, bal =5, tries = 0] against a mutant obtained by the MDU opera-tor. The mutant simulated a missing update on tries by mu-tating the transition T5 to activated − authenticate, pin =PIN&tries ≤ 2/ack→ authenticated and keeping the same configuration withthe original machine. There are several things to note here.First, although the confirming sequence failed, it was notdecisive of whether the mutant demonstrates the faulty pro-gram behavior (in this case it wasn’t, for example). Second,the confirming sequence given above is a minimal sequence;no other sequence with less steps performing the same taskexists. Last, the main purpose of the withdraw and depositoperations in the sequence is to move back and forth be-tween the activated and authenticated states. The amountspassed as arguments are irrelevant as long as they are valid.The program output was (ack, err(INV PIN),err(PURSE LOCKED), err(PURSE LOCKED),err(INV CMD), err(PURSE LOCKED)). Note that the

178

Page 6: An automated model-based debugging approach

purse model outputs err(INV CMD) for the input signalswhich are not defined on a state.

The MBD tool then automatically validated and scoredeach mutant as explained in Section 3. Table 1 shows thetop three diagnoses emitted from the tool, which were antic-ipated by the mutation operators, MMU, MMC, and MMC,respectively. To further facilitate the human debugging pro-cess, each diagnosis provides detailed information. For ex-ample, the first diagnosis reads: After executing the inputsin H, the program should exercise the transition T on thecurrent input I, however the program appears to implementT as T ′, i.e., with an off-by-one error in updating tries.

The first diagnosis not only does pinpoint the exact lo-cation in the model that the program failed to implementcorrectly, but also explains exactly how the program erro-neously implemented it.

The second diagnosis is implied by the error made in theimplementation; having an off-by-one error in updating triesimplies that tries will be corrupted in the resulting config-uration. The only difference between first and second diag-nosis is that the latter is obtained by mutating the expectedconfiguration of the model without touching the underlyingmachine, whereas the former is computed by mutating theunderlying machine. They both got the perfect score.

The third diagnosis, although localizes the error to theexact location in the model, it fails to explain it accurately,which is reflected in its lower score. For example, one ofthe confirming sequences that didn’t support this diagnosiswas (withdraw(iAmt = 0), authenticate(iP in = 3))/(ack,PURSE LOCKED). This sequence was extracted to dis-tinguish the faulty (now expected) model configuration [state= authenticated, bal = 5, tries = 2] from the expected(now faulty) configuration [state = authenticated, bal = 5,tries = 0]. The program returned (ack, INV PIN).

To sum sup, MBD detected the error which was, in fact,externally unobservable through the provided input sequenceand then precisely diagnosed the root cause.

4. EXPERIMENTSWe have implemented the MBD approach as a fully-

functional research prototype tool and conducted a set offeasibility studies to evaluate the approach.

We used the reference implementation of the electronicpurse application (≈1500 LOC) that comes with the JavaCard Development Kit v2.2.2 as our subject application.

Figure 1 depicts the behavioral model of the applicationused in our studies. Note that although this model seemsto have a total of three states, in reality, these three statesonly constitute the so-called control states of the model anddo not reflect the actual size of the model alone. Whenthe data states (the valuation of the context variables suchas bal and tries) are combined with these control states,the actual state size of the model can be better understood.For example, merging the data and control states of thismodel by using even a small interval for each context vari-able would convert this EFSM model with 3 states, 9 transi-tions, and 2 context variables into a semantically equivalentEFSM model of 56 states, 353 transitions, and no contextvariables (See Study 4.3).

We opted to run a set of controlled experiments to eval-uate the MBD approach. This allowed us to better assessthe benefits and shortcomings of the approach by analyz-ing the results in depth. Consequently, in our studies, we

Top Five Diagnoses

fault idx

scor

e

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

excellentvery goodgood (implied)poor

Figure 3: Study 1 results.

seeded faults into the application and provided failing inputsequences. We then fed these artifacts to the MBD tool andmanually categorized the resulting diagnoses with respect totheir quality in locating actual root causes.

4.1 Study 1: Anticipated ErrorsOur goal in this study is to evaluate the MBD approach

on faults which are directly anticipated by the mutation op-erators under its disposal. This is a sanity check of theapproach. Any “bad” results that might be obtained herewould indicate that MBD may not work well in general.

Study Execution. To facilitate the study, we appliedthe mutation operators given in Section 3.1 to the originalprogram and obtained 20 mutated programs. Each mutantwas obtained by a single application of a single mutation op-erator. Each mutation operator was utilized multiple times.We then manually created a failing input sequence for eachmutant, which exercised the error in the mutant. We finallyfed the model, the mutated programs, and the failing inputsequences to the MBD tool, one at a time.

Evaluation. Figure 3 depicts the scores of the top fivediagnoses obtained for each mutated program. The verti-cal axis denotes a score and the horizontal axis denotes amutated program index. For example, the first tick on thehorizontal axis, which is 1, represents the mutated programindexed as 1 and the data points right above it depict thetop five diagnoses obtained.

We manually categorized these diagnoses into four classesbased on the information they provided towards locating theroot causes: +, 2, ◦, and ×. The first class (+) representsdiagnoses which not only do precisely locate the faults (e.g.,the update tries = 0 in transition T5 is implemented in-correctly), but also precisely describe the actual error (e.g.,off-by-one error: tries = 1). The second class (2) depictsdiagnoses that are not as precise as the first class, but areextremely helpful in locating root causes. The third class(◦) is for diagnoses describing a potential cause which is im-plied by the actual error made in the program. For example,having the balance incorrectly updated in the current inputstep would imply that the balance will be corrupted in thenext step. The fourth class (×) represents diagnoses thatappear not to be related with errors.

As Figure 3 indicates, MBD precisely located the rootcauses of the failures. In 85% of the cases, the actual rootcauses (i.e., +’s) were reported as the top ranked result

179

Page 7: An automated model-based debugging approach

Top Five Diagnoses

fault idx

scor

e

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

very goodgood (implied)poor

Figure 4: Study 2 results.

and, in the remaining 15% (mutants 7, 9, and 10), theywere ranked second. Note that we sometimes have multiplediagnoses of an error classified as + in the figure, such as thecase for mutant 12. This happens when the error is exercisedmultiple times by the input sequence provided.

Even for mutants 7, 9, and 10, where the perfect diagnoseswere ranked second, the top ranked diagnoses reported werealso helpful in pinpointing the root causes. For example,mutant 7 was created by resetting (instead of setting) a flagin the program, which indicates whether the purse is acti-vated or not. In effect, the purse is not activated when itshould be. This error corresponds to the implementationof transition T2 in Figure 1. The top diagnosis suggestedthat T2 is not implemented at all in the program. Althoughthis is not completely true (the mutant updated the balancecorrectly but failed to advance the state), the diagnosis wasextremely helpful, since it implied that the purse was notactivated.

Lessons Learned. This study suggests that MBD canprecisely identify the root causes of failures that are directlyanticipated by the mutation operators. Our analysis furtherrevealed that the cases where the actual root causes hadless than perfect score (i.e., 100) were not due to some pair-wise confirming sequences that don’t support the diagnoses,but due to the fact that we were not be able to computeany pairwise confirming sequence for particular scenarios.One reason was that our mutation operators sometimes cre-ated indistinguishable or identical mutants. In the rest ofthe studies, we prevented the generation of indistinguishablemutants when we can by running simple checks during theapplication of the mutation operators.

4.2 Study 2: Unanticipated ErrorsIn this study we evaluate the MBD approach on errors

that are not directly anticipated by our mutation operators.This is important since, in practice, it is infeasible to antic-ipate all possible failure scenarios.

Study Execution. We adopted the same 20 programmutations created in our first study. For each muted pro-gram, however, we first removed the mutation operator whichwas used to create the mutant from our database of muta-tion operators and then ran the MBD tool.

Evaluation. Figure 4 depicts the results we obtained.As was the case with Study 1, we manually classified eachdiagnosis. In the first study, our success criterion was the

rank of the diagnoses classified as +. Since the exact mu-tation operators that created the mutants are now out ofconsideration, by definition, none of the diagnoses can beclassified as + in this study. Instead, we here are mainlyinterested in the rank of 2’s–diagnoses which may not beexact but are extremely helpful in locating root causes.

As the figure indicates, in 70% of the cases, at least onediagnosis classified as 2 is reported in the top two ranks.All such diagnoses, precisely located the incorrectly imple-mented transitions or states and pointed to the buggy up-dates, predicates, and/or parameters. In the remaining 30%of the cases, a majority of the top diagnoses reported werealso helpful. We now investigate them case by case as thespace allows.

Mutant 8 fails to update the balance of the purse dur-ing activation (transition T2). The top diagnosis perfectlydiagnosed the case where the fault manifested itself ratherthan when it occurred. The failing input sequence underinvestigation activated the purse with an amount, depositedmoney several times, and then withdrew the entire balanceat once. The top diagnosis states that the program appearsto be in state authenticated instead of state activated afterthe withdraw operation (model takes T9 whereas the pro-gram takes T7).

This diagnosis, although several steps away from the rootcause, provides clues regarding the cause: It locates the op-eration where the fault first manifested itself, i.e., the with-draw operation, and then identifies the symptom, i.e., theprogram should be in state activated but appears to be instate authenticated after executing the withdraw operation.By providing information about the transition that shouldhave been taken by the program, i.e., transition T9, this di-agnosis further narrows down the potential search space forthe root cause. For example, given T9, potential scenariosfor failing to advance the state could be: 1) T9 is not imple-mented at all, 2) its precondition is implemented incorrectly,or 3) the precondition doesn’t hold when it should. In thisparticular mutant, the last scenario was the actual reason,which suggested that the amount being withdrawn is greaterthan the current balance.

The top five diagnoses reported for mutants 5 and 17 aresimilar and very interesting. For example, mutant 5 erro-neously starts the purse as activated instead of as uninitial-ized. All of the top five diagnoses pointed to the transitionT2 which was taken on the first input (i.e., activate(.)).Each update operation in T2 was reported either missingor incorrectly updated. This indicates that the input didn’tget executed at all, which was exactly the case with themutant. Since the purse appeared to be already activated,an error code was returned instead of executing the acti-vate operation. The top five diagnoses for mutants 5 and 17collectively pointed us towards the root causes.

Mutants 1 and 13 are particularly interesting in the sensethat we were not able to provide any diagnosis for them.This was due to the fact that these mutants failed eachand every confirming sequence computed. Some scenarioswere, indeed, expected to fail for all confirming sequencesthat could be computed, others, however, could have beenpassed, but failed because of a particular choice of the con-firming sequence used.

Lessons Learned. This study suggests that MBD cannarrow down the potential root causes for failures which arenot directly anticipated by the mutation operators. In our

180

Page 8: An automated model-based debugging approach

Top Five Diagnoses

fault idx

scor

e

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

excellentvery goodgood (implied)poor

Figure 5: Study 3 results.

experiments, we have observed three different ways of howthis could be happening.

As was the case with diagnoses classified as 2 in Fig-ure 4, mutants that closely resembled the actual faults inthe program often got higher scores and were able to pointto the buggy updates, predicates, and/or parameters. Forexample, in our studies, we have seen several cases where ananticipated off-by-two error was able to locate an unantici-pated off-by-one error.

As was the case with mutant 7, we were also able to narrowdown the causes when the faulty program state induced byunanticipated faults led to a situation which was anticipatedby one of our mutation operators. For example, a missingupdate in mutant 7, later caused the program to erroneouslychange its state, which was then anticipated by our MTSoperator in the absence of the MDU operator. In such cases,although we diagnose the situation where the fault manifestsitself rather than where it occurs, our experiments suggestthat diagnoses provided by the MBD approach often provideaccurate clues regarding the causes.

As was the case with mutants 5 and 7, sometimes a set ofdiagnoses were collectively helpful to diagnose unanticipatederrors. These diagnoses might however be more difficultto interpret compared to the first two types of diagnosesdescribed above.

Another lesson learned is that confirming sequences, evenif the hypotheses they are trying to validate are correct, canpotentially fail because of errors in the rest of the program.This issue is partially compensated in MBD by our pairwisevalidation approach. Instead of computing one confirmingsequence, we compute pairwise sequences. By doing that,we take a better sample of the program behavior since eachsequence may potentially exercise a different path in theprogram.

4.3 Study3: Sensitivity AnalysisThere may be different ways to encode the same behavior

using EFSM. For example, there is no reason why a devel-oper might not choose to “unfold” the tries variable in thestate of the EFSM, which would lead to a series of activatedstates. This study addresses the sensitivity of the MBD ap-proach to structural variations in behavioral models, whichare semantically equivalent.

Study Execution. We took an extreme approach to bet-ter evaluate MBD. We automatically unfolded both context

variables (i.e., bal in the range of 0–11 and tries in the rangeof 0–3) in the states of our original behavioral model givenin Figure 1. This process transformed our original EFSMwith 3 states, 9 transitions, and 2 context variables into anEFSM with 56 states, 353 transitions, and no context vari-ables. For example, our new EFSM now starts at a statelabeled uninitialized_bal:0_tries:0 and given the inputactivate(5) would move to state activated_bal

-:5_tries:0. The decoding of state labels are straightfor-ward. The ranges for context variables were chosen so thatwe could use the same mutants and failing input sequencesstudied in Study 1 and 2, which allowed us to reliably com-pare the study results.

The structure of the newly transformed model made fourof our fault models obsolete, which left us only with MIS,MTS, and MDT operators. We used all these three opera-tors in this study. Furthermore, in order to keep the numberof tests to be performed for localizing faults under control(See Section 4.4 for more details), we set the distance pa-rameter of MIS and MTS operators to 1 as opposed to -1as in Study 1 and 2. That is, we mutated a given statewith its immediate neighbors one at a time instead of witheach and every state in the entire model. Another reasonfor employing such a cost-cutting measure was to provide abasis for our performance and scalability discussions givenin Section 4.4.

Evaluation. As depicted in Figure 5, we got similar re-sults to that of Study 1 with a couple of exceptions thatdeserve some explanations.

First of all, to demonstrate the differences in the diagnosesobtained between this study and Study 1, we take mutant 1as an example. For mutant 1, the top ranked diagnoses areclassified as excellent in both studies. Mutant 1 forgets toupdate the balance in the deposit function. The failing inputsequence for this mutant was: (activate(5), authenticate(1),deposit(2)). The top diagnosis emitted in Study 1 statesthat bal = bal + amt in transition T8 is missing in theimplementation, which is detected by the MDU mutationoperator, whereas the top diagnosis obtained in this studystates that, after executing the deposit function, the imple-mentation appears to be in state activated_bal:5_t-ries:0 when it should be in activated_bal:7_tries:0,which is detected by the MTS operator. Both diagnoses per-fectly diagnose the bug in the code. One difference thoughis that the latter diagnosis may be harder to interpret com-pared to the former one, since the missing update informa-tion is now needed to be inferred from the differences in twostate labels.

For mutant 6, we have no diagnosis classified as excel-lent. It turned out that the main reason behind this wasnot the structural changes in the model but instead thecost-cutting decision we made in this study, i.e., only thedistance-1 neighbors of the initial state were visible to theMIS operator. Mutant 6 starts in state authenticated in-stead of uninitialized. Since these states are distance-4neighbors in the new model (distance-2 neighbors in theoriginal model), we couldn’t localize the bug at the point ofvalidating the starting configuration of the program. There-fore, we couldn’t classify any diagnosis as +. However, thediagnoses obtained were very helpful in locating the bug.

The in-depth analysis of mutant 17 has lead to an inter-esting observation. In Study 1 and 2, the presence of contextvariables in the original model enabled us to generate and

181

Page 9: An automated model-based debugging approach

Model Depth Distance Tests Dist. EFSM Dist. EFSM Conf. Seq.Comput. Complexity Comput.

Study1 Model -1 -1 121.04 0.06 12 states, 93 trans. 0.70Study2 Model -1 -1 115.77 0.06 12 states, 93 trans. 0.84Study3 Model -1 1 72.47 6.14 3192 states, 24222 trans. 83.41Study3 Model 2 1 34.55 1.63 609 states, 6532.32 trans. 5.12Study3 Model 3 1 54.56 4.09 1624 states, 15628.27 trans. 22.46Study3 Model 2 -1 411.28 1.88 624 states, 6708.24 5.98

Table 2: Performance measures.

test some interesting hypotheses that we were not able togenerate/test on the new model with no context variables.This difference may in turn lead to poor diagnoses or nodiagnoses at all, which was the case with mutant 17. Forexample, one type of hypotheses that we were not able togenerate/validate on the new model was testing whether theprogram starts at the state uninitialized with the contextvariables (i.e., bal and tries) set to a non-zero value. Thisis because such a state was not present in the new modeland we didn’t have a mutation operator that can generatesuch new states. In general, it is difficult (if not impossible)to come up with mutation operators that can generate newstates, since the transitions to or from such newly createdstates may not be determined in a meaningful way. How-ever, context variables give us the ability to do the trick bymutating the model configuration.

Lessons Learned. In majority of the cases, we foundout that the structural changes in the original model didn’taffect the quality of the diagnoses in terms of their effective-ness in locating bugs. The same bugs were detected andlocalized using different mutation operators across struc-turally different models. One observation though is thatthe ease of interpreting the diagnoses may depend on thestructure of the models, since the amount of informationthat needs to be inferred from diagnoses may vary acrossstructurally different models.

Another observation is that the number and the type ofthe hypotheses that can be generated and validated mayalso be affected by the structure of the models. In general,context variables give us the flexibility in generating a widerange of hypotheses by changing the model configuration inanyway desired, which could be difficult to manage in theabsence of context variables.

4.4 Study 4: Performance AnalysisThere are two major factors that can affect the perfor-

mance and scalability of the MBD approach: 1) the numberof tests that should be executed to localize bugs and 2) thetime it takes to extract these test cases from the model.This section studies these factors in depth and proposes twotypes of control knobs in the form of additional parametersto the MBD algorithm and mutation operators to improvethe performance and scalability.

All the performance improvement suggestions presentedhere are based on the assumption that bugs can be detectedand localized using only the local information available withrespect to the location of the bugs. That is, one can leverageonly a portion of the behavioral model for localizing faultsat each step as opposed to taking the entire model in con-sideration. Note that for subtle bugs this assumption maynot always hold in practice. We have seen examples of suchcases in Study 3.

To exploit the bug locality assumption we use a conceptcalled a depth-k model with respect to a given model state.For a given state, its depth-k model is a portion of the orig-inal model which is consisted of only the states and their

transitions, which can be reached from or which can reachto the given state at most in k transitions.

Factor 1: Number of tests needed. The complexityof the algorithm given in Algorithm 1 in terms of the num-ber of test cases needed per input symbol is O(n2), where nis the number of hypotheses to be tested. The number of hy-potheses generated per input symbol is directly proportionalto the number and characteristics of the mutation operatorspresent in the fault models. For some mutation operators,this number is proportional to the size of the model (e.g.,number of context variables and/or states). MIS and MTSare examples of such operators where the number of mu-tants to be created depend on the number of states in thebehavioral model. In such cases, to improve the scalabilityby exploiting the bug locality assumption, we introducedan additional parameter, called distance, to such operators.The distance parameter makes sure that only the states inthe depth-distance model are available to MIS and MTS.The default value of this parameter is -1, indicating thatthe entire model is visible to these operators. Note thatthese parameters only reduce the number of error hypothe-ses, thus the number of test cases, to be generated. Theydon’t change the size of the model that the model checkerhas to deal with to compute the test cases.

Factor 2: Time to extract test cases. The time toextract a test case (i.e., a confirming sequence) is directlyproportional to the number of states and transitions presentin the behavioral models. We introduced a parameter calleddepth to the MBD algorithm that allows us to look for con-firming sequences in depth-k models, instead of in the entiremodel.

We conducted several experiments to evaluate the effec-tiveness of performance improving knobs on the performanceof the MBD approach. Table 2 presents the results of theseexperiments. The columns of this table depict the modelused, the value of the depth parameter of the MBD al-gorithm employed, the value of the distance parameter ofthe MIS and MTS operators chosen, average number oftests generated/executed per input symbol, the average timeto compute a distinguishing EFSM by using the algorithmin [10], the complexity of the resulting distinguishing EFSM,and the average time to compute a confirming sequence us-ing the model checker, respectively. The time measures aregiven in seconds and were obtained on a dual processor Xeonmachine with 1GB of memory. Since the actual number oftest cases to be generated is proportional to the number ofinput symbols in the failing input sequence provided, in anattempt to normalize the measures, we report the averagenumber of tests per input symbol.

The first thing to note in this table is that the performanceknobs introduced in this section greatly improved the per-formance and scalability of MBD. Furthermore, we observedthat, for the experiments that exploited the bug locality as-sumption in one form or another (i.e., all the experimentsstarting from the third row in the table), the quality of thediagnoses obtained were similar to that of Study 3. This is

182

Page 10: An automated model-based debugging approach

an encouraging result in terms of the applicability of the buglocality assumption. We however are fully aware that moreand larger experiments are needed to establish the generalityof this assumption.

5. RELATED WORKWe grouped the related works on fault localization as fol-

lows:Model-based reasoning. Perhaps, the closest works

to ours are the ones mentioned in this category, since theyleverage a model for reasoning about failures. Reiter [11]and de Kleer et al. [3] provide a model-based reasoning(MBR) framework for diagnosing faults. The MBR ap-proach can be summarized as a two-step process: 1) identifypossible model-artifact differences and 2) propose evidence-gathering measurement points to refine the set of possiblemodel-artifact differences until they accurately reflect theactual differences. This framework often leverages data-dependency models.

MBR has been extensively employed in debugging hard-ware systems. However, it’s application to software debug-ging is limited [9]. The reason behind this is two-fold: 1)MBR requires precise mapping of models to the artifacts be-ing modeled and 2) MBR assumes that components of theartifacts have no side effects. Our approach is completelyblack-box, doesn’t require one-to-one mapping of models toprograms, takes the side effects of components into account,and supports behavioral abstraction.

Static analysis-based reasoning. Engler et al. [4] in-troduced a static analysis technique to automatically checkmanually written system rules by using simple system-specificcompiler extensions and showed that simple rule checkerscan be effective in finding real bugs. MBD doesn’t requirehaving access to source code and can localize conceptualerrors.

Dynamic analysis-based reasoning. Reps et al. werethe first to state that manifestation of failures are highlycorrelated with differences in program spectra collected frompassing and failing runs [12]. Jones et al. [6] and Dallmeieret al. [2] showed that detecting suspicious statement cover-age and call sequences can help developers localize faults,respectively.

Hangal et al. introduced an online invariant detector andchecker system for Java programs [5]. Liblit et al. exploredthe possibility of locating faults by gathering a little bit ofinformation from every run of fielded program instances andfinding violations in automatically identified likely programinvariants [8]. MBD doesn’t require the instrumentation ofbinaries. All it needs is to have a way of communicatingwith the program via its input/output channels.

Empirical reasoning. Zeller et al. introduced the deltadebugging algorithm to isolate failure inducing cause-effectchains from programs via testing [13]. One way MBD differsfrom delta debugging is that MBD always generates validtest cases (either positive or negative), whereas the interme-diate tests conducted by the delta debugging approach areoften meaningless to QA teams.

6. CONCLUDING REMARKSTo the best of our knowledge MBD is the first approach

that can localize conceptual errors (e.g., errors stem fromnonconformance to specifications) as well as coding errors.

Furthermore, it is a completely black-box approach to de-bugging; no access to source code and binaries are needed.

To evaluate the proposed approach, we conducted a setof controlled experiments on an electronic purse applicationwritten for the Java Card environment. All empirical stud-ies suffer from threats to their internal and external validity.For this work, we were primarily concerned with threats toexternal validity since they limit our ability to generalizeour results to industrial practice. One potential threat isthat several steps in our approach require human decisionmaking and input, e.g., developers must provide failing in-put sequences and must also map problem diagnoses backto source code. Another possible threat concerns the rep-resentativeness of the subject model, the program, and theerrors seeded. We need to apply our approach to larger,more realistic models and software systems in future workto understand how well it scales.

Despite these limitations, we believe that the results ofour studies support our basic hypotheses: 1) mutating ex-pected behavioral models can mimic programmers’ typicalerrors, 2) leveraging confirming sequences to validate ex-pected behaviors against a set of hypothesized faulty behav-iors obtained by mutations can detect faults, 3) validatinganticipated faulty behaviors against their own mutants canlocate potential causes, and 4) scoring the validated faultybehaviors can accurately prioritize the causes.

We believe that this line of research is novel and interest-ing, but muck work remains to be done. How well wouldit scale? Can the diagnoses be automatically mapped backto the code (if available)? How can the debuggability ofbehavioral models be improved? Can other types of behav-ioral models be leveraged in addition to EFSMs? All thesequestions present opportunities for future research.

7. REFERENCES[1] A. Cimatti, E. Clarke, E. Giunchiglia et al. NuSMV 2: An

opensource tool for symbolic model checking. In CAV’02.

[2] V. Dallmeier, C. Lindig, and A. Zeller. Lightweight defectlocalization for java. In ECOOP’05, pages 528–550, 2005.

[3] J. de Kleer and B. C. Williams. Diagnosing multiple faults.Artif. Intell., 32(1):97–130, 1987.

[4] D. R. Engler, B. Chelf, A. Chou, and S. Hallem. Checkingsystem rules using system-specific, programmer-writtencompiler extensions. In OSDI’00, pages 1–16, 2000.

[5] S. Hangal and M. S. Lam. Tracking down software bugs usingautomatic anomaly detection. In ICSE ’02 , pages 291–301,New York, NY, USA, 2002. ACM Press.

[6] J. A. Jones, M. J. Harrold, and J. Stasko. Visualization of testinformation to assist fault localization. In ICSE ’02 pages467–477, New York, NY, USA, 2002. ACM Press.

[7] Z. Kohavi. Switching and finite automata theory. McGrawHill, 1978.

[8] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bugisolation via remote program sampling. In PLDI ’03 , pages141–154, New York, NY, USA, 2003. ACM Press.

[9] C. Mateis, M. Stumptner, and F. Wotawa. Debugging of javaprograms using a model-based approach. In DX’99 Workshop,1999.

[10] A. Petrenko, S. Boroday, and R. Groz. Confirmingconfigurations in EFSM testing. IEEE TSE, 30(1):29–42, 2004.

[11] R. Reiter. A theory of diagnosis from first principles. Artif.Intell., 32(1):57–95, 1987.

[12] T. Reps, T. Ball, M. Das, and J. Larus. The use of programprofiling for software maintenance with applications to theyear 2000 problem. In SIGSOFT Softw. Eng. Notes,22(6):432–449, 1997.

[13] A. Zeller. Isolating cause-effect chains from computerprograms. In SIGSOFT ’02/FSE-10, pages 1–10, New York,NY, USA, 2002. ACM Press.

183