A software fault localization technique based on program mutations

A Software Fault Localization Technique Based on Program Mutations

Tao HeCoauthor with Xinming Wang, Xiaocong Zhou, Wenjun Li, Zhenyu Zhang, S.C. Cheung

[email protected] Engineering Laboratory

Department of Computer Science, Sun Yat-Sen University

The 6nd Seminar of SELABNovember 2012

Sun Yat-Sen University, Guangzhou, China

1/23

mailto:[email protected]

Outline

Background and Motivation Our Approach – Muffler Empirical Evaluation Conclusion

2/23

Background and Motivation

3/23

Background

Coverage-Based Fault Localization (CBFL) Input

Coverage Testing results (passed or failed)

Output A ranking list of statements

Ranking functions Most CBFL techniques are similar with each other

except that different ranking functions are used to compute suspiciousness.

4/23

Motivation

One fundamental assumption [YPW08] of CBFL The observed behaviors from passed runs can precisely

represent the correct behaviors of this program; and the observed behaviors from failed runs can represent the

infamous behaviors. Therefore, the different observed behaviors of program

entities between passed runs and failed runs will indicate the fault’s location.

But this does not always hold.

5/23

[YPW08] C. Yilmaz, A. Paradkar, and C. Williams. Time will tell: fault localization using time spectra. In Proceedings of the 30th international conference on Software engineering (ICSE '08). ACM, New York, NY, USA, 81-90. 2008.

Motivation Coincidental Correctness (CC)

“No failure is detected, even though a fault has been executed.” [RT93]

i.e., the passed runs may cover the fault.

Weaken the first part of CBFL’s assumption: The observed behaviors from passed runs can precisely represent

the correct behaviors of this program; More, CC occurs frequently in practice.[MAE+09]

6/23

[RT93] D.J. Richardson and M.C. Thompson, An analysis of test data selection criteria using the RELAY model of fault detection, Software Engineering, IEEE Transactions on, vol. 19, (no. 6), pp. 533-553, 1993.[MAE+09] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi, An empirical study of the factors that reduce the effectiveness of coverage-based fault localization, in Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2009), pp. 1-5, 2009.

Our goal is to address the CC issue via mutation analysis

Our Approach – Muffler

7/23

Why does our approach work?- Key hypothesis

Mutating the faulty statement tends to maintain the results of passed test cases.

By contrast, mutating a correct statement tends to change the results of passed test cases (from passed to failed).

8/23

Why does our approach work?- Three comprehensive scenarios (1/3)

9/23

F M

M: Mutant pointF: Fault point

Program

Test cases

Test results

Passed

Failed

3 test results change from passed to failed

- If we mutate an M in different basic blocks with F


10/23

F

M


Program

Test cases

Test results

Passed

Failed


- If we mutate an M in different basic blocks with F


11/23

F+M


Program

Test cases

Test results

Passed

Failed

0 test result changes from passed to failed

- If we mutate F


12/23

F


Program

Test cases

Test results

Passed

FailedM

Data Flow

Control Flow


- If we mutate an M in the same basic block with FDue to different data flow to affect output


13/23

F


Program

Test cases

Test results

Passed

Failed

Data Flow

Control Flow

0 test result change from passed to failed

+M

- If we mutate F


14/23

F+MM: Mutant pointF: Fault point

Program

Test cases

Test results

Passed

Failed

0 test result changes from passed to failed

Weak ability to generate an infectious state or to propagate the infectious state to output

- When CC occurs frequently

- If we mutate FDue to weak ability to affect output

Our Approach – Muffler

15/23

Naish, the best existing ranking function[LRR11]

Mutation impact, the average amount of testing results change from passed to failed

Muffler, a combination of Naish and mutation impact

[LRR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM Transaction on Software Engineering Methodology, 20(3):11, 2011.

Empirical Evaluation

16/23


Program suite Number of versions

Lines of Executable

Code

Number of test cases LOC

tcas 41 63-67 1608 133-137

tot_info 23 122-123 1052 272-273

schedule 9 149-152 2650 290-294

schedule2 10 127-129 2710 261-263

print_tokens 7 189-190 4130 341-343

print_tokens2 10 199-200 4115 350-355

replace 32 240-245 5542 508-515

space 38 3633-3647 13585 5882-590417/23

Evaluation metrics

Subject programs


18/23

Percentage of code examined

Per

cent

age

of fa

ult l

ocat

ed

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

Techiniques

MufflerNaishOchiaiTarantulaWong3

Figure: Overall effectiveness comparison.


19/23

% of code examined

Tarantula Ochiai χDebug Naish Muffler

1% 14 18 19 21 35

5% 38 48 56 58 74

10% 54 63 68 68 85

15% 57 65 80 80 94

20% 60 67 84 84 99

30% 79 88 91 92 110

40% 92 98 98 99 117

50% 98 99 101 102 121

60% 99 103 105 106 123

70% 101 107 117 119 123

80% 114 122 122 123 123

90% 123 123 122 123 123

100% 123 123 123 123 123

Table: Number of faults located at different level of code examination effort using Naish and Muffler.

When 1% of the statements have been examined, Naish can reach the fault in 17.07% of faulty versions. At the same time, Muffler can reach the fault in 28.46% of faulty versions.


20/23

Tarantula Ochiai χDebug Naish Muffler

Min 0.00 0.00 0.00 0.00 0.00

Max 87.89 84.25 93.85 78.46 55.38

Median 20.33 9.52 7.69 7.32 3.25

Mean 27.68 23.62 20.04 19.34 9.62

Stdev 28.29 26.36 24.61 23.86 13.22

Table: Statistics of code examination effort.

Among these five techniques, Muffler always scores the best in the rows that correspond to the minimum, median, and mean code examination effort. In addition, Muffler gets much lower standard deviation, which means that their performances vary less widely than others, and are shown to be more stable in terms of effectiveness. Results also show that Muffler reduces the average code examination effort from Naish by 50.26% (=100%-(9.62%/19.34%).

Conclusion and future work

We propose Muffler, a technique using mutation to help locate program faults.

On 123 faulty versions of seven programs, we conduct a comparison of effectiveness and efficiency with Naish technique. Results show that Muffler reduces the average code examination effort on each faulty version by 50.26%.

For future work, we plan to generalize our approach to locate faults in multi-fault programs.

21/23

Q & A

22/23

Thank you!Contact me via [email protected]

23/23

mailto:[email protected]

# Background

Mutation analysis, first proposed by Hamlet [Ham77] and Demilo et al. [DLS78] , is a fault-based testing technique used to measure the effectiveness of a test suite.

In mutation analysis, one introduces syntactic code changes, one at a time, into a program to generate various faulty programs (called mutants).

A mutation operator is a change-seeding rule to generate a mutant from the original program.

24/23

[Ham77] R.G. Hamlet, Testing Programs with the Aid of a Compiler, Software Engineering, IEEE Transactions on, vol. SE-3, (no. 4), pp. 279- 290, 1977.[DLS78] R.A. DeMillo, R.J. Lipton and F.G. Sayward, Hints on Test Data Selection: Help for the Practicing Programmer, Computer, vol. 11, (no. 4), pp. 34-41, 1978.

# Ranking functions

25/23

Table: Ranking faunctions

Tarantula [JHS02], Ochiai [AZV07], χDebug [WQZ+07], and Naish [NLR11]

[JHS02] J.A. Jones, M. J. Harrold, and J. Stasko. Visualization of test information to assist fault localization. In Proceedings of the 24th International Conference on Software Engineering (ICSE '02), pp. 467-477, 2002.[AZV07] R. Abreu, P. Zoeteweij and A.J.C. Van Gemund, On the accuracy of spectrum-based fault localization, in Proc. Proceedings - Testing: Academic and Industrial Conference Practice and Research Techniques, TAIC PART-Mutation 2007, pp. 89-98, 2007. [WQZ+07] W.E. Wong, Yu Qi, Lei Zhao, and Kai-Yuan Cai. Effective Fault Localization using Code Coverage. In Proceedings of the 31st Annual International Computer Software and Applications Conference (COMPSAC '07), Vol. 1, pp. 449-456, 2007.[NLR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM Transaction on Software Engineering Methodology, 20(3):11, 2011.

# Our Approach – Muffler

26/23

Test Suite

Faulty Program

Ranking List of all statements

Instrument program&

Execute against test suite

Select statements to mutate

Mutate selected statements

Run mutants against test suite

Calculate suspiciousness&

Sort statements

Coverage & Testing Results

Candidate Statements

Mutants

Changes of testing results

Legend

Process

Input

Output

1.

3.

4.

5.

2.

Figure: Dataflow diagram of Muffler.

# Our Approach – Muffler

𝑆𝑢𝑠𝑝𝑀𝑢𝑓𝑓𝑙𝑒𝑟 (𝑆 𝑖)=𝐹𝑎𝑖𝑙𝑒𝑑(𝑃 ,𝑆𝑖)×(𝑇𝑜𝑡𝑎𝑙𝑃𝑎𝑠𝑠𝑒𝑑(𝑃 )+1)– 𝐼𝑚𝑝𝑎𝑐𝑡 (𝑆 𝑖)27/23

Primary Key(imprecise when multiple faults occurs)

Secondary Key(invalid when coincidental correctness%is high)

Additional Key(inclined to handlecoincidental correctness)

# An Example

28/23

Part I

Statement

S1 if (block_queue){

S2 count = block_queue->mem_count + 1; /* fault: insert ‘+1’ */

S3 n = (int) (count*ratio); /* fault: missing ‘+1’ */

S4 proc = find_nth(block_queue, n);

S5 if (proc) {

S6 block_queue = del_ele(block_queue, proc);

S7 prio = proc->priority;

S8 prio_queue[prio] = append_ele(prio_queue[prio], proc);}}

Part II

Tarantula Ochiai χDebug Naish

susp* r** susp r susp r susp r

0.58 8 0.32 8 205.41 8 510812 8

0.64 7 0.36 7 205.83 7 511228 7

0.64 7 0.36 7 205.83 7 511228 7

0.64 7 0.36 7 205.83 7 511228 7

0.64 7 0.36 7 205.83 7 511228 7

0.64 3 0.37 3 205.85 3 511252 3

0.64 3 0.37 3 205.85 3 511252 3

0.64 3 0.37 3 205.85 3 511252 3

Code examination effort to locate S2 and S3:

TotalPassed TotalFailed

2440 210

Passed(s) Failed(s)

1798 210

1382 210

1382 210

1382 210

1382 210

1358 210

1358 210

1358 210

88% 88% 88%

Figure: Faulty version v2 of program “schedule”.

# An Example

29/23

Part III

Mutated statement for each mutant Changep→f

M1 if (!block_queue ) { 1644

M2 count = block_queue->mem_count != 1; 249

M3 n = (int) (count <= ratio) ; 249

M4 proc = find_nth(block_queue , ratio); 1088

M5 if (!proc) { 1136

M6 block_queue = del_ele(block_queue , proc-1); 1123

M7 prio /= proc->priority; 1358

M8 prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }} 598

Changep→f Changep→f Changep→f Changep→f

1798 1101 1101 1644

1097 1097 249 1382

1116 1101 494 1101

638 1136 744 1382

1358 1101 1382 1101

349 1358 814 1358

1358 1101 1101 1358

598 1138 1358 1101

Part IV Muffler

Impact susp r

1457.6 509354.4 8

814.8 510413.2 2

812.2 510415.8 2

997.6 510230.4 5

1215.6 510012.4 6

1000.4 510251.6 4

1255.2 509996.8 7

958.6 510293.4 3

Code examination effort to locate S2 and S3: 25%


# An Example

30/23

Part I

Statement

S1 if (block_queue){

S2 count = block_queue->mem_count + 1; /* fault: insert ‘+1’ */

S3 n = (int) (count*ratio); /* fault: missing ‘+1’ */

S4 proc = find_nth(block_queue, n);

S5 if (proc) {

S6 block_queue = del_ele(block_queue, proc);

S7 prio = proc->priority;

S8 prio_queue[prio] = append_ele(prio_queue[prio], proc);}}

Part II

Tarantula Ochiai χDebug Naish

susp* r** susp r susp r susp r

0.58 8 0.32 8 205.41 8 510812 8

0.64 7 0.36 7 205.83 7 511228 7

0.64 7 0.36 7 205.83 7 511228 7

0.64 7 0.36 7 205.83 7 511228 7

0.64 7 0.36 7 205.83 7 511228 7

0.64 3 0.37 3 205.85 3 511252 3

0.64 3 0.37 3 205.85 3 511252 3

0.64 3 0.37 3 205.85 3 511252 3

Code examination effort to locate S2 and S3:

TotalPassed TotalFailed

2440 210

Passed(s) Failed(s)

1798 210

1382 210

1382 210

1382 210

1382 210

1358 210

1358 210

1358 210

88% 88% 88%


# An Example

31/23

Part III

Mutated statement for each mutant Changep→f

M1 if (!block_queue ) { 1644

M2 count = block_queue->mem_count != 1; 249

M3 n = (int) (count <= ratio) ; 249

M4 proc = find_nth(block_queue , ratio); 1088

M5 if (!proc) { 1136

M6 block_queue = del_ele(block_queue , proc-1); 1123

M7 prio /= proc->priority; 1358

M8 prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }} 598

Changep→f Changep→f Changep→f Changep→f

1798 1101 1101 1644

1097 1097 249 1382

1116 1101 494 1101

638 1136 744 1382

1358 1101 1382 1101

349 1358 814 1358

1358 1101 1101 1358

598 1138 1358 1101

Part IV Muffler

Impact susp r

1457.6 509354.4 8

814.8 510413.2 2

812.2 510415.8 2

997.6 510230.4 5

1215.6 510012.4 6

1000.4 510251.6 4

1255.2 509996.8 7

958.6 510293.4 3

Code examination effort to locate S2 and S3: 25%


# Empirical Evaluation

32/23

Versus

TanrantulaVersus Ochiai

Versus χDebug

Versus Naish

More effective 102 96 93 89

Same effectiveness 19 23 23 25

Less effective 2 4 7 9

Table: Pair-wise comparison betweenMuffler and existing techniques.

Muffler is more effective (examining more statements before encountering the faulty statement) than Naish for 89 out of 123 faulty versions; is as effective (examining the same number of statements before encountering the faulty statement) as Naish for 25 out of 123 faulty versions; and is less effective (examining less statements before encountering the faulty statement) than Naish for only 9 out of 123 faulty versions.


33/23

Faulty versions

CC%

Code examination effort

Naish Muffler

v5 1% 0% 0%

v9 7% 1% 0%

v17 31% 12% 7%

v28 49% 11% 5%

v29 99% 25% 9%

Experience on real faults

Table: Results with real faults in space

Five faulty versions are chosen to represent low, medium, and the high occurrence of coincidental correctness. In this table, the column “CC%” presents the percentage of coincidentally passed test cases out of all passed test cases. The columns under the head “Code examination effort” present the percentage of code to be examined before the fault is encountered.


34/23

Efficiency analysis

Table: Time spent by each technique on subject programs.

We have shown experimentally that, by taking advantages from both coverage and mutation impact, Muffler outperforms Naish regardless the occurrence of coincidental correctness. Unfortunately, our approaches, Muffler need to execute piles of mutants to compute mutation impact. The execution of mutants against the test suite may increase the time cost of fault localization. The time mainly contains the cost of instrumentation, execution, and coverage

collection. From this table, we observe that Muffler takes approximately 62.59 times of average time cost to the Naish technique.

Program suite CBFL (seconds) Muffler (seconds)tcas 18.00 868.68

tot_info 11.92 573.12schedule 34.02 2703.01

schedule2 27.76 1773.14print_tokens 59.11 2530.17

print_tokens2 62.07 5062.87replace 69.13 4139.19Average 40.29 2521.46


35/23

Efficiency analysis

Table: Information about mutants generated.

This Table illustrates the detailed data about the number of mutated/total executable statements, the number of mutants generated, and the time cost of running each mutant. For example, of the program tcas, there are, on average, 40.15 statements that are mutated by Muffler; and 65.10 executable statements in total; 199.90 mutants are generated and it takes 4.26 seconds to run each of them, on average. Notice that there is no need to collect coverage from the mutants’ executions, and it takes about 1/4 time to run a mutant without instrumentation and coverage collection.

Programsuite

Mutatedstatements

Totalstatements Mutants Time per mutant

(seconds)tcas 40.15 65.10 199.90 4.26

tot_info 39.57 122.96 191.87 2.92 schedule 80.60 150.20 351.60 7.59

schedule2 75.33 127.56 327.78 5.32 print_tokens 67.43 189.86 260.29 9.49

print_tokens2 86.67 199.44 398.67 12.54 replace 71.14 242.86 305.93 13.30 Average 56.52 142.79 256.90 7.92

How about the coincidental correctness issue?

36/23

Empirical Evaluation- The impact of coincidental correctness

37/23

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100 %

Percentage of coincidental correctness (|Tcc|/|Tp|)Pe

rcen

tage

of c

ode

exam

ined

(Muffl

er)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100 %

Percentage of coincidental correctness (|Tcc|/|Tp|)

Perc

enta

ge o

f cod

e ex

amin

ed (N

aish

)

Each point in Figure 5 represents a faulty version; the horizontal axis presents the faulty version’s percentage of coincidental correctness (CC%) that occurs in passed test cases, and the vertical axis presents the faulty version’s code examination effort to find the fault. The polynomial fitting curve (second order) represents the points’ tendency.

Figure 5: Correlation between effectiveness and coincidental correctness.

Does this work in real programs?

38/23

Why does our approach work? - A feasibility study

39/23

0

200

400

600

800

1000

tcas v7

0

200

400

600

800

tot_info v17

0

500

1000

1500

2000

schedule v4

0

500

1000

1500

2000

2500

schedule2 v1

0

1000

2000

3000

4000

print_tokens v7

0

1000

2000

3000

4000

print_tokens2 v3

0

1000

2000

3000

4000

replace v24

0

50

100

150

space v20

The vertical axis denotes the number of testing results changes (from ‘passed’ to ‘failed’), and horizontal width denotes the probability density at corresponding amount of testing results changes.

Figure: Distribution of statements’ result changesand faulty statement’s testing result changes.

Why does our approach work? - A feasibility study

40/23

The vertical axis denotes the number of testing results changes (from ‘passed’ to ‘failed’), and horizontal width denotes the probability density at corresponding amount of testing results changes.

Figure: Distribution of statements’ result changesand faulty statement’s testing result changes.

0

200

400

600

800

1000

tcas v7

0

200

400

600

800

tot_info v7

0

500

1000

1500

2000

2500

schedule v2

0

500

1000

1500

2000

2500

schedule2 v1

0

1000

2000

3000

4000

print_tokens v2

0

1000

2000

3000

4000

print_tokens2 v3

0

1000

2000

3000

4000

replace v15

0

50

100

150

space v8

0

200

400

600

800

1000

tcas v12

0

200

400

600

800

tot_info v8

0

500

1000

1500

2000

2500

schedule v3

0

500

1000

1500

2000

2500

schedule2 v4

0

1000

2000

3000

4000

print_tokens v3

0

1000

2000

3000

4000

print_tokens2 v6

0

1000

2000

3000

4000

replace v17

0

50

100

150

space v11

0

200

400

600

800

1000

tcas v17

0

200

400

600

800

tot_info v17

0

500

1000

1500

2000

2500

schedule v4

0

500

1000

1500

2000

2500

schedule2 v6

0

1000

2000

3000

4000

print_tokens v7

0

1000

2000

3000

4000

print_tokens2 v9

0

1000

2000

3000

4000

replace v24

0

50

100

150

space v20

Why does our approach work? - Another feasibility study (When CC%≥95%)

41/23

0 20 40 60 800

5

10

15

20

25

Percentage of code examined

Fre

qu

ency

of

fau

lty

vers

ion

s

When CC% is greater or equal than 95%, code examination effort reduction of result changes is 65.66% (=100%-16.33%/47.55%).

Only 6 faulty versions need to examine less than 20% of statements for Naish, while 22 versions by using result changes

∎ Result changes (avg. 16.33%)

∎ Naish (avg. 47.55%)

Figure: Frequency distribution of effectiveness when CC%≥ 95%.

Experience on real faults

42/23

Table 8: Results with real faults in space

Faulty versions

CC%Lines of code examined

Naish Muffler

v5 0.90% 2 1v20 1.97% 15 5v21 1.97% 15 6v10 2.74% 47 18v11 6.29% 37 14V6 6.92% 40 7v9 19.05% 7 1

v17 30.92% 427 244v28 48.57% 268 170v29 99.32% 797 331