Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Yan Ma, Bojan Cukic Yan Ma, Bojan Cukic

Lane Department of Computer Science and Electrical EngineeringLane Department of Computer Science and Electrical Engineering

West Virginia UniversityWest Virginia University

May 2007

Adequate Evaluation of Quality Models Adequate Evaluation of Quality Models

in Software Engineering Studiesin Software Engineering Studies

CITeRCITeR The Center for Identification Technology Researchwww.citer.wvu.eduAn NSF I/UCR Center advancing integrative research

Evaluating Defect Models

• Hundreds of research papers. – Most offer very little one can generalize and

reapply.– Initial hurdle was the lack of data, but not any

longer:• Open source repositories, NASA MDP, PROMISE

datasets.

• How to evaluate defect prediction models?

Software defect data: Class Imbalance

• A few modules are fault-prone.– A problem for supervised learning algorithms, which

typically try to maximize overall accuracy.

Software Defect Data: Correlation

MDP-PC1: Pearson correlation coefficients

LOC TOpnd V B LCC N

LOC 1.000 0.908 0.937 0.931 0.545 0.924

TOpnd 0.908 1.000 0.976 0.971 0.464 0.996

V 0.937 0.976 1.000 0.995 0.468 0.987

B 0.931 0.971 0.995 1.000 0.468 0.982

LCC 0.545 0.464 0.468 0.468 1.000 0.473

N 0.924 0.996 0.987 0.982 0.473 1.000

Software Defect Data: Correlation (2)

MDP-KC2: Pearson correlation coefficients

LOC UOp V IV.G TOp LOB

LOC 1.000 0.632 0.986 0.968 0.991 0.909

UOp 0.632 1.000 0.536 0.577 0.615 0.636

V 0.986 0.536 1.000 0.970 0.990 0.887

IV.G 0.968 0.577 0.970 1.000 0.972 0.836

Top 0.991 0.615 0.990 0.972 1.000 0.912

LOB 0.909 0.636 0.887 0.836 0.912 1.000

Software Defect Data: Correlation (3)

• Five “most informative attributes”

Software Defect Data: Module Size

• Defect-free modules are smaller.• In MDP, modules are very small.

The 90th percentile of LOC for the collection of defect modules and defect-free modules

KC1 KC2 PC1 JM1 CM1

Defect-free 42 55 47 72 55

Defect 99 167 114 165 131

Software Defect Data: Close Neighbors

• The “nearest neighbor” of most defective modules is a fault free module. – Measured by Euclidian distance between module

metrics

Project% of defect modules whose

nearest neighbor is a majority class instance

% of defect modules that has 2 among the three nearest

neighbors in the majority class

KC1 66.26% 73.62%

KC2 58.33% 58.33%

JM1 67.90% 75.46%

PC1 75.32% 85.71%

CM1 73.47% 97.96%

Implications on Evaluation

• Many machine learning algorithms ineffective.– But one would never know by reading the literature.

• Experimental results rarely reported adequately.

Classification Success Measures

• Probability of detection (PD): Correctly classifying faulty modules (called sensitivity).

• Specificity: Correctly classified fault-free modules.

• False alarm (PF): Proportion of misclassified fault-free modules.– PF = 1- Specificity

• Precision index: Proportion of faulty modules amongst those predicted as faulty.

Random Forests on PC1(only 7% modules faulty )

Success Measures (2)

• Accuracy, PD, specificity, precision index tell a one sided story.

• Indices combine measures of interest.

ecisionPDmeanG Pr1

G mean2 PD Specificity The geometric mean ofthe two accuracies.

Higher precision leads to“cheaper” V&V

Success Measures (3)

• F-measure, like G-mean1, combines PD and Precision.– More flexibility.– Weight should

reflect project’s “cost vs. risk aversion”

• May be difficult.

F measure ( 2 1) Precision PD

2 Pr ecision PD

Comparing Models

PC1: random forests at different voting cutoffs

Figure 1. (a) Figure 1. (b)

G-mean1 0.399 0.426

G-mean2 0.519 0.783

F-measure (=1) 0.372 0.368

F-measure (=2) 0.305 0.527

•G-mean2 and F-measure (=2) reflect the difference between the two models.

• Interpretation still in the domain of human understanding.

Comparing Performance MDP-PC1

IndicesNaïveBayes Logistic IB1 J48 Bagging

PD 0.299 0.065 0.442 0.234 0.169

1 – PF 0.936 0.988 0.954 0.985 0.995

Precision 0.259 0.289 0.415 0.540 0.732

Overall Accuracy 0.892 0.924 0.918 0.933 0.938

G-mean1 0.278 0.137 0.428 0.356 0.352

G-mean2 0.529 0.253 0.649 0.480 0.410

F-measure ( = 1) 0.278 0.106 0.428 0.327 0.275

F-measure ( = 2) 0.290 0.077 0.436 0.264 0.200

ED ( = 0.5) 0.498 0.661 0.396 0.542 0.588

Comparing Performance: KC-2

IndicesNaïveBayes Logistic IB1 J48 Bagging

PD 0.398 0.389 0.509 0.546 0.472

1 - PF 0.950 0.932 0.858 0.896 0.931

Precision 0.674 0.599 0.483 0.578 0.639

Overall Accuracy 0.836 0.820 0.786 0.824 0.836

G-mean1 0.518 0.483 0.496 0.562 0.549

G-mean2 0.615 0.602 0.661 0.700 0.663

F-measure ( = 1) 0.501 0.472 0.496 0.562 0.543

F-measure ( = 2) 0.434 0.418 0.504 0.552 0.498

ED ( = 0.5) 0.427 0.435 0.361 0.329 0.377

ED ( = 0.67) 0.492 0.500 0.409 0.375 0.433

Visual Tools: Margin Plots

Visual Tools: ROC

• Classifiers that allow multiple operational points are more flexible.

• Area Under the Curve, if two curves available.

• Distance from the ideal performance.

ED * (1 PD)2 (1 ) * PF 2

Factor depends of the misclassification cost.

Summary

• Emerging data collection must be accompanied by mature evaluation.

• Research reports must reveal all aspects of achieved predictive performance fairly.

• The goodness of a model depends on the cost of misclassification.

Current work

• Statistical significance calculation between ROC curves.• ROC vs. PR (Precision-Recall) curves, cost curves.

Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Economy & Finance

software defect data

modules faulty

defect prediction models

defective modules

proportion of faulty

g v uop loc

performance mdppc1

models gmean2andf