Top Banner
Yan Ma, Bojan Cukic Yan Ma, Bojan Cukic Lane Department of Computer Science and Electrical Lane Department of Computer Science and Electrical Engineering Engineering West Virginia University West Virginia University May 2007 Adequate Evaluation of Adequate Evaluation of Quality Models Quality Models in Software Engineering in Software Engineering Studies Studies CITeR CITeR The Center for Identification Technology Research www.citer.wvu.e du An NSF I/UCR Center advancing integrative research
19

Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Jun 09, 2015

Download

Economy & Finance

promise07

Yan Ma and Bojan Cukic
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Yan Ma, Bojan Cukic Yan Ma, Bojan Cukic

Lane Department of Computer Science and Electrical EngineeringLane Department of Computer Science and Electrical Engineering

West Virginia UniversityWest Virginia University

May 2007

Adequate Evaluation of Quality Models Adequate Evaluation of Quality Models

in Software Engineering Studiesin Software Engineering Studies

CITeRCITeR The Center for Identification Technology Researchwww.citer.wvu.eduAn NSF I/UCR Center advancing integrative research

Page 2: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Evaluating Defect Models

• Hundreds of research papers. – Most offer very little one can generalize and

reapply.– Initial hurdle was the lack of data, but not any

longer:• Open source repositories, NASA MDP, PROMISE

datasets.

• How to evaluate defect prediction models?

Page 3: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Software defect data: Class Imbalance

• A few modules are fault-prone.– A problem for supervised learning algorithms, which

typically try to maximize overall accuracy.

Page 4: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Software Defect Data: Correlation

MDP-PC1: Pearson correlation coefficients

LOC TOpnd V B LCC N

LOC 1.000 0.908 0.937 0.931 0.545 0.924

TOpnd 0.908 1.000 0.976 0.971 0.464 0.996

V 0.937 0.976 1.000 0.995 0.468 0.987

B 0.931 0.971 0.995 1.000 0.468 0.982

LCC 0.545 0.464 0.468 0.468 1.000 0.473

N 0.924 0.996 0.987 0.982 0.473 1.000

Page 5: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Software Defect Data: Correlation (2)

MDP-KC2: Pearson correlation coefficients

LOC UOp V IV.G TOp LOB

LOC 1.000 0.632 0.986 0.968 0.991 0.909

UOp 0.632 1.000 0.536 0.577 0.615 0.636

V 0.986 0.536 1.000 0.970 0.990 0.887

IV.G 0.968 0.577 0.970 1.000 0.972 0.836

Top 0.991 0.615 0.990 0.972 1.000 0.912

LOB 0.909 0.636 0.887 0.836 0.912 1.000

Page 6: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Software Defect Data: Correlation (3)

• Five “most informative attributes”

Page 7: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Software Defect Data: Module Size

• Defect-free modules are smaller.• In MDP, modules are very small.

The 90th percentile of LOC for the collection of defect modules and defect-free modules

KC1 KC2 PC1 JM1 CM1

Defect-free 42 55 47 72 55

Defect 99 167 114 165 131

Page 8: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Software Defect Data: Close Neighbors

• The “nearest neighbor” of most defective modules is a fault free module. – Measured by Euclidian distance between module

metrics

Project% of defect modules whose

nearest neighbor is a majority class instance

% of defect modules that has 2 among the three nearest

neighbors in the majority class

KC1 66.26% 73.62%

KC2 58.33% 58.33%

JM1 67.90% 75.46%

PC1 75.32% 85.71%

CM1 73.47% 97.96%

Page 9: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Implications on Evaluation

• Many machine learning algorithms ineffective.– But one would never know by reading the literature.

• Experimental results rarely reported adequately.

Page 10: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Classification Success Measures

• Probability of detection (PD): Correctly classifying faulty modules (called sensitivity).

• Specificity: Correctly classified fault-free modules.

• False alarm (PF): Proportion of misclassified fault-free modules.– PF = 1- Specificity

• Precision index: Proportion of faulty modules amongst those predicted as faulty.

Random Forests on PC1(only 7% modules faulty )

Page 11: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Success Measures (2)

• Accuracy, PD, specificity, precision index tell a one sided story.

• Indices combine measures of interest.

ecisionPDmeanG Pr1

G mean2 PD Specificity The geometric mean ofthe two accuracies.

Higher precision leads to“cheaper” V&V

Page 12: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Success Measures (3)

• F-measure, like G-mean1, combines PD and Precision.– More flexibility.– Weight should

reflect project’s “cost vs. risk aversion”

• May be difficult.

F measure ( 2 1) Precision PD

2 Pr ecision PD

Page 13: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Comparing Models

PC1: random forests at different voting cutoffs

Figure 1. (a) Figure 1. (b)

G-mean1 0.399 0.426

G-mean2 0.519 0.783

F-measure (=1) 0.372 0.368

F-measure (=2) 0.305 0.527

•G-mean2 and F-measure (=2) reflect the difference between the two models.

• Interpretation still in the domain of human understanding.

Page 14: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Comparing Performance MDP-PC1

IndicesNaïveBayes Logistic IB1 J48 Bagging

PD 0.299 0.065 0.442 0.234 0.169

1 – PF 0.936 0.988 0.954 0.985 0.995

Precision 0.259 0.289 0.415 0.540 0.732

Overall Accuracy 0.892 0.924 0.918 0.933 0.938

G-mean1 0.278 0.137 0.428 0.356 0.352

G-mean2 0.529 0.253 0.649 0.480 0.410

F-measure ( = 1) 0.278 0.106 0.428 0.327 0.275

F-measure ( = 2) 0.290 0.077 0.436 0.264 0.200

ED ( = 0.5) 0.498 0.661 0.396 0.542 0.588

Page 15: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Comparing Performance: KC-2

IndicesNaïveBayes Logistic IB1 J48 Bagging

PD 0.398 0.389 0.509 0.546 0.472

1 - PF 0.950 0.932 0.858 0.896 0.931

Precision 0.674 0.599 0.483 0.578 0.639

Overall Accuracy 0.836 0.820 0.786 0.824 0.836

G-mean1 0.518 0.483 0.496 0.562 0.549

G-mean2 0.615 0.602 0.661 0.700 0.663

F-measure ( = 1) 0.501 0.472 0.496 0.562 0.543

F-measure ( = 2) 0.434 0.418 0.504 0.552 0.498

ED ( = 0.5) 0.427 0.435 0.361 0.329 0.377

ED ( = 0.67) 0.492 0.500 0.409 0.375 0.433

Page 16: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Visual Tools: Margin Plots

Page 17: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Visual Tools: ROC

• Classifiers that allow multiple operational points are more flexible.

• Area Under the Curve, if two curves available.

• Distance from the ideal performance.

ED * (1 PD)2 (1 ) * PF 2

Factor depends of the misclassification cost.

Page 18: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Summary

• Emerging data collection must be accompanied by mature evaluation.

• Research reports must reveal all aspects of achieved predictive performance fairly.

• The goodness of a model depends on the cost of misclassification.

Page 19: Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

Current work

• Statistical significance calculation between ROC curves.• ROC vs. PR (Precision-Recall) curves, cost curves.