Nearest Neighbor Sampling for Better Defect Prediction

Nearest Neighbor Sampling for Better Nearest Neighbor Sampling for Better Defect PredictionDefect Prediction

Gary D. BoetticherDepartment of Software EngineeringUniversity of Houston - Clear Lake

Houston, Texas, USA

The Problem: The Problem: Why is there not more ML in Why is there not more ML in Software Engineering?Software Engineering?

Human-Based62 to 86% [Jørgensen 2004]

AlgorithmicMachine

Learning7 to 16%

Key IdeaKey Idea

More ML in SE through a more More ML in SE through a more defined experimental process.defined experimental process.

AgendaAgenda

A better defined process for better predicting (quality)

Experiments: Nearest Neighbor Sampling on PROMISE

Defect data sets

Extending the approach

Discussion

Conclusions

A Better Defined ProcessA Better Defined Process

Emphasis of ML approachesEmphasis on Measuring Success

– PRED(X)– Accuracy– MARE

Prediction success depends upon the relationship between training and test data.

PROMISE Defect Data (from NASA)PROMISE Defect Data (from NASA)Project Code Description

CM1 C NASA spacecraft instrumentKC1 C++ Storage management for receiving/processing ground dataKC2 C++ Science data processing. No software overlap with KC1.JM1 C Real-time predictive ground systemPC1 C Flight software for earth orbiting satellite

21 Inputs– Size (SLOC, Comments)– Complexity (McCabe Cyclomatic Comp.)– Vocabulary (Halstead Operators, Operands)

1 Output: Number of Defects

Data PreprocessingData Preprocessing

ProjectOriginal

SizeSize w/ No

Bad, No Dups0

Defects1+

Defects%

Defects

CM1 498 441 393 48 10.9%JM1 10,885 8911 6904 2007 22.5%KC1 2109 1211 896 315 26.0%KC2 522 374 269 105 28.1%PC1 1109 953 883 70 7.3%

Reduced to 2 classes

Experiment 1Experiment 1

6904 with

0 Defects

2007with

1+ Defects

JM1

}22%

Training40% of Original Data

Nice Test Nasty Test

Experiment 1 ContinuedExperiment 1 ContinuedTraining

Nice Test Nasty Test

Remaining Vectorsfrom Data set

Remaining Vectorsfrom Data set

J48 and Naïve Bayes Classifiers from WEKA200 Trials (100 Nice Test Data + 100 Nasty Test Data)

– CM1– JM1– KC1– KC2– PC1

Experiment 1 ContinuedExperiment 1 Continued

20 Nice Trials + 20 Nasty Trials

Results: AccuracyResults: AccuracyNice Test Set Nasty Test Set

J48 NaïveBayes

J48 NaïveBayes

CM1 97.4% 88.3% 6.2% 37.4%JM1 94.6% 94.8% 16.3% 17.7%KC1 90.9% 87.5% 22.8% 30.9%KC2 88.3% 94.1% 42.3% 36.0%PC1 97.8% 91.9% 19.8% 35.8%

OverallAverage

94.4% 93.6% 18.7% 21.2%

Results: Average Confusion MatrixResults: Average Confusion Matrix

J48 Naïve Bayes2 3 3 2

58 1021 68 1011

J48 Naïve Bayes50 249 60 2412 7 3 5

Average Nice Results

Average Nasty Results

Note thedistribution:

0 Defects

1+ Defects

Experiment 2: 60% Train, KNN=3Experiment 2: 60% Train, KNN=3Accuracy

NeighborDescription

# ofTRUEs

# ofFALSEs J48

NaïveBayes

PPP None None NA NA

PPN 0 354 88 90

PNP 0 5 40 20

NPP None None NA NA

PNN 3 0 100 0

NPN 13 0 31 100

NNP 110 0 25 28

NNN None None NA NA

Assessing Experiment DifficultyAssessing Experiment Difficulty

Exp_Difficulty = 1 - Matches / Total_Test_Instances

Match = Test vector’s nearest neighbor is from the same class instance in the training set.

Experimental Difficulty = 1

Experimental Difficulty = 0

Hard experiment

Easy experiment

Assessing Overall Data DifficultyAssessing Overall Data Difficulty

Overall Data Difficulty = 1 - Matches / Total_Data_Instances

Match = A data vector’s nearest neighbor is from the same class instance as another vector in the data set.

Overall Data Difficulty = 1

Overall Data Difficulty = 0

Difficult Data

Easy Data

Discussion: Anticipated BenefitsDiscussion: Anticipated BenefitsMethod for characterizing difficulty of

experimentMore realistic modelsEasy to implementCan be integrated into N-Way Cross ValidationCan apply to various types of SE data sets:

– Defect Prediction– Effort Estimation

Can be extended beyond SE to other domains

Discussion: Potential ProblemsDiscussion: Potential Problems

More work needs to be doneAgreement on how to measure Experimental

DifficultyExtra overheadImplicitly or Explicitly Data Staved Domain

How to get more ML in SE?

ConclusionsConclusions

Assess experiments/data for their difficulty

Benefits:More credibility to the modeling processMore reliable predictorsMore realistic models

Thanks to the reviewers for their comments!

AcknowledgementsAcknowledgements

1) M. Jørgensen, A Review of Studies on Expert Estimation of Software Development Effort, Journal Systems and Software, Vol 70, Issues 1-2, 2004, Pp. 37-60.

ReferencesReferences

Nearest Neighbor Sampling for Better Defect Prediction

Documents