Nearest Neighbor Sampling for Nearest Neighbor Sampling for Better Defect Prediction Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston, Texas, USA
Jan 18, 2016
Nearest Neighbor Sampling for Better Nearest Neighbor Sampling for Better Defect PredictionDefect Prediction
Gary D. BoetticherDepartment of Software EngineeringUniversity of Houston - Clear Lake
Houston, Texas, USA
The Problem: The Problem: Why is there not more ML in Why is there not more ML in Software Engineering?Software Engineering?
Human-Based62 to 86% [Jørgensen 2004]
AlgorithmicMachine
Learning7 to 16%
Key IdeaKey Idea
More ML in SE through a more More ML in SE through a more defined experimental process.defined experimental process.
AgendaAgenda
A better defined process for better predicting (quality)
Experiments: Nearest Neighbor Sampling on PROMISE
Defect data sets
Extending the approach
Discussion
Conclusions
A Better Defined ProcessA Better Defined Process
Emphasis of ML approachesEmphasis on Measuring Success
– PRED(X)– Accuracy– MARE
Prediction success depends upon the relationship between training and test data.
PROMISE Defect Data (from NASA)PROMISE Defect Data (from NASA)Project Code Description
CM1 C NASA spacecraft instrumentKC1 C++ Storage management for receiving/processing ground dataKC2 C++ Science data processing. No software overlap with KC1.JM1 C Real-time predictive ground systemPC1 C Flight software for earth orbiting satellite
21 Inputs– Size (SLOC, Comments)– Complexity (McCabe Cyclomatic Comp.)– Vocabulary (Halstead Operators, Operands)
1 Output: Number of Defects
Data PreprocessingData Preprocessing
ProjectOriginal
SizeSize w/ No
Bad, No Dups0
Defects1+
Defects%
Defects
CM1 498 441 393 48 10.9%JM1 10,885 8911 6904 2007 22.5%KC1 2109 1211 896 315 26.0%KC2 522 374 269 105 28.1%PC1 1109 953 883 70 7.3%
Reduced to 2 classes
Experiment 1Experiment 1
6904 with
0 Defects
2007with
1+ Defects
JM1
}22%
Training40% of Original Data
Nice Test Nasty Test
Experiment 1 ContinuedExperiment 1 ContinuedTraining
Nice Test Nasty Test
Remaining Vectorsfrom Data set
Remaining Vectorsfrom Data set
J48 and Naïve Bayes Classifiers from WEKA200 Trials (100 Nice Test Data + 100 Nasty Test Data)
– CM1– JM1– KC1– KC2– PC1
Experiment 1 ContinuedExperiment 1 Continued
20 Nice Trials + 20 Nasty Trials
Results: AccuracyResults: AccuracyNice Test Set Nasty Test Set
J48 NaïveBayes
J48 NaïveBayes
CM1 97.4% 88.3% 6.2% 37.4%JM1 94.6% 94.8% 16.3% 17.7%KC1 90.9% 87.5% 22.8% 30.9%KC2 88.3% 94.1% 42.3% 36.0%PC1 97.8% 91.9% 19.8% 35.8%
OverallAverage
94.4% 93.6% 18.7% 21.2%
Results: Average Confusion MatrixResults: Average Confusion Matrix
J48 Naïve Bayes2 3 3 2
58 1021 68 1011
J48 Naïve Bayes50 249 60 2412 7 3 5
Average Nice Results
Average Nasty Results
Note thedistribution:
0 Defects
1+ Defects
Experiment 2: 60% Train, KNN=3Experiment 2: 60% Train, KNN=3Accuracy
NeighborDescription
# ofTRUEs
# ofFALSEs J48
NaïveBayes
PPP None None NA NA
PPN 0 354 88 90
PNP 0 5 40 20
NPP None None NA NA
PNN 3 0 100 0
NPN 13 0 31 100
NNP 110 0 25 28
NNN None None NA NA
Assessing Experiment DifficultyAssessing Experiment Difficulty
Exp_Difficulty = 1 - Matches / Total_Test_Instances
Match = Test vector’s nearest neighbor is from the same class instance in the training set.
Experimental Difficulty = 1
Experimental Difficulty = 0
Hard experiment
Easy experiment
Assessing Overall Data DifficultyAssessing Overall Data Difficulty
Overall Data Difficulty = 1 - Matches / Total_Data_Instances
Match = A data vector’s nearest neighbor is from the same class instance as another vector in the data set.
Overall Data Difficulty = 1
Overall Data Difficulty = 0
Difficult Data
Easy Data
Discussion: Anticipated BenefitsDiscussion: Anticipated BenefitsMethod for characterizing difficulty of
experimentMore realistic modelsEasy to implementCan be integrated into N-Way Cross ValidationCan apply to various types of SE data sets:
– Defect Prediction– Effort Estimation
Can be extended beyond SE to other domains
Discussion: Potential ProblemsDiscussion: Potential Problems
More work needs to be doneAgreement on how to measure Experimental
DifficultyExtra overheadImplicitly or Explicitly Data Staved Domain
How to get more ML in SE?
ConclusionsConclusions
Assess experiments/data for their difficulty
Benefits:More credibility to the modeling processMore reliable predictorsMore realistic models
Thanks to the reviewers for their comments!
AcknowledgementsAcknowledgements
1) M. Jørgensen, A Review of Studies on Expert Estimation of Software Development Effort, Journal Systems and Software, Vol 70, Issues 1-2, 2004, Pp. 37-60.
ReferencesReferences