Predicting Defects Using Change Genealogies Kim Herzig * , Sascha Just † , Andreas Rau † , Andreas Zeller † * Microsoft Research, UK † Saarland University, Germany
Predicting Defects Using Change GenealogiesKim Herzig*, Sascha Just†, Andreas Rau†, Andreas Zeller†
* Microsoft Research, UK† Saarland University, Germany
Prediction Models
• Goal: determine the likelihood of bugs in code entities Quality assurance limited by time and money.
Can be helpful for project outsiders.
• Trained on “ground truth” Known instances and their properties.
Idea: learning from past for future.
• Predicting / estimating defect likelihood of new, unknown code entities
Fine-Tuning Prediction Models
Prediction Target
Machine Learner
Training Methods
Metrics (independent variables)
(Social) Network Metrics
Some participants more active and central than others.
Are these participants also more crucial?
Assumption: “Central binaries tend to be defect-prone”.
Code Network Metrics
Code entities communicate with each other.
Use call graph network to compute network metrics.
[2008] Zimmermann and Nagappan: “Predicting Defects using Network Analysis on Dependency Graphs”
10100100101101011000100101100100010101111001011001
10100100101101011000100101100100010101111001011001
10100100101101011000100101100100010101111001011001
10100100101101011000100101100100010101111001011001
10100100101101011000100101100100010101111001011001
10100100101101011000100101100100010101111001011001
10100100101101011000100101100100010101111001011001
10100100101101011000100101100100010101111001011001
Call graphs do not change significantly over time!
Assumption: “Code being crucially changed tend to be defect prone”.
Change Network Metrics
Code changes depend on each other.
Central code changes tend to be crucial.
Idea: Use dependencies between code changes
Change Genealogies
Change Genealogies (in a nutshell)[2013] Kim Herzig: “Mining and Untangling Change Genealogies” (PhD thesis)
Directed graph structure
Method level dependencies
Multi-dimensional (space & time)
Change Genealogy Metrics EGO network metrics
Measures the immediate impact of changes on other changes.
GLOBAL network metrics Express the long-term impact of changes on other changes.
Considering the type of the change Adding method definition, modifying method call
Considering parent age How old are the parent changes a change depends on.
Change genealogy metrics must be aggregated to source file level.
Comparing change genealogies
against:
Code complexity models (e.g. McCabe)
Code dependency models(Zimmermann & Nagappan)
Combined network models(Change genealogy & code dependency network metrics)
Experimental Setup
Experimental Setup
Study subjects Multiple machine learners
Prediction Precision
Code complexity metrics
Code dependency network metrics (Zimmermann & Nagappan)
Change genealogy metrics
NM & CGM
Confirmed: Network metrics
outperform complexity metrics.
Change genealogy models report
less false positives (higher precision).
Change genealogy model slightly
more false negatives (lower recall).
Combining network metrics: good
recall but worse precision.
Influential Metrics
Network efficiency among the top 10 most influential metrics.
Relationship between changes and type of dependency top 2 metrics (for all projects).
Higher number of old parents the higher the probability to add bugs.
Code entities combining multiple older functionalities more defect prone.
Code entities combining multiple older functionalities more defect prone.
Change genealogies are well suited for defect prediction (better precision, close recall).
Adapting social network metrics Comparing prediction models.to change dependency graphs.
Summary