Studying the Fix-Time for Bugs in Large Open Source Projects Lionel Marks, Ying Zou, Ahmed E. Hassan, Thanh Nguyen Queen’s University, Kingston, Ontario, Canada 1 Wednesday, 21 September, 11
Jul 05, 2015
Studying the Fix-Time for Bugs in Large Open Source
ProjectsLionel Marks, Ying Zou, Ahmed E. Hassan,
Thanh NguyenQueen’s University, Kingston, Ontario, Canada
1
Wednesday, 21 September, 11
If life was like that we don’t need
software prediction.
2
Wednesday, 21 September, 11
RealityMany simple feature requests or defect
reports do NOT get fixed for years.
3
Wednesday, 21 September, 11
4
Wednesday, 21 September, 11
Feature request / Defect report filed
4
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
4
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Implementation plan / cause of defect
determined
4
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Implementation plan / cause of defect
determined
Implement
4
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Implementation plan / cause of defect
determined
Implement
Verify
4
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Implementation plan / cause of defect
determined
Implement
Verify
4
Close
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Implementation plan / cause of defect
determined
Implement
Verify
Work item fix-time
4
Close
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Implementation plan / cause of defect
determined
Implement
Verify
Work item fix-timeWhen will it be
fixed?
4
Close
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Implementation plan / cause of defect
determined
Implement
Verify
Work item fix-timeWhen will it be
fixed?Which
one should we fix this iteration?
4
Close
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Implementation plan / cause of defect
determined
Implement
Verify
Work item fix-timeWhen will it be
fixed?Which
one should we fix this iteration?
Can we predict the work item fix-time?
4
Close
Wednesday, 21 September, 11
Locationproperties
(7)
5
Wednesday, 21 September, 11
Locationproperties
(7)
Product / Version /
Component
5
Wednesday, 21 September, 11
Locationproperties
(7)
Product / Version /
Component
Number of completed
WI*
5
Wednesday, 21 September, 11
Locationproperties
(7)
Product / Version /
Component
Number of completed
WI*
Average fix time*
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Industry / local / public
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Industry / local / public
Popularity*
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Industry / local / public
Popularity*
Number of past
requests*
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Industry / local / public
Popularity*
Number of past
requests*
Average fix time*
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Severity / Priority
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Severity / Priority
Number of interested
parties*
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Severity / Priority
Number of interested
parties*
Morning / Day / night
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Severity / Priority
Number of interested
parties*
Morning / Day / night
Description length*
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Severity / Priority
Number of interested
parties*
Morning / Day / night
Description length*
Code attachment
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Predictor
5
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Predictor
When will it be fixed?
6
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Short
Predictor
When will it be fixed?
6
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Short
NormalPredictor
When will it be fixed?
6
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Short
Normal
Long
Predictor
When will it be fixed?
6
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Predictor
Which one should we fix
this iteration?
7
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Next minor revision
Predictor
Which one should we fix
this iteration?
7
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Next minor revision
Next major revisionPredictor
Which one should we fix
this iteration?
7
Wednesday, 21 September, 11
Locationproperties
(7)
Reporterproperties
(4)
Work itemproperties
(12)
Next minor revision
Next major revision
Next version
Predictor
Which one should we fix
this iteration?
7
Wednesday, 21 September, 11
Case study
ProjectNumber of
work items
< 3 months
< 1 year
< 3 years
Mozilla 85,616 46% 27% 27%
Eclipse 63,402 76% 18% 6%
8
Wednesday, 21 September, 11
Random ForestWe use Random Forest because:
• Decision tree based models are explainable comparing to SVM or neural network.
• Random Forest out-performs C4.5 because it is more resistive to data with highly correlated attributes.
• It is easy to analyze the sensitivity of each property.
9
Wednesday, 21 September, 11
10
Data J48Random Forest M5P
Linux* 11.09 (3.09) 18.42 (2.53) 16.22(2.19)
Apache* 38.64 (1.35) 34.45 (1.68) 25.39 (1.64)
Jazz 17.75 (2.76) 18.43 (2.74) 11.67 (2.27)
*Akinori Ihara (Kinki University, Japan) and Yasutaka Kamei (Kyushu University, Japan)
Wednesday, 21 September, 11
Goals of our case study
• G1: What is the accuracy of fix-time prediction model?
• G2: Which properties are the most important predictors of fix-time?
• G3: How applicable are models in practice?
11
Wednesday, 21 September, 11
G1: Accuracy of the model
We build 10 random forests for each project:
• Each random forest uses randomly 2/3 of the data for training.
• We evaluate the prediction on the rest 1/3 of the data
12
Wednesday, 21 September, 11
Accuracy of the model:Overall misclassification
13
Wednesday, 21 September, 11
Accuracy of the model:Overall misclassification
0
15
30
45
60
51.9
39.6
50.8
36.2
Reporter Location Description All
Mozilla
13
Wednesday, 21 September, 11
Accuracy of the model:Overall misclassification
0
15
30
45
60
51.9
39.6
50.8
36.2
Reporter Location Description All 0
12.5
25
37.5
50
35.8 34.7
43
32.8
Reporter LocationDescription All
Mozilla Eclipse
13
Wednesday, 21 September, 11
Accuracy of the model:Overall misclassification
0
15
30
45
60
51.9
39.6
50.8
36.2
Reporter Location Description All 0
12.5
25
37.5
50
35.8 34.7
43
32.8
Reporter LocationDescription All
Mozilla Eclipse
G1: We can correctly classify ∼65% of the time the fix-time for work items in Eclipse and Mozilla, twice as good as random.
13
Wednesday, 21 September, 11
G2: Model sensitivity - Importance of each property
We use technique called permutation accuracy importance measure as follow:
• For each property, we randomly alter values and rerun the classification.
• We give an importance score (1 to 10) depend on the change in the classification result.
• We sum its score across all ten forests.
14
Wednesday, 21 September, 11
15
LocationLocation
Mozilla Eclipse
ProductComponent
Project fix-timeProduct opened
work items
Wednesday, 21 September, 11
15
LocationLocation
Mozilla Eclipse
ProductComponent
Project fix-timeProduct opened
work items
ReporterReporter
Mozilla Eclipse
Fix-timeRequests
Fix-timeOverall popularity
Wednesday, 21 September, 11
15
LocationLocation
Mozilla Eclipse
ProductComponent
Project fix-timeProduct opened
work items
ReporterReporter
Mozilla Eclipse
Fix-timeRequests
Fix-timeOverall popularity
DescriptionDescription
Mozilla Eclipse
YearHas target milestone
YearSeverity
Wednesday, 21 September, 11
15
LocationLocation
Mozilla Eclipse
ProductComponent
Project fix-timeProduct opened
work items
ReporterReporter
Mozilla Eclipse
Fix-timeRequests
Fix-timeOverall popularity
DescriptionDescription
Mozilla Eclipse
YearHas target milestone
YearSeverity
AllAll
Mozilla Eclipse
YearProduct
SeverityNumber of CCed
Wednesday, 21 September, 11
15
LocationLocation
Mozilla Eclipse
ProductComponent
Project fix-timeProduct opened
work items
ReporterReporter
Mozilla Eclipse
Fix-timeRequests
Fix-timeOverall popularity
DescriptionDescription
Mozilla Eclipse
YearHas target milestone
YearSeverity
AllAll
Mozilla Eclipse
YearProduct
SeverityNumber of CCed
G2: The time of bug filing and its location are the most important properties in the Mozilla project. In the Eclipse project, bug severity is the most important property.
Wednesday, 21 September, 11
Why is time the most important factor in Mozilla?
Project Duplicate Invalid Moved Won’t fix Works for me Fixed TotalMozilla 99,414 29,856 103 9,512 46,659 85,616 271,160
(37%) (11%) (0%) (4%) (17%) (32%) (100%)Eclipse 19,060 7,958 0 7,141 10,013 63,402 107,574
(18%) (7%) (0%) (7%) (9%) (59%) (100%)
Table 5: Statistics for the Resolution Type of Bugs for the Mozilla and Eclipse Projects
(a) Mozilla (b) Eclipse
Figure 1: The distribution of fix-time for bugs over the lifetime of the Mozilla and Eclipse projects. The x-axis shows the specificproject years, while the y-axis shows the percentage of bugs fixed within a specific class
Mozilla project. The observation about severity confirms the find-ing reported by Panjer using basic decision trees [17]. Our analysisshows that the year of the bug has a higher impact on the fix-timefor a bug than the severity of the bug. We also note that our find-ings do not match with the finding by Hooimeijer and Weimer [11]that severity of bugs in Mozilla has an important effect on the timeneeded to fix a bug. In contrast, our results show that reporting time(e.g., Year and Week of year) have a more important influence.
We believe that this is due to the fact that Hooimeijer and Weimermeasure the fix-time of a bug from the time it is entered into Bugzillatill it is fixed. In contrast, we only measure the time from theassignment of a developer (i.e., after triage) till the bug is fixed.Therefore, it might be the case that bugs with high severity aretriaged faster but then their fix-time is independent of their sever-ity level. This could be due to the fact that severity levels do notrepresent the true severity of a bug.
Mozilla Eclipse# Attribute % Attribute %1 Year 100 Year 992 Has target milestone 82 Severity 873 Number of CCed 80 Number of CCed 834 Week of Year 77 Has target milestone 69
Table 9: Top Attributes for the Bug Description Dimension
Experiment #4:All AttributesIn our last experiment, we combine all the attributes across the threedimensions. This resulted in 50 metrics that were studied usingour sensitivity analysis technique. Table 10 shows the results ofour analysis. The table shows that the severity of the bug and thenumber of CCed project personnel are the most important attributesfor the Eclipse project, while on the other hand the reporting time
(year and week), and the location (product and component) havethe largest influence for the Mozilla project.
Mozilla Eclipse# Attribute % Attribute %1 Year 97 Severity 872 Product 92 Number of CCed 733 Week 71 Product fix-time 614 Component 63
Table 10: Top Attributes for the Models with All Attributes
DiscussionTable 11 shows the performance of the three dimensions and allattributes for the Mozilla and Eclipse projects. The Table showsthat the dimensions perform differently for both projects. For theMozilla project, the best performing dimensions are: Location, De-scription, then Reporter. For the Eclipse project, the best perform-ing dimensions are: Location, Reporter, then Description. The per-formance differences are statistically significant. The results sug-gest that the bug descriptions for the Eclipse project should be im-proved to help project managers in project planning and resourceallocation decisions.
A random guessing approach would result in an overall misclas-sification rate of 60%, as we have 3 classes. Our random forestclassifier shows considerable improvement over random guessing.Moreover, if we did not discretize the numerical attributes we mightbe able to achieve a lower misclassification rate. However, the pro-duced model would be much harder to comprehend and use in prac-tice for project planning as we are searching for simple and basicrules-of-thumb that practitioners could use.
We verified the significance of our result differences by com-paring the performance of the ten forests generated for each di-
16
Wednesday, 21 September, 11
Why is time the most important factor in Mozilla?
Project Duplicate Invalid Moved Won’t fix Works for me Fixed TotalMozilla 99,414 29,856 103 9,512 46,659 85,616 271,160
(37%) (11%) (0%) (4%) (17%) (32%) (100%)Eclipse 19,060 7,958 0 7,141 10,013 63,402 107,574
(18%) (7%) (0%) (7%) (9%) (59%) (100%)
Table 5: Statistics for the Resolution Type of Bugs for the Mozilla and Eclipse Projects
(a) Mozilla (b) Eclipse
Figure 1: The distribution of fix-time for bugs over the lifetime of the Mozilla and Eclipse projects. The x-axis shows the specificproject years, while the y-axis shows the percentage of bugs fixed within a specific class
Mozilla project. The observation about severity confirms the find-ing reported by Panjer using basic decision trees [17]. Our analysisshows that the year of the bug has a higher impact on the fix-timefor a bug than the severity of the bug. We also note that our find-ings do not match with the finding by Hooimeijer and Weimer [11]that severity of bugs in Mozilla has an important effect on the timeneeded to fix a bug. In contrast, our results show that reporting time(e.g., Year and Week of year) have a more important influence.
We believe that this is due to the fact that Hooimeijer and Weimermeasure the fix-time of a bug from the time it is entered into Bugzillatill it is fixed. In contrast, we only measure the time from theassignment of a developer (i.e., after triage) till the bug is fixed.Therefore, it might be the case that bugs with high severity aretriaged faster but then their fix-time is independent of their sever-ity level. This could be due to the fact that severity levels do notrepresent the true severity of a bug.
Mozilla Eclipse# Attribute % Attribute %1 Year 100 Year 992 Has target milestone 82 Severity 873 Number of CCed 80 Number of CCed 834 Week of Year 77 Has target milestone 69
Table 9: Top Attributes for the Bug Description Dimension
Experiment #4:All AttributesIn our last experiment, we combine all the attributes across the threedimensions. This resulted in 50 metrics that were studied usingour sensitivity analysis technique. Table 10 shows the results ofour analysis. The table shows that the severity of the bug and thenumber of CCed project personnel are the most important attributesfor the Eclipse project, while on the other hand the reporting time
(year and week), and the location (product and component) havethe largest influence for the Mozilla project.
Mozilla Eclipse# Attribute % Attribute %1 Year 97 Severity 872 Product 92 Number of CCed 733 Week 71 Product fix-time 614 Component 63
Table 10: Top Attributes for the Models with All Attributes
DiscussionTable 11 shows the performance of the three dimensions and allattributes for the Mozilla and Eclipse projects. The Table showsthat the dimensions perform differently for both projects. For theMozilla project, the best performing dimensions are: Location, De-scription, then Reporter. For the Eclipse project, the best perform-ing dimensions are: Location, Reporter, then Description. The per-formance differences are statistically significant. The results sug-gest that the bug descriptions for the Eclipse project should be im-proved to help project managers in project planning and resourceallocation decisions.
A random guessing approach would result in an overall misclas-sification rate of 60%, as we have 3 classes. Our random forestclassifier shows considerable improvement over random guessing.Moreover, if we did not discretize the numerical attributes we mightbe able to achieve a lower misclassification rate. However, the pro-duced model would be much harder to comprehend and use in prac-tice for project planning as we are searching for simple and basicrules-of-thumb that practitioners could use.
We verified the significance of our result differences by com-paring the performance of the ten forests generated for each di-
17
Wednesday, 21 September, 11
G3: How applicable are models in practice?
If a prediction model is stable, it should:
• Use only available properties
• Be stable
18
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Implementation plan / cause of defect
determined
Implement
Verify
Work item fix-time
Close
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Work item fix-time
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Work item fix-time
Number of CCed?
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Work item fix-time
Number of CCed?
Serverity change!
Wednesday, 21 September, 11
Feature request / Defect report filed
Triage
Work item fix-time
Number of CCed?
Serverity change!
Assigned to someone else
Wednesday, 21 September, 11
Accuracy of the predictor models using only available properties
20
Data size Misclassification rate
Eclipse* 86490 0.51
Linux* 2024 0.55
Jazz 16672 0.57
Apache* 1466 0.37
*Akinori Ihara (Kinki University, Japan) and Yasutaka Kamei (Kyushu University, Japan)
Wednesday, 21 September, 11
Stability of the mode
• Training size stability: As more data is added to the training set, the accuracy should improve.
• Time stability: The accuracy should be stable overtime.
21
Wednesday, 21 September, 11
Apache - Training size stability
22
1 2 3 4 5 6 7 8 9
0.45
0.50
0.55
0.60
0.65
0.70
x 10% of training data
Precisions
Wednesday, 21 September, 11
Apache - Time stability
23
1 2 3 4 5 6 7 8
0.34
0.36
0.38
0.40
0.42
x 10% of training data
Precisions
Wednesday, 21 September, 11
Apache - Time stability
23
1 2 3 4 5 6 7 8
0.34
0.36
0.38
0.40
0.42
x 10% of training data
Precisions
G3: Fix-time prediction model may work on project such as Apache in practice. Apache prediction model have data stability and time stability.
Wednesday, 21 September, 11
24
Wednesday, 21 September, 11
24
Wednesday, 21 September, 11
24
Wednesday, 21 September, 11
24
Wednesday, 21 September, 11