Promise 2011: "Studying the Fix-Time for Bugs in Large Open Source Projects"

Studying the Fix-Time for Bugs in Large Open Source

ProjectsLionel Marks, Ying Zou, Ahmed E. Hassan,

Thanh NguyenQueen’s University, Kingston, Ontario, Canada

1

Wednesday, 21 September, 11

If life was like that we don’t need

software prediction.

2


RealityMany simple feature requests or defect

reports do NOT get fixed for years.

3


4


Feature request / Defect report filed

4



Triage

4



Triage

Implementation plan / cause of defect

determined

4



Triage


determined

Implement

4



Triage


determined

Implement

Verify

4



Triage


determined

Implement

Verify

4

Close



Triage


determined

Implement

Verify

Work item fix-time

4

Close



Triage


determined

Implement

Verify

Work item fix-timeWhen will it be

fixed?

4

Close



Triage


determined

Implement

Verify


fixed?Which

one should we fix this iteration?

4

Close



Triage


determined

Implement

Verify


fixed?Which

one should we fix this iteration?

Can we predict the work item fix-time?

4

Close


Locationproperties

(7)

5


Locationproperties

(7)

Product / Version /

Component

5


Locationproperties

(7)

Product / Version /

Component

Number of completed

WI*

5


Locationproperties

(7)

Product / Version /

Component

Number of completed

WI*

Average fix time*

5


Locationproperties

(7)

Reporterproperties

(4)

5


Locationproperties

(7)

Reporterproperties

(4)

Industry / local / public

5


Locationproperties

(7)

Reporterproperties

(4)


Popularity*

5


Locationproperties

(7)

Reporterproperties

(4)


Popularity*

Number of past

requests*

5


Locationproperties

(7)

Reporterproperties

(4)


Popularity*

Number of past

requests*

Average fix time*

5


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

5


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Severity / Priority

5


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Severity / Priority

Number of interested

parties*

5


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Severity / Priority


parties*

Morning / Day / night

5


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Severity / Priority


parties*


Description length*

5


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Severity / Priority


parties*


Description length*

Code attachment

5


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Predictor

5


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Predictor

When will it be fixed?

6


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Short

Predictor


6


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Short

NormalPredictor


6


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Short

Normal

Long

Predictor


6


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Predictor

Which one should we fix

this iteration?

7


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Next minor revision

Predictor


this iteration?

7


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Next minor revision

Next major revisionPredictor


this iteration?

7


Locationproperties

(7)

Reporterproperties

(4)

Work itemproperties

(12)

Next minor revision

Next major revision

Next version

Predictor


this iteration?

7


Case study

ProjectNumber of

work items

< 3 months

< 1 year

< 3 years

Mozilla 85,616 46% 27% 27%

Eclipse 63,402 76% 18% 6%

8


Random ForestWe use Random Forest because:

• Decision tree based models are explainable comparing to SVM or neural network.

• Random Forest out-performs C4.5 because it is more resistive to data with highly correlated attributes.

• It is easy to analyze the sensitivity of each property.

9


10

Data J48Random Forest M5P

Linux* 11.09 (3.09) 18.42 (2.53) 16.22(2.19)

Apache* 38.64 (1.35) 34.45 (1.68) 25.39 (1.64)

Jazz 17.75 (2.76) 18.43 (2.74) 11.67 (2.27)

*Akinori Ihara (Kinki University, Japan) and Yasutaka Kamei (Kyushu University, Japan)


http://www.kyushu-u.ac.jp/english/


Goals of our case study

• G1: What is the accuracy of fix-time prediction model?

• G2: Which properties are the most important predictors of fix-time?

• G3: How applicable are models in practice?

11


G1: Accuracy of the model

We build 10 random forests for each project:

• Each random forest uses randomly 2/3 of the data for training.

• We evaluate the prediction on the rest 1/3 of the data

12


Accuracy of the model:Overall misclassification

13



0

15

30

45

60

51.9

39.6

50.8

36.2

Reporter Location Description All

Mozilla

13



0

15

30

45

60

51.9

39.6

50.8

36.2

Reporter Location Description All 0

12.5

25

37.5

50

35.8 34.7

43

32.8

Reporter LocationDescription All

Mozilla Eclipse

13



0

15

30

45

60

51.9

39.6

50.8

36.2

Reporter Location Description All 0

12.5

25

37.5

50

35.8 34.7

43

32.8

Reporter LocationDescription All

Mozilla Eclipse

G1: We can correctly classify ∼65% of the time the fix-time for work items in Eclipse and Mozilla, twice as good as random.

13


G2: Model sensitivity - Importance of each property

We use technique called permutation accuracy importance measure as follow:

• For each property, we randomly alter values and rerun the classification.

• We give an importance score (1 to 10) depend on the change in the classification result.

• We sum its score across all ten forests.

14


15

LocationLocation

Mozilla Eclipse

ProductComponent

Project fix-timeProduct opened

work items


15

LocationLocation

Mozilla Eclipse

ProductComponent


work items

ReporterReporter

Mozilla Eclipse

Fix-timeRequests

Fix-timeOverall popularity


15

LocationLocation

Mozilla Eclipse

ProductComponent


work items

ReporterReporter

Mozilla Eclipse

Fix-timeRequests


DescriptionDescription

Mozilla Eclipse

YearHas target milestone

YearSeverity


15

LocationLocation

Mozilla Eclipse

ProductComponent


work items

ReporterReporter

Mozilla Eclipse

Fix-timeRequests



Mozilla Eclipse


YearSeverity

AllAll

Mozilla Eclipse

YearProduct

SeverityNumber of CCed


15

LocationLocation

Mozilla Eclipse

ProductComponent


work items

ReporterReporter

Mozilla Eclipse

Fix-timeRequests



Mozilla Eclipse


YearSeverity

AllAll

Mozilla Eclipse

YearProduct

SeverityNumber of CCed

G2: The time of bug filing and its location are the most important properties in the Mozilla project. In the Eclipse project, bug severity is the most important property.


Why is time the most important factor in Mozilla?

Project Duplicate Invalid Moved Won’t fix Works for me Fixed TotalMozilla 99,414 29,856 103 9,512 46,659 85,616 271,160

(37%) (11%) (0%) (4%) (17%) (32%) (100%)Eclipse 19,060 7,958 0 7,141 10,013 63,402 107,574

(18%) (7%) (0%) (7%) (9%) (59%) (100%)

Table 5: Statistics for the Resolution Type of Bugs for the Mozilla and Eclipse Projects

(a) Mozilla (b) Eclipse

Figure 1: The distribution of fix-time for bugs over the lifetime of the Mozilla and Eclipse projects. The x-axis shows the specificproject years, while the y-axis shows the percentage of bugs fixed within a specific class

Mozilla project. The observation about severity confirms the find-ing reported by Panjer using basic decision trees [17]. Our analysisshows that the year of the bug has a higher impact on the fix-timefor a bug than the severity of the bug. We also note that our find-ings do not match with the finding by Hooimeijer and Weimer [11]that severity of bugs in Mozilla has an important effect on the timeneeded to fix a bug. In contrast, our results show that reporting time(e.g., Year and Week of year) have a more important influence.

We believe that this is due to the fact that Hooimeijer and Weimermeasure the fix-time of a bug from the time it is entered into Bugzillatill it is fixed. In contrast, we only measure the time from theassignment of a developer (i.e., after triage) till the bug is fixed.Therefore, it might be the case that bugs with high severity aretriaged faster but then their fix-time is independent of their sever-ity level. This could be due to the fact that severity levels do notrepresent the true severity of a bug.

Mozilla Eclipse# Attribute % Attribute %1 Year 100 Year 992 Has target milestone 82 Severity 873 Number of CCed 80 Number of CCed 834 Week of Year 77 Has target milestone 69

Table 9: Top Attributes for the Bug Description Dimension

Experiment #4:All AttributesIn our last experiment, we combine all the attributes across the threedimensions. This resulted in 50 metrics that were studied usingour sensitivity analysis technique. Table 10 shows the results ofour analysis. The table shows that the severity of the bug and thenumber of CCed project personnel are the most important attributesfor the Eclipse project, while on the other hand the reporting time

(year and week), and the location (product and component) havethe largest influence for the Mozilla project.

Mozilla Eclipse# Attribute % Attribute %1 Year 97 Severity 872 Product 92 Number of CCed 733 Week 71 Product fix-time 614 Component 63

Table 10: Top Attributes for the Models with All Attributes

DiscussionTable 11 shows the performance of the three dimensions and allattributes for the Mozilla and Eclipse projects. The Table showsthat the dimensions perform differently for both projects. For theMozilla project, the best performing dimensions are: Location, De-scription, then Reporter. For the Eclipse project, the best perform-ing dimensions are: Location, Reporter, then Description. The per-formance differences are statistically significant. The results sug-gest that the bug descriptions for the Eclipse project should be im-proved to help project managers in project planning and resourceallocation decisions.

A random guessing approach would result in an overall misclas-sification rate of 60%, as we have 3 classes. Our random forestclassifier shows considerable improvement over random guessing.Moreover, if we did not discretize the numerical attributes we mightbe able to achieve a lower misclassification rate. However, the pro-duced model would be much harder to comprehend and use in prac-tice for project planning as we are searching for simple and basicrules-of-thumb that practitioners could use.

We verified the significance of our result differences by com-paring the performance of the ten forests generated for each di-

16


Why is time the most important factor in Mozilla?

Project Duplicate Invalid Moved Won’t fix Works for me Fixed TotalMozilla 99,414 29,856 103 9,512 46,659 85,616 271,160

(37%) (11%) (0%) (4%) (17%) (32%) (100%)Eclipse 19,060 7,958 0 7,141 10,013 63,402 107,574

(18%) (7%) (0%) (7%) (9%) (59%) (100%)

Table 5: Statistics for the Resolution Type of Bugs for the Mozilla and Eclipse Projects

(a) Mozilla (b) Eclipse

Figure 1: The distribution of fix-time for bugs over the lifetime of the Mozilla and Eclipse projects. The x-axis shows the specificproject years, while the y-axis shows the percentage of bugs fixed within a specific class

Mozilla project. The observation about severity confirms the find-ing reported by Panjer using basic decision trees [17]. Our analysisshows that the year of the bug has a higher impact on the fix-timefor a bug than the severity of the bug. We also note that our find-ings do not match with the finding by Hooimeijer and Weimer [11]that severity of bugs in Mozilla has an important effect on the timeneeded to fix a bug. In contrast, our results show that reporting time(e.g., Year and Week of year) have a more important influence.

We believe that this is due to the fact that Hooimeijer and Weimermeasure the fix-time of a bug from the time it is entered into Bugzillatill it is fixed. In contrast, we only measure the time from theassignment of a developer (i.e., after triage) till the bug is fixed.Therefore, it might be the case that bugs with high severity aretriaged faster but then their fix-time is independent of their sever-ity level. This could be due to the fact that severity levels do notrepresent the true severity of a bug.

Mozilla Eclipse# Attribute % Attribute %1 Year 100 Year 992 Has target milestone 82 Severity 873 Number of CCed 80 Number of CCed 834 Week of Year 77 Has target milestone 69

Table 9: Top Attributes for the Bug Description Dimension

Experiment #4:All AttributesIn our last experiment, we combine all the attributes across the threedimensions. This resulted in 50 metrics that were studied usingour sensitivity analysis technique. Table 10 shows the results ofour analysis. The table shows that the severity of the bug and thenumber of CCed project personnel are the most important attributesfor the Eclipse project, while on the other hand the reporting time

(year and week), and the location (product and component) havethe largest influence for the Mozilla project.

Mozilla Eclipse# Attribute % Attribute %1 Year 97 Severity 872 Product 92 Number of CCed 733 Week 71 Product fix-time 614 Component 63

Table 10: Top Attributes for the Models with All Attributes

DiscussionTable 11 shows the performance of the three dimensions and allattributes for the Mozilla and Eclipse projects. The Table showsthat the dimensions perform differently for both projects. For theMozilla project, the best performing dimensions are: Location, De-scription, then Reporter. For the Eclipse project, the best perform-ing dimensions are: Location, Reporter, then Description. The per-formance differences are statistically significant. The results sug-gest that the bug descriptions for the Eclipse project should be im-proved to help project managers in project planning and resourceallocation decisions.

A random guessing approach would result in an overall misclas-sification rate of 60%, as we have 3 classes. Our random forestclassifier shows considerable improvement over random guessing.Moreover, if we did not discretize the numerical attributes we mightbe able to achieve a lower misclassification rate. However, the pro-duced model would be much harder to comprehend and use in prac-tice for project planning as we are searching for simple and basicrules-of-thumb that practitioners could use.

We verified the significance of our result differences by com-paring the performance of the ten forests generated for each di-

17


G3: How applicable are models in practice?

If a prediction model is stable, it should:

• Use only available properties

• Be stable

18



Triage


determined

Implement

Verify

Work item fix-time

Close



Triage

Work item fix-time



Triage

Work item fix-time

Number of CCed?



Triage

Work item fix-time

Number of CCed?

Serverity change!



Triage

Work item fix-time

Number of CCed?

Serverity change!

Assigned to someone else


Accuracy of the predictor models using only available properties

20

Data size Misclassification rate

Eclipse* 86490 0.51

Linux* 2024 0.55

Jazz 16672 0.57

Apache* 1466 0.37

*Akinori Ihara (Kinki University, Japan) and Yasutaka Kamei (Kyushu University, Japan)




Stability of the mode

• Training size stability: As more data is added to the training set, the accuracy should improve.

• Time stability: The accuracy should be stable overtime.

21


Apache - Training size stability

22

1 2 3 4 5 6 7 8 9

0.45

0.50

0.55

0.60

0.65

0.70

x 10% of training data

Precisions


Apache - Time stability

23

1 2 3 4 5 6 7 8

0.34

0.36

0.38

0.40

0.42


Precisions


Apache - Time stability

23

1 2 3 4 5 6 7 8

0.34

0.36

0.38

0.40

0.42


Precisions

G3: Fix-time prediction model may work on project such as Apache in practice. Apache prediction model have data stability and time stability.


24


24


24


24


Promise 2011: "Studying the Fix-Time for Bugs in Large Open Source Projects"

Technology