Top Banner
Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science and Technology ESEM 2009 ESEM 2009 1
18

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Dec 18, 2015

Download

Documents

Ursula Potter
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Towards Logistic Regression Models for Predicting Fault-prone Code

across Software ProjectsErika Camargo

and Ochimizu Koichiro

Japan Institute of Science and Technology

ESEM 2009ESEM 20091

Page 2: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Contents

1. Abstract2. Background3. Problem Analysis4. Case study5. Results6. Conclusion and Future Work

2

Page 3: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Abstract

Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects.

First attempt of solution: simple log data transformations

P(y=1)

xX = X = design-design-complexity complexity metricmetric

P(Fault prone P(Fault prone class)class)

3

Page 4: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Background• Some design-complexity metrics have shown to

be good predictors of fault-prone classes in LR models

• Among these metrics are the Chidamber & Kemerer (CK) metrics

– 80th and 20th percentiles of the distributions can be used to determine high and low values

– Their thresholds cannot be determined before their use and should be derived and used locally

4

Page 5: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Problem Analysis

Can a LR model built with these kind of metrics work efficiently with different software projects?

LEAST FAULTY MOST FAULTY

Small Size SW project

Large Size SW project

X = Number of Methods

P (y=1)

105

20

Page 6: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Case Study

1. Data analysis of 7 different projects and application of simple log data transformations.

2. Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes).– Dependent Variables: CK-CBO, CK-RFC, CK-WMC– Independent Variables: Defects (from Bugzilla & CVS)

3. Test these models with 2 other smaller projects (with 11 and13 Java classes)

6

Page 7: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

7

Challenge

(**) Eclipse Project

(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

produced biased regression estimates and reduce the predictive power of regression models

BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **

Page 8: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

(**) Eclipse Project

(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

RFC Data of BNS is more spread than the data of

the MYL

BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **

8

Page 9: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

(**) Eclipse Project

(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

RFC Data of BNS is more spread than the data of

the MYL

BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **

9

Page 10: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Case Study

Solution. Simple data transformation using “Log10”

Example :

10

Number of Outliers are lessData Spread is more uniform

LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm;Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed

Page 11: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Results

Effects of the Log data Transformations:• Elimination of great number of outliers• Overall goodness of fit of the 3 models is

better • Discrimination (Most Faulty/Least Faulty)– All models discriminate well between most Faulty

and Least Faulty classes of the Mylyn System– What about using different projects?

11

Page 12: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Results

Group Model Correct Classification (RAW DATA)

Correct Classification(LOG Tx DATA)

Effect

MF(6 classes)

CBO 2 5

RFC 5 5 =

WMC 6 6 =

LF(5 classes)

CBO 5 5 =

RFC 3 3 =

WMC 4 4 =

BOTH(11 classes)

CBO 7 10

RFC 8 8 =

WMC 10 10 =

BANKING SYSTEM

12

MF: Most FaultyLF: Least Faulty

Page 13: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Results

Group Model Correct Classification (RAW DATA)

Correct Classification(LOG Tx DATA)

Effect

MF(9 classes)

CBO 3 7

RFC 9 8

WMC 7 6

LF(4 classes)

CBO 4 4 =

RFC 0 3

WMC 0 4

BOTH(13 classes)

CBO 7 11

RFC 9 11

WMC 7 10

E-COMMERCE SYSTEM

13

MF: Most FaultyLF: Least Faulty

Page 14: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Conclusions and Future work

• CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects

• Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model.

• Further data exploration and study of data transformations

14

Page 15: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Thank you!questions, comments …

contact: [email protected]

15

Page 16: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

16

Page 17: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

17

Page 18: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

18