SOFTWARE FAULT PREDICTION USING QUAD TREE-BASED FUZZY C-MEANS CLUSTERING ALGORITHM By, SHANMUGAPRIYA.K II-M.E[CSE] Guided by, Mr. S. Nandagopal, M.E.,

SOFTWARE FAULT PREDICTION USING QUAD TREE-BASED FUZZY C-MEANS

CLUSTERING ALGORITHM

By,

SHANMUGAPRIYA.K

II-M.E[CSE]

Guided by,

Mr. S. Nandagopal, M.E., (Ph.D).,

Associate Professor / CSE,

Nandha College of Technology,

Erode- 52.

TABLE OF CONTENTS

• ABSTRACT

• INTRODUCTION

• LITERATURE SURVEY

• EXISTING SYSTEM

• PROPOSED SYSTEM

• SYSTEM REQUIREMENTS

• SYSTEM IMPLEMENTATION

• REFERENCES

SYNOPSIS

Quad Tree-based Fuzzy C-Means algorithm has been applied for

predicting faults in program modules. Quad Trees are applied for finding the initial

cluster centers to be input to the Fuzzy C-Means Algorithm. An input threshold

parameter that governs the number of initial cluster centers and by varying the user

can generate desired initial cluster centers. Clustering gain has been used to

determine the quality of clusters for evaluation of the Quad Tree-based

initialization. The overall error rates of this prediction approach are compared to

other existing algorithms like Quad Tree-based K-Means, K-Means, Global K-

Means and are found the best accuracy ratio.

1. INTRODUCTION

1.1. Overview of the project:

• The process of extraction of relevant data from large databases.

• Real time databases are large and complex and contain different types of attributes.

• The clustering algorithms are used to find faults in software modules.

• An analysis of static metrics and faults in C software suggest that multiple variable models are

necessary to find metrics that are important in addition to program size.

• A software fault is a defect that causes software failure in an executable product.

• Software bug is an error, flaw, mistake, failure, or fault in a computer program that prevents it

from behaving as intended.

• Most bugs arise from mistakes and errors made by people in either a program's source code or its

design, and a few are caused by compilers producing incorrect code.

• Prediction of fault prone modules in software development process and mostly used the metric

based approach.

1.2. Software Fault Prediction process

• Fault Prediction Process shows the modified Software with unknown fault data.• Faults can be predicted based on training data set.• The new version of the software can be generated.

1.3. Cluster Analysis

• Clustering is the assignment of a set of observations into subsets so that observations in the

same cluster are similar in some sense.

• Clustering can be experimented for the Prediction of impact of faults in object oriented

software systems.

• K-Means clustering is a widely used technique for finding faults.

• K-Means is sensitive to noise. So that it is applied with Quad-Tree based method.

• Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to

two or more clusters.

• FCM assigns every data point a membership grade for each cluster.

• By iteratively updating the cluster centers and the membership grades for each data point,

FCM iteratively moves the cluster centers to the right location within a data set.

1.4. Quad Tree-based Algorithm

• Quad Tree-based FCM algorithm (QDC) has been applied for predicting faults in program

modules

• Quad Trees are applied for finding initial cluster centers for Fuzzy C-Means algorithm.

• Varying the value of threshold parameter, a user can generate a desired number of cluster

centers to be used as input to the simple K-Means algorithm.

• The overall error rates of this prediction approach are compared to other existing algorithms

and are found to be better in most of the cases.

• Clustering gain values for the best cluster by FCM and by Quad Tree-based algorithm are very

close thereby proving the effectiveness of the algorithm.

1.4. Quad Tree-based Algorithm continued…

• The Quad Tree-based Fuzzy C-means gives the initialization algorithm. In this algorithm divides an initial

data space into buckets and continue until all buckets are either black or white leaf buckets.

• The first division into four buckets is done, three buckets are gray while one is white.

• The gray buckets are further subdivided, while the white one is left as such. At this stage, one of the sub

buckets is labelled as a black leaf bucket.

Quad Tree implementation for 4-quadarants Quad Tree implementation for 8-quadarants

1.5. Metric Thresholds used

Metric Name DescriptionMODULE Unique numeric identifier of the product.LOC_BLANK Number of blank lines in a module.BRANCH_COUNT Branch count metrics.CALL_PAIRS Number of calls to other functions in a module.LOC_CODE_AND_COMMENT Number of lines which contain both code & comment in a

module.LOC_COMMENTS Number of lines of comments in a module.CONDITION_COUNT Number of conditionals in a given module.CYCLOMATIC_COMPLEXITY Cyclomatic complexity of a module.CYCLOMATIC_DENSITY Ratio of the module's cyclomatic complexity to its length in

NCSLOC.DECISION_COUNT Number of decision points in a given module.DECISION_DENSITY Calculated as: Cond / Decision.DESIGN_COMPLEXITY Design complexity of a module.DESIGN_DENSITY Design density is calculated as: iv(G)/v(G).EDGE_COUNT Number of edges found in a given module.ERROR_COUNT Number of defects associated with a module.ERROR_DENSITY Number of defects per 1000 lines of code for a moduleESSENTIAL_COMPLEXITY Essential complexity of a module.ESSENTIAL_DENSITY Essential density is calculated as: (ev(G)-1)/(v(G)-1).LOC_EXECUTABLE Number of lines of executable code for a module

1.5. Metric Thresholds continued…

Metric Name DescriptionPARAMETER_COUNT Number of parameters to a given module.GLOBAL_DATA_COMPLEXI TY Global Data Complexity quantifies the cyclomatic complexity of a

module's structure as it relates to global/parameter data.

GLOBAL_DATA_DENSITY Global Data density is calculated as: gdv(G) / v(G).HALSTEAD_CONTENT Halstead length content of a module.HALSTEAD_DIFFICULTY Halstead difficulty metric of a module.HALSTEAD_EFFORT Halstead effort metric of a module.HALSTEAD_ERROR_EST Halstead error estimate metric.HALSTEAD_LENGTH Halstead length metric of a module.HALSTEAD_LEVEL Halstead level metric of a module.HALSTEAD_PROG_TIME Halstead programming time metric of a module.HALSTEAD_VOLUME Halstead volume metric of a module.MULTIPLE_CONDITION_COUNT Number of multiple conditions that exist within a module.NODE_COUNT Number of nodes found in a given module.NUM_OPERANDS The number of operands contained in a module.NUM_OPERATORS The number of operators contained in a module.NUM_UNIQUE_OPERANDS The number of unique operands contained in a module.NUM_UNIQUE_OPERATORS The number of unique operators contained in a module.NUMBER_OF_LINES Number of lines in a module.PATHOLOGICAL_COMPLEXITY A measure of the degree to which a module contains extremely

unstructured constructs.PERCENT_COMMENTS Percentage of the code that is comments.LOC_TOTAL The total number of lines for a givenmodule.

2. LITERATURE SURVEY

2.1. Expert-based approach for software fault prediction

• They applied K-Means and Neural-Gas techniques on different real data sets.

• And based on their experience Neural-Gas-based prediction approach performed slightly worse than

K-Means clustering-based approach in terms of the overall error rate on large data sets.

• This approach helped the expert in making better estimations as compared to predictions made by an

unsupervised learning algorithm

2.2. Clustering and Metrics Threshold Based Software Fault Prediction of Unlabeled Program Modules

• Demonstrate the effectiveness of metrics threshold and show that the standalone application based

on metrics threshold.

• Unsupervised Learning Approach to Fault Prediction in Software Module has

– False negative rates (FNR)

– False positive rates (FPR)

2.3. Software Fault Prediction in Object Oriented Software Systems

• object-–oriented metrics is that they can serve as early predictors of classes that contain faults

or that are closely maintain.

• Fault prediction model is designed to separate the faulty classes in the field of software

testing.

• It is an approach for predicting the run-time errors in java is introduced.

3. EXISTING SYSTEM

The existing systems are,

• Expert-based approach for software fault prediction

• Software Quality Classification Modeling Using the PRINT Decision Algorithm

• Clustering and Metrics Threshold Based Software Fault Prediction of Unlabeled Program

Modules

• Extending K-Means with Efficient Estimation of the Number of Cluster

• A Genetic Algorithm Using Hyper-Quadtrees for Low-Dimensional K-Means Clustering

• Software Fault Prediction in Object Oriented Software Systems

• Quad Tree-based software fault prediction

• Genetic Programming Model for Software Quality Classification

• Quality Prediction of Function Based Software Using Decision Tree Approach

3.1. Drawbacks of Existing System

• Expert-based approach for software fault prediction is dependent on the availability and

capability of the expert.

• K-Means clustering the user has to initialize the number of clusters which is very difficult to

identify in most of the cases.

• It requires selection of the suitable initial cluster centers which is again subject to error. Since

the structure of the clusters depends on the initial cluster centers this may result in an

inefficient clustering.

• The K-Means algorithm is very sensitive to noise.

• The False negative rates (FNR) for the clustering-based approach is less than that for metrics-

based approach.

• The False positive rates (FPR) are better for the metrics-based approach.

4. PROPOSED SYSTEM

• The Quad Tree-based Fuzzy C-Means algorithm (QDC) has been applied for predicting faults

in program modules.

• Quad Trees are applied for finding initial cluster centers for FCM algorithm.

• Generate a desired number of cluster centers to be used as input to the simple K-Means

algorithm.

• Clustering gain values for the best cluster by FCM and by Quad Tree-based algorithm are very

close thereby proving the effectiveness of the algorithm.

• To compare the performance of QDC with other initialization techniques such as Global K-

Means algorithm, K-Means and Quad Tree-based K-means.

4.1. Advantages

• QDC returns number of cluster centers for analysis.

• Provide better performance than other prediction techniques.

• Support labeling the modules.

• High availability of metric thresholds.

• The number of iterations of K-Means algorithm is less in the case of QDC.

5. SYSTEM REQUIREMENTS

HARDWARE REQUIREMENTS

• Processor : Pentium IV

• Speed : Above 500 MHz

• RAM capacity : 2 GB

• Hard disk drive : 80 GB

• Monitor : 17” Samsung

SOFTWARE REQUIREMENTS

Operating System : Windows XP and above

Front end used : Java

Back End : SQL Server

6. SYSTEM IMPLEMENTATION

6.1. Modules

The Proposed system has three modules are,

• Calculation of Input Parameters

– Usage of JM1 Data Sets

– Calculation of Metric Thresholds

– Evaluation of Fault-prone Parameters

• Applying Fuzzy C-Means Algorithm Using Quad-Tree

• Performance Comparison

6.2. Module Description

1. Calculation of Input Parameters

1.1. Usage of JM1 Data Sets

• A data set is a collection of data in tabular form.

• A data set has several characteristics which define its structure and properties.

• Four real data sets to test our algorithm.

– Those data sets are: AR3, AR4, AR5 and an Iris data set.

• The software measurements and fault data are collected at the program function, subroutine, or

method levels.

1.2. Calculation of Metric Thresholds

Determine acceptable metrics thresholds, there are three methods described

• Experience and Hints from literature

• Tuning machine

• Analysis of multiple versions

Branch Count metric Branch Count

Line Count metrics Total Lines of CodeExecutable LOCLine Count metrics Comments LOCBlank LOCCodeAnd Comments LOC

Halstead metrics Total OperatorsTotal OperandsUnique OperatorsUnique Operands

Cyclomatic Complexity

The total number of independent paths on the flow graph.

1.3. Evaluation of Fault-prone Parameters

• Matrix can be formed with

– Actual Labels

– Predicted Labels

• The Actual labels of data items are placed along the rows.

• The predicted labels are placed along the columns.

• If a not faulty module (Actual label—False) is predicted as nonfaulty (Predicted Label—

False) then which is True Negative

• If it is predicted as faulty (Predicted label—True) then which is False Positive.

• False negative rate is the percentage of faulty modules labeled as not fault prone.

• The False positive rate is the percentage of not faulty modules labeled as fault prone.

2. Applying Fuzzy C-Means Algorithm Using Quad-Tree

Notations and parameters used in the initialization algorithm are provided

• MIN: user defined threshold for minimum number of data points in a sub bucket.

• MAX: user defined threshold for maximum number of data points in a sub bucket.

• δ: user specified distance for finding nearest neighbours.

• White leaf bucket: a sub bucket having less than MIN percent of data points of the parent

bucket.

• Black leaf bucket: a sub bucket having more than MAX percent of data points of the parent

bucket.

• Gray bucket: a sub bucket which is neither white nor black.

• Rk: neighbourhood set of centre ck of a black leaf bucket.

• C: set of cluster centers used for initializing K-Means algorithm.

ALGORITHM:

Step 1: Divide an initial data space into buckets and continue until all buckets are either black or

white leaf buckets.

Step 2: The gray buckets are further subdivided, while the white one is left as such. At this stage,

one of the sub buckets is labeled as a black leaf bucket.

Step 3: Iinitializes the set of cluster centers to null set

Step 4: Find its -nearest neighbours and include them in the set.

Step 5: Calculate the mean of each group. The means are used as initial cluster centers for the K-

Means algorithm.

3. Performance Comparison

• Performance can be compared with various clustering algorithms based on gain value.

• Parameters for evaluation are

– Number Of Iterations (NOI)

– Sum of Squares Error (SSE)

– Gain

– Percent Error.

• The simplified formula for calculation of gain is as follows:

Gain = ∑k=1k(vk-1) || z0 - z0

k || 2,

REFERENCES

1. Bataineh. K.M, Naji.M, Saqer.M,” A Comparison Study between Various Fuzzy Clustering

Algorithms”, Jordan Journal of Mechanical and Industrial Engineering, Pages 335 - 343

Volume 5, Number 4, Aug. 2011.

2. Catal.C, Sevim.U, and Diri.B,” Software Fault Prediction of Unlabeled Program Modules”,

Proceedings of the World Congress on Engineering 2009 Vol I.

3. Deepak Gupta, Vinay Kumar Goel, Harish Mittal,” Software Quality Analysis of Unlabeled

Program Modules with Fuzzy C-means Clustering Technique”, IJMRS’s International Journal of

Engineering Sciences, Vol. 01, Issue 02, June 2012.

4. Han.J and Kamber.M, Data Mining Concepts and Techniques, second ed, pp. 401-404.

Morgan Kaufmann Publishers, 2007.

5. Yue Jiang & Bojan Cukic & Yan Ma,” Techniques for evaluating fault prediction models”, Empir

Software Eng (2008) 13:561–595.

THANK YOU

SOFTWARE FAULT PREDICTION USING QUAD TREE-BASED FUZZY C-MEANS CLUSTERING ALGORITHM By, SHANMUGAPRIYA.K II-M.E[CSE] Guided by, Mr. S. Nandagopal, M.E.,

Documents

c software

software modules

software failure

quad treebased initialization

modified software

software bug

cluster analysis clustering

synopsis quad tree