Estimation of Defects Based on Defect Decay Model ED3M

Estimation of Defects Based on

Defect Decay Model: ED3M

Abstract:

An accurate prediction of the number of defects in a software product during

system testing contributes not only to the management of the system testing process but

also to the estimation of the product’s required maintenance. Here, a new approach,

called Estimation of Defects based on Defect Decay Model (ED3M) is presented that

computes an estimate the defects in an ongoing testing process. ED3M is based on

estimation theory. Unlike many existing approaches, the technique presented here does

not depend on historical data from previous projects or any assumptions about the

requirements and/or testers’ productivity. It is a completely automated approach that

relies only on the data collected during an ongoing testing process. This is a key

advantage of the ED3M approach as it makes it widely applicable in different testing

environments. Here, the ED3M approach has been evaluated using five data sets from

large industrial projects and two data sets from the literature. In addition, a performance

analysis has been conducted using simulated data sets to explore its behavior using

different models for the input data. The results are very promising; they indicate the

ED3M approach provides accurate estimates with as fast or better convergence time in

comparison to well-known alternative techniques, while only using defect data as the

input.

EXISTING SYSTEM:

Several researchers have investigated the behavior of defect density based on

module size. One group of researchers has found that larger modules have lower defect

density. Two of the reasons provided for their findings are the smaller number of links

between modules and that larger modules are developed with more care. The second

group has suggested that there is an optimal module size for which the defect density is

minimal. Their results have shown that defect density depicts a U-shaped behavior

against module size. Still others have reported that smaller modules enjoy lower defect

density, exploiting the famous divide and conquer rule. Another line of studies has been

based on the HAIDER ET AL.: ESTIMATION OF DEFECTS BASED ON DEFECT

DECAY MODEL: ED3M 349 . Convergence statistics, collected from the simulation of

100 data sets generated from the Triple-Linear behavior, of the estimator with (A)10

percent tolerance, (b)20 percent tolerance, and (c) 30 percent tolerance. Convergence

statistics, collected from the simulation of 100 data sets generated from the

Multiexponential behavior, of the estimator with: 10 percent tolerance, (b) 20 percent

tolerance, and (c) 30 percent tolerance. use of design metrics to predict fault-prone

modules. Briand et al. have studied the degree of accuracy of capture-recapture models,

proposed by biologists, to predict the number of remaining defects during inspection

using actual inspection data. They have also studied the impact of the number of

inspectors and the total number of defects on the accuracy of the estimators based on

relevant capturerecapture models. Ostrand et al. Bell et al. have developed a model to

predict which files will contain the most faults in the next release based on the structure

of each file, as well as fault and modification history from the previous release.

PROPOSED SYSTEM:

Many researchers have addressed this important problem with varying end

goals and have proposed estimation techniques to compute the total number of defects. A

group of researchers focuses on finding error-prone modules based on the size of the

module. Briand et al. predict the number of remaining defects during inspection using

actual inspection data, whereas Ostrand et al predict which files will contain the most

faults in the next release. Zhang and Mockus use data collected from previous projects to

estimate the number of defects in a new project. However, these data sets are not always

available or, even if they are, may lead to inaccurate estimates. For example, Zhang and

Mockus use a naïve method based only on the size of the product to select similar

projects while ignoring many other critical factors such as project type, complexity, etc.

Another alternative that appears to produce very accurate estimates is based on the use of

Bayesian Belief Networks (BBNs) .However, these techniques require the use of

additional information, such as expert knowledge and empirical data, that are not

necessarily collected by most software development companies. Software reliability

growth models (SRGMs) are also used to estimate the total number of defects to measure

software reliability. Although they can be used to indicate the status of the testing

process, some have slow convergence while others have limited application as they may

require more input data or initial values that are selected by experts.

Hardware and Software Requirements:

SOFTWARE REQUIREMENTS

VS .NET 2010,C#

Windows 7.

HARDWARE REQUIREMENTS

Hard disk : 80 GB

RAM : 1GB

Processor : Pentium dual core and above

Monitor : 17’’Color Monitor

Scope of the project :

The goal of the project is estimate the defects in a software product. The availability of this estimate allows a test manager to improve his planning, monitoring, and controlling activities; this provides a more efficient testing process. estimators can achieve high accuracy as more and more data becomes available and the process nears completion.

Introduction :

Software metrics are crucial for characterizing the development status of a software product. Well-defined metrics can help to address many issues, such as cost, resource planning (people, equipment such as testbeds, etc.), and product release schedules. Metrics have been proposed for many phases of the software development lifecycle, including requirements, design, and testing. In this paper, the focus is on characterizing the status of the software testing effort using a single key metric: the estimated number of defects in a software product. The availability of this estimate allows a test manager toimprove his planning, monitoring, and controlling activities; this provides a more efficient testing process. Also, since, in many companies, system

testing is one of the last phases (if not the last), the time to release can be better assessed; the estimated remaining defects can be used to predict the required level of customer support. Ideally, a defect estimation technique has several important characteristics. First, the technique should be accurate as decisions based on inaccurate estimates can be time consuming and costly to correct. However, most estimators can achieve high accuracy as more and more data becomes available and the process nears completion.

By that time, the estimates are of little, if any, use. Therefore,a second important characteristic is that accurate estimates need to be available as early as possible during the system testing phase. The faster the estimate converges to the actual value (i.e., the lower its latency), the more valuable the result is to a test manager. Third, the technique should be generally applicable in different software testing processes and on different kinds of software products. The inputs to the process should be commonly available and should not require extensive expertise in an underlying formalism. In this case, the same technique can be widely reused, both within and among software development companies, reducing training costs, the need for additional tool support, etc. Many researchers have addressed this important problem with varying end goals and have proposed estimation techniques to compute the total number of defects. A group of researchers focuses on finding error-prone modules based on the size of the module . Briand et al. predict the number of remaining defects during inspection using actual inspection data, whereas Ostrand . predict which files will contain the most faults in the next release. Zhang and Mockus use data collected from previous projects to estimate the number of defects in a new project. However, these data sets are not always available or, even if they are, may lead to inaccurate estimates. For example, Zhang and Mockus use a naïve method based only on the size of the product to select similar projects while ignoring many other critical factors such as project type, complexity, etc. Another alternative that appears to produce very accurate estimates is based on the use of Bayesian Belief Networks (BBNs) .

Estimation of Defects based on Defect Decay Model (ED3M), is a novel approach proposed here which has been rigorously validated using case studies, simulated data sets, and data sets from the literature. Based on this validation work, the ED3M approach has been shown to produce accurate final estimates with a convergence rate that meets or improves upon closely

related, well-known techniques. The only input is the defect data; the ED3M approach is fully automated .

Although the ED3M approach has yielded promising results, there are defect prediction issues that are not addressed by it. For example, system test managers would benefit from obtaining a prediction of the defects to be found in ST well before the testing begins, ideally in the requirements or design phase. This could be used to improve the plan for developing the test cases. The ED3M approach, which requires test defect data as the input, cannot be used for this. Alternate approaches which rely on different input data (e.g., historical project data and expert knowledge) could be selected to accomplish this. However, in general, these data are not available at most companies.

A second issue is that test managers may prefer to obtain the predictions for the number of defects on a feature-by-feature basis, rather than for the whole system. Although the ED3M approach could be used for this, the number of sample points for each feature may be too smallto allow for accurate predictions. As before, additional information could be used to achieve such estimations, but this is beyond the scope of this paper. Third, the performance of the ED3M approach is affected when the data diverge from the underlying assumption of an exponential decay behavior.

General Concepts

Software defect prediction is the mechanism of identifying what is the number of defects

that a software under development will have after a certain time frame.

It helps allocating resources to rectify the defects and also helps in software costing.

Generally defect detection is a model which is based on some current data and some

learning data. So most of the models relies on how defect is identified and corrected in an

organization.

Edm3 model on the other hand is a defect prediction model that does not require any

historical data. All you need to give input is bug report of first few days and the rate of

ratifying the bugs.

Based on these inputs, it calculates model parameters lambda and estimate the value of H

which is the curve fitting element for n or number of days over which defect is intended

to be detected.

The concept also takes into consideration of noise factor. A noise is nothing but

misleading defect information. For example in your project facebook connect and

facebook live stream is not working and are reported in the bug report. it is a problem of

facebook api hook. Therefore the second report here. be considered as noise data here.

Module Diagram :

UML Diagram :

Use case diagram :

Project Defect History

Browse (To select the project)

Estimate the defects in a project

Error correction method

Report process

user

Login

Browse

Error estimation

Error correction

Report

Admin

Log of Defect Correction

Class Diagram :

Collaboration Diagram :

Sequence Diagram :

StateChart Diagram :

Activity Diagram :

Component Diagram :

Project Flow Diagram :

Browse the input folderuser

Apply Error Estimation techniques

Error Correction

Report for admin

System Architecture :

Literature review :

The traditional way of predicting software reliability has since the 1970ies been the use of software reliability growth models. They were developed in a time when software was developed using a waterfall process model. This is inline with thefact that most software reliability growth models require a substantial amount of failure data to get any trustworthy estimate of the reliability. Software reliability growth models are normally described in the form of an equation with a number of parameters that need to be fitted to the failure data. A key problem is that the curve fitting often means that the parameters can only be estimated very late in testing and hence their industrial value for decision-making is limited. This is particularly the case when development is done, for example, using an incremental approach or other short turnaround approaches. A sufficient amount of failure data is simply not available. The software reliability growth models have initially been developed for a quite different situation than today. Thus, it is not a surprise that they are no really fit for the challenges today unless the problems can be circumvented. This paper addresses some of the possibilities of addressing the problems with software reliability growth models by looking at ways of estimating the parameters in software reliability growth models before entering integration or system testing.

Construction simulation tools typically provide results in the form of numerical or statistical data. However, they do not illustrate the modeled operations graphically in 3D. This poses significant difficulty in communicating the results of simulation models, especially to persons who are not trained in simulation but are domain experts. The resulting “Black-Box Effect” is a major impediment in verifying and validating simulation models. Decision makers often do not have the means, the training and/or the time to verify and validate simulation models based solely on

the numerical output of simulation models and are thus always skeptic about simulation analyses and have little confidence in their results. This lack of credibility is a major deterrent hindering the widespread use of simulation as an operations planning tool in construction. This paper illustrates the use of DES in the design of a complex dynamic earthwork operation whose control logic was verified and validated using 3D animation. The model was created using Stroboscope and animated using the Dynamic Construction Visualizer.

Over the years, many defect prediction studies have been conducted. The studies consider the problem using a variety of mathematical models (e.g., Bayesian Networks, probability distributions, reliability growth models, etc.) and characteristics of the project, such as module size, file structure, etc. A useful survey and critique of these techniques is available in .Several researchers have investigated the behavior of defect density based on module size. One group of researchers has found that larger modules have lower defect density. Two of the reasons provided for their findings are the smaller number of links between modules and that larger modules are developed with more care. The second group has suggested that there is an optimal module size for which the defect density is minimal. Their results have shown that defect density depicts a U-shaped behavior against module size. Still others have reported that smaller modules enjoy lower defect density, exploiting the famous divide and conquer rule. Another line of studies has been based on theuse of design metrics to predict fault-prone modules . Briand have studied the degree of accuracy of capture-recapture models, proposed by biologists, to predict the number of remaining defects during inspection using actual inspection data. They have also studied the impact of the number of inspectors and the total number of defects on the accuracy of the estimators based on relevant recapture models. Ostrand and Bell have developed a model to predict which files will contain the most faults in the next release based on the

structure of each file, as well as fault and modification history from the previous release. Their research [5] has shown that faults are distributed in files according to the famous Pareto Principle, i.e., 80 percent of the faults are found in 20 percent of the files. Zhang and Mockus assume that defects discovered and fixed during development are caused by implementing new features recorded as Modification Requests (MRs). Historical data from past projects are used to collect estimates for defect rate per feature MR, the time to repair the defect in a feature, and the delay between a feature implementation and defect repair activities. The selection criteria for past similar projects are based only on the size of the project while disregarding many other critical characteristics. These estimates are used as input to a prediction model, based on the Poison distribution, to predict the number of defect repair MRs.The technique that has been presented by Zhang and Mockus relies solely on historical data from past projects and does not consider the data from the current project. Fenton have used BBNs to predict the number of defects in the software. The results shown are plausible; the authors also explain causes of the results from the model. However, accuracy has been achieved at the cost of requiring expert knowledge of the Project Manager and historical data (information besides defect data) from past projects. Currently, such information is not always collected in industry. Also, expert knowledge is highly subjective and can be biased. These factors may limit the application of such models to a few companies that can cope with these requirements. This has been a key motivating factor in developing the ED3M approach. The only information ED3M needs is the defect data from the ongoing testing process; this is collected by almost all companies. Gras also advocate the use and effectiveness of BBNs for defect prediction. However, they point out that the use of BBN is not always possible and an alternative method, Defect Profile Modeling (DPM), is proposed. Although DPM does not demand as much on calibration as BBN, it does rely on data from past projects, such as the defect

identifier, release sourced, phase sourced, release found, phase found, etc.Many Reliability models have been used to predict the number of defects in a software product. The models have also been used to provide the status of the testing process based on the defect growth curve. For example, if the defect curve is growing exponentially, then more undiscovered defects are to follow and testing should continue. If the growth curve has reached saturation, then the decision regarding the fate of testing can be reviewed by managers and engineers.

Advantages :

1) Accurate estimates as early as possible during the system testing process.

2) Much more information to compute the estimates.3) Estimate the large modules.4) Correct the current estimate ,corrected value is the output.

Application :

XML-based application testing Java application testing N-tier client-server application testing WAP application testing

Module Description :

Methodology

To improve on the existing approaches, in particular with respect to the applicability, the following characteristics are needed in a technique.

First, it should use the defect count, an almost ubiquitous input, as the only data required to compute the estimates (historical data are not required). Most companies, if not all, developing software have a way to report defects which then can be easily counted. Second, the user should not be required to provide any initial values for internal parameters or expert knowledge; this results in a fully automated approach. Third, the technique should be flexible; it should be able to produce estimates based on defect data reported in execution time or calendar time.

Numerous application areas, such as signal processing, defect estimation, and software reliability, need to extract information from observations that have been corrupted by noise. For example, in a software testing process, the observations are the number of defects detected; the noise may have been caused by the experience of the testers, sizeand complexity of the application, errors in collecting the data, etc. The information of interest to extract from the observations is the total number of defects in the software. A branch of statistics and signal processing, estimation theory, provides techniques to accomplish this.

observations: 0,1,--N

Information to extract from dataset

Implementation

bmp = new Bitmap(pictureBox1.Width, pictureBox1.Height); g = Graphics.FromImage(bmp); listBox1.Items.Clear(); OpenFileDialog ofd = new OpenFileDialog(); DialogResult dr=ofd.ShowDialog(); if (dr.Equals(DialogResult.OK)) {

string fname = ofd.FileName; string[] Lines = File.ReadAllLines(fname); for (int i = 0; i < Lines.Length; i++) { listBox5.Items.Add(Lines[i]); } BugReport[] bug = new BugReport[Lines.Length-1]; int BugCount = bug.Length; for (int i = 0; i < Lines.Length-1; i++)// first line is your header. so start reading from next //line { try { string s = Lines[i]; // it is in the form "2.1.2011 error!" string[] data = s.Split(new char[] { '\t' });// Remember split takes a character array as //input. So even if you had 1 character, it is to be passed as an array. string date = data[0];// convert string to integer. string name = data[1]; // first part is name and second part is age Right? bug[i] = new BugReport(name, date); BugCount = i; } catch (Exception ex) { }

} //suppose the DataGridView name is dataGridView1, then for assigning the value: //MessageBox.Show(bug[0].Bug+bug[0].date); //dataGridView1.DataSource = new BugReport[]{new BugReport("1","Error")}; MessageBox.Show("Bug Count="+BugCount.ToString());

float [] W=new float[BugCount]; for (int i = 0; i < BugCount; i++) {

float f = 0; for (int j = 0; j < BugCount; j++) { f+=NGram.GetBigramSimilarity(bug[i].Bug,bug[j].Bug); } W[i] = f; listBox1.Items.Add(f); }

////////// Prove that Noise Correlation is 0

for (int i = 1; i < BugCount; i++) { float f = 0; f = NGram.GetBigramSimilarity(W[i].ToString(), W[i-1].ToString()); listBox2.Items.Add(f);

} ////////////// Let theta be the date variable. Find the probability of Error at certain dates //P[x/theta] is nothing but probability of // Now Calculate W[i]. W[i] is the noise component. Noise is the Correlated Values. //Date wise total Bug is x[N], in a single date there might be more than one error. ArrayList totErrors = new ArrayList(); int err = 1; for (int i = 1; i < BugCount; i++) { if (bug[i].date != bug[i - 1].date) { totErrors.Add( err); } else { err++; }

} int[] X = new int[totErrors.Count]; for (int i = 0; i < X.Length; i++) { X[i] = (int)totErrors[i]; }

BugCount = X.Length;//Update Bugcount //Now theta Becomes date Range... 0,1,2,3 first date, second date and so on //find P[x(n)/theta] double sigma = .5;

double [] Px_Theta=new double[BugCount]; for (int i = 0; i < BugCount; i++) { double p = 1 / (Math.Sqrt(2 * 3.14 * sigma)); double p1 = Math.Exp(-1*(X[i] - i) * (X[i] - i)/(2*sigma*sigma)); Px_Theta[i] = p1 / p; listBox3.Items.Add(Px_Theta[i]); } // R(t) is the Rate of Error Correction velocity. // From observation let us assume that once an error is tracked, is solved in average 2 days double Rt = 2.0; //Lambda1= Rate of Error Occurance, Lambda2 is scale of Error Occurance // Rate of Error Occurance is Given by total Errors/Distinct dates double lambda1 = (double)bug.Length / (double)BugCount; double lambda2 = 0; //Now we know R(init), Rt,Lambda1, eqn 23 has to solve for lambda2 double K = Rt /(double) X[0]; lambda2=K*lambda1/(1-Math.Exp(-1*lambda1)); //Now calculate h(n) double []H=new double[BugCount]; double[] LogCk = new double[BugCount]; int x1=pictureBox1.Width, x2=0, y1=pictureBox1.Height, y2=pictureBox1.Height; for (int i = 0; i < BugCount; i++) { double t1=lambda2/(lambda2-lambda1); t1 = t1 * Math.Exp(-1*lambda1*i);

double t2 = lambda1 / (lambda2 - lambda1); t2 = t2 * Math.Exp(-1 * lambda2 * i); H[i]=1-(t1+t2); listBox4.Items.Add(H[i]); LogCk[i] = -lambda1 * i + Math.Log(lambda2/(lambda2-lambda1)); y2 = pictureBox1.Height -(int) (-1 * LogCk[i] * 100); x2 = x1 - 100; g.DrawLine(new Pen(new SolidBrush(Color.Red), 5), x1, y1, x2, y2); x1 = x2; y1 = y2; pictureBox1.Image = bmp; listBox6.Items.Add(LogCk[i]);

} }

The closeness between the bug reports are calculated based on Bigram similarity measure.

The algorithm is specified in detail as bellow. First Remove the Stop words like ‘the’,’an’,’or’,’am’,’are’,’and’ etc from the

document

Extract all the unique words or terms from the document and construct a Matrix

T={t1,t2,…tn} where ti is ith term.

Extract all the sentences minus the stop words from the documents. S={s1,s2,…

sn}

Extract the Frequency Sentence matrix by calculating if a sentence has a

particular term.

S’={S11,S12,S13…Snm}, where Snm is the nth sentences frequency for mth

word or term

Now create a Graph(V,E) with each sentence S’ at the vertex and if two

sentences are similar, they are connected with an edge with Weight of the Edge.

The Weight is nothing but the cosine similarity between two sentences.

Thus the sentence actually “recommends” sentences which like itself under this

weight calculating mechanism. A long sentence that is similar with most of the

sentences will obtain high rank

If a sentence is started with Discourse word like ‘because’, ‘hence’,’therefor’

etc, then it’s weight is decreased.

Once a Graph is built, weight of each vertices are calculated as

Where

Once the vertices are ranked, top ranked vertices are selected which is the

summery of the document.

For Sentence level similarity, Cosine similarity measure is considered and for

word level summarization Entropy measure Combined with TF-IDF similarity

measure is considered. Therefore the work is named as multi parameter document

summerization

Conclusion

An accurate prediction of the total number of defects is an important problem in software

testing as it can support the status evaluation for a testing process as well as determining

the extent of customer support required after releasing the product. A new approach,

ED3M, is presented in this work which estimates the number of remaining defects in a

software product. ED3M is based on estimation theory.

The ED3M approach has the following characteristics: First, it uses the defect count, an

almost ubiquitous input, as the only data required to compute the estimates (historical

data are not required). Most companies, if not all, developing software have a way to

report defects which can then be easily counted. Second, the user is not required

to provide any initial values for internal parameters or expert knowledge; it is a fully

automated approach. Third, the technique is flexible, able to produce estimates based on

defect data reported in execution time or calendar time.

These characteristics make the ED3M approach widely applicable. The ED3M approach

has been rigorously validated using case studies, simulated data sets, and data sets from

the literature. Based on this validation work, the ED3M approach has been shown to

produce accurate final estimates with a convergence rate that meets or improves

upon closely related, well-known techniques.

Overall, the results in the case studies indicate the approach converges accurately to the

actual number of defects (2.2 percent-7.2 percent error). In these studies, ED3M is able to

converge using less than 50 percent of the testing time. The case studies with data that

better match the exponential assumption of the model (e.g., case studies 1 and 2)

produced better results. Case study 4, with data that better matches a multiexponential

curve, had a slower convergence rate and the lowest accuracy.

The impact of using data that deviate from the double exponential assumption has been

explored. Simulated data sets that use double-linear, triple-linear, and multiexponential

models have been used as input to the ED3M approach and the results have been

analyzed. The results confirm that the deviations do impact the convergence rate and

accuracy of the ED3M approach; however, the results still appear to be reasonably good.

The results of using the multiexponential data in comparison to the double and triple-

linear input were better, likely due to its closer similarity to the double-exponential curve.

The comparative study using data sets from the literature indicates the ED3M approach

performs at least as well or better than Padberg’s approach, the Gompertz curve, and the

Musa-Okumoto Model in terms of accuracy and convergence rates. A key advantage of

the ED3M approach is that it only requires the defect data from the current project to

produce estimates. In addition, it is flexible in that it can use defect data collected in

calendar time or execution time.

The results provide a clear indication of the strong accuracy, convergence, and general

applicability of the estimator. It should be clear that, even though the estimator is based

on the DDM [14], the solution of the model is general enough to represent the behavior

of any testing process with an exponential or S-shaped decay in the accumulated number

of defects.

Estimation of Defects Based on Defect Decay Model ED3M

Documents

varying end

process nears

bayesian belief

defect density

efficient

ongoing testing

lower defect

nave method