Statistical Process Control for Level 4

Statistica

A Metrics-BasedAchiev

Reiner Dumke

[email protected]

Contents 1 The CMMI Approach …………1.1 Basic Intentions of the CMMI ………1.2 The CMMI Levels ……………………1.3 The CMMI Metrication ……………… 2 Software Measurement Intentio2.1 The CAME Measurement Framework 2.2 The CMMI Metrics Set by Kulpa and 2.3 The CMMI-Based Organization’s Mea 3 The Statistical Software Proces3.1 Foundations of the SPC………………3.2 Empirical Strategies …………………3.3 Testing Methods ………………………3.4 Methods of Data Analysis …………… 4 SPC and CMMI …………………4.1 Basics of Quantified Process Managem4.2 Controlling the Process Improvement …. 5 References ………………………

The following preprint gives a new form oprocess (SPC) in the assessment and iIntegration initiative. Including the basidescribe a structured approach for metripreprint shows appropriate methods of stactivities for a quantified managed process

l Process Control (SPC)

Point of View of Software Processes ing the CMMI Level Four

, Isabelle Côté, Olga Andruschak

iversität Magdeburg, Institut für Verteilte Systeme urg.de, http://ivs.cs.uni-magdeburg.de/sw-eng/agruppe/

……………………………..…………………… 2 ………………………………………………………………. 2 ……………………………………………………………... 3 ……………………………………………………………… 7

ns ……………………………………………….. 10 ……………………………………………………………….. 10 Johnson …………………………………….………………. 15 surement Repository ………………………………………. 20

s (SPC)…………………………………………… 21 ……………………………………………………………. 21 …………………………………………………………….. 27

…………………………………………………………… 33 ……………………………………………………………. 39

…………………………………………………. 66 ent …………………………………………………………… 66 …………………………………………………………….... 69

……………………………………………….… 79

Abstract

f integration of the idea of the statistical based analysis of the software mprovement activities considering the Capability Maturity Model c statistical methods and software experiment foundations we will cation of the different stages of the CMMI approach. Further, this atistical analysis in order to improve the software process areas and level based on metrics set defines by Kulpa and Johnson.

1

1 The CMMI Approach 1.1 Basic Intentions of the CMMI CMMI stands for Capability Maturity Model Integration and is an initiative for changing the general intention of an assessment view based of the “classical” CMM or ISO 9000 to an improvement view integrating the System Engineering CMM (SE-CMM), the Software Acquisition Capability Maturity Model (SA-CMM), the Integrated Product Development Team Model (IDP-CMM), the System Engineering Capability Assessment Model (SECAM), the Systems Engineering Capability Model (SECM), and basic ideas of the new versions of the ISO 9001 and 15504. The following semantic network shows some classical approaches in the software process evaluation without any comments [Ferguson 1998]. PSP SDCCR MIL-Q-9858 DOD-STD- MIL-STD 2168 1679 People CMM SDCE NATO DOD-STD- SA-CMM IEEE Stds. 730, AQAP1,4,9 SW-CMM 828,829,830,1012 DOD-STD- SCE 1016,1028,1058 7935A FAA-iCMM 1063 EQA Baldrige MIL-STD-498 ISO 15504 (SPICE) Trillium BS 5750 ISO/IEC SE-CMM CMMI DO-178B 12207 EIA/IEEE SSE-CMM SECM J-STD-016 (EIA/IS 731) IEEE 1074 TickIT ISO 9000 SECAM IPD-CMM Series DOD IPPD IEEE 1220 Q9000 IEEE/EIA EIA/IS 632 AF IPD Guide ISO 10011 12207 MIL-STD-499B EIA 632 ISO 15288 Figure 1: Dependencies of software process evaluation methods and standards The CMMI is structured in the five maturity levels, the considered process areas, the specific goals (SG) and generic goals (GG), the common features and the specific practices (SP) and generic practices (GP). The process areas are defined as follows [Kulpa 2003]:

“The Process Area is s group of practices or activities performed collectively to achieve a specific objective.”

Such objectives could be the requirements management at the level 2, the requirements development at the maturity level 3 or the quantitative project management at the level 4. The difference between the “specific” and the “general” goals, practices or process area is reasoning in the special aspects or areas which are considered in opposition to the general IT or company wide analysis or improvement. There are four common features:

The commitment to perform (CO) The ability to perform (AB) The directing implementation (DI) The verifying implementation (VE).

The CO is shown through senior management commitment, the AB is sown through the training personnel, the DI is demonstrated by managing configurations, and the VE is demonstrated via objectively evaluating adherence and by reviewing status with higher-level management.

2

The following Figure 2 shows the general relationships between the different components of the CMMI approach.

Generic Practices

Generic Goals

Process Area 2Process Area 1 Process Area n

Specific Goals

Specific Practices Capability LevelsGeneric Practices

Generic Goals

Process Area 2Process Area 1 Process Area n

Specific Goals

Specific Practices Capability Levels

Figure 2: The CMMI model components The CMMI gives us some guidance as to what is a required component, an expected component, and simply informative. 1.2 CMMI Levels There are six capability levels (but five maturity levels), designated by the numbers 0 through 5 [SEI 2002], including the following process areas:

0. Incomplete: - 1. Performed: best practices;

2. Managed: requirements management, project planning, project monitoring and control, supplier

agreement management, measurement and analysis, process and product quality assurance; 3. Defined: requirements development, technical solution, product integration, verification,

validation, organizational process focus, organizational process definition, organizational training, integrated project management, risk management, integrated teaming, integrated supplier management, decision analysis and resolution, organizational environment for integration;

4. Quantitatively Managed: organizational process performance, quantitative project management;

5. Optimizing: organizational innovation and deployment, causal analysis and resolution.

Kulpa and Johnson consider the following specific goals and practices achieving the different maturity levels relating to the quantification [Kulpa 2003]: Level 2: Measurement and Analysis: The purpose of Measurement and Analysis is to develop and sustain a measurement capability that is used to support management information needs. Specific Practices by Specific Goal:

SG1 Align Measurement and Analysis Activities: Measurement objectives and activities are aligned with identified information needs and objectives.

SP1.1 Establish Measurement Objectives: Establish and maintain measurement objectives that are derived from identified information needs and objectives.

SP1.2 Specify Measures: Specify measures to address the measurement objectives.

3

SP1.3 Specify Data Collection and Storage Procedures: Specify how measurement data will be obtained and stored.

SP1.4 Specify Analysis Procedures: Specify how measurement data will be analyzed and reported. SG2 Provide Measurement Results: Measurement results that address identified information needs

and objectives are provided. SP2.1 Collect Measurement Data: Obtain specified measurement data. SP2.2 Analyze Measurement Data: Analyze and interpret measurement data. SP2.3 Store Data and Results: Manage and store measurement data, measurement specifications, and

analysis results. SP2.4 Communicate Results: Report results of measurement and analysis activities to all relevant

stakeholders. Level 2: Specific Practices by Specific Goal:

SG1 Objectively Evaluate Processes and Work Products: Adherence of the performed process and associated work products and services to applicable process descriptions, standards, and procedures is objectively evaluated.

SP1.1 Objectively Evaluate Processes: Objectively evaluate the designated performed processes against the applicable process descriptions, standards, and procedures.

SP1.2 Objectively Evaluate Work Products and Services: Objectively evaluate the designated work products and services against the applicable process descriptions, standards, and procedures.

SG2 Provide Objective Insight: Noncompliance issues are objectively tracked and communicated,

and resolution is ensured. SP2.1 Communicate and Ensure Resolution of Noncompliance Issues: Communicate quality issues

and ensure resolution of noncompliance issues with the staff and managers. SP2.2 Establish Records: Establish and maintain records of the quality assurance activities.

Level 3: Verification: The purpose of Verification is to ensure that selected work products meet their specified requirements. Specific Practices by Specific Goal:

SG1 Prepare for Verification: Preparation for verification is conducted. SP1.1 Select Work Products for Verification: Select the work products to be verified and the

verification methods that will be used for each. SP1.2 Establish the Verification Environment: Establish and maintain the environment needed to

support verification. SP1.3 Establish Verification Procedures and Criteria: Establish and maintain verification procedures

and criteria for the selected work products. SG2 Perform Peer Reviews: Peer reviews are performed on selected work products. SP2.1 Prepare for Peer Reviews: Prepare for peer reviews of selected work products. SP2.2 Conduct Peer Reviews: Conduct peer reviews on selected work products and identify issues

resulting from the peer review. SP2.3 Analyze Peer Review Data: Analyze data about preparation, conduct, and results of the peer

reviews. SG3 Verify Selected Work Products: Selected work products are verified against their specified

requirements. SP3.1 Perform Verification: Perform verification on the selected work products. SP3.2 Analyze Verification Results and Identify Corrective Action: Analyze the results of all

verification activities and identify corrective action.

4

Level 3: Validation:

The purpose of Validation is to demonstrate that a product or product component fulfills its intended use when placed in its intended environment. Specific Practices by Specific Goal:

SG1 Prepare for Validation: Preparation for validation is conducted. SP1.1 Select Products for Validation: Select products and product components to be validated and

the validation methods that will be used for each. SP1.2 Establish the Validation Environment: Establish and maintain the environment needed to

support validation. SP1.3 Establish Validation Procedures and Criteria: Establish and maintain procedures and criteria

for validation. SG2 Validate Product or Product Components: The product or product components are validated to

ensure that they are suitable for use in their intended operating environment. SP2.1 Perform Validation: Perform validation on the selected products and product components. SP2.2 Analyze Validation Results: Analyze the results of the validation activities and identify issues.

Level 3: Decision Analysis and Resolution: The purpose of Decision Analysis and Resolution is to analyze possible decisions using a formal evaluation process that evaluates identified alternatives against established criteria. Specific Practices by Specific Goal:

SG1 Evaluate Alternatives: Decisions are based on an evaluation of alternatives using established criteria.

SP1.1 Establish Guidelines for Decision Analysis: Establish and maintain guidelines to determine which issues are subject to a formal evaluation process.

SP1.2 Establish Evaluation Criteria: Establish and maintain the criteria for evaluating alternatives, and the relative ranking of these criteria.

SP1.3 Identify Alternative Solutions: Identify alternative solutions to address issues. SP1.4 Select Evaluation Methods: Select the evaluation methods. SP1.5 Evaluate Alternatives: Evaluate alternative solutions using the established criteria and

methods. SP1.6 Select Solutions: Select solutions from the alternatives based on the evaluation criteria.

Level 4: Quantitative Project Management: The purpose of the Quantitative Project Management process area is to quantitatively manage the project’s defined process to achieve the project’s established quality and process-performance objectives. Specific Practices by Specific Goal:

SG1 Quantitatively Manage the Project: The project is quantitatively managed using quality and process- performance objectives.

SP1.1 Establish the Project’s Objectives: Establish and maintain the project’s quality and process- performance objectives.

SP1.2 Compose the Defined Process: Select the subprocesses that compose the project’s defined process, based on historical stability and capability data.

SP1.3 Select the Subprocesses that Will Be Statistically Managed: Select the subprocesses of the project’s defined process that will be statistically managed.

SP1.4 Manage Project Performance: Monitor the project to determine whether the project’s objectives for quality and process performance will be satisfied, and identify corrective action as appropriate.

SG2 Statistically Manage Subprocess Performance: The performance of selected subprocesses

within the project’s defined process is statistically managed. SP2.1 Select Measures and Analytic Techniques: Select the measures and analytic techniques to be

used in statistically managing the selected subprocesses. SP2.2 Apply Statistical Methods to Understand Variation: Establish and maintain an understanding

of the variation of the selected subprocesses using the selected measures and analytic techniques. SP2.3 Monitor Performance of the Selected Subprocesses: Monitor the performance of the selected

5

subprocesses to determine their capability to satisfy their quality and process-performance objectives, and identify corrective action as necessary.

SP2.4 Record Statistical Management Data: Record statistical and quality management data in the organization’s measurement repository.

Level 5: Causal Analysis and Resolution: The purpose of Causal Analysis and Resolution is to identify causes of defects and other problems and take action to prevent them from occurring in the future. Specific Practices by Specific Goal:

SG1 Determine Causes of Defects: Root causes of defects and other problems are systematically determined.

SP1.1 Select Defect Data for Analysis: Select the defects and other problems for analysis. SP1.2 Analyze Causes: Perform causal analysis of selected defects and other problems and propose

actions to address them. SG2 Address Causes of Defects: Root causes of defects and other problems are systematically

addressed to prevent their future occurrence. SP2.1 Implement the Action Proposals: Implement the selected action proposals that were developed

in causal analysis. SP2.2 Evaluate the Effect of Changes: Evaluate the effect of changes on process performance. SP2.3 Record Data: Record causal analysis and resolution data for use across the project and

organization. Addressing the basics of the project management CMMI considers the following components for the management of the IT processes [SEI 2002]:

Process Performance objectives, baselines, models

QPM

Organization’s standard processes and supporting assets IPM

forIPPD

RSKMLessons Learned,

Planning and Performance Data

Project’sdefinedprocess

Statistical Mgmt Data

Risk status

Risk mitigation plans

Corrective action

Risk taxonomies

& parametersProcess Management

process areas

BasicProject Management

process areas

Risk exposure due to unstable processes

Quantitative objectivesSubprocesses to statistically manage

Identified risks

Engineering and Supportprocess areas

Coordination,commitments,issues to resolve

IT

Coordination and collaborationamong project stakeholders

Shared visionand integrated teamstructure for the project

Integrated teammanagement forperformingengineeringprocesses

Productarchitectureforstructuringteams

Integrated workenvironment andpeople practices

Project’sdefinedprocessProject

performancedata

ISM

Monitoring data aspart of supplieragreement

Configuration management,verification, and integrationdata

Figure 3: The CMMI project management process areas

6

Where QPM stands for Quantitative Project Management, IPM for Integrated Project Management, IPPD for Integrated Product and Process Development, RSKM for risk management, and ISM for Integrated Supplier Management. 1.3 CMMI Metrication In order to manage the software process quantitatively, the CMMI defines a set of metrics examples. Some of these appropriate software measurement intentions are [SEI 2002]

Examples of quality and process performance attributes for which needs and priorities might be identified include the following:

o Functionality o Reliability o Maintainability o Usability o Duration o Predictability o Timeliness o Accuracy

Examples of quality attributes for which objectives might be written include the following:

o Mean time between failures o Critical resource utilization o Number and severity of defects in the released product o Number and severity of customer complaints concerning the provided service

Examples of process performance attributes for which objectives might be written include the following:

o Percentage of defects removed by product verification activities (perhaps by type of verification, such as peer reviews and testing)

o Defect escape rates o Number and density of defects (by severity) found during the first year following product

delivery (or start of service) o Cycle time o Percentage of rework time

Examples of sources for objectives include the following:

o Requirements o Organization's quality and process-performance objectives o Customer's quality and process-performance objectives o Business objectives o Discussions with customers and potential customers o Market surveys

Examples of sources for criteria used in selecting subprocesses include the following:

o Customer requirements related to quality and process performance o Quality and process-performance objectives established by the customer o Quality and process-performance objectives established by the organization o Organization’s performance baselines and models o Stable performance of the subprocess on other projects o Laws and regulations

Examples of product and process attributes include the following:

o Defect density o Cycle time o Test coverage

Example sources of the risks include the following:

o Inadequate stability and capability data in the organization’s measurement repository o Subprocesses having inadequate performance or capability o Suppliers not achieving their quality and process-performance objectives

7

o Lack of visibility into supplier capability o Inaccuracies in the organization’s process performance models for predicting future

performance o Deficiencies in predicted process performance (estimated progress) o Other identified risks associated with identified deficiencies

Examples of actions that can be taken to address deficiencies in achieving the project’s objectives

include the following: o Changing quality or process performance objectives so that they are within the expected

range of the project’s defined process o Improving the implementation of the project’s defined process so as to reduce its normal

variability (reducing variability may bring the project’s performance within the objectives without having to move the mean)

o Adopting new subprocesses and technologies that have the potential for satisfying the objectives and managing the associated risks

o Identifying the risk and risk mitigation strategies for the deficiencies o Terminating the project

Examples of subprocess measures include the following:

o Requirements volatility o Ratios of estimated to measured values of the planning parameters (e.g., size, cost, and

schedule) o Coverage and efficiency of peer reviews o Test coverage and efficiency o Effectiveness of training (e.g., percent of planned training completed and test scores) o Reliability o Percentage of the total defects inserted or found in the different phases of the project life

cycle o Percentage of the total effort expended in the different phases of the project life cycle

Sources of anomalous patterns of variation may include the following:

o Lack of process compliance o Undistinguished influences of multiple underlying subprocesses on the data o Ordering or timing of activities within the subprocess o Uncontrolled inputs to the subprocess o Environmental changes during subprocess execution o Schedule pressure o Inappropriate sampling or grouping of data

Examples of criteria for determining whether data are comparable include the following:

o Product lines o Application domain o Work product and task attributes (e.g., size of product) o Size of project

Examples of where the natural bounds are calculated include the following:

o Control charts o Confidence intervals (for parameters of distributions) o Prediction intervals (for future outcomes)

Examples of techniques for analyzing the reasons for special causes of variation include the following:

o Cause-and-effect (fishbone) diagrams o Designed experiments o Control charts (applied to subprocess inputs or to lower level subprocesses) o Subgrouping (analyzing the same data segregated into smaller groups based on an

understanding of how the subprocess was implemented facilitates isolation of special causes)

Examples of when the natural bounds may need to be recalculated include the following:

o There are incremental improvements to the subprocess o New tools are deployed for the subprocess o A new subprocess is deployed

8

o The collected measures suggest that the subprocess mean has permanently shifted or the subprocess variation has permanently changed

Examples of actions that can be taken when a selected subprocess’ performance does not satisfy its

objectives include the following: o Changing quality and process-performance objectives so that they are within the

subprocess’ process capability o Improving the implementation of the existing subprocess so as to reduce its normal

variability (reducing variability may bring the natural bounds within the objectives without having to move the mean)

o Adopting new process elements and subprocesses and technologies that have the potential for satisfying the objectives and managing the associated risks

o Identifying risks and risk mitigation strategies for each subprocess’ process capability deficiency

Examples of other resources provided include the following tools:

o System dynamics models o Automated test-coverage analyzers o Statistical process and quality control packages o Statistical analysis packages

Examples of training topics include the following:

o Process modelling and analysis o Process measurement data selection, definition, and collection

Examples of work products placed under configuration management include the following:

o Subprocesses to be included in the project’s defined process o Operational definitions of the measures, their collection points in the subprocesses, and

how the integrity of the measures will be determined o Collected measures

Examples of activities for stakeholder involvement include the following:

o Establishing project objectives o Resolving issues among the project’s quality and process-performance objectives o Appraising performance of the selected subprocesses o Identifying and managing the risks in achieving the project’s quality and process-

performance objectives o Identifying what corrective action should be taken

Examples of measures used in monitoring and controlling include the following:

o Profile of subprocesses under statistical management (e.g., number planned to be under statistical management, number currently being statistically managed, and number that are statistically stable)

o Number of special causes of variation identified

Examples of activities reviewed include the following: o Quantitatively managing the project using quality and process-performance objectives o Statistically managing selected subprocesses within the project’s defined process

Examples of work products reviewed include the following:

o Subprocesses to be included in the project’s defined process o Operational definitions of the measures o Collected measures

Based on these quantifications CMMI defines: “A `managed process` is a performed process that is planned and executed in accordance with policy; employs skilled people having adequate resources to produce controlled outputs; involves relevant stakeholders; is monitored, controlled, and reviewed; and is evaluated for adherence to its process description“.

9

2 Software Measurement Intentions 2.1 The CAME Measurement Framework The following measurement and evaluation framework addressed to the software product, process and resources was developed at the University of Magdeburg [Dumke 1999]. The measurement framework is embedded in some aspects of strategy in the IT area in organizations and societies which is shown in the following Figure 4. Society Organization IT area CAME strategy CAME framework CAME tools

Figure 4: Main areas relating to the software measurement and evaluation framework We will describe shortly some essential aspects of this framework and the characteristics of the framework environments. The CAME strategy is related to the experience of measurement frameworks or metric programs which are embedded in the enterprise area ([Dumke 2002], [Eickelmann 2000], [Fehrling 2003], [Kitchenham 1997], [Munson 2003]) and stands for

• Community: the necessity of a group or a team that is motivated and has the knowledge of software measurement to install software metrics. In general, the members of these groups are organised in metrics communities such as our German Interest Group on Software Metrics.

• Acceptance: the agreement of the (top) management to install a metrics program in the (IT) business area.

This aspect is strong connected with the knowledge about required budgets and personnel resources.

• Motivation: the production of measurement and evaluation results in a first metrics application which demonstrates the convincing benefits of the metrics application. This very important aspect can be achieved by the application of essential results in the (world-wide) practice which are easy to understand and should motivate the management. One of the problem of this aspect is the fact that the management wants to obtain one single (quality) number as a summary of all measured characteristics.

• Engagement: the acceptance of spending effort to implement the software measurement as a permanent

metrics system (with continued measurement, different statistical analysis, metrics set updates etc.). This aspect includes also the requirement to dedicate personnel resources such as measurement teams etc.

The CAME framework consists of the following four phases which are defined to install a metrics program in the IT area and which can be used to evaluate the measurement level of this metrics program itself (see also [Dumke 2001], [Fenton 1997], [Kitchenham 1995], [Putnam 2003], [Zuse 1998]):

• Choice: the selection of metrics based on a special or general measurement view on the kind of measurement and the related measurement goals,

10

• Adjustment: the investigation and definition of the measurement characteristics of the metrics for the specific application field,

• Migration: the installation of a high metrication coverage based on semantic relations between the

metrics along the whole life cycle and along the system architecture,

• Efficiency: the automation level of the construction of a tool-based measurement for the used metrics. The phases of this framework will be explained in the following sections including the detailed aspects software measurement evaluation and the role of the CAME tools. The Measurement Choice involves the use of metrics involves the following two essential questions:

“What is possible to measure?” vs. “What is necessary to measure?”

Obviously, we only want to measure, what is necessary. But, in most software engineering areas, this aspect is unknown (especially for modern software development paradigms or methodologies such as software agents and multi-agent systems). The first framework step includes the choice of the software metrics and measures. Therefore, we must define the set of software metrics explicitly [Dumke 2003]. The structure of this set of metrics is based on the following classification principles

software product measurement and evaluation is based on the three components: model, implementation and documentation (see Figure 5),

software architecture: software operation: software documentation:

human interface aspects appropriateness user interface user interface marketing documents tutorials user problem domain product data manual confi- tasks accessing guration development task data documents manage- manage- (technology, tests, ment ment distributed tasks and data bases tools, supports) readability components tasks behaviour completeness data basis data handling Figure 5: Simplified visualisation of the product metrication

Note that the metrication process depends on the kind of the development method, of the application area of the software system, of the implementation paradigm etc.

11

software process measurement and evaluation is based on the process aspects: controlling, phases/steps and methodologies (see Figure 6),

software life cycle: software management: software methodology:

milestones controlling versioning suitability support problem definition project management ap- develop- upper requirement analysis/ proach ment me- CASE specification quality configu- thodology design manage- ration ma- para- ... implementation ment nagement digm implemen- lower field test maintenance management tation me- CASE thodology phases aspects evaluation workflow efficiency

Figure 6: Simplified visualisation of the process metrication

software resources measurement and evaluation is based on the three resource parts: personnel, software and hardware (see Figure 7).

personnel: software resources: hardware resources:

skills communication compatibility paradigm reliability availability (mobile) user customer COTS CASE computers peripherals (hosts) development team (test team) system software networks maintenance team architectures productivity performance performance Figure 7: Simplified visualisation of the resources metrication Our framework starts with the investigation of the chosen metrics and assumes an underlying choice method such as • the general measurement goal planning by [Basili 1986] (see also [Wohlin 2000]) which consider the

different measurement goals as understanding of systems, assessment, proof of hypothesis, understanding of metrics etc.,

• the Goal Question Metrics (GQM) paradigm [Solingen 1999] which is directed on the improvement of a

special aspect or component of the software system related to a special goal. The measurement choice step defines the static characteristics of the software measurement process [Feiler 1993]. Note, that the choice of software metrics or software measures decides about the areas of controlling and the areas out of controlling in the IT department.

12

The Measurement Adjustment is related to the experience (expressed in values) of the measured attributes for the evaluation. The adjustment includes the metrics validation ([Card 2000], [Kitchenham 1995], [Zelkowitz 1997]) and the determination of the metrics algorithm based on the measurement theory ([Henderson 1996], [Zuse 2003]). The steps in the measurement adjustment are

• the determination of the scale type and (if possible) the unit,

• the determination of the favourable values (as thresholds) for the evaluation of the measurement component, e. g. by

o discussion or brainstorming in the development or quality team,

o analysing and using the examples in the literature,

o using the thresholds of the metrics tools,

o taking the results of appropriate case studies and experimentation,

• the tuning of the thresholds as

o approximation during the software development from other project components,

o application of a metrics tool for a chosen software product that was classified as a ‘good

qualitative’ example,

• the calibration of the scale (as transformation of the numerical scale part to the empirical) depends on the improvement of the knowledge in the problem domain.

In the adjustment step mainly, we consider the metrics characteristics addressed to the qualitative evaluation (nominal and ordinal scale types) or to the quantitative evaluation (interval or ratio scale types). The Measurement Migration step is aimed to the dynamic aspects of the measurement framework or metrics program. This means that we must install a metrics-based network over the software product, process, and resources components as an Internal Measurement Process (IMP). We “migrate” the idea of metrication to all of the components of the software development and maintenance. Note, that the most existing software measurement approaches or frameworks do not consider this step explicitly. First intentions of this idea are described as complexity traces in [Ebert 1993] and measurement through the life cycle in [Cool 1993], and as granularity of object-oriented systems in [Abreu 1995]. Some examples of these kinds of migration for software products are [Dumke 1999]

• metrics tracing along the software life cycle, e. g. #notions (problem definition) → #classes (specification) → #new-defined-classes (design) → #implemented-classes (implementation),

• metrics refinement along the software life cycle, e. g. informal description of a specified service (text

metrics) → PDL description of a service (design metrics) → Java form of a service (code metrics),

• metrics granulation related to the architecture, e. g. in an object-oriented development as the system, the component, the class/object and the method.

In the process and resources area the semantic characteristics such as process phases and resources versions are also considered. Observing the software metrics as class hierarchy, we can understand the measurement migration as the definition and design of the metrics behaviour. On the other hand, the migration step includes the definition and installation of the External Measurement Process (EMP) as software measurement integration. This means that we must consider the final goals of software measurement in the IT area. Hence, we need all of the process steps such as measurement, evaluation, exploitation and application (assessment, decision support, improvement) in a persistent manner ([Eickelmann 2000], [Jacquet 1997], [Wohlin 2000]).

13

The Measurement Efficiency step includes the instrumentation or the automation of the measurement process by tools. It requires to analyse the algorithmic character of the software metrics and the possibility of the integration of tool-based ‘control cycles’ in the software development or maintenance process. We will call the metrics tools as CAME (Computer Assisted software Measurement and Evaluation) tools [Dumke 1996]. In most cases, it is necessary to combine different metrics tools and techniques related to the measurement phases. Finally, we can describe software measurement intentions as following:

⇒ We don’t have any general system of measures in software engineering like

in physics. Hence, we must consider in the software development the rules of thumb, statements of trends, analogue conclusions, expertise, estimations and predictions also ([Dumke 2003], [Endres 2003]).

⇒ We also don’t have any standardised measurement system which performs

the system of measures. Therefore, we must use the general techniques of assessment (continues, periodic or certified), general evaluation, experiences and experimentation. Sometimes, the experimentation is not immediately used for decision support, improvement or controlling. We also use the experimentation for understanding of new paradigms or the cognition of new kinds of problems ([Basili 1986], [Wohlin 2000]).

⇒ Software measurement instruments are mostly not based on a physical

analogy such the column of mercury to measure the temperature. In the most cases, software measurement is counting [Kitchenham 1995].

⇒ Software measurement has a context and is not finished with measurement values or thresholds. Software measurement can be a generic measurement and analysis process ([Card 2000], [Jacquet 1997]).

⇒ Empirical techniques are divided into informally observing, formal

experiments, industrial case studies and benchmarking exercises or surveys ([Juristo 2003], [Kitchenham 1997]).

⇒ “In software engineering metrics area, should place more emphasis on the validity of the mathematical (and statistical) tools which have been (and are currently being) used in their development and use. Areas which give cause for concern in the past include the use of dimensionally incorrect equations, incorrect plotting of equations and consequent incorrect inferences, the sloppy use of mathematical notation and of calculated values and the lack of underpinning mathematical models.” [Henderson 1996]

Hence, the software metrics application based on different methodologies or frameworks requires statistical methods ([Juristo 2003], [Munson 2003], [Pandian 2003], [Sigpurwalla 1999], [Wohlin 2000], [Zuse 1998]).

14

2.2 The CMMI Metrics Set by Kulpa and Johnson The following set of metrics is defined by Kulpa and Johnson in order to keep the quantified requirements for the different CMMI levels [Kulpa 2003]. CMMI Level 2: Requirements Management

1. Requirements volatility- (percentage of requirements changes) 2. Number of requirements by type or status (defined, reviewed. approved. and implemented) 3. Cumulative number of changes to the allocated requirements, including total number of changes

proposed, open, approved, and incorporated into the system baseline 4. Number of change requests per month, compared to the original number of requirements for the

project 5. Amount of time spent, effort spent, cost of implementing change requests 6. Number and size of change requests after the Requirements phase is completed 7. Cost of implementing a change request 8. Number of change requests versus the total number of change requests during the life of the

project 9. Number of change requests accepted but not implemented 10. Number of requirements (changes and additions to the baseline)

Project Planning

11. Completion of milestones for the project planning activities compared to the plan (estimates versus actuals)

12. Work completed, effort and funds expended in the project planning activities compared to the plan

13. Number of revisions to the project plan 14. Cost, schedule, and effort variance per plan revision 15. Replanning effort due to change requests 16. Effort expended over time to manage the hmject compared to the plan 17. Frequency, causes, and magnitude of the replanning effort

Project Monitoring and Control

18. Effort and other resources expended in performing monitoring and oversight activities 19. Change activity for the project plan, which includes changes to size estimates of the work

products, cost/resource estimates, and schedule 20. Number of open and closed corrective actions or action items 21. Project milestone dates (planned versus actual) 22. Number of project milestone dates made on time 23. Number and types of reviews performed 24. Schedule, budget, and size variance between planned and actual reviews 25. Comparison of actuals versus estimates for all planning and tracking items

Measurement and Analysis

26. Number of projects using progress and performance measures 27. Number of measurement objectives addressed

Supplier Agreement Management

28. Cost of the COTS (commercial off-the-shelf) products 29. Cost and effort to incorporate the COTS products into the project 30. Number of changes made to the supplier requirements 31. Cost and schedule variance per supplier agreement 32. Costs of the activities for managing the contract compared to the plan 33. Actual delivery dates for contracted products compared to the plan 34. Actual dates of prime contractor deliveries to the subcontractor compared to the plan 35. Number of on-time deliveries from the vendor, compared with the contract 36. Number and severity of errors found after delivery 37. Number of exceptions to the contract to ensure schedule adherence 38. Number of quality audits compared to the plan

15

39. Number of Senior Management reviews to ensure adherence to hudget and schedule versus the plan

40. Number of contract violations by supplier or vendor Process and Product Quality Assurance (QA)

41. Completions of milestones for the QA activities compared to the plan 42. Work completed, effort expended in the QA activities compared to the plan 43. Number of product audits and activity reviews compared to the plan 44. Number of process audits and activities versus those planned 45. Number of defects per release and/or build 46. Amount of time/effort spent in rework 47. Amount of QA time/effort spent in each phase of the life cycle 48. Number of reviews and audits versus number of defects found 49. Total number of defects found in internal reviews and testing versus those found by the customer or end

user after delivery 50. Number of defects found in each phase of the life cycle 51. Number of defects injected during each phase of the life cycle 52. Number of noncompliances written versus the number resolved 53. Number of noncompliances elevated to senior management 54. Complexity of module or component (McCabe, MeClure, and Halstead metrics)

Configuration Management (CM)

55. Number of change requests or change board requests processed per unit of time 56. Completions of milestones for the CM activities compared to the plan 57. Work completed, effort expended, and funds expended in the CM activities 58. Number of changes to configuration items 59. Number of configuration audits conducted 60. Number of fixes returned as "Not Yet Fixed" 61. Number of fixes returned as "Could Not Reproduce Error" 62. Number of violations of CM procedures (noncompliance found in audits) 63. Number of outstanding problem reports versus rate of repair 64. Number of times changes are overwritten by someone else (or number of times people have the wrong

initial version or baseline) 65. Number of engineering change proposals proposed, approved, rejected, implemented 66. Number of changes by category to code source, and to supporting documentation 67. Number of changes by category, type, and severity 68. Source lines of code stored in libraries placed under configuration control

CMMI Level 3: Requirements Development

69. Cost, schedule, and effort expended for rework 70. Defect density of requirements specifications 71. Number of requirements approved for build (versus the total number of requirements) 72. Actual number of requirements documented (versus the total number of estimated requirements) 73. Staff hours (total and by Requirements Development activity) 74. Requirements status (percentage of defined specifications out of the total approved and proposed;

number of requirements defined) 75. Estimates of total requirements, total requirements definition effort, requirements analysis effort, and

schedule 76. Number and type of requirements changes

Technical Solution

77. Cost, schedule, and effort expended for rework 78. Number of requirements addressed in the product or productcomponent design 79. Size and complexity of the product, product components, interfaces, and documentation 80. Defect density of technical solutions work products (number of defects per page) 81. Number of requirements by status or type throughout the life of the project (for example, number

defined, approved, documented, implemented, tested, and signed-off by phase) 82. Problem reports by severity and length of time they are open

16

83. Number of requirements changed during implementation and test 84. Effort to analyze proposed changes for each proposed change and cumulative totals 85. Number of changes incorporated into the baseline by category (e.g., interface, security, system

configuration, performance, and useability) 86. Size and cost to implement and test incorporated changes, including initial estimate and actual size

and cost 87. Estimates and actuals of system size, reuse, effort, and schedule 88. The total estimated and actual

staff hours needed to develop the system by job category and activity 89. Estimated dates and actuals for the start and end of each phase of the life cycle 90. Number of diagrams completed versus the estimated total diagrams 91. Number of design modules/units proposed 92. Number of design modules/units delivered 93. Estimates and actuals of total lines of code - new, modified, and reused 94. Estimates and actuals of total design and code modules and units 95. Estimates and actuals for total CPU hours used to date 96. The number of units coded and tested versus the number planned 97. Errors by category, phase discovered, phase injected, type, and severity 98. Estimates of total units, total effort, and schedule 99. System tests planned, executed, passed, or failed 100. Test discrepancies reported, resolved, or not resolved 101. Source code growth by percentage of planned versus actual

Product Integration

102. Product-component integration profile (i.e., product-component assemblies planned and performed, and number of exceptions found)

103. Integration evaluation problem report trends (e.g., number written and number closed) 104. Integration evaluation problem report aging (i.e., how long each problem report has been open)

Verification

105. Verification profile (e.g., the number of verifications planned and performed, and the defects found; perhaps categorized by verification method or type)

106. Number of defects detected by defect category 107. Verification problem report trends (e.g., number written and number closed) 108. Verification problem report status (i.e., how long each problem report has been open) 109. Number of peer reviews performed compared to the plan 110. Overall effort expended on peer reviews compared to the plan 111. Number of work products reviewed compared to the plan

Validation

112. Number of validation activities completed (planned versus actual) 113. Validation problem reports trends (e.g., number written and number closed) 114. Validation problem report aging (i.e., how long each problem report has been open)

Organizational Process Focus 115. Number of process improvement proposals submitted, accepted, or implemented 116. CMMI maturity or capability level 117. Work completed, effort and funds expended in the organization's activities for process assessment,

development, and improvement compared to the plans for these activities 118. Results of each process assessment, compared to the results and recommendations of previous

assessments Organizational Process Definition

119. Percentage of projects using the process architectures and process elements of the organization's set of standard processes

120. Defect density of each process element of the organization's set of standard processes 121. Number of on-schedule milestones for process development and maintenance 122. Costs for the process definition activities

Organizational Training 123. Number of training courses delivered (e.g., planned versus actual) 124. Post-training evaluation ratings 125. Training program quality surveys

17

126. Actual attendance at each training course compared to the projected attendance 127. Progress in improving training courses compared to the organization's and projects' training plans 128. Number of training waivers approved over time

Integrated Project Management for IPPD

129. Number of changes to the project's defined process 130. Effort to tailor the organization's set of standard processes 131. Interface coordination issue trends (e.g., number identified and closed)

Risk Management

132. Number of risks identified, managed, tracked, and controlled 133. Risk exposure and changes to the risk exposure for each assessed risk, and as a summary percentage

of management reserve 134. Change activity for the risk mitigation plans (e.g., processes, schedules, funding) 135. Number of occurrences of unanticipated risks 136. Risk categorization volatility 137. Estimated versus actual risk mitigation effort 138. Estimated versus actual risk impact 139. The amount of effort and time spent on risk management activities versus the number of actual risks 140. The cost of risk management versus the cost of actual risks 141. For each identified risk, the realized adverse impact compared to the estimated impact

Integrated Teaming

142. Performance according to plans, commitments, and procedures for the integrated team, and deviations from expectations

143. Number of times team objectives were not achieved 144. Actual effort and other resources expended by one group to support another group or groups, and

vice versa 145. Actual completion of specific tasks and milestones by one group to support the activities of other

groups, and vice versa Integrated Supplier Management

146. Effort expended to manage the evaluation of sources and selection of suppliers 147. Number of changes to the requirements in the supplier agreement 148. Number of documented commitments between the project and the supplier 149. Interface coordination issue trends (e.g., number identified and number closed) 150. Number of defects detected in supplied products (during integration and after delivery)

Decision Analysis and Resolution

151. Cost-to-benefit ratio of using formal evaluation processes Organizational Environment for Integration

152. Parameters for key operating characteristics of the work environment CMMI Level 4: Organizational Process Performance

153. Trends in the organization's process performance with respect to changes in work products and task attributes (e.g., size growth, effort, schedule, and quality)

Quantitative Project Management

154. Time between failures 155. Critical resource utilization 156. Number and severity of defects in the released product 157. Number and severity of customer complaints concerning the provided service 158. Number of defects removed by product verification activities (perhaps by type of verification, such

as peer reviews and testing) 159. Defect escape rates 160. Number and density of defects by severity found during the first year following product delivery or

start of service

18

161. Cycle time 162. Amount of rework time 163. Requirements volatility (i.e., number of requirements changes per phase) 164. Ratios of estimated to measured values of the planning parameters (e.g., size, cost, and schedule) 165. Coverage and efficiency of peer reviews (i.e., number/amount of products reviewed compared to

total number, and number of defects found per hour) 166. Test coverage and efficiency (i.e., number/amount of products tested compared to total number, and

number of defects found per hour) 167. Effectiveness of training (i.e., percent of planned training completed and test scores) 168. Reliability (i.e., mean time-to-failure usually measured during integration and systems test) 169. Percentage of the total defects inserted or found in the different phases of the project life cycle 170. Percentage of the total effort expended in the different phases of the project life cycle 171. Profile of subprocesses under statistical management (i.e., number planned to be under statistical

management, number currently being statistically managed, and number that are statistically stable)

172. Number of special causes of variation identified 173. The cost over time for the quantitative process management activities compared to the plan 174. The accomplishment of schedule milestones for quantitative process management activities

compared to the approved plan (i.e., establishing the process measurements to be used on the project, determining how the process data will be collected, and collecting the process data)

175. The cost of poor quality (e.g., amount of rework, re-reviews and re-testing) 176. The costs for achieving quality goals (e.g., amount of initial reviews, audits, and testing)

CMMI Level 5: Organizational Innovation and Deployment

177. Change in quality after improvements (e.g., number of reduced defects) 178. Change in process performance after improvements (e.g., change in baselines) 179. The overall technology change activity, including number, type, and size of changes 180. The effect of implementing the technology change compared to the goals (e.g., actual cost saving to

projected) 181. The number of process improvement proposals submitted and implemented for each process area 182. The number of process improvement proposals submitted by each project, group, and department 183. The number and types of awards and recognitions received by each of the projects, groups, and

departments 184. The response time for handling process improvement proposals 185. Number of process improvement proposals accepted per reporting period 186. The overall change activity including number, type, and size of changes 187. The effect of implementing each process improvement compared to its defined goals 188. Overall performance of the organization's and projects' processes, including effectiveness, quality,

and productivity compared to their defined goals 189. Overall productivity and quality trends for each project 190. Process measurements that relate to the indicators of the customers' satisfaction (e.g., surveys results,

number of customer complaints, and number of customer compliments)

Causal Analysis and Resolution 191. Defect data (problem reports, defects reported by the customer, defects reported by the user, defects

found in peer reviews, defects found in testing, process capability problems, time and cost for identifying the defect and fixing it, estimated cost of not fixing the problem)

192. Number of root causes removed 193. Change in quality or process performance per instance of the causal analysis and resolution process

(e.g., number of defects and changes in baseline) 194. The costs of defect prevention activities (e.g., holding causal analysis meetings and implementing

action items), cumulatively 195. The time and cost for identifying the defects and correcting them compared to the estimated cost of

not correcting the defects 196. Profiles measuring the number of action items proposed, open, and completed 197. The number of defects injected in each stage, cumulatively, and over-releases of similar products 198. The number of defects

19

2.3 The CMMI-Based Organization’s Measurement Repository The following section includes the main activities for defining and implementation of measurement repositories using in an organizational context. The repository contains both product and process measures that are related to the organization's set of standard processes ([SEI 2002]). It also contains or refers to the information needed to understand and interpret the measures and assess them for reasonableness and applicability. For example, the definitions of the measures are used to compare similar measures from different processes. Typical Work Products:

1. Definition of the common set of product and process measures for the organization's set of standard processes

2. Design of the organization’s measurement repository

3. Organization's measurement repository (i.e., the repository structure and support environment)

4. Organization’s measurement data

Subpractices:

1. Determine the organization's needs for storing, retrieving, and analyzing measurements. 2. Define a common set of process and product measures for the organization's set of standard

processes. The measures in the common set are selected based on the organization's set of standard processes. The common set of measures may vary for different standard processes. Operational definitions for the measures specify the procedures for collecting valid data and the point in the process where the data will be collected. Examples of classes of commonly used measures include the following:

Estimates of work product size (e.g., pages) Estimates of effort and cost (e.g., person hours) Actual measures of size, effort, and cost Quality measures (e.g., number of defects found, severity of defects) Peer review coverage Test coverage Reliability measures (e.g., mean time to failure).

Refer to the Measurement and Analysis process area for more information about defining measures.

3. Design and implement the measurement repository. 4. Specify the procedures for storing, updating, and retrieving measures.

5. Conduct peer reviews on the definitions of the common set of measures and the procedures for

storing and retrieving measures. Refer to the Verification process area for more information about conducting peer reviews.

6. Enter the specified measures into the repository. Refer to the Measurement and Analysis process

area for more information about collecting and analyzing data.

7. Make the contents of the measurement repository available for use by the organization and projects as appropriate.

8. Revise the measurement repository, common set of measures, and procedures as the organization’s

needs change. Examples of when the common set of measures may need to be revised include the following:

New processes are added Processes are revised and new product or process measures are needed Finer granularity of data is required Greater visibility into the process is required Measures are retired.

20

3 The Statistical Process Control (SPC) 3.1 Foundations of the Statistical Process Control This section gives a short description of the Software Process Control (SPC) based on [Kulpa 2003]. SPC is often the most dreaded of all subjects when discussing process improvement. Because it involves numbers, and then scrutinizing the numbers to determine whether the numbers are correctly collected, reported, and used throughout the organization. Many organizations will collect metrics to summarize the best practices we can found in other organizations. So we will describe the different types of charts and discusses reasons for using the charts and reasons for collecting data. SPC consists of some techniques used to help individuals understand, analyze, and interpret numerical information. SPC is used to identify and track variation in processes. All processes will have some natural variation. Due to the normal variation in any process, the numbers (in this example, the number of cars waiting at the stoplight, the number of accidents that may occur) can change when the process really has not. So, we need to understand both the numbers relating to our processes and the changes that occur in our processes so that we may respond appropriately. Other terms that you may see are common causes of variation and special causes of variation, as well as common cause systems and special cause systems. Common causes of variation result from such things as system design decisions and the use of one development tool over another. This variation will occur predictably across the entire process associated with it and is considered normal variation. Special causes of variation are those that arise from such things as inconsistent process execution and lack of resources. This variation is exceptional variation and is also known as assignable causes of variation. We will use both terms. Other terms you will hear are in control for predictable processes or steady-state; and out of control for unpredictable processes that are “outside the natural limits.” When a process is predictable, it exhibits routine variation as a result of common causes. When a process is unpredictable, it exhibits exceptional variation as a result of assignable causes. It is our job to be able to tell the difference and to find the assignable cause. When a process is predictable, it is performing as consistently as it can (either for better or for worse). It will not be performing perfectly; there will always be some normal, routine variation. Looking for assignable causes for processes that are running predictably is a waste of time because you will not find any. Work instead on improving the process itself. When a process is unpredictable, that means it is not operating consistently. It is a waste of time to try to improve the process itself. In this case, you must find out why it is not operating predictably and detail the “whys” as specifically as possible. To do that, you must find and fix the assignable cause(s); that is, the activity that is causing the process to behave erratically. In contrast to the predictability of a process, we may want to consider if a process is capable of delivering what is needed by the customer. Capable processes perform within the specification limits set by the customer. So, a process may be predictable, but not capable. Usally, there are seven commonly recognized tools or diagrams for statistical process control:

1. Check sheet 2. Run chart 3. Histogram 4. Pareto chart 5. Scatter diagram/chart 6. Cause and effect or fislnhone diagram 7. Control chart

Some basic examples are shown in following which we have cited from [Kulpa 2003] only for illustration the general characteristics. Check Sheet: The check sheet (see Table 1) is used for counting and accumulating data in a general or special context.

21

Table 1: Check sheet Used for Counting and Accumulating Data Run Chart: The run chart (see Figure 8) tracks trends over a period of time. Points are tracked in the order in which they- occur. Each point represents an observation. You can often see interesting trends in the data by simply plotting data on a run chart. A danger in using run charts is that you might overreact to normal variations, but it is often useful to put your data on a run chart to get a feel for process behaviour.

Histogram: Thperiod of time, observations thaamount of variat9. Using the histand get a quick i

Figure 8: Example of a run chart

e histogram (see Figure 9) is a bar chart that presents data that have been collected over a and graphically presents these data by frequency. Each bar represents the number of

t fit within the indicated range. Histograms are useful because they can be used to see the ion in a process. The data in this histogram are the same data as in the run chart in Figure ogram, you get a different perspective on the data. You see how often similar values occur dea of how the data are distributed.

Figure 9: A simple example of a histogram

22

Pareto Chart: The Pareto chart (see Figure 10) is a bar chart that presents data prioritized in some fashion, usuallv either by descending or ascending order of importance. Parcto diagrams are used to show attribute data. Attributes are qualitative data that can he counted for recording and analysis; for example, counting the number of each type of defect. I'areto charts are often used to analyze the most often occurring type of something.

Fi Scatter Diagramallowing trends to possible cause-andeffect relationship concluding cause-a

Figure 11:

gure 10: An example of a pareto chart

/Chart: The scatter diagram (see Figure 11) is a diagram that plots data points, be observed between one variable and another. The scatter diagram is used to test for -effect relationships. A danger is that a scatter diagram does not prove the cause-and-and can be misused. A common error in statistical analysis is seeing a relationship and nd-effect without additional analysis.

An example of a scatter diagram/chart

23

Cause-and-Effect/Fishbone Diagram: The cause-and-effect/fishbone diagram (see Figure 12) is a graphical display of problems and causes. This is a good to capture team input from a brainstorming meeting, from a set of defect data, or from a check sheet.

Fi

Control Callows anbehavior c

These sevand presen The follow

1.

2.

3.

4.

5.

gure 12: A cause and effect/fishbone diagram example

hart: The control chart (see Figure 13) is basically a run charts with upper and lower limits that organization to track process performance variation. Control charts are also called process harts.

en grat the d

ing se

WhcolHoproWhtimcloWhreaHothe

Figure 13: Example of a control chart

phical displays can he used together or separately to help gather data, accumulate clam, ata for different functions associated with SPC.

ven questions are a start in order to reviewing the data for your charts [Kulpa 2003]:

o collected these data? (Hopefully the same people who are trained in proper data lection techniques.) w were the data collected? (Hopefully by automated means and at the same part of the cess.) en were the data collected? (Hopefully all at the same time on the same day or at the same e in the process - very important for accounting data dealing with month-end or year-end sings. ) at do the values presented mean? (Have you changed the process recently? Do these values lly tell me what I want or need to know?) w were these values computed from raw inputs? (Have you computed the data to arrive at results you want, or to accuratelv depict the true voice of the process?)

24

6. What formulas were used? (Are thev measuring what we need to measure? Are they working,' Are they still relevant?)

... and the most important question of all: 7. Are we collecting the right data, and are we collecting the data right? (The data collected

should be consistent, and the way data are collected should also be consistent. Do the data contain the correct information for analysis? In our peer review example, this information would be size, complexity, and programming language.)

Control charts are used to identify process variation over time. All processes vary. The degree of variance, and the causes of the variance, can be determined using control charting techniques. While there are many types of control charts, the ones we have seen the most often are the [Kulpa 2003]:

c-chart: This chart uses a constant sample size of attribute data, where the average sample size is greater than five. It is used to chart the number of defects (such as “12” or “15” defects per thousand lines of code). c stands for the number of nonconformities within a constant sample size.

u-chart:. This chart uses a variable sample size of attribute data. This chart is used to chart the

number of defects in a sample or set of samples (such as “20 out of 50” design flaws were a result of requirements errors). u stands for the number of nonconformities with varying sample sizes.

np-chart: This chart uses a constant sample size of attribute data, usually greater than or equal to 50.

This chart is used to chart the number defective in a group. For example, a hardware component might he considered defective, regardless of the total number of defects in it. np stands for the number defective.

p-chart: This chart uses a variable sample size of attribute data, usually greater than or equal to 50.

This chart is used to chart the fraction defective found in a group. p stands for the proportion defective.

X and mR charts: These charts use variable data where the sample size is one. X-bar and R charts: These charts use variable data where the sample size is small. They can also he

based on a large sample size greater than or equal to ten. X-bar stands for the average of the data collected. R stands for the range (distribution) of the data collected.

X-bar and s charts: These charts use variable data where the sample size is large, usually greater

than or equal to ten. So, as you can see, you can sometimes use several of the charts, based on m type of data and on the size of the sample - and the size of the sample may change. Control charts help detect and differentiate between noise (normal variation of the process) and signals (exceptional variation that warrants further investigation). Although others may disagree, we recommend that you use the Average Moving Range (XmR) chart for most situations. There are automated tools that can support building and displaying these charts. The task we need to undertake is to figure out how to tell the difference between noise and signals. Properly generated control charts, specifically the XmR chart, can help us in this task. Risk data (historical data) are critical for generating accurate control charts and for correct SPC analyses. The Table 2 shows the count for each month of the year 2002 and the mR values (moving range).

Table 2: Example of moving range for the calendar year 2002

25

We can then average the moving ranges in the following statistical manner (see Figure 14), where Cen stands for centered line, UCL for upper center line, and LCL for lower center line.

We know that thvalues displayedto compute the calculate the limfollows:

For thcomput

For th

multiplcenterlimoving

Notice that valugathered and coaverage variatiointerdependent a We have also semedian moving range values thaaround the midautomatically “thowever, the for The most obvioulower). Those vyour control chathis pattern may85 to 90 perceconstructed withgreater the certa Another way to points are clusteprocess has proinvestigated. c-chart appropappropriate charnorm” that is, a

Figure 14: The 2002 moving R chart

e values for the centerlines for each chart were computed by simply taking the average of the (i.e., by adding up the values for each month and then dividing by the number of months/values average). How were the upper and lower limits calculated for the charts shown above? We can

its for both the X (lndividual Values) chart and the Average Moving Range (mR) chart as

e mR (moving range) chart. The upper range (or upper control limit, or upper natural limit) is ed by multiplying the average moving range (the centerline of the mR chart).

e X chart (individual values chart). The upper range for the X chart is computed by ying the average moving range of the associated chart and then adding the value for the ne of the X chart. The lower range for the X chart is computed by multiplying the average range and then subtracting the value for the centerline of the X chart.

es for both representations (individual values and average moving range values) must be mputed. The upper and lower limits for the individual values chart (X chart) depend on the ns calculated for the centerline of the average moving range chart. Therefore, these charts are nd can be used to show relationships between the two types of charts and the two types of data.

en the limits for the XmR charts calculated using median ranges instead of average ranges. The range is often more sensitive to assigned causes when the values used contain some very high t inflate the average. Remember that the median range is that range of numbers that hover

dle of a list sequenced in ascending or descending order: thus, the median range chart will hrow out” the very high- or low-end values. Use of the median moving range approach is valid; mulas (constants) change.

s interpretation is when one or more data points fall outside your control limits (either upper or alues should be investigated for assignable causes, and the assignable causes should be fixed. If rt shows three out of four consecutive points hovering closer to the limits than to the centerline, signal a shift or trend, and should be investigated (because predictable processes generally show nt of the data closer to the centerline than to the limits). Remember: useful limits can be as few as five or six consecutive values. However, the more data used to compute the limits, the inty of the results.

spot trends is to look at the data points along the centerline. If eight or more consecutive data red on the same side of the centerline, a shift in the original baseline or performance of the bably occurred, even without a data point falling outside the limits. This is a signal to be

riateness: While XmR charts are the most often applied in organizations, and are the most ts to use most often, they are not infallible. Sometimes, an event will occur that “skews the rare event way outside of the average has occurred. When this happens, a c-chart is better used.

26

A c-chart is used for rare events that are independent of each other. The formulas for c-charts are different from XmR charts. First, calculate the average count of the rare occurrence over the total time period that the occurrence happened. That number becomes the centerline. The upper limit is calculated by adding the average count to three times the square root of the average count. The lower limit is calculated by subtracting the average count from three times the square root of the average count. Charting the number of times a rare event occurs is pretty useless. However, charting the time periods between recurring rare events can be used to help predict when another rare event will occur. To do this, count the number of times the rare event occurs (usually per day per year) and determine the intervals between the rare events. Convert these numbers into the average moving ranges and, voilä, you can build an XmR chart. u-chart appropriateness: The u-chart is based on the assumption that your data are based on a count of discrete events occurring within well-defined, finite regions/areas, and that these events are independent. The u-chart assumes a Poisson process. You may want to consider a u-chart when dealing with defects (counts) within a group of pages (region/area); for example, number of errors per page or the number of defects per 1000 lines of code. The u-chart differs from the XmR chart in that the upper and lower control limits of the u-chart change over time. The u` in u-chart is the weighted average of the count (u` = ∑ countj/ ∑ sizej). The upper control limit is calculated by adding ü to three times the square root of the ü divided by the last size (sizej). The lower control limit is calculated by subtracting u` from three times the square root of the ü divided by the last size (sizej). 3.2 Empirical Strategies There are three different types of strategies: survey, case study and experiment ([Juristo 2003], [Kitchenham 1997]). Those three strategies will be looked at in more detail in following. The survey is being applied to subjects already in use (tools, etc). The usual proceeding to gather information is the usage of questionnaires or interviews. These are applied to a representative sample group and the outcomes are then analysed. The aim is to derive conclusions that are descriptive, exploratory or explanatory. With the use of generalization the result from the sample is mapped to the whole group. It is, however, not possible to manipulate or control the samples. Nevertheless it is practicable to compare the result with similar outcomes of other surveys. Both qualitative as well as quantitative data can be derived from this strategy. Which one it is depends on the data that is being collected through the questionnaires or interviews and whether statistical analysis methods are applicable or not. A popular field for this kind of investigation is well known to most people: social studies. An example would be public opinion polls before elections take place. The surveys there try to show how the people will vote on the actual day of election. Another helpful kind of surveys methods is the application of experience such as Rules of thumb. Examples of these rules of thumb are described in following as laws and conjectures cited from [Endres 2004]. Process-related expriences:

Fagan’s law: “Inspections significantly increase productivity, quality, and project stability”. There are three kinds of inspection: design, code, and test inspection. They are applicable in the development of all information or knowledge intensive products. This form of inspection is wide spread throughout the industry today. Inspection also has a key role in the Capability Maturity Model (CMM). The benefit of inspections can be summarized as followed: they “create awareness for quality that is not achievable by any other method”.

Porter-Votta law: “Effectivness of inspections is fairly independent of its organizational form”. A. Porter

and L. Votta investigated the inspection process introduced by Fagan and came up with the following results: physical meetings are overestimated. It can be helpful while introducing the inspection process to new people. When education and experience are extant it is not that important anymore. Another point revealed was that it is not true that adding more persons to the inspection team increases the detection rate.

Hetzel-Myers law: “A combination of different Verification and Validation methods outperforms any

single method alone”. W. Hetzel and G. Myers claim that it is better to use all three methods in combination to gain better results at the end. This is due to the fact that design, code and test inspection are not competitors.

27

Mills-Jones hypothesis: “Quality entails productivity”. It is also known as “the optimist’s law” and can be seen as a variation of P. Cosby’s proverb “quality is free”. It is a very intuitive hypothesis: on the one hand, when the quality is high, less rework has to be done which results in better productivity. On the other hand, when quality is poor more rework has to be considered. Therefore productivity rate drops, as well.

Mays’ hypothesis: “Error prevention is better than error removal”. No matter when an error is detected a

certain amount of rework has to be done (this amount increases the later it is detected). Therefore it is better to prevent errors. To be able to do so, the circumstances of errors have to be investigated, identified and then removed. It is still a hypothesis because it is extremely difficult to prove.

Structured conclusions:

Basili-Rombach hypothesis: “Measurements require both goals and models”. Metrics and measurement

need goals and questions otherwise they do not have a meaning. It is also preferable to use a top-down approach when specifying the parameters. This leads to the Goal-Question-Metric (GQM) paradigm.

Conjecture a: “Human-based methods can only be studied empirically”. The human-based methods

involve (human) judgement and depend on experience and motivation. This is why the results also depend on these different factors. To be able to understand and control those factors empirical studies are needed.

Conjecture b: “Learning is best accelerated by a combination of controlled experiments and case

studies”. Observing software development helps the developers to learn. The case studies supply the project characteristics, (realistic) complexity, project pressure etc. The lack of cause and effect insights can be provided through controlled experiments.

Conjecture c: “Empirical results are transferable only if abstracted and packaged with context”. The

information that has been gained needs to be transformed into knowledge with the context borne in mind. This can be achieved with the help of abstraction. It offers the opportunity to reuse the results. When the results are abstracted and packaged only two questions remain to be answered: “Do the results apply to this environment?” and “What are the risks of reusing these results?”

Another form of experience surveys are the delivering of models such as the Models for measuring software reliability based on the failure rates and probalistics characteristics of software systems [Singpurwalla 1999]:

• Jelinski-Moranda model: Jelinski and Moranda assume that the software contains an unknown number of, say N, of bugs and that each time the software fails, a bug is detected and corrected and the failure rate Ti is proportional to N – i + 1 the number of remaining the code.

• Baysian reliability growth model: This model devoid a consideration that the relationship between the

relationship between the number of bugs and the frequency of failure is tenuous.

• Musa-Okumoto models: These models are based on the postulation a relationship between the intensity function and the mean value function of a Piosson process, that has gained popularity with users.

• General order statistics models: This kind of models is based on statistical order functions. The

motivation for ordering comes from many applications like hydrology, strength of materials and reliability.

• Concatenated failure rate model: These models introduce the infinite memories for storage the failure

rates where the notion infinite memory is akin to the notion of invertibility in time series analysis. A case study is used to monitor the project. Throughout the study data is collected. This data is then investigated with statistical methods. The aim is to track variables or to establish relationships between different variables that have a leading role or effect on the outcome of the study. With the help of this kind of strategy it is possible to build a prediction model. The statistical analysis methods used for this kind of study consists of linear regression and principle component analysis. A disadvantage of this study is the generalisation. Depending on

28

the kind of result it can be very difficult to find a corresponding generalisation. This also influences the interpretation and thus makes it more difficult. Like the survey the case study can provide data for both qualitative and quantitative research. Experiments are usually performed in an environment resembling a laboratory to ensure a high amount of control while carrying out the experiment. The assignments of the different factors for the experiment are allotted totally at random. More about this random assignment can be found in the following sections. The main task of an experiment is to manipulate variables and to measure the effects they cause. This measurement data is the basis for the statistical analysis that is performed afterwards. In the case that it is not possible to assign the factors through random assignment, so-called quasi-experiments can be used instead of the experiments described above. Experiments are used for instance to confirm existing theories, to validate measures or to evaluate the accuracy of models [Wohlin 2000]. Other than surveys and case studies the experiments only provide data for a quantitative study. The difference between case studies and experiments is that case studies have a more observational character. They track specific attributes or establish relationships between attributes but do not manipulate them. In other words they observe the on-going project. The characteristic of an experiment in this case is that control is the main aspect and that the essential factors are not only identified but also manipulated. It is also possible to see a difference between case studies and surveys. A case study is performed during the execution of a project. The survey looks at the project in retrospect. Although it is possible to perform a survey before starting a project as a kind of prediction of the outcome, the experience used to do this is based on former knowledge and hence based on those experiences gained in the past. Carrying out experiments in the field of Software Engineering is different from other fields of application [Juristo 2003]. In software engineering several aspects are rather difficult to establish. These are:

• Find variable definitions that are accepted by everyone • Prove that the measures are nominal or ordinal scale • Validation of indirect measures: models and direct measures have to be validated

To be able to carry out an experiment several steps have to be performed [Basili 1986]:

1. The definition of the experiment 2. The planning 3. Carrying out the experiment 4. Analysis and Interpretation of the outcomes 5. Presentation of the results

Now a more detailed look on the different steps mentioned above. The Experiment definition is the basis for the whole experiment. It is crucial that this definition is performed with some caution. When the definition is not well founded and interpreted the whole effort spent could have been done in vain and one worse thing to happen is that the result of the experiment is not displaying what was intended The definition sets up the objective of the experiment. Following a framework can do this. The GQM templates could supply such a framework for example [Solingen 1999]. After finishing the definition the planning step has to be performed. While the previous step was to answer the question why the experiment is performed, this step answers the question how the experiment will be carried out. 6 different stages will be needed to complete the planning phase [Wohlin 2000].

Context selection: The environment in which the experiment will be carried out is selected. Hypothesis formulation and variable selection: Hypothesis testing is the main aspect for statistical

analysis when carrying out experiments. The goal is to reject the hypothesis with the help of the collected data gained through the experiment. In the case that the hypothesis is rejected it is possible to draw conclusion out of it. More details about hypothesis testing can be read in the following sections. The selection of variables is a difficult task Two kinds of variables have to be identified: dependent and independent ones. This also includes the choice of scale type and range of the different variables. The section above also contains more information about dependent and independent variables.

29

Subject selection: It is performed through sampling methods. Different kinds of sampling can be found at the end of this chapter. This step is the fundament for the later generalisation. Therefore the selection chosen here has to be representative for the whole population. The act of sampling the population can be performed in two ways either probabilistic or non-probabilistic. The difference between those two methods is that in the latter the probability of choosing a sample of the selection is not known. Simple random sampling and systematic sampling, just to name two, are probability-sampling techniques. Those and other methods can be found at the end of this chapter. The size of the sample also has influence on the generalisation. A rule of thumb is that the larger the sample is the lower the error in generalising the results will be. There are some general principles described in [Juristo 2003]:

If there is large variability in the population, a large sample size is needed. The analysis of the data may influence the choice of the sample size. It is therefore needed

to consider how the data shall be analysed already at the design stage of the experiment. Experiment design: The design tells how the tests are being organized and performed. An experiment is

so to speak a series of tests. A close relationship between the design and the statistical analysis exists and they have effect on each other. The choices taken before (measurement scale, etc.) and a closer look at the null-hypothesis help to find the appropriate statistical method to be able to reject the hypothesis. The following sections provide a deeper view into the subject described shortly above.

Instrumentation: In this step the instruments needed for the experiment are being developed. Therefore

three different aspects have to be addressed: experiment objects (i.e. specification and code documents), guidelines (i.e. process description and checklists) and measurement. Using instrumentation does not affect the outcome of the experiment. It is only used to provide means for performing and to monitor experiments [Wohlin 2000].

Validity evaluation: After the experiments are carried out the question arises how valid the results are.

Therefore it is necessary to think of possibilities to check the validity.

The following components are an important vocabulary needed for the software engineering experimentation process:

Dependent & Independent variables: Variables that are being manipulated or controlled are called independent variables. When variables are used to study the effects of the manipulation etc. they are called dependent

Factors: independent variables that are used to study the effect when manipulating them. All the other independent variables remain unchanged

Treatment: a specific value of a factor is called treatment Object & Subject: an example for an object is a review of a document. A subject is the person carrying

out the review. Both can be independent variables Test (sometimes referred to as Trial): an experiment is built up using several tests. Each single test is

structured in treatment, objects and subjects. However, these tests should not be mixed up with statistical tests

Experimental error: gives an indication of how much confidence can be put in the experiment. It is affected by how many tests have been carried out

Validity: there are four kinds of validity: internal validity (validity within the environment and reliability of the results), external validity (how general are the findings), construct validity (how does the treatment reflects the cause construct) and conclusion validity (relationship between treatment and outcome)

Randomisation: the analysis of the data has to be done from independent random variables. It can also be used to select subjects out of the population and to average out effects

Blocking: is used to eliminate effects that are not desired Balancing: when each treatment has the same number of subjects it is called balanced

Software engineering experimentation could be supported by the following sampling methods [Wohlin 2000]:

Simple random sampling: the subjects that are selected are randomly chosen out of a list of the population.

30

Systematic sampling: only the first subject is selected randomly out of the list of the population. After that every n-the subject is chosen.

Stratified random sampling: first the population is divided into different strata, also referred to as groups,

with a known distribution between the different strata. Second the random sampling is applied to every stratum.

Convenience sampling: the nearest and most convenient subjects are selected. Quota sampling: various elements of the population are desired. Therefore convenience sampling is

applied to get every single subject. Controlled Experiments: The advantage of this approach is that it promotes comparison and statistical analysis. Controlled here means that the experiment follows the steps as mentioned above (Basili 1986], [Zelkowitz 1997]):

1. Experiment definition: it should provide answers to the following questions [3]: “what is studied?” (object of study), ”what is the intention?” (purpose), “which effect is studied?” (quality focus), “whose view is represented?” (perspective) and “where is the study conducted?” (context).

2. Experiment planning: null hypothesis and alternative hypothesis is formulated. The details (personnel,

environment, measuring scale, etc.) are determined and the dependent and independent variables are chosen. First thoughts about the validity of the results.

3. Experiment realization: the experiment is carried out according to the baselines established in the

design and planning step. The data is collected and validated.

4. Experiment analysis: the data collection gathered during the realization is the basis for this step. First descriptive statistics are applied to gain an understanding of the submitted data. The data is informally interpreted. Now the decision has to be made how the data can be reduced. After the reduction the hypothesis test is performed. More about hypothesis testing can be found in the following sections.

5. Portrayal of the results and conclusion about the hypothesis: the analysis provides the information

that is needed to decide whether the hypothesis was rejected or accepted. These conclusions are collected and documented. This paper comprises the lessons learned.

Experimental design types: The quality of the design decides whether the study is a success or a failure. So it is very important to meticulously design the experiment [Juristo 2003]. Several principles of how to design an experiment are known. Those are randomisation, blocking and balancing. In general a combination of the three methods is applied. The experimental design can be divided into several standard design types. The difference between them is that they have distinct factors and treatment. The first group relies on one factor, the second on two and the third group on more than two factors. The following paragraphs will show some detail about the different design types.

• One-factor design with two treatments: Field of use: comparison. Example: comparing two different analysis techniques using several projects Assignment: techniques are assigned totally at random; the same objects are used for both treatments Analysis methods: t-test, Mann-Whitney Benefit: simple experiment

Project Technique 1 Technique 2 1 ☺ 2 ☺ 3 ☺ 4 ☺ 5 ☺ 6 ☺

Table 3: Assigning analysis techniques to projects

31

• Paired comparison design (extends the design mentioned above)

Field of use: comparison of two different analysis techniques (two treatments) Example: comparing two different analysis techniques Assignment: the subjects are applied to both treatments on the same object; the assignment is performed randomly Analysis methods: Paired t-Test, Sign test, Wilcoxon Benefit: improves the precision of the experiment.

Subjects Treatment 1 Treatment 21 2 1 2 1 2 3 2 1 4 2 1 5 1 2 6 1 2

Table 4: Assigning the treatments to pairs

• One factor design with more than two treatments: Field of use: comparison of all treatments Example: comparing different programming languages regarding their quality while using them Assignment: subjects are randomly assigned; one object to all treatments Analysis methods: analysis of variance (ANOVA), Kruskal-Wallis

Subject Treatment 1 Treatment 2 … Treatment n 1 ☺ 2 ☺ 3 ☺ 4 ☺

Table 5: Assigning the n-treatments to the subjects

• Randomised complete block design: Field of use: comparison of all treatments with high variability among the subjects. More than two treatments Example: same as in the design mentioned above Assignment: each subject uses all treatments; the order is assigned randomly; restriction of randomisation because of the blocks Analysis methods: ANOVA, Kruskal-Wallis Benefit: minimizing the effect of variability. One of the most used designs in experimentation. Subjects form a more homogenous unit.

Subject Treatment 1 Treatment 2 Treatment 3 1 3 2 1 2 1 2 3 3 1 3 2 4 2 1 3

Table 6: Assigning the subjects to the different treatments using randomized complete block design

• Two factor design: This design is used when more complex experimentation arrangements are needed. There are now three hypotheses: one for the effect of the first factor, one for the second factor and one for the interaction between the two factors. The following paragraphs will depict different two factor designs.

32

• 2*2 factorial design: Example: investigating the understandability of design documents using two different designs, i.e. structured versus object-oriented design; two treatments per factor Assignment: randomly assign subjects to combination of the two treatments Analysis methods: ANOVA

Factor 2

Treatment 2_1Factor 2 Treatment 2_2

Factor 1 Treatment 1_1

C, F B, E

Factor 1 Treatment 1_2

A, H D, G

Table 7: Possible portrayal of a 2*2 factorial design (Available Subjects: A, B, C, D, E, F, G, H)

• Two-stage nested design: Field of use: one factor is similar to another factor for different treatments (two or more treatments) Example: efficiency of unit testing using two different designs i.e. functional programming versus object-oriented programming Assignment: one of the two factors is nested to the other; the subjects are randomly assigned Analysis method: ANOVA

Factor and treatment combination Subjects Factor 1 Treatment 1_1 Factor 2 Treatment 2_1

A, H

Factor 1 Treatment 1_1 Factor 2 Treatment 2_2

C, F

Factor 1 Treatment 1_2 Factor 2 Treatment 2_2_1

B, E

Factor 1 Treatment 1_2 Factor 2 Treatment 2_2_2

D, G

Table 8: Two-stage nested design with two treatments per factor (Available Subjects: A, B, C, D, E, F, G, H)

• More than two factors designs:

Some experimentation arrangements depend on more than two factors. These kinds of designs are also called factorial design because the dependent variables also depend on interaction between the n-factors. Known factorial designs with two treatments are: 2k factorial design, 2k fractional factorial design, one-half fractional factorial design of the 2k factorial design and one-quarter fractional factorial design of the 2k factorial design.

3.3 Testing methods A listing of the statistical testing methods needed for the different design types in alphabetical order is given in following. More details about them can be found in [Juristo 2003]:

• ANOVA: This test is an ANalysis Of Variance between groups of artefacts. • Binomial test: This test analyse the differences between dichotomy variables.

• Chi2: This type of test is used when frequencies are involved. This means that the data has the form of

frequencies.

33

• F-test: The F-test compares the variance of two (independent) samples

• Kruskal-Wallis: In this case one-way analysis of variance by ranks is performed.

• Mann-Whitney: When the assumption made in the t-test is uncertain it is possible to use the Mann-

Whitney test instead. Similar to the Wilcox test this method is based on ranks.

• Paired t-test: This method compares two samples, gained through repeated measures.

• Sign test: It depends on the sign of the difference of the values of the examined pairs.

• t-test: This test compares two (independent) samples.

• Wilcoxon: For this method it is important that it is possible to determine the greater value of the examined pair and that the difference can be ranked because the ranks are the basis of the Wilcoxon test.

Parametric and Non-parametric testing: We will start with the parametric tests. The main characteristics: consists in the fact that the analysed models have a specific distribution. Usually the assumption is made that some parameters are normally distributed. The parameters must be measurable at interval scale, at least, the test for normality can be done with the Chi2 test. The non-parametric tests main characteristic is that only a very general assumption is made, more general than parametric test. When they are available they can be used instead of parametric test but not vice versa. The decision which one of the two mentioned approaches is best suited can be based on two factors. These are Applicability (what are the assumptions made? The assumptions must be realistic!) and Power (parametric tests have, in general, higher power than the non-parametric test). The relation between experimental design types, test methods and parametric, non-parametric tests is shown in the following Table 9 [Juristo 2003].

Design Type Parametric Non-parametric One factor, one treatment Binomial test, Chi2

One factor, two treatments completely randomised

t-test, F-test Mann-Whitney, Chi2

One factor, two treatments paired comparison

Paired t-test Wilcoxon, Sign test

One factor, more than two treatments

ANOVA Kruskal-Wallis, Chi2

More than two factors ANOVA

Table 9: Overview of parametric and non-parametric test methods Hypothesis Testing: One way to evaluate if the presumption we have is correct is to use hypothesis testing as evaluation source. The result, when everything has been taken out correctly, will help us to draw conclusions whether the presumption that was used to formulate the tested hypothesis established some cause and effect relationships. Hypothesis testing takes place in several steps that are applied repeatedly if needed. The first phase, induction, is used to formulate the first hypothesis, also called the null hypothesis and also the formulation of an alternative hypothesis in case of rejection of the null hypothesis. It is possible that the test rejects a true hypothesis or vice-versa. Should such behaviour occur it is referred to as a risk. Two different kind of risks can be identified, Type-I-error (the hypothesis is true but rejected) and Type-II-error (the hypothesis is false but accepted). When talking about the risks it is also necessary to talk about the power of a statistical test. The power indicates the probability that the statistical test will reveal a true pattern if the null hypothesis is false. It is therefore desirable to choose a test that has a very high power upon one with a lesser power.

34

The kinds of visualisation for SPC we have described above. Now we will give some further characteristics shortly. A graphical visualisation provides an illustrative way of providing information about different aspects. In the following passages several visualisation methods are described. Scatter Plot:

Input Paired samples (xi, yi) Portrayal Two-dimensional grid Used for Assessing dependencies between variables

Tendency of linear relation Identification of outliers Observation of correlation

Box Plot:

Input Percentiles Portrayal Box plot constructed by different percentiles Used for Visualisation of dispersion and skewedness

Histogram:

Input Frequency (or relative frequency) of a value or interval of values Portrayal Bars with different heights Used for Overview of distribution density

Indicator for normal distribution Cumulative Histogram:

Input Variables with corresponding samples Portrayal Bars containing the cumulative sum of frequencies up to the

current class of values Used for Probability distribution function of the samples from one

variable Pie Chart:

Input Data values, divided into a specific number of distinct classes Portrayal Segments in a circle. Angles proportional to the relative

frequency Used for Relative frequency of the data values

In following we will describe an example of a controlled experiment investigating the performance using the Personal Software Process (PSP) [Wohlin 2000]. First step: Definition

• Object of study: participants in the PSP course, their ability considering performance with respect to background and experience;

• Purpose: evaluate the individual performance with respect to the individual background;

• Perspective: point of view of researchers and teachers; They would like to know if there are differences

between the participants in the course having different backgrounds;

• Quality focus: Productivity in terms of KLOC1 / development time and Defect density in terms of faults / KLOC;

1 Thousands of lines of code

35

• Context: experiment is run within the PSP;

• Summary (of Definition): Analyse the outcome of the PSP for the purpose of evaluation with respect to

the background of the individuals from the point of view of the researchers and teachers in the context of the PSP course.

Second step: Planning

• Context selection: PSP course at university; It addresses a real problem and is performed off-line because it is not used for industrial software development. The programming language is C.

• Hypothesis selection:

Null-hypothesis H0: No difference in productivity between students from Computer Science and Engineering program (CSE) and Electrical Engineering Program (EE) H0: Product(CSE) = Product(EE) Alternative Hypothesis H1: Product(CSE) ≠ Product(EE) Null-hypothesis 2 H0: No difference between the students considering the faults/ KLOC (based on prior knowledge of C) H0: # of faults is independent of C experience Alternative hypothesis 2 H1: # of faults/KLOC changes with experience

• Measures: C experience, Faults / KLOC • Data to be collected: student program (nominal scale), program size in Lines of Code (ratio scale) ,

development time in minutes (ratio scale), productivity (ratio scale), experience in C (ordinal scale, they used here a classification into four groups), and faults / KLOC

• Variables selection:

o Independent variables: program and experience in C. o Dependent variables: productivity and faults / KLOC

• Selection of subjects: chosen based on convenience; They are samples from the two programs and not

chosen by a random sample.

• Experiment design: o Randomisation: subjects are not assigned at random. They all use the PSP and take part in all

of the assignments. o Blocking: not applied o Balancing: not applicable

• Standard design types:

o first design: one factor (program), two treatments (CSE, EE). A parametric test is chosen, in this case the t-test because the dependent variables are ratio scaled.

o Second design: one factor (experience in C), more than two treatments. Here four treatments can be identified (4 different groups). The dependent variable is also measured in a ratio scale so that parametric testing can be applied. In this case the ANOVA test.

• Instrumentation: A survey carried out at the beginning of the course provides the needed data about

experience and background.

• Validity evaluation: o Internal validity: provided through the number of tests within the course. o External validity: highly probable that similar results are obtained when the course is run in a

similar way. It is rather difficult to generalize the results to students not taking the course. However, it might be possible to generalise the outcome to other PSP courses, comparing the background for example.

o Conclusion validity: not considered to be critical due to the fact that the faked or incorrect data is independent from the background.

o Construct validity: two major threats can be identified. Are the measures appropriate? Example: Is LOC/ Development time a good measure for productivity? And because it was a

36

graded course the student might bias their data. At the beginning of the course it was stated that the grade did not solely depend on the actual data but rather on timely and properly delivery and on the reports handed in.

Third step: Operation

• Preparation: The students primarily took a course they were not aware of exactly what was being investigated.

• Execution:

execution time: 14 weeks Number of assignments: 10 Number of participants: 65 At the end of the course interviews were performed to evaluate the course and the PSP.

• Data validation: From the 65 students six were removed because their results were rather questionable

or invalid. This took part based on the personal impression of the given data with regard on the question whether they were representative or not of the researchers and teachers on the given assignments. The remaining 59 (32 CSE,27 EE) students were used for the statistical analysis and interpretation.

Fourth step: Analysis and interpretation

Descriptive statistics: In Figure 15 the productivity of the two study programs is shown. It gives a hint that the productivity of the EE students is not as high as the productivity of the CSE students.

Figure 15: Frequency distribution for the productivity (in classes)

As second method box plots are made (Figure 16). There it is visible that the EE group has on outlier, which stays in the data and is considered an extreme value.

Figure 16: Box plot of productivity The two figures already indicate that the productivity of the EE students is lower than of the CSE students. The hypothesis testing might reveal a difference between the two study programs. Let us move on to the faults / KLOC. The table below shows the different parameters of the faults/ KLOC. It can be seen that the distribution is skewed towards the first group (little or no experience). That is why a box plot for this group is made (see Figure 17).

37

Class Number of

students Median value of

faults/ KLOC Mean value of faults/KLOC

Stnadard deviation of faults/KLOC

1 32 66.8 82.9 64.2 2 19 69.7 68.0 22.9 3 6 63.6 67.6 20.6 4 2 63 63.0 17.3

Table 10: Faults/ KLOC for the different experience groups

Figure 17: Box plot for faults/ KLOC for the first group

The descriptive statistics tell what can be expected from the hypothesis testing and were problems due to outliers might appear. Data reduction: It was decided that the outliers are being removed which changed the mean values and standard derivation as can be seen in Table 11.

Class Number of students

Median value of faults/ KLOC

Mean value of faults/KLOC

Standard deviation of faults/KLOC

1 31 66 72.7 29.0

Table 11: Faults/ KLOC for group 1

Hypothesis testing: For the first null- hypothesis the t-test was applied. The result can be seen in Table 12. The conclusion is that the hypothesis H0 is rejected. The difference between the students from the two programs is significant. The actual reasons for this have to be further evaluated.

Factor Mean diff. Degrees of freedom (DF)

t-value p-value

CSE vs. EE 6.1617 57 3.283 0.0018

Table 12: t-test result For the second null-hypothesis the ANOVA test was chosen. The result can be seen in Table 13.

Factor: C vs. Faults/KLOC

Degrees of freedom (DF)

Sum of squares

Mean square

F-value p-value

Between treatments

3 3483 1160.9 0.442 0.7236

Errors 55 144304 2623.7 - -

Table 13: ANOVA test results

38

The outcome was that there is no significance between the different groups and the faults/ KLOC. The groups 2,3 and 4 were grouped together to investigate the difference between the new formed group and group 1. A t-test was then applied to look for differences between those two groups. No significant results were obtained.

Fifth step: Summary and Conclusion

Two hypotheses were investigated. The study program / productivity and experience in C / faults per KLOC. The first hypothesis tested showed that the CSE students were more productive than the EE students. The second hypothesis stated that there is no significant influence on the number of faults considering the experience in C. Hence,

When following the PSP it is better to use a well-known language so that the focus can solely be on the PSP.

It is also reasonable to claim that students with a computer science background have a higher productivity than students with other disciplines as background. It is still necessary to do further studies.

3.4 Methods of Data Analysis In the following we will give some examples of statistical analysis in three kinds of domains [Pandian 2003]:

• Metrics data analysis in frequency domain, • Metrics data analysis in time domain,

• Metrics data analysis in the relationship domain.

The following methods and examples are cited from [Pandian 2004] in order to achieve a consistent form of statistical descriptions (see also [Juristo 2003] and [Wohlin 2000]). Metrics data analysis in frequency domain: All processes show variations that will become evident if a frequency distribution is drawn on the process metric. Understanding process variation, Demming observes, will lead to profound knowledge of the process. Frequency distribution also contains an indication about probability of occurrence of events. Analysis of metrics data in the frequency domain would result in empirical distribution curves. The shape and structure of these distribution curves represent a process signature. Analyses of distributions are usually based on several well-known probability distributions. We have selected two distribution types that find practical views in software projects: normal distribution and the Rayleigh distribution. All empirical distributions are referred to any one of these two for interpretation. Normal Distribution: Normal distribution is considered nature's template, the most common pattern of process variation. A large number of project outcomes can be directly fitted to the ideal normal curve. For example, effort variance in a family of software projects has been analyzed to find that they have a mean value of 10 percent and standard deviation of 2 percent. The equation to normal distribution is given in the following equation.

39

The process variation illustrated here makes us view software projects from a statistical standpoint. Bias: A Process Reality: Real-life process behaviour may exhibit a bias. Such distributions lack symmetry and are skewed to one side. Also, these have a characteristic “tail”, representing occurrences that have transgressed or strayed into unusual regions. The bias is characteristic of human systems that use intention or will to choose among several tactical opportunities. The long tail, such as in Rayleigh distribution, bears evidence to a fundamental but small propensity of nature to defy human design. This tail could be a symbol of machine failure in mechanical processes or estimation failure in project management. The tail of the schedule variance distribution presented in Figure 18 shows how „best-made estimates” have failed. Figure 18: Schedule variance bias As a structure, the skewed Rayleigh distribution has been put to great use in software estimation by Putnam. Software reliability models use this structure to represent defect leakage into the field in the continuum of time. The Rayleigh curve can be expressed as given in the following equation

where m(t) is the manpower, K the total effort, a the constant (shape parameter), and t the time. Central Tendency of Processes: Central tendency in a skewed distribution, a more authentic representation of real-life processes, is difficult to establish. Nevertheless, it is conventional to refer to three measures of central tendency:

1. Mean 2. Median 3. Mode

The mean is the arithmetic average of all the observations. The median that divides a series of data arranged in the order of magnitude of their values so that an equal number of values is on either side of the center or median value. The median divides the distribution curve into two equal areas. The mode denotes the value that has the highest frequency of occurrence in the dataset. If the distribution of the data is normal and not skewed, then the mode, median, and mean are equal. It is customary to take the mean value to indicate the central value of a metric. It is convenient to think so, and many business models run on this simple assumption. But when the metrics data set contains outliers and extreme values, median could be a better choice because it presents a balanced picture. Mode is considered for setting process goals.

40

Process Spread: Process results wander away from the mean value. The degree of wandering, or spread, is denoted by the standard deviation, sigma (σ), of process output values. Frequency distributions are the most natural tools to study and analyze process spread. In Figure 19, three models for effort variance are plotted, all with different standard deviations but a common central value of 10 percent. Process variations such as these indicate trouble. The larger the variation, the larger is the uncertainty. It may be noticed that as the spread increases, the number of “results on target” decreases. When the process deviations get closer to process boundaries or tolerance limits, the process tends to become unreliable.

Bin Sigma 2 Sigma 4 Sigma 7

1 0.00 0.02 0.07 2 0.00 0.04 0.09 3 0.00 0.06 0.10 4 0.01 0.10 0.12 5 0.03 0.14 0.13 6 0.08 0.18 0.15 7 0.19 0.23 0.16 8 0.36 0.26 0.16 9 0.53 0.29 0.17 10 0.60 0.30 0.17 11 0.53 0.29 0.17 12 0.36 0.26 0.16 13 0.19 0.23 0.16 14 0.08 0.18 0.15 15 0.03 0.14 0.13 16 0.01 0.10 0.12 17 0.00 0.06 0.10 18 0.00 0.04 0.09 19 0.00 0.02 0.07 20 0.00 0.01 0.06

Figure 19: Dispersion of effort variance: three models Another example of process dispersion can be seen in how bug-fixing time (TTR, time to repair, in days), falls into three service levels, corresponding to simple, medium, and complex types of bugs. Fixing each type of bug is a process of its own, characterized by central tendencies and standard deviations. As illustrated in Figure 20, the distinction between these processes results in blur in some areas, and the maintenance project manager needs to use this information while setting goals and limits for delivery schedules. Figure 20: Three service models for bug fixing

41

Measures of Dispersion: Measures of dispersion describe how the observations in the dataset are spread out. Important measures of dispersion are

• Range • Variance • Standard deviation

Range is the difference between the highest and lowest values in a dataset. Variance measures the fluctuation of the observations around the mean. The larger the value of the variance, the greater the fluctuation. The standard deviation, like the variance, also measures the variability of the observations around the mean. Standard deviation is equal to the positive square root of variance. A standard deviation has the same units as the observations, and thus is easier to interpret. Descriptive Statistics: Before we draw any inferences from data (using inferential statistics), we need to do descriptive statistical study. Hence, metric data can be first studied for its descriptive statistics, which includes estimation of the following parameters:

• Mean • Standard error (of the mean) • Median • Mode • Standard deviation • Variance • Kurtosis • Skewness • Range • Minimum • Maximum • Sum • Count • Largest (#) • Smallest (#)

Note: Skew means lack of symmetry. The skew can be positive (skewed to the left) or negative (skewed to the right). For a positively skewed distribution, the mean is greater than the median because a few values are large compared to the others. If a distribution is negatively skewed, the mean is less than the median. Kurtosis is a measure of the peakedness of the dataset. It is also viewed as a measure of the "heaviness" of the tails of a distribution. A tool for calculating descriptive statistics is available in Excel as a macro in the Analysis Tool Pak. Deriving Frequency Distribution from Data: There are three ways of visualizing frequency distribution, ranging from mathematical to empirical. Each can be applied to a practical situation; each has its advantages. Probability Density Function Curve: The first is to work from the mean and sigma to construct an ideal normal distribution curve, applying the equation to probability density function. One can use the spreadsheet function NORMDIST and generate the graph by constructing an x,y table (and plotting an x,y chart) in accordance with the relationship given in the following equation.

This bell shaped curve is a classical way of getting a feel for the process. Next we can draw a histogram and study its shape. The bin intervals (or class intervals) are marked in the x-axis and the frequency in the y-axis.

42

One can use a "tally" system to count the number of data points falling into each bin, or use the histogram macro on the spreadsheet and get the tally as well as the chart. Histogram will present details that had been ironed out in the normal curve. Empirical Distribution Curve. Finally, we can transform the histogram into a "curve" by constructing a smooth line that passes through the tops of the histogram bars. Constructing such a curve, sometimes called the fre-quency polynomial, is not an attempt to find a mathematical expression for an empirical reality; it is an attempt to create a graphical pattern, as a model and a continuous representation process behavior. Frequency Scan: While arriving at empirical distribution curves, we stand to gain by doing alternative analysis by varying the bin sizes. One such analysis is “scanning”, where we deliberately run a histogram on a large number of bins, although the number of data points may not warrant a large number of bins. An example of schedule variance analysis with 32 bins is depicted in Figure 21. Figure 21: Frequency analysis with modified bins The frequency diagram scans the entire process range, like a spectral scanner, and finds occurrences in the right location in the metrics scale. Such an analysis highlights “bursts” of events, which stand far away in the frequency domain from the primary process modes. In the background, the best-fit normal curve built from the process mean and average is presented. It may be noted that the normal curve is very broad and shallow, indicating a widely varying process. The standard deviation is about 2.5 times larger than the mean, with the obvious consequences on the curve. A frequency scan could make several discoveries in process behaviour, including the following:

Extreme deviations Process outliers Natural clusters Secondary modes Primary modes Zoom view of the significant modes

The Filter Effect - Getting a Smooth Overall Picture: We can obtain a smoother function, with the details ironed out, to show a broad picture of schedule variance, as shown in Figure 22. The desire here is not to prescribe discrimination rules or locate troublesome groups, but to get a sense of variation. Figure 22: Frequency diagram designed to give the overall picture

43

This choice is deliberately made because of the shift in decision-making approach from class discrimination to variation control. The same process data, which was scanned in the previous figure, is now processed with less bin numbers, just 7 instead of the original 32. The result is a smoothened curve, which has muffled the fast variations, like a low pass filter, and indicates an overall picture. One can vary the "filter characteristics" of a histogram to see different views of variation, and develop an insight from these many perspectives. It is like tuning in to different wavelengths, looking for signals. Looking at Histograms: The histogram is known as the “voice of the process”. On a chosen metric, histogram analysis can reveal process behaviour such as stability and bias. The first-cut analysis is to look at the shape of the histogram and see the “process signature”. Standard types of histograms have been identified by Feigenbaum for manufacturing processes. The shapes and types could reveal the nature of the process from which the data points have been gathered. For example, a histogram truncated on both sides represented product behaviour after the „out-of-tolerance components” have been removed. A histogram with the central portion missing can be traced to a population where the best components have been selected and removed, perhaps marked as a higher-grade delivery. In software, too, we can identify histograms with telltale signatures. Three of these signatures are presented in Figure 25, along with their special meanings:

1. Comb structure 2. Right-biased structure 3. Left-biased structure

Many of the other figures furnished in this chapter contain real-life process signatures. Notable among them are the following:

• Bimodal distribution with equal peaks • Bimodal distribution with a single dominant peak • Multiple clusters • Rayleigh type distribution with long “tail” • Plateau structure (flat distribution) • Spurs (in spectral scanning)

Projects can maintain histogram libraries and map them to the contributing process scenarios. This way, every organization can invent its own histogram types, as shown in Figure 23. Figure 23: Defect histograms for three processes

44

Process Capability from Frequency Distribution: A process that is under statistical control is said to be capable if it is able to satisfy the customer specifications or the goals of the process, in the event customer specifications are not available. Process capability refers to the inherent ability of a process to repeat results for a sustained period of time under a given set of conditions. The frequency signature of a capable process has a few notable characteristics: Single mode, less variation, and process peak tends to be closer to target. In the classical model of process capability computations, normal distribution is assumed, and numerical indices are calculated to quantify process capability. Process Capability Index Cp: This index indicates the performance of the process by relating the natural process spread to the specification (tolerance) spread, as shown in the following equation. Modifications of this basic definition are in use to account for the following specialprocess drift. Such indices and their variants were originally designed for mechanicaestablished statistical models for process variation, defect occurrence, inspection, aprojects, can we apply Cp? There are several constraints. The beginning of the problethe process called project management or software engineering, each having procethat of mechanical processes. Next in line are the difficulties of prescribing contrlimits, which cannot be calculated based on old assumptions but require a deep distributions of process parameters and defects. Probability: The area under probability density function represents "probability" othe shaded area represents the probability that the upper specification limit of transgressed. Figure 24: Probability calculation The exact value of this probability as P(SV > USL) is obtained by the division of ttotal area under the curve. The probability that the schedule target will be met corresThe shaded area, lying outside the limit, constitutes what we can term as “process deacceptable region. The areas are actually integral values of the probability densspecified limits, and can be calculated by using the relationship given in the following

45

situations: Single limit and l processes, based on well-nd sampling. For software m lies in the very nature of ss signatures different than ol limits and specifications understanding of statistical

f occurrence. In Figure 24, schedule variance may be

he shaded area through the ponds to the unshaded area. fects”. The white area is the ity function, pdf, with the equation

Probabilistic Expressions of Capability and Risk: Probabilistic models can be used to determine process capability and risk. Capability is defined as the probability of meeting the target and risk is the probability of missing the target. Capability and risk are like two sides of a coin. If a process is not “filled” with capability, the vacuum will be encroached by risk. A similar analysis can be done almost on all metrics, although the core metrics such as the ones in the following list are preferred choices: Schedule, productivity, and defects. Analyzing Process Maturity: Process maturity can be analyzed using frequency distributions. Mature processes show slim frequency diagrams, with sharp peaks - the fat and the process wanderings having been eliminated. Mature processes show, decisively, a central value. The danger of secondary process intervention would have been eliminated to secure stability. The voice of the process will stand clear above noise from spurious performances, outliers, and strange isolated events. Mature process peaks tend to drift toward customer satisfaction, resource conservation, and better performances. A productivity distribution, as the project matures in capability, tends to move toward higher values. The defect distribution peak, in a similar environment, will move to lower values. A process behaviour model is seldom static. It is highly dynamic, constantly shifting its location, and changing the shape. The process boundaries keep in tune and the process remains in a constant state of metamorphosis. The road to process maturity can be tracked using frequency diagram models of the process, and by arranging a process maturity storyboard or chronicler, which has now become an industry standard for visualizing “continuous process improvement”. Figure 25 presents a process maturity storyboard of an organization that is moving up the maturity grid as time passes. Approximately, the signatures correspond with capability maturity model (CMM) levels. The metric - the chosen indicator - is effort variance. If the organization's goals can be marked on these frames, one can easily perceive and estimate quantitatively resource management capability as well as effort escalation risk, and relate the findings to climbing maturity level. Apart from using process signatures to narrate a story in time, we can use them to compare business units within an organization or benchmark teams within a business unit. We could also create a signature board to cover all primary metrics to see if there is balance in capability or how uncertainty and risk propagate into the deeper recesses of processes.

Figure 25: Process maturity storyboard

46

Process Diagnosis: Process baselines based on mean and sigma sometimes hide real problems, such as in the case study described here. The effort variance in this instance shows a bimodal distribution, each mode on either side of zero. The arithmetic mean is almost zero; going by the mean one may think that the process is on target. Far from it, the process is severely unstable, toggles between two meta-stable states, as revealed in the frequency analysis. The project team recognized the problem, the first step in diagnosis, did a causal analysis, and spotted trouble in the estimation process, which was in its juvenile stage. Either effort was overestimated or it was underestimated. Where they had provided contingency cushions, it turned out that the expected risks did not attack. Where they had been optimistic, risks had surfaced eventually. More than estimation, the problem was in risk forecasting, and linking it with estimation. The team was trying to grapple with the problem and the struggle resulted in the twin modes. Search for Natural Process Boundary: Higher-level metrics, such as effort variance, denote complex processes because they tend to capture the net result of several sub-processes. Calculating process control limits in such cases is a tricky job. The exact distribution type of each sub-process may not be known, much less the way the sub-processes combine. Traditional control limits use mean and sigma-based concoctions. But we know the fallacy of blindly choosing the mean as a representative figure. The questions emerge: What is the true process limit? What is going to be the decision threshold? Which is an outlier and which is the core? What control limits do we use in our control charts? We are looking for a natural process boundary that we can trust and use in decision making. The answer to the question lies in a frequency distribution study of the metric. Typically, as illustrated in Figure 26, such an analysis would manifest a dominant mode, denoting a primary process, and a subdued mode, denoting a secondary process. The valley point is taken as the natural process boundary which can be used as the upper control limit. Figure 26: Natural process boundary Class Recognition - Productivity: Productivity in software development is a very complex area. Analysis of productivity using frequency distributions could give tangible benefits. Apart from the baseline normal curve, the empirical distribution derived with the right choice of bin intervals could reveal "productivity clusters," as illustrated in the following case study. In Figure 27, four modes have emerged during an organizationwide analysis of productivity data. These modes point to the existence of four distinct classes of projects; the dis-criminating factors could be complexity of job and skill grades of staff. There could also be interplay between other productivity drivers and barriers. This diagnosis establishes four productivity levels, and facilitates developing management strategies. It also provides a fair basis for performance measurement and comparison. The mistake of having and quoting one pro-ductivity figure for the entire organization can now be avoided. The gaps in productivity levels provide a framework for improvement of performance levels, tools utilization, and better and more objective human resource management.

47

Figure 27: Software productivity classes Benchmarking: A benchmark study using frequency distribution, in addition to the conventional comparison charts, could bring over more valuable information. Sometimes it is just a comparison of signature between successful projects and not so successful projects. Sometimes it can be a comparison of motivation level and commitment. During a benchmarking study using frequency distribution, one can compare the following features:

• Process central tendency (dominant peak) • Number of modes • Natural process boundary • Process capability (percent) • Risk (percent) • Outliers (percent) • Extreme values (percent) • Mean (overall) • Sigma (overall)

Measuring the True Value: Software measurements can have ambiguities as large as 50 percent. The measuring process, such as review or testing, has its own sources of uncertainty, noise, and variation. The measuring tool and the measured process both vary simultaneously, making software measurements even more difficult. In the presence of this ambiguity, histograms help in getting at the true value: the central tendency or the dominant mode. The histogram successfully points out the true value, even while presenting the details of variations. All modern measuring techniques and instruments use histogram analysis to detect true value. A case in point is defect measurement, fraught with uncertainties of high proportions. Measuring Defects without Ambiguity: hen it comes to defects, the measured value depends on the product of two factors, as

Measured Value = Actual Defect * Detection Effectiveness . Detection effectiveness values could vary from 40 to 80 percent, depending on the review methodology used and the review capability of reviewers. Thus an uncertainty is associated with the review process. Measurement capability is inversely proportional to measurement uncertainty. The rule book of measurement says that the measuring instrument should have less uncertainty than the process variation the instrument is trying to measure. We have to measure defect variations of the order of 10 percent with measuring instruments such as review with an inherent variation of up to 70 percent. The ambiguity in defect measurements can be overcome by using a simple signal-processing technique: defect histogram Comparison when Distinctions Blur: We go to statistics when we cannot make a judgment without its help. An example is the case study where it was called upon to compare two review methods. The first (DD) is a one-

48

person method; the other is a group method (PI/DC). Defect detection probabilities looked very similar in both cases, and the raw data was confusing. Once the frequency distributions of the findings were plotted, the bottom curves in Figure 28) and the whole picture could now be understood. Figu Six Sigma Model: SThe graphs shown in Fthe gap - safety distaGraph A has a safetypeak that is 6σ away fthose transgressions aevents (even after allo

Figure 28: Review performance comparison

re 29: Six Sigma process model

ix Sigma concepts originally began with a process behaviour model in frequency domain. igure 29 show a Six Sigma representation of process capability. Capability is measured by

nce measured in terms of sigma - between the process tendency and performance limit. distance or gap of 3σ, and hence the process has 3σ capabilities. Graph B has a process rom the specification limit, and hence has 6σ capabilities. Defects in a Six Sigma process - cross the specification limits - account for a mere 3.4 parts per million (ppm) of the total wing for some wandering of the process peak from the mean).

49

Metrics data analysis in time domain: Viewing in Time: Metrics data, organized in the time domain in a framework, present a window into real world. Our purpose here is to see what the present holds out in the context of the past. We also wish to connect events, like a thread connects beads, and see meaningful patterns from which a future can be forecast. We will also be seeing how control charts can be devised to provide support in decision making. Because software projects run a predetermined path known as the life cycle, with a finite start and a finite end, time domain analysis proves to be only natural. Time domain analysis enables project teams to become sensitive to reality, responsive to situations, and self-organizing through continuous learning. Temporal Patterns in Metrics: Plotting data in a chronological order brings out the hidden temporal patterns. A causal factor for attrition, the motivational level of employees is measured here as a commitment index and gathered every quarter. We recognize first the simple linear trend, and later more intricate nonlinear trends. While the linear trend captures a broad, long-term behavioural pattern, the local characteristics are captured in increasing level of details by power, polynomial, and moving average trends. All of them are effective in suppressing noise but forecasting scope and efficiency vary. Each analysis offers an adaptive perception, different from the rest. The overall problem, of course, is a steady decline in commitment, but the pattern of decline, the seasonality, and similarity with known trends provide knowledge. Time Series Forecasting: Using time series analysis, events can be predicted based on historical trends. The bug arrival pattern shown here is an important input for maintenance projects to decide the following:

Work scheduling Human resource balancing Strategies for service quality assurance

Forecasting requires that we identify structures in the data, which might repeat. Software failure intensity data can be plotted and the trend can be used to predict failure, as indicated in Figure 30. In fixed assets and facilities management, assets downtime data can be plotted in time sequence, and the trend may be derived and used to forecast spare-parts requirements and manpower and tools requirements to fix failure events. With the infor-mation made available by forecasting, one stands to plan better and even avoid those marginal losses that are bound to be incurred without the benefit of advance information. Figure 30: Bug arrival trend

50

Signature Prediction: Beyond the bug arrival statistics, signatures of bug population are captured periodically, as illustrated in Figure 31, and used in prediction. The signatures become yet another dimension in forecasting. Here signature refers to a bar graph showing distribution of bugs among the known categories as percentages. The distribution pattern keeps changing. Risk tracking, risk exposure magnitude, and risk distribution may be carried out in a similar fashion. Defect magnitude and defect signature are known to have been tracked in a similar way by IBM in their ODC framework of defect management. Figure 31: Signature profiles of bug population Prediction Windows: Prediction may be done by seeing patterns across projects or can be done locally within a project. For instance, customer satisfaction index may be tracked in an organization, as shown in Figure 32, project after project, and the trend may be used in decision making. The prediction window here is quite large and may run into years. Each project runs within a time window inside which predictions are made. Time to complete a project and cost at completion are both predicted from the earned value graph (EVG), which cumulatively tracks value and cost as a time series. Figure 32: Prediction windows Within a project, there could be smaller process windows where very short time series curves operate. Reliability growth curve (RGC) tracks defects within the inspection window of the project. Failure intensity curve, being a reliability model, operates in a window that begins with inprocess inspection but goes beyond delivery and penetrates into deeper time zones of alpha, beta, and acceptance tests and application runs. Every metric operates in a time window, which also becomes the prediction window. The window patterns are eventually called models. Process Characterization - Process Central Tendency Chart: A process behaviour is characterized, in simple terms, by the mean value and the standard deviation. The first refers to the location of the process and the next represents variation of the process. The weekly average (Xbar value) of time to repair (TTR) bugs in a maintenance project itself is a good indicator of the process. Such a plot is called the X-bar chart, shown in Figure 33(a). When the process variations are quite large, central tendency is more meaningful with median values. Therefore, monitoring of process median charts is recommended in these conditions. Figure 33(b) shows the plot of median values for the same set of data.

51

Figure 33: X-bar chart on TTR Process Variation Charts: Process variation is represented by standard deviation. Figure 34(a) illus-trates the weekly values for standard deviation, in the form of an S chart. There are occasions when process range is used as a measure of variation in place of standard deviation, which is represented in Figure 34(b). Figure 34: Range-standard deviation chart Plotting Central Value and Variation Together: When accompanied by another chart showing how the range (maximum/minimum) varies every week, the pair is called X-bar-R chart, which has been very popular on the work floor. A simpler way is to plot the mean, minimum, and maximum values in the same graph and construct the MMM chart. The weekly data set is known as sub-group (the sub-groups could stand for a group of projects, a group of components, etc.). In our example, the MMM chart is plotted for sub-groups, each corresponding to one week. The chart could be modified to consider (µ + σ) and (µ -σ) instead of the maximum and minimum values to express variations. The MMM format allows forecasting and pattern recognition. Control Charts: Park et al., Fenton and Pfleeger, Adrian Burr and Mal Owen, and Thomas Thelin are among the earliest to have applied the traditional forms of control charts to software engineering processes. Many software development houses have adapted control charts in one form or another. An established tool in manufacturing, the control chart is an emergent technology in software development. In a control chart, process results are plotted in time and compared with an expected value. Examples for the expected values are

• Control limits set from experience • Control limits calculated from data • Specification limits drawn from process requirements • Process goals set by benchmarking • Improvement goals • Estimated value • Planned value

In Figure 35, the estimated value of cumulative lines of code is plotted against month, and the actually delivered lines of code are compared with the estimated. The perceived gap between the estimated and actual makes the process owner see the problem and do something to bring the process result back to the estimated value. Control here means adhering to a budget or a plan. The essential control chart is a decision support tool, an early warning radar that alerts the user.

52

Figure 35: Tracking growth against point estimate In a control chart, process results are plotted in time and compared with an expected value. Examples for the expected values are

• Control limits set from experience • Control limits calculated from data • Specification limits drawn from process requirements • Process goals set by benchmarking • Improvement goals • Estimated value • Planned value

In Figure 35, the estimated value of cumulative lines of code is plotted against month, and the actually delivered lines of code are compared with the estimated. The perceived gap between the estimated and actual makes the process owner see the problem and do something to bring the process result back to the estimated value. Control here means adhering to a budget or a plan. The essential control chart is a decision support tool, an early warning radar that alerts the user. Range in Expected Values: The estimated value, instead of being a point, could have a range, taking a clue from real-life process variations. Hence, there exists an upper limit and a lower limit for the estimated value, for a given confidence level. If σ represents the standard deviation and if the limits are estimated at 3σ, for instance, the associated confidence level is 99.7 percent. Figure 36: Tracking growth against interval estimate As shown in Figure 36, the actual values are plotted in the background of the estimated mean value and the limits. Now one sees a problem if the actual values cross the limits because we have already given a tolerance band to deviations from the expected mean value. Those data points, which lie outside the tolerance band, are known as outliers. The first improvement one can think of is to prevent outliers, the next improvement being reduction of the allowed variation band.

53

Life Cycle Phase Control Charts: The acceptable limits (point estimates) on defect levels are marked in the life cycle phase control chart. The actual data is superimposed on the expectation levels. Perhaps this type of control chart is most natural for life cycle projects. One can plot the following metrics values in this control chart format:

• Effort • Schedule • Rework • Defect found • Defect leaked • Review effort

These life cycle phase control charts provide an opportunity to disseminate process goals and deploy them phasewise. One can define the ranges around each estimate to be more realistic about goal setting. The expected values and process goals change with time and improve when the organization makes progress in its processes. There is perhaps no expected value that can be stationary and permanent. When Limits Blur: We must recall that uncertainties are associated with each measured value. Each data point is not a deterministic entity, but probabilistic in nature. If we plot the probability densities of measured values, as in Figure 37, each data point is not a single point but a distribution. Let us try to answer the following questions. Have distributions A, B, C, D, and E crossed the limits? Should we read red alert or early warning? The answer: these are blurred crossings, not abrupt jumps. Statistically, they represent process diffusion. We may relate control limits to the assumed confidence levels of judgment and appreciate the tentative nature of limits. We can move up or down the control limits and opt for yet another reference point as UCL. We can fix the UCL and LCL at chosen points on the process distribution curve and accept the corresponding confidence level for decision making. Crossing the limit is a question of degree, which depends on assumptions and perceptions and not so much on the seemingly rigorous mathematical expressions that are used to compute the limits. Figure 37: Blurred crossings Selecting Control Limits for Unknown Distributions: When the type of distribution is not known we can apply Chebyshev's theorem, according to which, for any population or sample, at least (1 - (1/k)2) of the observations in the dataset fall within k standard deviations of the mean, where k ≥ 1. This is illustrated in Figure 38 as a relationship between standard deviation and the corresponding confidence level. Figure 38: Selecting confidence limits for control chart

54

Chebyshev's theorem provides a lower bound to the proportion of measurements that are within a certain number of standard deviations from the mean. This lower bound estimate can be very helpful when the distribution of a particular population is unknown or mathematically intractable. Because the software development process is totally a human process, one cannot expect a standard distribution pattern. Therefore, we should adopt an estimation method, which does not depend on data distribution pattern, and at the same time reasonably represent the actual situation. Therefore, depending on the confidence level required one could set the process capability baseline limits with 1.5σ, 2σ, or 3σ for 56, 75, and 89 percent confidence levels, respectively. Control Limits for X m R Chart: When the sample data points are not available it is frequently impossible to construct an X-bar-R chart. In this case the only alternative available is to construct an X moving range chart. Here successive data points are grouped to form a sub-group. Control limits for this chart are derived based on control chart constants. The limits are given in the following equation.

Let us consider an application of X m R chart for effort variance process. Because this data is less frequently available, at the project closure we can characterize this process and arrive at its baseline value through the appli-cation of X m R chart. Process Capability Baseline Charts: Figure 39 shows the process capability baselines with popular control limits. If tighter control on a metric such as effort variance percent is wanted, one could choose 1.5σ limits; on the contrary, if the project manager does not want too many causal analyses to be made or if the process is in the inception stage, one could choose 3σ control limits, wherein nearly 89 out of 100 times the process value will be within the 3σ control limit. Figure 39: Control chart with confidence limits Process Capability Baselines from Empirical Distribution: The process history, if available, can be used to set control limits such as demonstrated in Figure 40, where frequency distribution of historical data reveals the existence of natural process limits, the valley points dropping off the principal peak. UNPL refers to upper natural process limit and LNPL refers to lower natural process limit. This approach allows us to use empirical frequency distributions, which are perhaps more relevant and accurate than the elegant assumptions made in the traditional computations of limits. Figure 40: History-based limits

55

Metrics data analysis in the relationship domain: A Fertile Domain: Processes are interdependent, forming a network. The interplay between process parameters has been the subject of several studies in software engineering, leading to understanding of the hidden process dynamics. The interactions that exist in the process network can be symbolically represented as a map of relationships between metrics. The symbolic world of relationship between metrics is a new domain, which mirrors the real world of processes and the influences they exert on one another. The analysis of an individual metric in the frequency and time domains enhances the indicative abilities of the metric and allows us to see pat-terns. In the new domain, we expand our view angle, look at the neighborhood around each metric, spot more metrics (which seem to be connected), and focus on capturing the interrelationships. The relationship domain brings in a pragmatic perspective. In the real world, processes do not work in isolation and, as a consequence, complete truth cannot be represented by isolated metrics. Analysis in the relationship domain complements analysis in the other domains. When processes work as interconnected systems, the interrelationships may follow an order or rule. This may be just a local discipline governing a narrow range of process events. Or it may be a global order, with universal influence. The order may change from time to time when processes shift from one phase state to another. When we analyze metrics data in the relationship domain, we use metrics "snapshots" of the process, to try to arrive at formulas that depict the order, rule, or discipline by which the process runs. The formulas could be local or global, following the characteristic of the process order. Some are ephemeral while others are everlasting. Some are reversible, some are irreversible. Some are reproducible while others are not. We search for all. The relationship domain is a fertile hunting ground. Studying relationships among metrics with existing data is one approach. Making special observations under controlled conditions or conducting experiments is another approach. The choice between routine observation and experiment is decided by the proposed degree of rigor in the intended analysis and cost. We proceed with the first choice, studying naturally available data without incurring the expenditure of experiments. We believe that in a project environment there is a lot to learn from available data and a lot of improvement can be made from the study results of such data before the need arises to commission experiments. The relationship between metrics and the expression of the same as a formula or equation can be presented graphically. In fact, we begin with graphical analysis and then arrive at empirical formulas. Search for Relationships: Relationship between metrics is a mirror of interplay between processes. Now we wish to analyze metrics in search of relationships. In principle we can suppose a relationship between any two metrics. For example, let us look at the relationships between six core metrics selected from a project:

1. Skill level 2. Productivity 3. Review effectiveness 4. Defect density 5. Effort variance 6. Size

A relationship map of these six core metrics is displayed in Figure 41. The connecting lines denote possible relationship. Any two metrics, an ordered pair of them, provide an opportunity to conceive a relationship. There are 15 ordered pairs of metrics and to match there are 15 relationship lines in the map. Not all the supposed relationships are meaningful. Some are merely mechanical constructs, just unreal mathematical possibilities. In others, we do have expectations to uncover relationships of practical significance. Figure 41: Relationship map

56

Pairing metrics is a limited, simple step, useful within the limits. We can see a complex set of relationships if we connect one "driven" metrics to five "driver" metrics. This way we are applying a cause-and-effect relationship or predictor-response model. We take defect density as the effect and can imagine that it is driven by the remaining five metrics, establishing a one-to-five multivariate mapping. Considering the simultaneous influence of five predictor metrics on one response metrics is a more complete and more rigorous approach. Perceiving Relationships: Let us consider metrics in ordered pairs - two at a time - and take a look at the possible types of relationships that can exist between them. Relationships may be perceived by plotting scatter diagrams. One of the two chosen metrics will be treated as the dependent variable (y-axis), the other as the independent variable (x-axis). The scatter diagram may reveal relationships, which can be among the five types mentioned in Table 14.

Type 1 Strong Positive Type 2 Strong Negative Type 3 Weak Positive Type 4 Weak Negative Type 5 Weak No Relationship

Table 14: Relationships revealed in a scatter diagram Perceiving the type of influence between metrics allows us to see the interplay between process elements. In Figure 42 the five types of influences, or relationships, are illustrated. Figure 42: Scatter plots of relationships

57

Strength of Relationship: Correlation Coefficient: We may begin the relationship study between two variables by estimating the correlation coefficient (r), which is a statistical measure of the degree of linear relationship between the two variables. It lies between +1 and -1 depending on whether the relationship is positive or negative. The strength of the relationship is expressed by the absolute value of the correlation coefficient. Let us consider the metrics Skill Ldata obtained from a project is giequation.

Computation of r using the equacomputation is shown in the follow The correlation analysis shows thatgo through all these time-consuminwith built-in statistical functions.

Table 15: Productivity skill level data

evel and Productivity as x and y variables for a correlation study. Metrics ven in Table 15. The correlation coefficient r is defined in the following

tion above yields a value of 0.993 for the correlation coefficient. The ing Table 16.

there is a good correlation between productivity and skill level. We need not g steps to do a correlation study. Excel and similar spreadsheets lend support

58

Table The calculation is basedin the table above. Causal Relationship arelationship. Correlationchanges in the other. Hsuperficial. The variabledropped in a random ma A coincidence, indeed. from the feeder. Other pdevised superstition on mere correlation might bconclusion that there is n“buried” in the data. Sopatterns. One should bsuperstition if invalidateUsually a relationship known, time-proven conawaits validation.

16

o

n bos nn

Thigethe o

me d.

is ce

: Calculation of correlation coefficient

n the following concrete equations relating to the considered productivity data shown

d Statistical Correlation: There is a difference between etween metrics suggests that they are associated; a change in onwever, mere association does not assure causal relationshipkeep pace perhaps by coincidence. In a feeding experiment wer. However, some pigeons happened to see food drop when th

ese pigeons moved their heads up when they needed food andons thought sideways movement caused food drop. The pigeon

e basis of apparent correlation. Expectation (or estimation) bamisleading. Likewise, if the linear correlation coefficient is zer relationship at all. Other forms of relations might still exist, invetimes, linear correlation studies may not be able to grasp higcareful while making correlation studies; correlation can deg Relationship on the other hand goes beyond statistical correlconceived before data analysis, based on some fundamentalpts. Sometimes a new relationship is proposed based on theor

59

correlation and causal e follows approximate

. Correlation could be ith pigeons, food was

ey raised their heads.

expected food to drop s soon settled in a self-sed on the strength of o, we cannot come to a isible because they are hly nonlinear or cyclic enerate into scientific ation and coincidence. assumptions or well-etical reasoning, which

Linear Regression: We will now move from correlation coefficient, which measures the strength of relationships between two variables, to regression analysis, which determines the mathematical expression of the relationship. In the simplest form of regression, the dataset is fitted to the equation y = a + bx, where y is the dependent variable and x is the independent variable. The values of x are assumed to cause or determine the values of y. y = a + bx is known as the regression line to which the data points regress. This is also taken as a regression model, which estimates y from x.

Error Sum of Squares: The difference between the estimated value and the true value is called the error of estimation or residual in regression. For a proposed regression model, one can find error sum of square by the following equation.

The Principle of Least Squares: The best fit regression model, built accordinsquares, is the regression line that achieves a minimum value for the error sumthrough a process of iteration, where the error sum of squares converges to its lo Standard Error of Estimate: Standard error of estimates measures the varobserved values around the regression line. It is also a measure of reliability ofestimation equation. It is calculated using the following equation.

Total Sum of Squares (TSS): This is the total of the squared observatioobservation and the sample mean, as shown in the following equation.

Coefficient of Determination R2: Coefficient of determination is defined as a mof variation in y that is accounted for by regression on x.

Linear Regression: Example: We present an example of regression analysis onReview Effectiveness (RE) and Defect Density (DD). The independent variableand the dependent variable is Defect Density. We expect a relationship betweenthat increase in RE will make DD come down. However, we do not know whethnonlinear, weak, or strong; we wish to find from the regression analysis. A tyusing the Excel tool yields outputs that include the following results:

• Regression line • Regression table • Residual plot • Regression statistics

60

g to the principle of least of squares. This is done

west value.

iability or scatter of the the regression line as an

ns between each sample

e

th is Der pi

asure of the proportion

Dtc

e relationship between Review Effectiveness,

and RE. We believe he relationship will be al regression analysis

The first output, the regression line, is shown in Figure 43. The equation to the regression line and the coefficient of determination are also printed in a textbox next to the regression line. The regrespredicted vresiduals. Tway they cto be rando Perhaps a not reasona

Figure 43: Regression line plot

sion results are presented by the tool in a tabular form as shown in Table 17. This table presents the alues (y estimated) and the observed values (y true). The difference between them is presented as he residuals provide important information for judging the adequacy of the regression analysis. One

an be used is in a plot of the residuals versus the independent variable. If the residuals do not appear mly scattered above the horizontal line, it may indicate a problem with the regression analysis.

stb

Table 17: Regression analysis results

raight-line relationship is not appropriate, or the assumptions of normality or constant variance are le. A plot of the residuals is shown in Figure 44.

Figure 44: Residual plot

61

Regression statistics includes the estimation of coefficient of determination (RI) and the standard error, as in Table 18. Table 18: Regression sta Outliers in Relationship: A special graphfit line indicating outliers is given in Figuconsidered as results of process violationknown sometimes as a sloping control chinner dynamics. This trigger is regarded as Figure 45: Reliability of Departure from expected relation is the dein Figure 45, the outlier has the least defecthe developers. However, we wish to quesunexpected change in relationship could m

A new complexity has arrive Factors other than Review E The intended relationship (D

not known to us. Nonlinear Regression Models: In nonlinprinciple of least squares. Where linear remust verify. Nonlinear regression analysisequation does not describe the data, thenbefore the iteration begins. If the data is collected in the region. If the data is toocollect more data to make sure that the wiwe have to deal with. Simple data transfornarrowed.

Nonlinear Regression Analysis odefinition is size/effect. Productivifactors determine its value. Produc

tistics

showing the sloping lines (1 SE and 2 SE) that run parallel to the best re 45. Those data points that lie beyond a threshold of 1 SE slopes are s, and marked for study and examination. The graph in Figure 47 is art. Here the control chart raises a trigger when a process changes its more proactive than the conventional control charts.

regression line

cision criteria, and, not the magnitude of defect density. For example t density, and for all practical reasons it represents a good job done by tion why the relationship with review effectiveness has changed. This ean that:

d in the development process. ffectiveness have contributed to defect reduction. D = -0.1927 RE + 31.199) has failed to govern this outlier for reasons

ear regression the dataset is fitted to nonlinear curves, again using the lationships are absent, there could be nonlinear relationships that we is an iterative approach. We try different modelling equations; if one we try a different equation. The dataset must be carefully examined not enough in “critical ranges”, it is safer to wait until more data is scattered, nonlinear fittings could give unstable results. If possible, de scatter (suggesting weak relationship) is not a mistake but a reality mations or normalization may be tried to see if the data scatter can be

f Productivity: Software development productivity in the simplest ty is a heavily loaded metric, and is very complex in the sense many tivity tends to be fundamentally nonlinear in nature. Studies have been

62

made in mapping productivity drivers to productivity estimates. We will pick size from the potential drivers and study its relationship with productivity. Metrics data has been collected for size in function points (FP) and effect as person months (PM). Size is the predictor variable or independent variable x. Productivity itself is the “response variable” or dependent variable y. The data is presented in Table 19. Table 19 Nonlinear Regression Anala typical productivity datasethat correspond with the foll

1. Nonlinear regressio2. Nonlinear regressio3. Nonlinear regressio4. Nonlinear regressio5. Nonlinear regressio6. Nonlinear regressio

: Data used for nonlinear regression

ysis: We will use the following nonlinear equations for regression analysis of t given in Table 19. Excel has been used to generate the regression curves

owing six nonlinear equations:

n logarithmic equation n polynomial-degree 2 n polynomial-degree 3 n polynomial-degree 4 n power equation n exponential equation

Figure 46: Nonlinear regression

63

Goodness of Fit: The regression curves are shown in Figure 46. It may be seen that the coefficient of determination, R2, which represents the quality of fit, is different for different regression equations. The lowest value is 0.3034 for the logarithmic curve and the best value is 0.5621 for the fourth degree polynomial curve. R2 gives an indication of closeness of data points to the regression equation in a statistical sense. This helps in making a first order judgment on regression. Monotonicity: However, choosing the regression curve must consider the other requirements of curve fitting. The regression curves must be monotonic and stable. A look at the six models in Figure 46 shows that one model - the fourth-order polynomial - shows a curve, which reverses its trend in a few places. Physically, trend reversal means larger program costs less in those regions of reversal - an absurdity. Stability of Nonlinear Regression Curves: A Comparison: The forecasting ability of nonlinear curves has to be assessed while choosing regression models. Let us formulate a forecasting problem and examine how the six nonlinear regression models fare. The forecasting problem we have taken is to predict productivity value (y) for a given size of 15000 FP (x) (see Table 20). It may be noted that the current data range is 0 to 11000 FP. This means that the regression curve has to be extrapolated up 4000 FP and reach an estimate. Table 20: Results of forecasting The results of forecasting are illustrated in the figures given in Figure 47. The fourth-order polynomial predicts a deeply negative value, while all other models predict productivity in the range between 23 and 43 FP/PM. Negative productivity is a physically meaningless number, and magnitude of the negative value indicates a complete failure in forecasting. The forecasting performance of the fourth-order polynomial is shown in Figure 47, along with the power curve. It is seen from these results that the polynomial curve has collapsed to negative values of productivity. Hence, it is a poor and unreliable estimate. The power curve, however, behaves better and predicts a value that is realistic.

Figure 47: Forecasting nonlinear regression model Multiple Linear Regression: So far we have been looking at relationships between one dependent variable (y) and one independent variable (x). But in many studies we need to consider the influence of several independent variables. In multiple linear regression, the mean of the dependent variable is a linear combination of the independent variables, as shown in the following equation.

64

Linearity: If the linearity assumption is not met, sometimes we can transform one or more of the x variables, like taking the square root, and get a linear dependence. Interaction: If interactions between the independent variables are to be included in the model, then additional cross products, xi xj, have to be included in the model. Surface Plot: We will consider a case study for multiple linear regression with two independent variables. The dependent variable is Defect Density (y), measure in Defects/KLOC. The independent variables are Skill Level (x1) and Review Effectiveness (x2). A surface plot of the linear model is shown in Figure 48. The planar Defect Density surface indicates how quality of the software work product is influenced by two variables. The surface gently slopes towards the high performance point with the following coordinate values:

Figure 48: Surface plot This surface, being a plane, does not offer optimum points but only indicates the general direction of process improvement.

65

4 SPC and CMMI 4.1 Basics of Quantified Process Management In general we can establish the following four categories of processes in the software development ([Kulpa 2003], [SEI 2002]): the project management processes, the process management processes, the engineering processes, and the support processes. Based on process models like the CMMI we can evaluate main activities shown in the Figure 49. Accordingapplication50).

Figure 49: Activities supporting by process models

the GQM paradigm and the principles of the CAME framework for successful measurement we can formulate the basic CMMI intentions considering the SPC approach as following (see Figure

Figure 50: CMMI approach including the SPC

66

The actual goals are implied in the achieving the different levels of the CMMI maturity evaluation. The appropriate questions for the process maturity can be identified by considering the CMMI key processes. In following we will give the essential questions in order to satisfy these key processes cited from [Singpurwalla 1999]. Maturity Level 2:

Key Process Area I (K21)-Requirements Management

1. For each project involving software development, is there a designated software manager? 2. Does the project software manager report directly to the project (or project development)

manager? 3. Does the Software Quality Assurance (SQA) function have a management reporting channel

separate from the software development project management? 4. Is there a designated individual or team responsible for the control of software interfaces? 5. Is there a software configuration control function for each project that involves software

development?

Key Process Area 2 (K22)-Software Quality Assurance

6. Does senior management have a mechanism for the regular review of the status of software development projects?

7. Is a mechanism used for regular technical interchanges with the customer? 8. Do software development first-line managers sign off on their schedules and cost estimates? 9. Is a mechanism used for controlling changes to the software requirements? 10. Is a mechanism used for controlling changes to the code? (Who can make changes and under what

circumstances?) Key Process Area 3 (K23)-Software Project Planning

11. Is there a required training program for all newly appointed development managers designed to familiarize them with software project management'?

12. Is a formal procedure used to make estimates of software size? 13. Is a formal procedure used to produce software development schedules? 14. Are formal procedures applied to estimating software development cost? 15. Is a formal procedure used in the management review of each software development prior to

making contractual commitments? Maturity Level 3

Key Process Area I (K31)-Integrated Software Management

16. Is a mechanism used for identifying and resolving system engineering issues that affect software? 17. Is a mechanism used for independently calling integration and test issues to the attention of the

project manager? 18. Are the action items resulting from testing tracked to closure? 19. Is a mechanism used for ensuring compliance with the software engineering standards? 20. Is a mechanism used for ensuring traceability between the software requirements and top-level

design? Key Process Area 2 (K32)-Organization Process Definition

21. Are statistics on Software design errors gathered? 22. Are the action items resulting from design reviews tracked to closure? 23. Is a mechanism used for ensuring traceability between the Software top-level and detailed

designs? 24. Is a mechanism used for verifying that the samples examined by Software Quality Assurance are

representative of the work performed? 25. Is there a mechanism for ensuring the adequacy of regression testing?

67

Key Process Area 3 (K33)-Peer Review

26. Are internal Software design reviews conducted? 27. Is a mechanism used for controlling changes to the Software design? 28. Is a mechanism used for ensuring traceability between Software detailed design and the code? 29. Are Software code reviews conducted? 30. Is a mechanism used for configuration management of the Software tools used in the development

process? Maturity Leve1 4

Key Process Area 1 (K41)-Quantitative Process Management

31. Is a mechanism used for periodically assessing the Software engineering process and implementing indicated improvements?

32. Is there a formal management process for determining if the prototyping of Software functions is an appropriate part of the design process?

33. Are design and code review coverage measured and recorded? 34. Is test coverage measured and recorded for each phase of functional testing? 35. Are internal design review standards applied?

Key Process Area 2 (K42)-Software Quality Management

36. Has a managed and controlled process database been established for process metrics data across all

projects? 37. Are the review data gathered during design reviews analyzed? 38. Are the error data from code reviews and tests analyzed to determine the likely distribution and

characteristics of the errors remaining in the product? 39. Are analyses of errors conducted to determine their process-related causes? 40. Is review efficiency analyzed for each project?

Maturity Level 5

Key Process Area 1 (K51)-Defect Prevention

41. Is software system engineering represented on the system design team? 42. Is a formal procedure used to ensure periodic management review of the status of each software

development project? 43. Is a mechanism used for initiating error prevention actions? 44. Is a mechanism used for identifying and replacing obsolete technologies? 45. Is software

productivity analyzed for major process steps? The appropriate metrics in order to find the answers of the questions above we will give the CMMI metrics defined by Kulpa and Johnson again (only for the CMMI Level Four) [Kulpa 2003]: Organizational Process Performance

QM01: Trends in the organization's process performance with respect to changes in work products and task attributes (e.g., size growth, effort, schedule, and quality)

Quantitative Project Management

QM02: Time between failures QM03: Critical resource utilization QM04: Number and severity of defects in the released product QM05: Number and severity of customer complaints concerning the provided service QM06: Number of defects removed by product verification activities (perhaps by type of verification,

such as peer reviews and testing) QM07: Defect escape rates QM08: Number and density of defects by severity found during the first year following product delivery

or start of service QM09: Cycle time

68

QM10: Amount of rework time QM11: Requirements volatility (i.e., number of requirements changes per phase) QM12: Ratios of estimated to measured values of the planning parameters (e.g., size, cost, and schedule) QM13: Coverage and efficiency of peer reviews (i.e., number/amount of products reviewed compared to

total number, and number of defects found per hour) QM14: Test coverage and efficiency (i.e., number/amount of products tested compared to total number,

and number of defects found per hour) QM15: Effectiveness of training (i.e., percent of planned training completed and test scores) QM16: Reliability (i.e., mean time-to-failure usually measured during integration and systems test) QM17: Percentage of the total defects inserted or found in the different phases of the project life cycle QM18: Percentage of the total effort expended in the different phases of the project life cycle QM19: Profile of subprocesses under statistical management (i.e., number planned to be under statistical

management, number currently being statistically managed, and number that are statistically stable)

QM20: Number of special causes of variation identified QM21: The cost over time for the quantitative process management activities compared to the plan QM22: The accomplishment of schedule milestones for quantitative process management activities

compared to the approved plan (i.e., establishing the process measurements to be used on the project, determining how the process data will be collected, and collecting the process data)

QM23: The cost of poor quality (e.g., amount of rework, re-reviews and re-testing) QM24: The costs for achieving quality goals (e.g., amount of initial reviews, audits, and testing)

SPC depends on historical data. It also depends on accurate, consistent process data. If you are just beginning the process improvement journey, do not jump into SPC. You (your data) are not yet ready for it. That is why the CMMI waits until Maturity Level 4 in the staged representation to suggest the application of SPC techniques. At Level 2, processes are still evolving. At Level 3, they are more consistent. Level 4 takes process information from Level 3, and analyzes and structures both the data and their collection. Level 5 takes predictable and unpredictable processes, and improves them. 4.2 Controlling the Process Improvement Finally, we will describe some statistical methods supporting the Statistical Process Control especially (see [Pandian 2004], [Putnam 2003], [Zelkowitz 1997] and [Zuse 2003]). The Shewhart control chart, introduced in 1920, decomposes process variation into two components: random variation (predictable bounds) and systematic variation (anomalies). Random variations, when the cause system is constant, approach some distribution function, and hence remain predictable or statistically stable. Systematic variations are due to assignable causes, which are due to unusual causes, freak incidents, process drifts, and environmental threats. Shewhart demonstrated how control charts could be used to identify and distinguish the two types of process variation, to achieve process efficiency, and ensuing economic benefits. Figure 51 shows how a training manager uses the Shewhart Control Chart to identify (and later solve) two problems: extraordinary cost for Training ID 7 and the average cost (µ) greater than the budget. Armand V. Feigenbaum allows specifying control limits from past experience and guesswork in a pragmatic manner.
Figure 51: Controlling the cost of training
69

Tests for Control Charts: Tests for statistical control have been in use for a long time. The classical tests or decision rules to be applied while reading the control charts are presented in the following list, along with an illustration in Figure 52.

Test #1: Any point outside one of the control limits is an indication of a special cause and needs to be investigated.

Test #2: A run of seven points in succession, either all above the central line or

below the central line or all increasing or all decreasing, is an indication of a special cause and needs to be investigated.

Test #3: Any unusual pattern or trend involving cyclic or drift behaviour of the data

is an indication of a special cause and needs to be investigated. Test #4: The proportion of points in the middle-third zone of the distance between

the control limits should be about two thirds of all the points under observation.

Controlin Figurehelps in guideline Dual Prown indand revie

Figure 52: Tests for Statistical Process Control charts

Chart in the Presence of Trend: If the metric shows trend, such as delivered defect density (DDD) 53, the control charts may be partitioned to make a clearer presentation of the problem. The trend line forecasting and risk estimation. The baseline helps in process analysis, estimation, and setting process s.

ocess Control Charts: Sometimes the metric is a product of two major components, each showing its ependent characteristics. Defects found by design review, for instance, are a product of defect injected w effectiveness, shown in the following equation.

Defects Found = Defects Injected * Review Effectiveness

70

Thehavemorinter Froanalsign

Figure 53: Trend and baseline

UCL in the control chart of defect/KLOC, as shown in Figure 55, is more relevant to the designers, who to keep defect level below the UCL. The LCL, on the other hand, appeals to the reviewers to find defects

e than the UCL. In the defect control chart in Figure 55, the following references are marked for proper pretation:

m Dual Limyze, and interals, one dema

its to Single Limits: The control chart in Figure 54 is cluttered, and one has to strain to read, pret the chart. When the chart is used to give process feedback, some process owners may mix nding a minimum production of defects, another may demand just the opposite.

71

This problconstruct tindicated imarked. Th The proceslimit) cleacorrective action. Control Cfar. Belowelsewheresome realfollowing

X X X p u c

Figure 54: In-process defect control chart

em may be solved and effective presentation may be made to the process owner, if only we could wo separate control charts, each delivered to the process owner with the appropriate control limits, as n Figure 55. After the split, the new control charts look simple and clear, with just one decision rule e process owner, the designer, or the reviewer, gets a clear signal.

Figure 55: Splitting a double-side limit into two single-side limits

s defects are marked as circles in both cases. With defects clearly marked and the goal (specification rly specified, each process owner can go into causal analysis of process violations and initiate measures. The purpose of this control chart is to provide effective feedback and facilitate corrective

harts Types: There are several control chart forms in use, including the ones we have used so is a brief list for a quick reference. The exact formulas for computations may be found

. When we have a large number of data points that can be organized as sub-groups according to -life order, and when the sub-group sizes are used in determining the control limits, the charts may be useful.

-bar chart with UCL and LCL -bar - R chart with UCL and LCL -bar - S chart with UCL and LCL Chart (percentage defectives) with UCL and LCL Chart (defects per unit size) with UCL and LCL Chart (defect counts per module) with UCL and LCL

72

If instead of sub-groups we have just an individual data point for every process delivery, we can artificially create a sub-group by selecting data points from a moving average window, and plot a graph with control limits calculated in the traditional way.

Individuals chart (X m R) with UCL and LCL When all we desire is to characterize the process and generate some performance baseline on a chosen metric, the following forms may be used. These forms can be used across life cycle phases or across sub-groups.

If we wish to compare actual values with estimates, then the following may be used:

• Cumulative graphs with point estimates • Cumulative graphs with interval estimates • Run charts with estimates shown as USL, LSL • Life cycle profiles with USL and LSL • Run charts with baseline values (history) marked Special Forms

Most performance models are constructed this way. A few of them are illustrated in this section. Multi-Process Tracking Model: A simple way to take a holistic and balanced view of processes is to track all related process metrics on a radar chart, marking the target values and the achieved values. Cost drivers, performance drivers, and defect drivers in software development can be plotted on the radar chart for effective process control. Tracking of multiple goals, all competing for resources, is presented in the radar chart format in Figure 56. The following is a list of metrics used to represent and measure goals:

• Customer satisfaction index (CUST SAT) • Productivity index (PROD) • Employee satisfaction index (EMP SAT) • Right first time index (RFT) • Defect removal effectiveness (DRE) • Training need fulfilment index (TNF)

All these are measured quantitatively on a 0 to 10 scale (ratio scale). Targets and achievement in each direction are plotted. This is a control chart because it compares reality with expectation and allows one to see deviations. It gives deeper meaning and allows one to visualize a balanced picture or model on goal achievement.

Figure 56: Goal control radar

73

Dynamic Model - Automated Control Charts: Control charts in modern times have taken a totally new form. They are embedded in metric databases and analysis modules, which perform dynamic functions. A defect-tracking tool uses a defect database as the platform and tracks bug closure. If the time taken exceeds a preset limit, the software generates a message to the tester. Even if the bug lives long after the message, the software escalates the issue and the message is now flashed to the project manager. The tester or the manager does not see a physical control chart but gets the results. The limit setting can be a choice from the manager, where his experience and judgment prevail. Or the limit setting can be done by the software logic, which will use an appropriate decision rule and raise an alarm. The decision-making algorithm can be simple algebra or a sophisticated knowledge engine that learns and works with intelligence. The graph is printed, on demand, as a report from the tool along with other statistics. In a similar way, metrics data analysis tools can generate dynamic control charts on all metrics. These charts can be published in the monthly process capability baseline reports. Control Chart for Effective Application: There are many forms of control charts but they all must be structured well for effective application. Here are some suggestions. On any metric we can plot a control chart. Choose the metric that communicates better. For instance, a training manager can choose cost of absenteeism instead of number of people who are absent because the former makes senior management look at the control chart seriously. The data should be in chronological order. Most software development processes follow the learning curve, both first order and second order. Before process stability is achieved, the learning curve is encountered. Chronological order gives control charts the vital meaning and power. A decision rule must be provided to enable problem recognition. The rule could be expressed in the following ways:

• Control limits • Specification limits • Baseline references • Estimated values • Process goals • Process constraints • Benchmark values • Expected trend • Zones

The reader must be made familiar with the rules for interpretation. The chart must be designed with the most likely readers in mind, and every effort must be made to make the chart provide effective communication to a human system (biofeedback). Provide support data as annotations for significant data points. For example, a defect distribution pie chart can be provided as a companion to a defect control chart. Annotate identified hot spots or trends with causal analysis findings. We learn from such annotations. Wherever possible, suggested corrective action may be indicated. Modernism in Process Control - Decision Support Charts: Metrics data, when presented in time series, offers a new form that helps to understand the process. A well-structured time series chart could emerge into a model once it captures a pattern that can be applied as a historic lesson. The time series analysis for trend or process control is also a time series model of the process, inasmuch as it can increase one's understanding of the process behaviour and forecast.

What-IFAnalysis: But the outstanding issue in software projects is whether a process goes according to a plan or estimate. The need for statistically derived, selforganizing goals, should it arise, is only secondary. The term control chart may then be replaced with the term decision support chart The concept of control limit will be substituted with the concept of decision thresholds. What-if analysis can be done on a control chart by shifting the limits and seeing each time how many events are picked up and earmarked for inves-tigation. The problem set will shift according to the location of the threshold line. Clues, Not Convincing Proof: There are reasons why metrics control charts end up issuing suggestive clues but not convincing proof about process problems:

• Data errors • Ambiguity in measurement scale • Process having nonnormal distributions • Nonavailability of defect propagation models

74

But all a project manager is looking for is a set of clues, not final proof. A decision support chart can coexist with ambiguity but the classical control chart cannot.

If It Is Written on the Wall, Do Not Draw Control Charts: If known problems are not solved, nobody wants to use a control chart to detect new problems. If trouble can be spotted without having to use a control chart, avoid control charts. Going one step further, if without the aid of control limits we can spot outliers using the naked eye, let us not draw control limits. The connection of control charts with action is now legendary. The best control chart is the one on which somebody acts.

Regression models have huge application potential in software engineering and management. They support the creation of a wide variety of knowledge products from simple visual display of relationships to estimation equations. They can reflect real situations in different degrees of detail, ranging from simple two-variable models to complex multiple variable models. They can capture process nonlinearity and allow us to exploit this knowledge, either in optimization or in risk avoidance. Regression Model Application - Causal Analysis: Regression models are naturally poised for causal analysis application. The x-y relationship is a cause-effect relationship (in the predictor-predicted sense). The regression analysis discussed here makes use of productivity data. requirement effort% has been chosen as the independent variable. The data and the nonlinear regression line fitted to the dataset are shown in Figure 57. The association rule for causal analysis demands a good R2, and we get a value of 64.34 percent. The extraneous data and outliers can be put aside and we can focus on the regression line to do causal analysis. Logic tells us that software productivity should improve with better requirement capturing (and a direction for causal analysis is set this way). The regression model (nonlinear, logarithmic) shows asymptotic rise in productivity, and we can see a shoulder on the curve after which it becomes flat. Requirement effort affects productivity up to a point, then either other factors take over or further investment on requirement does not yield return. Regre That tregresFigurepercenproducconstr

Figure 57: Influence of requirement analysis effort on productivity

ssion Model Application - Optimum Team Size:

here exists an optimum team size has been much discussed and widely quoted. But what are the facts? A sion model of team size on productivity reveals the real picture. Team size productivity data is shown in 58, and the graph shows the nonlinear regression curve, a power equation, which fits to an R2 of 42.28 t. According to the regression model, when the team grows away from the organic small size, its tivity decreases exponentially. The nonlinear model does permit optimization of team size; it imposes a

aint equation on software projects. Choice is made not based on the intrinsic demonstration of best among

75

the lot prediction but based on other factors. For example, a strategic limit on minimum productivity would dictate the team size limit. In those cases, where a larger team size is chosen based on other considerations, from the model we know what would be the corresponding loss in productivity, and take appropriate counter measures. This model would also help in breaking work packages to smaller units and operate the project with the proverbial small teams. Fi Regression Model Predicting effort frmodels and estimati Fi Our objective here available in projects Ta

gure 58: Team size constraints on productivity

Application - Building an Effort Estimation Model:

om size has been a favourite game for several researchers. They go by the name of cost on models.

gure 59: Refined regression model (after removal of outliers)

is to apply regression modelling to design an effort estimation model from data commonly , namely, effort and size. Some practical data is provided in Table 21.

ble 21: Effort d
ata
76

Expectation: The metrics used here are effort in hours and size in function points. Size is taken as the independent variable. The expected relationship, based on several experiences, is a power equation of the form

We also expect complications in regression model building. Size measurements can have errors, which will interfere with regression. Analysis: Regression analysis of the dataset is shown in Figure 60. A linear regression line appears with goodness of fit 39.75 percent, a poor value for an estimation model. There is a large scatter of data. The model requires improvement.

Figure 60: Effort Estimation from size: the first regression

Table 22: Clustered data

77

Presentation of such scatter plots sometimes invites criticism. Lack of clear trend makes people give up and lose interest in analysis. They conclude that "if you have enough data you can prove any theory." The problem is quite basic. The step that had been missed in data collection is "categorization," a discipline lower in the rank of measurement scales but which could bring in clarity.

Clustering: By examining the scatter plot in Figure 60 we may notice that there is a possibility for clustering, regrouping data according to some logical rule, and try separate regressions for each cluster. The exploratory data analysis indicates a natural divide in the data, worth finding. Now we know that there must be logic for regrouping which is based on some physical reasoning, such as types of projects, nature of technology, and even year of completion. Histograms can be used to test for existence of strong clusters. The data was grouped into two clusters. The regrouped data is shown in Table 22. New Regression Models: The new regression lines, obtained after clustering, are shown in Figure 61. The goodness of fit figures is 83.44 percent for one and 67.63 percent for the other. Regression quality is far better than what we had in the first run. This is an example that emphasizes the need for iterative runs in model building. We can continue the iteration with further clustering, transformation, partitioning, or other means of model refinement. We can also search for better equations. Of course, we can go to multiple linear regressions and achieve better and better models. It is a process by itself. The quest is brought to an end, when we have a reasonable model which will have reasonable confidence level and which agrees with common sense.

Importantown familuniversal estimation

Statistical Procethis preprint the spractically applieWith the oblige ethe moment but aquality increases.

Figure 61: Regression after clustering

Lesson: This application proves one principle: estimation models predict better within their ies. Each estimation model represents a narrow world, inside which it operates best. There is no estimation model. Hence, even if we have just a few data points, better to build our own model, one for each family.

ss Control provides a way of handling the increasing complexity of software engineering. In tatistical basics were introduced and an example was provided to show how this approach is

d. To be able to use it in a profitable way it is necessary to gain experience with this approach. xperience it is a very powerful tool for controlling the software processes being developed at lso for the planning of future projects. This means that the overall effort decreases while the

78

5 References [Abreu 1995] Abreu, F. B.; Gonlao, M.; Esteves, R.: Towards the Design Quality Evaluation in Object-Oriented

Software Systems. Proc. of the 5ICSQ, October 24-26, Austin, Texas, 1995, pp. 44-57 [Basili 1986] Basili, V. R.; Selby, R. W.; Hutchens, D. H.: Experimentation in Software Engineering. IEEE

Transactions on Software Engineering, 12(1986)7, pp. 733-743 [Card 2000] Card, D. N.: Making Measurement Understandable. IEEE Software, January/February 2000, pp. 95-

96 [Cole 1993] Cole, R. J.; Woods, D.: Measurement Through the Software Lifecycle: A Comperative Case Study. .

Proc. of the 10th Annual Conference on Application of Software Metrics and Quality Assurance in Industry, Amsterdam, Netherlands, September 1993, Section 19

[Dumke 2001] Dumke, R.; Abran, A.: Current Trends in Software Measurement. Proc. of the IWSM2001,

Montreal, August 2001, Shaker Publ., 2001 [Dumke 2002] Dumke, R.; Abran, A.; Bundschuh, M.; Symons, C.: Software Measurement and Estimation.

Proc. of the IWSM2002, Magdeburg, October 2002, Shaker Publ., 2002 [Dumke 1999] Dumke, R.; Foltin, E.: An Object-Oriented Software Measurement and Evaluation Framework.

Proc. of the FESMA, October 4-8, 1999, Amsterdam, pp. 59-68 [Dumke 1996] Dumke, R.; Foltin, E.; Koeppe, R.; Winkler, A.: Softwarequalität durch Meßtools – Assessment,

Messung und instrumentierte ISO 9000. Vieweg Publ., Braunschweig, Germany, 1996 [Dumke 2003] Dumke, R.; Lother, M.: Softwarequalitätsmanagement (SQM). Vorlesungsskript. Otto-von-

Guericke-Universität Magdeburg, http://ivs.cs.uni-magdeburg.de/sw-eng/agruppe/ lehre/swt2.shtml [Ebert 1993] Ebert, C.: Complexity Traces – an Instrument for Software Project Management. Proc. of the 10th

Annual Conference on Application of Software Metrics and Quality Assurance in Industry, Amsterdam, Netherlands, September 1993, Section 17

[Eickelmann 2000] Eickelmann, N.: Integrating the Balanced Scorecard and Software Measurement

Frameworks. Proc. of the IRMA 2000, Anchorage, Alaska, May 2000, pp. 980-983 [Endres 2003] Endres, Albert; Rombach, D.: A Handbook of Software and System Engineering. Pearson

Education Limited, 2003 [Fehrling 2003] Fehrling, N.: Softwaremetriken im Umfeld der Automobilindustrie. In: Büren et al.: Software-

Messung in der Praxis. Tagungsband der MetriKon 2003, November 2003, Ulm, Shaker-Verlag, 2003, pp. 163-164

[Feiler 1993] Feiler, P. H.; Humphrey, W. S.: Software Process Development and Enactment: Concepts and

Definitions. Proc. of the 2nd Int. Conference on Software Process, Los Altimos, 1993, pp. 28-40 [Fenton 1997] Fenton, N. E.; Pfleeger, S. L.: Software Metrics – A Rigorous and Practical Approach. Thomson

Publ., 1997 [Ferguson 1998] Ferguson, J.; Sheard, S.: Leveraging Your CMM Efforts for IEEE/EIA 12207. IEEE Software,

September/October 1998, pp. 23-28 [Henderson 1996] Henderson-Seller, B.: The Mathematical Validity of Software Metrics. Software Engineering

Notes, 21(1996)5, pp. 89-94 [Jacquet 1997] Jacquet, J.; Abran, A.: From Software Metrics to Software Measurement Methods: A Process

Model. Proc. of the ISESS, 1997 [Juristo 2003] Juristo, N.; Moreno, A. M.: Basics of Software Engineering Experimentation. Kluwer Academic

Publishers, Boston, 2003

79

[Kitchenham 1995] Kitchenham, B., Pfleeger, S. L.; Fenton, N.: Towards a Framework for Software

Measurement Validation. IEEE Transactions on Software Engineering, 21(1995)12, pp. 929-944 [Kitchenham 1997] Kitchenham et al.: Evaluation and assessment in software engineering. Information and

Software Technology, 39(1997), pp. 731-734 [Kulpa 2003] Kulpa, M. K.; Johnson, K. A.: Interpreting the CMMI – A Process Improvement Approach. CRC

Press Company, 2003 [Munson 2003] Munson, J., C.: Software Engineering Measurement. CRC Press Company, Boca Raton London

New York, 2003 [Pandian 2004] Pandian, C. R.: Software Metrics – A Guide to Planning, Analysis, and Application. CRC Press

Company, 2004 [Putnam 2003] Putnam, L. H.; Myers, W.: Five Core Metrics – The Intelligence Behind Successful Software

Management. Dorset House Publishing, New York, 2003 [SEI 2002] SEI: Capability Maturity Model Integration (CMMISM), Version 1.1, Software Engineering Institute,

Pittsburgh, March 2002, CMMI-SE/SW/IPPD/SS, V1.1 [Singpurwalla 1999] Singpurwalla, N. D.; Wilson, S. P.: Statistical Methods in Software Engineering. Springer

Publ., 1999 [Solingen 1999] Solingen, v. R.; Berghout, E.: The Goal/Question/Metric Method. McGraw Hill Publ., 1999 [Wohlin 2000] Wohlin, C, Runeson, P, Höst, M, Ohlsson, M, Regnell, B, Wesslén, A.: Experimentation in

Software Engineering: An Introduction. Kluwer Academic Publishers, Boston, 2000 [Zelkowitz 1997] Zelkowitz, M. V.; Wallace, D. R.: Experimental Models for Validating Technology. IEEE

Computer, May 1998, pp. 23-31 [Zuse 1998] Zuse, H.: A Framework of Software Measurement. De Gruyter Publ., Berlin New York, 1998 [Zuse 2003] Zuse, H.: What can Practioneers learn from Measurement Theory. In Dumke et al.: Investigations

in Software Measurement, Proc. of the IWSM 2003, Montreal, September 2003, pp. 175-176

80

Statistical Process Control for Level 4

Documents

kluwer academic publishers

crc press company

10th annual conference

action items resulting

early warning radar

rare event occurs

identify corrective action

2k factorial design