Data for Statistics - 1 Version 2.1 © 2010 Kasse Initiatives, LLC DATA
Jan 21, 2015
2. Data
Data Input for Analysis and Interpretation
Data are generally collected as a basis for action
You must always use some method of analysis to extract and
interpret the information that lies in the data
The type of data that has been collected will determine the type of
statistics or analysis that can be performed
Making sense of the data is a process in itself
Always provide a context for data
Data has no meaning apart for their context
Data should always be presented in such a way that preserves the
evidence in the data for all the predictions that might be made
from these data
3. Data - 2
Data should be completely and fully described
Who collected the data?
How were the data collected?
When were the data collected?
Where were the data collected?
What do these values represent?
If the data are computed values, how were the values computed from
the raw inputs?
4. Data - 3
Variation exists in all data and consists of both noise (random or
common cause variation) and signal (nonrandom or special cause
variation)
Without formal and standardized approaches for analyzing data, you
may have difficulty interpreting and using your measurement
results
When you interpret and act on measurement results, you are
presuming that the measurements represent reality
5. Data - 4
To use data safely, you must have simple and effective methods not
only for detecting signals that are surrounded by noise,
but also for recognizing and dealing with normal process variations
when there are no signals present
Drawing conclusions and predictions from data depends not only on
using appropriate analytical methods and tools,
but also on understanding the underlying nature of the data and the
appropriateness of assumptions about the conditions and
environments in which the data were obtained
6. Data Definitions
Categorical vs. Quantitative Variables - Variables can be
classified as categorical (aka, qualitative) or quantitative (aka,
numerical)
Categorical - Categorical variables take on values that are names
or labels. The color of a ball (e.g., red, green, blue) or the
breed of a dog (e.g., collie, shepherd, terrier) would be examples
of categorical variables.
Quantitative - Quantitative variables are numerical. They represent
a measurable quantity.
For example, when we speak of the population of a city, we are
talking about the number of people in the city - a measurable
attribute of the city. Therefore, population would be a
quantitative variable
7. Data Definitions - 2
Discrete vs. Continuous Variables - Quantitative variables can be
further classified as discrete or continuous.
If a variable can take on any value between two specified values,
it is called a continuous variable; otherwise, it is called a
discrete variable.
Examples to clarify the difference between discrete and continuous
variables.
Suppose the fire department mandates that all fire fighters must
weigh between 150 and 250 pounds. The weight of a fire fighter
would be an example of a continuous variable; since a fire
fighter's weight could take on any value between 150 and 250
pounds.
Suppose we flip a coin and count the number of heads. The number of
heads could be any integer value between 0 and plus infinity.
However, it could not be any number between 0 and plus infinity. We
could not, for example, get 2.5 heads. Therefore, the number of
heads must be a discrete variable.
8. Attributes Data vs. Variables Data
9. Variables Data
Variables data is measured and plotted on a continuous scale
With variables data, an actual numeric estimate is derived for one
or more characteristics of the population being sampled such
as:
Time
Temperature
Length
Weight
Height
Volume
Voltage
Horsepower
Torque
Speed
Cost
10. Variables Data - 2
In software, examples of variables data include:
Effort expended - (Number of hours, days, weeks, years, etc., that
have been expended by a workforce member on an identified
topic)
Years of experience - (Total number of years of experience per
category)
Memory utilization - (% of total memory available)
CPU utilization - (% of CPU used at any given moment in time)
Cost of rework - (Dollars and cents calculation of the rework based
on the effort put forth by anyone involved in the finding and
fixing of reported problems)
11. Counts Could Be Treated as Variables Data
There are many situations where counts get used as measures of
size:
Total number of requirements
Total lines of code
Total bubbles in a data-flow diagram
Customer sites
Change requests received
Total people assigned to a project
When we count these things, we are counting all the entities in a
population, not just the occurrence of entities with specific
attributes
These should always be treated as variables data even though they
are instances of discrete counts
12. Attributes Data
When working with attributes data, the focus is on learning about
one or more specific non-numerical characteristics of the
population being sampled
When attributes data are used for direct comparisons, they must be
based on consistent areas of opportunity if the comparisons are to
be meaningful
If the number of defects that are likely to be observed depends on
the size (lines of code)of a module or component, all sizes must be
nearly equal
If the probabilities associated with defect discovery depend on the
time spent on inspecting or testingthe elapsed time spent must be
nearly equal
13. Attributes Data - 2
In general, when the areas of opportunity for observing a specific
event are not equal or nearly so, the chances of observing the
event will differ across the observations
Then we must normalize (convert to rates) by dividing each count by
its area of opportunity before valid comparisons are made
Conditions that make us willing to assume constant areas of
opportunity seem to be less in software environments
Normalization is almost always needed for software!
14. Attributes Data - 3
Example:
If the defects are being counted and the size of an item inspected
influences the number of defects found, some measure of item size
will also be needed to convert defect counts to relative rates that
can be compared in meaningful ways (defects per lines of
code)
If the variations in the amount of time spent inspecting or testing
can influence the number of defects found, these times should be
clearly defined and measured as well
15. Attributes Data - 4
One of the keys to making effective use of attributes data lies in
preserving the ordering of each count in space and time
Sequence information (the order in time or space in which the data
is collected) is almost always needed to correctly interpret counts
of attributes
Make the counts specific Make sure there is an operational
definition (clear set of rules and procedures) for recognizing an
attribute or entity if what gets counted is to be what the user of
the data expects the data to be
16. Attributes Data - 5
Attributes data is counted and plotted as discrete events:
Shipping errors
Percentage waste
Number of defects found
Number of defective items
Number of source statements of a given type
Number of lines of comments in a module of n lines
Number of people with certain skills on a project
Percentage of projects using formal inspections
Team size
Elapsed time between milestones
Staff hours logged per task
Backlog
Number of priority-one customer complaints
Percentage of non-conforming products in the output of an activity
or a process
17. The Key to Classifying Data
The key to classifying data as attributes data or variables data
depends not so much on whether the data are discrete or continuous,
but on how they are collected and used
The total number of defects found is often used as a measure of the
amount of rework or retesting to be performed
It is viewed as a measure of size and treated as variables
data
It is normally used as a count based on attributes
The method of analysis you choose for any data will depend
on:
The questions you are asking
The data distribution model you have in mind
The assumptions you are willing to make with respect to the nature
of the data (Page 79)
18. Data Type Classifications
Discrete
Continuous
19. Distributional ModelsRelationship to Chart Types
Each type of chart is related to a set of assumptions (a
distributional model) that must hold for that type of chart to be
valid.
There are six types of charts for attributes data
NP
P
C
U
XmR for counts
XmR for rates
20. XmR charts have an advantage over np, p, c, and u charts in
that they require fewer and less stringent assumptions
They are easier to plat and use
They have wide applicability
Recommended by many quality-control professionals
When assumptions of the distributional model are met, however, the
more specialized np, p, c, and u charts can give better bounds for
control limits and can offer advantages
Distributional Models Relationship to Chart Types - 2
21. Distributional ModelsRelationship to Chart Types - 3
NP Chart An np chart is used when the count data are binomially
distributed and all samples have equal areas of opportunity
These conditions occur in manufacturing settings when there is 100%
of lots of size n (n is constant) and the number of defective units
in each lot is recorded
P Chart a p chart is used when the data are binomially distributed
but the areas of opportunity vary from sample to sample
A p chart could be appropriate if the lot size n were to change
from lot to lot
22. Distributional ModelsRelationship to Chart Types - 4
C Chart a c chart is used when the count data are samples from a
Poisson distribution and the samples all have equal-sized areas of
opportunity
U Chart a u chart is used in place of a c chart when the count data
are samples from a Poisson distribution and the areas of
opportunity are not constant
Defects per thousand lines of code is an example for software
NP, P, C and U charts are the traditional control charts used with
attributes data
XmR Chart Useful when little is known about the underlying
distribution of when the justification for assuming a binomial or
Poisson process is questionable
Almost always a reasonable choice
23. Distributional ModelsRelationship to Chart Types - 5
More About U Charts U charts seem to have the greatest prospects
for use in software settings
U charts require normalization (conversion to rates) when the areas
of opportunity are not constant
Poisson might be appropriate when counting the number of defects in
modules during inspection or testing
Defects per thousand lines of source code is an example of
attributes data that is a candidate for u charts
Although u charts may be appropriate for studying software defect
densities in an operational environment, we are not aware of any
empirical studies that have generally validated the use of Poisson
models for nonoperational environments such as inspections
24. Distributional ModelsRelationship to Chart Types - 6
Defects per module or defects per test are unlikely candidates for
u charts, c charts, or any other charts for that matter
The ratios are not based on equal areas of opportunity Cant be
normalized
There is no reason to expect them to be constant across all modules
or tests when the process is in statistical control
25. Distributional ModelsRelationship to Chart Types - 7
If you are uncertain as to the model that applies, it can make
sense to use more than one set of charts
If you think you may have a Poisson situation but are not sure that
all conditions for a Poisson process are present, then plotting
both a u chart and the corresponding XmR charts should bracket the
situation
If both charts point to the same conclusions, you are unlikely to
be led astray
If the conclusions differ, then you should investigate your
assumptions or the events
26. Presenting Data
While it is simple and easy to compare one number with another,
such comparisons are limited and weak
Limited because the small amount of data used
Weak because both of the numbers are subject to variation
This makes it difficult to determine just how much of the
differences between the values is due to variation in numbers and
how much is due to real changes in the process
27. Presenting Data - 2
Graphs there are two basic graphs that are the most helpful is
providing the context for interpreting the current value
Time series graph (Run Chart)
Have months or years marked off on the horizontal axis and possible
values marked off on the vertical axis
As you move from left to right, there is a passage of time
By visually comparing the current value with the plotted values for
the preceding months you can quickly see if the current value is
unusual or not
Histogram (Tally Plot)
An accumulation of the different values as they occur without
trying to display the time order sequence
28. Run Charts
Number of Required Changes to a Module
as the Project Approaches Systems Test
Syntax
Check
Desk
Check
Code
Review
Unit
Test
Integration
and Test
Systems
Test
29. 20
18
16
14
12
10
Number of Days
8
6
4
2
0
32
56
48
46
44
42
40
38
36
54
52
50
34
Product Service Staff Hours
Histograms
30.
PROCESS CONTROL CHART TYPE:
METRIC:
A point above or below the
control linessuggests that the
measurement has a special
preventable or removable cause
Upper
Control
Limit
(UCL)
The chart is used for continuous
and time control ofthe process
and prevention of causes
Upper and
Lower
Control Limits
representthe
natural variation
In the process
Center Line (CL)
(Mean of data used to
set up the chart)
The chart is analyzed using
standard Rules to define the
control status of the process
Plotted points are either
individual measurements or the
means of small groups of
measurements
Lower
Control
Limit
(LCL)
Data
relating to
the process
Statistical Methods for Software Quality
Adrian Burr Mal Owen, 1996
Numerical data taken
in time sequence
31. Impacts of Poor Data Quality
Inability to conduct hypothesis and predictive modeling
Inability to manage the quality and performance software or
application development
Ineffective process change instead of process improvement
Ineffective and inefficient testing causing issues with time to
market, field quality, and development costs
Products that are costly to use within real-life usage
profiles
32. References
Brassard, Michael & Ritter, Diane, The Memory Jogger II A
Pocket Guide of Tools for Continuous Improvement & Effective
Planning, GOAL/QPC, Salem, New Hampshire, 1994
Florac, W.A. & Carleton, A.D. Measuring the Software Process
Addison-Wesley, 1999
Six Sigma Academy, The Black Belt Memory Jogger A Pocket Guide for
Six Sigma Success, GOAL/QPC, Salem, New Hampshire, 2002
Wheeler, Donald J. Understanding Variation: The Key to Managing
Chaos, Knoxville, Tennessee: SPC Press, 2000