SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Introduction & Descriptive Descriptive Statistics Statistics
Dec 26, 2015
SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS
SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS
Introduction & Introduction &
Descriptive StatisticsDescriptive Statistics
StatisticsStatistics - technology used to describe and measure aspects of nature from samples
Statistics lets us quantify the quantify the uncertaintyuncertainty of these measures
IntroductionIntroduction
StatisticsStatistics is also about good is also about good scientific practicescientific practice
The history of statistics has its roots in biologybiology
IntroductionIntroduction
Sir Francis GaltonSir Francis Galton
Inventor of fingerprints, study of heredity of quantitative traits
Regression & correlation
Also: efficacy of prayer, attractiveness as function of distance from London
Karl PearsonKarl Pearson
Polymath-
Studied genetics
Correlation coefficientc2 testStandard deviation
Sir Ronald FisherSir Ronald Fisher
The Genetical Theory of Natural Selection
Founder of population genetics
Analysis of variance Likelihood P-valueRandomized experiments Multiple regressionetc., etc., etc.
Statistical quotationsStatistical quotations
There are three kinds of lies: lies, damn lies, and statistics. Benjamin Disraeli / Mark Twain
It is easy to lie with statistics, but easier to lie without them. Frederick Mosteller
Goals of statisticsGoals of statistics
Estimation Estimation Infer an unknown quantity of a population
using sample data Hypothesis testingHypothesis testing
Differences among groups Relationships among variables
IntroductionIntroduction
Introduction to the basic concepts of statistics as applied to problems in biological science.
Goal of the course Understand statistical concepts (population,
sample,, slope, significant etc.); Identify appropriate methods for your data (e.g.,
one-sample, two-sample, paired t-test or independent t-test, one-way or two-way ANOVA);
Select correct MINITAB procedures to analyze data Scientific reading and interpretation.
BiostatisticsBiostatistics Why study Biostatistics?
Statistical methods are widely used in biological field; Examples are from biological field, practical and useful; Focus on application instead of mathematical
derivation; Help to evaluate the paper in an intelligent manner.
Statistics - the science and art of obtaining reliable results and conclusions from data that is subject to variation.
Biostatistics (Biometry)- the application of statistics to the biological sciences.
Why Computer Applications?Why Computer Applications? Statistical methods are mostly difficult and
complicated (ANOVA, regression etc); Advances in computer technology and
statistical software development make the application of statistical method much easier today than before;
Software such as MINITAB needs time to learn.
BiostatisticsBiostatistics
Is Biostatistics hard to study?Is Biostatistics hard to study? Factors make it hard for some students to
learn statistics: The terminology is deceptive. To
understand statistics, you have to understand the statistical meaning of terms such as significant, error and hypothesis are distinct from ordinary uses of these words.
Is Biostatistics hard to study?Is Biostatistics hard to study? Statistics requires mastering abstract
concepts. It is not easy to think about theoretical concepts such as populations, probability distributions, and null hypotheses.
Statistics is at the interface of mathematics and science. To really grasp the concepts of statistics, you need to be able to think about it from both angles.
The derivation of many statistical tests involves difficult math. However, you can learn to use statistical tests and interpret the results even if you do not fully understand how they work. You only need to know enough about how the tool works so that you can avoid using them in inappropriate situations.
Is Biostatistics hard to study?Is Biostatistics hard to study?
Basically, you can calculate statistical tests and interpret results even if you don’t understand how the equations were derived, as long as you know enough to use
the statistical tests appropriately.
Is Biostatistics hard to study?Is Biostatistics hard to study?
Questions about this courseQuestions about this course Is this course to be hard?
No. Concept is easy and procedure is clear.
Why do we spend time on theoretical stuff? Helpful to understand the application
Do we need to know all the stuff? You may not need all, but be prepared
Role of statistics in Role of statistics in Biological ScienceBiological Science
Science
1.Idea or Question
2.Collect data/make observations
3.Describe data / observations
4.Assess the strength of evidence for / against the hypothesis
Statistics
1.Mathematical model / hypothesis
2.Study design
3.Descriptive statistics
4.Inferential statistics
Contents of the courseContents of the course Descriptive statistics
Graph, table, mean and standard deviation Inferential statistics
Probability and distribution Hypothesis test Analysis of Variation Correlation and regression analysis Other special topics
Basic ConceptBasic Concept DataData
numerical facts, measurements, or observations obtained from an investigation, experiment aimed at answering a question
Statistical analyses deal with numbers
Basic ConceptBasic Concept QuantitativeQuantitative
Usual type of measurement, such as height or weight - measurements of quantitative variables carry information about 'amount' - can calculate means, etc., and can use in calculations
Basic ConceptBasic Concept QualitativeQualitative
Carry information about category or classification, such as medical diagnosis, ethnic group, gender - cannot calculate means as such, but can tabulate counts or frequencies and analyze frequencies
Basic ConceptBasic Concept VariableVariable
a characteristic that can take on different values for different persons, places or things
Statistical analyses need variability; otherwise there is nothing to study
Examples:Examples: Concentration of a substance, pH values
obtained from atmospheric precipitation, birth weight of babies whose mothers are smokers, etc.
A variablevariable is a characteristic measured on individuals drawn from a population under study.
DataData are measurements of one or more variables made on a collection of individuals.
Basic ConceptBasic Concept
Basic ConceptBasic Concept Type of VariableType of Variable
Continuous variable Between any two values of a variable,
there is another possible value Examples: height, weight,
concentration Discrete variable
Value can be only integer Example: number of people, plant etc.
Continuous variablesContinuous variables Can take any value to any degree of
precision in a certain range - height, weight, temperature (?)
Basic ConceptBasic Concept
Discrete variables:Discrete variables: Can take only certain values or can
only be measured to a certain degree of accuracy - e.g., # of children that a woman has delivered, # of teeth with fillings, blood pressure (?) - may be handled differently in analysis
Basic ConceptBasic Concept
Independent VariableIndependent Variable Dependent VariableDependent Variable
We try to predict or explain a response variable from an explanatory variable.
Basic ConceptBasic Concept
Populations and samplesPopulations and samples
Populations <-> Parameters;Samples <-> Estimates
Basic ConceptBasic Concept
Nomenclature
Population
Parameters
Sample
Statistics
Mean
Variance s2
Standard Deviation
s
x
Basic ConceptBasic Concept
Basic ConceptBasic Concept PopulationPopulation
Population parameters are constants whereas estimates are random variables, changing from one random sample to the next from the same population.
Basic ConceptBasic Concept Population and SamplePopulation and Sample
SamplePopulation, StatisticParameter
population
sample
Parameter
predict properties of sample
statistic
Generalize to a population
Basic ConceptBasic Concept PopulationPopulation
Population: a set or collection of objects we are interested in. (finite, infinite)
Parameter: a descriptive measure associated with a variable of an entire population, usually unknown because the whole population cannot be enumerated.
For example,Plant height under warming conditions;Graduates in USIM; Smokers in the world.
Example: number of people, plant etc.
Basic ConceptBasic Concept Population and SamplePopulation and Sample
- Population Population - largest collection of values of a random variable for which we have an interest at a particular time - school children in Negeri Sembilan.
- Sample Sample - selected part of a population – Form Three girls, Form Five boys, etc.
Basic ConceptBasic Concept
A sample of conveniencesample of convenience is a collection of individuals that happen to be available at the time.
Basic ConceptBasic Concept SamplingSampling
essence of statistical inference – why?
Why sample?Why sample? Cannot afford time or money to record measurements on entire population and new members of the population may be entering all of the time - We use statistical analysis of a sample to answer questions about a population - cancer patients, teen-age boys, women after child birth, etc.
Basic ConceptBasic Concept
SamplingSamplingPrecise Imprecise
Biased
Unbiased
Basic ConceptBasic Concept
BiasBias is a systematic discrepancy between estimates and the true population characteristic.
Basic ConceptBasic Concept
Sampling error Sampling error - The difference between the estimate and average value of the estimate is a systematic discrepancy between estimates and the true population characteristic.
Basic ConceptBasic Concept
Larger samplesLarger samples on average will have smaller sampling error.
Basic ConceptBasic Concept Properties of a good sampleProperties of a good sample
Independent selection of individuals Random selection of individuals Sufficiently large
Basic ConceptBasic Concept SamplingSampling So how do 'intervention studies fit So how do 'intervention studies fit
into this?into this? Studies select a sample of the population (e.g., cancer patients) to study the effects of a new therapy and then make inferences about how the rest of the cancer patient population would react to the new therapy.
Basic ConceptBasic Concept SampleSample
SampleSample: a small number of subjects from a population to make inference about the population;
Random sampleRandom sample: A sample of size n drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected.
StatisticStatistic: a descriptive measure associated with a random variable of a sample.
Basic ConceptBasic Concept RandomRandom
Variables whose values arise by chance factors which cannot be predicted in advance, such as height or weight
race or age are 'fixed' variables; i.e., not random
Basic ConceptBasic Concept RandomRandom
In a random samplerandom sample, each member of a population has an equal and independent chance of being selected.
Descriptive StatisticsDescriptive Statistics Graphical SummariesGraphical Summaries
Frequency distribution Histogram Stem and Leaf plot Boxplot
Numerical SummariesNumerical Summaries Location – mean, median, mode. Spread – range, variance, standard deviation Shape – skewness, kurtosis
Example:Example: Number of grass plants, Mytilus edulis, found in 800 sample quadrats (1m2) in an ecological study of grasses:
Frequency DistributionFrequency Distribution- Discrete variables- Discrete variables
Example:Example: Number of grass plants, Mytilus edulis, found in 800 sample quadrats (1m2) in an ecological study of grasses:
1, 4, 1, 0, 0, 1, 0, 0, 2, 3, 1, 2, 3, 1, 0, 2, 0, 1, 2,
………………………………………………………
1, 2, 3, 2, 1, 1, 0, 5, 0, 0, 1, 0, 1, 0, 2, 4, 7, 2, 1,0
How is the plant number in a quadrat distributed?
Frequency DistributionFrequency Distribution- Discrete variables- Discrete variables
Table 1. The frequency, relative frequency, cumulative frequencies of plant sedge in a quadrat.
Plants/quadrat (Xi) Frequency (fi) Relative frequency (fi/n*100) Cumulative relative frequency0 268 33.500 33.5001 316 39.500 73.0002 135 16.875 89.8753 61 7.625 97.5004 15 1.875 99.3755 3 0.375 99.7506 1 0.125 99.8757 1 0.125 100.000
Total 800 100.000
• frequency - number of times value occurs in data.(probability for population).
• relative frequency - the % of the time that the value occurs (frequency/n).
• cumulative relative frequency - the % of the sample that is equal to or smaller than the value (cumulative frequency/n).
Frequency DistributionFrequency Distribution- Discrete variables- Discrete variables
Histogram (Bar graph) and polygonHistogram (Bar graph) and polygon
Histogram graph of frequencies Histogram graph of frequencies Can be used to visually compare frequencies Easier to assess magnitude of differences rather than
trying to judge numbers
Frequency polygon - similar to histogramFrequency polygon - similar to histogram
Fig. 1. Frequency distribution of plants in a quadrat.
Grouping of Grouping of continuouscontinuous outcome outcome Examples: weight, height. Better understanding of what data show
rather than individual values Example:Example: Fiber length of a cotton (n=106)
Data:
27.5,28.6,29.4,30.5,31.4,29.8,27.6,28.7,27.6…………
31.8,32.0,27.8
Frequency DistributionFrequency Distribution- Continuous variables- Continuous variables
Length (Xi, mm) Frequency (fi) Relative frequency (%) Cumulative relative frequency27.0~27.5 1 0.943396226 0.94339622627.5~28.0 3 2.830188679 3.77358490628.0~28.5 6 5.660377358 9.43396226428.5~29.0 13 12.26415094 21.6981132129.0~29.5 18 16.98113208 38.6792452829.5~30.0 19 17.9245283 56.6037735830.0~30.5 17 16.03773585 72.6415094330.5~31.0 16 15.09433962 87.7358490631.0~31.5 6 5.660377358 93.3962264231.5~32.0 5 4.716981132 98.1132075532.0~32.5 2 1.886792453 100Total 106 100
Table 2. Frequency and relative frequency distribution of fiber length (mm) of a cotton variety (n=106)
Frequency DistributionFrequency Distribution- Continuous variables- Continuous variables
Calculate Range: R=max(X)-min(x)=5.13Calculate Range: R=max(X)-min(x)=5.13 Set Number of intervals g and interval Set Number of intervals g and interval
range irange i Some “rules” exist, but generally create 8-15
equal sized intervals, g=11 i =R/(g-1)=0.5
Set intervalsSet intervals L1=min(X)-i /2=27.0, L2=L1+i =27.5, …
Count number in each intervalCount number in each interval
Frequency DistributionFrequency Distribution- Continuous variables- Continuous variables
02468
101214161820
27.0~
27.5
27.5~
28.0
28.0~
28.5
28.5~
29.0
29.0~
29.5
29.5~
30.0
30.0~
30.5
30.5~
31.0
31.0~
31.5
31.5~
32.0
32.0~
32.5
Length (mm)
Fre
qu
ency
Fig. 2. Frequency distribution in fiber length of a cotton.
0
2
4
6
8
10
12
14
16
18
20
27 28 29 30 31 32 33Length (mm)
Fre
qu
ency
0
10
20
30
40
50
60
70
80
90
100
27 28 29 30 31 32 33
Length (mm)
Acc
um
ula
te r
elat
ive
freq
uen
cy
Histogram (Bar graph) and polygonHistogram (Bar graph) and polygon
HistogramHistogram A histogram is a way of summarising data that are
measured on an interval scale (either discrete or continuous). It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. It divides up the range of possible values in a data set into classes or groups. For each group, a rectangle is constructed with a base length equal to the range of values in that specific group, and an area proportional to the number of observations falling into that group. This means that the rectangles might be drawn of non-uniform height.
Histogram
The histogram is only appropriate for variables whose values are numerical and measured on an interval scale. It is generally used when dealing with large data sets (>100 observations), when stem and leaf plots become tedious to construct. A histogram can also help detect any unusual observations (outliers), or any gaps in the data set.
Histogram
Another way to assess frequenciesAnother way to assess frequencies Does preserve individual measure information, so
not useful for large data sets Stem is first digit(s) of measurements, leaves are
last digit of measurements Most useful for two digit numbers, more
cumbersome for three+ digits 20: X30: XXX40: XXXX50: XX60: X
2* | 13* | 2444* | 24685* | 266* | 4
Stem leaf
Stem and Leaf DisplaysStem and Leaf Displays
Stem and Leaf Plot A stem and leaf plot is a way of
summarising a set of data measured on an interval scale. It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient and easily drawn form.
Stem and Leaf Plot
A stem and leaf plot is similar to a histogram but is usually a more informative display for relatively small data sets (<100 data points). It provides a table as well as a picture of the data and from it we can readily write down the data in order of magnitude, which is useful for many statistical procedures, e.g. in the skinfold thickness example below:
Stem and Leaf Plot
We can compare more than one data set by the use of multiple stem and leaf plots. By using a back-to-back stem and leaf plot, we are able to compare the same characteristic in two different groups, for example, pulse rate after exercise of smokers and non-smokers.
In practice, descriptive statistics play In practice, descriptive statistics play a major rolea major role Always the first 1-2 tables/figures in a paper Statistician needs to know about each
variable before deciding how to analyze to answer research questions
In any analysis, 90% of the effort goes In any analysis, 90% of the effort goes into setting up the datainto setting up the data Descriptive statistics are part of that 90%
SummarySummary
Descriptive measure computed from Descriptive measure computed from population data - parameterpopulation data - parameter
Descriptive measure computed from Descriptive measure computed from sample data - statisticsample data - statistic
Most common measures of locationMost common measures of location Mean Median Mode Geometric Mean, harmonic mean
Descriptive StatisticsDescriptive Statistics - Measures of Location- Measures of Location
Suppose we have N measurements of a particular variable in a population.We denote these N measurements as:
X1, X2, X3,…,XN
where X1 is the first measurement, X2 is the second, etc.
DefinitionDefinition
More accurately called the arithmetic mean, it is defined as the sum of measures observed divided by the number of observations.
N
X
N
XX
NX
NX
N
N
ii
N
121
1...
11
Arithmetic mean (population)Arithmetic mean (population)
Sample: Suppose we have n measurements of a particular variable in a population with N measurements.The n measurements are:
X1, X2, X3,…,Xn
where X1 is the first measurement, X2 is the second, etc.
DefinitionDefinition
n
XX
nX
nX
nx i
n
1...
1121
Arithmetic mean (sample)Arithmetic mean (sample)
Some Properties of the Arithmetic Mean
1. ,
2.
Prove: 1.
2.
min)( 22 xXxi
)( xXx ii ;0)( xXx ii
;0)( xnXxXx iii
,' exx
22
2222
222
)(
)(2)(])(2)[(
])[()()'(
exX
exXexXexXexX
exXexXxX
i
iiii
iii
Arithmetic mean (sample)Arithmetic mean (sample)
Frequently used if there are extreme values in a distribution or if the distribution is non-normal
DefinitionDefinition That value that divides the ‘ordered array’ into two
equal parts If an odd number of observations, the median Md will be
the (n+1)/2 observation ex.: median of 11 observations is the 6th observation
If an even number of observations, the median Md will be the midpoint between the middle two observations
ex.: median of 12 observations is the midpoint between 6th and 7th
MedianMedian
Definition Value that occurs most frequently in data
set ExampleExample
2 3 4 5 3 4 5 6 7 5 3 2 5, mode Mo=5 If all values different, no modeIf all values different, no mode May be more than one modeMay be more than one mode
Bimodal or multimodal
Not used very frequently in practiceNot used very frequently in practice
ModeMode
Suppose the ages of the 10 trees you are studying are: 34,24,56,52,21,44,64,34,42,46
Then the mean age of this group is:
To find the median, first order the data:
21,24,34,34,42,44,46,52,56,64
The mode is 34 years Mo=34 (occurred twice).
years7.41
10/417
10/)46423464442152562434(1
Xn
x
Median1
2
years
X X102
102
1
1
242 44
43
Mean are commonly used
Example: Central LocationExample: Central Location
Used to calculate mean growth rateUsed to calculate mean growth rate DefinitionDefinition
Antilog of the mean of the log xi
nnXXXG
1
21 )(
n
XXXG nlog...loglog
log 21
Geometric mean Geometric mean
Example: Root growth at 25Example: Root growth at 25ooC, C, calculate mean growth rate (mm/d).calculate mean growth rate (mm/d).
)/(31.11173.0log,1173.06
7040.0log 1 dmmGG
Day Root length(mm) Growth rate (Xi,mm/d)log(Xi)0 171 23 1.352941176 0.1312792 30 1.304347826 0.1153933 38 1.266666667 0.1026624 51 1.342105263 0.1277875 72 1.411764706 0.1497626 86 1.194444444 0.077166
Total 7.872270083 0.70405
Geometric mean Geometric mean
Look at these two data sets:Look at these two data sets:Set 1: 100, 30, 20, 7, –20, –30, –100
Set 2: 10, 3, 2, 7, -2, -3, -10
If we calculate mean:If we calculate mean:
Set 1. Set 1.
Set 2.Set 2.
How to measure dispersion (spread, variability)?
1,7 xn1,7 xn
Descriptive StatisticsDescriptive Statistics- Measures of Dispersion- Measures of Dispersion
Common measuresCommon measures Range Variance and Standard deviation Coefficient of variation
Many distributions are well-described Many distributions are well-described by measure of location and dispersionby measure of location and dispersion
Descriptive StatisticsDescriptive Statistics- Measures of Dispersion- Measures of Dispersion
Range is the difference between the Range is the difference between the largest and smallest values in the data setlargest and smallest values in the data set
R=Max (Xi) - Min (Xi)
Heavily influenced by two most extreme values and ignores the rest of the distribution
Set 1: 100, 30, 20, 7, –20, –30, –100
Set 2: 10, 3, 2, 7, -2, -3, -10 R1=200 R2=20
Range Range
Suppose we have N measurements of a particular variable in a population: X1, X2, X3,…,XN,
The mean is , as , we define:
as variance, unit is X unitas variance, unit is X unit22
as standard deviationas standard deviation
0)( iX
N
XX
NX
NX
Ni
N
222
22
12 )(
)(1
...)(1
)(1
N
X i2)(
Variance and Standard DeviationVariance and Standard Deviation- Population - Population
Suppose we have n measurements of a particular variable in a sample: X1, X2, X3,…,Xn,
The mean is , we define:
as mean squares, or sample varianceas mean squares, or sample variance
as standard deviationas standard deviation
x
1
)( 22
n
xXs i
1
)( 2
n
xXs i
2
Variance and Standard DeviationVariance and Standard Deviation- Sample- Sample
Corrected Sum of Squares (CSS)
Degree of freedom n-1 used because if we know n-1 deviations, the
nth deviation is known Deviations have to sum to zero
1
)( 22
n
xxs i
n
XXxXSS i
ii
222 )(
)(
1 ndf
Variance and Standard DeviationVariance and Standard Deviation
Suppose the ages of the 10 trees you are studying are: 34,24,56,52,21,44,64,34,42,46, We calculated
Calculate range, variation, standard deviation and CV.7.41x
No. Xi x_bar Xi-x_bar (Xi-x_bar) 2̂ Xi 2̂1 34 41.7 -7.7 59.29 11562 24 41.7 -17.7 313.29 5763 56 41.7 14.3 204.49 31364 52 41.7 10.3 106.09 27045 21 41.7 -20.7 428.49 4416 44 41.7 2.3 5.29 19367 64 41.7 22.3 497.29 40968 34 41.7 -7.7 59.29 11569 42 41.7 0.3 0.09 1764
10 46 41.7 4.3 18.49 2116Total 417 0 1692.1 19081
R=64-21=43 y, s2=1692.1/9=188.01 y2, s=13.72 y.
Example Example
Relative variation rather than absolute Relative variation rather than absolute variation such as standard deviationvariation such as standard deviation
Definition of C.VDefinition of C.V.
Useful in comparing variation between two Useful in comparing variation between two distributionsdistributions Used particularly in comparing laboratory
measures to identify those determinations with more variation
100x
sCV
Coefficient of Variation Coefficient of Variation
Set 1: 100, 30, 20, 7, –20, –30, –100
Set 2: 10, 3, 2, 7, -2, -3, -10
Calculate , s2, s and CV.
Set s2 s CV
1 1 3773.7 61.4 61.4
2 1 44.7 6.7 6.7
x
x
Example Example
Descriptive method to convey information Descriptive method to convey information about measures of location and dispersionabout measures of location and dispersion Box-and-Whisker plots
Construction of boxplotConstruction of boxplot Box is IQR Line at median Whiskers at smallest and largest
observations Other conventions can be used, especially
to represent extreme values
Box PlotsBox Plots
-20
0
20
40
Increment in Systolic B.P.
1 2 3 4Drug
Box PlotsBox Plots
Box and Whisker Plot (or Boxplot)
A box and whisker plot is a way of summarising a set of data measured on an interval scale. It is often used in exploratory data analysis. It is a type of graph which is used to show the shape of the distribution, its central value, and variability. The picture produced consists of the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median
Box and Whisker Plot (or Boxplot)
A box plot (as it is often called) is especially helpful for indicating whether a distribution is skewed and whether there are any unusual observations (outliers) in the data set.Box and whisker plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared.
Box and Whisker Plot (or Boxplot)
Box and Whisker Plot (or Boxplot)
Box and Whisker Plot (or Boxplot)
5-Number Summary
A 5-number summary is especially useful when we have so many data that it is sufficient to present a summary of the data rather than the whole data set. It consists of 5 values: the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median.A 5-number summary can be represented in a diagram known as a box and whisker plot. In cases where we have more than one data set to analyse, a 5-number summary is constructed for each, with corresponding multiple box and whisker plots.
Outlier
An outlier is an observation in a data set which is far removed in value from the others in the data set. It is an unusually large or an unusually small value compared to the others.An outlier might be the result of an error in measurement, in which case it will distort the interpretation of the data, having undue influence on many summary statistics, for example, the mean.
Outlier
If an outlier is a genuine result, it is important because it might indicate an extreme of behaviour of the process under study. For this reason, all outliers must be examined carefully before embarking on any formal analysis. Outliers should not routinely be removed without further justification.
Interpreting a Boxplot
Interpreting a Boxplot
The boxplot is interpreted as follows:The box itself contains the middle 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile of the data set, and the lower hinge indicates the 25th percentile. The range of the middle two quartiles is known as the inter-quartile range.The line in the box indicates the median value of the data.
Interpreting a Boxplot
The boxplot is interpreted as follows:If the median line within the box is not equidistant from the hinges, then the data is skewed.The ends of the vertical lines or "whiskers" indicate the minimum and maximum data values, unless outliers are present in which case the whiskers extend to a maximum of 1.5 times the inter-quartile range.The points outside the ends of the whiskers are outliers or suspected outliers.
Boxplot Enhancements
Beyond the basic information, boxplots sometimes are enhanced to convey additional information:The mean and its confidence interval can be shown using a diamond shape in the box.The expected range of the median can be shown using notches in the box.The width of the box can be varied in proportion to the log of the sample size.
Advantages of Boxplots
Boxplots have the following strengths:Graphically display a variable's location and spread at a glance.Provide some indication of the data's symmetry and skewness.Unlike many other methods of data display, boxplots show outliers.By using a boxplot for each categorical variable side-by-side on the same graph, one quickly can compare data sets.
Disadvantage of Boxplots
One drawback of boxplots is that they tend to emphasize the tails of a distribution, which are the least certain points in the data set. They also hide many of the details of the distribution. Displaying a histogram in conjunction with the boxplot helps in this regard, and both are important tools for exploratory data analysis.
Boxplot Example 1
Check location and variation shifts Box plots are an excellent tool for conveying location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data. Sample Plot:This box plot reveals that machine has a significant effect on energy with respect to location and possibly variation
Boxplot Example 1
Boxplot Example 1
This box plot, comparing four machines for energy output, shows that machine has a significant effect on energy with respect to both location and variation. Machine 3 has the highest energy response (about 72.5); machine 4 has the least variable energy response with about 50% of its readings being within 1 energy unit.
Boxplot Example 1
These MINITAB boxplots represent lottery payoffs for winning numbers for three time periods (May 1975-March 1976, November 1976-September 1977, and December 1980-September 1981).
Boxplot Example 1
The median for each dataset is indicated by the black center line, and the first and third quartiles are the edges of the red area, which is known as the inter-quartile range (IQR).
Boxplot Example 1
The extreme values (within 1.5 times the inter-quartile range from the upper or lower quartile) are the ends of the lines extending from the IQR. Points at a greater distance from the median than 1.5 times the IQR are plotted individually as asterisks. These points represent potential outliers.
Boxplot Example 1
In this example, the three boxplots have nearly identical median values. The IQR is decreasing from one time period to the next, indicating reduced variability of payoffs in the second and third periods. In addition, the extreme values are closer to the median in the later time periods.
Boxplot Example 2
As shown in the figure, a line is drawn from the upper hinge to the upper adjacent value and from the lower hinge to the lower adjacent value. Every score between the inner and outer fences is indicated by an "o" whereas a score beyond the outer fences is indicated by a "*".
Boxplot Example 2
It is often useful to compare data from two or more groups by viewing box plots from the groups side by side. The data from 2b are higher, more spread out, and have a positive skew. That the skew is positive can be determined by the fact that the mean is higher than the median and the upper whisker is longer than the lower whisker.
Boxplot Example 3
Although the medians are all roughly the same, you can see at a glance that the spread of each data set is different. The boxplot on the left shows data that appears to be distributed evenly. The median is in the middle of the rectangle, and the whiskers are about the same length. In addition, the plot contains no outside values. The median of the second plot from the left appears to be slightly off-center. The amount of extreme values is a point of concern because it suggests that the data vary widely.
Boxplot Example 3
The third boxplot shows data that has less variation and spread than the other plots. The fourth boxplot shows data that is significantly upwardly-skewed. The median of this plot is closer to the top of the rectangle than to the bottom, and the upper whisker is longer than the bottom one. All the boxplots have approximately the same median, and the two boxplots on the left have approximately the same variation in the data.
Descriptive Statistics
(Summmary) Graphical Summaries
Frequency distribution Histogram Stem and Leaf plot Boxplot
Numerical Summaries Location - mean, median, mode. Dispersion - range, variance, standard
deviation Shape
Statistical softwareStatistical software SAS SPSS Stata BMDP MINITAB
Graphical softwareGraphical software Sigmaplot Harvard Graphics PowerPoint Excel
Software Software
BiostatisticsBiostatistics
BiostatisticsBiostatistics