SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics.

Post on 26-Dec-2015

232 Views

Category:

Documents

6 Downloads

Preview:

Click to see full reader

Transcript

SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS

SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS

Introduction & Introduction &

Descriptive StatisticsDescriptive Statistics

StatisticsStatistics - technology used to describe and measure aspects of nature from samples

Statistics lets us quantify the quantify the uncertaintyuncertainty of these measures

IntroductionIntroduction

StatisticsStatistics is also about good is also about good scientific practicescientific practice

The history of statistics has its roots in biologybiology

IntroductionIntroduction

Sir Francis GaltonSir Francis Galton

Inventor of fingerprints, study of heredity of quantitative traits

Regression & correlation

Also: efficacy of prayer, attractiveness as function of distance from London

Karl PearsonKarl Pearson

Polymath-

Studied genetics

Correlation coefficientc2 testStandard deviation

Sir Ronald FisherSir Ronald Fisher

The Genetical Theory of Natural Selection

Founder of population genetics

Analysis of variance Likelihood P-valueRandomized experiments Multiple regressionetc., etc., etc.

Statistical quotationsStatistical quotations

There are three kinds of lies: lies, damn lies, and statistics. Benjamin Disraeli / Mark Twain

It is easy to lie with statistics, but easier to lie without them. Frederick Mosteller

Goals of statisticsGoals of statistics

Estimation Estimation Infer an unknown quantity of a population

using sample data Hypothesis testingHypothesis testing

Differences among groups Relationships among variables

IntroductionIntroduction

Introduction to the basic concepts of statistics as applied to problems in biological science.

Goal of the course Understand statistical concepts (population,

sample,, slope, significant etc.); Identify appropriate methods for your data (e.g.,

one-sample, two-sample, paired t-test or independent t-test, one-way or two-way ANOVA);

Select correct MINITAB procedures to analyze data Scientific reading and interpretation.

BiostatisticsBiostatistics Why study Biostatistics?

Statistical methods are widely used in biological field; Examples are from biological field, practical and useful; Focus on application instead of mathematical

derivation; Help to evaluate the paper in an intelligent manner.

Statistics - the science and art of obtaining reliable results and conclusions from data that is subject to variation.

Biostatistics (Biometry)- the application of statistics to the biological sciences.

Why Computer Applications?Why Computer Applications? Statistical methods are mostly difficult and

complicated (ANOVA, regression etc); Advances in computer technology and

statistical software development make the application of statistical method much easier today than before;

Software such as MINITAB needs time to learn.

BiostatisticsBiostatistics

Is Biostatistics hard to study?Is Biostatistics hard to study? Factors make it hard for some students to

learn statistics: The terminology is deceptive. To

understand statistics, you have to understand the statistical meaning of terms such as significant, error and hypothesis are distinct from ordinary uses of these words.

Is Biostatistics hard to study?Is Biostatistics hard to study? Statistics requires mastering abstract

concepts. It is not easy to think about theoretical concepts such as populations, probability distributions, and null hypotheses.

Statistics is at the interface of mathematics and science. To really grasp the concepts of statistics, you need to be able to think about it from both angles.

The derivation of many statistical tests involves difficult math. However, you can learn to use statistical tests and interpret the results even if you do not fully understand how they work. You only need to know enough about how the tool works so that you can avoid using them in inappropriate situations.

Is Biostatistics hard to study?Is Biostatistics hard to study?

Basically, you can calculate statistical tests and interpret results even if you don’t understand how the equations were derived, as long as you know enough to use

the statistical tests appropriately.

Is Biostatistics hard to study?Is Biostatistics hard to study?

Questions about this courseQuestions about this course Is this course to be hard?

No. Concept is easy and procedure is clear.

Why do we spend time on theoretical stuff? Helpful to understand the application

Do we need to know all the stuff? You may not need all, but be prepared

Role of statistics in Role of statistics in Biological ScienceBiological Science

Science

1.Idea or Question

2.Collect data/make observations

3.Describe data / observations

4.Assess the strength of evidence for / against the hypothesis

Statistics

1.Mathematical model / hypothesis

2.Study design

3.Descriptive statistics

4.Inferential statistics

Contents of the courseContents of the course Descriptive statistics

Graph, table, mean and standard deviation Inferential statistics

Probability and distribution Hypothesis test Analysis of Variation Correlation and regression analysis Other special topics

Basic ConceptBasic Concept DataData

numerical facts, measurements, or observations obtained from an investigation, experiment aimed at answering a question

Statistical analyses deal with numbers

Basic ConceptBasic Concept QuantitativeQuantitative

Usual type of measurement, such as height or weight - measurements of quantitative variables carry information about 'amount' - can calculate means, etc., and can use in calculations

Basic ConceptBasic Concept QualitativeQualitative

Carry information about category or classification, such as medical diagnosis, ethnic group, gender - cannot calculate means as such, but can tabulate counts or frequencies and analyze frequencies

Basic ConceptBasic Concept VariableVariable

a characteristic that can take on different values for different persons, places or things

Statistical analyses need variability; otherwise there is nothing to study

Examples:Examples: Concentration of a substance, pH values

obtained from atmospheric precipitation, birth weight of babies whose mothers are smokers, etc.

A variablevariable is a characteristic measured on individuals drawn from a population under study.

DataData are measurements of one or more variables made on a collection of individuals.

Basic ConceptBasic Concept

Basic ConceptBasic Concept Type of VariableType of Variable

Continuous variable Between any two values of a variable,

there is another possible value Examples: height, weight,

concentration Discrete variable

Value can be only integer Example: number of people, plant etc.

Continuous variablesContinuous variables Can take any value to any degree of

precision in a certain range - height, weight, temperature (?)

Basic ConceptBasic Concept

Discrete variables:Discrete variables: Can take only certain values or can

only be measured to a certain degree of accuracy - e.g., # of children that a woman has delivered, # of teeth with fillings, blood pressure (?) - may be handled differently in analysis

Basic ConceptBasic Concept

Independent VariableIndependent Variable Dependent VariableDependent Variable

We try to predict or explain a response variable from an explanatory variable.

Basic ConceptBasic Concept

Populations and samplesPopulations and samples

Populations <-> Parameters;Samples <-> Estimates

Basic ConceptBasic Concept

Nomenclature

Population

Parameters

Sample

Statistics

Mean

Variance s2

Standard Deviation

s

x

Basic ConceptBasic Concept

Basic ConceptBasic Concept PopulationPopulation

Population parameters are constants whereas estimates are random variables, changing from one random sample to the next from the same population.

Basic ConceptBasic Concept Population and SamplePopulation and Sample

SamplePopulation, StatisticParameter

population

sample

Parameter

predict properties of sample

statistic

Generalize to a population

Basic ConceptBasic Concept PopulationPopulation

Population: a set or collection of objects we are interested in. (finite, infinite)

Parameter: a descriptive measure associated with a variable of an entire population, usually unknown because the whole population cannot be enumerated.

For example,Plant height under warming conditions;Graduates in USIM; Smokers in the world.

Example: number of people, plant etc.

Basic ConceptBasic Concept Population and SamplePopulation and Sample

- Population Population - largest collection of values of a random variable for which we have an interest at a particular time - school children in Negeri Sembilan.

- Sample Sample - selected part of a population – Form Three girls, Form Five boys, etc.

Basic ConceptBasic Concept

A sample of conveniencesample of convenience is a collection of individuals that happen to be available at the time.

Basic ConceptBasic Concept SamplingSampling

essence of statistical inference – why?

Why sample?Why sample? Cannot afford time or money to record measurements on entire population and new members of the population may be entering all of the time - We use statistical analysis of a sample to answer questions about a population - cancer patients, teen-age boys, women after child birth, etc.

Basic ConceptBasic Concept

SamplingSamplingPrecise Imprecise

Biased

Unbiased

Basic ConceptBasic Concept

BiasBias is a systematic discrepancy between estimates and the true population characteristic.

Basic ConceptBasic Concept

Sampling error Sampling error - The difference between the estimate and average value of the estimate is a systematic discrepancy between estimates and the true population characteristic.

Basic ConceptBasic Concept

Larger samplesLarger samples on average will have smaller sampling error.

Basic ConceptBasic Concept Properties of a good sampleProperties of a good sample

Independent selection of individuals Random selection of individuals Sufficiently large

Basic ConceptBasic Concept SamplingSampling So how do 'intervention studies fit So how do 'intervention studies fit

into this?into this? Studies select a sample of the population (e.g., cancer patients) to study the effects of a new therapy and then make inferences about how the rest of the cancer patient population would react to the new therapy.

Basic ConceptBasic Concept SampleSample

SampleSample: a small number of subjects from a population to make inference about the population;

Random sampleRandom sample: A sample of size n drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected.

StatisticStatistic: a descriptive measure associated with a random variable of a sample.

Basic ConceptBasic Concept RandomRandom

Variables whose values arise by chance factors which cannot be predicted in advance, such as height or weight

race or age are 'fixed' variables; i.e., not random

Basic ConceptBasic Concept RandomRandom

In a random samplerandom sample, each member of a population has an equal and independent chance of being selected.

Descriptive StatisticsDescriptive Statistics Graphical SummariesGraphical Summaries

Frequency distribution Histogram Stem and Leaf plot Boxplot

Numerical SummariesNumerical Summaries Location – mean, median, mode. Spread – range, variance, standard deviation Shape – skewness, kurtosis

Example:Example: Number of grass plants, Mytilus edulis, found in 800 sample quadrats (1m2) in an ecological study of grasses:

Frequency DistributionFrequency Distribution- Discrete variables- Discrete variables

Example:Example: Number of grass plants, Mytilus edulis, found in 800 sample quadrats (1m2) in an ecological study of grasses:

1, 4, 1, 0, 0, 1, 0, 0, 2, 3, 1, 2, 3, 1, 0, 2, 0, 1, 2,

………………………………………………………

1, 2, 3, 2, 1, 1, 0, 5, 0, 0, 1, 0, 1, 0, 2, 4, 7, 2, 1,0

How is the plant number in a quadrat distributed?

Frequency DistributionFrequency Distribution- Discrete variables- Discrete variables

Table 1. The frequency, relative frequency, cumulative frequencies of plant sedge in a quadrat.

Plants/quadrat (Xi) Frequency (fi) Relative frequency (fi/n*100) Cumulative relative frequency0 268 33.500 33.5001 316 39.500 73.0002 135 16.875 89.8753 61 7.625 97.5004 15 1.875 99.3755 3 0.375 99.7506 1 0.125 99.8757 1 0.125 100.000

Total 800 100.000

• frequency - number of times value occurs in data.(probability for population).

• relative frequency - the % of the time that the value occurs (frequency/n).

• cumulative relative frequency - the % of the sample that is equal to or smaller than the value (cumulative frequency/n).

Frequency DistributionFrequency Distribution- Discrete variables- Discrete variables

Histogram (Bar graph) and polygonHistogram (Bar graph) and polygon

Histogram graph of frequencies Histogram graph of frequencies Can be used to visually compare frequencies Easier to assess magnitude of differences rather than

trying to judge numbers

Frequency polygon - similar to histogramFrequency polygon - similar to histogram

Fig. 1. Frequency distribution of plants in a quadrat.

Grouping of Grouping of continuouscontinuous outcome outcome Examples: weight, height. Better understanding of what data show

rather than individual values Example:Example: Fiber length of a cotton (n=106)

Data:

27.5,28.6,29.4,30.5,31.4,29.8,27.6,28.7,27.6…………

31.8,32.0,27.8

Frequency DistributionFrequency Distribution- Continuous variables- Continuous variables

Length (Xi, mm) Frequency (fi) Relative frequency (%) Cumulative relative frequency27.0~27.5 1 0.943396226 0.94339622627.5~28.0 3 2.830188679 3.77358490628.0~28.5 6 5.660377358 9.43396226428.5~29.0 13 12.26415094 21.6981132129.0~29.5 18 16.98113208 38.6792452829.5~30.0 19 17.9245283 56.6037735830.0~30.5 17 16.03773585 72.6415094330.5~31.0 16 15.09433962 87.7358490631.0~31.5 6 5.660377358 93.3962264231.5~32.0 5 4.716981132 98.1132075532.0~32.5 2 1.886792453 100Total 106 100

Table 2. Frequency and relative frequency distribution of fiber length (mm) of a cotton variety (n=106)

Frequency DistributionFrequency Distribution- Continuous variables- Continuous variables

Calculate Range: R=max(X)-min(x)=5.13Calculate Range: R=max(X)-min(x)=5.13 Set Number of intervals g and interval Set Number of intervals g and interval

range irange i Some “rules” exist, but generally create 8-15

equal sized intervals, g=11 i =R/(g-1)=0.5

Set intervalsSet intervals L1=min(X)-i /2=27.0, L2=L1+i =27.5, …

Count number in each intervalCount number in each interval

Frequency DistributionFrequency Distribution- Continuous variables- Continuous variables

02468

101214161820

27.0~

27.5

27.5~

28.0

28.0~

28.5

28.5~

29.0

29.0~

29.5

29.5~

30.0

30.0~

30.5

30.5~

31.0

31.0~

31.5

31.5~

32.0

32.0~

32.5

Length (mm)

Fre

qu

ency

Fig. 2. Frequency distribution in fiber length of a cotton.

0

2

4

6

8

10

12

14

16

18

20

27 28 29 30 31 32 33Length (mm)

Fre

qu

ency

0

10

20

30

40

50

60

70

80

90

100

27 28 29 30 31 32 33

Length (mm)

Acc

um

ula

te r

elat

ive

freq

uen

cy

Histogram (Bar graph) and polygonHistogram (Bar graph) and polygon

HistogramHistogram A histogram is a way of summarising data that are

measured on an interval scale (either discrete or continuous). It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. It divides up the range of possible values in a data set into classes or groups. For each group, a rectangle is constructed with a base length equal to the range of values in that specific group, and an area proportional to the number of observations falling into that group. This means that the rectangles might be drawn of non-uniform height.

Histogram

The histogram is only appropriate for variables whose values are numerical and measured on an interval scale. It is generally used when dealing with large data sets (>100 observations), when stem and leaf plots become tedious to construct. A histogram can also help detect any unusual observations (outliers), or any gaps in the data set.

Histogram

Another way to assess frequenciesAnother way to assess frequencies Does preserve individual measure information, so

not useful for large data sets Stem is first digit(s) of measurements, leaves are

last digit of measurements Most useful for two digit numbers, more

cumbersome for three+ digits 20: X30: XXX40: XXXX50: XX60: X

2* | 13* | 2444* | 24685* | 266* | 4

Stem leaf

Stem and Leaf DisplaysStem and Leaf Displays

Stem and Leaf Plot A stem and leaf plot is a way of

summarising a set of data measured on an interval scale. It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient and easily drawn form.

Stem and Leaf Plot

A stem and leaf plot is similar to a histogram but is usually a more informative display for relatively small data sets (<100 data points). It provides a table as well as a picture of the data and from it we can readily write down the data in order of magnitude, which is useful for many statistical procedures, e.g. in the skinfold thickness example below:

Stem and Leaf Plot

We can compare more than one data set by the use of multiple stem and leaf plots. By using a back-to-back stem and leaf plot, we are able to compare the same characteristic in two different groups, for example, pulse rate after exercise of smokers and non-smokers.

In practice, descriptive statistics play In practice, descriptive statistics play a major rolea major role Always the first 1-2 tables/figures in a paper Statistician needs to know about each

variable before deciding how to analyze to answer research questions

In any analysis, 90% of the effort goes In any analysis, 90% of the effort goes into setting up the datainto setting up the data Descriptive statistics are part of that 90%

SummarySummary

Descriptive measure computed from Descriptive measure computed from population data - parameterpopulation data - parameter

Descriptive measure computed from Descriptive measure computed from sample data - statisticsample data - statistic

Most common measures of locationMost common measures of location Mean Median Mode Geometric Mean, harmonic mean

Descriptive StatisticsDescriptive Statistics - Measures of Location- Measures of Location

Suppose we have N measurements of a particular variable in a population.We denote these N measurements as:

X1, X2, X3,…,XN

where X1 is the first measurement, X2 is the second, etc.

DefinitionDefinition

More accurately called the arithmetic mean, it is defined as the sum of measures observed divided by the number of observations.

N

X

N

XX

NX

NX

N

N

ii

N

121

1...

11

Arithmetic mean (population)Arithmetic mean (population)

Sample: Suppose we have n measurements of a particular variable in a population with N measurements.The n measurements are:

X1, X2, X3,…,Xn

where X1 is the first measurement, X2 is the second, etc.

DefinitionDefinition

n

XX

nX

nX

nx i

n

1...

1121

Arithmetic mean (sample)Arithmetic mean (sample)

Some Properties of the Arithmetic Mean

1. ,

2.

Prove: 1.

2.

min)( 22 xXxi

)( xXx ii ;0)( xXx ii

;0)( xnXxXx iii

,' exx

22

2222

222

)(

)(2)(])(2)[(

])[()()'(

exX

exXexXexXexX

exXexXxX

i

iiii

iii

Arithmetic mean (sample)Arithmetic mean (sample)

Frequently used if there are extreme values in a distribution or if the distribution is non-normal

DefinitionDefinition That value that divides the ‘ordered array’ into two

equal parts If an odd number of observations, the median Md will be

the (n+1)/2 observation ex.: median of 11 observations is the 6th observation

If an even number of observations, the median Md will be the midpoint between the middle two observations

ex.: median of 12 observations is the midpoint between 6th and 7th

MedianMedian

Definition Value that occurs most frequently in data

set ExampleExample

2 3 4 5 3 4 5 6 7 5 3 2 5, mode Mo=5 If all values different, no modeIf all values different, no mode May be more than one modeMay be more than one mode

Bimodal or multimodal

Not used very frequently in practiceNot used very frequently in practice

ModeMode

Suppose the ages of the 10 trees you are studying are: 34,24,56,52,21,44,64,34,42,46

Then the mean age of this group is:

To find the median, first order the data:

21,24,34,34,42,44,46,52,56,64

The mode is 34 years Mo=34 (occurred twice).

years7.41

10/417

10/)46423464442152562434(1

Xn

x

Median1

2

years

X X102

102

1

1

242 44

43

Mean are commonly used

Example: Central LocationExample: Central Location

Used to calculate mean growth rateUsed to calculate mean growth rate DefinitionDefinition

Antilog of the mean of the log xi

nnXXXG

1

21 )(

n

XXXG nlog...loglog

log 21

Geometric mean Geometric mean

Example: Root growth at 25Example: Root growth at 25ooC, C, calculate mean growth rate (mm/d).calculate mean growth rate (mm/d).

)/(31.11173.0log,1173.06

7040.0log 1 dmmGG

Day Root length(mm) Growth rate (Xi,mm/d)log(Xi)0 171 23 1.352941176 0.1312792 30 1.304347826 0.1153933 38 1.266666667 0.1026624 51 1.342105263 0.1277875 72 1.411764706 0.1497626 86 1.194444444 0.077166

Total 7.872270083 0.70405

Geometric mean Geometric mean

Look at these two data sets:Look at these two data sets:Set 1: 100, 30, 20, 7, –20, –30, –100

Set 2: 10, 3, 2, 7, -2, -3, -10

If we calculate mean:If we calculate mean:

Set 1. Set 1.

Set 2.Set 2.

How to measure dispersion (spread, variability)?

1,7 xn1,7 xn

Descriptive StatisticsDescriptive Statistics- Measures of Dispersion- Measures of Dispersion

Common measuresCommon measures Range Variance and Standard deviation Coefficient of variation

Many distributions are well-described Many distributions are well-described by measure of location and dispersionby measure of location and dispersion

Descriptive StatisticsDescriptive Statistics- Measures of Dispersion- Measures of Dispersion

Range is the difference between the Range is the difference between the largest and smallest values in the data setlargest and smallest values in the data set

R=Max (Xi) - Min (Xi)

Heavily influenced by two most extreme values and ignores the rest of the distribution

Set 1: 100, 30, 20, 7, –20, –30, –100

Set 2: 10, 3, 2, 7, -2, -3, -10 R1=200 R2=20

Range Range

Suppose we have N measurements of a particular variable in a population: X1, X2, X3,…,XN,

The mean is , as , we define:

as variance, unit is X unitas variance, unit is X unit22

as standard deviationas standard deviation

0)( iX

N

XX

NX

NX

Ni

N

222

22

12 )(

)(1

...)(1

)(1

N

X i2)(

Variance and Standard DeviationVariance and Standard Deviation- Population - Population

Suppose we have n measurements of a particular variable in a sample: X1, X2, X3,…,Xn,

The mean is , we define:

as mean squares, or sample varianceas mean squares, or sample variance

as standard deviationas standard deviation

x

1

)( 22

n

xXs i

1

)( 2

n

xXs i

2

Variance and Standard DeviationVariance and Standard Deviation- Sample- Sample

Corrected Sum of Squares (CSS)

Degree of freedom n-1 used because if we know n-1 deviations, the

nth deviation is known Deviations have to sum to zero

1

)( 22

n

xxs i

n

XXxXSS i

ii

222 )(

)(

1 ndf

Variance and Standard DeviationVariance and Standard Deviation

Suppose the ages of the 10 trees you are studying are: 34,24,56,52,21,44,64,34,42,46, We calculated

Calculate range, variation, standard deviation and CV.7.41x

No. Xi x_bar Xi-x_bar (Xi-x_bar) 2̂ Xi 2̂1 34 41.7 -7.7 59.29 11562 24 41.7 -17.7 313.29 5763 56 41.7 14.3 204.49 31364 52 41.7 10.3 106.09 27045 21 41.7 -20.7 428.49 4416 44 41.7 2.3 5.29 19367 64 41.7 22.3 497.29 40968 34 41.7 -7.7 59.29 11569 42 41.7 0.3 0.09 1764

10 46 41.7 4.3 18.49 2116Total 417 0 1692.1 19081

R=64-21=43 y, s2=1692.1/9=188.01 y2, s=13.72 y.

Example Example

Relative variation rather than absolute Relative variation rather than absolute variation such as standard deviationvariation such as standard deviation

Definition of C.VDefinition of C.V.

Useful in comparing variation between two Useful in comparing variation between two distributionsdistributions Used particularly in comparing laboratory

measures to identify those determinations with more variation

100x

sCV

Coefficient of Variation Coefficient of Variation

Set 1: 100, 30, 20, 7, –20, –30, –100

Set 2: 10, 3, 2, 7, -2, -3, -10

Calculate , s2, s and CV.

Set s2 s CV

1 1 3773.7 61.4 61.4

2 1 44.7 6.7 6.7

x

x

Example Example

Descriptive method to convey information Descriptive method to convey information about measures of location and dispersionabout measures of location and dispersion Box-and-Whisker plots

Construction of boxplotConstruction of boxplot Box is IQR Line at median Whiskers at smallest and largest

observations Other conventions can be used, especially

to represent extreme values

Box PlotsBox Plots

-20

0

20

40

Increment in Systolic B.P.

1 2 3 4Drug

Box PlotsBox Plots

Box and Whisker Plot (or Boxplot)

A box and whisker plot is a way of summarising a set of data measured on an interval scale. It is often used in exploratory data analysis. It is a type of graph which is used to show the shape of the distribution, its central value, and variability. The picture produced consists of the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median

Box and Whisker Plot (or Boxplot)

A box plot (as it is often called) is especially helpful for indicating whether a distribution is skewed and whether there are any unusual observations (outliers) in the data set.Box and whisker plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared.

Box and Whisker Plot (or Boxplot)

Box and Whisker Plot (or Boxplot)

Box and Whisker Plot (or Boxplot)

5-Number Summary

A 5-number summary is especially useful when we have so many data that it is sufficient to present a summary of the data rather than the whole data set. It consists of 5 values: the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median.A 5-number summary can be represented in a diagram known as a box and whisker plot. In cases where we have more than one data set to analyse, a 5-number summary is constructed for each, with corresponding multiple box and whisker plots.

Outlier

An outlier is an observation in a data set which is far removed in value from the others in the data set. It is an unusually large or an unusually small value compared to the others.An outlier might be the result of an error in measurement, in which case it will distort the interpretation of the data, having undue influence on many summary statistics, for example, the mean.

Outlier

If an outlier is a genuine result, it is important because it might indicate an extreme of behaviour of the process under study. For this reason, all outliers must be examined carefully before embarking on any formal analysis. Outliers should not routinely be removed without further justification.

Interpreting a Boxplot

Interpreting a Boxplot

The boxplot is interpreted as follows:The box itself contains the middle 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile of the data set, and the lower hinge indicates the 25th percentile. The range of the middle two quartiles is known as the inter-quartile range.The line in the box indicates the median value of the data.

Interpreting a Boxplot

The boxplot is interpreted as follows:If the median line within the box is not equidistant from the hinges, then the data is skewed.The ends of the vertical lines or "whiskers" indicate the minimum and maximum data values, unless outliers are present in which case the whiskers extend to a maximum of 1.5 times the inter-quartile range.The points outside the ends of the whiskers are outliers or suspected outliers.

Boxplot Enhancements

Beyond the basic information, boxplots sometimes are enhanced to convey additional information:The mean and its confidence interval can be shown using a diamond shape in the box.The expected range of the median can be shown using notches in the box.The width of the box can be varied in proportion to the log of the sample size.

Advantages of Boxplots

Boxplots have the following strengths:Graphically display a variable's location and spread at a glance.Provide some indication of the data's symmetry and skewness.Unlike many other methods of data display, boxplots show outliers.By using a boxplot for each categorical variable side-by-side on the same graph, one quickly can compare data sets.

Disadvantage of Boxplots

One drawback of boxplots is that they tend to emphasize the tails of a distribution, which are the least certain points in the data set. They also hide many of the details of the distribution. Displaying a histogram in conjunction with the boxplot helps in this regard, and both are important tools for exploratory data analysis.

Boxplot Example 1

Check location and variation shifts Box plots are an excellent tool for conveying location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data. Sample Plot:This box plot reveals that machine has a significant effect on energy with respect to location and possibly variation

Boxplot Example 1

Boxplot Example 1

This box plot, comparing four machines for energy output, shows that machine has a significant effect on energy with respect to both location and variation. Machine 3 has the highest energy response (about 72.5); machine 4 has the least variable energy response with about 50% of its readings being within 1 energy unit.

Boxplot Example 1

These MINITAB boxplots represent lottery payoffs for winning numbers for three time periods (May 1975-March 1976, November 1976-September 1977, and December 1980-September 1981).

Boxplot Example 1

The median for each dataset is indicated by the black center line, and the first and third quartiles are the edges of the red area, which is known as the inter-quartile range (IQR).

Boxplot Example 1

The extreme values (within 1.5 times the inter-quartile range from the upper or lower quartile) are the ends of the lines extending from the IQR. Points at a greater distance from the median than 1.5 times the IQR are plotted individually as asterisks. These points represent potential outliers.

Boxplot Example 1

In this example, the three boxplots have nearly identical median values. The IQR is decreasing from one time period to the next, indicating reduced variability of payoffs in the second and third periods. In addition, the extreme values are closer to the median in the later time periods.

Boxplot Example 2

As shown in the figure, a line is drawn from the upper hinge to the upper adjacent value and from the lower hinge to the lower adjacent value. Every score between the inner and outer fences is indicated by an "o" whereas a score beyond the outer fences is indicated by a "*".

Boxplot Example 2

It is often useful to compare data from two or more groups by viewing box plots from the groups side by side. The data from 2b are higher, more spread out, and have a positive skew. That the skew is positive can be determined by the fact that the mean is higher than the median and the upper whisker is longer than the lower whisker.

Boxplot Example 3

Although the medians are all roughly the same, you can see at a glance that the spread of each data set is different. The boxplot on the left shows data that appears to be distributed evenly. The median is in the middle of the rectangle, and the whiskers are about the same length. In addition, the plot contains no outside values. The median of the second plot from the left appears to be slightly off-center. The amount of extreme values is a point of concern because it suggests that the data vary widely.

Boxplot Example 3

The third boxplot shows data that has less variation and spread than the other plots. The fourth boxplot shows data that is significantly upwardly-skewed. The median of this plot is closer to the top of the rectangle than to the bottom, and the upper whisker is longer than the bottom one. All the boxplots have approximately the same median, and the two boxplots on the left have approximately the same variation in the data.

Descriptive Statistics

(Summmary) Graphical Summaries

Frequency distribution Histogram Stem and Leaf plot Boxplot

Numerical Summaries Location - mean, median, mode. Dispersion - range, variance, standard

deviation Shape

Statistical softwareStatistical software SAS SPSS Stata BMDP MINITAB

Graphical softwareGraphical software Sigmaplot Harvard Graphics PowerPoint Excel

Software Software

BiostatisticsBiostatistics

BiostatisticsBiostatistics

top related