This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
How to learn statistics in this lclass
• No preview needed before the class• Focus on listening and thinking (3 hours / week) at g g ( )
class– Don’t take notes (wasting your time)( g y )
• Intensive review (1-2 hours / week) after the class• Do the homework (1 hour / week) after the classDo the homework (1 hour / week) after the class
2
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
Text booksText books
Sokal, R. R. and F. J. Rohlf. 1995. Biometry: the principles and practice of statistics in biological research. Third Edition. W. H. Freeman and Co.: New York. 887 pp.
Zar, J. H. 1999. Biostatistical Analysis. Fourth Edition.
3From 1976 (the earliest year indexed) to mid 1997 (the date the search was performed) the following counts were obtained: Darwin (all publications, e.g. The origin of the species) = 7,111. Sokal and Rohlf Biometry = 31,757.
, J y
Prentice Hall: New Jersey, 663 pp.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiOverview
Biostatistics or biometry
• "biostatistics" and "biometry" are i d i h blsometimes used interchangeably
• "biometry" is more often used of biological i lt l li tior agricultural applications
• "biostatistics" is more often used of medical applications
4medical applications.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiOverview
What is statistics?
• Statistics is the science of collectionhttp://teeky.org/search-engine-optimization/
determine-success-via-website-statistics/
• Statistics is the science of collection, analysis, interpretation, and presentation of dataof data.
• Descriptive statistics are numericalDescriptive statistics are numerical estimates that organize, sum up or present the datathe data.
• Inferential statistics is the process of 5inferring from a sample to the population.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiOverview
Statistical errors in publications
Underwood (1981) found statistical errors in 78% of the papers he surveyed in marine ecology. Hurlbert (1984) reported that in two y gy ( ) pseparate surveys 26% and 48% of the ecological papers surveyed showed the statistical error of pseudoreplication (Krebs 1999).
Charles J. Krebs. 1999. Ecological Methodology, 2nd ed. Addison-Wesley Educational Publishers, Inc.
“50% of medical literature have statistical flaws (Altman et al. 1991). Serious statistical errors were found in 40% of 164 articles published i hi t j l (M G i 1995)” (E t l 2007)in a psychiatry journal (McGuigan 1995)” (Ercan et al. 2007).
Ilker Ercan, Berna Yazıcı, Yaning Yang, Guven Özkaya, Sengul Cangur, Bulent Ediz, Ismet Kan. Misusage Of Statistics In Medical Research Eur J Gen Med 2007; 4(3):128-134Misusage Of Statistics In Medical Research. Eur J Gen Med 2007; 4(3):128-134
6
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiOverview
Contents• Brief history, basic concepts,
and descriptive statistics• Analysis of covariance (ANCOVA)
p
• Probability distribution• Nonparametric statistics
• Multivariate analysis• Hypothesis testing
• Analysis of variance (ANOVA)
• Multivariate analysis
• Generalized linear model• Analysis of variance (ANOVA)
• Simple linear regression and l ti
• Common mistakes
correlation
xxy7
21 xxy
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
St ti ti l ft R
Overview
Statistical software R http://cran.r-project.org
R is a free software environment for statistical computing and graphics ItR is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
In 1995 R was initially written by Ross Ihaka and Robert Gentleman at theIn 1995, R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in Auckland, New Zealand.
Since mid-1997 there has been a core group (the “R Core Team”) who can g p ( )modify the R source code archive.
It is free software distributed under a GNU-style copyleft, and an official part of the GNU project (“GNU S”).
It has over 2100 packages in 2010.
CitationR Development Core Team 2011 R: A Language and Environment
8
R Development Core Team. 2011. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. ISBN: 3-900051-07-0. http://www.R-project.org.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiOverview
Today’s contentsIntroduction to biological
statisticsstatistics
History Data in biology Descriptive statistics
9
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
Hi t
History
History• John Graunt (1620-1674, British) and William Petty (1623-1687, British): Jo G au t ( 6 0 6 , t s ) a d a etty ( 6 3 68 , t s )
developed early human statistical and census methods that later provided a framework for modern demography based on life table, mean value, census, longevity, and mortality.
• Blaise Pascal (1623-1662, French) and Pierre de Fermat (1601-1665, French), Jacques Bernoulli (1654-1705, Swiss): probability theory (binomial coefficients)
• Abraham de Moivre (棣莫弗)(1667-1754, French): combine the statistics with probability theory; approximate the normal distribution though the expansion of the binomial distributionof the binomial distribution
• Carl Friedrich Gauss (1777-1855, Germany): least square, normal distribution
• Adolphe Quetelet (凯特勒) (1796-1874, Belgium): significance of constancy of large numbers (rate of criminal events)
10• Florence Nightingale (1820-1910, British): graphic presentation of statistics
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiHistory
Emergence of statistics in 1800’s• Laplace wrote a book describing how to compute the• Laplace wrote a book describing how to compute the
future positions of planets and comets on the basis of a few observations from earth.few observations from earth.
• Napoleon: "I find no mention of God in your treatise, Mr. Laplace."Laplace.
• Laplace replied: "I had no need for that hypothesis.“Th b ti f l t d t f thi thl l tf did• The observations of planets and comets from this earthly platform did not fit the predicted positions exactly. Laplace and his fellow scientists attributed this to errors in the observations, sometimes due to perturbations in the earth's atmosphere, other times due to human error.
By the end of the nineteenth century the errors had mounted instead11
• By the end of the nineteenth century, the errors had mounted instead of diminishing. As measurements became more and more precise, more and more error cropped up.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
Gaps between Darwinism and genetics in early 1900’s
History
Gaps between Darwinism and genetics in early 1900’s
Core Evolution Concepts Mendel’s law of tiCore Evolution Concepts
Population: Organisms that share a common gene pool (Species = actually or
segregationBy carrying out the monohybrid crosses, Mendel determined that the 2 alleles for g p ( p y
potentially interbreeding organisms)
Variation: Modifications of forms are produced by chance via mutations, genetic
each character segregate during gamete production.
p y , gcoding errors of individual organisms
Natural Selection: Reproduction & survival of organisms whose heritable traits gare better suited to existing environmental conditions
Retention: Persistence within a population of the selected variation(s) over successive generations
12
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiHistory
Neo-Darwinian Modern evolutionary synthesis in 1930’s
• Sir Ronald A. Fisher (1890-1962, British) developed several basic statistical methods in support of his work ppThe Genetical Theory of Natural Selection
• Sewall G Wright (1889 1988 American) used statistics• Sewall G. Wright (1889-1988, American) used statistics in the development of modern population genetics
• John B. S. Haldane (霍尔丹1892-1964, British)reestablished natural selection as the premier mechanism of evolution by explaining it in terms of themechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics in his book The Causes of Evolution.
13
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiHistory
Francis Galton
• Francis Galton (1822-1911, British) (father of biometry d i ) i l ti
and eugenics): regression, correlation– African Explorer and elected Fellow in the Royal Geographic Society
C t f th fi t th d t bli h f th t l i l– Creator of the first weather maps and establisher of the meteorological theory of anticyclones
– Coined term "eugenics" and phrase "nature versus nurture" – Developed statistical concepts of correlation and regression to the mean – Discovered that fingerprints were an index of personal identity and
persuaded Scotland Yard to adopt a fingerprinting system – First to utilize the survey as a method for data collection – Produced over 340 papers and books throughout his lifetime
K i ht d i 190914
– Knighted in 1909
Galton, F. (1869/1892/1962). Hereditary Genius: An Inquiry into its Laws and Consequences. Macmillan/Fontana, London. Galton, F. (1883/1907/1973). Inquiries into Human Faculty and its Development. AMS Press, New York.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiHistory
Karl Pearson
• Karl Pearson (1857-1936, British): continued in the tradition of Galton http://www.economics.soton.ac.uk/staff/aldrich/New%20Folder/kpreader1.htm
( , )and laid the foundation for much of descriptive statistics.
– In 1884, Pearson became Professor of Applied Mathematics and Mechanics Cat University College London.
– In 1901 Pearson, Weldon and Galton founded Biometrika, a “Journal for the Statistical Study of Biological Problems”.
– In 1907, Pearson took over a research unit founded by Galton and reconstituted it as the Francis Galton Laboratory of National Eugenics.
In 1911 Pearson founded the world's first university statistics department at– In 1911, Pearson founded the world s first university statistics department at University College London.
method of moments
15 chi-square correlation
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiHistory
Ronald A. Fisher
• Sir Ronald Aylmer Fisher (1890 –1962) an English statistician evolutionaryhttp://en.wikipedia.org/wiki/Image:RonaldFisher.jpg
• Sir Ronald Aylmer Fisher, (1890 –1962), an English statistician, evolutionary biologist, and geneticist.
• He was described by Anders Hald as "a genius who almost single-handedly y g g ycreated the foundations for modern statistical science"[1] and Richard Dawkins described him as "the greatest biologist since Darwin".[2] (from Wikipedia)
– In 1933 he became a Professor of Eugenics at University College London– In 1943 he was offered the Balfour Chair of Genetics at Cambridge
Universityy
Analysis of variance Maximum likelihood
Fisher, R.A. 1925. Statistical Methods for Research WorkersFisher, R.A. 1935. The design of experiments
16 Fisher information[1] Hald, Anders (1998). A History of Mathematical Statistics. New York: Wiley.
[2] Dawkins, Richard (1995). River out of Eden.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiHistory
Society and publications in early yearsIn 1901 Pearson Weldon and Galton founded• In 1901, Pearson, Weldon and Galton founded Biometrika, a “Journal for the Statistical Study of Biological Problems”.
• Until the 1940s, the application of statistics to biological questions began to have a profound impact on thequestions began to have a profound impact on the scientific community.
Th bi i i f h A i S i i l• The biometrics section of the American Statistical Association to publish the Biometrics Bulletin, in 1945.
• In 1947, International Biometric Society (IBS) was established. Shortly thereafter, the IBS began publishing Biometrics
17
Biometrics.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
• In 1980, the NBC television network aired a documentary entitled "If Japan Can, Why Can't We?"
– The documentary was really a description of the influence one man had on Japanese industry, W. Edwards Deming.
• Deming's major point about quality control is that the output of a production line is variable because that is theoutput of a production line is variable, because that is the nature of all human activity. What the customer wants is not a perfect product but a reliable product.not a perfect product but a reliable product.
18
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
A story of statistics and industryHistory
A story of statistics and industryDeming’s quality control
• Deming proposed that the production line be seen as a stream of activities that start with raw material and end with finished product.
• Each activity can be measured, so each activity has its own variability due to environmental causes.
• Instead of waiting for the final product to exceed arbitrary limits of• Instead of waiting for the final product to exceed arbitrary limits of variability, the managers should be looking at the variability of each of these activities.
• The most variable of the activities is the one that should be addressed. Once that variability is reduced, there will be another activity that is "most variable " and it should then be addressedactivity that is most variable, and it should then be addressed.
• Thus, quality control becomes a continuous process, where the most variable aspect of the production line is constantly being p p y gworked on.
19
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Data
• Datum is one observation about the variable being measured.g
• Data are a collection of observations.Data are a collection of observations.
• A population consists of all subjects about• A population consists of all subjects about whom the study is being conducted.
• A sample is a sub-group of population being examined
20
examined.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Parameters vs. Statistics
• A parameter is a numerical quantity measuring some aspect of a population of scores. p p p– For example, the mean is a measure of central tendency– Usually use Greek letters
• A statistic computed in samples is used to estimate parametersparameters
Quantity Parameter Statistic
Mean μ M
Standard deviation σ s
21
Proportion π p
Correlation ρ r
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
– ordered, constant scale, but no natural zero – differences make sense, but ratios do not (e.g., 30º-20º = 20º-10º, but 20º/10º is not twice as hot! – e.g., temperature (C,F), dates
• Ratio scale variable– ordered constant scale natural zero
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Derived variables
Ratio RatioSex ratio
IndexS&P 500 index (stock market)
Rate23Growth rate
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Acc rac and precision of dataAccuracy and precision of data
Accuracy Precision Inaccuracy
24
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Accuracy of data
Mean square errorqfor estimating population mean (μ)using sample mean (m)
μ
)(MMSEBias
μM
2
2 ])[( ME Accuracy
2])([)( MEMVar
25precision bias
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Summarizing Data
• Frequency DistributionFrequency Distribution• Cumulative Distributions• Relative Frequency Distribution• Relative Frequency Distribution • Percent Frequency Distribution
B G h• Bar Graph • Histogram• Pie Chart• Dot Plot
26
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Frequency Distribution for Q lit ti d t
A f di ib i i b l f
Qualitative dataA frequency distribution is a tabular summary ofdata showing the frequency (or number) of itemsi h f l l i lin each of several nonoverlapping classes.
h bj i i id i i h b h dThe objective is to provide insights about the datathat cannot be quickly obtained by looking only at h i i l dthe original data.
27
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Frequency DistributionGuests staying at Holiday Inn were asked to rate the quality of their accommodations as being:being:
PoorPoor 22Rating Frequency
Below AverageBelow AverageAverageAverage
3355
Above AverageAbove AverageExcellentExcellent
9911
ll28
TotalTotal 2020
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
An example for quantitative data:Hudson Auto Repair
Sample of Parts Cost for 50 TuneSample of Parts Cost for 50 Tune--upsups
Hudson Auto RepairSample of Parts Cost for 50 TuneSample of Parts Cost for 50 Tune--upsups
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Frequency Distribution• Guidelines for selecting number of classes
Use between 5 and 20 classesData sets with a larger number of elements usually require g y qa larger number of classes
Smaller data sets usually require fewer classes
Use classes of equal width
Approximate class width =Approximate class width =
Largest Data Value Smallest Data Value
30Number of Classes
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Frequency DistributionFor Hudson Auto Repair, if we choose six classes:
Approximate Class Width = (109 - 52)/6 = 9.5 10
50-59 2Parts Cost ($) Frequency
60-69 70-7980 89
1316
80-8990-99
100 109
775
31
100-109 5Total 50
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Relati e Freq enc Distrib tionRelative Frequency Distribution
The relative frequency of a class is the fraction orproportion of the total number of data itemsbelonging to the class.
A relative frequency distribution is a tabularsummary of a set of data showing the relativesummary of a set of data showing the relativefrequency for each class.
32
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Percent Frequency DistributionPercent Frequency Distribution
The percent frequency of a class is the relativefrequency multiplied by 100frequency multiplied by 100.
A percent frequency distribution is a tabularA percent frequency distribution is a tabularsummary of a set of data showing the percentfrequency for each class.frequency for each class.
33
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
R l ti F dData
Relative Frequency andPercent Frequency Distributions
Holiday Inn Quality Ratings
q y
RelativeFrequency
PercentFrequencyRating
PoorBelow Average
.10.10
.15.1510101515
AverageAbove Average
.25.25
.45.4525254545
Excellent .05.05TotalTotal 1.001.00
55100100
341/20 = .051/20 = .05
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
R l ti F dData
Relative Frequency andPercent Frequency Distributionsq y
Hudson Auto Repair
PartsCost ($)
RelativeFrequency
PercentFrequency
50-5960 69
Cost ($).0426
Frequency4
26
Frequency
2/5060-69 70-7980 89
.26
.3214
263214
2/50
80-8990-99
100 109
.14
.1410
141410
35
100-109 .10Total 1.00
10100
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
R l i F d
Data
Relative Frequency andPercent Frequency Distributions
Insights gained from the percent frequency distribution
Percent Frequency Distributions
• Only 4% of the parts costs are in the $50-59 classOnly 4% of the parts costs are in the $50-59 class.
• 30% of the parts costs are under $70.
• The greatest percentage (32% or almost one-third) of the parts costs are in the $70-79 class.p $
• 10% of the parts costs are $100 or more.
36
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Bar Graph
1012
Barplot()
Bar Graph
Num
ber
24
68
陈 杜 李 刘 马 宋 王 魏 吴 徐 杨 于 袁 张 赵 郑 朱
0
• A bar graph is a graphical device for depicting qualitative data.
• Specify the labels that are used for each of the classes on one axis (usually the horizontal axis)(usually the horizontal axis).
• A frequency, relative frequency, or percent frequency scale can be used for the other axis (usually the vertical axis).
• Use a bar of fixed width drawn above each class label.
• The bars are separated to emphasize the fact that each class is a separate category
40
separate category.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
HistogramData
Histogram• Another common graphical presentation of quantitative data is a
histogram.
• The variable of interest is placed on the horizontal axis.
• A rectangle is drawn above each class interval with its height corresponding to the inter al’s freq enc relati e freq enc or percentcorresponding to the interval’s frequency, relative frequency, or percent frequency.
• Unlike a bar graph, a histogram has no natural separation between rectangles of adjacent classes.
R codehist(rnorm(100),nclass=6)
41
( ( ), )
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
Holiday Inn Quality RatingsData
Pie Chart
R codex=sample(1:100,6,replace=TRUE)names(x)=c('A' 'B' 'C' 'D' 'E' 'F')names(x)=c('A','B','C','D','E','F')pie(x)
• The pie chart is a commonly used graphical device for presenting relative frequency distributions for qualitative data.
• First draw a circle; then use the relative frequencies to subdivide the• First draw a circle; then use the relative frequencies to subdivide the circle into sectors that correspond to the relative frequency for each class.
• Since there are 360 degrees in a circle, a class with a relative frequency
42of .25 would consume .25(360) = 90 degrees of the circle.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
D t Pl tData
Dot Plot• One of the simplest graphical summaries of data is a• One of the simplest graphical summaries of data is a
dot plot.
• A horizontal axis shows the range of data values• A horizontal axis shows the range of data values.
• Then each data value is represented by a dot placed above the axisabove the axis.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
C l ti Di t ib ti
Data
Cumulative frequency distribution - shows the
Cumulative DistributionsCumulative frequency distribution shows thenumber of items with values less than or equal tothe upper limit of each class..Cumulative relative/ percent frequency distribution
x=seq(-5,5,by=0.1)
R code
plot(pnorm(x,mean=0,sd=1),type='l')
44
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
C l ti Di t ib tiData
Cumulative Distributions• Hudson Auto RepairHudson Auto Repair
Cumulative Cumulative
Cost ($)CumulativeFrequency
CumulativeRelative
Frequency
CumulativePercent
Frequency< 59< 69
215
.04
.304
30< 79< 89
3138
.62
.766276
2 + 13 15/50
< 99< 109
4550
.901.00
90100
45
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
Leaf Unit = 0.1Leaf Unit = 0.1
Data
Stem-and-Leaf Display8899
1010
6 86 8
1 41 422p y
• A stem-and-leaf display shows both the rank order and shape of the
1111 0 70 7
p y pdistribution of the data.
• It is similar to a histogram on its side, but it has the advantage of h i th t l d t lshowing the actual data values.
• The first digits of each data item are arranged to the left of a vertical line.line.
• To the right of the vertical line we record the last digit for each item in rank order.
• Each line in the display is referred to as a stem.• Each digit on a stem is a leaf.
46
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
a stema stem--andand--leaf display of these data will beleaf display of these data will bea stema stem andand leaf display of these data will beleaf display of these data will be
Leaf Unit = 10Leaf Unit = 1016161717
881 91 9
The 82 in 1682The 82 in 1682is rounded downis rounded down
18181919
0 30 31 71 7
to 80 and isto 80 and isrepresented as an 8.represented as an 8.
48
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiData
Probability density function (PDF)
A probability density function (pdf) is a function that represents a probability distribution in terms of integrals.p y g
Formally, a probability distribution has density f(x), such that the probability of the interval [a, b] is given by
b
a
dxxf )(
I t iti l if b bilit di t ib ti h d it f( ) th th i fi it i lIntuitively, if a probability distribution has density f(x), then the infinitesimal interval [x, x + dx] has probability f(x) dx.
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Descriptive statistics
• Are the scores generally high or generally low?• Are the scores generally high or generally low?
• Where the center of the distribution tends to be located
Th f t l t d• Three measures of central tendency
– Mode– Median– Mean
50
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Mode
• The most frequently occurring score
• Report mode when using nominal scale, the most frequently occurring category
• Based on the simple frequency of each score
• If you have a rectangular distribution, do not report the mode
• Unimodal, bimodal, multimodal, antimode
51
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
E l f M d
Descriptive statistics
Example of ModeMeasurementsMeasurements
x355
• In this case the data have tow modes:
172
• 5 and 7
Both measurements are2670
• Both measurements are repeated twice
04
52
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
E l f M dDescriptive statistics
Example of ModeM tMeasurements
x3 • Mode: 3511
• Mode: 3
147
• Notice that it is possible for a data not to have any mode. 7
38
y
3
53
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
MedianS t th 50th til• Score at the 50th percentile
• For normal distribution the median is the same as the mode
• Arrange scores from lowest to highest if odd• Arrange scores from lowest to highest, if odd number of scores the Median is the one in the middle, if even number of scores then averagemiddle, if even number of scores then average the two scores in the middle
• Used when have ordinal scale and when the• Used when have ordinal scale and when the distribution is skewed
54
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Example of Median
• Median: (4+5)/2 = 4.5
Measurements Measurements Ranked
• Notice that only the two l l d
x x3 05 15 2 central values are used
in the computation.5 21 37 42
• The median is not sensible to extreme
2 56 57 6
values0 74 7
40 40
55
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
MeanDescriptive statistics
Mean• Score at the exact mathematical center ofScore at the exact mathematical center of
distribution (average)
U d ith i t l d ti l d h• Used with interval and ratio scales, and when have a symmetrical and unimodal distribution
• Not accurate when distribution is skewed because it is pulled towards the tailbecause it is pulled towards the tail
n
xX i
i1
56
n
X
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Uses of the Mean• Describes scores
• Deviation of mean gives us the error of ourDeviation of mean gives us the error of our estimate of the score, with total error equal to zerozero
• Predict scores
• Describe a scores location
• Describe the population mean () which is a parameter
57
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Deviations around the Mean
• The score minus the mean• The score minus the mean
• Include plus or minus signInclude plus or minus sign
• Sum of deviations of the mean always equals zero (X-M)=0
58
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Range
Report the maximum difference between the• Report the maximum difference between the lowest and highest
• Semi-interquartile range used with the median: one half the distance between the scores at the 25th and 75th percentile
59
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Measures of Variability
• Extent to which the scores differ from each other or how spread out the scores are
• Tells us how accurately the measure of central• Tells us how accurately the measure of central tendency describes the distribution
• Shape of the distribution
60
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
Wh d b t i bilit ?Descriptive statistics
Why do we care about variability?• Where would you rather vacation LA Bungalows• Where would you rather vacation, LA Bungalows,
where the mean temperature is 24 degrees, or Sahara Condos where the mean temperature isSahara Condos where the mean temperature is also 24 degrees?
◊ LA temperature range: day = 26y
night = 22
◊ S h t t◊ Sahara temperature range:day = 40
61night = 8
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Variance• Uses the deviation from the mean
• Remember the sum of the deviations alwaysRemember, the sum of the deviations always equals zero, so you have to square each of the deviationsdeviations
• S2X= sum of squared deviations divided by the
number of scores
• Provides information about the relative variabilityProvides information about the relative variability
62
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
S Li itSome Limits
• It isn’t the average deviation
• Interpretation doesn’t make sense because:
N b i t l– Number is too large
– And it is a squared valueAnd it is a squared value
63
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
The standard deviation (SD)
• Take the square root of the variance
• SX
• Uses the same units of measurement as the raw scores
• How much scores deviate below and above the meanmean
64
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
The standard deviation (SD)( )
• Standard deviation ~ the mean of deviations from the mean (sort of)( )
σ (lowercase sigma) is the population standard deviation.
th l t d d d i tiS the sample standard deviationS
s (s-hat) is the sample estimate of σ65
s (s hat) is the sample estimate of σ
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
The deviation (definitional) formula for the population standard deviation
xn
2)(
p p
n
xi
i
1
)(
n
• The larger the standard deviation the more e a ge e s a da d de a o e o evariability there is in the scores
• The standard deviation is somewhat less sensitive to extreme outliers than the range
66
g(as N increases)
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
Th d i ti (d fi iti l) f l f
Descriptive statistics
The deviation (definitional) formula for the sample standard deviation
XX 2
NXX
S i
• What’s the difference between this formula and the population standard deviation?
• In the first case all the Xs represent the entireIn the first case, all the Xs represent the entire population. In the second case, the Xs represent a sample.
67
p p
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
St d d D i ti E lDescriptive statistics
Standard Deviation: Example
2X
21
XX 2XX
-5.8 33.642524
5.8-1.8-2 8
33.643.247 8424
3034
-2.83.27 2
7.8410.2451 8434
026.87.2 51.84
21.36Mean
62436218.106S68
62.436.215
S
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Calculating S using the raw-score formularaw score formula
X 2
NNX
XS
2
N
To calculate ΣX2 you square all the scores first andTo calculate ΣX you square all the scores first and then sum them
To calculate (ΣX)2 you sum all the scores first and then square them
69
then square them
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Population and sample variance and standard deviationvariance and standard deviation
• When we have data from the entire population we use (not x bar) to compute X using the same formula
• Variance and standard deviations of the sample• Variance and standard deviations of the sample are biased estimates of the population
70
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Estimating the population standard deviation from a samplestandard deviation from a sample
• S the sample standard deviation is usually a little smaller• S, the sample standard deviation, is usually a little smaller than the population standard deviation. Why?
• The sample mean minimizes the sum of squared deviations (SS). Therefore, if the sample mean differs at all from the
l ti th th SS f th l ill bpopulation mean, then the SS from the sample will be an understimate of the SS from the population
• Therefore, statisticians alter the formula of the sample standard deviation by subtracting 1 from N
71
standard deviation by subtracting 1 from N
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Formulas for s-hat (estimated)
2 XXDefinitional 1
ˆ
N
XXs
Definitional formula:
22 XR
1ˆ
2
NNX
Xs
Raw-scoreformula:
72
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
The estimated variance
The standard deviation squares
1
ˆ2
2 XX
s
NX
2
2
1nN
The variance is not a very useful descriptive statistic, but it is very important value you will use in other t h i ( th l i f i )
73techniques (e.g., the analysis of variance)
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
For a standard normal distribution
• Sample mean is a good estimate of population mean
• The estimate of the population variance and standard deviation tells us how spread out thestandard deviation tells us how spread out the scores are
• 68% of the scores are within +1 and –1 SX
74
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Standard errorStandard error
The standard error of a sample of sample size n is the sample's standard deviation divided by . It therefore estimates the standard deviation of the sample mean based on thestandard deviation of the sample mean based on the population mean.
sSEn
SEx
75
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Coefficient of variation
In probability theory and statistics, the coefficient of variation (CV) is a normalized measure of dispersion of a probability distribution It isnormalized measure of dispersion of a probability distribution. It is defined as the ratio of the standard deviation to the mean :
100CV 100CV
76
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
SkewnessDescriptive statistics
SkewnessSymmetrical distribution
• Symmetric– Left tail is the mirror image of the right tailLeft tail is the mirror image of the right tail– Examples: heights and weights of people
ncy
ncy .30.30
.35.35
Freq
uen
Freq
uen
.20.20
.25.25
Rela
tive
FRe
lativ
e F
0505.10.10
.15.15
77
RR .05.0500
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
SkewnessDescriptive statistics
SkewnessAsymmetrical distribution
• Moderately Skewed Left– A longer tail to the leftA longer tail to the left– Example: exam scores
yy .30.30
.35.35
requ
ency
requ
ency
.20.20
.25.25
.30.30
lativ
e Fr
lativ
e Fr
.10.10
.15.15
78
Rel
Rel
.05.0500
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
SkewnessDescriptive statistics
SkewnessAsymmetrical distribution
Frequency
I• Income• Populations of
countries
Value
79
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Skewness
N
A Measure of skewness based on the 3rd moment about the Mean
N3
i )(
31i
s)1(Nskewness
n
xxn
ii )1( 2/3
1
3
s)1(N
nn
xxn
ii
i
2)1(
1
2/3
1
80smedianmeansemeani
/)(3/)mod(1
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
Sk
Descriptive statistics
SkewnessFrequencyq y
Value
81
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Skewed Right - Positive Skewness
Number of Music CDs of Spring 1998 Stat 250 Students
20
10quen
cy
10
Fre
0 100 200 300 400
0
82
Number of Music CDs
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiDescriptive statistics
Kurtosis
• Measures of Kurtosis
– Kurtosis is a measure of the flatness or peakedness of a Distribution
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
SAS Example/****************************************************************//* SAS SAMPLE LIBRARY *//* *//* NAME: UNIVAR *//* TITLE: Simple Descriptive Statistics using PROC UNIVARIATE */
SAS
SAS ExampleOPTIONS LS=75 NODATE;
DATA STATEPOP;
/* PRODUCT: SAS *//* SYSTEM: ALL *//* KEYS: DESCRIPTIVE STATISTICS, *//* PROCS: UNIVARIATE *//* DATA: *//* *//* REF: *//* MISC: *//* DESC: INPUT A SMALL DATA SET USING THE CARDS STATEMENT. *//* RUN UNIVARIATE USING THE FREQ, PLOT AND NORMAL *//* PROC OPTIONS. ANALYZE THE VARIABLE POP AND *//* RETAIN THE VARIABLE STATE USING THE ID STATEMENT. *//* NO OTHER OPTIONS ARE USED. *//* */DATA STATEPOP;
INPUT STATE $ POP @@;
LABEL POP ='1970 CENSUS POPULATION IN MILLIONS';
/****************************************************************/OPTIONS LS=75 NODATE;DATA STATEPOP;INPUT STATE $ POP @@;LABEL POP ='1970 CENSUS POPULATION IN MILLIONS';CARDS;
ALA 3.44 ALASKA 0.30 ARIZ 1.77 ARK 1.92 CALIF 19.95COLO 2.21 CONN 3.03 DEL 0.55 FLA 6.79 GA 4.59HAW 0.77 IDAHO 0.71 ILL 11.01 IND 5.19 IOWA 2.83KAN 2.25 KY 3.22 LA 3.64 ME 0.99 MD 3.92MASS 5.69 MICH 8.88 MINN 3.81 MISS 2.22 MO 4.68MONT 0.69 NEB 1.48 NEV 0.49 NH 0.74 NJ 7.17NM 1.02 NY 18.24 NC 5.08 ND 0.62 OHIO 10.65OKLA 2.56 ORE 2.09 PA 11.79 RI 0.95 SC 2.59SD 0 67 TENN 3 92 TEXAS 11 2 UTAH 1 06 VT 0 44 CARDS;
ALA 3.44 ALASKA 0.30 ARIZ 1.77 ARK 1.92 CALIF 19.95
M 4 047200 Std D i ti 4 32932 Mean 4.047200 Std Deviation 4.32932
Median 2.710000 Variance 18.74300
Mode 3.920000 Range 19.65000
Interquartile Range 3.69000
90
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai LiSAS
Quantiles
Quantile Estimate
100% Max 19.950
99% 19 95099% 19.950
95% 11.790
90% 10.830
75% Q3 4.680
50% Median 2.710
25% Q1 0.990
91
Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics
Xinhai Li
AssignmentBe familiar with the following terms:Be familiar with the following terms:• Probability density function (PDF)• Deviation• Variance• Standard deviation
Standard error• Standard error• Range• ModeMode• Quantile• Coefficient of variation
Download and install R on your laptopPlot histograms using