Introduction to biostatistics: part 1

Lecture 1. Brief history, basic concepts and descriptive statistics Biostatistics

Xinhai Li

Biological statisticsBiological statistics

Li, Xinhai

Phone: 64807898Phone: 64807898Email: [email protected]: http://people.gucas.ac.cn/~LiXinhaiBlog: http://blog.sciencenet.cn/u/lixinhaiMiniblog: http://weibo.com/lixinhaiblog

1


Xinhai Li

How to learn statistics in this lclass

• No preview needed before the class• Focus on listening and thinking (3 hours / week) at g g ( )

class– Don’t take notes (wasting your time)( g y )

• Intensive review (1-2 hours / week) after the class• Do the homework (1 hour / week) after the classDo the homework (1 hour / week) after the class

2


Xinhai Li

Text booksText books

Sokal, R. R. and F. J. Rohlf. 1995. Biometry: the principles and practice of statistics in biological research. Third Edition. W. H. Freeman and Co.: New York. 887 pp.

Zar, J. H. 1999. Biostatistical Analysis. Fourth Edition.

3From 1976 (the earliest year indexed) to mid 1997 (the date the search was performed) the following counts were obtained: Darwin (all publications, e.g. The origin of the species) = 7,111. Sokal and Rohlf Biometry = 31,757.

, J y

Prentice Hall: New Jersey, 663 pp.


Xinhai LiOverview

Biostatistics or biometry

• "biostatistics" and "biometry" are i d i h blsometimes used interchangeably

• "biometry" is more often used of biological i lt l li tior agricultural applications

• "biostatistics" is more often used of medical applications

4medical applications.


Xinhai LiOverview

What is statistics?

• Statistics is the science of collectionhttp://teeky.org/search-engine-optimization/

determine-success-via-website-statistics/

• Statistics is the science of collection, analysis, interpretation, and presentation of dataof data.

• Descriptive statistics are numericalDescriptive statistics are numerical estimates that organize, sum up or present the datathe data.

• Inferential statistics is the process of 5inferring from a sample to the population.


Xinhai LiOverview

Statistical errors in publications

Underwood (1981) found statistical errors in 78% of the papers he surveyed in marine ecology. Hurlbert (1984) reported that in two y gy ( ) pseparate surveys 26% and 48% of the ecological papers surveyed showed the statistical error of pseudoreplication (Krebs 1999).

Charles J. Krebs. 1999. Ecological Methodology, 2nd ed. Addison-Wesley Educational Publishers, Inc.

“50% of medical literature have statistical flaws (Altman et al. 1991). Serious statistical errors were found in 40% of 164 articles published i hi t j l (M G i 1995)” (E t l 2007)in a psychiatry journal (McGuigan 1995)” (Ercan et al. 2007).

Ilker Ercan, Berna Yazıcı, Yaning Yang, Guven Özkaya, Sengul Cangur, Bulent Ediz, Ismet Kan. Misusage Of Statistics In Medical Research Eur J Gen Med 2007; 4(3):128-134Misusage Of Statistics In Medical Research. Eur J Gen Med 2007; 4(3):128-134

6


Xinhai LiOverview

Contents• Brief history, basic concepts,

and descriptive statistics• Analysis of covariance (ANCOVA)

p

• Probability distribution• Nonparametric statistics

• Multivariate analysis• Hypothesis testing

• Analysis of variance (ANOVA)

• Multivariate analysis

• Generalized linear model• Analysis of variance (ANOVA)

• Simple linear regression and l ti

• Common mistakes

correlation

xxy7

21 xxy


Xinhai Li

St ti ti l ft R

Overview

Statistical software R http://cran.r-project.org

R is a free software environment for statistical computing and graphics ItR is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

In 1995 R was initially written by Ross Ihaka and Robert Gentleman at theIn 1995, R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in Auckland, New Zealand.

Since mid-1997 there has been a core group (the “R Core Team”) who can g p ( )modify the R source code archive.

It is free software distributed under a GNU-style copyleft, and an official part of the GNU project (“GNU S”).

It has over 2100 packages in 2010.

CitationR Development Core Team 2011 R: A Language and Environment

8

R Development Core Team. 2011. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. ISBN: 3-900051-07-0. http://www.R-project.org.


Xinhai LiOverview

Today’s contentsIntroduction to biological

statisticsstatistics

History Data in biology Descriptive statistics

9


Xinhai Li

Hi t

History

History• John Graunt (1620-1674, British) and William Petty (1623-1687, British): Jo G au t ( 6 0 6 , t s ) a d a etty ( 6 3 68 , t s )

developed early human statistical and census methods that later provided a framework for modern demography based on life table, mean value, census, longevity, and mortality.

• Blaise Pascal (1623-1662, French) and Pierre de Fermat (1601-1665, French), Jacques Bernoulli (1654-1705, Swiss): probability theory (binomial coefficients)

• Abraham de Moivre (棣莫弗)(1667-1754, French): combine the statistics with probability theory; approximate the normal distribution though the expansion of the binomial distributionof the binomial distribution

• Carl Friedrich Gauss (1777-1855, Germany): least square, normal distribution

• Adolphe Quetelet (凯特勒) (1796-1874, Belgium): significance of constancy of large numbers (rate of criminal events)

10• Florence Nightingale (1820-1910, British): graphic presentation of statistics


Xinhai LiHistory

Emergence of statistics in 1800’s• Laplace wrote a book describing how to compute the• Laplace wrote a book describing how to compute the

future positions of planets and comets on the basis of a few observations from earth.few observations from earth.

• Napoleon: "I find no mention of God in your treatise, Mr. Laplace."Laplace.

• Laplace replied: "I had no need for that hypothesis.“Th b ti f l t d t f thi thl l tf did• The observations of planets and comets from this earthly platform did not fit the predicted positions exactly. Laplace and his fellow scientists attributed this to errors in the observations, sometimes due to perturbations in the earth's atmosphere, other times due to human error.

By the end of the nineteenth century the errors had mounted instead11

• By the end of the nineteenth century, the errors had mounted instead of diminishing. As measurements became more and more precise, more and more error cropped up.


Xinhai Li

Gaps between Darwinism and genetics in early 1900’s

History

Gaps between Darwinism and genetics in early 1900’s

Core Evolution Concepts Mendel’s law of tiCore Evolution Concepts

Population: Organisms that share a common gene pool (Species = actually or

segregationBy carrying out the monohybrid crosses, Mendel determined that the 2 alleles for g p ( p y

potentially interbreeding organisms)

Variation: Modifications of forms are produced by chance via mutations, genetic

each character segregate during gamete production.

p y , gcoding errors of individual organisms

Natural Selection: Reproduction & survival of organisms whose heritable traits gare better suited to existing environmental conditions

Retention: Persistence within a population of the selected variation(s) over successive generations

12


Xinhai LiHistory

Neo-Darwinian Modern evolutionary synthesis in 1930’s

• Sir Ronald A. Fisher (1890-1962, British) developed several basic statistical methods in support of his work ppThe Genetical Theory of Natural Selection

• Sewall G Wright (1889 1988 American) used statistics• Sewall G. Wright (1889-1988, American) used statistics in the development of modern population genetics

• John B. S. Haldane (霍尔丹1892-1964, British)reestablished natural selection as the premier mechanism of evolution by explaining it in terms of themechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics in his book The Causes of Evolution.

13


Xinhai LiHistory

Francis Galton

• Francis Galton (1822-1911, British) (father of biometry d i ) i l ti

http://www.sil.si.edu/digitalcollections/hst/scientific-identity/fullsize/SIL14-G001-05a.jpg

and eugenics): regression, correlation– African Explorer and elected Fellow in the Royal Geographic Society

C t f th fi t th d t bli h f th t l i l– Creator of the first weather maps and establisher of the meteorological theory of anticyclones

– Coined term "eugenics" and phrase "nature versus nurture" – Developed statistical concepts of correlation and regression to the mean – Discovered that fingerprints were an index of personal identity and

persuaded Scotland Yard to adopt a fingerprinting system – First to utilize the survey as a method for data collection – Produced over 340 papers and books throughout his lifetime

K i ht d i 190914

– Knighted in 1909

Galton, F. (1869/1892/1962). Hereditary Genius: An Inquiry into its Laws and Consequences. Macmillan/Fontana, London. Galton, F. (1883/1907/1973). Inquiries into Human Faculty and its Development. AMS Press, New York.


Xinhai LiHistory

Karl Pearson

• Karl Pearson (1857-1936, British): continued in the tradition of Galton http://www.economics.soton.ac.uk/staff/aldrich/New%20Folder/kpreader1.htm

( , )and laid the foundation for much of descriptive statistics.

– In 1884, Pearson became Professor of Applied Mathematics and Mechanics Cat University College London.

– In 1901 Pearson, Weldon and Galton founded Biometrika, a “Journal for the Statistical Study of Biological Problems”.

– In 1907, Pearson took over a research unit founded by Galton and reconstituted it as the Francis Galton Laboratory of National Eugenics.

In 1911 Pearson founded the world's first university statistics department at– In 1911, Pearson founded the world s first university statistics department at University College London.

method of moments

15 chi-square correlation


Xinhai LiHistory

Ronald A. Fisher

• Sir Ronald Aylmer Fisher (1890 –1962) an English statistician evolutionaryhttp://en.wikipedia.org/wiki/Image:RonaldFisher.jpg

• Sir Ronald Aylmer Fisher, (1890 –1962), an English statistician, evolutionary biologist, and geneticist.

• He was described by Anders Hald as "a genius who almost single-handedly y g g ycreated the foundations for modern statistical science"[1] and Richard Dawkins described him as "the greatest biologist since Darwin".[2] (from Wikipedia)

– In 1933 he became a Professor of Eugenics at University College London– In 1943 he was offered the Balfour Chair of Genetics at Cambridge

Universityy

Analysis of variance Maximum likelihood

Fisher, R.A. 1925. Statistical Methods for Research WorkersFisher, R.A. 1935. The design of experiments

16 Fisher information[1] Hald, Anders (1998). A History of Mathematical Statistics. New York: Wiley.

[2] Dawkins, Richard (1995). River out of Eden.


Xinhai LiHistory

Society and publications in early yearsIn 1901 Pearson Weldon and Galton founded• In 1901, Pearson, Weldon and Galton founded Biometrika, a “Journal for the Statistical Study of Biological Problems”.

• Until the 1940s, the application of statistics to biological questions began to have a profound impact on thequestions began to have a profound impact on the scientific community.

Th bi i i f h A i S i i l• The biometrics section of the American Statistical Association to publish the Biometrics Bulletin, in 1945.

• In 1947, International Biometric Society (IBS) was established. Shortly thereafter, the IBS began publishing Biometrics

17

Biometrics.


Xinhai LiHistory

A story of statistics in industryindustry

http://www.census.gov/history/www/census_then_now/notable_alumni/w_edwards_deming.html

• In 1980, the NBC television network aired a documentary entitled "If Japan Can, Why Can't We?"

– The documentary was really a description of the influence one man had on Japanese industry, W. Edwards Deming.

• Deming's major point about quality control is that the output of a production line is variable because that is theoutput of a production line is variable, because that is the nature of all human activity. What the customer wants is not a perfect product but a reliable product.not a perfect product but a reliable product.

18


Xinhai Li

A story of statistics and industryHistory

A story of statistics and industryDeming’s quality control

• Deming proposed that the production line be seen as a stream of activities that start with raw material and end with finished product.

• Each activity can be measured, so each activity has its own variability due to environmental causes.

• Instead of waiting for the final product to exceed arbitrary limits of• Instead of waiting for the final product to exceed arbitrary limits of variability, the managers should be looking at the variability of each of these activities.

• The most variable of the activities is the one that should be addressed. Once that variability is reduced, there will be another activity that is "most variable " and it should then be addressedactivity that is most variable, and it should then be addressed.

• Thus, quality control becomes a continuous process, where the most variable aspect of the production line is constantly being p p y gworked on.

19


Xinhai LiData

Data

• Datum is one observation about the variable being measured.g

• Data are a collection of observations.Data are a collection of observations.

• A population consists of all subjects about• A population consists of all subjects about whom the study is being conducted.

• A sample is a sub-group of population being examined

20

examined.


Xinhai LiData

Parameters vs. Statistics

• A parameter is a numerical quantity measuring some aspect of a population of scores. p p p– For example, the mean is a measure of central tendency– Usually use Greek letters

• A statistic computed in samples is used to estimate parametersparameters

Quantity Parameter Statistic

Mean μ M

Standard deviation σ s

21

Proportion π p

Correlation ρ r


Xinhai Li

VariablesData

Variables• Nominal variable

classification data e g male/female 0/1 etc

QualitativeQuantitative

– classification data, e.g., male/female, 0/1, etc • Ordinal variable

– ordered but differences between values are not Ordinal Interval or

ratio

important – e.g., Likert scales, rank on a scale of 1..5 (degree of

satisfaction); restaurant ratings

ratio

satisfaction); restaurant ratings • Interval scale variable

– ordered, constant scale, but no natural zero – differences make sense, but ratios do not (e.g., 30º-20º = 20º-10º, but 20º/10º is not twice as hot! – e.g., temperature (C,F), dates

• Ratio scale variable– ordered constant scale natural zero

22

– ordered, constant scale, natural zero – e.g., height, weight, age, length


Xinhai LiData

Derived variables

Ratio RatioSex ratio

IndexS&P 500 index (stock market)

Rate23Growth rate


Xinhai LiData

Acc rac and precision of dataAccuracy and precision of data

Accuracy Precision Inaccuracy

24


Xinhai LiData

Accuracy of data

Mean square errorqfor estimating population mean (μ)using sample mean (m)

μ

)(MMSEBias

μM

2

2 ])[( ME Accuracy

2])([)( MEMVar

25precision bias


Xinhai LiData

Summarizing Data

• Frequency DistributionFrequency Distribution• Cumulative Distributions• Relative Frequency Distribution• Relative Frequency Distribution • Percent Frequency Distribution

B G h• Bar Graph • Histogram• Pie Chart• Dot Plot

26


Xinhai LiData

Frequency Distribution for Q lit ti d t

A f di ib i i b l f

Qualitative dataA frequency distribution is a tabular summary ofdata showing the frequency (or number) of itemsi h f l l i lin each of several nonoverlapping classes.

h bj i i id i i h b h dThe objective is to provide insights about the datathat cannot be quickly obtained by looking only at h i i l dthe original data.

27


Xinhai LiData

Frequency DistributionGuests staying at Holiday Inn were asked to rate the quality of their accommodations as being:being:

PoorPoor 22Rating Frequency

Below AverageBelow AverageAverageAverage

3355

Above AverageAbove AverageExcellentExcellent

9911

ll28

TotalTotal 2020


Xinhai LiData

An example for quantitative data:Hudson Auto Repair

Sample of Parts Cost for 50 TuneSample of Parts Cost for 50 Tune--upsups

Hudson Auto RepairSample of Parts Cost for 50 TuneSample of Parts Cost for 50 Tune--upsups

91 78 93 57 75 52 99 80 97 6271 69 72 89 66 75 79 75 72 76104 74 62 68 97 105 77 65 80 10985 97 88 68 83 68 71 69 67 7462 82 98 101 79 105 79 69 62 73

29


Xinhai LiData

Frequency Distribution• Guidelines for selecting number of classes

Use between 5 and 20 classesData sets with a larger number of elements usually require g y qa larger number of classes

Smaller data sets usually require fewer classes

Use classes of equal width

Approximate class width =Approximate class width =

Largest Data Value Smallest Data Value

30Number of Classes


Xinhai LiData

Frequency DistributionFor Hudson Auto Repair, if we choose six classes:

Approximate Class Width = (109 - 52)/6 = 9.5 10

50-59 2Parts Cost ($) Frequency

60-69 70-7980 89

1316

80-8990-99

100 109

775

31

100-109 5Total 50


Xinhai LiData

Relati e Freq enc Distrib tionRelative Frequency Distribution

The relative frequency of a class is the fraction orproportion of the total number of data itemsbelonging to the class.

A relative frequency distribution is a tabularsummary of a set of data showing the relativesummary of a set of data showing the relativefrequency for each class.

32


Xinhai LiData

Percent Frequency DistributionPercent Frequency Distribution

The percent frequency of a class is the relativefrequency multiplied by 100frequency multiplied by 100.

A percent frequency distribution is a tabularA percent frequency distribution is a tabularsummary of a set of data showing the percentfrequency for each class.frequency for each class.

33


Xinhai Li

R l ti F dData

Relative Frequency andPercent Frequency Distributions

Holiday Inn Quality Ratings

q y

RelativeFrequency

PercentFrequencyRating

PoorBelow Average

.10.10

.15.1510101515

AverageAbove Average

.25.25

.45.4525254545

Excellent .05.05TotalTotal 1.001.00

55100100

341/20 = .051/20 = .05


Xinhai Li

R l ti F dData

Relative Frequency andPercent Frequency Distributionsq y

Hudson Auto Repair

PartsCost ($)

RelativeFrequency

PercentFrequency

50-5960 69

Cost ($).0426

Frequency4

26

Frequency

2/5060-69 70-7980 89

.26

.3214

263214

2/50

80-8990-99

100 109

.14

.1410

141410

35

100-109 .10Total 1.00

10100


Xinhai Li

R l i F d

Data

Relative Frequency andPercent Frequency Distributions

Insights gained from the percent frequency distribution

Percent Frequency Distributions

• Only 4% of the parts costs are in the $50-59 classOnly 4% of the parts costs are in the $50-59 class.

• 30% of the parts costs are under $70.

• The greatest percentage (32% or almost one-third) of the parts costs are in the $70-79 class.p $

• 10% of the parts costs are $100 or more.

36


Xinhai Li

Our classData

Our classstudents <- read.csv('D:/ioz/statistics/2012/students.csv', header=T)students$ID <- as.character(students$ID)students$ID as.character(students$ID)head(students)

order ID name visits email

33 201028007610020 吴国菊 2093 163.com nrow(students) #115f il < b t ( t d t $ 1 1)3 201028016215017 袁金蕊 222 163.com

111 201028016215018 张苗苗 99 163.com

4 201028016215019 赵次娴 130 163.com

25 201128000206033 钞婷 130 163.com

56 201128000206061 雷金龙 282 mails.gucas.ac.cn

family.name <- substr(students$name,1,1)length(unique(family.name)) #62f.name <- table(family.name)[table(family.name)>1]f.name <- as.table(f.name)barplot(f.name, ylab='Number')

012

雷金龙 g

mbe

r

68

10

Num

24

37陈杜李刘马宋王魏吴徐杨于袁张赵郑朱

0


Xinhai Li

Our classData

Our classemail <- table(students$email)[table(students$email) > 2]class(email) # arrayemail <- as.table(email)barplot(email, ylab='Number')

r 3040

Num

be

1020

126.com 163.com gmail.com mails.gucas.ac.cn qq.com sina.com

01

38


Xinhai Li

Our classData

Our class

Histogram of students$visits

hist(students$visits, freq=T, nclass=15, xlab='Times')

Histogram of students$visits0

25

uenc

y

1520

Freq

u

510

0 500 1000 1500 2000

0

39Times

0 500 1000 1500 2000


Xinhai LiData

Bar Graph

1012

Barplot()

Bar Graph

Num

ber

24

68

陈杜李刘马宋王魏吴徐杨于袁张赵郑朱

0

• A bar graph is a graphical device for depicting qualitative data.

• Specify the labels that are used for each of the classes on one axis (usually the horizontal axis)(usually the horizontal axis).

• A frequency, relative frequency, or percent frequency scale can be used for the other axis (usually the vertical axis).

• Use a bar of fixed width drawn above each class label.

• The bars are separated to emphasize the fact that each class is a separate category

40

separate category.


Xinhai Li

HistogramData

Histogram• Another common graphical presentation of quantitative data is a

histogram.

• The variable of interest is placed on the horizontal axis.

• A rectangle is drawn above each class interval with its height corresponding to the inter al’s freq enc relati e freq enc or percentcorresponding to the interval’s frequency, relative frequency, or percent frequency.

• Unlike a bar graph, a histogram has no natural separation between rectangles of adjacent classes.

R codehist(rnorm(100),nclass=6)

41

( ( ), )


Xinhai Li

Holiday Inn Quality RatingsData

Pie Chart

R codex=sample(1:100,6,replace=TRUE)names(x)=c('A' 'B' 'C' 'D' 'E' 'F')names(x)=c('A','B','C','D','E','F')pie(x)

• The pie chart is a commonly used graphical device for presenting relative frequency distributions for qualitative data.

• First draw a circle; then use the relative frequencies to subdivide the• First draw a circle; then use the relative frequencies to subdivide the circle into sectors that correspond to the relative frequency for each class.

• Since there are 360 degrees in a circle, a class with a relative frequency

42of .25 would consume .25(360) = 90 degrees of the circle.


Xinhai Li

D t Pl tData

Dot Plot• One of the simplest graphical summaries of data is a• One of the simplest graphical summaries of data is a

dot plot.

• A horizontal axis shows the range of data values• A horizontal axis shows the range of data values.

• Then each data value is represented by a dot placed above the axisabove the axis.

Tune-up Parts Costp. . . .. . . . . .. . . .

5050 6060 7070 8080 9090 100100 1101105050 6060 7070 8080 9090 100100 110110

. . . ..... .......... .. . .. . . ... . .. .. . . ..... .......... .. . .. . . ... . .. .. .. .. .. .. . . . .. .. .. .. . .

435050 6060 7070 8080 9090 100100 1101105050 6060 7070 8080 9090 100100 110110

Cost ($)Cost ($)


Xinhai Li

C l ti Di t ib ti

Data

Cumulative frequency distribution - shows the

Cumulative DistributionsCumulative frequency distribution shows thenumber of items with values less than or equal tothe upper limit of each class..Cumulative relative/ percent frequency distribution

x=seq(-5,5,by=0.1)

R code

plot(pnorm(x,mean=0,sd=1),type='l')

44


Xinhai Li

C l ti Di t ib tiData

Cumulative Distributions• Hudson Auto RepairHudson Auto Repair

Cumulative Cumulative

Cost ($)CumulativeFrequency

CumulativeRelative

Frequency

CumulativePercent

Frequency< 59< 69

215

.04

.304

30< 79< 89

3138

.62

.766276

2 + 13 15/50

< 99< 109

4550

.901.00

90100

45


Xinhai Li

Leaf Unit = 0.1Leaf Unit = 0.1

Data

Stem-and-Leaf Display8899

1010

6 86 8

1 41 422p y

• A stem-and-leaf display shows both the rank order and shape of the

1111 0 70 7

p y pdistribution of the data.

• It is similar to a histogram on its side, but it has the advantage of h i th t l d t lshowing the actual data values.

• The first digits of each data item are arranged to the left of a vertical line.line.

• To the right of the vertical line we record the last digit for each item in rank order.

• Each line in the display is referred to as a stem.• Each digit on a stem is a leaf.

46


Xinhai Li

E l L f U it 0 1

Data

Example: Leaf Unit = 0.1

If we have data with values such as

8.6 8.6 11.711.7 9.49.4 9.19.1 10.210.2 11.011.0 8.88.8

a stema stem andand leaf display of these data will beleaf display of these data will be

Leaf Unit = 0.1Leaf Unit = 0.1

a stema stem--andand--leaf display of these data will beleaf display of these data will be

8899

Leaf Unit 0.1Leaf Unit 0.16 86 81 41 499

10101111

1 41 4220 70 7

47

1111 0 70 7


Xinhai Li

Example: Leaf Unit = 10Data

If we have data with values such asIf we have data with values such as

Example: Leaf Unit = 10

1806 1717 1974 1791 1682 1910 18381806 1717 1974 1791 1682 1910 1838

a stema stem--andand--leaf display of these data will beleaf display of these data will bea stema stem andand leaf display of these data will beleaf display of these data will be

Leaf Unit = 10Leaf Unit = 1016161717

881 91 9

The 82 in 1682The 82 in 1682is rounded downis rounded down

18181919

0 30 31 71 7

to 80 and isto 80 and isrepresented as an 8.represented as an 8.

48


Xinhai LiData

Probability density function (PDF)

A probability density function (pdf) is a function that represents a probability distribution in terms of integrals.p y g

Formally, a probability distribution has density f(x), such that the probability of the interval [a, b] is given by

b

a

dxxf )(

I t iti l if b bilit di t ib ti h d it f( ) th th i fi it i lIntuitively, if a probability distribution has density f(x), then the infinitesimal interval [x, x + dx] has probability f(x) dx.

x=seq(-5,5,by=0.1)plot(dnorm(x,mean=0,sd=1),type='l')

491)(-

dxxfThe total area under the graph is 1


Xinhai LiDescriptive statistics

Descriptive statistics

• Are the scores generally high or generally low?• Are the scores generally high or generally low?

• Where the center of the distribution tends to be located

Th f t l t d• Three measures of central tendency

– Mode– Median– Mean

50



Mode

• The most frequently occurring score

• Report mode when using nominal scale, the most frequently occurring category

• Based on the simple frequency of each score

• If you have a rectangular distribution, do not report the mode

• Unimodal, bimodal, multimodal, antimode

51


Xinhai Li

E l f M d


Example of ModeMeasurementsMeasurements

x355

• In this case the data have tow modes:

172

• 5 and 7

Both measurements are2670

• Both measurements are repeated twice

04

52


Xinhai Li

E l f M dDescriptive statistics

Example of ModeM tMeasurements

x3 • Mode: 3511

• Mode: 3

147

• Notice that it is possible for a data not to have any mode. 7

38

y

3

53



MedianS t th 50th til• Score at the 50th percentile

• For normal distribution the median is the same as the mode

• Arrange scores from lowest to highest if odd• Arrange scores from lowest to highest, if odd number of scores the Median is the one in the middle, if even number of scores then averagemiddle, if even number of scores then average the two scores in the middle

• Used when have ordinal scale and when the• Used when have ordinal scale and when the distribution is skewed

54



Example of Median

• Median: (4+5)/2 = 4.5

Measurements Measurements Ranked

• Notice that only the two l l d

x x3 05 15 2 central values are used

in the computation.5 21 37 42

• The median is not sensible to extreme

2 56 57 6

values0 74 7

40 40

55


Xinhai Li

MeanDescriptive statistics

Mean• Score at the exact mathematical center ofScore at the exact mathematical center of

distribution (average)

U d ith i t l d ti l d h• Used with interval and ratio scales, and when have a symmetrical and unimodal distribution

• Not accurate when distribution is skewed because it is pulled towards the tailbecause it is pulled towards the tail

n

xX i

i1

56

n

X



Uses of the Mean• Describes scores

• Deviation of mean gives us the error of ourDeviation of mean gives us the error of our estimate of the score, with total error equal to zerozero

• Predict scores

• Describe a scores location

• Describe the population mean () which is a parameter

57



Deviations around the Mean

• The score minus the mean• The score minus the mean

• Include plus or minus signInclude plus or minus sign

• Sum of deviations of the mean always equals zero (X-M)=0

58



Range

Report the maximum difference between the• Report the maximum difference between the lowest and highest

• Semi-interquartile range used with the median: one half the distance between the scores at the 25th and 75th percentile

59



Measures of Variability

• Extent to which the scores differ from each other or how spread out the scores are

• Tells us how accurately the measure of central• Tells us how accurately the measure of central tendency describes the distribution

• Shape of the distribution

60


Xinhai Li

Wh d b t i bilit ?Descriptive statistics

Why do we care about variability?• Where would you rather vacation LA Bungalows• Where would you rather vacation, LA Bungalows,

where the mean temperature is 24 degrees, or Sahara Condos where the mean temperature isSahara Condos where the mean temperature is also 24 degrees?

◊ LA temperature range: day = 26y

night = 22

◊ S h t t◊ Sahara temperature range:day = 40

61night = 8



Variance• Uses the deviation from the mean

• Remember the sum of the deviations alwaysRemember, the sum of the deviations always equals zero, so you have to square each of the deviationsdeviations

• S2X= sum of squared deviations divided by the

number of scores

• Provides information about the relative variabilityProvides information about the relative variability

62



S Li itSome Limits

• It isn’t the average deviation

• Interpretation doesn’t make sense because:

N b i t l– Number is too large

– And it is a squared valueAnd it is a squared value

63



The standard deviation (SD)

• Take the square root of the variance

• SX

• Uses the same units of measurement as the raw scores

• How much scores deviate below and above the meanmean

64



The standard deviation (SD)( )

• Standard deviation ~ the mean of deviations from the mean (sort of)( )

σ (lowercase sigma) is the population standard deviation.

th l t d d d i tiS the sample standard deviationS

s (s-hat) is the sample estimate of σ65

s (s hat) is the sample estimate of σ



The deviation (definitional) formula for the population standard deviation

xn

2)(

p p

n

xi

i

1

)(

n

• The larger the standard deviation the more e a ge e s a da d de a o e o evariability there is in the scores

• The standard deviation is somewhat less sensitive to extreme outliers than the range

66

g(as N increases)


Xinhai Li

Th d i ti (d fi iti l) f l f


The deviation (definitional) formula for the sample standard deviation

XX 2

NXX

S i

• What’s the difference between this formula and the population standard deviation?

• In the first case all the Xs represent the entireIn the first case, all the Xs represent the entire population. In the second case, the Xs represent a sample.

67

p p


Xinhai Li

St d d D i ti E lDescriptive statistics

Standard Deviation: Example

2X

21

XX 2XX

-5.8 33.642524

5.8-1.8-2 8

33.643.247 8424

3034

-2.83.27 2

7.8410.2451 8434

026.87.2 51.84

21.36Mean

62436218.106S68

62.436.215

S



Calculating S using the raw-score formularaw score formula

X 2

NNX

XS

2

N

To calculate ΣX2 you square all the scores first andTo calculate ΣX you square all the scores first and then sum them

To calculate (ΣX)2 you sum all the scores first and then square them

69

then square them



Population and sample variance and standard deviationvariance and standard deviation

• When we have data from the entire population we use (not x bar) to compute X using the same formula

• Variance and standard deviations of the sample• Variance and standard deviations of the sample are biased estimates of the population

70



Estimating the population standard deviation from a samplestandard deviation from a sample

• S the sample standard deviation is usually a little smaller• S, the sample standard deviation, is usually a little smaller than the population standard deviation. Why?

• The sample mean minimizes the sum of squared deviations (SS). Therefore, if the sample mean differs at all from the

l ti th th SS f th l ill bpopulation mean, then the SS from the sample will be an understimate of the SS from the population

• Therefore, statisticians alter the formula of the sample standard deviation by subtracting 1 from N

71

standard deviation by subtracting 1 from N



Formulas for s-hat (estimated)

2 XXDefinitional 1

ˆ

N

XXs

Definitional formula:

22 XR

1ˆ

2

NNX

Xs

Raw-scoreformula:

72



The estimated variance

The standard deviation squares

1

ˆ2

2 XX

s

NX

2

2

1nN

The variance is not a very useful descriptive statistic, but it is very important value you will use in other t h i ( th l i f i )

73techniques (e.g., the analysis of variance)



For a standard normal distribution

• Sample mean is a good estimate of population mean

• The estimate of the population variance and standard deviation tells us how spread out thestandard deviation tells us how spread out the scores are

• 68% of the scores are within +1 and –1 SX

74



Standard errorStandard error

The standard error of a sample of sample size n is the sample's standard deviation divided by . It therefore estimates the standard deviation of the sample mean based on thestandard deviation of the sample mean based on the population mean.

sSEn

SEx

75



Coefficient of variation

In probability theory and statistics, the coefficient of variation (CV) is a normalized measure of dispersion of a probability distribution It isnormalized measure of dispersion of a probability distribution. It is defined as the ratio of the standard deviation to the mean :

100CV 100CV

76


Xinhai Li

SkewnessDescriptive statistics

SkewnessSymmetrical distribution

• Symmetric– Left tail is the mirror image of the right tailLeft tail is the mirror image of the right tail– Examples: heights and weights of people

ncy

ncy .30.30

.35.35

Freq

uen

Freq

uen

.20.20

.25.25

Rela

tive

FRe

lativ

e F

0505.10.10

.15.15

77

RR .05.0500


Xinhai Li


SkewnessAsymmetrical distribution

• Moderately Skewed Left– A longer tail to the leftA longer tail to the left– Example: exam scores

yy .30.30

.35.35

requ

ency

requ

ency

.20.20

.25.25

.30.30

lativ

e Fr

lativ

e Fr

.10.10

.15.15

78

Rel

Rel

.05.0500


Xinhai Li


SkewnessAsymmetrical distribution

Frequency

I• Income• Populations of

countries

Value

79



Skewness

N

A Measure of skewness based on the 3rd moment about the Mean

N3

i )(

31i

s)1(Nskewness

n

xxn

ii )1( 2/3

1

3

s)1(N

nn

xxn

ii

i

2)1(

1

2/3

1

80smedianmeansemeani

/)(3/)mod(1


Xinhai Li

Sk


SkewnessFrequencyq y

Value

81



Skewed Right - Positive Skewness

Number of Music CDs of Spring 1998 Stat 250 Students

20

10quen

cy

10

Fre

0 100 200 300 400

0

82

Number of Music CDs



Kurtosis

• Measures of Kurtosis

– Kurtosis is a measure of the flatness or peakedness of a Distribution

• Normal Kurtosis - Mesokurtic

• Flat Kurtosis - Platokurtic

• Peaked Kurtosis Leptokurtic• Peaked Kurtosis - Leptokurtic

– A Measure of Kurtosis based on the 4th moment about the Mean

83



Kurtosis

N4

1i

4i )(

kurtosis

4s)1(Nkurtosis

If less then 0 = PlatokurticMore than 0 = LeptokurticIf 0 then = Mesokurtic

84

If 0 then Mesokurtic



Kurtosis

Frequencyk > 3

q y

k=3k=3

k < 3

85Value



Describing dataStatistic (mean based)

Statistic (non-mean based)

Center Mean Mode, medianSpread Variance, SD Range,Spread Variance, SD

(standard deviation), SE,

Range,Interquartilerangedeviation), SE,

CVrange

Skew SkewnessSkew Skewness --

Peaked Kurtosis --

86



R codex = rnorm(100)( )mean(x)sd(x)var(x)min(x)max(x)max(x)median(x)range(x)quantile(x)summary(x)skewness = sum((x-mean(x))^3/sqrt(var(x))^3)/length(x); skewnessskewness = sum((x-mean(x)) 3/sqrt(var(x)) 3)/length(x); skewnesskurtosis = sum((x-mean(x))^4/var(x)^2) /length(x) -3; kurtosis

87


Xinhai Li

SAS Example/****************************************************************//* SAS SAMPLE LIBRARY *//* *//* NAME: UNIVAR *//* TITLE: Simple Descriptive Statistics using PROC UNIVARIATE */

SAS

SAS ExampleOPTIONS LS=75 NODATE;

DATA STATEPOP;

/* PRODUCT: SAS *//* SYSTEM: ALL *//* KEYS: DESCRIPTIVE STATISTICS, *//* PROCS: UNIVARIATE *//* DATA: *//* *//* REF: *//* MISC: *//* DESC: INPUT A SMALL DATA SET USING THE CARDS STATEMENT. *//* RUN UNIVARIATE USING THE FREQ, PLOT AND NORMAL *//* PROC OPTIONS. ANALYZE THE VARIABLE POP AND *//* RETAIN THE VARIABLE STATE USING THE ID STATEMENT. *//* NO OTHER OPTIONS ARE USED. *//* */DATA STATEPOP;

INPUT STATE $ POP @@;

LABEL POP ='1970 CENSUS POPULATION IN MILLIONS';

/****************************************************************/OPTIONS LS=75 NODATE;DATA STATEPOP;INPUT STATE $ POP @@;LABEL POP ='1970 CENSUS POPULATION IN MILLIONS';CARDS;

ALA 3.44 ALASKA 0.30 ARIZ 1.77 ARK 1.92 CALIF 19.95COLO 2.21 CONN 3.03 DEL 0.55 FLA 6.79 GA 4.59HAW 0.77 IDAHO 0.71 ILL 11.01 IND 5.19 IOWA 2.83KAN 2.25 KY 3.22 LA 3.64 ME 0.99 MD 3.92MASS 5.69 MICH 8.88 MINN 3.81 MISS 2.22 MO 4.68MONT 0.69 NEB 1.48 NEV 0.49 NH 0.74 NJ 7.17NM 1.02 NY 18.24 NC 5.08 ND 0.62 OHIO 10.65OKLA 2.56 ORE 2.09 PA 11.79 RI 0.95 SC 2.59SD 0 67 TENN 3 92 TEXAS 11 2 UTAH 1 06 VT 0 44 CARDS;

ALA 3.44 ALASKA 0.30 ARIZ 1.77 ARK 1.92 CALIF 19.95

COLO 2.21 CONN 3.03 DEL 0.55 FLA 6.79 GA 4.59

SD 0.67 TENN 3.92 TEXAS 11.2 UTAH 1.06 VT 0.44VA 4.65 WASH 3.41 W.VA 1.74 WIS 4.42 WYO 0.33PROC UNIVARIATE FREQ PLOT NORMAL;VAR POP; ID STATE;

run;

HAW 0.77 IDAHO 0.71 ILL 11.01 IND 5.19 IOWA 2.83

KAN 2.25 KY 3.22 LA 3.64 ME 0.99 MD 3.92

MASS 5.69 MICH 8.88 MINN 3.81 MISS 2.22 MO 4.68

MONT 0.69 NEB 1.48 NEV 0.49 NH 0.74 NJ 7.17

NM 1.02 NY 18.24 NC 5.08 ND 0.62 OHIO 10.65

OKLA 2.56 ORE 2.09 PA 11.79 RI 0.95 SC 2.59OKLA 2.56 ORE 2.09 PA 11.79 RI 0.95 SC 2.59

SD 0.67 TENN 3.92 TEXAS 11.2 UTAH 1.06 VT 0.44

VA 4.65 WASH 3.41 W.VA 1.74 WIS 4.42 WYO 0.33

PROC UNIVARIATE FREQ PLOT NORMAL;

88

PROC UNIVARIATE FREQ PLOT NORMAL;

VAR POP; ID STATE;

run;


Xinhai LiSAS

The SAS System SAS results The UNIVARIATE Procedure

Variable: POP (1970 CENSUS POPULATION IN MILLIONS)

Moments

N 50 Sum Weights 50

Mean 4.0472 Sum Observations 202.36

Std Deviation 4.32931867 Variance 18.7430002

Skewness 2.05521839 Kurtosis 4.54561679

Uncorrected SS 1737.3984 Corrected SS 918.407008

89 Coeff Variation 106.970712 Std Error Mean 0.61225812


Xinhai LiSAS

Basic statistics

Basic Statistical Measures Basic Statistical Measures

Location Variability Location Variability

M 4 047200 Std D i ti 4 32932 Mean 4.047200 Std Deviation 4.32932

Median 2.710000 Variance 18.74300

Mode 3.920000 Range 19.65000

Interquartile Range 3.69000

90


Xinhai LiSAS

Quantiles

Quantile Estimate

100% Max 19.950

99% 19 95099% 19.950

95% 11.790

90% 10.830

75% Q3 4.680

50% Median 2.710

25% Q1 0.990

91


Xinhai Li

AssignmentBe familiar with the following terms:Be familiar with the following terms:• Probability density function (PDF)• Deviation• Variance• Standard deviation

Standard error• Standard error• Range• ModeMode• Quantile• Coefficient of variation

Download and install R on your laptopPlot histograms using

92

Plot histograms using hist(rnorm(100), nclass=6)

Introduction to biostatistics: part 1

Documents