Top Banner
Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia
38

Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Dec 14, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Data Analysis Using R:2. Descriptive Statistics

Tuan V. Nguyen

Garvan Institute of Medical Research,

Sydney, Australia

Page 2: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Overview

• Measurements• Population vs sample• Summary of data: mean, variance, standard deviation,

standard error• Graphical analyses• Transformation

Page 3: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Scales of Measurement

• In general, most observable behaviors can be measured on a ratio-scale

• In general, many unobservable psychological qualities (e.g., extraversion), are measured on interval scales

• We will mostly concern ourselves with the simple categorical (nominal) versus continuous distinction (ordinal, interval, ratio)

categorical continuous

ordinal

interval

ratio

variables

Page 4: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Ordinal Measurement

• Ordinal: Designates an ordering; quasi-ranking– Does not assume that the intervals between numbers are equal.– finishing place in a race (first place, second place)

1 hour 2 hours 3 hours 4 hours 5 hours 6 hours 7 hours 8 hours

1st place 2nd place 3rd place 4th place

Page 5: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Interval and Ratio Measurement

• Interval: designates an equal-interval ordering– The distance between, for example, a 1 and a 2 is the

same as the distance between a 4 and a 5– Example: Common IQ tests are assumed to use an

interval metric

• Ratio: designates an equal-interval ordering with a true zero point (i.e., the zero implies an absence of the thing being measured)– Example: number of intimate relationships a person has

had• 0 quite literally means none• a person who has had 4 relationships has had twice as many

as someone who has had 2

Page 6: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Statististics: Enquiry to the unknown

Population Sample

Parameter Estimate

Page 7: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Estimate the population mean

Population height mean = 160 cm

Standard deviation = 5.0 cm

ht <- rnorm(10, mean=160, sd=5)mean(ht)

ht <- rnorm(10, mean=160, sd=5)mean(ht)

ht <- rnorm(100, mean=160, sd=5)mean(ht)

ht <- rnorm(1000, mean=160, sd=5)mean(ht)

ht <- rnorm(10000, mean=160, sd=5)mean(ht)hist(ht)

The larger the sample, the more accurate the estimate is!

Page 8: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Estimate the population proportion

Population proportion of males = 0.50 Take n samples, record the number of k males

rbinom(n, k, prob)

males <- rbinom(10, 10, 0.5)malesmean(males)

males <- rbinom(20, 100, 0.5)malesmean(males)

males <- rbinom(1000, 100, 0.5)malesmean(males)

The larger the sample, the more accurate the estimate is!

Page 9: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Summary of Continuous Data

• Measures of central tendency:– Mean, median, mode

• Measures of dispersion or variability:– Variance, standard deviation, standard error– Interquartile range

R commandslength(x), mean(x), median(x), var(x), sd(x)

summary(x)

Page 10: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

R example

height <- rnorm(1000, mean=55, sd=8.2)mean(height)[1] 55.30948

median(height)[1] 55.018

var(height)[1] 68.02786

sd(height)[1] 8.2479

summary(height) Min. 1st Qu. Median Mean 3rd Qu. Max. 28.34 49.97 55.02 55.31 60.78 85.05

Page 11: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Graphical Summary: Box plot3

04

05

06

07

08

0boxplot(height)

95% percentile

75% percentile

25% percentile

5% percentile

Median, 50% perc.

Page 12: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Strip chart

30 40 50 60 70 80

Page 13: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Histogram

Histogram of height

height

Fre

qu

en

cy

30 40 50 60 70 80 90

05

01

00

15

02

00

25

0

Page 14: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Implications of the mean and SD

• “In the Vietnamese population aged 30+ years, the average of weight was 55.0 kg, with the SD being 8.2 kg.”

• What does this mean?

• 68% individuals will have height between 55 +/- 8.2*1 = 46.8 to 63.2 kg

• 95% individuals will have height between 55 +/- 8.2*1.96 = 38.9 to 71.1 kg

Page 15: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Implications of the mean and SD

• The distribution of weight of the entire population can be shown to be:

0

1

2

3

4

5

6

22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 92

Weight (kg)

Per

cen

t (%

)

1SD

1.96SD

Page 16: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Summary of Categorical Data

• Categorical data: – Gender: male, female

– Race: Asian, Caucasian, African

• Semi-quantitative data: – Severity of disease: mild, moderate, severe

– Stages of cancer: I, II, III, IV

– Preference: dislike very much, dislike, equivocal, like, like very much

Page 17: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Mean and variance of a proportion

• For an individual i consumer, the probability he/she prefers A is pi. Assuming that all consumers are independent, then pi = p.

• Variance of pi is var(pi) = p(1-p)

• For a sample of n consumers, the estimated probability of preference for A is:

n

ppppp n

...321

and the variance of p_bar is:

n

ppp

1var

Page 18: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Normal approximation of a binomial distribution

• For an individual i consumer, the probability he/she prefers A is pi. Assuming that all consumers are independent, then pi = p.

• Variance of pi is var(pi) = p(1-p)

• For a sample of n consumers, the estimated probability of preference for A is:

n

ppppp n

...321

and the variance of p_bar is:

n

ppp

1var

and standard deviation: n

pps

1

Page 19: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Normal approximation of a binomial distribution - example

• 10 consumbers, 8 preferred product A.

• Proportion of preference for A: p = 0.8

• Variance: var(p) = 0.8(0.2)/10 = 0.016

• Standard deviation of p: s = 0.126

• 95% CI of p: 0.8 + 1.96(0.126) = 0.55 to 1.00

Page 20: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Descriptive AnalysesContinuous data

Page 21: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Paired t-test

• Continuous data• Normally distributed• Two samples are NOT independent

Page 22: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Paired t-test – an example

• The problem: Viewing certain meats under red light might enhance judges preferences for meat. 12 judges were asked to score the redness of meat under red light and white light

Results:

Judge Red White

1 20 22

2 18 19

3 19 17

4 22 18

5 17 21

6 20 23

7 19 19

8 16 20

9 21 22

10 17 20

11 23 27

12 18 24

Page 23: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Paired t-test – analysis

Judge Red light White light Difference

1 20 22 2

2 18 19 1

3 19 17 -2

4 22 18 -4

5 17 21 4

6 20 23 3

7 19 19 0

8 16 20 4

9 21 22 1

10 17 20 3

11 23 27 4

12 18 24 6

Mean 21.0 19.2 1.83

SD 2.8 2.1 2.82

Mean difference: 1.83, SD: 0.81

Standard error (SE):

SD/sqrt(n) = 0.81/sqrt(10) = 0.81

T-test = (1.83 – 0)/0.81 = 2.23

P-value = 0.0459

Conclusion: there was a significant effect of light colour.

Page 24: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Paired t-test – R analysis

red < -c(20,18,19,22,17,20,19,16,21,17,23,18)

white < -c(22,19,17,18,21,23,19,20,22,20,27,24)

t.test(red, white, paired=TRUE)

data: red and white t = -2.2496, df = 11, p-value = 0.04592alternative hypothesis: true difference in means is not

equal to 0 95 percent confidence interval: -3.6270234 -0.0396433 sample estimates:mean of the differences -1.833333

Page 25: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Two-sample t-test

Sample Group 1 Group21 x1 y1

2 x2 y2 3 x3 y3 4 x4 y4 5 x5 y5 … …n xn yn

Sample size n1 n2

Mean x y

SD sx sy

Mean difference:

D = x – y

Variance of D:

T-statistic:

95% Confidence interval:

Page 26: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Two-group comparison: an example

ID A B

1 3 3

2 7 1

3 1 2

4 9 4

5 3 5

6 4 2

7 1 2

8 2 5

9 6 3

10 7 2

ID AB

11 5 3

12 8 4

13 5 2

14 9 3

15 4 5

16 6 4

17 4 3

18 3 1

19 9 3

20 5 2

20 consumers rated their preference for two rice desserts (A and B)

Page 27: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Unpaired t-test using R

a<-c(3,7,1,9,3,4,1,2,6,7,5,8,5,9,4,6,4,3,9,5)b<-c(3,1,2,4,5,2,2,5,3,2,3,4,2,3,5,4,3,1,3,2)t.test(red,white)

Welch Two Sample t-test

data: a and b

t = 3.3215, df = 27.478, p-value = 0.002539

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

0.8037895 3.3962105

sample estimates:

mean of x mean of y

5.05 2.95

Page 28: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Transformation of data: multiplicative effects

• The following data represent lysozyme levels in the gastric juice of 29 patients with peptic ulcer and of 30 normal controls. It was interested to know whether lysozyme levels were different between two groups.

Group 1:

0.2 0.3 0.4 1.1 2.0 2.1 3.3 3.8 4.5 4.8 4.9 5.0 5.3 7.5 9.8 10.4 10.9 11.3 12.4 16.2 17.6 18.9 20.7 24.0 25.4 40.0 42.2 50.0 60.0

Group 2:

0.2 0.3 0.4 0.7 1.2 1.5 1.5 1.9 2.0 2.4 2.5 2.8 3.6 4.8 4.8 5.4 5.7 5.8 7.5 8.7 8.8 9.1 10.3 15.6 16.1 16.5 16.7 20.0 20.7 33.0

Page 29: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Unpaired t-test by Rg1 <- c( 0.2, 0.3, 0.4, 1.1, 2.0, 2.1, 3.3, 3.8,

4.5, 4.8, 4.9, 5.0, 5.3, 7.5, 9.8, 10.4,

10.9, 11.3, 12.4, 16.2, 17.6, 18.9, 20.7,

24.0, 25.4, 40.0, 42.2, 50.0, 60)

g2 <- c(0.2, 0.3, 0.4, 0.7, 1.2, 1.5, 1.5, 1.9, 2.0,

2.4, 2.5, 2.8, 3.6, 4.8, 4.8, 5.4, 5.7, 5.8,

7.5, 8.7, 8.8, 9.1, 10.3, 15.6, 16.1, 16.5,

16.7, 20.0, 20.7, 33.0)

t.test(g1, g2)

data: g1 and g2 t = 2.0357, df = 40.804, p-value = 0.04831alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.05163216 13.20239083 sample estimates:mean of x mean of y 14.310345 7.683333

Page 30: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Exploration of data

par(mfrow=c(1,2))

hist(g1)

hist(g2)

Histogram of g1

g1

Fre

qu

en

cy

0 10 20 30 40 50 60

05

10

15

Histogram of g2

g2

Fre

qu

en

cy

0 5 10 20 30

05

10

15

Group 1:

mean(g1) = 14.3

sd(g1) = 15.7

Group 2:

mean(g2) = 7.7

sd(g2) = 7.8

Page 31: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Re-analysis of lysozyme data

log.g1 <- log(g1)

log.g2 <- log(g2)

t.test(log.g1, log.g2)

data: log.g1 and log.g2 t = 1.406, df = 55.714, p-value = 0.1653alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2182472 1.2453165 sample estimates:mean of x mean of y 1.921094 1.407559

exp(1.921-1.407) = 1.67

Group 1’s mean is 67% higher than group 2’s

Page 32: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Descriptive analysisCategorical data

Page 33: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Comparison of two proportions - theory

Group1 2

____________________________________________

Sample size n1 n2

Number of events e1 e2

Proportion of events p1 p2

Difference: D = p1 – p2 SE difference: SE = [p1(1–p1)/n1 + p2(1–p2)/n2]1/2

Z = D / SE95% CI: D + 1.96(SE)

With (n1 + n2) > 20, and if Z > 2, it is possible to reject the null hypothesis.

Page 34: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Comparison of two proportions - example

GroupHeroine Cocaine

__________________________________________

Sample size 100 100Number of deaths 90 36Mortality rate 0.90 0.36

Thirty-day mortality rate (%) of 100 rats who had been exposed to heroine or cocain.

Analysis

Difference: D = 0.90 – 0.36 = 0.54SE (D) = [0.9(0.1)/100 + 0.36(0.64)/100]1/2

= 0.057Z = 0.54 / 0.057 = 9.54

95% CI:0.54 + 1.96(0.057)0.43 to 0.65

Conclusion: reject the null hypothesis.

Page 35: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Comparison of two proportions - R

events <- c(90, 36)

total <- c(100, 100)

prop.test(events, total)

2-sample test for equality of proportions with continuity correction

data: deaths out of total X-squared = 60.2531, df = 1, p-value = 8.341e-15alternative hypothesis: two.sided 95 percent confidence interval: 0.4190584 0.6609416 sample estimates:prop 1 prop 2 0.90 0.36

Page 36: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Comparison of >2 proportions – Chi square analysis

table(sex, ethnicity)

ethnicity

sex African Asian Caucasian Others

Female 4 43 22 0

Male 4 17 8 2

females <- c(4, 43, 22, 0)

total <- c(8, 60, 30, 2)

prop.test(females, total)

Page 37: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Comparison of >2 proportions – Chi square analysis

4-sample test for equality of proportions without continuity

correction

data: females out of total X-squared = 6.2646, df = 3, p-value = 0.09942alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 0.5000000 0.7166667 0.7333333 0.0000000

Warning message:Chi-squared approximation may be incorrect in:

prop.test(females, total)

Page 38: Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia.

Summary

• Examine the distribution of data– Mean and variance: systematic difference?

– Normally distributed ?

• Transformation?

• Present confidence intervals (and p-values)