Top Banner
1 Introduction to Statistics Bob Conatser Irvine 210 Research Associate Biomedical Sciences [email protected]
59
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Statistics

1

Introduction to Statistics

Bob ConatserIrvine 210

Research AssociateBiomedical Sciences

[email protected]

Page 2: Introduction to Statistics

2

Statistics - Definition

The scientific study of numerical data based on variation in nature. (Sokal and Rohlf)

A set of procedures and rules for reducing large masses of data into manageable proportions allowing us to draw conclusions from those data. (McCarthy)

Page 3: Introduction to Statistics

3

Basic Terms

Measurement – assignment of a number to something

Data – collection of measurements

Sample – collected data

Population – all possible data

Variable – a property with respect to which data from a sample differ in some measurable way.

Page 4: Introduction to Statistics

4

Types of Measurements

Ordinal – rank order, (1st,2nd,3rd,etc.)

Nominal – categorized or labeled data(red, green, blue, male, female)

Ratio (Interval) – indicates order as well asmagnitude. An interval scale does not include zero.

Page 5: Introduction to Statistics

5

Types of Variables

Independent Variable – controlled or manipulated by the researcher; causes a change in the dependent variable. (x-axis)

Dependent Variable – the variable being measured (y-axis)

Discreet Variable – has a fixed value

Continuous Variable - can assume any value

Page 6: Introduction to Statistics

6

Types of Statistics

Descriptive – used to organize and describe a sample

Inferential – used to extrapolate from a sample to a larger population

Page 7: Introduction to Statistics

7

Descriptive StatisticsMeasures of Central Tendency- Mean (average)- Median (middle)- Mode (most frequent)

Measures of Dispersion- variance- standard deviation- standard error

Measures of Association- correlation

Page 8: Introduction to Statistics

8

Descriptive Stats Central Tendency

Page 9: Introduction to Statistics

9

Descriptive Stats Central Tendency

Page 10: Introduction to Statistics

10

Descriptive Stats Central Tendency

20

2122

23

2425

26

27

2829

30

1 2 3 4 5 6 7 8 9 10 11 12 13

Rank

Age

MeanMedian

andMode

Page 11: Introduction to Statistics

11

Descriptive Stats Dispersion

Page 12: Introduction to Statistics

12

Descriptive Stats Dispersion

20

21

22

23

24

25

26

27

28

29

30

Age

Standard Deviation

Standard Error

0

5

10

15

20

25

30

35

Age

Standard Deviation

Standard Error

Page 13: Introduction to Statistics

13

Descriptive StatsAssociation

Highest Mastery Level Visit 1 vs Visit 6

0

0.05

0.1

0.15

0.2

0.25

0.3

0 0.1 0.2 0.3 0.4 0.5

Visit 1

Vis

it 6

Pearson Correlation = 0.354

Correlations

1 .354**.001

83 83.354** 1.001

83 83

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

Visit_1

Visit_6

Visit_1 Visit_6

Correlation is significant at the 0.01 level(2 il d)

**.

Page 14: Introduction to Statistics

14

Descriptive StatisticsData can

usually be characterized by a normal distribution.

Central tendency is represented by the peak of the distribution.

Dispersion is represented by the width of the distribution.

Page 15: Introduction to Statistics

15

Descriptive Statistics

Normal distribution f(x) = (1/(σ*√(2*π)))*exp[-(1/2)*((x-μ)/σ)^2]

Measurement

Freq

uen

cy

Standard Deviation

MeanMedianMode

Page 16: Introduction to Statistics

16

Descriptive Statistics

Skew Distribution

Measurement

Freq

uenc

y

Mean

Median

Mode

Standard Deviation

Page 17: Introduction to Statistics

17

Descriptive Statistics

Standard Deviations

Measurement

Freq

uenc

y

σ = .5σ = 1.0σ = 1.5

Page 18: Introduction to Statistics

18

Inferential StatisticsCan your experiment make a statement about the general population?

Two types1. Parametric

Interval or ratio measurementsContinuous variablesUsually assumes that data is normally distributed

2. Non-ParametricOrdinal or nominal measurementsDiscreet variablesMakes no assumption about how data is distributed

Page 19: Introduction to Statistics

19

Inferential Statistics

Null Hypothesis

Statistical hypotheses usually assume no relationship between variables.

There is no association between eye colorand eyesight.

If the result of your statistical test is significant,then the original hypothesis is false and you cansay that the variables in your experiment are somehow related.

Page 20: Introduction to Statistics

20

Inferential Statistics - Error

Type I – false positive, αType II – false negative, β

Unfortunately, α and β cannot both have very small values.As one decreases, the other increases.

Statistical Result for Null HypothesisAccepted Rejected

Actual Null TRUE Correct Type I ErrorHypothesis FALSE Type II Error Correct

Page 21: Introduction to Statistics

21

Inferential Statistics - ErrorType I Error

Measurement

Freq

uenc

y

α / 2 α / 2

Page 22: Introduction to Statistics

22

Inferential Statistics - Error

Type II Error

α / 2 α / 2

Statistical Result for Null Hypothesis

Actual Null Hypothesis

β 1 - β

Page 23: Introduction to Statistics

23

Statistical Test Decision Tree

Page 24: Introduction to Statistics

24

Inferential Statistics - Power

The ability to detect a difference between two different hypotheses.

Complement of the Type II error rate (1 - β).

Fix α (= 0.05) and then try to minimize β(maximize 1 - β).

Page 25: Introduction to Statistics

25

Inferential Statistics - Power

Power depends on:

sample sizestandard deviationsize of the difference you want to detect

The sample size is usually adjusted so that power equals .8.

Page 26: Introduction to Statistics

26

Inferential Statistics

Effect Size

Detectable difference in means / standard deviationDimensionless~ 0.2 – small (low power)~ 0.5 – medium~ 0.8 – large (powerful test)

Page 27: Introduction to Statistics

27

Inferential Statistics – T-Test

Are the means of two groups different?Groups assumed to be normally distributed and of similar size.

tα,ν = (Y1 – Y2) / √[(σ12 - σ2

2) / n] (equal sample sizes)

Y1 and Y2 are the means of each group σ1 and σ2 are the standard deviations n is the number of data points in each group α is the significance level (usually 0.05) ν is the degrees of freedom (2 * (n – 1)) (Sokal & Rohlf)

Page 28: Introduction to Statistics

28

Inferential Statistics – T-Test

Compare calculated tα,νvalue with value from table. If calculated value is larger, the null hypothesis is false. (Lentner, C., 1982, Geigy Scientific Tables vol. 2, CIBA-Geigy Limited, Basle, Switzerland)

Page 29: Introduction to Statistics

29

T-Test ExampleExample data from SPSS

Null Hypothesis -There is no difference between healthy people and people with coronary artery disease in time spent on a treadmill.

Group Treadmill1 - healthy time 2 - diseased in seconds

1 10141 6841 8101 9901 8401 9781 10021 11102 8642 6362 6382 7082 7862 6002 13202 7502 5942 750

Group1mean 928.5StDev 138.121Group2mean 764.6StDev 213.75

Coronary Artery Disease

0

200

400

600

800

1000

1200

Tim

e on

Tre

adm

ill (

sec)

healthydiseased

Page 30: Introduction to Statistics

30

T-Test Decision Tree

Page 31: Introduction to Statistics

31

T-Test Example (cont.)

Independent Samples Test

.137 .716 1.873 16 .080 163.900 87.524 -21.642 349.442

1.966 15.439 .068 163.900 83.388 -13.398 341.198

Equal variancesassumedEqual variancesnot assumed

Treadmill timein seconds

F Sig.

Levene's Test forEquality of Variances

t df Sig. (2-tailed)Mean

DifferenceStd. ErrorDifference Lower Upper

95% ConfidenceInterval of the

Difference

t-test for Equality of Means

Null hypothesis is accepted because the results are not significant at the 0.05 level.

Page 32: Introduction to Statistics

32

Non-Parametric Decision Tree

Page 33: Introduction to Statistics

33

Non-Parametric Statistics

Makes no assumptions about the population from which the samples are selected.

Used for the analysis of discreet data sets.

Also used when data does not meet the assumptionsfor a parametric analysis (“small” data sets).

Page 34: Introduction to Statistics

34

Non-Parametric Example IMann-Whitney

Most commonly used as an alternative to the independent samples T-Test.

Test Statisticsb

15.00070.000-2.222

.026

.027a

Mann-Whitney UWilcoxon WZAsymp. Sig. (2-tailed)Exact Sig. [2*(1-tailedSig.)]

Treadmilltime in

seconds

Not corrected for ties.a.

Grouping Variable: groupb.

Note difference in results between this test and T-Test.

Page 35: Introduction to Statistics

35

Inferential Statistics - ANOVAANOVA – Analysis of Variance

Compares the means of 3 or more groupsAssumptions:

Groups relatively equal.Standard deviations similar. (Homogeneity of variance)Data normally distributed.Sampling should be randomized.Independence of errors.

Post-Hoc test

Page 36: Introduction to Statistics

36

ANOVA - Example12 15 2313 15 2315 16 2515 17 2617 17 2719 21 3020 21 3020 23 31

Mean 16.38 18.13 26.88StDev 3.114 3.091 3.182StErr 0.794 0.747 0.626

0

5

10

15

20

25

30

35

Series1Series2Series3

Page 37: Introduction to Statistics

37

Anova Decision Tree

Page 38: Introduction to Statistics

38

ANOVA - Results

Test of Homogeneity of Variances

VAR00001

.001 2 21 .999

LeveneStatistic df1 df2 Sig.

ANOVA

VAR00001

506.333 2 253.167 25.855 .000205.625 21 9.792711.958 23

Between GroupsWithin GroupsTotal

Sum ofSquares df Mean Square F Sig.

Multiple Comparisons

Dependent Variable: VAR00001Tukey HSD

-1.75000 1.56458 .514-10.50000* 1.56458 .000

1.75000 1.56458 .514-8.75000* 1.56458 .00010.50000* 1.56458 .000

8.75000* 1.56458 .000

(J) VAR000042.003.001.003.001.002.00

(I) VAR000041.00

2.00

3.00

MeanDifference

(I-J) Std. Error Sig.

The mean difference is significant at the .05 level.*.

2.

1. 3.

Page 39: Introduction to Statistics

39

ANOVA – Example 2

1 322.000 2 322.007 3 321.986 4 321.9941 322.005 2 322.031 3 321.990 4 322.0031 322.022 2 322.011 3 322.002 4 322.0061 321.991 2 322.029 3 321.984 4 322.0031 322.011 2 322.009 3 322.017 4 321.9861 321.995 2 322.026 3 321.983 4 322.0021 322.006 2 322.018 3 322.002 4 321.9981 321.976 2 322.007 3 322.001 4 321.9911 321.998 2 322.018 3 322.004 4 321.9961 321.996 2 321.986 3 322.016 4 321.9991 321.984 2 322.018 3 322.002 4 321.9901 321.984 2 322.018 3 322.009 4 322.0021 322.004 2 322.020 3 321.990 4 321.9861 322.000 2 322.012 3 321.993 4 321.9911 322.003 2 322.014 3 321.991 4 321.9831 322.002 2 322.005 3 322.002 4 321.998

Mean 321.999 322.014 321.998 321.995StDev 0.011 0.011 0.010 0.007

SPSS example Machine type vs. brake diameter

4 group ANOVA

Null Hypothesis –

There is no difference among machines in brake diameter.

Page 40: Introduction to Statistics

40

ANOVA – Example 2

Page 41: Introduction to Statistics

41

ANOVA – Example 2 - Results

1.

2.

3.

ANOVA

Disc Brake Diameter (mm)

.004 3 .001 11.748 .000

.006 60 .000

.009 63

Between GroupsWithin GroupsTotal

Sum ofSquares df Mean Square F Sig.

Test of Homogeneity of Variances

Disc Brake Diameter (mm)

.697 3 60 .557

LeveneStatistic df1 df2 Sig.

Multiple Comparisons

Dependent Variable: Disc Brake Diameter (mm)Tukey HSD

-.0157487* .000.0157487* .000.0159803* .000.0188277* .000

-.0159803* .000-.0188277* .000

(J) Machine Number213422

(I) Machine Number12

34

MeanDifference

(I-J) Sig.

The mean difference is significant at the .05 level.*.

Null hypothesis is rejected because result is highly significant.

Page 42: Introduction to Statistics

42

Non-Parametric ANOVA Decision Tree

Page 43: Introduction to Statistics

43

Non-Parametric ANOVA Example II

Kruskal-WallisThe Kruskal-Wallis test is a non-parametric alternative to one-way analysis of variance.

Test Statisticsa,b

23.5633

.000

Chi-SquaredfAsymp. Sig.

Brake_Dia

Kruskal Wallis Testa.

Grouping Variable: Machineb.

The test result (shown below)is highly significant. A post hoctest (multiple Mann-Whitney tests)would be done to determine whichgroups were different.

Page 44: Introduction to Statistics

44

Chi-Square Test

used with categorical data

two variables and two groups on both variables

results indicate whether the variables are related

Figure from Geigy Scientific Tables vol. 2

Page 45: Introduction to Statistics

45

Chi-Square Test

Assumptions: - observations are independent- categories do not overlap- most expected counts > 5 and none < 1

Sensitive to the number of observations

Spurious significant results can occur for large n.

Page 46: Introduction to Statistics

46

Chi Squared Decision Tree

Page 47: Introduction to Statistics

47

Chi-Square Example

A 1991 U.S. general survey of 225 people asked whether they thought their most important problem in the last 12 months was health or finances.

Null hypothesis – Males and females will respond the same to the survey.

1 - males 1 - health

2 - females 2 - finances2 12 12 12 1

… …2 21 22 22 2

Page 48: Introduction to Statistics

48

Chi-Square Exampleproblem * group Crosstabulation

Count

35 57 9256 77 13391 134 225

HealthFinance

problem

Total

Males Femalesgroup

Total

Cross-tabulation table shows how many people are in each category.

Chi-Square Tests

.372b 1 .542

.223 1 .637

.373 1 .541.582 .319

.371 1 .543

225

Pearson Chi-SquareContinuity Correctiona

Likelihood RatioFisher's Exact TestLinear-by-LinearAssociationN of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is 37.21.

b.

The non-significant resultsignifies that the null hypothesis is accepted.

Page 49: Introduction to Statistics

49

Chi-Square Example II

Chi-square test can be extended to multiple responses for two groups.

problem * group Crosstabulation

Count

35 57 9256 77 13315 33 489 10 19

15 25 40130 202 332

HealthFinanceFamilyPersonalMiscellaneou

problem

Total

Males Femalesgroup

Total

Chi-Square Tests

2.377a 4 .6672.400 4 .663

.021 1 .885

332

Pearson Chi-SquaLikelihood RatioLinear-by-LinearAssociationN of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Tminimum expected count is 7.44.

a.

Page 50: Introduction to Statistics

50

Page 51: Introduction to Statistics

51

Multiple Regression

Null Hypothesis – GPA at the end of Freshman year cannot be predicted by performance on college entrance exams.

GPA = α * ACT score + β* Read. Comp. score

Page 52: Introduction to Statistics

52

Multiple Regression

y = 0.0382x + 1.56R2 = 0.0981

0

0.5

1

1.5

2

2.5

3

5 10 15 20

ACT Score

GP

A

y = 0.0491x + 1.2803R2 = 0.2019

0

0.5

1

1.5

2

2.5

3

5 10 15 20

Reading Comprehension Score

GP

A

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

4

6

8

10

12

14

16

8

10

12

14

16

18

GP

AACT S

core

Read. Comp. Score

Page 53: Introduction to Statistics

53

Multiple RegressionANOVAb

.373 2 .186 1.699 .224a

1.317 12 .1101.689 14

RegressionResidualTotal

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Reading Comprehension score, ACT scorea.

Dependent Variable: Grade Point Average in first year of collegeb.

Coefficientsa

1.172 .460 2.551 .025.018 .034 .151 .537 .601

.042 .031 .386 1.374 .195

(Constant)ACT scoreReadingComprehension score

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Grade Point Average in first year of collegea.

The analysis shows no significant relationship between college entrance tests and GPA. α =

β =

Page 54: Introduction to Statistics

54

MANOVAMultivariate ANalysis of VAriance (MANOVA)

MANOVA allows you to look at differences between variables as well as group differences.

assumptions are the same as ANOVA

additional condition of multivariate normality

also assumes equal covariance matrices (standard deviations between variables should be similar).

Page 55: Introduction to Statistics

55

MANOVA ExampleSubset of plantar fasciitis dataset.

Null HypothesisThere is no difference in soleusemg activity, peak torque, or time to peak torque for quick stretch measurements in people with plantar fasciitis who receive counterstraintreatment compared with the same group of people receiving a placebo treatment.

Treatment Peak to Peak Peak Quick Time toGroup Soleus EMG Stretch Peak Torque1 - Treated Response Torque milliseconds2 - Control millivolt*2 newton-meter

1 0.0706 0.883 322.561 0.0189 0.347 329.281 0.0062 0.388 319.21 0.0396 1.104 325.921 0.0668 3.167 315.841 0.2524 2.628 248.641 0.0183 0.346 3361 0.0393 1.535 332.641 0.1319 3.282 292.321 0.3781 3.622 278.882 0.039 0.557 299.042 0.074 0.525 372.962 0.0396 1.400 362.882 0.0143 0.183 295.682 0.076 3.074 322.562 0.2213 3.073 258.722 0.0196 0.271 346.082 0.0498 2.278 302.42 0.155 3.556 309.122 0.2887 3.106 292.32

TreatedMean 0.10221 1.73017 310.128StDev 0.12151083 1.31802532 28.15859087ControlMean 0.09773 1.80212 316.176StDev 0.09324085 1.35575411 35.22039409

Page 56: Introduction to Statistics

56

MANOVA Example

Plantar Study Soleus Peak to Peak EMG

-0.05

0

0.05

0.1

0.15

0.2

0.25

EMG

(mv*

2)

Treated

Control

Plantar Study Peak Torque

0

0.5

1

1.5

2

2.5

3

3.5

Torq

ue

(nm

)Treated

Control

Plantar StudyTime to

Peak Torque

0

50

100

150

200

250

300

350

400

Tim

e(m

s)

Treated

Control

Page 57: Introduction to Statistics

57

MANOVA ResultsBox's Test of Equality of Covariance Matricesa

5.165.703

62347.472

.647

Box's MFdf1df2Sig.

Tests the null hypothesis that the observed covariancematrices of the dependent variables are equal across groups.

Design: Intercept+groupa.

Levene's Test of Equality of Error Variancesa

.349 1 18 .562

.078 1 18 .783

.550 1 18 .468

Soleus peak topeak emg (mv*2)Peak quick stretchtorque (nm)Time to peak torque(milliseconds)

F df1 df2 Sig.

Tests the null hypothesis that the error variance of the dependent vis equal across groups.

Design: Intercept+groupa.

Box’s test checks for equal covariance matrices. A non-significant result means the assumption holds true.

Levene’s tests checks forunivariate normality. A non-significant result meansthe assumption holds true.

Page 58: Introduction to Statistics

58

MANOVA ResultsMultivariate Testsb

.996 1207.992a 3.000 16.000 .000

.004 1207.992a 3.000 16.000 .000226.499 1207.992a 3.000 16.000 .000226.499 1207.992a 3.000 16.000 .000

.020 .109a 3.000 16.000 .954

.980 .109a 3.000 16.000 .954

.020 .109a 3.000 16.000 .954

.020 .109a 3.000 16.000 .954

Pillai's TraceWilks' LambdaHotelling's TraceRoy's Largest RootPillai's TraceWilks' LambdaHotelling's TraceRoy's Largest Root

EffectIntercept

group

Value F Hypothesis df Error df Sig.

Exact statistica.

Design: Intercept+groupb.

The non-significant group result indicates that the null hypothesis is true. If the result had been significant, you would need to do post hoc tests to find out which variables were significant.

Page 59: Introduction to Statistics

59

ReferencesBruning, James L. and Kintz, B.L.Computational handbook of statistics (4th Edition). Massachusetts: Addison Wesley Longman, Inc., 1997.

Research Resource Guide 3rd Edition 2004-2005, http://ohiocoreonline.org, go to Research->Research Education.

Norusis, Marija J. SPSS 9.0 Guide to data analysis.New Jersey: Prentice-Hall, Inc., 1999.

Sokal, Robert R. and Rohlf, James F. Biometry (2nd Edition). New York: W.H. Freeman, 1981.

Stevens, James. Applied Multivariate Statistics for the Social Sciences. New Jersey: Lawrence Erlbaum Associates, Inc., 1986.

Lentner, Cornelius. Geigy Scientific Tables vol. 2. Basle, Switzerland: Ciba-Geigy Limited, 1982.