Top Banner
EARTH SC \ ENVIR SC \ GEOG 3MB3 STATISTICAL ANALYSIS SECTION 4 INFERENTIAL STATISTICS (cont’d)
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Geog 3mb3 Section 4

EARTH SC \ ENVIR SC \ GEOG 3MB3STATISTICAL ANALYSIS

SECTION 4INFERENTIAL STATISTICS (cont’d)

Page 2: Geog 3mb3 Section 4

Two-Sample Difference of Means Tests

We may want to form hypotheses comparing two populations; does significant difference exist?

Examples: Two similar cars are introduced at the same time with the same price. In after 5 years, have the two cars’ values depreciated the same amount?

In China, do we find a significant difference between the number of children born to women in coastal regions (heavy policing of one-child policy) and inland regions (weak policing of one-child policy) ?

Slight alterations of the one-sample difference of means Z and t testallow us to compare 2 populations

Page 3: Geog 3mb3 Section 4

Two-Sample Difference of Means Z Test

Where:• E(X1) is the mean of sample 1• E(X2) is the mean of sample 2• σ1

2 is the variance of sample 1• σ2

2 is the variance of sample 2• n1 is the size of sample 1• n2 is the size of sample 2

Like the one-sample Z-test,we use the two-sample differenceof means Z-test when both n1 and n2 ≥ 30

Page 4: Geog 3mb3 Section 4

Two-Sample Difference of Means t Test

In most cases we do not know the variance of the populations, so we estimate it from sample variances (s2) using the two-sample difference of means t test:

Though the formula for the Z and t test look the same, the denominator of the t test is derived using the sample variances. There are two ways to do this:

1. Assume population variances are equal (σ12 = σ2

2), and calculate a weighted average of the two sample variances called a pooled variance estimate (PVE)

2. Assume population variances are unequal (σ12 ≠ σ2

2), direct substitution of sample variances for population variances called a separate variance estimate (SVE)

Page 5: Geog 3mb3 Section 4

Two-Sample Difference of Means t Test (cont’d)

Pooled Variance Estimate

Separate Variance Estimate ( σ12 ≠ σ2

2 )

( σ12 = σ2

2 )

Page 6: Geog 3mb3 Section 4

Two-Sample Difference of Means t Test Example

A researcher found the mean house price in Dundas and Ancaster from a record of housing sales from 2012

Ancaster µA : $ 462, 579 n = 23 s = 35,000 s2 = 900,000,000

Dundas µD : $ 455, 891 n = 17 s = 15,000 s2 = 225,000,000

Is there a significant difference between mean house price in Dundas and Ancaster?

H0 : µA = µD

HA : µA > µD

Page 7: Geog 3mb3 Section 4

Two-Sample Difference of Means t Test Example (cont’d)

We cannot use the Z test because we do not know the variances of the populations the 2 samples were taken from. We were only given the sample variances. We assume the population variances to be unequal

Page 8: Geog 3mb3 Section 4

Two-Sample Difference of Means t Test Example (cont’d)

The t-value of 0.94 , according to the t table, corresponds to A = 0.3159

Therefore, the p-value is :

p-value = 0.5000 – 0.3159 = 0.1841

This is a relatively high p-value. We can not reject H0 at both α = 0.10 and α = 0.05

H0 : µA = µD

HA : µA > µD

Page 9: Geog 3mb3 Section 4

Two-Sample Difference of Proportions TestUsed to compare two sample proportions for difference. Assumption: Variable being considered is binary (i.e. only 2 types of observation: yes-no, male-female

Where:• p1 = proportion of sample 1 in categoryof focus• p2 = proportion of sample 2 in categoryof focus• = pooled estimate of the focus category for the population

We define the focus category as one of the two possible responses.Ex. Proportions in Sample 1: Yes – 0.86 ; No – 0.14If we choose “Yes” as the focus category , we use 0.86 for calculations

Page 10: Geog 3mb3 Section 4

A sample was taken from a county regarding a proposed legislation. Participants were divided into two categories: rural and urban. We want to know if there is a significant difference of opinion between rural and urban citizens on the legislation.

CategorySample Size (n)

Proportion in favour

Proportion Against

Urban 79 0.63 0.37

Rural 44 0.59 0.41

Two-Sample Difference of Proportions Example

H0 = purban = prural

HA = purban ≠ prural

Page 11: Geog 3mb3 Section 4

Substitute the pooled estimate value into standard error of the difference equation

Put that expression into the Zp equation

Two-Sample Difference of Proportions Example (cont’d)

Page 12: Geog 3mb3 Section 4

Two-Sample Difference of Proportions Example (cont’d)

Zp = - 0.43714, which corresponds to A = 0.1700

p-value= [ 0.5000 – (0.1700) ] x 2= 0.3300 x 2= 0.6600 Multiply by 2 because we

have a non-directional HA

This p-value of 0.6600 is very large. We cannot reject the null hypothesis that there is no difference between urban and rural opinions on the new legislation.

H0 : purban = prural

HA : purban ≠ prural

Page 13: Geog 3mb3 Section 4

Matched-Pairs TestsMatched-pairs tests are used to analyze dependent samples

Dependent Samples : Samples that are related; results of one sample give information about other samples

Examples:

1. Two measurements of the same participant’s non-commute driving distances before and after an oil crisis. Did driving distances decrease?

2. Random sample of men and women from same Mexican villages to determine the average male and female life expectancy for these villages. Do male and female life expectancies differ between villages?

Page 14: Geog 3mb3 Section 4

Each sample observation has two values, which are known as a matched-pair:

In the first example, matched-pairs would be formed from each participant’s before and after distances.

In the second, the life expectancies of men and women from the same village are dependent and constitute a matched-pair. This is because people in the same village are affected by the same social, economic, and environmental factors.

Matched-Pairs Tests (cont’d)

Page 15: Geog 3mb3 Section 4

Matched-Pairs t Test

A parametric test which compares the mean differences of matched-pairs

Where:• di = difference of matched-pair i • E(d) = mean of matched-pair differences• σd = standard error of matched-pair differences • sd = standard deviation of matched-pair differences

Page 16: Geog 3mb3 Section 4

Matched-Pairs t Test Example

Let’s perform a matched-pairs t test on the crop yield data

The average difference, E(d), was found to be 1.5333The term Σ[di - E(d)]2 was found to be 177.73422*[calculations are very space-consuming]

Page 17: Geog 3mb3 Section 4

Matched-Pairs t Test Example (cont’d)

The t statistic calculated was 1.67This corresponds to A = 0.4444

p = 0.5000 – 0.4444p = 0.0556

This is a relatively low p-value. It rejects the null hypothesis at α = 0.10. It cannot reject the null hypothesis at α = 0.05 and α = 0.01

Page 18: Geog 3mb3 Section 4

Parametric and Nonparametric Tests

Up to this point, we have made assumptions about the populations we have tested for differences in means and proportions:

• Population Parameters (μ ,ρ, and σ)• Populations are normally distributed with mean μ and standard deviation σ

Parametric TestsTests that require knowledge of population parameters and make certain assumptions about the population’s distribution. Can only be used with interval/ratio scale data

Page 19: Geog 3mb3 Section 4

Parametric and Nonparametric Tests

Nonparametric TestsTests that require no knowledge of population parameters and make few assumptions about the population’s distribution. Can only be used for data in ordinal form

Data may only be available in ordinal form. Sometimes, we choose to downgrade interval/ratio data to ordinal data to use nonparametric tests.

We use nonparametric tests for non-normally distributed data

Example on next slide

Page 20: Geog 3mb3 Section 4

Non-normally Distributed DataTurbidity, the measure of haziness or cloudiness of a fluid caused by suspended solids, is a key test of water quality. Turbidity values are generally very high upstream, and drop off downstream.

Turb

idity

non-normal distribution

normal distribution

Figure 1: Water Quality Along a Stream

Distance downstream from an arbitrarily chosen starting point (km)

As you can see, a non-normal distribution in green fits the data better than the normal distribution in red. We should use a non-parametric test to analyze

Page 21: Geog 3mb3 Section 4

Wilcoxon Rank Sum W TestA nonparametric test of sample mean difference, which only works for ordinal data. It assumes that the two population distributions have the same shapeProcedure: 1. Combine the results of two samples and rank them (starting by

assigning the lowest value the rank of 1)2. If there is a tie, assign the average rank between the pairs ( Rank 7

through 11 all equal 34.3. Assign a rank of 9 to all values. 3. Put the ranked values back into their original samples

Where: • Wi = sum of ranks for smaller sample• E(Wi) = mean rank of smaller sample

Page 22: Geog 3mb3 Section 4

Wilcoxon Matched-Pairs Signed-Ranks Test

A nonparametric test comparing matched-pair differences using their absolute differences

Steps:

1. For all pairs, determine the sign of the differences and the absolute value of the differences

2. Exclude all pairs with an absolute difference of 03. Order remaining pairs from smallest absolute difference to

largest absolute difference4. Rank the pairs, starting with the smallest as 1. Ties receive a

rank equal to the average of the ranks they span

Page 23: Geog 3mb3 Section 4

Wilcoxon Matched-Pairs Signed-Ranks Test (cont’d)

Where: • n = number of matched-pairs; must be > 10

• T = rank sum

There are two possible values for T: Tp (rank sum for positive differences) and Tn (rank sum for negative differences). Which to use depends on how HA is stated.

If non-directional, we choose the smaller of Tp and Tn

If directional, we choose the value of T according to the smaller number of hypothesized differences (i.e. If more differences are expected to be positive, we choose Tn.

Page 24: Geog 3mb3 Section 4

Wilcoxon Rank Sum W Test Example

The Canadian government decides to present Canada’s gross exports for 2013 by dividing the country into 20 geographic regions. Instead of providing exact dollar values for each region, they rank them. Researchers are interested in determining whether there is an appreciable difference in gross exports between Eastern and Western Canada.

Groups A - K are classified as “Eastern”, and L-T as “Western”

H0 : ∑RE = ∑RW

HA : ∑RE ≠ ∑RW

Page 25: Geog 3mb3 Section 4

Region Location RankA East 11B East 12C East 3D East 16E East 6F East 14G East 2H East 5I East 13J East 20K East 4L West 10M West 9N West 17O West 1P West 18Q West 7R West 19S West 8T West 15

Region Location RankO West 1G East 2C East 3K East 4H East 5E East 6Q West 7S West 8M West 9L West 10A East 11B East 12I East 13F East 14T West 15D East 16N West 17P West 18R West 19J East 20

Wilcoxon Rank Sum W Test Example (cont’d)

Page 26: Geog 3mb3 Section 4

We first calculate the sum of the Western and Eastern Ranks respectively

Sum of Western Ranks∑ RW = 10 + 9 + 17 + 1 + 18 + 7 + 19 + 8 + 15 = 104

Sum of Eastern Ranks∑ RE = 11 + 12 + 16 + 3 + 6 + 14 + 2 + 5 + 20 + 13 + 4 = 106

Wilcoxon Rank Sum W Test Example (cont’d)

** Calculate W using values fromsample with smaller sample size

Page 27: Geog 3mb3 Section 4

Wilcoxon Rank Sum W Test Example (cont’d)

The Z-score calculated using the Wilcoxon Rank Sum Test was 0.722

According to the normal table, that Z-score corresponds to A = 0.2642

p-value = 2(0.5000 – 0.2642) = 0.4715

This is a very high p-value, so we cannot reject the H0 that there is no appreciable difference in gross exports between the Eastern and Western regions

H0 : WE = WW

HA : WE ≠ WW

Page 28: Geog 3mb3 Section 4

Wilcoxon Matched-Pairs Signed-Ranks Test Example

A farmer recorded his crop yields for two consecutive years. The average rainfall in the growing season of Year 1 was much greater than Year 2. Is there significant difference in crop yields between the two years?

Crop Yield X-Y

i2011 (X)

2012 (Y) sign |X-Y|

1 92 87 pos 52 91 90 pos 13 84 78 pos 64 86 86 N/A 05 87 89 neg 26 90 87 pos 37 92 93 neg 18 91 94 neg 39 97 92 pos 5

10 102 95 pos 711 107 101 pos 612 102 101 pos 113 89 93 neg 414 90 91 neg 115 92 92 N/A 0

Crop Yield X-Y

i2011

(X)2012(

Y) sign |X-Y| Rank2 91 90 pos 1 27 92 93 neg 1 2

14 90 91 neg 1 212 102 101 pos 1 25 87 89 neg 2 58 91 94 neg 3 6.56 90 87 pos 3 6.5

13 89 93 neg 4 89 97 92 pos 5 9.51 92 87 pos 5 9.53 84 78 pos 6 11.5

11 107 101 pos 6 11.510 102 95 pos 7 13

** we exclude pairs where X-Y = 0

H0: Yield2011 = Yield2012

HA: Yield2011 > Yield 2012

Page 29: Geog 3mb3 Section 4

Wilcoxon Matched-Pairs Signed-Ranks Test Example (cont’d)

We now find the sum of ranks for pairs with positive and negative differences

Sum of Ranks for Positive Differences (Tp)ΣTp = 2 + 2 + 6.5 + 9.5 + 9.5 + 11.5 + 11.5 + 13 = 63.5Sum of Ranks for Negative Differences (Tn)ΣTn = 2 + 2 + 5 + 6.5 + 8 = 23.5

Since our HA is directional, that the yield of 2011 will be greater than 2012, we must determine whether there are more positive or negative differences in our data. We use the rank sum of the group with the smaller number of differences.

Thus, we use rank sum for negative differences = 23.5

Page 30: Geog 3mb3 Section 4

Wilcoxon Matched-Pairs Signed-Ranks Test Example (cont’d)

A Z-score of -1.54 corresponds toA = 0.4382

p = 0.5000 – 0.4382p = 0.0618

This p-value rejects the null hypothesis at α = 0.10, but fails to reject it at α = 0.05 and α = 0.01