Top Banner
correlation and percentages • association between variables can be explored using counts – are high counts of bone needles associated with high counts of end scrapers? • similar questions can be asked using percent-standardized data – are high proportions of decorated pottery associated with high proportions of copper bells?
56

“closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

Jan 02, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

correlation and percentages

• association between variables can be explored using counts– are high counts of bone needles

associated with high counts of end scrapers?

• similar questions can be asked using percent-standardized data– are high proportions of decorated pottery

associated with high proportions of copper bells?

Page 2: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

but…• these are different questions with

different implications for formal regression

• percents will show some correlation even if underlying counts do not…– ‘spurious’ correlation (negative)– “closed-sum” effect

Page 3: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

case C_v1 C_v2 C_v3 C_v4 C_v5 C_v6 C_v7 C_v8 C_v9 C_v10

1 15 14 94 59 76 13 8 97 10 95

2 35 1 89 95 23 77 14 9 27 43

3 20 96 73 31 90 65 74 60 85 27

4 23 59 7 52 33 83 71 35 57 90

5 36 90 86 15 97 54 52 41 34 3

6 79 2 26 5 11 68 74 44 13 87

7 40 99 28 66 77 23 69 22 63 36

8 95 36 22 75 21 48 95 58 74 68

9 27 0 58 99 32 30 5 5 100 75

10 67 93 98 61 62 94 3 16 43 48

10 vars.5 vars.

3 vars.2 vars.

matrix(round(rnorm(100, 50, 15), nrow=10)))

Page 4: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

-1.0 -0.5 0.0 0.5 1.0r

original counts

-1.0 -0.5 0.0 0.5 1.0r

%s (10 vars.)

-1.0 -0.5 0.0 0.5 1.0r

%s (5 vars.)

-1.0 -0.5 0.0 0.5 1.0r

%s (3 vars.)

-1.0 -0.5 0.0 0.5 1.0r

%s (2 vars.)

Page 5: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 20 40 60 80 100C_V1

0

20

40

60

80

100

C_

V2

0 5 10 15 20P10_V1

0

5

10

15

20

P10

_V2

0 10 20 30 40 50 60 70T5_V1

0

10

20

30

40

T5_

V2

10 20 30 40 50 60 70 80T3_V1

0

10

20

30

40

50

60

70

T3_

V2

10 20 30 40 50 60 70 80 90 100T2_V1

0

10

20

30

40

50

60

70

80

90

T2_

V2

original counts %s 10 vars.

%s 5 vars. %s 3 vars. %s 2 vars.

Page 6: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

outliers

Page 7: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

• including outliers in regression analyses is usually a bad idea…

• Tukey-line / least squares discrepancies are good red-flag signals

Page 8: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

2 4 6 8 10

51

01

5

x2

y2

2 4 6 8 10

51

01

5

x2

y2

2 4 6 8 10

51

01

5

x2

y2

Page 9: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 50 100 150 200 250

800

850

900

950

100

01

050

110

0

soMort$SO2

soM

ort$

mor

tal

“convex hull trimming”

Page 10: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 1 2 3 4 5

800

850

900

950

100

01

050

110

0

log(soMort$SO2)

soM

ort$

mor

tal

Page 11: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 1 2 3 4 5

800

850

900

950

100

01

050

110

0

log(soMort$SO2)

soM

ort$

mor

tal

Page 12: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

“convex hull trimming”

> hull1 chull(x, y)

> plot(x, y)

> polygon(x[hull1], y[hull1])

> abline(lm(y[-hull1] ~ x[-hull1]))

Page 13: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 1 2 3 4 5 6

80

09

00

10

00

11

00

log(soMort$SO2)

soM

ort

$m

ort

al

Page 14: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

transformation

Page 15: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

transformation

• at least two major motivations in regression analysis:– create/improve a linear relationship– correct skewed distribution(s)

Page 16: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

• ex: density of obsidian vs. distance from the quarry:

0 10 20 30 40 50 60 70 80DIST

0

1

2

3

4

5

6D

EN

SIT

Y

Page 17: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 10 20 30 40 50 60 70 80DIST

0

1

2

3

4

5

6

DE

NS

ITY

Plot of Residuals against Predicted Values

-1 0 1 2 3 4ESTIMATE

-1

0

1

2

RE

SID

UA

L

Page 18: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 10 20 30 40 50 60 70 80DIST

1

2

3456

DE

NS

ITY

0 10 20 30 40 50 60 70 80DIST

-3

-2

-1

0

1

2

LG

_D

EN

S

LG_DENS log(DENSITY)

old.par par(no.readonly = TRUE)

plot(DIST, DENSITY, log="y")par(old.par)

Page 19: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 50 100 150 200VAR1

0

50

100

150

200

VA

R2

0 50 100 150 200VAR1

0

50

100

150

200V

AR

2

Page 20: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 5 10 15VAR1T

0

50

100

150

200

VA

R2

> VAR1T sqrt(VAR1)> plot(VAR1T, VAR2)

Page 21: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

transformation summary

• correcting left skew:x4 stronger

x3 strong

x2 mild

• correcting right skew:x weak

log(x) mild

-1/x strong

-1/x2 stronger

Page 22: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

“coefficient of determination”

Page 23: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

• regression/correlation– the strength of a relationship can be

assessed by seeing how knowledge of one variable improves the ability to predict the other

Page 24: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

• if you ignore x, the best predictor of y will be the mean of all y values (y-bar) – if the y measurements are widely

scattered, prediction errors will be greater than if they are close together

• we can assess the dispersion of y values around their mean by:

2)( yyi

Page 25: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

y

iy

2)( yyi

2)ˆ( ii yy

Page 26: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

2)ˆ( ii yy

2)( yyir2=

• “coefficient of determination” (r2)

• describes the proportion of variation that is “explained” or accounted for by the regression line…

• r2=.5 half of the variation is explained by the regression…

half of the variation in y is explained by variation in x…

Page 27: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

vs.

“explaining variance”

range

x

Page 28: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

vs.

Page 29: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

multiple regression

Page 30: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

residuals

• vertical deviations of points around the regression – for case i, residual = yi-ŷi [yi-(a+bxi)]

• residuals in y should not show patterned variation either with x or y-hat

• should be normally distributed around the regression line

• residual error should not be autocorrelated (errors/residuals in y are independent…)

Page 31: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

• residuals may show patterning with respect to other variables…

• explore this with a residual scatterplot– ŷ vs. other variables…

• are there suggestions of linear or other kinds of relationships?

• if r2 < 1, some of the remaining variation may be explainable with reference to other variables

Page 32: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

• paying close attention to outliers in a residual plot may lead to important insights

• e.g.: outlying residuals from quantities of exotic flint ~ distance from quarries– sites with special access though transport

routes, political alliances…

• residuals from regressions are often the main payoff

Page 33: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

Middle Formative,

Basin of Mexico

Page 34: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

Formative Basin of Mexico

• settlement survey

• 3 variables recorded from sites:– site size (proxy for population)– amount of arable land in standard “catchment”– productivity index for soils

Page 35: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

How are these variables related?

Do any make sense as dependent or independent variables?

1. SIZE (ha)

2. AGLAND (km2)

3. PROD (index)

Page 36: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 10 20 30 40 50 60 70 80 90 100AGLAND

20

30

40

50

60

70

80

90

100S

IZE

SIZE ~ AGLAND

Page 37: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

r2 = .75 y = 35.4 + .66xSIZE = 35.38 + .66*AGLAND(ha) (km2)

Page 38: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 10 20 30 40 50 60 70 80 90 100AGLAND

20

30

40

50

60

70

80

90

100S

IZE

residuals??

Page 39: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

> resSize frmdat$size – (35.4 +.66 * frmdat$agland)

residual SIZE = SIZE – SIZE-hat

0.7 0.8 0.9 1.0 1.1 1.2 1.3

20

40

60

80

100

120

frmdat$prod

resS

ize

Page 40: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0.7 0.8 0.9 1.0 1.1 1.2 1.3PROD

20

30

40

50

60

70

80

90

100

SIZ

E

PROD & SIZE

r2 = .69SIZE = -29 + 98 * PROD

Page 41: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 10 20 30 40 50 60 70 80 90 100AGLAND

20

30

40

50

60

70

80

90

100

SIZ

E

0.7 0.8 0.9 1.0 1.1 1.2 1.3PROD

20

30

40

50

60

70

80

90

100

SIZ

E

r2 = .75

r2 = .69

What have we “explained” about site

size??

Page 42: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

size

20 40 60 80

3040

5060

7080

90

2040

6080

agland

30 50 70 90 0.7 0.8 0.9 1.0 1.1 1.2 1.3

0.7

0.8

0.9

1.0

1.1

1.2

1.3

prod

Page 43: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

0 10 20 30 40 50 60 70 80 90 100AGLAND

0.7

0.8

0.9

1.0

1.1

1.2

1.3P

RO

D

r2 = .55

Page 44: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

X0

X1 X2

multiple regression…

Page 45: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

1

1 = total variance observed in independent variable (x0)

X0

Page 46: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

201r

2011 r

X0

X1

201r

2011 r

variance in x0 explained by x1, by itself…

variance in x0 unexplained by x1…

Page 47: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

202r

2021 r

X0

X2

variance in x0 explained by x2, by itself…

variance in x0 unexplained by x2…2

021 r

202r

Page 48: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

)1( 202

22.01 rr

2

212

202

12020122.01

11

)(

rr

rrrr

partial correlation coefficient: proportion of variance in x0 explained by x1, that is not explained by x2…

X1

(total variance in x0 explained by x1, that is not explained by x2…)

X0

Page 49: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

)1( 202

22.01

202

212.0 rrrR

multiple coefficient of determination: variance in x0 explained by x1 and x2, both separately, and together…

Page 50: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

SIZE

SIZ

E

AGLAND PROD

SIZ

E

AG

LAN

D

AG

LAN

D

SIZE

PR

OD

AGLAND PROD

PR

OD

SITE-SIZE

productivity

agricultural land

Page 51: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

SIZE = -1.8 + .42*AGLAND + 50*PROD

y = -1.8 + .42x1 + 50x2

Page 52: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

size = -1.8 + .42*agland + 50*prod

• various scales are involved:size hectaresagland km2

prod productivity index

• increasing available agricultural land by 1 km2 increases site-size by about .4 hectares

• a 1-unit increase of soil productivity increases site-size by about 50 hectares

• which of these two factors is more important??

Page 53: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

• calculate “beta” coefficients to eliminate the effect differing scales…

• convert the variables to Z-scores– mean of 0 – standard deviation of 1

• repeat multiple correlation analysis…

Page 54: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

with(frmdat, {Bsize (size-mean(size))/sd(size)Bagland (agland-mean(agland))/sd(agland)Bprod (prod-mean(prod))/sd(prod) })

lmBeta lm(Bsize ~ Bagland + Bprod)

Page 55: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

size = .55*agland + .43*prod

doesn’t change…should be zero…

Page 56: “closed-sum” effect. correlation and percentages association between variables can be explored using counts –are high counts of bone needles associated.

site size

productivity

agricultural land

=.55

=.45

r2=.83

r2=.55