Page 1
correlation and percentages
• association between variables can be explored using counts– are high counts of bone needles
associated with high counts of end scrapers?
• similar questions can be asked using percent-standardized data– are high proportions of decorated pottery
associated with high proportions of copper bells?
Page 2
but…• these are different questions with
different implications for formal regression
• percents will show some correlation even if underlying counts do not…– ‘spurious’ correlation (negative)– “closed-sum” effect
Page 3
case C_v1 C_v2 C_v3 C_v4 C_v5 C_v6 C_v7 C_v8 C_v9 C_v10
1 15 14 94 59 76 13 8 97 10 95
2 35 1 89 95 23 77 14 9 27 43
3 20 96 73 31 90 65 74 60 85 27
4 23 59 7 52 33 83 71 35 57 90
5 36 90 86 15 97 54 52 41 34 3
6 79 2 26 5 11 68 74 44 13 87
7 40 99 28 66 77 23 69 22 63 36
8 95 36 22 75 21 48 95 58 74 68
9 27 0 58 99 32 30 5 5 100 75
10 67 93 98 61 62 94 3 16 43 48
10 vars.5 vars.
3 vars.2 vars.
matrix(round(rnorm(100, 50, 15), nrow=10)))
Page 4
-1.0 -0.5 0.0 0.5 1.0r
original counts
-1.0 -0.5 0.0 0.5 1.0r
%s (10 vars.)
-1.0 -0.5 0.0 0.5 1.0r
%s (5 vars.)
-1.0 -0.5 0.0 0.5 1.0r
%s (3 vars.)
-1.0 -0.5 0.0 0.5 1.0r
%s (2 vars.)
Page 5
0 20 40 60 80 100C_V1
0
20
40
60
80
100
C_
V2
0 5 10 15 20P10_V1
0
5
10
15
20
P10
_V2
0 10 20 30 40 50 60 70T5_V1
0
10
20
30
40
T5_
V2
10 20 30 40 50 60 70 80T3_V1
0
10
20
30
40
50
60
70
T3_
V2
10 20 30 40 50 60 70 80 90 100T2_V1
0
10
20
30
40
50
60
70
80
90
T2_
V2
original counts %s 10 vars.
%s 5 vars. %s 3 vars. %s 2 vars.
Page 7
• including outliers in regression analyses is usually a bad idea…
• Tukey-line / least squares discrepancies are good red-flag signals
Page 8
2 4 6 8 10
51
01
5
x2
y2
2 4 6 8 10
51
01
5
x2
y2
2 4 6 8 10
51
01
5
x2
y2
Page 9
0 50 100 150 200 250
800
850
900
950
100
01
050
110
0
soMort$SO2
soM
ort$
mor
tal
“convex hull trimming”
Page 10
0 1 2 3 4 5
800
850
900
950
100
01
050
110
0
log(soMort$SO2)
soM
ort$
mor
tal
Page 11
0 1 2 3 4 5
800
850
900
950
100
01
050
110
0
log(soMort$SO2)
soM
ort$
mor
tal
Page 12
“convex hull trimming”
> hull1 chull(x, y)
> plot(x, y)
> polygon(x[hull1], y[hull1])
> abline(lm(y[-hull1] ~ x[-hull1]))
Page 13
0 1 2 3 4 5 6
80
09
00
10
00
11
00
log(soMort$SO2)
soM
ort
$m
ort
al
Page 15
transformation
• at least two major motivations in regression analysis:– create/improve a linear relationship– correct skewed distribution(s)
Page 16
• ex: density of obsidian vs. distance from the quarry:
0 10 20 30 40 50 60 70 80DIST
0
1
2
3
4
5
6D
EN
SIT
Y
Page 17
0 10 20 30 40 50 60 70 80DIST
0
1
2
3
4
5
6
DE
NS
ITY
Plot of Residuals against Predicted Values
-1 0 1 2 3 4ESTIMATE
-1
0
1
2
RE
SID
UA
L
Page 18
0 10 20 30 40 50 60 70 80DIST
1
2
3456
DE
NS
ITY
0 10 20 30 40 50 60 70 80DIST
-3
-2
-1
0
1
2
LG
_D
EN
S
LG_DENS log(DENSITY)
old.par par(no.readonly = TRUE)
plot(DIST, DENSITY, log="y")par(old.par)
Page 19
0 50 100 150 200VAR1
0
50
100
150
200
VA
R2
0 50 100 150 200VAR1
0
50
100
150
200V
AR
2
Page 20
0 5 10 15VAR1T
0
50
100
150
200
VA
R2
> VAR1T sqrt(VAR1)> plot(VAR1T, VAR2)
Page 21
transformation summary
• correcting left skew:x4 stronger
x3 strong
x2 mild
• correcting right skew:x weak
log(x) mild
-1/x strong
-1/x2 stronger
Page 22
“coefficient of determination”
Page 23
• regression/correlation– the strength of a relationship can be
assessed by seeing how knowledge of one variable improves the ability to predict the other
Page 24
• if you ignore x, the best predictor of y will be the mean of all y values (y-bar) – if the y measurements are widely
scattered, prediction errors will be greater than if they are close together
• we can assess the dispersion of y values around their mean by:
2)( yyi
Page 25
y
iy
2)( yyi
2)ˆ( ii yy
Page 26
2)ˆ( ii yy
2)( yyir2=
• “coefficient of determination” (r2)
• describes the proportion of variation that is “explained” or accounted for by the regression line…
• r2=.5 half of the variation is explained by the regression…
half of the variation in y is explained by variation in x…
Page 27
vs.
“explaining variance”
range
x
Page 29
multiple regression
Page 30
residuals
• vertical deviations of points around the regression – for case i, residual = yi-ŷi [yi-(a+bxi)]
• residuals in y should not show patterned variation either with x or y-hat
• should be normally distributed around the regression line
• residual error should not be autocorrelated (errors/residuals in y are independent…)
Page 31
• residuals may show patterning with respect to other variables…
• explore this with a residual scatterplot– ŷ vs. other variables…
• are there suggestions of linear or other kinds of relationships?
• if r2 < 1, some of the remaining variation may be explainable with reference to other variables
Page 32
• paying close attention to outliers in a residual plot may lead to important insights
• e.g.: outlying residuals from quantities of exotic flint ~ distance from quarries– sites with special access though transport
routes, political alliances…
• residuals from regressions are often the main payoff
Page 33
Middle Formative,
Basin of Mexico
Page 34
Formative Basin of Mexico
• settlement survey
• 3 variables recorded from sites:– site size (proxy for population)– amount of arable land in standard “catchment”– productivity index for soils
Page 35
How are these variables related?
Do any make sense as dependent or independent variables?
1. SIZE (ha)
2. AGLAND (km2)
3. PROD (index)
Page 36
0 10 20 30 40 50 60 70 80 90 100AGLAND
20
30
40
50
60
70
80
90
100S
IZE
SIZE ~ AGLAND
Page 37
r2 = .75 y = 35.4 + .66xSIZE = 35.38 + .66*AGLAND(ha) (km2)
Page 38
0 10 20 30 40 50 60 70 80 90 100AGLAND
20
30
40
50
60
70
80
90
100S
IZE
residuals??
Page 39
> resSize frmdat$size – (35.4 +.66 * frmdat$agland)
residual SIZE = SIZE – SIZE-hat
0.7 0.8 0.9 1.0 1.1 1.2 1.3
20
40
60
80
100
120
frmdat$prod
resS
ize
Page 40
0.7 0.8 0.9 1.0 1.1 1.2 1.3PROD
20
30
40
50
60
70
80
90
100
SIZ
E
PROD & SIZE
r2 = .69SIZE = -29 + 98 * PROD
Page 41
0 10 20 30 40 50 60 70 80 90 100AGLAND
20
30
40
50
60
70
80
90
100
SIZ
E
0.7 0.8 0.9 1.0 1.1 1.2 1.3PROD
20
30
40
50
60
70
80
90
100
SIZ
E
r2 = .75
r2 = .69
What have we “explained” about site
size??
Page 42
size
20 40 60 80
3040
5060
7080
90
2040
6080
agland
30 50 70 90 0.7 0.8 0.9 1.0 1.1 1.2 1.3
0.7
0.8
0.9
1.0
1.1
1.2
1.3
prod
Page 43
0 10 20 30 40 50 60 70 80 90 100AGLAND
0.7
0.8
0.9
1.0
1.1
1.2
1.3P
RO
D
r2 = .55
Page 44
X0
X1 X2
multiple regression…
Page 45
1
1 = total variance observed in independent variable (x0)
X0
Page 46
201r
2011 r
X0
X1
201r
2011 r
variance in x0 explained by x1, by itself…
variance in x0 unexplained by x1…
Page 47
202r
2021 r
X0
X2
variance in x0 explained by x2, by itself…
variance in x0 unexplained by x2…2
021 r
202r
Page 48
)1( 202
22.01 rr
2
212
202
12020122.01
11
)(
rr
rrrr
partial correlation coefficient: proportion of variance in x0 explained by x1, that is not explained by x2…
X1
(total variance in x0 explained by x1, that is not explained by x2…)
X0
Page 49
)1( 202
22.01
202
212.0 rrrR
multiple coefficient of determination: variance in x0 explained by x1 and x2, both separately, and together…
Page 50
SIZE
SIZ
E
AGLAND PROD
SIZ
E
AG
LAN
D
AG
LAN
D
SIZE
PR
OD
AGLAND PROD
PR
OD
SITE-SIZE
productivity
agricultural land
Page 51
SIZE = -1.8 + .42*AGLAND + 50*PROD
y = -1.8 + .42x1 + 50x2
Page 52
size = -1.8 + .42*agland + 50*prod
• various scales are involved:size hectaresagland km2
prod productivity index
• increasing available agricultural land by 1 km2 increases site-size by about .4 hectares
• a 1-unit increase of soil productivity increases site-size by about 50 hectares
• which of these two factors is more important??
Page 53
• calculate “beta” coefficients to eliminate the effect differing scales…
• convert the variables to Z-scores– mean of 0 – standard deviation of 1
• repeat multiple correlation analysis…
Page 54
with(frmdat, {Bsize (size-mean(size))/sd(size)Bagland (agland-mean(agland))/sd(agland)Bprod (prod-mean(prod))/sd(prod) })
lmBeta lm(Bsize ~ Bagland + Bprod)
Page 55
size = .55*agland + .43*prod
doesn’t change…should be zero…
Page 56
site size
productivity
agricultural land
=.55
=.45
r2=.83
r2=.55