Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:
Post on 14-Dec-2015
222 Views
Preview:
Transcript
Stat 112: Notes 1
• Main topics of course:– Simple Regression – Multiple Regression– Analysis of Variance– Chapters 3-9 of textbook
• Readings for Notes 1: Chapter 3.1-3.2. Also, Chapter 2 contains review of material from Stat 111.
Monitoring Tiger Prey Abundance
• The Siberian (Amur) tiger is a species of tigers found in the Russian Far East.
• Tigers in general are in trouble. At the beginning of the 20th century, there were around 100,000 tigers. Today, there are less than 6000 tigers in the world and there are only about 400 Siberian tigers.
• The Sika deer is a staple of the Siberian tiger diet. It is also hunted by the local people.
• To balance the needs of the local people and at the same time ensure that there are adequate prey for tigers, local government managers need accurate estimates of the number of Sika deer in an area.
Estimating Deer Abundance• Two Methods:
– Counting Method: The number of deer in a plot can be determined accurately but with considerable time and work. It requires 3-5 expert field workers to monitor the plot and to classify whether deer tracks are moving into or out of the plot.
– Total Tracks Counted: Count the total number of deer tracks along transects (fixed paths that the observer walks along). Total tracks counted requires much less work.
• Can total tracks counted be used to accurately predict the density (per km^2) of deer obtained from the counting method?
Deer Density vs. Tracks Counted
• Study was done in which density was determined by expert field workers over a range of plots.
• How would we estimate the deer density if we counted 1 track per squared kilometer?
0
1
2
3
4
5
6
7
Den
sity
(per
km
^2)
0 0.5 1 1.5 2 2.5 3
Tracks Counted (per km^2)
Simple Regression Model
• How would we estimate the deer density if we counted 1 track per squared kilometer?
• Idea: Estimate the mean deer density when we count 1 track per squared kilometer.
• Simple Regression Setup:– Y=outcome (density per km squared)– X=explanatory variable (tracks counted per km squared– Note: outcome is sometimes called dependent variable and
explanatory variable is sometimes called independent or predictor variable
• Simple Regression Model: Model for the mean (expected value) of Y given X, denoted | or ( | )Y X E Y X
Simple Linear Regression ModelSimple Linear Regression Model:
0 1( | )E Y X X
1 Slope = Change in Mean of Y for each one unit change in X = Change in Mean Density for each one unit increase in tracks counted
0 Intercept = Mean of Y for X=0 = Mean Density for zero tracks counted (X=0)
0
1
2
3
4
5
6
7
Den
sity
(per
km
^2)
0 0.5 1 1.5 2 2.5 3
Tracks Counted (per km^2)
Linear Fit
Linear Fit Density (per km^2) = -0.053472 + 1.9085392*Tracks Counted (per km^2)
Using the Simple Linear Regression Model for Estimating Deer Density
Mean Density when Tracks = 0.5: E(Y|X=0.5) = -0.053 + 1.909 * 0.5 = 0.90 Mean Density when Tracks = 1: E(Y|X=1) = -0.053 + 1.909 * 1 = 1.86 Mean Density when Tracks = 1.5: E(Y|X=1.5) = -0.053+ 1.909*1.5 = 2.81
0
1
2
3
4
5
6
7
Den
sity
(per
km
^2)
0 0.5 1 1.5 2 2.5 3
Tracks Counted (per km^2)
Linear Fit
Linear Fit Density (per km^2) = -0.053472 + 1.9085392*Tracks Counted (per km^2)
Estimating the Slope and Intercept
0
1
2
3
4
5
6
7
Den
sity
(per
km
^2)
0 0.5 1 1.5 2 2.5 3
Tracks Counted (per km^2)
For the population, the line that best predicts Y based on X (in terms of minimizing squared prediction error) is the line
0 1( | )E Y X X The Least Squares Principle: We want to choose the line
for the sample data 1 1( , ), , ( , )n nX Y X Y that minimizes the sum of squared prediction errors in the sample. The least
squares line is the line 0 1ˆ ( | )E Y X b b X that minimizes
the sum of squared prediction errors in the data, 2
0 11( )
n
i iiY b b X
.
Simple Linear Regression Using JMP
• Use Analyze, Fit Y by X. Put response variable in Y and explanatory variable in X (make sure X is continuous by clicking on the X column, clicking Cols and Column Info and checking that the Modeling Type is Continuous).
• Click on fit line under red triangle next to Bivariate Fit of Y by X.
ResidualsWhen we use the mean density ( | )E Y X
(or estimated mean density 0 1ˆ ( | )E Y X b b X )
to predict the true density in a plot based on the tracks X, we will typically make some error. The error for a given observation ( , )i iX Y is
0 1ˆ ( | ) ( )i i i i i ie Y E Y X Y b b X
ie Residual for observation i
For observation 85, 85 852.391, 5.571X Y
85
ˆ ( | 2.391) 0.053 1.909*2.391 4.511
5.571 4.511 1.060
E Y X
e
0
1
2
3
4
5
6
7
Den
sity
(per
km
^2)
26
85
0 0.5 1 1.5 2 2.5 3
Tracks Counted (per km^2)
For observation 26, 26 262.174, 2.612X Y
26
ˆ ( | 2.174) 0.053 1.909*2.174 4.097
2.612 4.097 1.485
E Y X
e
Root Mean Square ErrorThe root mean square error (RMSE) is the “average” absolute error (absolute residual). The book calls the RMSE the standard error of regression. The RMSE is obtained by the formula
2 20 1
1 1
1 1ˆ( ( | )) ( ( ))2 2
n n
i i i i ii i
Y E Y X Y b b Xn n
The root mean square error is found in the Summary of Fit in the JMP output for a simple linear regression analysis. Bivariate Fit of Density (per km^2) By Tracks Counted (per km^2) Linear Fit Density (per km^2) = -0.053472 + 1.9085392*Tracks Counted (per km^2) Summary of Fit RSquare 0.901652 RSquare Adj 0.900649 Root Mean Square Error 0.409303 Mean of Response 1.821665 Observations (or Sum Wgts) 100
RMSE = 0.41. On average, we will make an absolute error of about 0.41 when we predict the density of deer based on the tracks.
Technical Note: RMSE^2 is averagesquared residual.RMSE is close to but not exactlyaverage absoluteresidual
Application of Deer Density Regression for Tiger Conservation
• Government managers do not need to know the
density of deers in an area exactly -- they only need a reasonable estimate.
• An average absolute error of 0.41 in the estimate of density based on tracks counted is tolerable.
• Because the Siberian tiger’s habitat is so vast, it would be enormously costly to have expert field workers count the deer in each area of interest.
• Counting tracks and using regression to estimate the deer density based on the tracks counted provides a basis for estimating deer density across the different areas of the tiger habitat. This enables government managers to balance the needs of the local people and tigers.
Poverty and MDs• Do states with more poverty tend to have
fewer doctors? Which states have an unusually high number of doctors given their poverty rate or an unusually low number of doctors given their poverty rate.
150
200
250
300
350
400
450
MD
s pe
r 10
0,00
0
California
New York
Pennsylvania
7.5 10 12.5 15 17.5 20 22.5
Poverty Percent
Bivariate Fit of MDs per 100,000 By Poverty Percent
Simple Linear Regression Model Y=M.D.’s per 100,000 residents X=Poverty percent
0 1( | )E Y X X
1 Slope = Change in Mean of Y for each one unit change in X = Change in mean M.D.’s per 100,000 residents for 1% increase in poverty
0 Intercept = Mean of Y for X=0 = Mean M.D.’s per 100,000 residents for 0% poverty.
150
200
250
300
350
400
450
MD
s pe
r 10
0,00
0California
New York
Pennsylvania
7.5 10 12.5 15 17.5 20 22.5
Poverty Percent
Linear Fit MDs per 100,000 = 286.84208 – 4.3292991 Poverty Percent The mean MDs per 100,000 residents is estimated to decrease by about 4 for an increase of 1 in the poverty percentage.
Residuals in JMP• Saving the residuals in JMP:
– To save the residuals, after fitting the line using Fit Y by X, click the red triangle next to linear fit and click save residuals. A column with the residuals is created on the data spreadsheet.
– The residuals can be sorted by clicking • Sorting the residuals:
– Click the table menu, then click sort, click the name of the column with the residuals, click by and then click sort.
• Labeling observations: – To label an observation in the graph, click the row with the
observation and then click the rows menu and label. By default, JMP will use the observation number to label the observation. To make JMP use state to label the observation, click the state column, click the Cols menu and click label
Residuals for Poverty-MD DataFive Largest Positive Residuals 1. Massachusetts +172.35 2. New York +168.13 3. Maryland +120.06 4. Connecticut +103.52 5. Rhode Island +100.51 These states all have more doctors per resident than their poverty rate would predict. Five Most Negative Residuals 1. Alaska -82.61 2. Iowa -76.18 3. Idaho -72.66 4. Nevada -66.22 5. Wyoming -64.32 These states all have less doctors per resident than their poverty rate would predict.
Summary for Notes 1
• Regression Model: Model for the mean of an outcome Y given a value of the explanatory variable X, E(Y|X).
• Simple Linear Regression Model: • Regression Models are useful for:
– Predicting Y from X– Understanding the association between Y and X.– Identifying observations that are unusual in their
relationship between Y and X (large magnitude of residuals).
0 1( | )E Y X X
top related