Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Stat 112: Notes 1

• Main topics of course:– Simple Regression – Multiple Regression– Analysis of Variance– Chapters 3-9 of textbook

• Readings for Notes 1: Chapter 3.1-3.2. Also, Chapter 2 contains review of material from Stat 111.

Monitoring Tiger Prey Abundance

• The Siberian (Amur) tiger is a species of tigers found in the Russian Far East.

• Tigers in general are in trouble. At the beginning of the 20th century, there were around 100,000 tigers. Today, there are less than 6000 tigers in the world and there are only about 400 Siberian tigers.

• The Sika deer is a staple of the Siberian tiger diet. It is also hunted by the local people.

• To balance the needs of the local people and at the same time ensure that there are adequate prey for tigers, local government managers need accurate estimates of the number of Sika deer in an area.

Estimating Deer Abundance• Two Methods:

– Counting Method: The number of deer in a plot can be determined accurately but with considerable time and work. It requires 3-5 expert field workers to monitor the plot and to classify whether deer tracks are moving into or out of the plot.

– Total Tracks Counted: Count the total number of deer tracks along transects (fixed paths that the observer walks along). Total tracks counted requires much less work.

• Can total tracks counted be used to accurately predict the density (per km^2) of deer obtained from the counting method?

Deer Density vs. Tracks Counted

• Study was done in which density was determined by expert field workers over a range of plots.

• How would we estimate the deer density if we counted 1 track per squared kilometer?

0 0.5 1 1.5 2 2.5 3

Tracks Counted (per km^2)

Simple Regression Model

• How would we estimate the deer density if we counted 1 track per squared kilometer?

• Idea: Estimate the mean deer density when we count 1 track per squared kilometer.

• Simple Regression Setup:– Y=outcome (density per km squared)– X=explanatory variable (tracks counted per km squared– Note: outcome is sometimes called dependent variable and

explanatory variable is sometimes called independent or predictor variable

• Simple Regression Model: Model for the mean (expected value) of Y given X, denoted | or ( | )Y X E Y X

Simple Linear Regression ModelSimple Linear Regression Model:

0 1( | )E Y X X

1 Slope = Change in Mean of Y for each one unit change in X = Change in Mean Density for each one unit increase in tracks counted

0 Intercept = Mean of Y for X=0 = Mean Density for zero tracks counted (X=0)

0 0.5 1 1.5 2 2.5 3

Linear Fit

Linear Fit Density (per km^2) = -0.053472 + 1.9085392*Tracks Counted (per km^2)

Using the Simple Linear Regression Model for Estimating Deer Density

Mean Density when Tracks = 0.5: E(Y|X=0.5) = -0.053 + 1.909 * 0.5 = 0.90 Mean Density when Tracks = 1: E(Y|X=1) = -0.053 + 1.909 * 1 = 1.86 Mean Density when Tracks = 1.5: E(Y|X=1.5) = -0.053+ 1.909*1.5 = 2.81

0 0.5 1 1.5 2 2.5 3

Linear Fit

Linear Fit Density (per km^2) = -0.053472 + 1.9085392*Tracks Counted (per km^2)

Estimating the Slope and Intercept

0 0.5 1 1.5 2 2.5 3

For the population, the line that best predicts Y based on X (in terms of minimizing squared prediction error) is the line

0 1( | )E Y X X The Least Squares Principle: We want to choose the line

for the sample data 1 1( , ), , ( , )n nX Y X Y that minimizes the sum of squared prediction errors in the sample. The least

squares line is the line 0 1ˆ ( | )E Y X b b X that minimizes

the sum of squared prediction errors in the data, 2

0 11( )

i iiY b b X

Simple Linear Regression Using JMP

• Use Analyze, Fit Y by X. Put response variable in Y and explanatory variable in X (make sure X is continuous by clicking on the X column, clicking Cols and Column Info and checking that the Modeling Type is Continuous).

• Click on fit line under red triangle next to Bivariate Fit of Y by X.

ResidualsWhen we use the mean density ( | )E Y X

(or estimated mean density 0 1ˆ ( | )E Y X b b X )

to predict the true density in a plot based on the tracks X, we will typically make some error. The error for a given observation ( , )i iX Y is

0 1ˆ ( | ) ( )i i i i i ie Y E Y X Y b b X

ie Residual for observation i

For observation 85, 85 852.391, 5.571X Y

ˆ ( | 2.391) 0.053 1.909*2.391 4.511

5.571 4.511 1.060

0 0.5 1 1.5 2 2.5 3

For observation 26, 26 262.174, 2.612X Y

ˆ ( | 2.174) 0.053 1.909*2.174 4.097

2.612 4.097 1.485

Root Mean Square ErrorThe root mean square error (RMSE) is the “average” absolute error (absolute residual). The book calls the RMSE the standard error of regression. The RMSE is obtained by the formula

2 20 1

1 1ˆ( ( | )) ( ( ))2 2

i i i i ii i

Y E Y X Y b b Xn n

The root mean square error is found in the Summary of Fit in the JMP output for a simple linear regression analysis. Bivariate Fit of Density (per km^2) By Tracks Counted (per km^2) Linear Fit Density (per km^2) = -0.053472 + 1.9085392*Tracks Counted (per km^2) Summary of Fit RSquare 0.901652 RSquare Adj 0.900649 Root Mean Square Error 0.409303 Mean of Response 1.821665 Observations (or Sum Wgts) 100

RMSE = 0.41. On average, we will make an absolute error of about 0.41 when we predict the density of deer based on the tracks.

Technical Note: RMSE^2 is averagesquared residual.RMSE is close to but not exactlyaverage absoluteresidual

Application of Deer Density Regression for Tiger Conservation

• Government managers do not need to know the

density of deers in an area exactly -- they only need a reasonable estimate.

• An average absolute error of 0.41 in the estimate of density based on tracks counted is tolerable.

• Because the Siberian tiger’s habitat is so vast, it would be enormously costly to have expert field workers count the deer in each area of interest.

• Counting tracks and using regression to estimate the deer density based on the tracks counted provides a basis for estimating deer density across the different areas of the tiger habitat. This enables government managers to balance the needs of the local people and tigers.

Poverty and MDs• Do states with more poverty tend to have

fewer doctors? Which states have an unusually high number of doctors given their poverty rate or an unusually low number of doctors given their poverty rate.

California

New York

Pennsylvania

7.5 10 12.5 15 17.5 20 22.5

Poverty Percent

Bivariate Fit of MDs per 100,000 By Poverty Percent

Simple Linear Regression Model Y=M.D.’s per 100,000 residents X=Poverty percent

0 1( | )E Y X X

1 Slope = Change in Mean of Y for each one unit change in X = Change in mean M.D.’s per 100,000 residents for 1% increase in poverty

0 Intercept = Mean of Y for X=0 = Mean M.D.’s per 100,000 residents for 0% poverty.

0California

New York

Pennsylvania

7.5 10 12.5 15 17.5 20 22.5

Poverty Percent

Linear Fit MDs per 100,000 = 286.84208 – 4.3292991 Poverty Percent The mean MDs per 100,000 residents is estimated to decrease by about 4 for an increase of 1 in the poverty percentage.

Residuals in JMP• Saving the residuals in JMP:

– To save the residuals, after fitting the line using Fit Y by X, click the red triangle next to linear fit and click save residuals. A column with the residuals is created on the data spreadsheet.

– The residuals can be sorted by clicking • Sorting the residuals:

– Click the table menu, then click sort, click the name of the column with the residuals, click by and then click sort.

• Labeling observations: – To label an observation in the graph, click the row with the

observation and then click the rows menu and label. By default, JMP will use the observation number to label the observation. To make JMP use state to label the observation, click the state column, click the Cols menu and click label

Residuals for Poverty-MD DataFive Largest Positive Residuals 1. Massachusetts +172.35 2. New York +168.13 3. Maryland +120.06 4. Connecticut +103.52 5. Rhode Island +100.51 These states all have more doctors per resident than their poverty rate would predict. Five Most Negative Residuals 1. Alaska -82.61 2. Iowa -76.18 3. Idaho -72.66 4. Nevada -66.22 5. Wyoming -64.32 These states all have less doctors per resident than their poverty rate would predict.

Summary for Notes 1

• Regression Model: Model for the mean of an outcome Y given a value of the explanatory variable X, E(Y|X).

• Simple Linear Regression Model: • Regression Models are useful for:

– Predicting Y from X– Understanding the association between Y and X.– Identifying observations that are unusual in their

relationship between Y and X (large magnitude of residuals).

0 1( | )E Y X X

Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

estimating deer density

mean deer density

estimate of density

total number of deer

outcome density

denoted slide

intercept slide

number of sika deer

Documents

Analysis of Variance, Design, and Regression:...

Introduction to Multivariate Analysis of Variance, Factor...

Equations in Simple Regression Analysis. The Variance.

robust — Robust variance estimates4 robust— Robust...

OLSinthe multiple regression PaulSchrimpf...

Linear Models Of Regression: Bias-Variance Decomposition ...

STAT 8230 — Applied Nonlinear Regression Lecture Notes ·...

REGRESSION MODELS VS. VARIANCE MEASURES AS STABILITY...

Linear Regression, Regularization Bias-Variance...

Regression Analysis Notes

Negative Binomial Regression - NCSS · Negative binomial...

5. Dummy-Variable Regression and Analysis of Variance ·...

Regression example Multiple regression. SPSS for...

Other types of regression models Analysis of variance and...

REGRESSION AND ANALYSIS OF VARIANCE FOR GENETICISTS

Analysis of variance approach to regression analysis