22s:152 Applied Linear Regression Chapter 2: Regression Analysis ———————————————————— Regression analysis • a class of statistical methods for – studying relationships between variables that can be measured e.g. predicting blood pressure from age – using known values of certain variables to predict the values of other variables for the same subjects e.g. given a person’s age, cholesterol, and weight, predict blood pressure 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
– studying relationships between variablesthat can be measured
e.g. predicting blood pressure from age
– using known values of certain variables topredict the values of other variables for thesame subjects
e.g. given a person’s age, cholesterol,and weight, predict blood pressure
1
Well-known Example:Space Shuttle Challenger
On January 27, 1986, the night before a planned launch,a 3-hour discussion took place.
The discussion was about the forecasted low temperaturefor the next day of 31◦ F, and the effect of low tempera-ture on O-ring performance. (O-rings seal joints).
In their discussion they utilized the following plot show-
ing the relationship between the number of O-rings hav-
ing some thermal distress and the temperature to decide
whether the shuttle should take-off as planned.
50 55 60 65 70 75 80 85
−0.
50.
51.
01.
52.
02.
53.
0
temperature
Num
ber
of in
cide
nts
●
● ● ● ●●
●
2
The final decision was to launch the shuttle as planned.
- 7 astronauts were killed
- combustion gas leak through an O-ring was the causeof the accident
Post-tragedy, a commission noted that a mistake in the
analysis of the data was that the flights with zero inci-
dents were left off because it was felt that these flights
did not contribute any information about the tempera-
ture effect.
50 55 60 65 70 75 80 85
−0.
50.
51.
01.
52.
02.
53.
0
temperature
Num
ber
of in
cide
nts
●
● ● ● ●●
●
●●●●
● ● ●● ● ● ● ●
● ● ● ● ●
3
What may have helped in the decision makingprocess?
- use off all the data (rather than using dataconditional on the occurrence of an incident)
- quantification of the relationship between tem-perature and O-ring failure (perhaps as aconditional probability)
- prediction of the probability of O-ring failureat 31◦ F (logistic regression, Dalal et al. used this approach
in the their 1989 article)
Dalal, S.R, Fowlkes, E.B. and Hoadley, B. (1989). Risk analysis of
the Space Shuttle: Pre-Chellenger Predicton of Failure. Journal of
the American Statistical Association, v.84, 945-957.
4
‘Investing it: duffers need not apply’New York Times, May 31, 1998An example of inappropriate removal of outliers
- An investment compensation expert carriedout a study purporting to show that the ma-jor companies, whose C.E.O’s hadlow golf scores, had high performingstocks.
- The expert obtained data for golf scores fromthe journal Golf Digest and used his own dataon the stock market performance of the com-panies of 51 chief executives.
- He created a Stock Rating which gave eachcompany a stock rating based on how in-vestors who held their stock did with 100being highest and 0 lowest.
5
All data points Points consideredoutliers
5 10 15 20 25 30 35
020
4060
8010
0
handicap
stoc
k ra
ting
●● ●
●●
●● ●●
●●● ●
●●● ●●
●●●●
● ●● ●●●
●●●●
●●
●●
●●●●
●
●●●
●●●
●●●
●
All data points
corr = −0.04
5 10 15 20 25 30 35
020
4060
8010
0handicap
stoc
k ra
ting
●● ●
●●
●● ●●
●●● ●
●●● ●●
●●●●
● ●● ●●●
●●●●
●●
●●
●●●●
●
●●●
●●●
●●●
●
X XX X XX
X
'Outliers' marked
Data in final analysis
5 10 15 20 25 30 35
020
4060
8010
0
handicap
stoc
k ra
ting
●● ●
●●
●● ●●
●●● ●
●●● ●●
●●●●
● ●● ●●●
●●●●
●●
●●
●●●●
●
●●●
'Outliers' removed
corr = −0.41
King, B. (1998) Critique of ‘Investing it: duffers need not
apply.’ Chance News 7.06.
6
Ch.2 Regression analysis...(as stated in book p. 16)
examines the relationship between a quanti-tative dependent variable Y and one or morequantitative independent variables, X1, . . . ,Xk. (He reserves the term regression for quantita-
tive variables)
Regression analysis traces the conditionaldistribution of Y - or some aspect of thedistribution, such as its mean - as a functionof the X ’s
Examples:
- General relationship between X and Y(where ε represents a random error).
Y = f (X) + ε↑
May be a linear ornon-linear relationship.
7
Linear Models (linear in the parameters)
- Simple linear relationship:Model the conditional mean response of acontinuous variable using a linear relation-ship to a single continuous variable assumingnormal errors
Y = β0+β1X+ε with ε ∼ N(0, σ2)
Given X , Y has a normal distribution witha mean(center) of [β0 + β1X ] and a varianceof σ2.
Also written as: Y |X ∼ N(β0 + β1X, σ2)
Sketch of plot showing normal conditional distributions:
8
- Quadratic relationship:Model the conditional mean response of acontinuous variable as a quadratic relation-ship to a single continuous variable (this isstill a linear model as it’s linear in the pa-rameters)
Y = β0 + β1X + β2X2 + ε with
ε ∼ N(0, σ2)
- Multiple linear relationships:Model the conditional mean response of acontinuous variable as a linear relationshipwith each of two continuous variables (no in-teraction)
Y = β0 + β1X1 + β2X2 + ε withε ∼ N(0, σ2)
Mean response surface shown on next page...
9
Mean response surface (errors not shown):
x1
y
Z
This surface is a plane in space.
10
Non-Linear Models(not linear in the parameters)
- Specific relationship:
Y = β0 + β1Xβ21 + β3X
β42 + ε with
ε ∼ N(0, σ2)
- Specific relationship:
Y = f (X1, X2) + ε withε ∼ N(0, σ2)
Mean response surface (errors not shown):
11
Non-normalityThe conditional distribution of Y given X doesnot have to be normal. BUT the validity ofmany of our common hypothesis tests dependson normality.
Y = β0+β1X+ε with ε ∼ a right-skeweddistribution
sketch
- Might attain normality of errors through trans-formations⇒ if so, common statistical testsvalid
- Could use the original skewed data and maxi-mum likelihood methods for estimation (witha specified non-normal distribution)
12
Nonparametric Regression
LOWESS (locally weighted scatterplot smoother)
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●●
● ●
●
●
●
●
●
●
●●
●
●●●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
0 5000 10000 15000 20000 25000
2040
6080
Average Income, USD
Pre
stig
e
- The lowess smoother estimates the function...Yi = f (xi) + εi
- The predicted Yi for a given xi is determinedby considering only ‘local’ points in a ‘win-dow’ around xi
- Often a simple linear regression is fit to thelocal points, and the prediction falls on thisline
- Researcher chooses width of window
13
Other analyses
• The type of data will affect how the data ismodeled and the choice of analysis
– Binary response (0/1) with covariate pre-dictors:
Logistic regression
– Relationship between categorical/ordinalvariables:
Contingency tables, chi-squared test(we won’t cover this in this class)
– Relationship between a quantitative de-pendent variable (Y) and qualitative pre-dictor:t-test or ANOVA
14
– Predicting a continuous response from bothquantitative and qualitative variables:
Dummy-variable regression or ANCOVA
– Response is a count (Poisson distribution)and the Poisson distribution mean is de-pendent on the covariates: