Applied Statistics - MIT
Post on 19-Jan-2016
90 Views
Preview:
DESCRIPTION
Transcript
Dr. Elizabeth Newton
Slides prepared by Elizabeth Newton (MIT) with some slides by Roy Welsch (MIT) and Gordon Kaufman (MIT).
1
15.075, Applied StatisticsLecture: M,W 10-11:30
Recitation: R 4-5
Text: Statistics and Data Analysis by Tamhane and Dunlop
Computing: S-Plus
Exams: Mid-term (in class) and Final during exam week
Prerequisites: Calculus, Probability, Linear Algebra
2
15.075, Applied Statistics, Course Outline
• Collecting Data• Summarizing and Exploring Data• Review of Probability• Sampling Distributions of Statistics• Inference
Point and CI Estimation, Hypothesis Testing• Linear Regression• Analysis of Variance• Nonparametric Methods• Special Topics (Data Mining?)
3
Statistics
“The science of collecting and analyzing data for the purpose ofdrawing conclusions and making decisions.” from Tamhane, Ajit C., and Dorothy D. Dunlop. Statistics and Data Analysis from Elementary to Intermediate. Prentice Hall, 2000, pp. 1.
“Statistics are no substitute for judgment.” Henry Clay
4
How is the meter defined?
One ten-millionth of a quarter meridian(distance from pole to equator).
BUT – it isn’t exactly.
Why?
5
The Measure of All Things, by Ken Alder, describes the attempt of 2 French astronomers,Delambre and Mechain, to determine the circumference of the earth during the time of the French Revolution.
Determined the distance between Barcelona and Dunkirk by triangulation.
Needed to know latitude at each end (by measuring heights of stars).
Seven months stretched to seven years.
Mechain obtained conflicting information andsuppressed some of his data.
6
Page 214 (Measure of All Things):
“What counts as an error? Who is to say when you have made a mistake? How close is close enough? Neither Mechain nor his colleagues could have answered these questions with any degree of confidence. They were completely innocent of statistical method.”
- Quote from Alder, Ken. The Measure of All Things: The Seven-YearOdyssey and Hidden Error that Transformed the World. Free Press, 2003.
7
Data: A Set of measurementsCharacter
Nominal, e.g. color: red, green, blueBinary e.g. (M,F), (H,T), (0,1)
Ordinal, e.g attitude to war: agree, neutral disagree
Numeric
Discrete, e.g. number of children
Continuous. e.g. distance, time, temperature
also:
Interval, e.g. Fahrenheit temperature
Ratio (real zero), e.g distance, number of children8
S-Plus Data Set: cu.summary
9
Concepts
Population:The set of all units of interest (finite or infinite). E.g. all students at MIT
Sample:A subset of the population actually observed. E.g. students in this room.
Variable: A property or attribute of each unit, e.g age, height
Observation:Values of all variables for an individual unit
A dataset is often organized as a matrix with rows corresponding to observations and columns to variables.
10
Concepts (continued)
Parameter:Numerical characteristic of population, defined for each variable, e.g. proportion opposed to war
Statistic:Numerical function of sample used to estimate population parameter.
Precision: Spread of estimator of a parameter
Accuracy: How close estimator is to true value - opposite of
Bias: Systematic deviation of estimate from true value
11
Accuracy and Precision
accurate and precise
accurate, not precise
precise, not accurate
not accurate, not precise
12
Diagram courtesy of MIT OpenCourseWare
Steps in Study Design and Implementation
1. Background research and literature review.
2. Define the goals and specific hypotheses of the study.
3. Determine what variables should be measured and how.
5. Develop a plan to collect the dataSampling designSample sizeInclusions and exclusions
5. Train Personnel
6. Gather Data
7. Analyze Data
8. Report Results13
Ethical IssuesFor human subjects:
For animal subjects:
(See Hulley & Cummings, Designing Clinical Research.)
14
Statistical Studies
Descriptive:One group, e.g. survey, poll
Comparative:2 or more groups, e.g. compare effectiveness of different teaching methods.
Experimental:Investigator actively intervenes to control study conditionsLook at relationship between predictor (explanatory) and response (outcome) variablesEstablish causation, e.g. drug trial
Observational:Investigator records data without interveningDifficult to distinguish effects of predictors and confounding variables (lurking variables)Establish association, e.g. Framingham Heart Study
15
Observational Studies:
Cross-sectionalLook at sample at a single point in timeE.g. Census, Sample survey
Prospective (expensive!)Follow sample (cohort) forward in time.E.g. Framingham heart study, Nurses’ Health Study
Retrospective (case-control)Look back in time
16
Sources of Error in Observational Studies
Sampling Error – sample differs from population
Measurement Bias – poorly worded questions
Self-Selection Bias – refusal to participate
Response Bias – incorrect or untruthful responses
17
Types of Samples
Probability Sample (every element in population has known non-zero probability of inclusion)
• Simple Random Sample (SRS)• Stratified Random Sample• Multi-Stage Cluster Sample• Systematic Sample
Non-Probability Sample (estimates may be biased, but frequently used as only feasible method)
• Convenience Sample e.g. supermarket survey• Judgment Sample – chosen by investigator
18
Simple Random Sample (SRS)
Requires a Sampling Frame, a list of all the units in a finite population
Sample of size n is drawn without replacement from population of size N, such that each sample (there are of them) has same chance of being chosen.
Each unit in population has same chance of being chosen: n/N (the sampling fraction).
Generate random numbers to select from sampling frame.
⎟⎟⎠
⎞⎜⎜⎝
⎛nN
19
Stratified Random Sample
Divide a diverse population into homogeneous subpopulations (strata).
Draw simple random sample from each one.
Advantages:
Separate estimates for strata obtained in addition to overall estimates.
Precision of estimates higher than for simple random sample
Disadvantage: Requires sampling frame20
Multistage Cluster Sampling
Used to survey large populations when sampling frame not available, e.g. USA
For instance, in an educational survey, draw a sample of states, then towns within states, then schools within towns.
Prepare a sampling frame of students from selected schools and use SRS.
21
Systematic Sampling
Useful when list of units exists or when units arrive sequentially (cars through a toll booth).
Select first unit at random, then every kth unit.
In finite population, each unit has same probability of selection (n/N)(however not all samples are equally likely).
Must avoid choosing k to coincide with regular cyclic variations in the data
22
Questionnaire Design
Structured questions: responses should be mutually exclusive and collectively exhaustive.
E.g. How many glasses of water do you drink per day?-------------- 0 to 2--------------- 3 to 5--------------- 6 or more
Non-structured:E.g. How many glasses of water do you drink per day?Allow more individualized response, but more prone to data entry errors.
23
Attitude questions
1. The homework load in this course is reasonable.
Strongly Neither Agree StronglyDisagree Disagree nor Disagree Agree Agree
Usually 5 to 9 categories.(Should we assign numbers to these categories?)(High to low or low to high?)
24
Problems with Question Wording
Double-barreled question
Leading question
One-sided question
Ambiguous question
Pretest! Pretest! Pretest!
(For more information, see Johnson & Wichern, Business Statistics)
25
26
Sensitive Questions
E.G Have you ever used heroin?
Randomized Response may elicit more accurate responses.Interviewer does not know what question respondent is answering.
E.g. Roll a die. If less than 3 then say whether statement 1 is true or false. Otherwise say whether statement 2 is true of false.
Statement 1: I have used heroin.Statement 2: I have not used heroin.
Let p=proportion of people who have used heroinq=proportion of people answering question 1 (can’t be 0.5).
P(True)=P(True|1)P(1) + P(True|2)P(2) = p q + (1-p) (1-q)
Solve for p.
Question Sequencing
1. Demographics at end
2. Sensitive questions nearer to end
3. Same topic questions appear together
4. Go from general to specific
5. Avoid skipping around.27
28
Experimental Studies
Purpose: Evaluate how a set of predictor variables (factors) affect a response variable.
Treatment Factors are of primary interest. Values (Levels) are controlled.
Nuisance Factors also affect response.
Treatment: particular combination of levels of treatment factors.
Experimental units (EU’s): subjects to which treatments applied.
Treatment group: all EU’s receiving same treatment
Run: observation on an EU under particular treatment condition.
Replicate: another independent run.
Sources of Error in Experimental Studies
Systematic Error: differences among EU’s caused by Confounding Factors
Random Error: inherent variability in responses of EU’s.
Measurement Error: due to imprecision of measuring instruments.
29
Strategies to Control Error in Experimental Studies
Blocking: Divide sample into groups of similar EU’s (same value for nuisance factors).E.g. In agricultural trials effect of nutrient and moisture gradients can be controlled for by blocking on agricultural plots
Matching: EU’s can be matched on nuisance factors, then each memberof match can be randomly assigned to different treatment (each match is a block).
Regression Analysis: If value of nuisance factor is known can include as covariate in final model.
Randomization: Randomly assign EU’s to treatments.
Basic Idea: Block over those nuisance factors that can be easily controlled and randomize over the rest
30
Basic Experimental Designs
Completely Randomized Design (CRD)EU’s assigned at random to treatments
Randomized Block Design (RBD) EU’s divided into homogeneous blocksTreatments assigned randomly within blocks.
Randomized Complete Block Design (RCBD): Blocks contain all treatments.
Randomized Incomplete Block Design (RIBD)Blocks do not contain all treatments.
31
Chapter 4: Summarizing & Exploring Data(Descriptive Statistics)
Graphics! Graphics! Graphics!(and some numbers)
Slides prepared by Elizabeth Newton (MIT) with some slides byJacqueline Telford (Johns Hopkins University) and Roy Welsch (MIT).
1
Graphical Excellence“Complex ideas communicated with
clarity, precision, and efficiency”Shows the dataMakes you think about substance rather than
method, graphic design, or something elseMany numbers in a small spaceMakes large data sets coherentEncourages the eye to compare different
pieces of the data
2
Charles Joseph Minard
Graphic Depicting Exports of Wine from France (1864)
Available at http://www.math.yorku.ca/SCS/Gallery/
Source: Minard, C. J. Carte figurative et approximative des quantités de vin français exportéspar mer en 1864. 1865. ENPC (École Nationale des Ponts et Chaussées), 1865.
Also available in: Tufte, Edward R. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001.
3
Summarizing Categorical DataA frequency table shows the number of occurrences of each category.Relative frequency is the proportion of the total in each category.
Bar charts and Pie Charts are used to graph categorical data. A Paretochart is a bar chart with categories arranged from the highest to lowest (QC: “vital few from the trivial many”).
Attraction FrequencyRelative
Frequency (%)Vertical Drop 101 15.1Roller Coaster A 54 8.1Roller Coaster B 77 11.5Water Park 155 23.1Spinners 35 5.2Tea Cups 81 12.1Haunted House 79 11.8Log Drop 88 13.1Total 670 100.0
Popularity of attractions at an amusement park
Relative Frequency (%)
0.0
5.0
10.0
15.0
20.0
25.0
Vertica
l Drop
Roller
Coaste
r A
Roller
Coaste
r BWater
Park
Spinne
rsTea
Cup
s
Haunte
d Hou
seLo
g Drop
4
Pie Chart and Bar Chart of Attraction Popularity at an Amusement Park
5
Relative Frequency (%)
Vertical Drop Roller Coaster ARoller Coaster B Water ParkSpinners Tea CupsHaunted House Log Drop
Relative Frequency (%)
0.0
5.0
10.0
15.0
20.0
25.0
Vertica
l Drop
Roller
Coaste
r A
Roller
Coaste
r BWater
Park
Spinne
rsTea
Cup
s
Haunte
d Hou
seLo
g Drop
Charles Joseph Minard
Graph showing quantities of meat sent from various regions of France to Paris using pie charts overlaid a
map of France (1864)
Available at http://www.math.yorku.ca/SCS/Gallery/
Source: Minard, C. J. Carte figurative et approximative des quantités de viande de boucherie envoyées sur pied par les départments et consommées à Paris. ENPC (École
Nationale des Ponts et Chaussées),1858, pp. 44.6
Plots for Numerical Univariate Data
Scatter plot (vs. observation number)
Histogram
Stem and Leaf
Box Plot (Box and Whiskers)
QQ Plot (Normal probability plot)
7
Scatter Plot of Iris Data
observation number
iris[
, "S
epal
W",
"Set
osa"
]
0 10 20 30 40 50
2.5
3.0
3.5
4.0
8This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
9
Scatter Plot of Iris Data with Observation Number Indicated
observation number
iris2
1
0 10 20 30 40 50
2.5
3.0
3.5
4.0
1
2
34
5
6
7 8
9
10
11
12
1314
15
16
17
18
1920
21
2223
2425
26
2728
29
3031
32
3334
3536
3738
39
4041
42
43
44
45
46
47
48
49
50
plot(iris21)text(iris21)
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of data using jitter function in S-Plus
observation number
x
0 100 200 300 400 500
0.0
0.5
1.0
1.5
2.0
2.5
3.0
observation number
jitte
r(x)
0 100 200 300 400 500
0.0
0.5
1.0
1.5
2.0
2.5
3.0
10This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Run ChartFor time series data, it is often useful to plot the data in time sequence. A run chart graphs the data against time.
0 5 10 15 20 25 30
Production Order
Com
pres
sion
Freq
uenc
y
Compression
11
Always Plot Your Data Appropriately - Try Several Ways!
HistogramData: n=24 Gas Mileage {31,13,20,21,24,25,25,27,28,40,29,30,31,23,31,32,35,28, 36,37,38,40,50,17}
Gives a picture of the distribution of data.
• Area under the histogram represents sample proportion.
• Use approx. sqrt(n) “bins” - if too many,too jagged; if too few, too smooth (no detail)
• Shows if the distribution is:– Symmetric or skewed– Unimodal or bimodal
• Gaps in the data may indicate a problem with the measurement process.
• Many quality control applications– Are there two processes?– Detection of rework or cheating– Tells if process meets the
specifications
2.5
5.0
7.5C
ount
Axi
s
10 15 20 25 30 35 40 45 50 55
Miles per gallonDistributions
Note: Bars touch for continuous data, but do NOTtouch for discrete data.
12
Histogram of Iris Data
2.5 3.0 3.5 4.0
02
46
810
iris21
13This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
2.0 2.5 3.0 3.5 4.0 4.5siris21
0.0
0.2
0.4
0.6
0.8
1.0
Histogram of Iris Data with Density Curve
14This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Stem and Leaf Diagram Cum. Dist. FunctionData: Gas Mileage
Stem Leaf5 044 003 56783 011122 5578892 01341 71 3
Count1
2456411 0.1
0.20.30.40.50.60.70.80.91.0
Cum
Pro
b10 15 20 25 30 35 40 45 50 55
Miles per gallon
CDF Plot
Shows distribution of data similar to a histogram but preserves the actual data.Can see numerical patterns in the data (like 40’s and 50).
Step occurs at each data value (higher for more values at the same data point).
15
Stem and Leaf Diagram for Iris Data• Decimal point is 1 place to the left of the colon
• 23 : 0• 24 :• 25 :• 26 :• 27 :• 28 :• 29 : 0• 30 : 000000• 31 : 0000• 32 : 00000• 33 : 00• 34 : 000000000• 35 : 000000• 36 : 000• 37 : 000• 38 : 0000• 39 : 00• 40 : 0• 41 : 0• 42 : 0• 43 :• 44 : 0
16This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary Statistics for Numerical DataMeasures of Location:
n
x
nxxxx
n
ii
n∑
==+++
= 121(“average”):Mean
Median: middle of the ordered sample (like θ.5 for distribution)
xmin = x(1) ≤ x(2) ≤ …≤ x(n) = xmax
⎪⎪⎩
⎪⎪⎨
⎧
⎥⎥⎦
⎤
⎢⎢⎣
⎡+
=
⎟⎠⎞
⎜⎝⎛ +
⎟⎠⎞
⎜⎝⎛
⎟⎠⎞
⎜⎝⎛ +
even is if
odd is if
mediannxx
nx
nn
n
22
221
21
Median of {0,1,2} is 1 : n=3 so n+1=4 & (n+1)/2=2 (2nd value)
Median of {0,1,2,3} is 1.5 (assumes data is continuous): n=4
Mode: The most common value 17
Mean or Median?Appropriate summary of the center of the data?– Mean if the data has a symmetric distribution with light tails
(i.e. a relatively small proportion of the observations lie away from the center of the data).
– Median if the distribution has heavy tails or is asymmetric.
Extreme values that are far removed from the main body of the data are called outliers.
– Large influence on the mean but not on the median.Right and left skewness (asymmetry)
(reverse alphabetic - RIGHT skewed)
mode (high point)median
mean
(alphabetic - LEFT skewed)
modemedian
mean
18
Quantiles, Fractiles, Percentiles
For a theoretical distribution:The pth quantile is the value of a random variable X, xp, such that P(X<xp)=p. For the normal dist’n:In S-Plus: qnorm(p), 0<p<1, gives the quantile.In S-Plus: pnorm(q) gives the probability.
For a sample:The order statistics are the sample values in ascending order. Denoted X(1) ,…X(n)The pth quantile is the data value in the sorted sample, such that a fraction p of the data is less than or equal to that value.
19
20
Normal CDF
x
pnor
m(x
)
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
qnorm(0.8)=0.8416212
pnorm(0.8416212)=0.8
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
An algorithm for finding sample quantiles:
1) Arrange observations from smallest to largest.2) For a given proportion p, compute the sample
size × p = np.3) If np is NOT an integer, round up to the next
integer (ceiling (np)) and set the corresponding observation = xp.
4) If np IS an integer k, average the kth and (k + 1)st ordered values. This average is then xp.
– Text has a different algorithm
21
Quantiles, continued(pth quantile is 100pth percentile)
Example:Data: {0, 1, 2, 3, 4, 5, 6}
= {x(1),x(2),x(3),x(4),x(5),x(6), x(7)}
n=7Q1 = ceiling(0.25*7) = 2 ⇒ Q1 = x(2)= 1 = 25th percentileQ2 = ceiling(0.50*7) = 4 ⇒ Q2 = x(4)= 3 = median (50th percentile)Q3 = ceiling(0.75*7) = 6 ⇒ Q3 = x(6)= 5 = 75th percentile
S-Plus gives different answers! Different methods for calculating quantiles.
22
Measures of Dispersion (Spread, Variability):Two data sets may have the same center and but quite
different dispersions around it.Two ways to summarize variability: 1. Give the values that divide the data into equal parts.
– Median is the 50th percentile– The 25th, 50th, and 75th percentiles are called
quartiles (Q1,Q2,Q3) and divide the data into four equal parts.
– The minimum, maximum, and three quartiles are called the “five number summary” of the data.
2. Compute a single number, e.g., range, interquartilerange, variance, and standard deviation.
23
Measures of Dispersion, continued
Range = maximum - minimumInterquartile range (IQR) = Q3 – Q1
⎥⎦
⎤⎢⎣
⎡−
−=−
−= ∑∑
==
n
ii
n
ii xnx
nxx
ns
1
22
1
22 )(1
1)(1
1Sample variance:
2ss =Sample standard deviation:
Sample mean, variance, and standard deviations are sample analogs of the population mean, variance, and standard deviation (µ, σ2, σ)
24
Other Measures of Dispersion
Sample Average of Absolute Deviations from the Mean:
Sample Median of Absolute Deviations from the Median
Median of {|xi − x.5|, i = 1, . . . , n}
1
1
=−∑
ni
ix x
n
25
Computations for Measures of DispersionExample:
Data: {0, 1, 2, 3, 4, 5, 6}= {x(1),x(2),x(3),x(4),x(5),x(6), x(7)}
mean = (0+1+2+3+4+5+6)/ 7 = 21/ 7 = 3min = 0, max = 6Q1 = x(2)= 1 = 25th percentileQ2 = x(4)= 3 = median (50th percentile)Q3 = x(6)= 5 = 75th percentileRange = max - min = 6 - 0 = 6IQR = Q3 - Q1 = 5 - 1 = 4s2 = [(02+12+22+32+42+52+62) - 7(32)]/(7-1) = [91-63]/6 =4.67s = sqrt(4.67) = 2.16
26
Sample Variance and Standard Deviations2 and s should only be used to summarize dispersion with symmetric distributions.
For asymmetric distribution, a more detailed breakup of the dispersion must be given in terms of quartiles.
For normal data and large samples:– 50% of the data values fall between mean ± 0.67s– 68% of the data values fall between mean ± 1s– 95% of the data values fall between mean ± 2s– 99.7% of the data values fall between mean ± 3s
For normally distributed data:IQR=(mean + 0.67s) - (mean - 0.67s) = 1.34s
27
28
Standard Normal Density
x
dnor
m(x
)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
68%
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Box (and Whiskers) PlotsVisual display of summary of data (more than five numbers)Outlier Box Plot Quantile Box PlotData: Gas Mileage
median
Q3
Q1
IQR = Q3 - Q1
Upper Fence = Q3 + 1.5 x IQR
Lower Fence = Q1 – 1.5 x IQR
Two lines are called whiskers and extend to the most extreme data values that are still inside the fences.
Observations outside the fences are regarded as possible outliers and are denoted by dots and circles or asterisks.
90th percentile
10th percentile
Rectangle:
29
Box Plot for Iris Data2.
53.
03.
54.
0
iris21
30This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
QQ PlotsCompare Sample to Theoretical
Distribution
Order the data. The ith ordered data value is the pth quantile, where p = (i - 0.5)/n, 0<p<1.Text uses i/(n+1). (Why can’t we just say i/n)?
Obtain quantiles from theoretical distribution corresponding to the values for p. E.g. qnorm(p), in S-Plus for normal distribution.
Plot theoretical quantiles vs. empirical quantiles (sorted data). S-Plus: plot(qnorm((1:length(y)-0.5)/n),sort(y))
Fit line through first and third quartiles of each distribution.
31
32
QQ (Normal) Plot for Iris Data
Quantiles of Standard Normal
iris2
1
-2 -1 0 1 2
2.5
3.0
3.5
4.0
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Normalizing TransformationsData can be non-normal in a number of ways, e.g., the distribution may not be bell shaped or may be heavier tailed than the normal distribution or may not be symmetric.
Only the departure from symmetry can be easily corrected by transforming the data.
If the distribution is positively skewed, then the right tail needs to be shrunk inward. The most common transformation used for this purpose is the log transformation: x → log x (e.g., decibels, Richter, and Beaufort (?) scales); see Figure 4.11.
xThe square-root ( ) transformation provides a weaker shrinking effect; it is frequently used for (Poisson) count data.
For negatively skewed data, use the exponential (ex) or squared (x2) transformations.
33
34
Normal Probability Plot of data generated from a certain distribution
Quantiles of Standard Normal
x
-2 -1 0 1 2
02
46
810
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
35
Normal probability plot of log of same data
Quantiles of Standard Normal
log(
x)
-2 -1 0 1 2
-10
12
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Histogram of the same data
36
0 2 4 6 8 10
010
2030
40
xThis graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summarizing Multivariate Data
37
When two or more variables are measured on each sampling unit, the result is multivariate data.
If only two variables are measured the result is bivariate data. One variable may be called the x variable and the other the y variable.
We can analyze the x and y variable separately with the methods we have learned so far, but these methods would NOT answer questions about the relationship between x and y.
– What is the nature of the relationship between x and y (if any)?
– How strong is the relationship?
– How well can one variable be predicted from the other?
Summarizing Bivariate Categorical DataTwo-way Table
Overall Job Satisfaction
Annual Salary
Very Dissatisfied
Slightly Dissatisfied
Slightly Satisfied
Very Satisfied Row Sum
Less than $10,000
81 64 29 10 184
$10,000-25,000
73 79 35 24 211
$25,000-50,000
47 59 75 58 239
More than $50,000
14 23 84 69 190
Column Sum 215 225 223 161 824
38
The numbers in the cells are the frequencies of each possible combination of categories.
Cell, row and column percentages can be computed to assess distribution.
Column Percentages for Income and Job Satisfaction Table
Overall Job Satisfaction
Annual Salary
Very Dissatisfied
Slightly Dissatisfied
Slightly Satisfied
Very Satisfied
Less than $10,000
37.7 28.4 13.0 6.2
$10,000-25,000
34.0 35.1 15.7 14.9
$25,000-50,000
21.9 26.2 33.6 36.0
More than $50,000
6.5 10.2 37.7 42.9
39
Simpson’s Paradox
“Lurking variables [excluded from consideration] can change or
reverse a relation between two categorical variables!”
40
Doctors’ Salaries
• The interpreter of a survey of doctors’ salaries in 1990 and again in 2000 concluded that their average income actually declined from $97,000 in 1990 to $91,000 in 2000.”
• Income is measured here in nominal (not adjusted for inflation) dollars.
41
What about the “Rest of the Story”?
• What deductive piece of logic might clarify the real meaning of this particular pair of statistics?
• Look more deeply: Is there a piece missing?
• Here is a very simple breakdown of “the numbers” that may help.
42
Doctors’ Salaries by Age
1980 1990Age fraction, f1 Income fraction, f2 Income<=45 0.5 $60,000 0.7 $70,000>45 0.5 $120,000 0.3 $130,000
Mean $90,000 $88,00043
Conclusion
• If MD salaries are broken into two categories by age:– Doctors younger than 45 constituted 50%
of the MD population in 1980 and 70% in 1990
– Younger doctors tend to earn less than older, more experienced doctors
– Parsed by age, MD salaries increased in both age categories!
44
Gender Bias in Graduate Admissions
For this example, see Johnson and Wichern, Business Statistics: Decision Making with Data. Wiley, First Edition, 1997.
45
Randomized study
Gender should be randomly assigned to applicants!
This would automatically balance out the departmental factor which is not controlled for in the original plaintiff (observational) study.
Practical reality
Gender cannot be assigned randomly.
Control for department factor by comparing admission within department, i.e. controlling for the confounding factor aftercompletion of the study.
Statistical Ideal
46
“There are lies, damn lies and then there are statistics!”
Benjamin Disraeli
47
Summarizing Bivariate Numerical DataNo. Method
1 (xi)Method 2 (yi)
1 88 86
2 78 81
3 90 87
4 91 90
5 89 89
6 79 80
7 76 74
8 80 78
9 78 76
10 90 86
0102030405060708090
100
75 80 85 90 95
Method 1
Met
hod
2
Is it easier to grasp the relationship in the data between Method A and Method B from the Table or from the Figure (scatter plot)?
48
Labeled Scatter PlotYear Country
ACountry
B Country
CCountry
D
1965 64.7 64.8 61.1 86.2
1970 65.0 65.2 61.2 86.5
1975 66.8 66.3 63.0 87.4
1980 66.9 67.4 62.8 87.0
1985 67.9 68.5 63.1 89.2
1990 68.3 69.1 63.5 89.4
1995 70.8 69.4 64.3 90.1
2000 71.7 70.0 65.1 90.5
Can you see the improvements in the literacy rates for these four countries more easily in the Table or in the Figure?
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005
Year
Lite
rary
Rat
e Country ACountry BCountry CCountry D
49
Sample Correlation CoefficientA single numerical summary statistic which measures the strength of a linear relationship between x and y.
r = covar(x,y)/(stddev(x)*stddev(y))
Properties similar to the population correlation coefficient ρ– Unitless quantity– Takes values between –1 and 1– The extreme values are attained if and only if the points (xi , yi) fall exactly on a straight line (r = -1 for a line with negative slope and r = +1 for a line with positive slope.)– Takes values close to zero if there is no linear relationship between x and y.
• See Figures 4.15, 4.16, 4.17 (a) and (b)
∑=
−−−
==n
iiixy
yx
xy yyxxn
sss
sr
1
))((1
1 where
50
What is the correlation?
x
y
0 20 40 60 80 100
020
4060
8010
012
0
51This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
What is the correlation?
x
y
-4 -2 0 2 4
05
1015
52This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
What is the correlation?
x
y
-4 -2 0 2 4
-1.0
-0.5
0.0
0.5
1.0
53This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Correlation and CausationHigh correlation is frequently mistaken for a cause and effect relationship. Such a conclusion may not be valid in observational studies, where the variables are not controlled.
– A lurking variable may be affecting both variables.– One can only claim association, not causation.
Countries with high fat diets tend to have higher incidences of cancer. Can we conclude causation?A common lurking variable in many studies is time order.
– Wealth and health problems go up with age.Does wealth cause health problems?
Sometimes correlations can be found without any plausible explanation, e.g., sun spots and economic cycles.
54
Plots for Multivariate Data
• Side by Side Box Plots• Scatter plot matrix• Three dimensional plots• Brush and Spin plots – add motion• Maps for spatial data
55
56
Box Plots of Auto Datawidths indicate number of each type
2025
3035
fuel
.fram
e[, "
Mile
age"
]
Compact Large Medium Small Sporty Van
fuel.frame[, "Type"]This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
57
Scatter plot matrix Iris –(Versicolor)
Sepal.L.
2.0 2.4 2.8 3.2 1.0 1.2 1.4 1.6 1.8
5.0
5.5
6.0
6.5
7.0
2.0
2.4
2.8
3.2
Sepal.W.
Petal.L.
3.0
3.5
4.0
4.5
5.0
5.0 5.5 6.0 6.5 7.0
1.0
1.2
1.4
1.6
1.8
3.0 3.5 4.0 4.5 5.0
Petal.W.
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
58
• Galaxy S-PLUS Language Reference • Radial Velocity of Galaxy NGC7531 • SUMMARY: • The galaxy data frame records the radial velocity of a spiral galaxy
measured at 323 points in the area of sky which it covers. All the measurements lie within seven slots crossing at the origin. The positions of the measurements given by four variables (columns).
• ARGUMENTS: • east.west
– the east-west coordinate. The origin, (0,0), is near the center of the galaxy, east is negative, west is positive.
• north.south– the north-south coordinate. The origin, (0,0), is near the center of the
galaxy, south is negative, north is positive. • angle
– degrees of counter-clockwise rotation from the horizontal of the slot within which the observation lies.
• radial.position– signed distance from origin; negative if east-west coordinate is
negative. • velocity
– radial velocity measured in km/sec. . This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Galaxy Data
east.west
-40 -20 0 20 40 1400 1500 1600 1700
-30
-10
1030
-40
020
40
north.south
radial.position
-40
020
4060
-30 -20 -10 0 10 20 301400
1600
-40 -20 0 20 40 60
velocity
59This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Galaxy 3D
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.60
61
Earthquake Data
longitude
36.0 36.5 37.0 37.5 38.0 38.5
-123
-122
-121
-120
36.0
37.0
38.0
latitude
-123 -122 -121 -120 3 4 5
34
5
magnitude
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
62
Earthquake 3D
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Narrative Graphics of Space and Time
• Adding spatial dimensions to a graph so that the data are moving over space and time can enhance the explanatory power of time series displays
• The Classic of Charles Joseph Minard (1781-1870) shows the terrible fate of Napoleon’s army during his Russian campaign of 1812. A copy of the map is available at http://www.math.yorku.ca/SCS/Gallery/
63
Map Source: Minard, C. J. Carte figurative des pertes successives en hommes de l'arméequ'Annibal conduisit d'Espagne en Italie en traversant les Gaules (selon Polybe). Carte figurative des pertes successives en hommes de l'armée française dans la campagne de Russie, 1812-1813. École Nationale des Ponts et Chaussées (ENPC), 1869. Also available in: Tufte, Edward R. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001.
Beginning at the left on the Polish-Russian border near the Niemen River the thick band shows the size of the army (422,000) as it invaded Russia in June 1812.– The width of the band indicates the size of the
army…– The army reached a sacked and deserted Moscow
with 100,000 men– Napoleon’s retreat path from Moscow is depicted by
a dark, lower band, linked to a temperature scale and dates at the bottom.
– The men struggled into Poland with only 10,000 troops remaining.
64
• Minard’s graphic tells a rich, coherent story with its multivariate data, far more enlightening than just a single number
• SIX variables are plotted:– Its location on a two-dimensional
surface– Direction of army’s movement– Temperature as a function of time
during the retreat– The size of the army
• “It may well be the best statistical graphic ever drawn.” Edward Tufte (The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001, pp. 40)
65
Scatter plot matrix of air data set in S-Plus
66
ozone
0 50 100 200 300 5 10 15 20
12
34
5
050
150
250
radiation
temperature
6070
8090
1 2 3 4 5
510
1520
60 70 80 90
wind
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
67
plot(temperature,ozone)
temperature
ozon
e
60 70 80 90
12
34
5
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
We often try to fit a straight line to bivariate data as a way to summarize bivariate data:
y = data = fit + residual
fit = a + bx
The parameter (coefficients) a and b can be found in many ways. Least-squares is commonly used.
The fit is often denoted by The residuals are What about curvature and outliers?
( )2
, 1
=
= .
min=
− −
−
∑n
i ia b i
xy x
y a bx
b S S
a y bx
+ .ˆi iy a bx= .ˆi iy y−
Fitting Lines
68
Divide x data into thirds. Find median of x in each third, and median of the y’s that correspond to the x’s in each third. Call these three pairs (xa, ya), (xb, yb), (xc, yc). Fit a least-squares line to these three points.
Or consider other metrics
These are alternatives to least-squares.
Resistant Line
, 1
, .
min
medianmin
ni i
a b i
i ia b i
y a bx
y a bx
=− −
− −
∑
69
70
abline(lm(ozone~temperature))
temperature
ozon
e
60 70 80 90
12
34
5
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Prediction and ResidualsFitted lines can be used to predict. If we go too far beyond range of x-data, we can expect poor results. Consider problems of interpolation and extrapolation.
Examination of residuals help tell us how well our model (a line) fits the data.We also compute
and call s the standard deviation of the residuals. Note use of n − 2 because two degrees of freedom are used to find a and b.
( )21
1 ˆ2
ni i
is y y
n == −
−∑
71
Residual Plots
72
1. against fitted values2. against explanatory variable3. against other possible explanatory variables4. against time, if applicable.
We want these pictures to look random — no pattern.
Outliers and InfluenceValues of x far away from the line have a lot of leverage on the line. Values of y with large residuals at high leverage points will usually be quite influential on the fitted line.
We can check by setting influential points aside and comparing fits and residuals.
( )ˆiy
73
Plot of residuals vs. observation number for ozone data
resi
d(lm
fit)
0 20 40 60 80 100
-10
12
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
74
Residuals vs. Fitted Values for ozone data
fitted(lmfit)
resi
d(lm
fit)
2.0 2.5 3.0 3.5 4.0 4.5
-10
12
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Smoothing
• Fitting curves to data• Separate Signal from noise• Fitted values, , are a weighted average of the
response y. • Weights are a function of predictor x.• Degrees of freedom indicate roughness• Simple linear regression, df=2
y
75
76
plot(temperature,ozone)lines(smooth.spline(temperature,ozone,df=16.5))
temperature
ozon
e
60 70 80 90
12
34
5
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
77
plot(temperature,ozone)lines(smooth.spline(temperature,ozone,df=6))
temperature
ozon
e
60 70 80 90
12
34
5
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Time-Series/Runs Chart
Plot of Compression vs. Time (Order of Production)
This is example of a process not in “statistical control” as seen from the downward drift.0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
0 5 10 15 20 25 30
Production Order
Com
pres
sion
The usual statistics procedures (such as means, standard deviation, confidence interval, hypothesis testing) should NOT be applied until the process has been stabilized.
78
Time-Series DataData obtained at successive time points for the same sampling unit(s).
A time series typically consists of the following components.1. Stable component2. Trend component3. Seasonal component4. Random component5. Cyclic (long term) component
Univariate time series { xt, t = 1, 2, …, T }
Time-series plot: Xt vs. Time
79
Data Smoothing and ForecastingTwo types of averages for time-series data:
1. Moving averages
2. Exponentially weighted averages
These should be used only if mean is constant (process is in “statistical control” or is stationary) or mean varies slowly.
Regression techniques can be used to model trends.
More advanced methods are needed to model seasonality and dependence between successive observations (autocorrelation).
80
(Arithmetic) Moving Averages (MA)The average of a set of w successive data values (called a window); the oldest data is successively dropped off.
T , 1, w w,for t 1 ……
+=++
= +−
wxxMA twt
t
The bigger the window (w), the more the smoothing.
MA forecast: 1ˆ −= tt MAx
T , 2, t ,ˆ 1 =−−=−= ttttt MAxxxeForecast error:
%100 1
12
×⎟⎟⎠
⎞⎜⎜⎝
⎛
− ∑−
T
t t
teT x
Mean Absolute Percent Error:(error in eqn 4.12 in textbook,x not y in the denominator)
81
Exponentially Weighted Moving AveragesUses all data, but the most recent data is weighted the heaviest.
1)1( −−+= ttt EWMAwxwEWMA
where 0 < w < 1 is the smoothing constant (usually 0.2 to 0.3).
1ˆ −= tt EWMAx
1ˆ −
EWMA forecast:
−=−= ttttt EWMAxxxeForecast error:
1 −+= ttt EWMAewEWMAAlternative formula:
Interpretation: If the forecast error is positive (forecast underestimated the actual value), the next period’s forecast is adjusted upward by a fraction of the forecast error.
82
Autocorrelation CoefficientFor time-series data, observations separated by a specified time period (called a lag) are said to be lagged.
First-order autocorrelation or the serial correlation coefficient between observations with lag = 1:
∑
∑
=
=−
−
−−= T
tt
T
ttt
xx
xxxxr
1
2
21
1
)(
))((
The k-th order autocorrelation coefficient:
∑
∑
=
+=−
−
−−= T
tt
T
kttkt
k
xx
xxxxr
1
2
1
)(
))((
83
84
Lag Plots in S-Pluslag.plot(x) or plot(x[1:(n-i)],x[(i+1):n])
lagged 1
Serie
s 1
100 150 200
5010
015
020
0
lagged 2
Serie
s 1
100 150 200
5010
015
020
0
lagged 3
Serie
s 1
100 150 200
5010
015
020
0
lagged 4
Serie
s 1
100 150 200
5010
015
020
0
lagged 5
Serie
s 1
100 150 200
5010
015
020
0
lagged 6
Serie
s 1
100 150 20050
100
150
200
Housing starts 1966:1974, lagged scatterplotsHousing starts 1966:1974, lagged scatterplots
These graphs were created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
John W. Tukey (1915 - 2000)
Statistician at Princeton Univ. and Bell Labs
Co-developer of Fast Fourier Transform
Coined terms “bit” (binary digit) and “software”
“An approximate answer to the right problem is worth a great deal more than a precise answer to the wrong problem.”
Developed new graphical displays (stem-and-leaf and box plots) to examine the data, as a reaction to the “mathematization of statistics.”
85
Review of Probability
Corresponds to Chapter 2 of Tamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT),with some slides by Jacqueline Telford
(Johns Hopkins University)
1
Concepts (Review) A population is a collection of all units of interest.A sample is a subset of a population that is actually observed.A measurable property or attribute associated with each unit of a population
is called a variable. A parameter is a numerical characteristic of a population. A statistic is a numerical characteristic of a sample. Statistics are used to infer the values of parameters. A random sample gives a non-zero chance to every unit of the population to
enter the sample. In probability, we assume that the population and its parameters are known
and compute the probability of drawing a particular sample. In statistics, we assume that the population and its parameters are unknown
and the sample is used to infer the values of the parameters. Different samples give different estimates of population parameters (called
sampling variability). Sampling variability leads to “sampling error”. Probability is deductive (general -> particular) Statistics is inductive (particular -> general)
2
Difference between Statistics and Probability
Statistics: Given the information in your hand, what is in the box?
Probability: Given the information in the box, what is in your hand?
Based on: Statistics, Norma Gilbert, W.B. Saunders Co., 1976. 3
Probability Concepts
Random experiment – procedure whose outcome cannot be predicted in advance. E.g. toss a coin twice
Sample Space (S) – The finest grain, mutually exclusive, collectively exhaustive listing of all possible outcomes (Drake, Fundamentals of Applied Probability Theory) S={H,H},{H,T},{T,H},{T,T}
Event (A) a set of outcomes (subset of S). E.g. No heads A={T,T}
Union (or) E.g. A=heads on first, B=heads on second A U B= {H,T},{H,H},{T,H}
Intersection (and): E.g. A= heads on first, B=heads on second A ∩ B = {H,H}
Complement of Event A – set of all outcomes not in A. E.g. A={T,T}, Ac={H,H},{H,T},{T,H}
4
5
Venn Diagram
A B
Axioms of ProbabilityAssociated with each event A in S is the probability of A, P(A) Axioms:
1. P(A) ≥ 0 2. P(S) = 1 where S is the sample space 3. P(A U B) = P(A) + P(B) if A and B are mutually exclusive
E.g. P(ace or king) = P(ace)+P(king)=1/13+1/13=2/13.
Theorems about probability can be proved using these axioms and these theorems can be used in probability calculations.
P(A) = 1 - P(Ac ) (see “birthday problem” on p. 13)P(A U B) = P(A) + P(B) – P(A∩B)E.g. P(ace or black) = P(ace) + P(black) – P(ace and black)= 4/52 + 26/52 – 2/52 = 28/52 = 7/13
6
Conditional Probabiity: P(A|B) = P(A∩B)/P(B) P(A∩B) = P(A|B)P(B)
E.g. Drawing a card from a deck of 52 cards, P(Heart)=1/4.
However, if it is known that the card is red, P(Heart | Red) = ½.
Sample space has been reduced to the 26 red cards.
(See page 16)
7
Independence P(A|B)=P(A)
There are situations in which knowing that event B occurred gives no information about event A, E.g. knowing that a card is black gives no information about whether it is an ace. P(ace | black) = 2/26 = 4/52 = P(ace).
If two events are independent then P(A∩B)=P(A)P(B)P(A∩B)=P(A|B)P(B)=P(A)P(B)E.g. P(ace of hearts) = P(ace) * P(hearts) = 4/52 * 13/52 = 1/52
Independent events are not the same as disjoint events. Strong dependence between disjoint events. E.g. card is red means can’t be black. P(A|B)=0.
8
Summary
If A and B are disjoint: P(A U B) = P(A) + P(B) P (A ∩B) =0
If A and B are independent: P(A ∩ B) = P(A) * P(B) P(A U B) = P(A) + P(B) – P(A ∩B)
9
Bayes Theorem
• P(A∩B) = P(A|B) P(B) = P(B|A) P(A)• P(B|A) = P(A|B) P(B) / P(A) • P(B) = prior probability • P(B|A) = posterior probability
• E.g. P(heart | red)=P(red | heart) * P(heart) / P(red) = 1* 0.25 / 0.5 = 0.5
• Monte Hall problem (page 20) 10
Sensor ProblemAssume that there are two chemical hazard sensors: A and B.
Let P(A falsely detecting a hazardous chemical)=0.05 and the same for B.
What is the probability of both sensors falsely detecting a hazardous chemical?
P (A ∩ B) = P(A|B)×P(B) = P(A) × P(B) = 0.05 × 0.05 = 0.0025
– only if A and B are independent (use different detection methods).
If A and B are both “fooled” by the same chemical substance, then P (A ∩ B) = P(A | B) × P(B) = 1 × 0.05 = 0.05 – which is 20 times the rate of false alarms (same type of sensor)
DON’T assume independence without good reason! 11
HIV + HIV - Test positive (+) 95 495 590 Test negative (-) 5 9405 9410 100 9900 10000
P(HIV +) = 100/10000 = .01 (prevalence)
P(Test + | HIV +) = 95/100 = 0.95 (sensitivity)P(Test - | HIV -) = 9405/9900 = .95 (specificity)P(Test - | HIV +) = 5/100 = .05 (false negatives)P(Test + | HIV -) = 495/9900 = .05 (false positives)
P(HIV + | Test +) = 95/590 = 0.16This is one reason why we don’t have mass HIV screening
HIV Testing Example
want these to be high
want these to be low
Made-up data
Suggestions for Solving Probability Problems
Draw a picture – Venn diagram – Tree or event diagram (Probabilistic Risk Assessment) – Sketch
Write out all possible combinations if feasible
Do a smaller scale problem first – Figure out the algorithm for the solution
– Increment the size of the problem by one and check algorithm for correctness
– Generalize algorithm (mathematical induction)
13
Counting rulesNumber of Possible Arrangements of Size r from n Objects:
Without With Replacement Replacement
Ordered: !
( )! n
n r− rn
Unordered: n r
⎛ ⎞ ⎜ ⎟⎝ ⎠
1n r r
+ −⎛ ⎞ ⎜ ⎟⎝ ⎠
Source: Casella, George, and Roger L. Berger. Statistical Inference. Belmont, CA: Duxbury Press, 1990, page 16. 14
Counting rules (from Casella & Berger)
For these examples, see pages 15-16 of: Casella, George, and Roger L. Berger. Statistical Inference. Belmont, CA: Duxbury Press, 1990.
15
Birthday ProblemAt a gathering of s randomly chosen students what is the probability
that at least 2 will have the same birthday?
P(at least 2 have same birthday)=1-P(all s students have different birthdays).
Assume 365 days in a year. Think of students’ birthdays as a sample of these 365 days.
The total number of possible outcomes is: N=365s (ordered, with replacement)
The number of ways that s students can have different birthdays is M=364!/(365-s)! (ordered, without replacement)
P(all s students have different birthdays) is M / N. 16
Probability that all students have different birthdays
choo
se(3
65, 1
:80,
ord
er =
....
0.0
0.4
0.8
0.2
0.6
1.0
0 20 40 60 80
Number of students 17 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
See “Harry Potter and the Sorcerer’s Stone” by J.K.
Rowling.
18
Another Counting Rule
The number of ways of classifying n items into kgroups with ri in group i, r1+r2+…+rk=n, is:
n! / (r1! r2! r3!...rk!)
For example: How many ways are there to assign100 incoming students to the 4 houses atHogwarts?
(1.6 * 10^57)
19
Random VariablesA random variable (r.v.) associates a unique numerical value with
each outcome in the sample space Example:
1 if coin toss results in a headX = 0 if coin toss results in a tail
Discrete random variables: number of possible values is finite or countably infinite: x1, x2, x3, x4, x5, x6, … Probability mass function (p.m.f.)
f(x) = P(X = x) (Sum over all possible values =1 always) Cumulative distribution function (c.d.f)
F(x) = P (X ≤ x) = Σ f(k)k ≤ x
• See Table 2.1 on p. 21 (p.m.f. and c.d.f. for sum of two dice)• See Figure 2.5 on p. 22 (p.m.f. and c.d.f. graphs for two dice)
20
Continuous Random VariablesAn r.v. is continuous if it can assume any value from
one or more intervals of real numbers
Probability density function (p.d.f.) f(x)
f(x) ≥ 0 ∞
)f ( dx x = 1 curve the under (Area = always) 1 ∫ ∞ −
b
a P ≤ X ≤ b) = f ( ds x any for a ≤ b( )∫ a
21
P(0<X<1) for standard normal= area under curve between 0 and 1
dnor
m(x
)
0.0
0.2
0.4
0.1
0.3
-4 -2 0 2 4
x 22 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Cumulative Distribution Function
The cumulative distribution function (c.d.f.), denoted F(x), for a continuous random variable is given by:
x
F ( x) = X P ≤ x) = f ( dy y( )∫ ∞ −
f ( x) = dF ( x)
dx
23
P(0<Z<1) for standard normal= F(1)-F(0) =0.8413-0.5 = 0.3413 (table page 674)
pnor
m(z
)
0.0
0.2
0.4
0.6
0.8
1.0
-4 -2 0 2 4
z 24 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Expected Value
The expected value or mean of a discrete r.v. X,denoted by E(X), µ x, or simply µ , is defined as:
(X E ) = µ = ∑ f x ( x) = f x ( x ) + x2 f ( x ) + …1 1 2 x
This is essentially a weighted average of the possible values the r.v. can assume, weights=f(x)
The expected value of a continuous r.v. X is defined as:
X E ) = µ = ∫ f x ( dx x ( )
25
Variance and Standard Deviation2The variance of an r.v. X, denoted by Var(X), σx , or
simply σ2, is defined as:
Var(X) = σ2 = E[(X - µ)2] Var(X) = E[(X - µ)2]= E(X2 - 2µX + µ2)
= E(X2) - 2µE(X) + E(µ2)
= E(X2) - 2µµ + µ2
= E(X2) - µ2 = E(X2) - [E(X)]2
The standard deviation (SD) is the square root of the variance. Note that the variance is in the square of the original units, while the SD is in the original units.
• See Example 2.17 on p. 26 (mean and variance of two dice)
26
Quantiles and Percentiles
For 0 ≤ p ≤ 1 the pth quantile (or the 100pth percentile), denoted by θp, of a continuous r.v. X is defined by the following equation:
X P ≤ θ ) = F (θ p ) = p( p
θ.5 is called the median
• See Example 2.20 on p. 30 (exponential distribution)
Jointly distributed random variables and independent random variables
See pp. 30-33
27
Joint Distributions
For a discrete distribution:
f(x,y) = P(X=x,Y=y)
f(x,y) ≥ 0 for all x and y ∑x ∑y f(x,y)=1
28
Marginal Distributions
• g(x) = P(X=x) = ∑y f(x,y) • h(y) = P(Y=y) = ∑x f(x,y)
• Independent if joint distribution factors into product of marginal distributions
• f(x,y) = g(x) h(y)
29
Conditional Distributions
f(y|x) = f(x,y) / g(x)If X and Y are independent:
f(y|x) = g(x) h(y) / g(x) = h(y)
Conditional distribution is just a probabilitydistribution defined on a reduced sample space. For every x, ∑y f(y|x) = 1
30
Covariance and Correlation
Cov(X,Y) = σ XY = E[(X - µ X)(Y - µ Y)] = E(XY) - E(X)E(Y)
= E(XY) - µ X µ Y
If X and Y are independent, then E(XY) = E(X)E(Y) so the covariance is zero. The other direction is not true.
∞ ∞
Note that: E ( Y X ) = y x f y x ) dx dy( ,∫ ∫ ∞ − ∞ −
ρ XY = corr ( X ,Y ) = Cov ( X ,Y )
=σ XY
Var ( X )Var (Y ) σ σY
• See Examples 2.26 and 2.27 on pp. 37-38 (prob vs. stat grades)
31
x
Example 2.25 in texty=x with probability 0.5 and y= -x with probability 0.5
y is not independent of x, yet covariance is zero
y
-40
0-2
0 20
40
0 10 20 30 40 50
x 32 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Two Famous Theorems
Chebyshev’s Inequality: Let c > 0 be a constant. Then, irrespective of the distribution of X,
2σ( X P − µ ≥ c ) ≤ 2c • See Example 2.29 on p. 41 (exact vs. Cheb. for two dice)
Weak Law of Large Numbers: Let X be the sample mean of n i.i.d. observations from a population with finite mean µ
2and variance σ . Then, for any fixed c > 0,
( X P − µ ≥ c ) → as 0 n ∞ →
33
Selected Discrete DistributionsBernoulli trials: (single coin flip)
xif (success)1( x f ) = (XP =x )=
⎧⎨⎩
p =
1−p xif (failure)0=0 1
E(X) = p and Var(X) = p(1-p)
Binomial distribution: (multiple coin flips)
X successes out of n trials ⎛ ⎞n
p x (1−p) −xn forx f( ) = XP( =x)= x= 1,0, …, n ⎜⎜⎝
⎟⎟⎠x
E(X) = np and Var(X) = np(1-p)
• See Example 2.30 on p. 43 (teeth) 0 1 . . n
34
Selected Discrete Distributions (cont)
Hypergeometric: drawing balls from the box without replacing the balls (as in the hand with the question mark)
Poisson: number of occurrences of a rare event
Geometric: number of failures before the first success
Multinomial: more than two outcomes
Negative Binomial: number of trials to get r successes
Uniform: N equally likely events 1 2 3 … N
• See Table 2.5, p. 59 for properties of these distributions
35
Selected Continuous DistributionsUniform: equally likely over an interval
Exponential: lifetimes of devices with no wear-out (“memoryless”), interarrival times when the arrivals are at random
Gamma: used to model lifetimes, related to many other distributions
Lognormal: lifetimes (similar shape to Gamma but with longer tail)
Beta: not equally likely over an interval
• See Table 2.5, p. 59 for properties of these distributions36
Normal Distribution
First discovered by de Moivre (1667-1754) in1733
Rediscovered by Laplace (1749-1827) andalso by
Gauss (1777-1855) in their studies of errorsin astronomical measurements.
Often referred to as the Gaussian distribution.
37
Carl Friedrick Gauss (1777 - 1855)
Photograph courtesy of John L. Telford, John Telford Photography. Used with permission. Currency from 1991.
38
Karl Pearson (1857 - 1936)
“Many years ago I called the Laplace-Gauss curve the NORMAL curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another ABNORMAL. That belief is, of course, not justifiable.”
Karl Pearson, 1920
39
Normal Distribution (“Bell-curve”, Gaussian)
A continuous r.v X has a normal distribution with parameter µ and σ 2 if its probability density function is given by:
x f ) = 1 exp[− ( x − µ )2 / 2σ 2 ] -for < ∞ x ∞ < (
σ 2π
E(X) = µ and Var(X) = σ 2 (see Figure 2.12, p. 53)
Standard normal distribution: Z = X − µ ~ N ( 1 ,0 )
σ• See Table A.3 on p. 673 Φ (z) = P(Z ≤ z)
X P ≤ x ) = Z P = X − µ
≤ x − µ
= z ⎟⎞ Φ = ⎛⎜
x − µ ⎞⎟( ⎜
⎛
⎝ σ σ ⎠ ⎝ σ ⎠
• See Examples 2.37 and 2.38 on pp. 54-55 (computations)40
Percentiles of the Normal DistributionSuppose that the scores on a standardized test are normally distributed with mean 500 and standard deviation of 100. What is the 75th percentile score of this test?
X P ≤ x ) = P ⎜⎛ X − 500 x − 500 ⎞ ⎛ x − 500 ⎞( ≤ ⎟ = Φ⎜ ⎟ = 75 .0 ⎝ 100 100 ⎠ ⎝ 100 ⎠
From Table A.3, Φ (0.675) = 0.75
x − 500 = 675.0 ⇒ x = 500 + ( 100)(675. 0 ) = 5. 567
100 Useful Information about the Normal Distribution:
~68% of a normal population is within ± 1σ of µ ~95% of a normal population is within ± 2σ of µ ~99.7% of a normal population is within ± 3σ of µ
41
75th percentile for a test with scores which are normally distributed, mean=500, standard deviation=100
pnor
m(x
, 500
, 100
)
0.0
0.4
0.8
qnorm(0.75, 500, 100)=567.5
pnorm(567.5, 500, 100)=0.75
0.2
0.6
1.0
200 400 600 800
x
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation. 42
Linear Combinations of r.v.s
Xi ~ N(µi, σi2) for i = 1, …, n and Cov(X , Xj) = σij for i≠ji
Let X = a1X1 + a2X2 + … + anXn where are constants.ai
Then X has a normal distribution with mean and variance: n
X E ) = X a E + X a 2 +…+ X a ) = a µ +a µ +…+a µ = ∑a µ( ( 1 1 2 n n 1 1 2 2 n n i i i =1
n n n 2 2Var ( X ) = Var ( X a + X a 2 +… + X a ) = ∑ai σ + 2∑∑ a a j σ1 1 2 n n i i ij
i =1 i =1 j =1i ≠j
X = (X1 + X2 + … + Xn) / n , so ai = 1/n
Therefore, X from n i.i.d. N(µ, σ2) observations ~ N(µ, σ2/n), since the covariances (σij) are zero (by independence).
43
Sampling Distributions of Statistics
Corresponds to Chapter 5 of Tamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT),with some slides by Jacqueline Telford
(Johns Hopkins University)
1
Sampling Distributions
2
Definitions and Key Concepts• A sample statistic used to estimate an unknown population parameter is called an estimate.
• The discrepancy between the estimate and the true parameter value is known as sampling error.
• A statistic is a random variable with a probability distribution, called the sampling distribution, which is generated by repeated sampling.
• We use the sampling distribution of a statistic to assess the sampling error in an estimate.
Random Sample• Definition 5.11, page 201, Casella and Berger.
• How is this different from a simple random sample?
• For mutual independence, population must be very large or must sample with replacement.
3
Sample Mean and Variance
∑=
=n
iiX
nX
1
1
1
)(1
2
2
−
−=∑=
n
XXS
n
ii
Sample Mean
Sample Variance
How do the sample mean and variance vary in repeated samples of size n drawn from the population?
In general, difficult to find exact sampling distribution. However,see example of deriving distribution when all possible samplescan be enumerated (rolling 2 dice) in sections 5.1 and 5.2.Note errors on page 168.
4
Properties of a sample mean and variance
See Theorem 5.2.2, page 268, Casella & Berger.
5
Distribution of Sample Means• If the i.i.d. r.v.’s are
– Bernoulli– Normal– Exponential
The distributions of the sample means can be derived
Sum of n i.i.d. Bernoulli(p) r.v.’s is Binomial(n,p)
Sum of n i.i.d. Normal(µ,σ2) r.v.’s is Normal(nµ,nσ2)
Sum of n i.i.d. Exponential(λ) r.v.’s is Gamma(λ,n)
6
Distribution of Sample Means
• Generally, the exact distribution is difficult to calculate.
• What can be said about the distribution of the sample mean when the sample is drawn from an arbitrary population?
• In many cases we can approximate the distribution of the sample mean when n is large by a normal distribution.
• The famous Central Limit Theorem
7
Central Limit TheoremLet X1, X2, … , Xn be a random sample drawn from an arbitrary distribution with a finite mean µ and variance σ2
As n goes to infinity, the sampling distribution of
n
Xσ
µ−
)1,0(
1 Nn
nXn
ii
≈−∑
=
σ
µ
converges to the N(0,1) distribution.
Sometimes this theorem is given in terms of the sums:
8
Central Limit Theorem
Let X1… Xn be a random sample from an arbitrary distribution with finite mean µ and variance σ2. As n increases
?),(
?),(
)1,0(/
)(
2
1
2
σµ
σµ
σµ
nnNX
nNX
Nn
X
n
ii ≈⇒
≈⇒
≈−
∑=
What happens as n goes to infinity?
9
10
Variance of means from uniform distributionsample size=10 to 10^6number of samples=100
log10(sample.size)
log1
0(va
rianc
e)
1 2 3 4 5 6
-7-6
-5-4
-3-2
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Example: Uniform Distribution• f(x | a, b) = 1 / (b-a), a≤x≤b• E X = (b+a)/2• Var X = (b-a)2/12
0 2 4 6 8 10
05
1015
2025
30
runif(500, min = 0, max = 10)
11This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
12
Standardized Means, Uniform Distribution500 samples, n=1
-1 0 1
010
2030
40
number of samples=500, n=1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
13
Standardized Means, Uniform Distribution500 samples, n=2
-2 -1 0 1 2
010
2030
40
number of samples=500, n=2
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
14
Standardized Means, Uniform Distribution500 samples, n=100
-3 -2 -1 0 1 2 3
010
2030
40
number of samples=500, n=100
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
15
QQ (Normal) plot of means of 500 samples of size 100 from uniform distribution
Quantiles of Standard Normal
tmp
-3 -2 -1 0 1 2 3
-3-2
-10
12
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Bootstrap – sampling from the sample
• Previous slides have shown results for means of 500 samples (of size 100) from uniform distribution.
• Bootstrap takes just one sample of size 100 and then takes 500 samples (of size 100) with replacement from the sample.
• x<-runif(100)• y<- mean(sample(x,100,replace=T))
16
17
Normal probability plot of sample of size 100 from exponential distribution
Quantiles of Standard Normal
x
-2 -1 0 1 2
01
23
45
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
18
Normal probability plot of means of 500 bootstrap samples from sample of size 100
from exponential distribution
Quantiles of Standard Normal
y
-3 -2 -1 0 1 2 3
1.0
1.1
1.2
1.3
1.4
1.5
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Law of Large Numbers and Central Limit Theorem
Both are asymptotic results about the sample mean:
• Law of Large Numbers (LLN) says that as n → ∞, the sample mean converges to the population mean, i.e.,
0,n as →−∞→ µX
• Central Limit Theorem (CLT) says that as n → ∞, also the distribution converges to Normal, i.e.,
N(0,1) toconverges , n asn
Xσ
µ−∞→
19
Normal Approximation to the Binomial
A binomial r.v. is the sum of i.i.d. Bernoulli r.v.’s so the CLT can be used to approximate its distribution.
Suppose that X is B(n, p). Then the mean of X is np and the variance of X is np(1 - p) .
By the CLT, we have: )1,0()1(
Npnp
npX≈
−−
⎥⎦
⎤⎢⎣
⎡ −=
.).(.).(..
FormulaGeneral
vrSDvrEvr
How large a sample, n, do we need for the approximation to be good?
Rule of Thumb: np ≥ 10 and n(1-p) ≥ 10
For p=0.5, np = n(1-p) = n (0.5) = 10 ⇒ n should be 20. (symmetrical)
For p=0.1 or 0.9, np or n(1-p) = n (0.1) = 10 ⇒ n should be 100. (skewed)
• See Figures 5.2 and 5.3 and Example 5.3, pp.172-174
20
Continuity Correction
See Figure 5.4 for motivation.
⎟⎟⎠
⎞⎜⎜⎝
⎛
−−+
Φ≅≤)1(
5.0)(pnpnpxxXP
⎟⎟⎠
⎞⎜⎜⎝
⎛
−−−
Φ−≅≥)1(
5.01)(pnpnpxxXP
Exact Binomial Probability:
P(X ≤ 8) = 0.2517
Normal approximation without Continuity Correction:
P(X ≤ 8) = 0.1867
Normal approximation with Continuity Correction:
P(X ≤ 8.5) = 0.2514 (much better agreement with exact calculation)21
Sampling Distribution of the Sample Variance
?~1
)(1
2
2
−
−=∑=
n
XXS
n
ii
There is no analog to the CLT for which gives an approximation for large samples for an arbitrary distribution.
The exact distribution for S2 can be derived for X ~ i.i.d. Normal.
Chi-square distribution: For ν ≥ 1, let Z1, Z2, …, Zν be i.i.d. N(0,1) and let Y = Z1
2 + Z22 + …+ Zν2.
The p.d.f. of Y can be shown to be( ) 212
22 )(2
1)(y
exyf −−
Γ=
ν
νν
This is known as the χ2 distribution with ν degrees of freedom (d.f.) or Y ~ .2
νχ
• See Figures 5.5 and 5.6, pp. 176-177 and Table A.5, p.67622
Distribution of the Sample Variance in the Normal Case
If Z ~ N(0,1), then Z2 ~21χ
212
2
2
2
~)1/(
)1(−−
=−
nnSSn χ
σσ
1~
21
22
−−
nS nχσ
It can be shown that
or equivalently , a scaled χ2
E(S2) = σ2 (is an unbiased estimator)
Var(S2) = 12 4
−nσ
See Result 2 (p.179)
23
Chi-square distribution
24x
chi s
quar
e de
nsity
for d
f=5,
10,2
0,30
0 10 20 30 40 50
0.0
0.05
0.10
0.15
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Chi-Square DistributionInteresting Facts
• EX = ν (degrees of freedom)• Var X = 2ν• Special case of the gamma distribution
with scale parameter=2, shape parameter=v/2.
• Chi-square variate with v d.f. is equal to the sum of the squares of v independent unit normal variates.
25
Student’s t-DistributionConsider a random sample X1, X2, ..., Xn drawn from N(µ,σ2).
It is known thatn
X/σµ− is exactly distributed as N(0,1).
nSXT
/µ−
= is NOT distributed as N(0,1).
A different distribution for each ν = n-1 degrees of freedom (d.f.).
T is the ratio of a N(0,1) r.v. and sq.rt.(independent χ2 divided by its d.f.) - for derivation, see eqn 5.13, p.180, and its messy p.d.f., eqn 5.14
See Figure 5.7, Student’s t p.d.f.’s for ν = 2, 10,and ∞, p.180• See Table A.4, t-distribution table, p. 675• See Example 5.6, milk cartons, p. 181
26
27
Student’s t densities for df=1,100
x
Stu
dent
's t
pdf,
df=1
& 1
00
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
df=1
df=100
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Student’s t DistributionInteresting Facts
• E X = 0, for v>1• Var X = v/(v-2) for v>2• Related to F distribution (F1,v = t2v )• As v tends to infinity t variate tends to
unit normal• If v=1 then t variate is standard Cauchy
28
29
Cauchy Distribution for center=0, scale=1 and center=1, scale=2
x
Cau
chy
-4 -2 0 2 4
0.05
0.10
0.15
0.20
0.25
0.30
center=1, scale=2
center=0, scale=1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Cauchy DistributionInteresting Facts
12 ]})(1[{),|( −−+=
baxbbaxf π
30
• Parameters, a=center, b=scale • Mean and Variance do not exist (how could this be?)• a=median• Quartiles=a +/- b• Special case of Student’s t with 1 d.f.• Ratio of 2 independent unit normal variates is standard
Cauchy variate• Should not be thought of as “only a pathological case”.
(Casella & Berger) as we frequently (when?) calculate ratios of random variables.
Snedecor-Fisher’s F-Distribution
has an F-distribution with n1-1 d.f. in the numerator and n2-1 d.f. in the denominator.
•F is the ratio of two independent χ2’s divided by their respective d.f.’s
•Used to compare sample variances.
•See Table A.6, F-distribution, pp. 677-679
Consider two independent random samples:
X1, X2, ..., Xn1from N(µ1,σ1
2) , Y1, Y2, ..., Yn2from N(µ2,σ2
2).
Then
)12(
22
22)12(
)11(
21
21)11(
22
22
21
21
−
−
−
−
=
n
Sn
n
Sn
S
S
σ
σ
σ
σ
31
32
Snedecor’s F Distribution
x
F pd
f for
df2
=40
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
df1=40
df1=10
df1=4
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Snedecor’s F DistributionInteresting Facts
• Parameters, v, w, referred to as degrees of freedom (df).• Mean = w/(w-2), for w>2• Variance = 2w2(v+w-2)/(v(w-2)2(w-4)), for w>4• As d.f., v and w increase, F variate tends to normal• Related also to Chi-square, Student’s t, Beta and Binomial• Reference for distributions:
Statistical Distributions 3rd ed. by Evans, Hastings and Peacock, Wiley, 2000
33
Sampling Distributions - Summary
• For random sample from any distribution, standardized sample mean converges to N(0,1) as n increases (CLT).
• In normal case, standardized sample mean with S instead of sigma in the denominator ~ Student’s t(n-1).
• Sum of n squared unit normal variates ~ Chi-square (n)
• In the normal case, sample variance has scaled Chi-square distribution.
• In the normal case, ratio of sample variances from two different samples divided by their respective d.f. has F distribution.
34
Sir Ronald A. Fisher George W. Snedecor(1890-1962) (1882-1974)
Taught at Iowa State Univ. where wrote a college textbook (1937):
“Thank God for Snedecor;now we can understand Fisher.”
(named the distribution for Fisher)
Wrote the first books on statistical methods (1926 & 1936):
“A student should not be madeto read Fisher’s books
unless he has read them before.”
35
Sampling Distributions for Order StatisticsMost sampling distribution results (except for CLT) apply to samples from normal populations.
If data does not come from a normal (or at least approximately normal), then statistical methods called “distribution-free” or “non-parametric” methods can be used (Chapter 14).
Non-parametric methods are often based on ordered data (called order statistics: X(1), X(2), …, X(n)) or just their ranks.
If X1..Xn are from a continuous population with cdf F(x) and pdf f(x) then the pdf of X(j) is:
The confidence intervals for percentiles can be derived using the order statistics and the binomial distribution.
jnjj xFxFxf
jnjnxf −− −
−−= )](1[)]()[(
)!()!1(!)( 1
)(
36
Basic Concepts of Inference
Corresponds to Chapter 6 of Tamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT)with some slides by Jacqueline Telford
(Johns Hopkins University) and Roy Welsch (MIT).1
“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” H. G. Wells
Statistical InferenceDeals with methods for making statements about a population based on a sample drawn from the population
Point Estimation: Estimate an unknown population parameter
Confidence Interval Estimation: Find an interval that contains the parameter with preassigned probability.
Hypothesis testing: Testing hypothesis about an unknown population parameter
2
Examples
Point Estimation: estimate the mean package weight of a cereal box filled during a production shift
Confidence Interval Estimation: Find an interval [L,U] based on the data that includes the mean weight of the cereal box with a specified probability
Hypothesis testing: Do the cereal boxes meet the minimum mean weight specification of 16 oz?
3
Two Levels of Statistical Inference
• Informal, using summary statistics (may only be descriptive statistics)
• Formal, which uses methods of probability and sampling distributions to develop measures of statistical accuracy
4
Estimation Problems
• Point estimation: estimation of an unknown population parameter by a single statistic calculated from the sample data.
• Confidence interval estimation: calculation of an interval from sample data that includes the unknown population parameter with a pre-assigned probability.
5
Point Estimation TerminologyEstimator = the random variable (r.v.) , a function of the Xi’sθ(the general formula of the rule to be computed from the data)
Estimate = the numerical value of calculated from the observed sample data X1 = x1, ..., Xn = xn
θ
n
XX
n
ii∑
== 1
n
xx
n
ii∑
== 1
Example: Xi ~ N(µ,σ2)
(the specific value calculated from the data)
of µ(= 10.2) is an estimateEstimate =
6
Estimator = is an estimator of µ µ=
Other estimators of µ?
Methods of Evaluating EstimatorsBias and Variance
θθθ −= )ˆ()ˆ( EBias- The bias measures the accuracy of an estimator.- An estimator whose bias is zero is called unbiased.- An unbiased estimator may, nevertheless, fluctuate greatly fromsample to sample.
{ }2)]ˆ(ˆ[ )ˆ( θθθ EE −=Var
7
-The lower the variance, the more precise the estimator.- A low-variance estimator may be biased.- Among unbiased estimators, the one with the lowest variance should be chosen. “Best”=minimum variance.
Accuracy and Precision
accurate and precise
accurate, not precise
precise, not accurate
not accurate, not precise
8Diagram courtesy of MIT OpenCourseWare
Mean Squared Error- To chose among all estimators (biased and unbiased), minimize a measure that combines both bias and variance.- A “good” estimator should have low bias (accurate) AND low variance (precise).
{ }2)]ˆ[ )ˆ( θθθ −= EMSE 6.2) (eqnBiasVar 2)]ˆ([)ˆ( θθ +=
MSE = expected squared error loss function
θθθ −= )ˆ()ˆ( EBias
{ }2)]ˆ(ˆ[ )ˆ( θθθ EE −=Var
9
Example: estimators of variance
Two estimators of variance:
)1()( 21
21 −−= ∑ =
nXXS n
i i is unbiased (Example 6.3)
nXXS n
i i2
122 )( −= ∑ =
is biased but has smaller MSE (Example 6.4)
In spite of larger MSE, we almost always use S12
10
Example - Poisson
(See example in Casella & Berger, page 308)
11
Standard Error (SE)- The standard deviation of an estimator is called the standard error of the estimator (SE).- The estimated standard error is also called standard error (se).- The precision of an estimator is measured by the SE.Examples for the normal and binomial distributions:
µ 1. of estimator unbiased an isXnXSE σ=)(
are called the standard error of the meannsXse =)(
ˆ 2. pp of estimator unbiased an isnpppse )ˆ1(ˆ)ˆ( −=
12
Precision and Standard Error
• A precise estimate has a small standard error, but exactly how are the precision and standard error related?
• If the sampling distribution of an estimator is normal with mean equal to the true parameter value (i.e., unbiased). Then we know that about 95% of the time the estimator will be within two SE’s from the true parameter value.
13
Methods of Point Estimation
•Method of Moments (Chapter 6)
•Maximum Likelihood Estimation (Chapter 15)
•Least Squares (Chapter 10 and 11)
14
Method of Moments
• Equate sample moments to population moments (as we did with Poisson).
• Example: for the continuous uniform distribution, f(x|a,b)=1/(b-a), a≤x≤b
• E(X) = (b+a)/2, Var(X)=(b-a)2/12
• Set = (b+a)/2
• S2 = (b-a)2/12
• Solve for a and b (can be a bit messy).
X
15
Maximum Likelihood Parameter Estimation
• By far the most popular estimation method! (Casella & Berger).
• MLE is the parameter point for which observed data is most likely under the assumed probability model.
• Likelihood function: L(θ |x) = f(x| θ), where x is the vector of sample values, θ also a vector possibly.
• When we consider f(x| θ), we consider θ as fixed and x as the variable.
• When we consider L(θ |x), we are considering x to be the fixed observed sample point and θ to be varying over all possible parameter values.
16
MLE (continued)•If X1….Xn are iid then
L(θ|x)=f(x1…xn| θ) = ∏ f(xi| θ)
•The MLE of θ is the value which maximizes the likelihood function (assuming it has a global maximum).
•Found by differentiating when possible.
•Usually work with log of likelihood function (∏→∑).
•Equations obtained by setting partial derivatives of ln L(θ) = 0 are called the likelihood equations.
•See text page 616 for example – normal distribution.17
Confidence Interval EstimationWe want an interval [ L, U ] where L and U are two statistics calculated from X1, X2, …, Xn such that
P[ L ≤ θ ≤ U] = 1 - α Note: L and U are random and θ is fixed but unknown
regardless of the true value of θ.
• [ L, U ] is called a 100(1-α)% confidence interval (CI).
• 1-α is called the confidence level of the interval.
• After the data is observed X1 = x1, ..., Xn = xn, the confidence limits L = l and U = u can be calculated.
18
95% Confidence Interval: Normal known2σConsider a random sample X1, X2, …, Xn ~ N(µ,σ2) where σ2 is assumed to be known and µ is an unknown parameter to be estimated. Then
95.096.196.1P =⎥⎦
⎤⎢⎣
⎡≤
−≤−
nXσ
µ By the CLT even if the sample is not normal, this result is approximately correct.
95.096.196.1P =⎥⎦⎤
⎢⎣⎡ +=≤≤−=⇒
nXU
nXL σµσ
un
xn
xl =+≤≤−=⇒σµσ 96.196.1 is a 95% CI for µ
(two-sided)
• See Example 6.7, Airline Revenues, p. 20419
Normal Distribution, 95% of area under curve is between -1.96 and 1.96
x
dnor
m(x
)
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
9 5 %
20This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Frequentist Interpretation of CI’sIn an infinitely long series of trials in which repeated samples of size n are drawn from the same population and 95% CI’s for µ are calculated using the same method, the proportion of intervals that actually include µ will be 95% (coverage probability).
However, for any particular CI, it is not known whether or not the CI includes µ, but the probability that it includes µis either 0 or 1, that is, either it does or it doesn’t.
It is incorrect to say that the probability is 0.95 that the true µ is in a particular CI.
• See Figure 6.2, p. 205
21
22
95% CI, 50 samples from unit normal distribution
95%
Con
fiden
ce In
terv
al
0 10 20 30 40 50
-1.0
-0.5
0.0
0.5
1.0
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Arbitrary Confidence Level for CI: known2σ
100(1-α)% two-sided CI for µ based on the observed sample mean
nZx
nZx σµσ
αα 2/2/ +≤≤− For 99% confidence, Zα/2 = 2.576
The price paid for higher confidence level is a wider interval.
For large samples, these CI can be used for data from any distribution, since by CLT ≈ N(µ, σ2/n).x
23
One-sided Confidence Intervals
nZx σµ α−≥ Lower one-sided CI For 95%
confidence, Zα= 1.645 vs. Zα/2= 1.96 n
Zx σµ α+≤ Upper one-sided CI
One-sided CIs are tighter for the same confidence level.
24
Hypothesis Testing
The objective of hypothesis testing is to access the validity of a claim against a counterclaim using sample data.
• The claim to be “proved” is the alternative hypothesis (H1).
• The competing claim is called the null hypothesis (H0).
• One begins by assuming that H0 is true. If the data fails to contradict H0 beyond a reasonable doubt, then H0 is not rejected. However, failing to reject H0 does not mean that we accept it as true. It simply means that H0 cannot be ruled out as a possible explanation for the observed data. A proof by insufficient data is not a proof at all.
25
Testing Hypotheses“The process by which we use data to answer questions about parametersis very similar to how juries evaluate evidence about a defendant.” – from Geoffrey Vining, Statistical Methods for Engineers, Duxbury, 1st edition, 1998. For more information, see that textbook.
26
Hypothesis Tests• A hypothesis test is a data-based rule to decide between H0and H1.
• A test statistic calculated from the data is used to make this decision.
• The values of the test statistics for which the test rejects H0 comprise the rejection region of the test.
• The complement of the rejection region is called the acceptance region.
• The boundaries of the rejection region are defined by one or more critical constants (critical values).
• See Examples 6.13(acc. sampling) and 6.14(SAT coaching), pp. 210-211.
27
Hypothesis Testing as a Two-Decision Problem
28
Framework developed by Neyman and Pearson in 1933.
When a hypothesis test is viewed as a decision procedure, two types of errors are possible:
Decision Do not reject H0 Reject H0
H0 True Correct Decision “Confidence”
1 - α
Type I Error “Significance Level”
α
Rea
lity
H0 False Type II Error “Failure to Detect”
β
Correct Decision “Prob. of Detection”
1 - β Column
Total
≠ 1
≠ 1
=1
=1
Probabilities of Type I and II Errorsα = P{Type I error} = P{Reject H0 when H0 is true} = P{Reject H0|H0}
also called α-risk or producer’s risk or false alarm rate
β = P{Type II error} = P{Fail to reject H0 when H1 is true} = P{Fail to reject H0|H1}
also called β-risk or consumer’s risk or prob. of not detecting
π = 1 - β = P{Reject H0|H1} is prob. of detection or power of the test
We would like to have low α and low β (or equivalently, high power).
α and 1-β are directly related, can increase power by increasing α.
These probabilities are calculated using the sampling distributions from either the null hypothesis (for α) or alternative hypothesis (for β).
29
Example 6.17 (SAT Coaching)
See Example 6.17, “SAT Coaching,” in the course textbook.
30
Power Function and OC Curve
The operating characteristic function of a test is the probability that the test fails to reject H0 as a function of θ, where θ is the test parameter.
OC(θ) = P{test fails to reject H0 | θ}
For θ values included in H1 the OC function is the β –risk.
The power function is:
π(θ) = P{Test rejects H0 | θ} = 1 – OC(θ)
Example: In SAT coaching, for the test that rejects the null hypothesis when mean change is 25 or greater, the power = 1-pnorm(25,mean=0:50,sd=40/sqrt(20))
31
Level of SignificanceThe practice of test of hypothesis is to put an upper bound on the P(Type I error) and, subject to that constraint, find a test with the lowest possible P(Type II error).
The upper bound on P(Type I error) is called the level of significance of the test and is denoted by α (usually some small number such as 0.01, 0.05, or 0.10).
The test is required to satisfy:
P{ Type I error } = P{ Test Rejects H0 | H0 } ≤ α
Note that α is now used to denote an upper bound on P(Type I error).
Motivated by the fact that the Type I error is usually the more serious.
A hypothesis test with a significance level α is called an a α-level test.
32
Choice of Significance Level
What α level should one use?
Recall that as P(Type I error) decreases P(Type II error) increases.
A proper choice of α should take into account the relative costs of Type I and Type II errors. (These costs may be difficult to determine in practice, but must be considered!)
Fisher said: α =0.05
Today α = 0.10, 0.05, 0.01 depending on how much proof against the null hypothesis we want to have before rejecting it.
P-values have become popular with the advent of computer programs.
33
Observed Level of Significance or P-valueSimply rejecting or not rejecting H0 at a specified α level does not fully convey the information in the data.
Example: H0 : µ = 15 vs H1 : µ > 15 is rejected at the α = 0.05
when 71.2920
40645.115 =×+>x
Is a sample with a mean of 30 equivalent to a sample with a meanof 50? (Note that both lead to rejection at the α-level of 0.05.)
More useful to report the smallest α-level for which the data would reject (this is called the observed level of significance or P-value).
Reject H0 if P-value < α34
Example 6.23 (SAT Coaching: P-Value)
See Example 6.23, “SAT Coaching,” on page 220 of the course textbook.
35
One-sided and Two-sided TestsH0 : µ = 15 can have three possible alternative hypotheses:
H1 : µ > 15 , H1 : µ < 15 , or H1 : µ ≠ 15
(upper one-sided) (lower one-sided) (two-sided)
Example 6.27 (SAT Coaching: Two-sided testing)
See Example 6.27 in the course textbook.
36
Example 6.27 continued
See Example 6.27, “SAT Coaching,” on page 223 of the course textbook.
37
Relationship Between Confidence Intervals and Hypothesis Tests
An α-level two-sided test rejects a hypothesis H0 : µ = µ0 if and only if the (1- α)100% confidence interval does not contain µ0.
Example 6.7 (Airline Revenues)
See Example 6.7, “Airline Revenues,” on page 207 of the course textbook.
38
Use/Misuse of Hypothesis Tests in Practice
• Difficulties of Interpreting Tests on Non-random samples and observational data
• Statistical significance versus Practical significance– Statistical significance is a function of sample size
• Perils of searching for significance
• Ignoring lack of significance
•Confusing confidence (1 - α) with probability of detecting a difference (1 - β)
39
Jerzy Neyman Egon Pearson(1894-1981) (1895-1980)
Carried on a decades-long feud with Fisher over the foundations of statistics (hypothesis testing and confidence limits) - Fisher never recognized Type II error & developed fiduciallimits
40
Inference for Single Samples
Corresponds to Chapter 7 of
Tamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León (University of Tennessee)
1
Inference About the Mean and Variance of a Normal Population
Applications:• Monitor the mean of a manufacturing process to determine
if the process is under control• Evaluate the precision of a laboratory instrument measured
by the variance of its readings• Prediction intervals and tolerance intervals which are
methods for estimating future observations from a population.
By using the central limit theorem (CLT), inference procedures for the mean of a normal population can be extended to the mean of a non-normal population when a large sample is available
2
Inferences on Mean (Large Samples)
( )
2
2
Inferences on will be based on the sample mean ,
which is an unbiased estimator of with variance .
For large sample size , the CLT tells us that is
approximately , distributed, even if
X
nn X
N n
µ
σµ
µ σ
•
•
2
2
the population
is not normal. Also for large , the sample variance may be
taken as an accurate estimator of with neglible sampling error.If 30, we may assume that in the formulas.
n s
n sσ
σ
•
≥
3
Pivots
• Definition: Casella & Berger, p. 413
• E.g. • Allow us to construct confidence intervals on
parameters.
)1,0(~/
Nn
XZσ
µ−=
4
Confidence Intervals on the Mean: Large Samples
2 2 1XP z Z z
nα α
µ ασ
⎡ ⎤−⎢ ⎥− ≤ = ≤ = −⎢ ⎥
⎢ ⎥⎣ ⎦
Note: zα/2 = -qnorm(α/2)
(See Figure 2.15 on page 56 of the course textbook.)
5
Confidence Intervals on the Mean
2 2x z x zn nα α
σ σµ− ≤ ≤ +
( Lower One-Sided CI)x znα
σ µ− ≤
(Upper One-Sided CI)x znα
σµ ≤ +
is the standard error of the meann
σ
6
Confidence Intervals in S-Plus
t.test(lottery.payoff)
One-sample t-Test
data: lottery.payofft = 35.9035, df = 253, p-value = 0 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval:274.4315 306.2850 sample estimates:mean of x 290.3583
7
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Sample Size Determination for a z-interval
[ ]Suppose that we require a (1- )-level two-sided CI for
of the form , with a margin of error x E x E Eα µ
− +
i
22
2Set and solve for , obtaining z
E z n nEn
αα
σσ ⎡ ⎤= = ⎢ ⎥
⎣ ⎦i
•Calculation is done at the design stage so a sample estimate of σ is not available.•An estimate for σ can be obtained by anticipatingthe range of the observations and dividing by 4.
[ ]Based on assuming normality since then 95% of the observation are expected to fall in 2 , 2µ σ µ σ− +
8
Example 7.1 (Airline Revenue)
See Example 7.1, “Airline Revenue,” on page 239 of the course textbook.
9
Example 7.2 – Strength of Steel Beams
See Example 7.2 on page 240 of the course textbook.
10
Power Calculation for One-sided Z-tests0( ) P[Test rejects | ]Hπ µ µ=
Testing vs.
For the power function of the α-level upper one sided z-test derivation, see Equation 7.7 in the course textbook.
:o oH µ µ≤ 1 0:H µ µ>
Illustration of calculation on next page
-z z
Φ(−z) 1−Φ(z)
11
Power Calculation for One-sided Z-tests
2
p.d.f. curves of
,X Nn
σµ⎛ ⎞⎜ ⎟⎝ ⎠
∼
(See Figure 7.1 on page 243 of the course textbook.)
12
Power Functions Curves
See Figure 7.2 on page 243 of the course textbook.
Notice how it is easier to detect a big difference from µ0.
13
Example 7.3 (SAT Couching: Power Calculation)
See Example 7.3 on page 244 of the course textbook.
( )0( )n
zα
µ µπ µ
σ
⎡ ⎤−= Φ − +⎢ ⎥
⎢ ⎥⎣ ⎦
14
Power Calculation Two-Sided
Test(See Figure 7.3 on page 245 of the course textbook.)
15
Power Curve for Two-sided TestIt is easier to detect large differences from the null hypothesis(See Figure 7.4 on
page 246 of the course textbook.)
Larger samples lead to more powerful tests
16
17
Power as a function of µ and n, µ0=0, σ=1Uses function persp in S-Plus
2040
6080
100
n
-1
-0.5
0
0.5
1
mu
00.
20.
40.
60.
81
pow
er
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Sample Size Determination for a One-Sided z-Test
• Determine the sample size so that a study will have sufficient power to detect an effect of practically important magnitude
• If the goal of the study is to show that the mean response µunder a treatment is higher than the mean response µ0 without the treatment, then µ−µ0 is called the treatment effect
• Let δ > 0 denote a practically important treatment effect and let 1−β denote the minimum power required to detect it. The goal is to find the minimum sample size n which would guarantee that an α-level test of H0 has at least 1-βpower to reject H0 when the treatment effect is at least δ.
18
Sample Size Determination for a One-sided Z-test
Because Power is an increasing function of µ−µ0, it is only necessary to find n that makes the power 1− β at µ = µ0+δ.
( )
0
2
( ) 1 [See Equation (7.7), Slide 11]
Since ( ) 1 we have - .
Solving for n, we obtain
nz
nz z z
z zn
α
β α β
α β
δπ µ δ βσ
δβσ
σδ
⎛ ⎞+ = Φ − + = −⎜ ⎟⎜ ⎟
⎝ ⎠
Φ = − + =
⎡ ⎤+= ⎢ ⎥
⎢ ⎥⎣ ⎦ zβ
19
Example 7.5 (SAT Coaching: Sample Size Determination
See Example 7.5 on page 248 of the course textbook.
20
Sample Size Determination for a Two-Sided z-Test
( ) 2
2z zn α β σ
δ
⎡ ⎤+⎢ ⎥⎢ ⎥⎣ ⎦
Read on you own the derivation on pages 248-249
See Example 7.6 on page 249 of the course textbook.
Read on your own Example 7.4 (page246)
21
Power and Sample Size in S-Plus
normal.sample.size(mean.alt = 0.3) mean.null sd1 mean.alt delta alpha power n1
0 1 0.3 0.3 0.05 0.8 88
> normal.sample.size(mean.alt = 0.3,n1=100) mean.null sd1 mean.alt delta alpha power n1
0 1 0.3 0.3 0.05 0.8508 100
22
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Inference on Mean (Small Samples)
The sampling variability of s2 may be sizable if the sample is small(less than 30). Inference methods must take this variability intoaccount when σ2 is unknown .
1
2
Assume that ,..., is a random sample from an
( , ) ditribution. Then has a
-distribution with -1 degrees of freedom (d.f.)
nX XXN TS n
t n
µµ σ −=
(Note that T is a pivot)
23
Confidence Intervals on Mean
1, 2 1, 2
1, 2 1, 2
1 n n
n n
XP t T tS n
S SP X t X tn n
α α
α α
µα
µ
− −
− −
⎡ ⎤−− = − ≤ = ≤⎢ ⎥
⎣ ⎦⎡ ⎤= − ≤ ≤ +⎢ ⎥⎣ ⎦
1, 2 1, 2 [Two-Sided 100(1- )% CI]n nS SX t X tn nα αµ α− −− ≤ ≤ +
1, 2 2 interval is wider on the average than z-intervalnt z tα α− > ⇒ −
24
Example 7.7, 7.8, and 7.9
See Examples 7.7, 7.8, and 7.9 from the course textbook.
25
Inference on Variance2
1Assume that ,..., is a random sample from an ( , ) distributionnX X N µ σ2
22
( 1) has a Chi-square distribution with -1 d.f.n S nχσ−
=
(See Figure 7.8 on page 255 of the course textbook)
( ) 22 2
21,1 1,2 2
11
n n
n SP α αα χ χ
σ− − −
⎡ ⎤−− = ≤ ≤⎢ ⎥
⎣ ⎦26
CI for σ2 and σ
The 100(1-α)% two-sided CI for σ2 (Equation 7.17 in course textbook):
2 22
2 2
1, 1,12 2
( 1) ( 1)
n n
n s n s
α α
σχ χ
− − −
− −≤ ≤
The 100(1-α)% two-sided CI for σ (Equation 7.18 in course textbook):
2 2
1, 1,12 2
1 1
n n
n ns sα α
σχ χ
− − −
− −≤ ≤
27
Hypothesis Test on Variance
See Equation 7.21 on page 256 of the course textbook for an explanation of the chi-square statistic:
22
20
( 1)n sχσ−
=
28
Prediction Intervals• Many practical applications call for an interval estimate of
– an individual (future) observation sampled from a population – rather than of the mean of the population.
• An interval estimate for an individual observation is called a prediction interval
Prediction Interval Formula:
1, 2 1, 21 11 1n nx t s X x t sn nα α− −− + ≤ ≤ + +
29
Confidence vs. Prediction IntervalPrediction interval of a single future observation:
1, 2 1, 2
2 2
1 11 1
As interval converges to [ , ]
n nx t s X x t sn n
n z z
α α
α αµ σ µ σ
− −− + ≤ ≤ + +
→ ∞ − +
Confidence interval for µ:
1, 2 1, 21 1
As interval converges to single point
n nx t s x t sn n
n
α αµ
µ
− −− ≤ ≤ +
→ ∞30
Example 7.12: Tear Strength of Rubber
See Example 7.12 on page 259 of the course textbook.
Run chart shows process is predictable.
31
Tolerance IntervalsSuppose we want an interval which will contain at least.90 = 1-γ of the strengths of the future batches (observations) with 95% = 1-α confidence
Using Table A.12 in the course textbook:1-α = 0.951-γ = 0.90n = 14So, the critical value we want is 2.529.
[ , ] 33.712 2.529 0.798 [31.694,35.730]x Ks x Ks− + = ± × =Note that this statistical interval is even wider than the prediction interval
32
Inferences for Two Samples
Corresponds to Chapter 8 ofTamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León
(University of Tennessee) 1
Introductory Remarks• A majority of statistical studies, whether experimental or
observational, are comparative• Simplest type of comparative study compares two
populations• Two principal designs for comparative studies
– Using independent samples– Using matched pairs
• Graphical methods for informal comparisons• Formal comparisons of means and variances of normal
populations– Confidence intervals– Hypothesis tests
2
Independent Samples DesignExample: Compare Control Group to Treatment Group •See page 270 in course textbook.
1
2
1 2
1 2
:Sample 1: , ,...,
Sample 2: , ,...,n
n
x x x
y y y
Independent samples design
Different Numbers
•The two samples are independent•Independent sample design relies on random assignment to make the two groups equal (on the average) on all attributes except for the treatment used (treatment factor).
3
Graphical Methods for Comparing
Two Independent
Samples See Table 8.1 and Figure 8.1, which is a Q-Q Plot. Plot suggests that treatment group costs are less than control group costs. But is it true?
( ) ( )
Plot of the order statistics ordered pairs ( , )
which are the i quantiles
n+1of the respective samples
i ix y
⎛ ⎞⎜ ⎟⎝ ⎠
Book discusses how to prepare this graph when the two samples are not of the same size (interpolation).
4
5
Box plots of hospitalization cost data0
5000
1000
015
000
2000
025
000
3000
0
hcc hct
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Box plots of logs of hospitalization cost data
6
78
910
lhcc lhct
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Graphical Displays of Data from Matched Pairs
7
• Plot the pairs (xi, yi) in a scatter plot. Using the 45° line as a reference, one can judge whether the two sets of values are similar or whether one set tends to be larger than the other
• Plots of the differences or the ratios of the pairs may prove to be useful
• A Q-Q plot is meaningless for paired data because the same quantiles based on the ordered observations do not, in general, come from the same pair.
Comparing Means of Two Populations:Independent Samples Design
(Large Samples Case)
1 21 2 1 2
12 2
2 1 2
Suppose that the observations , ,..., and , ,...,
are random samples from two populations with means and and variances and . Both means and variancesare assumed to be unknown.
n nx x x y y y
µ
µ σ σ
1
2 1 2 1
2
The goal is to compare and in terms of their difference - . We assume that and are large (say 30).
nn
µµ µ µ
>
8
Comparing Means of Two Populations:Independent Samples Design
1 22 21 2
1 2
1 22 21 1 2 2
1 2
( ) ( ) ( )
( ) ( ) ( )
Therefore the standarized r.v.( ) has mean = 0 and variance = 1
If and are large, then Z is approximately (0,1) byth
E X Y E X E Y
Var X Y Var X Var Yn n
X YZn n
n n N
µ µ
σ σ
µ µσ σ
− = − = −
− = + = +
− − −=
+
e Central Limit Theorem though we did not assume the samples came from normal populations. (We also use fact that the difference of independent normal r.v.'s is also normal.)
9
Large Sample (Approximate) 100(1-α)% CI for µ1−µ2
( ) ( )2 2 2 21 2 1 2
2 1 2 21 2 1 2
2 2Note has been substituted for because samples arelarge, i.e., bigger than 30.
i i
s s s sx y z x y zn n n n
s
α αµ µ
σ
− − + ≤ − ≤ − + +
Example 8.2: See Example 8.2 in course textbook.
10
Large Sample (Approximate) Test of Hypothesis
0 1 2 0 1 1 2 0 0: vs. : (Typically 0)H Hµ µ δ µ µ δ δ− = − ≠ =
02 21 1 2 2
( )Test statistics: x yzs n s n
δ− −=
+
11
Inference for Small Samples2 21 2Case 1: Variances and assumed equal.σ σ
Assumption of normal populations is important since we cannot invoke the CLT
2 22 22 1 1 2 2
1 2 1 22 2 2
1 2
Pooled estimate of the common variance:( ) ( )( 1) ( 1)
( 1) ( 1) 2Note: ( ) / 2 if sample sizes are equal
i iX X Y Yn S n SSn n n n
S S S
− + −− + −= =
− + − + −
= +
∑ ∑
1 21 2
1 2
( ) has -distribution with 2 d.f.1 1
X YT t n nS n n
µ µ− − −= + −
+12
Inference for Small Sample: Confidence Intervals and Hypothesis Tests
2 21 2Case 1: Variances and assumed equal.σ σ
1 2 1 22, 2 1 2 2, 21 2 1 2
Two-sided 100(1- )% CI is given by:
1 1 1 1n n n nx y t s x y t s
n n n nα α
α
µ µ+ − + −− − + ≤ − ≤ − + +
1 2
0 1 2 0 1 1 2 0
0
1 2
0 2, 2
Test of Hypothesis: : vs. :
Test statistics: 1 1
Reject if n n
H Hx yts
n n
H t t α
µ µ δ µ µ δδ
+ −
− = − ≠− −
=+
>
13
Hospitalization Cost Example•See Example 8.2 on page 276 of course textbook.
Contrast this conclusion with apparent difference seen on the Q-Q plot in Figure 8.1
14
t.test in S-Plus to test difference in means of logs of hospitalization cost data
t.test(lhcc,lhct)
Standard Two-Sample t-Test
data: lhcc and lhctt = 0.6181, df = 58, p-value = 0.5389 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-0.3731277 0.7064981 sample estimates:mean of x mean of y 8.250925 8.08424
15
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Interpretation of Difference in Means on the Log Scale
Mean (log Cost) = Median (log Cost) = log (Median Cost)
Because distribution of log cost is symmetric
Because the log preserves ordering
0.373 (log ) (log ) 0.707 0.373 log( ) log( ) 0.707
0.373 log 0.707
.689 exp( 0.373) exp(0.707) 2.028
C T
C T
C
T
C
T
Mean Cost Mean CostMedian Cost Median Cost
Median CostMedian Cost
Median CostMedian Cost
− ≤ − ≤− ≤ − ≤
⎛ ⎞− ≤ ≤⎜ ⎟
⎝ ⎠
= − ≤ ≤ =
This Interpretation is not in your textbook
95% confidence interval for the ratio of median costs
16
Inference for Small Samples2 21 2Case 2: Variances and unequal.σ σ
1 22 2
1 2
1 2
( ) does not have a Student - distributionX YT tS Sn n
µ µ− − −=
+
It can be shown that distribution of T depends on the ratio of unknown variances, hence T is not a pivotal quantity. However, whenn1 and n2 are large T has an approximate N(0,1) distribution
17
Inference for Small Samples2 21 2Case 2: Variances and unequal.σ σ
( ) ( )
1 22 2
1 2
1 2
21 2
2 21 1 2 2
2 22 21 2
1 21 2
For small samples( ) has an approximately -distribution
( )with degrees of freedom( 1) ( 1)
where SEM( ) and SEM( )
X YT tS Sn n
w ww n w n
s sw x w yn n
µ µ
ν
− − −=
+
+=
− + −
= = = =
Note: d.f. are estimated from the data and are not a function of the samples sizes alone
Note: ν is not usually an integer but is rounded down to the nearest integer
18
Inference for Small Samples2 21 2Case 2: Variances and unequal.σ σ
1 2
2 2 2 21 2 1 2
, 2 1 2 , 21 2 1 2
Approximate 100(1- )% two-sided CI for :
s s s sx y t x y tn n n nν α ν α
α µ µ
µ µ
−
− − + ≤ − ≤ − − +
0 1 2 0 1 1 2 0
02 21 1 1 1
0 , 2
Test statistics for : vs. :
is
Reject if .
H Hx yt
s n s n
H t tν α
µ µ δ µ µ δδ
− = − ≠− −
=+
>
19
Hospitalization Costs: Inference Using Separate Variances
See Example 8.4 on page 280 of course textbook.
20
t.test in S-Plus to test differences in means of hospitalization data, unequal variances
t.test(lhcc,lhct,var.equal=F)
Welch Modified Two-Sample t-Test
data: lhcc and lhctt = 0.6181, df = 54.61, p-value = 0.5391 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-0.3738420 0.7072124 sample estimates:mean of x mean of y 8.250925 8.08424
21
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Testing for the Equality of VariancesSection 8.4 covers the classical F test for the equality of two variances and associated confidence intervals. However, this method is not robust against departures from normality. For example, p-values can be off by a factor of 10 if the distributions have shorter or longer tails than the normal.A robust alternative is Levene’s Test. His test applies the two-sample t-test to the absolute value of the difference of each observation and the group mean
1 1 1
2 2 2
| |, 1, 2, ,| |, 1, 2, ,
i
i
Y Y i nY Y i n
− = ⋅⋅⋅
− = ⋅⋅⋅This method works well even though these absolute deviations are not independent.
In the Brown-Forsythe test the response is the absolute value of the difference of each observation and the group median.
22
Independent Sample Design: Sample Size Determination Assuming Equal Variances
0 1 2 1 1 22
21 2
: 0 vs. : 0
( )2
H H
z zn n n α β
µ µ µ µ
σδ
− = − ≠
+⎡ ⎤= = = ⎢ ⎥
⎣ ⎦
Because we assume a known variance this n is a slight underestimate of sample size
Smallest difference of practical importance that we want to detect
23
Using S-Plus to compute sample size
normal.sample.size(mean2=.693,power=0.9)mean1 sd1 mean2 sd2 delta alpha power n1 n2 prop.n2
0 1 0.693 1 0.693 0.05 0.9 44 44 1
24
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Matched Pairs Design
Example:See Section 8.3.2, page 283 in course textbook.
25
Statistical Justification of Matched Pairs Design
See Section 8.3.2, page 283 in course textbook.
26
Sample Size Determination2
22
( ) (One-Sided Test)
( ) (Two-Sided Test)
D
D
z zn
z zn
α β
α β
σδ
σδ
+⎡ ⎤= ⎢ ⎥⎣ ⎦
+⎡ ⎤= ⎢ ⎥⎣ ⎦
•One needs a planning value for σD
•This formulas come from the one-sample formulas applied to the differences
27
Comparing Variances of Two Populations
•Application arises when comparing instrument precision oruniformities of products.
•The methods discussed in the book are applicable only under theassumption of normality of the data. They are highly sensitiveto even modest departures from normality
• In case of nonnormal data there are nonparametric and other robust methods for comparing data dispersion.
28
Comparing Variances of Two Populations
1
1
21 2 1 1
21 2 1 1
Independent sample design:Sample 1: , ,..., is a random sample from ( , )
Sample 2: , ,..., is a random sample from ( , )n
n
x x x N
y y y N
µ σ
µ σ2 2
1 11 22 2
2 2
has an F distribution 1 and 1 d.f. respectivelySF n nS
σσ
= − −
1 2 1 2
2 21 1
1, 1,1 / 2 1, 1, / 22 22 2
/ 1/n n n n
SP f fSα α
σ ασ− − − − −
⎧ ⎫≤ ≤ = −⎨ ⎬
⎩ ⎭
1 2 1 2
2 2 21 1 12 2 2
1, 1, / 2 2 2 1, 1,1 / 2 2
1 1 1n n n n
S SPf S f Sα α
σ ασ− − − − −
⎧ ⎫⎪ ⎪≤ ≤ = −⎨ ⎬⎪ ⎪⎩ ⎭
(1-α)-level CI (two-sided):
1 2 1 2
2 2 21 1 12 2 2
1, 1, / 2 2 2 1, 1,1 / 2 2
1 1
n n n n
S Sf S f Sα α
σσ− − − − −
≤ ≤
29
An Important Industrial Application:Example 8.8
(See Table 8.8 in course textbook.)
Do the two labs have equal measurement precision?
30
Inferences for Proportions and Count Data
Corresponds to Chapter 9 of
Tamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León
(University of Tennessee) 1
Inference for Proportions
• Data = {0,1,1,10,0…..1,0}, Bernoulli(p)• Goal – estimate p, probability of success (or
proportion of population with a certain attribute) • p = x= number of successes in n trials• Var( p ) = p(1-p)/n = pq/n • Variance depends on the mean.
2
Large Sample Confidence Interval for Proportionˆ −
Recall that ( p p)≈ N (0,1) if n is large
/pq n (q = 1- p, np ˆ ≥ 10 and n(1− p) ≥ 10)
It follows that:
⎛ ˆ − ⎞ P ⎜−zα 2 ≤
( p p)≤ zα 2 ⎟⎟ ≈ 1−α⎜ ˆ ˆpq n ⎝ ⎠
Confidence interval for p:
ˆ ˆ ˆ ˆ p zˆ − α 2
pq ≤ p ≤ p z pqˆ + α 2n n
3
A Better Confidence Interval for Proportion
Use this probability statement
⎛ ˆ − ⎞P ⎜− ≤
( p p)≤ zα 2 ⎟⎟ ≈1−α⎜ zα 2 pq n ⎝ ⎠
Solve for p using quadratic equation
CI for p: z2 l � 2 z4 z2 pqz l lp + − pqz
+4n2 p + +
l � 2
+z4
2n n ≤ ≤ 2n n 4n2
z2 ⎞⎛ p
⎛ z2 ⎞ ⎜1+ ⎟ ⎜1+ ⎟n ⎠⎝ ⎝ n ⎠
where z = zα / 2
4
Example
See Example 9.1 on page 301 of the course textbook.
5
Binomial CI
In S-Plus: >qbinom(.975,800,0.45) [1] 388> qbinom(.025,800,0.45) [1] 332
95% CI for proportion of gun owners is: 332/800 ≤ p ≤ 388/8000.415 ≤ p ≤ 0.485
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
6
Sample Size Determination for a Confidence Interval for Proportion
Want (1-α)-level two-sided CI:
ˆ ±p E where E is the margin of error. Then E = z 2 ˆ ˆ
.pq nα
2⎛ zα 2 ⎞
⎟ ˆ ˆ ⎝ E ⎠
pq Solving for n gives n = ⎜
1 1Largest value of pq = ⎛ ⎞⎛ ⎞ = 1 so conservative sample size is:⎜ ⎟⎜ ⎟2 2⎝ ⎠⎝ ⎠ 4
2⎛ zα 2 ⎞ 1n = ⎜ ⎟ (Formula 9.5) ⎝ E ⎠ 4
7
Example 9.2: Presidential Poll
See Example 9.2 on page 302 of the course textbook.
Threefold increase in precision requires ninefold increase in sample size
8
Largest Sample Hypothesis Test on Proportion
= : ≠ 0H : p p vs. H p p 0 0 1
ˆ − 0Best test statistics: z =p p
p q n 0 0
Acceptance Region: p0 ± cd, where c=za/2 and d=(p0q0/n)0.5
9
Basketball Problem: z-test
See Example 9.3 on page 303 of the course textbook.
P-value
2.182
10
Exact Binomial Test in S-Plus
1-pbinom(299,400,.7) 0.01553209
dbin
om(x
, 400
, 0.7
)
0.0
0.01
0.
02
0.03
0.
04
240 260 280 300 320
x 11
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Sample Size for Z-Test of ProportionH p p H p p : ≤ 0 vs. : > 0o 1
Suppose that the power for rejecting H must be at0
least 1- β when the true proportion is p p p 0.= >1
Let δ = p p − 0 . Then 1
⎡ z p q + z p q ⎤2 Test based on:
0 0 β 1 1 ˆ − 0n = ⎢ α
δ ⎥⎥⎦
z = p p
⎢⎣ p q n 0 0
Replace z by zα for two-sided test sample size.α 2
12
Example 9.4: Pizza Testing
See Example 9.4 on page 305 of the course textbook.
2⎡ z p q z p q ⎤
n = ⎢ α 2 0 0 + β 1 1 ⎥ ⎢ δ ⎥⎦⎣
13
Comparing Two Proportions: Independent Sample Design
If n p , n q , n2 p2 , n2 q2 ≥ 10, then 1 1 1 1
Z p p p p =
ˆ1 − ˆ2 − ( 1 − 2 ) ≈ N (0,1)
ˆ ˆ ˆ 2p q p2q1 1 + n2n1
Confidence Interval:
ˆ1 − ˆ2 − p q p qˆ 1 + 2 2 ≤ p p p p z p p z 1 − 2 ≤ ˆ1 − ˆ2 +α 2 1 α 2n2n1
1 1 2 2
1 2
ˆ ˆp q n n
+ ˆ q p
14
Test for Equality of Proportions (Large n) Independent Sample Design – pooled estimate of p
: 1 = vs. 1 : 1 ≠ 2H p p H p p 0 2
− ˆ2ˆ1Test statitics: z =p p
⎛ 1 1 ⎞ˆ ˆ ⎜ +pq ⎟n n 2⎝ 1 ⎠ ˆ + x + y1 1where p =
n p n2 p2 = n n2 n n2+1 + 1
15
Example 9.6 –Comparing Two Leukemia Therapies
See Example 9.6 on page 310 of the course textbook.
16
Inference for Small Samples Fisher’s Exact Test
• Calculates the probability of obtaining observed 2x2 table or any more extreme with margins fixed.
• Uses hypergeometric distribution
M x
N
K
N
K
M x
XP ( KMNx )| , ,
⎞−⎛⎞⎛ ⎟⎟⎠−⎜⎜
⎝⎟⎟⎠
⎜⎜⎝
⎞⎛ ⎟⎟⎠
⎜⎜⎝
= =
17
Inference for Count Data
Data = cell counts = number of observations in each of sevaral (>2) categories, ni, i=1..c, Σni=n
Joint distribution of corresponding r.v.’s is multinomial.
Goal – determine if the probabilities of belonging to each of the categories are equal to hypothesized values, pi0.
χ
Test statistic, χ2 = Σ(observed-expected)2/expected, where observed=ni, expected=npi0
2 has chi-square distribution when sample size is large
18
Multinomial Test of Proportions
See Example 9.10 on page 316 of the course textbook.
19
Inferences for Two-Way Count Datay: Job Satisfaction
x: Annual Very Slightly Slightly Very Satisfied Row Sum Salary Dissatisfied Dissatisfied Satisfied
Less than $10,000
81 64 29 10 184
$10,000-25,000
73 79 35 24 211
$25,000-50,000
47 59 75 58 239
More than $50,000
14 23 84 69 190
Column Sum 215 225 223 161 824
Sampling Model 1: Multinomial Model (Total Sample Size Fixed) Sample of 824 from a single population that is then cross-classified
The null hypothesis is that X and Y are independent: : ( = , ( = ) ( i. . j for all i, jH pij = P X i Y = j) = P X i P Y = j) = p p 0
20
Sampling Model 1 (Total Sample Size Fixed)Based on Table 9.10 in the course textbook
y: Job Satisfaction
x: Annual Very Slightly Slightly Very Satisfied Row Sum Salary Dissatisfied Dissatisfied Satisfied
Less than $10,000
81 64 29 10 184
$10,000-25,000 73 79 35 24 211
$25,000-50,000 47 59 75 58 239
More than $50,000
14 23 84 69 190
Column Sum 215 225 223 161 824
Estimated Expected Frequency = 824 ⎜⎛ 215 ⎞⎛ 184 ⎞ =
215×184 = 48.01 ⎟⎜ ⎟
⎝ 824 ⎠⎝ 824 ⎠ 824 (Cell 1,1) = np p1• •1
21
Chi-Square Statistics
See Example 9.13, page 324 for instructions on calculating the chi-square statistic.
c
χ =∑ (n e )2
2 i − i
i=1 ei
22
2Based on Table A.5, critical values χυ ,α for theChi-
Square Chi-square Distribution, in the course textbook:
Test Critical Value
2The d.f. for this χ − statistics is2 (4-1)(4-1) = 9. Since χ 9,.05 = 16.919
2the calculated χ = 11.989 is not sufficiently large to reject the hypothesis of independence at α = .05 level
α
v .995 .99 .975 .95 .90 .10 .05
1
2
3
4
5
6
7
8
9 16.919
10
11
23
S-Plus – job satisfaction example• Call: • crosstabs(formula = c(jobsat) ~ c(row(jobsat)) + c(col(jobsat))) • 901 cases in table • +----------+ • |N | • |N/RowTotal| • |N/ColTotal| • |N/Total | • +----------+ • c(row(jobsat))|c(col(jobsat)) • |1 |2 |3 |4 |RowTotl| • -------+-------+-------+-------+-------+-------+ • 1 | 20 | 24 | 80 | 82 |206 | • |0.097 |0.12 |0.39 |0.4 |0.23 | • |0.32 |0.22 |0.25 |0.2 | | • |0.022 |0.027 |0.089 |0.091 | | • -------+-------+-------+-------+-------+-------+ • 2 | 22 | 38 |104 |125 |289 | • |0.076 |0.13 |0.36 |0.43 |0.32 | • |0.35 |0.35 |0.33 |0.3 | | • |0.024 |0.042 |0.12 |0.14 | | • -------+-------+-------+-------+-------+-------+ • 3 | 13 | 28 | 81 |113 |235 | • |0.055 |0.12 |0.34 |0.48 |0.26 | • |0.21 |0.26 |0.25 |0.27 | | • |0.014 |0.031 |0.09 |0.13 | | • -------+-------+-------+-------+-------+-------+ • 4 | 7 | 18 | 54 | 92 |171 | • |0.041 |0.11 |0.32 |0.54 |0.19 | • |0.11 |0.17 |0.17 |0.22 | | • |0.0078 |0.02 |0.06 |0.1 | | • -------+-------+-------+-------+-------+-------+ • ColTotl|62 |108 |319 |412 |901 | • |0.069 |0.12 |0.35 |0.46 | | • -------+-------+-------+-------+-------+-------+ • Test for independence of all factors • Chi^2 = 11.98857 d.f.= 9 (p=0.2139542) • Yates' correction not used 24 • >
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Product Multinomial Model:Row Totals Fixed
(See Table 9.2 in the course textbook.)
Sampling Model 2: Product Multinomial Total number of patients in each drug group is fixed.
•The null hypothesis is that the probability of column response (success or failure) is the same, regardless of the row population:
0 : (Y = j | X i p j)H P = =
25
S-Plus – leukemia trial• Call: • crosstabs(formula = c(leuk) ~ c(row(leuk)) + c(col(leuk))) • 63 cases in table • +----------+ • |N | • |N/RowTotal| • |N/ColTotal| • |N/Total | • +----------+ • c(row(leuk))|c(col(leuk)) • |1 |2 |RowTotl| • -------+-------+-------+-------+ • 1 |14 | 7 |21 | • |0.67 |0.33 |0.33 | • |0.27 |0.64 | | • |0.22 |0.11 | | • -------+-------+-------+-------+ • 2 |38 | 4 |42 | • |0.9 |0.095 |0.67 | • |0.73 |0.36 | | • |0.6 |0.063 | | • -------+-------+-------+-------+ • ColTotl|52 |11 |63 | • |0.83 |0.17 | | • -------+-------+-------+-------+ • Test for independence of all factors • Chi^2 = 5.506993 d.f.= 1 (p=0.01894058) • Yates' correction not used • Some expected values are less than 5, don't trust stated p-value • > 26
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Remarks About Chi-Square Test
• The distribution of the chi-square statistics under the null hypothesis is approximately chi-square only when the sample sizes are large – The rule of thumb is that all expected cell counts should be greater
than 1 and – No more than 1/5th of the expected cell counts should be less than
5.
• Combine sparse cell (having small expected cell counts) with adjacent cells. Unfortunately, this has the drawback of losing some information.
• Never stop with the chi-square test. Look at cells with large values of (O-E), as in job satisfaction example.
27
Odds Ratio as a Measure of Association for a 2x2 Table
Sampling Model I: Multinomialp p11 12ψ = p p 21 22
The numerator is the odds of the column 1 outcome vs. the column 2 outcome for row 1, and the denominator is the same odds for row 2, hence the name “odds ratio”
28
Odds Ratio as a Measure of Association for a 2x2 Table
Sampling Model II: Product Multinomial1− p1 1ψ =
p p 1− p2 2
The two column outcomes are labeled as “success” and “failure,” then ψ is the odds of success for the row 1 population vs. the odds of success for the row 2 population
29
Inference in a Nutshell
Slides prepared by Elizabeth Newton (MIT)
Corresponds to Chapters 6-9 of Tamhane and Dunlop
1
OutlineChapter 6: Basic Concepts of Inference
Mean Square ErrorConfidence IntervalHypothesis Test
Chapter 7: Inference for Single SamplesMean - Large Sample - zMean - Small Sample – tVariance – Chi-squarePrediction and Tolerance Intervals
2
Outline (continued)Chapter 8 – Inference for Two Samples
Comparing Means, Independent, Large Sample –zComparing Means, Independent, Small Sample
Variances equal – tVariances not equal – t with df from SEM
Matched Pairs – test differences – tComparing Variances – F
3
Outline (continued)Chapter 9 - Inferences for Proportions and Count Data
Proportion, Large sample – zProportion, Small sample – binomialComparing 2 Proportions, large – z or Chi-squareComparing 2 Proportions, small – Fisher’s ExactMatched Pairs – McNemar’s TestOne way Count – Chi squareTwo-way Count – Chi squareGoodness of Fit – Chi squareOdds ratio - z
4
Confidence Interval on the Mean
û ± cd is a two-sided CI for mean uwhere:û = estimator of u = sample meand=standard deviation of û.c=critical constant, for instance, zα/2 or tn-1,a/2.zα/2 is such that P(Z> zα/2)=α/2.zα/2=Φ-1(1-α/2) = qnorm(1-α/2) = -qnorm(α/2)If a=0.05 then zα/2= 1.96.If draw many samples and construct 95% CI’s from them, 95% would contain true value of u.
5
Confidence Intervals
(See Figure 6.2 on page 205 of the course textbook.)
6
Hypothesis Tests• H0: null hypothesis, no change, no effect,
for instance u=u0
• H1: alternative hypothesis, u≠u0
• α = P(Type I error = P(reject H0 | H0 true)• β = P(Type II error = P(accept H0 | H0 false)• Power = function of u = P(reject H0 | u)• A two-sided hypothesis test rejects H0 when
|û-u0|/d > c ↔ |û-u0| > cd ↔û<u0-cd or û>u0+cd
7
Level α Tests
(See Table 7.1 on page 240 of the course textbook.)
8
P-Values
• P-Value is the probability of obtaining the observed result or one more extreme
• Two-sided P-Value= P(|Z|>|(û-u0)|/d = 2[1-Φ[|(û-u0)|/d] = 2*(1-pnorm(abs(û-u0)/d)) in S-Plus
9
P-Values
(See Table 7.2 on page 241 of the course textbook.)
10
Power Function
Power is the probability of rejecting H0 for a given value of u.
π(u) = P(û<u0-cd | u) + P(û>u0+cd |u)
= Φ[-c+(u0-u)/d] + Φ[-c+(u-u0)/d]
11
Power
(See Figure 7.3 on page 245 of the course textbook.)
12
Reject H0
(1) If u0 falls outside interval û ± cd.
(2) if û falls outside interval u0 ± cd.
(3) if p-value is small.
13
Simple Linear Regression and Correlation.
Corresponds to Chapter 10
Tamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT) with some slides by
Jacqueline Telford (Johns Hopkins University)
1
Simple linear regression analysis estimates the relationship between two variables.
One of the variables is regarded as a response or outcome variable (y).
The other variable is regarded as predictor or explanatory variable (x).
Sometimes it is not clear which of two variable should be the response (e.g. height and weight). In this case, correlation analysis may be used.
Simple linear regression estimates relationships of the form y = a + bx.
2
Scatter plot of ozone concentration by temperature
air$temperature
air$
ozon
e
60 70 80 90
12
34
5
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.3
A Probabilistic Model for Simple Linear Regression
Let x1, x2,..., xn be specific settings of the predictor variable.
Let y1, y2,..., yn be the corresponding values of the response variable.
Assume that yi is the observed value of a random variable (r.v.) Yi, which depends X on according to the following model:
Yi = β0 + β1 xi + εi (i = 1, 2, …, n)
Here εi is the random error with E(εi)=0 and Var(εi)=σ2 .
Thus, E(Yi) = µi = β0 + β1 xi (true regression line).
The xi’s usually are assumed to be fixed (not random variables).
4
A Probabilistic Model for Simple Linear Regression
See Figure 10.1, p. 348 and also see page 348 for the four assumptions of a simple linear regression model.
5
Least Square Line Mathematics (invented by Gauss)
Find the line, i.e., values of β0 and β1 that minimizes the sum of the squared deviations:
∑=
+−=n
iii xy
1
210 )]([Q ββ
How?
Solve for values of β0 and β1 for which
0010
=∂∂
=∂∂
ββQ and Q
6
Finding Regression Coefficients
)]([2
)]([2
1011
110
0
i
n
iii
n
iii
xyxQ
xyQ
βββ
βββ
+−−=∂∂
+−−=∂∂
∑
∑
=
=
7
Normal Equations
∑∑∑
∑∑
===
==
=+
=+
n
iii
n
ii
n
ii
n
ii
n
ii
yxxx
yxn
11
2
110
1110
ββ
ββ
8
Solution to Normal Equations
ˆˆ
SS
)(
))((ˆ
10
xx
xy
1
2
11
xy
xx
yyxx
n
ii
n
iii
ββ
β
−=
=−
−−=
∑
∑
=
=
.),( yxNote that least squares line goes through
9
Fitted regression line
air$temperature
air$
ozon
e
60 70 80 90
12
34
5
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.10
nixyy iii ,...,2,1,ˆˆˆ 10 : of values Fitted =+= ββ
nixyyye iiiii ,...,2 ,1 , ) ˆˆ(ˆ :Residuals 10 =+−=−= ββ
temperature ozone fitted resid 67 3.45 2.49 0.9672 3.30 2.84 0.4674 2.29 2.98 -0.6962 2.62 2.14 0.4865 2.84 2.35 0.50
11This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Matrix Approach to Simple Linear Regression (what your regression package is really doing)
The model: y=Xβ + ε
y is n by 1X is n by 2β is 2 by 1ε is n by 1
12
Y=Xβ + ε
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
+⎥⎦
⎤⎢⎣
⎡
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
4
3
1
0
1
0
4
3
2
1
4
3
2
1
x1 x1 x1 x1
εεεε
ββ
yyyy
13
Solution of linear equations
In linear algebra:Find x which solves Ax=b.
In regression analysis:Find β which solves Xβ=y Why can’t we do this?
14
Least Squares
Q=(y-Xβ)’(y-Xβ) = y’y – β’X’y – y’Xβ + β’X’Xβ= y’y – 2 β’X’y + β’X’Xβ
∂Q/ ∂β = -2X’y + 2X’Xβ
∂Q/ ∂β = 0 → X’y = X’Xb, where b= β
15
Least Squares continued
For simple linear regression:
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
∑∑∑ ∑
∑
ii
i
i
yx
y
x
nXX
yX'
x
x '
2i
i
16
Least Squares continued
X’Xb = X’y
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
∑∑
∑ ∑∑
ii
i
i yx
y
x
n b
x
x 2i
i
The Normal Equations as before17
Least Squares continued
X’Xb = X’yb= (X’X)-1X’y (if X has linearly
independent columns)Solution by QR decompositionX=QR, Q orthonormal, R upper triangular
and invertibleb=(X’X)-1X’y = (R’Q’QR)-1R’Q’y=(R’R)-1R’Q’y = R-1Q’y
18
The Hat Matrix
b=(X’X)-1 X’y=Xb = X(X’X)-1X’y =Hy
H (n by n) is the Hat matrixTakes y toH is symmetric and idempotent HH=HDiagonal elements of the hat matrix are
useful in detecting influential observations.
y
y
19
Expected value of b
E(b) = E((X’X)-1X’y]= E[(X’X)-1X’(Xβ+ε)]= E[(X’X)-1X’X β+ (X’X)-1X’ε]= β
Hence b is an unbiased estimator of β.
20
Covariance of b
The covariance matrix of y is σ2Ib=(X’X)-1X’y = Ay (where A is k by n)Cov(b) = A Var(y) A’ = A σ2I A = σ2AA’
= σ2 (X’X)-1X’X(X’X)-1
= σ2 (X’X)-1
21
Covariance of b
For simple linear regression, σ2(X’X)-1=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
−−=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
∑∑∑
∑ ∑∑ ∑∑
−
n x
x- x)(x
x
i
i2i
2
21
2i
i2
iii xxnx
n σσ
xxxx
i
SnSx
bSD 1)SD(b ;)( 1
2
0 σσ == ∑
22
Estimation of σ2
2
)ˆ(
22 1
2
1
2
−
−=
−=
∑∑==
n
yy
n
en
iii
n
ii
s
Note: The denominator is n - 2 since two parameters are being estimated (β0 and β1).
E[S2]=σ2 (See proof in Seber, Linear Regression Analysis)
23
Statistical Inference for βo and β1
xxxx
i
SsSE
nSx
sSE == ∑ )ˆ( and )ˆ( 1
2
0 ββ
For ozone example:Coefficients:
Value Std. Error t value Pr(>|t|) (Intercept) -2.2260 0.4614 -4.8243 0.0000temperature 0.0704 0.0059 11.9511 0.0000
24This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Sums of Squares
∑=
−n
ii yy
1
2)(: (SST) Total Squares of Sum
∑∑==
−=n
iii
n
ii yye
1
2
1
2 )ˆ(: (SSE) Error for Squares of Sum
∑=
−n
ii yy
1
2)ˆ(: (SSR) Regression for Squares of Sum
25
Geometry of the Sums of Squares)ˆ()ˆ( iiii yyyyyy −+−=−
y
yi
SST = SSR + SSE, see derivation on p. 354
26J. Telford
Coefficient of Determination (R-squared)
=−==SSTSSE1
SSTSSR2r
proportion of the variance in y that is accounted for by the regression on x
= square of correlation between y and y
For ozone example:Multiple R-Squared: 0.5672
27
Analysis of Variance (ANOVA)
0 1 0 1: 0 . : 0H vs Hβ β= ≠
2
MSEMSR
2)-SSE/(nSSR/1 tF ===
For ozone example:summary.aov(tmp)
Df Sum of Sq Mean Sq F Value Pr(F) temperature 1 49.46178 49.46178 142.8282 0Residuals 109 37.74698 0.34630
28This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Regression DiagnosticsResidual vs. observation number
resi
d(oz
one.
lm)
0 20 40 60 80 100
-10
12
29This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Regression Diagnosticsresidual vs. fitted value
fitted(ozone.lm)
resi
d(oz
one.
lm)
2.0 2.5 3.0 3.5 4.0 4.5
-10
12
30This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Regession Diagnosticsresidual vs. x
air$temperature
resi
d(oz
one.
lm)
60 70 80 90
-10
12
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.31
Regression Diagnosticsqq plot of residuals
Quantiles of Standard Normal
resi
d(oz
one.
lm)
-2 -1 0 1 2
-10
12
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.32
Hat Matrix Diagonalsha
t(mod
el.m
atrix
(ozo
ne.lm
))
0 20 40 60 80 100
0.01
0.02
0.03
0.04
0.05
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.33
Some useful S-Plus commandsmy.lm <- lm(y~x, data=mydata, na.action=na.omit)
includes intercept term by defaultsummary(my.lm)
gives coefficients, correlation of coefficients, R-square, F-statistic, residual standard error
summary.aov(my.lm) gives ANOVA table
resid(my.lm) gives residuals
fitted(my.lm) gives fitted values
model.matrix(my.lm) gives model matrix
34This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Multiple Linear Regression
Corresponds to Chapter 11 ofTamhane & Dunlop
Slides prepared by Elizabeth Newton (MIT)with some slides by Roy Welsch (MIT).
Linear Regression
Review:Linear Model: y=Xβ + ε
y~N(Xβ, σ2I)Least squares: =(X’X)X’y
= fitted value of y = X =X(X’X)-1X’y=Hy
e = error = residuals = y- = y-Hy=(I-H)y
βy
y
β
2
Properties of the Hat matrix
• Symmetric: H’=H• Idempotent: HH=H• Trace(H) = sum(diag(H)) = k+1 = number of
columns in the X matrix• 1’H=vector of 1’s (hence y and have same
mean)• 1’(I-H) = vector of 0’s (hence mean of residuals
is 0).• What is H when X is only a column of 1’s?
y
3
Variance-Covariance Matrices
)()()())(()()()(
)()()ˆ(
time) lastsaw we(as )'()ˆCov(
22
22
12
HIHIIHIHIyCovHIyHICoveCov
HIHHHyHCovHyCovyCov
XX
−=−−=
−−=−=
==
==
= −
σσ
σσ
σβ
4
Confidence and Prediction Intervals
)1()1)'(()'()()ˆ()ˆ(
ˆ y, xat nobservationew of
)'()ˆ()ˆ(
xat response mean
02
01'
022
01'
02
0000
0000
02
01'
02'
00
0
+=+=+
=+=+
+=
===
−−
−
vxXXxxXXx
VaryVaryVaryVariance
vxXXxxVaryVar
ofVariance
σσσσ
εεε
σσβ
An estimate of σ2 is s2 = MSE = y’(I-H)y /(n-k-1)
5
Confidence and Prediction Intervals
(1-α) Confidence Interval on Mean Response at x0:
0/2 1),(k-n0 vsd and tc where,ˆ ==± + αcdy
(1-α) Prediction Interval on New Observation at x0:
1vsd and tc where,ˆ 0/2 1),(k-n0 +==± + αcdy
6
Sums of Squares
∑=
−n
ii yy
1
2)(: (SST) Total Squares of Sum
∑∑==
−=n
iii
n
ii yye
1
2
1
2 )ˆ(: (SSE) Error for Squares of Sum
∑=
−n
ii yy
1
2)ˆ(: (SSR) Regression for Squares of Sum
SSR = SST - SSE7
Overall Significance TestTo see if there is any linear relationship we test:
H0: β1 = β2 = . . . = βk = 0H1: βj ≠ 0 for some j.
Compute
The F statistic is:
with F based on k and (n − k − 1) degrees of freedom.
Reject H0 when F exceeds F k,n−k−1(α).
SSESSTSSRyyyySSE iiii −=−=−= ∑∑ )(SST )ˆ( 22
MSEMSR
knSSEkSSR
=−− )1/(
/
8
Sequential Sums of Squares
SSR(x1) = SST - SSE(x1)
SSR(x2|x1) = SSR(x1,x2) - SSR(x1) =SSE(x1) - SSE(x1,x2)
SSR(x3|x1 x2) = SSE(x1,x2) - SSE(x1,x2,x3)
9
ANOVA TableType 1 (sequential) sums of squares
Source of SS dfVariationRegression SSR(x1,x2,x3) 3
x1 SSR(x1) 1x2|x1 SSR(x2|x1) 1x3|x2 x1 SSR(x3|x2,x1) 1
Error SSE(x1,x2,x3) n-4Total SST n-1
10
ANOVA TableType 3 (partial) sums of squares
Source of SS dfVariationRegression SSR(x1,x2,x3) 3
x1|x2,x3 SSR(x1|x2,x3) 1x2|x1,x3 SSR(x2|x1,x3) 1x3|x1,x2 SSR(x3|x1,x2) 1
Error SSE(x1,x2,x3) n-4Total SST n-1
11
12
Scatter plot Matrix of the Air Data Set in S-Plus pairs(air)
ozone
0 50 100 200 300 5 10 15 20
12
34
5
050
150
250
radiation
temperature
6070
8090
1 2 3 4 5
510
1520
60 70 80 90
wind
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
air.lm<-lm(y~x1+x2+x3)
> summary(air.lm)$coefValue Std. Error t value Pr(>|t|)
(Intercept) -0.297329634 0.5552138923 -0.5355227 5.933998e-001x1 0.002205541 0.0005584658 3.9492854 1.407070e-004x2 0.050044325 0.0061061612 8.1957098 5.848655e-013x3 -0.076021950 0.0157548357 -4.8253090 4.665124e-006
> summary.aov(air.lm)Df Sum of Sq Mean Sq F Value Pr(F)
x1 1 15.53144 15.53144 59.6761 6.000000e-012x2 1 37.76939 37.76939 145.1204 0.000000e+000x3 1 6.05985 6.05985 23.2836 4.665124e-006
Residuals 107 27.84808 0.26026
> summary.aov(air.lm,ssType=3)Type III Sum of Squares
Df Sum of Sq Mean Sq F Value Pr(F) x1 1 4.05928 4.05928 15.59685 0.0001407070x2 1 17.48174 17.48174 67.16966 0.0000000000x3 1 6.05985 6.05985 23.28361 0.0000046651
Residuals 107 27.84808 0.26026 > 13
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Polynomial Models
y=β0 + β1x + β2x2 … + βkxk
Problems:Powers of x tend to be large in magnitudePowers of x tend to be highly correlated
Solutions:Centering and scaling of x variablesOrthogonal polynomials (poly(x,k) in S-Plus,
see Seber for methods of generating)
14
15
Plot of mpg vs. weight for 74 autos(S-Plus dataset auto.stats)
wt
mpg
2000 2500 3000 3500 4000 4500
1520
2530
3540
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
summary(lm(mpg~wt+wt^2+wt^3))
Call: lm(formula = mpg ~ wt + wt^2 + wt^3)Residuals:
Min 1Q Median 3Q Max -6.415 -1.556 -0.2815 1.265 13.06
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) 68.1797 21.4515 3.1783 0.0022wt -0.0309 0.0214 -1.4430 0.1535
I(wt^2) 0.0000 0.0000 0.9586 0.3410I(wt^3) 0.0000 0.0000 -0.7449 0.4588
Residual standard error: 3.209 on 70 degrees of freedomMultiple R-Squared: 0.705 F-statistic: 55.76 on 3 and 70 degrees of freedom, the p-value is 0
Correlation of Coefficients:(Intercept) wt I(wt^2)
wt -0.9958 I(wt^2) 0.9841 -0.9961 I(wt^3) -0.9659 0.9846 -0.9961
16
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
wts<-(wt-mean(wt))/sqrt(var(wt))
summary(lm(mpg~wts+wts^2+wts^3))
Call: lm(formula = mpg ~ wts + wts^2 + wts^3)Residuals:
Min 1Q Median 3Q Max -6.415 -1.556 -0.2815 1.265 13.06
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) 20.2331 0.5676 35.6470 0.0000wts -4.4466 0.7465 -5.9567 0.0000
I(wts^2) 1.1241 0.4682 2.4007 0.0190I(wts^3) -0.2521 0.3385 -0.7449 0.4588
Residual standard error: 3.209 on 70 degrees of freedomMultiple R-Squared: 0.705 F-statistic: 55.76 on 3 and 70 degrees of freedom, the p-value is 0
Correlation of Coefficients:(Intercept) wts I(wts^2)
wts -0.2800 I(wts^2) -0.7490 0.4558 I(wts^3) 0.3925 -0.8596 -0.6123
17
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Orthogonal Polynomials
Generation is similar to Gram-Schmidt orthogonalization (see Strang, Linear Algebra)
Resulting vectors are orthonormal X’X=IHence (X’X)-1 = I and coefficients
= (X’X)-1X’y = X’yAddition of higher degree term does not affect
coefficients for lower degree termsCorrelation of coefficients = ISE of coefficients = s = sqrt(MSE)
18
summary(lm(mpg~poly(wt,3)))
Call: lm(formula = mpg ~ poly(wt, 3))Residuals:
Min 1Q Median 3Q Max -6.415 -1.556 -0.2815 1.265 13.06
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) 21.2973 0.3730 57.0912 0.0000poly(wt, 3)1 -40.6769 3.2090 -12.6758 0.0000poly(wt, 3)2 7.8926 3.2090 2.4595 0.0164poly(wt, 3)3 -2.3904 3.2090 -0.7449 0.4588
Residual standard error: 3.209 on 70 degrees of freedomMultiple R-Squared: 0.705 F-statistic: 55.76 on 3 and 70 degrees of freedom, the p-value is 0
Correlation of Coefficients:(Intercept) poly(wt, 3)1 poly(wt, 3)2
poly(wt, 3)1 0 poly(wt, 3)2 0 0 poly(wt, 3)3 0 0 0 19
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
20
Plot of mpg by weight with fitted regression line
wt
mpg
2000 2500 3000 3500 4000 4500
1520
2530
3540
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Indicator Variables
• Sometimes we might want to fit a model with a categorical variable as a predictor. For instance, automobile price as a function of where the car is made (Germany, Japan, USA).
• If there are c categories, we need c-1 indicator (0,1) variables as predictors. For instance j=1 if car is made in Japan, 0 otherwise, u=1 if car is made in USA, 0 otherwise.
• If there are just 2 categories and no other predictors, we could just do a t-test for difference in means.
21
22
Boxplots of price by country for S-Plus dataset cu.summary
1000
020
000
3000
040
000
pric
e
Germany Japan USA
cntry
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
23
Histogram of automobile prices for S-Plus dataset cu.summary
10000 20000 30000 40000
010
2030
40
price
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
24
Histogram of log of automobile prices for S-Plus dataset cu.summary
9.0 9.5 10.0 10.5
05
1015
20
log(price)
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
summary(lm(price~u+j))
Call: lm(formula = price ~ u + j)Residuals:
Min 1Q Median 3Q Max -15746 -4586 -2071 2374 22495
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) 25741.3636 2282.2729 11.2788 0.0000u -10520.5473 2525.4871 -4.1657 0.0001j -10236.0088 2656.5095 -3.8532 0.0002
Residual standard error: 7569 on 88 degrees of freedomMultiple R-Squared: 0.1723 F-statistic: 9.159 on 2 and 88 degrees of freedom, the p-value is
0.0002435
Correlation of Coefficients:(Intercept) u
u -0.9037 j -0.8591 0.7764 25
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
summary(lm(price~u+g))
Call: lm(formula = price ~ u + g)Residuals:
Min 1Q Median 3Q Max -15746 -4586 -2071 2374 22495
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) 15505.3548 1359.5121 11.4051 0.0000u -284.5385 1737.1208 -0.1638 0.8703g 10236.0088 2656.5095 3.8532 0.0002
Residual standard error: 7569 on 88 degrees of freedomMultiple R-Squared: 0.1723 F-statistic: 9.159 on 2 and 88 degrees of freedom, the p-value is
0.0002435
Correlation of Coefficients:(Intercept) u
u -0.7826 g -0.5118 0.4005 26
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Regression DiagnosticsGoal: identify remarkable observations and unremarkable
predictors.
Problems with observations:OutliersInfluential observations
Problems with predictors:A predictor may not add much to model.A predictor may be too similar to another predictor (collinearity).Predictors may have been left out.
27
28
Plot of standardized residuals vs. fitted values for air dataset
fitted value
stan
dard
ized
resi
dual
2 3 4
-2-1
01
23
1
2
3
45
67
8
9
10
11
12
1314
1516
17
1819
20
21
22
23
2425
26
2728
29
30
31
32
33
34
35
36
37
38
39
4041
42
43
44
45
46
47
48
49
50
51
52
53
54
5556
57
5859
60
61
62
6364
65
66
67
68
69
70
71
72
73
7475
76
77
78
79
80
81
82
83
84
85
8687
88
89
909192
93
94
9596
97
9899
100
101
102
103
104
105
106
107
108
109
110
111
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of residual vs. fit for air data set with all interaction terms
fitted(tmp)
resi
d(tm
p)
2.0 2.5 3.0 3.5 4.0 4.5 5.0
-1.0
-0.5
0.0
0.5
1.0
1.5
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
29
30
Plot of residual vs. fit for air model with x3*x4 interaction
fitted(tmp)
resi
d(tm
p)
2 3 4 5
-1.0
-0.5
0.0
0.5
1.0
1.5
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Call: lm(formula = air[, 1] ~ air[, 2] + air[, 3] + air[, 4] + air[, 3] * air[, 4])Residuals:
Min 1Q Median 3Q Max -1.088 -0.3542 -0.07242 0.3436 1.47
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) -3.6465 1.1684 -3.1209 0.0023 air[, 2] 0.0023 0.0005 4.3223 0.0000 air[, 3] 0.0920 0.0143 6.4435 0.0000 air[, 4] 0.2523 0.1031 2.4478 0.0160
air[, 3]:air[, 4] -0.0042 0.0013 -3.2201 0.0017
Residual standard error: 0.4892 on 106 degrees of freedomMultiple R-Squared: 0.7091 F-statistic: 64.61 on 4 and 106 degrees of freedom, the p-value is 0
Correlation of Coefficients:(Intercept) air[, 2] air[, 3] air[, 4]
air[, 2] -0.0361 air[, 3] -0.9880 -0.0495 air[, 4] -0.9268 0.0620 0.9313
air[, 3]:air[, 4] 0.8902 -0.0661 -0.9119 -0.9892 >
31
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Remarkable Observations
Residuals are the keyStandardized residuals:
Outlier if |ei*|>2Hat matrix diagonals, hii
Influential if hii > 2(k+1)/nCook’s Distance
ii
i
i
ii his
eeSE
ee−
==)(
*
)1
()1
( 2*
ii
iiii h
hked
−+=
Influential if di > 1 32
33
Plot of standardized residual vs. observation number for air dataset
observation number
stan
dard
ized
resi
dual
0 20 40 60 80 100
-2-1
01
23
1
2
3
45
67
8
9
10
11
12
1314
1516
17
1819
20
21
22
23
2425
26
2728
29
30
31
32
33
34
35
36
37
38
39
4041
42
43
44
45
46
47
48
49
50
51
52
53
54
5556
57
5859
60
61
62
6364
65
66
67
68
69
70
71
72
73
7475
76
77
78
79
80
81
82
83
84
85
8687
88
89
909192
93
94
959697
9899
100
101
102
103
104
105
106
107
108
109
110
111
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
34
Hat matrix diagonals
observaton number
hat m
atrix
dia
gona
ls
0 20 40 60 80 100
0.02
0.04
0.06
0.08
0.10
0.12
1
2
3
45
6
7
8
9
10
11
12
13
14
1516
17
1819
202122
2324
25
26
27
28
29
30
31
3233
34
3536
3738394041
42
43
44
45
4647
484950
51
52
53
54
55
56575859
60
61
62
63
64656667
68
69
70
71
7273
7475
76
77
78
7980
8182
8384
85
8687
88
899091
92
9394
95
96
9798
99
100
101
102
103104
105
106
107
108
109
110111
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
35
Plot of wind vs. ozone
wind
ozon
e
5 10 15 20
12
34
5
12
3
45
6
7
8
910
1112
13
14
15
16
17
18
19
20
21
22
23
2425
26
27
28 29
30
31
3233
34
35
36
37
38
39
404142
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58 59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78 79
80
818283
8485
86
87
88
8990 91
92
93
9495
9697
98
99100
101
102103
104
105
106
107
108
109110 111
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
36
Cook’s DistanceC
ook'
s D
ista
nce
0 20 40 60 80 100
0.0
0.02
0.04
0.06
0.08
0.10
0.12
0.14
17
77
30
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
37
Plot of ozone vs. wind including fitted regression lines with and without observation 30
(simple linear regression)
wind
ozon
e
5 10 15 20
12
34
5
12
3
45
6
7
8
910
1112
13
14
15
16
17
18
19
20
21
22
23
2425
26
27
28 29
30
31
3233
34
35
36
37
38
39
404142
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58 59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78 79
80
818283
8485
86
87
88
8990 91
92
93
9495
9697
98
99100
101
102103
104
105
106
107
108
109110 111
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Remedies for Outliers
• Nothing?• Data Transformation?• Remove outliers?• Robust Regression – weighted least
squares: b=(X’WX)-1X’Wy• Minimize median absolute deviation
38
CollinearityHigh correlation among the predictors can cause problems with least
squares estimates (wrong signs, low t-values, unexpected results).If predictors are centered and scaled to unit length, then X’X is the
correlation matrix.Diagonal elements of inverse of correlation matrix are called VIF’s
(variance inflation factors).
R ,1
1 2j2 where
RVIF
jj −=
is the coefficient of determination for the regression of the jth predictor on the remaining predictors
39
When Rj2 = .90, VIF is about 10 and caution is advised. (Some authors
say VIF = 5.) A large VIF indicates there is redundant information in the explanatory variables.
Why is this called the variance inflation factor?We can show that
Thus VIFj represents the variation inflation caused by adding all thevariables other than xj to the model.
( )( )
2
2 2
1
1ˆVar 1
ˆVIF Var in simple regression
j nj j j
i
j j
R x x
σβ
β=
=− −
⎡ ⎤= ⎣ ⎦
∑
R Welsch 40
Remedies for collinearity
1. Identify and eliminate redundant variables (large literatureon this).
2. Modified regression techniques
a. ridge regression, b=(X’X+cI)-1X’y
3. Regress on orthogonal linear combinations of theexplanatory variables
a. principal components regression
4. Careful variable selection
R Welsch 41
Correlation and inverse of correlation matrix for air data set.
r<-cor(model.matrix(air.lm)[,-1])
> rx1 x2 x3
x1 1.0000000 0.2940876 -0.1273656x2 0.2940876 1.0000000 -0.4971459X3 -0.1273656 -0.4971459 1.0000000
> solve(r)x1 x2 x3
x1 1.09524102 -0.3357220 -0.02740677x2 -0.33572201 1.4312012 0.66875638x3 -0.02740677 0.6687564 1.32897882 > 42
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Correlation and inverse of correlation matrix for mpg data set
r<-cor(model.matrix(auto1.lm)[,-1])
> rwt I(wt^2) I(wt^3)
wt 1.0000000 0.9917756 0.9677228I(wt^2) 0.9917756 1.0000000 0.9918939I(wt^3) 0.9677228 0.9918939 1.0000000
solve(r)wt I(wt^2) I(wt^3)
wt 2000.377 -3951.728 1983.884I(wt^2) -3951.728 7868.535 -3980.575I(wt^3) 1983.884 -3980.575 2029.459
43
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Variable Selection
• We want a parsimonious model – as few variables as possible to still provide reasonable accuracy in predicting y.
• Some variables may not contribute much to the model.
• SSE never will increase if add more variables to model, however MSE=SSE/(n-k-1) may.
• Minimum MSE is one possible optimality criterion. However, must fit all possible subsets (2k of them) and find one with minimum MSE.
44
Backward Elimination
1. Fit the full model (with all candidate predictors).
2. If P-values for all coefficients < α then stop.
3. Delete predictor with highest P-value4. Refit the model5. Go to Step 2.
45
Logistic Regression
References: Applied Linear Statistical Models, Neter et al.
Categorical Data Analysis, Agresti
Slides prepared by Elizabeth Newton (MIT)
Logistic Regression• Nonlinear regression model when response
variable is qualitative.• 2 possible outcomes, success or failure,
diseased or not diseased, present or absent• Examples: CAD (y/n) as a function of age,
weight, gender, smoking history, blood pressure• Smoker or non-smoker as a function of family
history, peer group behavior, income, age• Purchase an auto this year as a function of
income, age of current car, age
E Newton 2
Response Function for Binary Outcome
iii
iiii
ii
ii
ii
iii
XYEYEYPYP
XYEXY
πββπππ
ππ
ββεββ
=+==−+=
−====
+=
++=
10
10
10
}{)1(0)(1}{
1)0()1(
}{
E Newton 3
Special Problems when Response is Binary
Constraints on Response Function0 ≤ E{Y} = π = ≤ 1
Non-normal Error TermsWhen Yi=1: εi = 1-β0-β1Xi
When Yi=0: εi = -β0-β1Xi
Non-constant error varianceVar{Yi} = Var{εi} = πi(1-πi)
E Newton 4
Logistic Response Function
X
X
XXXXXXX
XXYE
10
10
10
1010
1010
1010
10
10
1log
)exp(1
)exp()1()exp()exp()exp()exp()exp())exp(1(
)exp(1)exp(}{
ββπ
π
ββπ
πββππ
ββπββπββββππββββπ
ββββπ
+=⎟⎠⎞
⎜⎝⎛
−
+=−
+−=
+−+=
+=++
+=++
+++
==
E Newton 5
Example of Logistic Response Function
Age
Pro
babi
lity
of C
AD
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
E Newton 6
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Properties of Logistic Response Function
log(π/(1-π))=logit transformation, log odds
π/(1-π) = odds
Logit ranges from -∞ to ∞ as x varies from -∞ to ∞
E Newton 7
Likelihood Function
)1log()]1
log([)...g(Y log
)1()()...g(Y
is; pdf joint t,independen re YSince1,2...ni ;1,0 Y,)1()(:
1)0()1(
11i
111i
i
i1
i
n
i
n
i i
iin
Yi
Yi
niii
nin
Yi
Yiii
ii
ii
YY
YfY
aYfpdf
YPYP
ii
ii
ππ
πππ
ππ
ππ
−+−
=
−Π=Π=
==−=
−====
∑∑==
−==
−
E Newton 8
Likelihood Function (continued)
)]exp(1log[)(),(log
)exp(111
)-1
log(
1 1101010
10
10
∑ ∑= =
++−+=
++=−
+=
n
i
n
iiii
ii
ii
i
XXYL
X
X
ββββββ
ββπ
ββππ
E Newton 9
Likelihood for Multiple Logistic Regression
yXyX
xx
xxxy
x
xxxyL
xXyL
iki
i
jijj
jijj
iik
iiki
jijj
jijj
iik
iiki
k
i jijj
ijiji
j
ˆ''
ˆ])exp(1
)exp([ :Equations Likelihood
])exp(1
)exp([
)]exp(1log[)()(log
=
=+
=
+−=
∂∂
+−=
∑∑∑
∑∑
∑∑
∑∑
∑ ∑∑∑
πβ
β
β
β
β
βββ
E Newton 10
Solution of Likelihood Equations
No closed form solutionUse Newton-Raphson algorithm
Iteratively reweighted least squares (IRLS)Start with OLS solution for β at iteration t=0, β0
πit=1/(1+exp(-Xi’βt))
β(t+1)=βt + (XVX)-1 X’(y-πt)Where V=diag(πi
t(1-πit))
Usually only takes a few iterations
E Newton 11
Interpretation of logistic regression coefficients
• Log(π/(1-π))=Xβ• So each βj is effect of unit increase in Xj
on log odds of success with values of other variables held constant
• Odds Ratio=exp(βj)
E Newton 12
Example: Spinal Disease in Children Data SUMMARY: The kyphosis data frame has 81 rows representing data on 81 children
who have had corrective spinal surgery. The outcome Kyphosis is a binary variable, the other three variables (columns) are numeric.
ARGUMENTS: Kyphosis
a factor telling whether a postoperative deformity (kyphosis) is "present" or "absent" .
Agethe age of the child in months.
Numberthe number of vertebrae involved in the operation.
Startthe beginning of the range of vertebrae involved in the operation.
SOURCE: John M. Chambers and Trevor J. Hastie, Statistical Models in S,
Wadsworth and Brooks, Pacific Grove, CA 1992, pg. 200.
E Newton 13
This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Observations 1:16 of kyphosis data setkyphosis[1:16,]
Kyphosis Age Number Start 1 absent 71 3 52 absent 158 3 143 present 128 4 54 absent 2 5 15 absent 1 4 156 absent 1 2 167 absent 61 2 178 absent 37 3 169 absent 113 2 1610 present 59 6 1211 present 82 5 1412 absent 148 3 1613 absent 18 5 214 absent 1 4 1216 absent 168 3 18
E Newton 14
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Variables in kyphosissummary(kyphosis)
Kyphosis Age Number Start absent:64 Min.: 1.00 Min.: 2.000 Min.: 1.00 present:17 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.: 9.00
Median: 87.00 Median: 4.000 Median:13.00 Mean: 83.65 Mean: 4.049 Mean:11.49
3rd Qu.:130.00 3rd Qu.: 5.000 3rd Qu.:16.00 Max.:206.00 Max.:10.000 Max.:18.00
E Newton 15
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Scatter plot matrix kyphosis data set
Kyphosis
0 50 100 150 200 5 10 15
absn
prsn
050
100
150
200
Age
Number
24
68
10
absn prsn
510
15
2 4 6 8 10
Start
E Newton 16
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Boxplots of predictors vs. kyphosis0
5010
015
020
0
Age
absent present
Kyphosis
24
68
10
Num
ber
absent present
Kyphosis
510
15
Star
t
absent present
Kyphosis
E Newton 17
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Smoothing spline fits, df=3
jitter(age)
kyp
0 50 100 150 200
1.0
1.2
1.4
1.6
1.8
2.0
jitter(num)
kyp
2 4 6 8 10
1.0
1.2
1.4
1.6
1.8
2.0
jitter(sta)
kyp
5 10 15
1.0
1.2
1.4
1.6
1.8
2.0
E Newton 18
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary of glm fitCall: glm(formula = Kyphosis ~ Age + Number + Start,
family = binomial, data = kyphosis)
Deviance Residuals:Min 1Q Median 3Q Max
-2.312363 -0.5484308 -0.3631876 -0.1658653 2.16133
Coefficients:Value Std. Error t value
(Intercept) -2.03693225 1.44918287 -1.405573Age 0.01093048 0.00644419 1.696175
Number 0.41060098 0.22478659 1.826626Start -0.20651000 0.06768504 -3.051043
E Newton 19
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary of glm fitNull Deviance: 83.23447 on 80 degrees of freedom
Residual Deviance: 61.37993 on 77 degrees of freedom
Number of Fisher Scoring Iterations: 5
Correlation of Coefficients:(Intercept) Age Number
Age -0.4633715 Number -0.8480574 0.2321004 Start -0.3784028 -0.2849547 0.1107516
E Newton 20
This code7 was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Residuals
• Response Residuals: yi-πi
• Pearson Residuals: (yi-πi)/sqrt(πi(1-πi))
• Deviance Residuals: sqrt(-2log(|1-yi-πi|))
E Newton 21
Model Deviance
• Deviance of fitted model compares log-likelihood of fitted model to that of saturated model.
• Log likelihood of saturated model=0
DEVd
YYYsignd
YYDEV
ii
iiiiiii
ii
n
iii
∑
∑
=
−−+−−=
−−+−==
2
2/11
)]}ˆ1log()1()ˆlog([2){ˆ(
)ˆ1log()1()ˆlog(2
πππ
ππ
E Newton 22
Covariance Matrix> x<-model.matrix(kyph.glm)
> xvx<-t(x)%*%diag(fi*(1-fi))%*%x
> xvx(Intercept) Age Number Start
(Intercept) 9.620342 907.8887 43.67401 86.49845Age 907.888726 114049.8308 3904.31350 9013.14464
Number 43.674014 3904.3135 219.95353 378.82849Start 86.498450 9013.1446 378.82849 1024.07328
> xvxi<-solve(xvx)> xvxi
(Intercept) Age Number Start (Intercept) 2.101402986 -0.00433216784 -0.2764670205 -0.0370950612
Age -0.004332168 0.00004155736 0.0003368969 -0.0001244665Number -0.276467020 0.00033689690 0.0505664221 0.0016809996Start -0.037095061 -0.00012446655 0.0016809996 0.0045833534
> sqrt(diag(xvxi))[1] 1.44962167 0.00644650 0.22486979 0.06770047
E Newton 23
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Change in Deviance resulting from adding terms to model
> anova(kyph.glm)Analysis of Deviance Table
Binomial model
Response: Kyphosis
Terms added sequentially (first to last)Df Deviance Resid. Df Resid. Dev
NULL 80 83.23447Age 1 1.30198 79 81.93249
Number 1 10.30593 78 71.62656Start 1 10.24663 77 61.37993
E Newton 24
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary for kyphosis model with age^2 added
Call: glm(formula = Kyphosis ~ poly(Age, 2) + Number + Start, family = binomial, data = kyphosis)
Deviance Residuals:Min 1Q Median 3Q Max
-2.235654 -0.5124374 -0.245114 -0.06111367 2.354818
Coefficients:Value Std. Error t value
(Intercept) -1.6502939 1.40171048 -1.177343poly(Age, 2)1 7.3182325 4.66933068 1.567298poly(Age, 2)2 -10.6509151 5.05858692 -2.105512
Number 0.4268172 0.23531689 1.813798Start -0.2038329 0.07047967 -2.892080
E Newton 25
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary of fit with age^2 addedNull Deviance: 83.23447 on 80 degrees of freedom
Residual Deviance: 54.42776 on 76 degrees of freedom
Number of Fisher Scoring Iterations: 5
Correlation of Coefficients:(Intercept) poly(Age, 2)1 poly(Age,
2)2 Number poly(Age, 2)1 -0.2107783 poly(Age, 2)2 0.2497127 -0.0924834
Number -0.8403856 0.3070957 -0.0988896 Start -0.4918747 -0.2208804 0.0911896
0.0721616
E Newton 26
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Analysis of Deviance> anova(kyph.glm2)Analysis of Deviance Table
Binomial model
Response: Kyphosis
Terms added sequentially (first to last)Df Deviance Resid. Df Resid. Dev
NULL 80 83.23447poly(Age, 2) 2 10.49589 78 72.73858
Number 1 8.87597 77 63.86261Start 1 9.43485 76 54.42776
E Newton 27
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Kyphosis data, 16 obs, with fit and residuals
cbind(kyphosis,round(p,3),round(rr,3),round(rp,3),round(rd,3))[1:16,]Kyphosis Age Number Start fit rr rp rd
1 absent 71 3 5 0.257 -0.257 -0.588 -0.7712 absent 158 3 14 0.122 -0.122 -0.374 -0.5113 present 128 4 5 0.493 0.507 1.014 1.1894 absent 2 5 1 0.458 -0.458 -0.919 -1.1075 absent 1 4 15 0.030 -0.030 -0.175 -0.2466 absent 1 2 16 0.011 -0.011 -0.105 -0.1487 absent 61 2 17 0.017 -0.017 -0.131 -0.1858 absent 37 3 16 0.024 -0.024 -0.157 -0.2209 absent 113 2 16 0.036 -0.036 -0.193 -0.27110 present 59 6 12 0.197 0.803 2.020 1.80311 present 82 5 14 0.121 0.879 2.689 2.05312 absent 148 3 16 0.076 -0.076 -0.288 -0.39913 absent 18 5 2 0.450 -0.450 -0.905 -1.09414 absent 1 4 12 0.054 -0.054 -0.239 -0.33316 absent 168 3 18 0.064 -0.064 -0.261 -0.36317 absent 1 3 16 0.016 -0.016 -0.129 -0.181
E Newton 28
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of response residual vs. fit
fi
y - f
i
0.0 0.2 0.4 0.6 0.8
-1.0
-0.5
0.0
0.5
E Newton 29
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of deviance residual vs. indexre
sid(
kyph
.glm
, typ
e =
"de.
...
0 20 40 60 80
-2-1
01
2
E Newton 30
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of deviance residuals vs. fitted value
fitted(kyph.glm2)
resi
d(ky
ph.g
lm2,
type
= "d
....
0.0 0.2 0.4 0.6 0.8
-2-1
01
2
E Newton 31
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary of bootstrap for kyphosis model
E Newton 32
Call:bootstrap(data = kyphosis, statistic = coef(glm(Kyphosis ~
poly(Age, 2) + Number + Start, family = binomial,data = kyphosis)), trace = F)
Number of Replications: 1000
Summary Statistics:Observed Bias Mean SE
(Intercept) -1.6503 -0.85600 -2.5063 5.1675poly(Age, 2)1 7.3182 4.33814 11.6564 22.0166poly(Age, 2)2 -10.6509 -7.48557 -18.1365 37.6780
Number 0.4268 0.17785 0.6047 0.6823Start -0.2038 -0.07825 -0.2821 0.4593
Empirical Percentiles:2.5% 5% 95% 97.5%
(Intercept) -8.52922 -7.247145 1.1760 2.27636poly(Age, 2)1 -6.13910 -1.352143 27.1515 34.64701poly(Age, 2)2 -48.86864 -38.993192 -4.9585 -4.13232
Number -0.07539 -0.003433 1.4756 1.82754Start -0.58795 -0.470139 -0.1159 -0.08919
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary of bootstrap (continued)BCa Confidence Limits:
2.5% 5% 95% 97.5% (Intercept) -6.4394 -5.3043 2.39707 3.56856
poly(Age, 2)1 -18.2205 -10.1003 18.34192 21.56654poly(Age, 2)2 -24.2382 -20.3911 -1.75701 -0.19269
Number -0.7653 -0.1694 1.14036 1.27858Start -0.3521 -0.3167 -0.03478 0.01461
Correlation of Replicates:(Intercept) poly(Age, 2)1 poly(Age, 2)2 Number Start
(Intercept) 1.0000 -0.4204 0.5082 -0.5676 -0.1839poly(Age, 2)1 -0.4204 1.0000 -0.8475 0.4368 -0.6478poly(Age, 2)2 0.5082 -0.8475 1.0000 -0.3739 0.5983
Number -0.5676 0.4368 -0.3739 1.0000 -0.4174Start -0.1839 -0.6478 0.5983 -0.4174 1.0000
E Newton 33
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Histograms of coefficient estimates
-50 0 50
0.0
0.05
0.10
0.15
0.20
Value
Den
sity
(Intercept)
0 100 200 300 4000.
00.
010.
020.
030.
040.
05Value
Den
sity
poly(Age, 2)1
-600 -400 -200 0
0.0
0.01
0.03
0.05
Value
Den
sity
poly(Age, 2)2
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Value
Den
sity
Number
-12 -10 -8 -6 -4 -2 0
01
23
4
Value
Den
sity
Start
E Newton 34
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
QQ Plots of coefficient estimates
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es
-2 0 2
-50
050
(Intercept)
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es-2 0 2
010
020
030
040
0
poly(Age, 2)1
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es
-2 0 2
-600
-400
-200
0
poly(Age, 2)2
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es
-2 0 2
02
46
810
Number
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es
-2 0 2
-12
-10
-8-6
-4-2
0
Start
E Newton 35
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Regression Reviewand Robust Regression
Slides prepared by Elizabeth Newton (MIT)
S-Plus Oil City Data FrameMonthly Excess Returns of Oil City Petroleum, Inc.
Stocks and the Market SUMMARY: The oilcity data frame has 129 rows and 2 columns. The
sample runs from April 1979 to December 1989. This data frame contains the following columns:
VALUE: Oil
monthly excess returns of Oil City Petroleum, Inc. stocks. Market
monthly excess returns of the market.
E Newton 2
This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Oil City Data (continued)• Returns = relative change in the stock price over a one
month interval• Excess returns are computed relative to the monthly
return of a 90-day US Treasury bill at the risk-free rate• Financial economists use least squares to fit a straight
line predicting a particular stock return from the market return.
• Beta= estimated coefficient of the market return. Measures the riskiness of the stock in terms of standard deviation and expected returns.
• Large beta -> stock is risky compared to market, but also expected returns from the stock are large.
E Newton 3
Plot of Market returns vs. month
Month
oilc
ity$M
arke
t
0 20 40 60 80 100 120
-0.2
-0.1
0.0
E Newton 4
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of Oil City Petroleum return vs. month
month
Oil
0 20 40 60 80 100 120
01
23
45
E Newton 5
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Histogram of Market Returns
-0.3 -0.2 -0.1 0.0 0.1
010
2030
4050
Market
E Newton 6
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Histogram of Oil City Returns
-1 0 1 2 3 4 5
020
4060
8010
0
Oil
E Newton 7
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of Oil City vs. Market Returns
Market
Oil
City
-0.2 -0.1 0.0
01
23
45
12
34 5
6
7
8
9
10
11
1213
1415
16171819
20
2122 23
2425 2627
282930
31 323334
35
36373839 40
4142
4344
4546
4748
4950 5152 535455
56
57
5859 6061626364
65
66
6768
69 70717273 7475767778
79
808182 838485 868788
8990 919293
94
9596
97 98 99
100101
102103 104 105
106107
108 109110 111112113 114115116117 118119 120 121122123 124125126127 128129
E Newton 8
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of Oil City vs. Market Returns without observation 94
Market
Oil
City
-0.25 -0.20 -0.15 -0.10 -0.05 0.0 0.05
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1
2
3
45
6
7
8
9
10
11
12
13
14
15
16
17
1819
20
21
22 23
24
25
26
27
2829
30
3132
33
34
35
36
373839
40
41
42
43
44
45
46
47
48
49
50 51
52
53
5455
56
57
58
59
6061
62
6364
65
66
67
68
69
7071
7273
7475
76
77
78
79
80
8182 838485 868788
89
90 919293
94
95
96
97
98
99
100
101102103 104
105
106
107 108109110
111112
113114
115
116
117118 119 120
121
122 123
124
125126 127
128
E Newton 9
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
> summary(oilcity)Oil Market
Min.:-0.55667260 Min.:-0.27857020 1st Qu.:-0.23968330 1st Qu.:-0.10557534 Median:-0.10049000 Median:-0.07277544 Mean:-0.07221215 Mean:-0.07689209
3rd Qu.:-0.05821000 3rd Qu.:-0.03973828 Max.: 5.19292000 Max.: 0.07131940
E Newton 10
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary oil.lm
Call: lm(formula = Oil ~ Market, data = oilcity)Residuals:
Min 1Q Median 3Q Max -0.6952 -0.1732 -0.05444 0.08407 4.842
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) 0.1474 0.0707 2.0849 0.0391 Market 2.8567 0.7318 3.9040 0.0002
Residual standard error: 0.4867 on 127 degrees of freedomMultiple R-Squared: 0.1071 F-statistic: 15.24 on 1 and 127 degrees of freedom, the p-value
is 0.0001528
Correlation of Coefficients:(Intercept)
Market 0.7956
E Newton 11
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of residual vs. fit for oil.lm
Fitted : Market
Res
idua
ls
-0.6 -0.4 -0.2 0.0 0.2
01
23
45
65
79
94
E Newton 12
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
E Newton 13
Plot of Cooks Distance vs. IndexC
ook'
s D
ista
nce
0 20 40 60 80 100 120
0.0
0.5
1.0
1.5
2.0
2.5
3.0
6543
94
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of hat matrix diagonals for oil.lm
month
hat(m
odel
.mat
rix(o
il.lm
))
0 20 40 60 80 100 120
0.02
0.04
0.06
0.08
0.10
123456
7
8910
11
12
13
141516
17181920
2122
23
24
25
262728
2930
3132
3334
35
36
37
3839
40
41
42
43
4445464748
49
5051525354
55565758
59
6061
62
6364
65
66676869
70
717273
74
7576777879
80
8182
8384
85
86
8788
89
9091
9293
94
95
969798
99100101102
103
104
105
106107
108109110
111
112113114115116117
118119120
121122123
124
125126127128129
E Newton 14
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary of model without observation 94
Call: lm(formula = Oil ~ Market, data = oilcity94)
Residuals:Min 1Q Median 3Q Max
-0.5169 -0.1174 -0.01959 0.06864 0.859
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) -0.0247 0.0304 -0.8139 0.4173 Market 1.1355 0.3137 3.6202 0.0004
Residual standard error: 0.2033 on 126 degrees of freedomMultiple R-Squared: 0.09422 F-statistic: 13.11 on 1 and 126 degrees of freedom, the p-value
is 0.0004249
Correlation of Coefficients:(Intercept)
Market 0.8061
E Newton 15
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of residual vs fit for model without observation 94
Fitted : Market
Res
idua
ls
-0.3 -0.2 -0.1 0.0
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
8
10579
E Newton 16
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Weighted Least Squares
Vof root square the called sometimes is RVRRRR' that such
R matrix,symmetric singular-non nxn symmetric always isV
ed,uncorrelat are errors if diagonal isV definite positive singular-non is
)( ,0)(
variances unequal have , yns,observatio whenUsed
2
i
==∃
==
+=
VVVarE
Xyσεε
εβ
E Newton 17
Weighted least squares (continued)
0)()(
yor ,
becomes ,X ,y
:variablesnew Define
1*
***
111
1*
1*
1*
==
+=+=
+=
===
−
−−−
−−−
εε
εβεβ
εβεε
REE
XRXRyR
XyRXRyR
E Newton 18
Weighted least squares (continued)
IRRRRVRR
RERRRE
EEEEVar
2
112
112
11
11**
*****
)'()'(
)'(})]'()][({[)(
σ
σ
σ
εε
εε
εεεεεεε
=
=
=
=
=
=
−−=
−−
−−
−−
−−
E Newton 19
Weighted Least Squares (continued)
12
111-2
11-
1-
-11**
)'()'('WX)X'(
)()var('WX)X'()ˆ(
'WX)(X'ˆ :is solution The
WyX'ˆWX)(X' are equations normal squares Least
)()'(V W,')Q(
−
−−
−
−
=
=
=
=
=
−−=
=====
WXXWXXWXWWX
XWXWXyWXVar
WyX
XyWXyweightsWV
σ
σ
β
β
β
ββεεεεεεβ
E Newton 20
Robust RegressionUsed to reduce influence of outliers
residuals of function a g ,)g(e)g(y :minimize
:estimators M
}median{e }]median{[y :minimize :Regression LMS
|e||y|L1 minimize
:Regression LAR
n
1ii
n
1ii
2i
2i
n
1ii
n
1ii
∑∑
∑∑
==
==
=−
=−
=−=
β
β
β
i
i
i
x
x
x
E Newton 21
Robust Regression (continued)IRLS, iteratively reweighted least squaresMinimize e’WeW is a diagonal matrix of weights, inversely proportional to
magnitude of scaled residuals, uiui=ei/s, s=MAD=median{|ei-median(ei)|}
Procedure:1. Obtain initial coefficient estimates from OLS2. Obtain weights from scaled residuals3. Obtain coefficient estimates from WLS4. Return to 2.Convergence usually rapid.
E Newton 22
(See Figure 10.4, and Equations 10.44 and 10.45 in Neter et al. Applied Linear Statistical Models.)
Neter et al. Applied Linear Statistical Models
23
Plot of residuals in oil.rregoi
l.rre
g$re
sid
0 20 40 60 80 100 120
01
23
45
E Newton 24
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of weights in robust regression for oil city data set
Month
Wei
ghts
0 20 40 60 80 100 120
0.0
0.2
0.4
0.6
0.8
1.0
12
3
45
6
7
8
910
11
12
13
14
1516171819
20
21
2223
2425
26
27282930
313233
34
35
36
37
3839
4041
42
43
44
4546
47
48
49
5051
52
53
545556
57
58596061626364
65
66
67686970717273
74
75767778
79
80
8182838485868788
89
90
919293
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108109110111112113
114
115116117
118119120121122123124
125126127128
129
E Newton 25
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of sqrt(weights)*resid/s in oil.rreg(s
qrt(o
il.rr
eg$w
) * o
il.rr
....
0 20 40 60 80 100 120
-10
1
E Newton 26
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Coefficient table for oil.rreg
> x<-cbind(1,Market)> beta<-solve(t(x)%*%diag(w)%*%x)%*%t(x)%*%diag(w)%*%Oil> r<-Oil-x%*%beta> s<- median(abs(r-median(r)))*1.4826> covm<-solve(t(x)%*%diag(w)%*%x)*s^2> se<-sqrt(diag(covm))> tvalue=beta/se> prob<-2*(1-pt(abs(tvalue),127))> cbind(beta,se,tvalue,prob)
beta se tvalue prob(Intercept) -0.06779903 0.02451469 -2.765649 0.0065285939
x 0.89895511 0.24902845 3.609849 0.0004394276
Covariance matrix is approximate.
E Newton 27
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plots of fitted regression lines for oil city data
Market
Oil
-0.2 -0.1 0.0
01
23
45
12
34 5
6
7
8
9
10
11
1213
1415
16171819
20
2122 23
2425 2627
282930
31 323334
35
36373839 40
4142
4344
4546
4748
4950 5152 535455
56
57
5859 6061626364
65
66
6768
69 70717273 7475767778
79
808182 838485 868788
8990 919293
94
9596
97 98 99
100101
102103 104 105
106107
108 109110 111112113 114115116117 118119 120 121122123 124125126127 128129
oil.lmoil.lm94oil.rreg
E Newton 28
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Least Trimmed Squares Regression
n and n/2 between be to chosen is q where
,e :q
1i
2i∑
=
Minimizes
Based on a genetic algorithm for finding a subset of data with minimum SSE.
High breakdown point: fits the bulk of the data well, even if bulk is only a little more than half the data.
Resulting weights are 1 or 0
E Newton 29
E Newton 30
> summary(oil.lts)Method:[1] "Least Trimmed Squares Robust Regression."
Call:ltsreg(formula = Oil ~ Market)
Coefficients:Intercept Market -0.0864 0.7907
Scale estimate of residuals: 0.1468
Robust Multiple R-Squared: 0.09863
Total number of observations: 129
Number of observations that determine the LTS estimate: 116
Residuals:Min. 1st Qu. Median 3rd Qu. Max.
-0.454 -0.088 0.032 0.097 5.223
Weights:0 1 10 119
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Single Factor ANOVA Models
Corresponds to Chapter 12 ofTamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT) with some slides by Jacqueline Telford
(Johns Hopkins University).
1
Chapter 8: How to compare two treatments
Chapter 12: How to compare more than two treatments (or just two).
Example: yields of several varieties of barley. Variety is the treatment factor (predictor)Yield is the response
2
Experimental Designs
3
S-Plus barley data set (observation 13:30)> barley.small
yield variety year site 13 35.13333 Svansota 1931 University Farm14 47.33333 Svansota 1931 Waseca15 25.76667 Svansota 1931 Morris16 40.46667 Svansota 1931 Crookston17 29.66667 Svansota 1931 Grand Rapids18 25.70000 Svansota 1931 Duluth19 39.90000 Velvet 1931 University Farm20 50.23333 Velvet 1931 Waseca21 26.13333 Velvet 1931 Morris22 41.33333 Velvet 1931 Crookston23 23.03333 Velvet 1931 Grand Rapids24 26.30000 Velvet 1931 Duluth25 36.56666 Trebi 1931 University Farm26 63.83330 Trebi 1931 Waseca27 43.76667 Trebi 1931 Morris28 46.93333 Trebi 1931 Crookston29 29.76667 Trebi 1931 Grand Rapids30 33.93333 Trebi 1931 Duluth
4This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Completely Randomized Design Notation
If the sample sizes are equal the design is balanced; otherwise the design is unbalanced
See Table 12.1, page 458 in the course textbook.
1
a
ij
N n=
= ∑
5
S-Plus barley dataset (observations 13:30)
Variety Svansota Velvet Trebi35.13333 39.90000 36.5666647.33333 50.23333 63.83330 25.76667 26.13333 43.76667 40.46667 41.33333 46.9333329.66667 23.03333 29.7666725.70000 26.30000 33.93333
Variety Mean 34.01111 34.48889 42.46666
6This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of yield by variety for S-Plus barley data set
3040
5060
barle
y.sm
all$
yiel
d
Svansota Velvet Trebi
barley.small$variety
7This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
8
S-plus plot.design function
Factors
mea
n of
yie
ld
3436
3840
42
Svansota
Velvet
Trebi
variety
Factors
med
ian
of y
ield
3436
3840
Svansota
Velvet
Trebi
variety
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
CRD: Model and Estimation (cell means model)
See Section 12.1.1 and Figure 12.2 on page 460 of the course textbook.
9
CRD: Treatment Effects Model
Alternative Formulation of the Model:
Formula from 12.1.1, page 460 in the course textbook.
( 1, 2,..., ; 1, 2,..., )ij i ij iY i a j nµ τ ε= + + = =
10
CRD parameter estimates
a)-e/(ne' sby estimated
ˆ - y error emeans treatment values fitted of vector ˆ
)/ny(1' yby estimated treatment, i of mean y)/n(1' yby estimated mean,
22
iiith
i
=
====
==
==
σ
µ
µ
yy
grand
11
Fitted values and residuals for barley example
> cbind(barley.small[,1:2],fitted(tmp),resid(tmp))yield variety fitted resid
13 35.13333 Svansota 34.01111 1.12221814 47.33333 Svansota 34.01111 13.32221815 25.76667 Svansota 34.01111 -8.24444216 40.46667 Svansota 34.01111 6.45555817 29.66667 Svansota 34.01111 -4.34444218 25.70000 Svansota 34.01111 -8.31111219 39.90000 Velvet 34.48889 5.41111320 50.23333 Velvet 34.48889 15.74444321 26.13333 Velvet 34.48889 -8.35555722 41.33333 Velvet 34.48889 6.84444323 23.03333 Velvet 34.48889 -11.45555724 26.30000 Velvet 34.48889 -8.18888725 36.56666 Trebi 42.46666 -5.90000026 63.83330 Trebi 42.46666 21.36664027 43.76667 Trebi 42.46666 1.30001028 46.93333 Trebi 42.46666 4.46667029 29.76667 Trebi 42.46666 -12.69999030 33.93333 Trebi 42.46666 -8.533330
12This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
X matrix?1 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 0 1 01 0 1 01 0 1 01 0 1 01 0 1 01 0 1 01 0 0 11 0 0 11 0 0 11 0 0 11 0 0 11 0 0 1
13
Model.matrix in S-Plus> round(model.matrix(barley.small.aov),3)
(Intercept) variety.L variety.Q13 1 -0.707 0.40814 1 -0.707 0.40815 1 -0.707 0.40816 1 -0.707 0.40817 1 -0.707 0.40818 1 -0.707 0.40819 1 0.000 -0.81620 1 0.000 -0.81621 1 0.000 -0.81622 1 0.000 -0.81623 1 0.000 -0.81624 1 0.000 -0.81625 1 0.707 0.40826 1 0.707 0.40827 1 0.707 0.40828 1 0.707 0.40829 1 0.707 0.40830 1 0.707 0.408
14This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Model Coefficients
15
• > summary.lm(barley.small.aov)
• Call: aov(formula = yield ~ variety, data = barley.small)• Residuals:• Min 1Q Median 3Q Max • -12.7 -8.294 -1.611 6.194 21.37
• Coefficients:• Value Std. Error t value Pr(>|t|) • (Intercept) 36.9889 2.5207 14.6741 0.0000 • variety.L 5.9790 4.3660 1.3695 0.1910 • variety.Q 3.0619 4.3660 0.7013 0.4939
• Residual standard error: 10.69 on 15 degrees of freedom• Multiple R-Squared: 0.1363 • F-statistic: 1.184 on 2 and 15 degrees of freedom, the p-value is 0.3332
• Correlation of Coefficients:• (Intercept) variety.L• variety.L 0 • variety.Q 0 0
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
S-plus model.tables command gives treatment means or effects
> model.tables(barley.small.aov,type="mean")Warning messages:Model was refit to allow projection in: model.tables(tmp, type =
"mean")
Tables of meansGrand mean
36.989
variety Svansota Velvet Trebi34.011 34.489 42.467
16This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
S-plus model.tables command gives treatment means or effects
> model.tables(barley.small.aov)Warning messages:Model was refit to allow projection in:
model.tables(barley.small.aov)
Tables of effects
variety Svansota Velvet Trebi-2.9778 -2.5000 5.4778
17This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Analysis of Variance (ANOVA)
Homogeneity Hypothesis:
Note SSR=SSA=Treatment sums of squares
0 1 2 1
0 1 2 1
: ... . : .: ... . : 0.
a i
a i
H vs H Not all the areequalH vs H At least some
µ µ µ µτ τ τ τ
= = =
= = = ≠
Variation Source Sum of Squares Degrees of Freedom Mean Square F
Treatments (A)
Error (E)
Total (T)
2( )ij iy y−∑ ∑
2( )i in y y−∑
2( )ijy y−∑ ∑
1a −
N a−
1N −
1SSAa −SSEN a−
MSAMSE
18
ANOVA table for model with 3 varieties of barley, year 1
> summary(aov(yield~variety,barley.small))Df Sum of Sq Mean Sq F Value Pr(F)
variety 2 270.739 135.3694 1.183614 0.3332005Residuals 15 1715.544 114.3696
ANOVA table for model with all 10 varieties of barley, year 1
> summary(aov(yield~variety,barley1))Df Sum of Sq Mean Sq F Value Pr(F)
variety 9 646.262 71.8069 0.5963671 0.793823Residuals 50 6020.357 120.4071 >
19This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
F-statistic for One-way ANOVA
anaFMSEMSAF −−= ,1~
1)(
)(
1
2
2
2
−+=
=
∑=
a
nMSAE
MSEEa
iiiτ
σ
σ
20
Fitting model with continuous vs. character predictor
> summary(aov(barley.small$yield~varnum)) Df Sum of Sq Mean Sq F Value Pr(F)
varnum 1 214.489 214.4889 1.93692 0.1830502Residuals 16 1771.794 110.7371
> summary(aov(barley.small$yield~as.factor(varnum)))Df Sum of Sq Mean Sq F Value Pr(F)
as.factor(varnum) 2 270.739 135.3694 1.183614 0.3332005Residuals 15 1715.544 114.3696
21This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Equivalence of T test and ANOVA for model with single factor with 2 levels
> t.test(y[1:6],y[7:12])
Standard Two-Sample t-Test
data: y[1:6] and y[7:12] t = -1.194, df = 10, p-value = 0.26 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-22.864726 6.909179 sample estimates:mean of x mean of y 34.48889 42.46666
> summary(aov(yield~variety,barley.vsmall))Df Sum of Sq Mean Sq F Value Pr(F)
variety 1 190.935 190.9346 1.425727 0.2600178Residuals 10 1339.209 133.9209
22This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
23
Model Diagnostics, residual vs. fitted value(all 10 varieties, year 1)
fitted(barley1.aov)
resi
d(ba
rley1
.aov
)
32 34 36 38 40 42
-10
010
20
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
24
Model Diagnostics, residual vs. observation number(all 10 varieties, year 1)
resi
d(ba
rley1
.aov
)
0 10 20 30 40 50 60
-10
010
20
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Model Diagnostics, normal plot of residuals(all 10 varieties, year 1)
25Quantiles of Standard Normal
resi
d(ba
rley1
.aov
)
-2 -1 0 1 2
-10
010
20
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
26
Model Diagnostics, histogram of residuals(all 10 varieties, year 1)
-10 0 10 20 30
05
1015
20
resid(barley1.aov)
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Random Effects Model for a One-way LayoutWhen the treatment levels are determined by the experimenter (or those are the only levels of interest), the design is a fixed effects model.
• Goal is to measure the treatment effects or means (“pick the winner”).
When the treatment levels are a random sample from a population of possible treatment levels (e.g. workers in a factory) and the particular levels used in the experiment are not of any interest, the design is a random effects model.
• Goal is to measure the treatment variability (estimate the expected variability among workers).
27
Random Effects Model for a One-way LayoutModel: Yij = µi + εij = µ + τi + εij (looks similar to the fixed effects model), where
εij ~ N(0,σ2) µi ~ N(µ,σA
2) or τi ~ N(0,σA2) (constants in fixed effects model)
Var(Yij) = Var(µi) + Var(eij) = σA2 + σ2
σA2=variance among, σ2 = variance within
With balanced one-way layout, n observations per treatment:
22
2
)()(
AnMSAEMSEE
σσ
σ
+=
=
Can estimate σA2 as (MSA-MSE)/n (if you are lucky!)
28
Randomized Block Design
See Figure 3.2 on page 99 of the course textbook.
29
Barley Example10 varieties, 6 sites
> ymUniversity Farm Waseca Morris Crookston Grand Rapids Duluth Variety Mean
Manchuria 27.00000 48.86667 27.43334 39.93333 32.96667 28.96667 34.19445Glabron 43.06666 55.20000 28.76667 38.13333 29.13333 29.66667 37.32778
Svansota 35.13333 47.33333 25.76667 40.46667 29.66667 25.70000 34.01111Velvet 39.90000 50.23333 26.13333 41.33333 23.03333 26.30000 34.48889Trebi 36.56666 63.83330 43.76667 46.93333 29.76667 33.93333 42.46666
No. 457 43.26667 58.10000 28.70000 45.66667 32.16667 33.60000 40.25000No. 462 36.60000 65.76670 30.36667 48.56666 24.93334 28.10000 39.05556
Peatland 32.76667 48.56666 29.86667 41.60000 34.70000 32.00000 36.58333No. 475 24.66667 46.76667 22.60000 44.10000 19.70000 33.06666 31.81667
Wisconsin No. 38 39.30000 58.80000 29.46667 49.86667 34.46667 31.60000 40.58333Site Mean 35.82667 54.34667 29.28667 43.66000 29.05334 30.29333 37.07778
30This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Randomized Block Design (RBD)Method
( 1,..., ; 1,..., )ij i j ijY i a j bµ τ β ε= + + + = =
10
b
jjβ
=
=∑1
0a
iiτ
=
=∑
a-1 independent treatment effects
b-1 independent block effects
For more information, see 12.4, page 482 in course textbook.
31
No Interactions Between Treatments and Blocks
' ' '( ) ( )ij i j i j i j i iµ µ µ τ β µ τ β τ τ− = + + + + + = −
Formula from page 483 in the course textbook.
32
RBD: Sums of Squares
See formulas 12.17, 12.18, and 12.19 on pages 484-5 in the course textbook.
33
ANOVA tables for models for barley data set
> summary(aov(yield~variety,barley1))Df Sum of Sq Mean Sq F Value Pr(F)
variety 9 646.262 71.8069 0.5963671 0.793823Residuals 50 6020.357 120.4071
> summary(aov(yield~variety+site,barley1))Df Sum of Sq Mean Sq F Value Pr(F)
variety 9 646.262 71.807 3.67995 0.001612103site 5 5142.272 1028.454 52.70610 0.000000000
Residuals 45 878.085 19.513
34This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Type 1 and Type 3 Sums of Squares for barley example (balanced design)> summary(barley12.aov)
Df Sum of Sq Mean Sq F Value Pr(F) variety 9 646.262 71.807 3.67995 0.001612103
site 5 5142.272 1028.454 52.70610 0.000000000Residuals 45 878.085 19.513
> summary(barley12.aov,ssType=3)Type III Sum of Squares
Df Sum of Sq Mean Sq F Value Pr(F) variety 9 646.262 71.807 3.67995 0.001612103
site 5 5142.272 1028.454 52.70610 0.000000000Residuals 45 878.085 19.513
35This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Degrees of Freedom
36
Effects in barley model > model.tables(barley12.aov,type="effects")Warning messages:Model was refit to allow projection in: model.tables(barley12.aov, type = "effects")
Tables of effects
variety Svanso No. 462 Manch No. 475 Velvet Peatla Glabron No. 457 Wisc No. 38 Trebi-3.0667 1.9778 -2.8833 -5.2611 -2.5889 -0.4944 0.2500 3.1722 3.5056 5.3889
site Grand Rapids Duluth University Farm Morris Crookston Waseca -8.024 -6.784 -1.251 -7.791 6.582 17.269
37This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Analysis of Multifactor Experiments
Corresponds to Chapter 13 of Tamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT), with some slides by Jacqueline Telford
(Johns Hopkins University) 1
Analysis of Multifactor Experiments
(See Table 13.1 on page 505 of the course textbook.)
2
Model and estimates
.ˆ.ˆ
........)(.....ˆ.....ˆ
...ˆ)(yijk
ijijkijkijkijk
ijijk
jiijij
jj
ii
ijkijji
yyyye
yy
yyyyyy
yyy
−=−=
=
+−−=
−=
−==
++++=
τβ
β
τµ
ετββτµ
3
For any model
)y-(y)'y-(y SSError SSE)y-y()'y -y( SSModel SSM
)y-(y)'y-(y SSTotal SST
mean grand of vector yvalues fitted of vector y
valuesresponseobservedof vector
====
==
===y
4
• Biochemical Reactions of Cells Treated with Puromycin
• SUMMARY: • The “Balanced” Puromycin data frame has 24 rows
representing the measurement of initial velocity of a biochemical reaction for 6 different concentrations of substrate and two different cell treatments. This data frame contains the following variables (columns):
• ARGUMENTS: • conc
– the concentration of the substrate. • vel
– the initial velocity of the reaction. • state
– a factor telling whether the cells involved were treated or untreated.
5
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
6
Scatterplot matrix for puromycin data set
conc
untr trtd
0.0
0.4
0.8
untr
trtd
state
0.2 0.4 0.6 0.8 1.0 50 100 150 200
5010
015
020
0
vel
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
7
plot.factor(conc,vel)50
100
150
200
vel
0.02 0.06 0.11 0.22 0.56 1.1
f(conc)
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
8
plot.factor(state,vel)50
100
150
200
vel
untreated treated
state
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Velocity in “Balanced” puromycin data set
conc treated untreated0.02 76 47 67 510.06 97 107 84 860.11 123 139 98 1150.22 159 152 131 1240.56 191 201 144 1581.10 207 200 160 162
9
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Histogram of velocity0
12
34
5
vel10
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
interaction.plot(pyb$state,pyb$conc,pyb$vel)
11pyb$state
mea
n of
pyb
$vel
6080
100
120
140
160
180
200
untreated treated
pyb$conc
1.10.560.220.110.060.02
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
12
interaction.plot(pyb$conc,pyb$state,pyb$vel)
pyb$conc
mea
n of
pyb
$vel
6080
100
120
140
160
180
200
0.02 0.06 0.11 0.22 0.56 1.1
pyb$state
treateduntreated
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summaries of puromycin model
Residuals:Min 1Q Median 3Q Max
-14.5 -5 -4.441e-016 5 14.5
Residual standard error: 9.559 on 12 degrees of freedomMultiple R-Squared: 0.9784 F-statistic: 49.5 on 11 and 12 degrees of freedom, the
p-value is 2.919e-008
Df Sum of Sq Mean Sq F Value Pr(F) state 1 4240.04 4240.042 46.40264 0.00001871conc 5 44243.71 8848.742 96.83985 0.00000000
state:conc 5 1270.71 254.142 2.78130 0.06803651Residuals 12 1096.50 91.375
13
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Observed velocity and fitted values for puromycin model with interaction
Observed Fitted Valuesconc treated untreated treated untreated 0.02 76 47 67 51 61.5 61.5 59.0 59.00.06 97 107 84 86 102.0 102.0 85.0 85.00.11 123 139 98 115 131.0 131.0 106.5 106.50.22 159 152 131 124 155.5 155.5 127.5 127.50.56 191 201 144 158 196.0 196.0 151.0 151.01.10 207 200 160 162 203.5 203.5 161.0 161.0
14
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
model.tablesTables of meansGrand mean
128.29
state untreated treated 115.00 141.58
conc0.02 0.06 0.11 0.22 0.56 1.1 60.25 93.50 118.75 141.50 173.50 182.25
state:concDim 1 : stateDim 2 : conc
0.02 0.06 0.11 0.22 0.56 1.1 untreated 59.0 85.0 106.5 127.5 151.0 161.0treated 61.5 102.0 131.0 155.5 196.0 203.5
15
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
multicomp(pyb.aov,focus=“concf”)
95 % simultaneous confidence intervals for specified linear combinations, by the Tukey method
critical point: 3.3595 response variable: vel
intervals excluding 0 are flagged by '****'
Estimate Std.Error Lower Bound Upper Bound 0.02-0.06 -33.20 6.76 -56.0 -10.5000 ****0.02-0.11 -58.50 6.76 -81.2 -35.8000 ****0.02-0.22 -81.20 6.76 -104.0 -58.5000 ****0.02-0.56 -113.00 6.76 -136.0 -90.5000 ****0.02-1.1 -122.00 6.76 -145.0 -99.3000 ****
0.06-0.11 -25.30 6.76 -48.0 -2.5400 ****0.06-0.22 -48.00 6.76 -70.7 -25.3000 ****0.06-0.56 -80.00 6.76 -103.0 -57.3000 ****0.06-1.1 -88.70 6.76 -111.0 -66.0000 ****
0.11-0.22 -22.70 6.76 -45.5 -0.0425 ****0.11-0.56 -54.70 6.76 -77.5 -32.0000 ****0.11-1.1 -63.50 6.76 -86.2 -40.8000 ****
0.22-0.56 -32.00 6.76 -54.7 -9.2900 ****0.22-1.1 -40.70 6.76 -63.5 -18.0000 ****0.56-1.1 -8.75 6.76 -31.5 14.0000
16
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
17
Residual vs. fit for puromycin model
fitted(pyb.aov)
resi
d(py
b.ao
v)
-15
-10
-50
510
15
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
18
qqplot of residuals for puromycin model
Quantiles of Standard Normal
resi
d(py
b.ao
v)
-15
-10
-50
510
15
Summaries of puromycin model without interaction
Residuals:Min 1Q Median 3Q Max
-26.54 -7.083 2.625 4.792 20.04Residual standard error: 11.8 on 17 degrees of freedomMultiple R-Squared: 0.9534 F-statistic: 58.03 on 6 and 17 degrees of freedom, the
p-value is 2.18e-010
Df Sum of Sq Mean Sq F Value Pr(F) conc 5 44243.71 8848.742 63.54684 0.00000000021
state 1 4240.04 4240.042 30.44967 0.00003762498Residuals 17 2367.21 139.248
19
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Observed velocity and fitted values for puromycin model without interaction
Observed Fittedconc treated untreated treated untreated0.02 76 47 67 51 73.542 73.542 46.958 46.9580.06 97 107 84 86 106.792 106.792 80.208 80.2080.11 123 139 98 115 132.042 132.042 105.458 105.4580.22 159 152 131 124 154.792 154.792 128.208 128.2080.56 191 201 144 158 186.792 186.792 160.208 160.2081.10 207 200 160 162 195.542 195.542 168.958 168.958
20
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
21
Plot of residual vs. fit for puromycin model without interaction
Fitted : conc + state
Res
idua
ls
-20
-10
010
20
21
13
2
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
22
Plot of velocity vs. concentration
conc
vel
5010
015
020
0
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Call: aov(formula = vel ~ conc + conc^2 + state)Residuals:
Min 1Q Median 3Q Max -45.4 -6.93 4.227 7.902 23.94
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) 73.0885 6.0136 12.1539 0.0000conc 304.9581 37.3027 8.1752 0.0000
I(conc^2) -188.9327 32.5953 -5.7963 0.0000state 13.2917 3.4172 3.8897 0.0009
Residual standard error: 16.74 on 20 degrees of freedomMultiple R-Squared: 0.8898 F-statistic: 53.82 on 3 and 20 degrees of freedom, the p-
value is 9.291e-010
> summary(pyb2.aov)Df Sum of Sq Mean Sq F Value Pr(F)
conc 1 31590.27 31590.27 112.7215 0.0000000011I(conc^2) 1 9415.64 9415.64 33.5972 0.0000113551
state 1 4240.04 4240.04 15.1295 0.0009104989Residuals 20 5605.01 280.25 23
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
24
Plot of residual vs. fit for pyb2.aov
Fitted : conc + conc^2 + state
Res
idua
ls
-40
-20
020
18
21
2
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
25
qqplot of residuals for pyb2.aov
Quantiles of Standard Normal
Res
idua
ls
-40
-20
020
18
21
2
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
26
Call: aov(formula = vel ~ conc + conc^2 + conc^3 + conc^4 + conc^5 + state)
Residuals:Min 1Q Median 3Q Max
-26.54 -7.083 2.625 4.792 20.04
Coefficients:Residual standard error: 11.8 on 17 degrees of freedomMultiple R-Squared: 0.9534 F-statistic: 58.03 on 6 and 17 degrees of freedom, the p-
value is 2.18e-010
> summary(pyb5.aov)Df Sum of Sq Mean Sq F Value Pr(F)
conc 1 31590.27 31590.27 226.8641 0.0000000I(conc^2) 1 9415.64 9415.64 67.6180 0.0000003I(conc^3) 1 2603.71 2603.71 18.6984 0.0004604I(conc^4) 1 631.13 631.13 4.5324 0.0481759I(conc^5) 1 2.96 2.96 0.0213 0.8857934
state 1 4240.04 4240.04 30.4497 0.0000376Residuals 17 2367.21 139.25 >
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
27
Plot of residual vs. fit for pyb5.aov
Fitted : conc + conc^2 + conc^3 + conc^4 + conc^5 + state
Res
idua
ls
-20
-10
010
20
21
13
2
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Guayule data set
• Rate of Germination of Treated Guayule Seeds • SUMMARY: • The guayule data frame, a design object, has 96 rows and 5
columns. The guayule is a Mexican plant from which rubber is manufactured. Batches of 100 seeds of eight varieties ( variety ) of guayule were given one of four treatments ( treatment ), and planted; the number of plants that came up in each batch ( plants ) was recorded.
• ARGUMENTS: • variety
– factor with levels V1 through V8 labeling the variety of guayule. • treatment
– factor with levels T1 through T4 labeling the treatment given to the seeds.
• plants– numeric vector givng the number seeds out of a batch of 100 that
germinated.
28
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
pairs(gy)
29
variety
T1 T2 T3 T4
V1
V3
V5
V7
T1T2
T3T4
treatment
V1 V3 V5 V7 20 40 60 80
2040
6080
plants
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
plot.factor(gy$variety,gy$plants)
30
2040
6080
gy$p
lant
s
V1 V2 V3 V4 V5 V6 V7 V8
gy$variety
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
plot.factor(gy$treatment,gy$plants)
31
2040
6080
gy$p
lant
s
T1 T2 T3 T4
gy$treatment
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
interaction.plot(gy$variety,gy$treatment,gy$plants)
32gy$variety
mea
n of
gy$
plan
ts
1020
3040
5060
V1 V2 V3 V4 V5 V6 V7 V8
gy$treatment
T1T3T2T4
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
interaction.plot(gy$treatment,gy$variety,gy$plants)
33gy$treatment
mea
n of
gy$
plan
ts
1020
3040
5060
T1 T2 T3 T4
gy$variety
V6V8V5V3V2V7V4V1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
hist(gy$plants)
34
010
2030
gy$plants
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summaries of gy.aov
Call: aov(formula = plants ~ variety * treatment, data = gy)Residuals:
Min 1Q Median 3Q Max -16.33 -2.667 1.494e-015 2.75 16
Residual standard error: 6.348 on 64 degrees of freedomMultiple R-Squared: 0.9298 F-statistic: 27.35 on 31 and 64 degrees of freedom, the p-value
is 0 > summary(gy.aov)
Df Sum of Sq Mean Sq F Value Pr(F) variety 7 763.16 109.02 2.7058 0.01604076
treatment 3 30774.28 10258.09 254.5959 0.00000000variety:treatment 21 2620.14 124.77 3.0966 0.00026666
Residuals 64 2578.67 40.29
35
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of residual vs. fit for gy data set
36Fitted : variety * treatment
Res
idua
ls
-15
-10
-50
510
15
3
35
34
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
model.tables(gy.aov,type="mean")
Tables of meansGrand mean
25.302
variety V1 V2 V3 V4 V5 V6 V7 V8
24.667 26.833 28.833 21.000 21.917 28.167 23.250 27.750
treatment T1 T2 T3 T4
55.833 13.917 20.042 11.417
37
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
model.tables(gy.aov,type="mean")
variety:treatmentDim 1 : varietyDim 2 : treatment
T1 T2 T3 T4 V1 66.333 11.667 12.333 8.333V2 63.333 18.333 14.333 11.333V3 65.000 12.667 26.333 11.333V4 50.333 10.000 14.000 9.667V5 49.333 16.333 10.333 11.667V6 58.000 8.000 29.667 17.000V7 46.333 14.667 22.000 10.000V8 48.000 19.667 31.333 12.000
38
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
multicomp(gy.aov,focus="treatment")
95 % simultaneous confidence intervals for specified linear combinations, by the Tukey method
critical point: 2.6378 response variable: plants
intervals excluding 0 are flagged by '****'
Estimate Std.Error Lower Bound Upper Bound T1-T2 41.90 1.83 37.10 46.80 ****T1-T3 35.80 1.83 31.00 40.60 ****T1-T4 44.40 1.83 39.60 49.30 ****T2-T3 -6.12 1.83 -11.00 -1.29 ****T2-T4 2.50 1.83 -2.33 7.33 T3-T4 8.62 1.83 3.79 13.50 ****
39
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Guayule ANOVA with variety random
> gyr.tabDf Sum of Sq Mean Sq F Value Pr(F)
treatment 3 30774.28 10258.09 82.21711 0.0000000variety 7 763.16 109.02 0.87380 0.5428964
treatment:variety 21 2620.14 124.77 3.09663 0.0002667Residuals 64 2578.67 40.29
40
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Random if:
• Not interested in those particular factor levels (e.g. batches)
• Levels of factor are randomly chosen from a larger population of factor levels (e.g. 10 universities selected from all universities in country).
• Want to generalize to a larger population of factor levels.
41
EMS for 2-factor models(See Table 24.5 on page 981 of Neter et al. Applied Linear Statistical Models.)
Nested vs. Crossed Design(See Figure 28.1 in Neter et al. Applied Linear Statistical Models.)
Nested Fixed Factors(See Table 28.3 on page 1129 of Neter et al. Applied Linear Statistical Models.)
42
Nested Mixed Factors(See Table 28.5 on page 1133 of Neter et al. Applied Linear Statistical Models.)
Cross-Nested Models(See Table 28.11 on page 1151 of Neter et al. Applied Linear Statistical Models.)
43
Images of book covers:
Patrick O’Brian, The Commodore.
Patrick O’Brian, The Fortune of War.
44
Nested Factors• Speed of Firing Naval Guns • SUMMARY: • The gun data frame, a design object, has 36 rows representing runs
of a team of 3 men loading and firing naval guns attempting to get off as many rounds per minute as possible. The three predictor variables (columns) specify the team and the physique of the menon it and the loading method used; the outcome variable is the rounds fired per minute.
• ARGUMENTS: • Method
– factor giving one of two methods for loading rounds into Naval guns. Levels are M1 and M2 .
• Physique– an ordered factor giving the physique of the men: S for slight, A for
average, and H for heavy. • Team
– factor with levels T1 , T2 or T3 . In fact there are nine teams, three of each physique, i.e. a slight T1 , an average T1 , and a heavy T1 , etc.
• Rounds– numeric vector giving the number of rounds per minute fired by a team.
45
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
gunMethod Physique Team Rounds
1 M1 S T1 20.22 M2 S T1 14.23 M1 A T1 22.04 M2 A T1 14.15 M1 H T1 23.16 M2 H T1 14.17 M1 S T2 26.28 M2 S T2 18.09 M1 A T2 22.6
10 M2 A T2 14.011 M1 H T2 22.912 M2 H T2 12.213 M1 S T3 23.814 M2 S T3 12.515 M1 A T3 22.916 M2 A T3 13.717 M1 H T3 21.818 M2 H T3 12.719 M1 S T1 24.120 M2 S T1 16.2
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
46
gunMethod Physique Team Rounds
1 M1 S T1 20.22 M2 S T1 14.23 M1 A T2 22.04 M2 A T2 14.15 M1 H T3 23.16 M2 H T3 14.17 M1 S T4 26.28 M2 S T4 18.09 M1 A T5 22.6
10 M2 A T5 14.011 M1 H T6 22.912 M2 H T6 12.213 M1 S T7 23.814 M2 S T7 12.515 M1 A T8 22.916 M2 A T8 13.717 M1 H T9 21.818 M2 H T9 12.719 M1 S T1 24.120 M2 S T1 16.2
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
47
Speed of firing of naval guns
Slight Average Heavy
Method 1 T1: 20.2, 24.1T4: 26.2, 26.9T7: 23.8, 24.9
T2: 22.0, 23.5T5: 22.6, 24.6T8: 22.9, 25.0
T3: 23.1, 22.9T6: 22.9, 23.7T9: 21.8, 23.5
Method 2 T1: 14.2, 16.2T4: 18.0, 19.1T7: 12.5, 15.4
T2: 14.1, 16.1T5: 14.0, 18.1T8: 13.7, 16.0
T3: 14.1, 16.1T6: 12.2, 13.8T9: 12.7, 15.1
48
pairs(gun2)
49
method
1.0 1.5 2.0 2.5 3.0 15 20 25
1.0
1.4
1.8
1.0
1.5
2.0
2.5
3.0
physique
team
24
68
1.0 1.2 1.4 1.6 1.8 2.0
1520
25
2 4 6 8
rounds
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
50
Method Effect
method
mea
n of
roun
ds
1618
2022
M1 M2
rep(1, 36)
1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
51
Physique Effect
physique
mea
n of
roun
ds
18.5
19.0
19.5
20.0
S A H
rep(1, 36)
1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
52
Team Effect
team
mea
n of
roun
ds
1819
2021
22
1 2 3 4 5 6 7 8 9
rep(1, 36)
1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
53
Method-Physique Interaction
method
mea
n of
roun
ds
1416
1820
2224
M1 M2
physique
SAH
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
ANOVA tables for firing of naval guns example(with teams numbered 1-9)
54
> summary(aov(rounds~phys*meth*team))Df Sum of Sq Mean Sq F Value Pr(F)
phys 2 16.0517 8.0258 3.4736 0.0529995meth 1 651.9511 651.9511 282.1621 0.0000000team 6 39.2583 6.5431 2.8318 0.0403140
phys:meth 2 1.1872 0.5936 0.2569 0.7762240meth:team 6 10.7217 1.7869 0.7734 0.6009376Residuals 18 41.5900 2.3106
> summary(aov(rounds~phys*meth*team%in%phys))Df Sum of Sq Mean Sq F Value Pr(F)
phys 2 16.0517 8.0258 3.4736 0.0529995meth 1 651.9511 651.9511 282.1621 0.0000000
phys:meth 2 1.1872 0.5936 0.2569 0.7762240team %in% phys 6 39.2583 6.5431 2.8318 0.0403140
meth:(team %in% phys) 6 10.7217 1.7869 0.7734 0.6009376Residuals 18 41.5900 2.3106
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
> model.tables(gunaov,type="mean")Tables of meansGrand mean
19.333
Method M1 M2
23.589 15.078
Physique S A H
20.125 19.383 18.492
Team %in% Physique Dim 1 : PhysiqueDim 2 : Team
T1 T2 T3 S 18.675 22.550 19.150A 18.925 19.825 19.400H 19.050 18.150 18.275
55
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
56
Tables of meansGrand mean
19.333
method M1 M2
23.589 15.078rep 18.000 18.000
physique S A H
20.125 19.383 18.492rep 12.000 12.000 12.000
team %in% physique Dim 1 : physiqueDim 2 : team
1 2 3 4 5 6 7 8 9 S 18.675 22.550 19.150
rep 4.000 0.000 0.000 4.000 0.000 0.000 4.000 0.000 0.000A 18.925 19.825 19.400
rep 0.000 4.000 0.000 0.000 4.000 0.000 0.000 4.000 0.000H 19.050 18.150 18.275
rep 0.000 0.000 4.000 0.000 0.000 4.000 0.000 0.000 4.000
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summaries of firing of naval guns example (without interaction)
57
Call: aov(formula = Rounds ~ Method + Physique/Team, data = gun)
Residuals:Min 1Q Median 3Q Max
-2.731 -0.7368 2.498e-016 0.9972 2.531
Residual standard error: 1.434 on 26 degrees of freedomMultiple R-Squared: 0.9297 F-statistic: 38.19 on 9 and 26 degrees of freedom, the p-value is
9.602e-013
> summary(gunaov)Df Sum of Sq Mean Sq F Value Pr(F)
Method 1 651.9511 651.9511 316.8426 0.00000000Physique 2 16.0517 8.0258 3.9005 0.03300457
Team %in% Physique 6 39.2583 6.5431 3.1799 0.01782181Residuals 26 53.4989 2.0576
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of residual vs fit for gun.aov
58Fitted : Method + Physique/Team
Res
idua
ls
14 16 18 20 22 24 26
-2-1
01
2
14
28
1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
2k Factorial Designs
• Exploratory experimental studies.• Multifactor experiment in which each factor
studied at two levels.• Used to screen large number of factors to
identify the most important.• Sometimes 2 levels naturally occur e.g.
present or absent, smoker or non-smoker• k factors => 2k treatment combinations
59
2k Factorial Design Example
Example: 13.19, page 553 of the course textbook.
60
61
pairs(nw.df)
y
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
5015
025
0
-1.0
0.0
0.5
1.0
a
b
-1.0
0.0
0.5
1.0
50 100 150 200 250 300
-1.0
0.0
0.5
1.0
-1.0 -0.5 0.0 0.5 1.0
c
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
62
hist(y)0
24
6
y
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Effect of a
63a
mea
n of
y
160
170
180
190
-1 1
rep(1, 24)
1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
64
Effect of b
b
mea
n of
y
100
150
200
250
-1 1
rep(1, 24)
1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
65
Effect of c
c
mea
n of
y
160
165
170
175
180
185
-1 1
rep(1, 24)
1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
66
interaction.plot(a,b,y)
a
mea
n of
y
100
150
200
250
-1 1
b
-11
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
67
interaction.plot(a,c,y)
a
mea
n of
y
140
160
180
-1 1
c
1-1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
68
interaction.plot(b,c,y)
b
mea
n of
y
100
150
200
250
-1 1
c
1-1
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
summary.lm(nw.aov)Call: aov(formula = y ~ a * b * c, data = nw.df)Residuals:
Min 1Q Median 3Q Max -37.67 -6.861 2.388 12.67 28.67
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) 171.1942 4.6675 36.6780 0.0000a -17.6942 4.6675 -3.7909 0.0016b -76.5833 4.6675 -16.4078 0.0000c 13.3333 4.6675 2.8566 0.0114
a:b -14.8050 4.6675 -3.1719 0.0059a:c 16.6667 4.6675 3.5708 0.0026b:c 4.9442 4.6675 1.0593 0.3052
a:b:c -25.0558 4.6675 -5.3682 0.0001
Residual standard error: 22.87 on 16 degrees of freedomMultiple R-Squared: 0.9556 F-statistic: 49.21 on 7 and 16 degrees of freedom, the p-value is
1.209e-009
Effect (of going from low to high level) is 2*regression coefficient 69
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
model.matrix(nw.aov)(Intercept) a b c a:b a:c b:c a:b:c
1 1 -1 -1 -1 1 1 1 -12 1 1 -1 -1 -1 -1 1 13 1 -1 1 -1 -1 1 -1 14 1 -1 -1 1 1 -1 -1 15 1 1 1 -1 1 -1 -1 -16 1 1 -1 1 -1 1 -1 -17 1 -1 1 1 -1 -1 1 -18 1 1 1 1 1 1 1 19 1 -1 -1 -1 1 1 1 -110 1 1 -1 -1 -1 -1 1 111 1 -1 1 -1 -1 1 -1 112 1 -1 -1 1 1 -1 -1 113 1 1 1 -1 1 -1 -1 -114 1 1 -1 1 -1 1 -1 -115 1 -1 1 1 -1 -1 1 -116 1 1 1 1 1 1 1 117 1 -1 -1 -1 1 1 1 -118 1 1 -1 -1 -1 -1 1 119 1 -1 1 -1 -1 1 -1 120 1 -1 -1 1 1 -1 -1 121 1 1 1 -1 1 -1 -1 -122 1 1 -1 1 -1 1 -1 -123 1 -1 1 1 -1 -1 1 -124 1 1 1 1 1 1 1 1 70
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
X’X Matrixt(X)%*%X
(Intercept) a b c a:b a:c b:c a:b:c (Intercept) 24 0 0 0 0 0 0 0
a 0 24 0 0 0 0 0 0b 0 0 24 0 0 0 0 0c 0 0 0 24 0 0 0 0
a:b 0 0 0 0 24 0 0 0a:c 0 0 0 0 0 24 0 0b:c 0 0 0 0 0 0 24 0
a:b:c 0 0 0 0 0 0 0 24
71
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
n*(X’X)-1 X’
> solve(t(X)%*%X)%*%t(X)*241 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
(Intercept) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1a -1 1 -1 -1 1 1 -1 1 -1 1 -1 -1 1 1 -1 1 -1 1 -1 -1 1 1 -1 1b -1 -1 1 -1 1 -1 1 1 -1 -1 1 -1 1 -1 1 1 -1 -1 1 -1 1 -1 1 1c -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 1 1 1
a:b 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1a:c 1 -1 1 -1 -1 1 -1 1 1 -1 1 -1 -1 1 -1 1 1 -1 1 -1 -1 1 -1 1b:c 1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 1
a:b:c -1 1 1 1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1
72
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
summary(nw.aov)> summary(nw.aov)
Df Sum of Sq Mean Sq F Value Pr(F) a 1 7514.0 7514.0 14.3712 0.0016031b 1 140760.2 140760.2 269.2166 0.0000000c 1 4266.7 4266.7 8.1604 0.0114229
a:b 1 5260.5 5260.5 10.0612 0.0059164a:c 1 6666.7 6666.7 12.7506 0.0025519b:c 1 586.7 586.7 1.1221 0.3052037
a:b:c 1 15067.1 15067.1 28.8171 0.0000628Residuals 16 8365.6 522.9
73
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of residual vs. fit for nw.aov
74Fitted : a * b * c
Res
idua
ls
-40
-20
020
6
23 22
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Nonparametric Statistical Methods
Corresponds to Chapter 14 of Tamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT)
1
Nonparametric Methods
• Most NP methods are based on ranks instead of original data
• Reference: Hollander & Wolfe, Nonparametric Statistical Methods
E Newton 2
E Newton 3
Histogram of 100 gamma(1,1) r.v.’s
0 1 2 3 4
010
2030
g
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Histogram of ranks of 100 r.v.’s
0 20 40 60 80 100
02
46
810
rank(g)
E Newton 4
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Parametric and Nonparametric Tests
E Newton 5
Type of test Parametric NonparametricSingle Sample z and t tests Sign test
WilcoxonSigned Rank Test
Two independent samples
z and t tests Wilcoxon Rank Sum Test
Mann Whitney U Test
E Newton 6
Type of test Parametric Nonparametric
Several Independent Samples
ANOVA CRD Kruskal-Wallace Test
Several Matched Samples
ANOVA RBD Friedman Test
Correlation Pearson Spearman Rank Correlation
Kendall’s Rank Correlation
Sign Test
• Inference on median (u) for a single sample, size n• H0: u=u0 vs. H1 u≠u0
• Count the number of xi’s that are greater than u0 and denote this s+
• The number of xi‘s less than u are s- = n - s+• Reject H0 if s+ is large or if s- is small.• Under H0, s+ (and s-) has binomial(n,1/2)
distribution• Large sample z test
E Newton 7
Histogram of thermostat data
198 200 202 204 206 208
01
23
4
x
E Newton 8
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Sign Test in S-Plus > thermostat[1] 202.2 203.4 200.5 202.5 206.3 198.0 203.7 200.8
201.3 199.0
> thermostat<200[1] F F F F F T F F F T
> sum(thermostat<200)[1] 2
> 2*pbinom(sum(thermostat<200),10,0.5)[1] 0.109375
E Newton 9
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Wilcoxon Signed Rank Test• Inference on median (u), single sample, size n• Assumes population distribution is symmetric• H0: u=u0 vs. H1 u≠u0• di = xi -u0• Rank order |di|• W+ = sum of ranks of positive differences• W- = sum of ranks of negative differences• Wmax = maximum (W+, W-)• Reject H0 if Wmax is large.• Null Distribution – see text• Large sample z test
E Newton 10
S-Plus wilcox.test for thermostat data
E Newton 11
> thermostat[1] 202.2 203.4 200.5 202.5 206.3 198.0 203.7 200.8 201.3 199.0
> sum(rank(abs(thermostat-200))[-c(6,10)])[1] 47
> wilcox.test(thermostat,mu=200)
Exact Wilcoxon signed-rank test
data: thermostat signed-rank statistic V = 47, n = 10, p-value =
0.0488 alternative hypothesis: true mu is not equal to 200
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
S-Plus parametric t-test for thermostat data
> t.test(thermostat, mu=200)
One-sample t-Test
data: thermostatt = 2.3223, df = 9, p-value = 0.0453 alternative hypothesis: true mean is not equal to 200 95 percent confidence interval:200.0459 203.4941 sample estimates:mean of x
201.77
E Newton 12
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Location-Scale Families
• See course textbook, page 575.
E Newton 13
2 normal pdf’s with location parameters = -1 and 1, scale parameter =1
x
dnor
m(x
, 1, 1
)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
E Newton 14
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Wilcoxon Rank Sum Test
• Inference on location of distribution of 2 independent random samples X and Y (e.g. from control and treatment population).
• Assume X~Y+∆• H0: ∆=0 vs. H1: ∆≠0• Rank all N = n1 + n2 observations• W=sum of ranks assigned to the Y’s (or X’s,
whichever has smaller sample size) • Reject H0 if W is extreme
E Newton 15
Mann-Whitney U test
• Equivalent to Wilcoxon rank sum test• Compare each xi with each yi.• There are nx*ny such comparisons• U= number of pairs in which xi<yi.• Icbst W = U + (n*(n+1))/2 (when no ties)• Reject H0 if U is extreme.
E Newton 16
Boxplots of times to failure for control and stressed capacitors
05
1015
2025
30
cg sg
time
to fa
ilure
E Newton 17
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
S-Plus wilcox.test
> wilcox.test(cg, sg)
Exact Wilcoxon rank-sum test
data: cg and sgrank-sum statistic W = 95, n = 8, m = 10, p-value =
0.1011 alternative hypothesis: true mu is not equal to 0
E Newton 18
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
S-Plus parametric t-test
> t.test(cg,sg)
Standard Two-Sample t-Test
data: cg and sgt = 1.8105, df = 16, p-value = 0.089 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-1.103506 14.018506
sample estimates:mean of x mean of y 15.5375 9.08
E Newton 19
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Kolmogorov-Smirnov Tests
There is also a one-sample version for testing the distance between some observed data and a specified (ideal) distribution.
The Kolmogorov-Smirnov test detects differences in location, scale, skewness, or whatever (any differences between two distributions), uses two empirical cumulative distribution functions (step functions).
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Cum
ulat
ive
Freq
uenc
y
Distribution 2MaximumGap
Distribution 1
Two-sample Test
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Cum
ulat
ive
Freq
uenc
yIdeal Distribution
Maximum Gap
Observed Distribution
One-sample Test
Tests the maximum gap between the observed distribution and the hypothesized distribution as a function of sample size (tables or p-values).
J Telford 20
E Newton 21
Histograms of 100 random normal (2,1) deviates and 100 random gamma(4,2) deviates
-1 0 1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
x
0 1 2 3 4 5 6
0.0
0.1
0.2
0.3
0.4
0.5
y
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Kolmogorov-Smirnov Tests> ks.gof(x,y)
Two-Sample Kolmogorov-Smirnov Test
data: x and y ks = 0.15, p-value = 0.2112 alternative hypothesis: cdf of x does not equal the
cdf of y for at least one sample point.
> ks.gof(y)
One sample Kolmogorov-Smirnov Test of Composite Normality
data: y ks = 0.0969, p-value = 0.0216 alternative hypothesis: True cdf is not the normal distn. with
estimated parameters sample estimates:mean of x standard deviation of x 1.865857 0.9421928
E Newton 22
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Kruskal-Wallis Test• Inference for several independent samples• Assume distributions of each of the samples differ
only possibly in location.• Xij = θ + τj + eij.• H0: τ1=τ2=..= τk, vs. H1: τi≠τj for some i ≠ j• Rank all N=n1+n2..+na observations.• Calculate rank sums and averages in each group• Calculate KW test statistic=kw (see text)• Reject H0 for large values of kw• For large ni’s, null dist’n of kw χ2
a-1
E Newton 23
Test scores for four different teaching methods (page 582)
scm<-matrix(score,7,4)> scm
[,1] [,2] [,3] [,4] [1,] 14.06 14.71 23.32 26.93[2,] 14.26 19.49 23.42 29.76[3,] 14.59 20.20 24.92 30.43[4,] 18.15 20.27 27.82 33.16[5,] 20.82 22.34 28.68 33.88[6,] 23.44 24.92 32.85 36.43[7,] 25.43 26.84 33.90 37.04
E Newton 24
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot.factor(f(grp),score)15
2025
3035
test
sco
res
for e
ach
teac
hing
met
hod
1 2 3 4
f(grp)E Newton 25
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Ranks of Test Scores
E Newton 26
> scmr<-matrix(rank(score),7,4)> scmr
[,1] [,2] [,3] [,4] [1,] 1 4.0 11.0 18[2,] 2 6.0 12.0 21[3,] 3 7.0 14.5 22[4,] 5 8.0 19.0 24[5,] 9 10.0 20.0 25[6,] 13 14.5 23.0 27[7,] 16 17.0 26.0 28
> tmp<-apply(scmr,2,sum)> tmp[1] 49.0 66.5 125.5 165.0
> (12/(28*29))*sum((tmp^2)/7)-3*29[1] 18.13406
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Kruskal-Wallis test in S-Plus
> kruskal.test(scm, col(scm))
Kruskal-Wallis rank sum test
data: scm and col(scm) Kruskal-Wallis chi-square = 18.139, df = 3,
p-value = 0.0004 alternative hypothesis: two.sided
E Newton 27
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
ANOVA for test scores
summary(aov(score~f(grp)))Df Sum of Sq Mean Sq F Value Pr(F)
f(grp) 3 830.1914 276.7305 15.93607 6.509182e-006Residuals 24 416.7609 17.3650
E Newton 28
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Friedman Test
• Inference for several matched samples• a treatments, b blocks• H0: τ1=τ2=..= τk, vs. H1: τi≠τj for some i ≠ j• Rank observations separately within each block• Calculate rank sums• Calculate the Friedman statistic, fr (see text)• Reject H0 for large values of fr• For b large, fr ~ χ2
a-1
E Newton 29
Ranks within Blocks (rows)> scmrb<-t(apply(scm,1,rank))> scmrb
[,1] [,2] [,3] [,4] [1,] 1 2 3 4[2,] 1 2 3 4[3,] 1 2 3 4[4,] 1 2 3 4[5,] 1 2 3 4[6,] 1 2 3 4[7,] 1 2 3 4
> tmp<-apply(scmrb,2,sum)[1] 7 14 21 28
> (12/(4*7*5))*sum(tmp^2)-3*7*5[1] 21
E Newton 30
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Friedman test in S-Plus
• > friedman.test(scm, col(scm), row(scm))
• Friedman rank sum test
• data: scm and col(scm) and row(scm) • Friedman chi-square = 21, df = 3, p-value
= 0.0001 • alternative hypothesis: two.sided
E Newton 31
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
ANOVA test score data with blocks
> summary(aov(score~f(grp)+f(blk)))Df Sum of Sq Mean Sq F Value Pr(F)
f(grp) 3 830.1914 276.7305 260.4768 5.220000e-015f(blk) 6 397.6377 66.2729 62.3804 4.558276e-011
Residuals 18 19.1232 1.0624
E Newton 32
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Correlation Methods
• Pearson Correlation: measures only linear association.
• Spearman Correlation: correlation of the ranks
• Kendall’s Tau: based on number of concordant and discordant pairs.
E Newton 33
Kendall’s Tau
• Assume: the n bivariate observations (X1,Y1),…,(Xn,Yn) are a random sample from a continuous bivariate population.
• H0: Xi, Yi are independent• H0: F(x,y) = F(x)F(y)• Measure dependence by finding the number of
concordant and discordant pairs.• Population correlation coefficient:
τ = 2*P{X2-X1)(Y2-Y1)>0}-1
E Newton 34
Kendall’s Tau
)1(2ˆ
)),(),,((
0)Y-)(YX-(X if 1,-0 )Y-)(YX-(X if 0, 0)Y-)(YX-(X if 1,
))Y,(X),Y,Q((X
:n j i1
1
1 1
jiji
jiji
jiji
jjii
−=
=
⎪⎩
⎪⎨
⎧
<
=
>
=
≤<≤
∑ ∑−
= +=
nnK
YXYXQK
For
n
i
n
ijjjii
τ
E Newton 35
Kendall’s Tau example
E Newton 36
> m1 3 2 4
1 NA 1 1 12 NA NA -1 13 NA NA NA 14 NA NA NA NA
> 2*sum(m,na.rm=T)/12[1] 0.6666667
> cor.test(c(1,2,3,4),c(1,3,2,4),method="k")
Kendall's rank correlation tau
data: c(1, 2, 3, 4) and c(1, 3, 2, 4) normal-z = 1.3587, p-value = 0.1742 alternative hypothesis: true tau is not equal to 0 sample estimates:
tau0.6666667
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
x=1:10y=exp(x)
x
y
2 4 6 8 10
050
0010
000
1500
020
000
E Newton 37
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Pearson Correlation
> cor.test(x,y,method="p")
Pearson's product-moment correlation
data: x and y t = 2.9082, df = 8, p-value = 0.0196 alternative hypothesis: true coef is not equal to 0
sample estimates:cor
0.7168704
E Newton 38
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Spearman Correlation
> cor.test(x,y,method="s")
Spearman's rank correlation
data: x and y normal-z = 2.9818, p-value = 0.0029 alternative hypothesis: true rho is not equal to 0
sample estimates:rho1
E Newton 39
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Kendall Correlaton
> cor.test(x,y,method="k")
Kendall's rank correlation tau
data: x and y normal-z = 4.0249, p-value = 0.0001 alternative hypothesis: true tau is not equal to 0
sample estimates:tau1
E Newton 40
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
E Newton 41
Example - Environmental Data –Censored below LOD
0 2 4 6 8 10 12 14
010
2030
4050
g
0 2 4 6 8 10 12 14
010
2030
4050
h
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Resampling Methods
• Parametric methods – Inference based on assumed population distribution
• Resampling methods – No assumption about functional form of population distribution.
• Permutation Tests – 2 sample problem• Jackknife – Delete one observation at a
time• Bootstrap – resample with replacement
E Newton 42
Permulation Tests• Goal: estimate difference in means (2 sample problem)• (x1, x2… xn1) and (y1, y2.. yn2) are independent samples
drawn from F1 and F2.• H0: F1=F2 => all assignments of labels x and y equally
likely.• Choose SRS of size n1 from n1+n2 observations and
label as x, label rest as y.• Calculate value of test statistic (e.g. difference in means)
for each assignment -> permutation distribution.• There are (n1+n2) choose (n1) possible distinct
assignments (capacitor data set Ex14.7, n1=8, n2=10, number of assignments=43,758)
E Newton 43
Jackknife• Goal: estimate distribution and standard error of statistic
(e.g. median or mean)• Draw n samples of size n-1 from original sample, by
deleting one observation at a time.• Calculate mj*=mean (median) from each sample
∑=
−−
=n
jj mm
nnmJSE
1
2** )(1)(
• JSE is exact for mean, not necessarily very good for median
E Newton 44
Bootstrap• Goal: estimate distribution, standard error,
confidence interval of statistic (e.g. mean, median, correlation)
• Draw B samples of size n, with replacement, from original sample
• Calculate test statistics from each sample
1
)()( 1
2**
−
−=∑ =
B
mmmBSE
B
j j
E Newton 45
Swiss Data Set in S-PlusFertility Data for Switzerland in 1888 SUMMARY: The swiss.fertility and swiss.x data sets contain fertility data for Switzerland in 1888. ARGUMENTS:
swiss.fertilitystandardized fertility measure I[g] for each of 47 French-speaking provinces of
Switzerland in approximately 1888.
swiss.xmatrix with 5 columns that contain socioeconomic indicators for the provinces:
1) percent of population involved in agriculture as an occupation; 2) percent of "draftees" receiving highest mark on army examination; 3) percent of population whose education is beyond primary school; 4) percent of population who are Catholic; and, 5) percent of live births who live less than 1 year (infant mortality).
SOURCE: Mosteller and Tukey (1977). Data Analysis and Regression. Addison-Wesley. Unpublished data used by permission of Francine van de Walle. Population Study
Center, University of Pennsylvania, Philadelphia, PA.
E Newton 46
This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Bootstrap estimates and CI for variance of education
> educ<-swiss.x[,3]> var(educ)[1] 92.45606
> educ.boot<-bootstrap(educ,var,trace=F)> summary(educ.boot)Call:bootstrap(data = educ, statistic = var, trace = F)
Number of Replications: 1000
Summary Statistics:Observed Bias Mean SE
var 92.46 -0.5972 91.86 39.14
Empirical Percentiles:2.5% 5% 95% 97.5%
var 29.98 36.26 165.3 175
E Newton 47
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Histogram of variance estimates obtained from 1000 bootstrap samples
50 100 150 200
0.0
0.00
20.
004
0.00
60.
008
0.01
0
Value
Den
sity
var
E Newton 48
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
QQ plot of variance estimates
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es
-2 0 2
5010
015
020
0
var
E Newton 49
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Plot of LSAT scores by GPA for a sample of 15 schools
gpa
lsat
2.8 3.0 3.2 3.4
560
580
600
620
640
660
E Newton 50
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Bootstrap estimates and CI for correlation between LSAT and GPA
> law.boot<-bootstrap(law.data, cor(lsat,gpa), trace=F)> summary(law.boot)Call:bootstrap(data = law.data, statistic = cor(lsat, gpa), trace = F)
Number of Replications: 1000
Summary Statistics:Observed Bias Mean SE
Param 0.7764 -0.00506 0.7713 0.1368
Empirical Percentiles:2.5% 5% 95% 97.5%
Param 0.449 0.5133 0.947 0.9623
BCa Confidence Limits:2.5% 5% 95% 97.5%
Param 0.2623 0.4138 0.9232 0.9413
E Newton 51
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Histogram of correlation estimates obtained from 1000 bootstrap samples
0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Value
Den
sity
Param
E Newton 52
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
QQ Plot of correlation estimates
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es
-2 0 2
0.2
0.4
0.6
0.8
1.0
Param
E Newton 53
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
S-Plus Stack-loss data set
• Stack-loss Data • SUMMARY: • The stack.loss and stack.x data sets are from the operation of a plant for the
oxidation of ammonia to nitric acid, measured on 21 consecutive days. • ARGUMENTS: • stack.loss
– percent of ammonia lost (times 10). • stack.x
– matrix with 21 rows and 3 columns representing air flow to the plant, cooling water inlet temperature, and acid concentration as a percentage (coded by subtracting 50 and then multiplying by 10).
• SOURCE: • Brownlee, K.A. (1965). Statistical Theory and Methodology in Science and
Engineering. New York: John Wiley & Sons, Inc. • Draper and Smith (1966). Applied Regression Analysis. New York: John
Wiley & Sons, Inc. • Daniel and Wood (1971). Fitting Equations to Data. New York: John Wiley &
Sons, Inc.E Newton 54
This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
S-Plus stack loss data set
stack.loss
50 55 60 65 70 75 80 75 80 85 90
1020
3040
5060
7080
Air.Flow
Water.Temp
1820
2224
26
10 20 30 40
7580
8590
18 20 22 24 26
Acid.Conc.
E Newton 55
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary of stack loss regression> summary(tmp)
Call: lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data = stack)
Residuals:Min 1Q Median 3Q Max
-7.238 -1.712 -0.4551 2.361 5.698
Coefficients:Value Std. Error t value Pr(>|t|)
(Intercept) -39.9197 11.8960 -3.3557 0.0038Air.Flow 0.7156 0.1349 5.3066 0.0001
Water.Temp 1.2953 0.3680 3.5196 0.0026Acid.Conc. -0.1521 0.1563 -0.9733 0.3440
Residual standard error: 3.243 on 17 degrees of freedomMultiple R-Squared: 0.9136 F-statistic: 59.9 on 3 and 17 degrees of freedom, the p-value is 3.016e-009
Correlation of Coefficients:(Intercept) Air.Flow Water.Temp
Air.Flow 0.1793 Water.Temp -0.1489 -0.7356 Acid.Conc. -0.9016 -0.3389 0.0002
E Newton 56
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary of stack loss bootstrap outputsummary(stack.boot)Call:bootstrap(data = stack, statistic = coef(lm(stack.loss ~ Air.Flow
+ Water.Temp + Acid.Conc., stack)), trace = F)
Number of Replications: 1000
Summary Statistics:Observed Bias Mean SE
(Intercept) -39.9197 0.5691396 -39.3505 9.3731Air.Flow 0.7156 0.0016734 0.7173 0.1777
Water.Temp 1.2953 -0.0264873 1.2688 0.4798Acid.Conc. -0.1521 -0.0006978 -0.1528 0.1261
Empirical Percentiles:2.5% 5% 95% 97.5%
(Intercept) -56.0109 -53.4216 -21.92994 -18.75262Air.Flow 0.3903 0.4366 1.00261 1.04605
Water.Temp 0.4004 0.5131 2.07381 2.23633Acid.Conc. -0.4285 -0.3740 0.03282 0.05912
E Newton 57
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Summary of stack loss bootstrap output
summary(stack.boot)
BCa Confidence Limits:2.5% 5% 95% 97.5%
(Intercept) -55.6465 -52.6606 -21.451125 -18.55810Air.Flow 0.3266 0.4120 0.992007 1.01855
Water.Temp 0.5244 0.6193 2.264165 2.40956Acid.Conc. -0.4629 -0.4101 -0.007724 0.04459
Correlation of Replicates:(Intercept) Air.Flow Water.Temp Acid.Conc.
(Intercept) 1.00000 -0.17636 0.09902 -0.80236Air.Flow -0.17636 1.00000 -0.78822 -0.07635
Water.Temp 0.09902 -0.78822 1.00000 -0.24463Acid.Conc. -0.80236 -0.07635 -0.24463 1.00000
E Newton 58
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Histograms of regression coefficients
E Newton 59
-60 -40 -20 0 20 40
0.0
0.01
0.03
0.05
Value
Den
sity
(Intercept)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.5
1.0
1.5
2.0
Value
Den
sity
Air.Flow
-0.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.2
0.4
0.6
0.8
Value
Den
sity
Water.Temp
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4
01
23
4
Value
Den
sity
Acid.Conc.
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
QQ Plots of regression coefficients
E Newton 60
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es
-2 0 2
-60
-40
-20
020
40
(Intercept)
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es
-2 0 2
0.0
0.4
0.8
1.2
Air.Flow
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es
-2 0 2
-0.5
0.5
1.5
2.5
Water.Temp
Quantiles of Standard Normal
Qua
ntile
s of
Rep
licat
es
-2 0 2
-1.0
-0.6
-0.2
0.2
Acid.Conc.
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
top related