Applied Statistics - MIT

Dr. Elizabeth Newton

Slides prepared by Elizabeth Newton (MIT) with some slides by Roy Welsch (MIT) and Gordon Kaufman (MIT).

15.075, Applied StatisticsLecture: M,W 10-11:30

Recitation: R 4-5

Text: Statistics and Data Analysis by Tamhane and Dunlop

Computing: S-Plus

Exams: Mid-term (in class) and Final during exam week

Prerequisites: Calculus, Probability, Linear Algebra

15.075, Applied Statistics, Course Outline

• Collecting Data• Summarizing and Exploring Data• Review of Probability• Sampling Distributions of Statistics• Inference

Point and CI Estimation, Hypothesis Testing• Linear Regression• Analysis of Variance• Nonparametric Methods• Special Topics (Data Mining?)

Statistics

“The science of collecting and analyzing data for the purpose ofdrawing conclusions and making decisions.” from Tamhane, Ajit C., and Dorothy D. Dunlop. Statistics and Data Analysis from Elementary to Intermediate. Prentice Hall, 2000, pp. 1.

“Statistics are no substitute for judgment.” Henry Clay

How is the meter defined?

One ten-millionth of a quarter meridian(distance from pole to equator).

BUT – it isn’t exactly.

The Measure of All Things, by Ken Alder, describes the attempt of 2 French astronomers,Delambre and Mechain, to determine the circumference of the earth during the time of the French Revolution.

Determined the distance between Barcelona and Dunkirk by triangulation.

Needed to know latitude at each end (by measuring heights of stars).

Seven months stretched to seven years.

Mechain obtained conflicting information andsuppressed some of his data.

(Measure of All Things):

“What counts as an error? Who is to say when you have made a mistake? How close is close enough? Neither Mechain nor his colleagues could have answered these questions with any degree of confidence. They were completely innocent of statistical method.”

- Quote from Alder, Ken. The Measure of All Things: The Seven-YearOdyssey and Hidden Error that Transformed the World. Free Press, 2003.

Data: A Set of measurementsCharacter

Nominal, e.g. color: red, green, blueBinary e.g. (M,F), (H,T), (0,1)

Ordinal, e.g attitude to war: agree, neutral disagree

Numeric

Discrete, e.g. number of children

Continuous. e.g. distance, time, temperature

Interval, e.g. Fahrenheit temperature

Ratio (real zero), e.g distance, number of children8

S-Plus Data Set: cu.summary

Concepts

Population:The set of all units of interest (finite or infinite). E.g. all students at MIT

Sample:A subset of the population actually observed. E.g. students in this room.

Variable: A property or attribute of each unit, e.g age, height

Observation:Values of all variables for an individual unit

A dataset is often organized as a matrix with rows corresponding to observations and columns to variables.

Concepts (continued)

Parameter:Numerical characteristic of population, defined for each variable, e.g. proportion opposed to war

Statistic:Numerical function of sample used to estimate population parameter.

Precision: Spread of estimator of a parameter

Accuracy: How close estimator is to true value - opposite of

Bias: Systematic deviation of estimate from true value

Accuracy and Precision

accurate and precise

accurate, not precise

precise, not accurate

not accurate, not precise

Diagram courtesy of MIT OpenCourseWare

Steps in Study Design and Implementation

1. Background research and literature review.

2. Define the goals and specific hypotheses of the study.

3. Determine what variables should be measured and how.

5. Develop a plan to collect the dataSampling designSample sizeInclusions and exclusions

5. Train Personnel

6. Gather Data

7. Analyze Data

8. Report Results13

Ethical IssuesFor human subjects:

For animal subjects:

(See Hulley & Cummings, Designing Clinical Research.)

Statistical Studies

Descriptive:One group, e.g. survey, poll

Comparative:2 or more groups, e.g. compare effectiveness of different teaching methods.

Experimental:Investigator actively intervenes to control study conditionsLook at relationship between predictor (explanatory) and response (outcome) variablesEstablish causation, e.g. drug trial

Observational:Investigator records data without interveningDifficult to distinguish effects of predictors and confounding variables (lurking variables)Establish association, e.g. Framingham Heart Study

Observational Studies:

Cross-sectionalLook at sample at a single point in timeE.g. Census, Sample survey

Prospective (expensive!)Follow sample (cohort) forward in time.E.g. Framingham heart study, Nurses’ Health Study

Retrospective (case-control)Look back in time

Sources of Error in Observational Studies

Sampling Error – sample differs from population

Measurement Bias – poorly worded questions

Self-Selection Bias – refusal to participate

Response Bias – incorrect or untruthful responses

Types of Samples

Probability Sample (every element in population has known non-zero probability of inclusion)

• Simple Random Sample (SRS)• Stratified Random Sample• Multi-Stage Cluster Sample• Systematic Sample

Non-Probability Sample (estimates may be biased, but frequently used as only feasible method)

• Convenience Sample e.g. supermarket survey• Judgment Sample – chosen by investigator

Simple Random Sample (SRS)

Requires a Sampling Frame, a list of all the units in a finite population

Sample of size n is drawn without replacement from population of size N, such that each sample (there are of them) has same chance of being chosen.

Each unit in population has same chance of being chosen: n/N (the sampling fraction).

Generate random numbers to select from sampling frame.

⎟⎟⎠

⎞⎜⎜⎝

Stratified Random Sample

Divide a diverse population into homogeneous subpopulations (strata).

Draw simple random sample from each one.

Advantages:

Separate estimates for strata obtained in addition to overall estimates.

Precision of estimates higher than for simple random sample

Disadvantage: Requires sampling frame20

Multistage Cluster Sampling

Used to survey large populations when sampling frame not available, e.g. USA

For instance, in an educational survey, draw a sample of states, then towns within states, then schools within towns.

Prepare a sampling frame of students from selected schools and use SRS.

Systematic Sampling

Useful when list of units exists or when units arrive sequentially (cars through a toll booth).

Select first unit at random, then every kth unit.

In finite population, each unit has same probability of selection (n/N)(however not all samples are equally likely).

Must avoid choosing k to coincide with regular cyclic variations in the data

Questionnaire Design

Structured questions: responses should be mutually exclusive and collectively exhaustive.

E.g. How many glasses of water do you drink per day?-------------- 0 to 2--------------- 3 to 5--------------- 6 or more

Non-structured:E.g. How many glasses of water do you drink per day?Allow more individualized response, but more prone to data entry errors.

Attitude questions

1. The homework load in this course is reasonable.

Strongly Neither Agree StronglyDisagree Disagree nor Disagree Agree Agree

Usually 5 to 9 categories.(Should we assign numbers to these categories?)(High to low or low to high?)

Problems with Question Wording

Double-barreled question

Leading question

One-sided question

Ambiguous question

Pretest! Pretest! Pretest!

(For more information, see Johnson & Wichern, Business Statistics)

Sensitive Questions

E.G Have you ever used heroin?

Randomized Response may elicit more accurate responses.Interviewer does not know what question respondent is answering.

E.g. Roll a die. If less than 3 then say whether statement 1 is true or false. Otherwise say whether statement 2 is true of false.

Statement 1: I have used heroin.Statement 2: I have not used heroin.

Let p=proportion of people who have used heroinq=proportion of people answering question 1 (can’t be 0.5).

P(True)=P(True|1)P(1) + P(True|2)P(2) = p q + (1-p) (1-q)

Solve for p.

Question Sequencing

1. Demographics at end

2. Sensitive questions nearer to end

3. Same topic questions appear together

4. Go from general to specific

5. Avoid skipping around.27

Experimental Studies

Purpose: Evaluate how a set of predictor variables (factors) affect a response variable.

Treatment Factors are of primary interest. Values (Levels) are controlled.

Nuisance Factors also affect response.

Treatment: particular combination of levels of treatment factors.

Experimental units (EU’s): subjects to which treatments applied.

Treatment group: all EU’s receiving same treatment

Run: observation on an EU under particular treatment condition.

Replicate: another independent run.

Sources of Error in Experimental Studies

Systematic Error: differences among EU’s caused by Confounding Factors

Random Error: inherent variability in responses of EU’s.

Measurement Error: due to imprecision of measuring instruments.

Strategies to Control Error in Experimental Studies

Blocking: Divide sample into groups of similar EU’s (same value for nuisance factors).E.g. In agricultural trials effect of nutrient and moisture gradients can be controlled for by blocking on agricultural plots

Matching: EU’s can be matched on nuisance factors, then each memberof match can be randomly assigned to different treatment (each match is a block).

Regression Analysis: If value of nuisance factor is known can include as covariate in final model.

Randomization: Randomly assign EU’s to treatments.

Basic Idea: Block over those nuisance factors that can be easily controlled and randomize over the rest

Basic Experimental Designs

Completely Randomized Design (CRD)EU’s assigned at random to treatments

Randomized Block Design (RBD) EU’s divided into homogeneous blocksTreatments assigned randomly within blocks.

Randomized Complete Block Design (RCBD): Blocks contain all treatments.

Randomized Incomplete Block Design (RIBD)Blocks do not contain all treatments.

Chapter 4: Summarizing & Exploring Data(Descriptive Statistics)

Graphics! Graphics! Graphics!(and some numbers)

Slides prepared by Elizabeth Newton (MIT) with some slides byJacqueline Telford (Johns Hopkins University) and Roy Welsch (MIT).

Graphical Excellence“Complex ideas communicated with

clarity, precision, and efficiency”Shows the dataMakes you think about substance rather than

method, graphic design, or something elseMany numbers in a small spaceMakes large data sets coherentEncourages the eye to compare different

pieces of the data

Charles Joseph Minard

Graphic Depicting Exports of Wine from France (1864)

Available at http://www.math.yorku.ca/SCS/Gallery/

Source: Minard, C. J. Carte figurative et approximative des quantités de vin français exportéspar mer en 1864. 1865. ENPC (École Nationale des Ponts et Chaussées), 1865.

Also available in: Tufte, Edward R. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001.

Summarizing Categorical DataA frequency table shows the number of occurrences of each category.Relative frequency is the proportion of the total in each category.

Bar charts and Pie Charts are used to graph categorical data. A Paretochart is a bar chart with categories arranged from the highest to lowest (QC: “vital few from the trivial many”).

Attraction FrequencyRelative

Frequency (%)Vertical Drop 101 15.1Roller Coaster A 54 8.1Roller Coaster B 77 11.5Water Park 155 23.1Spinners 35 5.2Tea Cups 81 12.1Haunted House 79 11.8Log Drop 88 13.1Total 670 100.0

Popularity of attractions at an amusement park

Relative Frequency (%)

Vertica

l Drop

Roller

Coaste

Roller

Coaste

r BWater

Spinne

Haunte

g Drop

Pie Chart and Bar Chart of Attraction Popularity at an Amusement Park

Vertical Drop Roller Coaster ARoller Coaster B Water ParkSpinners Tea CupsHaunted House Log Drop

Vertica

l Drop

Roller

Coaste

Roller

Coaste

r BWater

Spinne

Haunte

g Drop

Charles Joseph Minard

Graph showing quantities of meat sent from various regions of France to Paris using pie charts overlaid a

map of France (1864)

Available at http://www.math.yorku.ca/SCS/Gallery/

Source: Minard, C. J. Carte figurative et approximative des quantités de viande de boucherie envoyées sur pied par les départments et consommées à Paris. ENPC (École

Nationale des Ponts et Chaussées),1858, pp. 44.6

Plots for Numerical Univariate Data

Scatter plot (vs. observation number)

Histogram

Stem and Leaf

Box Plot (Box and Whiskers)

QQ Plot (Normal probability plot)

Scatter Plot of Iris Data

observation number

0 10 20 30 40 50

8This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Scatter Plot of Iris Data with Observation Number Indicated

observation number

0 10 20 30 40 50

plot(iris21)text(iris21)

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Plot of data using jitter function in S-Plus

observation number

0 100 200 300 400 500

observation number

0 100 200 300 400 500

Run ChartFor time series data, it is often useful to plot the data in time sequence. A run chart graphs the data against time.

0 5 10 15 20 25 30

Production Order

Compression

Always Plot Your Data Appropriately - Try Several Ways!

HistogramData: n=24 Gas Mileage {31,13,20,21,24,25,25,27,28,40,29,30,31,23,31,32,35,28, 36,37,38,40,50,17}

Gives a picture of the distribution of data.

• Area under the histogram represents sample proportion.

• Use approx. sqrt(n) “bins” - if too many,too jagged; if too few, too smooth (no detail)

• Shows if the distribution is:– Symmetric or skewed– Unimodal or bimodal

• Gaps in the data may indicate a problem with the measurement process.

• Many quality control applications– Are there two processes?– Detection of rework or cheating– Tells if process meets the

specifications

10 15 20 25 30 35 40 45 50 55

Miles per gallonDistributions

Note: Bars touch for continuous data, but do NOTtouch for discrete data.

Histogram of Iris Data

2.5 3.0 3.5 4.0

iris21

2.0 2.5 3.0 3.5 4.0 4.5siris21

Histogram of Iris Data with Density Curve

Stem and Leaf Diagram Cum. Dist. FunctionData: Gas Mileage

Stem Leaf5 044 003 56783 011122 5578892 01341 71 3

Count1

2456411 0.1

0.20.30.40.50.60.70.80.91.0

b10 15 20 25 30 35 40 45 50 55

Miles per gallon

CDF Plot

Shows distribution of data similar to a histogram but preserves the actual data.Can see numerical patterns in the data (like 40’s and 50).

Step occurs at each data value (higher for more values at the same data point).

Stem and Leaf Diagram for Iris Data• Decimal point is 1 place to the left of the colon

• 23 : 0• 24 :• 25 :• 26 :• 27 :• 28 :• 29 : 0• 30 : 000000• 31 : 0000• 32 : 00000• 33 : 00• 34 : 000000000• 35 : 000000• 36 : 000• 37 : 000• 38 : 0000• 39 : 00• 40 : 0• 41 : 0• 42 : 0• 43 :• 44 : 0

Summary Statistics for Numerical DataMeasures of Location:

= 121(“average”):Mean

Median: middle of the ordered sample (like θ.5 for distribution)

xmin = x(1) ≤ x(2) ≤ …≤ x(n) = xmax

⎪⎪⎩

⎪⎪⎨

⎥⎥⎦

⎢⎢⎣

⎟⎠⎞

⎜⎝⎛ +

⎟⎠⎞

⎜⎝⎛

⎟⎠⎞

⎜⎝⎛ +

even is if

odd is if

mediannxx

Median of {0,1,2} is 1 : n=3 so n+1=4 & (n+1)/2=2 (2nd value)

Median of {0,1,2,3} is 1.5 (assumes data is continuous): n=4

Mode: The most common value 17

Mean or Median?Appropriate summary of the center of the data?– Mean if the data has a symmetric distribution with light tails

(i.e. a relatively small proportion of the observations lie away from the center of the data).

– Median if the distribution has heavy tails or is asymmetric.

Extreme values that are far removed from the main body of the data are called outliers.

– Large influence on the mean but not on the median.Right and left skewness (asymmetry)

(reverse alphabetic - RIGHT skewed)

mode (high point)median

(alphabetic - LEFT skewed)

modemedian

Quantiles, Fractiles, Percentiles

For a theoretical distribution:The pth quantile is the value of a random variable X, xp, such that P(X<xp)=p. For the normal dist’n:In S-Plus: qnorm(p), 0<p<1, gives the quantile.In S-Plus: pnorm(q) gives the probability.

For a sample:The order statistics are the sample values in ascending order. Denoted X(1) ,…X(n)The pth quantile is the data value in the sorted sample, such that a fraction p of the data is less than or equal to that value.

Normal CDF

-3 -2 -1 0 1 2 3

qnorm(0.8)=0.8416212

pnorm(0.8416212)=0.8

An algorithm for finding sample quantiles:

1) Arrange observations from smallest to largest.2) For a given proportion p, compute the sample

size × p = np.3) If np is NOT an integer, round up to the next

integer (ceiling (np)) and set the corresponding observation = xp.

4) If np IS an integer k, average the kth and (k + 1)st ordered values. This average is then xp.

– Text has a different algorithm

Quantiles, continued(pth quantile is 100pth percentile)

Example:Data: {0, 1, 2, 3, 4, 5, 6}

= {x(1),x(2),x(3),x(4),x(5),x(6), x(7)}

n=7Q1 = ceiling(0.25*7) = 2 ⇒ Q1 = x(2)= 1 = 25th percentileQ2 = ceiling(0.50*7) = 4 ⇒ Q2 = x(4)= 3 = median (50th percentile)Q3 = ceiling(0.75*7) = 6 ⇒ Q3 = x(6)= 5 = 75th percentile

S-Plus gives different answers! Different methods for calculating quantiles.

Measures of Dispersion (Spread, Variability):Two data sets may have the same center and but quite

different dispersions around it.Two ways to summarize variability: 1. Give the values that divide the data into equal parts.

– Median is the 50th percentile– The 25th, 50th, and 75th percentiles are called

quartiles (Q1,Q2,Q3) and divide the data into four equal parts.

– The minimum, maximum, and three quartiles are called the “five number summary” of the data.

2. Compute a single number, e.g., range, interquartilerange, variance, and standard deviation.

Measures of Dispersion, continued

Range = maximum - minimumInterquartile range (IQR) = Q3 – Q1

⎥⎦

⎤⎢⎣

⎡−

−=−

−= ∑∑

ii xnx

22 )(1

1Sample variance:

2ss =Sample standard deviation:

Sample mean, variance, and standard deviations are sample analogs of the population mean, variance, and standard deviation (µ, σ2, σ)

Other Measures of Dispersion

Sample Average of Absolute Deviations from the Mean:

Sample Median of Absolute Deviations from the Median

Median of {|xi − x.5|, i = 1, . . . , n}

=−∑

Computations for Measures of DispersionExample:

Data: {0, 1, 2, 3, 4, 5, 6}= {x(1),x(2),x(3),x(4),x(5),x(6), x(7)}

mean = (0+1+2+3+4+5+6)/ 7 = 21/ 7 = 3min = 0, max = 6Q1 = x(2)= 1 = 25th percentileQ2 = x(4)= 3 = median (50th percentile)Q3 = x(6)= 5 = 75th percentileRange = max - min = 6 - 0 = 6IQR = Q3 - Q1 = 5 - 1 = 4s2 = [(02+12+22+32+42+52+62) - 7(32)]/(7-1) = [91-63]/6 =4.67s = sqrt(4.67) = 2.16

Sample Variance and Standard Deviations2 and s should only be used to summarize dispersion with symmetric distributions.

For asymmetric distribution, a more detailed breakup of the dispersion must be given in terms of quartiles.

For normal data and large samples:– 50% of the data values fall between mean ± 0.67s– 68% of the data values fall between mean ± 1s– 95% of the data values fall between mean ± 2s– 99.7% of the data values fall between mean ± 3s

For normally distributed data:IQR=(mean + 0.67s) - (mean - 0.67s) = 1.34s

Standard Normal Density

-4 -2 0 2 4

Box (and Whiskers) PlotsVisual display of summary of data (more than five numbers)Outlier Box Plot Quantile Box PlotData: Gas Mileage

median

IQR = Q3 - Q1

Upper Fence = Q3 + 1.5 x IQR

Lower Fence = Q1 – 1.5 x IQR

Two lines are called whiskers and extend to the most extreme data values that are still inside the fences.

Observations outside the fences are regarded as possible outliers and are denoted by dots and circles or asterisks.

90th percentile

10th percentile

Rectangle:

Box Plot for Iris Data2.

iris21

QQ PlotsCompare Sample to Theoretical

Distribution

Order the data. The ith ordered data value is the pth quantile, where p = (i - 0.5)/n, 0<p<1.Text uses i/(n+1). (Why can’t we just say i/n)?

Obtain quantiles from theoretical distribution corresponding to the values for p. E.g. qnorm(p), in S-Plus for normal distribution.

Plot theoretical quantiles vs. empirical quantiles (sorted data). S-Plus: plot(qnorm((1:length(y)-0.5)/n),sort(y))

Fit line through first and third quartiles of each distribution.

QQ (Normal) Plot for Iris Data

Quantiles of Standard Normal

-2 -1 0 1 2

Normalizing TransformationsData can be non-normal in a number of ways, e.g., the distribution may not be bell shaped or may be heavier tailed than the normal distribution or may not be symmetric.

Only the departure from symmetry can be easily corrected by transforming the data.

If the distribution is positively skewed, then the right tail needs to be shrunk inward. The most common transformation used for this purpose is the log transformation: x → log x (e.g., decibels, Richter, and Beaufort (?) scales); see Figure 4.11.

xThe square-root ( ) transformation provides a weaker shrinking effect; it is frequently used for (Poisson) count data.

For negatively skewed data, use the exponential (ex) or squared (x2) transformations.

Normal Probability Plot of data generated from a certain distribution

-2 -1 0 1 2

Normal probability plot of log of same data

-2 -1 0 1 2

Histogram of the same data

0 2 4 6 8 10

xThis graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Summarizing Multivariate Data

When two or more variables are measured on each sampling unit, the result is multivariate data.

If only two variables are measured the result is bivariate data. One variable may be called the x variable and the other the y variable.

We can analyze the x and y variable separately with the methods we have learned so far, but these methods would NOT answer questions about the relationship between x and y.

– What is the nature of the relationship between x and y (if any)?

– How strong is the relationship?

– How well can one variable be predicted from the other?

Summarizing Bivariate Categorical DataTwo-way Table

Overall Job Satisfaction

Annual Salary

Very Dissatisfied

Slightly Dissatisfied

Slightly Satisfied

Very Satisfied Row Sum

Less than $10,000

81 64 29 10 184

$10,000-25,000

73 79 35 24 211

$25,000-50,000

47 59 75 58 239

More than $50,000

14 23 84 69 190

Column Sum 215 225 223 161 824

The numbers in the cells are the frequencies of each possible combination of categories.

Cell, row and column percentages can be computed to assess distribution.

Column Percentages for Income and Job Satisfaction Table

Overall Job Satisfaction

Annual Salary

Very Dissatisfied

Slightly Dissatisfied

Slightly Satisfied

Very Satisfied

Less than $10,000

37.7 28.4 13.0 6.2

$10,000-25,000

34.0 35.1 15.7 14.9

$25,000-50,000

21.9 26.2 33.6 36.0

More than $50,000

6.5 10.2 37.7 42.9

Simpson’s Paradox

“Lurking variables [excluded from consideration] can change or

reverse a relation between two categorical variables!”

Doctors’ Salaries

• The interpreter of a survey of doctors’ salaries in 1990 and again in 2000 concluded that their average income actually declined from $97,000 in 1990 to $91,000 in 2000.”

• Income is measured here in nominal (not adjusted for inflation) dollars.

What about the “Rest of the Story”?

• What deductive piece of logic might clarify the real meaning of this particular pair of statistics?

• Look more deeply: Is there a piece missing?

• Here is a very simple breakdown of “the numbers” that may help.

Doctors’ Salaries by Age

1980 1990Age fraction, f1 Income fraction, f2 Income<=45 0.5 $60,000 0.7 $70,000>45 0.5 $120,000 0.3 $130,000

Mean $90,000 $88,00043

Conclusion

• If MD salaries are broken into two categories by age:– Doctors younger than 45 constituted 50%

of the MD population in 1980 and 70% in 1990

– Younger doctors tend to earn less than older, more experienced doctors

– Parsed by age, MD salaries increased in both age categories!

Gender Bias in Graduate Admissions

For this example, see Johnson and Wichern, Business Statistics: Decision Making with Data. Wiley, First Edition, 1997.

Randomized study

Gender should be randomly assigned to applicants!

This would automatically balance out the departmental factor which is not controlled for in the original plaintiff (observational) study.

Practical reality

Gender cannot be assigned randomly.

Control for department factor by comparing admission within department, i.e. controlling for the confounding factor aftercompletion of the study.

Statistical Ideal

“There are lies, damn lies and then there are statistics!”

Benjamin Disraeli

Summarizing Bivariate Numerical DataNo. Method

1 (xi)Method 2 (yi)

1 88 86

2 78 81

3 90 87

4 91 90

5 89 89

6 79 80

7 76 74

8 80 78

9 78 76

10 90 86

0102030405060708090

75 80 85 90 95

Method 1

Is it easier to grasp the relationship in the data between Method A and Method B from the Table or from the Figure (scatter plot)?

Labeled Scatter PlotYear Country

ACountry

B Country

CCountry

1965 64.7 64.8 61.1 86.2

1970 65.0 65.2 61.2 86.5

1975 66.8 66.3 63.0 87.4

1980 66.9 67.4 62.8 87.0

1985 67.9 68.5 63.1 89.2

1990 68.3 69.1 63.5 89.4

1995 70.8 69.4 64.3 90.1

2000 71.7 70.0 65.1 90.5

Can you see the improvements in the literacy rates for these four countries more easily in the Table or in the Figure?

1960 1965 1970 1975 1980 1985 1990 1995 2000 2005

e Country ACountry BCountry CCountry D

Sample Correlation CoefficientA single numerical summary statistic which measures the strength of a linear relationship between x and y.

r = covar(x,y)/(stddev(x)*stddev(y))

Properties similar to the population correlation coefficient ρ– Unitless quantity– Takes values between –1 and 1– The extreme values are attained if and only if the points (xi , yi) fall exactly on a straight line (r = -1 for a line with negative slope and r = +1 for a line with positive slope.)– Takes values close to zero if there is no linear relationship between x and y.

• See Figures 4.15, 4.16, 4.17 (a) and (b)

−−−

xy yyxxn

1 where

What is the correlation?

0 20 40 60 80 100

-4 -2 0 2 4

Correlation and CausationHigh correlation is frequently mistaken for a cause and effect relationship. Such a conclusion may not be valid in observational studies, where the variables are not controlled.

– A lurking variable may be affecting both variables.– One can only claim association, not causation.

Countries with high fat diets tend to have higher incidences of cancer. Can we conclude causation?A common lurking variable in many studies is time order.

– Wealth and health problems go up with age.Does wealth cause health problems?

Sometimes correlations can be found without any plausible explanation, e.g., sun spots and economic cycles.

Plots for Multivariate Data

• Side by Side Box Plots• Scatter plot matrix• Three dimensional plots• Brush and Spin plots – add motion• Maps for spatial data

Box Plots of Auto Datawidths indicate number of each type

Compact Large Medium Small Sporty Van

fuel.frame[, "Type"]This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Scatter plot matrix Iris –(Versicolor)

Sepal.L.

2.0 2.4 2.8 3.2 1.0 1.2 1.4 1.6 1.8

Sepal.W.

Petal.L.

5.0 5.5 6.0 6.5 7.0

3.0 3.5 4.0 4.5 5.0

Petal.W.

• Galaxy S-PLUS Language Reference • Radial Velocity of Galaxy NGC7531 • SUMMARY: • The galaxy data frame records the radial velocity of a spiral galaxy

measured at 323 points in the area of sky which it covers. All the measurements lie within seven slots crossing at the origin. The positions of the measurements given by four variables (columns).

• ARGUMENTS: • east.west

– the east-west coordinate. The origin, (0,0), is near the center of the galaxy, east is negative, west is positive.

• north.south– the north-south coordinate. The origin, (0,0), is near the center of the

galaxy, south is negative, north is positive. • angle

– degrees of counter-clockwise rotation from the horizontal of the slot within which the observation lies.

• radial.position– signed distance from origin; negative if east-west coordinate is

negative. • velocity

– radial velocity measured in km/sec. . This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Galaxy Data

east.west

-40 -20 0 20 40 1400 1500 1600 1700

north.south

radial.position

-30 -20 -10 0 10 20 301400

-40 -20 0 20 40 60

velocity

Galaxy 3D

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.60

Earthquake Data

longitude

36.0 36.5 37.0 37.5 38.0 38.5

latitude

-123 -122 -121 -120 3 4 5

magnitude

Earthquake 3D

Narrative Graphics of Space and Time

• Adding spatial dimensions to a graph so that the data are moving over space and time can enhance the explanatory power of time series displays

• The Classic of Charles Joseph Minard (1781-1870) shows the terrible fate of Napoleon’s army during his Russian campaign of 1812. A copy of the map is available at http://www.math.yorku.ca/SCS/Gallery/

Map Source: Minard, C. J. Carte figurative des pertes successives en hommes de l'arméequ'Annibal conduisit d'Espagne en Italie en traversant les Gaules (selon Polybe). Carte figurative des pertes successives en hommes de l'armée française dans la campagne de Russie, 1812-1813. École Nationale des Ponts et Chaussées (ENPC), 1869. Also available in: Tufte, Edward R. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001.

Beginning at the left on the Polish-Russian border near the Niemen River the thick band shows the size of the army (422,000) as it invaded Russia in June 1812.– The width of the band indicates the size of the

army…– The army reached a sacked and deserted Moscow

with 100,000 men– Napoleon’s retreat path from Moscow is depicted by

a dark, lower band, linked to a temperature scale and dates at the bottom.

– The men struggled into Poland with only 10,000 troops remaining.

• Minard’s graphic tells a rich, coherent story with its multivariate data, far more enlightening than just a single number

• SIX variables are plotted:– Its location on a two-dimensional

surface– Direction of army’s movement– Temperature as a function of time

during the retreat– The size of the army

• “It may well be the best statistical graphic ever drawn.” Edward Tufte (The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001, pp. 40)

Scatter plot matrix of air data set in S-Plus

0 50 100 200 300 5 10 15 20

radiation

temperature

1 2 3 4 5

60 70 80 90

plot(temperature,ozone)

temperature

60 70 80 90

We often try to fit a straight line to bivariate data as a way to summarize bivariate data:

y = data = fit + residual

fit = a + bx

The parameter (coefficients) a and b can be found in many ways. Least-squares is commonly used.

The fit is often denoted by The residuals are What about curvature and outliers?

− −

i ia b i

y a bx

a y bx

+ .ˆi iy a bx= .ˆi iy y−

Fitting Lines

Divide x data into thirds. Find median of x in each third, and median of the y’s that correspond to the x’s in each third. Call these three pairs (xa, ya), (xb, yb), (xc, yc). Fit a least-squares line to these three points.

Or consider other metrics

These are alternatives to least-squares.

Resistant Line

medianmin

i ia b i

y a bx

=− −

− −

abline(lm(ozone~temperature))

temperature

60 70 80 90

Prediction and ResidualsFitted lines can be used to predict. If we go too far beyond range of x-data, we can expect poor results. Consider problems of interpolation and extrapolation.

Examination of residuals help tell us how well our model (a line) fits the data.We also compute

and call s the standard deviation of the residuals. Note use of n − 2 because two degrees of freedom are used to find a and b.

is y y

n == −

−∑

Residual Plots

1. against fitted values2. against explanatory variable3. against other possible explanatory variables4. against time, if applicable.

We want these pictures to look random — no pattern.

Outliers and InfluenceValues of x far away from the line have a lot of leverage on the line. Values of y with large residuals at high leverage points will usually be quite influential on the fitted line.

We can check by setting influential points aside and comparing fits and residuals.

( )ˆiy

Plot of residuals vs. observation number for ozone data

0 20 40 60 80 100

Residuals vs. Fitted Values for ozone data

fitted(lmfit)

2.0 2.5 3.0 3.5 4.0 4.5

Smoothing

• Fitting curves to data• Separate Signal from noise• Fitted values, , are a weighted average of the

response y. • Weights are a function of predictor x.• Degrees of freedom indicate roughness• Simple linear regression, df=2

plot(temperature,ozone)lines(smooth.spline(temperature,ozone,df=16.5))

temperature

60 70 80 90

plot(temperature,ozone)lines(smooth.spline(temperature,ozone,df=6))

temperature

60 70 80 90

Time-Series/Runs Chart

Plot of Compression vs. Time (Order of Production)

This is example of a process not in “statistical control” as seen from the downward drift.0.0

0 5 10 15 20 25 30

Production Order

The usual statistics procedures (such as means, standard deviation, confidence interval, hypothesis testing) should NOT be applied until the process has been stabilized.

Time-Series DataData obtained at successive time points for the same sampling unit(s).

A time series typically consists of the following components.1. Stable component2. Trend component3. Seasonal component4. Random component5. Cyclic (long term) component

Univariate time series { xt, t = 1, 2, …, T }

Time-series plot: Xt vs. Time

Data Smoothing and ForecastingTwo types of averages for time-series data:

1. Moving averages

2. Exponentially weighted averages

These should be used only if mean is constant (process is in “statistical control” or is stationary) or mean varies slowly.

Regression techniques can be used to model trends.

More advanced methods are needed to model seasonality and dependence between successive observations (autocorrelation).

(Arithmetic) Moving Averages (MA)The average of a set of w successive data values (called a window); the oldest data is successively dropped off.

T , 1, w w,for t 1 ……

= +−

wxxMA twt

The bigger the window (w), the more the smoothing.

MA forecast: 1ˆ −= tt MAx

T , 2, t ,ˆ 1 =−−=−= ttttt MAxxxeForecast error:

%100 1

×⎟⎟⎠

⎞⎜⎜⎝

− ∑−

Mean Absolute Percent Error:(error in eqn 4.12 in textbook,x not y in the denominator)

Exponentially Weighted Moving AveragesUses all data, but the most recent data is weighted the heaviest.

1)1( −−+= ttt EWMAwxwEWMA

where 0 < w < 1 is the smoothing constant (usually 0.2 to 0.3).

1ˆ −= tt EWMAx

1ˆ −

EWMA forecast:

−=−= ttttt EWMAxxxeForecast error:

1 −+= ttt EWMAewEWMAAlternative formula:

Interpretation: If the forecast error is positive (forecast underestimated the actual value), the next period’s forecast is adjusted upward by a fraction of the forecast error.

Autocorrelation CoefficientFor time-series data, observations separated by a specified time period (called a lag) are said to be lagged.

First-order autocorrelation or the serial correlation coefficient between observations with lag = 1:

−−= T

The k-th order autocorrelation coefficient:

−−= T

Lag Plots in S-Pluslag.plot(x) or plot(x[1:(n-i)],x[(i+1):n])

lagged 1

100 150 200

lagged 2

100 150 200

lagged 3

100 150 200

lagged 4

100 150 200

lagged 5

100 150 200

lagged 6

100 150 20050

Housing starts 1966:1974, lagged scatterplotsHousing starts 1966:1974, lagged scatterplots

These graphs were created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

John W. Tukey (1915 - 2000)

Statistician at Princeton Univ. and Bell Labs

Co-developer of Fast Fourier Transform

Coined terms “bit” (binary digit) and “software”

“An approximate answer to the right problem is worth a great deal more than a precise answer to the wrong problem.”

Developed new graphical displays (stem-and-leaf and box plots) to examine the data, as a reaction to the “mathematization of statistics.”

Review of Probability

Corresponds to Chapter 2 of Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT),with some slides by Jacqueline Telford

(Johns Hopkins University)

Concepts (Review) A population is a collection of all units of interest.A sample is a subset of a population that is actually observed.A measurable property or attribute associated with each unit of a population

is called a variable. A parameter is a numerical characteristic of a population. A statistic is a numerical characteristic of a sample. Statistics are used to infer the values of parameters. A random sample gives a non-zero chance to every unit of the population to

enter the sample. In probability, we assume that the population and its parameters are known

and compute the probability of drawing a particular sample. In statistics, we assume that the population and its parameters are unknown

and the sample is used to infer the values of the parameters. Different samples give different estimates of population parameters (called

sampling variability). Sampling variability leads to “sampling error”. Probability is deductive (general -> particular) Statistics is inductive (particular -> general)

Difference between Statistics and Probability

Statistics: Given the information in your hand, what is in the box?

Probability: Given the information in the box, what is in your hand?

Based on: Statistics, Norma Gilbert, W.B. Saunders Co., 1976. 3

Probability Concepts

Random experiment – procedure whose outcome cannot be predicted in advance. E.g. toss a coin twice

Sample Space (S) – The finest grain, mutually exclusive, collectively exhaustive listing of all possible outcomes (Drake, Fundamentals of Applied Probability Theory) S={H,H},{H,T},{T,H},{T,T}

Event (A) a set of outcomes (subset of S). E.g. No heads A={T,T}

Union (or) E.g. A=heads on first, B=heads on second A U B= {H,T},{H,H},{T,H}

Intersection (and): E.g. A= heads on first, B=heads on second A ∩ B = {H,H}

Complement of Event A – set of all outcomes not in A. E.g. A={T,T}, Ac={H,H},{H,T},{T,H}

Venn Diagram

Axioms of ProbabilityAssociated with each event A in S is the probability of A, P(A) Axioms:

1. P(A) ≥ 0 2. P(S) = 1 where S is the sample space 3. P(A U B) = P(A) + P(B) if A and B are mutually exclusive

E.g. P(ace or king) = P(ace)+P(king)=1/13+1/13=2/13.

Theorems about probability can be proved using these axioms and these theorems can be used in probability calculations.

P(A) = 1 - P(Ac ) (see “birthday problem” on p. 13)P(A U B) = P(A) + P(B) – P(A∩B)E.g. P(ace or black) = P(ace) + P(black) – P(ace and black)= 4/52 + 26/52 – 2/52 = 28/52 = 7/13

Conditional Probabiity: P(A|B) = P(A∩B)/P(B) P(A∩B) = P(A|B)P(B)

E.g. Drawing a card from a deck of 52 cards, P(Heart)=1/4.

However, if it is known that the card is red, P(Heart | Red) = ½.

Sample space has been reduced to the 26 red cards.

(See page 16)

Independence P(A|B)=P(A)

There are situations in which knowing that event B occurred gives no information about event A, E.g. knowing that a card is black gives no information about whether it is an ace. P(ace | black) = 2/26 = 4/52 = P(ace).

If two events are independent then P(A∩B)=P(A)P(B)P(A∩B)=P(A|B)P(B)=P(A)P(B)E.g. P(ace of hearts) = P(ace) * P(hearts) = 4/52 * 13/52 = 1/52

Independent events are not the same as disjoint events. Strong dependence between disjoint events. E.g. card is red means can’t be black. P(A|B)=0.

Summary

If A and B are disjoint: P(A U B) = P(A) + P(B) P (A ∩B) =0

If A and B are independent: P(A ∩ B) = P(A) * P(B) P(A U B) = P(A) + P(B) – P(A ∩B)

Bayes Theorem

• E.g. P(heart | red)=P(red | heart) * P(heart) / P(red) = 1* 0.25 / 0.5 = 0.5

• Monte Hall problem (page 20) 10

Sensor ProblemAssume that there are two chemical hazard sensors: A and B.

Let P(A falsely detecting a hazardous chemical)=0.05 and the same for B.

What is the probability of both sensors falsely detecting a hazardous chemical?

P (A ∩ B) = P(A|B)×P(B) = P(A) × P(B) = 0.05 × 0.05 = 0.0025

– only if A and B are independent (use different detection methods).

If A and B are both “fooled” by the same chemical substance, then P (A ∩ B) = P(A | B) × P(B) = 1 × 0.05 = 0.05 – which is 20 times the rate of false alarms (same type of sensor)

DON’T assume independence without good reason! 11

HIV + HIV - Test positive (+) 95 495 590 Test negative (-) 5 9405 9410 100 9900 10000

P(HIV +) = 100/10000 = .01 (prevalence)

P(Test + | HIV +) = 95/100 = 0.95 (sensitivity)P(Test - | HIV -) = 9405/9900 = .95 (specificity)P(Test - | HIV +) = 5/100 = .05 (false negatives)P(Test + | HIV -) = 495/9900 = .05 (false positives)

P(HIV + | Test +) = 95/590 = 0.16This is one reason why we don’t have mass HIV screening

HIV Testing Example

want these to be high

want these to be low

Made-up data

Suggestions for Solving Probability Problems

Draw a picture – Venn diagram – Tree or event diagram (Probabilistic Risk Assessment) – Sketch

Write out all possible combinations if feasible

Do a smaller scale problem first – Figure out the algorithm for the solution

– Increment the size of the problem by one and check algorithm for correctness

– Generalize algorithm (mathematical induction)

Counting rulesNumber of Possible Arrangements of Size r from n Objects:

Without With Replacement Replacement

Ordered: !

( )! n

n r− rn

Unordered: n r

⎛ ⎞ ⎜ ⎟⎝ ⎠

1n r r

+ −⎛ ⎞ ⎜ ⎟⎝ ⎠

Source: Casella, George, and Roger L. Berger. Statistical Inference. Belmont, CA: Duxbury Press, 1990, page 16. 14

Counting rules (from Casella & Berger)

For these examples, see pages 15-16 of: Casella, George, and Roger L. Berger. Statistical Inference. Belmont, CA: Duxbury Press, 1990.

Birthday ProblemAt a gathering of s randomly chosen students what is the probability

that at least 2 will have the same birthday?

P(at least 2 have same birthday)=1-P(all s students have different birthdays).

Assume 365 days in a year. Think of students’ birthdays as a sample of these 365 days.

The total number of possible outcomes is: N=365s (ordered, with replacement)

The number of ways that s students can have different birthdays is M=364!/(365-s)! (ordered, without replacement)

P(all s students have different birthdays) is M / N. 16

Probability that all students have different birthdays

0 20 40 60 80

Number of students 17 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

See “Harry Potter and the Sorcerer’s Stone” by J.K.

Rowling.

Another Counting Rule

The number of ways of classifying n items into kgroups with ri in group i, r1+r2+…+rk=n, is:

n! / (r1! r2! r3!...rk!)

For example: How many ways are there to assign100 incoming students to the 4 houses atHogwarts?

(1.6 * 10^57)

Random VariablesA random variable (r.v.) associates a unique numerical value with

each outcome in the sample space Example:

1 if coin toss results in a headX = 0 if coin toss results in a tail

Discrete random variables: number of possible values is finite or countably infinite: x1, x2, x3, x4, x5, x6, … Probability mass function (p.m.f.)

f(x) = P(X = x) (Sum over all possible values =1 always) Cumulative distribution function (c.d.f)

F(x) = P (X ≤ x) = Σ f(k)k ≤ x

• See Table 2.1 on p. 21 (p.m.f. and c.d.f. for sum of two dice)• See Figure 2.5 on p. 22 (p.m.f. and c.d.f. graphs for two dice)

Continuous Random VariablesAn r.v. is continuous if it can assume any value from

one or more intervals of real numbers

Probability density function (p.d.f.) f(x)

f(x) ≥ 0 ∞

)f ( dx x = 1 curve the under (Area = always) 1 ∫ ∞ −

a P ≤ X ≤ b) = f ( ds x any for a ≤ b( )∫ a

P(0<X<1) for standard normal= area under curve between 0 and 1

-4 -2 0 2 4

x 22 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Cumulative Distribution Function

The cumulative distribution function (c.d.f.), denoted F(x), for a continuous random variable is given by:

F ( x) = X P ≤ x) = f ( dy y( )∫ ∞ −

f ( x) = dF ( x)

P(0<Z<1) for standard normal= F(1)-F(0) =0.8413-0.5 = 0.3413 (table page 674)

-4 -2 0 2 4

z 24 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Expected Value

The expected value or mean of a discrete r.v. X,denoted by E(X), µ x, or simply µ , is defined as:

(X E ) = µ = ∑ f x ( x) = f x ( x ) + x2 f ( x ) + …1 1 2 x

This is essentially a weighted average of the possible values the r.v. can assume, weights=f(x)

The expected value of a continuous r.v. X is defined as:

X E ) = µ = ∫ f x ( dx x ( )

Variance and Standard Deviation2The variance of an r.v. X, denoted by Var(X), σx , or

simply σ2, is defined as:

Var(X) = σ2 = E[(X - µ)2] Var(X) = E[(X - µ)2]= E(X2 - 2µX + µ2)

= E(X2) - 2µE(X) + E(µ2)

= E(X2) - 2µµ + µ2

= E(X2) - µ2 = E(X2) - [E(X)]2

The standard deviation (SD) is the square root of the variance. Note that the variance is in the square of the original units, while the SD is in the original units.

• See Example 2.17 on p. 26 (mean and variance of two dice)

Quantiles and Percentiles

For 0 ≤ p ≤ 1 the pth quantile (or the 100pth percentile), denoted by θp, of a continuous r.v. X is defined by the following equation:

X P ≤ θ ) = F (θ p ) = p( p

θ.5 is called the median

• See Example 2.20 on p. 30 (exponential distribution)

Jointly distributed random variables and independent random variables

See pp. 30-33

Joint Distributions

For a discrete distribution:

f(x,y) = P(X=x,Y=y)

f(x,y) ≥ 0 for all x and y ∑x ∑y f(x,y)=1

Marginal Distributions

• g(x) = P(X=x) = ∑y f(x,y) • h(y) = P(Y=y) = ∑x f(x,y)

• Independent if joint distribution factors into product of marginal distributions

• f(x,y) = g(x) h(y)

Conditional Distributions

f(y|x) = f(x,y) / g(x)If X and Y are independent:

f(y|x) = g(x) h(y) / g(x) = h(y)

Conditional distribution is just a probabilitydistribution defined on a reduced sample space. For every x, ∑y f(y|x) = 1

Covariance and Correlation

Cov(X,Y) = σ XY = E[(X - µ X)(Y - µ Y)] = E(XY) - E(X)E(Y)

= E(XY) - µ X µ Y

If X and Y are independent, then E(XY) = E(X)E(Y) so the covariance is zero. The other direction is not true.

∞ ∞

Note that: E ( Y X ) = y x f y x ) dx dy( ,∫ ∫ ∞ − ∞ −

ρ XY = corr ( X ,Y ) = Cov ( X ,Y )

=σ XY

Var ( X )Var (Y ) σ σY

• See Examples 2.26 and 2.27 on pp. 37-38 (prob vs. stat grades)

Example 2.25 in texty=x with probability 0.5 and y= -x with probability 0.5

y is not independent of x, yet covariance is zero

0 10 20 30 40 50

x 32 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Two Famous Theorems

Chebyshev’s Inequality: Let c > 0 be a constant. Then, irrespective of the distribution of X,

2σ( X P − µ ≥ c ) ≤ 2c • See Example 2.29 on p. 41 (exact vs. Cheb. for two dice)

Weak Law of Large Numbers: Let X be the sample mean of n i.i.d. observations from a population with finite mean µ

2and variance σ . Then, for any fixed c > 0,

( X P − µ ≥ c ) → as 0 n ∞ →

Selected Discrete DistributionsBernoulli trials: (single coin flip)

xif (success)1( x f ) = (XP =x )=

⎧⎨⎩

1−p xif (failure)0=0 1

E(X) = p and Var(X) = p(1-p)

Binomial distribution: (multiple coin flips)

X successes out of n trials ⎛ ⎞n

p x (1−p) −xn forx f( ) = XP( =x)= x= 1,0, …, n ⎜⎜⎝

⎟⎟⎠x

E(X) = np and Var(X) = np(1-p)

• See Example 2.30 on p. 43 (teeth) 0 1 . . n

Selected Discrete Distributions (cont)

Hypergeometric: drawing balls from the box without replacing the balls (as in the hand with the question mark)

Poisson: number of occurrences of a rare event

Geometric: number of failures before the first success

Multinomial: more than two outcomes

Negative Binomial: number of trials to get r successes

Uniform: N equally likely events 1 2 3 … N

• See Table 2.5, p. 59 for properties of these distributions

Selected Continuous DistributionsUniform: equally likely over an interval

Exponential: lifetimes of devices with no wear-out (“memoryless”), interarrival times when the arrivals are at random

Gamma: used to model lifetimes, related to many other distributions

Lognormal: lifetimes (similar shape to Gamma but with longer tail)

Beta: not equally likely over an interval

• See Table 2.5, p. 59 for properties of these distributions36

Normal Distribution

First discovered by de Moivre (1667-1754) in1733

Rediscovered by Laplace (1749-1827) andalso by

Gauss (1777-1855) in their studies of errorsin astronomical measurements.

Often referred to as the Gaussian distribution.

Carl Friedrick Gauss (1777 - 1855)

Photograph courtesy of John L. Telford, John Telford Photography. Used with permission. Currency from 1991.

Karl Pearson (1857 - 1936)

“Many years ago I called the Laplace-Gauss curve the NORMAL curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another ABNORMAL. That belief is, of course, not justifiable.”

Karl Pearson, 1920

Normal Distribution (“Bell-curve”, Gaussian)

A continuous r.v X has a normal distribution with parameter µ and σ 2 if its probability density function is given by:

x f ) = 1 exp[− ( x − µ )2 / 2σ 2 ] -for < ∞ x ∞ < (

σ 2π

E(X) = µ and Var(X) = σ 2 (see Figure 2.12, p. 53)

Standard normal distribution: Z = X − µ ~ N ( 1 ,0 )

σ• See Table A.3 on p. 673 Φ (z) = P(Z ≤ z)

X P ≤ x ) = Z P = X − µ

≤ x − µ

= z ⎟⎞ Φ = ⎛⎜

x − µ ⎞⎟( ⎜

⎝ σ σ ⎠ ⎝ σ ⎠

• See Examples 2.37 and 2.38 on pp. 54-55 (computations)40

Percentiles of the Normal DistributionSuppose that the scores on a standardized test are normally distributed with mean 500 and standard deviation of 100. What is the 75th percentile score of this test?

X P ≤ x ) = P ⎜⎛ X − 500 x − 500 ⎞ ⎛ x − 500 ⎞( ≤ ⎟ = Φ⎜ ⎟ = 75 .0 ⎝ 100 100 ⎠ ⎝ 100 ⎠

From Table A.3, Φ (0.675) = 0.75

x − 500 = 675.0 ⇒ x = 500 + ( 100)(675. 0 ) = 5. 567

100 Useful Information about the Normal Distribution:

~68% of a normal population is within ± 1σ of µ ~95% of a normal population is within ± 2σ of µ ~99.7% of a normal population is within ± 3σ of µ

75th percentile for a test with scores which are normally distributed, mean=500, standard deviation=100

qnorm(0.75, 500, 100)=567.5

pnorm(567.5, 500, 100)=0.75

200 400 600 800

This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation. 42

Linear Combinations of r.v.s

Xi ~ N(µi, σi2) for i = 1, …, n and Cov(X , Xj) = σij for i≠ji

Let X = a1X1 + a2X2 + … + anXn where are constants.ai

Then X has a normal distribution with mean and variance: n

X E ) = X a E + X a 2 +…+ X a ) = a µ +a µ +…+a µ = ∑a µ( ( 1 1 2 n n 1 1 2 2 n n i i i =1

n n n 2 2Var ( X ) = Var ( X a + X a 2 +… + X a ) = ∑ai σ + 2∑∑ a a j σ1 1 2 n n i i ij

i =1 i =1 j =1i ≠j

X = (X1 + X2 + … + Xn) / n , so ai = 1/n

Therefore, X from n i.i.d. N(µ, σ2) observations ~ N(µ, σ2/n), since the covariances (σij) are zero (by independence).

Sampling Distributions of Statistics

Slides prepared by Elizabeth Newton (MIT),with some slides by Jacqueline Telford

(Johns Hopkins University)

Sampling Distributions

Definitions and Key Concepts• A sample statistic used to estimate an unknown population parameter is called an estimate.

• The discrepancy between the estimate and the true parameter value is known as sampling error.

• A statistic is a random variable with a probability distribution, called the sampling distribution, which is generated by repeated sampling.

• We use the sampling distribution of a statistic to assess the sampling error in an estimate.

Random Sample• Definition 5.11, page 201, Casella and Berger.

• How is this different from a simple random sample?

• For mutual independence, population must be very large or must sample with replacement.

Sample Mean and Variance

−=∑=

Sample Mean

Sample Variance

How do the sample mean and variance vary in repeated samples of size n drawn from the population?

In general, difficult to find exact sampling distribution. However,see example of deriving distribution when all possible samplescan be enumerated (rolling 2 dice) in sections 5.1 and 5.2.Note errors on page 168.

Properties of a sample mean and variance

See Theorem 5.2.2, page 268, Casella & Berger.

Distribution of Sample Means• If the i.i.d. r.v.’s are

– Bernoulli– Normal– Exponential

The distributions of the sample means can be derived

Sum of n i.i.d. Bernoulli(p) r.v.’s is Binomial(n,p)

Sum of n i.i.d. Normal(µ,σ2) r.v.’s is Normal(nµ,nσ2)

Sum of n i.i.d. Exponential(λ) r.v.’s is Gamma(λ,n)

Distribution of Sample Means

• Generally, the exact distribution is difficult to calculate.

• What can be said about the distribution of the sample mean when the sample is drawn from an arbitrary population?

• In many cases we can approximate the distribution of the sample mean when n is large by a normal distribution.

• The famous Central Limit Theorem

Central Limit TheoremLet X1, X2, … , Xn be a random sample drawn from an arbitrary distribution with a finite mean µ and variance σ2

As n goes to infinity, the sampling distribution of

≈−∑

converges to the N(0,1) distribution.

Sometimes this theorem is given in terms of the sums:

Central Limit Theorem

Let X1… Xn be a random sample from an arbitrary distribution with finite mean µ and variance σ2. As n increases

)1,0(/

ii ≈⇒

≈⇒

≈−

What happens as n goes to infinity?

Variance of means from uniform distributionsample size=10 to 10^6number of samples=100

log10(sample.size)

1 2 3 4 5 6

Example: Uniform Distribution• f(x | a, b) = 1 / (b-a), a≤x≤b• E X = (b+a)/2• Var X = (b-a)2/12

0 2 4 6 8 10

runif(500, min = 0, max = 10)

Standardized Means, Uniform Distribution500 samples, n=1

-1 0 1

number of samples=500, n=1

-2 -1 0 1 2

-3 -2 -1 0 1 2 3

QQ (Normal) plot of means of 500 samples of size 100 from uniform distribution

-3 -2 -1 0 1 2 3

Bootstrap – sampling from the sample

• Previous slides have shown results for means of 500 samples (of size 100) from uniform distribution.

• Bootstrap takes just one sample of size 100 and then takes 500 samples (of size 100) with replacement from the sample.

• x<-runif(100)• y<- mean(sample(x,100,replace=T))

Normal probability plot of sample of size 100 from exponential distribution

-2 -1 0 1 2

Normal probability plot of means of 500 bootstrap samples from sample of size 100

from exponential distribution

-3 -2 -1 0 1 2 3

Law of Large Numbers and Central Limit Theorem

Both are asymptotic results about the sample mean:

• Law of Large Numbers (LLN) says that as n → ∞, the sample mean converges to the population mean, i.e.,

0,n as →−∞→ µX

• Central Limit Theorem (CLT) says that as n → ∞, also the distribution converges to Normal, i.e.,

N(0,1) toconverges , n asn

µ−∞→

Normal Approximation to the Binomial

A binomial r.v. is the sum of i.i.d. Bernoulli r.v.’s so the CLT can be used to approximate its distribution.

Suppose that X is B(n, p). Then the mean of X is np and the variance of X is np(1 - p) .

By the CLT, we have: )1,0()1(

npX≈

−−

⎥⎦

⎤⎢⎣

⎡ −=

.).(.).(..

FormulaGeneral

vrSDvrEvr

How large a sample, n, do we need for the approximation to be good?

Rule of Thumb: np ≥ 10 and n(1-p) ≥ 10

For p=0.5, np = n(1-p) = n (0.5) = 10 ⇒ n should be 20. (symmetrical)

For p=0.1 or 0.9, np or n(1-p) = n (0.1) = 10 ⇒ n should be 100. (skewed)

• See Figures 5.2 and 5.3 and Example 5.3, pp.172-174

Continuity Correction

See Figure 5.4 for motivation.

⎟⎟⎠

⎞⎜⎜⎝

−−+

Φ≅≤)1(

5.0)(pnpnpxxXP

⎟⎟⎠

⎞⎜⎜⎝

−−−

Φ−≅≥)1(

5.01)(pnpnpxxXP

Exact Binomial Probability:

P(X ≤ 8) = 0.2517

Normal approximation without Continuity Correction:

P(X ≤ 8) = 0.1867

Normal approximation with Continuity Correction:

P(X ≤ 8.5) = 0.2514 (much better agreement with exact calculation)21

Sampling Distribution of the Sample Variance

−=∑=

There is no analog to the CLT for which gives an approximation for large samples for an arbitrary distribution.

The exact distribution for S2 can be derived for X ~ i.i.d. Normal.

Chi-square distribution: For ν ≥ 1, let Z1, Z2, …, Zν be i.i.d. N(0,1) and let Y = Z1

2 + Z22 + …+ Zν2.

The p.d.f. of Y can be shown to be( ) 212

22 )(2

exyf −−

This is known as the χ2 distribution with ν degrees of freedom (d.f.) or Y ~ .2

• See Figures 5.5 and 5.6, pp. 176-177 and Table A.5, p.67622

Distribution of the Sample Variance in the Normal Case

If Z ~ N(0,1), then Z2 ~21χ

)1(−−

nnSSn χ

−−

nS nχσ

It can be shown that

or equivalently , a scaled χ2

E(S2) = σ2 (is an unbiased estimator)

Var(S2) = 12 4

−nσ

See Result 2 (p.179)

Chi-square distribution

0 10 20 30 40 50

Chi-Square DistributionInteresting Facts

• EX = ν (degrees of freedom)• Var X = 2ν• Special case of the gamma distribution

with scale parameter=2, shape parameter=v/2.

• Chi-square variate with v d.f. is equal to the sum of the squares of v independent unit normal variates.

Student’s t-DistributionConsider a random sample X1, X2, ..., Xn drawn from N(µ,σ2).

It is known thatn

X/σµ− is exactly distributed as N(0,1).

/µ−

= is NOT distributed as N(0,1).

A different distribution for each ν = n-1 degrees of freedom (d.f.).

T is the ratio of a N(0,1) r.v. and sq.rt.(independent χ2 divided by its d.f.) - for derivation, see eqn 5.13, p.180, and its messy p.d.f., eqn 5.14

See Figure 5.7, Student’s t p.d.f.’s for ν = 2, 10,and ∞, p.180• See Table A.4, t-distribution table, p. 675• See Example 5.6, milk cartons, p. 181

Student’s t densities for df=1,100

-4 -2 0 2 4

df=100

Student’s t DistributionInteresting Facts

• E X = 0, for v>1• Var X = v/(v-2) for v>2• Related to F distribution (F1,v = t2v )• As v tends to infinity t variate tends to

unit normal• If v=1 then t variate is standard Cauchy

Cauchy Distribution for center=0, scale=1 and center=1, scale=2

-4 -2 0 2 4

center=1, scale=2

center=0, scale=1

Cauchy DistributionInteresting Facts

12 ]})(1[{),|( −−+=

baxbbaxf π

• Parameters, a=center, b=scale • Mean and Variance do not exist (how could this be?)• a=median• Quartiles=a +/- b• Special case of Student’s t with 1 d.f.• Ratio of 2 independent unit normal variates is standard

Cauchy variate• Should not be thought of as “only a pathological case”.

(Casella & Berger) as we frequently (when?) calculate ratios of random variables.

Snedecor-Fisher’s F-Distribution

has an F-distribution with n1-1 d.f. in the numerator and n2-1 d.f. in the denominator.

•F is the ratio of two independent χ2’s divided by their respective d.f.’s

•Used to compare sample variances.

•See Table A.6, F-distribution, pp. 677-679

Consider two independent random samples:

X1, X2, ..., Xn1from N(µ1,σ1

2) , Y1, Y2, ..., Yn2from N(µ2,σ2

22)12(

21)11(

Snedecor’s F Distribution

0.0 0.5 1.0 1.5 2.0 2.5 3.0

df1=40

df1=10

Snedecor’s F DistributionInteresting Facts

• Parameters, v, w, referred to as degrees of freedom (df).• Mean = w/(w-2), for w>2• Variance = 2w2(v+w-2)/(v(w-2)2(w-4)), for w>4• As d.f., v and w increase, F variate tends to normal• Related also to Chi-square, Student’s t, Beta and Binomial• Reference for distributions:

Statistical Distributions 3rd ed. by Evans, Hastings and Peacock, Wiley, 2000

Sampling Distributions - Summary

• For random sample from any distribution, standardized sample mean converges to N(0,1) as n increases (CLT).

• In normal case, standardized sample mean with S instead of sigma in the denominator ~ Student’s t(n-1).

• Sum of n squared unit normal variates ~ Chi-square (n)

• In the normal case, sample variance has scaled Chi-square distribution.

• In the normal case, ratio of sample variances from two different samples divided by their respective d.f. has F distribution.

Sir Ronald A. Fisher George W. Snedecor(1890-1962) (1882-1974)

Taught at Iowa State Univ. where wrote a college textbook (1937):

“Thank God for Snedecor;now we can understand Fisher.”

(named the distribution for Fisher)

Wrote the first books on statistical methods (1926 & 1936):

“A student should not be madeto read Fisher’s books

unless he has read them before.”

Sampling Distributions for Order StatisticsMost sampling distribution results (except for CLT) apply to samples from normal populations.

If data does not come from a normal (or at least approximately normal), then statistical methods called “distribution-free” or “non-parametric” methods can be used (Chapter 14).

Non-parametric methods are often based on ordered data (called order statistics: X(1), X(2), …, X(n)) or just their ranks.

If X1..Xn are from a continuous population with cdf F(x) and pdf f(x) then the pdf of X(j) is:

The confidence intervals for percentiles can be derived using the order statistics and the binomial distribution.

jnjj xFxFxf

jnjnxf −− −

−−= )](1[)]()[(

)!()!1(!)( 1

Basic Concepts of Inference

Slides prepared by Elizabeth Newton (MIT)with some slides by Jacqueline Telford

(Johns Hopkins University) and Roy Welsch (MIT).1

“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” H. G. Wells

Statistical InferenceDeals with methods for making statements about a population based on a sample drawn from the population

Point Estimation: Estimate an unknown population parameter

Confidence Interval Estimation: Find an interval that contains the parameter with preassigned probability.

Hypothesis testing: Testing hypothesis about an unknown population parameter

Examples

Point Estimation: estimate the mean package weight of a cereal box filled during a production shift

Confidence Interval Estimation: Find an interval [L,U] based on the data that includes the mean weight of the cereal box with a specified probability

Hypothesis testing: Do the cereal boxes meet the minimum mean weight specification of 16 oz?

Two Levels of Statistical Inference

• Informal, using summary statistics (may only be descriptive statistics)

• Formal, which uses methods of probability and sampling distributions to develop measures of statistical accuracy

Estimation Problems

• Point estimation: estimation of an unknown population parameter by a single statistic calculated from the sample data.

• Confidence interval estimation: calculation of an interval from sample data that includes the unknown population parameter with a pre-assigned probability.

Point Estimation TerminologyEstimator = the random variable (r.v.) , a function of the Xi’sθ(the general formula of the rule to be computed from the data)

Estimate = the numerical value of calculated from the observed sample data X1 = x1, ..., Xn = xn

Example: Xi ~ N(µ,σ2)

(the specific value calculated from the data)

of µ(= 10.2) is an estimateEstimate =

Estimator = is an estimator of µ µ=

Other estimators of µ?

Methods of Evaluating EstimatorsBias and Variance

θθθ −= )ˆ()ˆ( EBias- The bias measures the accuracy of an estimator.- An estimator whose bias is zero is called unbiased.- An unbiased estimator may, nevertheless, fluctuate greatly fromsample to sample.

{ }2)]ˆ(ˆ[ )ˆ( θθθ EE −=Var

-The lower the variance, the more precise the estimator.- A low-variance estimator may be biased.- Among unbiased estimators, the one with the lowest variance should be chosen. “Best”=minimum variance.

Accuracy and Precision

accurate and precise

accurate, not precise

precise, not accurate

not accurate, not precise

8Diagram courtesy of MIT OpenCourseWare

Mean Squared Error- To chose among all estimators (biased and unbiased), minimize a measure that combines both bias and variance.- A “good” estimator should have low bias (accurate) AND low variance (precise).

{ }2)]ˆ[ )ˆ( θθθ −= EMSE 6.2) (eqnBiasVar 2)]ˆ([)ˆ( θθ +=

MSE = expected squared error loss function

θθθ −= )ˆ()ˆ( EBias

{ }2)]ˆ(ˆ[ )ˆ( θθθ EE −=Var

Example: estimators of variance

Two estimators of variance:

)1()( 21

21 −−= ∑ =

nXXS n

i i is unbiased (Example 6.3)

nXXS n

122 )( −= ∑ =

is biased but has smaller MSE (Example 6.4)

In spite of larger MSE, we almost always use S12

Example - Poisson

(See example in Casella & Berger, page 308)

Standard Error (SE)- The standard deviation of an estimator is called the standard error of the estimator (SE).- The estimated standard error is also called standard error (se).- The precision of an estimator is measured by the SE.Examples for the normal and binomial distributions:

µ 1. of estimator unbiased an isXnXSE σ=)(

are called the standard error of the meannsXse =)(

ˆ 2. pp of estimator unbiased an isnpppse )ˆ1(ˆ)ˆ( −=

Precision and Standard Error

• A precise estimate has a small standard error, but exactly how are the precision and standard error related?

• If the sampling distribution of an estimator is normal with mean equal to the true parameter value (i.e., unbiased). Then we know that about 95% of the time the estimator will be within two SE’s from the true parameter value.

Methods of Point Estimation

•Method of Moments (Chapter 6)

•Maximum Likelihood Estimation (Chapter 15)

•Least Squares (Chapter 10 and 11)

Method of Moments

• Equate sample moments to population moments (as we did with Poisson).

• Example: for the continuous uniform distribution, f(x|a,b)=1/(b-a), a≤x≤b

• E(X) = (b+a)/2, Var(X)=(b-a)2/12

• Set = (b+a)/2

• S2 = (b-a)2/12

• Solve for a and b (can be a bit messy).

Maximum Likelihood Parameter Estimation

• By far the most popular estimation method! (Casella & Berger).

• MLE is the parameter point for which observed data is most likely under the assumed probability model.

• Likelihood function: L(θ |x) = f(x| θ), where x is the vector of sample values, θ also a vector possibly.

• When we consider f(x| θ), we consider θ as fixed and x as the variable.

• When we consider L(θ |x), we are considering x to be the fixed observed sample point and θ to be varying over all possible parameter values.

MLE (continued)•If X1….Xn are iid then

L(θ|x)=f(x1…xn| θ) = ∏ f(xi| θ)

•The MLE of θ is the value which maximizes the likelihood function (assuming it has a global maximum).

•Found by differentiating when possible.

•Usually work with log of likelihood function (∏→∑).

•Equations obtained by setting partial derivatives of ln L(θ) = 0 are called the likelihood equations.

•See text page 616 for example – normal distribution.17

Confidence Interval EstimationWe want an interval [ L, U ] where L and U are two statistics calculated from X1, X2, …, Xn such that

P[ L ≤ θ ≤ U] = 1 - α Note: L and U are random and θ is fixed but unknown

regardless of the true value of θ.

• [ L, U ] is called a 100(1-α)% confidence interval (CI).

• 1-α is called the confidence level of the interval.

• After the data is observed X1 = x1, ..., Xn = xn, the confidence limits L = l and U = u can be calculated.

95% Confidence Interval: Normal known2σConsider a random sample X1, X2, …, Xn ~ N(µ,σ2) where σ2 is assumed to be known and µ is an unknown parameter to be estimated. Then

95.096.196.1P =⎥⎦

⎤⎢⎣

⎡≤

−≤−

µ By the CLT even if the sample is not normal, this result is approximately correct.

95.096.196.1P =⎥⎦⎤

⎢⎣⎡ +=≤≤−=⇒

nXL σµσ

xl =+≤≤−=⇒σµσ 96.196.1 is a 95% CI for µ

(two-sided)

• See Example 6.7, Airline Revenues, p. 20419

Normal Distribution, 95% of area under curve is between -1.96 and 1.96

-3 -2 -1 0 1 2 3

Frequentist Interpretation of CI’sIn an infinitely long series of trials in which repeated samples of size n are drawn from the same population and 95% CI’s for µ are calculated using the same method, the proportion of intervals that actually include µ will be 95% (coverage probability).

However, for any particular CI, it is not known whether or not the CI includes µ, but the probability that it includes µis either 0 or 1, that is, either it does or it doesn’t.

It is incorrect to say that the probability is 0.95 that the true µ is in a particular CI.

• See Figure 6.2, p. 205

95% CI, 50 samples from unit normal distribution

0 10 20 30 40 50

Arbitrary Confidence Level for CI: known2σ

100(1-α)% two-sided CI for µ based on the observed sample mean

nZx σµσ

αα 2/2/ +≤≤− For 99% confidence, Zα/2 = 2.576

The price paid for higher confidence level is a wider interval.

For large samples, these CI can be used for data from any distribution, since by CLT ≈ N(µ, σ2/n).x

One-sided Confidence Intervals

nZx σµ α−≥ Lower one-sided CI For 95%

confidence, Zα= 1.645 vs. Zα/2= 1.96 n

Zx σµ α+≤ Upper one-sided CI

One-sided CIs are tighter for the same confidence level.

Hypothesis Testing

The objective of hypothesis testing is to access the validity of a claim against a counterclaim using sample data.

• The claim to be “proved” is the alternative hypothesis (H1).

• The competing claim is called the null hypothesis (H0).

• One begins by assuming that H0 is true. If the data fails to contradict H0 beyond a reasonable doubt, then H0 is not rejected. However, failing to reject H0 does not mean that we accept it as true. It simply means that H0 cannot be ruled out as a possible explanation for the observed data. A proof by insufficient data is not a proof at all.

Testing Hypotheses“The process by which we use data to answer questions about parametersis very similar to how juries evaluate evidence about a defendant.” – from Geoffrey Vining, Statistical Methods for Engineers, Duxbury, 1st edition, 1998. For more information, see that textbook.

Hypothesis Tests• A hypothesis test is a data-based rule to decide between H0and H1.

• A test statistic calculated from the data is used to make this decision.

• The values of the test statistics for which the test rejects H0 comprise the rejection region of the test.

• The complement of the rejection region is called the acceptance region.

• The boundaries of the rejection region are defined by one or more critical constants (critical values).

• See Examples 6.13(acc. sampling) and 6.14(SAT coaching), pp. 210-211.

Hypothesis Testing as a Two-Decision Problem

Framework developed by Neyman and Pearson in 1933.

When a hypothesis test is viewed as a decision procedure, two types of errors are possible:

Decision Do not reject H0 Reject H0

H0 True Correct Decision “Confidence”

1 - α

Type I Error “Significance Level”

H0 False Type II Error “Failure to Detect”

Correct Decision “Prob. of Detection”

1 - β Column

Probabilities of Type I and II Errorsα = P{Type I error} = P{Reject H0 when H0 is true} = P{Reject H0|H0}

also called α-risk or producer’s risk or false alarm rate

β = P{Type II error} = P{Fail to reject H0 when H1 is true} = P{Fail to reject H0|H1}

also called β-risk or consumer’s risk or prob. of not detecting

π = 1 - β = P{Reject H0|H1} is prob. of detection or power of the test

We would like to have low α and low β (or equivalently, high power).

α and 1-β are directly related, can increase power by increasing α.

These probabilities are calculated using the sampling distributions from either the null hypothesis (for α) or alternative hypothesis (for β).

Example 6.17 (SAT Coaching)

See Example 6.17, “SAT Coaching,” in the course textbook.

Power Function and OC Curve

The operating characteristic function of a test is the probability that the test fails to reject H0 as a function of θ, where θ is the test parameter.

OC(θ) = P{test fails to reject H0 | θ}

For θ values included in H1 the OC function is the β –risk.

The power function is:

π(θ) = P{Test rejects H0 | θ} = 1 – OC(θ)

Example: In SAT coaching, for the test that rejects the null hypothesis when mean change is 25 or greater, the power = 1-pnorm(25,mean=0:50,sd=40/sqrt(20))

Level of SignificanceThe practice of test of hypothesis is to put an upper bound on the P(Type I error) and, subject to that constraint, find a test with the lowest possible P(Type II error).

The upper bound on P(Type I error) is called the level of significance of the test and is denoted by α (usually some small number such as 0.01, 0.05, or 0.10).

The test is required to satisfy:

P{ Type I error } = P{ Test Rejects H0 | H0 } ≤ α

Note that α is now used to denote an upper bound on P(Type I error).

Motivated by the fact that the Type I error is usually the more serious.

A hypothesis test with a significance level α is called an a α-level test.

Choice of Significance Level

What α level should one use?

Recall that as P(Type I error) decreases P(Type II error) increases.

A proper choice of α should take into account the relative costs of Type I and Type II errors. (These costs may be difficult to determine in practice, but must be considered!)

Fisher said: α =0.05

Today α = 0.10, 0.05, 0.01 depending on how much proof against the null hypothesis we want to have before rejecting it.

P-values have become popular with the advent of computer programs.

Observed Level of Significance or P-valueSimply rejecting or not rejecting H0 at a specified α level does not fully convey the information in the data.

Example: H0 : µ = 15 vs H1 : µ > 15 is rejected at the α = 0.05

when 71.2920

40645.115 =×+>x

Is a sample with a mean of 30 equivalent to a sample with a meanof 50? (Note that both lead to rejection at the α-level of 0.05.)

More useful to report the smallest α-level for which the data would reject (this is called the observed level of significance or P-value).

Reject H0 if P-value < α34

Example 6.23 (SAT Coaching: P-Value)

See Example 6.23, “SAT Coaching,” on page 220 of the course textbook.

One-sided and Two-sided TestsH0 : µ = 15 can have three possible alternative hypotheses:

H1 : µ > 15 , H1 : µ < 15 , or H1 : µ ≠ 15

(upper one-sided) (lower one-sided) (two-sided)

Example 6.27 (SAT Coaching: Two-sided testing)

See Example 6.27 in the course textbook.

Example 6.27 continued

See Example 6.27, “SAT Coaching,” on page 223 of the course textbook.

Relationship Between Confidence Intervals and Hypothesis Tests

An α-level two-sided test rejects a hypothesis H0 : µ = µ0 if and only if the (1- α)100% confidence interval does not contain µ0.

Example 6.7 (Airline Revenues)

See Example 6.7, “Airline Revenues,” on page 207 of the course textbook.

Use/Misuse of Hypothesis Tests in Practice

• Difficulties of Interpreting Tests on Non-random samples and observational data

• Statistical significance versus Practical significance– Statistical significance is a function of sample size

• Perils of searching for significance

• Ignoring lack of significance

•Confusing confidence (1 - α) with probability of detecting a difference (1 - β)

Jerzy Neyman Egon Pearson(1894-1981) (1895-1980)

Carried on a decades-long feud with Fisher over the foundations of statistics (hypothesis testing and confidence limits) - Fisher never recognized Type II error & developed fiduciallimits

Inference for Single Samples

Corresponds to Chapter 7 of

Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León (University of Tennessee)

Inference About the Mean and Variance of a Normal Population

Applications:• Monitor the mean of a manufacturing process to determine

if the process is under control• Evaluate the precision of a laboratory instrument measured

by the variance of its readings• Prediction intervals and tolerance intervals which are

methods for estimating future observations from a population.

By using the central limit theorem (CLT), inference procedures for the mean of a normal population can be extended to the mean of a non-normal population when a large sample is available

Inferences on Mean (Large Samples)

Inferences on will be based on the sample mean ,

which is an unbiased estimator of with variance .

For large sample size , the CLT tells us that is

approximately , distributed, even if

the population

is not normal. Also for large , the sample variance may be

taken as an accurate estimator of with neglible sampling error.If 30, we may assume that in the formulas.

Pivots

• Definition: Casella & Berger, p. 413

• E.g. • Allow us to construct confidence intervals on

parameters.

)1,0(~/

µ−=

Confidence Intervals on the Mean: Large Samples

2 2 1XP z Z z

nα α

µ ασ

⎡ ⎤−⎢ ⎥− ≤ = ≤ = −⎢ ⎥

⎢ ⎥⎣ ⎦

Note: zα/2 = -qnorm(α/2)

(See Figure 2.15 on page 56 of the course textbook.)

Confidence Intervals on the Mean

2 2x z x zn nα α

σ σµ− ≤ ≤ +

( Lower One-Sided CI)x znα

σ µ− ≤

(Upper One-Sided CI)x znα

σµ ≤ +

is the standard error of the meann

Confidence Intervals in S-Plus

t.test(lottery.payoff)

One-sample t-Test

data: lottery.payofft = 35.9035, df = 253, p-value = 0 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval:274.4315 306.2850 sample estimates:mean of x 290.3583

This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Sample Size Determination for a z-interval

[ ]Suppose that we require a (1- )-level two-sided CI for

of the form , with a margin of error x E x E Eα µ

2Set and solve for , obtaining z

E z n nEn

σσ ⎡ ⎤= = ⎢ ⎥

⎣ ⎦i

•Calculation is done at the design stage so a sample estimate of σ is not available.•An estimate for σ can be obtained by anticipatingthe range of the observations and dividing by 4.

[ ]Based on assuming normality since then 95% of the observation are expected to fall in 2 , 2µ σ µ σ− +

Example 7.1 (Airline Revenue)

See Example 7.1, “Airline Revenue,” on page 239 of the course textbook.

Example 7.2 – Strength of Steel Beams

See Example 7.2 on page 240 of the course textbook.

Power Calculation for One-sided Z-tests0( ) P[Test rejects | ]Hπ µ µ=

Testing vs.

For the power function of the α-level upper one sided z-test derivation, see Equation 7.7 in the course textbook.

:o oH µ µ≤ 1 0:H µ µ>

Illustration of calculation on next page

Φ(−z) 1−Φ(z)

Power Calculation for One-sided Z-tests

p.d.f. curves of

σµ⎛ ⎞⎜ ⎟⎝ ⎠

Power Functions Curves

See Figure 7.2 on page 243 of the course textbook.

Notice how it is easier to detect a big difference from µ0.

Example 7.3 (SAT Couching: Power Calculation)

( )0( )n

µ µπ µ

⎡ ⎤−= Φ − +⎢ ⎥

⎢ ⎥⎣ ⎦

Power Calculation Two-Sided

Test(See Figure 7.3 on page 245 of the course textbook.)

Power Curve for Two-sided TestIt is easier to detect large differences from the null hypothesis(See Figure 7.4 on

of the course textbook.)

Larger samples lead to more powerful tests

Power as a function of µ and n, µ0=0, σ=1Uses function persp in S-Plus

Sample Size Determination for a One-Sided z-Test

• Determine the sample size so that a study will have sufficient power to detect an effect of practically important magnitude

• If the goal of the study is to show that the mean response µunder a treatment is higher than the mean response µ0 without the treatment, then µ−µ0 is called the treatment effect

• Let δ > 0 denote a practically important treatment effect and let 1−β denote the minimum power required to detect it. The goal is to find the minimum sample size n which would guarantee that an α-level test of H0 has at least 1-βpower to reject H0 when the treatment effect is at least δ.

Sample Size Determination for a One-sided Z-test

Because Power is an increasing function of µ−µ0, it is only necessary to find n that makes the power 1− β at µ = µ0+δ.

( ) 1 [See Equation (7.7), Slide 11]

Since ( ) 1 we have - .

Solving for n, we obtain

nz z z

β α β

δπ µ δ βσ

δβσ

⎛ ⎞+ = Φ − + = −⎜ ⎟⎜ ⎟

⎝ ⎠

Φ = − + =

⎡ ⎤+= ⎢ ⎥

⎢ ⎥⎣ ⎦ zβ

Example 7.5 (SAT Coaching: Sample Size Determination

Sample Size Determination for a Two-Sided z-Test

2z zn α β σ

⎡ ⎤+⎢ ⎥⎢ ⎥⎣ ⎦

Read on you own the derivation on pages 248-249

Read on your own Example 7.4 (page246)

Power and Sample Size in S-Plus

normal.sample.size(mean.alt = 0.3) mean.null sd1 mean.alt delta alpha power n1

0 1 0.3 0.3 0.05 0.8 88

> normal.sample.size(mean.alt = 0.3,n1=100) mean.null sd1 mean.alt delta alpha power n1

0 1 0.3 0.3 0.05 0.8508 100

Inference on Mean (Small Samples)

The sampling variability of s2 may be sizable if the sample is small(less than 30). Inference methods must take this variability intoaccount when σ2 is unknown .

Assume that ,..., is a random sample from an

( , ) ditribution. Then has a

-distribution with -1 degrees of freedom (d.f.)

nX XXN TS n

µµ σ −=

(Note that T is a pivot)

Confidence Intervals on Mean

1, 2 1, 2

XP t T tS n

S SP X t X tn n

− −

⎡ ⎤−− = − ≤ = ≤⎢ ⎥

⎣ ⎦⎡ ⎤= − ≤ ≤ +⎢ ⎥⎣ ⎦

1, 2 1, 2 [Two-Sided 100(1- )% CI]n nS SX t X tn nα αµ α− −− ≤ ≤ +

1, 2 2 interval is wider on the average than z-intervalnt z tα α− > ⇒ −

Example 7.7, 7.8, and 7.9

See Examples 7.7, 7.8, and 7.9 from the course textbook.

Inference on Variance2

1Assume that ,..., is a random sample from an ( , ) distributionnX X N µ σ2

( 1) has a Chi-square distribution with -1 d.f.n S nχσ−

(See Figure 7.8 on page 255 of the course textbook)

( ) 22 2

21,1 1,2 2

n SP α αα χ χ

σ− − −

⎡ ⎤−− = ≤ ≤⎢ ⎥

⎣ ⎦26

CI for σ2 and σ

The 100(1-α)% two-sided CI for σ2 (Equation 7.17 in course textbook):

1, 1,12 2

( 1) ( 1)

n s n s

σχ χ

− − −

− −≤ ≤

The 100(1-α)% two-sided CI for σ (Equation 7.18 in course textbook):

1, 1,12 2

n ns sα α

σχ χ

− − −

− −≤ ≤

Hypothesis Test on Variance

See Equation 7.21 on page 256 of the course textbook for an explanation of the chi-square statistic:

( 1)n sχσ−

Prediction Intervals• Many practical applications call for an interval estimate of

– an individual (future) observation sampled from a population – rather than of the mean of the population.

• An interval estimate for an individual observation is called a prediction interval

Prediction Interval Formula:

1, 2 1, 21 11 1n nx t s X x t sn nα α− −− + ≤ ≤ + +

Confidence vs. Prediction IntervalPrediction interval of a single future observation:

1, 2 1, 2

1 11 1

As interval converges to [ , ]

n nx t s X x t sn n

α αµ σ µ σ

− −− + ≤ ≤ + +

→ ∞ − +

Confidence interval for µ:

1, 2 1, 21 1

As interval converges to single point

n nx t s x t sn n

α αµ

− −− ≤ ≤ +

→ ∞30

Example 7.12: Tear Strength of Rubber

Run chart shows process is predictable.

Tolerance IntervalsSuppose we want an interval which will contain at least.90 = 1-γ of the strengths of the future batches (observations) with 95% = 1-α confidence

Using Table A.12 in the course textbook:1-α = 0.951-γ = 0.90n = 14So, the critical value we want is 2.529.

[ , ] 33.712 2.529 0.798 [31.694,35.730]x Ks x Ks− + = ± × =Note that this statistical interval is even wider than the prediction interval

Inferences for Two Samples

Corresponds to Chapter 8 ofTamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León

(University of Tennessee) 1

Introductory Remarks• A majority of statistical studies, whether experimental or

observational, are comparative• Simplest type of comparative study compares two

populations• Two principal designs for comparative studies

– Using independent samples– Using matched pairs

• Graphical methods for informal comparisons• Formal comparisons of means and variances of normal

populations– Confidence intervals– Hypothesis tests

Independent Samples DesignExample: Compare Control Group to Treatment Group •See page 270 in course textbook.

:Sample 1: , ,...,

Sample 2: , ,...,n

Independent samples design

Different Numbers

•The two samples are independent•Independent sample design relies on random assignment to make the two groups equal (on the average) on all attributes except for the treatment used (treatment factor).

Graphical Methods for Comparing

Two Independent

Samples See Table 8.1 and Figure 8.1, which is a Q-Q Plot. Plot suggests that treatment group costs are less than control group costs. But is it true?

( ) ( )

Plot of the order statistics ordered pairs ( , )

which are the i quantiles

n+1of the respective samples

i ix y

⎛ ⎞⎜ ⎟⎝ ⎠

Book discusses how to prepare this graph when the two samples are not of the same size (interpolation).

Box plots of hospitalization cost data0

hcc hct

Box plots of logs of hospitalization cost data

lhcc lhct

Graphical Displays of Data from Matched Pairs

• Plot the pairs (xi, yi) in a scatter plot. Using the 45° line as a reference, one can judge whether the two sets of values are similar or whether one set tends to be larger than the other

• Plots of the differences or the ratios of the pairs may prove to be useful

• A Q-Q plot is meaningless for paired data because the same quantiles based on the ordered observations do not, in general, come from the same pair.

Comparing Means of Two Populations:Independent Samples Design

(Large Samples Case)

1 21 2 1 2

Suppose that the observations , ,..., and , ,...,

are random samples from two populations with means and and variances and . Both means and variancesare assumed to be unknown.

n nx x x y y y

µ σ σ

2 1 2 1

The goal is to compare and in terms of their difference - . We assume that and are large (say 30).

µµ µ µ

Comparing Means of Two Populations:Independent Samples Design

1 22 21 2

1 22 21 1 2 2

( ) ( ) ( )

Therefore the standarized r.v.( ) has mean = 0 and variance = 1

If and are large, then Z is approximately (0,1) byth

E X Y E X E Y

Var X Y Var X Var Yn n

X YZn n

µ µσ σ

− = − = −

− = + = +

− − −=

e Central Limit Theorem though we did not assume the samples came from normal populations. (We also use fact that the difference of independent normal r.v.'s is also normal.)

Large Sample (Approximate) 100(1-α)% CI for µ1−µ2

( ) ( )2 2 2 21 2 1 2

2 1 2 21 2 1 2

2 2Note has been substituted for because samples arelarge, i.e., bigger than 30.

s s s sx y z x y zn n n n

α αµ µ

− − + ≤ − ≤ − + +

Example 8.2: See Example 8.2 in course textbook.

Large Sample (Approximate) Test of Hypothesis

0 1 2 0 1 1 2 0 0: vs. : (Typically 0)H Hµ µ δ µ µ δ δ− = − ≠ =

02 21 1 2 2

( )Test statistics: x yzs n s n

δ− −=

Inference for Small Samples2 21 2Case 1: Variances and assumed equal.σ σ

Assumption of normal populations is important since we cannot invoke the CLT

2 22 22 1 1 2 2

1 2 1 22 2 2

Pooled estimate of the common variance:( ) ( )( 1) ( 1)

( 1) ( 1) 2Note: ( ) / 2 if sample sizes are equal

i iX X Y Yn S n SSn n n n

− + −− + −= =

− + − + −

∑ ∑

1 21 2

( ) has -distribution with 2 d.f.1 1

X YT t n nS n n

µ µ− − −= + −

Inference for Small Sample: Confidence Intervals and Hypothesis Tests

2 21 2Case 1: Variances and assumed equal.σ σ

1 2 1 22, 2 1 2 2, 21 2 1 2

Two-sided 100(1- )% CI is given by:

1 1 1 1n n n nx y t s x y t s

n n n nα α

µ µ+ − + −− − + ≤ − ≤ − + +

0 1 2 0 1 1 2 0

0 2, 2

Test of Hypothesis: : vs. :

Test statistics: 1 1

Reject if n n

H Hx yts

H t t α

µ µ δ µ µ δδ

− = − ≠− −

Hospitalization Cost Example•See Example 8.2 on page 276 of course textbook.

Contrast this conclusion with apparent difference seen on the Q-Q plot in Figure 8.1

t.test in S-Plus to test difference in means of logs of hospitalization cost data

t.test(lhcc,lhct)

Standard Two-Sample t-Test

data: lhcc and lhctt = 0.6181, df = 58, p-value = 0.5389 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-0.3731277 0.7064981 sample estimates:mean of x mean of y 8.250925 8.08424

Interpretation of Difference in Means on the Log Scale

Mean (log Cost) = Median (log Cost) = log (Median Cost)

Because distribution of log cost is symmetric

Because the log preserves ordering

0.373 (log ) (log ) 0.707 0.373 log( ) log( ) 0.707

0.373 log 0.707

.689 exp( 0.373) exp(0.707) 2.028

Mean Cost Mean CostMedian Cost Median Cost

Median CostMedian Cost

− ≤ − ≤− ≤ − ≤

⎛ ⎞− ≤ ≤⎜ ⎟

⎝ ⎠

= − ≤ ≤ =

This Interpretation is not in your textbook

95% confidence interval for the ratio of median costs

Inference for Small Samples2 21 2Case 2: Variances and unequal.σ σ

1 22 2

( ) does not have a Student - distributionX YT tS Sn n

µ µ− − −=

It can be shown that distribution of T depends on the ratio of unknown variances, hence T is not a pivotal quantity. However, whenn1 and n2 are large T has an approximate N(0,1) distribution

( ) ( )

1 22 2

2 21 1 2 2

2 22 21 2

1 21 2

For small samples( ) has an approximately -distribution

( )with degrees of freedom( 1) ( 1)

where SEM( ) and SEM( )

X YT tS Sn n

w ww n w n

s sw x w yn n

− − −=

− + −

= = = =

Note: d.f. are estimated from the data and are not a function of the samples sizes alone

Note: ν is not usually an integer but is rounded down to the nearest integer

2 2 2 21 2 1 2

, 2 1 2 , 21 2 1 2

Approximate 100(1- )% two-sided CI for :

s s s sx y t x y tn n n nν α ν α

α µ µ

− − + ≤ − ≤ − − +

0 1 2 0 1 1 2 0

02 21 1 1 1

Test statistics for : vs. :

Reject if .

H Hx yt

s n s n

H t tν α

µ µ δ µ µ δδ

− = − ≠− −

Hospitalization Costs: Inference Using Separate Variances

See Example 8.4 on page 280 of course textbook.

t.test in S-Plus to test differences in means of hospitalization data, unequal variances

t.test(lhcc,lhct,var.equal=F)

Welch Modified Two-Sample t-Test

data: lhcc and lhctt = 0.6181, df = 54.61, p-value = 0.5391 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-0.3738420 0.7072124 sample estimates:mean of x mean of y 8.250925 8.08424

Testing for the Equality of VariancesSection 8.4 covers the classical F test for the equality of two variances and associated confidence intervals. However, this method is not robust against departures from normality. For example, p-values can be off by a factor of 10 if the distributions have shorter or longer tails than the normal.A robust alternative is Levene’s Test. His test applies the two-sample t-test to the absolute value of the difference of each observation and the group mean

| |, 1, 2, ,| |, 1, 2, ,

Y Y i nY Y i n

− = ⋅⋅⋅

− = ⋅⋅⋅This method works well even though these absolute deviations are not independent.

In the Brown-Forsythe test the response is the absolute value of the difference of each observation and the group median.

Independent Sample Design: Sample Size Determination Assuming Equal Variances

0 1 2 1 1 22

: 0 vs. : 0

z zn n n α β

µ µ µ µ

− = − ≠

+⎡ ⎤= = = ⎢ ⎥

⎣ ⎦

Because we assume a known variance this n is a slight underestimate of sample size

Smallest difference of practical importance that we want to detect

Using S-Plus to compute sample size

normal.sample.size(mean2=.693,power=0.9)mean1 sd1 mean2 sd2 delta alpha power n1 n2 prop.n2

0 1 0.693 1 0.693 0.05 0.9 44 44 1

Matched Pairs Design

Example:See Section 8.3.2, page 283 in course textbook.

Statistical Justification of Matched Pairs Design

See Section 8.3.2, page 283 in course textbook.

Sample Size Determination2

( ) (One-Sided Test)

( ) (Two-Sided Test)

+⎡ ⎤= ⎢ ⎥⎣ ⎦

•One needs a planning value for σD

•This formulas come from the one-sample formulas applied to the differences

Comparing Variances of Two Populations

•Application arises when comparing instrument precision oruniformities of products.

•The methods discussed in the book are applicable only under theassumption of normality of the data. They are highly sensitiveto even modest departures from normality

• In case of nonnormal data there are nonparametric and other robust methods for comparing data dispersion.

Comparing Variances of Two Populations

21 2 1 1

Independent sample design:Sample 1: , ,..., is a random sample from ( , )

Sample 2: , ,..., is a random sample from ( , )n

x x x N

y y y N

µ σ2 2

1 11 22 2

has an F distribution 1 and 1 d.f. respectivelySF n nS

= − −

1 2 1 2

2 21 1

1, 1,1 / 2 1, 1, / 22 22 2

/ 1/n n n n

SP f fSα α

σ ασ− − − − −

⎧ ⎫≤ ≤ = −⎨ ⎬

⎩ ⎭

1 2 1 2

2 2 21 1 12 2 2

1, 1, / 2 2 2 1, 1,1 / 2 2

1 1 1n n n n

S SPf S f Sα α

σ ασ− − − − −

⎧ ⎫⎪ ⎪≤ ≤ = −⎨ ⎬⎪ ⎪⎩ ⎭

(1-α)-level CI (two-sided):

1 2 1 2

2 2 21 1 12 2 2

1, 1, / 2 2 2 1, 1,1 / 2 2

n n n n

S Sf S f Sα α

σσ− − − − −

≤ ≤

An Important Industrial Application:Example 8.8

(See Table 8.8 in course textbook.)

Do the two labs have equal measurement precision?

Inferences for Proportions and Count Data

Corresponds to Chapter 9 of

Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León

(University of Tennessee) 1

Inference for Proportions

• Data = {0,1,1,10,0…..1,0}, Bernoulli(p)• Goal – estimate p, probability of success (or

proportion of population with a certain attribute) • p = x= number of successes in n trials• Var( p ) = p(1-p)/n = pq/n • Variance depends on the mean.

Large Sample Confidence Interval for Proportionˆ −

Recall that ( p p)≈ N (0,1) if n is large

/pq n (q = 1- p, np ˆ ≥ 10 and n(1− p) ≥ 10)

It follows that:

⎛ ˆ − ⎞ P ⎜−zα 2 ≤

( p p)≤ zα 2 ⎟⎟ ≈ 1−α⎜ ˆ ˆpq n ⎝ ⎠

Confidence interval for p:

ˆ ˆ ˆ ˆ p zˆ − α 2

pq ≤ p ≤ p z pqˆ + α 2n n

A Better Confidence Interval for Proportion

Use this probability statement

⎛ ˆ − ⎞P ⎜− ≤

( p p)≤ zα 2 ⎟⎟ ≈1−α⎜ zα 2 pq n ⎝ ⎠

Solve for p using quadratic equation

CI for p: z2 l � 2 z4 z2 pqz l lp + − pqz

+4n2 p + +

l � 2

2n n ≤ ≤ 2n n 4n2

z2 ⎞⎛ p

⎛ z2 ⎞ ⎜1+ ⎟ ⎜1+ ⎟n ⎠⎝ ⎝ n ⎠

where z = zα / 2

Example

Binomial CI

In S-Plus: >qbinom(.975,800,0.45) [1] 388> qbinom(.025,800,0.45) [1] 332

95% CI for proportion of gun owners is: 332/800 ≤ p ≤ 388/8000.415 ≤ p ≤ 0.485

Sample Size Determination for a Confidence Interval for Proportion

Want (1-α)-level two-sided CI:

ˆ ±p E where E is the margin of error. Then E = z 2 ˆ ˆ

.pq nα

2⎛ zα 2 ⎞

⎟ ˆ ˆ ⎝ E ⎠

pq Solving for n gives n = ⎜

1 1Largest value of pq = ⎛ ⎞⎛ ⎞ = 1 so conservative sample size is:⎜ ⎟⎜ ⎟2 2⎝ ⎠⎝ ⎠ 4

2⎛ zα 2 ⎞ 1n = ⎜ ⎟ (Formula 9.5) ⎝ E ⎠ 4

Example 9.2: Presidential Poll

Threefold increase in precision requires ninefold increase in sample size

Largest Sample Hypothesis Test on Proportion

= : ≠ 0H : p p vs. H p p 0 0 1

ˆ − 0Best test statistics: z =p p

p q n 0 0

Acceptance Region: p0 ± cd, where c=za/2 and d=(p0q0/n)0.5

Basketball Problem: z-test

P-value

Exact Binomial Test in S-Plus

1-pbinom(299,400,.7) 0.01553209

240 260 280 300 320

Sample Size for Z-Test of ProportionH p p H p p : ≤ 0 vs. : > 0o 1

Suppose that the power for rejecting H must be at0

least 1- β when the true proportion is p p p 0.= >1

Let δ = p p − 0 . Then 1

⎡ z p q + z p q ⎤2 Test based on:

0 0 β 1 1 ˆ − 0n = ⎢ α

δ ⎥⎥⎦

z = p p

⎢⎣ p q n 0 0

Replace z by zα for two-sided test sample size.α 2

Example 9.4: Pizza Testing

2⎡ z p q z p q ⎤

n = ⎢ α 2 0 0 + β 1 1 ⎥ ⎢ δ ⎥⎦⎣

Comparing Two Proportions: Independent Sample Design

If n p , n q , n2 p2 , n2 q2 ≥ 10, then 1 1 1 1

Z p p p p =

ˆ1 − ˆ2 − ( 1 − 2 ) ≈ N (0,1)

ˆ ˆ ˆ 2p q p2q1 1 + n2n1

Confidence Interval:

ˆ1 − ˆ2 − p q p qˆ 1 + 2 2 ≤ p p p p z p p z 1 − 2 ≤ ˆ1 − ˆ2 +α 2 1 α 2n2n1

1 1 2 2

ˆ ˆp q n n

+ ˆ q p

Test for Equality of Proportions (Large n) Independent Sample Design – pooled estimate of p

: 1 = vs. 1 : 1 ≠ 2H p p H p p 0 2

− ˆ2ˆ1Test statitics: z =p p

⎛ 1 1 ⎞ˆ ˆ ⎜ +pq ⎟n n 2⎝ 1 ⎠ ˆ + x + y1 1where p =

n p n2 p2 = n n2 n n2+1 + 1

Example 9.6 –Comparing Two Leukemia Therapies

Inference for Small Samples Fisher’s Exact Test

• Calculates the probability of obtaining observed 2x2 table or any more extreme with margins fixed.

• Uses hypergeometric distribution

XP ( KMNx )| , ,

⎞−⎛⎞⎛ ⎟⎟⎠−⎜⎜

⎝⎟⎟⎠

⎜⎜⎝

⎞⎛ ⎟⎟⎠

⎜⎜⎝

Inference for Count Data

Data = cell counts = number of observations in each of sevaral (>2) categories, ni, i=1..c, Σni=n

Joint distribution of corresponding r.v.’s is multinomial.

Goal – determine if the probabilities of belonging to each of the categories are equal to hypothesized values, pi0.

Test statistic, χ2 = Σ(observed-expected)2/expected, where observed=ni, expected=npi0

2 has chi-square distribution when sample size is large

Multinomial Test of Proportions

Inferences for Two-Way Count Datay: Job Satisfaction

x: Annual Very Slightly Slightly Very Satisfied Row Sum Salary Dissatisfied Dissatisfied Satisfied

Less than $10,000

81 64 29 10 184

$10,000-25,000

73 79 35 24 211

$25,000-50,000

47 59 75 58 239

More than $50,000

14 23 84 69 190

Column Sum 215 225 223 161 824

Sampling Model 1: Multinomial Model (Total Sample Size Fixed) Sample of 824 from a single population that is then cross-classified

The null hypothesis is that X and Y are independent: : ( = , ( = ) ( i. . j for all i, jH pij = P X i Y = j) = P X i P Y = j) = p p 0

Sampling Model 1 (Total Sample Size Fixed)Based on Table 9.10 in the course textbook

y: Job Satisfaction

x: Annual Very Slightly Slightly Very Satisfied Row Sum Salary Dissatisfied Dissatisfied Satisfied

Less than $10,000

81 64 29 10 184

$10,000-25,000 73 79 35 24 211

$25,000-50,000 47 59 75 58 239

More than $50,000

14 23 84 69 190

Column Sum 215 225 223 161 824

Estimated Expected Frequency = 824 ⎜⎛ 215 ⎞⎛ 184 ⎞ =

215×184 = 48.01 ⎟⎜ ⎟

⎝ 824 ⎠⎝ 824 ⎠ 824 (Cell 1,1) = np p1• •1

Chi-Square Statistics

See Example 9.13, page 324 for instructions on calculating the chi-square statistic.

χ =∑ (n e )2

2 i − i

i=1 ei

2Based on Table A.5, critical values χυ ,α for theChi-

Square Chi-square Distribution, in the course textbook:

Test Critical Value

2The d.f. for this χ − statistics is2 (4-1)(4-1) = 9. Since χ 9,.05 = 16.919

2the calculated χ = 11.989 is not sufficiently large to reject the hypothesis of independence at α = .05 level

v .995 .99 .975 .95 .90 .10 .05

9 16.919

S-Plus – job satisfaction example• Call: • crosstabs(formula = c(jobsat) ~ c(row(jobsat)) + c(col(jobsat))) • 901 cases in table • +----------+ • |N | • |N/RowTotal| • |N/ColTotal| • |N/Total | • +----------+ • c(row(jobsat))|c(col(jobsat)) • |1 |2 |3 |4 |RowTotl| • -------+-------+-------+-------+-------+-------+ • 1 | 20 | 24 | 80 | 82 |206 | • |0.097 |0.12 |0.39 |0.4 |0.23 | • |0.32 |0.22 |0.25 |0.2 | | • |0.022 |0.027 |0.089 |0.091 | | • -------+-------+-------+-------+-------+-------+ • 2 | 22 | 38 |104 |125 |289 | • |0.076 |0.13 |0.36 |0.43 |0.32 | • |0.35 |0.35 |0.33 |0.3 | | • |0.024 |0.042 |0.12 |0.14 | | • -------+-------+-------+-------+-------+-------+ • 3 | 13 | 28 | 81 |113 |235 | • |0.055 |0.12 |0.34 |0.48 |0.26 | • |0.21 |0.26 |0.25 |0.27 | | • |0.014 |0.031 |0.09 |0.13 | | • -------+-------+-------+-------+-------+-------+ • 4 | 7 | 18 | 54 | 92 |171 | • |0.041 |0.11 |0.32 |0.54 |0.19 | • |0.11 |0.17 |0.17 |0.22 | | • |0.0078 |0.02 |0.06 |0.1 | | • -------+-------+-------+-------+-------+-------+ • ColTotl|62 |108 |319 |412 |901 | • |0.069 |0.12 |0.35 |0.46 | | • -------+-------+-------+-------+-------+-------+ • Test for independence of all factors • Chi^2 = 11.98857 d.f.= 9 (p=0.2139542) • Yates' correction not used 24 • >

Product Multinomial Model:Row Totals Fixed

(See Table 9.2 in the course textbook.)

Sampling Model 2: Product Multinomial Total number of patients in each drug group is fixed.

•The null hypothesis is that the probability of column response (success or failure) is the same, regardless of the row population:

0 : (Y = j | X i p j)H P = =

S-Plus – leukemia trial• Call: • crosstabs(formula = c(leuk) ~ c(row(leuk)) + c(col(leuk))) • 63 cases in table • +----------+ • |N | • |N/RowTotal| • |N/ColTotal| • |N/Total | • +----------+ • c(row(leuk))|c(col(leuk)) • |1 |2 |RowTotl| • -------+-------+-------+-------+ • 1 |14 | 7 |21 | • |0.67 |0.33 |0.33 | • |0.27 |0.64 | | • |0.22 |0.11 | | • -------+-------+-------+-------+ • 2 |38 | 4 |42 | • |0.9 |0.095 |0.67 | • |0.73 |0.36 | | • |0.6 |0.063 | | • -------+-------+-------+-------+ • ColTotl|52 |11 |63 | • |0.83 |0.17 | | • -------+-------+-------+-------+ • Test for independence of all factors • Chi^2 = 5.506993 d.f.= 1 (p=0.01894058) • Yates' correction not used • Some expected values are less than 5, don't trust stated p-value • > 26

Remarks About Chi-Square Test

• The distribution of the chi-square statistics under the null hypothesis is approximately chi-square only when the sample sizes are large – The rule of thumb is that all expected cell counts should be greater

than 1 and – No more than 1/5th of the expected cell counts should be less than

• Combine sparse cell (having small expected cell counts) with adjacent cells. Unfortunately, this has the drawback of losing some information.

• Never stop with the chi-square test. Look at cells with large values of (O-E), as in job satisfaction example.

Odds Ratio as a Measure of Association for a 2x2 Table

Sampling Model I: Multinomialp p11 12ψ = p p 21 22

The numerator is the odds of the column 1 outcome vs. the column 2 outcome for row 1, and the denominator is the same odds for row 2, hence the name “odds ratio”

Odds Ratio as a Measure of Association for a 2x2 Table

Sampling Model II: Product Multinomial1− p1 1ψ =

p p 1− p2 2

The two column outcomes are labeled as “success” and “failure,” then ψ is the odds of success for the row 1 population vs. the odds of success for the row 2 population

Inference in a Nutshell

Slides prepared by Elizabeth Newton (MIT)

Corresponds to Chapters 6-9 of Tamhane and Dunlop

OutlineChapter 6: Basic Concepts of Inference

Mean Square ErrorConfidence IntervalHypothesis Test

Chapter 7: Inference for Single SamplesMean - Large Sample - zMean - Small Sample – tVariance – Chi-squarePrediction and Tolerance Intervals

Outline (continued)Chapter 8 – Inference for Two Samples

Comparing Means, Independent, Large Sample –zComparing Means, Independent, Small Sample

Variances equal – tVariances not equal – t with df from SEM

Matched Pairs – test differences – tComparing Variances – F

Outline (continued)Chapter 9 - Inferences for Proportions and Count Data

Proportion, Large sample – zProportion, Small sample – binomialComparing 2 Proportions, large – z or Chi-squareComparing 2 Proportions, small – Fisher’s ExactMatched Pairs – McNemar’s TestOne way Count – Chi squareTwo-way Count – Chi squareGoodness of Fit – Chi squareOdds ratio - z

Confidence Interval on the Mean

û ± cd is a two-sided CI for mean uwhere:û = estimator of u = sample meand=standard deviation of û.c=critical constant, for instance, zα/2 or tn-1,a/2.zα/2 is such that P(Z> zα/2)=α/2.zα/2=Φ-1(1-α/2) = qnorm(1-α/2) = -qnorm(α/2)If a=0.05 then zα/2= 1.96.If draw many samples and construct 95% CI’s from them, 95% would contain true value of u.

Confidence Intervals

Hypothesis Tests• H0: null hypothesis, no change, no effect,

for instance u=u0

• H1: alternative hypothesis, u≠u0

• α = P(Type I error = P(reject H0 | H0 true)• β = P(Type II error = P(accept H0 | H0 false)• Power = function of u = P(reject H0 | u)• A two-sided hypothesis test rejects H0 when

|û-u0|/d > c ↔ |û-u0| > cd ↔û<u0-cd or û>u0+cd

Level α Tests

(See Table 7.1 on page 240 of the course textbook.)

P-Values

• P-Value is the probability of obtaining the observed result or one more extreme

• Two-sided P-Value= P(|Z|>|(û-u0)|/d = 2[1-Φ[|(û-u0)|/d] = 2*(1-pnorm(abs(û-u0)/d)) in S-Plus

P-Values

Power Function

Power is the probability of rejecting H0 for a given value of u.

π(u) = P(û<u0-cd | u) + P(û>u0+cd |u)

= Φ[-c+(u0-u)/d] + Φ[-c+(u-u0)/d]

Reject H0

(1) If u0 falls outside interval û ± cd.

(2) if û falls outside interval u0 ± cd.

(3) if p-value is small.

Simple Linear Regression and Correlation.

Corresponds to Chapter 10

Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT) with some slides by

Jacqueline Telford (Johns Hopkins University)

Simple linear regression analysis estimates the relationship between two variables.

One of the variables is regarded as a response or outcome variable (y).

The other variable is regarded as predictor or explanatory variable (x).

Sometimes it is not clear which of two variable should be the response (e.g. height and weight). In this case, correlation analysis may be used.

Simple linear regression estimates relationships of the form y = a + bx.

Scatter plot of ozone concentration by temperature

air$temperature

60 70 80 90

A Probabilistic Model for Simple Linear Regression

Let x1, x2,..., xn be specific settings of the predictor variable.

Let y1, y2,..., yn be the corresponding values of the response variable.

Assume that yi is the observed value of a random variable (r.v.) Yi, which depends X on according to the following model:

Yi = β0 + β1 xi + εi (i = 1, 2, …, n)

Here εi is the random error with E(εi)=0 and Var(εi)=σ2 .

Thus, E(Yi) = µi = β0 + β1 xi (true regression line).

The xi’s usually are assumed to be fixed (not random variables).

A Probabilistic Model for Simple Linear Regression

See Figure 10.1, p. 348 and also see page 348 for the four assumptions of a simple linear regression model.

Least Square Line Mathematics (invented by Gauss)

Find the line, i.e., values of β0 and β1 that minimizes the sum of the squared deviations:

+−=n

iii xy

210 )]([Q ββ

Solve for values of β0 and β1 for which

=∂∂

ββQ and Q

Finding Regression Coefficients

βββ

+−−=∂∂

Normal Equations

∑∑∑

∑∑

Solution to Normal Equations

))((ˆ

−−=

.),( yxNote that least squares line goes through

Fitted regression line

air$temperature

60 70 80 90

nixyy iii ,...,2,1,ˆˆˆ 10 : of values Fitted =+= ββ

nixyyye iiiii ,...,2 ,1 , ) ˆˆ(ˆ :Residuals 10 =+−=−= ββ

temperature ozone fitted resid 67 3.45 2.49 0.9672 3.30 2.84 0.4674 2.29 2.98 -0.6962 2.62 2.14 0.4865 2.84 2.35 0.50

11This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Matrix Approach to Simple Linear Regression (what your regression package is really doing)

The model: y=Xβ + ε

y is n by 1X is n by 2β is 2 by 1ε is n by 1

Y=Xβ + ε

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

+⎥⎦

⎤⎢⎣

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

x1 x1 x1 x1

εεεε

Solution of linear equations

In linear algebra:Find x which solves Ax=b.

In regression analysis:Find β which solves Xβ=y Why can’t we do this?

Least Squares

Q=(y-Xβ)’(y-Xβ) = y’y – β’X’y – y’Xβ + β’X’Xβ= y’y – 2 β’X’y + β’X’Xβ

∂Q/ ∂β = -2X’y + 2X’Xβ

∂Q/ ∂β = 0 → X’y = X’Xb, where b= β

Least Squares continued

For simple linear regression:

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

∑∑∑ ∑

X’Xb = X’y

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

∑∑

∑ ∑∑

The Normal Equations as before17

X’Xb = X’yb= (X’X)-1X’y (if X has linearly

independent columns)Solution by QR decompositionX=QR, Q orthonormal, R upper triangular

and invertibleb=(X’X)-1X’y = (R’Q’QR)-1R’Q’y=(R’R)-1R’Q’y = R-1Q’y

The Hat Matrix

b=(X’X)-1 X’y=Xb = X(X’X)-1X’y =Hy

H (n by n) is the Hat matrixTakes y toH is symmetric and idempotent HH=HDiagonal elements of the hat matrix are

useful in detecting influential observations.

Expected value of b

E(b) = E((X’X)-1X’y]= E[(X’X)-1X’(Xβ+ε)]= E[(X’X)-1X’X β+ (X’X)-1X’ε]= β

Hence b is an unbiased estimator of β.

Covariance of b

The covariance matrix of y is σ2Ib=(X’X)-1X’y = Ay (where A is k by n)Cov(b) = A Var(y) A’ = A σ2I A = σ2AA’

= σ2 (X’X)-1X’X(X’X)-1

= σ2 (X’X)-1

Covariance of b

For simple linear regression, σ2(X’X)-1=

⎥⎥⎦

⎢⎢⎣

−−=

⎥⎥⎦

⎢⎢⎣

∑∑∑

∑ ∑∑ ∑∑

x- x)(x

iii xxnx

n σσ

bSD 1)SD(b ;)( 1

0 σσ == ∑

Estimation of σ2

∑∑==

Note: The denominator is n - 2 since two parameters are being estimated (β0 and β1).

E[S2]=σ2 (See proof in Seber, Linear Regression Analysis)

Statistical Inference for βo and β1

sSE == ∑ )ˆ( and )ˆ( 1

0 ββ

For ozone example:Coefficients:

Value Std. Error t value Pr(>|t|) (Intercept) -2.2260 0.4614 -4.8243 0.0000temperature 0.0704 0.0059 11.9511 0.0000

Sums of Squares

2)(: (SST) Total Squares of Sum

∑∑==

ii yye

2 )ˆ(: (SSE) Error for Squares of Sum

2)ˆ(: (SSR) Regression for Squares of Sum

Geometry of the Sums of Squares)ˆ()ˆ( iiii yyyyyy −+−=−

SST = SSR + SSE, see derivation on p. 354

26J. Telford

Coefficient of Determination (R-squared)

=−==SSTSSE1

SSTSSR2r

proportion of the variance in y that is accounted for by the regression on x

= square of correlation between y and y

For ozone example:Multiple R-Squared: 0.5672

Analysis of Variance (ANOVA)

0 1 0 1: 0 . : 0H vs Hβ β= ≠

MSEMSR

2)-SSE/(nSSR/1 tF ===

For ozone example:summary.aov(tmp)

Df Sum of Sq Mean Sq F Value Pr(F) temperature 1 49.46178 49.46178 142.8282 0Residuals 109 37.74698 0.34630

Regression DiagnosticsResidual vs. observation number

0 20 40 60 80 100

Regression Diagnosticsresidual vs. fitted value

fitted(ozone.lm)

2.0 2.5 3.0 3.5 4.0 4.5

Regession Diagnosticsresidual vs. x

air$temperature

60 70 80 90

Regression Diagnosticsqq plot of residuals

-2 -1 0 1 2

Hat Matrix Diagonalsha

0 20 40 60 80 100

Some useful S-Plus commandsmy.lm <- lm(y~x, data=mydata, na.action=na.omit)

includes intercept term by defaultsummary(my.lm)

gives coefficients, correlation of coefficients, R-square, F-statistic, residual standard error

summary.aov(my.lm) gives ANOVA table

resid(my.lm) gives residuals

fitted(my.lm) gives fitted values

model.matrix(my.lm) gives model matrix

Multiple Linear Regression

Corresponds to Chapter 11 ofTamhane & Dunlop

Slides prepared by Elizabeth Newton (MIT)with some slides by Roy Welsch (MIT).

Linear Regression

Review:Linear Model: y=Xβ + ε

y~N(Xβ, σ2I)Least squares: =(X’X)X’y

= fitted value of y = X =X(X’X)-1X’y=Hy

e = error = residuals = y- = y-Hy=(I-H)y

Properties of the Hat matrix

• Symmetric: H’=H• Idempotent: HH=H• Trace(H) = sum(diag(H)) = k+1 = number of

columns in the X matrix• 1’H=vector of 1’s (hence y and have same

mean)• 1’(I-H) = vector of 0’s (hence mean of residuals

is 0).• What is H when X is only a column of 1’s?

Variance-Covariance Matrices

)()()())(()()()(

)()()ˆ(

time) lastsaw we(as )'()ˆCov(

HIHIIHIHIyCovHIyHICoveCov

HIHHHyHCovHyCovyCov

−=−−=

−−=−=

Confidence and Prediction Intervals

)1()1)'(()'()()ˆ()ˆ(

ˆ y, xat nobservationew of

)'()ˆ()ˆ(

xat response mean

−−

vxXXxxXXx

VaryVaryVaryVariance

vxXXxxVaryVar

ofVariance

σσσσ

εεε

σσβ

An estimate of σ2 is s2 = MSE = y’(I-H)y /(n-k-1)

Confidence and Prediction Intervals

(1-α) Confidence Interval on Mean Response at x0:

0/2 1),(k-n0 vsd and tc where,ˆ ==± + αcdy

(1-α) Prediction Interval on New Observation at x0:

1vsd and tc where,ˆ 0/2 1),(k-n0 +==± + αcdy

Sums of Squares

2)(: (SST) Total Squares of Sum

∑∑==

ii yye

2 )ˆ(: (SSE) Error for Squares of Sum

2)ˆ(: (SSR) Regression for Squares of Sum

SSR = SST - SSE7

Overall Significance TestTo see if there is any linear relationship we test:

H0: β1 = β2 = . . . = βk = 0H1: βj ≠ 0 for some j.

Compute

The F statistic is:

with F based on k and (n − k − 1) degrees of freedom.

Reject H0 when F exceeds F k,n−k−1(α).

SSESSTSSRyyyySSE iiii −=−=−= ∑∑ )(SST )ˆ( 22

MSEMSR

knSSEkSSR

=−− )1/(

Sequential Sums of Squares

SSR(x1) = SST - SSE(x1)

SSR(x2|x1) = SSR(x1,x2) - SSR(x1) =SSE(x1) - SSE(x1,x2)

SSR(x3|x1 x2) = SSE(x1,x2) - SSE(x1,x2,x3)

ANOVA TableType 1 (sequential) sums of squares

Source of SS dfVariationRegression SSR(x1,x2,x3) 3

x1 SSR(x1) 1x2|x1 SSR(x2|x1) 1x3|x2 x1 SSR(x3|x2,x1) 1

Error SSE(x1,x2,x3) n-4Total SST n-1

ANOVA TableType 3 (partial) sums of squares

Source of SS dfVariationRegression SSR(x1,x2,x3) 3

x1|x2,x3 SSR(x1|x2,x3) 1x2|x1,x3 SSR(x2|x1,x3) 1x3|x1,x2 SSR(x3|x1,x2) 1

Error SSE(x1,x2,x3) n-4Total SST n-1

Scatter plot Matrix of the Air Data Set in S-Plus pairs(air)

0 50 100 200 300 5 10 15 20

radiation

temperature

1 2 3 4 5

60 70 80 90

air.lm<-lm(y~x1+x2+x3)

> summary(air.lm)$coefValue Std. Error t value Pr(>|t|)

(Intercept) -0.297329634 0.5552138923 -0.5355227 5.933998e-001x1 0.002205541 0.0005584658 3.9492854 1.407070e-004x2 0.050044325 0.0061061612 8.1957098 5.848655e-013x3 -0.076021950 0.0157548357 -4.8253090 4.665124e-006

> summary.aov(air.lm)Df Sum of Sq Mean Sq F Value Pr(F)

x1 1 15.53144 15.53144 59.6761 6.000000e-012x2 1 37.76939 37.76939 145.1204 0.000000e+000x3 1 6.05985 6.05985 23.2836 4.665124e-006

Residuals 107 27.84808 0.26026

> summary.aov(air.lm,ssType=3)Type III Sum of Squares

Df Sum of Sq Mean Sq F Value Pr(F) x1 1 4.05928 4.05928 15.59685 0.0001407070x2 1 17.48174 17.48174 67.16966 0.0000000000x3 1 6.05985 6.05985 23.28361 0.0000046651

Residuals 107 27.84808 0.26026 > 13

Polynomial Models

y=β0 + β1x + β2x2 … + βkxk

Problems:Powers of x tend to be large in magnitudePowers of x tend to be highly correlated

Solutions:Centering and scaling of x variablesOrthogonal polynomials (poly(x,k) in S-Plus,

see Seber for methods of generating)

Plot of mpg vs. weight for 74 autos(S-Plus dataset auto.stats)

2000 2500 3000 3500 4000 4500

summary(lm(mpg~wt+wt^2+wt^3))

Call: lm(formula = mpg ~ wt + wt^2 + wt^3)Residuals:

Min 1Q Median 3Q Max -6.415 -1.556 -0.2815 1.265 13.06

Coefficients:Value Std. Error t value Pr(>|t|)

(Intercept) 68.1797 21.4515 3.1783 0.0022wt -0.0309 0.0214 -1.4430 0.1535

I(wt^2) 0.0000 0.0000 0.9586 0.3410I(wt^3) 0.0000 0.0000 -0.7449 0.4588

Residual standard error: 3.209 on 70 degrees of freedomMultiple R-Squared: 0.705 F-statistic: 55.76 on 3 and 70 degrees of freedom, the p-value is 0

Correlation of Coefficients:(Intercept) wt I(wt^2)

wt -0.9958 I(wt^2) 0.9841 -0.9961 I(wt^3) -0.9659 0.9846 -0.9961

wts<-(wt-mean(wt))/sqrt(var(wt))

summary(lm(mpg~wts+wts^2+wts^3))

Call: lm(formula = mpg ~ wts + wts^2 + wts^3)Residuals:

Min 1Q Median 3Q Max -6.415 -1.556 -0.2815 1.265 13.06

(Intercept) 20.2331 0.5676 35.6470 0.0000wts -4.4466 0.7465 -5.9567 0.0000

I(wts^2) 1.1241 0.4682 2.4007 0.0190I(wts^3) -0.2521 0.3385 -0.7449 0.4588

Correlation of Coefficients:(Intercept) wts I(wts^2)

wts -0.2800 I(wts^2) -0.7490 0.4558 I(wts^3) 0.3925 -0.8596 -0.6123

Orthogonal Polynomials

Generation is similar to Gram-Schmidt orthogonalization (see Strang, Linear Algebra)

Resulting vectors are orthonormal X’X=IHence (X’X)-1 = I and coefficients

= (X’X)-1X’y = X’yAddition of higher degree term does not affect

coefficients for lower degree termsCorrelation of coefficients = ISE of coefficients = s = sqrt(MSE)

summary(lm(mpg~poly(wt,3)))

Call: lm(formula = mpg ~ poly(wt, 3))Residuals:

Min 1Q Median 3Q Max -6.415 -1.556 -0.2815 1.265 13.06

(Intercept) 21.2973 0.3730 57.0912 0.0000poly(wt, 3)1 -40.6769 3.2090 -12.6758 0.0000poly(wt, 3)2 7.8926 3.2090 2.4595 0.0164poly(wt, 3)3 -2.3904 3.2090 -0.7449 0.4588

Correlation of Coefficients:(Intercept) poly(wt, 3)1 poly(wt, 3)2

poly(wt, 3)1 0 poly(wt, 3)2 0 0 poly(wt, 3)3 0 0 0 19

Plot of mpg by weight with fitted regression line

2000 2500 3000 3500 4000 4500

Indicator Variables

• Sometimes we might want to fit a model with a categorical variable as a predictor. For instance, automobile price as a function of where the car is made (Germany, Japan, USA).

• If there are c categories, we need c-1 indicator (0,1) variables as predictors. For instance j=1 if car is made in Japan, 0 otherwise, u=1 if car is made in USA, 0 otherwise.

• If there are just 2 categories and no other predictors, we could just do a t-test for difference in means.

Boxplots of price by country for S-Plus dataset cu.summary

Germany Japan USA

Histogram of automobile prices for S-Plus dataset cu.summary

10000 20000 30000 40000

Histogram of log of automobile prices for S-Plus dataset cu.summary

9.0 9.5 10.0 10.5

log(price)

summary(lm(price~u+j))

Call: lm(formula = price ~ u + j)Residuals:

Min 1Q Median 3Q Max -15746 -4586 -2071 2374 22495

(Intercept) 25741.3636 2282.2729 11.2788 0.0000u -10520.5473 2525.4871 -4.1657 0.0001j -10236.0088 2656.5095 -3.8532 0.0002

Residual standard error: 7569 on 88 degrees of freedomMultiple R-Squared: 0.1723 F-statistic: 9.159 on 2 and 88 degrees of freedom, the p-value is

0.0002435

Correlation of Coefficients:(Intercept) u

u -0.9037 j -0.8591 0.7764 25

summary(lm(price~u+g))

Call: lm(formula = price ~ u + g)Residuals:

Min 1Q Median 3Q Max -15746 -4586 -2071 2374 22495

(Intercept) 15505.3548 1359.5121 11.4051 0.0000u -284.5385 1737.1208 -0.1638 0.8703g 10236.0088 2656.5095 3.8532 0.0002

Residual standard error: 7569 on 88 degrees of freedomMultiple R-Squared: 0.1723 F-statistic: 9.159 on 2 and 88 degrees of freedom, the p-value is

0.0002435

Correlation of Coefficients:(Intercept) u

u -0.7826 g -0.5118 0.4005 26

Regression DiagnosticsGoal: identify remarkable observations and unremarkable

predictors.

Problems with observations:OutliersInfluential observations

Problems with predictors:A predictor may not add much to model.A predictor may be too similar to another predictor (collinearity).Predictors may have been left out.

Plot of standardized residuals vs. fitted values for air dataset

fitted value

909192

Plot of residual vs. fit for air data set with all interaction terms

fitted(tmp)

2.0 2.5 3.0 3.5 4.0 4.5 5.0

Plot of residual vs. fit for air model with x3*x4 interaction

fitted(tmp)

2 3 4 5

Call: lm(formula = air[, 1] ~ air[, 2] + air[, 3] + air[, 4] + air[, 3] * air[, 4])Residuals:

Min 1Q Median 3Q Max -1.088 -0.3542 -0.07242 0.3436 1.47

(Intercept) -3.6465 1.1684 -3.1209 0.0023 air[, 2] 0.0023 0.0005 4.3223 0.0000 air[, 3] 0.0920 0.0143 6.4435 0.0000 air[, 4] 0.2523 0.1031 2.4478 0.0160

air[, 3]:air[, 4] -0.0042 0.0013 -3.2201 0.0017

Correlation of Coefficients:(Intercept) air[, 2] air[, 3] air[, 4]

air[, 2] -0.0361 air[, 3] -0.9880 -0.0495 air[, 4] -0.9268 0.0620 0.9313

air[, 3]:air[, 4] 0.8902 -0.0661 -0.9119 -0.9892 >

Remarkable Observations

Residuals are the keyStandardized residuals:

Outlier if |ei*|>2Hat matrix diagonals, hii

Influential if hii > 2(k+1)/nCook’s Distance

ii his

iiii h

Influential if di > 1 32

Plot of standardized residual vs. observation number for air dataset

observation number

0 20 40 60 80 100

909192

959697

Hat matrix diagonals

observaton number

0 20 40 60 80 100

202122

3738394041

484950

56575859

64656667

899091

103104

110111

Plot of wind vs. ozone

5 10 15 20

404142

818283

8990 91

102103

109110 111

Cook’s DistanceC

0 20 40 60 80 100

Plot of ozone vs. wind including fitted regression lines with and without observation 30

(simple linear regression)

5 10 15 20

404142

818283

8990 91

102103

109110 111

Remedies for Outliers

• Nothing?• Data Transformation?• Remove outliers?• Robust Regression – weighted least

squares: b=(X’WX)-1X’Wy• Minimize median absolute deviation

CollinearityHigh correlation among the predictors can cause problems with least

squares estimates (wrong signs, low t-values, unexpected results).If predictors are centered and scaled to unit length, then X’X is the

correlation matrix.Diagonal elements of inverse of correlation matrix are called VIF’s

(variance inflation factors).

1 2j2 where

jj −=

is the coefficient of determination for the regression of the jth predictor on the remaining predictors

When Rj2 = .90, VIF is about 10 and caution is advised. (Some authors

say VIF = 5.) A large VIF indicates there is redundant information in the explanatory variables.

Why is this called the variance inflation factor?We can show that

Thus VIFj represents the variation inflation caused by adding all thevariables other than xj to the model.

( )( )

1ˆVar 1

ˆVIF Var in simple regression

j nj j j

=− −

⎡ ⎤= ⎣ ⎦

R Welsch 40

Remedies for collinearity

1. Identify and eliminate redundant variables (large literatureon this).

2. Modified regression techniques

a. ridge regression, b=(X’X+cI)-1X’y

3. Regress on orthogonal linear combinations of theexplanatory variables

a. principal components regression

4. Careful variable selection

R Welsch 41

Correlation and inverse of correlation matrix for air data set.

r<-cor(model.matrix(air.lm)[,-1])

> rx1 x2 x3

x1 1.0000000 0.2940876 -0.1273656x2 0.2940876 1.0000000 -0.4971459X3 -0.1273656 -0.4971459 1.0000000

> solve(r)x1 x2 x3

x1 1.09524102 -0.3357220 -0.02740677x2 -0.33572201 1.4312012 0.66875638x3 -0.02740677 0.6687564 1.32897882 > 42

Correlation and inverse of correlation matrix for mpg data set

r<-cor(model.matrix(auto1.lm)[,-1])

> rwt I(wt^2) I(wt^3)

wt 1.0000000 0.9917756 0.9677228I(wt^2) 0.9917756 1.0000000 0.9918939I(wt^3) 0.9677228 0.9918939 1.0000000

solve(r)wt I(wt^2) I(wt^3)

wt 2000.377 -3951.728 1983.884I(wt^2) -3951.728 7868.535 -3980.575I(wt^3) 1983.884 -3980.575 2029.459

Variable Selection

• We want a parsimonious model – as few variables as possible to still provide reasonable accuracy in predicting y.

• Some variables may not contribute much to the model.

• SSE never will increase if add more variables to model, however MSE=SSE/(n-k-1) may.

• Minimum MSE is one possible optimality criterion. However, must fit all possible subsets (2k of them) and find one with minimum MSE.

Backward Elimination

1. Fit the full model (with all candidate predictors).

2. If P-values for all coefficients < α then stop.

3. Delete predictor with highest P-value4. Refit the model5. Go to Step 2.

Logistic Regression

References: Applied Linear Statistical Models, Neter et al.

Categorical Data Analysis, Agresti

Logistic Regression• Nonlinear regression model when response

variable is qualitative.• 2 possible outcomes, success or failure,

diseased or not diseased, present or absent• Examples: CAD (y/n) as a function of age,

weight, gender, smoking history, blood pressure• Smoker or non-smoker as a function of family

history, peer group behavior, income, age• Purchase an auto this year as a function of

income, age of current car, age

E Newton 2

Response Function for Binary Outcome

XYEYEYPYP

πββπππ

ββεββ

=+==−+=

−====

}{)1(0)(1}{

1)0()1(

E Newton 3

Special Problems when Response is Binary

Constraints on Response Function0 ≤ E{Y} = π = ≤ 1

Non-normal Error TermsWhen Yi=1: εi = 1-β0-β1Xi

When Yi=0: εi = -β0-β1Xi

Non-constant error varianceVar{Yi} = Var{εi} = πi(1-πi)

E Newton 4

Logistic Response Function

XXXXXXX

)exp(1

)exp()1()exp()exp()exp()exp()exp())exp(1(

)exp(1)exp(}{

ββπ

πββππ

ββπββπββββππββββπ

ββββπ

+=⎟⎠⎞

⎜⎝⎛

+−+=

E Newton 5

Example of Logistic Response Function

0 20 40 60 80 100

E Newton 6

Properties of Logistic Response Function

log(π/(1-π))=logit transformation, log odds

π/(1-π) = odds

Logit ranges from -∞ to ∞ as x varies from -∞ to ∞

E Newton 7

Likelihood Function

)1log()]1

log([)...g(Y log

)1()()...g(Y

is; pdf joint t,independen re YSince1,2...ni ;1,0 Y,)1()(:

1)0()1(

aYfpdf

πππ

−+−

−Π=Π=

==−=

−====

∑∑==

E Newton 8

Likelihood Function (continued)

)]exp(1log[)(),(log

)exp(111

1 1101010

∑ ∑= =

++−+=

++=−

ββββββ

ββπ

ββππ

E Newton 9

Likelihood for Multiple Logistic Regression

i jijj

ˆ])exp(1

)exp([ :Equations Likelihood

])exp(1

)exp([

)]exp(1log[)()(log

∂∂

∑∑∑

∑∑

∑ ∑∑∑

βββ

E Newton 10

Solution of Likelihood Equations

No closed form solutionUse Newton-Raphson algorithm

Iteratively reweighted least squares (IRLS)Start with OLS solution for β at iteration t=0, β0

πit=1/(1+exp(-Xi’βt))

β(t+1)=βt + (XVX)-1 X’(y-πt)Where V=diag(πi

t(1-πit))

Usually only takes a few iterations

E Newton 11

Interpretation of logistic regression coefficients

• Log(π/(1-π))=Xβ• So each βj is effect of unit increase in Xj

on log odds of success with values of other variables held constant

• Odds Ratio=exp(βj)

E Newton 12

Example: Spinal Disease in Children Data SUMMARY: The kyphosis data frame has 81 rows representing data on 81 children

who have had corrective spinal surgery. The outcome Kyphosis is a binary variable, the other three variables (columns) are numeric.

ARGUMENTS: Kyphosis

a factor telling whether a postoperative deformity (kyphosis) is "present" or "absent" .

Agethe age of the child in months.

Numberthe number of vertebrae involved in the operation.

Startthe beginning of the range of vertebrae involved in the operation.

SOURCE: John M. Chambers and Trevor J. Hastie, Statistical Models in S,

Wadsworth and Brooks, Pacific Grove, CA 1992, pg. 200.

E Newton 13

This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Observations 1:16 of kyphosis data setkyphosis[1:16,]

Kyphosis Age Number Start 1 absent 71 3 52 absent 158 3 143 present 128 4 54 absent 2 5 15 absent 1 4 156 absent 1 2 167 absent 61 2 178 absent 37 3 169 absent 113 2 1610 present 59 6 1211 present 82 5 1412 absent 148 3 1613 absent 18 5 214 absent 1 4 1216 absent 168 3 18

E Newton 14

Variables in kyphosissummary(kyphosis)

Kyphosis Age Number Start absent:64 Min.: 1.00 Min.: 2.000 Min.: 1.00 present:17 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.: 9.00

Median: 87.00 Median: 4.000 Median:13.00 Mean: 83.65 Mean: 4.049 Mean:11.49

3rd Qu.:130.00 3rd Qu.: 5.000 3rd Qu.:16.00 Max.:206.00 Max.:10.000 Max.:18.00

E Newton 15

Scatter plot matrix kyphosis data set

Kyphosis

0 50 100 150 200 5 10 15

Number

absn prsn

2 4 6 8 10

E Newton 16

Boxplots of predictors vs. kyphosis0

absent present

Kyphosis

absent present

Kyphosis

absent present

Kyphosis

E Newton 17

Smoothing spline fits, df=3

jitter(age)

0 50 100 150 200

jitter(num)

2 4 6 8 10

jitter(sta)

5 10 15

E Newton 18

Summary of glm fitCall: glm(formula = Kyphosis ~ Age + Number + Start,

family = binomial, data = kyphosis)

Deviance Residuals:Min 1Q Median 3Q Max

-2.312363 -0.5484308 -0.3631876 -0.1658653 2.16133

Coefficients:Value Std. Error t value

(Intercept) -2.03693225 1.44918287 -1.405573Age 0.01093048 0.00644419 1.696175

Number 0.41060098 0.22478659 1.826626Start -0.20651000 0.06768504 -3.051043

E Newton 19

Summary of glm fitNull Deviance: 83.23447 on 80 degrees of freedom

Residual Deviance: 61.37993 on 77 degrees of freedom

Number of Fisher Scoring Iterations: 5

Correlation of Coefficients:(Intercept) Age Number

Age -0.4633715 Number -0.8480574 0.2321004 Start -0.3784028 -0.2849547 0.1107516

E Newton 20

This code7 was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Residuals

• Response Residuals: yi-πi

• Pearson Residuals: (yi-πi)/sqrt(πi(1-πi))

• Deviance Residuals: sqrt(-2log(|1-yi-πi|))

E Newton 21

Model Deviance

• Deviance of fitted model compares log-likelihood of fitted model to that of saturated model.

• Log likelihood of saturated model=0

YYYsignd

iiiiiii

−−+−−=

−−+−==

)]}ˆ1log()1()ˆlog([2){ˆ(

)ˆ1log()1()ˆlog(2

πππ

E Newton 22

Covariance Matrix> x<-model.matrix(kyph.glm)

> xvx<-t(x)%*%diag(fi*(1-fi))%*%x

> xvx(Intercept) Age Number Start

(Intercept) 9.620342 907.8887 43.67401 86.49845Age 907.888726 114049.8308 3904.31350 9013.14464

Number 43.674014 3904.3135 219.95353 378.82849Start 86.498450 9013.1446 378.82849 1024.07328

> xvxi<-solve(xvx)> xvxi

(Intercept) Age Number Start (Intercept) 2.101402986 -0.00433216784 -0.2764670205 -0.0370950612

Age -0.004332168 0.00004155736 0.0003368969 -0.0001244665Number -0.276467020 0.00033689690 0.0505664221 0.0016809996Start -0.037095061 -0.00012446655 0.0016809996 0.0045833534

> sqrt(diag(xvxi))[1] 1.44962167 0.00644650 0.22486979 0.06770047

E Newton 23

Change in Deviance resulting from adding terms to model

> anova(kyph.glm)Analysis of Deviance Table

Binomial model

Response: Kyphosis

Terms added sequentially (first to last)Df Deviance Resid. Df Resid. Dev

NULL 80 83.23447Age 1 1.30198 79 81.93249

Number 1 10.30593 78 71.62656Start 1 10.24663 77 61.37993

E Newton 24

Summary for kyphosis model with age^2 added

Call: glm(formula = Kyphosis ~ poly(Age, 2) + Number + Start, family = binomial, data = kyphosis)

Deviance Residuals:Min 1Q Median 3Q Max

-2.235654 -0.5124374 -0.245114 -0.06111367 2.354818

Coefficients:Value Std. Error t value

(Intercept) -1.6502939 1.40171048 -1.177343poly(Age, 2)1 7.3182325 4.66933068 1.567298poly(Age, 2)2 -10.6509151 5.05858692 -2.105512

Number 0.4268172 0.23531689 1.813798Start -0.2038329 0.07047967 -2.892080

E Newton 25

Summary of fit with age^2 addedNull Deviance: 83.23447 on 80 degrees of freedom

Residual Deviance: 54.42776 on 76 degrees of freedom

Number of Fisher Scoring Iterations: 5

Correlation of Coefficients:(Intercept) poly(Age, 2)1 poly(Age,

2)2 Number poly(Age, 2)1 -0.2107783 poly(Age, 2)2 0.2497127 -0.0924834

Number -0.8403856 0.3070957 -0.0988896 Start -0.4918747 -0.2208804 0.0911896

0.0721616

E Newton 26

Analysis of Deviance> anova(kyph.glm2)Analysis of Deviance Table

Binomial model

Response: Kyphosis

Terms added sequentially (first to last)Df Deviance Resid. Df Resid. Dev

NULL 80 83.23447poly(Age, 2) 2 10.49589 78 72.73858

Number 1 8.87597 77 63.86261Start 1 9.43485 76 54.42776

E Newton 27

Kyphosis data, 16 obs, with fit and residuals

cbind(kyphosis,round(p,3),round(rr,3),round(rp,3),round(rd,3))[1:16,]Kyphosis Age Number Start fit rr rp rd

1 absent 71 3 5 0.257 -0.257 -0.588 -0.7712 absent 158 3 14 0.122 -0.122 -0.374 -0.5113 present 128 4 5 0.493 0.507 1.014 1.1894 absent 2 5 1 0.458 -0.458 -0.919 -1.1075 absent 1 4 15 0.030 -0.030 -0.175 -0.2466 absent 1 2 16 0.011 -0.011 -0.105 -0.1487 absent 61 2 17 0.017 -0.017 -0.131 -0.1858 absent 37 3 16 0.024 -0.024 -0.157 -0.2209 absent 113 2 16 0.036 -0.036 -0.193 -0.27110 present 59 6 12 0.197 0.803 2.020 1.80311 present 82 5 14 0.121 0.879 2.689 2.05312 absent 148 3 16 0.076 -0.076 -0.288 -0.39913 absent 18 5 2 0.450 -0.450 -0.905 -1.09414 absent 1 4 12 0.054 -0.054 -0.239 -0.33316 absent 168 3 18 0.064 -0.064 -0.261 -0.36317 absent 1 3 16 0.016 -0.016 -0.129 -0.181

E Newton 28

Plot of response residual vs. fit

0.0 0.2 0.4 0.6 0.8

E Newton 29

Plot of deviance residual vs. indexre

0 20 40 60 80

E Newton 30

Plot of deviance residuals vs. fitted value

fitted(kyph.glm2)

0.0 0.2 0.4 0.6 0.8

E Newton 31

Summary of bootstrap for kyphosis model

E Newton 32

Call:bootstrap(data = kyphosis, statistic = coef(glm(Kyphosis ~

poly(Age, 2) + Number + Start, family = binomial,data = kyphosis)), trace = F)

Number of Replications: 1000

Summary Statistics:Observed Bias Mean SE

(Intercept) -1.6503 -0.85600 -2.5063 5.1675poly(Age, 2)1 7.3182 4.33814 11.6564 22.0166poly(Age, 2)2 -10.6509 -7.48557 -18.1365 37.6780

Number 0.4268 0.17785 0.6047 0.6823Start -0.2038 -0.07825 -0.2821 0.4593

Empirical Percentiles:2.5% 5% 95% 97.5%

(Intercept) -8.52922 -7.247145 1.1760 2.27636poly(Age, 2)1 -6.13910 -1.352143 27.1515 34.64701poly(Age, 2)2 -48.86864 -38.993192 -4.9585 -4.13232

Number -0.07539 -0.003433 1.4756 1.82754Start -0.58795 -0.470139 -0.1159 -0.08919

Summary of bootstrap (continued)BCa Confidence Limits:

2.5% 5% 95% 97.5% (Intercept) -6.4394 -5.3043 2.39707 3.56856

poly(Age, 2)1 -18.2205 -10.1003 18.34192 21.56654poly(Age, 2)2 -24.2382 -20.3911 -1.75701 -0.19269

Number -0.7653 -0.1694 1.14036 1.27858Start -0.3521 -0.3167 -0.03478 0.01461

Correlation of Replicates:(Intercept) poly(Age, 2)1 poly(Age, 2)2 Number Start

(Intercept) 1.0000 -0.4204 0.5082 -0.5676 -0.1839poly(Age, 2)1 -0.4204 1.0000 -0.8475 0.4368 -0.6478poly(Age, 2)2 0.5082 -0.8475 1.0000 -0.3739 0.5983

Number -0.5676 0.4368 -0.3739 1.0000 -0.4174Start -0.1839 -0.6478 0.5983 -0.4174 1.0000

E Newton 33

Histograms of coefficient estimates

-50 0 50

(Intercept)

0 100 200 300 4000.

05Value

poly(Age, 2)1

-600 -400 -200 0

poly(Age, 2)2

0 2 4 6 8 10

Number

-12 -10 -8 -6 -4 -2 0

E Newton 34

QQ Plots of coefficient estimates

-2 0 2

(Intercept)

es-2 0 2

poly(Age, 2)1

-2 0 2

poly(Age, 2)2

-2 0 2

Number

-2 0 2

E Newton 35

Regression Reviewand Robust Regression

S-Plus Oil City Data FrameMonthly Excess Returns of Oil City Petroleum, Inc.

Stocks and the Market SUMMARY: The oilcity data frame has 129 rows and 2 columns. The

sample runs from April 1979 to December 1989. This data frame contains the following columns:

VALUE: Oil

monthly excess returns of Oil City Petroleum, Inc. stocks. Market

monthly excess returns of the market.

E Newton 2

Oil City Data (continued)• Returns = relative change in the stock price over a one

month interval• Excess returns are computed relative to the monthly

return of a 90-day US Treasury bill at the risk-free rate• Financial economists use least squares to fit a straight

line predicting a particular stock return from the market return.

• Beta= estimated coefficient of the market return. Measures the riskiness of the stock in terms of standard deviation and expected returns.

• Large beta -> stock is risky compared to market, but also expected returns from the stock are large.

E Newton 3

Plot of Market returns vs. month

0 20 40 60 80 100 120

E Newton 4

Plot of Oil City Petroleum return vs. month

0 20 40 60 80 100 120

E Newton 5

Histogram of Market Returns

-0.3 -0.2 -0.1 0.0 0.1

Market

E Newton 6

Histogram of Oil City Returns

-1 0 1 2 3 4 5

E Newton 7

Plot of Oil City vs. Market Returns

Market

-0.2 -0.1 0.0

16171819

2122 23

2425 2627

282930

31 323334

36373839 40

4950 5152 535455

5859 6061626364

69 70717273 7475767778

808182 838485 868788

8990 919293

97 98 99

100101

102103 104 105

106107

108 109110 111112113 114115116117 118119 120 121122123 124125126127 128129

E Newton 8

Plot of Oil City vs. Market Returns without observation 94

Market

-0.25 -0.20 -0.15 -0.10 -0.05 0.0 0.05

373839

8182 838485 868788

90 919293

101102103 104

107 108109110

111112

113114

117118 119 120

122 123

125126 127

E Newton 9

> summary(oilcity)Oil Market

Min.:-0.55667260 Min.:-0.27857020 1st Qu.:-0.23968330 1st Qu.:-0.10557534 Median:-0.10049000 Median:-0.07277544 Mean:-0.07221215 Mean:-0.07689209

3rd Qu.:-0.05821000 3rd Qu.:-0.03973828 Max.: 5.19292000 Max.: 0.07131940

E Newton 10

Summary oil.lm

Call: lm(formula = Oil ~ Market, data = oilcity)Residuals:

Min 1Q Median 3Q Max -0.6952 -0.1732 -0.05444 0.08407 4.842

(Intercept) 0.1474 0.0707 2.0849 0.0391 Market 2.8567 0.7318 3.9040 0.0002

Residual standard error: 0.4867 on 127 degrees of freedomMultiple R-Squared: 0.1071 F-statistic: 15.24 on 1 and 127 degrees of freedom, the p-value

is 0.0001528

Correlation of Coefficients:(Intercept)

Market 0.7956

E Newton 11

Plot of residual vs. fit for oil.lm

Fitted : Market

-0.6 -0.4 -0.2 0.0 0.2

E Newton 12

E Newton 13

Plot of Cooks Distance vs. IndexC

0 20 40 60 80 100 120

Plot of hat matrix diagonals for oil.lm

0 20 40 60 80 100 120

123456

141516

17181920

262728

4445464748

5051525354

55565758

66676869

717273

7576777879

969798

99100101102

106107

108109110

112113114115116117

118119120

121122123

125126127128129

E Newton 14

Summary of model without observation 94

Call: lm(formula = Oil ~ Market, data = oilcity94)

Residuals:Min 1Q Median 3Q Max

-0.5169 -0.1174 -0.01959 0.06864 0.859

(Intercept) -0.0247 0.0304 -0.8139 0.4173 Market 1.1355 0.3137 3.6202 0.0004

is 0.0004249

Correlation of Coefficients:(Intercept)

Market 0.8061

E Newton 15

Plot of residual vs fit for model without observation 94

Fitted : Market

-0.3 -0.2 -0.1 0.0

E Newton 16

Weighted Least Squares

Vof root square the called sometimes is RVRRRR' that such

R matrix,symmetric singular-non nxn symmetric always isV

ed,uncorrelat are errors if diagonal isV definite positive singular-non is

)( ,0)(

variances unequal have , yns,observatio whenUsed

VVVarE

Xyσεε

E Newton 17

Weighted least squares (continued)

becomes ,X ,y

:variablesnew Define

−−−

εβεβ

εβεε

XRXRyR

XyRXRyR

E Newton 18

Weighted least squares (continued)

IRRRRVRR

RERRRE

EEEEVar

)'()'(

)'(})]'()][({[)(

εεεεεεε

−−=

−−

E Newton 19

Weighted Least Squares (continued)

)'()'('WX)X'(

)()var('WX)X'()ˆ(

'WX)(X'ˆ :is solution The

WyX'ˆWX)(X' are equations normal squares Least

)()'(V W,')Q(

−−

−−=

WXXWXXWXWWX

XWXWXyWXVar

XyWXyweightsWV

ββεεεεεεβ

E Newton 20

Robust RegressionUsed to reduce influence of outliers

residuals of function a g ,)g(e)g(y :minimize

:estimators M

}median{e }]median{[y :minimize :Regression LMS

|e||y|L1 minimize

:Regression LAR

∑∑

E Newton 21

Robust Regression (continued)IRLS, iteratively reweighted least squaresMinimize e’WeW is a diagonal matrix of weights, inversely proportional to

magnitude of scaled residuals, uiui=ei/s, s=MAD=median{|ei-median(ei)|}

Procedure:1. Obtain initial coefficient estimates from OLS2. Obtain weights from scaled residuals3. Obtain coefficient estimates from WLS4. Return to 2.Convergence usually rapid.

E Newton 22

(See Figure 10.4, and Equations 10.44 and 10.45 in Neter et al. Applied Linear Statistical Models.)

Neter et al. Applied Linear Statistical Models

Plot of residuals in oil.rregoi

0 20 40 60 80 100 120

E Newton 24

Plot of weights in robust regression for oil city data set

0 20 40 60 80 100 120

1516171819

27282930

313233

545556

58596061626364

67686970717273

75767778

8182838485868788

919293

108109110111112113

115116117

118119120121122123124

125126127128

E Newton 25

Plot of sqrt(weights)*resid/s in oil.rreg(s

0 20 40 60 80 100 120

E Newton 26

Coefficient table for oil.rreg

> x<-cbind(1,Market)> beta<-solve(t(x)%*%diag(w)%*%x)%*%t(x)%*%diag(w)%*%Oil> r<-Oil-x%*%beta> s<- median(abs(r-median(r)))*1.4826> covm<-solve(t(x)%*%diag(w)%*%x)*s^2> se<-sqrt(diag(covm))> tvalue=beta/se> prob<-2*(1-pt(abs(tvalue),127))> cbind(beta,se,tvalue,prob)

beta se tvalue prob(Intercept) -0.06779903 0.02451469 -2.765649 0.0065285939

x 0.89895511 0.24902845 3.609849 0.0004394276

Covariance matrix is approximate.

E Newton 27

Plots of fitted regression lines for oil city data

Market

-0.2 -0.1 0.0

16171819

2122 23

2425 2627

282930

31 323334

36373839 40

4950 5152 535455

5859 6061626364

69 70717273 7475767778

808182 838485 868788

8990 919293

97 98 99

100101

102103 104 105

106107

108 109110 111112113 114115116117 118119 120 121122123 124125126127 128129

oil.lmoil.lm94oil.rreg

E Newton 28

Least Trimmed Squares Regression

n and n/2 between be to chosen is q where

Minimizes

Based on a genetic algorithm for finding a subset of data with minimum SSE.

High breakdown point: fits the bulk of the data well, even if bulk is only a little more than half the data.

Resulting weights are 1 or 0

E Newton 29

E Newton 30

> summary(oil.lts)Method:[1] "Least Trimmed Squares Robust Regression."

Call:ltsreg(formula = Oil ~ Market)

Coefficients:Intercept Market -0.0864 0.7907

Scale estimate of residuals: 0.1468

Robust Multiple R-Squared: 0.09863

Total number of observations: 129

Number of observations that determine the LTS estimate: 116

Residuals:Min. 1st Qu. Median 3rd Qu. Max.

-0.454 -0.088 0.032 0.097 5.223

Weights:0 1 10 119

Single Factor ANOVA Models

Corresponds to Chapter 12 ofTamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT) with some slides by Jacqueline Telford

(Johns Hopkins University).

Chapter 8: How to compare two treatments

Chapter 12: How to compare more than two treatments (or just two).

Example: yields of several varieties of barley. Variety is the treatment factor (predictor)Yield is the response

Experimental Designs

S-Plus barley data set (observation 13:30)> barley.small

yield variety year site 13 35.13333 Svansota 1931 University Farm14 47.33333 Svansota 1931 Waseca15 25.76667 Svansota 1931 Morris16 40.46667 Svansota 1931 Crookston17 29.66667 Svansota 1931 Grand Rapids18 25.70000 Svansota 1931 Duluth19 39.90000 Velvet 1931 University Farm20 50.23333 Velvet 1931 Waseca21 26.13333 Velvet 1931 Morris22 41.33333 Velvet 1931 Crookston23 23.03333 Velvet 1931 Grand Rapids24 26.30000 Velvet 1931 Duluth25 36.56666 Trebi 1931 University Farm26 63.83330 Trebi 1931 Waseca27 43.76667 Trebi 1931 Morris28 46.93333 Trebi 1931 Crookston29 29.76667 Trebi 1931 Grand Rapids30 33.93333 Trebi 1931 Duluth

Completely Randomized Design Notation

If the sample sizes are equal the design is balanced; otherwise the design is unbalanced

See Table 12.1, page 458 in the course textbook.

S-Plus barley dataset (observations 13:30)

Variety Svansota Velvet Trebi35.13333 39.90000 36.5666647.33333 50.23333 63.83330 25.76667 26.13333 43.76667 40.46667 41.33333 46.9333329.66667 23.03333 29.7666725.70000 26.30000 33.93333

Variety Mean 34.01111 34.48889 42.46666

Plot of yield by variety for S-Plus barley data set

Svansota Velvet Trebi

barley.small$variety

S-plus plot.design function

Factors

Svansota

Velvet

variety

Factors

Svansota

Velvet

variety

CRD: Model and Estimation (cell means model)

See Section 12.1.1 and Figure 12.2 on page 460 of the course textbook.

CRD: Treatment Effects Model

Alternative Formulation of the Model:

Formula from 12.1.1, page 460 in the course textbook.

( 1, 2,..., ; 1, 2,..., )ij i ij iY i a j nµ τ ε= + + = =

CRD parameter estimates

a)-e/(ne' sby estimated

ˆ - y error emeans treatment values fitted of vector ˆ

)/ny(1' yby estimated treatment, i of mean y)/n(1' yby estimated mean,

Fitted values and residuals for barley example

> cbind(barley.small[,1:2],fitted(tmp),resid(tmp))yield variety fitted resid

13 35.13333 Svansota 34.01111 1.12221814 47.33333 Svansota 34.01111 13.32221815 25.76667 Svansota 34.01111 -8.24444216 40.46667 Svansota 34.01111 6.45555817 29.66667 Svansota 34.01111 -4.34444218 25.70000 Svansota 34.01111 -8.31111219 39.90000 Velvet 34.48889 5.41111320 50.23333 Velvet 34.48889 15.74444321 26.13333 Velvet 34.48889 -8.35555722 41.33333 Velvet 34.48889 6.84444323 23.03333 Velvet 34.48889 -11.45555724 26.30000 Velvet 34.48889 -8.18888725 36.56666 Trebi 42.46666 -5.90000026 63.83330 Trebi 42.46666 21.36664027 43.76667 Trebi 42.46666 1.30001028 46.93333 Trebi 42.46666 4.46667029 29.76667 Trebi 42.46666 -12.69999030 33.93333 Trebi 42.46666 -8.533330

X matrix?1 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 0 1 01 0 1 01 0 1 01 0 1 01 0 1 01 0 1 01 0 0 11 0 0 11 0 0 11 0 0 11 0 0 11 0 0 1

Model.matrix in S-Plus> round(model.matrix(barley.small.aov),3)

(Intercept) variety.L variety.Q13 1 -0.707 0.40814 1 -0.707 0.40815 1 -0.707 0.40816 1 -0.707 0.40817 1 -0.707 0.40818 1 -0.707 0.40819 1 0.000 -0.81620 1 0.000 -0.81621 1 0.000 -0.81622 1 0.000 -0.81623 1 0.000 -0.81624 1 0.000 -0.81625 1 0.707 0.40826 1 0.707 0.40827 1 0.707 0.40828 1 0.707 0.40829 1 0.707 0.40830 1 0.707 0.408

Model Coefficients

• > summary.lm(barley.small.aov)

• Call: aov(formula = yield ~ variety, data = barley.small)• Residuals:• Min 1Q Median 3Q Max • -12.7 -8.294 -1.611 6.194 21.37

• Coefficients:• Value Std. Error t value Pr(>|t|) • (Intercept) 36.9889 2.5207 14.6741 0.0000 • variety.L 5.9790 4.3660 1.3695 0.1910 • variety.Q 3.0619 4.3660 0.7013 0.4939

• Residual standard error: 10.69 on 15 degrees of freedom• Multiple R-Squared: 0.1363 • F-statistic: 1.184 on 2 and 15 degrees of freedom, the p-value is 0.3332

• Correlation of Coefficients:• (Intercept) variety.L• variety.L 0 • variety.Q 0 0

S-plus model.tables command gives treatment means or effects

> model.tables(barley.small.aov,type="mean")Warning messages:Model was refit to allow projection in: model.tables(tmp, type =

"mean")

Tables of meansGrand mean

36.989

variety Svansota Velvet Trebi34.011 34.489 42.467

S-plus model.tables command gives treatment means or effects

> model.tables(barley.small.aov)Warning messages:Model was refit to allow projection in:

model.tables(barley.small.aov)

Tables of effects

variety Svansota Velvet Trebi-2.9778 -2.5000 5.4778

Analysis of Variance (ANOVA)

Homogeneity Hypothesis:

Note SSR=SSA=Treatment sums of squares

0 1 2 1

: ... . : .: ... . : 0.

H vs H Not all the areequalH vs H At least some

µ µ µ µτ τ τ τ

= = = ≠

Variation Source Sum of Squares Degrees of Freedom Mean Square F

Treatments (A)

Error (E)

Total (T)

2( )ij iy y−∑ ∑

2( )i in y y−∑

2( )ijy y−∑ ∑

1a −

N a−

1N −

1SSAa −SSEN a−

MSAMSE

ANOVA table for model with 3 varieties of barley, year 1

> summary(aov(yield~variety,barley.small))Df Sum of Sq Mean Sq F Value Pr(F)

variety 2 270.739 135.3694 1.183614 0.3332005Residuals 15 1715.544 114.3696

ANOVA table for model with all 10 varieties of barley, year 1

> summary(aov(yield~variety,barley1))Df Sum of Sq Mean Sq F Value Pr(F)

variety 9 646.262 71.8069 0.5963671 0.793823Residuals 50 6020.357 120.4071 >

F-statistic for One-way ANOVA

anaFMSEMSAF −−= ,1~

Fitting model with continuous vs. character predictor

> summary(aov(barley.small$yield~varnum)) Df Sum of Sq Mean Sq F Value Pr(F)

varnum 1 214.489 214.4889 1.93692 0.1830502Residuals 16 1771.794 110.7371

> summary(aov(barley.small$yield~as.factor(varnum)))Df Sum of Sq Mean Sq F Value Pr(F)

as.factor(varnum) 2 270.739 135.3694 1.183614 0.3332005Residuals 15 1715.544 114.3696

Equivalence of T test and ANOVA for model with single factor with 2 levels

> t.test(y[1:6],y[7:12])

data: y[1:6] and y[7:12] t = -1.194, df = 10, p-value = 0.26 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-22.864726 6.909179 sample estimates:mean of x mean of y 34.48889 42.46666

> summary(aov(yield~variety,barley.vsmall))Df Sum of Sq Mean Sq F Value Pr(F)

Model Diagnostics, residual vs. fitted value(all 10 varieties, year 1)

fitted(barley1.aov)

32 34 36 38 40 42

Model Diagnostics, residual vs. observation number(all 10 varieties, year 1)

0 10 20 30 40 50 60

Model Diagnostics, normal plot of residuals(all 10 varieties, year 1)

25Quantiles of Standard Normal

-2 -1 0 1 2

Model Diagnostics, histogram of residuals(all 10 varieties, year 1)

-10 0 10 20 30

resid(barley1.aov)

Random Effects Model for a One-way LayoutWhen the treatment levels are determined by the experimenter (or those are the only levels of interest), the design is a fixed effects model.

• Goal is to measure the treatment effects or means (“pick the winner”).

When the treatment levels are a random sample from a population of possible treatment levels (e.g. workers in a factory) and the particular levels used in the experiment are not of any interest, the design is a random effects model.

• Goal is to measure the treatment variability (estimate the expected variability among workers).

Random Effects Model for a One-way LayoutModel: Yij = µi + εij = µ + τi + εij (looks similar to the fixed effects model), where

εij ~ N(0,σ2) µi ~ N(µ,σA

2) or τi ~ N(0,σA2) (constants in fixed effects model)

Var(Yij) = Var(µi) + Var(eij) = σA2 + σ2

σA2=variance among, σ2 = variance within

With balanced one-way layout, n observations per treatment:

AnMSAEMSEE

Can estimate σA2 as (MSA-MSE)/n (if you are lucky!)

Randomized Block Design

See Figure 3.2 on page 99 of the course textbook.

Barley Example10 varieties, 6 sites

> ymUniversity Farm Waseca Morris Crookston Grand Rapids Duluth Variety Mean

Manchuria 27.00000 48.86667 27.43334 39.93333 32.96667 28.96667 34.19445Glabron 43.06666 55.20000 28.76667 38.13333 29.13333 29.66667 37.32778

Svansota 35.13333 47.33333 25.76667 40.46667 29.66667 25.70000 34.01111Velvet 39.90000 50.23333 26.13333 41.33333 23.03333 26.30000 34.48889Trebi 36.56666 63.83330 43.76667 46.93333 29.76667 33.93333 42.46666

No. 457 43.26667 58.10000 28.70000 45.66667 32.16667 33.60000 40.25000No. 462 36.60000 65.76670 30.36667 48.56666 24.93334 28.10000 39.05556

Peatland 32.76667 48.56666 29.86667 41.60000 34.70000 32.00000 36.58333No. 475 24.66667 46.76667 22.60000 44.10000 19.70000 33.06666 31.81667

Wisconsin No. 38 39.30000 58.80000 29.46667 49.86667 34.46667 31.60000 40.58333Site Mean 35.82667 54.34667 29.28667 43.66000 29.05334 30.29333 37.07778

Randomized Block Design (RBD)Method

( 1,..., ; 1,..., )ij i j ijY i a j bµ τ β ε= + + + = =

a-1 independent treatment effects

b-1 independent block effects

For more information, see 12.4, page 482 in course textbook.

No Interactions Between Treatments and Blocks

' ' '( ) ( )ij i j i j i j i iµ µ µ τ β µ τ β τ τ− = + + + + + = −

Formula from page 483 in the course textbook.

RBD: Sums of Squares

See formulas 12.17, 12.18, and 12.19 on pages 484-5 in the course textbook.

ANOVA tables for models for barley data set

> summary(aov(yield~variety,barley1))Df Sum of Sq Mean Sq F Value Pr(F)

> summary(aov(yield~variety+site,barley1))Df Sum of Sq Mean Sq F Value Pr(F)

variety 9 646.262 71.807 3.67995 0.001612103site 5 5142.272 1028.454 52.70610 0.000000000

Residuals 45 878.085 19.513

Type 1 and Type 3 Sums of Squares for barley example (balanced design)> summary(barley12.aov)

Df Sum of Sq Mean Sq F Value Pr(F) variety 9 646.262 71.807 3.67995 0.001612103

site 5 5142.272 1028.454 52.70610 0.000000000Residuals 45 878.085 19.513

> summary(barley12.aov,ssType=3)Type III Sum of Squares

site 5 5142.272 1028.454 52.70610 0.000000000Residuals 45 878.085 19.513

Degrees of Freedom

Effects in barley model > model.tables(barley12.aov,type="effects")Warning messages:Model was refit to allow projection in: model.tables(barley12.aov, type = "effects")

Tables of effects

variety Svanso No. 462 Manch No. 475 Velvet Peatla Glabron No. 457 Wisc No. 38 Trebi-3.0667 1.9778 -2.8833 -5.2611 -2.5889 -0.4944 0.2500 3.1722 3.5056 5.3889

site Grand Rapids Duluth University Farm Morris Crookston Waseca -8.024 -6.784 -1.251 -7.791 6.582 17.269

Analysis of Multifactor Experiments

Slides prepared by Elizabeth Newton (MIT), with some slides by Jacqueline Telford

(Johns Hopkins University) 1

Analysis of Multifactor Experiments

Model and estimates

.ˆ.ˆ

........)(.....ˆ.....ˆ

...ˆ)(yijk

ijijkijkijkijk

jiijij

ijkijji

yyyyyy

−=−=

+−−=

ετββτµ

For any model

)y-(y)'y-(y SSError SSE)y-y()'y -y( SSModel SSM

)y-(y)'y-(y SSTotal SST

mean grand of vector yvalues fitted of vector y

valuesresponseobservedof vector

• Biochemical Reactions of Cells Treated with Puromycin

• SUMMARY: • The “Balanced” Puromycin data frame has 24 rows

representing the measurement of initial velocity of a biochemical reaction for 6 different concentrations of substrate and two different cell treatments. This data frame contains the following variables (columns):

• ARGUMENTS: • conc

– the concentration of the substrate. • vel

– the initial velocity of the reaction. • state

– a factor telling whether the cells involved were treated or untreated.

Scatterplot matrix for puromycin data set

untr trtd

0.2 0.4 0.6 0.8 1.0 50 100 150 200

plot.factor(conc,vel)50

0.02 0.06 0.11 0.22 0.56 1.1

f(conc)

plot.factor(state,vel)50

untreated treated

Velocity in “Balanced” puromycin data set

conc treated untreated0.02 76 47 67 510.06 97 107 84 860.11 123 139 98 1150.22 159 152 131 1240.56 191 201 144 1581.10 207 200 160 162

Histogram of velocity0

interaction.plot(pyb$state,pyb$conc,pyb$vel)

11pyb$state

untreated treated

pyb$conc

1.10.560.220.110.060.02

interaction.plot(pyb$conc,pyb$state,pyb$vel)

pyb$conc

0.02 0.06 0.11 0.22 0.56 1.1

pyb$state

treateduntreated

Summaries of puromycin model

-14.5 -5 -4.441e-016 5 14.5

Residual standard error: 9.559 on 12 degrees of freedomMultiple R-Squared: 0.9784 F-statistic: 49.5 on 11 and 12 degrees of freedom, the

p-value is 2.919e-008

Df Sum of Sq Mean Sq F Value Pr(F) state 1 4240.04 4240.042 46.40264 0.00001871conc 5 44243.71 8848.742 96.83985 0.00000000

state:conc 5 1270.71 254.142 2.78130 0.06803651Residuals 12 1096.50 91.375

Observed velocity and fitted values for puromycin model with interaction

Observed Fitted Valuesconc treated untreated treated untreated 0.02 76 47 67 51 61.5 61.5 59.0 59.00.06 97 107 84 86 102.0 102.0 85.0 85.00.11 123 139 98 115 131.0 131.0 106.5 106.50.22 159 152 131 124 155.5 155.5 127.5 127.50.56 191 201 144 158 196.0 196.0 151.0 151.01.10 207 200 160 162 203.5 203.5 161.0 161.0

model.tablesTables of meansGrand mean

128.29

state untreated treated 115.00 141.58

conc0.02 0.06 0.11 0.22 0.56 1.1 60.25 93.50 118.75 141.50 173.50 182.25

state:concDim 1 : stateDim 2 : conc

0.02 0.06 0.11 0.22 0.56 1.1 untreated 59.0 85.0 106.5 127.5 151.0 161.0treated 61.5 102.0 131.0 155.5 196.0 203.5

multicomp(pyb.aov,focus=“concf”)

95 % simultaneous confidence intervals for specified linear combinations, by the Tukey method

critical point: 3.3595 response variable: vel

intervals excluding 0 are flagged by '****'

Estimate Std.Error Lower Bound Upper Bound 0.02-0.06 -33.20 6.76 -56.0 -10.5000 ****0.02-0.11 -58.50 6.76 -81.2 -35.8000 ****0.02-0.22 -81.20 6.76 -104.0 -58.5000 ****0.02-0.56 -113.00 6.76 -136.0 -90.5000 ****0.02-1.1 -122.00 6.76 -145.0 -99.3000 ****

0.06-0.11 -25.30 6.76 -48.0 -2.5400 ****0.06-0.22 -48.00 6.76 -70.7 -25.3000 ****0.06-0.56 -80.00 6.76 -103.0 -57.3000 ****0.06-1.1 -88.70 6.76 -111.0 -66.0000 ****

0.11-0.22 -22.70 6.76 -45.5 -0.0425 ****0.11-0.56 -54.70 6.76 -77.5 -32.0000 ****0.11-1.1 -63.50 6.76 -86.2 -40.8000 ****

0.22-0.56 -32.00 6.76 -54.7 -9.2900 ****0.22-1.1 -40.70 6.76 -63.5 -18.0000 ****0.56-1.1 -8.75 6.76 -31.5 14.0000

Residual vs. fit for puromycin model

fitted(pyb.aov)

qqplot of residuals for puromycin model

Summaries of puromycin model without interaction

-26.54 -7.083 2.625 4.792 20.04Residual standard error: 11.8 on 17 degrees of freedomMultiple R-Squared: 0.9534 F-statistic: 58.03 on 6 and 17 degrees of freedom, the

p-value is 2.18e-010

Df Sum of Sq Mean Sq F Value Pr(F) conc 5 44243.71 8848.742 63.54684 0.00000000021

state 1 4240.04 4240.042 30.44967 0.00003762498Residuals 17 2367.21 139.248

Observed velocity and fitted values for puromycin model without interaction

Observed Fittedconc treated untreated treated untreated0.02 76 47 67 51 73.542 73.542 46.958 46.9580.06 97 107 84 86 106.792 106.792 80.208 80.2080.11 123 139 98 115 132.042 132.042 105.458 105.4580.22 159 152 131 124 154.792 154.792 128.208 128.2080.56 191 201 144 158 186.792 186.792 160.208 160.2081.10 207 200 160 162 195.542 195.542 168.958 168.958

Plot of residual vs. fit for puromycin model without interaction

Fitted : conc + state

Plot of velocity vs. concentration

Call: aov(formula = vel ~ conc + conc^2 + state)Residuals:

Min 1Q Median 3Q Max -45.4 -6.93 4.227 7.902 23.94

(Intercept) 73.0885 6.0136 12.1539 0.0000conc 304.9581 37.3027 8.1752 0.0000

I(conc^2) -188.9327 32.5953 -5.7963 0.0000state 13.2917 3.4172 3.8897 0.0009

Residual standard error: 16.74 on 20 degrees of freedomMultiple R-Squared: 0.8898 F-statistic: 53.82 on 3 and 20 degrees of freedom, the p-

value is 9.291e-010

> summary(pyb2.aov)Df Sum of Sq Mean Sq F Value Pr(F)

conc 1 31590.27 31590.27 112.7215 0.0000000011I(conc^2) 1 9415.64 9415.64 33.5972 0.0000113551

state 1 4240.04 4240.04 15.1295 0.0009104989Residuals 20 5605.01 280.25 23

Plot of residual vs. fit for pyb2.aov

Fitted : conc + conc^2 + state

qqplot of residuals for pyb2.aov

Call: aov(formula = vel ~ conc + conc^2 + conc^3 + conc^4 + conc^5 + state)

-26.54 -7.083 2.625 4.792 20.04

Coefficients:Residual standard error: 11.8 on 17 degrees of freedomMultiple R-Squared: 0.9534 F-statistic: 58.03 on 6 and 17 degrees of freedom, the p-

value is 2.18e-010

> summary(pyb5.aov)Df Sum of Sq Mean Sq F Value Pr(F)

conc 1 31590.27 31590.27 226.8641 0.0000000I(conc^2) 1 9415.64 9415.64 67.6180 0.0000003I(conc^3) 1 2603.71 2603.71 18.6984 0.0004604I(conc^4) 1 631.13 631.13 4.5324 0.0481759I(conc^5) 1 2.96 2.96 0.0213 0.8857934

state 1 4240.04 4240.04 30.4497 0.0000376Residuals 17 2367.21 139.25 >

Plot of residual vs. fit for pyb5.aov

Fitted : conc + conc^2 + conc^3 + conc^4 + conc^5 + state

Guayule data set

• Rate of Germination of Treated Guayule Seeds • SUMMARY: • The guayule data frame, a design object, has 96 rows and 5

columns. The guayule is a Mexican plant from which rubber is manufactured. Batches of 100 seeds of eight varieties ( variety ) of guayule were given one of four treatments ( treatment ), and planted; the number of plants that came up in each batch ( plants ) was recorded.

• ARGUMENTS: • variety

– factor with levels V1 through V8 labeling the variety of guayule. • treatment

– factor with levels T1 through T4 labeling the treatment given to the seeds.

• plants– numeric vector givng the number seeds out of a batch of 100 that

germinated.

pairs(gy)

variety

T1 T2 T3 T4

treatment

V1 V3 V5 V7 20 40 60 80

plants

plot.factor(gy$variety,gy$plants)

V1 V2 V3 V4 V5 V6 V7 V8

gy$variety

plot.factor(gy$treatment,gy$plants)

T1 T2 T3 T4

gy$treatment

interaction.plot(gy$variety,gy$treatment,gy$plants)

32gy$variety

V1 V2 V3 V4 V5 V6 V7 V8

gy$treatment

T1T3T2T4

interaction.plot(gy$treatment,gy$variety,gy$plants)

33gy$treatment

T1 T2 T3 T4

gy$variety

V6V8V5V3V2V7V4V1

hist(gy$plants)

gy$plants

Summaries of gy.aov

Call: aov(formula = plants ~ variety * treatment, data = gy)Residuals:

Min 1Q Median 3Q Max -16.33 -2.667 1.494e-015 2.75 16

is 0 > summary(gy.aov)

treatment 3 30774.28 10258.09 254.5959 0.00000000variety:treatment 21 2620.14 124.77 3.0966 0.00026666

Residuals 64 2578.67 40.29

Plot of residual vs. fit for gy data set

36Fitted : variety * treatment

model.tables(gy.aov,type="mean")

25.302

variety V1 V2 V3 V4 V5 V6 V7 V8

24.667 26.833 28.833 21.000 21.917 28.167 23.250 27.750

treatment T1 T2 T3 T4

55.833 13.917 20.042 11.417

model.tables(gy.aov,type="mean")

variety:treatmentDim 1 : varietyDim 2 : treatment

T1 T2 T3 T4 V1 66.333 11.667 12.333 8.333V2 63.333 18.333 14.333 11.333V3 65.000 12.667 26.333 11.333V4 50.333 10.000 14.000 9.667V5 49.333 16.333 10.333 11.667V6 58.000 8.000 29.667 17.000V7 46.333 14.667 22.000 10.000V8 48.000 19.667 31.333 12.000

multicomp(gy.aov,focus="treatment")

95 % simultaneous confidence intervals for specified linear combinations, by the Tukey method

critical point: 2.6378 response variable: plants

intervals excluding 0 are flagged by '****'

Estimate Std.Error Lower Bound Upper Bound T1-T2 41.90 1.83 37.10 46.80 ****T1-T3 35.80 1.83 31.00 40.60 ****T1-T4 44.40 1.83 39.60 49.30 ****T2-T3 -6.12 1.83 -11.00 -1.29 ****T2-T4 2.50 1.83 -2.33 7.33 T3-T4 8.62 1.83 3.79 13.50 ****

Guayule ANOVA with variety random

> gyr.tabDf Sum of Sq Mean Sq F Value Pr(F)

treatment 3 30774.28 10258.09 82.21711 0.0000000variety 7 763.16 109.02 0.87380 0.5428964

treatment:variety 21 2620.14 124.77 3.09663 0.0002667Residuals 64 2578.67 40.29

Random if:

• Not interested in those particular factor levels (e.g. batches)

• Levels of factor are randomly chosen from a larger population of factor levels (e.g. 10 universities selected from all universities in country).

• Want to generalize to a larger population of factor levels.

EMS for 2-factor models(See Table 24.5 on page 981 of Neter et al. Applied Linear Statistical Models.)

Nested vs. Crossed Design(See Figure 28.1 in Neter et al. Applied Linear Statistical Models.)

Nested Fixed Factors(See Table 28.3 on page 1129 of Neter et al. Applied Linear Statistical Models.)

Nested Mixed Factors(See Table 28.5 on page 1133 of Neter et al. Applied Linear Statistical Models.)

Cross-Nested Models(See Table 28.11 on page 1151 of Neter et al. Applied Linear Statistical Models.)

Images of book covers:

Patrick O’Brian, The Commodore.

Patrick O’Brian, The Fortune of War.

Nested Factors• Speed of Firing Naval Guns • SUMMARY: • The gun data frame, a design object, has 36 rows representing runs

of a team of 3 men loading and firing naval guns attempting to get off as many rounds per minute as possible. The three predictor variables (columns) specify the team and the physique of the menon it and the loading method used; the outcome variable is the rounds fired per minute.

• ARGUMENTS: • Method

– factor giving one of two methods for loading rounds into Naval guns. Levels are M1 and M2 .

• Physique– an ordered factor giving the physique of the men: S for slight, A for

average, and H for heavy. • Team

– factor with levels T1 , T2 or T3 . In fact there are nine teams, three of each physique, i.e. a slight T1 , an average T1 , and a heavy T1 , etc.

• Rounds– numeric vector giving the number of rounds per minute fired by a team.

gunMethod Physique Team Rounds

1 M1 S T1 20.22 M2 S T1 14.23 M1 A T1 22.04 M2 A T1 14.15 M1 H T1 23.16 M2 H T1 14.17 M1 S T2 26.28 M2 S T2 18.09 M1 A T2 22.6

10 M2 A T2 14.011 M1 H T2 22.912 M2 H T2 12.213 M1 S T3 23.814 M2 S T3 12.515 M1 A T3 22.916 M2 A T3 13.717 M1 H T3 21.818 M2 H T3 12.719 M1 S T1 24.120 M2 S T1 16.2

gunMethod Physique Team Rounds

1 M1 S T1 20.22 M2 S T1 14.23 M1 A T2 22.04 M2 A T2 14.15 M1 H T3 23.16 M2 H T3 14.17 M1 S T4 26.28 M2 S T4 18.09 M1 A T5 22.6

10 M2 A T5 14.011 M1 H T6 22.912 M2 H T6 12.213 M1 S T7 23.814 M2 S T7 12.515 M1 A T8 22.916 M2 A T8 13.717 M1 H T9 21.818 M2 H T9 12.719 M1 S T1 24.120 M2 S T1 16.2

Speed of firing of naval guns

Slight Average Heavy

Method 1 T1: 20.2, 24.1T4: 26.2, 26.9T7: 23.8, 24.9

T2: 22.0, 23.5T5: 22.6, 24.6T8: 22.9, 25.0

T3: 23.1, 22.9T6: 22.9, 23.7T9: 21.8, 23.5

Method 2 T1: 14.2, 16.2T4: 18.0, 19.1T7: 12.5, 15.4

T2: 14.1, 16.1T5: 14.0, 18.1T8: 13.7, 16.0

T3: 14.1, 16.1T6: 12.2, 13.8T9: 12.7, 15.1

pairs(gun2)

method

1.0 1.5 2.0 2.5 3.0 15 20 25

physique

1.0 1.2 1.4 1.6 1.8 2.0

2 4 6 8

rounds

Method Effect

method

rep(1, 36)

Physique Effect

physique

rep(1, 36)

Team Effect

1 2 3 4 5 6 7 8 9

rep(1, 36)

Method-Physique Interaction

method

physique

ANOVA tables for firing of naval guns example(with teams numbered 1-9)

> summary(aov(rounds~phys*meth*team))Df Sum of Sq Mean Sq F Value Pr(F)

phys 2 16.0517 8.0258 3.4736 0.0529995meth 1 651.9511 651.9511 282.1621 0.0000000team 6 39.2583 6.5431 2.8318 0.0403140

phys:meth 2 1.1872 0.5936 0.2569 0.7762240meth:team 6 10.7217 1.7869 0.7734 0.6009376Residuals 18 41.5900 2.3106

> summary(aov(rounds~phys*meth*team%in%phys))Df Sum of Sq Mean Sq F Value Pr(F)

phys 2 16.0517 8.0258 3.4736 0.0529995meth 1 651.9511 651.9511 282.1621 0.0000000

phys:meth 2 1.1872 0.5936 0.2569 0.7762240team %in% phys 6 39.2583 6.5431 2.8318 0.0403140

meth:(team %in% phys) 6 10.7217 1.7869 0.7734 0.6009376Residuals 18 41.5900 2.3106

> model.tables(gunaov,type="mean")Tables of meansGrand mean

19.333

Method M1 M2

23.589 15.078

Physique S A H

20.125 19.383 18.492

Team %in% Physique Dim 1 : PhysiqueDim 2 : Team

T1 T2 T3 S 18.675 22.550 19.150A 18.925 19.825 19.400H 19.050 18.150 18.275

19.333

method M1 M2

23.589 15.078rep 18.000 18.000

physique S A H

20.125 19.383 18.492rep 12.000 12.000 12.000

team %in% physique Dim 1 : physiqueDim 2 : team

1 2 3 4 5 6 7 8 9 S 18.675 22.550 19.150

rep 4.000 0.000 0.000 4.000 0.000 0.000 4.000 0.000 0.000A 18.925 19.825 19.400

rep 0.000 4.000 0.000 0.000 4.000 0.000 0.000 4.000 0.000H 19.050 18.150 18.275

rep 0.000 0.000 4.000 0.000 0.000 4.000 0.000 0.000 4.000

Summaries of firing of naval guns example (without interaction)

Call: aov(formula = Rounds ~ Method + Physique/Team, data = gun)

-2.731 -0.7368 2.498e-016 0.9972 2.531

Residual standard error: 1.434 on 26 degrees of freedomMultiple R-Squared: 0.9297 F-statistic: 38.19 on 9 and 26 degrees of freedom, the p-value is

9.602e-013

> summary(gunaov)Df Sum of Sq Mean Sq F Value Pr(F)

Method 1 651.9511 651.9511 316.8426 0.00000000Physique 2 16.0517 8.0258 3.9005 0.03300457

Team %in% Physique 6 39.2583 6.5431 3.1799 0.01782181Residuals 26 53.4989 2.0576

Plot of residual vs fit for gun.aov

58Fitted : Method + Physique/Team

14 16 18 20 22 24 26

2k Factorial Designs

• Exploratory experimental studies.• Multifactor experiment in which each factor

studied at two levels.• Used to screen large number of factors to

identify the most important.• Sometimes 2 levels naturally occur e.g.

present or absent, smoker or non-smoker• k factors => 2k treatment combinations

2k Factorial Design Example

Example: 13.19, page 553 of the course textbook.

pairs(nw.df)

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0

50 100 150 200 250 300

-1.0 -0.5 0.0 0.5 1.0

hist(y)0

Effect of a

rep(1, 24)

Effect of b

rep(1, 24)

Effect of c

rep(1, 24)

interaction.plot(a,b,y)

interaction.plot(a,c,y)

interaction.plot(b,c,y)

summary.lm(nw.aov)Call: aov(formula = y ~ a * b * c, data = nw.df)Residuals:

Min 1Q Median 3Q Max -37.67 -6.861 2.388 12.67 28.67

(Intercept) 171.1942 4.6675 36.6780 0.0000a -17.6942 4.6675 -3.7909 0.0016b -76.5833 4.6675 -16.4078 0.0000c 13.3333 4.6675 2.8566 0.0114

a:b -14.8050 4.6675 -3.1719 0.0059a:c 16.6667 4.6675 3.5708 0.0026b:c 4.9442 4.6675 1.0593 0.3052

a:b:c -25.0558 4.6675 -5.3682 0.0001

Residual standard error: 22.87 on 16 degrees of freedomMultiple R-Squared: 0.9556 F-statistic: 49.21 on 7 and 16 degrees of freedom, the p-value is

1.209e-009

Effect (of going from low to high level) is 2*regression coefficient 69

model.matrix(nw.aov)(Intercept) a b c a:b a:c b:c a:b:c

1 1 -1 -1 -1 1 1 1 -12 1 1 -1 -1 -1 -1 1 13 1 -1 1 -1 -1 1 -1 14 1 -1 -1 1 1 -1 -1 15 1 1 1 -1 1 -1 -1 -16 1 1 -1 1 -1 1 -1 -17 1 -1 1 1 -1 -1 1 -18 1 1 1 1 1 1 1 19 1 -1 -1 -1 1 1 1 -110 1 1 -1 -1 -1 -1 1 111 1 -1 1 -1 -1 1 -1 112 1 -1 -1 1 1 -1 -1 113 1 1 1 -1 1 -1 -1 -114 1 1 -1 1 -1 1 -1 -115 1 -1 1 1 -1 -1 1 -116 1 1 1 1 1 1 1 117 1 -1 -1 -1 1 1 1 -118 1 1 -1 -1 -1 -1 1 119 1 -1 1 -1 -1 1 -1 120 1 -1 -1 1 1 -1 -1 121 1 1 1 -1 1 -1 -1 -122 1 1 -1 1 -1 1 -1 -123 1 -1 1 1 -1 -1 1 -124 1 1 1 1 1 1 1 1 70

X’X Matrixt(X)%*%X

(Intercept) a b c a:b a:c b:c a:b:c (Intercept) 24 0 0 0 0 0 0 0

a 0 24 0 0 0 0 0 0b 0 0 24 0 0 0 0 0c 0 0 0 24 0 0 0 0

a:b 0 0 0 0 24 0 0 0a:c 0 0 0 0 0 24 0 0b:c 0 0 0 0 0 0 24 0

a:b:c 0 0 0 0 0 0 0 24

n*(X’X)-1 X’

> solve(t(X)%*%X)%*%t(X)*241 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

(Intercept) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1a -1 1 -1 -1 1 1 -1 1 -1 1 -1 -1 1 1 -1 1 -1 1 -1 -1 1 1 -1 1b -1 -1 1 -1 1 -1 1 1 -1 -1 1 -1 1 -1 1 1 -1 -1 1 -1 1 -1 1 1c -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 1 1 1

a:b 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1a:c 1 -1 1 -1 -1 1 -1 1 1 -1 1 -1 -1 1 -1 1 1 -1 1 -1 -1 1 -1 1b:c 1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 1

a:b:c -1 1 1 1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1

summary(nw.aov)> summary(nw.aov)

Df Sum of Sq Mean Sq F Value Pr(F) a 1 7514.0 7514.0 14.3712 0.0016031b 1 140760.2 140760.2 269.2166 0.0000000c 1 4266.7 4266.7 8.1604 0.0114229

a:b 1 5260.5 5260.5 10.0612 0.0059164a:c 1 6666.7 6666.7 12.7506 0.0025519b:c 1 586.7 586.7 1.1221 0.3052037

a:b:c 1 15067.1 15067.1 28.8171 0.0000628Residuals 16 8365.6 522.9

Plot of residual vs. fit for nw.aov

74Fitted : a * b * c

Nonparametric Statistical Methods

Nonparametric Methods

• Most NP methods are based on ranks instead of original data

• Reference: Hollander & Wolfe, Nonparametric Statistical Methods

E Newton 2

E Newton 3

Histogram of 100 gamma(1,1) r.v.’s

0 1 2 3 4

Histogram of ranks of 100 r.v.’s

0 20 40 60 80 100

rank(g)

E Newton 4

Parametric and Nonparametric Tests

E Newton 5

Type of test Parametric NonparametricSingle Sample z and t tests Sign test

WilcoxonSigned Rank Test

Two independent samples

z and t tests Wilcoxon Rank Sum Test

Mann Whitney U Test

E Newton 6

Type of test Parametric Nonparametric

Several Independent Samples

ANOVA CRD Kruskal-Wallace Test

Several Matched Samples

ANOVA RBD Friedman Test

Correlation Pearson Spearman Rank Correlation

Kendall’s Rank Correlation

Sign Test

• Inference on median (u) for a single sample, size n• H0: u=u0 vs. H1 u≠u0

• Count the number of xi’s that are greater than u0 and denote this s+

• The number of xi‘s less than u are s- = n - s+• Reject H0 if s+ is large or if s- is small.• Under H0, s+ (and s-) has binomial(n,1/2)

distribution• Large sample z test

E Newton 7

Histogram of thermostat data

198 200 202 204 206 208

E Newton 8

Sign Test in S-Plus > thermostat[1] 202.2 203.4 200.5 202.5 206.3 198.0 203.7 200.8

201.3 199.0

> thermostat<200[1] F F F F F T F F F T

> sum(thermostat<200)[1] 2

> 2*pbinom(sum(thermostat<200),10,0.5)[1] 0.109375

E Newton 9

Wilcoxon Signed Rank Test• Inference on median (u), single sample, size n• Assumes population distribution is symmetric• H0: u=u0 vs. H1 u≠u0• di = xi -u0• Rank order |di|• W+ = sum of ranks of positive differences• W- = sum of ranks of negative differences• Wmax = maximum (W+, W-)• Reject H0 if Wmax is large.• Null Distribution – see text• Large sample z test

E Newton 10

S-Plus wilcox.test for thermostat data

E Newton 11

> thermostat[1] 202.2 203.4 200.5 202.5 206.3 198.0 203.7 200.8 201.3 199.0

> sum(rank(abs(thermostat-200))[-c(6,10)])[1] 47

> wilcox.test(thermostat,mu=200)

Exact Wilcoxon signed-rank test

data: thermostat signed-rank statistic V = 47, n = 10, p-value =

0.0488 alternative hypothesis: true mu is not equal to 200

S-Plus parametric t-test for thermostat data

> t.test(thermostat, mu=200)

One-sample t-Test

data: thermostatt = 2.3223, df = 9, p-value = 0.0453 alternative hypothesis: true mean is not equal to 200 95 percent confidence interval:200.0459 203.4941 sample estimates:mean of x

201.77

E Newton 12

Location-Scale Families

• See course textbook, page 575.

E Newton 13

2 normal pdf’s with location parameters = -1 and 1, scale parameter =1

, 1, 1

-4 -2 0 2 4

E Newton 14

Wilcoxon Rank Sum Test

• Inference on location of distribution of 2 independent random samples X and Y (e.g. from control and treatment population).

• Assume X~Y+∆• H0: ∆=0 vs. H1: ∆≠0• Rank all N = n1 + n2 observations• W=sum of ranks assigned to the Y’s (or X’s,

whichever has smaller sample size) • Reject H0 if W is extreme

E Newton 15

Mann-Whitney U test

• Equivalent to Wilcoxon rank sum test• Compare each xi with each yi.• There are nx*ny such comparisons• U= number of pairs in which xi<yi.• Icbst W = U + (n*(n+1))/2 (when no ties)• Reject H0 if U is extreme.

E Newton 16

Boxplots of times to failure for control and stressed capacitors

E Newton 17

S-Plus wilcox.test

> wilcox.test(cg, sg)

Exact Wilcoxon rank-sum test

data: cg and sgrank-sum statistic W = 95, n = 8, m = 10, p-value =

0.1011 alternative hypothesis: true mu is not equal to 0

E Newton 18

S-Plus parametric t-test

> t.test(cg,sg)

data: cg and sgt = 1.8105, df = 16, p-value = 0.089 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-1.103506 14.018506

sample estimates:mean of x mean of y 15.5375 9.08

E Newton 19

Kolmogorov-Smirnov Tests

There is also a one-sample version for testing the distance between some observed data and a specified (ideal) distribution.

The Kolmogorov-Smirnov test detects differences in location, scale, skewness, or whatever (any differences between two distributions), uses two empirical cumulative distribution functions (step functions).

Distribution 2MaximumGap

Distribution 1

Two-sample Test

yIdeal Distribution

Maximum Gap

Observed Distribution

One-sample Test

Tests the maximum gap between the observed distribution and the hypothesized distribution as a function of sample size (tables or p-values).

J Telford 20

E Newton 21

Histograms of 100 random normal (2,1) deviates and 100 random gamma(4,2) deviates

-1 0 1 2 3 4 5

0 1 2 3 4 5 6

Kolmogorov-Smirnov Tests> ks.gof(x,y)

Two-Sample Kolmogorov-Smirnov Test

data: x and y ks = 0.15, p-value = 0.2112 alternative hypothesis: cdf of x does not equal the

cdf of y for at least one sample point.

> ks.gof(y)

One sample Kolmogorov-Smirnov Test of Composite Normality

data: y ks = 0.0969, p-value = 0.0216 alternative hypothesis: True cdf is not the normal distn. with

estimated parameters sample estimates:mean of x standard deviation of x 1.865857 0.9421928

E Newton 22

Kruskal-Wallis Test• Inference for several independent samples• Assume distributions of each of the samples differ

only possibly in location.• Xij = θ + τj + eij.• H0: τ1=τ2=..= τk, vs. H1: τi≠τj for some i ≠ j• Rank all N=n1+n2..+na observations.• Calculate rank sums and averages in each group• Calculate KW test statistic=kw (see text)• Reject H0 for large values of kw• For large ni’s, null dist’n of kw χ2

E Newton 23

Test scores for four different teaching methods (page 582)

scm<-matrix(score,7,4)> scm

[,1] [,2] [,3] [,4] [1,] 14.06 14.71 23.32 26.93[2,] 14.26 19.49 23.42 29.76[3,] 14.59 20.20 24.92 30.43[4,] 18.15 20.27 27.82 33.16[5,] 20.82 22.34 28.68 33.88[6,] 23.44 24.92 32.85 36.43[7,] 25.43 26.84 33.90 37.04

E Newton 24

Plot.factor(f(grp),score)15

1 2 3 4

f(grp)E Newton 25

Ranks of Test Scores

E Newton 26

> scmr<-matrix(rank(score),7,4)> scmr

[,1] [,2] [,3] [,4] [1,] 1 4.0 11.0 18[2,] 2 6.0 12.0 21[3,] 3 7.0 14.5 22[4,] 5 8.0 19.0 24[5,] 9 10.0 20.0 25[6,] 13 14.5 23.0 27[7,] 16 17.0 26.0 28

> tmp<-apply(scmr,2,sum)> tmp[1] 49.0 66.5 125.5 165.0

> (12/(28*29))*sum((tmp^2)/7)-3*29[1] 18.13406

Kruskal-Wallis test in S-Plus

> kruskal.test(scm, col(scm))

Kruskal-Wallis rank sum test

data: scm and col(scm) Kruskal-Wallis chi-square = 18.139, df = 3,

p-value = 0.0004 alternative hypothesis: two.sided

E Newton 27

ANOVA for test scores

summary(aov(score~f(grp)))Df Sum of Sq Mean Sq F Value Pr(F)

f(grp) 3 830.1914 276.7305 15.93607 6.509182e-006Residuals 24 416.7609 17.3650

E Newton 28

Friedman Test

• Inference for several matched samples• a treatments, b blocks• H0: τ1=τ2=..= τk, vs. H1: τi≠τj for some i ≠ j• Rank observations separately within each block• Calculate rank sums• Calculate the Friedman statistic, fr (see text)• Reject H0 for large values of fr• For b large, fr ~ χ2

E Newton 29

Ranks within Blocks (rows)> scmrb<-t(apply(scm,1,rank))> scmrb

[,1] [,2] [,3] [,4] [1,] 1 2 3 4[2,] 1 2 3 4[3,] 1 2 3 4[4,] 1 2 3 4[5,] 1 2 3 4[6,] 1 2 3 4[7,] 1 2 3 4

> tmp<-apply(scmrb,2,sum)[1] 7 14 21 28

> (12/(4*7*5))*sum(tmp^2)-3*7*5[1] 21

E Newton 30

Friedman test in S-Plus

• > friedman.test(scm, col(scm), row(scm))

• Friedman rank sum test

• data: scm and col(scm) and row(scm) • Friedman chi-square = 21, df = 3, p-value

= 0.0001 • alternative hypothesis: two.sided

E Newton 31

ANOVA test score data with blocks

> summary(aov(score~f(grp)+f(blk)))Df Sum of Sq Mean Sq F Value Pr(F)

f(grp) 3 830.1914 276.7305 260.4768 5.220000e-015f(blk) 6 397.6377 66.2729 62.3804 4.558276e-011

Residuals 18 19.1232 1.0624

E Newton 32

Correlation Methods

• Pearson Correlation: measures only linear association.

• Spearman Correlation: correlation of the ranks

• Kendall’s Tau: based on number of concordant and discordant pairs.

E Newton 33

Kendall’s Tau

• Assume: the n bivariate observations (X1,Y1),…,(Xn,Yn) are a random sample from a continuous bivariate population.

• H0: Xi, Yi are independent• H0: F(x,y) = F(x)F(y)• Measure dependence by finding the number of

concordant and discordant pairs.• Population correlation coefficient:

τ = 2*P{X2-X1)(Y2-Y1)>0}-1

E Newton 34

Kendall’s Tau

)1(2ˆ

)),(),,((

0)Y-)(YX-(X if 1,-0 )Y-)(YX-(X if 0, 0)Y-)(YX-(X if 1,

))Y,(X),Y,Q((X

:n j i1

⎪⎩

⎪⎨

≤<≤

∑ ∑−

YXYXQK

ijjjii

E Newton 35

Kendall’s Tau example

E Newton 36

> m1 3 2 4

1 NA 1 1 12 NA NA -1 13 NA NA NA 14 NA NA NA NA

> 2*sum(m,na.rm=T)/12[1] 0.6666667

> cor.test(c(1,2,3,4),c(1,3,2,4),method="k")

Kendall's rank correlation tau

data: c(1, 2, 3, 4) and c(1, 3, 2, 4) normal-z = 1.3587, p-value = 0.1742 alternative hypothesis: true tau is not equal to 0 sample estimates:

tau0.6666667

x=1:10y=exp(x)

2 4 6 8 10

E Newton 37

Pearson Correlation

> cor.test(x,y,method="p")

Pearson's product-moment correlation

data: x and y t = 2.9082, df = 8, p-value = 0.0196 alternative hypothesis: true coef is not equal to 0

sample estimates:cor

0.7168704

E Newton 38

Spearman Correlation

> cor.test(x,y,method="s")

Spearman's rank correlation

data: x and y normal-z = 2.9818, p-value = 0.0029 alternative hypothesis: true rho is not equal to 0

sample estimates:rho1

E Newton 39

Kendall Correlaton

> cor.test(x,y,method="k")

Kendall's rank correlation tau

data: x and y normal-z = 4.0249, p-value = 0.0001 alternative hypothesis: true tau is not equal to 0

sample estimates:tau1

E Newton 40

E Newton 41

Example - Environmental Data –Censored below LOD

0 2 4 6 8 10 12 14

Resampling Methods

• Parametric methods – Inference based on assumed population distribution

• Resampling methods – No assumption about functional form of population distribution.

• Permutation Tests – 2 sample problem• Jackknife – Delete one observation at a

time• Bootstrap – resample with replacement

E Newton 42

Permulation Tests• Goal: estimate difference in means (2 sample problem)• (x1, x2… xn1) and (y1, y2.. yn2) are independent samples

drawn from F1 and F2.• H0: F1=F2 => all assignments of labels x and y equally

likely.• Choose SRS of size n1 from n1+n2 observations and

label as x, label rest as y.• Calculate value of test statistic (e.g. difference in means)

for each assignment -> permutation distribution.• There are (n1+n2) choose (n1) possible distinct

assignments (capacitor data set Ex14.7, n1=8, n2=10, number of assignments=43,758)

E Newton 43

Jackknife• Goal: estimate distribution and standard error of statistic

(e.g. median or mean)• Draw n samples of size n-1 from original sample, by

deleting one observation at a time.• Calculate mj*=mean (median) from each sample

−−

nnmJSE

2** )(1)(

• JSE is exact for mean, not necessarily very good for median

E Newton 44

Bootstrap• Goal: estimate distribution, standard error,

confidence interval of statistic (e.g. mean, median, correlation)

• Draw B samples of size n, with replacement, from original sample

• Calculate test statistics from each sample

)()( 1

−=∑ =

mmmBSE

E Newton 45

Swiss Data Set in S-PlusFertility Data for Switzerland in 1888 SUMMARY: The swiss.fertility and swiss.x data sets contain fertility data for Switzerland in 1888. ARGUMENTS:

swiss.fertilitystandardized fertility measure I[g] for each of 47 French-speaking provinces of

Switzerland in approximately 1888.

swiss.xmatrix with 5 columns that contain socioeconomic indicators for the provinces:

1) percent of population involved in agriculture as an occupation; 2) percent of "draftees" receiving highest mark on army examination; 3) percent of population whose education is beyond primary school; 4) percent of population who are Catholic; and, 5) percent of live births who live less than 1 year (infant mortality).

SOURCE: Mosteller and Tukey (1977). Data Analysis and Regression. Addison-Wesley. Unpublished data used by permission of Francine van de Walle. Population Study

Center, University of Pennsylvania, Philadelphia, PA.

E Newton 46

Bootstrap estimates and CI for variance of education

> educ<-swiss.x[,3]> var(educ)[1] 92.45606

> educ.boot<-bootstrap(educ,var,trace=F)> summary(educ.boot)Call:bootstrap(data = educ, statistic = var, trace = F)

var 92.46 -0.5972 91.86 39.14

var 29.98 36.26 165.3 175

E Newton 47

Histogram of variance estimates obtained from 1000 bootstrap samples

50 100 150 200

E Newton 48

QQ plot of variance estimates

-2 0 2

E Newton 49

Plot of LSAT scores by GPA for a sample of 15 schools

2.8 3.0 3.2 3.4

E Newton 50

Bootstrap estimates and CI for correlation between LSAT and GPA

> law.boot<-bootstrap(law.data, cor(lsat,gpa), trace=F)> summary(law.boot)Call:bootstrap(data = law.data, statistic = cor(lsat, gpa), trace = F)

Param 0.7764 -0.00506 0.7713 0.1368

Param 0.449 0.5133 0.947 0.9623

BCa Confidence Limits:2.5% 5% 95% 97.5%

Param 0.2623 0.4138 0.9232 0.9413

E Newton 51

Histogram of correlation estimates obtained from 1000 bootstrap samples

0.2 0.4 0.6 0.8 1.0

E Newton 52

QQ Plot of correlation estimates

-2 0 2

E Newton 53

S-Plus Stack-loss data set

• Stack-loss Data • SUMMARY: • The stack.loss and stack.x data sets are from the operation of a plant for the

oxidation of ammonia to nitric acid, measured on 21 consecutive days. • ARGUMENTS: • stack.loss

– percent of ammonia lost (times 10). • stack.x

– matrix with 21 rows and 3 columns representing air flow to the plant, cooling water inlet temperature, and acid concentration as a percentage (coded by subtracting 50 and then multiplying by 10).

• SOURCE: • Brownlee, K.A. (1965). Statistical Theory and Methodology in Science and

Engineering. New York: John Wiley & Sons, Inc. • Draper and Smith (1966). Applied Regression Analysis. New York: John

Wiley & Sons, Inc. • Daniel and Wood (1971). Fitting Equations to Data. New York: John Wiley &

Sons, Inc.E Newton 54

S-Plus stack loss data set

stack.loss

50 55 60 65 70 75 80 75 80 85 90

Air.Flow

Water.Temp

10 20 30 40

18 20 22 24 26

Acid.Conc.

E Newton 55

Summary of stack loss regression> summary(tmp)

Call: lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data = stack)

-7.238 -1.712 -0.4551 2.361 5.698

(Intercept) -39.9197 11.8960 -3.3557 0.0038Air.Flow 0.7156 0.1349 5.3066 0.0001

Water.Temp 1.2953 0.3680 3.5196 0.0026Acid.Conc. -0.1521 0.1563 -0.9733 0.3440

Residual standard error: 3.243 on 17 degrees of freedomMultiple R-Squared: 0.9136 F-statistic: 59.9 on 3 and 17 degrees of freedom, the p-value is 3.016e-009

Correlation of Coefficients:(Intercept) Air.Flow Water.Temp

Air.Flow 0.1793 Water.Temp -0.1489 -0.7356 Acid.Conc. -0.9016 -0.3389 0.0002

E Newton 56

Summary of stack loss bootstrap outputsummary(stack.boot)Call:bootstrap(data = stack, statistic = coef(lm(stack.loss ~ Air.Flow

+ Water.Temp + Acid.Conc., stack)), trace = F)

(Intercept) -39.9197 0.5691396 -39.3505 9.3731Air.Flow 0.7156 0.0016734 0.7173 0.1777

Water.Temp 1.2953 -0.0264873 1.2688 0.4798Acid.Conc. -0.1521 -0.0006978 -0.1528 0.1261

(Intercept) -56.0109 -53.4216 -21.92994 -18.75262Air.Flow 0.3903 0.4366 1.00261 1.04605

Water.Temp 0.4004 0.5131 2.07381 2.23633Acid.Conc. -0.4285 -0.3740 0.03282 0.05912

E Newton 57

Summary of stack loss bootstrap output

summary(stack.boot)

BCa Confidence Limits:2.5% 5% 95% 97.5%

(Intercept) -55.6465 -52.6606 -21.451125 -18.55810Air.Flow 0.3266 0.4120 0.992007 1.01855

Water.Temp 0.5244 0.6193 2.264165 2.40956Acid.Conc. -0.4629 -0.4101 -0.007724 0.04459

Correlation of Replicates:(Intercept) Air.Flow Water.Temp Acid.Conc.

(Intercept) 1.00000 -0.17636 0.09902 -0.80236Air.Flow -0.17636 1.00000 -0.78822 -0.07635

Water.Temp 0.09902 -0.78822 1.00000 -0.24463Acid.Conc. -0.80236 -0.07635 -0.24463 1.00000

E Newton 58

Histograms of regression coefficients

E Newton 59

-60 -40 -20 0 20 40

(Intercept)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Air.Flow

-0.5 0.0 0.5 1.0 1.5 2.0 2.5

Water.Temp

-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4

Acid.Conc.

QQ Plots of regression coefficients

E Newton 60

-2 0 2

(Intercept)

-2 0 2

Air.Flow

-2 0 2

Water.Temp

-2 0 2

Acid.Conc.

Applied Statistics - MIT

data analysis

data summarizing

plus data set

g distance

close estimator

elizabeth newton mit

roy welsch mit

gordon kaufman mit

Documents

Applied statistics lecture_2

MSc APPLIED STATISTICS€¦ · MSc APPLIED STATISTICS (MAS)...

Applied Statistics and Econometrics Outline of …...Applied...

Applied Statistics in the Pharmaceutical...

Working Papers in Econometrics and Applied Statistics ·...

M.Sc. Programme in Applied Statistics (Course work – SLQF....

Applied Probability & Statistics

MATHEMATICS & APPLIED STATISTICS -...

Applied Statistics I

RESEARCH, APPLIED ANALYTICS, AND STATISTICS, STATISTICS...

Applied Statistics 2009

Applied Statistics II

Applied statistics lecture_7

MATH602: APPLIED STATISTICS

Applied Statistics Chapter12

Applied statistics lecture_5