Stat 231 Final Slides

5/12/2018 Stat 231 Final Slides - slidepdf.com

http://slidepdf.com/reader/full/stat-231-final-slides 1/100

STAT 231

Final



Outline

• Chapter 1

– Data types (discrete, continuous, categorical)

– Problem (3

different

aspects)

– Populations (target, study, sample)

– Representations of data

• Graphical: histograms,

CDFs,

box

plots

• Numerical: mean, standard deviation, IQR

– Bivariate Data

• Relative risk

• Correlation co‐efficient



Outline

• Chapter 2

– Review of probability distributions

– Random PPDAC

examples…



Outline

• Chapter 3

– Binomial Model

– Response Model

– Regression Model

– Maximum Likelihood Estimation



Outline

• Chapter 4

– Sampling distributions for estimators

– Introduction to

new

distributions

• Gaussian

• Chi‐squared

• t – Confidence Interval

– Hypothesis Testing

– Confidence Intervals

and

Hypothesis

Testing

with

the

likelihood

function



Outline

• Chapter 5

– Testing for independence with categorical variates

– Model checking

and

assessment

for

assumptions



Outline

• Chapter 6 – Comparison

• 2 sample t-tests• Paired t-test

– Causality

• Testing for association• Blocking

• Randomization and repetition

• Matching – Prediction

• Prediction intervals for response

• Prediction intervals for regression



Confidence Intervals using the

Relative Likelihood Function

Define the likelihood function

Define the relative likelihood function as:

)(

)(

π

π )

L

L

∏=

=n

i

i x f L1

)()(π



Confidence Intervals using the

Relative Likelihood Function

Graph the

relative

likelihood

function:

Draw a horizontal line at 0.1, the intersection of the two

x‐coordinates forms an approximate 95% confidence interval



Hypothesis Testing using the

Likelihood Function

1) Define the null hypothesis, define the alternate

hypothesis

2) Define

the

test

statistic,

identify

the

distribution,

calculate the observed value

3) Calculate the p‐value

The test statistic:

Distribution of D:

)]()~

([20θ θ ll D −=



Hypothesis Testing using the

Likelihood Function

Observed value

of

D:

P‐value:

)]()([20θ θ lld −= )

)( d DP ≥ pn D −2

~ χ



Example



Example

The observed value of the test statistic )]()([20θ θ lld −=

)



Example

∑=

++=n

i

i xnl1

ln)1ln()( θ θ θ

N



Example



Example

)]()([2 0θ θ lld −= ) ∑=

++=n

i

i xnl1

ln)1ln()( θ θ θ



Model Assessment

• We’ve been assuming our data collected fits

to a specific model (Binomial, Response, etc.)

• With these models come many assumptions,

including independence

• In this

chapter,

we

analyze

our

data

to

actually see if we’re able to use these models

to fit

our

data

Independence with



Independence with

Binary Variates

• We want to see if we can assume two binary

variates (represented by 2 random variables X

and Y)

are

independent

• This is essentially another type of hypothesis

testing

• Since a binary variate is just a categorical

variate with

2 categories,

this

test

can

be

extended to two categorical variates

Independence with



Independence with

Binary Variates

Define:

Let X represent the binary variate gender (Male = 0, Female = 1)

Let Y represent the binary variate smoker (Non‐Smoker = 0,

Smoker = 1)

Let n be the sample size

Let us collect our observed data and present in the following

frequency table:

Male (X=0) Female (X=1) TotalNon-Smoker (Y=0) a b a + b

Smoker (Y=1) c d c + d

Total a + c b + d n = a + b + c + d

Independence with



Independence with

Binary Variates

If X and Y are independent then:

Expected

frequency

of

male

smokers

is

Expected frequency of male non‐smokers is

Expected frequency of female smokers is

Expected frequency

of

female

non

‐smokers

is

)1()0( =⋅=⋅ Y P X Pn

)0()0( =⋅=⋅ Y P X Pn

)1()1( =⋅=⋅ Y P X Pn

)0()1( =⋅=⋅ Y P X Pn

Independence with



Independence with

Binary Variates

Using the observed frequency table

Male (X=0) Female (X=1) Total

Non-Smoker (Y=0) a b a + b

Smoker (Y=1) c d c + d


)0( = X P

)1( = X P

)0( =Y P

)1( =Y P

Independence with



Independence with

Binary Variates

Creating our expected frequency tableMale (X=0) Female (X=1) Total

Non-Smoker (Y=0) a + b

Smoker (Y=1) c + d


1

)0()0(

e

Y P X Pn

=

=⋅=⋅

2

)0()1(

e

Y P X Pn

=

=⋅=⋅

3

)1()0(

e

Y P X Pn

=

=⋅=⋅

4

)1()1(

e

Y P X Pn

=

=⋅=⋅

Independence with



Independence with

Binary Variates

As with any other hypothesis testing question, we need to define the test statistic.

Test Statistic:

Distribution of the test statistic:

Observed value:

∑=

−=

n

i i

ii

e

eoS

1

2)(

)1)(1(2

~ −− cr S χ

∑=

−=

n

i i

ii

e

eos

1

2)(

Independence with



Independence with

Binary Variates

p‐value

Make your

conclusion:

Reject: X and Y are not independent

Accept: X and

Y are

independent

)( sSP ≥=

E l



Example

E l



Example

E l



Example

∑=

−=

n

i i

ii

e

eos

1

2)(

Observed value:

l



Example

P‐value:

M d l A t



Model Assessment

For the

regression

model,

we

have

the

following

assumptions when fitting our data

1) The expectation of Y is a linear function of the explanatory

variate

2) The model used is Gaussian

3) Yi’s are independent

4)

The

model

has

a

constant

variance

M d l A t



Model Assessment

The expectation

of

Y is

a linear

function

of

the

explanatory variate

• The model

assumes

that

E[Yi]

is

a linear

combination

of

xi

• If we plot Yi vs. xi we should see a linear relationship

Model Assessment



Model Assessment

The model

used

is

Gaussian

• In the model, we assume and thus

• How do

we

check

if this

assumption

is

reasonable?

Residuals

• Rearranging the

model,

• A realization of R becomes

• An estimated residual is,

• Graphically , is the distance from the line of best fit to our observed response variate

),0(~ σ G R ),(~ σ β α xGY +

)( xY R β α +−=

)( iii x yr β α +−=

iiii y y x yr )

)

) )

−=+−= )( β α

ir )

Model Assessment



Model Assessment

• We can

check

for

the

Gaussian

assumptions

by

plotting

a QQ

plot

• Plot the sample quantiles against the theoretical quantiles of

the estimated

residuals,

if

the

line

is

relatively

straight,

then

the Gaussian assumption holds

Model Assessment



Model Assessment

Yi’s are

independent

• We will check these assumptions by plotting the fitted

response

,

against

the

estimated

residuals,• If our assumptions are true, we should see a random pattern

centered around 0

ii x y β α

)

) )

+= ir

)

Model Assessment



Model Assessment

Model Assessment



Model Assessment

Yi’s have Constant Variance

• If Yi’s have constant variance, we should see residuals evenly

distributed around

zero

Non‐constant variance: funnel shaped

Comparison



Comparison

Recall in Chapter 1 we learned there were three

different aspects (type of problem)

• Descriptive

• Causative• Predictive

Chapter 6 looks

at

techniques

for

solving

each

of

the 3 problems

Comparison



Comparison

• The descriptive aspect of the problem could involve looking

and comparing between two different populations

• In this

section,

we

will

learn

how

to

conduct

hypothesis

tests

that will allow us to make the conclusion whether there’s a

difference between 2 populations

– The question

asked

is

‘is

there

a difference

between

the

mean values of the 2 populations?’

• Essentially, the hypothesis tested is whether the parameter

for

each

population

is

equal

210 : μ μ = H

Comparison



Comparison

2 sample

t‐tests

(Response

Model)

• Two populations

• The estimator for each population is

• The sampling

distribution

for

each

estimator

is

j j RY 111 += μ j j RY 222 += μ

1

1

1

1

1

~

n

Y

n

j

j∑=

=μ 2

1

2

2

2

~

n

Y

n

j

j∑=

=μ

),(~~

1

11

n

Gσ

μ μ ),(~~

2

22

n

Gσ

μ μ

Comparison



Comparison

• In the

hypothesis

tests,

we

want

to

see

if

the

two

parameters

and are equal, so let’s look at the r.v.

• What is the sampling distribution of under the

assumption

1μ 2μ 21

~~ μ μ −

21

~~ μ μ −

21μ μ =

),(~~

1

11

n

Gσ

μ μ ),(~~

2

22

n

Gσ

μ μ

Comparison



Comparison

)11

,0(~~~

21

21nn

G +− σ μ μ

)1,0(~11

~~

21

21 G

nn +

−

σ

μ μ Standardize

Replace with estimate

2

21

21

21~

11~

~~−+

+

−nnt

nnσ

μ μ

Comparison



Comparison

)2(

)1()1(

21

2

22

2

11

−+

−+−= nn

nn σ σ σ

) )

)

2

21

21

21~

11~

~~

−+

+

−= nnt

nn

T

σ

μ μ

Example



Example

Example



Example

3.711 =μ )

7.682=μ

)

2.101 =σ )

3.112 =σ )

471 =n

362 =n

Example



Example

6892.10)23647(

3.11)136(2.10)147(

)2(

)1()1(22

21

2

22

2

11 =−+

−+−=

−+

−+−=

nn

nn σ σ σ

) )

)

097.1

36

1

47

16892.10

7.683.71=

+

−=t

Paired T‐Tests



Paired T Tests

• In the prior pages, we looked at two sample t‐tests

• A stronger test is called the paired t‐test

• This test

only

works

if the

two

samples

we

collect

are

actually

data for the same group of n units, but at different times

• The paired t‐test involves simplifying the two data sets into

one by

finding

the

difference

of

each

pair

of

data,

and

working with this single dataset

• Then we conduct a usual t‐test/hypothesis test on this single

dataset of

differences

Causation



• The causative

aspect

of

a problem

looks

at

the

relationship between the explanatory and response

variates

• Recall in

chapter

1 we

looked

at

2 types

of

concepts

that

looks at the relationship between X and Y

– Relative Risk

– Association

• Association involves calculating the correlation

coefficient

∑∑

∑

==

=

−−

−−

===n

ii

n

ii

i

n

i

i

YY XX

XY

y y x x

y y x x

SS

Sr

1

2

1

2

1

)()(

)()(

ρ

Causation



• In this

course,

we

only

have

the

skills

to

test

for

association

• This involves

testing

the

hypothesis

in the regression model

• If , then we can say there is no

association between

X and

Y

0:0=

β H

0:0 = β H

Example



p

Example



p

)~

(

0

β

β β

SE t

−=

)

Causation



• Association does NOT imply causation

• The course

notes

talks

about

why

this

is

the

case and how we can avoid making the wrong

assumption using three techniques

– Blocking

– Repetition and Randomization

– Matching

Causation



Confounding

• Association does not imply causation

• There could be a third hidden variate that is related to both

the explanatory

and

response

and

causes

this

causal

relationship: this is called confounding

• The difficulty with confounding variates is identifying them in

the first

place,

or

else

we

will

make

a wrong

conclusion

about

the relationship between the explanatory and response

variates

• If we

can

identify

the

confounding

variates,

then

there

are

tools we can use when designing experimental plans to

account for these variates

Causation



Blocking

• If we’ve identified the confounding variate, we neutralize its

effect by collecting samples where the units have the same

value for

the

confounding

variate

• The Chicken Example:

– Response variate: growth rate of chickens

– Explanatory variate:

protein

in

diet

– Confounding variate: gender of the chickens

– Blocking: look at samples of only male chickens and samples of only

females chickens

– This eliminates the gender effect and the experimenter is able to look

at the effects of protein in diet on the growth rate of chickens

Causation



Replication and

Randomization

• If we cannot identify or control the confounding variate, we can

also try to neutralize its effects by randomly allocating our

controlled variate

in

the

experimental

plan

• The Medicine Example:

– Response variate: survival rate

– Explanatory variate:

type

of

treatment

– Confounding variates: medical history/health of the patient

– Using randomization and replication to assign the treatment type to each

unit

will

result

in

two

very

balanced

groups

in

terms

of

their

health/medical history

– This will eliminate the confounding variates as much as possible

Causation



Matching and Observational Plans

• In observational plans, the experimenter cannot

control the

variates

• The method of matching is used where the units that

are being observed are compared with a control unit

that has

very

similar

characteristics

to

the

unit

in

the

plan, (this is similar to blocking)

• Thus

if

there

is

a

difference

in

the

value

observed

between the sampled unit and the control unit, the

difference must be legitimate

Prediction



• The predictive aspect of a problem involves

using

our

collected

data

to

estimate

a

value

for a unit to be randomly selected from the

population

• We will look at prediction intervals for

– Response

– Regression

Prediction



The Model

The predicted

unit:

Since follows the response model then

RY += μ

0Y

),(~0 σ μ GY

),(~ σ μ GY

0

Y

Prediction



What would be a logical choice to use as our predicted

value?

• The average

We need

the

estimator

for

the

mean

parameter:

),(~~n

G σ μ μ n

Y

n

ii

∑== 1~μ

μ ~

From MLE Sampling Distribution

Prediction



If we

look

at

the

difference

between

our

predicted

value

and

the

population average, then we have the random variable

μ ~0 −Y

),(~~

nG

σ μ μ ),(~0 σ μ GY

Prediction



Standardizing gives

Replace with an estimator gives

)1

1,0(~~0

nGY +− σ μ

)1,0(~1

1

~0 G

n

Y

+

−

σ

μ

1

0 ~1

1~

~

−

+

−nt

n

Y

σ

μ

Prediction



Constructing a 95%

Prediction

Interval

for (

unknown)

Our ultimate goal:

Since we can make the probability statement:

0Y σ

bY a ≤≤ 0

1

0 ~1

1~

~

−

+

−nt

n

Y

σ

95.0)1

1~

~( 0 =≤

+

−c

n

Y P

σ

μ

Prediction



95.0)1

1~

~( 0 =≤

+

−≤− c

n

Y cP

σ

μ

Example



Let Y be

the

response

variate

representing

body

weight

(kg).

The

following sample is collected:

60 54 72 65 64

Construct a 95%

prediction

interval

for

the

body

weight

of

someone

we

randomly select from the population.

nc

1

1+⋅± σ μ

) )

N

Example



nc

11+⋅± σ μ ) )

Prediction



The Model

But

for

our

purposes,

we

will

use

a

shifted

version

of

the

model

R xY i ++= β α

R x xY i +−+= )( β α

Prediction



The Model

The predicted

unit:

We want to predict given the subgroup

Since follows the regression model then

0Y

0Y

R x xY i +−+= )( β α

0Y 0 x xi =

)),((~ 00σ β α x xGY −+

Prediction



What would be a logical choice to use as our predicted

value?

• The average

given

the

subgroup

which

we

will denote0 x xi =

)(~ 0 xμ

)(~~]|[)(~000 x x xY E x −+== β α μ

R x xY i +−+= )( β α

Regression Model

Average of the subgroup 0 x xi =

Prediction



Using Maximum

Likelihood

Estimation

we

obtain

the

estimators

The sampling distributions of these two estimators are

),(~~

nG

σ α α ),(~

~

XX S

Gσ

β β

n

Y n

i

i∑== 1~α XX

XY

n

i

i

n

i

ii

S

S

x x

x xY Y

=

−

−−

=

∑

∑

=

=

1

2

1

)(

))((~

β

Prediction



What is

the

sampling

distribution

of )(~~)(~ 00 x x x −+= β α μ

),(~~

n

Gσ

α α ),(~~

XX S

Gσ

β β

))

)(1(),((~)(~

2

0

00

xxS

x x

n x xG x−

+−+ σ β α μ

Prediction



If we

look

at

the

difference

between

our

predicted

value

and

the

population average, then we have the random variable

The obvious

next

step

would

be

to

determine

the

sampling

distribution of

)(~ 00 xY μ −

)),((~ 00 σ β α x xGY −+ )))(1

(),((~)(~2

0

00

xx

S

x x

n x xG x

−+−+ σ β α μ

)(~ 00 xY μ −

Prediction



)),((~ 00 σ β α x xGY −+ )))(1(),((~)(~2

0

00

xxS x x

n x xG x −+−+ σ β α μ

Prediction



Standardizing gives

Estimating sigma gives

))(1

1,0(~)(~2

0

00

xxS

x x

nG xY

−++− σ μ

)1,0(~)(1

1

)(~

2

0

00 G

S

x x

n

xY

xx

−++

−

σ

μ

22

0

00 ~)(1

1~

)(~−

−++

−n

xx

t

S

x x

n

xY

σ

μ

Prediction



Constructing a 95%

Prediction

Interval

for (

unknown)

Our ultimate goal:

Since we can make the probability

statement:

0Y σ

bY a ≤≤ 0

22

0

00 ~)(1

1~

)(~−

−++

−n

xx

t

S

x x

n

xY

σ

95.0)

)(11~

)(~(

20

00 =≤

−++

−c

S

x x

n

xY P

xx

σ

μ

Prediction



95.0))(1

1~

)(~(

2

0

00 =≤−

++

−c

S

x x

n

xY P

xx

σ

μ

95.0))(1

1~)(~)(1

1~)(~(

95.0))(1

1~)(~)(1

1~(

95.0))(1

1

~

)(~(

2

0

00

2

0

0

2

0

00

2

0

2

0

00

=−

++⋅+≤≤−

++⋅−

=−

++⋅≤−≤−

++⋅−

=≤−

++

−≤−

xx xx

xx xx

xx

S

x x

nc xY

S

x x

nc xP

S

x x

nc xY

S

x x

ncP

c

S

x x

n

xY cP

σ μ σ μ

σ μ σ

σ

μ

Prediction



xxS

x x

n

c x x2

0

0

)(11)(

−++⋅±−+ σ β α

)

)

)

Upper and Lower bounds of a regression prediction interval

Example



Let Y be

the

response

variate

representing

body

weight

(kg)

and

X be the explanatory variate representing body height (cm).

The following

sample

is

collected:

Construct a 95% prediction interval for the body weight of

someone we randomly select from the population whose

height is

175cm.

Use

i 1 2 3 4 5

xi 172 162 180 170 174

yi 60 54 72 65 64

97.2=σ )

Example



xxS x x

nc x x

2

00 )(11)( −++⋅±−+ σ β α

)

)

)

i 1 2 3 4 5xi 172 162 180 170 174

yi 60 54 72 65 64

Example



xxS

x x

nc x x

20

0

)(11)( −++⋅±−+ σ β α ) ) )

Outline



• Chapter 1

– Data types (discrete, continuous, categorical)

– Problem (3 different aspects)

– Populations (target, study, sample)

– Representations of data

• Graphical: histograms, CDFs, box plots

• Numerical: mean,

standard

deviation,

IQR

– Bivariate Data

• Relative risk

• Correlation

co‐

efficient

• Chapter 2

– Review of probability distributions

– Random PPDAC

examples…

PPDAC



PPDAC



Draw a frequency

histogram

of

the

Flash

data,

with

bins

given

by

the intervals (45 – 49.9), (50 – 54.9), etc.

First make

a frequency

table

with

the

bin

widths

Interval Frequency

(45 – 49.9) 1

(50 – 54.9) 1

(55 – 59.9) 2

(60 – 64.9) 5

(65 – 69.9) 5

(70 – 74.9) 1

(75 – 79.9) 1

(80 – 84.9) 1

(85 – 89.9) 2

(90 – 94.9) 1

PPDAC



Concept Review



• From the

previous

example:

– Target population, study population, sample, unit

– Response vs.

explanatory

variates

– Aspects

• Descriptive

• Causative

• Predictive

– Histograms

• Bin Width

• Frequency histogram

Outline



• Chapter 3

– Binomial Model

– Response Model

– Regression Model

– Maximum Likelihood Estimation

MLE



∏=

=n

i

i x f L

1

);()( θ θ

MLE



∑−

+−=n

i

i xnl1

)ln()1(ln)( θ θ θ

Concept Review



• From the previous example:

– Maximum Likelihood Estimation Method

• Define likelihood

function

• Define log likelihood function

• Differentiate with respect to the parameter

• Set to

zero

• Solve for the parameter

Outline



• Chapter 4

– Sampling distributions for estimators

– Introduction to new distributions

• Gaussian

• Chi‐squared

• t

– Confidence Interval


– Confidence Intervals and Hypothesis Testing with the likelihood

function

Confidence Intervals



Confidence Interval



Concepts Review

h i l



• From the

previous

example:

– Confidence Intervals for the response model, sigma

unknown – Structure of a symmetric confidence interval

Hypothesis Testing



Hypothesis Testing

For a paired t test we create a new set of data



For a paired

t‐test,

we

create

a new

set

of

data

1 2 3 4 5 6 7 8

Diff 0.48 0.53 0.52 0.21 -0.05 0.44 0.41 0.68

9 10 11 12 13 14 15 16

Diff 0.46 0.76 3.09 0.26 0.34 0.32 -0.07 0.33

Hypothesis Testing

T t t ti ti 0DT μμ



Test statistic: 1

0 ~~~

−−= n

D

D t

n

T σ

μ μ

Hypothesis Testing

P value



P‐value

Hypothesis Testing



Hypothesis Testing

For a 2 sample t test we have two populations with 2 sets of data



For a 2 sample

t‐test,

we

have

two

populations,

with

2 sets

of

data

Hypothesis Testing

Test statistic: 21

~~= tT μμ



Test statistic: 2

21

21~

11~−+

+

−= nnt

nn

T

σ

μ μ

Hypothesis Testing

912)116(482)116()1()1(2222

++ nn σσ



704.2)21616(

91.2)116(48.2)116(

)2(

)1()1(

21

2211 =−+

−+−=

−+

−+−=

nn

nn σ σ σ

) ) )

Observed value

of

the

test

statistic:

21

21

11

nn

t

+

−=

σ

μ μ

)

) )

Hypothesis Testing

P value



P‐value

Concepts Review

• From the previous example:



• From the

previous

example:


• Define the null hypothesis

• Define the test statistic, identify the distribution, calculate

the observed value of the test statistic

• Calculate the p‐value

– 2 sample t test

– Paired t test

Stat 231 Final Slides

Documents