Top Banner
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008
30

Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Dec 26, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Additional Slides on Bayesian Statistics for STA 101

Prof. Jerry Reiter

Fall 2008

Page 2: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Can we use this method to learn about means and percentages?

• To learn about population averages and percentages, we’ve used data (like the DNA test results), but not prior information (like the list of suspects).

• We show how to combine data and prior information in class.

Page 3: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Combining the prior beliefs and the data using Bayes Rule

• In Bayes rule problem before break, we combine the prior beliefs and the data using Bayes rule.

• Pr(p|X=1) represents our posterior beliefs about µ .

)1Pr(

)Pr()|1Pr()1|Pr(

X

ppXXp

Page 4: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Estimation of unknown parameters in statistical models (Bayesian and

non-Bayesian)• Suppose we posit a probability distribution to

model data. How do we estimate its unknown parameters?

• Example: assume data follow regression model. Where do the estimates of the regression coefficients come from?

• Classical statistics: maximum likelihood estimation.

• Bayesian statistics: Bayes rule.

Page 5: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Estimating percentage of Dukies who plan to get advanced degree

• Suppose we want to estimate the percentage of Duke students who plan to get an advanced degree (MBA, JD, MD, PhD, etc.). Call this percentage p.

• We sample 20 people at random, and 8 of them say they plan to get an advanced degree.

• What should be our estimate of p?

Page 6: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Estimating the average IQ of Duke professors

• Let µ be the population average IQ of Duke profs.

• Suppose we randomly sample 25 Duke profs and record their IQs.

• What should be our estimate of µ?

.01

.05

.10

.25

.50

.75

.90

.95

.99

-2

-1

0

1

2

3

Nor

mal

Qua

ntile

Plo

t

100 110 120 130 140 150 160 170

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

132.16

11.710679

2.3421358

136.99393

127.32607

25

Moments

Prof IQs (hypothetical data)

Distributions

Page 7: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Maximum likelihood estimation: Aprincipled approach to estimation

• Usually we can use subject-matter knowledge to specify a distribution for the data. But, we don’t know the parameters of that distribution.

1) Number out of 20 who want advanced degree: binomial distribution.

2) Profs’ IQs: normal distribution.

Page 8: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Maximum likelihood estimation

• We need to estimate the parameters of the distribution.

Why do we care?

A) So we can make probability statements about future events.

B) The parameters themselves may be important.

Page 9: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Maximum likelihood estimation

• The maximum likelihood estimate of the unknown parameter is the value for which the data were most likely to have occurred.

• Let’s see how this works in the examples.

Page 10: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Advanced degree example• Let Y be the random variable for the number of people out of 20

that plan to get an advanced degree.

• Y has a binomial distribution with n = 20, and unknown probability p.

• In the data, Y= 8 . If we knew p, the value of the probability distribution function at Y= 8 would be:

8208 )1()! 12)(! 8 (

!20) 8 Pr( ppY

Page 11: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

MLE for degree example

• Let’s graph Pr(Y = 8) as a function of the unknown p.

• Label the function L(p). L(p) is called the likelihood function for p.

Page 12: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Maximum likelihood

• The maximum likelihood estimate of p is the value of p that maximizes L(p).

• This is a reasonable estimate because it is the value of p for which the observed data (y= 8 ) had the greatest chance of occurring.

Page 13: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Finding the MLE for degree example

• To maximize the likelihood function, we need to take the derivative of

with respect to p, set it equal to zero, and finally solve for p.

You get the sample percentage!

8208 )1()! 12 )(! 8 (

!20)( pppL

Page 14: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Estimating the average IQ of Duke professors

• Let µ be the population average IQ of Duke profs.

• Suppose we randomly sample 25 Duke profs and record their IQs.

• What should be our estimate of µ?

.01

.05

.10

.25

.50

.75

.90

.95

.99

-2

-1

0

1

2

3

Nor

mal

Qua

ntile

Plo

t

100 110 120 130 140 150 160 170

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

132.16

11.710679

2.3421358

136.99393

127.32607

25

Moments

Prof IQs (hypothetical data)

Distributions

Page 15: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Model for Professors’ IQs

• The mathematical function for a normal curve for any prof’s IQ, which we label Y, is:

• All normal curves have this form, with different means and SDs. Here, we’ll assume the σ = 15. We don’t know µ, which is what we’re after.

22 2/)(

2

1)(

yeyf

Page 16: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Model for all 25 IQs

• We need the function for all 25 IQs. • Assuming each prof’s IQ is independent

of other profs’ IQs, we have

)15(2/)(

25

1

25212521

22

215

1

)(...)()(),...,,(

iy

i

e

yfyfyfyyyf

Page 17: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Model for all 25 IQs

• With some algebra and simplifications, the likelihood function is:

25

1

22 )15(2/)(

2525 215

1 )( i

iy

eL

Page 18: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Likelihood function and maximum likelihood estimates

• A graph of the likelihood function looks something like this:

• The function is maximized when µ is the sample average. So, we use 132.16 as our estimate of the average Duke prof’s IQ.

• This sample average is the MLE for µ in any normal curve.

Page 19: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

The Bayesian approach to estimation of means

• Let’s show how to combine data and prior information to address the following motivating question:

What is a likely range for the average IQ of Duke professors?

Page 20: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Combining the prior beliefs and the data using Bayes Rule

• We combine our prior beliefs and the data using Bayes rule.

• f(µ|data) represents our posterior beliefs about µ .

)(

)()|()|(

dataf

fdatafdataf

Page 21: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Formalizing a model for prior information

• Let’s assign a distribution for µ that reflects our a priori beliefs about its likely range. Label this f(µ).

• Using the data you supplied in class, the curve describing our beliefs about µ is the normal curve with

mean = 128 SD = 15

Page 22: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Mathematical equation for normal curve

• We can write down the equation for this normal curve.

22 )15(2/) 128 (

215

1)(

ef

Page 23: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Model for the data (25 IQs)

• If we knew µ, the model for the data (the professors’ IQs) is

)15(2/)(

25

12521

22

215

1)|,...,,(

iy

i

eyyyf

Page 24: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Estimating the average IQ of Duke professors

• Let µ be the population average IQ of Duke profs.

• Suppose we randomly sample 25 Duke profs and record their IQs.

• What should be our estimate of µ?

.01

.05

.10

.25

.50

.75

.90

.95

.99

-2

-1

0

1

2

3

Nor

mal

Qua

ntile

Plo

t

100 110 120 130 140 150 160 170

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

132.16

11.710679

2.3421358

136.99393

127.32607

25

Moments

Prof IQs (hypothetical data)

Distributions

Page 25: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Combining the prior beliefs and the data using Bayes Rule

• We combine the model for the prior beliefs and the model for the data using Bayes rule.

• f(µ|data) represents our posterior beliefs about µ .

)(

)()|()|(

dataf

fdatafdataf

Page 26: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Posterior distribution

• Using calculus, one can show that f(µ|data) is a normal curve with

mean =

SD =

22

22

SD1/Prior 1/SE

meanPrior )SD(1/Prior )SE/1(

y

22 SD1/Prior 1/SE

1

Page 27: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Posterior distribution

• For our data and prior beliefs, the posterior beliefs, f(µ|data), is a normal curve with

mean =

SD =

06.1321/15 1/2.34

)128(1/1532.161)2.34/1(22

22

313.2 1/15 1/2.342

122

Page 28: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Using the posterior distribution to summarize beliefs about µ

• Because f(µ|data) describes beliefs about µ, we can make probability statements about µ.

• For example, using a normal curve with mean equal to 132.06 and SD equal to 2.314,

Pr(µ > 130 | data) = .813

• A 95% posterior interval for µ stretches from

127.52 to 136.59.

Page 29: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Bayesian statistics in general• Bayesian methods exist for any population parameter, including

percentiles, maxima and minima, ratios, etc.

• The method is general:

1) specify a mathematical curve that reflects prior beliefs about the population parameter.

2) specify a mathematical curve that describes the distribution of the data, given a value of the population parameter.

3) combine the curves from 1 and 2 mathematically to get posterior beliefs for the parameter, updated for the data.

Page 30: Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Differences between frequentist and Bayesian

FREQUENTIST

Parameters are not random.

Confidence intervals.

BAYESIAN

o Parameters are random.

o Posterior distributions.