Top Banner
Hadley Wickham Stat405 Bootstrapping Tuesday, 21 September 2010
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 09 bootstrapping

Hadley Wickham

Stat405 Bootstrapping

Tuesday, 21 September 2010

Page 2: 09 bootstrapping

1. The project

2. Hypothesis testing revision

3. Bootstrapping

Tuesday, 21 September 2010

Page 3: 09 bootstrapping

The project

Tuesday, 21 September 2010

Page 4: 09 bootstrapping

Project

About the movies data (sorry for confusion)

Like a homework, but bigger, and you will work on it as a group. 4-5 main questions.

There will be a single grade for the project, but individual grades will be adjusted based on effort (peer rating form).

Tuesday, 21 September 2010

Page 5: 09 bootstrapping

Team workWorking in teams is tough. But it is a vital skill to gain.

Take 10 minutes to discuss expectations with your team. Make sure to sign the sheet, and one team member should take responsibility for copying and distributing it.

Make sure you talk about how to get in touch.

Hand outs: team policies, team expectations.

Tuesday, 21 September 2010

Page 6: 09 bootstrapping

Firing & Quitting

You may fire a non-participating team member, but you need to meet with me and issue a written warning.

If you feel that you are doing all the work in your team, you may quit. You’ll also need to meet with me and give a written warning to the rest of your team.

Tuesday, 21 September 2010

Page 7: 09 bootstrapping

I think I have scheduled all teams. If your team doesn’t have a time yet, please get in touch ASAP.

Bring your main questions, as well as any initial plots you have made to answer them.

The more of a start you have made the more I’ll be able to help you.

Project meetings

Tuesday, 21 September 2010

Page 8: 09 bootstrapping

Hypothesis testing

Tuesday, 21 September 2010

Page 9: 09 bootstrapping

Goal

Casino claims that slot machines have prize payout of 92%, but payoff for 345 we observed is 67%. Is the casino lying?

(House advantage of 8% vs. 33%)

(Big caveat: we’re using a prize calculation function we know to be incorrect)

Tuesday, 21 September 2010

Page 10: 09 bootstrapping

http://www.flickr.com/photos/joegratz/117048243

Hypothesis testingThe statistical justice system

Tuesday, 21 September 2010

Page 11: 09 bootstrapping

A suspect is accused of a crime. The suspect is declared guilty or innocent based on a trial. Each trial has a defence and a prosecution. On the basis of how evidence compares to a standard, the judge declares them guilty or not guilty.

Tuesday, 21 September 2010

Page 12: 09 bootstrapping

A dataset is accused of having a particular parameter value. The data is declared guilty or innocent based on the results of a statistical test. Each test has a null hypothesis and an alternative hypothesis. On the basis of how a test statistic compares to a standard distribution, we make the decision to reject the null or fail to reject the null hypothesis.

Tuesday, 21 September 2010

Page 13: 09 bootstrapping

Your turn

Write down the null and alternative hypotheses for this case. How could we generate the distribution of payoffs under the null hypothesis?

Tuesday, 21 September 2010

Page 14: 09 bootstrapping

Hypothesis testing

Null hypothesis: Mean payout is 92%Alternative hypothesis: Mean payout is less than 92%

How could we generate samples from the null distribution?

Tuesday, 21 September 2010

Page 15: 09 bootstrapping

# Could assume that the payoffs come from a normal # distribution with mean 0.67 and standard deviation 2.55).# This gives a t.test:

> t.test(slots$prize, mu = 0.92, alternative = "less")

One Sample t-test

data: slots$prize t = -1.8026, df = 344, p-value = 0.03616alternative hypothesis: true mean is less than 0.92 95 percent confidence interval: -Inf 0.8989483 sample estimates:mean of x 0.6724638

Tuesday, 21 September 2010

Page 16: 09 bootstrapping

slots$prize

count

0

50

100

150

200

250

300

0 5 10 15 20

Tuesday, 21 September 2010

Page 17: 09 bootstrapping

Alternative approach

Assume that the distribution under the null hypothesis is similar to the empirical (data) distribution, but with a different mean.

This is called bootstrapping, and is a very widely used statistical technique.

Tuesday, 21 September 2010

Page 18: 09 bootstrapping

Bootstrapping

Tuesday, 21 September 2010

Page 19: 09 bootstrapping

Bootstrapping

So to answer the question we’re interested in, we need to get a better grasp on the distribution of prizes.

1 set of 345 pulls is too few, we want to simulate lots. We’ll start by simulating a single pull and then work our way up.

Tuesday, 21 September 2010

Page 20: 09 bootstrapping

Your turnNow want to generate a new draw from the same distribution (a bootstrap sample).

Write a function that returns the prize from a randomly generated new draw. Call the function random_prize

Hint: sample(slot$w1, 1) will draw a single sample randomly from the original data. Next slide has code you need to get started.

Tuesday, 21 September 2010

Page 21: 09 bootstrapping

slots <- read.csv("slots.csv", stringsAsFactors = FALSE)

calculate_prize <- function(windows) { payoffs <- c("DD" = 800, "7" = 80, "BBB" = 40, "BB" = 25, "B" = 10, "C" = 10, "0" = 0)

same <- length(unique(windows)) == 1 allbars <- all(windows %in% c("B", "BB", "BBB"))

if (same) { prize <- payoffs[windows[1]] } else if (allbars) { prize <- 5 } else { cherries <- sum(windows == "C") diamonds <- sum(windows == "DD")

prize <- c(0, 2, 5)[cherries + 1] * c(1, 2, 4)[diamonds + 1] } unname(prize)}

Tuesday, 21 September 2010

Page 22: 09 bootstrapping

w1 <- sample(slots$w1, 1)w2 <- sample(slots$w2, 1)w3 <- sample(slots$w3, 1)

calculate_prize(c(w1, w2, w3))

Tuesday, 21 September 2010

Page 23: 09 bootstrapping

random_prize <- function() { w1 <- sample(slots$w1, 1) w2 <- sample(slots$w2, 1) w3 <- sample(slots$w3, 1)

calculate_prize(c(w1, w2, w3))}

# What is the implicit assumption here?# How could we test that assumption?

Tuesday, 21 September 2010

Page 24: 09 bootstrapping

Your turn

Write a function to do this n times

Use a for loop: create an empty vector, and then fill with values

Draw a histogram of the results

Tuesday, 21 September 2010

Page 25: 09 bootstrapping

n <- 100prizes <- rep(NA, n)

for(i in seq_along(prizes)) { prizes[i] <- random_prize()}

Tuesday, 21 September 2010

Page 26: 09 bootstrapping

random_prizes <- function(n = 345) { prizes <- rep(NA, n)

for(i in seq_along(prizes)) { prizes[i] <- random_prize() } prizes}library(ggplot2)qplot(random_prizes())

payout <- function() mean(random_prizes(345))payout()

Tuesday, 21 September 2010

Page 27: 09 bootstrapping

# Windows should be a matrix with a column for each windowcalculate_prize <- function(windows) { payoffs <- c("DD" = 800, "7" = 80, "BBB" = 40, "BB" = 25, "B" = 10, "C" = 10, "0" = 0)

prize <- rep(NA, nrow(windows))

same <- windows[, 1] == windows[, 2] & windows[, 2] == windows[, 3] prize[same] <- payoffs[windows[same, 1]] bars <- windows == "B" | windows == "BB" | windows == "BBB" all_bars <- bars[, 1] & bars[, 2] & bars[, 3] prize[all_bars] <- 5

other <- !same & !all_bars other_windows <- windows[other, ] cherries <- rowSums(other_windows == "C") diamonds <- rowSums(other_windows == "DD") prize[other] <- c(0, 2, 5)[cherries + 1] * c(1, 2, 4)[diamonds + 1]

unname(prize)}

random_prizes <- function(n) { w1 <- sample(slots$w1, n, rep = T) w2 <- sample(slots$w2, n, rep = T) w3 <- sample(slots$w2, n, rep = T) calculate_prize(cbind(w1, w2, w3))}

This is a much faster version. You should be able to figure out how it works.

Tuesday, 21 September 2010

Page 28: 09 bootstrapping

Your turn

Now we want to do this repeatedly to learn about the distribution of the payout. Write a function that generates the payout from m sets of 345 pulls.

Then think about how this helps answer our question.

Tuesday, 21 September 2010

Page 29: 09 bootstrapping

payouts <- function(m) { payouts <- rep(NA, m)

for(i in seq_along(payouts)) { payouts[i] <- payout() } payouts}

payouts(10)

Tuesday, 21 September 2010

Page 30: 09 bootstrapping

quantile(payouts(10000), c(.05, 0.5, .95))

# However, this is too low because our prize# function is incorrect - once you've done the # homework try it out with your function.

Tuesday, 21 September 2010