Bayesian models in r

10/3/14, 13:37Bayesian Models in R

Page 1 of 53http://docs.supstat.com/BayesianModelEN/#1

Bayesian Models in RVivian Zhang | SupStat Inc.Copyright SupStat Inc., All rights reserved



Outline1. Introduction to Bayes and Bayes' Theorem

2. Distribution estimation

3. Conditional probability

4. Bayesian models

2/53



Introduction to Bayes andBayes' Theorem

3/53



The*story*behind*the*Bayesian*modelThomas Bayes

Source: http://www.bioquest.org/products/auth_images/422_bayes.gif

18th century English statistician

Most known for the Bayes Theorem

Essential contributor to early development of probability theory

·

·

·

4/53

http://www.bioquest.org/products/auth_images/422_bayes.gif



The*Model1. Models using Bayes' theorem (based on conditional probablity�

2. Bayes Decision Theory

3. Models implementing Bayesian thinking

Naive Bayes, Association Rules·

Classical Bayesian model for Decision Theory·

Treat all the parameter as random variables, especially in hierarchical models·

5/53



Distribution Estimation

6/53



Distribu6on*Es6ma6onProbablity Density Function

In statistics, the Probablity Density Function (PDF) of a continous random variable is an outputdiscribing this variable, which means the probability around a certain point.

Example: plot of PDF of the Normal distribution

·

·

7/53




The PDF has an important place in statistics�

Knowing the PDF, we can calculate the

·

It contains all the information in the random variable-

·

Mean

Variance

Median

etc.

-

-

-

-

8/53



Distribu6on*Es6ma6onProbablity Density FunctionObtain the PDF, get everything from a random variable. This allows you to perform:

Bayesian Hypothesis Tests

Bayesian Interval Estimation

Bayesian Regression Models

Bayesian Logistic Models

etc.

·

·

·

·

·

9/53




Example�Bayesian Regression:

Estimation methods for the regression model

·

Y = Xβ + ϵ, ϵ ∼ N(0, )σ2

·

OLS (Ordinary Least Squres)

is the estimator of

-

- β ∼ N(( X Y, ( X )X ′ )−1 X ′ X ′ )−1

- = ( X Yβ̂ X ′ )−1 X ′ β

10/53



Distribu6on*Es6ma6onThe Bayesian Model

Before obtaining data, one has beliefs about the value of the proportion and models his or herbeliefs in terms of a prior distribution.

After data have been observed, one updates one’s beliefs about the proportion by computing theposterior distribution.

·

·

11/53



Distribu6on*Es6ma6onThe Bayesian Model

Building a Bayesian model begins with Bayesian Thinking (every value has its own distribution).

Steps to build a Bayesian model:

·

·

Make inferences about prior distribution

Calculate the parameter of the posterior distribution

Finish the statistical task (interval estimation�statistical decision, etc.)

-

-

-

12/53



Inferring*from*the*posterior*distribu6on

Essentials:

Posterior inference is the core of Bayes' Theorem, because we do not actually know thepopulation distribution which generated our data. We use the conditional distribution to addressthis gap indirectly. In this section, a certain degree of mathematical sophistication is requiredwithout which we cannot easily implement the model computationally.

·

Bayes' theorem

Conditional distribution

Certain prior distribution

·

·

For example: in regression is from a normal distribution- ϵ

·

No information given-

13/53



Calcula6ng*the*posterior*distribu6onThe most difficult part is calculating the posterior distribution, which requires integration.

Markov chain Monte Carlo (MCMC)·

Gibbs

MH method

-

-

14/53



Conditional probability

15/53



Condi6onal*probabilityWhat is conditional probability?

The probablity that event will occur when event has occurred. This probability is written as .

· A BP(A|B)

P(A|B) = P(AB)P(B)

A and B are two events

is the probability that both A and B occur.

is the probability that B occurs.

·

· P(AB)

· P(B)

16/53



Condi6onal*probabilityWhy conditional probability�Example

Suppose�·

A: The event of getting a cold

B: The event of a rainy day (p = 0.2)

AB: The event that when it rains you get a cold (p = 0.1)

-

-

-

P(A|B) = = = 0.5P(AB)P(B)

0.10.2

Interpretation:·

When it rains, the probablity of getting a cold is 50%-

17/53



Condi6onal*probabilityExercise

There are two kids in a family.·

If one of the kids is a boy, the probability that the other one is also a boy is...

If the first one is a boy, the probability that the other one is a boy is...

,

-

-

- 23

12

18/53



Condi6onal*ProbabilityThe model relates to conditional probability

A priori·

Mining associated rules

The association from A to B is defined as:

-

-

A => B : = P(B|A)P(AB)P(A)

In R, use the arules package·

19/53



Condi6onal*ProbabilityA priori

Goal: find the items with strong relationships

First, load the data:

·

·

library(arules)

data = read.csv("data/BASKETS1n")

names(data)

[1] "cardid" "value" "pmethod" "sex" "homeown" "income"

[7] "age" "fruitveg" "freshmeat" "dairy" "cannedveg" "cannedmeat"

[13] "frozenmeal" "beer" "wine" "softdrink" "fish" "confectionery"

20/53




basket = data[, 8:18]

names(basket)[which(basket[1, ] == T)]

[1] "freshmeat" "dairy" "confectionery"

tbs2 = apply(basket, 1, function(x) names(basket)[which(x==T)])

len = sapply(tbs2, length)

require(arules)

trans.code = rep(1:1000, len)

trans.items = unname(unlist(tbs2))

trans.code.ind = match(trans.code, unique(trans.code))

trans.items.ind = match(trans.items, unique(trans.items))

21/53




mat = sparseMatrix(i = trans.items.ind,

j = trans.code.ind,

x = 1,

dims = c(length(unique(trans.items)),

length(unique(trans.code))))

mat = as(mat, 'ngCMatrix')

#after setting the argument we get the model:

trans.res = apriori(mat,parameter = list(confidence=0.05,

support=0.05,

minlen=2,maxlen=3))

22/53




parameter specification:

confidence minval smax arem aval originalSupport support minlen maxlen target ext

0.05 0.1 1 none FALSE TRUE 0.05 2 3 rules FALSE

algorithmic control:

filter tree heap memopt load sort verbose

0.1 TRUE TRUE FALSE TRUE 2 TRUE

apriori - find association rules with the apriori algorithm

version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt

set item appearances ...[0 item(s)] done [0.00s].

set transactions ...[11 item(s), 940 transaction(s)] done [0.00s].

sorting and recoding items ... [11 item(s)] done [0.00s].

creating transaction tree ... done [0.00s].

checking subsets of size 1 2 3 done [0.00s].

writing ... [108 rule(s)] done [0.00s]. 23/53



Condi6onal*ProbabilityAt last, we have the items with the strongest relationship in one basket·

#let's see these rules:

lhs.generic = unique(trans.items)[trans.res@lhs@data@i+1]

rhs.generic = unique(trans.items)[trans.res@rhs@data@i+1]

cbind(lhs.generic, rhs.generic)[1:10, ]

lhs.generic rhs.generic

[1,] "dairy" "confectionery"

[2,] "confectionery" "dairy"

[3,] "dairy" "fish"

[4,] "fish" "dairy"

[5,] "dairy" "fruitveg"

[6,] "fruitveg" "dairy"

[7,] "dairy" "frozenmeal"

[8,] "frozenmeal" "dairy"

[9,] "freshmeat" "confectionery"

[10,] "confectionery" "freshmeat"

24/53



Condi6onal*ProbabilityThe model relates to conditional probablity

Naive Bayes·

Used in recommendation systems�classification problems

Compute the posterior probability for all values of C using the Bayestheorem:

Choose the value of C that maximizes

Equivalent to choosing the value of C that maximizes

-

- P(C|A1, A2, … , An)

P(C| ⋯ ) =A1A2 AnP( ⋯ |C) × P(C)A1A2 An

P( ⋯ )A1A2 An

- P(C|A1, A2, . . . , An)

- P(A1, A2, . . . , An|C)P(C)

25/53



Naive*Bayesdata(iris)

m = naiveBayes(Species ~ ., data=iris)

## alternatively:

m = naiveBayes(iris[, -5], iris[, 5])

26/53



Naive*BayesModel:

m

Naive Bayes Classifier for Discrete Predictors

Call:

naiveBayes.default(x = iris[, -5], y = iris[, 5])

A-priori probabilities:

iris[, 5]

setosa versicolor virginica

0.33333 0.33333 0.33333

Conditional probabilities:

Sepal.Length

iris[, 5] [,1] [,2]

setosa 5.006 0.35249 27/53



Naive*BayesPredict:

table(predict(m, iris), iris[,5])

setosa versicolor virginica

setosa 50 0 0

versicolor 0 47 3

virginica 0 3 47

28/53



From*condi6onal*probablity*to*Bayes'*TheoremWe have:

So:

Change the Conditional Prob.

·

P(B|A) = P(AB)P(A)

·

P(AB) = P(B|A)P(A)

·

P(A|B) = =P(AB)P(B)

P(B|A)P(A)P(B)

29/53



Bayes'*TheoremP(A|B) = P(B|A)P(A)

P(B)

Bayes' theorem relates the conditional probablity to the marginal distribution of a random varable.Bayes' theorm can tell us how to update our thinking after obtaining new data.

Harold Jeffreys has claimed that Bayes' theorem is to Statistics as the Pythagorean theorem is togeometry.

·

·

30/53



Bayes'*theoremContinuous situation

The Bayes' theorem mentioned above is in discrete form

In the real world often we are using and analyzing continuous random variables

The Bayes' theorem can be written in continuous form as:

·

·

·

π(θ|x) = f (x|θ)π(θ)m(x)

31/53



Bayes'*TheoremContinous form


Here�·

is an unknown parameter

is the data observed

Processing is from to

From the original knowledge of updated to the situation after we observe

- θ

- X

- π(θ) π(θ|x)

- θ X

32/53



Bayes'*TheoremContinuous form


Based on the properties of continous random variables, it can be written as:·

π(θ|x) = f (x|θ)π(θ)∫ f (x|θ)π(θ)dθ

33/53




Important distributions:�

π(θ|x) = =f (x|θ)π(θ)m(x)

f (x|θ)π(θ)∫ f (x|θ)π(θ)dθ

· π(θ)

Prior distribution-

· π(θ|x)

Posterior distribution-

34/53




Other distributions:



· m(x) = ∫ f (x|θ)π(θ)dθ

Marginal Distribution-

· f (x|θ)π(θ) = f (x, θ)

Joint distribution-

35/53



Bayesian Models

36/53



Bayesian*ModelsBayesian thinking

data(iris)

head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

Data are random variables with a mean of · μ

37/53




The frequency perspective: The mean is a constant· μ

colMeans(iris[, 1:3])

Sepal.Length Sepal.Width Petal.Length

5.8433 3.0573 3.7580

38/53




PROB SEPAL LENGTH SEPAL WIDTH PETAL LENGTH

90% 5.843333 3.057333 3.758000

10% Others Others Others

The Bayesian perspective: The mean is a random variable· μ

39/53



Bayesian*ModelsIn fact, nearly all of modern Bayesian modeling uses Bayesian thinking

Nearly all statistical models can be implemented as Bayesian-form models

Even some non-parametric models can be transformed to Bayeseian versions

Bayes Cluster

Bayes Regression

Bayes Neural Net

Non-parametric Bayes

Hierarchical model

etc.

·

·

·

·

·

Logit, Probit, Tobit, Quantile, LASSO...-

·

·

·

·

40/53



Bayesian*Modeling*ExampleQuestion

For a Sample from a normal distribution. We want to know the mean of this sample.

Frequentists think

Bayesians think is a random variable with a distribution

Suppose that�

·

· , , . . . , ∼ N(θ, σ)X1 X2 Xn

· = mean(x)θ̂

· θ

· θ ∼ N(μ, )τ2

Infer the posterior distribution

Calculate the posterior distribution

Estimate the mean of the sample

-

-

-

41/53



Bayesian*Modeling*ExampleInferenceInferring the posterior distribution using Bayes' Theorem in continous form:



Put the distribution into the theorem to calculate the posterior distribution·

Prior distribution

Conditional distribution

- θ ∼ N(μ, )τ2

- x|θ ∼ N(θ, )σ2

42/53



Bayesian*Modeling*ExampleInference

43/53



Bayesian*Modeling*ExampleCalculating the posterior distributionAccording to the theorem, we know the mean and the variance of for a normal distribution.θ

postDis = function(miu=2, tau=4, n=100) {

x = rnorm(n,3,5)

a = list(0)

a[[1]] = (var(x)*miu+tau^2*mean(x))/(var(x)+tau^2)

a[[2]] = var(x)*tau^2/(var(x)+tau^2)

a

}

postDis(3, 5, 1000)

[[1]]

[1] 2.9284

[[2]]

[1] 12.254

44/53



Bayesian*Modeling*ExampleEstimating the mean

In ordinary statistics, the MLE and moment estimators of in a normal distribution are the samplemean.

For the Bayes posterior distribution

· μ

·

MLE ---> posterior maximum likelihood estimator

Can be considered as MLE of posterior distribution

Posterior distribution is normal, too. So, the parameter of the mean is:

-

-

-

( μ + x)/( + )σ2 τ2 σ2 τ2

45/53



Bayesian*Modeling*ExampleEstimating the mean

When using a different prior distribution

Observe the error in a different situation

· x ∼ N(μ, σ) = N(3, 5)

The mean is 3-

·

·

46/53



Bayesian*Modeling*ExamplePrior distribution: · N(3, 1)

library(ggplot2)

plot_dif = function(miu=3, tau=1) {

i = seq(100, 10000, by=10)

set.seed(123)

meanCompare = function(n=100, miu=3, tau=1) {

x = rnorm(n, 3, 5)

(var(x)*miu+tau^2*mean(x))/(var(x)+tau^2)-3

}

aa = sapply(i, meanCompare, miu=miu, tau=tau)

bb = sapply(i,function(i) mean(rnorm(i,3,5))-3)

g = ggplot(data.frame(i=i, a=aa, b=bb)) +

geom_line(aes(x=i ,y=b), col="blue") +

geom_line(aes(x=i, y=a), col="red")

print(g)

}

47/53



Bayesian*Modeling*ExamplePrior distribution: (Bayes estimator in red, MLE in blue)· N(3, 1)

plot_dif(3, 1)

48/53




plot_dif(2,1)

49/53




plot_dif(2,4)

50/53




plot_dif(2,100)

51/53



Bayesian*Modeling*Example1. As we can see, if the prior distribution is very accurate, the Bayes estimator is better than the

ordinary estimator.

2. If the prior distribution is not accurate enough:

Larger variance is better

For a suitable variance� more data is better

·

·

52/53



Bayesian*Modeling*ExampleChoosing the prior distribution

Choosing a prior distribution...·

If sure for the model, can improve the accuracy of the estimator

If not sure, should be done by selecting for greater variance to improve the estimator

-

-

53/53

Bayesian models in r

Documents