1 Motivation. - UW Faculty Web Serverfaculty.washington.edu/bajari/metricstheorysp09/bayesian.pdf1 Motivation. • Bayesian discrete choice models • Bayesian approach oﬀers extremely

1 Motivation.

• Bayesian discrete choice models

• Bayesian approach offers extremely powerful meth-ods for numerical integration

• These methods facilitate the study of latent vari-able models

• Discrete choice is a case in point

• Some particularly rich models can only be studiedusing Bayes

2 Review of Bayesian Statistics

• Let p(D|θ) denote the probability of the data Dgiven parameter θ

• In most applications, p(D|θ) corresponds to alikelihood l(θ)

• Unlike many classical estimators, Bayesian ap-proach typically requires specification of paramet-

ric likelihood

• Many researchers view this as a drawback

• However, in discrete choice parametric models arealmost always used

• p(θ) is prior distribution of parameters

• The choice of a prior can be controversial in ap-plied work

• In some cases, we might put reasonable restric-tions on parameters

• Some researchers conduct senstivity analysis withrespect to priors

• Bayes theorem:

p(θ|D) =p(D, θ)

p(D)

=p(D|θ)p(θ)

p(D)∝ p(θ)l(θ)

• p(θ|D) is the posterior distribution of the para-meters given the data

• In the Bayesian approach, the research updateshis/her beliefs about the parameters using the

laws of conditional probability

• The posterior is proportional to the prior timesthe likelihood

• Note that as the sample becomes large, the like-lihood will become the dominant term

• In classical statistics, the parameter is a randomvariable

• Confidence intervals etc.... are obtained using anasymptotic approximation

• The choice between Bayes/Classical approachesis a matter of dispute in statistics/econometrics

• Under mild regularity conditions, as the samplesize n becomes large, the posterior becomes:

p(θ|D) ≈ N(bθMLE,∙−H

θ=bθMLE

¸−1)

• Intuitively, this occurs because the posterior closelyresembles the likelihood function as the sample

size becomes large.

• p(θ) stays fixed with n

• l(θ) increases with n

• The prior swamps the likelihood in large samples

2.1 Predictive distributions

• Df data that we have not observed

• D observed data

p(Df |D) =Zp(Df |θ)p(θ|D)dθ

• p(Df |θ) probability of Df given θ

• Integrate out parameter uncertainty using the pos-terior, p(θ|D)

2.2 Decision Theory

• L (a, θ) loss function associated with an action awhen the parameter is θ

• A Bayesian should choose to minimize expectedloss given parameter uncertainty

mina

ZL (a, θ) p(θ|D)dθ

• Examples: estimation- a corresponds to choice ofparameter

• Non-nested testing- a coresponds to choice ofmodel

• Profit maximization: target marketing

• Utility Maximization

3 Markov Chains.

• Two common ways to conduct MCMC are Gibbssampling and Metropolis.

• A normal random walk metropolis works as fol-

lows.

• First, the econometrician comes up with a roughguess θ0 at the MLE.

• Second, come up with a rough guess at I0 at theinformation matrix using the hessian of the MLE.

• A sequence of psueorandom values θ(1), ..., θ(S)

is drawn as follows. Given θ(s), we draw θ(s+1)

as follows:

1. First, draw a candidate value eθ ∼ N(θ(s), I0)

2. Second, compute α = min{ p(eθ)f(y|eθ)p(θ(s))f(y|θ(s))

, 1}

3. Set θ(s+1) = eθ with probability α and θ(s+1) =

θ(s) with probability α.

• Implimenting this algorithm simply requires the

econometrician to evaluate the likihood repeat-

edly and draw normal deviates.

• A second algorithm for constructing a Markov

Chain is Gibbs sampling.

• Partition parameters into θ1, ..., θd blocks

• Let pk(θk|θ1, ..., θk−1, θk+1, ..., θd) denote the con-ditional distribution of the kth block of parame-

ters given the others.

• In some applications, this distribution can be con-venient to form even if the entire likelihood is

quite complicated!

• Starting with an initial value θ0, Gibbs samplingworks as follows. Given θ(s)

1. Draw θ(s+1)1 ∼ p1(θ1|θ

(s)2 , θ

(s)2 ..., θ

(s)d )

2. Draw θ(s+1)2 ∼ p2(θ1|θ

(s+1)1 , θ

(s)3 , θ

(s)4 ..., θ

(s)d )

3. Draw θ(s+1)3 ∼ p3(θ1|θ

(s+1)1 , θ

(s+1)2 , θ

(s)4 ..., θ

(s)d )

...

d. Draw θ(s+1)d ∼ pd(θ1|θ

(s+1)1 , θ

(s+1)2 , θ

(s+1)4 ..., θ

(s+1)d−1 )

d+1 Return to 1.

4 Bayesian Analysis of Regression

• MCMC/Gibbs sampling are particularly powerfulin problems with latent variables.

• In discrete choice, utiltity is latent to the econo-metrician.

• In a multinomial probit, if utility was observed bythe econometrician, estimating parameters would

boil down to linear regression.

• For our analysis, it will be useful to consider theBayesian analysis of linear regression.

• A key step in Rossi et. al.’s paper essentially in-volve the Bayesian analysis of a normal linear re-

gression.

• The analysis here follows Geweke’s textbook, Con-temporary Bayesian Econometrics (an excellent

intro to the subject).

• y is a T × 1 vector of dependent variables

• X is a T ×k matrix of covariates (nonstochastic)

• Assume that error terms are normal, homoskedas-tic.

• y|β, h,X ∼ N(Xβ, h−1IT )

p(y|β, h,X) = (2π)−T/2hT/2 exp(−h(y−Xβ)0(y−Xβ)/2))

• h is called the precision parameter (inverse of vari-

ance)

• IT is a scalar covariance matrix

• This is our likelihood function.

• Recall that Bayes theorem implies that the pos-

terior distribution of the model parametes is the

prior times the likelihood.

• We need to specify a prior distribution for ourmodel parameters.

• p(β) = N(β,H−1)

p(β) = (2π)−k/2|H|1/2 exp(−(β − β)0H−1(β − β)))

• The prior on beta is normal with prior mean, βand prior precision H

• The prior distribution on h is s2h ∼ χ2(v)

p(h) =h2v/2Γ(v/2)

i−1(s)v/2h(v−2)/2 exp(−s2h/2)

• Remark- this is essentially a gamma distribution,rewritten in a manner that will be convenient for

reasons below.

• This form for the prior is chosen because of conju-gacy, i.e. the posterior distribution can be written

in an analytically convenient manner.

• Now recall tht the posterior is proportional to theprior time the likelihood.

• That is, p(β, h|X, y) ∝ p(β)p(h)p(y|β, h,X)

• Combining the equations above yields p(β, h|X, y) ∝:

(2π)−T/2hT/2h2v/2Γ(v/2)

i−1 |H|1/2(s)v/2h(T+v−2)/2 exp(−s2h/2)

exp

"−(β − β)0H−1(β − β)/2−h(y −Xβ)0(y −Xβ)/2

#

• Now recall from Cameron and Trivedi, the idea

behind Gibbs sampling is to ”block” the parame-

ters into a set of convenient conditional distribu-

tions.

• In this case, we will want to block p(β|h,X, y)

and p(h|β,X, y)

• Let’s first derive p(β|h,X, y).

• It is obvious from the agove expression, that β isgoing to be normally distributed.

• We will want to complete the the square insidethe exp() to rewrite the expression in the form(β − eβ)0fH−1(β − eβ)

• Then eβ will be the posterior mean and fH theposterior precision

• Distributing terms and completing the square inβ yields that:

(β − β)0H−1(β − β) + h(y −Xβ)0(y −Xβ) =

(β − β)0H−1(β − β) + h(y0 −Xβ)0(y0 −Xβ)

where y0 = fitted ols values

= (β − eβ)0fH−1(β − eβ) +QfH = H + hX 0Xeβ = fH−1(H−1β + hX 0Xb)

where b is the ols estimate of β

Q is a constant that does not depend on β

• Note that the posterior precision is a weightedaverage of the prior precison and the ols estimate

of the precision

• The posterior estimate of eβ involves a weightedaverage of the prior mean and the ols estimate.

• The weights depend on the posterior precision,the prior precision and the ols estimate of the

precision.

• As the sample size becomes sufficiently large, thedata will ”swamp” the prior.

• The number of terms in the likelihood is a func-tion of T and grows with the sample size

• The number of terms in the prior remains fixed.

• Next, we have to derive the posterior in h

• If you look at the prior times the likelihood, itis obvious that the posterior will be of the form

p(h|β,X, y) of hα exp(−hω)

• If we can derive α and ω we can express the pos-terior

• By some straightforward, albeit tedious, algebrawe can write the posterior distribution as p(h|β,X, y):

es2h ∼ χ2(ev)es = s+ (y0 −Xβ)0(y0 −Xβ)ev = v + T

• A Gibbs sampler generates a pseudo-random se-

quence (h(s), β(s))s = 1, ..., S using the following

markov chain,

1. Given (h(s), β(s)), draw β(s+1) ∼ p(β|h(s),X, y)

2. Given β(s+1) draw h(s+1) ∼ p(h|β(s+1),X, y)

3. Return to 1

• This example illustrates the importance of conju-gacy.

• That is, choosing our prior and likelihoods to bethe ”right” functional form greatly simplifies the

analysis of the posterior distribution.

• Standard texts in Bayesian statistics (e.g. Berger/Bernardoand Smith) have appendices that lay out conju-

gate distributions.

5 Bayesian Multinomial Probit

• The multinomial probit is closely related to theanalysis of the normal linear model.

• The multinomial probit is defined as:

yij = xijβ + εij (1)

var(εi1, ..., εiJ) = h−1IJcij = 1{yij > yij0 for j

0 6= j}

• In the above, yij is the utility of person i for

alternative j

• εij is the stochastic preference shock

• xij are covariates that enter into i’s utility

• cij = 1 if i chooses j

• If the yij were known, then we could use the Gibbssampler above to estimate β and h

• However, the yij are latent variables and thereforewe do data augmentation.

• The idea behind data augmentation is simple— weintegrate out the distribution of the variables that

we do not see.

• Follwing the notation in Cameron and Trivedi, letf(θ|y, y∗) denote the posterior conditional on theobserved variables, y and the latent variables, y∗.

• Let f(y∗|y, θ) denote the distribution of the latentvariable conditional on y and parameters.

• Then the posterior can be written as:

p(θ|y) =Zf(θ|y, y∗)f(y∗|y, θ)dy∗

• Taking account of the latent variable simply in-volves an additional Gibbs step.

• The distribution of the latent utility yijis a tru-

cated normal distribution.

• If cij = 1, yij is a truncated normal with mean pa-rameter β, precision h and lower truncation point

max{yij0, j0 6= j}.

• If cij = 0, yij is a truncated normal with mean

parameter β, precision h and upper truncation

point max{yij}.

• The Gibbs sampler for the multinomial probit sim-ply adds the data augmentation step above:

• A Gibbs sampler generates a pseudo-random se-

quence (h(s), β(s),½y(s)ij

¾i∈I,j∈J

) s = 1, ..., S us-

ing the following markov chain

1. Given (h(s), β(s)), draw β(s+1) ∼ p(β|h(s),X, y, C)

2. Given β(s+1) draw h(s+1) ∼ p(h|β(s+1),X, y, C)

3. For each I, draw y(s+1)i1 ∼ p(h|β(s+1),X, y

(s)i2 , ..., y

(s)iJ , C)

4. Draw y(s+1)i2 ∼ p(h|β(s+1),X, y

(s+1)i1 , ..., y

(s)iJ , C)

5. ...

6. Draw y(s+1)iJ ∼ p(h|β(s+1),X, y

(s+1)i1 , ..., y

(s+1)iJ−1 , C)

7. Return to 1

6 Target Marketing

• In ”The Value of Purchase History Data in TargetMarketing” Rossi et. al. attempt to estimate

household level preference parameters.

• This is of interest as a marketing problem.

• CMI checkout coupon uses purchase informationto customize coupons to a particular household.

• In principal, the entire purchase history (from con-sumer loyalty cards) could be used to customize

coupons (and hence prices)

• If a household level preference parameter can beforecasted with high precision, this is essentially

first degree price discrimination!

• Even with short purchase histories, they find thatprofits are increased 2.5 fold through the use of

purchase data compared to blanket couponing strate-

gies.

• Even one observation can boost profits from coupon-ing by 50%.

• This application is of interest to economists aswell.

• The methods in this paper allow us to account forconsumer heterogeneity in a very rich manner.

• This might be useful to examine the distribtionof welfare consequences of a policy intervention

(e.g. a merger or market regulation).

• Beyond that, these methods demonstrate the powerof Bayesian methods in latent variable problems.

7 Random Coefficients Model

• Multinomial probit with panel data on householdlevel choices

yh,t = Xh,tβh + εh,tεh,t ∼ N(0,Λ)

βh = ∆zh + vh, vh ∼ N(0, Vβ)

• Households h = 1, ...,H and time t = 1, ..., T

• Xh,t covariates and zh demographics

• Note that household specific random coefficients

βh remain fixed over time

• Ih,t observed choice

• The posterior distributions are derived in Appen-dix A.

• Formally, the derivations are very close to ourmultinomial probit model above.

• Gibbs sampling is used to simulate the posteriordistribution of Λ,∆, Vβ

8 Predictive Distributions

• The authors wish to give different coupons to dif-ferent households.

• A rational (Bayesian) decision maker would formher beliefs about household h’s preference para-

meters given her posterior about the model para-

meters.

• This will involve, as we show below, forming a

predictive distribution for βh given the econome-

trician’s information set.

• As a first case, suppose that the econometricianonly knew zh, the demographics of household h

• From our model, p(βh|zh,∆, Vβ) is N(∆zh, Vβ)

• Given the posterior p(∆, Vβ|Data), the econome-

tricians predictive distribution for βh is:

p(βh|zh,Data) =Zp(βh|zh,∆, Vβ)p(∆, Vβ|Data)

• We can simulate p(βh|zh,Data) using Gibbs sam-

pling given our posterior simulations ∆(s), V(s)β

s = 1, ..., S:

1

S

Xp(βh|zh,∆(s), V

(s)β )

• We could draw random βh from p(βh|zh,Data).

• For each∆(s), V (s)β , draw β(s)h from p(βh|zh,∆(s), V

(s)β )

• Given β(s)h , s = 1, ..., S, we could then simulate

purchase probabilities.

• Draw ε(s)ht from εh,t ∼ N(0,Λ(s))

• The posterior purchase probability for j, givenXht and zh is:

1

S

Xs1½Xjhtβh + ε

(s)jht > Xj0htβh + ε

(s)j0ht for j

0 6= j¾

• This would allow us to simulate the purchase re-sponse to different couponing strategies for a spe-

cific household h.

• The paper runs through different couponing strate-gies given different information set (e.g. full or

choice only information sets).

• The key ideas are similar- form a predictive distri-

bution for h’s preferences and simulate purchase

behavior in an analogous fashion.

• In the case of a full purchase information his-tory, we could use the raw Gibbs output since

the markov chain will simulate β(s)h s = 1, ..., S.

• This could then be used to simulate choice be-havior as in the example above (given draws of

ε(s)ht )

9 Data

• AC Neilson scanner panel data for tuna in Spring-field Missouri.

• 400 households, 1.5 years, 1-61 purchases.

• Brands and covariates in Table 2.

• Demographics Table 3.

• Table 4, delta coefficients.

• Poorer people prefer private label.

• Goodness of fit moderate for demographic coeffi-cients

• Figures 1 and 2, household level coefficient esti-mates with different information sets

• Table 5, return to different marketing strategies.

• Bottom line, you gain .5 cents to 1.0 cents per

customer through better estimates.

• With a lot of customers, this could be quite prof-itable.

1 Motivation. - UW Faculty Web Serverfaculty.washington.edu/bajari/metricstheorysp09/bayesian.pdf1 Motivation. • Bayesian discrete choice models • Bayesian approach oﬀers extremely

Documents