Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013
Dec 29, 2015
Bayesian InferenceEkaterina LomakinaTNU seminar: Bayesian inference1 March 2013
Outline• Probability distributions• Maximum likelihood estimation• Maximum a posteriori estimation• Conjugate priors• Conceptualizing models as collection of priors• Noninformative priors• Empirical Bayes
Probability distribution• Density estimation – to model distribution p(x) of a random
variable x given a finite set of observations x1, …, xN.
Nonparametric approach Parametric approach
• Histogram• Kernel density estimation• Nearest neighbor approach
• Gaussian distribution• Beta distribution• …
The Exponential Family
Gaussian distribution
Binomial distribution
Beta distribution
etc…
Gaussian distribution• Central limit theorem (CLT) states that, given certain
conditions, the mean of a sufficiently large number of independent random variables, each with a well-defined mean and well-defined variance, will be approximately normally distributed
Bean machine by Sir Francis Galton
Maximum likelihood estimation• The frequentist approach to estimate parameters of the
distribution given a set of observations is to maximize likelihood.
– data are i.i.d
– monotonic transformation
MLE for Gaussian distribution
– simple average
Maximum a posterior estimation• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize posterior distribution.
• It allows to account for the prior information.
MAP for Gaussian distribution
Posterior distribution is given by
– weighted average
Conjugate prior• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate prior that can be written in the form
• Important conjugate pairs include:Binomial – BetaMultinomial – DirichletGaussian – Gaussian (for mean)Gaussian – Gamma (for precision)Exponential – Gamma
MLE for Binomial distribution• Binomial distribution models the probability of m “heads” out
of N tosses.
• The only parameter of the distribution μ encodes probability of a single event (“head”)
• Maximum likelihood estimation is given by
MAP for Binomial distribution
• The conjugate prior for this distribution is Beta
• The posterior is then given by
where l = N – m, simply the number of “tails”.
Models as collection of priors - 1• Take a simple regression model
• Add a prior on weights
• And get Bayesian linear regression!
Models as collection of priors - 2• Take again a simple regression model
• Add a prior on function
• And get Gaussian processes!
yn
β
yn
β
K
Where yn is some function of xn
Models as collection of priors - 3• Take a model where xn is discrete and unknown
• Add a prior on states (xn), assuming they are temporarily smooth
• And get Hidden Markov Model!
θ
x1 x2 xn-1xn xn+1
t1 tnt2 tn-1tn+1
Noninformative priors• Sometimes we have no strong prior belief but still want to
apply Bayesian inference. Then we need noninformative priors.
• If our parameter λ is a discrete variable with K states then we can simply set each prior probability to 1/K.
• However for continues variables it is not so clear. • One example of a noninformative prior could be a
noninformative prior over μ for Gaussian distribution:
with • We can see that the effect of the prior on the posterior over μ
is vanished in this case.
Empirical Bayes• But what if still want to assume some prior
information but want to learn it from the data instead of assuming in advance?
• Imagine the following model
• We cannot use full Bayesian inference but we can approximate it by finding the best λ* to maximize p(X|λ)
N
θs
xn
S
λ
• We can estimate the result by the following iterative procedure (EM-algorithm):
• Initialize λ*
• E-step:
• M-step:
• It illustrates the other term for Empirical Bayes – maximum marginal likelihood.
• This is not fully Bayesian treatment however offers a useful compromise between Bayesian and frequentist approaches.
Empirical Bayes
Compute p(θ|X, λ) given fixed λ*
Thank you for your attention!