Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani
Exponential family &
Generalized Linear Models (GLIMs)
Probabilistic Graphical Models
Sharif University of Technology
Spring 2017
Soleymani
Outline
2
Exponential family
Many standard distributions are in this family
Similarities among learning algorithms for different models in
this family:
ML estimation has a simple form for exponential families
moment matching of sufficient statistics
Bayesian learning is simplest for exponential families
They have a maximum entropy interpretation
GLIMs as to parameterize conditional distributions that
have an exponential distribution on a variable for each
value of parent
Exponential family: canonical
parameterization
3
๐ ๐ ๐ผ =1
๐ ๐ผโ ๐ exp ๐ผ๐๐(๐)
๐ ๐ผ = โ ๐ exp ๐ผ๐๐(๐) ๐๐
๐ ๐ ๐ผ = โ ๐ exp ๐ผ๐๐ ๐ โ ln ๐(๐ผ)
๐:๐ณ โ โ๐พ: sufficient statistics function
๐ผ: natural or canonical parameters
โ:๐ณ โ โ+: reference measure independent of parameters
๐: Normalization factor or partition function (0 < ๐ ๐ผ < โ)
๐ด(๐ผ): log partition function
Example: Bernouli
4
๐ ๐ฅ ๐ = ๐๐ฅ 1 โ ๐ 1โ๐ฅ
= exp ln๐
1 โ ๐๐ฅ + ln 1 โ ๐
โข ๐ = ln๐
1โ๐
โข ๐ = ln๐
1โ๐โ ๐ =
๐๐
๐๐+1=
1
1+๐โ๐
โข ๐ ๐ฅ = ๐ฅ
โข ๐ด ๐ = โ ln 1 โ ๐ = ln 1 + ๐๐
โข โ ๐ฅ = 1
Example: Gaussian
5
๐ ๐ฅ ๐, ๐2 =1
2๐๐exp โ
๐ฅ โ ๐ 2
2๐2
โข ๐ผ =๐1๐2
=
๐
๐2
โ1
2๐2
โข โ ๐ = โ๐1
2๐2, ๐2 = โ
1
2๐2
โข ๐ ๐ฅ =๐ฅ๐ฅ2
โข ๐ด ๐ผ = โ ln 2๐๐ exp๐2
2๐2= โ
1
2ln 2๐ โ
1
2ln โ2๐2 โ
๐12
4๐2
โข โ ๐ฅ = 1
Example: Multinomial
6
๐ ๐ ๐ฝ = ๐=1
๐พ
๐๐๐ฅ๐
๐ ๐ ๐ฝ = exp
๐=1
๐พ
๐ฅ๐ ln ๐๐
= exp
๐=1
๐พโ1
๐ฅ๐ ln ๐๐ + 1 โ
๐=1
๐พโ1
๐ฅ๐ ln 1 โ
๐=1
๐พโ1
๐๐
โข ๐ผ = ๐1, โฆ , ๐๐พโ1๐ = ln
๐1
1โ ๐=1๐พโ1 ๐๐
, โฆ , ln๐๐พโ1
1โ ๐=1๐พโ1 ๐๐
๐
โข ๐ผ = ln๐1
๐๐พ, โฆ , ln
๐๐พโ1
๐๐พ
๐
โ ๐๐ =๐๐๐
๐=1๐พ ๐
๐๐
โข ๐ ๐ = ๐ฅ1, โฆ , ๐ฅ๐พโ1๐
โข ๐ด ๐ผ = โ ln ๐๐พ = โ ln 1 โ ๐=1๐พโ1๐๐ = ln ๐=1
๐พ ๐๐๐
๐=1
๐พ
๐๐ = 1
Well-behaved parameter space
7
Multiple exponential families may encode the same set of
distributions
We want the parameter space ๐ผ 0 < ๐ ๐ผ < โ to be:
Convex set
Non-redundant:๐ผ โ ๐ผโฒ โ ๐ ๐ ๐ผ โ ๐ ๐ ๐ผโฒ
The function from ๐ฝ to ๐ผ is invertible
Example: invertible function from ๐ to ๐ in the Bernoulli example ๐
=1
1+๐โ๐
Examples of non-exponential distributions
8
Uniform
Laplace
Student t-distribution
Moments
9
๐ด ๐ผ = ln๐ ๐ผ
๐ ๐ผ = โ ๐ exp ๐ผ๐๐(๐) ๐๐
๐ป๐ผ๐ด ๐ผ =๐ป๐ผ๐ ๐ผ
๐ ๐ผ= โ ๐ ๐(๐) exp ๐ผ๐๐(๐) ๐๐
๐ ๐ผ
= ๐(๐)โ ๐ exp ๐ผ๐๐(๐)
๐ ๐ผ๐๐ = ๐ธ๐(๐|๐ผ) ๐(๐)
โ ๐ป๐ผ๐ด ๐ผ = ๐ธ๐ผ ๐(๐)
๐ป๐ผ2๐ด ๐ผ = ๐ธ๐ผ ๐ ๐ ๐ ๐ ๐ โ ๐ธ๐ผ ๐ ๐ ๐ธ๐ผ ๐ ๐ ๐ = ๐ถ๐๐ฃ๐ผ ๐ ๐
The first derivative of ๐ด ๐ผ is
the mean of sufficient statistics
The i-th derivative gives the i-th centered moment
of sufficient statistics.
Properties
10
The moment parameters ๐ฝ can be derived as a function of the
natural or canonical parameters:
๐ป๐ผ๐ด ๐ผ = ๐ธ๐ผ ๐(๐)
๐ฝ โก ๐ธ๐ผ ๐(๐) โ ๐ป๐ผ๐ด ๐ผ = ๐ฝ
๐ด(๐ผ) is convex since ๐ป๐ผ2๐ด ๐ผ = ๐ถ๐๐ฃ๐ผ ๐ ๐ โฝ 0
Covariance matrix is always positive semi-definite โ Hessian ๐ป๐ผ2๐ด ๐ผ is
positive semi-definite, and hence that ๐ด ๐ผ = ln ๐ ๐ผ is a convex
function of ๐ผ.
For many distributions,
we have ๐ฝ โก ๐ธ๐ผ ๐(๐ฅ)
Exponential family: moment parameterization
11
A distribution in the exponential family can also be
parameterized by the moment parameterization:
๐ ๐ ๐ฝ =1
๐ ๐ฝโ ๐ exp ๐ ๐ฝ ๐๐(๐)
๐ ๐ฝ = โ ๐ exp ๐ ๐ฝ ๐๐(๐) ๐๐
If ๐ป๐ผ2๐ด ๐ผ โป 0โ๐ป๐ผ๐ด ๐ผ is ascending โน๐โ1 ๐ผ = ๐ฝ = ๐ป๐ผ๐ด ๐ผ is
ascending and thus is 1-to-1
The mapping from the moments to the canonical parameters is
invertible (1-to-1 relationship): ๐ผ = ๐(๐ฝ)
๐ผ = ๐(๐ฝ)
๐ maps the parameters ๐ฝ to
the space of sufficient statistics๐ฝ โก ๐ธ๐ผ ๐(๐) = ๐ป๐ผ๐ด ๐ผ
๐ฝ = ๐โ1 ๐ผ
Sufficiency
12
A statistic is a function of a random variable
Suppose that the distribution of ๐ depends on a parameter ๐
โ๐(๐) is a sufficient statistic for ๐ if there is no information in ๐regarding ๐ beyond that in ๐(๐)โ
Sufficiency in both frequentist and Bayesian frameworks implies a
factorization of ๐ ๐ฅ ๐ (Neyman factorization theorem):
๐ ๐ฅ, ๐ ๐ฅ , ๐ = ๐ ๐ ๐ฅ , ๐ ๐ ๐ฅ, ๐ ๐ฅ
๐ ๐ฅ, ๐ = ๐ ๐ ๐ฅ , ๐ ๐(๐ฅ, ๐(๐ฅ))๐ ๐ฅ|๐ = ๐โฒ ๐ ๐ฅ , ๐ ๐(๐ฅ, ๐(๐ฅ))
Sufficient statistic
13
Sufficient statistic and the exponential family:
๐ ๐ ๐ผ = โ ๐ exp ๐ผ๐๐ ๐ โ ๐ด(๐ผ)
Sufficient statistic in the case of i.i.d sampling can be obtained
easily for a set of N observations from a distribution
๐ ๐ ๐ผ = ๐=1
๐
โ ๐(๐) exp ๐ผ๐๐ ๐ ๐ โ ๐ด(๐ผ)
= ๐=1
๐
โ ๐(๐) exp{๐ผ๐ ๐=1
๐
๐ ๐ ๐ โ๐๐ด ๐ผ }
๐ has itself an exponential distribution with sufficient statistic
๐=1๐ ๐ ๐ ๐
MLE for exponential family
15
โ ๐ผ;๐ = ln๐ ๐ ๐ผ = ln ๐=1
๐
โ ๐(๐) exp ๐ผ๐๐ ๐ ๐ โ ๐ด(๐ผ)
= ln
๐=1
๐
โ(๐(๐)) + ๐ผ๐ ๐=1
๐
๐ ๐ ๐ โ ๐๐ด ๐ผ
๐ป๐ผโ ๐ผ;๐ = 0 โ ๐=1
๐
๐ ๐ ๐ โ ๐๐ป๐ผ๐ด ๐ผ = 0
โ ๐ป๐ผ๐ด ๐ผ = ๐=1๐ ๐ ๐ ๐
๐
โ ๐ป๐ผ๐ด ๐ผ = ๐ธ ๐ผ ๐(๐) = ๐=1๐ ๐ ๐ ๐
๐
moment matching
Concave
function
Maximum entropy models
16
Among all distributions with certain moments of interest, the
exponential family is the most random (makes fewest
assumptions or structure)
Out of all distributions which reproduce the observed sufficient
statistics, the exponential family distribution (roughly) makes the fewest
additional assumptions.
The unique distribution maximizing the entropy, subject to the
constraint that these moments are exactly matched, is then an
exponential family distribution
Maximum entropy
17
Constraints:
๐ธ ๐๐ =
๐
๐๐ ๐ ๐ ๐ = ๐น๐
Maximum entropy (maxent): pick the distribution
with maximum entropy subject to the constraints๐ฟ ๐, ๐
= โ
๐
๐ ๐ log๐ ๐ + ๐0 1 โ
๐
๐ ๐ +
๐
๐๐ ๐น๐ โ
๐
๐๐ ๐ ๐ ๐
๐ป๐ฟ = 0 โ ๐ ๐ =1
๐exp โ
๐
๐๐๐๐ ๐
๐๐ ๐ : an arbitrary function
๐น๐: constant
๐ =
๐
exp โ
๐
๐๐๐๐ ๐
Maximum entropy: constraints
18
Constants in the constraints:
๐น๐ measure the empirical counts on the training data
๐น๐ = ๐=1๐ ๐๐ ๐(๐)
๐
These constraints also ensure consistency automatically.
Exponential family: summary
19
Many famous distribution are in the exponential family
Important properties for learning with exponential families: Gradients of log partition function gives expected sufficient statistics, ormoments, for some models
Moments of any distribution in exponential family can be easily computed bytaking the derivatives of the log normalizer
The Hessian of the log partition function is positive semi-definite and sothe log partition function is convex
Among all distributions with certain moments of interest, theexponential family has the highest entropy
Are important for modeling distributions of Markovnetworks
Generalized linear models (GLIMs)
20
Conditional relationship between ๐ and ๐ฟ
Examples:
Linear regression:๐ ๐ฆ ๐,๐, ๐2 = ๐ฉ(๐ฆ|๐๐๐, ๐2)
Discriminative linear classifier (two class)
Logistic regression:๐ ๐ฆ ๐,๐ = ๐ต๐๐(๐ฆ|๐ ๐๐๐ )
Probit regression: ๐ ๐ฆ ๐,๐ = ๐ต๐๐(๐ฆ|ฮฆ ๐๐๐ ) where ฮฆ is the cdf
of ๐ฉ(0,1)
Generalized linear models (GLIMs)
21
๐(๐ฆ|๐) is a generalized linear model if:
๐ enters into the model via a linear combination ๐๐๐
The conditional mean of ๐(๐ฆ|๐) is expressed as ๐ ๐๐๐ :
๐ is called the response function
๐ = ๐ธ ๐ฆ|๐ = ๐ ๐๐๐
The distribution of ๐ฆ is characterized by an exponential family distribution(with conditional mean ๐ ๐๐๐ )
We have two choices in the specification of a GLIM:
The choice of the exponential family distribution
Usually constrained by the nature of ๐
The choice of the response function ๐ the principal degree of freedom in the specification of a GLIM
However, we need to impose constraints on this function (e.g., ๐ must be in [0,1] forBernoulli distribution on ๐ฆ)
The relation between vars. in a GLIMs
22
Canonical response function
23
Canonical response function: ๐(. ) = ๐โ1(. ) or ๐ = ๐ In this case, the choice of the exponential family density completely
determines the GLIM
The constraints on the range of ๐ are automatically satisfied.
๐ = ๐ ๐ are guaranteed to be possible values of the conditional
expectation (i.e.,๐ ๐ = ๐โ1 ๐ =๐๐ด ๐
๐๐= ๐ธ ๐|๐ )
Log likelihood for GLIMs
24
โ ๐ผ;๐ = ln๐ ๐ ๐ผ
= ln ๐=1
๐
โ ๐ฆ(๐) exp ๐(๐)๐ฆ(๐) โ ๐ด ๐(๐)
=
๐=1
๐
ln โ ๐ฆ(๐) +
๐=1
๐
๐(๐)
๐ฆ(๐) โ ๐ด ๐(๐)
๐(๐) = ๐(๐ ๐ ) and ๐ ๐ = ๐ ๐ฝ๐๐(๐)
In the case of canonical response function ๐(๐) = ๐ฝ๐๐(๐)
โ ๐ฝ;๐ =
๐=1
๐
ln โ ๐ฆ(๐) + ๐ฝ๐
๐=1
๐
๐(๐)๐ฆ(๐) โ
๐=1
๐
๐ด ๐ฝ๐๐(๐)
Sufficient statistics for ๐ฝ
Gradient of log likelihood
25
๐ป๐ฝ๐ ๐ผ; ๐ =
๐=1
๐๐๐
๐๐(๐)๐ป๐ฝ๐
(๐) =
๐=1
๐
๐ฆ(๐) โ๐๐ด ๐(๐)
๐๐(๐)๐ป๐ฝ๐
(๐)
=
๐=1
๐
๐ฆ(๐) โ ๐(๐)๐๐(๐)
๐๐(๐)๐๐(๐)
๐๐(๐)๐(๐)
In the case of canonical response function ๐(๐) = ๐(๐):
๐ป๐ฝ๐ ๐ฝ; ๐ =
๐=1
๐
๐ฆ(๐) โ ๐(๐) ๐(๐)
๐(๐) = ๐ ๐ฝ๐๐(๐)
Online learning for GLIMs
26
An LMS like algorithm as a generic stochastic gradient
descent for GLIMs:
๐ฝ๐ก+1 = ๐ฝ๐ก + ๐ ๐ฆ(๐) โ ๐ ๐ ๐ก๐(๐)
๐ ๐ ๐ก= ๐ ๐ฝ๐ก
๐๐(๐)
If we do not use the canonical response function only
scaling coefficients due to the derivatives of ๐(. ) and
๐(. ) will also incorporated into the step size
Similar to Least Mean Squares
(LMS) algorithm
Batch learning for GLIMs :
Newton-Rafson
27
For the canonical response functions:
๐ป๐ฝ๐ ๐ฝ; ๐ =
๐=1
๐
๐ฆ(๐) โ ๐(๐) ๐(๐) = ๐ฟ๐ ๐ โ ๐
๐ป =๐2๐
๐๐ฝ๐๐ฝ๐=
๐๐
๐๐ฝ๐
๐=1
๐
๐ฆ(๐) โ ๐(๐) ๐(๐) = โ
๐=1
๐
๐ ๐๐๐ ๐
๐๐ฝ๐
= โ
๐=1
๐
๐(๐)๐๐(๐)
๐๐(๐)๐๐(๐)
๐๐ฝ๐
Since ๐(๐) = ๐ฝ๐๐(๐)
๐ป = โ
๐=1
๐
๐(๐)๐๐(๐)
๐๐(๐)๐ ๐ ๐
= โ๐ฟ๐๐พ๐ฟ
๐พ = ๐๐๐๐๐๐(1)
๐๐(1), โฆ ,
๐๐(๐)
๐๐(๐)
๐ฟ =๐ฅ1(1)
โฏ ๐ฅ๐(1)
โฎ โฑ โฎ
๐ฅ1(๐)
โฏ ๐ฅ๐(๐)
๐ =๐ฆ(1)
โฎ๐ฆ(๐)
๐๐(๐)
๐๐(๐)=
๐2๐ด
๐๐(๐)
Batch learning for GLIMs: Newton-Rafson
28
๐ฝ๐ก+1 = ๐ฝ๐ก + ๐ฟ๐๐พ๐ก๐ฟ โ1๐ฟ๐ ๐ โ ๐๐ก
= ๐ฟ๐๐พ๐ก๐ฟ โ1 ๐ฟ๐๐พ๐ก๐ฟ๐ฝ๐ก + ๐ฟ๐ ๐ โ ๐๐ก
โ ๐ฝ๐ก+1 = ๐ฟ๐๐พ๐ก๐ฟ โ1๐ฟ๐๐พ๐ก๐๐ก
๐๐ก = ๐๐ก +๐พ๐กโ1 ๐ โ ๐๐ก
Iterative Reweighted Least Squares (IRLS)
Linear regression
29
Cost function (according to MLE where ๐ ๐ฆ ๐ = ๐ฉ(๐ฆ|๐ฝ๐๐, ๐2)):
๐ฝ ๐ฝ =1
2
๐=1
๐
๐ฝ๐๐ ๐ โ ๐ฆ ๐ 2
๐ป๐ฝ๐ฝ ๐ฝ = ๐ โ ๐ฝ = ๐ฟ๐๐ฟ โ1๐ฟ๐๐
Online learning (LMS):
๐ฝ๐ก+1 = ๐ฝ๐ก + ๐ ๐ฆ(๐) โ ๐ฝ๐ก๐๐(๐) ๐(๐)
IRLS:
๐ฝ๐ก+1 = ๐ฟ๐๐พ๐ก๐ฟ โ1๐ฟ๐๐พ๐ก๐๐ก
= ๐ฟ๐๐ฟ โ1๐ฟ๐ ๐ฟ๐ฝ๐ก + ๐ โ ๐๐ก
= ๐ฟ๐๐ฟ โ1๐ฟ๐๐
Canonical response function
๐ ๐ = ๐ฝ๐๐ = ๐(๐)๐๐
๐๐= 1 โ ๐พ = ๐ฐ
Logistic regression
30
๐ ๐ =1
1 + ๐โ๐(๐)
Canonical response function ๐ = ๐ = ๐ฝ๐๐
IRLS:๐๐
๐๐= ๐ 1 โ ๐
๐ =๐(1) 1 โ ๐(1) โฏ 0
โฎ โฑ โฎ0 โฏ ๐(๐) 1 โ ๐(๐ )
๐ฝ๐ก+1 = ๐ฟ๐๐พ๐ก๐ฟ โ1๐ฟ๐๐พ๐ก๐๐ก
๐๐ก = ๐ฟ๐ฝ๐ก +๐พ๐กโ1 ๐ โ ๐๐ก
References
31
Jordan, Chapter 8.
Koller & Friedman, Chapter 8.1-8.3.