Exponential family & Generalized Linear Models (GLIMs)

Exponential family &

Generalized Linear Models (GLIMs)

Probabilistic Graphical Models

Sharif University of Technology

Spring 2017

Soleymani

Outline

2

Exponential family

Many standard distributions are in this family

Similarities among learning algorithms for different models in

this family:

ML estimation has a simple form for exponential families

moment matching of sufficient statistics

Bayesian learning is simplest for exponential families

They have a maximum entropy interpretation

GLIMs as to parameterize conditional distributions that

have an exponential distribution on a variable for each

value of parent

Exponential family: canonical

parameterization

3

𝑃 𝒙 𝜼 =1

𝑍 𝜼ℎ 𝒙 exp 𝜼𝑇𝑇(𝒙)

𝑍 𝜼 = ℎ 𝒙 exp 𝜼𝑇𝑇(𝒙) 𝑑𝒙

𝑃 𝒙 𝜼 = ℎ 𝒙 exp 𝜼𝑇𝑇 𝒙 − ln 𝑍(𝜼)

𝑇:𝒳 → ℝ𝐾: sufficient statistics function

𝜼: natural or canonical parameters

ℎ:𝒳 → ℝ+: reference measure independent of parameters

𝑍: Normalization factor or partition function (0 < 𝑍 𝜼 < ∞)

𝐴(𝜼): log partition function

Example: Bernouli

4

𝑃 𝑥 𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥

= exp ln𝜃

1 − 𝜃𝑥 + ln 1 − 𝜃

• 𝜂 = ln𝜃

1−𝜃

• 𝜂 = ln𝜃

1−𝜃⇒ 𝜃 =

𝑒𝜂

𝑒𝜂+1=

1

1+𝑒−𝜂

• 𝑇 𝑥 = 𝑥

• 𝐴 𝜂 = − ln 1 − 𝜃 = ln 1 + 𝑒𝜂

• ℎ 𝑥 = 1

Example: Gaussian

5

𝑃 𝑥 𝜇, 𝜎2 =1

2𝜋𝜎exp −

𝑥 − 𝜇 2

2𝜎2

• 𝜼 =𝜂1𝜂2

=

𝜇

𝜎2

−1

2𝜎2

• ⇒ 𝜇 = −𝜂1

2𝜂2, 𝜎2 = −

1

2𝜂2

• 𝑇 𝑥 =𝑥𝑥2

• 𝐴 𝜼 = − ln 2𝜋𝜎 exp𝜇2

2𝜎2= −

1

2ln 2𝜋 −

1

2ln −2𝜂2 −

𝜂12

4𝜂2

• ℎ 𝑥 = 1

Example: Multinomial

6

𝑃 𝒙 𝜽 = 𝑘=1

𝐾

𝜃𝑘𝑥𝑘

𝑃 𝒙 𝜽 = exp

𝑘=1

𝐾

𝑥𝑘 ln 𝜃𝑘

= exp

𝑘=1

𝐾−1

𝑥𝑘 ln 𝜃𝑘 + 1 −

𝑘=1

𝐾−1

𝑥𝑘 ln 1 −

𝑘=1

𝐾−1

𝜃𝑘

• 𝜼 = 𝜂1, … , 𝜂𝐾−1𝑇 = ln

𝜃1

1− 𝑘=1𝐾−1 𝜃𝑘

, … , ln𝜃𝐾−1

1− 𝑘=1𝐾−1 𝜃𝑘

𝑇

• 𝜼 = ln𝜃1

𝜃𝐾, … , ln

𝜃𝐾−1

𝜃𝐾

𝑇

⇒ 𝜃𝑘 =𝑒𝜂𝑘

𝑗=1𝐾 𝑒

𝜂𝑗

• 𝑇 𝒙 = 𝑥1, … , 𝑥𝐾−1𝑇

• 𝐴 𝜼 = − ln 𝜃𝐾 = − ln 1 − 𝑘=1𝐾−1𝜃𝑘 = ln 𝑘=1

𝐾 𝑒𝜂𝑗

𝑘=1

𝐾

𝜃𝑘 = 1

Well-behaved parameter space

7

Multiple exponential families may encode the same set of

distributions

We want the parameter space 𝜼 0 < 𝑍 𝜼 < ∞ to be:

Convex set

Non-redundant:𝜼 ≠ 𝜼′ ⇒ 𝑃 𝒙 𝜼 ≠ 𝑃 𝒙 𝜼′

The function from 𝜽 to 𝜼 is invertible

Example: invertible function from 𝜃 to 𝜂 in the Bernoulli example 𝜃

=1

1+𝑒−𝜂

Examples of non-exponential distributions

8

Uniform

Laplace

Student t-distribution

Moments

9

𝐴 𝜼 = ln𝑍 𝜼

𝑍 𝜼 = ℎ 𝒙 exp 𝜼𝑇𝑇(𝒙) 𝑑𝒙

𝛻𝜼𝐴 𝜼 =𝛻𝜼𝑍 𝜼

𝑍 𝜼= ℎ 𝒙 𝑇(𝒙) exp 𝜼𝑇𝑇(𝒙) 𝑑𝒙

𝑍 𝜼

= 𝑇(𝒙)ℎ 𝒙 exp 𝜼𝑇𝑇(𝒙)

𝑍 𝜼𝑑𝒙 = 𝐸𝑃(𝒙|𝜼) 𝑇(𝒙)

⇒ 𝛻𝜼𝐴 𝜼 = 𝐸𝜼 𝑇(𝒙)

𝛻𝜼2𝐴 𝜼 = 𝐸𝜼 𝑇 𝒙 𝑇 𝒙 𝑇 − 𝐸𝜼 𝑇 𝒙 𝐸𝜼 𝑇 𝒙 𝑇 = 𝐶𝑜𝑣𝜼 𝑇 𝒙

The first derivative of 𝐴 𝜼 is

the mean of sufficient statistics

The i-th derivative gives the i-th centered moment

of sufficient statistics.

Properties

10

The moment parameters 𝜽 can be derived as a function of the

natural or canonical parameters:

𝛻𝜼𝐴 𝜼 = 𝐸𝜼 𝑇(𝒙)

𝜽 ≡ 𝐸𝜼 𝑇(𝒙) ⇒ 𝛻𝜼𝐴 𝜼 = 𝜽

𝐴(𝜼) is convex since 𝛻𝜼2𝐴 𝜼 = 𝐶𝑜𝑣𝜼 𝑇 𝒙 ≽ 0

Covariance matrix is always positive semi-definite ⇒ Hessian 𝛻𝜼2𝐴 𝜼 is

positive semi-definite, and hence that 𝐴 𝜼 = ln 𝑍 𝜼 is a convex

function of 𝜼.

For many distributions,

we have 𝜽 ≡ 𝐸𝜼 𝑇(𝑥)

Exponential family: moment parameterization

11

A distribution in the exponential family can also be

parameterized by the moment parameterization:

𝑃 𝒙 𝜽 =1

𝑍 𝜽ℎ 𝒙 exp 𝜓 𝜽 𝑇𝑇(𝒙)

𝑍 𝜽 = ℎ 𝒙 exp 𝜓 𝜽 𝑇𝑇(𝒙) 𝑑𝒙

If 𝛻𝜼2𝐴 𝜼 ≻ 0⇒𝛻𝜼𝐴 𝜼 is ascending ⟹𝜓−1 𝜼 = 𝜽 = 𝛻𝜼𝐴 𝜼 is

ascending and thus is 1-to-1

The mapping from the moments to the canonical parameters is

invertible (1-to-1 relationship): 𝜼 = 𝜓(𝜽)

𝜼 = 𝜓(𝜽)

𝜓 maps the parameters 𝜽 to

the space of sufficient statistics𝜽 ≡ 𝐸𝜼 𝑇(𝒙) = 𝛻𝜼𝐴 𝜼

𝜽 = 𝜓−1 𝜼

Sufficiency

12

A statistic is a function of a random variable

Suppose that the distribution of 𝑋 depends on a parameter 𝜃

“𝑇(𝑋) is a sufficient statistic for 𝜃 if there is no information in 𝑋regarding 𝜃 beyond that in 𝑇(𝑋)”

Sufficiency in both frequentist and Bayesian frameworks implies a

factorization of 𝑃 𝑥 𝜃 (Neyman factorization theorem):

𝑃 𝑥, 𝑇 𝑥 , 𝜃 = 𝑓 𝑇 𝑥 , 𝜃 𝑔 𝑥, 𝑇 𝑥

𝑃 𝑥, 𝜃 = 𝑓 𝑇 𝑥 , 𝜃 𝑔(𝑥, 𝑇(𝑥))𝑃 𝑥|𝜃 = 𝑓′ 𝑇 𝑥 , 𝜃 𝑔(𝑥, 𝑇(𝑥))

Sufficient statistic

13

Sufficient statistic and the exponential family:

𝑃 𝒙 𝜼 = ℎ 𝒙 exp 𝜼𝑇𝑇 𝒙 − 𝐴(𝜼)

Sufficient statistic in the case of i.i.d sampling can be obtained

easily for a set of N observations from a distribution

𝑃 𝒟 𝜼 = 𝑛=1

𝑁

ℎ 𝒙(𝑛) exp 𝜼𝑇𝑇 𝒙 𝑛 − 𝐴(𝜼)

= 𝑛=1

𝑁

ℎ 𝒙(𝑛) exp{𝜼𝑇 𝑛=1

𝑁

𝑇 𝒙 𝑛 −𝑁𝐴 𝜼 }

𝒟 has itself an exponential distribution with sufficient statistic

𝑛=1𝑁 𝑇 𝒙 𝑛

MLE for exponential family

15

ℓ 𝜼;𝒟 = ln𝑃 𝒟 𝜼 = ln 𝑛=1

𝑁

ℎ 𝒙(𝑛) exp 𝜼𝑇𝑇 𝒙 𝑛 − 𝐴(𝜼)

= ln

𝑛=1

𝑁

ℎ(𝒙(𝑛)) + 𝜼𝑇 𝑛=1

𝑁

𝑇 𝒙 𝑛 − 𝑁𝐴 𝜼

𝛻𝜼ℓ 𝜼;𝒟 = 0 ⇒ 𝑛=1

𝑁

𝑇 𝒙 𝑛 − 𝑁𝛻𝜼𝐴 𝜼 = 0

⇒ 𝛻𝜼𝐴 𝜼 = 𝑛=1𝑁 𝑇 𝒙 𝑛

𝑁

⇒ 𝛻𝜼𝐴 𝜼 = 𝐸 𝜼 𝑇(𝒙) = 𝑛=1𝑁 𝑇 𝒙 𝑛

𝑁

moment matching

Concave

function

Maximum entropy models

16

Among all distributions with certain moments of interest, the

exponential family is the most random (makes fewest

assumptions or structure)

Out of all distributions which reproduce the observed sufficient

statistics, the exponential family distribution (roughly) makes the fewest

additional assumptions.

The unique distribution maximizing the entropy, subject to the

constraint that these moments are exactly matched, is then an

exponential family distribution

Maximum entropy

17

Constraints:

𝐸 𝑓𝑘 =

𝒙

𝑓𝑘 𝒙 𝑃 𝒙 = 𝐹𝑘

Maximum entropy (maxent): pick the distribution

with maximum entropy subject to the constraints𝐿 𝑃, 𝝀

= −

𝒙

𝑃 𝒙 log𝑃 𝒙 + 𝜆0 1 −

𝒙

𝑃 𝒙 +

𝑘

𝜆𝑘 𝐹𝑘 −

𝒙

𝑓𝑘 𝒙 𝑃 𝒙

𝛻𝐿 = 0 ⇒ 𝑃 𝒙 =1

𝑍exp −

𝑘

𝜆𝑘𝑓𝑘 𝒙

𝑓𝑘 𝒙 : an arbitrary function

𝐹𝑘: constant

𝑍 =

𝒙

exp −

𝑘

𝜆𝑘𝑓𝑘 𝒙

Maximum entropy: constraints

18

Constants in the constraints:

𝐹𝑘 measure the empirical counts on the training data

𝐹𝑘 = 𝑛=1𝑁 𝑓𝑘 𝒙(𝑛)

𝑁

These constraints also ensure consistency automatically.

Exponential family: summary

19

Many famous distribution are in the exponential family

Important properties for learning with exponential families: Gradients of log partition function gives expected sufficient statistics, ormoments, for some models

Moments of any distribution in exponential family can be easily computed bytaking the derivatives of the log normalizer

The Hessian of the log partition function is positive semi-definite and sothe log partition function is convex

Among all distributions with certain moments of interest, theexponential family has the highest entropy

Are important for modeling distributions of Markovnetworks

Generalized linear models (GLIMs)

20

Conditional relationship between 𝑌 and 𝑿

Examples:

Linear regression:𝑃 𝑦 𝒙,𝒘, 𝜎2 = 𝒩(𝑦|𝒘𝑇𝒙, 𝜎2)

Discriminative linear classifier (two class)

Logistic regression:𝑃 𝑦 𝒙,𝒘 = 𝐵𝑒𝑟(𝑦|𝜎 𝒘𝑇𝒙 )

Probit regression: 𝑃 𝑦 𝒙,𝒘 = 𝐵𝑒𝑟(𝑦|Φ 𝒘𝑇𝒙 ) where Φ is the cdf

of 𝒩(0,1)

Generalized linear models (GLIMs)

21

𝑃(𝑦|𝒙) is a generalized linear model if:

𝒙 enters into the model via a linear combination 𝒘𝑇𝒙

The conditional mean of 𝑃(𝑦|𝒙) is expressed as 𝑓 𝒘𝑇𝒙 :

𝑓 is called the response function

𝜇 = 𝐸 𝑦|𝒙 = 𝑓 𝒘𝑇𝒙

The distribution of 𝑦 is characterized by an exponential family distribution(with conditional mean 𝑓 𝒘𝑇𝒙 )

We have two choices in the specification of a GLIM:

The choice of the exponential family distribution

Usually constrained by the nature of 𝑌

The choice of the response function 𝑓 the principal degree of freedom in the specification of a GLIM

However, we need to impose constraints on this function (e.g., 𝑓 must be in [0,1] forBernoulli distribution on 𝑦)

The relation between vars. in a GLIMs

22

Canonical response function

23

Canonical response function: 𝑓(. ) = 𝜓−1(. ) or 𝜉 = 𝜂 In this case, the choice of the exponential family density completely

determines the GLIM

The constraints on the range of 𝑓 are automatically satisfied.

𝜇 = 𝑓 𝜂 are guaranteed to be possible values of the conditional

expectation (i.e.,𝑓 𝜂 = 𝜓−1 𝜂 =𝑑𝐴 𝜂

𝑑𝜂= 𝐸 𝑌|𝜂 )

Log likelihood for GLIMs

24

ℓ 𝜼;𝒟 = ln𝑃 𝒟 𝜼

= ln 𝑛=1

𝑁

ℎ 𝑦(𝑛) exp 𝜂(𝑛)𝑦(𝑛) − 𝐴 𝜂(𝑛)

=

𝑛=1

𝑁

ln ℎ 𝑦(𝑛) +

𝑛=1

𝑁

𝜂(𝑛)

𝑦(𝑛) − 𝐴 𝜂(𝑛)

𝜂(𝑛) = 𝜓(𝜇 𝑛 ) and 𝜇 𝑛 = 𝑓 𝜽𝑇𝒙(𝑛)

In the case of canonical response function 𝜂(𝑛) = 𝜽𝑇𝒙(𝑛)

ℓ 𝜽;𝒟 =

𝑛=1

𝑁

ln ℎ 𝑦(𝑛) + 𝜽𝑇

𝑛=1

𝑁

𝒙(𝑛)𝑦(𝑛) −

𝑛=1

𝑁

𝐴 𝜽𝑇𝒙(𝑛)

Sufficient statistics for 𝜽

Gradient of log likelihood

25

𝛻𝜽𝑙 𝜼; 𝒟 =

𝑛=1

𝑁𝑑𝑙

𝑑𝜂(𝑛)𝛻𝜽𝜂

(𝑛) =

𝑛=1

𝑁

𝑦(𝑛) −𝑑𝐴 𝜂(𝑛)

𝑑𝜂(𝑛)𝛻𝜽𝜂

(𝑛)

=

𝑛=1

𝑁

𝑦(𝑛) − 𝜇(𝑛)𝑑𝜂(𝑛)

𝑑𝜇(𝑛)𝑑𝜇(𝑛)

𝑑𝜉(𝑛)𝒙(𝑛)

In the case of canonical response function 𝜂(𝑛) = 𝜉(𝑛):

𝛻𝜽𝑙 𝜽; 𝒟 =

𝑛=1

𝑁

𝑦(𝑛) − 𝜇(𝑛) 𝒙(𝑛)

𝜇(𝑛) = 𝑓 𝜽𝑇𝒙(𝑛)

Online learning for GLIMs

26

An LMS like algorithm as a generic stochastic gradient

descent for GLIMs:

𝜽𝑡+1 = 𝜽𝑡 + 𝜌 𝑦(𝑛) − 𝜇 𝑛 𝑡𝒙(𝑛)

𝜇 𝑛 𝑡= 𝑓 𝜽𝑡

𝑇𝒙(𝑛)

If we do not use the canonical response function only

scaling coefficients due to the derivatives of 𝑓(. ) and

𝜓(. ) will also incorporated into the step size

Similar to Least Mean Squares

(LMS) algorithm

Batch learning for GLIMs :

Newton-Rafson

27

For the canonical response functions:

𝛻𝜽𝑙 𝜽; 𝒟 =

𝑛=1

𝑁

𝑦(𝑛) − 𝜇(𝑛) 𝒙(𝑛) = 𝑿𝑇 𝒚 − 𝝁

𝐻 =𝑑2𝑙

𝑑𝜽𝑑𝜽𝑇=

𝑑𝑙

𝑑𝜽𝑇

𝑛=1

𝑁

𝑦(𝑛) − 𝜇(𝑛) 𝒙(𝑛) = −

𝑛=1

𝑁

𝒙 𝑛𝑑𝜇 𝑛

𝑑𝜽𝑇

= −

𝑛=1

𝑁

𝒙(𝑛)𝑑𝜇(𝑛)

𝑑𝜂(𝑛)𝑑𝜂(𝑛)

𝑑𝜽𝑇

Since 𝜂(𝑛) = 𝜽𝑇𝒙(𝑛)

𝐻 = −

𝑛=1

𝑁

𝒙(𝑛)𝑑𝜇(𝑛)

𝑑𝜂(𝑛)𝒙 𝑛 𝑇

= −𝑿𝑇𝑾𝑿

𝑾 = 𝑑𝑖𝑎𝑔𝑑𝜇(1)

𝑑𝜂(1), … ,

𝑑𝜇(𝑁)

𝑑𝜂(𝑁)

𝑿 =𝑥1(1)

⋯ 𝑥𝑑(1)

⋮ ⋱ ⋮

𝑥1(𝑁)

⋯ 𝑥𝑑(𝑁)

𝒚 =𝑦(1)

⋮𝑦(𝑁)

𝑑𝜇(𝑛)

𝑑𝜂(𝑛)=

𝑑2𝐴

𝑑𝜂(𝑛)

Batch learning for GLIMs: Newton-Rafson

28

𝜽𝑡+1 = 𝜽𝑡 + 𝑿𝑇𝑾𝑡𝑿 −1𝑿𝑇 𝒚 − 𝝁𝑡

= 𝑿𝑇𝑾𝑡𝑿 −1 𝑿𝑇𝑾𝑡𝑿𝜽𝑡 + 𝑿𝑇 𝒚 − 𝝁𝑡

⇒ 𝜽𝑡+1 = 𝑿𝑇𝑾𝑡𝑿 −1𝑿𝑇𝑾𝑡𝒛𝑡

𝒛𝑡 = 𝜂𝑡 +𝑾𝑡−1 𝒚 − 𝝁𝑡

Iterative Reweighted Least Squares (IRLS)

Linear regression

29

Cost function (according to MLE where 𝑃 𝑦 𝒙 = 𝒩(𝑦|𝜽𝑇𝒙, 𝜎2)):

𝐽 𝜽 =1

2

𝑛=1

𝑁

𝜽𝑇𝒙 𝑛 − 𝑦 𝑛 2

𝛻𝜽𝐽 𝜽 = 𝟎 ⇒ 𝜽 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚

Online learning (LMS):

𝜽𝑡+1 = 𝜽𝑡 + 𝜌 𝑦(𝑛) − 𝜽𝑡𝑇𝒙(𝑛) 𝒙(𝑛)

IRLS:

𝜽𝑡+1 = 𝑿𝑇𝑾𝑡𝑿 −1𝑿𝑇𝑾𝑡𝒛𝑡

= 𝑿𝑇𝑿 −1𝑿𝑇 𝑿𝜽𝑡 + 𝒚 − 𝝁𝑡

= 𝑿𝑇𝑿 −1𝑿𝑇𝒚

Canonical response function

𝜇 𝒙 = 𝜽𝑇𝒙 = 𝜂(𝒙)𝑑𝜇

𝑑𝜂= 1 ⇒ 𝑾 = 𝑰

Logistic regression

30

𝜇 𝒙 =1

1 + 𝑒−𝜂(𝒙)

Canonical response function 𝜂 = 𝜉 = 𝜽𝑇𝒙

IRLS:𝑑𝜇

𝑑𝜂= 𝜇 1 − 𝜇

𝑊 =𝜇(1) 1 − 𝜇(1) ⋯ 0

⋮ ⋱ ⋮0 ⋯ 𝜇(𝑁) 1 − 𝜇(𝑁 )

𝜽𝑡+1 = 𝑿𝑇𝑾𝑡𝑿 −1𝑿𝑇𝑾𝑡𝒛𝑡

𝒛𝑡 = 𝑿𝜽𝑡 +𝑾𝑡−1 𝒚 − 𝝁𝑡

References

31

Jordan, Chapter 8.

Koller & Friedman, Chapter 8.1-8.3.

Exponential family & Generalized Linear Models (GLIMs)

Documents