Top Banner
刘畅 微软亚洲研究院 高等机器学习
145

生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Jul 19, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

刘 畅微软亚洲研究院

生成模型

高等机器学习

Page 2: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Outline• Generative Models: Overview

• Plain Generative Models• Autoregressive Models

• Latent Variable Models• Deterministic Generative Models• Generative Adversarial Nets

• Flow-Based Generative Models

• Bayesian Generative Models• Bayesian Inference (variational inference, MCMC)

• Bayesian Networks• Topic Models (LDA, LightLDA, sLDA)

• Deep Bayesian Models (VAE)

• Markov Random Fields (Boltzmann machines, deep energy-based models)

2019/10/10 清华大学-MSRA 《高等机器学习》 2

Page 3: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• Generative Models:• Models that describe the generating process of all observations.

• Technically, they specify 𝑝 𝑥 (unsupervised) or 𝑝 𝑥, 𝑦 (supervised) in principle, either explicitly or implicitly.

∼ 𝑝 𝑥𝑥 𝑛 =

2019/10/10 清华大学-MSRA 《高等机器学习》 3

Page 4: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• Generative Models:• Models that describe the generating process of all observations.

• Technically, they specify 𝑝 𝑥 (unsupervised) or 𝑝 𝑥, 𝑦 (supervised) in principle, either explicitly or implicitly.

∼ 𝑝 𝑥, 𝑦𝑥 𝑛 , 𝑦 𝑛 =

“0”“1”“2”“3”“4”“5” “6”“7”“8”“9”𝑦 𝑛 =

𝑥 𝑛 =

2019/10/10 清华大学-MSRA 《高等机器学习》 4

Page 5: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• Non-Generative Models:

Discriminative models(e.g., feedforward neural networks):only 𝑝 𝑦|𝑥 is available.

𝑥

“0” “1” “2” “3” “4” “5” “6” “7” “8” “9”

𝑝 𝑦 𝑥

𝑓

Recurrent neural networks:

only 𝑝 | is

available.

2019/10/10 清华大学-MSRA 《高等机器学习》 5

Page 6: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• Non-Generative Models:

(Img: [DFD+18])

𝑥

𝑧

Autoencoders:𝑝 𝑥 unavailable.

𝑓

𝑔

𝑥

2019/10/10 清华大学-MSRA 《高等机器学习》 6

Page 7: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• What can generative models do:

1. Generate new data.

Conditional Generation 𝑝 𝑥 𝑦 [LWZZ18]

Generation 𝑝 𝑥 [KW14] Missing Value Imputation (Completion)𝑝 𝑥hidden 𝑥observed [OKK16]

2019/10/10 清华大学-MSRA 《高等机器学习》 7

Page 8: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• What can generative models do:

1. Generate new data.

“the cat sat on the mat”∼ 𝑝 𝑥 : Language Model.

ℎ1

𝑥1

𝑝(𝑥1 = the)

𝑥1

ℎ2

𝑥2

the

𝑝(cat|𝑥1)

𝑥2

ℎ3

𝑥3

cat

𝑝(sat|𝑥1…2)

𝑥5

ℎ6

𝑥6

𝑥4

ℎ5

𝑥5

ℎ7

𝑥7

on the mat

𝑝(the|𝑥1…4) 𝑝(mat|𝑥1…5) 𝑝(</s>|𝑥1…6)

𝑥6𝑥3

ℎ4

𝑥4

sat

𝑝(on|𝑥1…3)

2019/10/10 清华大学-MSRA 《高等机器学习》 8

Page 9: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• What can generative models do:

2. Density estimation 𝑝 𝑥 .

[Ritchie Ng]

Anomaly Detection:

2019/10/10 清华大学-MSRA 《高等机器学习》 9

Page 10: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• What can generative models do:

3. Draw semantic or concise representation of data 𝑥 (via latent variable 𝑧).

𝑥 (documents) 𝑧 (topics) [PT13]

𝑧 (semantic regions) [DFD+18]𝑥 (image)2019/10/10 清华大学-MSRA 《高等机器学习》 10

Page 11: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• What can generative models do:

3. Draw semantic or concise representation of data 𝑥 (via latent variable 𝑧).

𝑧 (semantic regions) [KD18]𝑥 (image)

2019/10/10 清华大学-MSRA 《高等机器学习》 11

Page 12: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• What can generative models do:

3. Draw semantic or concise representation of data 𝑥 (via latent variable 𝑧).

[PT13]

𝑧 ∈ ℝ20 [DFD+18]𝑥 ∈ ℝ28×28

Topic proportion

𝑧 ∈ ℝ#topic𝑥 ∈ ℝ#vocabulary

Dimensionality Reduction:

2019/10/10 清华大学-MSRA 《高等机器学习》 12

Page 13: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• What can generative models do:

4. Supervised Learning: argmax𝑦∗

𝑝 𝑦∗ 𝑥∗, 𝑥 𝑛 , 𝑦 𝑛 .

[Naive Bayes]

𝑥1: doc 1

𝑧: topics

𝑦1: science & tech

𝑥2: doc 2 𝑦2: politics

......

......

Supervised LDA [MB08]2019/10/10 清华大学-MSRA 《高等机器学习》 13

Page 14: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Overview• What can generative models do:

4. Supervised Learning: argmax𝑦∗

𝑝 𝑦∗ 𝑥∗, 𝑥 𝑛 , 𝑦 𝑛 , 𝑥 𝑛 .

Semi-Supervised Learning:

Unlabeled data 𝑥 𝑛 can be

utilized to learn a better 𝑝 𝑥, 𝑦 .

2019/10/10 清华大学-MSRA 《高等机器学习》 14

Page 15: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Benefits“What I cannot create, I do not understand.” —Richard Feynman

• Natural for generation.

• For representation learning: responsible and faithful knowledge of the data.

• For supervised learning: can leverage unlabeled data.

• For supervised learning: more data-efficient.For logistic regression (discriminative) and naive Bayes (generative) [NJ01],

𝑑: data dimension.𝑁: data size.

2019/10/10 清华大学-MSRA 《高等机器学习》 15

Page 16: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Taxonomy• Plain Generative Models: Directly model 𝑝 𝑥 ; no latent variable.

• Latent Variable Models:

𝑝𝜃 𝑥 𝑥

𝑥(or 𝑥, 𝑦 )

𝑧

𝑝 𝑧

𝑝𝜃 𝑥

𝑥 = 𝑓𝜃(𝑧)

𝑥(or 𝑥, 𝑦 )

𝑧

𝑝 𝑧

𝑝𝜃 𝑥

𝑝𝜃 𝑥, 𝑧

𝑝𝜃 𝑥|𝑧

𝑧

𝑥

𝑥 = 𝑓𝜃(𝑧)

𝑝 𝑧

𝑝𝜃 𝑥

𝑧

𝑥

𝑥 ∼ 𝑝𝜃(𝑥|𝑧)

𝑝 𝑧

𝑝𝜃 𝑥

• Deterministic Generative Models:Dependency between 𝑥 and 𝑧 is deterministic: 𝑥 = 𝑓𝜃(𝑧).

• Bayesian Generative Models:Dependency between 𝑥 and 𝑧 is probabilistic: 𝑥, 𝑧 ∼ 𝑝𝜃 𝑥, 𝑧 .

2019/10/10 清华大学-MSRA 《高等机器学习》 16

Page 17: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Taxonomy• Latent Variable Models• Bayesian Generative Models

𝑧

𝑥

𝑥 ∼ 𝑝𝜃(𝑥|𝑧)

𝑝 𝑧

𝑝𝜃 𝑥

• Bayesian Network (BayesNet):𝑝 𝑥, 𝑧 specified by 𝑝 𝑧 and 𝑝 𝑥 𝑧 .

• Synonyms: Causal Networks, Directed Graphical Model

• Markov Random Field (MRF):𝑝 𝑥, 𝑧 specified by an Energy function

𝐸𝜃 𝑥, 𝑧 : 𝑝𝜃 𝑥, 𝑧 ∝ exp −𝐸𝜃 𝑥, 𝑧 .

• Synonyms: Energy-Based Model, Undirected Graphical Model

𝑧

𝑥

𝑝𝜃 𝑥, 𝑧 ∝ exp −𝐸𝜃 𝑥, 𝑧

2019/10/10 清华大学-MSRA 《高等机器学习》 17

Page 18: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Taxonomy• Summary

𝑧

𝑥

𝑥 = 𝑓𝜃(𝑧)

𝑝 𝑧

𝑝𝜃 𝑥

𝑧

𝑥

𝑥 ∼ 𝑝𝜃(𝑥|𝑧)

𝑝 𝑧

𝑝𝜃 𝑥

𝑧

𝑥

𝑝𝜃 𝑥, 𝑧 ∝ exp −𝐸𝜃 𝑥, 𝑧

𝑝𝜃 𝑥 𝑥

Generative Models (GMs)

Plain GMs

Autoregressive Models

Latent Variable Models

Deterministic GMs

Generative Adversarial Nets,

Flow-Based Models

Bayesian GMs

BayesNets

Topic Models,Variational Auto-

Encoders

MRFs

Boltzmann machines,Deep Energy-Based

Models

whether use latent variable 𝑧 deterministic or

probabilistic 𝑧-𝑥dependency

directed or undirected

2019/10/10 清华大学-MSRA 《高等机器学习》 18

Page 19: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Outline• Generative Models: Overview

• Plain Generative Models• Autoregressive Models

• Latent Variable Models• Deterministic Generative Models• Generative Adversarial Nets

• Flow-Based Generative Models

• Bayesian Generative Models• Bayesian Inference (variational inference, MCMC)

• Bayesian Networks• Topic Models (LDA, LightLDA, sLDA)

• Deep Bayesian Models (VAE)

• Markov Random Fields (Boltzmann machines, deep energy-based models)

2019/10/10 清华大学-MSRA 《高等机器学习》 19

Page 20: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Plain Generative Models• Directly model 𝑝 𝑥 ; no latent variable involved.

• Easy to learn (no normalization constant issue) and use (generation).

• Learning: Maximum Likelihood Estimation (MLE).

𝜃∗ = argmax𝜃

𝔼 ො𝑝 𝑥 log 𝑝𝜃 𝑥 = argmin𝜃

KL Ƹ𝑝, 𝑝𝜃

≈ argmax𝜃

1

𝑁σ𝑛=1𝑁 log 𝑝𝜃(𝑥

𝑛 ).

• First example: Gaussian Mixture Model

𝑝𝜃 𝑥 = σ𝑘=1𝐾 𝛼𝑘𝒩(𝑥|𝜇𝑘, Σ𝑘),

𝜃 = 𝛼, 𝜇, Σ .

2019/10/10 清华大学-MSRA 《高等机器学习》 20

Kullback-Leibler divergenceKL Ƹ𝑝, 𝑝𝜃 ≔ 𝔼 ො𝑝 𝑥 log Ƹ𝑝/𝑝𝜃

Page 21: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Plain Generative Models• Autoregressive Model:

Model 𝑝 𝑥 by each conditional 𝑝 𝑥𝑖 𝑥<𝑖 (𝑖 indices components).• Full dependency can be restored.

• Conditionals are easier to model.

• Easy learning (MLE).

• Easy generation:𝑥 ∼ 𝑝 𝑥 ⟺ 𝑥1 ∼ 𝑝 𝑥1 , 𝑥2 ∼ 𝑝 𝑥2 𝑥1 , … , 𝑥𝑑 ∼ 𝑝 𝑥𝑑 𝑥1, … , 𝑥𝑑−1 .

But non-parallelizable.

𝑥 𝑥1 𝑥2 𝑥3 𝑥𝑑⋯⋯

𝑝 𝑥1, 𝑥2, … , 𝑥𝑑 𝑝 𝑥1 𝑝 𝑥2|𝑥1 𝑝 𝑥3|𝑥1, 𝑥2 𝑝 𝑥𝑑|𝑥<𝑑⋯

2019/10/10 清华大学-MSRA 《高等机器学习》 21

Page 22: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

清华大学-MSRA 《高等机器学习》

Autoregressive Models• Fully Visible Sigmoid Belief Network [Fre98]

𝑝 𝑥𝑖 𝑥<𝑖 = Bern 𝑥𝑖 𝜎 σ𝑗<𝑖𝑊𝑖𝑗𝑥𝑗

• Neural Autoregressive Distribution Estimator [LM11]𝑝 𝑥𝑖 𝑥<𝑖 = Bern 𝑥𝑖 𝜎 𝑉𝑖,:𝜎 𝑊:,<𝑖𝑥<𝑖 + 𝑎 + 𝑏𝑖

• A typical language model:

Sigmoid function

𝜎 𝑟 =1

1+𝑒−𝑟

ℎ1

𝑥1

= 𝑝(𝑥1 = the)

𝑥1

ℎ2

𝑥2

the

𝑝(cat|𝑥1)

𝑥2

ℎ3

𝑥3

cat

𝑝(sat|𝑥1…2)

𝑥5

ℎ6

𝑥6

𝑥4

ℎ5

𝑥5

ℎ7

𝑥7

on the mat

𝑝(the|𝑥1…4) 𝑝(mat|𝑥1…5) 𝑝(</s>|𝑥1…6)

𝑥6𝑥3

ℎ4

𝑥4

sat

𝑝(on|𝑥1…3)

𝑝(“the cat sat on the mat”) = 𝑝 𝑥

2019/10/10 22

Page 23: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Autoregressive Models• WaveNet [ODZ+16]• Construct 𝑝 𝑥𝑖 𝑥<𝑖 via Causal Convolution

𝑥1 𝑥𝑖−1

ℎ2

NN

𝑝(𝑥2|𝑥1)

𝑥2

𝑥2

𝑝(𝑥1)

𝑥1

NN

𝑝(𝑥𝑖−1|𝑥< 𝑖−1 )

𝑥𝑖−1

NN

𝑝(𝑥𝑖|𝑥<𝑖)

𝑥𝑖

𝑥𝑖−2𝑥𝑖−3𝑥𝑖−4𝑥𝑖−52019/10/10 清华大学-MSRA 《高等机器学习》 23

Page 24: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Autoregressive Models• PixelCNN & PixelRNN [OKK16]• Autoregressive structure of an image:

• PixelCNN: model conditional distributions via (masked) convolution:

ℎ𝑖 = 𝐾 ∗ 𝑥<𝑖 ,𝑝 𝑥𝑖 𝑥<𝑖 = NN ℎ𝑖 .

• Bounded receptive field.

• Likelihood evaluation: parallel

𝑥𝑖

NN

𝑝(𝑥𝑖|𝑥<𝑖)

𝑥𝑖

ℎ𝑖

2019/10/10 清华大学-MSRA 《高等机器学习》 24

Page 25: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Autoregressive Models• PixelCNN & PixelRNN [OKK16]• PixelRNN: model conditional distributions via

recurrent connection:

ℎ𝑖 , 𝑐𝑖 = LSTM 𝐾 ∗ ℎ( 𝑖/𝑛 𝑛−𝑛): 𝑖/𝑛 𝑛

1D convolution

, 𝑐𝑖−1, 𝑥𝑖−1 ,

𝑝 𝑥𝑖 𝑥<𝑖 = NN ℎ𝑖 .• Unbounded receptive field.

• Likelihood evaluation (in-row): parallelLikelihood evaluation (inter-row): sequential

𝑥𝑖

𝑐𝑖

𝑥𝑖−1

𝑐𝑖−1

NN

𝑝(𝑥𝑖|𝑥<𝑖)

𝑥𝑖

ℎ𝑖

2019/10/10 清华大学-MSRA 《高等机器学习》 25

Page 26: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Autoregressive Models

Image Generation

Image Completion

• PixelCNN & PixelRNN [OKK16]

2019/10/10 清华大学-MSRA 《高等机器学习》 26

Page 27: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Outline• Generative Models: Overview

• Plain Generative Models• Autoregressive Models

• Latent Variable Models• Deterministic Generative Models• Generative Adversarial Nets

• Flow-Based Generative Models

• Bayesian Generative Models• Bayesian Inference (variational inference, MCMC)

• Bayesian Networks• Topic Models (LDA, LightLDA, sLDA)

• Deep Bayesian Models (VAE)

• Markov Random Fields (Boltzmann machines, deep energy-based models)

2019/10/10 清华大学-MSRA 《高等机器学习》 27

Page 28: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Latent Variable Models• Latent Variable:• Abstract knowledge of data; enables various tasks.

Knowledge Discovery

Manipulated Generation

Dimensionality Reduction

2019/10/10 清华大学-MSRA 《高等机器学习》 28

Page 29: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

𝑧

𝑧

Latent Variable Models• Latent Variable:• Compact representation of dependency.

De Finetti’s Theorem (1955): if 𝑥1, 𝑥2, … are infinitely exchangeable, then ∃r.v. 𝑧 and 𝑝 ⋅ 𝑧 s.t. ∀𝑛,

𝑝 𝑥1, … , 𝑥𝑛 = න ෑ

𝑖=1

𝑛

𝑝 𝑥𝑖 𝑧 𝑝 𝑧 d𝑧 .

Infinite exchangeability:

For all 𝑛 and permutation 𝜎, 𝑝 𝑥1, … , 𝑥𝑛 = 𝑝 𝑥𝜎 1 , … , 𝑥𝜎 𝑛 .

2019/10/10 清华大学-MSRA 《高等机器学习》 29

Page 30: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Outline• Generative Models: Overview

• Plain Generative Models• Autoregressive Models

• Latent Variable Models• Deterministic Generative Models• Generative Adversarial Nets

• Flow-Based Generative Models

• Bayesian Generative Models• Bayesian Inference (variational inference, MCMC)

• Bayesian Networks• Topic Models (LDA, LightLDA, sLDA)

• Deep Bayesian Models (VAE)

• Markov Random Fields (Boltzmann machines, deep energy-based models)

2019/10/10 清华大学-MSRA 《高等机器学习》 30

Page 31: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Adversarial Nets• Deterministic 𝑓𝜃: 𝑧 ↦ 𝑥, modeled by a neural network.

+ Flexible modeling ability.

+ Good generation performance.

- Hard to infer 𝑧 of a data point 𝑥.

- Unavailable density function 𝑝𝜃 𝑥 .

- Mode-collapse.

• Learning: min𝜃

discr Ƹ𝑝 𝑥 , 𝑝𝜃 𝑥 .

• discr. = KL Ƹ𝑝, 𝑝𝜃 ⟹ MLE: max𝜃

𝔼 ො𝑝 log 𝑝𝜃 , but 𝑝𝜃 𝑥 is unavailable!

• discr. = Jensen-Shannon divergence [GPM+14].

• discr. = Wasserstein distance [ACB17].

𝑧

𝑥

𝑥 = 𝑓𝜃(𝑧)(Neural Nets)

𝑝 𝑧

𝑝𝜃 𝑥

2019/10/10 清华大学-MSRA 《高等机器学习》 31

Page 32: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Generative Adversarial Nets• Learning: min

𝜃discr Ƹ𝑝 𝑥 , 𝑝𝜃 𝑥 .

• GAN [GPM+14]: discr. = Jensen-Shannon divergence.

JS Ƹ𝑝, 𝑝𝜃 ≔1

2KL Ƹ𝑝,

𝑝𝜃 + Ƹ𝑝

2+ KL 𝑝𝜃,

𝑝𝜃 + Ƹ𝑝

2

=1

2max𝑇 ⋅

𝔼 ො𝑝 𝑥 log 𝜎 𝑇 𝑥 + 𝔼𝑝𝜃 𝑥 log 1 − 𝜎 𝑇 𝑥

=𝔼𝑝 𝑧 log 1−𝜎 𝑇 𝑓𝜃 z

+ log 2 .

• 𝜎 𝑇 𝑥 is the discriminator; 𝑇 implemented as a neural network.

• Expectations can be estimated by samples.

𝑧

𝑥

𝑥 = 𝑓𝜃(𝑧)(Neural Nets)

𝑝 𝑧

𝑝𝜃 𝑥

2019/10/10 清华大学-MSRA 《高等机器学习》 32

Page 33: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Generative Adversarial Nets• Learning: min

𝜃discr Ƹ𝑝 𝑥 , 𝑝𝜃 𝑥 .

• WGAN [ACB17]: discr. = Wasserstein distance:

𝑑𝑊 Ƹ𝑝, 𝑝𝜃 = inf𝛾∈Γ ො𝑝,𝑝𝜃

𝔼𝛾 𝑥,𝑦 𝑐 𝑥, 𝑦

= sup𝜙∈Lip1

𝔼 ො𝑝 𝜙 − 𝔼𝑝𝜃 𝜙 . (if c is a distance in a Polish space)

• Choose 𝜙 as a neural network with parameter clipping.

• Benefit: 𝑑𝑊 has more alleviative reaction to distribution difference than JS.

ℎ𝑝0

𝑝ℎ

ℎ ℎ

𝑑𝑊 𝑝0, 𝑝ℎ JS 𝑝0, 𝑝ℎ

𝑧

𝑥

𝑥 = 𝑓𝜃(𝑧)(Neural Nets)

𝑝 𝑧

𝑝𝜃 𝑥

2019/10/10 清华大学-MSRA 《高等机器学习》 33

Page 34: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Outline• Generative Models: Overview

• Plain Generative Models• Autoregressive Models

• Latent Variable Models• Deterministic Generative Models• Generative Adversarial Nets

• Flow-Based Generative Models

• Bayesian Generative Models• Bayesian Inference (variational inference, MCMC)

• Bayesian Networks• Topic Models (LDA, LightLDA, sLDA)

• Deep Bayesian Models (VAE)

• Markov Random Fields (Boltzmann machines, deep energy-based models)

2019/10/10 清华大学-MSRA 《高等机器学习》 34

Page 35: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Flow-Based Generative Models• Deterministic and invertible 𝑓𝜃: 𝑧 ↦ 𝑥.

+ Available density function!

𝑝𝜃 𝑥 = 𝑝 𝑧 = 𝑓𝜃−1 𝑥

𝜕𝑓𝜃−1

𝜕𝑥(rule of change of variables).

+ Easy inference: 𝑧 = 𝑓𝜃−1 𝑥 .

- Redundant representation: dim. 𝑧 = dim. 𝑥.

- Restricted 𝑓𝜃: deliberative design; either 𝑓𝜃 or 𝑓𝜃−1 computes costly.

• Learning: min𝜃

KL Ƹ𝑝 𝑥 , 𝑝𝜃 𝑥 ⟹ MLE: max𝜃

𝔼 ො𝑝 𝑥 log 𝑝𝜃 𝑥 .

• NICE [DKB15], RealNVP [DSB17], MAF [PPM17], GLOW [KD18].

𝑧

𝑥

𝑥 = 𝑓𝜃(𝑧)(invertible)

𝑝 𝑧

𝑝𝜃 𝑥

Jacobian Determinant

2019/10/10 清华大学-MSRA 《高等机器学习》 35

Page 36: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Flow-Based Generative Models• RealNVP [DSB17]• Building block: coupling: 𝑦 = 𝑔 𝑥 ,

where 𝑠 and 𝑡: ℝ𝐷−𝑑 → ℝ𝐷−𝑑 are general functions for scale and translation.

• Jacobian Determinant: 𝜕𝑔

𝜕𝑥= exp σ𝑗=1

𝐷−𝑑 𝑠𝑗 𝑥1:𝑑 .

• Partitioning 𝑥 using a binary mask 𝑏:

2019/10/10 清华大学-MSRA 《高等机器学习》 36

Page 37: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Flow-Based Generative Models• RealNVP [DSB17]

• Building block: squeezing: from 𝑠 × 𝑠 × 𝑐 to 𝑠

𝑠

2× 4𝑐:

• Combining with a multi-scale architecture:

where each 𝑓 follows a “coupling-squeezing-coupling” architecture.2019/10/10 清华大学-MSRA 《高等机器学习》 37

Page 38: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Flow-Based Generative Models• RealNVP [DSB17]

2019/10/10 清华大学-MSRA 《高等机器学习》 38

Page 39: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

清华大学-MSRA 《高等机器学习》

Flow-Based Generative Models• GLOW [KD18]

One step of 𝑓𝜃

Component Details

Combination of the steps to form 𝑓𝜃

2019/10/10 39

Page 40: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Flow-Based Generative Models• GLOW [KD18]

Generation Results

(Interpolation)

GenerationResults

(Manipulation;each semantic

direction =ҧ𝑧pos − ҧ𝑧neg)

2019/10/10 清华大学-MSRA 《高等机器学习》 40

Page 41: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Outline• Generative Models: Overview

• Plain Generative Models• Autoregressive Models

• Latent Variable Models• Deterministic Generative Models• Generative Adversarial Nets

• Flow-Based Generative Models

• Bayesian Generative Models• Bayesian Inference (variational inference, MCMC)

• Bayesian Networks• Topic Models (LDA, LightLDA, sLDA)

• Deep Bayesian Models (VAE)

• Markov Random Fields (Boltzmann machines, deep energy-based models)2019/10/10 清华大学-MSRA 《高等机器学习》 41

Page 42: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Generative Models: OverviewBayesian Networks

• Model structure (Bayesian Modeling):• Prior 𝑝 𝑧 : initial belief of 𝑧.

• Likelihood 𝑝 𝑥 𝑧 : dependence of 𝑥 on 𝑧.

• Learning (Model Selection): MLE.

𝜃∗ = argmax𝜃

𝔼 ො𝑝 𝑥 log 𝑝𝜃 𝑥 ,

Evidence 𝑝 𝑥 = 𝑝 𝑧, 𝑥 d𝑧.

• Feature/representation learning (Bayesian Inference):

Posterior 𝑝 𝑧 𝑥 =𝑝 𝑧,𝑥

𝑝 𝑥=

𝑝 𝑧)𝑝(𝑥|𝑧

𝑝 𝑧,𝑥 d𝑧(Bayes’ rule)

represents the updated information that observation 𝑥 conveys to 𝑧.

• Generation/prediction: 𝑧new ∼ 𝑝 𝑧|𝑥 , 𝑥new ∼ 𝑝 𝑥 𝑧new .

Prior 𝑝 𝑧

Likelihood 𝑝 𝑥|𝑧 Evidence 𝑝 𝑥

Bayesian Modeling

Latent Variable

𝑧(Bayesian Inference)

Posterior 𝑝 𝑧|𝑥

(Model Selection)

Data Variable

𝑥

2019/10/10 清华大学-MSRA 《高等机器学习》 42

Page 43: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Generative Models: Overview• Dependency between 𝑥 and 𝑧 is probabilistic: 𝑥, 𝑧 ∼ 𝑝𝜃 𝑥, 𝑧 .

𝑧

𝑥

𝑥 ∼ 𝑝𝜃(𝑥|𝑧)

𝑝 𝑧

𝑝𝜃 𝑥

• Bayesian Network (BayesNet):𝑝 𝑥, 𝑧 specified by 𝑝 𝑧 and 𝑝 𝑥 𝑧 .

• Synonyms: Causal Networks,Directed Graphical Model.

• Directional/Causal belief encoded:𝑥 is generated/caused by 𝑧, not the other way.

• Markov Random Field (MRF):𝑝 𝑥, 𝑧 specified by an Energy function

𝐸𝜃 𝑥, 𝑧 : 𝑝𝜃 𝑥, 𝑧 ∝ exp −𝐸𝜃 𝑥, 𝑧 .• Synonyms: Energy-Based Model,

Undirected Graphical Model.• Modeling the symmetric correlation.• Harder learning and generation.

𝑧

𝑥

𝑝𝜃 𝑥, 𝑧 ∝ exp −𝐸𝜃 𝑥, 𝑧

2019/10/10 清华大学-MSRA 《高等机器学习》 43

Page 44: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Generative Models: OverviewNot all Bayesian models are generative:

Prior 𝑝 𝑧

Likelihood𝑝 𝑥, 𝑦|𝑧

Latent Variable

𝑧

Data & Response 𝑥, 𝑦

Generative Bayesian Models

Evidence 𝑝 𝑥, 𝑦

Posterior 𝑝 𝑧|𝑥, 𝑦

Prior 𝑝 𝑧

Likelihood 𝑝 𝑦|𝑧, 𝑥

Latent Variable

𝑧

Response 𝑦

Data 𝑥Evidence 𝑝 𝑦|𝑥

Posterior 𝑝 𝑧|𝑥, 𝑦

Non-Generative Bayesian Models

Generative Non-generative

Supervised Naive Bayes, supervised LDA Bayesian Logistic Regression,Bayesian Neural Networks

Unsupervised BayesNets (LDA, VAE),MRFs (BM, RBM, DBM)

(invalid task)

2019/10/10 清华大学-MSRA 《高等机器学习》 44

Page 45: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Generative Models: Benefits• Robust to small data and adversarial attack.

• Stable training process

• Principled and natural inference 𝑝 𝑧 𝑥 via Bayes’ rule

One-shot generation [LST15]

2019/10/10 清华大学-MSRA 《高等机器学习》 45

Page 46: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Generative Models: Benefits• Robust to small data and adversarial attack.

Meta-learning [KYD+18] One-shot generation [LST15]

2019/10/10 清华大学-MSRA 《高等机器学习》 46

Page 47: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Generative Models: Benefits• Robust to small data and adversarial attack.

Adversarial robustness [LG17](non-generative case)

2019/10/10 清华大学-MSRA 《高等机器学习》 47

Page 48: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Generative Models: Benefits• Stable training process

• Principled and natural inference 𝑝 𝑧 𝑥 via Bayes’ rule

Prior 𝑝 𝑧

Likelihood 𝑝 𝑥|𝑧 Evidence 𝑝 𝑥

Bayesian Modeling

Latent Variable

𝑧(Bayesian Inference)

Posterior 𝑝 𝑧|𝑥

(Model Selection)

Data Variable

𝑥

2019/10/10 清华大学-MSRA 《高等机器学习》 48

Page 49: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

𝑧

𝑥 ቊ

Bayesian Generative Models: Benefits• Natural to incorporate prior knowledge

[KSDV18]

Bald

Mustache

2019/10/10 清华大学-MSRA 《高等机器学习》 49

Page 50: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Outline• Generative Models: Overview

• Plain Generative Models• Autoregressive Models

• Latent Variable Models• Deterministic Generative Models• Generative Adversarial Nets

• Flow-Based Generative Models

• Bayesian Generative Models• Bayesian Inference (variational inference, MCMC)

• Bayesian Networks• Topic Models (LDA, LightLDA, sLDA)

• Deep Bayesian Models (VAE)

• Markov Random Fields (Boltzmann machines, deep energy-based models)2019/10/10 清华大学-MSRA 《高等机器学习》 50

Page 51: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian InferenceEstimate the posterior 𝑝 𝑧 𝑥 .

𝑥(or 𝑥, 𝑦 )

𝑧

𝑝 𝑧

𝑝 𝑥

𝑝 𝑥, 𝑧

𝑝 𝑥|𝑧∗

Bayesian Modeling

𝑧∗

Bayesian Inference

𝑥(or 𝑥, 𝑦 )

𝑧

𝑝 𝑧

𝑝 𝑥

𝑝 𝑥, 𝑧𝑝 𝑧|𝑥∗

𝑥∗

2019/10/10 清华大学-MSRA 《高等机器学习》 51

Page 52: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian InferenceEstimate the posterior 𝑝 𝑧 𝑥 .

• Extract knowledge/representation from data.

Prior 𝑝 𝑧

Likelihood 𝑝 𝑥|𝑧

Bayesian Modeling

Data Variable

𝑥

Latent Variable

𝑧

Topics

Documents

Bayesian Inference

Posterior

𝑝 𝑧|𝑥

2019/10/10 清华大学-MSRA 《高等机器学习》 52

Page 53: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian InferenceEstimate the posterior 𝑝 𝑧 𝑥 .

• Extract knowledge/representation from data.

Naive Bayes: 𝑧 = 𝑦.

𝑝 𝑦 = 0 𝑥 =𝑝 𝑥 𝑦 = 0 𝑝 𝑦 = 0

𝑝 𝑥 𝑦 = 0 𝑝 𝑦 = 0 + 𝑝 𝑥 𝑦 = 1 𝑝 𝑦 = 1

𝑓 𝑥 = argmax𝑦

𝑝 𝑦 𝑥 achieves the lowest error 𝑝 𝑦 = 1 − 𝑓 𝑥 |𝑥 𝑝 𝑥 d𝑥.

2019/10/10 清华大学-MSRA 《高等机器学习》 53

Page 54: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian InferenceEstimate the posterior 𝑝 𝑧 𝑥 .

• Facilitate model learning: max𝜃

1

𝑁σ𝑛=1𝑁 log 𝑝𝜃(𝑥

𝑛 ).

• 𝑝𝜃 𝑥 = 𝑝𝜃 𝑥 𝑧 𝑝𝜃 𝑧 d𝑧 is hard to evaluate:

• Closed-form integration is generally unavailable.

• Numerical integration• Curse of dimensionality

• Hard to optimize.

• log 𝑝𝜃 𝑥 = log𝔼𝑝 𝑧 𝑝𝜃 𝑥 𝑧 ≈ log1

𝑁σ𝑛 𝑝𝜃 𝑥 𝑧 𝑛 , 𝑧 𝑛 ∼ 𝑝 𝑧 .

• Hard for 𝑝 𝑧 to cover regions where 𝑝𝜃 𝑥 𝑧 is large.

• log1

𝑁σ𝑛 𝑝𝜃 𝑥 𝑧 𝑛 is biased:

𝔼 log1

𝑁σ𝑛 𝑝𝜃 𝑥 𝑧 𝑛 ≤ log𝔼

1

𝑁σ𝑛 𝑝𝜃 𝑥 𝑧 𝑛 = log 𝑝𝜃 𝑥 .

2019/10/10 清华大学-MSRA 《高等机器学习》 54

Page 55: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian InferenceEstimate the posterior 𝑝 𝑧 𝑥 .

• Facilitate model learning: max𝜃

1

𝑁σ𝑛=1𝑁 log 𝑝𝜃(𝑥

𝑛 ).

An effective and practical learning approach:

• Introduce a variational distribution 𝑞 𝑧 :log 𝑝𝜃 𝑥 = ℒ𝜃 𝑞 𝑧 + KL 𝑞 𝑧 , 𝑝𝜃 𝑧 𝑥 ,

ℒ𝜃 𝑞 𝑧 ≔ 𝔼𝑞 𝑧 log 𝑝𝜃 𝑧, 𝑥 − 𝔼𝑞 𝑧 log 𝑞 𝑧 .

• ℒ𝜃 𝑞 𝑧 ≤ log 𝑝𝜃 𝑥 ➔ Evidence Lower BOund (ELBO)!

• ℒ𝜃 𝑞 𝑧 is easier to estimate.

• (Variational) Expectation-Maximization Algorithm:

(a) E-step: Let ℒ𝜃 𝑞 𝑧 ≈ log 𝑝𝜃 𝑥 , ∀𝜃 ⟺ min𝑞∈𝒬

KL 𝑞 𝑧 , 𝑝𝜃 𝑧 𝑥

Bayesian Inference

;

(b) M-step: max𝜃

ℒ𝜃 𝑞 𝑧 .

𝑝𝜃 𝑥 = 𝑝𝜃 𝑥 𝑧 𝑝𝜃 𝑧 d𝑧Hard to evaluate!

2019/10/10 清华大学-MSRA 《高等机器学习》 55

Page 56: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian InferenceEstimate the posterior 𝑝 𝑧 𝑥 .

• For prediction:

𝑝 𝑦∗ 𝑥∗, 𝑥, 𝑦 = ൞

𝑝 𝑦∗ 𝑧, 𝑥∗ 𝑝 𝑧 𝑥∗, 𝑥, 𝑦 d𝑧 ,

𝑝 𝑦∗ 𝑧, 𝑥∗ 𝑝 𝑧 𝑥, 𝑦 d𝑧 .

𝑍

𝑋, 𝑌

𝑋

𝑍

𝑌

(Generative)

(Non-Generative)

2019/10/10 清华大学-MSRA 《高等机器学习》 56

Page 57: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian InferenceEstimate the posterior 𝑝 𝑧 𝑥 .

• It is a hard problem• Closed form of 𝑝 𝑧 𝑥 ∝ 𝑝 𝑧 𝑝 𝑥 𝑧 is generally intractable.

• We care about expectations w.r.t 𝑝 𝑧 𝑥 (prediction, computing ELBO).• So that even if we know the closed form (e.g., by numerical integration),

downstream tasks are still hard.

• So that the Maximum a Posteriori (MAP) estimate

argmax𝑧

log 𝑝 𝑧 𝑥 𝑛𝑛=1

𝑁= argmax

𝑧log 𝑝 𝑧 +

𝑛=1

𝑁

log 𝑝(𝑥 𝑛 |𝑧)

does not help much for Bayesian tasks.

Modeling Method Mathematical Problem

Parametric Method Optimization

Bayesian Method Bayesian Inference

2019/10/10 清华大学-MSRA 《高等机器学习》 57

Page 58: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference• Variational inference (VI)

Use a tractable variational distribution 𝑞 𝑧 to approximate 𝑝 𝑧 𝑥 :min𝑞∈𝒬

KL 𝑞 𝑧 , 𝑝 𝑧 𝑥 .

Tractability: known density function, or samples are easy to draw.

• Parametric VI: use a parameter 𝜙 to represent 𝑞𝜙 𝑧 .

• Particle-based VI: use a set of particles 𝑧 𝑖𝑖=1

𝑁to represent 𝑞 𝑧 .

• Monte Carlo (MC)• Draw samples from 𝑝 𝑧 𝑥 .

• Typically by simulating a Markov chain (i.e., MCMC) to release requirements on 𝑝 𝑧 𝑥 .

2019/10/10 清华大学-MSRA 《高等机器学习》 58

Page 59: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: Variational Inferencemin𝑞∈𝒬

KL 𝑞 𝑧 , 𝑝 𝑧 𝑥 .

• Parametric variational inference: use a parameter 𝜙 to represent 𝑞𝜙 𝑧 .

But KL 𝑞𝜙 𝑧 , 𝑝𝜃 𝑧 𝑥 is hard to compute…

Recall log 𝑝𝜃 𝑥 = ℒ𝜃 𝑞 𝑧 + KL 𝑞 𝑧 , 𝑝𝜃 𝑧 𝑥 ,

so min𝜙

KL 𝑞𝜙 𝑧 , 𝑝 𝑧 𝑥 ⟺ max𝜙

ℒ𝜃 𝑞𝜙 𝑧 .

The ELBO ℒ𝜃 𝑞𝜙 𝑧 = 𝔼𝑞𝜙 𝑧 log 𝑝𝜃 𝑧, 𝑥 − 𝔼𝑞𝜙 𝑧 log 𝑞𝜙 𝑧 is easier to compute.

• For model-specifically designed 𝑞𝜙 𝑧 , ELBO 𝜃, 𝜙 has closed form(e.g., [SJJ96] for SBN, [BNJ03] for LDA).

2019/10/10 清华大学-MSRA 《高等机器学习》 59

Page 60: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: Variational Inferencemin𝑞∈𝒬

KL 𝑞 𝑧 , 𝑝 𝑧 𝑥 .

• Parametric variational inference: use a parameter 𝜙 to represent 𝑞𝜙 𝑧 .

• Information theory perspective of the ELBO: Bits-Back Coding [HV93].• Average coding length for communicating 𝑥 after communicating its code 𝑧:

𝔼𝑞 𝑧|𝑥 − log 𝑝 𝑥 𝑧 .

• Average coding length for communicating 𝑧 under the bits-back coding:𝔼𝑞 𝑧 𝑥 − log𝑝 𝑧 − 𝔼𝑞 𝑧 𝑥 − log 𝑞 𝑧 𝑥 .

The second term: the receiver knowns the encoder 𝑞 𝑧 𝑥 that the sender uses.

• Average coding length for communicating 𝑥 with the help of 𝑧:𝔼𝑞 𝑧|𝑥 −log 𝑝 𝑥 𝑧 − log 𝑝 𝑧 + log 𝑞 𝑧 𝑥 .

This coincides with the negative ELBO!

Maximize ELBO = Minimize averaged coding length under the bits-back scheme.

2019/10/10 清华大学-MSRA 《高等机器学习》 60

Page 61: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: Variational Inferencemin𝑞∈𝒬

KL 𝑞 𝑧 , 𝑝 𝑧 𝑥 .

• Parametric variational inference: use a parameter 𝜙 to represent 𝑞𝜙 𝑧 .

Main Challenge:

• 𝒬 should be as large/general/flexible as possible,

• while enables practical optimization of the ELBO.

𝑝 𝑧 𝑥

𝒬

𝑞∗ = arg min𝑞∈𝒬

KL 𝑞 𝑧 , 𝑝 𝑧 𝑥

2019/10/10 清华大学-MSRA 《高等机器学习》 61

Page 62: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: Variational Inference• Parametric variational inference: use a parameter 𝜙 to represent 𝑞𝜙 𝑧 .

max𝜙

ℒ𝜃 𝑞𝜙 𝑧 = 𝔼𝑞𝜙 𝑧 log 𝑝𝜃 𝑧, 𝑥 − 𝔼𝑞𝜙 𝑧 log 𝑞𝜙 𝑧 .

• Explicit variational inference: specify the form of the density function 𝑞𝜙 𝑧 .

• Model-specific 𝑞𝜙 𝑧 : [SJJ96] for SBN, [BNJ03] for LDA.

• [GHB12, HBWP13, RGB14]: model-agnostic 𝑞𝜙 𝑧 (e.g., mixture of Gaussians).

• [RM15, KSJ+16]: define 𝑞𝜙 𝑧 by a flow-based generative model.

• Implicit variational inference: define 𝑞𝜙 𝑧 by a GAN-like generative model.

• More flexible but more difficult to optimize.

• Density ratio estimation: [MNG17, SSZ18a].

ℒ𝜃 𝑞𝜙 𝑧 = 𝔼𝑞𝜙 𝑧 log 𝑝𝜃 𝑥|𝑧 − 𝔼𝑞𝜙 𝑧 log𝑞𝜙 𝑧

𝑝 𝑧.

• Gradient Estimation ∇ log 𝑞𝜙 𝑧 : [VLBM08, LT18, SSZ18b].

2019/10/10 清华大学-MSRA 《高等机器学习》 62

Page 63: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: Variational Inference• Parametric variational inference: use a parameter 𝜙 to represent 𝑞𝜙 𝑧 .

max𝜙

ℒ𝜃 𝑞𝜙 𝑧 = 𝔼𝑞𝜙 𝑧 log 𝑝𝜃 𝑧, 𝑥 − 𝔼𝑞𝜙 𝑧 log 𝑞𝜙 𝑧 .

• Explicit variational inference: specify the form of the density function 𝑞𝜙 𝑧 .

To be applicable to any model (model-agnostic 𝑞𝜙 𝑧 ):

• [GHB12]: mixture of Gaussian 𝑞𝜙 𝑧 =1

𝑁σ𝑛=1𝑁 𝒩 𝑧 𝜇𝑛, 𝜎𝑛

2𝐼 .

Blue =1

𝑁σ𝑛=1𝑁 𝔼𝒩 𝜇𝑛,𝜎𝑛

2𝐼 𝑓 𝑧

≈1

𝑁σ𝑛=1𝑁 𝔼𝒩 𝜇𝑛,𝜎𝑛

2𝐼 Taylor2 𝑓, 𝜇𝑛 =1

𝑁σ𝑛=1𝑁 𝑓 𝜇𝑛 +

𝜎𝑛2

2tr ∇2𝑓 𝜇𝑛 ,

Red ≥ −1

𝑁σ𝑛=1𝑁 logσ𝑗=1

𝑁 𝒩 𝜇𝑛 𝜇𝑗 , 𝜎𝑛2 + 𝜎𝑗

2 𝐼 + log𝑁.

• [RGB14]: mean-field 𝑞𝜙 𝑧 = ς𝑑=1𝐷 𝑞𝜙𝑑 𝑧𝑑 .

• ∇𝜃ℒ𝜃 𝑞𝜙 = 𝔼𝑞𝜙 𝑧 ∇𝜃 log 𝑝𝜃 𝑧, 𝑥 .

• ∇𝜙ℒ𝜃 𝑞𝜙 = 𝔼𝑞𝜙 𝑧 ∇𝜙 log 𝑞𝜙 𝑧 log 𝑝𝜃 𝑧, 𝑥 − log 𝑞𝜙 𝑧

(similar to REINFORCE [Wil92]) (with variance reduction).2019/10/10 清华大学-MSRA 《高等机器学习》 63

Page 64: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: Variational Inference• Parametric variational inference: use a parameter 𝜙 to represent 𝑞𝜙 𝑧 .

max𝜙

ℒ𝜃 𝑞𝜙 𝑧 = 𝔼𝑞𝜙 𝑧 log 𝑝𝜃 𝑧, 𝑥 − 𝔼𝑞𝜙 𝑧 log 𝑞𝜙 𝑧 .

• Explicit variational inference: specify the form of the density function 𝑞𝜙 𝑧 .

To be more flexible and model-agnostic:

• [RM15, KSJ+16]: define 𝑞𝜙 𝑧 by a generative model:𝑧 ∼ 𝑞𝜙 𝑧 ⟺ 𝑧 = 𝑔𝜙 𝜖 , 𝜖 ∼ 𝑞 𝜖 ,

where 𝑔𝜙 is invertible (flow model).

Density function 𝑞𝜙 𝑧 is known!

𝑞𝜙 𝑧 = 𝑞 𝜖 = 𝑔𝜙−1 𝑧

𝜕𝑔𝜙−1

𝜕𝑧. (rule of change of variables)

ℒ𝜃 𝑞𝜙 = 𝔼𝑞 𝜖 log 𝑝𝜃 𝑧, 𝑥 ቚ𝑧=𝑔𝜙 𝜖

− log 𝑞𝜙 𝑧 ቚ𝑧=𝑔𝜙 𝜖

.

2019/10/10 清华大学-MSRA 《高等机器学习》 64

Page 65: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: Variational Inference• Parametric variational inference: use a parameter 𝜙 to represent 𝑞𝜙 𝑧 .

max𝜙

ℒ𝜃 𝑞𝜙 𝑧 = 𝔼𝑞𝜙 𝑧 log 𝑝𝜃 𝑧, 𝑥 − 𝔼𝑞𝜙 𝑧 log 𝑞𝜙 𝑧 .

• Implicit variational inference: define 𝑞𝜙 𝑧 by a generative model:𝑧 ∼ 𝑞𝜙 𝑧 ⟺ 𝑧 = 𝑔𝜙 𝜖 , 𝜖 ∼ 𝑞 𝜖 ,

where 𝑔𝜙 is a general function.

• More flexible than explicit VIs.

• Samples are easy to draw, but density function 𝑞𝜙 𝑧 is unavailable.

• ℒ𝜃 𝑞𝜙 𝑧 = 𝔼𝑞 𝜖 log 𝑝𝜃 𝑥|𝑧 |𝑧=𝑔𝜙 𝜖 − 𝔼𝑞 𝜖 log𝑞𝜙 𝑧

𝑝 𝑧|𝑧=𝑔𝜙 𝜖 .

Key Problem:

• Density Ratio Estimation 𝑟 𝑧 ≔𝑞𝜙 𝑧

𝑝 𝑧.

• Gradient Estimation ∇ log 𝑞 𝑧 .

2019/10/10 清华大学-MSRA 《高等机器学习》 65

Page 66: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: Variational Inference• Parametric variational inference: use a parameter 𝜙 to represent 𝑞𝜙 𝑧 .

max𝜙

ℒ𝜃 𝑞𝜙 𝑧 = 𝔼𝑞𝜙 𝑧 log 𝑝𝜃 𝑧, 𝑥 − 𝔼𝑞𝜙 𝑧 log 𝑞𝜙 𝑧 .

• Implicit variational inferenceDensity Ratio Estimation:

• [MNG17]: log 𝑟 = argmax𝑇

𝔼𝑞𝜙 𝑍 log 𝜎 𝑇 𝑍 + 𝔼𝑝 𝑍 log 1 − 𝜎 𝑇 𝑍 .

Also used in [MSJ+15, Hus17, TRB17].

• [SSZ18a]:

𝑟 ≈ argminƸ𝑟∈ℋ

1

2𝔼𝑝 Ƹ𝑟 − 𝑟 2 +

𝜆

2Ƹ𝑟 ℋ2 ≈

1

𝜆𝑁𝑞𝟏⊤𝐾𝑞 −

1

𝜆𝑁𝑝𝑁𝑞𝟏⊤𝐾𝑞𝑝

1

𝑁𝑝𝐾𝑝𝑝 + 𝜆𝐼

−1

𝐾𝑝 ,

where 𝐾𝑝 𝑧 𝑗 = 𝐾 𝑧𝑗𝑝, 𝑧 , 𝐾𝑞𝑝 𝑖𝑗

= 𝐾 𝑧𝑖𝑞, 𝑧𝑗

𝑝, 𝑧𝑖

𝑞

𝑖=1

𝑁𝑞∼ 𝑞𝜙 𝑧 , 𝑧𝑗

𝑝

𝑗=1

𝑁𝑝∼ 𝑝 𝑧 .

Gradient Estimation:

• [VLBM08, LT18, SSZ18b].

2019/10/10 清华大学-MSRA 《高等机器学习》 66

Page 67: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: Variational Inferencemin𝑞∈𝒬

KL 𝑞 𝑧 , 𝑝 𝑧 𝑥 .

• Particle-based variational inference: use particles 𝑧 𝑖𝑖=1

𝑁to represent 𝑞 𝑧 .

To minimize KL 𝑞 𝑧 , 𝑝 𝑧 𝑥 , simulate its gradient flow on the Wasserstein space.

• Wasserstein space:

an abstract space of distributions.

• Wasserstein tangent vector

⟺ vector field.

𝑧𝑡𝑖

𝒵𝑉𝑡 𝑧𝑡

𝑖

𝑞𝑡

𝒵𝑧𝑡𝑖+ 𝜀𝑉𝑡 𝑧𝑡

𝑖

𝑞𝑡 𝑡𝑞𝑡+𝜀

𝑞𝑡 𝑉𝑡 𝒫2 𝒵

𝑞𝑡+𝜀 + 𝑜 𝜀

2019/10/10 清华大学-MSRA 《高等机器学习》 67

Page 68: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: Variational Inferencemin𝑞∈𝒬

KL 𝑞 𝑧 , 𝑝 𝑧 𝑥 .

• Particle-based variational inference: use particles 𝑧 𝑖𝑖=1

𝑁to represent 𝑞 𝑧 .

𝑉 ≔ grad𝑞 KL 𝑞, 𝑝 = ∇ log(𝑞/𝑝) .𝑧 𝑖 ← 𝑧 𝑖 + 𝜀𝑉 𝑧 𝑖 .

𝑉 𝑧 𝑖 ≈

• SVGD [LW16]: σ𝑗𝐾𝑖𝑗 ∇𝑧 𝑗 log 𝑝 𝑧 𝑗 𝑥 + σ𝑗 ∇𝑧 𝑗 𝐾𝑖𝑗.

• Blob [CZW+18]: ∇𝑧 𝑖 log 𝑝 𝑧 𝑖 𝑥 −σ𝑗 ∇𝑧 𝑖 𝐾𝑖𝑗

σ𝑘 𝐾𝑖𝑘− σ𝑗

∇𝑧 𝑖 𝐾𝑖𝑗

σ𝑘𝐾𝑗𝑘.

• GFSD [LZC+19]: ∇𝑧 𝑖 log 𝑝 𝑧 𝑖 𝑥 −σ𝑗 ∇𝑧 𝑖 𝐾𝑖𝑗

σ𝑘 𝐾𝑖𝑘.

• GFSF [LZC+19]: ∇𝑧 𝑖 log 𝑝 𝑧 𝑖 𝑥 + σ𝑗,𝑘 𝐾−1𝑖𝑘∇𝑧 𝑗 𝐾𝑘𝑗.

= σ𝑗 𝑧𝑖 − 𝑧 𝑗 𝐾𝑖𝑗

for Gaussian Kernel: Repulsive force!

2019/10/10 清华大学-MSRA 《高等机器学习》 68

Page 69: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: Variational Inferencemin𝑞∈𝒬

KL 𝑞 𝑧 , 𝑝 𝑧 𝑥 .

• Particle-based variational inference: use particles 𝑧 𝑖𝑖=1

𝑁to represent 𝑞 𝑧 .

Non-parametric 𝑞: more particles, more flexible.• Stein Variational Gradient Descent (SVGD) [LW16]:

Update the particles by a dynamicsd𝑧𝑡

d𝑡= 𝑉𝑡 𝑧𝑡 so that KL decreases.

• Distribution evolution: consequence of the dynamics.

𝜕𝑡𝑞𝑡 = −div 𝑞𝑡𝑉𝑡 = −∇ ⋅ 𝑞𝑡𝑉𝑡 (continuity Eq.)𝑧𝑡𝑖

𝒵𝑉𝑡 𝑧𝑡

𝑖

𝑞𝑡

𝒵𝑧𝑡𝑖+ 𝜀𝑉𝑡 𝑧𝑡

𝑖

𝑞𝑡+𝜀

2019/10/10 清华大学-MSRA 《高等机器学习》 69

Page 70: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: Variational Inferencemin𝑞∈𝒬

KL 𝑞 𝑧 , 𝑝 𝑧 𝑥 .

• Particle-based variational inference: use particles 𝑧 𝑖𝑖=1

𝑁to represent 𝑞 𝑧 .

• Stein Variational Gradient Descent (SVGD) [LW16]:

Update the particles by a dynamicsd𝑧𝑡

d𝑡= 𝑉𝑡 𝑧𝑡 so that KL decreases.

• Decrease KL:

𝑉𝑡∗ ≔ argmax

𝑉𝑡−d

d𝑡KL 𝑞𝑡 , 𝑝 = 𝔼𝑞𝑡 𝑉𝑡 ⋅ ∇ log 𝑝 + ∇ ⋅ 𝑉𝑡

Stein Operator 𝒜𝑝 𝑉𝑡

.

For tractability,𝑉𝑡SVGD ≔ max⋅ arg max

𝑉𝑡∈ℋ𝐷, 𝑉𝑡 =1

𝔼𝑞𝑡 𝑉𝑡 ⋅ ∇ log 𝑝 + ∇ ⋅ 𝑉𝑡

= 𝔼𝑞 𝑧′ 𝐾 𝑧′,⋅ ∇𝑧′ log 𝑝 𝑧′ + ∇𝑧′𝐾 𝑧′,⋅ .

Update rule: 𝑧 𝑖 += 𝜀 σ𝑗𝐾𝑖𝑗 𝛻𝑧 𝑗 log 𝑝 𝑧 𝑗 +σ𝑗 𝛻𝑧 𝑗 𝐾𝑖𝑗 .

= σ𝑗 𝑧𝑖 − 𝑧 𝑗 𝐾𝑖𝑗

for Gaussian Kernel: Repulsive force!

2019/10/10 清华大学-MSRA 《高等机器学习》 70

Page 71: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: Variational Inference

• Particle-based variational inference: use particles 𝑧 𝑖𝑖=1

𝑁to represent 𝑞 𝑧 .

• Unified view as Wasserstein gradient flow (WGF) [LZC+19]:particle-based VIs approximate WGF with a compulsory smoothing assumption, in either of the two equivalent forms of smoothing the density (Blob, GFSD) or smoothing functions (SVGD, GFSF).

2019/10/10 清华大学-MSRA 《高等机器学习》 71

Page 72: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: Variational Inference

• Particle-based variational inference: use particles 𝑧 𝑖𝑖=1

𝑁to represent 𝑞 𝑧 .

• Acceleration on the Wasserstein

space [LZC+19]:• Apply Riemannian Nesterov’s

methods to 𝒫2 𝒵 .

2019/10/10 清华大学-MSRA 《高等机器学习》 72

Inference for Bayesian logistic regression

Page 73: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: Variational Inference• Particle-based variational inference: use particles 𝑧 𝑖

𝑖=1

𝑁to represent 𝑞 𝑧 .

• Kernel bandwidth selection:• Median [LW16]: median of pairwise distances of the particles.

• HE [LZC+19]: the two approx. to 𝑞𝑡+𝜀 𝑧 , i.e., 𝑞 𝑧; 𝑧 𝑗𝑗+ 𝜀Δ𝑧 𝑞 𝑧; 𝑧 𝑗

𝑗(Heat

Eq.) and 𝑞 𝑧; 𝑧 𝑖 − 𝜀∇𝑧 𝑖 log 𝑞 𝑧 𝑖 ; 𝑧 𝑗𝑗

𝑖(particle evolution), should match.

2019/10/10 清华大学-MSRA 《高等机器学习》 73

Page 74: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: Variational Inference

• Particle-based variational inference: use particles 𝑧 𝑖𝑖=1

𝑁to represent 𝑞 𝑧 .

• Unified view as Wasserstein gradient flow: [LZC+19].

• Asymptotic analysis: SVGD [Liu17] (𝑁 → ∞, 𝜀 → 0).

• Non-asymptotic analysis• w.r.t 𝜀: e.g., [RT96] (as WGF).

• w.r.t 𝑁: [CMG+18, FCSS18, ZZC18].

• Accelerating ParVIs: [LZC+19, LZZ19].

• Add particles dynamically: [CMG+18, FCSS18].

• Solve the Wasserstein gradient by optimal transport: [CZ17, CZW+18].

• Manifold support space: [LZ18].

2019/10/10 清华大学-MSRA 《高等机器学习》 74

Page 75: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: MCMC• Monte Carlo• Directly draw (i.i.d.) samples from 𝑝 𝑧 𝑥 .

• Almost always impossible to directly do so.

• Markov Chain Monte Carlo (MCMC):Simulate a Markov chain whose stationary distribution is 𝑝 𝑧 𝑥 .

• Easier to implement: only requires unnormalized 𝑝 𝑧 𝑥 (e.g., 𝑝 𝑧, 𝑥 ).

• Asymptotically accurate.

• Drawback/Challenge: sample auto-correlation.

Less effective than i.i.d. samples.

[GC11]

2019/10/10 清华大学-MSRA 《高等机器学习》 75

Page 76: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: MCMCA fantastic MCMC animation site: https://chi-feng.github.io/mcmc-demo/

2019/10/10 清华大学-MSRA 《高等机器学习》 76

Page 77: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: MCMCClassical MCMC

• Metropolis-Hastings framework [MRR+53, Has70]:

Draw 𝑧∗ ∼ 𝑞 𝑧∗|𝑧 𝑘 and take 𝑧 𝑘+1 as 𝑧∗ with probability

min 1,𝑞 𝑧 𝑘 𝑧∗ 𝑝 𝑧∗ 𝑥

𝑞 𝑧∗ 𝑧 𝑘 𝑝 𝑧 𝑘 𝑥,

else take 𝑧 𝑘+1 as 𝑧 𝑘 .

Proposal distribution 𝑞 𝑧∗ 𝑧 : e.g., taken as 𝒩 𝑧∗ 𝑧, 𝜎2 .

2019/10/10 清华大学-MSRA 《高等机器学习》 77

Page 78: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: MCMCClassical MCMC

• Gibbs sampling [GG87]:

Iteratively sample from conditional distributions, which are easier to draw:

𝑧11 ∼ 𝑝 𝑧1 𝑧2

0 , 𝑧30 , … , 𝑧𝑑

0 , 𝑥 ,

𝑧21∼ 𝑝 𝑧2 𝑧1

1, 𝑧3

0, … , 𝑧𝑑

0, 𝑥 ,

𝑧31 ∼ 𝑝 𝑧3 𝑧1

1 , 𝑧21 , … , 𝑧𝑑

0 , 𝑥 ,

… ,

𝑧𝑖𝑘+1 ∼ 𝑝 𝑧𝑖 𝑧1

𝑘+1 , … , 𝑧𝑖−1𝑘+1 , 𝑧𝑖+1

𝑘 , … , 𝑧𝑑𝑘 , 𝑥 .

2019/10/10 清华大学-MSRA 《高等机器学习》 78

Page 79: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: MCMCDynamics-based MCMC

• Simulates a jump-free continuous-time Markov process (dynamics):

d𝑧 = 𝑏 𝑧 d𝑡drift

+ 2𝐷 𝑧 d𝐵𝑡 𝑧diffusion

,

Δ𝑧 = 𝑏 𝑧 𝜀 +𝒩 0,2𝐷 𝑧 𝜀 + 𝑜 𝜀 ,

with appropriate 𝑏 𝑧 and 𝐷 𝑧 so that 𝑝 𝑧 𝑥 is kept stationary/invariant.

• Informative transition using gradient ∇𝑧 log 𝑝 𝑧 𝑥 .

• Some are compatible with stochastic gradient (SG): more efficient.

∇𝑧 log 𝑝 𝑧 𝑥 = ∇𝑧 log 𝑝 𝑧 +

𝑛∈𝒟

∇𝑧 log 𝑝 𝑥 𝑛 𝑧 ,

෩∇𝑧 log 𝑝 𝑧 𝑥 = ∇𝑧 log 𝑝 𝑧 +𝒟

𝒮

𝑛∈𝒮

∇𝑧 log 𝑝 𝑥 𝑛 𝑧 , 𝒮 ⊂ 𝒟.

Brownian motion

Pos. semi-def. matrix

2019/10/10 清华大学-MSRA 《高等机器学习》 79

Page 80: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: MCMCDynamics-based MCMC

• Langevin Dynamics [RS02] (compatible with SG [WT11, CDC15, TTV16]):𝑧 𝑘+1 = 𝑧(𝑘) + 𝜀∇ log 𝑝 𝑧 𝑘 𝑥 +𝒩 0,2𝜀 .

• Hamiltonian Monte Carlo [DKPR87, Nea11, Bet17]

(incompatible with SG [CFG14, Bet15]; leap-frog integrator [CDC15]):

𝑟 0 ∼ 𝒩 0, Σ ,

𝑟 𝑘+1/2 = 𝑟 𝑘 + (𝜀/2)∇ log 𝑝 𝑧 𝑘 𝑥 ,

𝑧 𝑘+1 = 𝑧 𝑘 + 𝜀Σ−1𝑟 𝑘+1/2 ,

𝑟 𝑘+1 = 𝑟 𝑘+1/2 + (𝜀/2)∇ log 𝑝 𝑧 𝑘+1 𝑥 .

• Stochastic Gradient Hamiltonian Monte Carlo [CFG14] (compatible with SG):

൝𝑧 𝑘+1 = 𝑧 𝑘 + 𝜀Σ−1𝑟 𝑘 ,

𝑟 𝑘+1 = 𝑟 𝑘 + 𝜀∇ log 𝑝 𝑧 𝑘 𝑥 − 𝜀𝐶Σ−1𝑟 𝑘 +𝒩 0,2𝐶𝜀 .

• …2019/10/10 清华大学-MSRA 《高等机器学习》 80

Page 81: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: MCMCDynamics-based MCMC

• Langevin dynamics [Lan08]:d𝑧 = ∇ log 𝑝 d𝑡 + 2 d𝐵𝑡 𝑧 .

Algorithm (also called Metropolis Adapted Langevin Algorithm) [RS02]:𝑧 𝑘+1 = 𝑧(𝑘) + 𝜀𝛻 log 𝑝 𝑧 𝑘 𝑥 +𝒩 0,2𝜀 ,

followed by an MH step.

2019/10/10 清华大学-MSRA 《高等机器学习》 81

Page 82: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: MCMCDynamics-based MCMC

• Hamiltonian Dynamics: ቊd𝑧 = Σ−1𝑟 d𝑡,d𝑟 = ∇ log 𝑝 d𝑡.

• Algorithm: Hamiltonian Monte Carlo [DKPR87, Nea11, Bet17]

Draw 𝑟 0 ∼ 𝒩 0, Σ and simulate 𝐾 steps:𝑟 𝑘+1/2 = 𝑟 𝑘 + 𝜀/2 ∇𝑧 log 𝑝 𝑧 𝑘 𝑥 ,

𝑧 𝑘+1 = 𝑧 𝑘 + 𝜀Σ−1𝑟 𝑘+1/2 ,

𝑟 𝑘+1 = 𝑟 𝑘+1/2 + 𝜀/2 ∇𝑧 log 𝑝 𝑧 𝑘+1 𝑥 ,

and do an MH step, for one sample of 𝑧.

• Störmer-Verlet (leap-frog) integrator:• Makes MH ratio close to 1.• Higher-order simulation error [CDC15].

• More distant exploration than LD (less auto-correlation).2019/10/10 清华大学-MSRA 《高等机器学习》 82

Page 83: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: MCMCDynamics-based MCMC: using stochastic gradient (SG).

• Langevin dynamics is compatible with SG [WT11, CDC15, TTV16].

• Hamiltonian Monte Carlo is incompatible with SG [CFG14, Bet15]:

the stationary distribution is changed.

• Stochastic Gradient Hamiltonian Monte Carlo [CFG14]:

൝d𝑧 = Σ−1𝑟 d𝑡,

d𝑟 = ∇ log 𝑝 d𝑡 − 𝐶Σ−1𝑟 d𝑡 + 2𝐶 d𝐵𝑡 𝑟 .

• Asymptotically, stationary distribution is 𝑝.

• Non-asymptotically (with Euler integrator), gradient noise is of higher-order of Brownian-motion noise [CDC15].

2019/10/10 清华大学-MSRA 《高等机器学习》 83

Page 84: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: MCMCDynamics-based MCMC: using stochastic gradient.

• Stochastic Gradient Nose-Hoover Thermostats [DFB+14] (scalar 𝐶 > 0):d𝑧 = Σ−1𝑟 d𝑡,

d𝑟 = ∇ log 𝑝 d𝑡 − 𝜉𝑟 d𝑡 + 2𝐶Σ d𝐵𝑡 𝑟 ,

d𝜉 =1

𝐷𝑟⊤Σ−1𝑟 − 1 d𝑡.

• Thermostats 𝜉 ∈ ℝ: adaptively balance the gradient noise and the Brownian-motion noise.

2019/10/10 清华大学-MSRA 《高等机器学习》 84

Page 85: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Bayesian Inference: MCMCDynamics-based MCMC

• Complete recipe for the dynamics [MCF15]:For any skew-symmetric matrix 𝑄 and pos. semi-def. matrix 𝐷, the dynamics

d𝑧 = 𝑏 𝑧 d𝑡 + 2𝐷 𝑧 d𝐵𝑡 𝑧 ,

𝑏𝑖 =1

𝑝

𝑗

𝜕𝑗 𝑝 𝐷𝑖𝑗 +𝑄𝑖𝑗 ,

keeps 𝑝 stationary/invariant.• The inverse also holds:

Any dynamics that keeps 𝑝 stationary can be cast into this form.• If 𝐷 is pos. def., then 𝑝 is the unique stationary distribution.

• Integrators and their non-asymptotic analysis (with SG): [CDC15].

• MCMC dynamics as flows on the Wasserstein space: [LZZ19].

2019/10/10 清华大学-MSRA 《高等机器学习》 85

Page 86: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: MCMCDynamics-based MCMC

• Complete framework for MCMC dynamics: [MCF15].

• Interpretation on the Wasserstein space: [JKO98, LZZ19].

• Integrators and their non-asymptotic analysis (with SG): [CDC15].

• For manifold support space:• LD: [GC11]; HMC: [GC11, BSU12, BG13, LSSG15]; SGLD: [PT13];

SGHMC: [MCF15, LZS16]; SGNHT: [LZS16]

• Different kinetic energy (other than Gaussian):• Monomial Gamma [ZWC+16, ZCG+17].

• Fancy Dynamics:• Relativistic: [LPH+16]• Magnetic: [TRGT17]

2019/10/10 清华大学-MSRA 《高等机器学习》 86

Page 87: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Bayesian Inference: Comparison

Parametric VI Particle-Based VI MCMC

Asymptotic Accuracy No Yes Yes

Approximation Flexibility Limited Unlimited Unlimited

Empirical Convergence Speed High High Low

Particle Efficiency (Do not apply) High Low

High-Dimensional Efficiency High Low High

2019/10/10 清华大学-MSRA 《高等机器学习》 87

Page 88: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Outline• Generative Models: Overview

• Plain Generative Models• Autoregressive Models

• Latent Variable Models• Deterministic Generative Models• Generative Adversarial Nets

• Flow-Based Generative Models

• Bayesian Generative Models• Bayesian Inference (variational inference, MCMC)

• Bayesian Networks• Topic Models (LDA, LightLDA, sLDA)

• Deep Bayesian Models (VAE)

• Markov Random Fields (Boltzmann machines, deep energy-based models)2019/10/10 清华大学-MSRA 《高等机器学习》 88

Page 89: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Topic ModelsSeparate global (dataset abstraction) and local (datum representation) latent variables.

[SG07]2019/10/10 清华大学-MSRA 《高等机器学习》 89

Page 90: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Latent Dirichlet AllocationModel Structure [BNJ03]:

• Data variable: Words/Documents 𝑤 = 𝑤𝑑𝑛 𝑛=1:𝑁𝑑,𝑑=1:𝐷, 𝑤𝑑𝑛 ∈ 1…𝑊 .

• Latent variables:

• Global: topics 𝛽 = 𝛽𝑘 𝑘=1:𝐾 , 𝛽𝑘 ∈ Δ𝑊.

• Local: topic proportions 𝜃 = 𝜃𝑑 , 𝜃𝑑 ∈ Δ𝐾,

topic assignments 𝑧 = 𝑧𝑑𝑛 , 𝑧𝑑𝑛 ∈ 1…𝐾 .

• Prior: 𝑝 𝛽𝑘 𝑏 = Dir 𝑏 , 𝑝 𝜃𝑑 𝑎 = Dir 𝑎 , 𝑝 𝑧𝑑𝑛 𝜃𝑑 = Mult 𝜃𝑑 .

• Likelihood: 𝑝 𝑤𝑑𝑛 𝑧𝑑𝑛, 𝛽 = Mult 𝛽𝑧𝑑𝑛 .

𝐷𝑁𝑑𝐾

𝛽𝑘 𝜃𝑑𝑤𝑑𝑛

𝑎𝑏

𝑧𝑑𝑛

2019/10/10 清华大学-MSRA 《高等机器学习》 90

Page 91: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Latent Dirichlet AllocationVariational inference [BNJ03]:

• Take variational distribution (mean-field approximation):

𝑞𝜆,𝛾,𝜙 𝛽, 𝜃, 𝑧 ≔ෑ

𝑘=1

𝐾

Dir 𝛽𝑘 𝜆𝑘 ෑ

𝑑=1

𝐷

Dir 𝜃𝑑 𝛾𝑑 ෑ

𝑛=1

𝑁𝑑

Mult 𝑧𝑑𝑛 𝜙𝑑𝑛 .

• ELBO 𝜆, 𝛾, 𝜙; 𝑎, 𝑏 is available in closed form.

• E-step: update 𝜆, 𝛾, 𝜙 by maximizing ELBO;

• M-step: update 𝑎, 𝑏 by maximizing ELBO.

2019/10/10 清华大学-MSRA 《高等机器学习》 91

Page 92: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Latent Dirichlet AllocationMCMC: Gibbs sampling [GS04]

Model structure ⟹ 𝑝 𝛽, 𝜃, 𝑧, 𝑤 = 𝐴𝐵 ς𝑘,𝑤 𝛽𝑘𝑤𝑁𝑘𝑤+𝑏𝑤−1 ς𝑑,𝑘 𝜃𝑑𝑘

𝑁𝑘𝑑+𝑎𝑘−1

⟹ 𝑝 𝑧,𝑤 = 𝐴𝐵 ς𝑘ς𝑤 Γ 𝑁𝑘𝑤+𝑏𝑤

Γ 𝑁𝑘+𝑊ത𝑏ς𝑑

ς𝑘 Γ 𝑁𝑘𝑑+𝑎𝑘

Γ 𝑁𝑑+𝐾 ത𝑎.

(𝑁𝑘𝑤: #times word 𝑤 is assigned to topic 𝑘; 𝑁𝑘𝑑: #times topic 𝑘 appears in document 𝑑.)

• Unacceptable cost to directly compute 𝑝 𝑧 𝑤 = 𝑝 𝑧,𝑤 /𝑝 𝑤 .

• Use Gibbs sampling to draw from 𝑝 𝑧 𝑤 !

𝑝 𝑧𝑑𝑛 = 𝑘 𝑧−𝑑𝑛, 𝑤 ∝𝑁𝑘𝑤−𝑑𝑛 + 𝑏𝑤

𝑁𝑘−𝑑𝑛 +𝑊ത𝑏

𝑁𝑘𝑑−𝑑𝑛 + 𝑎𝑘 .

• For 𝛽 and 𝜃, use MAP estimate:

መ𝛽 ≔ argmax𝛽

log 𝑝 𝛽 𝑤 ≈𝑁𝑘𝑤 + 𝑏𝑤

𝑁𝑘 +𝑊ത𝑏,

𝜃𝑑𝑘 ≔ argmax𝜃

log 𝑝 𝜃 𝑤 ≈𝑁𝑘𝑑 + 𝑎𝑘𝑁𝑑 + 𝐾ത𝑎

.Estimated by samples of 𝑧

2019/10/10 清华大学-MSRA 《高等机器学习》 92

Page 93: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet AllocationMCMC: Gibbs sampling [GS04]

𝑝 𝛽, 𝜃, 𝑧, 𝑤

= ෑ

𝑘=1

𝐾

Dir 𝛽𝑘 𝑏 ෑ

𝑑=1

𝐷

Dir 𝜃𝑑 𝑎 ෑ

𝑛=1

𝑁𝑑

Mult 𝑧𝑑𝑛 𝜃𝑑 Mult 𝑤𝑑𝑛 𝛽𝑧𝑑𝑛

= 𝐴𝐵 ෑ

𝑘,𝑤

𝛽𝑘𝑤𝑏𝑤−1 ෑ

𝑑,𝑘

𝜃𝑑𝑘𝑎𝑘−1 ෑ

𝑑,𝑛

𝜃𝑑𝑧𝑑𝑛𝛽𝑧𝑑𝑛𝑤𝑑𝑛

= 𝐴𝐵 ෑ

𝑘,𝑤

𝛽𝑘𝑤𝑁𝑘𝑤+𝑏𝑤−1 ෑ

𝑑,𝑘

𝜃𝑑𝑘𝑁𝑘𝑑+𝑎𝑘−1 ,

• 𝐴 =Γ σ𝑘 𝑎𝑘

ς𝑘 Γ 𝑎𝑘

𝐷

, 𝐵 =Γ σ𝑤 𝑏𝑤

ς𝑤 Γ 𝑏𝑤

𝐾

, where Γ ⋅ is the Gamma function.

• 𝑁𝑘𝑤 = σ𝑑=1𝐷 σ𝑛=1

𝑁𝑑 𝕀 𝑤𝑑𝑛 = 𝑤, 𝑧𝑑𝑛 = 𝑘 : number of times that word 𝑤 is assigned to topic 𝑘.

• 𝑁𝑘𝑑 = σ𝑛=1𝑁𝑑 𝕀 𝑧𝑑𝑛 = 𝑘 : number of times that topic 𝑘 appears in document 𝑑.

2019/10/10 清华大学-MSRA 《高等机器学习》 93

Page 94: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet AllocationMCMC: Gibbs sampling [GS04]

𝑝 𝛽, 𝜃, 𝑧, 𝑤 = 𝐴𝐵 ෑ

𝑘,𝑤

𝛽𝑘𝑤𝑁𝑘𝑤+𝑏𝑤−1 ෑ

𝑑,𝑘

𝜃𝑑𝑘𝑁𝑘𝑑+𝑎𝑘−1 ,

𝑁𝑘𝑤 =

𝑑=1

𝐷

𝑛=1

𝑁𝑑

𝕀 𝑤𝑑𝑛 = 𝑤, 𝑧𝑑𝑛 = 𝑘 ,𝑁𝑘𝑑 =

𝑛=1

𝑁𝑑

𝕀 𝑧𝑑𝑛 = 𝑘 .

• 𝛽 and 𝜃 can be collapsed:𝑝 𝑧,𝑤 = ∬𝑝 𝛽, 𝜃, 𝑧, 𝑤 d𝛽 d𝜃

= 𝐴𝐵 ෑ

𝑘

ς𝑤 Γ 𝑁𝑘𝑤 + 𝑏𝑤

Γ 𝑁𝑘 +𝑊ത𝑏ෑ

𝑑

ς𝑘 Γ 𝑁𝑘𝑑 + 𝑎𝑘Γ 𝑁𝑑 +𝐾ത𝑎

.

• Unacceptable cost to directly compute 𝑝 𝑧 𝑤 = 𝑝(𝑧, 𝑤)/𝑝 𝑤 !

2019/10/10 清华大学-MSRA 《高等机器学习》 94

Page 95: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet AllocationMCMC: Gibbs sampling [GS04]

𝑝 𝑧,𝑤 = 𝐴𝐵 ෑ

𝑘

ς𝑤 Γ 𝑁𝑘𝑤 + 𝑏𝑤

Γ 𝑁𝑘 +𝑊ത𝑏ෑ

𝑑

ς𝑘 Γ 𝑁𝑘𝑑 + 𝑎𝑘Γ 𝑁𝑑 +𝐾ത𝑎

.

• Use Gibbs sampling: iteratively sample from

𝑝 𝑧111 𝑧12

0 , 𝑧130 , 𝑧14

0 , … , 𝑧𝐷𝑁𝐷0 , 𝑤 ,

𝑝 𝑧121 𝑧11

1 , 𝑧130 , 𝑧14

0 , … , 𝑧𝐷𝑁𝐷0 , 𝑤 ,

𝑝 𝑧131 𝑧11

1 , 𝑧121 , 𝑧14

0 , … , 𝑧𝐷𝑁𝐷0 , 𝑤 ,

… ,

𝑝 𝑧𝑑𝑛𝑙+1 𝑧11

𝑙+1 , … , 𝑧𝑑 𝑛−1𝑙+1 , 𝑧𝑑 𝑛+1

𝑙 , … , 𝑤 =: 𝑝 𝑧𝑑𝑛 𝑧−𝑑𝑛, 𝑤 .

𝑝 𝑧𝑑𝑛 = 𝑘 𝑧−𝑑𝑛, 𝑤 ∝𝑁𝑘𝑤−𝑑𝑛 + 𝑏𝑤

𝑁𝑘−𝑑𝑛 +𝑊ത𝑏

𝑁𝑘𝑑−𝑑𝑛 + 𝑎𝑘 .

2019/10/10 清华大学-MSRA 《高等机器学习》 95

Page 96: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet Allocation

𝑝 𝑧, 𝑤 = 𝐴𝐵 ෑ

𝑘′

ς𝑤′ Γ 𝑁𝑘′𝑤′ + 𝑏𝑤′

Γ 𝑁𝑘′ +𝑊ത𝑏ෑ

𝑑′

ς𝑘′ Γ 𝑁𝑘′𝑑′ + 𝑎𝑘′

Γ 𝑁𝑑′ + 𝐾ത𝑎

Denote 𝑤𝑑𝑛 as 𝑤:

= 𝐴𝐵ෑ

𝑘′

ς𝑤′≠𝑤 Γ 𝑁𝑘′𝑤′ + 𝑏𝑤′ ⋅ Γ 𝑁𝑘′𝑤−𝑑𝑛 + 𝕀 𝑧𝑑𝑛 = 𝑘′ + 𝑏𝑤

Γ 𝑁𝑘′−𝑑𝑛 + 𝕀 𝑧𝑑𝑛 = 𝑘′ +𝑊ത𝑏

⋅ ෑ

𝑑′≠𝑑

ς𝑘′ Γ 𝑁𝑘′𝑑′ + 𝑎𝑘′

Γ 𝑁𝑑′ +𝐾ത𝑎⋅ς𝑘′ Γ 𝑁𝑘′𝑑

−𝑑𝑛 + 𝕀 𝑧𝑑𝑛 = 𝑘′ + 𝑎𝑘′

Γ 𝑁𝑑 + 𝐾ത𝑎

= 𝐴𝐵ෑ

𝑘′

ς𝑤′≠𝑤 Γ 𝑁𝑘′𝑤′ + 𝑏𝑤′ ⋅ Γ 𝑁𝑘′𝑤−𝑑𝑛 + 𝑏𝑤 ⋅ 𝑁𝑘′𝑤

−𝑑𝑛 + 𝑏𝑤𝕀 𝑧𝑑𝑛=𝑘

Γ 𝑁𝑘′−𝑑𝑛 +𝑊ത𝑏 ⋅ 𝑁𝑘′

−𝑑𝑛 +𝑊ത𝑏𝕀 𝑧𝑑𝑛=𝑘

⋅ ෑ

𝑑′≠𝑑

ς𝑘′ Γ 𝑁𝑘′𝑑′ + 𝑎𝑘′

Γ 𝑁𝑑′ + 𝐾ത𝑎⋅ς𝑘′ Γ 𝑁𝑘′𝑑

−𝑑𝑛 + 𝑎𝑘′ ⋅ 𝑁𝑘′𝑑−𝑑𝑛 + 𝑎𝑘′

𝕀 𝑧𝑑𝑛=𝑘′

Γ 𝑁𝑑 + 𝐾ത𝑎

= 𝐴𝐵 ෑ

𝑘′

ς𝑤′≠𝑤 Γ 𝑁𝑘′𝑤′ + 𝑏𝑤′ ⋅ Γ 𝑁𝑘′𝑤−𝑑𝑛 + 𝑏𝑤

Γ 𝑁𝑘′−𝑑𝑛 +𝑊ത𝑏

⋅ෑ

𝑘′

𝑁𝑘′𝑤−𝑑𝑛 + 𝑏𝑤

𝑁𝑘′−𝑑𝑛 +𝑊ത𝑏

𝕀 𝑧𝑑𝑛=𝑘′

⋅ ෑ

𝑑′≠𝑑

ς𝑘′ Γ 𝑁𝑘′𝑑′ + 𝑎𝑘′

Γ 𝑁𝑑′ + 𝐾ത𝑎⋅ς𝑘′ Γ 𝑁𝑘′𝑑

−𝑑𝑛 + 𝑎𝑘′

Γ 𝑁𝑑 + 𝐾ത𝑎⋅ෑ

𝑘′

𝑁𝑘′𝑑−𝑑𝑛 + 𝑎𝑘′

𝕀 𝑧𝑑𝑛=𝑘′

.

𝑝 𝑧𝑑𝑛 = 𝑘 𝑧−𝑑𝑛 , 𝑤 ∝𝑁𝑘𝑤−𝑑𝑛 + 𝑏𝑤

𝑁𝑘−𝑑𝑛 +𝑊ത𝑏

𝑁𝑘𝑑−𝑑𝑛 + 𝑎𝑘 .

2019/10/10 清华大学-MSRA 《高等机器学习》 96

Page 97: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet AllocationMCMC: Gibbs sampling [GS04]

• For 𝛽, use the MAP estimate:መ𝛽 = argmax

𝛽log 𝑝 𝛽 𝑤 .

Estimate 𝑝 𝛽 𝑤 = 𝔼𝑝(𝑧|𝑤) 𝑝 𝛽, 𝑧, 𝑤 with one sample of 𝑧 from 𝑝 𝑧 𝑤 :

⟹ መ𝛽𝑘 =𝑁𝑘𝑤 + 𝑏𝑤 − 1

𝑁𝑘 +𝑊ത𝑏 −𝑊≈𝑁𝑘𝑤 + 𝑏𝑤

𝑁𝑘 +𝑊ത𝑏.

• For 𝜃, use the MAP estimate:

𝜃𝑑𝑘 =𝑁𝑘𝑑 + 𝑎𝑘 − 1

𝑁𝑑 +𝐾ത𝑎 − 𝐾≈𝑁𝑘𝑑 + 𝑎𝑘𝑁𝑑 + 𝐾ത𝑎

.

2019/10/10 清华大学-MSRA 《高等机器学习》 97

Page 98: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Latent Dirichlet AllocationMCMC: LightLDA [YGH+15]

𝑝 𝑧𝑑𝑛 = 𝑘 𝑧−𝑑𝑛 , 𝑤 ∝ 𝑁𝑘𝑑−𝑑𝑛 + 𝑎𝑘

𝑁𝑘𝑤−𝑑𝑛 + 𝑏𝑤

𝑁𝑘−𝑑𝑛 +𝑊ത𝑏

.

• Direct implementation: 𝑂 𝐾 time.

• Amortized 𝑂 1 multinomial sampling: alias table.

3

8,1

16,1

8,7

16⟹ Alias Table: 4,

3

16, 1,

1

16, 4,

1

8, 4,

1

4= ℎ𝑖 , 𝑣𝑖

• 𝑂 1 sampling: 𝑖 ∼ Unif 1, … , 𝐾 , 𝑣 ∼ Unif 0,1 , 𝑧 = 𝑖 if 𝑣 < 𝑣𝑖 else ℎ𝑖.• 𝑂 𝐾 time to build the Alias Table ⟹ Amortized 𝑂 1 time for 𝐾 samples.• What if the target changes (slightly): use Metropolis Hastings (MH) to correct.

2019/10/10 清华大学-MSRA 《高等机器学习》 98

Page 99: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet AllocationMCMC: LightLDA [YGH+15]

𝑝 𝑧𝑑𝑛 = 𝑘 𝑧−𝑑𝑛, 𝑤 ∝ 𝑁𝑘𝑑−𝑑𝑛 + 𝑎𝑘

𝑁𝑘𝑤−𝑑𝑛 + 𝑏𝑤

𝑁𝑘−𝑑𝑛 +𝑊ത𝑏

,

• Proposal in MH:

𝑞 𝑧𝑑𝑛 = 𝑘 ∝ 𝑀𝑘𝑑 + 𝑎𝑘doc−proposal

𝑀𝑘𝑤 + 𝑏𝑤

𝑀𝑘 +𝑊ത𝑏word−proposal

.

Update 𝑀𝑘𝑑 = 𝑁𝑘𝑑 , 𝑀𝑘𝑤 = 𝑁𝑘𝑤 , 𝑀𝑘 = 𝑁𝑘 every 𝐾 draws.

• Doc-proposal:

• MH ratio = 𝑁𝑘′𝑑−𝑑𝑛+𝑎

𝑘′𝑁𝑘′𝑤−𝑑𝑛+𝛽𝑤 𝑁𝑘

−𝑑𝑛+𝑊ത𝑏 𝑀𝑘𝑑+𝑎𝑘

𝑁𝑘𝑑−𝑑𝑛+𝑎𝑘 𝑁𝑘𝑤

−𝑑𝑛+𝛽𝑤 𝑁𝑘′−𝑑𝑛+𝑊 ത𝑏 𝑀𝑘′𝑑+𝑎𝑘′

. 𝑂 1 .

• Sample from ∝ 𝑀𝑘𝑑: take 𝑧𝑑𝑛 where 𝑛 ∼ Unif 1,… ,𝑁𝑑 . Directly 𝑂 1 .

• Sample from ∝ 𝑎𝑘 (dense): use Alias Table. Amortized 𝑂 1 .

2019/10/10 清华大学-MSRA 《高等机器学习》 99

Page 100: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet AllocationMCMC: LightLDA [YGH+15]

𝑝 𝑧𝑑𝑛 = 𝑘 𝑧−𝑑𝑛, 𝑤 ∝ 𝑁𝑘𝑑−𝑑𝑛 + 𝑎𝑘

𝑁𝑘𝑤−𝑑𝑛 + 𝑏𝑤

𝑁𝑘−𝑑𝑛 +𝑊ത𝑏

,

• Proposal in MH:

𝑞 𝑧𝑑𝑛 = 𝑘 ∝ 𝑀𝑘𝑑 + 𝑎𝑘doc−proposal

𝑀𝑘𝑤 + 𝑏𝑤

𝑀𝑘 +𝑊ത𝑏word−proposal

.

Update 𝑀𝑘𝑑 = 𝑁𝑘𝑑 , 𝑀𝑘𝑤 = 𝑁𝑘𝑤 , 𝑀𝑘 = 𝑁𝑘 every 𝐾 draws.

• Word-proposal:

• MH ratio =𝑁𝑘′𝑑−𝑑𝑛+𝑎

𝑘′𝑁𝑘′𝑤−𝑑𝑛+𝛽𝑤 𝑁𝑘

−𝑑𝑛+𝑊ത𝑏 𝑀𝑘𝑤+𝑏𝑤 (𝑀𝑘′+𝑊ത𝑏)

𝑁𝑘𝑑−𝑑𝑛+𝑎𝑘 𝑁𝑘𝑤

−𝑑𝑛+𝛽𝑤 𝑁𝑘′−𝑑𝑛+𝑊 ത𝑏 𝑀𝑘′𝑤+𝑏𝑤 (𝑀𝑘+𝑊 ത𝑏)

. 𝑂 1 .

•𝑀𝑘𝑤+𝑏𝑤

𝑀𝑘+𝑊ത𝑏=

𝑀𝑘𝑤

𝑀𝑘+𝑊ത𝑏+

𝑏𝑤

𝑀𝑘+𝑊ത𝑏. Sample from either term: use Alias Table. Amortized 𝑂 1 .

2019/10/10 清华大学-MSRA 《高等机器学习》 100

Page 101: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet AllocationMCMC: LightLDA [YGH+15]

• Overall procedure for Gibbs sampling (cycle proposal):

• Alternatively use word-proposal and doc-proposal: better coverage on the modes.

• For each 𝑧𝑑𝑛, run the MH chain 𝐿 ≤ 𝐾 times and take the last sample.

2019/10/10 清华大学-MSRA 《高等机器学习》 101

Page 102: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet AllocationMCMC: LightLDA [YGH+15]

• System implementation• Send the model to data:

• Hybrid data structure:

2019/10/10 清华大学-MSRA 《高等机器学习》 102

Page 103: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Latent Dirichlet Allocation• Dynamics-Based MCMC and Particle-Based VI: target 𝑝 𝛽 𝑤 .

∇𝛽 log 𝑝 𝛽 𝑤 = 𝔼𝑝 𝑧 𝛽,𝑤 ∇𝛽 log 𝑝 𝛽, 𝑧, 𝑤 .

• Stochastic Gradient Riemannian Langevin Dynamics [PT13],Stochastic Gradient Nose-Hoover Thermostats [DFB+14],Stochastic Gradient Riemannian Hamiltonian Monte Carlo [MCF15].

• Accelerated particle-based VI [LZC+19, LZZ19].

Gibbs Sampling Closed-form known

2019/10/10 清华大学-MSRA 《高等机器学习》 103

Page 104: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet AllocationMCMC: Stochastic Gradient Riemannian Langevin Dynamics [PT13]

d𝑥 = 𝐺−1𝛻 log 𝑝 d𝑡 + 𝛻 ⋅ 𝐺−1 d𝑡 +𝒩 0,2𝐺−1 d𝑡 .

• To draw from 𝑝 𝛽 𝑤 ,

𝛻𝛽 log 𝑝 𝛽 𝑤 =1

𝑝 𝛽 𝑤𝛻𝛽 𝑝 𝛽, 𝑧 𝑤 d𝑧 =

1

𝑝 𝛽 𝑤𝛻𝛽𝑝 𝛽, 𝑧 𝑤 d𝑧

= 𝑝 𝛽, 𝑧 𝑤

𝑝 𝛽 𝑤

𝛻𝛽𝑝 𝛽, 𝑧 𝑤

𝑝 𝛽, 𝑧 𝑤d𝑧 = 𝔼𝑝 𝑧 𝛽,𝑤 𝛻𝛽 log 𝑝 𝛽, 𝑧, 𝑤 .

• 𝑝 𝛽, 𝑧, 𝑤 is available in closed form.

• 𝑝 𝑧 𝛽,𝑤 can be drawn using Gibbs sampling.

• Each 𝛽𝑘 is on a simplex: use reparameterization to convert to the Euclidean space (that’s where 𝐺 comes from), e.g., 𝛽𝑘𝑤 =

𝜋𝑘𝑤σ𝑤 𝜋𝑘𝑤

.

2019/10/10 清华大学-MSRA 《高等机器学习》 104

Page 105: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Latent Dirichlet AllocationMCMC: Stochastic Gradient Riemannian Langevin Dynamics [PT13]

d𝑥 = 𝐺−1𝛻 log 𝑝 d𝑡 + 𝛻 ⋅ 𝐺−1 d𝑡 +𝒩 0,2𝐺−1 d𝑡 .

• Various parameterizations:

2019/10/10 清华大学-MSRA 《高等机器学习》 105

Page 106: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Supervised Latent Dirichlet AllocationModel structure [MB08]:

• Variational inference: similar to LDA.

• Prediction: for test document 𝑤𝑑,

ො𝑦𝑑 ≔ 𝔼𝑝 𝑦𝑑 𝑤𝑑𝑦𝑑 = 𝜂⊤𝔼𝑝 𝑧𝑑 𝑤𝑑

ҧ𝑧𝑑≈ 𝜂⊤𝔼𝑞 𝑧𝑑 𝑤𝑑

ҧ𝑧𝑑 .

First do inference (find 𝑞 𝑧𝑑 𝑤𝑑 ), then estimate ො𝑦𝑑.

𝐷𝑁𝑑𝐾

𝛽𝑘 𝜃𝑑𝑤𝑑𝑛

𝑎𝑏

𝑧𝑑𝑛

𝑦𝑑

𝜂, 𝜎

𝑥1: doc 1

𝑧: topics

𝑦1: science & tech

𝑥2: doc 2 𝑦2: politics

......

......

2019/10/10 清华大学-MSRA 《高等机器学习》 106

Page 107: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Supervised Latent Dirichlet AllocationModel structure [MB08]:

• Generating process:• Draw topics: 𝛽𝑘 ∼ Dir 𝑏 , 𝑘 = 1,… , 𝐾;

• For each document 𝑑,• Draw topic proportion 𝜃𝑑 ∼ Dir 𝑎 ;

• For each word 𝑛 in document 𝑑,• Draw topic assignment 𝑧𝑑𝑛 ∼ Mult 𝜃𝑑 ;

• Draw word 𝑤𝑑𝑛 ∼ Mult 𝑧𝑑𝑛 .

• Draw the response 𝑦𝑑 ∼ 𝒩 𝜂⊤ ҧ𝑧𝑑 , 𝜎2 , ҧ𝑧𝑑 ≔

1

𝑁𝑑σ𝑛=1𝑁𝑑 𝑧𝑑𝑛 (one-hot).

𝑝 𝛽, 𝜃, 𝑧, 𝑤, y

= ෑ

𝑘=1

𝐾

Dir 𝛽𝑘 𝑏 ෑ

𝑑=1

𝐷

Dir ෑ

𝑛=1

𝑁𝑑

Mult 𝑧𝑑𝑛 𝜃𝑑 Mult 𝑤𝑑𝑛 𝛽𝑧𝑑𝑛 𝒩 𝑦𝑑 𝜂⊤ ҧ𝑧𝑑, 𝜎

2 .

𝐷𝑁𝑑𝐾

𝛽𝑘 𝜃𝑑𝑤𝑑𝑛

𝑎𝑏

𝑧𝑑𝑛

𝑦𝑑

𝜂, 𝜎

2019/10/10 清华大学-MSRA 《高等机器学习》 107

Page 108: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Supervised Latent Dirichlet AllocationVariational inference [MB08]: similar to LDA.

• Same variational distribution

𝑞𝜆,𝛾,𝜙 𝛽, 𝜃, 𝑧 ≔ෑ

𝑘=1

𝐾

Dir 𝛽𝑘 𝜆𝑘 ෑ

𝑑=1

𝐷

Dir 𝜃𝑑 𝛾𝑑 ෑ

𝑛=1

𝑛𝑑

Mult 𝑧𝑑𝑛 𝜙𝑑𝑛 .

ELBO 𝜆, 𝛾, 𝜙; 𝑎, 𝑏, 𝜂, 𝜎2 is available in closed form.

• E-step: update 𝜆, 𝛾, 𝜙 by maximizing ELBO.

• M-step: update 𝑎, 𝑏, 𝜂, 𝜎2 by maximizing ELBO.

• Prediction: given a new document 𝑤𝑑,ො𝑦𝑑 ≔ 𝔼𝑝 𝑦𝑑 𝑤𝑑

𝑦𝑑 = 𝜂⊤𝔼𝑝 𝑧𝑑 𝑤𝑑ҧ𝑧𝑑 ≈ 𝜂⊤𝔼𝑞 𝑧𝑑 𝑤𝑑

ҧ𝑧𝑑 .

First do inference: find 𝑞 𝑧𝑑 𝑤𝑑 i.e. 𝜙𝑑, then estimate ො𝑦𝑑.

2019/10/10 清华大学-MSRA 《高等机器学习》 108

Page 109: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Supervised Latent Dirichlet AllocationVariational inference with posterior regularization [ZAX12]

• Regularized Bayes (RegBayes) [ZCX14]:• Recall: 𝑝 𝑧 𝑥 𝑛 , 𝑦 𝑛

= argmin𝑞 𝑧

−ℒ 𝑞 = KL 𝑞 𝑧 , 𝑝 𝑧 − σ𝑛𝔼𝑞 log 𝑝 𝑥 𝑛 , 𝑦 𝑛 |𝑧 .

• Regularize posterior towards better prediction:min𝑞 𝑧

KL 𝑞 𝑧 , 𝑝 𝑧 − σ𝑛𝔼𝑞 log 𝑝 𝑥 𝑛 , 𝑦 𝑛 |𝑧 + 𝜆ℓ 𝑞 𝑧 ; 𝑥 𝑛 , 𝑦 𝑛 .

• Maximum entropy discrimination LDA (MedLDA) [ZAX12]:

• ℓ 𝑞; 𝑤 𝑛 , 𝑦 𝑛 = σ𝑛 ℓ𝜀 𝑦 𝑛 − ො𝑦 𝑛 𝑞,𝑤 𝑛

= σ𝑛 ℓ𝜀 𝑦 𝑛 − 𝜂⊤𝔼𝑞 𝑧 𝑛 𝑤 𝑛 ҧ𝑧 𝑛 ,

where ℓ𝜀 𝑟 = max 0, 𝑟 − 𝜀 is the hinge (max-margin) loss.

• Facilitates both prediction and topic representation.

2019/10/10 清华大学-MSRA 《高等机器学习》 109

Page 110: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Outline• Generative Models: Overview

• Plain Generative Models• Autoregressive Models

• Latent Variable Models• Deterministic Generative Models• Generative Adversarial Nets

• Flow-Based Generative Models

• Bayesian Generative Models• Bayesian Inference (variational inference, MCMC)

• Bayesian Networks• Topic Models (LDA, LightLDA, sLDA)

• Deep Bayesian Models (VAE)

• Markov Random Fields (Boltzmann machines, deep energy-based models)2019/10/10 清华大学-MSRA 《高等机器学习》 110

Page 111: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Variational Auto-EncoderMore flexible Bayesian model using deep learning tools.

• Model structure (decoder) [KW14]:

𝑧𝑑 ∼ 𝑝 𝑧𝑑 = 𝒩 𝑧𝑑|0, 𝐼 ,

𝑥𝑑 ∼ 𝑝𝜃 𝑥𝑑 𝑧𝑑 = 𝒩 𝑥𝑑|𝜇𝜃 𝑧𝑑 , Σ𝜃 𝑧𝑑 ,

where 𝜇𝜃 𝑧𝑑 and Σ𝜃 𝑧𝑑 are modeled by neural networks.

𝐷

𝑧𝑑

𝑥𝑑 𝜃

𝑝 𝑧𝑑

𝑝𝜃 𝑥𝑑|𝑧𝑑

2019/10/10 清华大学-MSRA 《高等机器学习》 111

Page 112: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Variational Auto-Encoder• Variational inference (encoder) [KW14]:

𝑞𝜙 𝑧 𝑥 ≔ ς𝑑=1𝐷 𝑞𝜙 𝑧𝑑 𝑥𝑑 = ς𝑑=1

𝐷 𝒩 𝑧𝑑 𝜈𝜙 𝑥𝑑 , Γ𝜙 𝑥𝑑 ,

where 𝜈𝜙 𝑥𝑑 , Γ𝜙 𝑥𝑑 are also NNs.

• Amortized inference: approximate local posteriors 𝑝 𝑧𝑑 𝑥𝑑 𝑑=1𝐷 globally by 𝜙.

• Objective:

𝔼 ො𝑝 𝑥 ELBO 𝑥 ≈1

𝐷

𝑑=1

𝐷

𝔼𝑞𝜙 𝑧𝑑 𝑥𝑑 log 𝑝𝜃 𝑧𝑑 𝑝𝜃 𝑥𝑑 𝑧𝑑 − log 𝑞𝜙 𝑧𝑑 𝑥𝑑 .

• Gradient estimation with the reparameterization trick:

𝑧𝑑 ∼ 𝑞𝜙 𝑧𝑑 𝑥𝑑 ⟺ 𝑧𝑑 = 𝑔𝜙 𝑥𝑑 , 𝜖 ≔ 𝜈𝜙 𝑥𝑑 + 𝜖 Γ𝜙 𝑥𝑑 , 𝜖 ∼ 𝑞 𝜖 = 𝒩 𝜖|0, 𝐼 .

∇𝜙,𝜃𝔼 ො𝑝 𝑥 ELBO 𝑥 ≈1

𝐷σ𝑑=1𝐷 𝔼𝑞 𝜖 ∇𝜙,𝜃 log 𝑝𝜃 𝑧𝑑 𝑝𝜃 𝑥𝑑|𝑧𝑑 − log 𝑞𝜙 𝑧𝑑|𝑥𝑑 |𝑧𝑑=𝑔𝜙 𝑥𝑑,𝜖 .

(Smaller variance than REINFORCE-like estimator [Wil92]: ∇𝜃𝔼𝑞𝜃 𝑓 = 𝔼𝑞𝜃 𝑓∇𝜃 log 𝑞𝜃 .)

𝐷

𝑧𝑑

𝑥𝑑 𝜃

𝜙 𝑝 𝑧𝑑

𝑝𝜃 𝑥𝑑|𝑧𝑑𝑞𝜙 𝑧𝑑|𝑥𝑑

2019/10/10 清华大学-MSRA 《高等机器学习》 112

Page 113: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Variational Auto-Encoder• Generation results [KW14]

2019/10/10 清华大学-MSRA 《高等机器学习》 113

Page 114: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Variational Auto-Encoder• With spatial attention structure [GDG+15]

2019/10/10 清华大学-MSRA 《高等机器学习》 114

Page 115: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Variational Auto-Encoder• Inference with importance-weighted ELBO [BGS15]

• ELBO: ℒ𝜃 𝑞𝜙 𝑧 = 𝔼𝑞𝜙 𝑧 log 𝑝𝜃 𝑧, 𝑥 − 𝔼𝑞𝜙 𝑧 log 𝑞𝜙 𝑧 .

• A tighter lower bound:

ℒ𝜃𝑘 𝑞𝜙 ≔ 𝔼𝑧 1 ,…,𝑧 𝑘 ~i.i.d.𝑞𝜙

log1

𝑘

𝑖=1

𝑘𝑝𝜃 𝑧 𝑘 , 𝑥

𝑞𝜙 𝑧 𝑘.

Ordering relation:

ℒ𝜃 𝑞𝜙 = ℒ𝜃1 𝑞𝜙 ≤ ℒ𝜃

2 𝑞𝜙 ≤ ⋯ ≤ ℒ𝜃∞ 𝑞𝜙 = log 𝑝𝜃 𝑥 .

If 𝑝 𝑧,𝑥

𝑞 𝑧 𝑥 is bounded.

2019/10/10 清华大学-MSRA 《高等机器学习》 115

Page 116: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Variational Auto-Encoder• Parametric Variational Inference: towards more flexible approximations.• Explicit VI:

Normalizing flows [RM15, KSJ+16].

Using a tighter ELBO [BGS15].

• Implicit VI:

Adversarial Auto-Encoder [MSJ+15], Adversarial Variational Bayes [MNG17], Wasserstein Auto-Encoder [TBGS17], [SSZ18a], [LT18], [SSZ18b].

• MCMC [LTL17] and Particle-Based VI [FWL17, PGH+17]:• Train the encoder as a sample generator.

• Amortize the update on samples to 𝜙.

2019/10/10 清华大学-MSRA 《高等机器学习》 116

Page 117: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Outline• Generative Models: Overview

• Plain Generative Models• Autoregressive Models

• Latent Variable Models• Deterministic Generative Models• Generative Adversarial Nets

• Flow-Based Generative Models

• Bayesian Generative Models• Bayesian Inference (variational inference, MCMC)

• Bayesian Networks• Topic Models (LDA, LightLDA, sLDA)

• Deep Bayesian Models (VAE)

• Markov Random Fields (Boltzmann machines, deep energy-based models)2019/10/10 清华大学-MSRA 《高等机器学习》 117

Page 118: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Markov Random FieldsSpecify 𝑝𝜃 𝑥, 𝑧 by an energy function 𝐸𝜃 𝑥, 𝑧 :

𝑝𝜃 𝑥, 𝑧 =1

𝑍𝜃exp −𝐸𝜃 𝑥, 𝑧 , 𝑍𝜃 = exp −𝐸𝜃 𝑥′, 𝑧′ d𝑥′d𝑧′ .

• Only correlation and no causality: 𝑝 𝑥, 𝑧 is either 𝑝 𝑧 𝑝 𝑥 𝑧 or 𝑝 𝑥 𝑝 𝑧 𝑥 .

+ Flexible and simple in modeling dependency.

- Harder to learn and generate than BayesNets.• Learning: even 𝑝𝜃 𝑥, 𝑧 is unavailable.

∇𝜃𝔼 ො𝑝 𝑥 log 𝑝𝜃 𝑥 = −𝔼 ො𝑝 𝑥 𝑝𝜃 𝑧|𝑥 ∇𝜃𝐸𝜃 𝑥, 𝑧 + 𝔼𝑝𝜃 𝑥,𝑧 ∇𝜃𝐸𝜃 𝑥, 𝑧 .

• Bayesian inference: generally same as BayesNets.

• Generation: rely on MCMC or train a generator.

(augmented) data distribution(Bayesian inference)

model distribution (generation)

=0 if 𝐸 = log 𝑝.

2019/10/10 清华大学-MSRA 《高等机器学习》 118

𝑧

𝑥

𝑝𝜃 𝑥, 𝑧 ∝ exp −𝐸𝜃 𝑥, 𝑧

Page 119: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Markov Random Fields• Learning: ∇𝜃𝔼 ො𝑝 𝑥 log 𝑝𝜃 𝑥 = −𝔼 ො𝑝 𝑥 𝑝𝜃 𝑧|𝑥 ∇𝜃𝐸𝜃 𝑥, 𝑧 + 𝔼𝑝𝜃 𝑥,𝑧 ∇𝜃𝐸𝜃 𝑥, 𝑧 .

• Boltzmann Machine: Gibbs sampling for both inference and generation [HS83].Bayesian Inference Generation

𝑥

𝑧 𝐸𝜃 𝑥, 𝑧 = −𝑥⊤𝑊𝑧 −1

2𝑥⊤𝐿𝑥 −

1

2𝑧⊤𝐽𝑧.

𝑝𝜃 𝑧𝑗 𝑥, 𝑧−𝑗 = Bern 𝜎 σ𝑖=1𝐷 𝑊𝑖𝑗𝑥𝑖 + σ𝑚≠𝑗

𝑃 𝐽𝑗𝑚𝑧𝑗 ,

𝑝𝜃 𝑥𝑖 𝑧, 𝑥−𝑖 = Bern 𝜎 σ𝑗=1𝑃 𝑊𝑖𝑗𝑧𝑗 + σ𝑘≠𝑖

𝐷 𝐿𝑖𝑘𝑥𝑘 .

2019/10/10 清华大学-MSRA 《高等机器学习》 119

Page 120: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Markov Random Fields• Learning: ∇𝜃𝔼 ො𝑝 𝑥 log 𝑝𝜃 𝑥 = −𝔼 ො𝑝 𝑥 𝑝𝜃 𝑧|𝑥 ∇𝜃𝐸𝜃 𝑥, 𝑧 + 𝔼𝑝𝜃 𝑥,𝑧 ∇𝜃𝐸𝜃 𝑥, 𝑧 .

• Restricted Boltzmann Machine [Smo86]:Bayesian Inference Generation

𝐸𝜃 𝑥, 𝑧 = −𝑥⊤𝑊𝑧 + 𝑏 𝑥 ⊤𝑥 + 𝑏 𝑧 ⊤

𝑧.• Bayesian Inference is exact:

𝑝𝜃 𝑧𝑘 𝑥 = Bern 𝜎 𝑥⊤𝑊:𝑘 + 𝑏𝑘𝑧 .

• Generation: Gibbs sampling.Iterate:

𝑝𝜃 𝑧𝑘 𝑥 = Bern 𝜎 𝑥⊤𝑊:𝑘 + 𝑏𝑘𝑧 ,

𝑝𝜃 𝑥𝑘 𝑧 = Bern 𝜎 𝑊𝑘:𝑧 + 𝑏𝑘𝑥

.𝑥

𝑧

2019/10/10 清华大学-MSRA 《高等机器学习》 120

Page 121: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Markov Random FieldsDeep Energy-Based Models:

No latent variable; 𝐸𝜃 𝑥 is modeled by a neural network.∇𝜃𝔼 ො𝑝 𝑥 log 𝑝𝜃 𝑥 = −𝔼 ො𝑝 𝑥 ∇𝜃𝐸𝜃 𝑥 + 𝔼𝑝𝜃 𝑥′ ∇𝜃𝐸𝜃 𝑥′ .

• [KB16]: learn a generator𝑥 ∼ 𝑞𝜙 𝑥 ⟺ 𝑧 ∼ 𝑞 𝑧 , 𝑥 = 𝑔𝜙 𝑧 ,

to mimic the generation from 𝑝𝜃 𝑥 :

argmin𝜙

KL 𝑞𝜙 , 𝑝𝜃 = argmin𝜙

𝔼𝑞 𝑧 𝐸𝜃 𝑔𝜙 𝑧 − ℍ 𝑞𝜙approx. by batch

normalization Gaussian

2019/10/10 清华大学-MSRA 《高等机器学习》 121

Page 122: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Markov Random FieldsDeep Energy-Based Models:

No latent variable; 𝐸𝜃 𝑥 is modeled by a neural network.∇𝜃𝔼 ො𝑝 𝑥 log 𝑝𝜃 𝑥 = −𝔼 ො𝑝 𝑥 ∇𝜃𝐸𝜃 𝑥 + 𝔼𝑝𝜃 𝑥′ ∇𝜃𝐸𝜃 𝑥′ .

• [DM19]: estimate 𝔼𝑝𝜃 𝑥′ [⋅] by samples

drawn by the Langevin Dynamics.

2019/10/10 清华大学-MSRA 《高等机器学习》 122

Page 123: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Markov Random FieldsDeep Energy-Based Models:

No latent variable; 𝐸𝜃 𝑥 is modeled by a neural network.∇𝜃𝔼 ො𝑝 𝑥 log 𝑝𝜃 𝑥 = −𝔼 ො𝑝 𝑥 ∇𝜃𝐸𝜃 𝑥 + 𝔼𝑝𝜃 𝑥′ ∇𝜃𝐸𝜃 𝑥′ .

• [DM19]: estimate 𝔼𝑝𝜃 𝑥′ [⋅] by samples drawn by the Langevin Dynamics

𝑥 𝑘+1 = 𝑥 𝑘 − 𝜀∇𝑥𝐸𝜃 𝑥 𝑘 +𝒩 0, 2𝜀 .

• Replay buffer for initializing

the LD chain.

• 𝐿2-regularization on the

energy function.

2019/10/10 清华大学-MSRA 《高等机器学习》 123

Page 124: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

* Markov Random FieldsDeep Energy-Based Models:

• [DM19]

ImageNet32x32 Generation2019/10/10 清华大学-MSRA 《高等机器学习》 124

Page 125: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Generative Model: Summary• Summary

Plain Generative Models

Latent Variable Models

Autoregressive Models

Deterministic Generative Models Bayesian Generative Models

GANs Flow-Based BayesNets MRFs

+ Easy learning+ Easy generation- No abstract

representation- Slow generation

+ Abstract representation and manipulated generation- Harder learning

+ Flexible modeling+ Easy and good generation

+ Robust to small data and adversarial attack+ Principled inference+ Prior knowledge

- Hard inference- Hard learning

+ Easy inference+ Stable learning- Hard model design

+ Causal information+ Easier learning+ Easy generation

+ Simple dependency modeling- Harder learning- Hard generation

𝑧

𝑥

𝑥 ∼ 𝑝𝜃(𝑥|𝑧)

𝑝 𝑧

𝑝𝜃 𝑥

𝑧

𝑥

𝑝𝜃 𝑥, 𝑧 ∝ exp −𝐸𝜃 𝑥, 𝑧

𝑝𝜃 𝑥 𝑥

𝑧

𝑥

𝑥 = 𝑓𝜃(𝑧)(invertible)

𝑝 𝑧

𝑝𝜃 𝑥

𝑧

𝑥

𝑥 = 𝑓𝜃(𝑧)(Neural Nets)

𝑝 𝑧

𝑝𝜃 𝑥2019/10/10 清华大学-MSRA 《高等机器学习》 125

Page 126: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

Questions?

Page 127: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References

Page 128: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Plain Generative Models• Autoregressive Models• [Fre98] Frey, Brendan J. (1998). Graphical models for machine learning and digital

communication. MIT press.

• [LM11] Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2011.

• [UML14] Uria, B., Murray, I., & Larochelle, H. (2014). A deep and tractable density estimator. In International Conference on Machine Learning (pp. 467-475).

• [GGML15] Germain, M., Gregor, K., Murray, I., & Larochelle, H. (2015). MADE: Masked autoencoder for distribution estimation. In International Conference on Machine Learning(pp. 881-889).

• [OKK16] Oord, A. V. D., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759.

• [ODZ+16] Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

2019/10/10 清华大学-MSRA 《高等机器学习》 128

Page 129: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Deterministic Generative Models• Generative Adversarial Networks• [GPM+14] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,

Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).

• [ACB17] Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International Conference on Machine Learning (pp. 214-223).

• Flow-Based Generative Models• [DKB15] Dinh, L., Krueger, D., & Bengio, Y. (2015). NICE: Non-linear independent

components estimation. ICLR workshop.

• [DSB17] Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density estimation using real NVP. In Proceedings of the International Conference on Learning Representations.

• [PPM17] Papamakarios, G., Pavlakou, T., & Murray, I. (2017). Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems (pp. 2338-2347).

• [KD18] Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems (pp. 10215-10224).2019/10/10 清华大学-MSRA 《高等机器学习》 129

Page 130: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Inference: Variational Inference• Explicit Parametric VI:• [SJJ96] Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief

networks. Journal of artificial intelligence research, 4, 61-76.• [BNJ03] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of

Machine Learning Research, 3(Jan), pp.993-1022.• [GHB12] Gershman, S., Hoffman, M., & Blei, D. (2012). Nonparametric variational inference.

arXiv preprint arXiv:1206.4665.• [HBWP13] Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational

inference. The Journal of Machine Learning Research, 14(1), 1303-1347.• [RGB14] Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference.

In Artificial Intelligence and Statistics (pp. 814-822).• [RM15] Rezende, D.J., & Mohamed, S. (2015). Variational inference with normalizing flows. In

Proceedings of the International Conference on Machine Learning (pp. 1530-1538).• [KSJ+16] Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., & Welling, M. (2016).

Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems (pp. 4743-4751).

2019/10/10 清华大学-MSRA 《高等机器学习》 130

Page 131: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Inference: Variational Inference• Implicit Parametric VI: density ratio estimation• [MSJ+15] Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2016). Adversarial Autoencoders.

In Proceedings of the International Conference on Learning Representations.• [MNG17] Mescheder, L., Nowozin, S., & Geiger, A. (2017). Adversarial variational Bayes: Unifying

variational autoencoders and generative adversarial networks. In Proceedings of the International Conference on Machine Learning (pp. 2391-2400).

• [Hus17] Huszár, F. (2017). Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235.

• [TRB17] Tran, D., Ranganath, R., & Blei, D. (2017). Hierarchical implicit models and likelihood-free variational inference. In Advances in Neural Information Processing Systems (pp. 5523-5533).

• [SSZ18a] Shi, J., Sun, S., & Zhu, J. (2018). Kernel Implicit Variational Inference. In Proceedings of the International Conference on Learning Representations.

• Implicit Parametric VI: gradient estimation• [VLBM08] Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). Extracting and composing

robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning (pp. 1096-1103). ACM.

• [LT18] Li, Y., & Turner, R. E. (2018). Gradient estimators for implicit models. In Proceedings of the International Conference on Learning Representations.

• [SSZ18b] Shi, J., Sun, S., & Zhu, J. (2018). A spectral approach to gradient estimation for implicit distributions. In Proceedings of the 35th International Conference on Machine Learning (pp. 4651-4660).

2019/10/10 清华大学-MSRA 《高等机器学习》 131

Page 132: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Inference: Variational Inference• Particle-Based VI• [LW16] Liu, Q., & Wang, D. (2016). Stein variational gradient descent: A general purpose

Bayesian inference algorithm. In Advances In Neural Information Processing Systems (pp. 2378-2386).

• [Liu17] Liu, Q. (2017). Stein variational gradient descent as gradient flow. In Advances in neural information processing systems (pp. 3115-3123).

• [CZ17] Chen, C., & Zhang, R. (2017). Particle optimization in stochastic gradient MCMC. arXivpreprint arXiv:1711.10927.

• [FWL17] Feng, Y., Wang, D., & Liu, Q. (2017). Learning to Draw Samples with Amortized Stein Variational Gradient Descent. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.

• [PGH+17] Pu, Y., Gan, Z., Henao, R., Li, C., Han, S., & Carin, L. (2017). VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems (pp. 4236-4245).

• [LZ18] Liu, C., & Zhu, J. (2018). Riemannian Stein Variational Gradient Descent for Bayesian Inference. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (pp. 3627-3634).

2019/10/10 清华大学-MSRA 《高等机器学习》 132

Page 133: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Inference: Variational Inference• Particle-Based VI• [CMG+18] Chen, W. Y., Mackey, L., Gorham, J., Briol, F. X., & Oates, C. J. (2018). Stein

points. arXiv preprint arXiv:1803.10161.

• [FCSS18] Futami, F., Cui, Z., Sato, I., & Sugiyama, M. (2018). Frank-Wolfe Stein sampling. arXivpreprint arXiv:1805.07912.

• [CZW+18] Chen, C., Zhang, R., Wang, W., Li, B., & Chen, L. (2018). A unified particle-optimization framework for scalable Bayesian sampling. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.

• [ZZC18] Zhang, J., Zhang, R., & Chen, C. (2018). Stochastic particle-optimization sampling and the non-asymptotic convergence theory. arXiv preprint arXiv:1809.01293.

• [LZC+19] Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J., & Carin, L. (2019). Understanding and Accelerating Particle-Based Variational Inference. In Proceedings of the 36th International Conference on Machine Learning (pp. 4082-4092).

2019/10/10 清华大学-MSRA 《高等机器学习》 133

Page 134: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Inference: MCMC• Classical MCMC• [MRR+53] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M.N., Teller, A.H., & Teller, E. (1953).

Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6), pp.1087-1092.

• [Has70] Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), pp.97-109.

• [GG87] Geman, S., & Geman, D. (1987). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. In Readings in computer vision (pp. 564-584).

• [ADDJ03] Andrieu, C., De Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine learning, 50(1-2), 5-43.

2019/10/10 清华大学-MSRA 《高等机器学习》 134

Page 135: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Inference: MCMC• Dynamics-Based MCMC: full-batch• [Lan08] Langevin, P. (1908). Sur la théorie du mouvement Brownien. Compt. Rendus, 146, 530-533.• [DKPR87] Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D. (1987). Hybrid Monte Carlo. Physics

letters B, 195(2), pp.216-222.• [RT96] Roberts, G. O., & Tweedie, R. L. (1996). Exponential convergence of Langevin distributions

and their discrete approximations. Bernoulli, 2(4), 341-363.• [RS02] Roberts, G.O., & Stramer, O. (2002). Langevin diffusions and Metropolis-Hastings algorithms.

Methodology and computing in applied probability, 4(4), pp.337-357.• [Nea11] Neal, R.M. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov chain Monte

Carlo, 2(11), p.2.• [ZWC+16] Zhang, Y., Wang, X., Chen, C., Henao, R., Fan, K., & Carin, L. (2016). Towards unifying

Hamiltonian Monte Carlo and slice sampling. In Advances in Neural Information Processing Systems (pp. 1741-1749).

• [TRGT17] Tripuraneni, N., Rowland, M., Ghahramani, Z., & Turner, R. (2017, August). Magnetic Hamiltonian Monte Carlo. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3453-3461).

• [Bet17] Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXivpreprint arXiv:1701.02434.

2019/10/10 清华大学-MSRA 《高等机器学习》 135

Page 136: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Inference: MCMC• Dynamics-Based MCMC: full-batch (manifold support)• [GC11] Girolami, M., & Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian

Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2), 123-214.

• [BSU12] Brubaker, M., Salzmann, M., & Urtasun, R. (2012, March). A family of MCMC methods on implicitly defined manifolds. In Artificial intelligence and statistics (pp. 161-172).

• [BG13] Byrne, S., & Girolami, M. (2013). Geodesic Monte Carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4), 825-845.

• [LSSG15] Lan, S., Stathopoulos, V., Shahbaba, B., & Girolami, M. (2015). Markov chain Monte Carlo from Lagrangian dynamics. Journal of Computational and Graphical Statistics, 24(2), 357-378.

2019/10/10 清华大学-MSRA 《高等机器学习》 136

Page 137: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Inference: MCMC• Dynamics-Based MCMC: stochastic gradient• [WT11] Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin

dynamics. In Proceedings of the International Conference on Machine Learning (pp. 681-688).• [CFG14] Chen, T., Fox, E., & Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In

Proceedings of the International conference on machine learning (pp. 1683-1691).• [DFB+14] Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., & Neven, H. (2014). Bayesian

sampling using stochastic gradient thermostats. In Advances in neural information processing systems (pp. 3203-3211).

• [Bet15] Betancourt, M. (2015). The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In International Conference on Machine Learning (pp. 533-540).

• [TTV16] Teh, Y. W., Thiery, A. H., & Vollmer, S. J. (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. The Journal of Machine Learning Research, 17(1), 193-225.

• [LPH+16] Lu, X., Perrone, V., Hasenclever, L., Teh, Y. W., & Vollmer, S. J. (2016). Relativistic Monte Carlo. arXiv preprint arXiv:1609.04388.

• [ZCG+17] Zhang, Y., Chen, C., Gan, Z., Henao, R., & Carin, L. (2017, August). Stochastic gradient monomial Gamma sampler. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3996-4005).

• [LTL17] Li, Y., Turner, R.E., & Liu, Q. (2017). Approximate inference with amortised MCMC. arXivpreprint arXiv:1702.08343.

2019/10/10 清华大学-MSRA 《高等机器学习》 137

Page 138: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Inference: MCMC• Dynamics-Based MCMC: stochastic gradient (manifold support)• [PT13] Patterson, S., & Teh, Y.W. (2013). Stochastic gradient Riemannian Langevin dynamics on

the probability simplex. In Advances in neural information processing systems (pp. 3102-3110).• [MCF15] Ma, Y. A., Chen, T., & Fox, E. (2015). A complete recipe for stochastic gradient MCMC.

In Advances in Neural Information Processing Systems (pp. 2917-2925).• [LZS16] Liu, C., Zhu, J., & Song, Y. (2016). Stochastic Gradient Geodesic MCMC Methods. In

Advances in Neural Information Processing Systems (pp. 3009-3017).

• Dynamics-Based MCMC: general theory• [JKO98] Jordan, R., Kinderlehrer, D., & Otto, F. (1998). The variational formulation of the

Fokker-Planck equation. SIAM journal on mathematical analysis, 29(1), 1-17.• [MCF15] Ma, Y. A., Chen, T., & Fox, E. (2015). A complete recipe for stochastic gradient MCMC.

In Advances in Neural Information Processing Systems (pp. 2917-2925).• [CDC15] Chen, C., Ding, N., & Carin, L. (2015). On the convergence of stochastic gradient

MCMC algorithms with high-order integrators. In Advances in Neural Information Processing Systems (pp. 2278-2286).

• [LZZ19] Liu, C., Zhuo, J., & Zhu, J. (2019). Understanding MCMC Dynamics as Flows on the Wasserstein Space. In Proceedings of the 36th International Conference on Machine Learning(pp. 4093-4103).2019/10/10 清华大学-MSRA 《高等机器学习》 138

Page 139: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Models• Bayesian Networks: Topic Models• [BNJ03] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of

Machine Learning Research, 3(Jan), pp.993-1022.• [GS04] Griffiths, T.L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the

National academy of Sciences, 101 (suppl 1), pp.5228-5235.• [SG07] Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of latent

semantic analysis, 427(7), 424-440.• [MB08] Mcauliffe, J.D., & Blei, D.M. (2008). Supervised topic models. In Advances in neural

information processing systems (pp. 121-128).• [ZAX12] Zhu, J., Ahmed, A., & Xing, E. P. (2012). MedLDA: maximum margin supervised topic

models. Journal of Machine Learning Research, 13(Aug), 2237-2278.• [PT13] Patterson, S., & Teh, Y.W. (2013). Stochastic gradient Riemannian Langevin dynamics on

the probability simplex. In Advances in neural information processing systems (pp. 3102-3110).• [ZCX14] Zhu, J., Chen, N., & Xing, E. P. (2014). Bayesian inference with posterior regularization

and applications to infinite latent SVMs. The Journal of Machine Learning Research, 15(1), 1799-1847.

2019/10/10 清华大学-MSRA 《高等机器学习》 139

Page 140: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Models• Bayesian Networks: Topic Models• [LARS14] Li, A.Q., Ahmed, A., Ravi, S., & Smola, A.J. (2014). Reducing the sampling complexity

of topic models. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 891-900).

• [YGH+15] Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E.P., Liu, T.Y., & Ma, W.Y. (2015). LightLDA: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web (pp. 1351-1361).

• [CLZC16] Chen, J., Li, K., Zhu, J., & Chen, W. (2016). WarpLDA: a cache efficient o(1) algorithm for latent Dirichlet allocation. Proceedings of the VLDB Endowment, 9(10), pp.744-755.

2019/10/10 清华大学-MSRA 《高等机器学习》 140

Page 141: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Models• Bayesian Networks: Variational Auto-Encoders• [KW14] Kingma, D.P., & Welling, M. (2014). Auto-encoding variational Bayes. In Proceedings of the

International Conference on Learning Representations.• [GDG+15] Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). DRAW: A

recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning.

• [BGS15] Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.

• [DFD+18] Davidson, T.R., Falorsi, L., De Cao, N., Kipf, T., & Tomczak, J.M. (2018). Hypersphericalvariational auto-encoders. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.

• [MSJ+15] Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2016). Adversarial Autoencoders. In Proceedings of the International Conference on Learning Representations.

• [CDH+16] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems (pp. 2172-2180).

• [MNG17] Mescheder, L., Nowozin, S., & Geiger, A. (2017). Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the International Conference on Machine Learning (pp. 2391-2400).2019/10/10 清华大学-MSRA 《高等机器学习》 141

Page 142: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Models• Bayesian Networks: Variational Auto-Encoders• [TBGS17] Tolstikhin, I., Bousquet, O., Gelly, S., & Schoelkopf, B. (2017). Wasserstein Auto-

Encoders. arXiv preprint arXiv:1711.01558.• [FWL17] Feng, Y., Wang, D., & Liu, Q. (2017). Learning to Draw Samples with Amortized Stein

Variational Gradient Descent. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.

• [PGH+17] Pu, Y., Gan, Z., Henao, R., Li, C., Han, S., & Carin, L. (2017). VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems (pp. 4236-4245).

• [KSDV18] Kocaoglu, M., Snyder, C., Dimakis, A. G., & Vishwanath, S. (2018). CausalGAN: Learning causal implicit generative models with adversarial training. In Proceedings of the International Conference on Learning Representations.

• [LWZZ18] Li, C., Welling, M., Zhu, J., & Zhang, B. (2018). Graphical generative adversarial networks. In Advances in Neural Information Processing Systems (pp. 6069-6080).

2019/10/10 清华大学-MSRA 《高等机器学习》 142

Page 143: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Bayesian Models• Markov Random Fields• [HS83] Hinton, G., & Sejnowski, T. (1983). Optimal perceptual inference. In IEEE Conference on

Computer Vision and Pattern Recognition.• [Smo86] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of

harmony theory. In Parallel Distributed Processing, volume 1, chapter 6, pages 194-281. MIT Press.• [Hin02] Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence.

Neural computation, 14(8), 1771-1800.• [LCH+06] LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F. (2006). A tutorial on energy-

based learning. Predicting structured data, 1(0).• [HOT06] Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief

nets. Neural computation, 18(7), 1527-1554.• [SH09] Salakhutdinov, R., & Hinton, G. (2009, April). Deep Boltzmann machines. In AISTATS (pp.

448-455).• [Sal15] Salakhutdinov, R. (2015). Learning deep generative models. Annual Review of Statistics and

Its Application, 2, 361-385.• [KB16] Kim, T., & Bengio, Y. (2016). Deep directed generative models with energy-based probability

estimation. arXiv preprint arXiv:1606.03439.• [DM19] Du, Y., & Mordatch, I. (2019). Implicit generation and generalization in energy-based

models. arXiv preprint arXiv:1903.08689.2019/10/10 清华大学-MSRA 《高等机器学习》 143

Page 144: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

References• Others• Bayesian Models• [KYD+18] Kim, T., Yoon, J., Dia, O., Kim, S., Bengio, Y., & Ahn, S. (2018). Bayesian model-agnostic

meta-learning. In Advances in Neural Information Processing Systems (pp. 7332-7342).• [LST15] Lake, B.M., Salakhutdinov, R., & Tenenbaum, J.B. (2015). Human-level concept learning

through probabilistic program induction. Science, 350(6266), pp. 1332-1338.

• Bayesian Neural Network• [LG17] Li, Y., & Gal, Y. (2017). Dropout inference in Bayesian neural networks with alpha-

divergences. In Proceedings of the International Conference on Machine Learning (pp. 2052-2061).

• Related References• [Wil92] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist

reinforcement learning. Machine learning, 8(3-4), 229-256.• [HV93] Hinton, G., & Van Camp, D. (1993). Keeping neural networks simple by minimizing the

description length of the weights. In in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory.

• [NJ01] Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in neural information processing systems (pp. 841-848).

2019/10/10 清华大学-MSRA 《高等机器学习》 144

Page 145: 生成模型 - Chang Liu · •Bayesian Inference (variational inference, MCMC) •Bayesian Networks •Topic Models (LDA, LightLDA, sLDA) •Deep Bayesian Models (VAE) •Markov

The End