Top Banner
1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005
42

Software tookits for machine learning and graphical models

Jan 27, 2015

Download

Documents

butest

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Software tookits for machine learning and graphical models

1

Graphical modelsoftware for machine learning

Kevin Murphy

University of British Columbia

December, 2005

Page 2: Software tookits for machine learning and graphical models

2

Outline

• Discriminative models for iid data

• Beyond iid data: conditional random fields

• Beyond supervised learning: generative models

• Beyond optimization: Bayesian models

Page 3: Software tookits for machine learning and graphical models

3

Supervised learning as Bayesian inference

Y1

X1

YN

XN

Y*

X*

Yn

Xn

Y*

X*N

Training

Testing

Page 4: Software tookits for machine learning and graphical models

4

Supervised learning as optimization

Y1

X1

YN

XN

Y*

X*

Yn

Xn

Y*

X*N

Training

Testing

Page 5: Software tookits for machine learning and graphical models

5

Example: logistic regression

• Let yn 2 {1,…,C} be given by a softmax

• Maximize conditional log likelihood

• “Max margin” solution

Page 6: Software tookits for machine learning and graphical models

6

Outline

• Discriminative models for iid data

• Beyond iid data: conditional random fields

• Beyond supervised learning: generative models

• Beyond optimization: Bayesian models

Page 7: Software tookits for machine learning and graphical models

7

1D chain CRFs for sequence labeling

Yn1 YnmYn2

Xn

A 1D conditional random field (CRF) is an extension of logistic regressionto the case where the output labels are sequences, yn 2 {1,…,C}m

Local evidence Edge potential

i

ij

Page 8: Software tookits for machine learning and graphical models

8

2D Lattice CRFs for pixel labeling

A conditional randomfield (CRF) is a discriminative modelof P(y|x). The edge potentialsij are image dependent.

Page 9: Software tookits for machine learning and graphical models

9

2D Lattice MRFs for pixel labeling

A Markov Random Field (MRF) is an undirectedgraphical model. Here we model correlation between pixel labels using ij(yi,yj). We also have a per-pixelgenerative model of observations P(xi|yi)

Local evidence Potential functionPartition function

Page 10: Software tookits for machine learning and graphical models

10

Tree-structured CRFs

• Used in parts-based object detection

• Yi is location of part i in image

eyeL nose eyeR

mouth

Fischler & Elschlager, "The representation and matching of pictorial structures”, PAMI’73Felzenszwalb & Huttenlocher, "Pictorial Structures for Object Recognition," IJCV’05

Page 11: Software tookits for machine learning and graphical models

11

General CRFs

• In general, the graph may have arbitrary structure

• eg for collective web page classification,nodes=urls, edges=hyperlinks

• The potentials are in general defined on cliques, not just edges

Page 12: Software tookits for machine learning and graphical models

12

Factor graphsSquare nodes = factors (potentials)Round nodes = random variablesGraph structure = bipartite

Page 13: Software tookits for machine learning and graphical models

13

Potential functions

• For the local evidence, we can use a discriminative classifier (trained iid)

• For the edge compatibilities, we can use a maxent/ loglinear form, using pre-defined features

Page 14: Software tookits for machine learning and graphical models

14

Restricted potential functions

• For some applications (esp in vision), we often use a Potts model of the form

•We can generalize this for ordered labels (eg discretization of continuous states)

l

Page 15: Software tookits for machine learning and graphical models

15

Page 16: Software tookits for machine learning and graphical models

16

Learning CRFs

• If the log likelihood is

• then the gradient is cliques

Gradient = features – expected features

Tied params

Page 17: Software tookits for machine learning and graphical models

17

Learning CRFs

• Given the gradient rd, one can find the global optimum using first or second order optimization methods, such as– Conjugate gradient– Limited memory BFGS– Stochastic meta descent (SMD)?

• The bottleneck is computing the expected features needed for the gradient

Page 18: Software tookits for machine learning and graphical models

18

Exact inference

• For 1D chains, one can compute P(yi,i+1|x) exactly in O(N K2) time using belief propagation (BP = forwards backwards algorithm)

• For restricted potentials (eg ij=( l)), one can do this in O(NK) time using FFT-like tricks

• This can be generalized to trees.

Page 19: Software tookits for machine learning and graphical models

19

Sum-product vs max-product• We use sum-product to compute marginal

probabilities needed for learning

• We use max-product to find the most probable assignment (Viterbi decoding)

• We can also compute max-marginals

Page 20: Software tookits for machine learning and graphical models

20

Complexity of exact inferenceIn general, the running time is (N Kw), where w is the treewidthof the graph; this is the size of the maximal clique of the triangulatedgraph (assuming an optimal elimination ordering).For chains and trees, w = 2.For n £ n lattices, w = O(n).

Page 21: Software tookits for machine learning and graphical models

21

Approximate sum-productAlgorithm Potential (pairwise) Time N=num nodes,

K = num states,I = num iterations

BP(exact iff tree)

General O(N K2 I)

BP+FFT(exact iff tree)

Restricted O(N K I)

Generalized BP General O(N K2c I)c = cluster size

Gibbs General O(N K I)

Swendsen-Wang General O(N K I)

Mean field General O(N K I)

Page 22: Software tookits for machine learning and graphical models

22

Approximate max-productAlgorithm Potential (pairwise) Time N=num nodes,

K = num states,I = num iterations

BP (exact iff tree) General O(N K2 I)

BP+DT (exact iff tree) Restricted O(N K I)

Generalized BP General O(N K2c I)c = cluster size

Graph-cuts(exact iff K=2)

Restricted O(N2 K I) [?]

ICM (iterated conditional modes)

General O(N K I)

SLS (stochastic local search)

General O(N K I)

Page 23: Software tookits for machine learning and graphical models

23

Learning intractable CRFs

• We can use approximate inference and hope the gradient is “good enough”.– If we use max-product, we are doing “Viterbi

training” (cf perceptron rule)

• Or we can use other techniques, such as pseudo likelihood, which does not need inference.

Page 24: Software tookits for machine learning and graphical models

24

Pseudo-likelihood

Page 25: Software tookits for machine learning and graphical models

25

Software for inference and learning in 1D CRFs

• Various packages– Mallet (McCallum et al) – Java– Crf.sourceforge.net (Sarawagi, Cohen) – Java– My code – matlab (just a toy, not integrated

with BNT)– Ben Taskar says he will soon release his Max

Margin Markov net code (which uses LP for inference and QP for learning).

• Nothing standard, emphasis on NLP apps

Page 26: Software tookits for machine learning and graphical models

26

Software for inference in general CRFs/ MRFs

• Max-product : C++ code for GC, BP, TRP and ICM (for Lattice2) by Rick Szeliski et al

– “A comparative study of energy minimization methods for MRFs”, Rick Szeliksi, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marsall Tappen, Carsten Rother

• Sum-product for Gaussian MRFs: GMRFlib, C code by Havard Rue (exact inference)

• Sum-product: various other ad hoc pieces– My matlab BP code (MRF2)– Rivasseau’s C++ code for BP, Gibbs, tree-sampling

(factor graphs)– Metlzer’s C++ code for BP, GBP, Gibbs, MF (Lattice2)

Page 27: Software tookits for machine learning and graphical models

27

Software for learning general MRFs/CRFs

• Hardly any!– Parise’s matlab code (approx gradient,

pseudo likelihood, CD, etc)– My matlab code (IPF, approx gradient – just a

toy – not integrated with BNT)

Page 28: Software tookits for machine learning and graphical models

28

Structure of ideal toolbox

train

trainDatalearnEngine

infEngine queries model

modelinfEngine

probDist

performance

visualizesummarize

Generator/GUI/file

infer

decide

decisionEngine

utilities

decision

testData

Nbest list

Page 29: Software tookits for machine learning and graphical models

29

Structure of BNT

train

trainDatalearnEngine

infEngine queries model

modelinfEngine

probDist

visualizesummarize

Generator/GUI/file

infer

testData

decide

decisionEngine

Nbest list

BPJtreeMCMC

EMStructuralEM

Graphs+CPDs

Graphs+CPDs

LeRay Shan

Cell array

NodeIdsVarElim

N=1 (MAP) Array, Gaussian, samples

LIMID

JtreeVarElim

policy

Cell array

Page 30: Software tookits for machine learning and graphical models

30

Outline

• Discriminative models for iid data

• Beyond iid data: conditional random fields

• Beyond supervised learning: generative models

• Beyond optimization: Bayesian models

Page 31: Software tookits for machine learning and graphical models

31

Unsupervised learning: why?

• Labeling data is time-consuming.

• Often not clear what label to use.

• Complex objects often not describable with a single discrete label.

• Humans learn without labels.

• Want to discover novel patterns/ structure.

Page 32: Software tookits for machine learning and graphical models

32

Unsupervised learning: what?

• Clusters (eg GMM)

• Low dim manifolds (eg PCA)

• Graph structure (eg biology, social networks)

• “Features” (eg maxent models of language and texture)

• “Objects” (eg sprite models in vision)

Page 33: Software tookits for machine learning and graphical models

33

Unsupervised learning of objects from video

Frey and Jojic; Williams and Titsias ; et al

Page 34: Software tookits for machine learning and graphical models

34

Unsupervised learning: issues

• Objective function not as obvious as in supervised learning. Usually try to maximize likelihood (measure of data compression).

• Local minima (non convex objective).

• Uses inference as subroutine (can be slow – no worse than discriminative learning)

Page 35: Software tookits for machine learning and graphical models

35

Unsupervised learning: how?

• Construct a generative model (eg a Bayes net).

• Perform inference.

• May have to use approximations such as maximum likelihood and BP.

• Cannot use max likelihood for model selection…

Page 36: Software tookits for machine learning and graphical models

36

A comparison of BN software

www.ai.mit.edu/~murphyk/Software/Bayes/bnsoft.html

Page 37: Software tookits for machine learning and graphical models

37

Popular BN software

• BNT (matlab)

• Intel’s PNL (C++)

• Hugin (commercial)

• Netica (commercial)

• GMTk (free .exe from Jeff Bilmes)

Page 38: Software tookits for machine learning and graphical models

38

Outline

• Discriminative models for iid data

• Beyond iid data: conditional random fields

• Beyond supervised learning: generative models

• Beyond optimization: Bayesian models

Page 39: Software tookits for machine learning and graphical models

39

Bayesian inference: why?

• It is optimal.

• It can easily incorporate prior knowledge (esp. useful for small n, large p problems).

• It properly reports confidence in output (useful for combining estimates, and for risk-averse applications).

• It separates models from algorithms.

Page 40: Software tookits for machine learning and graphical models

40

Bayesian inference: how?

• Since we want to integrate, we cannot use max-product.

• Since the unknown parameters are continuous, we cannot use sum-product.

• But we can use EP (expectation propagation), which is similar to BP.

• We can also use variational inference.

• Or MCMC (eg Gibbs sampling).

Page 41: Software tookits for machine learning and graphical models

41

General purposeBayesian software

• BUGS (Gibbs sampling)

• VIBES (variational message passing)

• Minka and Winn’s toolbox (infer.net)

Page 42: Software tookits for machine learning and graphical models

42

Structure of ideal Bayesian toolbox

train

trainDatalearnEngine

infEngine queries model

modelinfEngine

probDist

performance

visualizesummarize

Generator/ GUI/ file

infer

decide

decisionEngine

utilities

decision

testData