Software tookits for machine learning and graphical models

1

Graphical modelsoftware for machine learning

Kevin Murphy

University of British Columbia

December, 2005

2

Outline

• Discriminative models for iid data

• Beyond iid data: conditional random fields

• Beyond supervised learning: generative models

• Beyond optimization: Bayesian models

3

Supervised learning as Bayesian inference

Y1

X1

YN

XN

Y*

X*

Yn

Xn

Y*

X*N

Training

Testing

4

Supervised learning as optimization

Y1

X1

YN

XN

Y*

X*

Yn

Xn

Y*

X*N

Training

Testing

5

Example: logistic regression

• Let yn 2 {1,…,C} be given by a softmax

• Maximize conditional log likelihood

• “Max margin” solution

6

Outline





7

1D chain CRFs for sequence labeling

Yn1 YnmYn2

Xn

A 1D conditional random field (CRF) is an extension of logistic regressionto the case where the output labels are sequences, yn 2 {1,…,C}m

Local evidence Edge potential

i

ij

8

2D Lattice CRFs for pixel labeling

A conditional randomfield (CRF) is a discriminative modelof P(y|x). The edge potentialsij are image dependent.

9

2D Lattice MRFs for pixel labeling

A Markov Random Field (MRF) is an undirectedgraphical model. Here we model correlation between pixel labels using ij(yi,yj). We also have a per-pixelgenerative model of observations P(xi|yi)

Local evidence Potential functionPartition function

10

Tree-structured CRFs

• Used in parts-based object detection

• Yi is location of part i in image

eyeL nose eyeR

mouth

Fischler & Elschlager, "The representation and matching of pictorial structures”, PAMI’73Felzenszwalb & Huttenlocher, "Pictorial Structures for Object Recognition," IJCV’05

11

General CRFs

• In general, the graph may have arbitrary structure

• eg for collective web page classification,nodes=urls, edges=hyperlinks

• The potentials are in general defined on cliques, not just edges

12

Factor graphsSquare nodes = factors (potentials)Round nodes = random variablesGraph structure = bipartite

13

Potential functions

• For the local evidence, we can use a discriminative classifier (trained iid)

• For the edge compatibilities, we can use a maxent/ loglinear form, using pre-defined features

14

Restricted potential functions

• For some applications (esp in vision), we often use a Potts model of the form

•We can generalize this for ordered labels (eg discretization of continuous states)

l

15

16

Learning CRFs

• If the log likelihood is

• then the gradient is cliques

Gradient = features – expected features

Tied params

17

Learning CRFs

• Given the gradient rd, one can find the global optimum using first or second order optimization methods, such as– Conjugate gradient– Limited memory BFGS– Stochastic meta descent (SMD)?

• The bottleneck is computing the expected features needed for the gradient

18

Exact inference

• For 1D chains, one can compute P(yi,i+1|x) exactly in O(N K2) time using belief propagation (BP = forwards backwards algorithm)

• For restricted potentials (eg ij=( l)), one can do this in O(NK) time using FFT-like tricks

• This can be generalized to trees.

19

Sum-product vs max-product• We use sum-product to compute marginal

probabilities needed for learning

• We use max-product to find the most probable assignment (Viterbi decoding)

• We can also compute max-marginals

20

Complexity of exact inferenceIn general, the running time is (N Kw), where w is the treewidthof the graph; this is the size of the maximal clique of the triangulatedgraph (assuming an optimal elimination ordering).For chains and trees, w = 2.For n £ n lattices, w = O(n).

21

Approximate sum-productAlgorithm Potential (pairwise) Time N=num nodes,

K = num states,I = num iterations

BP(exact iff tree)

General O(N K2 I)

BP+FFT(exact iff tree)

Restricted O(N K I)

Generalized BP General O(N K2c I)c = cluster size

Gibbs General O(N K I)

Swendsen-Wang General O(N K I)

Mean field General O(N K I)

22

Approximate max-productAlgorithm Potential (pairwise) Time N=num nodes,

K = num states,I = num iterations

BP (exact iff tree) General O(N K2 I)

BP+DT (exact iff tree) Restricted O(N K I)

Generalized BP General O(N K2c I)c = cluster size

Graph-cuts(exact iff K=2)

Restricted O(N2 K I) [?]

ICM (iterated conditional modes)

General O(N K I)

SLS (stochastic local search)

General O(N K I)

23

Learning intractable CRFs

• We can use approximate inference and hope the gradient is “good enough”.– If we use max-product, we are doing “Viterbi

training” (cf perceptron rule)

• Or we can use other techniques, such as pseudo likelihood, which does not need inference.

24

Pseudo-likelihood

25

Software for inference and learning in 1D CRFs

• Various packages– Mallet (McCallum et al) – Java– Crf.sourceforge.net (Sarawagi, Cohen) – Java– My code – matlab (just a toy, not integrated

with BNT)– Ben Taskar says he will soon release his Max

Margin Markov net code (which uses LP for inference and QP for learning).

• Nothing standard, emphasis on NLP apps

26

Software for inference in general CRFs/ MRFs

• Max-product : C++ code for GC, BP, TRP and ICM (for Lattice2) by Rick Szeliski et al

– “A comparative study of energy minimization methods for MRFs”, Rick Szeliksi, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marsall Tappen, Carsten Rother

• Sum-product for Gaussian MRFs: GMRFlib, C code by Havard Rue (exact inference)

• Sum-product: various other ad hoc pieces– My matlab BP code (MRF2)– Rivasseau’s C++ code for BP, Gibbs, tree-sampling

(factor graphs)– Metlzer’s C++ code for BP, GBP, Gibbs, MF (Lattice2)

27

Software for learning general MRFs/CRFs

• Hardly any!– Parise’s matlab code (approx gradient,

pseudo likelihood, CD, etc)– My matlab code (IPF, approx gradient – just a

toy – not integrated with BNT)

28

Structure of ideal toolbox

train

trainDatalearnEngine

infEngine queries model

modelinfEngine

probDist

performance

visualizesummarize

Generator/GUI/file

infer

decide

decisionEngine

utilities

decision

testData

Nbest list

29

Structure of BNT

train



modelinfEngine

probDist

visualizesummarize

Generator/GUI/file

infer

testData

decide

decisionEngine

Nbest list

BPJtreeMCMC

EMStructuralEM

Graphs+CPDs

Graphs+CPDs

LeRay Shan

Cell array

NodeIdsVarElim

N=1 (MAP) Array, Gaussian, samples

LIMID

JtreeVarElim

policy

Cell array

30

Outline





31

Unsupervised learning: why?

• Labeling data is time-consuming.

• Often not clear what label to use.

• Complex objects often not describable with a single discrete label.

• Humans learn without labels.

• Want to discover novel patterns/ structure.

32

Unsupervised learning: what?

• Clusters (eg GMM)

• Low dim manifolds (eg PCA)

• Graph structure (eg biology, social networks)

• “Features” (eg maxent models of language and texture)

• “Objects” (eg sprite models in vision)

33

Unsupervised learning of objects from video

Frey and Jojic; Williams and Titsias ; et al

34

Unsupervised learning: issues

• Objective function not as obvious as in supervised learning. Usually try to maximize likelihood (measure of data compression).

• Local minima (non convex objective).

• Uses inference as subroutine (can be slow – no worse than discriminative learning)

35

Unsupervised learning: how?

• Construct a generative model (eg a Bayes net).

• Perform inference.

• May have to use approximations such as maximum likelihood and BP.

• Cannot use max likelihood for model selection…

36

A comparison of BN software

www.ai.mit.edu/~murphyk/Software/Bayes/bnsoft.html

37

Popular BN software

• BNT (matlab)

• Intel’s PNL (C++)

• Hugin (commercial)

• Netica (commercial)

• GMTk (free .exe from Jeff Bilmes)

38

Outline





39

Bayesian inference: why?

• It is optimal.

• It can easily incorporate prior knowledge (esp. useful for small n, large p problems).

• It properly reports confidence in output (useful for combining estimates, and for risk-averse applications).

• It separates models from algorithms.

40

Bayesian inference: how?

• Since we want to integrate, we cannot use max-product.

• Since the unknown parameters are continuous, we cannot use sum-product.

• But we can use EP (expectation propagation), which is similar to BP.

• We can also use variational inference.

• Or MCMC (eg Gibbs sampling).

41

General purposeBayesian software

• BUGS (Gibbs sampling)

• VIBES (variational message passing)

• Minka and Winn’s toolbox (infer.net)

42

Structure of ideal Bayesian toolbox

train



modelinfEngine

probDist

performance

visualizesummarize

Generator/ GUI/ file

infer

decide

decisionEngine

utilities

decision

testData

Software tookits for machine learning and graphical models

Documents

y n x n y

ij y i

optimization y

bayesian inference y

learning crfs

general crfs mrfs maxproduct

general icm

supervised learning