David A. Knowles University of Cambridge April 20, 2012 · 2013. 6. 25. · Setup: Pitman-Yor di usion tree I Generalisation of the Dirichlet Di usion Tree (Neal, 2001) I A top-down

Diffusion trees as priors

David A. KnowlesUniversity of Cambridge

April 20, 2012

Motivation

I True hierarchies

I Parameter tying

I Visualisation andinterpretability

Rhino

Iguana

Alligator

Bee

Butterfly

Cockroach

Ant

Finch

Ostrich

Chicken

Robin

Eagle

Penguin

Trout

Salmon

Seal

Whale

Dolphin

Wolf

Squirrel

Mouse

Tiger

Lion

Dog

Cat

Elephant

Giraffe

Horse

Gorilla

Chimp

Deer

Cow

Camel

Setup: Pitman-Yor diffusion tree

I Generalisation of the Dirichlet Diffusion Tree (Neal, 2001)

I A top-down generative model for trees over N datapointsx1, x2, · · · , xN ∈ RD

I Points start at “time” t = 0 and follow Brownian diffusion ina D-dimensional Euclidean space until t = 1, where they areobserved

I Model based approach allows uncertainty over trees to bequantified, and integration into larger models

time

10

time

1

2

0a

time

1

2

3

0a

b

time

1

2

3

4

0a

b

Branching probability

At a branch point,

P(following branch k) =nk − αm + θ

,

P(diverging) =θ + αK

m + θ,

where

I nk : number of samples which previously took branch k

I K : current number of branches from this branch point

I m =∑K

k=1 nk : number of samples which previously took thecurrent path

I θ, α are hyperparameters

Probability of divergingTo maintain exchangeability, probability of diverging becomes

P

(diverging

in [t, t + dt]

)=

a(t)Γ(m − α)dt

Γ(m + 1 + θ)

where we use a(t) = c/(1− t). Note that∫

[0,1] a(t)dt =∞ givesdivergence before t = 1 a.s., and therefore a continuousdistribution on x .

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.2

0.4

0.6

0.8

1.0

theta

gam

ma(

m)/

gam

ma(

m+

1+th

eta)

m = 1m = 2m = 3m = 4

Example draws in R2

0.5 0.0 0.5 1.0

0.5

0.0

0.5

1.0

(a) c = 1, θ = 0, α = 0 (DDT)

0.4 0.2 0.0 0.2 0.4 0.6 0.80.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

(b) c = 1, θ = 0.5, α = 0

1.5 1.0 0.5 0.0 0.5 1.0 1.51.5

1.0

0.5

0.0

0.5

1.0

1.5

(c) c = 1, θ = 1, α = 0

1.0 0.5 0.0 0.5 1.0 1.51.0

0.5

0.0

0.5

1.0

1.5

(d) c = 3, θ = 1.5, α = 0

LemmaThe probability of generating a specific tree structure, divergencetimes, divergence locations and corresponding data set is invariantto the ordering of data points.

P( ) = P( )

Proof.Probability of tree structure:

∏[ab]∈internal edges

∏Kbk=3[θ + (k − 1)α]

∏Kbl=1 Γ(nb

l − α)

Γ(m(b) + θ)Γ(1− α)Kb−1(1)

Probability of divergence times:∏[ab]∈internal edges

a(tb) exp[(A(ta)− A(tb))Hθ,α

m(b)−1

]

where we define Hθ,αn =

∑ni=1

Γ(i−α)Γ(i+1+θ) .

Probability of node locations:∏[ab]∈edges

N(xb; xa, σ2(tb − ta)I )

None of these depend on the order of data points!

Proposition

The Pitman-Yor Diffusion Tree defines an infinitely exchangeabledistribution over data points.

Proof.Summing over all possible tree structures, and integrating over allbranch point times and locations, by Lemma 1 we have infiniteexchangeability.

Corollary

There exists a prior ν on probability measures on RD such that thesamples x1, x2, . . . generated by a PYDT are conditionallyindependent and identically distributed (iid) according to F ∼ ν,that is, we can represent the PYDT as

PYDT (x1, x2, . . . ) =

∫ (∏i

F(xi )

)dν(F)

.

Proof.Since the PYDT defines an infinitely exchangeable process on datapoints, the result follows directly by de Finetti’s Theorem.

Comparing to the DPM

It is difficult for the DPM to model fine structure: it has to choosebetween using many small clusters whose parameters will bedifficult to fit, or large clusters that would oversmooth the data.

1.0 0.5 0.0 0.5 1.01.5

1.0

0.5

0.0

0.5

1.0

(e) PYDT

3 2 1 0 1 2 33

2

1

0

1

2

3

4

(f) DPM

Parameter ranges

There are several valid ranges of the parameters (θ, α):

I 0 ≤ α < 1 and θ > −2α. General multifurcating case witharbitrary branching degree.

I α < 0 and θ = −κα where κ ∈ Z ≥ 3 is the maximumoutdegree of a node.

I α < 1 and θ = −2α. Binary branching, and specifically theDDT for α = θ = 0. A parameterised family of priorsproposed by MacKay and Broderick (2007).

I α = 1 gives instantaneous divergence so data points areindependent.

Effect of varying θ

Fix α = 0. Large θ: flat clusterings. Small θ: hierarchicalclusterings.

theta

Tree balance

Binary branching parameter range: α < 1 and θ = −2α.Probability of going left is

nl − αnl + nr − 2α

(2)

This reinforcement is equivalent to hypothesising a per node“probability of going left”, with prior

p ∼ Beta(−α,−α) (3)

Conditioning on the previous data points

p|nr , bl ∼ Beta(nl − α, nr − α) (4)

Thus marginalising out p gives (2). For α close to 1, p will beclose to 0 or 1, so the tree will be very unbalanced. For α→ −∞,p will be close to 1

2 giving balanced trees.

Tree balanceA measure of tree imbalance is Colless’s I (Colless, 1982)

I =2

(n − 1)(n − 2)

∑a∈T|l(a)− r(a)| (5)

The normalised no. of unbalanced nodes in a tree, J (Rogers,1996), i.e.

J =1

(n − 2)

∑a∈T

(1− I[l(a) = r(a)]) (6)

1.5 1.0 0.5 0.0 0.5 1.0alpha

0.0

0.2

0.4

0.6

0.8

1.0

Colle

ss's

index o

f im

bala

nce

1.5 1.0 0.5 0.0 0.5 1.0alpha

0.5

0.6

0.7

0.8

0.9

1.0

Pro

port

ion u

nbala

nce

d n

odes

Generalises the Dirichlet diffusion tree

θ = α = 0 recovers the DDT of Neal (2001).Probability of diverging off a branch

a(t)Γ(m − 0)dt

Γ(m + 1 + 0)=

a(t)(m − 1)!dt

m!=

a(t)dt

m, (7)

Probability of following a branch at an existing branch point isproportional to the number of previous datapoints having followedthat branch ∏Kb=2

l=1 Γ(nbl − 0)

Γ(m(b) + 0)=

(nb1 − 1)!(nb

2 − 1)!

(m(b)− 1)!, (8)

Nested CRP

I Distribution over hierarchical partitions

I Denote the K blocks in the first level as {B1k : k = 1, ...,K}

I Partition these blocks with independent CRPs

I Denote the partitioning of B1k as {B2

kl : l = 1, ...,Kk}I Recurse for S iterations, forming a S deep hierarchy

1 2 3 4 5 6 7

B11 B1

2

B211 B2

12 B221

Nested CRP

A draw from a S = 10-level nested Chinese restaurant process with15 leaves.

Continuum limit of a nested CRP

Associate each level s in an S-level nCRP with “time”ts = s−1

S ∈ [0, 1), and let the concentration parameter at level s bea(ts)/S , where a : [0, 1] 7→ R+. Taking the limit S →∞ recoversthe Dirichlet Diffusion Tree with divergence function a(t).

S = 5 S = 10 S = 15

Other properties of the PYDT

I Generalisation of DP mixture of Gaussians (with specificvariance structure)

I Prior over tree structures is a multifurcating Gibbsfragmentation tree (McCullagh et al., 2008), the most generalGibbs type, Markovian, exchangeable, consistent distributionover trees

Inference: MCMC

I Not straightforward to extend Neal’s slice sampling movesbecause of atoms in the prior at existing branches

I Propose new subtree locations from the prior: slow!

I Working on Gibbs sampling algorithm using uniformisationideas from (Rao and Teh, 2011) (with Vinayak Rao)

Inference: EM with greedy search

I Power EP or EM to calculate marginal likelihood for a tree

I Use this to drive sequential tree building and search over treestructures

I Warm start inference

I Use local evidence contribution to propose L = 3 “good”re-attachment locations

A straightforward extension of the algorithm we presented at ICML2011 for the DDT (K., 2011)

-N

N N --

P

0,1

×

-N

P

0,1

× -N

0,1

×

1

0,1

P

N

I[0<x<1] constraint factorPrior factor on time (Eqn 1)Normal factor (Eqn 2)

α=1

α=1α=1α=0

α=1

Δ Δ

Δ

[pl] [pl] [pr] [pr]

[0p] [0p]

Results: toy data

1.5 1.0 0.5 0.0 0.5 1.0 1.51.5

1.0

0.5

0.0

0.5

1.0

1.5

(g) DDT

1.5 1.0 0.5 0.0 0.5 1.0 1.51.5

1.0

0.5

0.0

0.5

1.0

1.5

(h) PYDT

Figure: Optimal trees learnt by the greedy EM algorithm for the DDTand PYDT on a synthetic dataset with D = 2,N = 100.

Results: Macaques skull measurementsNtrain = 200,Ntest = 28,D = 10 Adams et al. (2008)

101 102 1031.5

1.0

0.5

0.0

0.5

1.0im

pro

vem

ent/

nats

101 102 103

time/secs

60

55

50

45

40

35

30

log lik

elih

ood/1

00

.0

PYDTDDT

Figure: Density modeling of the D = 10,N = 200 macaque skullmeasurement dataset of Adams et al. (2008). Top: Improvement in testpredictive likelihood compared to a kernel density estimate. Bottom:Marginal likelihood of current tree. The shared x-axis is computationtime in seconds.

Results: Animal species

I 33 animal species from Kemp and Tenenbaum (2008)

I 102-dimensional binary feature vectors relating to attributes(e.g. being warm-blooded, having two legs)

I Probit regression

Results: Animal species

Rhino

Iguana

Alligator

Bee

Butterfly

Cockroach

Ant

Finch

Ostrich

Chicken

Robin

Eagle

Penguin

Trout

Salmon

Seal

Whale

Dolphin

Wolf

Squirrel

Mouse

Tiger

Lion

Dog

Cat

Elephant

Giraffe

Horse

Gorilla

Chimp

Deer

Cow

Camel

Figure: Tree structure learnt for the animals dataset of Kemp andTenenbaum (2008).

Other priors over tree structures used in ML

I Kingman’s coalscent (KC) (Kingman, 1982; Teh et al., 2008).Points coalesce together rather than fragmenting as in theDDT/PYDT. KC is in a sense the dual process to the DDT, afact used in Teh et al. (2011).

I Fixed number of generations and individuals per generationwhere each child chooses its parent (Williams, 2000), adiscretisation of KC.

I Nested CRP itself (Blei et al., 2010; Steinhardt andGhahramani, 2012). How to choose when to stop?

I Tree structured stick breaking (Adams et al., 2010). Extendsthe stick breaking construction of the CRP to the nested CRP,and adds a per node stopping probability.

Infinite Latent Attributes model for network data(with Konstantina Palla)

I Existing network models explain a “flat” clustering structureI ILA has features that are partitioned into disjoint groups

(subclusters)I Generalises the IRM (Kemp and Tenenbaum, 2006),

LFIRM (Miller et al., 2009), and MAG (Kim and Leskovec,2011)

I Excellent empirical performance in link prediction

Generative model:

Z|α ∼ IBP(α)

c(m)|γ ∼ CRP(γ)

w(m)kk ′ |σw ∼ N(0, σ2

w )

Pr(rij = 1|Z,C,W) = σ

(∑m

zimzjmw(m)cmi cmj

+ s

).

Gaussian Process Regression Networks(with Andrew Wilson)

I Multivariate heteroskadistic regression with covariatedependent signal and noise correlations

I Tractibility of Gaussian processes and multitask advantages ofneural networks

W (x)ij ∼ GP(0, kw )

fi (x) ∼ GP(0, kf + σ2f δ)

y(x) ∼ N(W (x)f(x), σ2y I )

f2(x)

f1(x)

W11(x)

W12(x)

W21(x)

W22(x)

W31(x)

W32(x)

y1(x)

y2(x)

y3(x)

Future/ongoing work

I Improved MCMC: uniformisation, slice sampling subtreelocations

I Hierarchical structured states in an infinite HMM (e.g. forunsupervised part of speech tagging, modelling geneticvariation)

I Topic modelling: hierarchy over topic specific distributionsover words

I How to summarise posterior samples?

I Time varying tree structures?

Bibliography I

Adams, R., Murray, I., and MacKay, D. (2008). The Gaussian processdensity sampler. In Advances in Neural Information ProcessingSystems, volume 21. MIT Press.

Adams, R. P., Ghahramani, Z., and Jordan, M. I. (2010). Tree-structuredstick breaking for hierarchical data. In Advances in Neural InformationProcessing (NIPS) 23.

Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nestedChinese restaurant process and Bayesian nonparametric inference oftopic hierarchies. Journal of the ACM, 57.

Colless, D. (1982). Phylogenetics: the theory and practice ofphylogenetic systematics. Systematic Zoology, 31(1):100–104.

Kemp, C. and Tenenbaum, J. B. (2006). Learning systems of conceptswith an infinite relational model. In 21st National Conference onArtificial Intelligence.

Kemp, C. and Tenenbaum, J. B. (2008). The discovery of structuralform. In Proceedings of the National Academy of Sciences, volume105(31), pages 10687–10692.

Bibliography II

Kim, M. and Leskovec, J. (2011). Modeling social networks with nodeattributes using the multiplicative attribute graph model. In UAI.

Kingman, J. (1982). The coalescent. Stochastic processes and theirapplications, 13(3):235 – 248.

Knowles, D. A., Gael, J. V., and Ghahramani, Z. (2011). Messagepassing algorithms for Dirichlet diffusion trees. In Proceedings of the28th Annual International Conference on Machine Learning.

Knowles, D. A. and Ghahramani, Z. (2011). Pitman-Yor diffusion trees.In The 28th Conference on Uncertainty in Artificial Intelligence (toappear).

MacKay, D. and Broderick, T. (2007). Probabilities over trees:generalizations of the Dirichlet diffusion tree and Kingman’scoalescent. Website.

McCullagh, P., Pitman, J., and Winkel, M. (2008). Gibbs fragmentationtrees. Bernoulli, 14(4):988–1002.

Miller, K., Griffiths, T., and Jordan, M. (2009). Nonparametric latentfeature models for link prediction. In NIPS.

Bibliography IIINeal, R. M. (2001). Defining priors for distributions using Dirichlet

diffusion trees. Technical Report 0104, Dept. of Statistics, Universityof Toronto.

Rao, V. and Teh, Y. W. (2011). Fast MCMC sampling for Markov jumpprocesses and continuous time Bayesian networks. In Proceedings ofthe International Conference on Uncertainty in Artificial Intelligence.

Rogers, J. S. (1996). Central moments and probability distributions ofthree measures of phylogenetic tree imbalance. Systematic biology,45(1):99–110.

Steinhardt, J. and Ghahramani, Z. (2012). Flexible martingale priors fordeep hierarchies. In International Conference on Artificial Intelligenceand Statistics (AISTATS).

Teh, Y. W., Blundell, C., and Elliott, L. T. (2011). Modelling geneticvariations with fragmentation-coagulation processes. In Advances inNeural Information Processing Systems (NIPS).

Teh, Y. W., Daume III, H., and Roy, D. M. (2008). Bayesianagglomerative clustering with coalescents. Advances in NeuralInformation Processing Systems, 20.

Bibliography IV

Williams, C. (2000). A MCMC approach to hierarchical mixturemodelling. Advances in Neural Information Processing Systems, 13.

David A. Knowles University of Cambridge April 20, 2012 · 2013. 6. 25. · Setup: Pitman-Yor di usion tree I Generalisation of the Dirichlet Di usion Tree (Neal, 2001) I A top-down

Documents