Diffusion trees as priors David A. Knowles University of Cambridge April 20, 2012
Diffusion trees as priors
David A. KnowlesUniversity of Cambridge
April 20, 2012
Motivation
I True hierarchies
I Parameter tying
I Visualisation andinterpretability
Rhino
Iguana
Alligator
Bee
Butterfly
Cockroach
Ant
Finch
Ostrich
Chicken
Robin
Eagle
Penguin
Trout
Salmon
Seal
Whale
Dolphin
Wolf
Squirrel
Mouse
Tiger
Lion
Dog
Cat
Elephant
Giraffe
Horse
Gorilla
Chimp
Deer
Cow
Camel
Setup: Pitman-Yor diffusion tree
I Generalisation of the Dirichlet Diffusion Tree (Neal, 2001)
I A top-down generative model for trees over N datapointsx1, x2, · · · , xN ∈ RD
I Points start at “time” t = 0 and follow Brownian diffusion ina D-dimensional Euclidean space until t = 1, where they areobserved
I Model based approach allows uncertainty over trees to bequantified, and integration into larger models
time
10
time
1
2
0a
time
1
2
3
0a
b
time
1
2
3
4
0a
b
Branching probability
At a branch point,
P(following branch k) =nk − αm + θ
,
P(diverging) =θ + αK
m + θ,
where
I nk : number of samples which previously took branch k
I K : current number of branches from this branch point
I m =∑K
k=1 nk : number of samples which previously took thecurrent path
I θ, α are hyperparameters
Probability of divergingTo maintain exchangeability, probability of diverging becomes
P
(diverging
in [t, t + dt]
)=
a(t)Γ(m − α)dt
Γ(m + 1 + θ)
where we use a(t) = c/(1− t). Note that∫
[0,1] a(t)dt =∞ givesdivergence before t = 1 a.s., and therefore a continuousdistribution on x .
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.2
0.4
0.6
0.8
1.0
theta
gam
ma(
m)/
gam
ma(
m+
1+th
eta)
m = 1m = 2m = 3m = 4
Example draws in R2
0.5 0.0 0.5 1.0
0.5
0.0
0.5
1.0
(a) c = 1, θ = 0, α = 0 (DDT)
0.4 0.2 0.0 0.2 0.4 0.6 0.80.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
(b) c = 1, θ = 0.5, α = 0
1.5 1.0 0.5 0.0 0.5 1.0 1.51.5
1.0
0.5
0.0
0.5
1.0
1.5
(c) c = 1, θ = 1, α = 0
1.0 0.5 0.0 0.5 1.0 1.51.0
0.5
0.0
0.5
1.0
1.5
(d) c = 3, θ = 1.5, α = 0
LemmaThe probability of generating a specific tree structure, divergencetimes, divergence locations and corresponding data set is invariantto the ordering of data points.
P( ) = P( )
Proof.Probability of tree structure:
∏[ab]∈internal edges
∏Kbk=3[θ + (k − 1)α]
∏Kbl=1 Γ(nb
l − α)
Γ(m(b) + θ)Γ(1− α)Kb−1(1)
Probability of divergence times:∏[ab]∈internal edges
a(tb) exp[(A(ta)− A(tb))Hθ,α
m(b)−1
]
where we define Hθ,αn =
∑ni=1
Γ(i−α)Γ(i+1+θ) .
Probability of node locations:∏[ab]∈edges
N(xb; xa, σ2(tb − ta)I )
None of these depend on the order of data points!
Proposition
The Pitman-Yor Diffusion Tree defines an infinitely exchangeabledistribution over data points.
Proof.Summing over all possible tree structures, and integrating over allbranch point times and locations, by Lemma 1 we have infiniteexchangeability.
Corollary
There exists a prior ν on probability measures on RD such that thesamples x1, x2, . . . generated by a PYDT are conditionallyindependent and identically distributed (iid) according to F ∼ ν,that is, we can represent the PYDT as
PYDT (x1, x2, . . . ) =
∫ (∏i
F(xi )
)dν(F)
.
Proof.Since the PYDT defines an infinitely exchangeable process on datapoints, the result follows directly by de Finetti’s Theorem.
Comparing to the DPM
It is difficult for the DPM to model fine structure: it has to choosebetween using many small clusters whose parameters will bedifficult to fit, or large clusters that would oversmooth the data.
1.0 0.5 0.0 0.5 1.01.5
1.0
0.5
0.0
0.5
1.0
(e) PYDT
3 2 1 0 1 2 33
2
1
0
1
2
3
4
(f) DPM
Parameter ranges
There are several valid ranges of the parameters (θ, α):
I 0 ≤ α < 1 and θ > −2α. General multifurcating case witharbitrary branching degree.
I α < 0 and θ = −κα where κ ∈ Z ≥ 3 is the maximumoutdegree of a node.
I α < 1 and θ = −2α. Binary branching, and specifically theDDT for α = θ = 0. A parameterised family of priorsproposed by MacKay and Broderick (2007).
I α = 1 gives instantaneous divergence so data points areindependent.
Effect of varying θ
Fix α = 0. Large θ: flat clusterings. Small θ: hierarchicalclusterings.
theta
Tree balance
Binary branching parameter range: α < 1 and θ = −2α.Probability of going left is
nl − αnl + nr − 2α
(2)
This reinforcement is equivalent to hypothesising a per node“probability of going left”, with prior
p ∼ Beta(−α,−α) (3)
Conditioning on the previous data points
p|nr , bl ∼ Beta(nl − α, nr − α) (4)
Thus marginalising out p gives (2). For α close to 1, p will beclose to 0 or 1, so the tree will be very unbalanced. For α→ −∞,p will be close to 1
2 giving balanced trees.
Tree balanceA measure of tree imbalance is Colless’s I (Colless, 1982)
I =2
(n − 1)(n − 2)
∑a∈T|l(a)− r(a)| (5)
The normalised no. of unbalanced nodes in a tree, J (Rogers,1996), i.e.
J =1
(n − 2)
∑a∈T
(1− I[l(a) = r(a)]) (6)
1.5 1.0 0.5 0.0 0.5 1.0alpha
0.0
0.2
0.4
0.6
0.8
1.0
Colle
ss's
index o
f im
bala
nce
1.5 1.0 0.5 0.0 0.5 1.0alpha
0.5
0.6
0.7
0.8
0.9
1.0
Pro
port
ion u
nbala
nce
d n
odes
Generalises the Dirichlet diffusion tree
θ = α = 0 recovers the DDT of Neal (2001).Probability of diverging off a branch
a(t)Γ(m − 0)dt
Γ(m + 1 + 0)=
a(t)(m − 1)!dt
m!=
a(t)dt
m, (7)
Probability of following a branch at an existing branch point isproportional to the number of previous datapoints having followedthat branch ∏Kb=2
l=1 Γ(nbl − 0)
Γ(m(b) + 0)=
(nb1 − 1)!(nb
2 − 1)!
(m(b)− 1)!, (8)
Nested CRP
I Distribution over hierarchical partitions
I Denote the K blocks in the first level as {B1k : k = 1, ...,K}
I Partition these blocks with independent CRPs
I Denote the partitioning of B1k as {B2
kl : l = 1, ...,Kk}I Recurse for S iterations, forming a S deep hierarchy
1 2 3 4 5 6 7
B11 B1
2
B211 B2
12 B221
Nested CRP
A draw from a S = 10-level nested Chinese restaurant process with15 leaves.
Continuum limit of a nested CRP
Associate each level s in an S-level nCRP with “time”ts = s−1
S ∈ [0, 1), and let the concentration parameter at level s bea(ts)/S , where a : [0, 1] 7→ R+. Taking the limit S →∞ recoversthe Dirichlet Diffusion Tree with divergence function a(t).
S = 5 S = 10 S = 15
Other properties of the PYDT
I Generalisation of DP mixture of Gaussians (with specificvariance structure)
I Prior over tree structures is a multifurcating Gibbsfragmentation tree (McCullagh et al., 2008), the most generalGibbs type, Markovian, exchangeable, consistent distributionover trees
Inference: MCMC
I Not straightforward to extend Neal’s slice sampling movesbecause of atoms in the prior at existing branches
I Propose new subtree locations from the prior: slow!
I Working on Gibbs sampling algorithm using uniformisationideas from (Rao and Teh, 2011) (with Vinayak Rao)
Inference: EM with greedy search
I Power EP or EM to calculate marginal likelihood for a tree
I Use this to drive sequential tree building and search over treestructures
I Warm start inference
I Use local evidence contribution to propose L = 3 “good”re-attachment locations
A straightforward extension of the algorithm we presented at ICML2011 for the DDT (K., 2011)
-N
N N --
P
0,1
×
-N
P
0,1
× -N
0,1
×
1
0,1
P
N
I[0<x<1] constraint factorPrior factor on time (Eqn 1)Normal factor (Eqn 2)
α=1
α=1α=1α=0
α=1
Δ Δ
Δ
[pl] [pl] [pr] [pr]
[0p] [0p]
Results: toy data
1.5 1.0 0.5 0.0 0.5 1.0 1.51.5
1.0
0.5
0.0
0.5
1.0
1.5
(g) DDT
1.5 1.0 0.5 0.0 0.5 1.0 1.51.5
1.0
0.5
0.0
0.5
1.0
1.5
(h) PYDT
Figure: Optimal trees learnt by the greedy EM algorithm for the DDTand PYDT on a synthetic dataset with D = 2,N = 100.
Results: Macaques skull measurementsNtrain = 200,Ntest = 28,D = 10 Adams et al. (2008)
101 102 1031.5
1.0
0.5
0.0
0.5
1.0im
pro
vem
ent/
nats
101 102 103
time/secs
60
55
50
45
40
35
30
log lik
elih
ood/1
00
.0
PYDTDDT
Figure: Density modeling of the D = 10,N = 200 macaque skullmeasurement dataset of Adams et al. (2008). Top: Improvement in testpredictive likelihood compared to a kernel density estimate. Bottom:Marginal likelihood of current tree. The shared x-axis is computationtime in seconds.
Results: Animal species
I 33 animal species from Kemp and Tenenbaum (2008)
I 102-dimensional binary feature vectors relating to attributes(e.g. being warm-blooded, having two legs)
I Probit regression
Results: Animal species
Rhino
Iguana
Alligator
Bee
Butterfly
Cockroach
Ant
Finch
Ostrich
Chicken
Robin
Eagle
Penguin
Trout
Salmon
Seal
Whale
Dolphin
Wolf
Squirrel
Mouse
Tiger
Lion
Dog
Cat
Elephant
Giraffe
Horse
Gorilla
Chimp
Deer
Cow
Camel
Figure: Tree structure learnt for the animals dataset of Kemp andTenenbaum (2008).
Other priors over tree structures used in ML
I Kingman’s coalscent (KC) (Kingman, 1982; Teh et al., 2008).Points coalesce together rather than fragmenting as in theDDT/PYDT. KC is in a sense the dual process to the DDT, afact used in Teh et al. (2011).
I Fixed number of generations and individuals per generationwhere each child chooses its parent (Williams, 2000), adiscretisation of KC.
I Nested CRP itself (Blei et al., 2010; Steinhardt andGhahramani, 2012). How to choose when to stop?
I Tree structured stick breaking (Adams et al., 2010). Extendsthe stick breaking construction of the CRP to the nested CRP,and adds a per node stopping probability.
Infinite Latent Attributes model for network data(with Konstantina Palla)
I Existing network models explain a “flat” clustering structureI ILA has features that are partitioned into disjoint groups
(subclusters)I Generalises the IRM (Kemp and Tenenbaum, 2006),
LFIRM (Miller et al., 2009), and MAG (Kim and Leskovec,2011)
I Excellent empirical performance in link prediction
Generative model:
Z|α ∼ IBP(α)
c(m)|γ ∼ CRP(γ)
w(m)kk ′ |σw ∼ N(0, σ2
w )
Pr(rij = 1|Z,C,W) = σ
(∑m
zimzjmw(m)cmi cmj
+ s
).
Gaussian Process Regression Networks(with Andrew Wilson)
I Multivariate heteroskadistic regression with covariatedependent signal and noise correlations
I Tractibility of Gaussian processes and multitask advantages ofneural networks
W (x)ij ∼ GP(0, kw )
fi (x) ∼ GP(0, kf + σ2f δ)
y(x) ∼ N(W (x)f(x), σ2y I )
f2(x)
f1(x)
W11(x)
W12(x)
W21(x)
W22(x)
W31(x)
W32(x)
y1(x)
y2(x)
y3(x)
Future/ongoing work
I Improved MCMC: uniformisation, slice sampling subtreelocations
I Hierarchical structured states in an infinite HMM (e.g. forunsupervised part of speech tagging, modelling geneticvariation)
I Topic modelling: hierarchy over topic specific distributionsover words
I How to summarise posterior samples?
I Time varying tree structures?
Bibliography I
Adams, R., Murray, I., and MacKay, D. (2008). The Gaussian processdensity sampler. In Advances in Neural Information ProcessingSystems, volume 21. MIT Press.
Adams, R. P., Ghahramani, Z., and Jordan, M. I. (2010). Tree-structuredstick breaking for hierarchical data. In Advances in Neural InformationProcessing (NIPS) 23.
Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nestedChinese restaurant process and Bayesian nonparametric inference oftopic hierarchies. Journal of the ACM, 57.
Colless, D. (1982). Phylogenetics: the theory and practice ofphylogenetic systematics. Systematic Zoology, 31(1):100–104.
Kemp, C. and Tenenbaum, J. B. (2006). Learning systems of conceptswith an infinite relational model. In 21st National Conference onArtificial Intelligence.
Kemp, C. and Tenenbaum, J. B. (2008). The discovery of structuralform. In Proceedings of the National Academy of Sciences, volume105(31), pages 10687–10692.
Bibliography II
Kim, M. and Leskovec, J. (2011). Modeling social networks with nodeattributes using the multiplicative attribute graph model. In UAI.
Kingman, J. (1982). The coalescent. Stochastic processes and theirapplications, 13(3):235 – 248.
Knowles, D. A., Gael, J. V., and Ghahramani, Z. (2011). Messagepassing algorithms for Dirichlet diffusion trees. In Proceedings of the28th Annual International Conference on Machine Learning.
Knowles, D. A. and Ghahramani, Z. (2011). Pitman-Yor diffusion trees.In The 28th Conference on Uncertainty in Artificial Intelligence (toappear).
MacKay, D. and Broderick, T. (2007). Probabilities over trees:generalizations of the Dirichlet diffusion tree and Kingman’scoalescent. Website.
McCullagh, P., Pitman, J., and Winkel, M. (2008). Gibbs fragmentationtrees. Bernoulli, 14(4):988–1002.
Miller, K., Griffiths, T., and Jordan, M. (2009). Nonparametric latentfeature models for link prediction. In NIPS.
Bibliography IIINeal, R. M. (2001). Defining priors for distributions using Dirichlet
diffusion trees. Technical Report 0104, Dept. of Statistics, Universityof Toronto.
Rao, V. and Teh, Y. W. (2011). Fast MCMC sampling for Markov jumpprocesses and continuous time Bayesian networks. In Proceedings ofthe International Conference on Uncertainty in Artificial Intelligence.
Rogers, J. S. (1996). Central moments and probability distributions ofthree measures of phylogenetic tree imbalance. Systematic biology,45(1):99–110.
Steinhardt, J. and Ghahramani, Z. (2012). Flexible martingale priors fordeep hierarchies. In International Conference on Artificial Intelligenceand Statistics (AISTATS).
Teh, Y. W., Blundell, C., and Elliott, L. T. (2011). Modelling geneticvariations with fragmentation-coagulation processes. In Advances inNeural Information Processing Systems (NIPS).
Teh, Y. W., Daume III, H., and Roy, D. M. (2008). Bayesianagglomerative clustering with coalescents. Advances in NeuralInformation Processing Systems, 20.
Bibliography IV
Williams, C. (2000). A MCMC approach to hierarchical mixturemodelling. Advances in Neural Information Processing Systems, 13.