School of Computer Science Stochastic Block Models of Mixed Membership Edo Airoldi 1,2 , Dave Blei 2 , Steve Fienberg 1 , Eric Xing 1 1 Carnegie-Mellon University & 2 Princeton University SAMSI, High Dimensional Inference and Random Matrices, September 17 th , 2006
Stochastic Block Models of Mixed Membership. Edo Airoldi 1,2 , Dave Blei 2 , Steve Fienberg 1 , Eric Xing 1 1 Carnegie-Mellon University & 2 Princeton University. SAMSI, High Dimensional Inference and Random Matrices, September 17 th , 2006. Interaction graphs. Expression graphs. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
School of Computer Science
Stochastic Block Models of Mixed Membership
Edo Airoldi 1,2, Dave Blei 2, Steve Fienberg 1, Eric Xing 1
1 Carnegie-Mellon University & 2 Princeton University
SAMSI, High Dimensional Inference and Random Matrices, September 17th, 2006
2
School of Computer Science
The Scientific Problem
• Protein-protein interactions in Yeast
• Different studies test protein interactions with different technologies (precision)
Expression graphs
Interaction graphs
3
School of Computer Science
M = 871 nodesM2 = 750K entries
The Data: Interaction Graphs
• M proteins in a graph (nodes)• M2 observations on pairs of proteins
– Edges are random quantities, Y [n,m]
• Interactions are not independent– Interacting proteins form a protein complex
• T graphs on the same set of proteins• Partial annotations for each protein, X [n]
4
School of Computer Science
The Scientific Problems
• What are stable protein complexes?– They perform many cellular processes– A protein may be a member of several ones
• How many are there?
• How do stable protein complexes interact?– Test hypotheses (inform new analyses)– Learn complex-to-complex interaction patterns
5
School of Computer Science
Disease Spread
Social Network
Food Web
ElectronicCircuit
Internet
More Network Data
6
School of Computer Science
An Abstraction of the Data
• A collection of unipartite graphs: G1:T = (Y1:T ,N )
• Node-specific (multivariate) attributes: X1:T = { Xt [n] : n N }
• Partially observable Y1:T and X1:T
7
School of Computer Science
The Challenge
• Given the data abstraction and the goals of the analysis
• Can we posit a rich class of models that is instrumental for thinking about the scientific problems we face? Amenable to theoretical analyses?
8
School of Computer Science
Modeling Ideas
• Hierarchical Bayes– Latent variables encode semantic elements– Assume structure on observable-latent elements
• Combination of 2 class of models
1. Models of mixed membership
2. Network models (block models)
Stochastic block models of mixed membership
=
9
School of Computer Science
Graphical Model Representation
MixedMembership
StochasticBlocks
10
School of Computer Science
Interactions(observed*)
j
i
yij = 1
i
j
1 2 3
Mixed membershipVectors (latent*)
h
g
1 2 3123
23 = 0.9
Group-to-grouppatterns (latent*)
Pr ( yij=1 | i,j, ) = i j
T
A Hierarchical Likelihood
11
School of Computer Science
More Modeling Issues
• Technical :: Sparsity– Introduce parameter that modulates the relative
importance of ones and zeros (binary edges) in the cost function that drives the clustering
• Biological :: Ribosomes & Distress– Some protein complexes act like hubs because
they are involved, e.g., in protein production or cell recovery (Y2H technology is invasive)
12
School of Computer Science
Large Scale Computation
• Masses of data– 750K observations in a small problem (M=871)– 2.5M observations with (M=1578)– 3M expressions for 6K genes/proteins in Yeast
• Variational inference [ Jordan et al., 2001 ]– Naïve implementation does not work– We develop a novel “nested” variational algorithm
13
School of Computer Science
Example: A Scientific Question
• Do PPI contain information about functions?
Model ApproximatePosterior onMembershipVectors
?
Raw dataFunctionalAnnotations
YLD014W
14
School of Computer Science
Interactions in Yeast (MIPS)
• Do PPI contain information about functions?
YLD014W
1
01 2 3 . . . 15
15
School of Computer Science
Results: Identifiability
• In this example we map latent groups to known functional categories
KnownAnnotations
UnknownAnnotations
16
School of Computer Science
Results: Functional Annotations
17
School of Computer Science
Results: Mixed Membership
Mixed membership
• The estimated membership vectors support the mixed membership assumption
18
School of Computer Science
Results: Stochastic Block Model
19
School of Computer Science
• Assumptions for unipartite graphs– Population: existence of K sub-populations
– Latent variable: mixed memb. vectors [n] ~ D
– Subject: exchangeable edges given blocks & memb. Y[nm] ~ f ( . | [n] [m] )
– Sampling scheme: the graphs are IID
• Additional data, e.g., attributes, annotations– Integrated model formulation (descriptive/predictive)
General Bayesian Formulation
T
20
School of Computer Science
Variational Algorithms• Naïve algorithm:
– init (i i, ij ij)
– while (≈ log-lik )update (ij ij)
update (i i)
• Nested algorithm:– init (i i)
– while (≈ log-lik )loop ij
• init ij
• while (≈ log-lik )update ij
partially update (i,j)
We trade space for time but …
21
School of Computer Science
Variational Algorithms for MMSB
On a single machine* we empirically observed: faster convergence (offsets extra computation), and more stable paths to convergence.
NaïveNaïve
Nested Nested
22
School of Computer Science
Take Home Points
• Bayesian formulation is integral to the biology
• A novel class of models that combines MM for soft-clustering & network models for dependent data
• Latent aspects patterns that correlate with, help predict, functional processes in the cell
• Current implementation allows for fast inference on large matrices through variational approximation considerable opportunity to improve upon both computation and efficiency of the approximation
23
School of Computer Science
• Data & Problems: Gavin et al. (2002) Nature; Ho et al. (2002) Nature; Mewes et al. (2004) Nucleic Acids Research; Krogan et al. (2006) Nature.
• Mixed Membership Models– Pritchard et al. (2000); Erosheva (2002); Rosenberg et al. (2002);
Blei et al. (2003); Xing et al. (2003ab); Erosheva et al. (2004); Airoldi et al. (2005); Blei & Lafferty (2006); Xing et al. (2006)
• Stochastic network models– Wasserman et al. (1980, 1994, 1996); Fienberg et al. (1985); Frank
& Strauss (1986); Nowicki & Snijders (2001); Hoff et al. (2002), Airoldi et al. (2006)
• More material on the Web at: http://www.cs.cmu.edu/~eairoldi/
• ICML Workshop on “Statistical Network Analysis: Models, Issues and New Directions” on June 29 at Carnegie Mellon, Pittsburgh PA: http://nlg.cs.cmu.edu/