ALGORITHMIC & ANALYTICAL METHODS FOR FUNCTIONAL CHARACTERIZATION OF MOLECULAR INTERACTION NETWORKS Mehmet Koyutürk PURDUE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE Joint work with Jayesh Pandey, Wojciech Szpankowski, and Ananth Grama
Jan 02, 2016
ALGORITHMIC & ANALYTICAL METHODS
FOR FUNCTIONAL CHARACTERIZATION OF
MOLECULAR INTERACTION NETWORKS
Mehmet Koyutürk
PURDUE UNIVERSITYDEPARTMENT OF COMPUTER SCIENCE
Joint work with Jayesh Pandey, Wojciech Szpankowski, and Ananth Grama
OUTLINE
Biological motivation Gene regulation, molecular annotation, pathway
annotation
Formal framework Functional attribute networks: Multigraph model
Algorithmic challenges Statistical interpretability, non-monotonicity
Statistical model Conditioning on building blocks to emphasize modularity
Resulting tool NARADA, algorithms, implementation, results
OUTLINE
Biological motivation Gene regulation, molecular annotation, pathway
annotationFormal framework
Functional attribute networks: Multigraph modelAlgorithmic challenges
Statistical interpretability, non-monotonicityStatistical model
Conditioning on building blocks to emphasize modularity
Resulting tool NARADA, algorithms, implementation, results
GENE REGULATION
Gene expression is the process of synthesizing a functional protein coded by the corresponding gene
Genes (& their products) regulate (promote / suppress) the extent of each other’s expression
Any step of gene expression can be modulated Transcription, translation, post-transcriptional
modification, RNA transport, mRNA degradation…
Negative ligand independent transcriptional regulation
at chromatin level
GENE REGULATORY NETWORKS
Abstraction: organization of regulatory interactions in the cell Genes are nodes, regulatory interactions are directed
edges Boolean network model: Edges are signed, indicating
up- (promotion) and down-regulation (supression)
GeneUp-regulation
Down-regulation
Flowering time in Arabidopsis
MOLECULAR ANNOTATION
Similar systems that involve different molecules (genes, proteins) in different species
Functional annotation of genes provides a unified understanding of the underlying principles
Gene Ontology: A library of molecular annotation Molecular function: What is the role of a gene? Biological process: In which processes is a gene involved? Cellular component: Where is a gene’s product localized?
We refer to each annotation class as a functional attribute
FROM MOLECULES TO SYSTEMS
Networks are species-specificAnnotation is at the molecular levelMap networks from gene space to function
space Can generate a library of annotated (sub-) networks
Network of Gene Ontology terms based on significance
of pairwise interactions in
S. cerevisiae Synthetic Gene Array (SGA) network
(Tong et al., Science, 2004)
INDIRECT REGULATION
g1
g2
g3
g4
g5
g6
g1
g2
g3
g4
g5
g6
Assessment of pairwise interactions is simple, but not adequate
OUTLINE
Biological motivation Gene regulation, molecular annotation, pathway annotation
Formal framework Functional attribute networks: Multigraph
model Algorithmic challenges
Statistical interpretability, non-monotonicity Statistical model
Conditioning on building blocks to emphasize modularity Resulting tool
NARADA, algorithms, implementation, results
FUNCTIONAL ATTRIBUTE NETWORK
Multigraph model A gene is associated with multiple functional attributes A functional attribute is associated with multiple genes Functional attributes are represented by nodes Genes are represented by ports, reflecting context
Functional attribute networkGene network
g1
g2
g3
g4
g5
g6
FREQUENCY OF A MULTIPATH
A pathway of functional attributes occurs in various contexts in the gene network Multipath in the functional attribute network
Frequency of multipath ?
4 0
SIGNIFICANCE OF A PATHWAY
We want to identify multipaths with unusual frequency These might correspond to modular pathways
Frequency alone is not a good measure of statistical significance The distribution of functional attributes among genes
is not uniform The degree distribution in the gene network is highly
skewed Pathways that contain common functional attributes
have high frequency, but they are not necessarily interesting
OUTLINE
Biological motivation Gene regulation, molecular annotation, pathway
annotation
Formal framework Functional attribute networks: Multigraph model
Algorithmic challenges Statistical interpretability, non-monotonicity
Statistical model Conditioning on building blocks to emphasize modularity
Resulting tool NARADA, algorithms, implementation, results
STATISTICAL INTERPRETABILITY
We are interested in identifying statistically over-represented patterns Null hypothesis: the pattern is sparse Additional positive observation => more significance Additional negative observation => less significance
BA
P(B) < P(A)
B’
P(B’) > P(A)
MONOTONICITY
Frequency is a monotonic measure If a pathway is frequent, then all of its sub-paths are
frequent Algorithmic advantage: enumerate all frequent patterns in
a bottom-up fashion Commonly exploited in traditional data mining
applications
Statistically interpretable measures are not monotonic! Statistical significance fluctuates in the search space Existing data mining algorithms do not apply Significance of pathways are non-monotonic in two
dimensions: GO Hierarcy & path space
GO HIERARCHY
Functional attributes are organized in a hierarchical manner “regulation of steroid biosynthetic process” is a
“regulation of steroid metabolic process” and is part of “steroid biosynthetic process”
Statistically interpretable measures are not monotonic with respect to GO hierarchy A pattern corresponding to child may be more
significant or less significant than that corresponding to its parent
Common example: Identification of significantly enriched GO terms in a set of genes (Ontologizer, VAMPIRE)
MONOTONICITY W.R.T. GO
g1
g2
g5
g3
g4
g1, g2, g3
g1, g2g4g3
GO DAG:
g1, g2, g4
Gene network
:
P( ) < P( ) < P( )
PATHWAY LENGTH
P( ) > P( )
Open problems How can we effectively search in the pathway space,
where significance fluctuates? How can we find optimal resolution in functional
attribute space?
P( ) < P( )
OUTLINE
Biological motivation Gene regulation, molecular annotation, pathway
annotationFormal framework
Functional attribute networks: Multigraph modelAlgorithmic challenges
Statistical interpretability, non-monotonicityStatistical model
Conditioning on building blocks to emphasize modularity
Resulting tool NARADA, algorithms, implementation, results
STATISTICAL MODEL: INSIGHT
Emphasize modularity of pathways Condition on frequency of building blocks Evaluate the significance of the coupling of building blocks
g1
g2
g5g3
g4 g6
φ( )
= = 2 = 5φ( )
φ( )
φ( ) = 4
φ( ) =
P( )
=> P( ) <
g7
STATISTICAL MODEL: FORMULATION
We denote each frequency random variable by Φ, their realization by φ
Significance of pathway π123 ( p123 ) P (Ф123≥φ123 |Ф12=φ12,Φ23= φ23,Φ1= φ1,Φ2= φ2,Φ3=
φ3)
Φ 1 Ф2 Ф3Φ12 Φ23
Φ123
π123:
SIGNIFICANCE OF A PATHWAY
Assume that regulatory interactions are independent There are φ12 φ23 posible pairs of π 12 and π 23 edges
The probability that a pair of π 12 and π 23 edges go through the same gene (corresponds to an occurrence of π 123) is 1/φ2
The probability that at least φ123 of these pairs go through the same gene can be bounded by p123≤ exp(φ12φ23Hq(t)) where q = 1/φ2 and
t = φ123 / φ12φ23
Hq(t) = t log(q/t) +(1-t) log((1-q)/(1-t)) is divergence
BASELINE MODEL
A single regulatory interaction is the shortest pathway Arbitrary degree distribution: The number of
edges leaving and entering each functional attribute is specified
Edges are assumed to be independent
The frequency of a regulatory interaction is a hypergeometric random variable Can derive a similar bound for the p-value of a single
regulatory interaction
OUTLINE
Biological motivation Gene regulation, molecular annotation, pathway
annotation
Formal framework Functional attribute networks: Multigraph model
Algorithmic challenges Statistical interpretability, non-monotonicity
Statistical model Conditioning on building blocks to emphasize modularity
Resulting tool NARADA, algorithms, implementation, results
ALGORITHMIC ISSUES
Significance is not monotonic Need to enumerate all pathways?
Strongly significant pathways A pathway is strongly significant if all of its building
blocks and their coupling are significant (defined recursively)
Allows pruning out the search space effectively
Shortcutting common functional attributes Transcription factors, DNA binding genes, etc. are
responsible for mediating regulation Shortcut these terms, consider regulatory effect of
different processes on each other directly
NARADAhttp://www.cs.purdue.edu/homes/jpandey/narada/
A software for identification of significant pathways
Queries Given functional attribute T, find all significant pathways
that originate at T Given functional attribute T, find all significant pathways
that terminate at T Given a sequence of functional attributes T1, T2, …, Tk,
find all occurrences of the corresponding pathway
Identified pathways are displayed as a tree User can explore back and forth between the gene
network and the functional attribute network
RESULTS
E. coli transcription network obtained from RegulonDB 3159 regulatory interactions between 1364 genes Using Gene Ontology, 881 of these genes are
mapped to 318 processes
Pathway length 2 3 4 5
All 427 580 1401 942
Strongly significant 427 208 183 142
Common terms shortcut 184 119 3 1
MOLYBDATE ION TRANSPORT
Significant regulatory pathwaysthat originate at
molybdate ion transport
Their occurrences in the gene network
WHAT IS SIGNIFICANT?
Molybdate ion transport regulates various processes directly Mo-molybdopterin cofactor biosynthesis, oligopeptide
transport, cytochrome complex assembly
It regulates various other processes indirectly Through DNA-dependent regulation of transcription, two-
component signal transduction system, nitrate assimilation Regulation of these mediator processes is not
significant on itself! NARADA captures modularity of indirect regulation!
CONCLUSION
Mapping gene regulatory networks to functional attribute space demonstrates great potential Abstract, unified understanding of regulatory systems
Algorithmically, a wide range of new challenges Bounding interpretable statistical measures Handling resolution in functional attribute space Generalizing the definition of a pathway
Discovering new information Projecting identified “canonical” patterns on other
networks to discover new regulatory relationships
ACKNOWLEDGMENTS
Ananth Grama
Wojciech Szpankowski
Shankar Subramaniam
YohanKim
JayeshPandey