Mehmet Koyut ü rk PURDUE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE Joint work with Jayesh Pandey,

ALGORITHMIC & ANALYTICAL METHODS

FOR FUNCTIONAL CHARACTERIZATION OF

MOLECULAR INTERACTION NETWORKS

Mehmet Koyutürk

PURDUE UNIVERSITYDEPARTMENT OF COMPUTER SCIENCE

Joint work with Jayesh Pandey, Wojciech Szpankowski, and Ananth Grama

OUTLINE

Biological motivation Gene regulation, molecular annotation, pathway

annotation

Formal framework Functional attribute networks: Multigraph model

Algorithmic challenges Statistical interpretability, non-monotonicity

Statistical model Conditioning on building blocks to emphasize modularity

Resulting tool NARADA, algorithms, implementation, results

OUTLINE


annotationFormal framework

Functional attribute networks: Multigraph modelAlgorithmic challenges

Statistical interpretability, non-monotonicityStatistical model

Conditioning on building blocks to emphasize modularity


GENE REGULATION

Gene expression is the process of synthesizing a functional protein coded by the corresponding gene

Genes (& their products) regulate (promote / suppress) the extent of each other’s expression

Any step of gene expression can be modulated Transcription, translation, post-transcriptional

modification, RNA transport, mRNA degradation…

Negative ligand independent transcriptional regulation

at chromatin level

GENE REGULATORY NETWORKS

Abstraction: organization of regulatory interactions in the cell Genes are nodes, regulatory interactions are directed

edges Boolean network model: Edges are signed, indicating

up- (promotion) and down-regulation (supression)

GeneUp-regulation

Down-regulation

Flowering time in Arabidopsis

MOLECULAR ANNOTATION

Similar systems that involve different molecules (genes, proteins) in different species

Functional annotation of genes provides a unified understanding of the underlying principles

Gene Ontology: A library of molecular annotation Molecular function: What is the role of a gene? Biological process: In which processes is a gene involved? Cellular component: Where is a gene’s product localized?

We refer to each annotation class as a functional attribute

FROM MOLECULES TO SYSTEMS

Networks are species-specificAnnotation is at the molecular levelMap networks from gene space to function

space Can generate a library of annotated (sub-) networks

Network of Gene Ontology terms based on significance

of pairwise interactions in

S. cerevisiae Synthetic Gene Array (SGA) network

(Tong et al., Science, 2004)

INDIRECT REGULATION

g1

g2

g3

g4

g5

g6

g1

g2

g3

g4

g5

g6

Assessment of pairwise interactions is simple, but not adequate

OUTLINE

Biological motivation Gene regulation, molecular annotation, pathway annotation

Formal framework Functional attribute networks: Multigraph

model Algorithmic challenges

Statistical interpretability, non-monotonicity Statistical model

Conditioning on building blocks to emphasize modularity Resulting tool

NARADA, algorithms, implementation, results

FUNCTIONAL ATTRIBUTE NETWORK

Multigraph model A gene is associated with multiple functional attributes A functional attribute is associated with multiple genes Functional attributes are represented by nodes Genes are represented by ports, reflecting context

Functional attribute networkGene network

g1

g2

g3

g4

g5

g6

FREQUENCY OF A MULTIPATH

A pathway of functional attributes occurs in various contexts in the gene network Multipath in the functional attribute network

Frequency of multipath ?

4 0

SIGNIFICANCE OF A PATHWAY

We want to identify multipaths with unusual frequency These might correspond to modular pathways

Frequency alone is not a good measure of statistical significance The distribution of functional attributes among genes

is not uniform The degree distribution in the gene network is highly

skewed Pathways that contain common functional attributes

have high frequency, but they are not necessarily interesting

OUTLINE


annotation





STATISTICAL INTERPRETABILITY

We are interested in identifying statistically over-represented patterns Null hypothesis: the pattern is sparse Additional positive observation => more significance Additional negative observation => less significance

BA

P(B) < P(A)

B’

P(B’) > P(A)

MONOTONICITY

Frequency is a monotonic measure If a pathway is frequent, then all of its sub-paths are

frequent Algorithmic advantage: enumerate all frequent patterns in

a bottom-up fashion Commonly exploited in traditional data mining

applications

Statistically interpretable measures are not monotonic! Statistical significance fluctuates in the search space Existing data mining algorithms do not apply Significance of pathways are non-monotonic in two

dimensions: GO Hierarcy & path space

GO HIERARCHY

Functional attributes are organized in a hierarchical manner “regulation of steroid biosynthetic process” is a

“regulation of steroid metabolic process” and is part of “steroid biosynthetic process”

Statistically interpretable measures are not monotonic with respect to GO hierarchy A pattern corresponding to child may be more

significant or less significant than that corresponding to its parent

Common example: Identification of significantly enriched GO terms in a set of genes (Ontologizer, VAMPIRE)

MONOTONICITY W.R.T. GO

g1

g2

g5

g3

g4

g1, g2, g3

g1, g2g4g3

GO DAG:

g1, g2, g4

Gene network

:

P( ) < P( ) < P( )

PATHWAY LENGTH

P( ) > P( )

Open problems How can we effectively search in the pathway space,

where significance fluctuates? How can we find optimal resolution in functional

attribute space?

P( ) < P( )

OUTLINE


annotationFormal framework

Functional attribute networks: Multigraph modelAlgorithmic challenges

Statistical interpretability, non-monotonicityStatistical model

Conditioning on building blocks to emphasize modularity


STATISTICAL MODEL: INSIGHT

Emphasize modularity of pathways Condition on frequency of building blocks Evaluate the significance of the coupling of building blocks

g1

g2

g5g3

g4 g6

φ( )

= = 2 = 5φ( )

φ( )

φ( ) = 4

φ( ) =

P( )

=> P( ) <

g7

STATISTICAL MODEL: FORMULATION

We denote each frequency random variable by Φ, their realization by φ

Significance of pathway π123 ( p123 ) P (Ф123≥φ123 |Ф12=φ12,Φ23= φ23,Φ1= φ1,Φ2= φ2,Φ3=

φ3)

Φ 1 Ф2 Ф3Φ12 Φ23

Φ123

π123:

SIGNIFICANCE OF A PATHWAY

Assume that regulatory interactions are independent There are φ12 φ23 posible pairs of π 12 and π 23 edges

The probability that a pair of π 12 and π 23 edges go through the same gene (corresponds to an occurrence of π 123) is 1/φ2

The probability that at least φ123 of these pairs go through the same gene can be bounded by p123≤ exp(φ12φ23Hq(t)) where q = 1/φ2 and

t = φ123 / φ12φ23

Hq(t) = t log(q/t) +(1-t) log((1-q)/(1-t)) is divergence

BASELINE MODEL

A single regulatory interaction is the shortest pathway Arbitrary degree distribution: The number of

edges leaving and entering each functional attribute is specified

Edges are assumed to be independent

The frequency of a regulatory interaction is a hypergeometric random variable Can derive a similar bound for the p-value of a single

regulatory interaction

OUTLINE


annotation





ALGORITHMIC ISSUES

Significance is not monotonic Need to enumerate all pathways?

Strongly significant pathways A pathway is strongly significant if all of its building

blocks and their coupling are significant (defined recursively)

Allows pruning out the search space effectively

Shortcutting common functional attributes Transcription factors, DNA binding genes, etc. are

responsible for mediating regulation Shortcut these terms, consider regulatory effect of

different processes on each other directly

NARADAhttp://www.cs.purdue.edu/homes/jpandey/narada/

A software for identification of significant pathways

Queries Given functional attribute T, find all significant pathways

that originate at T Given functional attribute T, find all significant pathways

that terminate at T Given a sequence of functional attributes T1, T2, …, Tk,

find all occurrences of the corresponding pathway

Identified pathways are displayed as a tree User can explore back and forth between the gene

network and the functional attribute network

RESULTS

E. coli transcription network obtained from RegulonDB 3159 regulatory interactions between 1364 genes Using Gene Ontology, 881 of these genes are

mapped to 318 processes

Pathway length 2 3 4 5

All 427 580 1401 942

Strongly significant 427 208 183 142

Common terms shortcut 184 119 3 1

MOLYBDATE ION TRANSPORT

Significant regulatory pathwaysthat originate at

molybdate ion transport

Their occurrences in the gene network

WHAT IS SIGNIFICANT?

Molybdate ion transport regulates various processes directly Mo-molybdopterin cofactor biosynthesis, oligopeptide

transport, cytochrome complex assembly

It regulates various other processes indirectly Through DNA-dependent regulation of transcription, two-

component signal transduction system, nitrate assimilation Regulation of these mediator processes is not

significant on itself! NARADA captures modularity of indirect regulation!

CONCLUSION

Mapping gene regulatory networks to functional attribute space demonstrates great potential Abstract, unified understanding of regulatory systems

Algorithmically, a wide range of new challenges Bounding interpretable statistical measures Handling resolution in functional attribute space Generalizing the definition of a pathway

Discovering new information Projecting identified “canonical” patterns on other

networks to discover new regulatory relationships

ACKNOWLEDGMENTS

Ananth Grama

Wojciech Szpankowski

Shankar Subramaniam

YohanKim

JayeshPandey

Mehmet Koyut ü rk PURDUE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE Joint work with Jayesh Pandey,

Documents

gene space

molecular levelmap networks

step of gene expression

functional protein

annotation class

building blocks

modularityresulting

different molecules