Top Banner
1 Seminar in Bioinformatics Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski Appeared in: Bioinformatics, Vol. 20, Sup. 1, 2004, pages i200-i207. Presented by: Royi Ronen
41

1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

1

Seminar in BioinformaticsSeminar in Bioinformatics

An efficient algorithm for detecting

frequent subgraphs in biological

networks

Paper by: M. Koyuturk, A. Grama and W. Szpankowski

Appeared in: Bioinformatics, Vol. 20, Sup. 1, 2004, pages i200-i207.

Presented by: Royi Ronen

Page 2: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

2

AbstractAbstract

• Motivation– Network interaction data is abundant

– Analyzing this data is important

– Problems are close to the subgraph isomorphism problem – Hard!

• Results– An efficient algorithm for detecting frequently occurring patterns in

bio-network

– The algorithm simplifies the subgraph isomorphism problem to a different, tractable, problem with biological applications

– Mining the KEGG database yields positive empiric results

Page 3: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

3

OutlineOutline

• Introduction

• Model

• Approach: Graph Mining– Related Work

– Formalism for metabolic pathways

– The Algorithm

• Discussion and Empiric Results

• Conclusion

• Future Work

Page 4: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

4

IntroductionIntroduction

• Experimental data relating to biological sequences (that are highly available and accessible) play an important role in tasks such as discovering common sequences and motifs

• Biomolecular interaction data are abstracted as graphs– Example: A hypergraph can represent a metabolic

pathway where nodes represent compounds

– Can be reduced to a directed graph where nodes are enzymes and edges relate them

Page 5: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

5

IntroductionIntroduction

• Key problems in this context:

– Aligning multiple graphs

– Finding frequently occurring sub-graphs in a collection of

graphs

• A solution can lead to the understanding of

– Motifs of cellular interactions

– Evolutionary relationships

– Differences between networks in different organisms

– Patterns of gene regulation

Page 6: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

6

IntroductionIntroduction

• In the paper– Finding frequently occurring subgraphs in a collection of graphs, each representing a

metabolic pathway

– Close to the NP-Hard subgraph isomorphism problem

– End of story?

• No!– The problem can be simplified and made tractable and still capture the biological

information

– Nodes will be “uniquely labeled”, according to the represented enzyme

– Experimental results: discovering “interesting” patterns from KEGG takes seconds

Page 7: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

7

OutlineOutline

• Introduction ☺

• Model

• Approach: Graph Mining– Related Work

– Formalism for metabolic pathways

– The Algorithm

• Discussion and Empiric Results

• Conclusion

• Future Work

Page 8: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

8

Metabolic PathwaysMetabolic Pathways

• Oldest kind of biological network

• Group the reactions that belong to a process

• Publicly available (e.g., KEGG)

• Chemical compounds are linked to each other by a

product-substrate relationship

• In a hypergraph – Nodes are compounds

– A hyperedge is a reaction (or an enzyme)

– Hyperedge direction is important to distinguish between substrates

and products

a

b

c

Page 9: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

9

Metabolic PathwaysMetabolic Pathways

• Simplification: – Regular graph, nodes represent enzymes, an edge connects enzyme a to enzyme b

iff a’s product is b’s substrate (more accurately, if such a relation exists)

– Edges may be labeled by the compound that relates a to b.

– A specific enzyme may appear more than once in the same pathway, but we consider merged nodes at the price of losing temporal information

• Various problems related to understanding the molecular interaction in the cell can be solved using graph related frameworks, mostly to provide a means to investigate units with well defined functionality

• Paper focus: Mining pathways for frequent connected subgraphs, which is important because functional modules are expected to repeat among several pathways or organisms (or both)

a bcom.

Page 10: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

10

OutlineOutline

• Introduction ☺

• Model ☺

• Approach: Graph Mining– Related Work

– Formalism for metabolic pathways

– The Algorithm

• Discussion and Empiric Results

• Conclusion

• Future Work

Page 11: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

11

Related WorkRelated Work

• Subgraph isomorphism– Unlabeled version. Hardness usually “tackled” by

ordering nodes and edges for efficient processing

– Labeled Version. Easier, suitable for biological networks

• Frequent itemset mining– Multiple sets of items (transactions) from domain D are

given

– Itemset X implies itemset Y with c confidence if c% of sets containing X also contain Y

– X→Y has support s if s% of the sets contain X and Y

Page 12: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

12

Graph Formalism for Metabolic Graph Formalism for Metabolic PathwaysPathways

• A Metabolic Pathway is a triplet, P(M,Z,R)

– M, a set of metabolites

– Z, a set of enzymes

– R, a set of reactions, where each reaction r is associated with

• A set of enzymes Z(r) from Z

• A set of substrates S(r) from M

• A set of products T(r) from M

metabolite

enzyme

Page 13: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

13

Graph Formalism for Metabolic Graph Formalism for Metabolic PathwaysPathways

• A Graph G(V,E) for P(M,Z,R) is defined

– For every enzyme zi in Z - a node vi exists

– (vi,vj) in E iff zj consumes the product of zi

• Example: enzymemetabolite

enzyme

Page 14: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

14

Mining Metabolic PathwaysMining Metabolic Pathways

• The Problem: Given a collection of n graphs and a

support threshold ε, find all maximal connected

subgraphs that are contained in at least εn of the

graphs

• The support of a subgraph which appears in n’

graphs is n’/n.

• A frequent subgraph is maximal if it is not

contained by another frequent subgraph

Page 15: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

15

Subgraph Isomorphism SimplifiedSubgraph Isomorphism Simplified

• Nodes are labeled by enzyme identifiers

• Only edges are needed to define a graph. Their labels conceptually identify the nodes

• Edges are items, uniquely specified by labels which refer to enzymes

• The problem can therefore be reduced to mining frequent itemset

• The graph G1 here is {ab,ac,de}

• Connectivity has to be considered

Page 16: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

16

Subgraph Homeomorphism Subgraph Homeomorphism SimplifiedSimplified

• A connected edgeset corresponds to a connected subgraph– A unique edge is a set of two node labels

– A set of unique edges ES={e1, e2 …, ek} is called connected iff every

subset ES’ of ES shares at least one node with the remaining edges

ES\ES’.

• Connection to frequent itemset mining– Input Graphs correspond to transactions

– Connected edgesets correspond to itemsets

– Approach: build frequent sets bottom up (small to large)

– Edge addition preserves connectivity

Page 17: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

17

Subgraph Homeomorphism Subgraph Homeomorphism SimplifiedSimplified

• Through the search, only connected

edgesets are considered

– Captures the connected nature of pathways

• Avoiding redundancy coming from

considering the same sets in different order

is important.

Page 18: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

18

The AlgorithmThe Algorithm

Page 19: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

19

The AlgorithmThe Algorithm

• The procedure is invoked for each frequent edge ei

– Mine({}, {ei}, N(ei), {e1,e2,…,ek})

• The support is embodied in the “if frequent”

statement

• Example: consider 5 enzymes, a, b, c, d and e,

which participate (vacuously or not) in 4 pathways

G1,G2,G3,G4.

• We mine with support = ¾.

Page 20: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

20

ExampleExample

ab, ac and de are the only frequent edges

Mine({}, {ab}, N(ab), {ab,ac,bd,de,ce}

Mine({}, {ac}, N(ac), {ab,ac,bd,de,ce}

Mine({}, {de}, N(de), {ab,ac,bd,de,ce}

{ab,ac},{de} are the frequent subgraphs

Page 21: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

21

ExampleExample

{ab,ac},{de} are the frequent maximal subgraphs

Mining development:

Page 22: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

22

Polynomial BoundPolynomial Bound

• The paper does not prove complexity, but only justifies

“efficiency” in an empiric way

• We show a polynomial bound for time complexity– Determining which are the frequent edges can be done using

sorting

– Determining the neighbors of an edge is linear (requires one pass)

– In every level of the recursion, the algorithm extends a frequent

subgraph with a new frequent edge. This is a linear number of

procedures

– Each such procedure can be done in polynomial time complexity,

where n is the number of edges in the input

Page 23: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

23

OutlineOutline

• Introduction ☺

• Model: ☺

• Approach: Graph Mining ☺– Related Work ☺

– Formalism for metabolic pathways ☺

– The Algorithm ☺

• Discussion and Empiric Results

• Conclusion

• Future Work

Page 24: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

24

Empiric ResultsEmpiric Results

• The bold subgraph

was mined and

appears in 29% of the

organisms in KEGG

• The solid subgraph

appears in 19.3%

• The entire graph

appears in 14.2%

Glutamate

Page 25: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

25

Empiric ResultsEmpiric Results

32.1%, 19.2%, 11.5% 25.6%, 21.8%, 15.4%

Alanine-aspartatePyrimidine

Page 26: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

26

Empiric ResultsEmpiric Results

• Run time results for

Pentium 4, 2 GHz, 0.5

GB of RAM

• Sub pathway of 16

edges discovered in 3

sec.

• The entire graph

appears in 14.2%

Page 27: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

27

OutlineOutline

• Introduction ☺

• Model: ☺

• Approach: Graph Mining ☺– Related Work ☺

– Formalism for metabolic pathways ☺

– The Algorithm ☺

• Discussion and Empiric Results ☺

• Conclusion

• Future Work

Page 28: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

28

ConclusionConclusion

• Framework for mining biological networks

• Graph simplification without losing biological

meaning

• Efficient graph mining

• Good response times

Page 29: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

29

OutlineOutline

• Introduction ☺

• Model: ☺

• Graph Mining ☺– Related Work ☺

– Formalism for metabolic pathways ☺

– The Algorithm ☺

• Discussion and Empiric Results ☺

• Conclusion ☺

• Future Work

Page 30: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

30

Future WorkFuture Work

• Adding flexibility for capturing biologically

meaningful info and concepts, such as probabilistic

methods

• Probabilistic models for investigating the

significance of discovered patterns (but unlike the

previous case, probability does not model biology)

• Approximate matching rather than exact

– What is an approximation in this case?

Suitable definition needed

Page 31: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

31

NEXT PAPER (IN BRIEF)…NEXT PAPER (IN BRIEF)…

Page 32: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

32

Seminar in BioinformaticsSeminar in Bioinformatics

Pairwise Local Alignment of Protein

Interaction Networks Guided by Models

of Evolution

Paper by: M. Koyuturk, A. Grama and W. Szpankowski

Appeared in: Journal of Comp. Biology, 13(2), 182-199, 2006.

Presented by: Royi Ronen

Page 33: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

33

The ProblemThe Problem

• Protein-Protein-Interaction networks are

modeled as graphs

• A PPI network is an undirected graph (V,E)

– Elements in V represent proteins

– Elements in E represent pairs which interact

• The paper solves the problem of aligning two

graphs (rather than many)

Page 34: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

34

Homology Function S(•,•)Homology Function S(•,•)

• Consider two Graphs: G(U,E), H(V,F)

• For each pair from the union of V and U, S assigns a score:– If the pair belongs to the same (a different) species, the confidence

that they are paralogous (orthologous). 0 is the lowest value

– Values of S are determined by an algorithm out of the scope of the

paper (INPARANOID)

• Some definitions:– Match: A conserved interaction between orthologous pairs

– Mismatch: A lack of interaction between a pair whose orthologs

interact

– Duplication: Paralogous proteins (tend to diverge in the long run)

Page 35: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

35

Proposed SolutionProposed Solution

• Every pair of node subsets induces an alignment {M,N,D} which is associated with a score

• M - Pairs of edges, with positive S values to nodes, which exist in both graphs. Each associated with a positive score

• N - Pairs of edges, with positive S values to nodes, which exist in one graph but not in the other. Each associated with a negative score

• D - Pairs of nodes from the same graph with positive S. Each associated with a negative score

• The total score is the sum of all the scores, and we wish to find alignment with locally maximal scores

Page 36: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

36

Proposed SolutionProposed Solution

• An algorithm is proposed in order to avoid

considering all possible subsets

• The heuristics tries to expand a set so that

its scores is made higher

• Rings a bell?

Page 37: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

37

Experimental ResultsExperimental Results

• Using this alignment method and a scoring

algorithm for S(•,•) called INPARANOID, PPI

networks of Human and Mouse were aligned

• Data taken from the DIP Database

• Details:

– Homo Sapiens - 1369 interaction between 1065 proteins

– Mus Musculus – 286 interactions between 329 proteins

Page 38: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

38

Experimental ResultsExperimental Results

• INPARANOID discovered 237 ortholog clusters

• 305 matched interactions were discovered; 205

mismatches, 536 duplications in Human; 149

mismatches, 384 duplications in Mouse.

• Examples:

– Conserved subnet with one-way mismatches

– Conserved subnet with two-way mismatches

– Duplications

Page 39: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

39

Example 1Example 1

• Graphs aligned

• Biological meaning

– Similarity and differences between the species

– Insight on evolutionary events

Page 40: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

40

Example 2Example 2

• Another graph alignment result with local

maximum score

Page 41: 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

41

Example 3Example 3

• Instance of duplication between mouse and

human

• The regulator regulates homologs