Lecture 4 Protein Function prediction using network concepts

Lecture 4

1.Protein Function prediction using network concepts

2.Application of network concepts in DNA sequencing

Topology of Protein-protein interaction is informative but further analysis can reveal other information.

A popular assumption, which is true in many cases is that similar function proteins interact with each other.

Based on these assumption, we have developed methods to predict protein functions and protein complexes from the PPI networks mainly based on cluster analysis.

Cluster Analysis

Cluster Analysis, also called data segmentation, implies grouping or segmenting a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters.

In the context of a graph densely connected nodes are considered as clusters

Visually we can detect two clusters in this graph

K-cores of Protein-Protein Interaction Networks

Definition

Let, a graph G=(V, E) consists of a finite set of nodes V and a finite set of edges E.

A subgraph S=(V, E) where V V and E E is a k-core or a core of order k of G if and only if v V: deg(v) k within S and S is the maximal subgraph of this property.

1-core graph: The degree of all nodes are one or more

Graph G


2-core graph: The degree of all nodes are two or more


3-core graph: The degree of all nodes are three or more

The 3-core is the highest k-core subgraph of the graph G

Graph G

Analyzing protein-protein interaction data obtained from different sources, G. D. Bader and C.W.V. Hogue, Nature biotechnology, Vol 20, 2002

Prediction of Protein Functions Based on K-cores of Protein-Protein Interaction Networks

“Prediction of Protein Functions Based on K-cores of Protein-Protein Interaction Networks and Amino Acid Sequences”, Md. Altaf-Ul-Amin, Kensaku Nishikata, Toshihiro Koma, Teppei Miyasato, Yoko Shinbo, Md. Arifuzzaman, Chieko Wada, Maki Maeda, Taku Oshima, Hirotada Mori, Shigehiko Kanaya The 14th International Conference on Genome Informatics December 14-17, 2003, Yokohama Japan.

Total 3007 proteins and 11531 interactions

Around 2000 are unknown function proteins

Highest K-core of this total graph is not so helpful

10-core graph

We separate 1072 interactions (out of 11531) involving protein synthesis and function unknown proteins.

P. S. U. F.

P. S. P. S.

Unknown

Function unknown Proteins of this 6-kore graph are likely to be involved in protein synthesis

193 interactions out of 11531 interactions involving electron transport and function unknown proteins.

Highest k-core or the 2-core subgraph of the graph of the previous page

Function unknown Proteins of this 2-kore graph are likely to be involved in electron transfer

Further sub-classification may be possible applying other information with the k-core subgraph

“Prediction of Protein Functions Based on Protein-Protein Interaction Networks: A Min-Cut Approach”, Md. Altaf-Ul-Amin, Toshihiro Koma, Ken Kurokawa, Shigehiko Kanaya, Proceedings of the Workshop on Biomedical Data Engineering (BMDE), Tokyo, Japan, pp. 37-43, April 3-4, 2005.

Outline

•Introduction

•The concept of Min-Cut

•Problem Formulation

•A Heuristic Method

•Evaluation of the Proposed Method

•Conclusions

Outline

•Introduction





•Conclusions

Introduction

After the complete sequencing of several genomes, the challenging problem now is to determine the functions of proteins

1) Determining protein functions experimentally

2) Using various computational methods

a) sequence

b) structure

c) gene neighborhood

d) gene fusions

e) cellular localization

f) protein-protein interactions

Present work predicts protein functions based on protein-protein interaction network.

Introduction

For the purpose of prediction, we consider the interactions of

•function-unknown proteins with function-known proteins and

• function-unknown proteins with function-unknown proteins

In the context of the whole network.

Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., and Tagaki, T. Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast 18, 523-531 (2001)

Reported similar results..

Introduction

Schwikowski, B., Uetz, P. and Fields, S. A network of protein-protein interactions in yeast. Nature Biotech. 18, 1257-1261 (2000)

Deals with a network of 2039 proteins and 2709 interactions.

65% of interactions occurred between protein pairs with at least one common function

Hence we call the proposed approach a Min-Cut approach.

Introduction

So, majority of protein-protein interactions are between similar function protein pairs.

Therefore,

We assign function-unknown proteins to different functional groups in such a way so that the number of inter-group interactions becomes the minimum.

Outline

•Introduction





•Conclusions

U4

K2K6

K4

K3

K1K8

K5U1

U2

U3

The concept of Min-Cut

G1

G2

A typical and small network of known and unknown proteins

U4

KK

K

K

KK

KU1

U2

U3

G1

G2


Unknown proteins assigned to known groups based on

majority interactions

U4

KK

K

K

KK

KU1

U2

U3

G1

G2


Number of CUT = 4

U4

KK

K

K

KK

KU1

U2

U3

G1

G2


An alternative assignment of unknown proteins

U4

KK

K

K

KK

KU1

U2

U3

G1

G2


Number of CUT = 2

For every assignment of unknown proteins, there is a value of CUT.

Min-cut approach looks for an assignment for which the number of CUT is minimum.

Outline

•Introduction





•Conclusions

Problem Formulation

L e t 1G , 2G , … … . . , nG a r e n s e t s / g r o u p s o f f u n c t i o n -k n o w n p r o t e i n s s u c h t h a t a l l p r o t e i n s o f a g r o u p a r e o f s i m i l a r f u n c t i o n . M u l t i p l e f u n c t i o n p r o t e i n s a r e m e m b e r s o f m o r e t h a n o n e g r o u p . T h e r e f o r e , t h e s e t o f a l l f u n c t i o n - k n o w n p r o t e i n s 1

nk kG G . T h e s e t o f

f u n c t i o n - u n k n o w n p r o t e i n s i s d e n o t e d b y U . ( , )N V E i s a g r a p h / n e t w o r k w h e r e iv V i s a n o d e r e p r e s e n t i n g a p r o t e i n a n d ( , )i j i je v v E i s a n e d g e r e p r e s e n t i n g … … .

Here we explain some points with a typical example.

U4

K2K6

K3

K4

K1K7

K5U1

U2

U3

K10

K8

K9U7

U5

U8

U6

G1

G2

G3

( , )N V E

V= set of all nodes

E =set of all edges

G={K1, K2, K3, K4, K5, K6, K7, K8, K9, K10}

U={U1, U2, U3, U4, U5, U6, U7, U8}

Problem Formulation

U´= {U1, U2, U3, U4, U5, U6, U7}

Problem Formulation

We generate U´ U such that each protein of U´ is connected in N with at least one protein of group G by a path of length 1 or length 2.

U4

K2K6

K3

K4

K1K7

K5U1

U2

U3

K10

K8

K9U7

U5

U8

U6

G1

G2

G3

U4

K2K6

K3

K4

K1K7

K5U1

U2

U3

K10

K8

K9U7

U5

U8

U6

G1

G2

G3

For this assignment of unknown proteins, the CUT= 6

Interactions between known protein pairs can never be part of CUT

Problem FormulationWe can assign proteins of U´ to different groups and calculate CUT

The problem we are trying to solve is to assign the proteins of set U´ to known groups G1 , G2 ,…….., G3 in such a way so that the CUT becomes the minimum.

Problem Formulation

Outline

•Introduction





•Conclusions

•The problem under hand is a variant of network partitioning problem.

•It is known that network partitioning problems are NP-hard.

•Therefore, we resort to some heuristics to find a solution as better as it is possible.

A Heuristic Method

A Heuristic Method min_cut = |E|

iteration = 0

Make a table for each protein of U containing maximum 3 IDs of respective priority groups

Assign each protein of Uto some randomly or intentionally chosen group from among its priority groups

Calculate CUT

CUT < min_cut

iteration = iteration + 1

iteration < max_value

min_cut = CUT Record the current

assignment

Print min_cut, corresponding assignment and Exit

YES

NO

NO

YES

U1

U2

U3

U4

U5

U6

U7

U1 G2 G1 x

U2

U3

U4

U5

U6

U7

U4

K2K6

K3

K4

K1K7

K5U1

U2

U3

K10

K8

K9U7

U5

U8

U6

G1

G2

G3

A Heuristic Method

U1 has one path of length 1 with G2 and two paths of length two with G1

U1 G2 G1 x

U2 G2 G1 x

U3 G2 G1 x

U4 G1 G2 G3

U5

U6

U7

U4

K2K6

K3

K4

K1K7

K5U1

U2

U3

K10

K8

K9U7

U5

U8

U6

G1

G2

G3

A Heuristic Method

U4 has two paths of length 1 with G1, one path of length one with G2 and one path of length two with G3.

U1 G2 G1 x

U2 G2 G1 x

U3 G2 G1 x

U4 G1 G2 G3

U5 G1 G2 G3

U6 G1 G3 G2

U7 G3 G2 x

U4

K2K6

K3

K4

K1K7

K5U1

U2

U3

K10

K8

K9U7

U5

U8

U6

G1

G2

G3

A Heuristic Method

U1 G2 G1 x

U2 G2 G1 x

U3 G2 G1 x

U4 G1 G2 G3

U5 G1 G2 G3

U6 G1 G3 G2

U7 G3 G2 x

A Heuristic Method min_cut = |E|

iteration = 0



Calculate CUT

CUT < min_cut




assignment


YES

NO

NO

YES

U1 G2 G1 x

U2 G2 G1 x

U3 G2 G1 x

U4 G1 G2 G3

U5 G1 G2 G3

U6 G1 G3 G2

U7 G3 G2 x

U4

K2K6

K3

K4

K1K7

K5U1

U2

U3

K10

K8

K9U7

U5

U8

U6

G1

G2

G3

A Heuristic Method

By assigning all the unknown proteins to respective height priority groups, CUT = 6

U1 G2 G1 x

U2 G2 G1 x

U3 G2 G1 x

U4 G1 G2 G3

U5 G1 G2 G3

U6 G1 G3 G2

U7 G3 G2 x

A Heuristic Method

U4

K2K6

K3

K4

K1K7

K5U1

U2

U3

K10

K8

K9U7

U5

U8

U6

G1

G2

G3


U1 G2 G1 x

U2 G2 G1 x

U3 G2 G1 x

U4 G1 G2 G3

U5 G1 G2 G3

U6 G1 G3 G2

U7 G3 G2 x

U4

K2K6

K3

K4

K1K7

K5U1

U2

U3

K10

K8

K9U7

U5

U8

U6

G1

G2

G3


A Heuristic Method

Outline

•Introduction





•Conclusions

Evaluation of the Proposed Approach

•The proposed method is a general one and can be applied to any organism and any type of functional classification.

•Here we applied it to yeast Saccharomyces cerevisiae protein-protein interaction network

•We obtain the protein-protein interaction data from ftp://ftpmips.gsf.de/yeast/PPI/ which contains 15613 genetic and physical interactions.

YAR019c YMR001c

YAR019c YNL098c

YAR019c YOR101w

YAR019c YPR111w

YAR027w YAR030c

YAR027w YBR135w

YAR031w YBR217w

------------- -------------

------------- -------------

Total 12487 pairs

We discard self-interactions and extract a set of 12487 unique binary interactions involving 4648 proteins.


A network of 12487 interactions and 4648 proteins is reasonably big


Name of functional class # of

proteins METABOLISM 984 ENERGY 260 CELL CYCLE AND DNA PROCESSING

690

TRANSCRIPTION 842 PROTEIN SYNTHESIS 381 PROTEIN FATE (folding, modification, destination)

631

PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic)

39

PROTEIN ACTIVITY REGULATION 27 CELLULAR TRANSPORT, TRANSPORT FACILITATION AND TRANSPORT ROUTES

719

CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM

94

CELL RESCUE, DEFENSE AND VIRULENCE

296

INTERACTION WITH THE CELLULAR ENVIRONMENT

336

TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS

118

BIOGENESIS OF CELLULAR COMPONENTS

451

CELL TYPE DIFFERENTIATION 339

We collect from http://mips.gsf.de/genre/proj/yeast/index.jsp the classification data


Name of functional class # of proteins

METABOLISM 984 ENERGY 260 CELL CYCLE AND DNA PROCESSING

690


631


39


719


94


296


336


118


451


•The proposed approach is intended to predict the functions of function-unknown proteins.

•However, by predicting the functions of function-unknown proteins, it is not possible to determine the correctness of the predictions.

•We consider around 10% randomly selected proteins of each group of Table 1 as function-unknown proteins.


Name of functional class # of

proteins METABOLISM 984 ENERGY 260 CELL CYCLE AND DNA PROCESSING

690


631


39


719


94


296


336


118


451


•The union of 10% of all groups consists of 604 proteins. This is the unknown group U.

•The union of the rest 90% of each of the functional groups constitutes the set of known proteins G. There are total 3783 proteins in G.

•We generate U´ U such that each protein of U´ is connected in N with at least one protein of group G by a path of length 1 or length 2. There are 470 proteins in U´ .

•We predicted functions of these 470 proteins using the proposed method.


min_cut = |E| iteration = 0



Calculate CUT

CUT < min_cut




assignment


YES

NO

NO

YES

We applied this algorithm using Max_value=50000 to predict the functions 470 proteins.


•We cannot guarantee that minimum CUT corresponds to maximum successful prediction.

•However, the trends of the results of the Figure above shows that it is very likely that the lower is the value of CUT the greater is the number of successful predictions


We then examine the relation of successful predictions with the number of degrees of the proteins in the network .


U4

K2K6

K3

K4

K1K7

K5U1

U2

U3

K10

K8

K9U7

U5

U8

U6

G1

G2

G3

Degree of U4 =7

Degree of U7=3

We then examine the relation of successful predictions with the number of degrees of the proteins in the network .


Degree Number of proteins

Successful prediction

Percentage

1 128 39 30.46 2 80 39 48.75 3 60 32 53.33 4 33 24 72.72 5 23 15 65.21 6 24 14 58.33 7 17 12 70.58

>7 105 71 67.61 Total 470 246 52.34

0

20

40

60

80

100

0 1 2 3 4 5 6 7 8

Degree

Suc

cess

Per

cent

age

•The success rate of prediction is as low as 30.46% for proteins that have only one degree in the interaction network.

•However it is 67.61% for proteins that have degrees 8 or more.

•This implies that the reliability of the prediction can be improved by providing reasonable amount of interaction information


Lecture 4 Protein Function prediction using network concepts

Documents

cores of protein

protein synthesis

protein complexes

protein function prediction

more2core graph

more1core graph

more3core graph

core of order