BIOINFORMATIC ANALYSES OF PROTEIN-PROTEIN INTERACTION NETWORKS UE Systems Biology and Complex System Master Bioscience Lyon 05/20/201 Biology Summer Schoo Marseille 09/01/201 Christine Brun, , Marseille
Jan 20, 2016
BIOINFORMATIC ANALYSES OF PROTEIN-PROTEIN INTERACTION
NETWORKS
UE Systems Biology and Complex SystemsMaster Biosciences
Lyon 05/20/2010&
Biology Summer SchoolMarseille 09/01/2010
Christine Brun, , Marseille
A protein never acts alone…
…but interacts with others to perform its function.
Molecular interactions :
- protein-DNA- protein-RNA- protein-protein
- protein-lipid- protein-small molecule
MolecularMovies.org The Inner Life of the Cell
THE DIVERSITY OF P-P INTERACTIONSfrom Nooren & Thornton (1/3)
• The interaction occurs between identical molecules homo-dimers, trimers, tetramers.....
1- Structural diversity
Ferritine, 24 identical polypeptides
• The interaction occurs between different polypeptides hetero-dimers, trimers, tetramers.....
RNApolymerase, 12 different polypeptides
2- Functional Diversity
Interactions within the JAK-STAT signaling pathway in drosophila
• Non-obligatory PPI :
Proteins are stable independantly.Proteins are functional independantly.
The interaction performs an action.
Ex: antigen-antibody reaction, enzymatic reaction, phosphorylation reaction (signaling…)
THE DIVERSITY OF P-P INTERACTIONSfrom Nooren & Thornton (2/3)
• Obligatory PPI :
Proteins are not stable independantly Proteins are not functional independantly.
The interaction is necessary to stability and function.
Ex: protein complexes (DNA polymerase, RNA polymerase, ribosome…)
• Non-obligatory PPI :
Proteins are stable independantly.Proteins are functional independantly.
The interaction performs an action.
Ex: antigen-antibody reaction, enzymatic reaction, phosphorylation reaction (signaling…)
2- Functional Diversity
THE DIVERSITY OF P-P INTERACTIONSfrom Nooren & Thornton (2/3)
2- Functional Diversity
Interactions within a complex : the yeast
proteasome
THE DIVERSITY OF P-P INTERACTIONSfrom Nooren & Thornton (3/3)
• Obligatory PPI :
Proteins are not stable independantly Proteins are not functional independantly.
The interaction is necessary to stability and function.
Ex: protein complexes (DNA polymerase, RNA polymerase, ribosome…)
3- Dynamic Diversity• Transient PPI :
Associate and dissociate in vivo.
Transient PPI may be non-obligatory.
Interactions within the JAK-STAT signaling pathway in drosophila
• Permanent PPI :
Exist only in complexes.
Permanent PPI generally correspond to obligatory PPI. Interactions within a
complex : the yeast proteasome
THE DIVERSITY OF P-P INTERACTIONSfrom Nooren & Thornton (3/3)
A protein never acts alone…
…but interacts with others to perform its function.
Cell
Molécule
Tissue, organ
Organism
Population
Protein Function : a complex notion
Molecular Function
Cellular Function
Physiology
Development, reproduction
Ecologicalequilibrium
Different integration levels
Molecular function of the proteins
Molecular activityexamples: Kinase, ATPase, DNA binding...
Biochemical analyses Bioinformatic predictions: similarity search between sequences and structures
But...
~30% of the genes/proteins of each newly sequences organism do not show any similarity with any known gene.
Cellular function of the proteins
Biological processexamples: signaling, transcription, establishment of the
epithelium
Genetic Analyses Cellular Biology
But...
Sharan et al, Mol Syst Biol, 2007
Protein function: functional annotations are also missing in E.
coli
Bouveret & Brun, MethMolBio, 2010
Interaction analysis allows investigating protein cellular functions.
Interactions within a process: the JAK-STAT signaling pathway in
drosophila
Interactions within a complex : the yeast
proteasome
How to study the protein cellular functions ?
Cellular function = Biological process
HOW TO IDENTIFY PROTEIN-PROTEIN INTERACTIONS AT THE WHOLE PROTEOME
SCALE?
Two high-throughput methods:
- Two-hybrid screens
- Affinity Purification followed by Mass Spectrometry
BS
AD
Gene
Transcription factor:DNA Biding site BS
+ Transcription activation domain AD, able to activate the basal transcription
machinery
Transcirption factor biding site
Transcription factor Transcription factor Messenger RNA
Messenger RNA
Messenger RNAMessenger RNA
Messenger RNA
Messenger RNA
Messenger RNAMessenger RNABS
AD
THE MODULARITY OF THE TRANSCRIPTION FACTORS,as an elementary principle of the yeast two-hybrid
BSBait X
ADPrey Y
Repoter Gene
YEAST TWO-HYBRIDPrinciple of the test
• The bait protein is fused to the BS of a transcription factor.• The prey proteins (potential interactors) are fused to the activation domain of a transcription factor.
• The fusion proteins are expressed in a yeast strain containing a reporter gene under the control of BS.
BSBait X
ADPrey Y
• When the prey Y interacts with the bait X, the activation domain AD gets close to the gene promotor and the transcription can happen.
Reporter Gene
Reporter RNAReporter RNA
Reporter RNAReporter RNA
Reporter RNA
Reporter RNA
Reporter RNAReporter RNA
ADPrey Y
YEAST TWO-HYBRIDPrinciple of the test
THE LARGE-SCALE TWO HYBRID SCREENS
S. cerevisiae Uetz et al., 2000Ito et al., 2001
P. falciparum Lacount et al., 2005 C. elegans Li et al., 2004D. melanogaster Giot et al., 2003
Stanyon et al., 2004 Formstecher et al., 2005
H. sapiens Stelzl et al., 2005Rual et al., 2005
T. pallidum Titz et al., 2008H. pilori Rain et al., 2001C. jejuni Parrish et al., 2007Synechocystis Sato et al., 2007Mesorhizobium Shimoda et al., 2008
+ virus (bacteriophage T7, vaccine, HCV, BPV, Herpes, EBV…)
+ host-virus (HCV-human, EBV-human)
Protein reconstitution
Bouveret & Brun, MethMolBio, 2010
OTHER LARGE SCALE METHODS BASED ON THE TWO-HYBRID PRINCIPLE (1/2)
Mappit, a functional complementationassay (cytokine receptor signalingpathway)
Bouveret & Brun, MethMolBio, 2010
OTHER LARGE SCALE METHODS BASED ON THE TWO-HYBRID PRINCIPLE (2/2)
HOW TO IDENTIFY PROTEIN-PROTEIN INTERACTIONS AT THE WHOLE PROTEOME
SCALE?
Two high-throughput methods:
- Two-hybrid screens
- Affinity Purification followed by Mass Spectrometry
Tag Bait Y
antibody anti-Tag
PRINCIPLE OF COMPLEX PURIFICATION
PROTOCOL FOR COMPLEX PURIFICATION
Different types of TAG formed by 2 parts, separated by a clivage site allow 2 steps of purification
Bouveret & Brun, MethMolBio, 2010
PROTOCOL FOR COMPLEX PURIFICATION
Bouveret & Brun, MethMolBio, 2010
AP/MS analyses
S. cerevisiae Gavin et al., 2002, 2006Ho et al., 2002Krogan et al., 2006
E. coli Butland et al., 2005Arifuzzaman et al., 2006Hu et al., 2009
M. pneumoniae Kühner et al., 2009
D. melanogaster Perrimon lab, 2009
(+ signaling pathways in drosophila and human)
HOW TO IDENTIFY PROTEIN-PROTEIN INTERACTIONS AT THE WHOLE PROTEOME SCALE?
The two high-throughput methods do not detect the same interaction types :
- the yeast two-hybrid method detects interactions which are biophysically possible, transient or
permanent.
- the tandem affinity purification method identifies permanent interactions in vivo.
FINDING INTERACTION DATA?
InternationalMolecularExchange
Consortium
Specialized Databases
Multi-organisms:DIP (dip.doe-mbi.ucla.edu)IntAct (www.ebi.ac.uk/intact)MINT (mint.bio.uniroma2.it/mint)BioGRID (www.thebiogrid.org)BIND (www.blueprint.org)
Yeast:MPact (mips.gsf.de/genre/proj/mpact)
Human:HPRD (www.hprd.org)Reactome (http://reactome.org/)
Meta-database
APID (bioinfow.dep.usal.es/apid/index.htm)
EXAMPLE 1: INTACT
EXAMPLE 1: INTACT
EXAMPLE 2: APID METABASE
A standardized representation of the interactions is proposed for databases. Authors are invited to submit their interactions to databases according to this format when they submit their publication towards a better record of the knowledge
Interactions described in databases can be represented as networks
= universal language to describe complex systems
Disease Spread
[Krebs]
Social Network
Food Web
Neural Network[Cajal]
ElectronicCircuitInternet
[Burch & Cheswick]
Non oriented graphNode = protein
Edge = physical interaction
PROTEIN-PROTEIN INTERACTION NETWORK
WHAT IS AN INTERACTOME ?The set of all possible protein-
protein interactions between all the proteins of an organism.
Jeong et al., 2001 Li et al., 2004
Formstecher et al., 2005 Rual et al., 2005
FAQ about INTERACTOMES
The set ofall possible protein-protein interactions between all the
proteins of an organism.
Are all detected interactions
physiological?
What’s the sizeof the interaction
space?
BUT...Interactomes do not contain spatio-temporal information 2D maps,
projections, long-exposure photographs...
~75 000 to 350 000 interactions in human
They are physically possible
SHOULD WE TRUST LS-Y2H ? (1/4)
Giot et al.Science 2003
Formstecher et al.Genome Res 2005
Stanyon et al. Genome Biol 2004
• Low overlap between experiments• The size of the complete set of interactions being unknown False-positive or barely overlapping sub-sets?
Comparison of the results of the 3 drosophila LS-Y2H screens
Formstecher et al.Genome Res 2005
Comparison of the results of 2 drosophila LS-Y2H…
Yes, we should since interactions detected in LS-Y2H are possible interactions.
SHOULD WE TRUST LS-Y2H ? (2/4)
20416 203723
192 63823
…only 30 baits in common
Giot et al.Science 2003
Braun et al., Nature Meth 2009
Different detection methods do not detect p-p interactions with the same efficiency
SHOULD WE TRUST LS-Y2H ? (3/4)
Positive interactions(17-21%)
From Venkatesan et al., Nat. Meth. 2009
SHOULD WE TRUST LS-Y2H ? (4/4)
Negative interactions(false positive rate0.5-2%)
1- SENSITIVITY:
2- COVERAGE:
Low sensibility rather than low
specificity
AND TAP-TAG? (1/2)
Overlap of the interactions detected in the 3 TAP-TAG screens performed in E. coli.
Hu et al., Plos Biol 2009
AND TAP-TAG? (2/2)
Overlap of the interactions detected in the 1 TAP-TAG screen performed in M. pneumoniae and interactions identified/inferred by other means.
Kühner et al., Science 2009
WHAT ABOUT LITTERATURE?
Cusick et al., Nature Meth 2009
Discovery Science : • Knowing all the parts• ~30 % of the genes/sequenced genomes proteins of unkonwn function
Predict/discover protein function
Systems biology approach :• Analyze the interactome properties
bring some new insights into old and novel biological questions
MOTIVATIONS OF INTERACTOMICS
1- Description
network organisation (stat, graph theory…)
- Protein degree- Edge-betweenness- K-core- Diametre....
POSSIBLE APPROACHES
• The protein degree number of neighbours
• If the network is directed, kin et kout
k = 4k = 4
kin = 1
kout
= 3
kin = 1
kout
= 3
PROTEIN DEGREE
• Protein degree distribution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
0.2
2
20
200
2000
levure S. cerevisae
connectivité k
nom
bre
de g
ènes
A lot of proteins are poorly connected
Some proteins are highly connected = « hub »
What does it mean in biology?
PROTEIN DEGREE
…when the airline traffic network is organized as a protein-protein interaction network ???
Power-law distribution
WHAT DOES IT MEAN BIOLOGICALLY?
‘EDGE BETWEENESS’
a
b
c
d
f
e g
h
i j
Number of shortest paths going through an edge
(computed between all node pairs)
Bio: Centrality
Processes are connected by interactions of high
betweenness.
a
b
c
d
f
e g
h
i j
1
8
8
3,5
15
3,55,5
5,5
24
7
14
2
9
(Can also be used to disconnect the graph)
b
d
a
c
X
X
X
X
X
X X
a,b,c,d,e belong to core 1
e
K-CORE NOTION or how to peel the interactome…(1/3)
* Recursively remove vertices/proteins according to their number of neighbours
i
g
l
f
k
h
X
X
XX
X
X
a,b,c,d,e belong to core 1f,g,h,i,j,k,l belong to core 2…
X
X X
X
XX
j
K-CORE NOTION or how to peel the interactome…(2/3)
* Recursively remove vertices/proteins according to their number of neighbours
X
K-CORE NOTION or how to peel the interactome…(3/3)
Bio: Central proteins vs. peripheral proteins
Functional differences ?
X
XPotential difficulties:
It is often difficult to match a graph property/characteristic to a biological role/property…
…increased when the graph/interactome does not contain any spatio-temporal information
Towards data integration
• The p-p interaction network is a static view• All interactions do not happen in the same
time at the same place! • ‘Dynamic’ information: expression data from
transcriptome experiments.
ex
pre
ss
ion
de
s g
èn
es
temps
on off on
EXAMPLE OF DATA INTEGRATION (1/2)
• Different kinds of hubs:
1st possibility:Simultaneous interactions « party hubs »
1st possibility:Simultaneous interactions « party hubs »
2nd possibility:Successive interactions« date hubs »
2nd possibility:Successive interactions« date hubs »
M phase of the cell cycle
S phase of the cell cycle
[Han et al., Nature 2004]
Inter-processcommunication
Intra- processrole
EXAMPLE OF DATA INTEGRATION (1/2)
2- Functional module identification for function prediction and systems biology
classification and graph partitioning
POSSIBLE APPROACHES
from Sharan et al., Mol Syst Biol, 2007
FUNCTION PREDICTION: 2 types of methods
Function prediction Function prediction+ Systems biology
FUNCTION PREDICTION : direct method
Inferrence of the function of an uuncharacterized protein by transfer of its neighbour’s functions.
- majority rule- functional flux- ...
Identification of groups of proteins. Inferrence of the group function.
- density- distances- edge-betweenness/betweenness cut- optimisation of criterion: modularity ( higher nb of internal edges / random partition of the graph, with the same class cardinals)- ...
FUNCTION PREDICTION: module detection
EXAMPLE 1: IDENTIFICATION OF MODULES BASED ON EDGE DENSITY
• What is dense zone ?
• « rigourous » definition:
not dense... ...rather dense !
maximal nb of connections between N proteins is ½N(N-1)Density is defined as:
maximal nb of connections between N proteins is ½N(N-1)Density is defined as:
d =connection nb
maximal nb of connections
d=6/21=0.28 d=14/21=0.67
FUNCTIONAL MODULES IDENTIFIED BASED ON EDGE DENSITY
Cell cycle regulationCell cycle regulation
Signaling pathway triggered by pheromonesSignaling pathway triggered by pheromones
Spirin & Mirny, PNAS 2003
EXAMPLE 2: THE PRODISTIN METHOD, A FUNCTIONAL CLASSIFICATION BASED ON INTERACTIONS
1-
The Czekanowski-Dice distance (Dice, 1945)
3- A classification tree
Brun et al., Genome Biology, 2003; Baudot et al., Bioinformatics, 2006
4- Annotated classification tree
+ GO annotations
2-
PRINCIPLE: …calculate a distance based on the number of interactors shared and unshared by protein pairs, reflect of
their functional similarity
A B
D
C
HYPOTHESIS: the more proteins share common interactors, the more likely they are functionally related
A POSSIBLE TRANSLATION OF THE BIOLOGICAL HYPOTHESIS:
THE CZEKANOWSKI-DICE DISTANCE
| XY | + | XY |
| X \ (XY) | + | Y \ (XY) | D(X, Y) =
8 + 3
2 + 3 =
(Dice, Ecology, 1945)
1-
The Czekanowski-Dice distance (Dice, 1945)
3- A classification tree
Brun et al., Genome Biology, 2003; Baudot et al., Bioinformatics, 2006
4- Annotated classification tree
+ GO annotations
2-
EXAMPLE 2: THE PRODISTIN METHOD, A FUNCTIONAL CLASSIFICATION BASED ON INTERACTIONS
[CLASS: 76]CA sensory_organ_development, neuroblast_division, cytoskeleton_organization_and_biogenesisP# 5PN pros, mira, insc, numb, baz
[CLASS: 60]CA myoblast_fusionP# 7PN sls, Mhc, Act88F, Actn, rols, mbc, Crk
FUNCTIONAL MODULES IDENTIFIED BY THE
COMPUTATION OF A DISTANCE
Functional classes = groups of proteins involved in the same pathway, the same protein complex or the same cellular process through interactions
PROTEIN FUNCTION PREDICTION: EXAMPLE OF THE'DNA METABOLISM' and 'CELL CYCLE‘ CLASS
Pre-Replication Complex
Protein kinase Complex
Telomere Replication
Targets PRC to origin
Cell cycle control
Chromatine Structure
??Telomere tethering to the nuclear periphery
POSSIBILITY OF FUNCTION INFERRENCE
PREDICTION OF CELLULAR FUNCTION
93 uncharacterised proteins
42 belong to PRODISTIN classes
a cellular function is predicted
2 new predictions (5%)
40 predicted by other bioinformatic methods
or recentlycharacterised
experimentally
+
27 in agreement (64%)
13 different (30%)Brun et al., Genome Biol, 2003
STATISTICAL EVALUATION OF CLASS QUALITY AND FUNCTIONAL PREDICTIONS (1/2)
- Is the protein clustering significant? Would it happen by chance? Test on random networks of the same topology – ‘Reshuffling’ 15 classes instead of 64 Significant
- What is the functional prediction quality? Suppose that all members of a class all perform the function assigned to the class; compare these predictions with known functions. Success rate = # correctly predicted functions/ # predictions 67 % (vs 43 % for Majority Rule Algorithm)
- What is the class quality? Class Robustness Index (CRI). Based on tree topological criteria (bootstrap for a distance-based method) 0 < 0.96 < 1 Brun et al., Genome Biol, 2003
- How do PRODISTIN deal with noise in interaction data? Is it robust toward the presence of both spurious and missing interactions in the dataset? Test on networks of different topologies – ‘Rewiring’ What is the prediction success rate?
STATISTICAL EVALUATION OF CLASS QUALITY AND FUNCTIONAL PREDICTIONS (2/2)
0
100
200
300
400
500
600
0 10 20 30 40 50
% rewiring
# p
rote
ins
pre
dic
ted
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
pre
dic
tio
n r
ate
Brun et al., Genome Biol, 2003
INTERACTOMES AND THE EVOLUTION OF THE FUNCTION OF THE DUPLICATED GENES
WHAT CAN WE LEARN ABOUT THE EVOLUTION OF THE FUNCTION OF THE DUPLICATED GENES WHEN STUDYING THE INTERACTOME WITH THE CLASSIFICATION METHOD ?
THE ANCESTRAL WHOLE GENOME DUPLICATION IN YEAST
UPDATE RESULTS 2004 2007
• 100-150 million years ago
• After the Kluyveromyces waltii and S. cerevisiae divergence (Kellis et al., 2004)
• Followed by massive deletion events
• 16% of the present genome is formed by WGD paralog pairs, remnants of this duplication event 457 - 460 pairs (Wolfe & Shields, 1997; Seoighe & Wolfe, 1999; Kellis et al., 2004)
THE ANCESTRAL WHOLE GENOME DUPLICATION IN YEAST
Kellis et al., 2004
A
Duplication t0
t1
t2
A’
A A4 shared interactors
A’’2 shared interactors2 specific interactors
A’A’’’2 shared interacteurs4 specific interactors
EVOLUTION OF THE NUMBER AND THE IDENTITY OF THE PARALOGUES INTERACTORS AFTER A GENOME
DUPLICATION
YEAST PPI NETWORK 2004
Interactions 3991 (Core data LS-Y2H+ Homemade literature curation)
17656
Proteins 2644 (~ 46% ORFs)
4773(82,3% ORFs)
Mean degree 3 7,4
Paralogs pairs in the
classification tree
38(8% paralogs)
172(37%
paralogs)
2004
38 paralog pairs
How are they classified?
Functional classification tree for yeast proteins
(2004)
Both paralogs are in the
same class
Functional classification tree for yeast proteins
(2004)
Paralogs are classified in
different classes devoted to the same cellular
function
Paralogs are classified in
different classes devoted
to different cellular function
Functional classification tree for yeast proteins
(2004)
THREE CLASSIFICATION BEHAVIOURS
43(25%)
21(12,2%)
9(24%)
Different class,Different function
13(7,6%)
3(8%)
Different class,Same function
95(55,3%)
26(68%)
Same class,Same function
2004 2007
• The majority of the WGD paralogs (68%) are in the same functional class share interactors.• The majority of the WGD paralogs are involved in the same cellular function (76%).
Different class, Different function
24%
FunctionalDivergence
-
+
Different class,Same function
8%
Same class, Same function
68%
EVOLUTION OF CELLULAR FUNCTION: A SCALE OF FUNCTIONAL DIVERGENCE FOR DUPLICATED
GENES BASED ON INTERACTION ANALYSIS
EVOLUTION BY NEO-FUNCTIONALIZATION
(Ohno, 1970)
EVOLUTION BY SUB-FUNCTIONALIZATION
(Force et al., 1999)
Baudot et al., 2004, Genome Biology, 5: R76
• Sequence conservation does not necessarily imply function conservation Complex relationship between sequence identity and cellular function • The functional classification shows paralog properties not detectable by sequence analysis alone.
CLASSIFICATION BEHAVIOURS AND SEQUENCE IDENTITY
Same class, Same function
Different class, Same function
Different class, Different function
Not classified
* A scale of functional divergence for the duplicated genes based on the interactions Distincts scenarii for the evolution of the cellular function of the duplicated genes.
* Differences between paralog pairs neither detectable by sequence analysis nor by functional annotation analysis a novel type of information by the analysis of the interactions.
(Baudot et al., Genome Biology, 2004)
Update 2004-2007:
* Stability of the results with a 4 times larger interactome:
- high quality interactome, even small, gives reliable biological information
- robustness of the network analysis method to false negative/positive (Brun et al., Genome Biology, 2003).
- changes in functional annotations (‘knowledge’ effect) may change the biological interpretation of results.
INTERACTION NETWORK AND EVOLUTION OF THE FUNCTION OF THE DUPLICATED GENES :
CONCLUSIONS