14. Lecture WS 2003/04 Bioinformatics III 1 In silico studies to predict protein protein contacts roaches: the macro level: map networks of protein interactions the micro level: understand mechanisms of interaction to predict interaction sites of genome data has stimulated a lot of research in area (1) studies have addressed area (2). structing detailed models of the protein-protein interfaces is impor prehensive understanding of molecular processes, for drug design and diction of quarternary structure ement into macromolecular complexes). nderstanding (2) should facilitate (1). re, this lecture focusses on area (2).
36
Embed
14. Lecture WS 2003/04Bioinformatics III1 In silico studies to predict protein protein contacts Two approaches: (1) on the macro level: map networks of.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
14. Lecture WS 2003/04
Bioinformatics III 1
In silico studies to predict protein protein contacts
Two approaches:
(1) on the macro level: map networks of protein interactions
(2) on the micro level: understand mechanisms of interaction
to predict interaction sites
Growth of genome data has stimulated a lot of research in area (1)
but few studies have addressed area (2).
But constructing detailed models of the protein-protein interfaces is important
for comprehensive understanding of molecular processes, for drug design and
for prediction of quarternary structure
(arrangement into macromolecular complexes).
Also: understanding (2) should facilitate (1).
Therefore, this lecture focusses on area (2).
14. Lecture WS 2003/04
Bioinformatics III 2
Overview
Statistical analysis of protein-protein interfaces in crystal structures of
protein-protein complexes: residues in interfaces have significantly different
amino acid composition that the rest of the protein.
predict protein-protein interaction sites from local sequence information
Conservation at protein-protein interfaces: interface regions are more conserved
than other regions on the protein surface
identify conserved regions on protein surface e.g. from solvent accessibility
Interacting residues on two binding partners often show correlated mutations (among
different organisms) if being mutated
identify correlated mutations
Surface patterns of protein-protein interfaces: interface often formed by hydrophobic
patch surrounded by ring of polar or charged residues.
identify suitable patches on surface if 3D structure is known
14. Lecture WS 2003/04
Bioinformatics III 3
1 Analysis of interfaces
PDB contains 1812 non-
redundant protein complexes
(less than 25% identity).
Results don‘t change
significantly if NMR structures,
theoretical models, or
structures at lower resolution
(altogether 50%) are excluded.
Most interesting are the results
for transiently formed
complexes.
Ofran, Rost, J. Mol. Biol. 325, 377 (2003)
14. Lecture WS 2003/04
Bioinformatics III 4
1 Properties of interfaces
Amino acid composition of six interface types. The propensities of all residues
found in SWISS-PROT were used as background. If the frequency of an amino
acid is similar to its frequency in SWISS-PROT, the height of the bar is close to
zero. Over-representation results in a positive bar, and under-representation
results in a negative bar. Ofran, Rost, J. Mol. Biol. 325, 377 (2003)
14. Lecture WS 2003/04
Bioinformatics III 5
1 Pairing frequencies at interfaces
Residue–residue preferences.
(A) Intra-domain: hydrophobic core is clear
(B) domain–domain, (C) obligatory homo-
oligomers (homo-obligomers), (D) transient
homo-oligomers (homo-complexes), (E)
obligatory hetero-oligomers (hetero-
obligomers), and (F) transient hetero-
oligomers (hetero-complexes). A red square
indicates that the interaction occurs more
frequently than expected; a blue square
indicates that it occurs less frequently than
expected. The amino acid residues are
ordered according to hydrophobicity, with
isoleucine as the most hydrophobic and
arginine as the least hydrophobic.
Ofran, Rost, J. Mol. Biol. 325, 377 (2003)
14. Lecture WS 2003/04
Bioinformatics III 6
2 Exploit local sequence propertiesfor predicting interfaces
14. Lecture WS 2003/04
Bioinformatics III 7
Analyze local sequence information
Assume that – on the protein surface - interacting residues are clustered in
sequence segments of several contacting residues.
- focus on transient protein-protein complexes; in PDB 1134 chains in 333
complexes: ca. 60.000 contacting residues (if any of its atoms is 6 Å from any
atom of other protein)
- prediction method: neural network with back-propagation; one hidden layer
stretches of 9 residues 21 possible states = 189 input nodes
300 hidden and two output units (interaction site or not).
- train on 2/3 of the data, predict 1/3 of the data
Ofran, Rost, FEBS Lett. 544, 236 (2003)
14. Lecture WS 2003/04
Bioinformatics III 8
Number of residues in interface in a stretch of 9
2 different distance thresholds to
consider a residue involved in
protein–protein interfaces were
used, namely when the closest atom
pair between two residues in
different proteins was closer than 4
(gray) or 6 (black) Å.
Although the distribution for the less
permissive 4 Å cut-off is moved
slightly to shorter segments, both
distributions clearly demonstrate
that most interface residues have
other contacting residues in their
sequence neighborhood.
Ofran, Rost, FEBS Lett. 544, 236 (2003)
Together with observation that interacting
residues tend to have unique composition,
this suggests that interaction sites are
detectable from sequence alone.
14. Lecture WS 2003/04
Bioinformatics III 9
Prediction of contacts: better than random?Significant improvement over random was found.
The random results were obtained as follows. The predictions of the network were scrambled and assigned randomly to the residues in the test set. Then the filtering stage was applied to these `predictions', to reveal any size effect that might result from the distributions of the contacts and the predictions. The number of correctly predicted contacts/number of predicted contacts (accuracy, y-axis) represents the fraction of correct positive predictions; the x -axis (number of correctly predicted/number of observed contacts) represents the fraction of interacting residues that were correctly predicted as a percentage of all known interactions. The random predictions never reached levels of coverage >2%, and its accuracy hovered around 0.4. Our method had substantially better accuracy for any level of coverage. Note the accuracy drops significantly if we force the system to detect more than 0.5–1% of all the observed contacts. However, at a level at which we detect at least one interaction site in each
protein, 70% of the predictions are correct.
Ofran, Rost, FEBS Lett. 544, 236 (2003)
14. Lecture WS 2003/04
Bioinformatics III 10
Could simpler models work as well?
Single residue frequences contain rather weak preferences for protein-protein
interactions
neural network trained on single residues does not outperform the random
prediction markedly
Another simple method that predicts all exposed hydrophobic residues as
interaction sites also does not perform better than random.
Ofran, Rost, FEBS Lett. 544, 236 (2003)
14. Lecture WS 2003/04
Bioinformatics III 11
Quality of strong predictions
When 9-stretch network is calibrated to the point of its strongest predictions,
94% of the predicted protein-protein interaction sites are correct.
(identified 58 sites from 28 chains in complexes, all predictions are correct,
random model gives 0 correct predictions).
At 70% accuracy, identify 197 sites (12 expected at random) from 95 chains in 66
complexes. In 81 of these chains, all predictions were correct.
Ofran, Rost, FEBS Lett. 544, 236 (2003)
14. Lecture WS 2003/04
Bioinformatics III 12
Example of successful prediction
Example for prediction mapped onto
3D structure. When scaled for highest
accuracy (94%), the method correctly
identified some contacts in 28 chains;
one of these is presented here.
The method identified two residues
(green) in the ubiquitin ligase skp1–
skp2 complex.
Both of the predictions are part of a
pocket that accommodates the
Trp109 in SKP-2 F-box protein. Note
that there were no wrong predictions
in this complex at the given threshold
for the prediction strength. Ofran, Rost, FEBS Lett. 544, 236 (2003)
correlation information is sufficient for selecting the correct structural arrangement of
known heterodimers and protein domains because the correlated pairs between the
monomers tend to accumulate at the contact interface.
Use same idea to identify interacting protein pairs.
14. Lecture WS 2003/04
Bioinformatics III 14
Correlated mutations at interface
Correlated mutations evaluate the similarity in variation patterns between positions in
a multiple sequence alignment.
Similarity of those variation patterns is thought to be related to compensatory
mutations.
Calculate for each positions i and j in the sequence a rank correlation coefficient (rij):
Pazos, Valencia, Proteins 47, 219 (2002)
lkjjkl
lkiikl
lkjjkliikl
ij
SSSS
SSSS
r
,
2
,
2
,
where the summations run over every possible pair of proteins k and l in the multiple
sequence alignment.
Sikl is the ranked similarity between residue i in protein k and residue i in protein l.
Sjkl is the same for residue j.
Si and Sj are the means of Sikl and Sjkl.
14. Lecture WS 2003/04
Bioinformatics III 15
Correlated mutations at interface
Generate for protein i multiple sequence alignment of homologous proteins (HSSP
database).
Compare MSAs of two proteins, reduce them by leaving only sequences of
coincident species (delete rows).
Pazos, Valencia, Proteins 47, 219 (2002)
14. Lecture WS 2003/04
Bioinformatics III 16
i2h method
Schematic representation of the i2h method.
A: Family alignments are collected for two different proteins, 1 and 2, including corresponding sequences from different species (a, b, c, ).
B: A virtual alignment is constructed, concatenating the sequences of the probable orthologous sequences of the two proteins. Correlated mutations are calculated.
C: The distributions of the correlation values are recorded. We used 10 correlation levels. The corresponding distributions are represented for the pairs of residues internal to the two proteins (P11 and P22) and for the pairs composed of one residue from each of the two proteins (P12).
Pazos, Valencia, Proteins 47, 219 (2002)
14. Lecture WS 2003/04
Bioinformatics III 17
Predictions from correlated mutationsResults obtained by i2h in a set of 14 two domain proteins of known structure = proteins with two interacting domains. Treat the 2 domains as different proteins.
A: Interaction index for the 133 pairs with 11 or more sequences in common. The true positive hits are highlighted with filled squares.
B: Representation of i2h results, reminiscent of those obtained in the experimental yeast two-hybrid system. The diameter of the black circles is proportional to the interaction index; true pairs are highlighted with gray squares. Empty spaces correspond to those cases in which the i2h system could not be applied, because they contained <11 sequences from different species in common for the two domains.
In most cases, i2h scored the correct pair of protein domains above all other possible interactions.
Pazos, Valencia, Proteins 47, 219 (2002)
14. Lecture WS 2003/04
Bioinformatics III 18
Second test set
The i2h method was applied to the set of bacterial interacting proteins analyzed by Dandekar et al.,using MSA compiled from 14 fully sequenced genomes. Select all those proteins where sequences are found in at least 11 genomes.
A: The interaction index is represented for the 244 possible pairs. In this case, possible interactions are indicated with empty squares, including different ribosomal proteins and elongation factors.
B: Representation of i2h results reminiscent of the typical representation of yeast two-hybrid experimental data. In this case, a subset of the results of (A) is represented, corresponding to proteins that form part of protein pairs with experimentally verified interactions and protein families with enough alignments. The diameter of the black circles is proportional to the interaction index, positive cases are highlighted with dark gray squares, and plausible interactions with light gray squares. Empty spaces correspond to those cases with <11 sequences from different species in common.
Pazos, Valencia, Proteins 47, 219 (2002)
14. Lecture WS 2003/04
Bioinformatics III 19
Analyze the influence of species distribution on
results: Can the presence or absence of sequences
of given species always be related with high scores?
Plot shows interaction indexes for the different
phylogenetic profiles in this data set. A phylogenetic
profile represents the pattern of presence
(1)/absence (0) of that species in the alignment of
common species for a pair of proteins.
The values of interaction indexes for all pairs of
proteins containing a given phylogenetic profile are
drawn.
Answer: No obvious relation between the species
distribution (phylogenetic profile) and the interaction
index.
Pazos, Valencia, Proteins 47, 219 (2002)
Second test set
Abbreviations forSpecies Names
14. Lecture WS 2003/04
Bioinformatics III 20
Predicted interactions for E. coli
Number of predicted interactions for E. coli.
The bars represent the number of
predicted interactions obtained from the
67,238 calculated pairs (having at least 11
homologous sequences of common
species for the two proteins in each pair),
depending on the interaction index cutoff
established as a limit to consider
interaction.
Pazos, Valencia, Proteins 47, 219 (2002)
Among the high scoring pairs are many cases of known interacting proteins.
14. Lecture WS 2003/04
Bioinformatics III 21
Predicted interactions of hypothetical protein
Example of data analysis using the E. coli i2h database. Analysis of predicted interaction partners for the hypothetical protein YABK_ECOLI, one of the E. coli proteins included in the prototype database.
The interaction index distribution for the different possible pairs is compared in an interactive Web-based interface that facilitates inspection of their functions by following links to the information deposited in Swissprot35 and other databases, localization in the E. coli genome, and the possible relationship to E. coli operons.
In this case, the different functions highlight the relationships of the hypothetical protein with iron and zinc transport mechanisms, as well as with other hypothetical proteins.
Pazos, Valencia, Proteins 47, 219 (2002)
14. Lecture WS 2003/04
Bioinformatics III 22
4 Coevolutionary Analysis
Idea: if co-evolution is relevant, a ligand-receptor pair should occupy related
positions in phylogenetic trees.
Observe that for ligand-receptor pairs that are part of most large protein families,
the correlation between their phylogenetic distance matrices is significantly
greater than for uncorrelated protein families (Goh et al. 2000, Pazos, Valencia,
2001).
Finer analysis (Goh & Cohen, 2002) shows that within these correlated
phylogenetic trees, the protein pairs that bind have a higher correlation between
their phylogenetic distance matrices than other homologs drawn drom the ligand
and receptor families that do not bind.
Goh, Cohen J Mol Biol 324, 177 (2002)
14. Lecture WS 2003/04
Bioinformatics III 23
5 Multimeric threading: Fit pair A, B to complex database
Phase 1: single-chain threading.
Each sequence is independently threaded and assigned to a list of possible
candidate structures according to the Z-scores of the alignments.
The Z-score for the k-th structure having energy Ek is given by:
Lu, ..., Skolnick, Genome Res 13, 1146 (2003)
EE
Z KK
where E and are the mean and standard deviation values of the energy of the
probe in all templates of the structural database.
For the assignment of energies, statistical potentials of residue pairing frequences
are used.
Library of 3405 protein folds where the pairwise sequence identity is < 35%.