Swiss Institute of Bioinformatics Protein Structure Bioinformatics Introduction Secondary Structure & Protein Disorder Prediction EMBnet course Lausanne, February 21, 2007 Lorenza Bordoli Overview Introduction Secondary Structure Prediction Solvent Accessibility Prediction Disorder Prediction
41
Embed
Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Swiss Institute of Bioinformatics
Protein Structure BioinformaticsIntroduction
Secondary Structure & Protein Disorder Prediction
EMBnet course Lausanne, February 21, 2007
Lorenza Bordoli
Overview
Introduction
Secondary Structure Prediction
Solvent Accessibility Prediction
Disorder Prediction
Principles of protein structure
Primary Structure
Secondary Structure
Tertiary Structure (Fold)
Quaternary Structure
Principles of protein structure
Protein structure include:
Core Region:
Secondary structure element packed in close proximity
in hydrophobic environment
Limited amino acid substitution
Outside the core:
loops and structural elements in contact with water,
membrane or other proteins
Amino acid substitution: not as restricted as above
Protein Structures:
Solvent Accessibility
• Buried
• Solvent exposed
Primary Structure
Secondary Structure
Tertiary Structure (Fold)
Quaternary Structure
Secondary Structures:
• α Helix
• β Sheet
Secondary structure assignment
DSSP
Dictionary of Secondary Structure of Proteins (Kabsch
& Sander, 1983)
Based on recognition of hydrogen-bonding patterns in
known structures
Automated assignment of secondary structure
Interprets backbone hydrogen bonds
Uses a Coulomb approximation for the hydrogen bond
2. Identify regions where 4/6 have a P(H) >100 “alpha-helix nucleus”
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
Extend α-helix nucleus
3. Extend helix in both directions until a set of four residues has an average P(H) <100.
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
Repeat steps 1 – 3 for entire peptide
4. Identify regions where 3/5 have a P(E) >100 “b-sheet nucleus”
Extend b-sheet until 4 continuous residues have an average P(E) < 100
If region average > 105 and the average P(E) > average P(H) then “b-sheet”
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75
Scan peptide for β-sheet regions
Chou-Fasman
1. Assign all of the residues in the peptide the appropriate set of parameters.
2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.
3. Repeat this procedure to locate all of the helical regions in the sequence.
4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.
5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region.
6. To identify a bend at residue number j, calculate the following value:p(t) = f(j)f(j+1)f(j+2)f(j+3)where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.
CHOFAS predicts protein secondary structure version 2.0u61 September 1998 Please cite: Chou and Fasman (1974) Biochem., 13:222-245 Chou-Fasman plot of @, 12 aa; SEQ1 sequence.
Artificial intelligence:Computer programs are trainedto be able to recognize amino acid patters that are located in known secondary structure and distinguish from other patterns not located in these structures
NN can detect interactions between amino acids in a sequence window.
Artificial Neural Networks (ANN)
Excursion:
Introduction to Artificial
Neural Networks
Thanks to C. Pellegrini & P. Palagi (SIB) for slides about ANNs.
Inspiration - The brain
• Capable of remembering, recognizing patterns and associating. Main characteristics:
• massively parallel• non-linear• huge number of slow units highly connected • self-organizing and self-adapting
• Some statistics about the brain: • 1011 neurons• 1015 connections
• and about neurons: • 1 neuron is connected with 103 to 105 other neurons• slow: 10-3 sec (silicon logic gates 10-9 sec)
C. Pellegrini (SIB)
The nervous system
forward information feedback
The brain continually receives information, perceives it and makesappropriate decisions.
ReceptorsNeural
networks Effectors ResponsesStimulus
Brain
Human nervous system is a three-stage system:
C. Pellegrini (SIB)
An artificial neural network
• An artificial neural network (ANN) is a “machine”:
• assembly of artificial neurons• created to model the way the brain execute tasks by simulating mathematically the neurons and their connections
• Requirements to achieve a good performance:
• a huge number of neurons• massive interconnection among them
C. Pellegrini (SIB)
Artificial neuron model
• Introduced by McCulloch & Pitts (1943):
v w xi ii
= ∑ if v > θ then output = +1
else output = -1
ν
x1
x2 w2
w1
θ− 1
output
Quite simple: All signals can be 1 or -1. The neuron calculates a weighted sum of inputs and compares it to a threshold. If the sum is higher then the threshold, the output is set to 1, otherwise -1.
P.Palagi (SIB)
Artificial neuron model
• This simple neuron model consists of:
• A set of connections, called synapses, which make the link to other neurons to create a network. Each synapse has a synaptic weight which represents the strength of the connection.
• One unity which multiplies each incoming activity by the weight on the connection and adds together all these weighted inputs toget a total input.
• An activation function that transforms the total input into an outgoing activity (to constrain the input amplitude).
v w xi ii
= ∑
if v > θ then output = +1
else output = -1
νx1
x2 w2
w1
θ−1
output
P.Palagi (SIB)
Artificial neuron model
Modern McCulloch & Pitts neuron:
Σ
x1 wk 1
x3 wk 3
x2 wk 2
x p wkp
Summationunit
( )ϕ . Output yk
Activationfunction
Thresholdθk
Synapticweights
Input
signals
vk
P.Palagi (SIB)
Artificial neuron model
v w xk kj jj
p
==
∑1
and
The model can be mathematically described:
( )y vk k k= −ϕ θ
Where:
( )
x x x inputs
w w w synaptic weights k
v linear combiner
threshold
activation function
y output
p
kp
k
k
k
1 2, , , are the ,
are the of neuron ,
is the output,
is the ,
is the
is the signal of the neuron.
K
K1 2, , ,
. ,
θϕ
P.Palagi (SIB)
Types of activation functions
( )y vif v
if vk k
k
k
= =≥<
⎧⎨⎩
ϕ1 0
0 0
The activation function defines the output of a neuron in terms of the activity level at its inputs. There are 3 basic types of activation functions.
( )y v
v
v v
vk k= =
≥> >
≤
⎧
⎨⎪
⎩⎪
ϕα
α ββ
1
0
• threshold function
• piecewise-linear function
( )ϕ ve av=
+ −
1
1( )ϕ v
v e
e
v
v=⎛⎝⎜
⎞⎠⎟ =
−+
−
+tanh2
1
1
• sigmoid function
or
Activation functions - interpretation
An activation function is a decision function:
• defines a threshold under which the activation value will not
fire any output,
• allows to select, linearly or not, among different activation
values,
• the highest output value comes from the highest activation
value, i.e. similarity between the input values and the synaptic
weights.
Network architectures
The power of neural networks comes from its collective
behavior in a network where all neurons are
interconnected. The network starts evolving: neurons
continuously evaluate their output by looking at their
inputs, calculating the weighted sum and comparing to a
threshold to decide if they should fire. This is highly
complex parallel process whose features cannot be
reduced to phenomena taking place with individual
neurons.
Network architectures
Neural networks are formed by an assembly of many artificial neurons. An artificial neural network may be seen as a massively paralleldistributed processor.
The basic work of a neural network is determined by learning. The memorized information is retained through the synaptic weights.
⇒ Knowledge is represented by the free parameters of the neural network, i.e. synaptic weights and thresholds.
Single-layer Feed-forward Network
x2
x3
xm
x1
Inputsignals
Neurons
Single Layer
Perceptron
Learning methods
An artificial neural network learning method is a procedure which adjusts the neural network free parameters i.e. synaptic weightsand thresholds.
Supervised: We feed the neural network with k input (entries) and
their corresponding desired output. The learning algorithm modifies
(little by little) the synaptic weights to adapt the obtained output
according to the desired output. Only the synaptic weights which
produce an error are modified.
Non-supervised: We feed the neural network with the input
(entries) only. The neural network will organise itself in order to
represent the input data.
Multilayer Feed-forward Network
x2
x3
xm
x1
Inputsignals
Hidden Neurons Output layer
Multi Layer Perceptron
Training a neural network
Supervised Learning
We feed the neural network with the input (entries) and the corresponding desired output.
The learning algorithm modifies (step by step) the synaptic weights to adapt the obtained output according to the desired output. Only the synaptic weights which produce an error are modified.
The error back-propagation algorithm consists of two phases:
the forward phase where the activations are propagated from the input to the output layer, and
the backward phase, where the error between the observed actual and the requested nominal value in the output layer is propagated backwards in order to modify the weights and bias values.
(H) α-Helix, local interactions
Neural Networks for Secondary Structure Prediction
Artificial intelligence:Computer programs are trainedto be able to recognize amino acid patters that are located in known secondary structure and distinguish from other patterns not located in these structures
NN can detect interactions between amino acids in a sequence window.
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
(B.Rost, Columbia, NewYork)
Input Layer
Hidden Layer
Output Layer
WeightsTraining NN
Neural Networks for Secondary Structure Prediction
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
(B.Rost, Columbia, NewYork)
= 0.19
= 0.61
= 0.17
The winner is:
E
prediction
Neural Networks for Secondary Structure Prediction
Neural Networks
BenefitsGeneral applicable
Can capture higher order correlations
Inputs other than sequence information
DrawbacksNeeds many data points (solved structures)
corresponds to the the 21*3 bits coding for the profile of one residue
(B.Rost, Columbia, NewYork)
3rd Generation secondary structure prediction
PHD method (Rost and Sander)
Combine neural networks with MAXHOM multiple
sequence profiles
6-8 Percentage points increase in prediction accuracy
over standard neural networks
Use second layer “Structure to structure”
network to filter predictions
Jury of predictors
3rd generation secondary structure prediction
PHD (Rost et. al.) Q3 72 - 76 %
[ B.Rost (2001) J.Struct.Biol. 134, 204 ]
59 %
65 %
72 %
Q3
Prediction reliability (0 = weak, 9 = strong)
3rd generation secondary structure prediction
PSI-Pred (Jones, DT)
Use alignments from iterative sequence
searches (PSI-Blast) as input to a neural
network
Better predictions due to better sequence
profiles
Available as stand alone program and via the
web
How accurate are predictions today?
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber
of p
rote
in c
hain
s
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
(B.Rost, Columbia, NewYork)
How accurate are predictions today?
Q3 = 72-77% +- 11 % (on average)
• I.e. 30 % of predicted assignments are wrong
• I.e. for 2/3 of all proteins, between 60% - 80% of residues are predicted correctly
• I.e. for your protein, accuracy can be lower than 60% or higher than 80%
Secondary Structure Prediction
META-PredictProtein Server
• http://www.predictprotein.org
• Simultaneous submission tool to several other servers, e.g.JPRED, PHD, PROF, PSIprod, SAM-T99, APSSP2, Sspro
• Includes also motif searches, domain assignments, TM predictions, etc.
1D-Structure prediction
Secondary Structure Prediction
Solvent Accessibility Prediction
Identify exposed residues, e.g. for
mutation studies, epitopes, etc.
1D-Structure prediction
Projection onto strings of structural assignments
E.g. “Solvent Accessibility” (buried or exposed?)
A B C D E F G…¦ ¦ ¦ ¦ ¦ ¦ ¦e e b b e e e…
Accuracy of two-state prediction: 75% ± 10 %
PHDacc: solvent accessibility prediction
[http://www.predictprotein.org]
1D-Structure Prediction
Introduction
Secondary Structure Prediction
Solvent Accessibility Prediction
Disorder Prediction
Native Disorder in Proteins
Structural biology tenet: “Function of a protein determined by its 3D-Structure“However: disordered proteins or regions of proteins no fixed secondary or tertiary structure under physiological conditions and/or in the absence of a binding partner/ligand:
Ensemble of structural states leading to dynamic flexibilityNon globular structures that are extended in the solvent
2hfv2hfq
Experimental Detection of Disordered regions
Protein region is defined as disordered if it is devoid of stable secondary structure and if it has a large number of conformations:
X-Ray crystallography: lack of electron density
NMR: dynamics of sizeable disordered regions
CD (Circular dichroism)
SAXS (Small-angle X-ray scattering)
Hydrodynamic measurements
Traditional biochemical studies: proteolytic susceptibility
…
DisProt: Database of protein disorder (www.disprot.org)
Role of Protein Disorder
Participate in many biological processes: Regulation of transcription and translation
Cellular signal transduction
Cell-cycle control
Regulation of the self-assembly of large multiproteincomplexes (e.g. bacterial flagellum and the ribosome)
Role?Form larger contact areas with other proteins
Flexibility allows to bind multiple ligands
Protein easily regulated by PTM modifications
Relative instability of the intrinsically disordered proteins involved in transcription and signaling provides a further levelof control trough proteolytic degradation: concentration easily regulated by protease digestion.
The continuum of protein structure
ACTR: interaction domain of activator (p160) for retinoid receptor
NCBD: nuclear-receptor co-activator domain of CBP
TFIIIA: 3 zinc fingers of transcription factor
elF4E: translation-initiation factor
Thermodynamic consequences of coupled folding and binding
There is an entropic cost associated with the disorder-to-order transition: binding of an intrinsically unstructured protein to its target.The key thermodynamic driving force for the binding reaction is a favorable enthalpiccontribution: enthalpy-entropy compensation.Coupled folding and binding gives rise to a complex with high specificity and relatively low affinity: appropriate for signal-transduction proteins.
Characteristics of Disorder regions
Clear patterns that characterize disordered regions:
Low sequence complexity (biases composition, overrepresentation of a few residues)Amino acid compositional bias
• Low content of bulky hydrophobic amino acid (Val,Leu,Ile,Met,Phe,Trp and Tyr)
• High proportion of polar and charged amino acids(Gln, Ser, Pro, Glu, Lys and sometimes Gly and Ala)
High-sequence variability (high flexibility)
Training of NN
Role of Prediction of Disordered regions
The prediction of disordered regions would provide:
First step in the identification of functionally relevant disordered regions
• Design of laboratory experiments for the identification of binding sites within disordered regions. [1]
Identification of regions that hinder successful crystallization of the protein: bottleneck in structural proteomics (high-through-put structure determination pipeline) [2]
[1] Longi S. et al. (2003), J. Biol. Chem., 278, 18638[2] Linding R. et al. (2003), Structure., 11, 1453