Protein Structure Space Patrice Koehl Computer Science and Genome Center http://www.cs.ucdavis.edu/~koehl/
Protein Structure Space
Patrice Koehl
Computer Science and Genome Center
http://www.cs.ucdavis.edu/~koehl/
From Sequence to Function
KKAVINGEQIRSISDLHQTLKKWELALPEYYGENLDALWDCLTGVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAEGCDITI
Sequence
Structure
Function
ligand
Protein Structure Space1CTF 1TIM 1K3R
1A1O 1NIK 1AON
68 AA 247 AA 268 AA
384 AA 4504 AA 8337 AA
Outline
•Protein Shape DescriptorsDifferential Geometry Tools
•Classifying Proteins
The Shapes of Protein Structures
•Protein Structure SpaceDimension?
•Complexity of Protein Structures
Are Proteins 3D, or 1D objects?
Outline
•Protein Shape DescriptorsDifferential Geometry Tools
•Classifying Proteins
The Shapes of Protein Structures
•Protein Structure SpaceDimension?
•Complexity of Protein Structures
Are Proteins 3D, or 1D objects?
Classification of Protein Structure: CATH
C
A
T
AlphaMixed AlphaBeta
Beta
Sandwich
Tim BarrelOther Barrel
Super RollBarrel
Protein Structure Space
Test set2,930 proteins out of 23,000 proteins in PDBNo sequence similarity (Fasta E-value < e-4)
Reference structural similarity defined from CATH769 folds
104,000 pairs of similar structures out of 4,600,000 pairs
Performance measure: ROC curve(Receiver Operating Characteristic)
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
0...
...0...
...0
1
1
N
N
d
d
D XXG T= X
Distance Matrix Metric Matrix Points in Space
Projecting Protein Structure Space
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
0...1
...01
110
D
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
0...0
...01
010
D
k
k
k
k
Class
Fold
Projecting Protein Structure Space
Protein Structure Similarity
Root mean square distance: cRMS:
( ) ∑=
−−=N
iii TRba
NBAcRMS
1
21,
N: number of equivalent atoms between A and BR, T: rigid transformation that minimizes cRMS.
Eigenvalues of the Metric Matrix:
Protein Structure Classes
Measure of Structure Similarity:cRMS after Optimal Superposition(Structal)
A Picture of the Protein Structure Space
Proteins
Proteins
α and Proteins
A Picture of the Protein Structure Space
1a81G2
1sfcK0
1repC21bdo00
2bi6H0
Proteins
Proteins
α and Proteins
Outline
•Protein Shape DescriptorsDifferential Geometry Tools
•Classifying Proteins
The Shapes of Protein Structures
•Protein Structure SpaceDimension?
•Complexity of Protein Structures
Are Proteins 3D, or 1D objects?
0 10 20 30 40 50 60 70 80 90 100
Rate of true negatives (%)
100
90
80
70
60
50
30
20
Rat
e of
true
pos
itive
s (%
)
40
10
Random measureArea = 0.5
“Perfect” measureArea = 1.0
Protein Fold SpaceROC Analysis
(Receiver Operating Characteristic)
Protein Fold SpaceROC Analysis
(Receiver Operating Characteristic)
True positives
pairs of proteins that belong to the same T class of CATH
True negatives
pairs of proteins that belong to the same C class, but not thesame T class.
Rate of true negatives (%)
Rat
e of
true
pos
itive
s (%
)
Protein Fold Space
Fasta: 0.54
CATH Class : 0.51
CATH Fold20 : 0.98 Fold20: first 20coordinates derivedfrom the CATHfold matrix
CATH class: first 3coordinates derivedfrom the CATHclass matrix
Rate of true negatives (%)
Rat
e of
true
pos
itive
s (%
)
Protein Fold Space
Fasta: 0.54Fasta: 0.54
Structal: 0.88
Protein Structure Features
()zyxzyxˆsin2),(),,( dR =
{ }),,(min)(),(
zyxxzy
R=ρ
y
x
z
R(x,y,z)
{ })(min xx
ρ=Δ
Global radius of curvature:
Thickness:
(Gonzalez & Maddocks, PNAS, 1999, 96:4769)
Thickness of a protein structure
= 2.60 Ǻ
Curvature Feature Vector
p
zyxpp dCdCdCzyxR
U/1
),,(
1⎟⎟⎠
⎞⎜⎜⎝
⎛= ∫∫∫
[ ]543215 UUUUUC =
Performance of the Curvature Feature Vector
Curvature vector performs better than fasta.
Needs more features to match Structal.
Rate of true negatives (%)
Rat
e of
true
pos
itive
s (%
)
Fasta: 0.54
Structal: 0.88
C5: 0.65
Protein Structure Features: Writhing
∫∫=2 2121 ),(
4
1dtdtttWr ω
π
Fain and Røgen, PNAS, 100: 119 (2003)
( )3
21
121121
)()(
)('),()(),('det),(
tt
tttttt
γγ
γγγγω
−
−=
Writhing Number
+ -
Sign of Crossing
(t1)
(t2)
[ ]|||| 12121110 WrWrWrWrW =
Writhe Feature Vector for Each Protein
1
Protein Structure Features: Writhing
W10 Writhe performs better than C5 Curvature
Rat
e of
true
pos
itive
s (%
)
Rate of true negatives (%)
Fasta: 0.54
C5: 0.65
Structal: 0.88
W10: 0.77
Outline
•Protein Shape DescriptorsDifferential Geometry Tools
•Classifying Proteins
The Shapes of Protein Structures
•Protein Structure SpaceDimension?
•Complexity of Protein Structures
Are Proteins 3D, or 1D objects?
Clustering Protein Fragments to Extract a Small Set of Representatives (a Library)
data clustereddata
library
(Simulated annealing K means)
Generating an approximate structure
Fragment library
A B C D
Generating an approximate structure
Fragment library
A B C D
Fragment library
Generating an approximate structure
A B C D
Fragment library
Generating an approximate structure
A B C D
Fragment library
Generating an approximate structure
A B C D
Structural Sequence:
AC
Fitting Protein Structures
better
100 fragments of length 50.91 Ǻ cRMS
50 fragments of length 72.78 Ǻ cRMS
Complexity(states/residue)
Ave
rage
cR
MS
dis
tan
ce
Longer fragments give better fit at same complexity
Fragment Size:7 residues6 residues5 residues4 residues
(Kolodny, Koehl, Guibas, Levitt, J. Mol. Biol.,323, 297 2002)
3
1
−= LNC N: number of fragmentsL: size of each fragment
Choosing the “right” library
Size L N such that Complexity=20
7 160000
6 8000
5 400
4 20
cRMS model-experimental structure cRMS model-experimental structure
0.2 0.6 1.0 0.2 0.6 1.0
# of
str
uct
ure
sA Structural Alphabet for Protein Backbone
Pro
tein
siz
e
Fragment size: 4Number of fragment: 20
Structural Alphabet: Application to Structure Comparison
cRMS = 1Å
Collaborators• Michael Levitt
(Computational Biology)
Stanford University
• Marc Delarue
(Biophysics)
Institut Pasteur, Paris
• Rachel Kolodny
(Computer Science)
Columbia University
• Herbert Edelsbrunner
(Math/Computer Science)
Duke University
• Peter Roegen
(Math)
DTU, Denmark
• Joel Hass
(Math)
UC Davis
Thank You