TEXTAL - Automated Crystallographic Protein Structure Determination Using Pattern Recognition Principal Investigators: Thomas Ioerger (Dept. Computer Science) James Sacchettini (Dept. Biochem/Biophys) Other contributors: Tod D. Romo, Kreshna Gopal, Erik McKee, Lalji Kanbi, Reetal Pai & Jacob Smith Funding: National Institutes of Health Texas A&M University
36
Embed
TEXTAL - Automated Crystallographic Protein Structure Determination Using Pattern Recognition
TEXTAL - Automated Crystallographic Protein Structure Determination Using Pattern Recognition. Principal Investigators: Thomas Ioerger (Dept. Computer Science) James Sacchettini (Dept. Biochem/Biophys) Other contributors: Tod D. Romo, Kreshna Gopal, Erik McKee, - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
TEXTAL - Automated Crystallographic Protein Structure Determination Using Pattern Recognition
Principal Investigators: Thomas Ioerger (Dept. Computer Science)
James Sacchettini (Dept. Biochem/Biophys)
Other contributors: Tod D. Romo, Kreshna Gopal, Erik McKee,
Lalji Kanbi, Reetal Pai & Jacob Smith
Funding: National Institutes of Health
Texas A&M University
X-ray crystallography• Most widely used method for
protein modeling
• Steps: – Grow crystal
– Collect diffraction data
– Generate electron density map (Fourier transform)
– Interpret map i.e. infer atomic coordinates
– Refine structure
• Model-building– Currently: crystallographers
– Challenges: noise, resolution
– Goal: automation
X-ray crystallography• Most widely used method for
protein modeling
• Steps: – Grow crystal
– Collect diffraction data
– Generate electron density map (Fourier transform)
– Interpret map i.e. infer atomic coordinates
– Refine structure
• Model-building– Currently: crystallographers
– Challenges: noise, resolution
– Goal: automation
• Automated model-building program
• Can we automate the kind of visual processing of patterns that crystallographers use?– Intelligent methods to interpret density, despite noise– Exploit knowledge about typical protein structure
• Focus on medium-resolution maps– optimized for 2.8A (actually, 2.6-3.2A is fine)
– typical for MAD data (useful for high-throughput)
– other programs exist for higher-res data (ARP/wARP)
Overview of TEXTAL
Electron density map(or structure factors) TEXTAL Protein model
(may need refinement)
SCALE MAP
TRACE MAP
CALCULATE FEATURES
PREDICT Cα’s
BUILD CHAINS
PATCH & STITCH CHAINS
REFINE CHAINS
LOOKUP: model side chains CAPRA: models backbone
POST-PROCESSING
SEQUENCE ALIGNMENT
REAL SPACE REFINEMENT
Crystal Collect data Diffraction data Electron density map
Model of backbone
Model of backbone & side chains
Corrected & refined model
CAPRA: C-Alpha Pattern-Recognition Algorithm
tracing
linking
Neural network:estimates whichpseudo-atoms areclosest to true C’s
Best-first search with heuristicscoring function based on: • neural net scores• density• connectivity• secondary structure
Example of C-chains fit by CAPRA
% built: 84%# chains: 2lengths: 47, 88RMSD: 0.82A
Rat 2 urinary protein (P. Adams)data: 2.5A MRmap generated at 2.8A
Stage 2: LOOKUP
• LOOKUP is based on Pattern Recognition – Given a local (5A-spherical) region of density, have we seen a
pattern like this before (in another map)?
– If so, use similar atomic coordinates.
• Use a database of maps with known structures– 200 proteins from PDB-Select (non-redundant)
– back-transformed (calculated) maps at 2.8A (no noise)
– regions centered on 50,000 C’s
• Use feature extraction to match regions efficiently– feature (e.g. moments) represent local density patterns
– features must be rotation-invariant (independent of 3D orientation)
– use density correlation for more precise evaluation
CAPRACAPRABUILD CHAINS: Examines network of BUILD CHAINS: Examines network of Cα’s and use heuristic search to Cα’s and use heuristic search to
connect them to form backbone chainsconnect them to form backbone chains
LOOKUP: Uses case-based reasoning LOOKUP: Uses case-based reasoning to find, for each Cto find, for each Cαα, the best , the best
matching local region in a database matching local region in a database
Databaseof knownmaps
Region in map to be interpreted
The LOOKUP ProcessFind optimalrotation
i
iii RFRFwRRdist 22121 ))()((),(
“2-norm”: weighted Euclideandistance metric for retrieving matches:
Two-step filter: 1) by features 2) by density correlation
Examples of Numeric Density Features
•Distance from center-of-sphere to center-of-mass•Moments of inertia - relative dispersion along orthogonal axes•Geometric features like “Spoke angles” •Local variance and other statistics
Features are designed to be rotation-invariant, i.e. samevalues for region in any orientation/frame-of-reference.
TEXTAL uses 19 distinct numeric features to represent the pattern of density in a region, each calculated over 4 different radii, for a total of 76 features.
SLIDER: Feature-weighting algorithm• Euclidean distance metric used for retrieval: • importance of relevant features, avoid noisy features• Goal: find optimal weight vector w the generates highest
probability of hits (matches) in top K candidates from database• Concept of Slider:
• analyze distances between representative matches and mismatches• adjust features so the most matches are ranked higher than mismatches
i
iii RFRFwRRdist 22121 ))()((),(
Slider Algorithm(w,F,{Ri},matches,mismatches) choose feature fF at random for each <Ri,Rj,Rk>, Rjmatches(Ri),Rkmismatches(Ri) compute cross-over point i where: dist’(Ri,Rj)=dist’(Ri,Rk) dist’(X,Y)= (Xf-Yf)2+(1-)dist\f(X,Y) pick that is best compromise among i
ranks most matches above mismatches update weight vector: w’update(w,f,), wf’= repeat until convergence
SLIDER ResultsConvergence of feature selection/weighting
algorithms
60
70
80
90
100
0 50 100 150 200 250
Iterations
Acc
ura
cy o
f ra
nki
ng
SLIDER
SFS
SBS
DIET
Accuracy of case retrieval
012345678
SLIDER SBS DIET SFS Uniformweights
Nu
mb
er o
f m
atch
es r
etri
eved
Speed of convergence
0
500
1000
1500
2000
SLIDER SFS SBS DIET
Tim
e (s
eco
nd
s)
Effectiveness of retrieval using Euclidean (tolerance = .02)
0
1
2
3
4
5
6
7
0 1000 2000 3000 4000
k
Ave
rag
e n
o o
f m
atch
es
cau
gh
t in
to
p k
Uniform-weighted
Slider-weighted
Stage 3: Post-Processing
Quality of TEXTAL models
• Typically builds >80% of the protein atoms
• Accuracy of coordinates: ~1Å error (RMSD)– Depends on resolution and quality of map
Percent amino acid identity = 87.5% (mistakes: small frame-shifts around gaps in alignment)all-atom RMSD = 0.92A
Closeup of -strand (TEXTAL model in green)
Closeup of another -strand and turn
Implementation
• Project started in 1998 – Collaboration between TAMU Computer Science & Biochemistry
departments
• 100,000 lines of C/C++, Perl, Python code• ~8 developers • CVS for version management• Platforms: Irix, Linux, OSX, Win32• Speed: 1-3 hours for medium-sized proteins
Deployment
• September 2004: Linux and OSX distributions– Can be downloaded from http://textal.tamu.edu:12321– 40 trial licenses granted so far
• June 2002: WebTex (http://textal.tamu.edu:12321)– Till May 2005: TB Structural Genomics Consortium members only– Recently open to the public– ~500 jobs successfully processed– 120 users from 70 institutions in 20 countries
• July 2003: Model building component of PHENIX– Python-based Hierarchical ENvironment for Integrated Xtallography– Consortium members:
• Lawrence Berkeley National Lab• University of Cambridge• Los Alamos National Lab• Texas A&M University
– April 2005: Alpha release - over 300 downloads so far