EBI is an Outstation of the European Molecular Biology Laboratory. PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
Dec 29, 2015
EBI is an Outstation of the European Molecular Biology Laboratory.
PDBe-fold (SSM)
A web-based service for protein structure
comparison and structure searches
Gaurav Sahni, Ph.D.
2
Structure alignment may be defined as identification of residues occupying “equivalent” geometrical positions
Unlike in sequence alignment, residue type is neglected
Used for measuring the structural similarity protein classification and functional analysis database searches
Structure alignment
Sequence and Structure Alignments
Sequence alignment
Based on residue identity, sometimes with a modified alphabet
--AARNEDDDGKMPSTF-LE-AARNFG-DGK--STFIL
Algorithms: Dynamic programming + heuristics
Applications: BLAST, FASTA, FLASH and others
Used for:
evolution studies protein function analysis guessing on structure similarity
Structure alignment
Based on geometrical equivalence of residue positions, residue type disregarded
Used for:
protein function analysis some aspects of evolution studies
Algorithms: Dynamic programming, graph theory, MC, geometric hashing and others
Applications: DALI, VAST, CE,MASS, SSM and others
Methods
Many methods are known:
Distance matrix alignment (DALI, Holm & Sander, EBI) Vector alignment (VAST, Bryant et. al. NCBI) Depth-first recursive search on SSEs (DEJAVU, Madsen & Kleywegt,
Uppsala) Combinatorial extension (CE, Shindyalov & Bourne, SDSC) Dynamical programming on C (Gerstein & Levitt) Dynamical programming on SSEs (SSA, Singh & Brutlag, Stanford
University) many more …
SSM employs a 2-step procedure:A Initial structure alignment and superposition using SSE graph matchingB C - alignment
Three dimensional graph matching
• Protein secondary structure elements (SSE)–
natural and convenient objects for building three
dimensional graphs.
• Secondary structures provide most functionality
and is conserved through evolution
• Details of protein fold –expressed in terms of two
SSE – helices and strands.
e
L
•SSE graphs- represented by vectors
•Each SSE can be used as graph vertices (Ti, ρi)
•Any 2 vertices are connected by an edge label L – describes position and orientation of the connected SSEs
•Each edge labelled with a property vector – α1/2 angle between edge and vertices, torsion angle between vertices, length of the edge L
Graph representation of SSEs
Vi
Vj
• Sets of vertices, edges and their labels provides full definition of the graph.
• Graph matching algorithm is required – set of rules for comparing individual vertices and edges – tolerances chosen empirically
• Relative and absolute vertex and edge lengths are used for comparison – allows larger absolute differences for longer vertices and edges
• Torsion angle comparison – distinguish mirror symmetry mates
e
L
H1
S1
S2S3
S4
H2
H1
H2 H3
H4
S1
H5
H6
S2
S3
S4 S5
S6
S7
H1
S1
S2
H2
S3
S4
S5
S6
S7
H3
H4
H5
H6
B
H1
S1
S2
S3
S4
H2
AA
B
Matching the SSE graphs yields a correspondence between secondary structure elements, that is, groups of residues. The correspondence may be used as initial guess for structure superposition and alignment of individual residues.
SSE graph matching
What next?
• We have considered three dimensional arrangement of
secondary structure element (SSE) regardless of their
ordering in protein chain.
• Connectivity of SSEs is significant (can be neglected in
comparing mutated/engineered proteins)
• In previous methods connectivity was either preserved or
neglected.
PDBefold (SSM) Approach – a more flexible way
• There are three options –
1) connectivity of SSEs neglected
Different
connectivity in
SSE but SSE
graphs are
geometrically
identical
2) Soft connectivity – general order of SSEs along their protein chains are same in both structures BUT any number of missing/unmatched SSE between matched ones allowed
3)Strict connectivity – matched SSEs follow same order along their protein chains – separated only by equal number of matched/unmatched SSE in both structures
• To obtain 3D alignment of individual residues – represent them by their C-alpha atoms – use results of graph matching as a starting point
SSE-alignment is used as an initial guess for C-alignment
C-alignment is an iterative procedure based on the expansion of shortest contacts at best superposition of structures
matched helices matched strands
chain A
chain B
C-alignment is a compromise between the alignment length
Nalign and r.m.s.d. Longest contacts are unmapped in order to
maximise the Q-score:
BA
align
NNRdsmr
NQ
20
2
....1
C - alignment
More than 2 structures are aligned simultaneously
Multiple alignment is not equal to the set of all-to-all pairwise alignments
Helps to identify common structure motifs for a whole family of structures
Multiple structure alignment
Macromolecular Structure Database31.10.0714
If you have to ask….
• Are there any structures in the PDB that are similar to mine?
• What SCOP and/or CATH family could my structure belong to ?
• Can I get some idea about the possible function of my protein based on similarity with others based on structural similarity ?
• Mutiple alignment of many of my structures ?
Use PDBefold.
Upload your own PDB file for analysis !!
SSM output Table of matched Secondary Structure Elements
Table of matched backbone C-atoms with distances between them at best structure superposition
Rotation-translation matrix of best structure superposition
Visualisation in Jmol and Rasmol
r.m.s.d. of C-alignment
Length of C-alignment Nalign
Number of gaps in C-alignment
Quality score Q
Statistical significance scores P(S), Z
Sequence identity
The PDBefold Search Interface
The Results Page For Pairwise Alignment
Analyzing the result from a particular pairwise alignment
Residue by Residue Structural alignment result
Multiple 3D alignment using PDBefold
Results from multiple 3D alignment
it is quite possible that residue identity plays a much less significant role in protein structure than often believed
as a consequence, the role of residue identity in protein function may be often overestimated
using sequence identity for the assessment of structural or functional features may give more false negatives than expected
physical-chemical properties of residues should be given preference over residue identity in structure and function analysis
modern methods for structure alignment are efficient; there is little sense to use sequence alignment in structure-related studies
Conclusion