Top Banner
Combining Sequence and Structure Information Topic 17
40
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Combining Sequence and Structure Information Topic 17.

Combining Sequence and Structure Information

Topic 17

Page 2: Combining Sequence and Structure Information Topic 17.

Problem: Identify the most important region(s)

What is a functional site? This is actually a very difficult question to answer robustly.

Of course, “catalytic residues” are functional sites.

Generally, we assume other site directly interacting with the substrate or other proteins involved in a complex to be functional.

However, what about sites far removed from the “active site region?” If a mutation at one of these sites is deleterious, is it functional?

Page 3: Combining Sequence and Structure Information Topic 17.

Problem: Identify the most important region(s)

Catalog of Important Sites (KC and Livesay)

Catalytic Sites: Sites that are identified as catalytic sites in the Catalytic Site Atlas (CSA).

Active Sites: Union of CSA catalytic residues and all residues contacting the catalytic residues using HBPLUS.

Ligand-Binding Sites: Sites identified by characterizing all enzyme-ligand interactions using HBPLUS.

What about Allosteric Sites? Structural Sites? Etc?

Page 4: Combining Sequence and Structure Information Topic 17.

Note that the two things are not the same

Page 5: Combining Sequence and Structure Information Topic 17.

The devil is in the details

Page 6: Combining Sequence and Structure Information Topic 17.

Methods

Typical approach: Combine sequence and structural information

Alignment Content: Sequence conservation and phylogeny-based. Typically also use structural information.

Machine-Learning Methods: Computational “black boxes,” but give good results.

Structure Features: Graph theoretic methods, protein surface shape, protein surface physiochemical properties, etc.

Triosephosphate isomerase color-coded by conservation

Page 7: Combining Sequence and Structure Information Topic 17.

Catalytic Site Atlas

Page 8: Combining Sequence and Structure Information Topic 17.

Catalytic Propensity

Page 9: Combining Sequence and Structure Information Topic 17.

Multiple Sequence Alignment

Page 10: Combining Sequence and Structure Information Topic 17.

Multiple Sequence Alignment

The Sum of Pairs (SP) score of column mi is calculated as above where s(mk,ml) is the scoring matrix substitution value. The sum is enumerated over all possible pairs within a single alignment column.

The Shannon entropy (S) score is calculated where pi is the probability of each residue i in that column. Very similar, the Williamson Property Entropy (WPE), sums of groups of chemically similar residues (k=9), where the probability within the logarithm is normalized by the average column probability

Rate4site (R4S) constructs a mathematical description of the underlying phylogeny in order to improve determination the rate of evolution at each site. The rate of evolution at each site is then estimated using the maximum likelihood principle, which considers both phylogenetic tree branch lengths and the stochastic nature of evolution.

And many others.

Page 11: Combining Sequence and Structure Information Topic 17.

Relative predictive power

Catalytic Active Ligand-binding

R4S 0.83 0.75 0.74

SP-score 0.77 0.66 0.70

JSD 0.78 0.72 0.67

WPE 0.75 0.70 0.66

Page 12: Combining Sequence and Structure Information Topic 17.

ConSurf is a web-implementation of R4S

Page 13: Combining Sequence and Structure Information Topic 17.

Throw everything at it, including the kitchen sink…

Page 14: Combining Sequence and Structure Information Topic 17.

Gutteridge et al., JMB (2003) 330:719-734.

Relative importance of input variables

Page 15: Combining Sequence and Structure Information Topic 17.

Gutteridge et al., JMB (2003) 330:719-734.

Three different NN’s

+ =

Using structural clustering to filter out FP’s

Unfortunately the method tends toover-predict catalytic residues

Structural clustering improves results

Page 16: Combining Sequence and Structure Information Topic 17.

Going beyond conservation

HKAMMKLQWBBMVRERCUGDYADHRAFGSGFFBYTUJGGCADFYDD EFZHRDADFD-EGHDGCVRRSERADZDFDAADFDEHGRRCADDSDDDFZBBDMJJJ-EDAFDCRRVSHTADHADFDEBGJEVEEECADDSDDNTHLJDJDDGUEKJFJCLDLSEIOOHMCVDUEGTEDDEDC--DSEIJDILKJADFFIFEVEECLDKSVVJBIOUDFFVFCFLKEICKDKSEE

Of course, well conserved positions make very good functional site predictions. But what defines differences between sub-families in the overall phylogeny?

Page 17: Combining Sequence and Structure Information Topic 17.

Evolutionary trace (aka tree-determinant) residues

..A........B....C...Y..

..A........B....C...Y..

..Z..D.....E....C...S..

..Z..D.....E....C...S..

..Z..D.....E....C...S..

..H......G.E....C...S..

..H......G.E....C...S..

..H......G.E....C...S..J.I......F.F....C...S..J.I......F.F....C...S..

Analyze to detect those residues with a tendency to be conserved within a subfamily of proteins, but which differ between subfamilies (tree-dependent positions), and regard them as a result of the evolutionary scenario in which conservation and specificity are present in a delicate balance.

Page 18: Combining Sequence and Structure Information Topic 17.

Evolutionary trace (aka tree-determinant) residues

• Identifying and understanding the role of the essential sites that determine the structure and proper functioning of the molecule.

• A thorough evaluation of the importance of all sequence sites involves extremely time-consuming and laborious biochemical experimental methods.

• All methods presented here rely on some sort of co-evolutionary theme. Or put otherwise, Nature has allowed some plasticity within (some) functional positions assuming the appropriate conditions are met elsewhere.

• Starting from the groundbreaking Lichtarge et al. paper in 1996, there have been several approaches presented that use this intra-family co-evolution principle to predict functional sites. The methods, called evolutionary trace, tree determinate residues, phylogenetic motifs, ConSurf, and strong motifs are conceptually similar and provide somewhat consistent results.

Page 19: Combining Sequence and Structure Information Topic 17.

Evolutionary trace (aka tree-determinant) residuesThe ET process

Page 20: Combining Sequence and Structure Information Topic 17.

Structural clusters of ET overlap ligand binding sites

Active site

Trace residues

97% of the time (37 of 38 examples), the largest cluster of trace residues contacts the ligand (Madabushi et al, Journal of Molecular Biology, 2002).

Page 21: Combining Sequence and Structure Information Topic 17.

Livesay et al. (2003). Biochemistry 42:3464-73.

What leads to conservation of CuZnSOD surface electrostatics?

Page 22: Combining Sequence and Structure Information Topic 17.

Livesay et al. (2003). Biochemistry 42:3464-73.

Structural and structure variability

Page 23: Combining Sequence and Structure Information Topic 17.

Stephen Jay Gould said, “The proof of evolution lies in those adaptations that arise from improbable foundations.”

Livesay et al. (2003). Biochemistry 42:3464-73.

An improbable result

Page 24: Combining Sequence and Structure Information Topic 17.

Triosephosphate isomerasewindow width = 5

PSZ threshold = -1.5

TIM Prosite definition

La, Sutch, Livesay (2005). Proteins 58:309-320.

Phylogenetic motifs

Notice structural clustering despite

little overall sequence proximity

Page 25: Combining Sequence and Structure Information Topic 17.

La, Sutch, Livesay (2005). Proteins 58:309-320.

Page 26: Combining Sequence and Structure Information Topic 17.

Copper, zinc-superoxide dismutaseTATA-box binding protein

Inorganic pyrophosphatase Cytochrome P450Myoglobin

Page 27: Combining Sequence and Structure Information Topic 17.

Glutamate dehydrogenase

Enolase Alcohol dehydrogenase

Glecerolaldehyde-3-phosphate dehydrogenase

Page 28: Combining Sequence and Structure Information Topic 17.

Trace residues that correspond to PMs are colored red.Trace residues that do not correspond to PMs are colored blue.

PMs identify sequence clusters of ET residues

Page 29: Combining Sequence and Structure Information Topic 17.

La, Sutch, Livesay (2005). Proteins 58:309-320.

PMs also correspond to traditional motif definitions

Page 30: Combining Sequence and Structure Information Topic 17.

That is, PMs represent a subset of motif space

La, Sutch, Livesay (2005). Proteins 58:309-320.

Page 31: Combining Sequence and Structure Information Topic 17.

APSRKFFVGGNWKMNGRKQSLGELIGTLNAAKV

PADTEVVCAPPTAYIDFARQKLDPKIAVAAQNC

YKVTGAFTGEISPGMIKDCGATWVVLGHSERRH

VFGESDELIGQKVAHALAEGLGVIACIGEKLDE

REAGITEVFEQTKVIADNVKDWSKVVLAYEPVW

AIGTGKTATPQQAQEVHEKLRGWLKSNVSDAVA

QSTRIIYGGVTGATCKELASQPDVDGFLVGGAS

LKPEFVDIINAKQ

Page 32: Combining Sequence and Structure Information Topic 17.

Livesay, La (2005), Protein Science 14:1158-1170.

Page 33: Combining Sequence and Structure Information Topic 17.

Figure caption: Ligand-binding Positions of Tyrosine Aminotransferase of Trypanosoma Cruzi. One chain of the crystal structure of tyrosine aminotransferase from Trypanosoma Cruzi (PDB code 1BW0). Results of a conservation-based measure (Williamson, in blue) are shown compared to the phylogeny-based SMERFS (in red). Positions predicted by both techniques are shown in green, the PLP cofactor in orange. Protein regions in stick representation and labelled are those important for cofactor binding, as described in the text. Manning et al. BMC Bioinformatics 2008 9:51

The SMERFS algorithm is intermediate in philosophy to those of TreeDet [21] and MINER [18] and compares local to global similarity matrices over windows on an alignment.

The work presented here has shown that SMERFS produces sets of putative functional positions in multiple sequence alignments fundamentally different from those of conservation measures. For this reason conservation measures and phylogeny-aware methods such as SMERFS should be considered as complementary tools. The data suggest that if alignment positions involved in the core function of a protein, for example catalysis, are the target of a study, relatively simple conservations measures remain the most useful tool. If less critical positions, perhaps responsible for defining sequence subfamiliy specificity, are the target, then methods such as SMERFS may be of use. Finally, SMERFS has been shown to predict many more surface positions than conservation, reducing the possibility of confusing signals from positions of core structural rather than functional significance.

Page 34: Combining Sequence and Structure Information Topic 17.

Np: total number of verticesLij: shortest path between i and j

Vertex: CαEdge: if distance is within 8.5Å

Page 35: Combining Sequence and Structure Information Topic 17.

Degree (aka, connectivity or valency):Simply the integer count of the number of edges a vertex shares.

Closeness:The closeness centrality, CC, for a vertex v is the reciprocal of the sum of geodesic distances to all other vertices in the graph.

Geodesic distance (aka, shortest path):The number of edges in the shortest path connecting two vertices. 1

23

54

6

Note: this assumes constant edge weights

Centrality metrics

Page 36: Combining Sequence and Structure Information Topic 17.

• The networks are usually highly clustered with few links connecting any two random vertices.

• A key feature of many complex systems (including protein networks) is robustness, meaning that the system can continue to function despite perturbations.

• On the other hand, robustness is coupled with fragility toward non-trivial rearrangements of the connections between the system’s key internal parts.

• Proteins are no exception, they have evolved toward a robust design; however, they are vulnerable to mutation at certain residues, meaning that some special importance could be placed on central residues.

• Recently, various centrality scores have beenused to predict folding nuclei and catalytic sites.

Protein networks

Page 37: Combining Sequence and Structure Information Topic 17.

del Sol et al., Mol Sys Biol (2006).

Protein networks conserve “hubs”

Page 38: Combining Sequence and Structure Information Topic 17.

However, there is a clear distinction b/t buried noncatalytic and catalytic sites.

Catalytic residuesOne third most buried residuesMiddle thirdOne third least buried residues

Close residues are typically buried…

Chea and Livesay (2007), BMC Bioinformatics, 8:153.

Page 39: Combining Sequence and Structure Information Topic 17.

Closeness centrality...

Non-catalytic residuesCatalytic residues

ROC curve for CC predictions

Chea and Livesay (2007), BMC Bioinformatics, 8:153.

Catalytic site prediction power

Page 40: Combining Sequence and Structure Information Topic 17.

Computed p-values on the null hypothesis that CC does not predict catalytic sites better than random.

Simple steps to improve accuracy

Raw predictionsAccessibility filterResidue identity filter