Structural Phylogenomic Analysis Estimat e Tree of Life; plot key traits onto tree VirB4 model Predict active site & subfamily specificity positions Anti- fungal defensin (Radish) Scorpion toxin Extend function prediction through inclusion of structure prediction and analysis Drosomycin (Drosophila) Based on 12% identity to TrwB struc
39
Embed
Structural Phylogenomic Analysis Estimate Tree of Life; plot key traits onto tree VirB4 model Predict active site & subfamily specificity positions Anti-fungal.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Structural Phylogenomic Analysis
Estimate Tree of Life; plot key traits onto tree
VirB4 model
Predict active site & subfamily specificity positions
Anti-fungal defensin(Radish)
Scorpion toxin
Extend function prediction through inclusion of structure prediction and analysis
Drosomycin(Drosophila)
Based on 12% identity to TrwB structure
Annotation transfer by homology
• Status quo approach to protein function prediction – Given a gene (or protein) of unknown function
• Run BLAST to find homologs• Identify the top BLAST hit(s)• If the score is significant, transfer the annotation
– If resources permit, predict domains using PFAM or CDD
• Problems: – Approach fails completely for ~30% of genes – Of those with annotations, only 3% have any supporting experimental evidence
• 97% have had functions predicted by homology alone*
– High error rate
* Based on analysis of >300K proteins in the UniProt database
Main sources of annotation errors:1. Domain shuffling2. Gene duplication (failure to discriminate
between orthologs and paralogs) 3. Existing database annotation errors
Galperin and Koonin, “Sources of systematic error in functional annotation of genomes: domain rearrangement,
non-orthologous gene displacement and operon disruption.”In Silico Biol. 1998
Sub-functionalizationNeo-functionalization
Propagation of existing database annotation errors
Errors in gene structure
Berkeley Phylogenomics Group
Tomato Cf-2 Bioinformatics AnalysisDomain fusion and fission events complicate function prediction by homology, particularly for particularly common domains (e.g., LRR regions).
Domain structure analysis (e.g., PFAM) is often critical.
BLAST against Arabidopsis
PFAM results
Panther
Tomato Cf-2 (GI:1587673)Dixon, Jones, Keddie, Thomas, Harrison and Jones JDG Cell (1996)
Top BLAST hit in Arabidopsis is an RLK!
Plant and Animal Innate Immunity Mediated by Structurally Similar Receptor and Receptor-like molecules
Homology detection and alignment accuracy (and %superposable positions!) drops with evolutionary distance
Structure can provide clues, but not necessarily exact definition
Not all positions in a Not all positions in a molecule are created equal molecule are created equal
Light-blue positions are variable across subfamilies – but can be very conserved within subfamilies.These are the hallmarks of binding pockets determining substrate specificity.
A B CA C B B C A
Major differences between Major differences between trees are in the coarse trees are in the coarse
branching orderbranching order
A B CA C B B C A
When each class, A, B and C appear equally similar to each other, the coarse branching order can be difficult to determine. In this case, it’s critical to be able to weight the subfamily-defining residues as more important when computing the distance between classes.
HMM construction using an initial multiple sequence alignment
Seq1 M V V S - - PSeq2 M V V S T G PSeq3 M V V S S G PSeq4 M V L S S P PSeq5 M - L S G P P
Delete/skip
Insert
Match
Profile or HMM parameter estimation using small training sets
D S I F M KD S V F M KD T I W M KD T I W M KD T V W M K
.
What other amino acids might be seen at this position among homologs? What are their probabilities?
The context is critical when estimating amino acid distributions
D S I F M KD S V F M KD T I W M KD T I W L KD T L W L R
.
This position may be critical for function or structure, and may not allow substitutions
Dirichlet Mixture Prior “Blocks9”
Parameters estimated using Expectation Maximization (EM) algorithm. Training data: 86,000 columns from BLOCKS alignment database.
Combining Prior Knowledge with Combining Prior Knowledge with Observations using Dirichlet Mixture Observations using Dirichlet Mixture
DensitiesDensities
pi = the estimated probability of amino acid ‘i’
n = (n1,…,n20) = the count vector summarizing the observed amino acids at a position.
j = ( j,1 ,…, j,20 ) = the parameters of component j of the
Dirichlet mixture .
ˆ
Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology. Sjolander, Karplus, Brown, Hughey, Krogh, Mian and Haussler. CABIOS (1996)
SATCHMO: Simultaneous Alignment and Tree Construction using Hidden Markov mOdels
Edgar, R., and Sjölander, K., "SATCHMO: Sequence Alignment and Tree Construction using Hidden Markov models," Bioinformatics. 2003 Jul 22;19(11):1404-11
Xia JiangNandini Krishnamurthy
Duncan BrownMichael Tung
Jake Gunn-GlanvilleBob Edgar
SATCHMO motivation• Structural divergence within a superfamily means that…
– Multiple sequence alignment (MSA) is hard– Alignable positions varies according to degree of divergence
• Current MSA methods not designed to handle this variability
• Assume globally alignable, all columns (e.g. ClustalW)…– Over-aligns, i.e. aligns regions that are not superposable
• …or identify and align only highly conserved positions (profile HMMs)– Discards information important for subfamily specificity
• Reality– Different degrees of alignability in different sequence pairs, different