sequence sequence alignment and alignment and their their reliability reliability The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By Haim Ashkenazy http://guidance.tau.ac.il/workshop_2013/ January 2013 1 TAU Bioinformatics Workshop
43
Embed
Multiple s equence alignment and their reliability
Multiple s equence alignment and their reliability. The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By Haim Ashkenazy http://guidance.tau.ac.il/workshop_2013/. What are alignments good for?. To compare sequences Find homology - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Multiple Multiple sequence sequence
alignment and alignment and their reliabilitytheir reliability
The Bioinformatics UnitG.S. Wise Faculty of Life Science
Tel Aviv University, IsraelJanuary 2013
By Haim Ashkenazy
http://guidance.tau.ac.il/workshop_2013/
January 2013 1TAU Bioinformatics Workshop
What are alignments good What are alignments good for?for?
• To compare sequenceso Find homologyo Similar sequence similar function
• To learn about sequence evolutiono Mismatch = point mutationo Gap = indel (insertion or deletion)o Reconstruct phylogenetic treeo Infer selection forces, e.g., detecting positive selection, co-
evolving sites
• For structure predictiono Similar regions potentially have similar structure
2
Making an alignment Making an alignment (pairwise)(pairwise)
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN CDRYYQ
• For 2 sequences – Pairwise alignmento Local alignment – finds regions of high
similarity in parts of the sequences.
o Global alignment – finds the best alignment across the entire two sequences
• Use exact solutiono Needleman-Wunsch (for global) or Smith-Waterman (for local) -
http://www.ebi.ac.uk/Tools/psa/
3
Sequences evolutionSequences evolutionATGAAATAA
ATGTTTTAA ATGCCCAAATAA
ATGTTTTCA ATGTTTTAA ATGCCCAAA
A T G - - - T T T T A A
A T G - - - T T T T C A
A T G C C C A A A - - -
30 MYA
5 MYA
Today
Human
Chimp
Mouse4
A T G - - - T T T T A A
A T G - - - T T T T C A
A T G C C C - - - A A A
Alignment and phylogeny Alignment and phylogeny are mutually dependentare mutually dependent
Inaccurate tree
building
MSA
Sequence alignment
0.4
Phylogeny reconstructi
on
Unaligned sequences
5
Alignment and phylogeny Alignment and phylogeny are both are both challengingchallenging
~25% of residues are wrongly alignedBased on BAliBASE: a large representative set of proteins
6
Alignment and phylogeny Alignment and phylogeny are both are both challengingchallenging
5% of tree branches are wrong
Based on simulations of 100 protein sequences
Making an alignment (MSA)Making an alignment (MSA)• For more sequences - Multiple sequence
alignment (MSA)o Exact methods are not feasible (too slow)o We use heuristic methodso Several advanced MSA programs are available
Basically two recommended methods:• MAFFT – fastest and one of the most
accurate• PRANK – distinct from all other MSA
programs because of its correct treatment of insertions/deletions
8
ABCDE
Compute the pairwise Compute the pairwise alignments for all alignments for all
against all (10against all (10 pairwise pairwise alignments).alignments).
The similarities are The similarities are converted to distances converted to distances and stored in a tableand stored in a table
First step: compute pairwise distances
Progressive alignmentProgressive alignment
A B C D E
A
B 8
C 15 17
D 16 14 10
E 32 31 31 32 9
A
D
C
B
E
Cluster the sequences to create Cluster the sequences to create a tree (a tree (guide treeguide tree):):
• represents the order in which represents the order in which pairs of sequences are to be pairs of sequences are to be alignedaligned• similar sequences are neighbors similar sequences are neighbors in the tree in the tree • distant sequences are distant distant sequences are distant from each other in the treefrom each other in the tree
Second step:build a guide tree
A B C D E
A
B 8
C 15 17
D 16 14 10
E 32 31 31 32The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!
GUIDANCE: Guide-tree based GUIDANCE: Guide-tree based alignment confidence scoresalignment confidence scores
14
Comparing alignmentsComparing alignmentsCommon measures to quantify distance between two MSAs:1.CS: Each column of the MSA that is identically aligned in the other MSA is given a score of 1; all other columns are given the score 0.2.SP: Each pair of residues in the MSA that is identically aligned in the other MSA is given a score of 1; all other residue pairs are given the score 0.3.Sum-of-pairs column score (SPC): The score of each column is simply the average of the SPs over all pairs in it.
Accuracy of GUIDANCE Accuracy of GUIDANCE scoresscores
16
http://guidance.tau.ac.il
As a rule of thumb, use HoT for less than 8 sequences
17
http://guidance.tau.ac.il
Un-aligned sequences
(FASTA format)
Choose sequence
type
Choose alignment
method
18
GUIDANCE resultsGUIDANCE results
04/19/23Footer Text 19
MSA colored by
confidence score
Confident
Uncertain
Sequence score
Column score
GUIDANCE resultsGUIDANCE results
GUIDANCE outputsGUIDANCE outputs
21
Download MSA for down-stream
analysis
Text files with all scores
Mask residue by score
Remove unreliable sequences
Confident
Uncertain
Sequence score
Column score
GUIDANCE resultsGUIDANCE results
22
GUIDANCE outputsGUIDANCE outputs
23
Remove unreliable sequences
Re-align sequences after filtration
Sequences left after filtration
Filtering sequences Filtering sequences with low scores and with low scores and
re-alignre-align
24
But always remember not to
remove too much data and
consider the biology…
GUIDANCE outputsGUIDANCE outputs
25
Remove unreliable columns
MSA after filtration
Filtering columns with Filtering columns with low scoreslow scores
26
GUIDANCE outputsGUIDANCE outputs
27
Masking unreliably aligned residues
Filtering residues with Filtering residues with low scoreslow scores
28
Filtering unreliable regions Filtering unreliable regions
can improve down-stream can improve down-stream
analysisanalysis
29
(Mol Biol Evol 2012;29:1-5)
AcknowledgmentsAcknowledgments• Prof. Tal Pupko• Dr. Eyal Privman• Dr. Osnat Penn• Pupko’s lab members
1. Penn, O., Privman, E., Ashkenazy, H., Landan, G., Graur, D. and Pupko, T. (2010).GUIDANCE: a web server for assessing alignment confidence scores.Nucleic Acids Research, 2010 Jul 1; 38 (Web Server issue):W23-W28; doi: 10.1093/nar/gkq443 [ABS] [PDF]
2. Penn, O., Privman, E., Landan, G., Graur, D. and Pupko, T. (2010).An alignment confidence score capturing robustness to guide-tree uncertainty. Molecular Biology and Evolution, 2010 Aug;27(8):1759-67; doi:10.1093/molbev/msq066 [ABS] [PDF]
3. Landan, G., and D. Graur. (2008).Local reliability measures from sets of co-optimal multiple sequence alignments.Pac Symp Biocomput 13:15-24 [ABS] [PDF]
30
Thanks for your Thanks for your attention!attention!