Top Banner
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP Position within the alignment = columns
23

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Jan 01, 2016

Download

Documents

Prosper Shelton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Sequence Alignment

Goal: line up two or more sequences

An alignment of two amino acid sequences:

123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSeq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

Position within the alignment = columns

Page 2: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Sequence Alignment

The alignment is a hypothesis: • The positions with identical nt/AA were present in the common

ancestor

• Differences represent the nt/AA that have diverged since the

common ancestor

Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSeq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

Page 3: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

From extant sequences to evolution

Page 4: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Constructing and evaluating alignments

• how to identify regions of sequence similarity between two sequences?

• How to evaluate the degree of similarity?

• What is the biological significance of the alignment?

Page 5: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Dot Plots: visualization of an alignment

1) Take two English words:

2) place the two sequences on vertical and horizontal axes of graph

3) put dots wherever there is a match

4) diagonal line is the region of identity – local alignment

THISSEQUENCE and THATSEQUENCE

Page 6: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Alignments reveal insertions and deletions

a gap in Seq1 accounts for the insertion of ISA into Seq2

seq1 THIS---SEQUENCE seq2 THISISASEQUENCE

Page 7: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

THISSEQUENCE

THATSEQUENCE

– How many substitutions?

– Are all substitutions equal?

– If these were real AA sequences in two extant organisms, how can we determine whether they reflect evolutionary ancestry?

• Would two unrelated sequence share this level of identity?

Alignments reveal substitutions

Page 8: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Need a methods for evaluating the likelihood of

- A to V (alanine to valine)

- R to F (Arginine to Phenylalanine)

Substitutions

Page 9: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Scoring schemes to assess similarity

• Percent identity = number of identical amino acids

• Percent similarity (biochemical equivalence)

• Substitution matrices– value assigned based on the probability of substitution– score the alignment

123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSeq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

Page 10: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Substitution matrices

How might one construct a scoring matrix?

– what types of sequence events should we consider?• DNA level? Transition vs. transversion• Amino acid level? Biochemical equivalence• Using known proteins?

– comparing protein homologs

– post hoc determination of probabilities

– should we use the same substitution matrix for two very closely related proteins vs. proteins that diverged long ago?

• Probability of substitutions increases over time

• Probability that multiple substitutions occurred in a single position

Page 11: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Substitution matrix

Each cell represents the likelihood of substitution of each possible pair of amino acids

Sum up the score = 52THISSEQUENCETHATSEQUENCE

581145505695

Page 12: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

PAM and BLOSUM matrices for AA sequences

Most protein alignment matrices are empirically derived:

• PAM Scoring Matrices– compared full length of closely related proteins– Measured the frequency of all possible substitution pairs

• BLOSUM Scoring Matrices– compared highly conserved regions of proteins– blocks

Page 13: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

How to score gaps?THISISASEQUENCE vs THATSEQUENCE?

THISISASEQUENCETH----ATSEQUENCETHA---TSEQUENCETH---ATSEQUENCETH-A-T-SEQUENCE

Scoring the alignment must take into account1) Substitutions2) Gaps

Gap penalties:1) start a new gap (-4)2) extend an existing gap (-1)

Score all, choose highest score

More than one possible alignment

Page 14: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Alignments

• Finding regions of sequence identity or similarity

• Inserting gaps to reflect indels

• Scoring the possible alignments to find the optimal alignment by

Page 15: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Common tools that produce alignments

• BLAST to identify similar sequences, given a query sequence

• ClustalW to align two or more sequences across their entire length

Page 16: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

the blast algorithm• Uses PAM or BLOSUM matrix

• divides query sequence into short strings, called words

• searches through the database to find subject sequences that contain similar words

• When finds similar words, it extends and scores the alignment

• Output consists of all subject sequences that align to the query at or above a threshold score

• If no words are similar, then no alignment

Page 17: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

BLAST Algorithm

divide entire length into words (segments of X length)

Page 18: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Extend hits one base at a time

S is the alignment score:

If it falls below a threshold, the extension processes ends

Page 19: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

HSPs are Aligned Regions

• High scoring segment pairs = the original word match plus the extension– high scoring = score of the alignment above threshold– segment = the region of the query sequence aligned to the subject– pair = alignment between two sequences (query and subject)

• BLAST often produces several short HSPs rather than a single aligned region

Page 20: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

BLAST Results report local alignments

>gi|17556182|ref|NP_497582.1| Predicted CDS, phosphatidylinositol transfer protein [Caenorhabditis elegans]

Score = 283 bits (723), Expect = 8e-75 Identities = 144/270 (53%), Positives = 186/270 (68%), Gaps = 13/270 (4%)

Query: 48 KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----DGE--KGQYT 101 K+ RV+LP+SV+EYQVGQL+SVAEASK P++ +G+ KGQYTSbjct: 70 KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLLNGQFTKGQYT 129

Query: 102 HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160 HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSbjct: 130 HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP 189

Query: 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-SKDYKAEEDPAKFKSIKTGRGPLGPN 219 D GT EN H L+ + E V I+IA+ + L S D + P+KF+S KTGRGPL NSbjct: 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSGDLHPDSTPSKFQSTKTGRGPLSGN 249

Query: 220 WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHRQLFCWLDKWV 279 WK + P MCAYKLVTV FKW+G Q VEN+ H Q RLF+ FHR++FCW+DKW Sbjct: 250 WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHREVFCWIDKWH 304

Query: 280 DLTMDDIRRMEEETKRQLDEMRQKDPVKGM 309 LTM DIR +E + +++L+E R+ V+GMSbjct: 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM 334

Query was the entire protein sequence (position 1 to 749)

• Score,E-value, Identities, Positives, Gaps

Page 21: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

BLAST Statistics• E-value is equivalent to a P value

• smaller numbers are more significant – 1e-4 = 1 x 10-4 = 0.0004

– 1e-50 = 1 x 10-50

• E-value is calculated from the alignment score (S)

• how many alignments of that score would likely occur by chance if you query a database of that size?– if GenBank contains 10 million sequences, there is a good

probability that the sequence “MAGAV” will occur multiple times in sequences that are NOT evolutionarily related

• The E-value represents the likelihood that the observed alignment is due to chance alone

Page 22: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Interpretation of output

• very low E-values (e-100) represent sequences that are very close to being identical

• moderate E-values are related genes (homologs)

• long list of gradually declining of E-values indicates a large gene family

• you must examine the results when e-value is in the 10-4 to -5 range– examine sequences

• a few AA matches in a long sequence?• many AA matches in a very short sequence?

Page 23: Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Evaluating Blast results

• Alignment:– colored bar alignments (region of alignment, score along length)– sequence alignments (region of alignment, AA information)

• Exploring potential function from significant blast hits– use accession link to go to the record page for each hit

• published papers• full sequence information• annotation

• Blast is linked to a protein domain tool