bio1 sequences handout - Göteborgs universitetbio.lundberg.gu.se/courses/bio1/seq.pdf · 2011. 8. 8. · Comparing non-identical sequences Protein sequence comparison - basic concepts

1

SEQUENCE ANALYSIS

Sequences - Problem of sequence alignment - interaction betweenmolecular biology / computer science / statistics

* What are the biological problems ?* Algorithm (dynamic programming)* ‘Simple’ implementation / source code / compiling* Statistics and probability theory of alignments* Common implementations in molecular biology software packages* Results of biological significance

Sequence analysis - Where and why

Sequencing projects, assembly of sequence dataIdentification of functional elements in sequencesSequence comparisonClassification of proteins Comparative genomicsRNA structure prediction Protein structure prediction Evolutionary history

Overview of methods

1. Comparison

a) IdentityExamples:• finding restriction sites (GAATTC)• pattern matches ((A,G)x4GK[S,T])(SeqWeb package: FindPatterns)

b) Comparing non-identical sequencesAlignmentsPairwise comparison

global alignment (SW: Gap)local alignment (smith-waterman) (SW:Bestfit)

Fasta (SW:Fasta, Tfasta)

Blast (SW: NetBlast)blastn compares a DNA sequence to a DNA databaseblastp compares a protein sequence to a protein database

tblastn compares a protein sequences to all possible translation products of a DNAdatabase

2

Multiple sequence alignmentClustalw (SW: Pileup)

Methods in phylogeny• Distances (SW: Growtree)• Parsimony, tree is preferred that correspond to the smallest number of changes• Maximum likelihood

2. Analyzing for property other than a simple linear sequence of letters .Examples:• statistical composition of residues• profile analysis , Position Specific Iterated BLAST (PSI-BLAST)(www.ncbi.nlm.nih.gov/BLAST)• HMMs• Prediction of higher order structure

1. Protein secondary structure (alpha, beta)2. RNA (folding by base pairing within the molecule)

3. Simple transformation /extraction• Translation, DNA -> protein (SW: Translate, Map)• Reverse translation• Splicing

3

Identity. Pattern matching

Pattern matching is used for finding short sequence patterns in a single sequence, in a group of sequences orin the databases.

Examples of patterns (regular expressions):

GAATTCRecognition site for the restriction enzyme EcoRI

GDSGGP Typical of serine proteases.

[AG]-x(4)-G-K-[ST] motif A of the ATP/GTP-binding site

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H zinc finger proteins

The program Findpatterns uses these types of patterns to search a set of sequences like those of a database.The program Motifs specifically search a protein sequence or set of sequences for the motifs present in thePROSITE database.

Comparing non-identical sequencesProtein sequence comparison - basic concepts

Comparing two nearly identical protein sequences

GWFTREKLREEDHIKKGWFTKEKIREEDHIKK

When two protein sequences are being compared and the similarity is considered statistically significant, it ishighly likely that the two proteins are evolutionary related. There are really only two kinds of biologicalrelationships:

Orthologs Proteins that carry out the same function in different speciesParalogs Proteins that perform different but related functions within one organism

Proteins are homologous if they are related by divergence from a common ancestor.

4

X

X

X1

X

X2

Speciation

What are orthologs?

Ancestral organism

Organism A

Organism A

Organism B

Organism B

Orthologs

X

X

Xa

X

Xb

Gene duplication

What are paralogs?

Paralogs

5

Mouse trypsin -- orthologs -- Human trypsin | | |paralogs paralogs | |Mouse chymotrypsin -- orthologs -- Human chymotrypsin

M A K L Q G A L G K R Y

M *A * *K * *I

Q *G * *A * *L * *A * * K * *R *Y

M A K L Q G A L G K R Y

* * * * * * * * * *M A K I Q G A L A K R Y

Comparing 2 sequences - Dotplot analysis

Sequence alignment

6

M A K L Q L G K R Y

M *A *K * *L * *Q *G *A *L * *G *K * *R *Y *

M A K L Q L G K R Y

* * * * * * * * * *M A K L Q G A L G K R Y

Gap

Sequence alignment

Comparing 2 sequences - Gaps

7

Comparing 2 sequences: What are really gaps?

Gaps are results of mutations (changes in DNA) that occur during evolution

For instance consider this deletion mutation:

AACTTGACGTTGAACTGC

GACTGGGCGTATCTGACCCGCATA

CGGGCACCGGCCCGTGGC

N L T D W A Y R A P

N L T R A P

AACTTGACGTTGAACTGC

CGGGCACCGGCCCGTGGC

DNAprotein

8

Pairwise comparison

In pairwise comparison gaps cannot be inserted in an unrestricted manner. For these reasons a gap penalty isassigned to gaps. Two parameters frequently used in sequence comparison (such as the programs Gap,Bestfit, Fasta).

- Gap creation penalty- Gap extension penalty

There are two parameters because it is more ’difficult’ to create a gap than to extend an existing gap.

Substitution matrices

In the scoring of an alignment we do not only take into account whether amino acids are identical or not.To better evaluate the biological significance of an alignment we make use of the fact that all amino acidsubstitutions do not occur with the same frequency.

Example : SubstitutionAsp (D) -> Glu (E) more likely thanAsp (D) -> Cys (C)

For each pair of amino acids one can estimate a probability for the pair to occur in a correct alignment ofrelated protein sequences. This kind of data is used to produce a substitution matrix. The first substitution

matrix to be used was PAM250. Now the matrix BLOSUM62 is most often used.

9

Neurospora_crassa GSVDGYAYTD ANKQKGITWD ENTLFEYLEN PKKYIPGTKM AFGGLKKDKD

Stellaria_longipes GSVEGFSYTD ANKAKGIEWN KDTLFEYLEN PKKYIPGTKM AFGGLKKDKD

Thermomyces_lanuginosus GSVEGYSYTD ANKQAGITWN EDTLFEYLEN PKKFIPGTKM AFGGLKKNKD

Arabidopsis_thaliana GSVAGYSYTD ANKQKGIEWK DDTLFEYLEN PKKYIPGTKM AFGGLKKPKD

Aspergillus_niger GQSEGYAYTD ANKQAGVTWD ENTLFSYLEN PKKFIPGTKM AFGGLKKGKE

Debaryomyces_occidentalis GQAAGYSYTD ANKKKGVEWT EQTMSDYLEN PKKYIPGTKM AFGGLKKPKD

Schizosaccharomyces_pombe GQAEGFSYTE ANRDKGITWD EETLFAYLEN PKKYIPGTKM AFAGFKKPAD

Fagopyrum_esculentum GTTAGYSYSA ANKNKAVTWG EDTLYEYLLN PKKYIPGTKM VFPGLKKPQE

Sesamum_indicum GTTPGYSYSA ANKNMAVIWG ENTLYDYLLN PKKYIPGTKM VFPGLKKPQE

Haematobia_irritans GQAAGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE

Lucilia_cuprina GQAPGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE

Ceratitis_capitata GQAAGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE

Sarcophaga_peregrina GQAPGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE

Manduca_sexta GQAPGFSYSD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM VFAGLKKANE

Samia_cynthia GQAPGFSYSN ANKAKGITWG DDTLFEYLEN PKKYIPGTKM VFAGLKKANE

Schistocerca_gregaria GQAPGFSYTD ANKSKGITWD ENTLFIYLEN PKKYIPGTKM VFAGLKKPEE

Apis_mellifera GQAPGYSYTD ANKGKGITWN KETLFEYLEN PKKYIPGTKM VFAGLKKPQE

Macaca_mulatta GQAPGYSYTA ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE

Pan_troglodytes GQAPGYSYTA ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE

Anas_platyrhynchos GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE

Aptenodytes_patagonicus GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE

10

Global alignment (Needleman-Wunsch algorithm): Considers similarity across the full extent of thesequences (Gap program in SeqWeb)

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | | ||||||| | |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Local alignment (Smith-Waterman algorithm): Considers regions of similarity in parts of the sequencesonly. (Bestfit program in SeqWeb)

xxxxxxx ||||||| xxxxxxx region of similarity

This means for instance that when doing a global alignment of sequences an alignment is produced eventhough there is no significant similarity between the two sequences. How can one in such global comparisonsdecide whether the similarity is significant? In the Gap program there is an option "Generate statistics fromrandomized alignment" When this option is selected the second sequence is repeatedly shuffled, maintainingits length and composition, and then realigned to the first sequence. The average alignment score, plus orminus the standard deviation, of all randomized alignments is reported in the output file. You can comparethis average quality score to the quality score of the actual alignment to help evaluate the significance of thealignment.

Database searches

Searching databases with FASTA / BLAST

Improvement of speed as compared to local alignment algorithm:

Initial search is for short words.Word hits are then extended in either direction.

Fasta and Blast are programs frequently used to search sequence databases for homology to a querysequence. Programs of this kind answers practical questions posed by molecular biologists like : Is mysequence similar to anything in the database? I seem to have identified a new protein, what is the relationshipof this protein to proteins that have been described previously?

Blast and fasta programs are local similarity search methods that concentrate on finding short identicalmatches, which may contribute to a total match

11

FastA uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to searchfor similarities between one sequence (the query) and any group of sequences of the same type (nucleic acidor protein) as the query sequence. In the first step of this search, the comparison can be viewed as a set of dotplots, with the query as the vertical sequence and the group of sequences to which the query is beingcompared as the different horizontal sequences. This first step finds the registers of comparison (diagonals)having the largest number of short perfect matches (words) for each comparison. In the second step, these"best" regions are rescored using a scoring matrix that allows conservative replacements, ambiguitysymbols, and runs of identities shorter than the size of a word. In the third step, the program checks to see ifsome of these initial highest-scoring diagonals can be joined together. Finally, the search set sequences withthe highest scores are aligned to the query sequence for display.

ktup or wordsize. Length of initial peptide match, default is 2, i.e. the program starts identifying a diagonalby extending a dipeptide match. ktup=1 is used for a more sensitive search.

12

Output from Fasta

Fasta searches a protein or DNA sequence data bank version 3.3t04 January 25, 2000Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

../seq/ramp4.seq: 75 aa >ramp4.seq vs /vol1/gcgdata/ncbi_nr/nr.dat librarysearching /vol1/gcgdata/ncbi_nr/nr.dat library

173831120 residues in 553635 sequences statistics extrapolated from 60000 to 552908 sequences Expectation_n fit: rho(ln(x))= 4.8232+/-0.0004; mu= 0.7959+/- 0.022; mean_var=53.2306+/- 9.966, 0's: 686 Z-trim: 26 B-trim: 2227 in 1/63 Kolmogorov-Smirnov statistic: 0.0519 (N=29) at 46

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 join: 36, opt: 24, gap-pen: -12/ -2, width: 16 Scan time: 102.010The best scores are: opt bits E(552908)gi|4585827|emb|CAB40910.1| (AJ238236) ribosome as ( 75) 483 130 1.9e-30gi|7657552|ref|NP_055260.1| stress-associated end ( 66) 426 116 3.7e-26gi|7504801|pir||T23009 hypothetical protein F59F4 ( 65) 251 71 8.5e-13gi|9802529|gb|AAF99731.1|AC004557_10 (AC004557) F ( 77) 145 45 0.00012gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|2 ( 136) 105 35 0.22gi|1800061|dbj|BAA16538.1| (D90891) similar to [S ( 217) 105 35 0.33gi|6319639|ref|NP_009721.1| involved in the secre ( 65) 92 31 1.2gi|2498674|sp|Q56109|NRDI_SALTY NRDI PROTEIN gi|1 ( 136) 93 32 1.8

>>gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associa (75 aa) initn: 483 init1: 483 opt: 483 Z-score: 682.4 bits: 130.3 E(): 1.9e-30Smith-Waterman score: 483; 100.000% identity in 75 aa overlap (1-75:1-75)

10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::gi|458 MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC 10 20 30 40 50 60

70ramp4. GSAIFQIIQSIRMGM :::::::::::::::gi|458 GSAIFQIIQSIRMGM 70

>>gi|7657552|ref|NP_055260.1| stress-associated endoplas (66 aa) initn: 426 init1: 426 opt: 426 Z-score: 605.1 bits: 115.8 E(): 3.7e-26Smith-Waterman score: 426; 100.000% identity in 66 aa overlap (10-75:1-66)

10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC :::::::::::::::::::::::::::::::::::::::::::::::::::gi|765 MVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC 10 20 30 40 50

70ramp4. GSAIFQIIQSIRMGM

13

:::::::::::::::gi|765 GSAIFQIIQSIRMGM 60

>>gi|7504801|pir||T23009 hypothetical protein F59F4.2 - (65 aa) initn: 227 init1: 143 opt: 251 Z-score: 365.3 bits: 71.4 E(): 8.5e-13Smith-Waterman score: 251; 53.846% identity in 65 aa overlap (10-74:1-64)

10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC :. :::. .::.. :::...::::::. . : :.: ..:::..::.::::gi|750 MAPKQRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVC 10 20 30 40 50

70ramp4. GSAIFQIIQSIRMGM :::.:.::. ..::gi|750 GSAVFEIIRYVKMGW 60

>>gi|9802529|gb|AAF99731.1|AC004557_10 (AC004557) F17L21 (77 aa) initn: 139 init1: 100 opt: 145 Z-score: 218.9 bits: 44.6 E(): 0.00012Smith-Waterman score: 145; 44.262% identity in 61 aa overlap (17-74:15-74)

10 20 30 40 50ramp4. MVGAGGAAKMVAKQRIRMAN---EKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIF :.:. :: .::: .:: : .:. . .. ::: ::..:.:gi|980 MVDLERTITNTTSKRLADRKIEKFDKNILKRGFVPETTTKKGKDYP-VGPILLGFFVF 10 20 30 40 50

60 70ramp4. VVCGSAIFQIIQSIRMGM :: ::..::::.. :gi|980 VVIGSSLFQIIRTATSGGMA 60 70

>>gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|212121 (136 aa) initn: 66 init1: 41 opt: 105 Z-score: 160.3 bits: 34.6 E(): 0.22Smith-Waterman score: 105; 30.488% identity in 82 aa overlap (3-75:50-125)

10 20 30ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGN :.::.: : .: ::. :..:.. . ::gi|249 RLGLPAVRIPLNERERIQVDEPYILIVPSYGGGGTAGAVPRQVIRFLNDEHNRALL-RGV 20 30 40 50 60 70

40 50 60 70ramp4. VAKTSRNAPEEKASVG---------PWLLALFIFVVCGSAIFQIIQSIRMGM .:. .:: : . .: ::: . : . :. . :...: :.gi|249 IASGNRNFGEAYGRAGDVIARKCGVPWL---YRFELMGTQ--SDIENVRKGVTEFWQRQP 80 90 100 110 120 130

gi|249 QNA...75 residues in 1 query sequences173831120 residues in 553635 library sequences Scomplib [version 3.3t04 January 25, 2000] start: Thu Sep 14 12:30:17 2000 done: Thu Sep 14 12:32:19 2000 Scan time: 102.010 Display time: 0.020

14

E-valueFor a given score, the number of hits in a database search that we expect to seeby chance with this score or better. The E-value takes into account the size ofthe database that was searched. The lower the E-value, the more significant

the score is.

P-valueLike an E-value, but a P-value is the probability of a hit occurring by chancewith this score or better, as opposed to the expected number of hits. A P-valuehas a maximum of 1.0, while an E-value has a maximum of the number of sequences inthe database that was searched. For small (significant) P-values, P and E areapproximately equal, so the choice of one or the other in a software package isarbitrary. NCBI BLAST 2.0, FASTA, and HMMER report E values. WU-BLAST 2.0 reportsP-values.

Blast

blastp compares an amino acid query sequence against a protein sequence database

blastn compares a nucleotide query sequence against a nucleotide sequence database

blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database

tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

Query Database

blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA

Databases used in BLAST at NCBI

Peptide Sequence Databases

nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF

swissprot

15

the last major release of the SWISS-PROT protein sequence database (no updates)

pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank

Nucleotide Sequence Databases

nrAll Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0,1 or 2 HTGS sequences)

dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions

dbsts Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions

htgshtgs unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTGsequences are in nr)

pdb Sequences derived from the 3-dimensional structure

gssGenome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu

PCR sequences.

16

Output from Blast

BLASTP 2.0.11 [Jan-20-2000]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),"Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.

Query= ramp4.seq (75 letters)

Database: nr 457,798 sequences; 140,871,481 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6

>gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membrane protein RAMP4 [Rattus norvegicus] Length = 75

Score = 126 bits (313), Expect = 2e-29 Identities = 62/62 (100%), Positives = 62/62 (100%)

Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRMSbjct: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73

Query: 74 GM 75 GMSbjct: 74 GM 75

>gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rattus norvegicus] >gi|5326497|dbj|BAA81894.1| (AB018546) similar to rat RAMP4 and yeast YSY6 [Rattus norvegicus] >gi|5326499|dbj|BAA81895.1| (AB022427) SERP1 [Homo sapiens] Length = 66

Score = 126 bits (313), Expect = 2e-29 Identities = 62/62 (100%), Positives = 62/62 (100%)

Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRMSbjct: 5 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 64

Query: 74 GM 75 GM

17

Sbjct: 65 GM 66

>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65

Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)

Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++MSbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63

Query: 74 G 74 GSbjct: 64 G 64

>gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] Length = 68

Score = 46.1 bits (107), Expect = 3e-05 Identities = 25/54 (46%), Positives = 35/54 (64%), Gaps = 1/54 (1%)

Query: 21 EKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRMG 74 EK KNI +RG V +T+ ++ VGP LL F+FVV GS++FQII++ GSbjct: 13 EKFDKNILKRGFVPETTTKKGKDYP-VGPILLGFFVFVVIGSSLFQIIRTATSG 65

>gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] Length = 106

Score = 35.6 bits (80), Expect = 0.048 Identities = 20/42 (47%), Positives = 26/42 (61%), Gaps = 1/42 (2%)

Query: 21 EKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGS 62 EK KNI +RG V +T+ ++ VGP LL F+FVV GSSbjct: 13 EKFDKNILKRGFVPETTTKKGKDYP-VGPILLGFFVFVVIGS 53

Filtering of low complexity regions

1 MSAAPVQDKDTLSNAERAKNVNGLLQVLMDINTLNGGSSDTADKIRIHAKNFEAALFAKS 60

61 SSKKEYMDSMNEKVAVMRNTYNTRKNAVTAAAANNNIKPVEQHHINNLKNSGNSANNMNV 120

121 NMNLNPQMFLNQQAQARQQVAQQLRNQQQQQQQQQQQQRRQLTPQQQQLVNQMKVAPIPK 180

181 QLLQRIPNIPPNINTWQQVTALAQQKLLTPQDMEAAKEVYKIHQQLLFKARLQQQQAQAQ 240

241 AQANNNNNGLPQNGNINNNINIPQQQQMQPPNSSANNNPLQQQSSQNTVPNVLNQINQIF 300

301 SPEEQRSLLQEAIETCKNFEKTQLGSTMTEPVKQSFIRKYINQKALRKIQALRDVKNNNN 360

361 ANNNGSNLQRAQNVPMNIIQQQQQQNTNNNDTIATSATPNAAAFSQQQNASSKLYQ

18

19

Multiple sequence alignment

The homology search programs Fasta and Blast both rely on a basic procedure to compare two sequenceswith each other. Multiple sequence alignment programs, on the other hand, allows you to align and directlycompare more than two related sequences. This procedure is a very useful tool if you want to analyze afamily of proteins and for instance to identify the structural elements that are characteristic of that family.Common programs for multiple sequence analysis are Clustalw and Pileup.

Clustalw and Pileup exploit the fact that similar sequences are likely to be evolutionary related. Thus, theprograms aligns sequences in pairs, following the branching order of a family tree. Similar sequences arealigned first, and more distantly related sequences are added later. Once pairwise alignment scores for eachsequence relative to all others have been calculated , they are used to cluster the sequences into groups,which are then aligned against each other to generate the final multiple alignment.

bio1 sequences handout - Göteborgs universitetbio.lundberg.gu.se/courses/bio1/seq.pdf · 2011. 8. 8. · Comparing non-identical sequences Protein sequence comparison - basic concepts

Documents