Analysis of biological sequences. (Lesk chapter 4)bio.lundberg.gu.se/courses/ht05/bio1/sequences_biol.pdfAnalysis of biological sequences. (Lesk chapter 4) Sequence alignment • Sequence

Analysis of biological sequences.(Lesk chapter 4)

Sequence alignment • Sequence assembly• Classification• Prediction of function• Comparative genomics• Phylogeny / Evolutionary history

Pattern matching

Recognition of signals / statistical properties

/ character relationships

• Prediction of protein function• Identification of transcription regulatory sites

• Gene prediction• RNA and protein secondary structure prediction

Regulatory elements

PromoterTranslation start

Transcription stop

polyA signal

Transcription start

Translation stop

Exons

Introns

Expression from a eukaryotic gene

Transcription

Translation

DNA

RNA (primarytranscript)

RNA (spliced)

Protein

%G 11 74 100 0 29%A 64 9 0 0 61%U 13 12 0 100 7%C 11 6 0 0 2

Exon Intron

Two-dimensional weight matrices are used in Two-dimensional weight matrices are used in identification of splicing signalsidentification of splicing signals

Prediction of RNA secondary structure

GCCUCUUGGC

G

CC

U

C

G

C

G

UU

5’ 3’

5’ 3’

Problem of sequence alignment - interaction betweenmolecular biology / computer science / statistics

* What biological problems are addressed ?* Algorithm (dynamic programming)* ‘Simple’ implementation / source code / compiling

* Common implementations in molecular biology software packages* Statistics and probability theory of alignments

Biological aspects of sequence alignments

Why do we want to align 2 sequences?

As one example, consider this common application:

We have a ‘new’ sequence. It is similar to a previously known sequence?

Alignment to all previously known sequences. (Many of these have annotation such as a description of function )

similarity

?

no similarity

•Prediction of function •Phylogeny / evolutionary history

Basic concepts of protein sequence alignments

Proteins are homologous if they are related by divergence from a common ancestor.

Two kinds of homology:

Orthologs Proteins that carry out the same function in different species

Paralogs Proteins that perform different but related functions within one organism

X

X

X1

X

X2

Speciation

Ancestral organism

Organism A

Organism A

Organism B

Organism B

Orthologs

Orthologs

X

X

Xa

X

Xb

Gene duplication

Paralogs

Paralogs

Mouse trypsin -- orthologs -- Human trypsin | | paralogs paralogs | | Mouse chymotrypsin -- orthologs -- Human chymotrypsin

Ortholog / paralog relationships may be identified using local alignment algorithmssuch as Smith Waterman

But: databases are huge, current nucleotidedatabase = 100 billion nucleotides

M A K L Q G A L G K R Y

M *A * *K * *I

Q *G * *A * *L * *A * * K * *R *Y

M A K L Q G A L G K R Y

* * * * * * * * * *M A K I Q G A L A K R Y

Comparing 2 sequences - Dotplot analysis

Sequence alignment

Searching databases with FASTA / BLAST

Improvement of speed as compared to local alignment algorithm:

Initial search is for short words.Word hits are then extended in either direction.

First step in BLAST - obtaining a list of words based on the query sequence

Query sequence: FSGTWAMA ....

Words derived from query sequence:FSG, SGT, GTW, TWA ....etc

GTW (6+5+11=22) GSW (6+1+11=18) GNW (6+0+11=17) GAW (6+0+11=17) ATW (0+5+11=16) DTW (-1+5+11=15) GTF (6+5+1=12)

GTM (6+5-1=10) DAW (-1+0+11=10)

threshold

Output from Fasta

Fasta searches a protein or DNA sequence data bank version 3.3t04 January 25, 2000Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

../seq/ramp4.seq: 75 aa >ramp4.seq vs /vol1/gcgdata/ncbi_nr/nr.dat librarysearching /vol1/gcgdata/ncbi_nr/nr.dat library

173831120 residues in 553635 sequences statistics extrapolated from 60000 to 552908 sequences Expectation_n fit: rho(ln(x))= 4.8232+/-0.0004; mu= 0.7959+/- 0.022; mean_var=53.2306+/- 9.966, 0's: 686 Z-trim: 26 B-trim: 2227 in 1/63 Kolmogorov-Smirnov statistic: 0.0519 (N=29) at 46

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 join: 36, opt: 24, gap-pen: -12/ -2, width: 16 Scan time: 102.010The best scores are: opt bits E(552908)gi|4585827|emb|CAB40910.1| (AJ238236) ribosome as ( 75) 483 130 1.9e-30gi|7657552|ref|NP_055260.1| stress-associated end ( 66) 426 116 3.7e-26gi|7504801|pir||T23009 hypothetical protein F59F4 ( 65) 251 71 8.5e-13gi|9802529|gb|AAF99731.1|AC004557_10 (AC004557) F ( 77) 145 45 0.00012gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|2 ( 136) 105 35 0.22gi|1800061|dbj|BAA16538.1| (D90891) similar to [S ( 217) 105 35 0.33gi|6319639|ref|NP_009721.1| involved in the secre ( 65) 92 31 1.2gi|2498674|sp|Q56109|NRDI_SALTY NRDI PROTEIN gi|1 ( 136) 93 32 1.8

How do we know from an alignment if two sequences are evolutionary related?

This seems convincing:

GWFTREKLREEDHIKKGWFTKEKIREEDHIKK

But what about this:

VAKTSRNAPEEKASVG IASGNRNFGEAYGRAG ?

We need some input from statistics / probability theory

For instance, alignment methods like BLAST will ask:What is the probability that this match occurs by chance only ?

The Expect value (E)

Parameter that describes the number of hits one can "expect" tosee just by chance when searching a database of a particularsize. Essentially, the E value describes the random backgroundnoise that exists for matches between sequences. For example,an E value of 1 assigned to a hit can be interpreted as meaningthat in a database of the current size one might expect to see 1match with a similar score simply by chance. This means thatthe lower the E-value, or the closer it is to "0" the more"significant" the match is.

>>gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associa (75 aa) initn: 483 init1: 483 opt: 483 Z-score: 682.4 bits: 130.3 E(): 1.9e-30Smith-Waterman score: 483; 100.000% identity in 75 aa overlap (1-75:1-75)

10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::gi|458 MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC 10 20 30 40 50 60

70ramp4. GSAIFQIIQSIRMGM :::::::::::::::gi|458 GSAIFQIIQSIRMGM 70

>>gi|7504801|pir||T23009 hypothetical protein F59F4.2 - (65 aa) initn: 227 init1: 143 opt: 251 Z-score: 365.3 bits: 71.4 E(): 8.5e-13Smith-Waterman score: 251; 53.846% identity in 65 aa overlap (10-74:1-64)

10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC :. :::. .::.. :::...::::::. . : :.: ..:::..::.::::gi|750 MAPKQRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVC 10 20 30 40 50

70 ramp4. GSAIFQIIQSIRMGM :::.:.::. ..:: gi|750 GSAVFEIIRYVKMGW 60

>>gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|212121 (136 aa) initn: 66 init1: 41 opt: 105 Z-score: 160.3 bits: 34.6 E(): 0.22Smith-Waterman score: 105; 30.488% identity in 82 aa overlap (3-75:50-125)

10 20 30 ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGN :.::.: : .: ::. :..:.. . :: gi|249 RLGLPAVRIPLNERERIQVDEPYILIVPSYGGGGTAGAVPRQVIRFLNDEHNRALL-RGV 20 30 40 50 60 70

40 50 60 70 ramp4. VAKTSRNAPEEKASVG---------PWLLALFIFVVCGSAIFQIIQSIRMGM .:. .:: : . .: ::: . : . :. . :...: :. gi|249 IASGNRNFGEAYGRAGDVIARKCGVPWL---YRFELMGTQ--SDIENVRKGVTEFWQRQP 80 90 100 110 120 130

Output from Blast

BLASTP 2.0.11 [Jan-20-2000]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.

Query= ramp4.seq (75 letters)

Database: nr 457,798 sequences; 140,871,481 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6

>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65 Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)

Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++MSbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63

Query: 74 G 74 GSbjct: 64 G 64

Query Database

blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA

The different variants of BLAST

Basic BLAST command line

blastall -i input_sequence -d database -p blast_version

In a BLAST search low complexity regions in the query sequence arefiltered out by default

Regions with low-complexity sequence have an unusual composition andthis can create problems in sequence similarity searching. Low-complexity sequence can often be recognized by visual inspection. Forexample, the protein sequence PPCDPPPPPKDKKKKDDGPP has lowcomplexity and so does the nucleotide sequenceAAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexitysequence because it can cause artifactual hits. In BLAST searchesperformed without a filter, often certain hits will be reported with highscores only because of the presence of a low-complexity region. Mostoften, this type of match cannot be thought of as the result of homologyshared by the sequences. Rather, it is as if the low-complexity region is"sticky" and is pulling out many sequences that are not truly related.

Another reason why hits to low-complexity regions in proteins should befiltered out is that such regions often have a disordered 3D structure andare not associated with well-defined biological functions.

BLAST and filtering of low-complexity sequence

Query:295 DDIFGELSSGKNAPKTGGGAKGNNASPAGSGNTKNNGASGADINNYAGQIKSAIESKFYD DDIFGELSSGKNAPKTGGGAKGNNASPAGSGNTKNNGASGADINNYAGQIKSAIESKFYD Sbjct:87 DDIFGELSSGKNAPKTGGGAKGNNASPAGSGNTKNNGASGADINNYAGQIKSAIESKFYD

Query:355 ASSYAGKTCTLRIKLAPDGMLLDIKPEGGDXXXXXXXXXXXXXXXXXXXXSQAVYEVFKN ASSYAGKTCTLRIKLAPDGMLLDIKPEGGD SQAVYEVFKNSbjct:147 ASSYAGKTCTLRIKLAPDGMLLDIKPEGGDPALCQAALAAAKLAKIPKPPSQAVYEVFKN

Query:415 APLDFKP 421 APLDFKPSbjct:207 APLDFKP 213

Introduction to practicals - biological sequences

M A K R K L K K N L K T F V A F S A I T F1

W Q R E S * K R T * K L L L H L V L L L F2 G K E K V K K E L K N F C C I * C Y Y C F3 1 ATGGCAAAGAGAAAGTTAAAAAAGAACTTAAAAACTTTTGTTGCATTTAGTGCTATTACT 60 ----:----|----:----|----:----|----:----|----:----|----:----|

1 TACCGTTTCTCTTTCAATTTTTTCTTGAATTTTTGAAAACAACGTAAATCACGATAATGA 60 X A F L F N F F F K F V K T A N L A I V F6 X P L S F T L F S S L F K Q Q M * H * * F5 H C L S L * F L V * F S K N C K T S N S F4

A L L L T N G I P I S A L T Q S S N T T F1 L Y C * L M V F Q L V L * L S L P I Q L F2 F I V N * W Y S N * C F N S V F Q Y N * F3 61 GCTTTATTGTTAACTAATGGTATTCCAATTAGTGCTTTAACTCAGTCTTCCAATACAACT 120

----:----|----:----|----:----|----:----|----:----|----:----| 61 CGAAATAACAATTGATTACCATAAGGTTAATCACGAAATTGAGTCAGAAGGTTATGTTGA 120 A K N N V L P I G I L A K V * D E L V V F6 Q K I T L * H Y E L * H K L E T K W Y L F5 S * Q * S I T N W N T S * S L R G I C S F4

E I T S Q A T T G L R N V M Y Y G D W S F1 R L L H K L L Q G Y V M * C I M V T G L F2 D Y F T S Y Y R V T * C N V L W * L V Y F3 121 GAGATTACTTCACAAGCTACTACAGGGTTACGTAATGTAATGTATTATGGTGACTGGTCT 180

----:----|----:----|----:----|----:----|----:----|----:----| 121 CTCTAATGAAGTGTTCGATGATGTCCCAATGCATTACATTACATAATACCACTGACCAGA 180 S I V E C A V V P N R L T I Y * P S Q D F6 Q S * K V L * * L T V Y H L T N H H S T F5 L N S * L S S C P * T I Y H I I T V P R F4

Translation of a nucleotide sequence using ‘sixpack’

Plotorf to show open reading frames

Ribosomal protein S16 1771-2019

Deviations from the standard genetic code

# Yeast mitochondria

UGA = Trp:W CUU = Thr:T CUC = Thr:T CUA = Thr:T CUG = Thr:T AUA = Met:M

# Mammalian mitochondria

UGA = Trp:W AUU = Ile:I AUC = Ile:I AUA = Met:M AGA = * :* AGG = * :*

# Drosophila mitochondria

UGA = Trp:W AUU = Ile:I AUA = Met:M AGA = Ser:S AGG = Ser:S

# mycoplasma

UGA = Trp

# Cilian protozoa

UAA = Gln:Q UAG = Gln:Q

EMBOSS

sixpackplotorf

water - Smith Waterman alignmentneedle - Needleman - Wunsch alignmentdottup - dotplot analysis


Alignment of mRNA sequence to genomic DNA sequence with needle

effect of gap parameters

Dot plot analysis (dottup) reveals repeats

hprt_mouse 1 mptrspsvvisddepgydldlfcipnhyaedlekvfiphglimdrterl +++++++++++++++++++++++++++++++++++++++++++++++++ MPTRSPSVVISDDEPGYDLDLFCIPNHYAEDLEKVFIPHGLIMDRxERLgi|26145909|dbj 36 acacacaggaagggcgtgcgtttacactgggtgagtaccgcaagaagac tccggcgtttgaaacgaatattgtcaaacaataatttcagtttagNagt ggcctcccgtcttaattcatgttattttcgtgaagttttagtgcgtaat

hprt_mouse 50 ardvmkemgghhivalcvlkggykffadlldyikalnrnsdrsipmtvd ++++++++++++++ +++++++++++++ +++++++++++++++++++ ARDVMKEMGGHHIV!LCVLKGGYKFFAD!LDYIKALNRNSDRSIPMTVHgi|26145909|dbj 183 gcggaagaggccag4ctgcaggtattgg4cgtaagcaaaagatacaagc cgattaatggaatt tgttaggaattca taatactagagagctctcta tatcggggactctg ctgcggctgcttc gtctaagtatttacttgtat

hprt_mouse 99 firlksycndqstgdikviggddlstltgk +++++++++++++++++++++++++++++ SIRLKSYCNDQSTGDIKVIGGDDLSTLTGKgi|26145909|dbj 332 taacaattagctaggaagaggggctataga ctgtagagaaaccgatattggaatcctcga tcaggcctttgaggcaatttattcatatag

Alignment of protein sequence to DNA sequence ( genewise)


Basic BLAST command line

blastall -i input_sequence -d database -p blast_version

Output from Blast

BLASTP 2.0.11 [Jan-20-2000]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.

Query= ramp4.seq (75 letters)

Database: nr 457,798 sequences; 140,871,481 total letters

Searching..................................................done


gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6

>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65 Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)

Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++MSbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63

Query: 74 G 74 GSbjct: 64 G 64


Substitution matricesEach amino acid change has a characteristic probability

Aligning two sequences using the BLAST algorithm:

bl2seq -i sequence_1 -j sequence_2 -p blastn


BLAST and word size

blastall -i ........ -W7 (default is W11)

GTCAAGTGGCAACTCCGTCAG ********** ********** GTCAAGTGGCTACTCCGTCAG


‘seg’ - NCBI utility to identify low-complexity regions

‘fastacmd’ retrieves sequences from BLAST-formatted databases:

fastacmd -s accession_number -d database


Query= gi|28872819|ref|NP_057849.4| Gag-Pol [Human immunodeficiencyvirus 1] (1435 letters)

Database: All non-redundant GenBank CDStranslations+PDB+SwissProt+PIR+PRF excluding environmental samples 2,506,223 sequences; 849,940,114 total letters

Searching..................................................done


ref|NP_057849.4| Gag-Pol [Human immunodeficiency virus 1] 2849 0.0gb|AAG28737.1| gag-pol fusion protein [synthetic construct] 2770 0.0gb|AAD03191.1| gag-pol fusion polyprotein [Human immunodeficienc... 2768 0.0dbj|BAB85751.1| Gag-pol fusion polyprotein [Human immunodeficien... 2759 0.0gb|AAD03200.1| gag-pol fusion polyprotein [Human immunodeficienc... 2745 0.0gb|AAG30116.1| gag-pol fusion polyprotein [Human immunodeficienc... 2741 0.0gb|AAD03217.1| gag-pol fusion polyprotein [Human immunodeficienc... 2727 0.0dbj|BAC77511.1| Gag-Pol fusion protein [Human immunodeficiency v... 2710 0.0gb|AAD03326.1| gag-pol fusion polyprotein [Human immunodeficienc... 2704 0.0dbj|BAC77477.1| Gag-Pol fusion polyprotein [Human immunodeficien... 2702 0.0dbj|BAC77486.1| Gag-Pol fusion polyprotein [Human immunodeficien... 2693 0.0gb|AAD03241.1| gag-pol fusion polyprotein [Human immunodeficienc... 2692 0.0gb|AAD03225.1| gag-pol fusion polyprotein [Human immunodeficienc... 2684 0.0gb|AAD03233.1| gag-pol fusion polyprotein [Human immunodeficienc... 2680 0.0gb|AAD03209.1| gag-pol fusion polyprotein [Human immunodeficienc... 2679 0.0gb|AAN73492.1| gag-pol fusion polyprotein [Human immunodeficienc... 2664 0.0gb|AAN73736.1| gag-pol fusion polyprotein [Human immunodeficienc... 2657 0.0emb| AD59561 1| gag-pol fusion protein [Human immunodefi ien y v 2653 0 0


FASTA:

30 40 50 60 70 80AF1862 GAUAGUCCAGGACUAUUGGAUUUAAUUCCAAAUGCUCCUGAGAGCUCCAUAGAGCGGAA- :::::::::::::::: : : : ::::::AF1862 GUGCGUCUUUCGGGGCGCGCGGGGCGAAAGAAUGCUCCUGAGAGCUUCCU-GGGCGGAAA 20 30 40 50 60 70 90 100 110 120 130 140AF1862 -----GCUCUGGACGAAGCCAUCAGAAAAAUCGCUUACUUGUGAAGUGAUGGGCCACUCU : : :: :: :::::::::::AF1862 UAUUUCCGCCGGGCGUCGCCAUCAGAAAUUCAGCAGGCUAUGCUUGCAUGGGAGGCGGCG 80 90 100 110 120 130

BLAST:

Query: 60 aatgctcctgagagct 75 ||||||||||||||||Sbjct: 50 aatgctcctgagagct 65 Score = 22.3 bits (11), Expect = 1.0 Identities = 11/11 (100%) Strand = Plus / Plus Query: 101 gccatcagaaa 111 |||||||||||Sbjct: 96 gccatcagaaa 106

Analysis of biological sequences. (Lesk chapter 4)bio.lundberg.gu.se/courses/ht05/bio1/sequences_biol.pdfAnalysis of biological sequences. (Lesk chapter 4) Sequence alignment • Sequence

Documents