Analysis of biological sequences. (Lesk chapter 4) Sequence alignment • Sequence assembly • Classification • Prediction of function • Comparative genomics • Phylogeny / Evolutionary history Pattern matching Recognition of signals / statistical properties / character relationships • Prediction of protein function • Identification of transcription regulatory sites • Gene prediction • RNA and protein secondary structure prediction
53
Embed
Analysis of biological sequences. (Lesk chapter 4)bio.lundberg.gu.se/courses/ht05/bio1/sequences_biol.pdfAnalysis of biological sequences. (Lesk chapter 4) Sequence alignment • Sequence
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Analysis of biological sequences.(Lesk chapter 4)
Sequence alignment • Sequence assembly• Classification• Prediction of function• Comparative genomics• Phylogeny / Evolutionary history
Pattern matching
Recognition of signals / statistical properties
/ character relationships
• Prediction of protein function• Identification of transcription regulatory sites
• Gene prediction• RNA and protein secondary structure prediction
Two-dimensional weight matrices are used in Two-dimensional weight matrices are used in identification of splicing signalsidentification of splicing signals
Prediction of RNA secondary structure
GCCUCUUGGC
G
CC
U
C
G
C
G
UU
5’ 3’
5’ 3’
Problem of sequence alignment - interaction betweenmolecular biology / computer science / statistics
* What biological problems are addressed ?* Algorithm (dynamic programming)* ‘Simple’ implementation / source code / compiling
* Common implementations in molecular biology software packages* Statistics and probability theory of alignments
Biological aspects of sequence alignments
Why do we want to align 2 sequences?
As one example, consider this common application:
We have a ‘new’ sequence. It is similar to a previously known sequence?
Alignment to all previously known sequences. (Many of these have annotation such as a description of function )
similarity
?
no similarity
•Prediction of function •Phylogeny / evolutionary history
Basic concepts of protein sequence alignments
Proteins are homologous if they are related by divergence from a common ancestor.
Two kinds of homology:
Orthologs Proteins that carry out the same function in different species
Paralogs Proteins that perform different but related functions within one organism
X
X
X1
X
X2
Speciation
Ancestral organism
Organism A
Organism A
Organism B
Organism B
Orthologs
Orthologs
X
X
Xa
X
Xb
Gene duplication
Paralogs
Paralogs
Mouse trypsin -- orthologs -- Human trypsin | | paralogs paralogs | | Mouse chymotrypsin -- orthologs -- Human chymotrypsin
Ortholog / paralog relationships may be identified using local alignment algorithmssuch as Smith Waterman
But: databases are huge, current nucleotidedatabase = 100 billion nucleotides
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
M A K L Q G A L G K R Y
* * * * * * * * * *M A K I Q G A L A K R Y
Comparing 2 sequences - Dotplot analysis
Sequence alignment
Searching databases with FASTA / BLAST
Improvement of speed as compared to local alignment algorithm:
Initial search is for short words.Word hits are then extended in either direction.
First step in BLAST - obtaining a list of words based on the query sequence
Query sequence: FSGTWAMA ....
Words derived from query sequence:FSG, SGT, GTW, TWA ....etc
Fasta searches a protein or DNA sequence data bank version 3.3t04 January 25, 2000Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
../seq/ramp4.seq: 75 aa >ramp4.seq vs /vol1/gcgdata/ncbi_nr/nr.dat librarysearching /vol1/gcgdata/ncbi_nr/nr.dat library
173831120 residues in 553635 sequences statistics extrapolated from 60000 to 552908 sequences Expectation_n fit: rho(ln(x))= 4.8232+/-0.0004; mu= 0.7959+/- 0.022; mean_var=53.2306+/- 9.966, 0's: 686 Z-trim: 26 B-trim: 2227 in 1/63 Kolmogorov-Smirnov statistic: 0.0519 (N=29) at 46
FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 join: 36, opt: 24, gap-pen: -12/ -2, width: 16 Scan time: 102.010The best scores are: opt bits E(552908)gi|4585827|emb|CAB40910.1| (AJ238236) ribosome as ( 75) 483 130 1.9e-30gi|7657552|ref|NP_055260.1| stress-associated end ( 66) 426 116 3.7e-26gi|7504801|pir||T23009 hypothetical protein F59F4 ( 65) 251 71 8.5e-13gi|9802529|gb|AAF99731.1|AC004557_10 (AC004557) F ( 77) 145 45 0.00012gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|2 ( 136) 105 35 0.22gi|1800061|dbj|BAA16538.1| (D90891) similar to [S ( 217) 105 35 0.33gi|6319639|ref|NP_009721.1| involved in the secre ( 65) 92 31 1.2gi|2498674|sp|Q56109|NRDI_SALTY NRDI PROTEIN gi|1 ( 136) 93 32 1.8
How do we know from an alignment if two sequences are evolutionary related?
This seems convincing:
GWFTREKLREEDHIKKGWFTKEKIREEDHIKK
But what about this:
VAKTSRNAPEEKASVG IASGNRNFGEAYGRAG ?
We need some input from statistics / probability theory
For instance, alignment methods like BLAST will ask:What is the probability that this match occurs by chance only ?
The Expect value (E)
Parameter that describes the number of hits one can "expect" tosee just by chance when searching a database of a particularsize. Essentially, the E value describes the random backgroundnoise that exists for matches between sequences. For example,an E value of 1 assigned to a hit can be interpreted as meaningthat in a database of the current size one might expect to see 1match with a similar score simply by chance. This means thatthe lower the E-value, or the closer it is to "0" the more"significant" the match is.
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq (75 letters)
Database: nr 457,798 sequences; 140,871,481 total letters
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65 Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)
In a BLAST search low complexity regions in the query sequence arefiltered out by default
Regions with low-complexity sequence have an unusual composition andthis can create problems in sequence similarity searching. Low-complexity sequence can often be recognized by visual inspection. Forexample, the protein sequence PPCDPPPPPKDKKKKDDGPP has lowcomplexity and so does the nucleotide sequenceAAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexitysequence because it can cause artifactual hits. In BLAST searchesperformed without a filter, often certain hits will be reported with highscores only because of the presence of a low-complexity region. Mostoften, this type of match cannot be thought of as the result of homologyshared by the sequences. Rather, it is as if the low-complexity region is"sticky" and is pulling out many sequences that are not truly related.
Another reason why hits to low-complexity regions in proteins should befiltered out is that such regions often have a disordered 3D structure andare not associated with well-defined biological functions.
W Q R E S * K R T * K L L L H L V L L L F2 G K E K V K K E L K N F C C I * C Y Y C F3 1 ATGGCAAAGAGAAAGTTAAAAAAGAACTTAAAAACTTTTGTTGCATTTAGTGCTATTACT 60 ----:----|----:----|----:----|----:----|----:----|----:----|
1 TACCGTTTCTCTTTCAATTTTTTCTTGAATTTTTGAAAACAACGTAAATCACGATAATGA 60 X A F L F N F F F K F V K T A N L A I V F6 X P L S F T L F S S L F K Q Q M * H * * F5 H C L S L * F L V * F S K N C K T S N S F4
A L L L T N G I P I S A L T Q S S N T T F1 L Y C * L M V F Q L V L * L S L P I Q L F2 F I V N * W Y S N * C F N S V F Q Y N * F3 61 GCTTTATTGTTAACTAATGGTATTCCAATTAGTGCTTTAACTCAGTCTTCCAATACAACT 120
----:----|----:----|----:----|----:----|----:----|----:----| 61 CGAAATAACAATTGATTACCATAAGGTTAATCACGAAATTGAGTCAGAAGGTTATGTTGA 120 A K N N V L P I G I L A K V * D E L V V F6 Q K I T L * H Y E L * H K L E T K W Y L F5 S * Q * S I T N W N T S * S L R G I C S F4
E I T S Q A T T G L R N V M Y Y G D W S F1 R L L H K L L Q G Y V M * C I M V T G L F2 D Y F T S Y Y R V T * C N V L W * L V Y F3 121 GAGATTACTTCACAAGCTACTACAGGGTTACGTAATGTAATGTATTATGGTGACTGGTCT 180
----:----|----:----|----:----|----:----|----:----|----:----| 121 CTCTAATGAAGTGTTCGATGATGTCCCAATGCATTACATTACATAATACCACTGACCAGA 180 S I V E C A V V P N R L T I Y * P S Q D F6 Q S * K V L * * L T V Y H L T N H H S T F5 L N S * L S S C P * T I Y H I I T V P R F4
Translation of a nucleotide sequence using ‘sixpack’
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq (75 letters)
Database: nr 457,798 sequences; 140,871,481 total letters
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65 Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)