Analysis of biological sequences. (Lesk chapter 4) Sequence alignment • Sequence assembly • Classification • Prediction of function • Comparative genomics • Phylogeny / Evolutionary history Pattern matching Recognition of signals / statistical properties / character relationships • Prediction of protein function • Identification of transcription regulatory sites • Gene prediction • RNA and protein secondary structure prediction
42
Embed
Analysis of biological sequences. (Lesk chapter 4)bio.lundberg.gu.se/courses/ht04/bio1/seqintro.pdf · Output from Fasta Fasta searches a protein or DNA sequence data bank version
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Analysis of biological sequences.(Lesk chapter 4)
Sequence alignment • Sequence assembly• Classification• Prediction of function• Comparative genomics• Phylogeny / Evolutionary history
Pattern matching
Recognition of signals / statistical properties
/ character relationships
• Prediction of protein function• Identification of transcription regulatory sites
• Gene prediction• RNA and protein secondary structure prediction
Two-dimensional weight matrices are used in Two-dimensional weight matrices are used in identification of splicing signalsidentification of splicing signals
Prediction of RNA secondary structure
GCCUCUUGGC
G
CC
U
C
G
C
G
UU
5’ 3’
5’ 3’
Analysis of biological sequences.(Lesk chapter 4)
Sequence alignment • Sequence assembly• Classification• Prediction of function• Comparative genomics• Phylogeny / Evolutionary history
Pattern matching
Recognition of signals / statistical properties
/ character relationships
• Prediction of protein function• Identification of transcription regulatory sites
• Gene prediction• RNA and protein secondary structure prediction
Problem of sequence alignment - interaction betweenmolecular biology / computer science / statistics
* What biological problems are addressed ?* Algorithm (dynamic programming)* ‘Simple’ implementation / source code / compiling
* Common implementations in molecular biology software packages* Statistics and probability theory of alignments
Why do we want to align 2 sequences?
As one example, consider this common application:
We have a ‘new’ sequence. It is similar to a previously known sequence?
Alignment to all previously known sequences. (Many of these have annotation such as a description of function )
similarity
?
no similarity
•Prediction of function •Phylogeny / evolutionary history
Basic concepts of protein sequence alignments
Proteins are homologous if they are related by divergence from a common ancestor.
Two kinds of homology:
Orthologs Proteins that carry out the same function in different species
Paralogs Proteins that perform different but related functions within one organism
X
X
X1
X
X2
Speciation
Ancestral organism
Organism A
Organism A
Organism B
Organism B
Orthologs
Orthologs
X
X
Xa
X
Xb
Gene duplication
Paralogs
Paralogs
Mouse trypsin -- orthologs -- Human trypsin | | paralogs paralogs | | Mouse chymotrypsin -- orthologs -- Human chymotrypsin
How do we know from an alignment if two sequences are evolutionary related?
This seems convincing:
GWFTREKLREEDHIKKGWFTKEKIREEDHIKK
But what about this:
VAKTSRNAPEEKASVG IASGNRNFGEAYGRAG ?
We need some input from statistics / probability theory
For instance, alignment methods like BLAST will ask:What is the probability that this match occurs by chance only ?
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
M A K L Q G A L G K R Y
* * * * * * * * * *M A K I Q G A L A K R Y
Comparing 2 sequences - Dotplot analysis
Sequence alignment
M A K L Q L G K R Y
M *A *K * *L * *Q *G *A *L * *G *K * *R *Y *
M A K L Q L G K R Y
* * * * * * * * * *M A K L Q G A L G K R Y
Gap
Sequence alignment
Comparing 2 sequences - Gaps
Gaps are results of mutations (changes in DNA) that occur during evolution
For instance consider this deletion mutation:
AACTTGACGTTGAACTGC
GACTGGGCGTATCTGACCCGCATA
CGGGCACCGGCCCGTGGC
N L T D W A Y R A P
N L T R A P
AACTTGACGTTGAACTGC
CGGGCACCGGCCCGTGGC
DNAprotein
Comparing / aligning two sequences. Gaps
In pairwise comparison gaps cannot be inserted in anunrestricted manner. For these reasons a gap penalty isassigned to gaps. Two parameters frequently used insequence comparison :
-Gap creation penalty-Gap extension penalty
There are two parameters because it is more ’difficult’to create a gap than to extend an existing gap.
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
M A K L Q G A L G K R Y
* * * * * * * * * *M A K I Q G A L A K R Y
Comparing 2 sequences - Dotplot analysis
Sequence alignment
Substitution matricesEach amino acid change has a characteristic probability
Dot plot analysis reveals repeats
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
M A K L Q G A L G K R Y
* * * * * * * * * *M A K I Q G A L A K R Y
Comparing 2 sequences - Dotplot analysis
Sequence alignment
Searching databases with FASTA / BLAST
Improvement of speed as compared to local alignment algorithm:
Initial search is for short words.Word hits are then extended in either direction.
Output from Fasta
Fasta searches a protein or DNA sequence data bank version 3.3t04 January 25, 2000Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
../seq/ramp4.seq: 75 aa >ramp4.seq vs /vol1/gcgdata/ncbi_nr/nr.dat librarysearching /vol1/gcgdata/ncbi_nr/nr.dat library
173831120 residues in 553635 sequences statistics extrapolated from 60000 to 552908 sequences Expectation_n fit: rho(ln(x))= 4.8232+/-0.0004; mu= 0.7959+/- 0.022; mean_var=53.2306+/- 9.966, 0's: 686 Z-trim: 26 B-trim: 2227 in 1/63 Kolmogorov-Smirnov statistic: 0.0519 (N=29) at 46
FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 join: 36, opt: 24, gap-pen: -12/ -2, width: 16 Scan time: 102.010The best scores are: opt bits E(552908)gi|4585827|emb|CAB40910.1| (AJ238236) ribosome as ( 75) 483 130 1.9e-30gi|7657552|ref|NP_055260.1| stress-associated end ( 66) 426 116 3.7e-26gi|7504801|pir||T23009 hypothetical protein F59F4 ( 65) 251 71 8.5e-13gi|9802529|gb|AAF99731.1|AC004557_10 (AC004557) F ( 77) 145 45 0.00012gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|2 ( 136) 105 35 0.22gi|1800061|dbj|BAA16538.1| (D90891) similar to [S ( 217) 105 35 0.33gi|6319639|ref|NP_009721.1| involved in the secre ( 65) 92 31 1.2gi|2498674|sp|Q56109|NRDI_SALTY NRDI PROTEIN gi|1 ( 136) 93 32 1.8
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq (75 letters)
Database: nr 457,798 sequences; 140,871,481 total letters
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65 Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)
blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA
The different variants of BLAST
In a BLAST search low complexity regions in the query sequence arefiltered out by default
Regions with low-complexity sequence have an unusual composition andthis can create problems in sequence similarity searching. Low-complexity sequence can often be recognized by visual inspection. Forexample, the protein sequence PPCDPPPPPKDKKKKDDGPP has lowcomplexity and so does the nucleotide sequenceAAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexitysequence because it can cause artifactual hits. In BLAST searchesperformed without a filter, often certain hits will be reported with highscores only because of the presence of a low-complexity region. Mostoften, this type of match cannot be thought of as the result of homologyshared by the sequences. Rather, it is as if the low-complexity region is"sticky" and is pulling out many sequences that are not truly related.
Another reason why hits to low-complexity regions in proteins should befiltered out is that such regions often have a disordered 3D structure andare not associated with well-defined biological functions.
WP: The program Findpatterns uses patterns to search a sequence(s). Theprogram Motifs specifically search a protein sequence or setof sequences for the motifs present in the PROSITE database.