Introduction to sequence analysis (Lesk chapter 4) Problem of sequence alignment - interaction between molecular biology / computer science / statistics * What are the biological problems ? * Algorithm (dynamic programming) * ‘Simple’ implementation / source code / compiling * Statistics and probability theory of alignments * Common implementations in molecular biology software packages * Results of biological significance
40
Embed
Introduction to sequence analysis - Göteborgs …bio.lundberg.gu.se/courses/ht03/bio1/seq.pdfIntroduction to sequence analysis (Lesk chapter 4) Problem of sequence alignment - interaction
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction to sequence analysis(Lesk chapter 4)
Problem of sequence alignment - interaction betweenmolecular biology / computer science / statistics
* What are the biological problems ?* Algorithm (dynamic programming)* ‘Simple’ implementation / source code / compiling* Statistics and probability theory of alignments* Common implementations in molecular biology software packages* Results of biological significance
Applications of sequence analysis
Sequencing projects, assembly of sequence dataIdentification of functional elements in sequences, gene predictionSequence comparisonClassification of proteins Comparative genomicsRNA structure prediction Protein structure prediction Evolutionary history
Regulatory elements
PromoterTranslation start
Transcription stop
polyA signal
Transcription start
Translation stop
Exons
Introns
Expression from a eukaryotic gene
Transcription
Translation
DNA
RNA (primarytranscript)
RNA (spliced)
Protein
Applications of sequence analysis
Sequencing projects, assembly of sequence dataIdentification of functional elements in sequences, gene predictionSequence comparisonClassification of proteins Comparative genomicsRNA structure prediction Protein structure prediction Evolutionary history
b) Comparing non-identical sequencesAlignmentsPairwise alignment
global alignment (WP: Gap) local alignment (smith-waterman) (WP:Bestfit)
Fasta (WP:Fasta, Tfasta)
Blast (WP / NCBI )blastn compares a DNA sequence to a DNA databaseblastp compares a protein sequence to a protein databasetblastn compares a protein sequences to all possible translation products of a DNA database
Multiple sequence alignmentClustalw (WP: Pileup)
2. Analyzing for property other than a simple linear sequence of letters.Examples: • statistical composition of letters• profile analysis , Position Specific Iterated BLAST (PSI-BLAST)(www.ncbi.nlm.nih.gov/BLAST)• HMMs• Prediction of higher order structure Protein secondary structure (alpha, beta) RNA (folding by base pairing within the molecule)
3. Simple transformation / extraction• Translation, DNA -> protein (W2H: Translate, Map)• Reverse translation• Splicing
Identity. Pattern matching
Pattern matching is used for finding short sequence patterns in asingle sequence, in a group of sequences or in the databases.
Examples of patterns (regular expressions):
GAATTCRecognition site for the restrictionenzyme EcoRI
GDSGGP Typical of serine proteases.
[AG]-x(4)-G-K-[ST] motif A of the ATP/GTP-binding site
WP: The program Findpatterns uses patterns to search a sequence(s). Theprogram Motifs specifically search a protein sequence or setof sequences for the motifs present in the PROSITE database.
When two protein sequences are being compared and the similarity isconsidered statistically significant, it is highly likely that the two proteins are evolutionary related.
GWFTREKLREEDHIKKGWFTKEKIREEDHIKK
Two kinds of biological relationships:
Orthologs Proteins that carry out the same function in different species
Paralogs Proteins that perform different but related functions within one organism
Proteins are homologous if they are related by divergence from a common ancestor.
X
X
X1
X
X2
Speciation
Ancestral organism
Organism A
Organism A
Organism B
Organism B
Orthologs
Orthologs
X
X
Xa
X
Xb
Gene duplication
Paralogs
Paralogs
Mouse trypsin -- orthologs -- Human trypsin | | paralogs paralogs | | Mouse chymotrypsin -- orthologs -- Human chymotrypsin
When two protein sequences are being compared and the similarity isconsidered statistically significant, it is highly likely that the two proteins are evolutionary related.
GWFTREKLREEDHIKKGWFTKEKIREEDHIKK
Two kinds of biological relationships:
Orthologs Proteins that carry out the same function in different species
Paralogs Proteins that perform different but related functions within one organism
Proteins are homologous if they are related by divergence from a common ancestor.
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
M A K L Q G A L G K R Y
* * * * * * * * * *M A K I Q G A L A K R Y
Comparing 2 sequences - Dotplot analysis
Sequence alignment
M A K L Q L G K R Y
M *A *K * *L * *Q *G *A *L * *G *K * *R *Y *
M A K L Q L G K R Y
* * * * * * * * * *M A K L Q G A L G K R Y
Gap
Sequence alignment
Comparing 2 sequences - Gaps
Gaps are results of mutations (changes in DNA) that occur during evolution
For instance consider this deletion mutation:
AACTTGACGTTGAACTGC
GACTGGGCGTATCTGACCCGCATA
CGGGCACCGGCCCGTGGC
N L T D W A Y R A P
N L T R A P
AACTTGACGTTGAACTGC
CGGGCACCGGCCCGTGGC
DNAprotein
Comparing / aligning two sequences. Gaps
In pairwise comparison gaps cannot be inserted in anunrestricted manner. For these reasons a gap penalty isassigned to gaps. Two parameters frequently used insequence comparison :
-Gap creation penalty-Gap extension penalty
There are two parameters because it is more ’difficult’to create a gap than to extend an existing gap.
blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA
Databases at NCBI
Protein sequence databases
nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
swissprot the last major release of SWISS-PROT
DNA sequence Databases
nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences(but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences)
dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
dbsts Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions
htgs htgs unfinished High Throughput Genomic Sequences
Output from Blast
BLASTP 2.0.11 [Jan-20-2000]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq (75 letters)
Database: nr 457,798 sequences; 140,871,481 total letters
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65 Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)
In a BLAST search low complexity regions in the query sequence arefiltered out by default
Regions with low-complexity sequence have an unusual composition andthis can create problems in sequence similarity searching. Low-complexity sequence can often be recognized by visual inspection. Forexample, the protein sequence PPCDPPPPPKDKKKKDDGPP has lowcomplexity and so does the nucleotide sequenceAAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexitysequence because it can cause artifactual hits. In BLAST searchesperformed without a filter, often certain hits will be reported with highscores only because of the presence of a low-complexity region. Mostoften, this type of match cannot be thought of as the result of homologyshared by the sequences. Rather, it is as if the low-complexity region is"sticky" and is pulling out many sequences that are not truly related.
Another reason why hits to low-complexity regions in proteins should befiltered out is that such regions often have a disordered 3D structure andare not associated with well-defined biological functions.