1 SEQUENCE ANALYSIS Sequences - Problem of sequence alignment - interaction between molecular biology / computer science / statistics * What are the biological problems ? * Algorithm (dynamic programming) * ‘Simple’ implementation / source code / compiling * Statistics and probability theory of alignments * Common implementations in molecular biology software packages * Results of biological significance Sequence analysis - Where and why Sequencing projects, assembly of sequence data Identification of functional elements in sequences Sequence comparison Classification of proteins Comparative genomics RNA structure prediction Protein structure prediction Evolutionary history Overview of methods 1. Comparison a) Identity Examples: • finding restriction sites (GAATTC) • pattern matches ((A,G)x4GK[S,T]) (SeqWeb package: FindPatterns) b) Comparing non-identical sequences Alignments Pairwise comparison global alignment (SW: Gap) local alignment (smith-waterman) (SW:Bestfit) Fasta (SW:Fasta, Tfasta) Blast (SW: NetBlast) blastn compares a DNA sequence to a DNA database blastp compares a protein sequence to a protein database tblastn compares a protein sequences to all possible translation products of a DNA database
19
Embed
bio1 sequences handout - Göteborgs universitetbio.lundberg.gu.se/courses/bio1/seq.pdf · 2011. 8. 8. · Comparing non-identical sequences Protein sequence comparison - basic concepts
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
SEQUENCE ANALYSIS
Sequences - Problem of sequence alignment - interaction betweenmolecular biology / computer science / statistics
* What are the biological problems ?* Algorithm (dynamic programming)* ‘Simple’ implementation / source code / compiling* Statistics and probability theory of alignments* Common implementations in molecular biology software packages* Results of biological significance
Sequence analysis - Where and why
Sequencing projects, assembly of sequence dataIdentification of functional elements in sequencesSequence comparisonClassification of proteins Comparative genomicsRNA structure prediction Protein structure prediction Evolutionary history
b) Comparing non-identical sequencesAlignmentsPairwise comparison
global alignment (SW: Gap)local alignment (smith-waterman) (SW:Bestfit)
Fasta (SW:Fasta, Tfasta)
Blast (SW: NetBlast)blastn compares a DNA sequence to a DNA databaseblastp compares a protein sequence to a protein database
tblastn compares a protein sequences to all possible translation products of a DNAdatabase
2
Multiple sequence alignmentClustalw (SW: Pileup)
Methods in phylogeny• Distances (SW: Growtree)• Parsimony, tree is preferred that correspond to the smallest number of changes• Maximum likelihood
2. Analyzing for property other than a simple linear sequence of letters .Examples:• statistical composition of residues• profile analysis , Position Specific Iterated BLAST (PSI-BLAST)(www.ncbi.nlm.nih.gov/BLAST)• HMMs• Prediction of higher order structure
1. Protein secondary structure (alpha, beta)2. RNA (folding by base pairing within the molecule)
3. Simple transformation /extraction• Translation, DNA -> protein (SW: Translate, Map)• Reverse translation• Splicing
3
Identity. Pattern matching
Pattern matching is used for finding short sequence patterns in a single sequence, in a group of sequences orin the databases.
Examples of patterns (regular expressions):
GAATTCRecognition site for the restriction enzyme EcoRI
GDSGGP Typical of serine proteases.
[AG]-x(4)-G-K-[ST] motif A of the ATP/GTP-binding site
The program Findpatterns uses these types of patterns to search a set of sequences like those of a database.The program Motifs specifically search a protein sequence or set of sequences for the motifs present in thePROSITE database.
When two protein sequences are being compared and the similarity is considered statistically significant, it ishighly likely that the two proteins are evolutionary related. There are really only two kinds of biologicalrelationships:
Orthologs Proteins that carry out the same function in different speciesParalogs Proteins that perform different but related functions within one organism
Proteins are homologous if they are related by divergence from a common ancestor.
4
X
X
X1
X
X2
Speciation
What are orthologs?
Ancestral organism
Organism A
Organism A
Organism B
Organism B
Orthologs
X
X
Xa
X
Xb
Gene duplication
What are paralogs?
Paralogs
5
Mouse trypsin -- orthologs -- Human trypsin | | |paralogs paralogs | |Mouse chymotrypsin -- orthologs -- Human chymotrypsin
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
M A K L Q G A L G K R Y
* * * * * * * * * *M A K I Q G A L A K R Y
Comparing 2 sequences - Dotplot analysis
Sequence alignment
6
M A K L Q L G K R Y
M *A *K * *L * *Q *G *A *L * *G *K * *R *Y *
M A K L Q L G K R Y
* * * * * * * * * *M A K L Q G A L G K R Y
Gap
Sequence alignment
Comparing 2 sequences - Gaps
7
Comparing 2 sequences: What are really gaps?
Gaps are results of mutations (changes in DNA) that occur during evolution
For instance consider this deletion mutation:
AACTTGACGTTGAACTGC
GACTGGGCGTATCTGACCCGCATA
CGGGCACCGGCCCGTGGC
N L T D W A Y R A P
N L T R A P
AACTTGACGTTGAACTGC
CGGGCACCGGCCCGTGGC
DNAprotein
8
Pairwise comparison
In pairwise comparison gaps cannot be inserted in an unrestricted manner. For these reasons a gap penalty isassigned to gaps. Two parameters frequently used in sequence comparison (such as the programs Gap,Bestfit, Fasta).
- Gap creation penalty- Gap extension penalty
There are two parameters because it is more ’difficult’ to create a gap than to extend an existing gap.
Substitution matrices
In the scoring of an alignment we do not only take into account whether amino acids are identical or not.To better evaluate the biological significance of an alignment we make use of the fact that all amino acidsubstitutions do not occur with the same frequency.
Example : SubstitutionAsp (D) -> Glu (E) more likely thanAsp (D) -> Cys (C)
For each pair of amino acids one can estimate a probability for the pair to occur in a correct alignment ofrelated protein sequences. This kind of data is used to produce a substitution matrix. The first substitution
matrix to be used was PAM250. Now the matrix BLOSUM62 is most often used.
Local alignment (Smith-Waterman algorithm): Considers regions of similarity in parts of the sequencesonly. (Bestfit program in SeqWeb)
xxxxxxx ||||||| xxxxxxx region of similarity
This means for instance that when doing a global alignment of sequences an alignment is produced eventhough there is no significant similarity between the two sequences. How can one in such global comparisonsdecide whether the similarity is significant? In the Gap program there is an option "Generate statistics fromrandomized alignment" When this option is selected the second sequence is repeatedly shuffled, maintainingits length and composition, and then realigned to the first sequence. The average alignment score, plus orminus the standard deviation, of all randomized alignments is reported in the output file. You can comparethis average quality score to the quality score of the actual alignment to help evaluate the significance of thealignment.
Database searches
Searching databases with FASTA / BLAST
Improvement of speed as compared to local alignment algorithm:
Initial search is for short words.Word hits are then extended in either direction.
Fasta and Blast are programs frequently used to search sequence databases for homology to a querysequence. Programs of this kind answers practical questions posed by molecular biologists like : Is mysequence similar to anything in the database? I seem to have identified a new protein, what is the relationshipof this protein to proteins that have been described previously?
Blast and fasta programs are local similarity search methods that concentrate on finding short identicalmatches, which may contribute to a total match
11
FastA uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to searchfor similarities between one sequence (the query) and any group of sequences of the same type (nucleic acidor protein) as the query sequence. In the first step of this search, the comparison can be viewed as a set of dotplots, with the query as the vertical sequence and the group of sequences to which the query is beingcompared as the different horizontal sequences. This first step finds the registers of comparison (diagonals)having the largest number of short perfect matches (words) for each comparison. In the second step, these"best" regions are rescored using a scoring matrix that allows conservative replacements, ambiguitysymbols, and runs of identities shorter than the size of a word. In the third step, the program checks to see ifsome of these initial highest-scoring diagonals can be joined together. Finally, the search set sequences withthe highest scores are aligned to the query sequence for display.
ktup or wordsize. Length of initial peptide match, default is 2, i.e. the program starts identifying a diagonalby extending a dipeptide match. ktup=1 is used for a more sensitive search.
12
Output from Fasta
Fasta searches a protein or DNA sequence data bank version 3.3t04 January 25, 2000Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
../seq/ramp4.seq: 75 aa >ramp4.seq vs /vol1/gcgdata/ncbi_nr/nr.dat librarysearching /vol1/gcgdata/ncbi_nr/nr.dat library
173831120 residues in 553635 sequences statistics extrapolated from 60000 to 552908 sequences Expectation_n fit: rho(ln(x))= 4.8232+/-0.0004; mu= 0.7959+/- 0.022; mean_var=53.2306+/- 9.966, 0's: 686 Z-trim: 26 B-trim: 2227 in 1/63 Kolmogorov-Smirnov statistic: 0.0519 (N=29) at 46
FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 join: 36, opt: 24, gap-pen: -12/ -2, width: 16 Scan time: 102.010The best scores are: opt bits E(552908)gi|4585827|emb|CAB40910.1| (AJ238236) ribosome as ( 75) 483 130 1.9e-30gi|7657552|ref|NP_055260.1| stress-associated end ( 66) 426 116 3.7e-26gi|7504801|pir||T23009 hypothetical protein F59F4 ( 65) 251 71 8.5e-13gi|9802529|gb|AAF99731.1|AC004557_10 (AC004557) F ( 77) 145 45 0.00012gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|2 ( 136) 105 35 0.22gi|1800061|dbj|BAA16538.1| (D90891) similar to [S ( 217) 105 35 0.33gi|6319639|ref|NP_009721.1| involved in the secre ( 65) 92 31 1.2gi|2498674|sp|Q56109|NRDI_SALTY NRDI PROTEIN gi|1 ( 136) 93 32 1.8
E-valueFor a given score, the number of hits in a database search that we expect to seeby chance with this score or better. The E-value takes into account the size ofthe database that was searched. The lower the E-value, the more significant
the score is.
P-valueLike an E-value, but a P-value is the probability of a hit occurring by chancewith this score or better, as opposed to the expected number of hits. A P-valuehas a maximum of 1.0, while an E-value has a maximum of the number of sequences inthe database that was searched. For small (significant) P-values, P and E areapproximately equal, so the choice of one or the other in a software package isarbitrary. NCBI BLAST 2.0, FASTA, and HMMER report E values. WU-BLAST 2.0 reportsP-values.
Blast
blastp compares an amino acid query sequence against a protein sequence database
blastn compares a nucleotide query sequence against a nucleotide sequence database
blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database
tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).
tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame transla- tions of a nucleotide sequence database.
Query Database
blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA
Databases used in BLAST at NCBI
Peptide Sequence Databases
nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
swissprot
15
the last major release of the SWISS-PROT protein sequence database (no updates)
pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank
Nucleotide Sequence Databases
nrAll Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0,1 or 2 HTGS sequences)
dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
dbsts Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions
htgshtgs unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTGsequences are in nr)
pdb Sequences derived from the 3-dimensional structure
gssGenome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu
PCR sequences.
16
Output from Blast
BLASTP 2.0.11 [Jan-20-2000]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),"Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq (75 letters)
Database: nr 457,798 sequences; 140,871,481 total letters
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65
The homology search programs Fasta and Blast both rely on a basic procedure to compare two sequenceswith each other. Multiple sequence alignment programs, on the other hand, allows you to align and directlycompare more than two related sequences. This procedure is a very useful tool if you want to analyze afamily of proteins and for instance to identify the structural elements that are characteristic of that family.Common programs for multiple sequence analysis are Clustalw and Pileup.
Clustalw and Pileup exploit the fact that similar sequences are likely to be evolutionary related. Thus, theprograms aligns sequences in pairs, following the branching order of a family tree. Similar sequences arealigned first, and more distantly related sequences are added later. Once pairwise alignment scores for eachsequence relative to all others have been calculated , they are used to cluster the sequences into groups,which are then aligned against each other to generate the final multiple alignment.