General nucleic acid Sequence databases • EMBL:(European Molecular Biology Labo ratory) http://www.ebi.ac.uk/Informa tion/ • GenBank: NCBI (National Center for Bi otechnology Information) http://www.ncbi.nlm.nih.gov/ • DDBJ: DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/ Entry name; accession number; version number
142
Embed
General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory) GenBank: NCBI (National Center for.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
General nucleic acid Sequence databases
• EMBL:(European Molecular Biology Laboratory)
http://www.ebi.ac.uk/Information/• GenBank: NCBI (National Center for Biote
chnology Information) http://www.ncbi.nlm.nih.gov/• DDBJ: DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/
Entry name; accession number; version number
General protein Sequence databases
• SWISS-PROT• PIR• PRF/SEQDB• PDB: It is the largest data bank of three-dimensional (3-D) biol
ogical macromolecular structure data.
coding sequences (CDS): from translation• TrEMBL• GenPret:
• SWISS-PROT is a highly curated database that contains excellent documentation. SWISS-PROT systematically merges variants and fragments into a single entry, but is greatly lagging behind the growth of the DNA data banks.
• PIR contains more sequences, including numerous “really sequenced” oligopeptides, but is not that tightly curated.
• The “automatic” data banks such as TrEMBL and GenPept are even larger, but contain little documentation and sometimes conceptual translations that are not actually found in nature.
BLAST Basic Local Alignment Search Tool• The BLAST algorithm breaks the query sequence
into short fragments, or “words,” and looks for an identical or close match between those words and words from the database sequences. When such a match or “hit” is encountered, the hit is extended in both directions to generate a local alignment segment. The quality of each alignment is quantified in a score, and the high-scoring segment pairs (HSPs) are reported in a table.
• BLASTN, which compares a nucleotide query sequence with a nucleotide sequence database; BLASTP, which compares a protein query sequence with a protein sequence database; BLASTX, which compares a nucleotide query sequence translated in all six open reading frames with a protein sequence database; TBLASTN, which compares a protein query sequence with a nucleotide sequence database dynamically translated in all six open reading frames; and TBLASTX, which compares a six-frame translation of a nucleotide query sequence with the six-frame translations of a nucleotide sequence database.
http://www.ddbj.nig.ac.jp/
Sequence alignment
Chapter 5
Measuring GeneticChange
D=s+wg
W=1P1 0+1x2=2P2 2+1x1=3
D=s+wg
W=1P1 0+1x2=2P2 2+1x1=3
W=2P1 0+2x2=4P2 2+2x1=4
W=3P1 0+3x2=6P2 2+3x1=5
W 小 gap 衝擊小Gap 多
W 大 gap 衝擊大Gap 少
D=s+wg
Gap 多 or 序列變異大 , W 可選小
Gap 少 or 序列保守 , W 可選大
D=s+wg
A B C D E F G H I K L M N P Q R S T V W X Y Z
A 8 -40 -
4-
2-
40 -
4-
2-
2-
2-
2-
4-
2-
2-
22 0 0 -
60 -
4-
2
B -48 -
68 2 -
6-
20 -
60 -
8-
66 -
40 -
20 -
2-
6-
8-
2-
62
C 0 -61
8-
6-
8-
4-
6-
6-
2-
6-
2-
2-
6-
6-
6-
6-
2-
2-
2-
4-
4-
4-
6
D -48 -
61
24 -
6-
2-
2-
6-
2-
8-
62 -
20 -
40 -
2-
6-
8-
2-
62
E -22 -
84 1
0-
6-
40 -
62 -
6-
40 -
24 0 0 -
2-
4-
6-
2-
48
F -4-
6-
4-
6-
61
2-
6-
20 -
60 0 -
6-
8-
6-
6-
4-
4-
22 -
26 -
6
G 0 -2-
6-
2-
4-
61
2-
4-
8-
4-
8-
60 -
4-
4-
40 -
4-
6-
4-
2-
6-
4
H -40 -
6-
20 -
2-
41
6-
6-
2-
6-
42 -
40 0 -
2-
4-
6-
4-
24 0
I -2-
6-
2-
6-
60 -
8-
68 -
64 2 -
6-
6-
6-
6-
4-
26 -
6-
2-
2-
6
K -20 -
6-
22 -
6-
4-
2-
61
0-
4-
20 -
22 4 0 -
2-
4-
6-
2-
42
L -2-
8-
2-
8-
60 -
8-
64 -
48 4 -
6-
6-
4-
4-
4-
22 -
4-
2-
2-
6
M -2-
6-
2-
6-
40 -
6-
42 -
24 1
0-
4-
40 -
2-
2-
22 -
2-
2-
2-
2
N -46 -
62 0 -
60 2 -
60 -
6-
41
2-
40 0 2 0 -
6-
8-
2-
40
P -2-
4-
6-
2-
2-
8-
4-
4-
6-
2-
6-
4-
41
4-
2-
4-
2-
2-
4-
8-
4-
6-
2
Q -20 -
60 4 -
6-
40 -
62 -
40 0 -
21
02 0 -
2-
4-
4-
2-
26
R -2-
2-
6-
40 -
6-
40 -
64 -
4-
20 -
42 1
0-
2-
2-
6-
6-
2-
40
S 2 0 -20 0 -
40 -
2-
40 -
4-
22 -
20 -
28 2 -
4-
60 -
40
T 0 -2-
2-
2-
2-
4-
4-
4-
2-
2-
2-
20 -
2-
2-
22 1
00 -
40 -
4-
2
V 0 -6-
2-
6-
4-
2-
6-
66 -
42 2 -
6-
4-
4-
6-
40 8 -
6-
2-
2-
4
W -6-
8-
4-
8-
62 -
4-
4-
6-
6-
4-
2-
8-
8-
4-
6-
6-
4-
62
2-
44 -
6
X 0 -2-
4-
2-
2-
2-
2-
2-
2-
2-
2-
2-
2-
4-
2-
20 0 -
2-
4-
2-
2-
2
Y -4-
6-
4-
6-
46 -
64 -
2-
4-
2-
2-
4-
6-
2-
4-
4-
4-
24 -
21
4-
4
Z -22 -
62 8 -
6-
40 -
62 -
6-
20 -
26 0 0 -
2-
4-
6-
2-
48
Blosum62mt
PAM500A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 1 -1
0 1 -2
0 1 1 0 0 -1
0 -1
-3
1 1 1 -6
-3
0 1 0 0 -9
R -1
5 1 0 -4
2 0 -1
2 -2
-2
4 0 -4
0 0 0 4 -4
-2
0 1 0 -9
N 0 1 1 2 -3
1 1 1 1 -1
-2
1 -1
-4
0 1 0 -5
-3
-1
1 1 0 -9
D 1 0 2 3 -5
2 3 1 1 -2
-3
1 -2
-5
0 1 0 -7
-5
-1
2 2 0 -9
C -2
-4
-3
-5
22
-5
-5
-3
-4
-2
-6
-5
-5
-3
-2
0 -2
-9
2 -2
-4
-5
-2
-9
Q 0 2 1 2 -5
2 2 0 2 -1
-2
1 -1
-4
1 0 0 -5
-4
-1
2 2 0 -9
E 1 0 1 3 -5
2 3 1 1 -2
-3
1 -1
-5
0 1 0 -7
-5
-1
2 2 0 -9
G 1 -1
1 1 -3
0 1 4 -1
-2
-3
0 -2
-5
1 1 1 -8
-5
-1
1 1 0 -9
H 0 2 1 1 -4
2 1 -1
4 -2
-2
1 -1
-2
0 0 0 -2
0 -2
1 2 0 -9
I 0 -2
-1
-2
-2
-1
-2
-2
-2
3 4 -2
3 2 -1
-1
0 -5
0 3 -2
-2
0 -9
L -1
-2
-2
-3
-6
-2
-3
-3
-2
4 7 -2
4 4 -2
-2
-1
-1
1 3 -3
-2
-1
-9
K 0 4 1 1 -5
1 1 0 1 -2
-2
4 0 -5
0 0 0 -3
-5
-2
1 1 0 -9
M -1
0 -1
-2
-5
-1
-1
-2
-1
3 4 0 4 1 -1
-1
0 -4
-1
2 -1
-1
0 -9
F -3
-4
-4
-5
-3
-4
-5
-5
-2
2 4 -5
1 13
-4
-3
-3
3 13
0 -4
-5
-2
-9
P 1 0 0 0 -2
1 0 1 0 -1
-2
0 -1
-4
4 1 1 -6
-5
-1
0 1 0 -9
S 1 0 1 1 0 0 1 1 0 -1
-2
0 -1
-3
1 1 1 -3
-3
-1
1 0 0 -9
T 1 0 0 0 -2
0 0 1 0 0 -1
0 0 -3
1 1 1 -6
-3
0 0 0 0 -9
W -6
4 -5
-7
-9
-5
-7
-8
-2
-5
-1
-3
-4
3 -6
-3
-6
34
2 -6
-6
-6
-4
-9
Y -3
-4
-3
-5
2 -4
-5
-5
0 0 1 -5
-1
13
-5
-3
-3
2 15
-1
-4
-4
-2
-9
V 0 -2
-1
-1
-2
-1
-1
-1
-2
3 3 -2
2 0 -1
-1
0 -6
-1
3 -1
-1
0 -9
B 1 0 1 2 -4
2 2 1 1 -2
-3
1 -1
-4
0 1 0 -6
-4
-1
2 2 0 -9
Z 0 1 1 2 -5
2 2 1 2 -2
-2
1 -1
-5
1 0 0 -6
-4
-1
2 2 0 -9
X 0 0 0 0 -2
0 0 0 0 0 -1
0 0 -2
0 0 0 -4
-2
0 0 0 0 -9
* -9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
-9
1
• The cost for every pair of possible amino acid replacements defines a cost matrix that can be used to score the alignment. Protein sequence alignment programmes typically use matrices derived from empirical comparisons of protein sequences
If indels were weighted 4, transversions 2, and transitions 1, the morphological character data were weighted 4. Leading and trailing gaps were weighted one-half internal gaps.These parameters, insertion:deletion cost (indel) and transversion:transition ratio (Tv:Ti) were variedIn all cases where morphological data were included, character transformations for morphology were weighted as equal to the indel cost.
Small values of result in an L-shaped distribution with extreme variation of rates; most sites are invariable but a few have very high rates of substitution
Parameter : the range of rate variation among sites
各 site 間的 rate 均等
Conservedregion
各 site 間的 rate 差大
Variable region
• This is primarily because the majority of substitutions happen at the same sites; that is , the variable positions.
• Obviously the more distantly related the sequences, the more pronounced this phenomenon becomes.
• Jin and Nei (1990) followed a similar approach, but assumed that substitution rates were Г- evolutionary model, which involves a parameter αthat describes the extent of the rate variation, they derived several equations to compute the evolutionary distance from the observed sequence dissimilarities.
• Relative nucleotide substitution rates in the SRC method are estimated by observing the frequencies with which sequence pairs differ at homologous positions.
• For an alignment of n sequences, TREECON computes n(n-1)/2 pairwise evolutionary distances d according to the Jukes and Cantor equation.
• When all pairwise distances have been computed, they are classified in several distance intervals (e.g., four).
• For each distance interval, the fraction of sequence pairs possessing a different nucleotide is plotted and a curve obeying he following equation:
• This is accomplished for all alignment positions
• The probability pi that an alignment position i contains a different nucleotide in two sequences, as a function of the evolutionary distance d separating these sequences.
• The slope of the curve through the origin yields the specific nucleotide substitution rate vi for the position under consideration.
• After estimation of all vi values, alignment positions are grouped into sets of similar variability and form a spectrum of relative nucleotide substitution rates.
Inferring Molecular PhylogenyDistance methods first convert aligned sequences into a pairwise distance matrix, then input that matrix into a tree building method,
whereas discrete methods consider each nucleotide site directly.
That the parsimony tree gives us the additional information of which site contributes to the length of each branch. Once we convert sequences into distance we lose this information.
Clustering methods
• Tree-building methods in the second class use optimality criteria to choose among the set of all possible trees. This criterion is used to assign to each tree a ‘score’ or rank which is a function of the relationship between tree and data.
Tree-building methods in the second class use optimality criteria to choose among the set of all possible tree (Fig. 6.3). This criterion is used to assign to each tree ‘score’ or bank which is a function of the relationship between tree and data (examples include maximum parsimony and maximum likelihood).
• What is the value of the optimality criterion for that tree?
• Which tree requires the fewest evolutionary events?
• While for small numbers of sequences (e.g. no more than 20) it is often possible to find the optimal tree (or trees), in many cases this is not feasible, in which case we have to rely on heuristic methods.
• A typical heuristic strategy is to start with a tree and rearrange it, keeping any rearrangement that produces a better tree. Such algorithms are often called ‘hill-climbing’.
Unweighted pair group method with arithmetic means (UPGMA)
• In an ultrametric tree all the tips are equidistant form the root of the tree, which is equivalent to assuming a molecular clock.
0.1715/2
0.2192/2
0.2795/2
• Distances are rarely, exactly tree metrics, and hence one class of ‘goodness of fit’ methods seeks the metric tree that best accounts for the ‘observed’ distances.
• The goodness of fit F between observed distance d
ij and tree distances pij for each pair of sequences i and j is given by.
• In the example just given we were fitting an additive tree with (2n-3) branches to
( ) = n (n-1)/2 pairwise distances.n2
Distance methods
Minimum evolution• Given an unrooted metric tree for n sequences there a
re (2n-3) branches, each with length ei. The sum of these branch lengths is the length L of the tree:
The minimum evolution tree (ME) is the tree which minimizes L.
• More commonly, the branch lengths of the minimum evolution tree are estimated using least-squares methods. The branch lengths are estimated in the same way as for goodness of fit measures; however, rather than compare the fit of the observed distances the least squares branch lengths are added together to give the length of the tree.
Neighbour joining clustering
• Neighbour joining (NJ) is a widely used method for tree building which combines computational speed with uniqueness of result - most implementations give a single tree.
• One strategy for finding the ME tree is to first compute the NJ tree, then see if any local rearrangement of the NJ tree produces a shorter tree.
• Summarizing a set of sequences by a pairwise distance matrix loses information;
• Branch lengths estimated by some distance methods may not be evolutionarily interpretable.
Discrete methods operate directly on the sequences, rather than on pairwise distances.
• The two major discrete methods are maximum parsimony (MP) and maximum likelihood (ML).
• Maximum parsimony choose the tree (or trees) that require the fewest evolutionary changes.
• Maximum likelihood chooses the tree (or trees) that of all tress is the one that is most likely to have produced the observed data.
1 ATATT2 ATCGT3 GCAGT4 GCCGT
The total number of evolutionary changes on a tree is simply the sum of the number of changes at each site.
1 ATATT2 ATCGT3 GCAGT4 GCCGT
Phylogenetically uninformative; sites that are invariant or sites where only one sequence has a different nucleotide are examples of such sites.
This is equivalent to saying the transversions are rarer than transitions, and therefore may be more reliable indicators of phylogeny.
Maximum likelihood requires three elements,
a model of sequence evolution a tree
the observed data.• for a given tree topology, what set of
branch lengths makes the observed.• Which tree of all the possible trees has the
greatest likelihood.
The log likelihood of obtaining the observed sequences is the sum of the log likelihoods of each individual site:
The 16 possible combinations of ancestral sites for a tree for four sequences.
• Obtaining the maximum likelihood estimate of branch lengths for a given tree is computationally time consuming, and in practice this has limited the application of the method to fairly small data sets.
• This model may include parameters for the transition/transversion ratio (TS/TV), base composition, and variation in rate among sites.
Objections to likelihood
• Which model to use, and what values of the parameters, such as transition/transversion ratio, should be employed.
• This is computationally time consuming, more than one maximal likelihood value may exist for a given tree.
Putting confidence limits on phylogenies bootstrap analysis
• Because we are sampling with replacement some sites may occur more than once in the pseudoreplicate, while others may not be represented at all.
• From this pseudoreplicate we would then build a tree using.
• We then repeat this two-step process a large number of time (anywhere from 100-to 1000-fold), resulting in a set of bootstrap trees.