-
Syst. Biol. 50(3):351366, 2001
Empirical and Hierarchical Bayesian Estimation of Ancestral
States
JOHN P. HUELSENBECK AND JONATHAN P. BOLLBACKDepartment of
Biology, University of Rochester, Rochester, New York 14627,
USA;
E-mail: [email protected]
Abstract.Several methods have been proposed to infer the states
at the ancestral nodes on a phy-logeny. These methods assume a
specic tree and set of branch lengths when estimating the
ancestralcharacter state. Inferences of the ancestral states, then,
are conditioned on the tree and branch lengthsbeing true. We
develop a hierarchical Bayes method for inferring the ancestral
states on a tree. Themethod integrates over uncertainty in the
tree, branch lengths, and substitutionmodel parameters
byusingMarkov chainMonteCarlo.We compare thehierarchicalBayes
inferences of ancestral stateswithinferences of ancestral states
made under the assumption that a specic tree is correct. We nd that
themethods are correlated, but that accommodating uncertainty in
parameters of the phylogenetic modelcanmake inferences of ancestral
states evenmore uncertain than they would be in an empirical
Bayesanalysis. [Ancestral state reconstruction; Bayesian
estimation; empirical Bayes; hierarchical Bayes.]
The reconstruction of ancestral states on aphylogeny remains an
important endeavorin evolutionary biology. The comparativemethod,
for example, looks for evidence ofcorrelated change in two or more
characters;in the course of such analyses, the ances-tral states
are often reconstructed on a tree(Harvey and Pagel, 1991). Other
studies ex-amine the properties of ancient molecules(Malcolm et
al., 1990; Stackhouse et al., 1990;Adey et al., 1994; Jermann et
al., 1995). Theamino acid sequence of the ancient protein
isestimatedand then synthesized in the labora-tory. Theproperties
of the ancient protein canthen be measured in vivo or in vitro with
thegoal of demonstrating a change in the func-tion of the protein.
Similar studies have beenperformed for morphological or
behavioraltraits. For example, Ryan and Rand (1995) re-constructed
themating calls of the hypothet-ical ancestors of a group of frogs
and thenexamined the response of the extant femalesto the ancient
calls. The inferences made insuch studies depend on the reliability
of theancestral state reconstructions.Several methods have been
proposed to
reconstruct the states present in the hypo-thetical ancestors on
a phylogenetic tree. Theparsimony method, probably the most
fre-quentlyusedmethod to infer ancestral states,nds the combination
of ancestral states atan interior node of a phylogeny that
mini-mizes the number of changes over the wholetree. The result of
a parsimony analysis of an-cestral states is either to choose one
state asthe best reconstruction or, less frequently, topresent
multiple reconstructions when sev-eral reconstructionsgive the
sameparsimony
tree length (in which case the reconstructionis said to be
ambiguous).More recently, ancestral states have been
reconstructed forDNA,amino acid, andmor-phological (two-state)
data by the use ofstochastic models (Schluter, 1995; Yang et
al.,1995; Schluter et al., 1997; Pagel, 1999). Twodifferent methods
have been used to inferthe ancestral states by using stochastic
mod-els. The maximum likelihood method ndsthe character state at an
internal node onthe tree that maximizes the probability ofobserving
the data. For Bayesian inference,on the other hand, the goal is to
calculatewhat is called the posterior probability thatan ancestral
node on a tree has a particularstate, given the observations at the
tips of thetree. The probability that a character takes aparticular
state at some interior node on aphylogenetic tree depends on the
topologyof the phylogenetic tree, the lengths of thebranches on the
phylogeny, and the param-eters of the substitution model (such as
thetransition/transversion rate bias). The typ-ical approach is to
use the maximum like-lihood estimates of these parameters
whencalculating the posterior probability of a site.Such an
analysis is referred to as an empiricalBayes analysis. Schultz and
Churchill (1999)have reviewed Bayesian approaches to
re-constructing ancestral states for morpholog-ical characters.In
this study, we examine how sensitive
the empirical Bayesian estimates of ances-tral states are to
uncertainty in the tree,branch lengths, and substitution
parameters.We propose a method that integrates overuncertainty in
these parameters; such an
351
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
352 SYSTEMATIC BIOLOGY VOL. 50
analysis is referred to as an hierarchical Bayesanalysis. The
advantage of a hierarchicalBayes analysis is that inferences about
thestate of an ancestral character are not condi-tioned on any
single tree or set of parametervalues. We approximate the posterior
proba-bility of an ancestral state assignment usingMarkov
chainMonteCarlo (MCMC).Wendthat it is important to consider
uncertainty inmodel parameters and phylogeny when re-constructing
ancestral states.
METHODSWe develop a hierarchical Bayes estimate
of ancestral states on phylogenetic trees.In this section, we
review several differ-ent methods for estimating ancestral
statesand introduce the notation that will be usedthroughout the
paper. The general goal ofthis study is to approximate the
posteriorprobability of a nucleotide assignment toa specic internal
node of a phylogenetictreewhile accommodating uncertainty in
thetree, branch lengths, and the substitution pa-rameters. We do
this using MCMC and thencompare the hierarchical Bayes estimates
ofancestral states with the empirical Bayesestimates.
Data
We assume that aligned DNA sequencesare available. However, the
method devel-oped in this paper applies equally well tostochastic
models for the stems of rRNAgenes (Schoniger and von Haeseler,
1994),codon models (Goldman and Yang, 1994;Muse and Gaut, 1994),
models of aminoacid change (Adachi and Hasegawa, 1992),or simple
two-state models, such as theMarkovBernoulli process, that might
beapplied to morphological features (seeSchluter, 1995; Schluter et
al., 1997; Pagel,1999). Although the method developed herecan be
applied to different types of data, ourmethod differs from earlier
work, in thatSchluter (1995) and Pagel (1999) consideredcharacters
that would not offer informationon the tree and branch lengths.
However,the methods devoped here could usefullybe applied in the
same situations consideredby Schluter (1995) and Pagel (1999)
byaccommodating uncertainty in the trees. Thealigned DNA sequences
are contained in thematrix X D fxi j g where i D 0, 1, : : : , s
andj D 1, 2, : : : , c (s is the number of species
and c is the length of the sequences). The j thsite in the
sequence is contained in the vectorx j D fx1 j , x2 j , : : : , xs
j g. Each element in thematrix, xi j , can take one of four states
(A, C,G, or T). The observation that the i th speciesand j th site
is nucleotide A is denotedxi j D A.We examine ve aligned DNA
sequence
data sets. These data include (1) IRBPsequences from s D 13
mammals (van denBussche et al., 1998); (2) three tRNA(tRNAHI S,
tRNASER, tRNALEU), partial(30 region) NADH-dehydrogenase subunit4
(ND4), and partial (50 region) ND5 se-quences from s D 12 primates
(Hayasakaet al., 1988); (3) ATPase8 sequences from s D10
vertebrates (Cummings et al., 1995); (4) cy-tochrome oxidase I
(COI) sequences from s D10 vertebrates (Cummings et al., 1995);
and(5) ND3 sequences from s D 10 vertebrates(Cummings et al.,
1995).
Phylogenetic Trees
We assume that the s sampled speciesare related through a
phylogenetic tree, i ,where i D 1, : : : , B(s). The number of
possi-ble trees is B(s) D (2s5)!2s3(s3)! for unrooted treesand B(s)
D (2s3)!2s2(s2)! for rooted trees. Eachtree, i , denes a set of b
branches (b D 2s 3or b D 2s 2 for unrooted and rooted
trees,respectively). The lengths of the b brancheson the tree are
expressed in terms of ex-pected number of substitutions per site,
.The branch lengths for the ith tree are con-tained in the vector i
( i D f1, 2, : : : , bg).The tips of the tree are labeled n1, : : :
, ns ,and the internal nodes of the tree are la-beled nsC1, : : : ,
n2s2 for unrooted trees andnsC1, : : : , n2s1 for rooted trees. The
internalnodes are labeled consecutively according toa postorder
traversal of the tree (i.e., fromthe tips of the tree to the root).
The treesare rooted either along a branch of the tree(for rooted
trees) or at taxon ns (for unrootedtrees). The ancestor of node nk
is denoted (nk ). For unrooted trees, (n2s2) D ns . Thatis, the
ancestor of the last internal node onthe tree is the tip taxon ns
.We are interested in estimating the (unob-
served) nucleotide states at one or more ofthe internal nodes of
the tree. In particular,we are interested in the ancestral states
forbats (Tonatia silvicola, Tonatia bidens, andPteropus; van den
Bussche et al., 1998);apes (Homo sapiens, Pan, Gorilla, Pongo,
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
2001 HUELSENBECK AND BOLLBACKBAYESIAN ANCESTRAL STATE
RECONSTRUCTION 353
and Hylobates; Hayasaka et al., 1988); andAmniota (a clade
containing chicken, rat,mouse, human, bovine, seal, and
whale;Cummings et al., 1995).We assume that theseclades are
monophyletic. Figure 1 shows themaximum likelihood estimates of
phylogeny
FIGURE 1. Themaximum likelihood trees under the constraints of
monophyly of (a) bats, (b) apes, and (c, d, ande) amniotes. The
thickened branch indicates the constraint of monophyly. The
ancestral states are reconstructed forthe node with the dot.
under the constraint that bats, apes, or am-niotes are
monophyletic. The phylogenieswere estimated under the HKY85 model
ofDNA substitution (Hasegawa et al., 1984,1985), a model that
allows for differingbase frequencies and for a bias in the rate
of
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
354 SYSTEMATIC BIOLOGY VOL. 50
TABLE 1. The maximum likelihood estimates under the HKY85C0
model of DNA substitution for the ve datasets examined in this
study. Themaximum likelihood estimates were under the constraints
of monophyly, indicatedin Figure 1. The genes are (A) IRBP (van den
Bussche et al., 1998), (B) mtDNA (Hayasaka et al., 1988), (C)
ATPase8(Cummings et al., 1995), (D) COI (Cummings et al., 1995),
and (E) ND3 (Cummings et al., 1995). See text for moreinformation
regarding the parameters.
Gene max[loge f (X j , v, )] A C G TA 7569.55 5.22 0.45 0.22
0.30 0.30 0.18B 5711.94 12.76 0.36 0.36 0.32 0.08 0.23C 1590.49
5.73 0.61 0.40 0.31 0.07 0.22D 9489.63 15.62 0.16 0.35 0.30 0.10
0.26E 2665.87 7.33 0.30 0.34 0.33 0.08 0.25
transitions and transversions. Among-siterate variationwas
accommodated by assum-ing that the rate at a site is a random
vari-able drawn from a mean-one gamma distri-bution with shape
parameter ( > 0; Yang,1993, 1994). The constrained clade is
indi-catedby adot inFigure 1. Themaximum like-lihoodestimates of
theparameters are shownin Table 1. The maximum likelihood tree
forthe ND3 gene did not have amniotes mono-phyletic. The log
likelihood of the best treewas 2665:77, whereas the log likelihood
ofthe best tree under the constraint of mono-phyly was 2665:87.The
ancestral states are estimated for the
nodes indicated by the large dot in Figure 1.Weestimated the
ancestral states for only oneof the nodes on the tree. However, if
addi-tional constraints areplacedon the tree topol-ogy, the
ancestral states for other nodes canalso be estimated. The reason
we constrain aparticular node is because our method con-siders all
trees that are consistent with theconstraint; inferences of
ancestral charactersare aweightedaverageoverall possible trees.If
we did not maintain the constraint, thenode of interest would not
be present in alarge number of the possible trees. The effectof
constraining the tree is to reduce the num-ber of possible trees.
If for a tree of s speciesthere is a single constraint with s1
species onone side of the constraint and s2 species onthe other
side, the total number of possibleunrooted trees is:
Bc(s1, s2) D B(s1 C 1) B(s2 C 1)
Hence, a total of 103,378,275 unrootedtrees are consistent with
the bat constraint,1,091,475 trees are consistent with theape
constraint, and 31,185 trees are con-sistent with the amniote
constraint. In the
hierarchical Bayes analysis, inferences of an-cestral state
reconstructions will be a sumover all possible trees consistent
with theconstraint, weighted by the probability thatthe tree is
correct. Because the number of
FIGURE 2. Examples of parsimony reconstruction ontwo trees. The
nodes (internal and external) of the treeare labelled n1 , : : : ,
n8. The observations are the nu-cleotide states assigned to the
tips, (in this case, AACCCor ACACC for Tree a or Tree b,
respectively. The parsi-mony reconstruction of ancestral states is
indicated bythe character sets at the internal nodes. The
characterreconstruction is ambiguous for Tree b .
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
2001 HUELSENBECK AND BOLLBACKBAYESIAN ANCESTRAL STATE
RECONSTRUCTION 355
possible trees is so large, we will use a nu-merical technique
to approximate the sum.
Parsimony Reconstruction
The parsimony reconstruction of an an-cestral character state is
the nucleotide as-signed to an interior node of a tree
thatminimizes the number of changes. Take, forexample, the tree
shown in Figure 2a. Thetree is drawn as an unrooted tree of ve
se-quences, with one of the sequences drawn atthe root. The
observations for the j th site arex j D fA, A, C, C, Cg. What are
the states atthe interior nodes of the tree?For notation purposes,
we contain the
hidden (or unobserved) states in a matrixY D fyi j g, where i D
s C 1, s C 2, : : : , 2s 2 and j D 1, 2, : : : , c. The
reconstructionthat minimizes the number of changesfor the
observations at the jth site x j DfA, A, C, C, Cg is y j D fA, C,
Cg. That is,there is an A at node 6 and C at nodes 7and 8. This
reconstruction implies that therewas a single change along the
branch be-tween nodes 6 and 7.Swofford et al. (1996) and Maddison
and
Maddison (1992) describe algorithms for re-constructing
ancestral states in a parsimonyanalysis. For our purposes, we
simply notethe dependence of the reconstruction of thehidden states
(Y) on the topology of the tree( ) and on the states observed at
the tips ofthe tree (X). Also, it is possible for the par-simony
method to ambiguously reconstructthe ancestral states. For example,
Figure 2bshows a different set of observations at thetips of the
phylogenetic tree. For this tree, thereconstruction of states at
two of the internalnodes of the tree is ambiguous.
Maximum Likelihood
The other two methods for estimatingancestral statesthe methods
of maximumlikelihood and Bayesian inferenceassumethat the
characters evolve according to astochastic process. For example,
theMarkovBernoulli process is a simple two-statemodel of evolution
that has been appliedto morphological characters. The
MarkovBernoullimodel has twoparameters, the biasparameter p and a
rate parameter . In an in-stant of time, dt , the probability of a
changefrom state 0 to state 1 is (1 p)dt and theprobability of a
change from state 1 to state0 is pdt . In this study, we are
interested in
modeling the evolution of DNA sequencesand assume that the DNA
sequences evolveaccording to a time-homogeneous Poissonprocess with
four states. In particular, we as-sume that substitutions follow
the HKY85model ofDNAsubstitution (Hasegawa et al.,1984, 1985). The
instantaneous ratematrix,Q,for the HKY85 model is
Q D fqi j g
D C G TA G TA C TA C G
(1)
where is the transition/transversion ratebias, and D (A, C, G,
T) are the equi-librium base frequencies. When > 1, tran-sitions
occur more frequently than transver-sions. The rows of the
instantaneous ratematrix sum to 0. Moreover, the constraintthat qi
ii D 1 is added, ensuring that thebranch lengths of the
phylogenetic tree aregiven in terms of expected number of
substi-tutions per site, . The probability that nu-cleotide i
changes into j over a branch oflength is contained in the matrix P
D fpi j g.P can be obtained from the rate matrix Qthrough the
operation P D eQ : We accom-modate rate variation across sites by
assum-ing that the rate at a site is a random variabledrawn from
amean-one gamma distributionwith shape parameter (Yang, 1993,
1994).Substitution models that assume gamma-distributed rate
variation are denoted C0.We use a total of four rate categories
toapproximate the continuous gamma distri-bution. Parameters of the
model of DNAsubstitution are contained in the vectorD (, , ).The
probability ofobserving the data at the
tips of an unrooted tree for a particular site(xi ) and an
assignment of nucleotides to theinternal nodes of the tree (yi )
is
f (xi jyi , j , j , ) D xis pyi (2s2)xi s (2s2)
s1
kD1pyi (k )xi k (k )
2s3
kDsC1pyi (k)yi k (k )
(2)
The probability of observing the data at thetips is conditional
on the states assigned to
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
356 SYSTEMATIC BIOLOGY VOL. 50
the interior nodes of the tree (yi ), the topol-ogyof the tree (
j ), the lengthsof the brancheson the tree ( j ), and the
parameters of thesubstitutionmodel ( ). Maximum
likelihoodestimation of phylogeny is not typicallyconditioned on a
particular assignment ofnucleotides to the interior nodes of the
tree(Felsenstein, 1981). Instead, the probabilityof observing the
data at the tips of the treeis a sum over all 4s2 possible
assignmentsof nucleotides to the internal nodes of thetree.
f (xi j j , j , ) Dyi
f (xi jyi , j , j , ) (3)
Assuming independence of substitutions atdifferent sites, the
probability of observingthe aligned DNA sequence data is the
prod-uct of the c site probabilities
f (Xj j , j , ) Dc
iD1f (xi j j , j , ) (4)
This function is maximized to obtain maxi-mum likelihood
estimates of the parameters , , and .The ancestral states can be
estimated
by maximum likelihood (Schluter, 1995;Schluter et al., 1997;
Pagel, 1999). Severaldifferent maximum likelihood methods canbe
used to estimate ancestral states. Sup-pose one is interested in
the ancestral stateat only one of the internal nodes of thetree.
One method for estimating the ances-tral state at that node is to
nd the com-bination of states for all nodes on the treethat
maximizes the likelihood; not only isthe likelihood maximized with
respect to thenode of interest but also for other nodes onthe tree
that are not of direct interest. Onepotential problemwith
thismethod for infer-ring the ancestral condition at a node,
how-ever, is that the number of parameters be-ing estimated is
large (in fact, larger than thenumber of observations at a site if
branchlengths are estimated for the each site inde-pendently).
Another method for estimatingthe ancestral condition is to nd the
max-imum likelihood nucleotide assignment forthe node of interest
while summing overall possible nucleotide assignments to thenodes
that are not of direct interest. Thismethod is preferable because
it focuses the
power of the method only on the node ofinterest.
Empirical Bayes
Bayesian inference is based on the poste-rior probability of a
parameter. The posteriorprobability that the character for the jth
siteat the i th internal node of tree k takes stateyi j ,
conditional on the data at the tips, tree,branch lengths, and
substitution parame-ters, is
f (yi j jx j , k , k , )
D f (x j jyi j , k , k , )yi jyi j2fA,C,G,Tg f (x j jyi j , k ,
k , )yi j
(5)
The probabilities are a sum over all possi-ble assignments of
nucleotides that can beassigned to the nodes in the tree that
arenot of interest (i.e., all of the internal nodesexcept ni ).
Note that the probability of theancestral state at node ni is
conditioned onthe observed data at the tips of the tree,the
topology of the tree, the lengths of thebranches, and the values of
the parameters ofthe substitution process.What values shouldthese
unknown parameters take? An empir-ical Bayes analysis uses
estimates for theseparameters. Yang et al. (1995) substitutedthe
maximum likelihood estimates for theseunknown parameters. Hence,
the posteriorprobability for the ancestral state at node i is
f (yi j jx j , , , )
D f (x j jyi j , , , )yi j
yi j2fA,C,G,Tg f (x j jyi j , , , )yi j(6)
Hierarchical Bayes
One of the disadvantages of an empiricalBayes analysis is that
inferences are condi-tioned on assigning specic values to un-known
parameters (such as the maximumlikelihood estimates for the
parameters). Analternativemethod, called hierarchical
Bayesanalysis, species a prior probability distri-bution for
theunknownparameters.Thepos-terior probability is then integrated
over un-certainty in the parameters. The posteriorprobability that
the state for the j th site at
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
2001 HUELSENBECK AND BOLLBACKBAYESIAN ANCESTRAL STATE
RECONSTRUCTION 357
the i th internal node of the tree is yi j is
f (yi j jX) D f (Xjyi j )yi jyi j2fA,C,G,Tg f (Xjyi j )yi j
(7)
where
f (Xjyi j ) DBc (s)
kD1 kf (Xjyi j , k , k , )
f (k) f ( k ) f ( )d kd (8)
The prior probabilities for the parameters aref (k ), f ( k ),
and f ( ), and the summation isover all possible trees that are
consistentwiththe constraint. In this study, we assume thatall
trees are a priori equally probable, witha uniform(0, 10) prior for
branch lengths, auniform(0, 100) prior for , a uniform(0, 10)prior
for , and a Dirichlet distributed priorfor (Appendix).
Markov Chain Monte Carlo
The summations and integrations requiredin equation 8 are
impossible to perform ana-lytically for even small phylogenetic
prob-lems. We use MCMC to approximate theposterior probability of
nucleotide assign-ments to interior nodes on the tree.
Speci-cally,we use theMetropolisHastingsGreenalgorithm (Metropolis
et al., 1953; Hastings,1970;Green, 1995). TheMCMCmethod takesvalid,
albeit dependent, samples from theprobability distribution of
interest by con-structing a Markov chain that has as itsstate space
the parameter(s) of interest. Here,we are interested in integrating
over un-certainty in the phylogenetic tree, branchlengths, and
substitution parameters; for theproblem of estimating ancestral
states, then,the chain runs over topology ( ), branchlengths ( ),
the transition/transversion bias(), the gamma shape parameter for
among-site rate variation (), and base frequenc-ies ( ).The Markov
chain was constructed as fol-
lows: (1) The current state of the chain isdesignated U ( U D f
, , , , g); the cur-rent state of the chain consists of a
specictree with branch lengths and specic val-ues for the
substitution parameters. If this isthe rst generation of the chain,
then an ini-tial state for the chain is chosen. (2) A new
state for the chain is proposed and desig-nated U 0. The
probability of proposing thenew state given the old state is f ( U
0j U ). Theprobability of making the reverse move isf ( U jU 0).
Our proposal mechanism modiesonly one or a few of the elements in U
ata time. The specic proposal mechanismsand their acceptance
probabilities are dis-cussed in the Appendix. (3) The
probabilitythat the proposed state is accepted is calcu-lated. The
probability of accepting the newstate is
R D min 1, f (Xj U0) f ( U 0)= f (X)
f (Xj U ) f ( U )=f (X) f ( U j U 0)f ( U 0jU )
D min 1, f (Xj U0) f ( U 0)
f (Xj U ) f ( U ) f ( U j U 0)f ( U 0j U )
D min 1, f (Xj U0)
f (Xj U )Likelihood ratio
f ( U0)
f ( U )
Prior ratio
f ( U j U0)
f ( U 0j U )Proposal ratio
(9)
The probability of accepting the new(proposed) state is the
product of the like-lihood, prior, and proposal ratios. The
pro-posal ratio is often referred to as the Hast-ings ratio. (4) A
uniformly distributed(pseudo)random number on the interval[0, 1] is
generated. If this number is lessthan R, then the proposed state is
acceptedand U D U 0. Otherwise, the chain remainsat U .Steps 1 to 4
are repeated a large number of
times (in this study, 106 times). The sequenceof states visited
constitutes a Markov chain.In this study,we save the states of
theMarkovchain every 100 generations (taking a totalof 104
samples). These sampled points rep-resent valid draws from the
posterior prob-ability of interest. Although the 104 sampledstates
are not independent draws from theposterior distribution, the
Markov chain lawof large numbers guarantees that
posteriorprobabilities can be validly estimated fromlong-run
samples from the chain (Tierney,1994). For each sampled state of
the chain,we calculate the posterior probabilities of thenucleotide
state assignments at the con-strained node for all c sites.
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
358 SYSTEMATIC BIOLOGY VOL. 50
RESULTSInferences of the posterior probabilities of
the parameters should be based on samplesdrawn from the chain
when at stationar-ity. Figure 3 shows the log likelihood of
thecurrent state of the chain through time forthe MCMC analysis.
For each data set, thechain started at a low likelihood (a poor
com-bination of parameters) and quickly reacheda plateau. The rst
1,000 sampled points(or the rst 105 generations of the
chain)werediscarded as the burn in of the chain. Allposterior
probabilities in this paper are basedon the 9,000 points that were
sampled fromthe chain when at apparent stationarity. Theposterior
probability of a phylogeny is theproportion of the time that it was
sampled(out of 9,000 samples).The phylogenetic trees for the ve
data
sets were not known with certainty. Figure4 shows the 50%
majority rule consensustrees for the ve data sets. These trees
rep-resent the Bayesian estimates of phylogenyunder the HKY85C0
model of DNA sub-stitution (Li, 1996; Mau, 1996; Rannala andYang,
1996; Mau and Newton, 1997; Yangand Rannala, 1997; Larget and
Simon, 1999;Mau et al., 1999; Newton et al., 1999). Thenumbers at
the interior nodes of the treesrepresent the posterior probability
that theclade is correct; they do not represent non-parametric
bootstrap values. The posteriorprobability for the constrained
clade is notshown because it must have been presentin 100% of the
samples. Note that the chainconsidered many different trees. For
four ofthe data sets, no single tree could have beensafely treated
as known without error whenestimating the ancestral states at the
con-strained node. For the COI data set, on theother hand, the
posterior probabilities of allclades are high (>0.97); wemay
thus assumethe topology (Fig. 4, Tree d) is known with-out error,
even though the other parametersof the model are uncertain.Just as
the analysis considers uncertainty
in the topology of the tree relating thespecies, uncertainty in
the parameters of thesubstitution model is also accommodated.Figure
5 shows the posterior probabilitiesfor the transition/transversion
rate bias ()and the gamma shape parameter for among-site rate
variation (). The posterior proba-bility for both parameters is
distributed overa range of values, most of the weight being
FIGURE 3. The log likelihood through time for theve data sets
analyzed in this study. (a) van den Busscheet al. (1998); (b)
Hayasaka et al. (1988); (c) atpase8(Cummings et al., 1995); (d)COI
(Cummings et al., 1995);(e) ND3 (Cummings et al., 1995).
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
2001 HUELSENBECK AND BOLLBACKBAYESIAN ANCESTRAL STATE
RECONSTRUCTION 359
FIGURE 4. The 50% majority rule consensus trees of the trees
visited during the MCMC analysis. The numbersat the interior
branches represent the posterior probability that the clade is
correct. Data sets (a)(e) as in Figure 3.
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
360 SYSTEMATIC BIOLOGY VOL. 50
FIGURE 5. The posterior probability distributions forthe
transition/transversion rate () and for the gammashape parameter
(). (a)(e) refer to same data sets as inFigure 3.
placed near the maximum likelihood esti-mates. A 95% credible
interval for each pa-rameter can be constructed by taking the2.5%
and 97.5% quantiles of the distribution.Table 2 shows themean and
credible intervalfor the parameters of the substitutionmodel.We
calculated the empirical and hierar-
chical Bayes estimates of ancestral states forall site patterns
at the constrained nodes of
Figure 1. The empirical Bayes estimates usedthemaximum
likelihoodestimates for thepa-rameters , , and . In thehierarchical
Bayesanalysis, the uncertainty in these parame-ters was integrated
over by using MCMC.Figure 6 shows the relationship between
theempirical and hierarchical Bayes estimatesof the ancestral
states. The graphs show theposterior probabilities across all
sites, rankedfrom sites with lowest to greatest probabilityunder
the empirical Bayes approach. As ex-pected, the posterior
probabilities of ances-tral state assignments for the empirical
andhierarchical Bayes analyses show a close re-lationship. The
relationship between the hi-erarchical and empirical Bayes
estimates isclosest for probabilities near 0 or 1. Theseare site
patterns for which there is little un-certainty about the state at
the constrainednode; both methods place most of the prob-ability on
a single reconstruction. However,for some site patterns, the
nucleotide assign-ment at the ancestral node is less certain
andthere is more disagreement between the em-pirical and
hierarchical Bayes analyses. Forthese sites, the topology and
branch lengthsof the tree and the uncertainty in the sub-stitution
parameters make the ancestral con-dition at the constrained node
less certain.Importantly, a site can have either a loweror a higher
posterior probability under thehierarchical Bayes approach because
of theuncertainty in the trees and branch lengths.Figure 7 more
explicitly demonstrates the
uncertainty in the ancestral state assign-ments introduced by
the uncertainty in thephylogeny and substitutionmodel. The
errorbars represent the 95% credible region for theprobability of a
particular nucleotide assign-ment to the constrained node on the
tree. Forthe nuclear gene (IRBP), the uncertainty inthe ancestral
states is relatively small. How-ever, for the other genes, the
uncertainty inthe ancestral state assignment can be quitelarge. For
one of the sites in theATPase8gene,for example, the probability
that anAwas as-signed to the constrained node varied by asmuch as
0.83 (a credible interval from 0.09to 0.93). The nucleotide state
assignments forother sites for ATPase8 gene were nearly
asuncertain.
DISCUSSIONPhylogenetic uncertainty is usually ig-
nored when reconstructing ancestral states.
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
2001 HUELSENBECK AND BOLLBACKBAYESIAN ANCESTRAL STATE
RECONSTRUCTION 361
TABLE 2. TheBayesian estimates under theHKY85C0model of DNA
substitution for the ve data sets examinedin this study. The point
estimate is the mean of the posterior distribution, and the
interval represent the 95% credibleregion for the parameters. Genes
as in Table 1.
Gene A C G T
A 5.32 0.45 0.22 0.30 0.30 0.19(4.63, 6.08) (0.39, 0.52) (0.20,
0.23) (0.28, 0.32) (0.28, 0.32) (0.17, 0.20)
B 11.92 0.38 0.36 0.31 0.08 0.24(9.39, 15.22) (0.31, 0.45)
(0.34, 0.39) (0.29, 0.34) (0.07, 0.09) (0.22, 0.26)
C 6.35 0.59 0.40 0.31 0.07 0.22(4.16, 9.51) (0.41, 0.83) (0.36,
0.44) (0.28, 0.35) (0.05, 0.09) (0.19, 0.25)
D 12.43 0.17 0.34 0.30 0.11 0.25(9.82, 16.36) (0.15, 0.18)
(0.32, 0.36) (0.29, 0.32) (0.10, 0.12) (0.24, 0.27)
E 11.89 0.23 0.34 0.33 0.08 0.25(5.67, 18.57) (0.17, 0.35)
(0.31, 0.37) (0.30, 0.36) (0.07, 0.10) (0.23, 0.28)
The phylogeny of some groups is so wellestablished that it is
perhaps safe to treatthe phylogeny as known. However, evenfor cases
where the phylogeny is wellsupported, the uncertainty in other
pa-rameters of the phylogenetic model, suchas the branch lengths on
the tree and thesubstitution parameters, can be large. Un-certainty
in the phylogenetic model (tree,
FIGURE6. Relationshipbetween theposterior probabilities of
states A, C,G,orTcalculatedbyusing the empiricaland the
hierarchical Bayes methods. (a)(e) as in Figure 3.
branch lengths, and substitution model)can all contribute to
make ancestral statereconstruction ambiguous.Figures 8 and 9
demonstrate how uncer-
tainty in the phylogeny and the lengths ofthe branches on the
phylogeny can lead todifferent interpretations of the ancestral
stateat a node. The observations at the tips of thetrees are A, A,
C, C, and C for species 1, 2,
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
362 SYSTEMATIC BIOLOGY VOL. 50
FIGURE7. Relationshipbetween theposteriorprobabilities of
statesA,C,G, orTcalculatedbyusing the empiricaland the hierarchical
Bayes methods. The 95% credible regions for the nucleotide
probabilities are indicated by thevertical error bars. (a)(e) as in
Figures 36.
FIGURE 8. The posterior probabilities of A, C, G, or T at the
node indicated by the dot. The three trees representall of the
trees of ve species that contain the taxon bipartition fn1 , n2 ,
n3g, fn4, n5g. Posterior probabilities werecalculated under the
JukesCantor (1969) model of DNA substitution, assuming that all of
the branches were 0.1expected substitutions per site. Numbers in
parentheses below each tree indicate (from left to right) the
probabilityof having A, C, G, or T, respectively, at the
constrained node.
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
2001 HUELSENBECK AND BOLLBACKBAYESIAN ANCESTRAL STATE
RECONSTRUCTION 363
FIGURE 9. The posterior probabilities of A, C, G, or T at the
node indicated by the dot. Posterior probabilitieswere calculated
under the JukesCantor (1969) model of DNA substitution and assuming
that the lengths of all ofthe branches were 0.1 expected
substitutions per site, except the two short branches of Trees b
and c, which were0.01 expected substitutions per site. Numbers in
parentheses as in Figure 8.
3, 4, and 5, respectively. The lengths of thebranches for the
three trees of Figure 8 areall 0.1 expected substitutions per site.
Thethree trees shown in Figure 8 represent allof the trees that
contain the taxon bipartitionfn1, n2, n3g, fn4, n5g. What is the
probabilityof having an A, C, G, or T at the node indi-cated by the
dot? If the JukesCantor (1969)model of DNA substitution is assumed,
thenthe probability of having an A, C, G, or Tat the constrained
node is indicated by thenumbers in parentheses in Figure 8. For
Treea of Figure 8, the posterior probability of hav-ingaCat the
interior node isgreatest (0.9644),makingC themost probable state at
thenode.However, for Trees b and c of Figure 8, thereconstruction
of the ancestral state is moreambiguous; the probability of having
anA orC are about the same. These empirical Bayesreconstructions
make intuitive sense and arein accordwith the parsimony
reconstruction.The parsimonymethod reconstructs the stateat the
constrained node as C for Tree a ofFigure 8 and either A or C for
Trees b and c.If the tree is certain, then the reconstruction ofthe
ancestral state is not problematic. For ex-ample, if Tree a is
correct, then the best recon-struction has nucleotide C at the
constrainednode. Similarly, if Tree b unambiguouslyrepresents the
relationships of the vespecies, then the best reconstruction has
ei-ther an A or C at the constrained node (witha slight preference
for the reconstruction thathas A at the constrained node).
However,rarely is thephylogenyknownwith certainty.For example, what
if the posterior probabili-
ties of Trees a , b, and c were 0.5, 0.3, and 0.2,respectively?
If one were to simply assumethat the tree with the greatest
posterior prob-ability is correct (the MAP estimate of phy-logeny),
then the best reconstruction wouldhave C at the constrained node
(with proba-bility 0.96). Ideally, however, the uncertaintyin the
topology should be accommodated,as we do by using MCMC in this
study. Ifthe uncertainty of the trees is accounted for,the
probabilities are 0.2702, 0.7269, 0.0015,and 0.0015 for nucleotides
A, C, G, and T,respectively. This calculation assumes thatthe
lengths of all branches on the trees areequal (0.1 substitutions
per site). The hier-archical Bayes estimate differs
substantiallyfrom the empirical Bayes estimate (e.g.,
theprobability of C is 0.9644 and 0.7269 for theempirical and
hierarchical Bayes analyses,respectively).Uncertainty in the
lengths of the branches
on the phylogeny is another source of am-biguity in ancestral
state reconstructions.Figure 9 shows how posterior probabili-ties
of ancestral state reconstructions can beaffected by uncertainty in
branch lengths.Here, the topology of the phylogeny is thesame for
all three examples. The branchesfor Tree a are all equal in length
(0.1 ex-pected substitutions per site). The lengthsof the branches
for Trees b and c are also0.1 expected substitutions per site
exceptfor the branch forming the taxon bipartitionfn1, n2g, fn3,
n4, n5g on Tree b, which is 0.01expected substitutions per site
long, and thebranch leading to tip n3 on Tree c , which is
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
364 SYSTEMATIC BIOLOGY VOL. 50
also 0.01 expected substitutions per site long.The posterior
probabilities of having an A,C, G, or T are shown for each tree.
For alltrees, the best reconstruction has C at theconstrained node.
However, the poste-rior probability changes depending on thelengths
of the branches. For Tree b the proba-bility of having C is much
less than for Treesa and c . The empirical Bayes analysis picksone
set of branch lengths to use when recon-structing ancestral states,
whereas the hierar-chical Bayes analysis attempts to accommo-date
any uncertainty in the branch lengthswhen reconstructing ancestral
states.Although inferences for our hierarchical
Bayes method are not conditioned on anyparticular tree or set of
model parametersbeing correct, the ancestral state recon-structions
are conditioned on the model ofDNA substitution being correct and
the con-strained node being real. Therefore, usingas realistic a
model of DNA substitution aspossible is important when
reconstructingancestral states. For DNA sequences, modelsthat
accommodate more rate parametersor allow limited dependence among
sitesmight provide better estimates of ancestralstates. At least
one of the clades on the treemust be assumed to be known.
Becausemany of the possible phylogenetic trees willnot contain the
ancestral node of interest,we must assume that the constraint, at
least,is correct. For the problems considered inthis paper, the
constraint is not problematic;virtually every study has supported
themonophyly of bats (see van den Bussche etal., 1998, for a review
of the bat monophylydebate), apes, and amniotes.Many studies in
evolutionary biology
assume that the phylogeny and branchlengths of a species group
are known with-out error (e.g., Harvey and Pagel, 1991).However,
phylogenetic estimates are poten-tially subject to large errors. In
fact, methodsfor evaluating uncertainty in phylogenies,such as the
nonparametric bootstrap andBayesian inference, suggest that many
treeshave a large amount of uncertainty. Ideally,this uncertainty
should be accommodatedin evolutionary studies. MCMC has beenapplied
with success to accommodate un-certainty in trees. For example,
Kuhner et al.(1994) and Beerli and Felsenstein (1999) haveused MCMC
to estimate parameters of thecoalescence process while integrating
overuncertainty in the gene tree. The hierarchical
Bayesmethoddeveloped in this paper allowsestimation of ancestral
states, conditionedonly on the observations of the states at thetip
of the tree. The method might usefully beextended to other problems
in evolutionarybiology that depend on a phylogeny butfor which the
phylogeny is not of primeinterest.
ACKNOWLEDGMENTSThisworkwas supported byNSF grantsDEB-0075406
and MCB-0075404 awarded to J.P.H.
REFERENCESADACHI, J., AND M. HASEGAWA. 1992. Amino acid
sub-stitution of proteins coded for in mitochondrial DNAduring
mammalian evolution. Jpn. J. Genet. 67:187197.
ADEY, N. B., T. O. TOLLEFSBOL, A. B. SPARKS, M.H. EDGELL, AND C.
A. HUTCHISON. 1994. Molecu-lar resurrection of an extinct ancestral
promoter formouse L1. Proc. Nat. Acad. Sci. USA 91:15691573.
BEERLI, P., AND J. FELSENSTEIN. 1999. Maximum likeli-hood
estimation of migration rates and effective pop-ulation umbers in
two populations using a coalescentapproach. Genetics
152:763773.
CUMMINGS , M. P., S. P. OTTO, AND J. WAKELEY. 1995.Sampling
properties of DNA sequence data in phylo-genetic analyses. Mol.
Biol. Evol. 12:814822.
FELSENSTEIN, J. 1981. Evolutionary trees from DNAsequences: A
maximum likelihood approach. J. Mol.Evol. 17:368376.
GOLDMAN, N., AND Z. YANG. 1994. A codon-basedmodel of nucleotide
substitution for protein-codingDNA sequences. Mol. Biol. Evol.
11:725736.
GREEN, P. J. 1995. Reversible jump Markov chain MonteCarlo
computation and Bayesian model determina-tion. Biometrika
82:711732.
HARVEY, P. H.,ANDM.D. PAGEL. 1991. The comparativemethod in
evolutionary biology. Oxford Univ. Press,Oxford.
HASEGAWA, M., H. KISHINO , AND T. YANO. 1985. Dat-ing the
humanape split by a molecular clock of mi-tochondrial DNA. J. Mol.
Evol. 22:160174.
HASEGAWA, M., T. YANO, AND H. KISHINO . 1984. A newmolecular
clock of mitochondrial DNA and the evo-lution of Hominoids. Proc.
Jpn. Acad. Ser. B 60:9598.
HASTINGS , W. K. 1970. Monte Carlo sampling meth-ods using
Markov chains and their applications.Biometrika 57:97109.
HAYASAKA,K.,T.GOJOBORI,AND S.HORAI. 1988.Molec-ular phylogeny
and evolution of primate mitochon-drial DNA. Mol. Biol. Evol.
5:626644.
JERMANN, T. M., J. G. OPITZ, J. STACKHOUSE, ANDS. A. BENNER .
1995. Reconstructing the evolutionaryhistory of the artiodactyl
ribonuclease superfamily.Nature 374:5759.
JUKES, T., AND C. CANTOR. 1969. Evolution of pro-tein molecules.
Pages 21132 in Mammalian pro-tein metabolism (H. Munro, ed.).
Academic Press,New York.
KUHNER, M., J. YAMATO, AND J. FELSENSTEIN. 1994. Es-timating
effective population size and mutation ratefrom sequence data using
MetropolisHastings sam-pling. Genetics 149:429434.
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
2001 HUELSENBECK AND BOLLBACKBAYESIAN ANCESTRAL STATE
RECONSTRUCTION 365
LARGET, B., AND D. SIMON. 1999. Markov chain MonteCarlo
algorithms for the Bayesian analysis of phylo-genetic trees. Mol.
Biol. Evol. 16:750759.
LI, S. 1996. Phylogenetic tree construction
usingMarkovchainMonte carlo.
Ph.D.dissertation,OhioStateUniv.Columbus.
MADDISON, W. P., AND D. R. MADDISON. 1992.MacClade, version
3.00. Academic Press, New York.
MALCOLM, B. A., K. P. WILSON, B. W. MATTHEWS,J. F. KIRSCH, AND
A. C. WILSON. 1990. Ancestrallysozymes reconstructed, neutrality
tested, and ther-mostability linked to hydrocarbon packing.
Nature345:8689.
MAU, B. 1996. Bayesian phylogenetic inference viaMarkov chain
Monte carlo methods. Ph.D. disserta-tion, Univ. of Wisconsin,
Madison.
MAU, B., AND M.NEWTON. 1997. Phylogenetic inferencefor binary
data on dendrograms using Markov chainMonte Carlo. J. Comput.
Graph. Stat. 6:122131.
MAU, B., M. NEWTON, AND B. LARGET. 1999. Bayesianphylogenetic
inference via Markov chainMonte carlomethods. Biometrics
55:112.
METROPOLIS, N., A. W. ROSENBLUTH, M. N.ROSENBLUTH, A. H. TELLER
, AND E. TELLER. 1953.Equations of state calculations by fast
computingmachines. J. Chem. Phys. 21:10871091.
MUSE, S. V., AND B. S. GAUT. 1994. A likelihood ap-proach for
comparing synonymous and nonsynony-mous nucleotide substitution
rates with applicationto the chloroplast genome. Mol. Biol. Evol.
11:715724.
NEWTON, M., B. MAU, AND B. LARGET. 1999. Markovchain Monte Carlo
for the Bayesian analysis of evo-lutionary trees from aligned
molecular sequences. InStatistics in molecular biology (F.
Seillier-Moseiwitch,T. P. Speed, and M. Waterman, eds.). Monograph
Se-ries of the Institute of Mathematical Statistics.
PAGEL, M. 1999. The maximum likelihood approach toreconstructing
ancestral character states of discretecharacters on phylogenies.
Syst. Biol. 48:612622.
RANNALA, B., AND Z. YANG. 1996. Probability distribu-tion of
molecular evolutionary trees: A new methodof phylogenetic
inference. J. Mol. Evol. 43:304311.
RYAN, M. J., AND A. S. RAND. 1995. Female responsesto ancestral
advertisement calls in Tungara frogs.Science 269:390392.
SCHLUTER, D. 1995. Uncertainty in ancient phylogenies.Nature
377:108109.
SCHLUTER, D., T. PRICE, A. .MOOERS,AND D. LUDWIG.1997.
Likelihood of ancestor states in adaptive radia-tion. Evolution
51:16991711.
SCHONIGER ,M.,AND A. VONHAESELER. 1994.A stochas-tic model for
the evolution of autocorrelated DNA se-quences. Mol. Phylogenet.
Evol. 3:240247.
SCHULTZ, T. R., AND G. A. CHURCHILL. 1999. The roleof
subjectivity in reconstructing ancestral characterstates: A
Bayesian approach to unknown rates, states,and transformation
asymmetries. Syst. Biol. 48:651664.
STACKHOUSE, J., S. R. PRESNELL, G. M. MCGEEHAN,K. P. NAMBIAR,
AND S. A. BENNER. 1990. The ribonu-clease from an ancient bovid
ruminant. FEBS Lett.262:104106.
SWOFFORD, D., G. OLSEN, P. WADDELL, AND D. M.HILLIS . 1996.
Phylogenetic inference. Pages 407511in Molecular Systematics, 2nd
edition (D. Hillis, C.Moritz, and B. Mable, eds.). Sinauer,
Sunderland,Massachusetts.
TIERNEY, L. 1994. Markov chains for exploring
posteriordistributions (with discussion). Ann. Stat.
22:17011762.
VANDENBUSSCHE, R. A.,R. J. BAKER, J. P.HUELSENBECK,AND D. M.
HILLIS . 1998. Base compositional bias andphylogenetic analyses: A
test of the ying DNA hy-pothesis. Mol. Phylogenet. Evol.
13:408416.
YANG, Z. 1993. Maximum likelihood estimation of phy-logeny from
DNA sequences when substitution ratesdiffer over sites. Mol. Biol.
Evol. 10:13961401.
YANG, Z. 1994. Maximum likelihood phylogenetic esti-mation from
DNA sequences with variable rates oversites: Approximate methods.
J. Mol. Evol. 39:306314.
YANG, Z., S. KUMAR, AND M. NEI. 1995. A new methodof inference
of ancestral nucleotide and amino acidsequences. Genetics
141:16411650.
YANG, Z.,AND B.RANNALA. 1997.Bayesianphylogeneticinference using
DNA sequences: A Markov chainMonte Carlo method. Mol. Biol. Evol.
14:717724.
Received 8 March 2000; accepted 3 May 2000Associate Editor: R.
Olmstead
APPENDIXWe use the MetropolisHastingsGreen algorithm
(Metropolis et al., 1953; Hastings, 1970; Green, 1995)to
approximate the posterior probabilities of ancestralstates (and
other parameters). We changed one or a fewparameters of the model
at a time. The proposal mech-anisms are described here.
Changing and .We used the LOCAL algorithmofBAMBE to change the
topology and branch lengths ofthe tree simultaneously (Larget and
Simon, 1999). Oneof the s 3 branches of the tree was chosen at
random.The two nodes of this branch are labeled u and w. Thisbranch
partitions the tree, with two clades on one endof the branch and
two clades on the other. One of theclades on the left side of the
branch (from node u) is ran-domly labeled a , and the other is
randomly labeled b.Similarly, one of the clades to the right (from
node w) israndomly labeled c, the other randomly labeled d .
Thepath length from a to c is m. Path length m is modiedby
multiplying the branches on that path by a randomnumber to obtain
the new path length, m . Specically,m D m e(U1=2), where U is a
uniformly distributedrandom number on the interval (0, 1) and is a
tuningparameter (Larget and Simon, 1999). After the path be-tween a
and c is modied, one of the two branches, u bor w d , is detached
from the tree and then reattachedat random along the path between a
and c. Specically,its reattachment point is uniformly distributed
on theinterval m . The acceptance probability for this move(see
Larget and Simon, 1999:753) is
R D min 1, (likelihood ratio) (m=m)2
Changing and .The transition/transversion rateratio () and the
gamma shape parameter () were mod-ied by adding to the current
value a uniformly dis-tributed random number on the interval [, C].
Both and are restricted to positive values. When the pro-posed
state was negative, the excess was reected back.The acceptance
probability for these moves is
R D min [1, (likelihood ratio)]Changing .New values for the base
frequencies
were proposed from a Dirichlet distribution withexpected values
at the current values. The Dirichlet
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from
-
366 SYSTEMATIC BIOLOGY VOL. 50
distribution has probability density
f ( j ) D 0(0)0(A)0(C)0(G)0(T)
A1A C1C G1G T1T
where i is the Dirichlet parameter for the i thnucleotide, 0 D A
C C C G C T , and i is thefrequency of the ith nucleotide. New base
fre-quencies are drawn from the Dirichlet distributionwith
parameter i D i0. We set 0 D 100 in thisstudy.
at Mihai Em
inescu Central University Library of Iasi on M
arch 17, 2015http://sysbio.oxfordjournals.org/
Dow
nloaded from