Points of View - ufscar.brevolucao/TGE/ref9-1.pdf · rsSN: 1063-5157 priiii / 107O-836X online DC5I: 1(1.1080/10635150490868877 ... nonoverlnpping monophyletic groups based on a priori
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Identifying and Removing Fast-Evolving Sites Using Compatibility Analysis: An Examplefrom the Arthropoda
DAVIDE PISANI
Department of Zoology. The Natural Historif Museum. Cromwell Road, London SW7 5BD. United Kingdom; E-mail: [email protected]. uk
Feisenstein (1978) first recognized that long branchattraction (LBA) can seriously affect the accuracy of phy-logenetic reconstruction. Although Felsenstein specifi-cally addressed this problen:i in the cases of parsimonyand clique analyses, it is now known that LBA can affectany tree reconstruction method, including maximumlikeliliood (ML) and Bayesian approaches. However,LBA is only a problem for distance, ML, and Bayesianmethods when the assumed substitution model is un-derparameterized, i.e., when it is unrealistically simple(Swofford et al., 2001; Lemmon and Moriatry, 2004).LBA should therefore be avoidable by analyzing thedata using ML, Bayesian, or distance methods under thebest-fitting substitution model (providing this is a goodapproximation ofthe true substitution model). However,ML and (although to a much lesser extent) Bayesiananalyses are time consuming, whereas many widelyused implenientations of distance methods (for exampleKumaretal., 2001) do not allow the specification of com-plex substitution models. Accordingly, estimating therelationships of fast-evolving species still represents oneof the most serious problems of molecular phylogenetics.
Strategies for dealing with LBA that do not neces-sarily reiy on the use of probabilistic methods or com-plex evolutionary models have been suggested. Thesestrategies can be of special utility when parsimony ordistance methods are used. These include (1) increas-ing the taxon sampling (Hendy and Penny, 1989; Hillis,1996; Rannala et al., 1998; Pollock et al., 2002; Poe, 2003);(2) optimal outgroup selection (Wlieeler, 1990); and (3)sampling strategies specifically targeting slowly evolv-ing species (e.g., Aguinaldo et al, 1997). Unfortunately,none of these strategies is universally applicable. For ex-ample, increasing the taxon sampling can (in some cases)exacerbate LBA (Kim, 1996; Poe and Swofford, 1999; Poe,2003), phylogenetic uncertainty can prevent the selectionof adequate outgroups, and for certain groups, it is pos-sible that no slowly evolving species can be identified.
An alternativ^e approach to countering LBA that doesnot necessarily rely on the use of complex substitutionmodels is to identify and remove fast evolving sites,which are expected to contribute substantially to it (e.g.,Brinkmann and Philippe, 1999; Hirt et al., 1999). Particu-
larly, Brinkmann and Philippe (1999) proposed a simpleparsimony-based method, christened slow-fast (SF), foridentifying (and then removing) fast evolving sites froman alignment. SF, as well as other methods that will notbe considered in detail here (e.g., Hirt et al., 1999), canbe especially useful when taxon sampling is limited, ifclose outgroups are unavailable, and in all cases whenfast evolving species are included in the investigation.
Here, the useof alternative,compatibility-based meth-ods (see Felsenstein, 2003; Semple and Steel, 2003; Pisani,2002; Wilkinson, 2001; Meacham and Estabrook, 1985,for an introduction) for identifying fast-evolving sites isproposed and illustrated using arthropod data in a tax-onomic congruence (Miyamoto and Fitch, 1993) context.Unlike the parsimony-based SF, the methods here pro-posed are topology independent, allowing for their ap-plication in cases where SF cannot be applied (see below).
IDENTIFYING FAST-EVOLVING SITES USING PARSIMONY
Brinkmann and Philippe's (1999) SF method worksas follows. The aligned sequences are partitioned into nnonoverlnpping monophyletic groups based on a prioribackground knowledge. For each of the n groups, parsi-mony analysis is carried out and tlie length of each po-sition determined. Finally, for each position ui the align-ment, the sum of its lengths on each of the n trees is takenas a proxy of its evolutionary rate. Characters most likelyto be fast evolving will be highlighted by their higherparsimony scores.
SF can prove useful in many situations, and it hasbeen used to address important phyiogenetic problems(e.g. Philippe et al., 2000; Brochier and Philippe, 2002).However, it requires subsets of monophyletic taxa to beknown and defined a priori. It is not difficult to envi-sion cases where this knowledge is unavailable or wherethe monophyletic subsets are misspecified. The full im-plications of misspecification for SF need to be inves-tigated, for example through simulations, nevertheless,because the length of a character is a topology-dependentvariable, it would generally lead to character lengths be-ing erroneously calculated and putatively fast evolvingsites mlsidentiiied. In any case, even if the sequences
978
2004 POINTS OF VIEW 979
were correctly partitioned, fast evolvin.g sites will be ac-curately highlighted only to the extent that the withinmonophyletic group phylogenies are correctly resolved.Otherwise, character lengths may be erroneously cal-culated and a variable proportion of fast-evolving sitesmisidentified. In conclusion, the limitations of SF are aconsequence of the method being topology dependent.
Character compatibility was first introduced by LeQuesne (1969). Briefly, two characters are compatible ifthey can be mapped on the same tree without homo-plasy (see Felsenstein, 2003, Semple and Steel, 2003, fordetails), otherwise they are incompatible.
Given a set of aligned sequences, for each site in thealignment, it is possible to calculate its incompatibilityscore, that is, the number of positions with which thespecified site is incompatible. Because fast-evolving siteshave lost most (if not all) of their phylogenetic infor-mation, they are expected to show more incompatibili-ties than slowly evolving ones and high incompatibilityscores could therefore be used to highlight, and thus re-move, potentially fast-evolving sites.
Incompatibility scores are calculated, for each charac-ter in a data set, without reference to any specific phylo-genetic hypothesis, therefore, compatibility-based meth-ods of character selection are topology independent,and the potential problems of misspecification of mono-phyletic groups and phylogenetic inaccuracy within sub-sets do not pertain.
Wilkinson (1992) and Meacham (1994) indepen-dently introduced compatibility-based randomizationtests analogous to the parsimony PTP test (Archie, 1989;Faith and Cranston, 1991), but used to evaluate the qual-ity of individual characters rather than entire matrices.Their methods differ only in trivial details (Wilkinson,2001) and only Wilkiiison's (1992) method will be con-sidered here. It provides a test of the null hypothesis thata character is no less incompatible with the other charac-ters in the data than is a random, phylogenetically unin-formative, character. Expectatioiis mider the null modelare determined for a given character by repeatedly ran-domly permuting (shuffling) the assignment of its char-acter states across the taxa and counting the numberof other (unpermuted) characters that it is incompatiblewith. The test statistic used, named the Le Quesne Proba-bility (LQP), is the probability of a random character hav-ing as low or lower incompatibility with the rest of thedata than does the original character (Wilkinson, 2001;see also Wilkinson and Nussbaum, 1996). Characterswith high LQP have notably high Incompatibility so thatLQP values can be used to highlight fast-evolving sites.
When removing characters from an alignment, thereis an important caveat to be considered: fast evolvingsites could still convey some phylogenetic signal (sensuPisani and Wilkinson, 2002). Accordingly, caution shouldbe taken when removing sites (even if fast evolving), andcharacter removal should be limited (hopefully) only to
outliers. However, the threshold at which character dele-tion would not improve phylogenetic accuracy is datadependent and so can only be found experimentally,by the sequential deletion of sites of increasingly goodquality (e.g., sites with decreasing LQP vnlues, lower in-compatibility scores, or shorter lengths). Although proto-cols to optimize site removal can be outlined, these proto-cols cannot avoid a certain amount of subjectivity. In anycase, LQP values offer an advantage over incompatibil-ity and parsimony scores because the LQP of a characterreflects both its compatibility with the other charactersin the data and its performance with reference to a fullynoisy character. Accordingly, LQP values provide a sim-ple way of monitoring the risk of deleting useful infor-mation from the data whereas trying to remove noise. Forexample, deletion of characters with LQP of 1 is likely toresult in an improved signal to noise ratio for the wholedata set, whereas deletion of characters of decreasingLQP (e.g., LQP — 0.4) will not necessarily improve thesignal to noise ratio ofthe data set, and could result in theloss of potentially important phylogenetic information.
EXAMPLE: AN ELONGATION FACTOR la PHYLOGENY
OF ARTHROPODA
Materials and Methods
DNA sequences of the Elongation Factor \a (EF-lff)gene for a variety of arthropods were retrieved fromGenbank (see Appendix for the accession numbers),and an alignment scoring 47 species and 866 nucleotidepositions (gaps excluded) was created using Clustal X(Thompson etal., 1994) default options. The data set usedin the analyses has been deposited (and is available fordownload) in TreeBase (www.treebase.org), Taxon sam-pling was designed to allow the testing of two phyloge-netic hypotheses: (1) the monophyly of Hexapoda (con-trast Nardi et al., 2003 and Delsuc et al., 2003) and (2) thephylogenetic relationships of the Myriapoda (compareGiribet et al., 2001, with Fredericli and Tautz, 1995;Hwang et al., 2001; Cook et al., 2001; Nardi et al., 2003;Delsuc et al., 2003; Pisani et al., 2004; Mallatt et al., 2004;Negrisolo et al., 2004). Some vertebrates (human, salmon,and zebra fish) were selected as outgroups. Relatively dis-tant outgroups were deliberately selected in an attemptto exacerbate LBA.
For the complete data set (all nuclootide positions),departure from homogeneity in base composition acrosstaxa was tested using the x" test (e.g., Negrisolo et al.,2004), and the best-fitting substitution model was se-lected using Modeltest (Posada and Crandall, 1998).Phylogenetic analyses were performed using Bayesianand ML methods under the best fitting substitutionmodel selected by Modeltest (GTR+F-f-l). Neighbor-joining (NJ) analyses were also performed. Ln order toallow LBA, the NJ analyses were carried out using anunderparameterized substitution model, which was thegamma-corrected Kimura two-parameter (K2P) model,and transversions only were considered when calculat-ing distances for the Nj analyses. All analyses were per-formed with gaps removed.
980 SYSTEMATIC BIOLOGY VOL. 53
The strict consensus (Sokal and Rohlf, 1981) of theBayesian and ML trees was calculated and the NJ treecompared vvith it and with what is generally acceptedabout the phylogeny of Arthropoda (Compare for exam-ple Giribet et'al., 2001, with Hwang et al., 2001). This wasdone to evaluate Eo what extent the NJ tree differed fromthe Bayesian and likelihood trees, and to what extent theML, Bayesian, and NJ trees matched what is generallyaccepted about the phylogeny of this group.
LQP \^alues were then obtained for each position inthe alignment, putjtively fast-e\'olving sites (sites withhigh LQP) sequentially eliminated, and the retained datareanalyzed using Bayesian, ML, and NJ analyses. Asabove, NJ analyses were performed assuming an under-parameterized (gamma-corrected K2P) model, whereasBayesian and ML analyses were performed under thebest-fitting substitution model, which was reestimatedafter removing the characters with high LQP.
To test the performance of the compatibility-basedmethod, SF was also implemented and the results ob-tained using the two methods compared. To implementSF the sequences were partitioned a priori in three mono-phyletic groups (Pancrustacea, Chelicerata, and Myri-apoda). Phyiogenetic uncertainty and taxon-samplinglimitations did not allow the definition of less inclusivemonophyletic groups. Sites highlighted as fast evolvingusing SF were excluded and the remaining data reana-lyzed using NJ, ML, and Bayesian analyses. As in theother cases, the NJ analyses were performed assuminga gamma-corrected K2P model, whereas the ML andBayesian analyses were performed under the (reesti-mated) best-fitting substitution model.
To monitor the consequences associated with the dele-tion of putatively fast-e\'olving sites, characters were in-crementally removed (in steps of 0.1) according to theirLQP, and starting with those with the highest values, i.e.,those with LQP included between 1 and 0.9. Phylogeneticanalyses were carried out after each set of characters wasremoved and changes in the tree topology monitored.Removed characters were subjected to the PTP test toevaluate whether they conveyed clustering information,and phylogenetic analysis of the removed characters wasperformed to visualize their information content.
To evaluate whether the exclusion of putatively fast-evolving sites improved phylogenetic estimation, thelikelihood of the NJ trees obtained after each set of char-acters was deleted and of the strict consensus of theBayesian and ML trees was calculated. Tiiis was done(under the best fitting substitution model) considering(1) all nucleotide positions and (2) each set of retainedcharacters. To evaluate whether the observed topologi-cal changes were significant, the Shimodaira-Hasegawa(SH) test (see Felsenstein, 2003, for details) was used tocompare the NJ trees with each other and with the strictconsensus of the ML and Bayesian trees. The likelihoodcalculations for the SH test were always performed con-sidering all nucleotide positions, under the best fittingsubstitution model.
Support for the nodes in the NJ trees was evaluated us-ing the interior branch length test (1000 replicates; see Nei
and Kumar, 2000, for details). Deletion of fast-evolvingsites could have a potentially deleterious impact on thebranch lengths of the recovered tree. As pointed out byKim (1996), LBA can be exacerbated when long terminalbranches (i.e., fast-evolving species) follow short internalbranches. Accordingly, it is possible to conjecture that theapplication of methods for the removal of fast-evolvingsites, if unevenly affecting different part of the tree, po-tentially, could exacerbate LBA. The interior branch testallows monitoring the significaiice of the branch lengthsfor the recovered tree, thus making it possible to eval-uate whether site removal could be exacerbating LBA.Support for the nodes in the strict consensus of theML and Bayesian trees are expressed as their posteriorprobabilities.
NJ analyses and the interior branch test were imple-mented using MEGA 2.1 (Kumar et a!., 2001). Tlie pro-gram DNALQP (which is part of the software packagePICA 4.0; Wilkinson, 2001) was used to calculate LQPvalues. PAUP* (Swofford, 1998) was used to implementthe /^ test for homogeneity in base coniposition acrosstaxa, the PTP test (1000 replicates with heuristic searchand the multiple trees option turned off), to carry out theML and parsimony analyses (100 replicates with heuris-tic search and random sequence addition), to infer thegamma parameter (a) values used in the NJ analyses,to calculate the likelihood of the inferred trees, and toimplement the SH test (RELL option with 100,000 repli-cates). Bayesian analyses were performed using MrBayes3.0 (Ronquist and Htielsenbeck, 2003). For each Bayesiananalysis 2,000,000 generations were run, sampling every1000 generations. The burn in period of each analysis wasestimated plotting the likelihood of the sampled trees.
ResultsDiagnosing LBA.—The ML and Bayesian analyses of
the data (all nucleotide positions) yielded trees differ-ing only in trivial details, and their strict consensus isreported in Figure 1. This tree is generally consistentwilh the "known" phylogeny of Arthropoda, for exam-ple, supporting monophyletic Pancrustacea, Chelicerata,Myriapoda, Branchiopoda, and Malacostraca. However,it does not support a monophyletic Hexapoda (in accor-dance with Nardi et al., 2003), although, interestingly,the springtail Tomoccnis sp. clusters with the "prim-itive" insects Metnjapyx s(//'ffrn7;/t'(<?i and Ctenotcpismalincnln (see Fig. 1). In accordance with Giribet et al.(2001), this tree supports a sister-group relation be-tween Myriapoda and Pancrustacea (i.e., it supports theMandibulata Hypothesis).
The NJ analysis of the data (all nucleotide positions)allowed for a possible case of LBA to be highlightedin the form of an obviously misplaced taxon, the crus-tacean Spelconcctes tulumensis (Remipedia), stemming atthe base of the tree (Fig, 2). However, LBA is not theonly possible explanation for the misplacement of S.f(//N-rnensis: other possible explanations are base compositionbias and paralogy (see Gribaldo and Philippe, 2002, for areview).
FIGURE 1. Strict consensus of the maximum likelihood and Bayesian trees. Numbers above the nodes represent the posterior proLiab iiit ies(all sites). Numbers below the nodes are the "revised" posterior probabilities calculated after excluding all the sites with LQP > 0.5 (172 sitesexcluded on a totiil of 866). in bold: posterior probabilities that increased after excluding the sites with high LQF. in italics: posterior probabilitiesthat decreased after excluding the sites with high LQP N.A.: not applicable. Tliis node was not .supported in the Bayesian analysis performedafter excluding the sites with LQP > 0.5.
misplacement of S. tuhnitensis in Figure 2, leaving LBAas the most likely explanation. This is further confirmedby the ML and Bayesian analyses (see Fig. 1) whereS. tnhnnensis nests witliin Pancrustacea as the sistergroup of the crustacean Hutchinsoiiielln macrocanllm.These results were to be expected if the misplacementof S. tulumensis in Figure 2 was due to LBA, which wasavoided in the Bayesian and ML analyses implementinga better fitting substitution model.
Tlie NJ tree in Figure 2 also shows several other ob-viously misplaced taxa. The "primitive" insects M. sub-terraneiis and C. lineata are nested outside Pancrustacea,
The X" t6sf for homogeneity in base compositioncould not Iiighlight significant heterogeneity across taxa(P — 0.927). E\'idence for the presence of multiple copiesofthe EF-la gene within Arthropoda exists {Hovemannet al., 1988; Danforth and Ji, 1998; Hedin and Maddison,2001). However, based on the paralog intron-structure(Hedin and Maddison, 200!) and phylogenetic analyses(Danforth and Ji, 1998), these EF-lcc paralogs appearto be the result of mdependent duplications in specificlineages rather than of ancient duplication events(Hedin and Maddison, 2001). Accordingly, neitherbase composition bias nor paralogy can explain the
FIGURE 2. Neighbor-joining tree. Tree obtained from the analysis of all 866 sites (or — 0.24). Numbers ai the nudes represent support asexpressed using the interior branch test. Dashed lines represent relationships unsupported in the tree summarizing the results of the interiorbranch test. In bold: obviously mispLiced taxa.
the centipede {Chilopoda) Oyptops hyalinus is outsideMyriapoda, the crustacean Htitchinsoniella macracanihais within Insecta, and the chelicerate Chthonius tetra-chelatus is within Pancrustacea. Other interesting aspectsof this tree are (1) the basal position of the springtailTomocerus sp. that is nested outside Hexapoda (in ac-cordance with Nardi et al., 2003) and, niost importantly,(2) the pairing of the Chelicerata and Myriapoda. Thelatter aspect of tlie tree in Figure 2 contradicts the re-
sults of the ML and Bayesian analyses (compare Figs. 1and 2) and provides support for the Myriochelata (Pisaniet al., 2004), or Paradoxopoda (Mallatt et al., 2004),hypothesis, However, the likelihood of the NJ tree inFigure 2 is much lower then that of the strict consensus ofthe ML and Bayesian trees (Table 1), and theSH test sug-gests this tree fits the data significantly worst than thetree in Figure 1 does {P =0.0092). Concluding, the NJtree in Figure 2 is UJ<ely to be a poor reflection of
2004 POINTS OF VIEW 983
TABLE 1. Likelihood of the recovered trees. Likelihood values have all been calculated under the best-fitting substitution model (reestimatedafter e ich set of characters was excluded). Tree 1: Sti'kt consensus of the Bayesian and ML trees (all nucleotide positions). Tree 2: NJ trLt- of .illnucleotide positions. Tree 3: Suboptimal LQP tree (see text). Tree 4: The best LQP tree (see text). Tree 5: The best SF tree (st>e text). In bold: bestNJ trees. An asterisk indicates the best trees overall.
Data Sel
All dataCharacter with LQP > 0.6 excludedCharacter with LQP > 0.5 excludedCharacters with Length > 9 excluded
the true phylogenetic relationships of these EF-lo-sequences.
Coping with LBA using LQP.—Deletion of putativelyfast evolving sites (sites with high LQP) resulted insubstantial changes of the recovered NJ trees, but didnot result in any substantial change of the topology ofthe recovered ML and Bayesian trees. However, the pos-terior probabilities of many ofthe nodes recovered by theML and Bayesian analyses varied when sites with highLQP were removed (see Fig. 1).
A first important change of the recovered NJ tree topol-ogy was observed after all the characters with LQP >0.6 (i.e., the first 131 worst performing characters) wereexcluded (Fig. 3). Tliis tree will be hereafter referred as"the suboptimal LQP tree" (see text below). Further siteremoval, down to all the sites with LQP > 0.5 (i.e., thefirst 172 worst performing characters) resulted in othernotable changes in the recovered NJ tree (Fig. 4). Thetree in Figure 4 will be hereafter referred as "the bestLQP tree." This is because its likelihood (calculated af-ter the characters removed to infer it were reintroduced)is greater than that of any other NJ tree recovered (seealso Table 1). Additional exclusion of increasingly betterperforming characters (with LQP < 0.5) led only to thedeterioration of the results (i.e., appearance of nonsensi-cal clades, large polytoinies, and decreasing likelihood ofthe recovered trees), suggesting important phylogeneticinformation was being removed and character deletionshould have stopped.
It is immediately evident that removal of characterswith high LQP improved the accuracy of the recoveredNJ trees. In fact, even in the suboptimal LQP tree thecrustacean S. ttdiunensis nested within Pancrustacea asthe sister group of the crustacean H. macracantha. Thisis consistent with the results of the Bayesian and MLanalyses of the complete data set (compare Figs. \ and3), and with the results of Giribet et al. (2001). Further-more, in the suboptimal LQP tree, the "primitive" in-sects M. subterraneus and C. lineata are also nested withinPancrustacea, whereas the centipede C. hyalinus is clus-tered at the base of Myriapoda. Di tliis tree, the onlytaxon that is still grossly misplaced, therefore, is thechelicerate C. tetrachelatus. The likelihood of the subop-timal LQP tree is greater than that of the NJ tree ob-tained from the analysis of the complete data set (see alsoTable 1), and is exceeded only by that of the best LQPtree. Nonetheless, the SH test indicates the suboptimalLQP tree does not fit the data significantly better then the
NJ tree obtained from the analysis of the complete dataset(P =0.2605).
Exclusion of other putatively fast evolving sites (i.e., allthe sites with LQP > 0.5; see above) eventually resultedin the recovery of the best LQP tree, in which there areno grossly misplaced taxa. The likelihood of this tree isgreater then that of the suboptimal LQP tree (and of anyother NJ-LQP tree), suggesting that the exclusion of extrasites with high LQP (down to all the sites with LQP >0.5) further improved phylogenetic estimation. This isfurther confirmed by tlie SH test, suggesting that the bestLQP tree fits the data significantly better then both the NJtree obtained from the analysis of the complete data set(P ^ 0.0402), and the suboptimal LQP tree (P = 0.031).Most importantly, the same test cannot reject the nullhypothesis that the best LQP tree and the strict consensusof the Bayesian iind likelihood trees fit the data equallywell(P =0.2158).
Tlie best LQP tree is topologicaliy very dissimilarfrom the suboptimal LQP tree (compare Figs. 3 and4). The suboptimal LQP tree suggests Myriapoda tobe the sister group of Chelicerata (i.e., it supportsMyriochelata), whereas the best LQP tree suggestsMyriapoda to be the sister group of Pancrustacea (i.e., itsupports Mandibulata).
The PTP test suggested clustering information was be-ing removed together with the putatively fast evolvingsites (P < 0.001). However, the phylogenetic trees recov-ered from the analyses of the removed characters, exceptfor suggesting some obvious group (e.g., Xiphosura, ttiatis, the horseshoe crabs), were mainly nonsensical. Tliere-fore, it is possible to conclude that the signal loss asso-ciated with the removal of the putatively fast-evolvingsitesisexpressionofthe "residual" signal associated withthe better supported groups, which is strong enough tobe conserved even in the worst performing characters.
SF analysis.—Implementing SF, the sites iii the align-ment were partitioned into groups having equal length,the worst performing ones having a length of 21 steps.Interestijigly, the fast-evolving characters identified us-ing SF are sometimes different (and generally differentlyranked) when compared with those highlighted in theLQP analysis.
Characters identified as putatively fast evolving by SFhad been sequentially deleted down to the worst 175(i.e., down to all the characters of length 10 or higher),that is, in a number comparable to that of the charactersexcluded to obtain the best LQP tree. NJ analysis of the
D AraneaeURicinulei^ Pseudoscorpiones^ AcariZH Opiliones
Xiphosura
Diplopoda
Chiiopoda
Pancmstacea
Chelicerata
Myriapoda
— 0.01 SLibstilutions/site
FIGURE 3. Neighbor-joining tree. Tree obtained after excluding the sites with LQP > 0,6,131 on a total of 866 (cr = 0.18). This is the suboptimalLQP tree (seo text). Numbers at the nodes represent suppnrt as expressed using tht? interior branch ti'st. Dashed lines represent relationshipsunsupported in the tree summarizing the results of the interior branch test, ln bold; obviously misplaced taxa.
remaining characters resulted in a tree still showing mis-placed S. tulumensis and H. macracantha (nt)t shown). Theadditional exclusion of all the characters of length 9 (fora total of 196), however, resulted in a tree topologicaliyvery similar to the best LQP tree (compare Figs. 4 and5). The tree in Figure 5 is the SF tree of maximum like-lihood (see Table 1) and will be referred hereafter asthe best SF tree. Additional character removal (downto all the characters of length 5 or more) did not ledto any potential improvement of the results, and to a
generalized decrease of the likelihood of the recoveredtrees.
The likelihood of the best LQP tree and of the bestSF tree are very similar (see Table 1) and the SH testcannot reject the null hypothesis that the two trees fitthe data equally well (P = 0.7186). Besides, as in the caseof the best LQP tree, the SH test cannot reject the nullhypothesis that the best SF tree and the strict consensusof the Bayesian and ML trees of Figure 1 fit the dataequally well (P ^ 0.1586).
FiCURV: 4. Neighhor-joininj; tree. Tree obt<iined after excluding the sites with LQP > 0.5, 172 on a total of 866 fo = 0.19). This is the bestLQP tree (see text). Numbers at the nodes represent support as expressed using the interior branch test. Dashed lines represent relationshipsunsupported in the tree summarizing the results of the interior branch test.
DISCUSSION
Phylogenetic Implications
Two competing hypotheses have been proposed forthe relationships among Pancrustacea, Myriapoda andChelicerata. These are (1) the Mandibulata hypothesis(see for example Giribet et a!., 2001) and (2) the Myri-ochelata (or Paradoxopoda) hypothesis. Recent molecu-lar phylogenetic analyses seem to support Myriochelata
(see Frederich and Tautz, 1995; Hwang et al., 2001; Cooket al., 2001; Kusche and Burmester, 2001; Nardi et al.,2003; Delsuc et al., 2003; Pisani et al., 2004; Mallatt et al.,2004; Negrisolo et al., 2004). However, morphologicaland combined morphological and molecular analyses(e.g., Giribet et al., 2001) support Mandibulata. Every-thing considered, the results here presented seems to fa-vor Mandibulata. In fact, although the interior branchtest indicates Mandibulata to be poorly supported in
FIGURES. Neii^hbor-jnining tree. Tree obtained after removing the sites identified (using SF) as having length >9 steps, 196 on a total of 866(uf = 0.20). This is the best SF tree (see text). Numbers at the nodes represent support as expressed using the interior branch test. Dashed linesrepresent relationships unsupported in the tree summarizing tlic results of the interior branch test.
both the best LQP and SF trees, their congruence withthe strict consensus of the ML and Bayesian trees (seeFigs. 1, 4, and 5) suggests LBA is not being exacerbatedby character deletion. Besides, the posterior probabilityof Mandibulata is relatively high and it increases whencharacters of high LQP are removed (Fig. 1).
Monophyly of insects is not even supported in thebest LQP tree (in accordance with Nardi et al., 2003).
However, interestingly, a clade including the springtailTomocerus sp. and the insects M. snbtcrraneus and C. lin-eata is recovered in the suboptimal LQP tree, in boththe best LQP and SF trees, and in the strict consensusof the Bayesian and ML trees (compare Figs. 1, 3, 4,and 5). Accordingly, this data set leaves the questionof the monophyly/paraphyly of Hexapoda substantiallyunresolved.
2004 POINTS OF VIEW 987
Coping with LBA Using Compatibility-Based Methods
The implementation of both the LQP method sug-gested here and SF improved phylogenetic accuracy ofthe NJ analyses. Both methods yielded trees not signif-icantly different from the ML and Bayesian ones, de-spite the NJ analysis employing an unxealistically simplemodel. Accordingly, this example shows topology inde-pendent, compatibility-based methods can be used toimprove the signal to noise ratio of a dafa set. Further-more, this example shows the proposed compatibility-based method performed at least as well as SE Addi-tional investigations need to be performed (for examplethrough simulations) to evaluate better the extent towhich compatibility-based methods could improve phy-logenetic estimation. However, this example illustratesby means of taxonomic congruence (Miyamoto andFitch, 1995) that the suggested method can improve phy-logenetic accuracy, at least when distance methods areimplemented under unrealistically simple substitutionmodels (as it is often the case).
The utility of character removal methods in a MLor Bayesian context {where more complex substitutionmodels are generally implemented) is less obvious. How-ever, it should be pointed out that the as'ailable sub-stitution models are all simplifications of the true, butunknown, substitution models (see for example Foster,2004) and it can thus be conjectured that character re-moval methods could be useful also in a ML or Bayesiancontext. It should also be noted that, at least in this exam-ple, the exclusion of fast-evolving sites notably increasedcomputational speed. The ML analysis of the completedata set (all positions) lasted more then 300 hours (on acluster of 6 dual processor nodes), whereas the ML anal-ysis performed after excluding all sites with LQP > 0.5could be completed in 42 hours and 46 minutes.
In their simplest implementation (see above),compatibility-based methods for highlighting fast-evolving sites do not need the a priori definition ofmonophyletic groups of sequences. Furthermore, theydo not need phylogenetic trees to be built in order tohighlight fast evolving sites. Therefore, they make mini-mal assumptions about the data (Wilkinson, 1998), andcan be implemented when the prior information nec-essary to implement SF is lacking. However, if enoughinformation is available, the procedure implementedhere can be modified, groups of monophyletic sequencesdefined, and a strict equivalent of SF designed.
All methods of character selection pose the problem offinding an optimal cut-off value under which charactersshould not be deleted. How to discriminate charactersthe deletion of which could improve phylogenetic accu-racy, therefore, is key. Still, this is the most complex stepof any character selection protocol. Because it is obviousthat optimal cut-off values must be defined on a case-by-case basis, the threshold discriminating characters thatshould be deleted from those that should be retainedcan only be defined experimentally (i.e., by monitoringthe effect associated with character deletion). Althoughthis would always imply a certain amount of subjectivity.
some guidelines to monitor the effects of character dele-tion can be suggested. (1) Groups of characters shouldbe removed sequentially according to their LQP (start-ing with the characters with a LQP = 1 and going downtoward lower LQP values). Changes in the tree topol-ogy should be examined after each group of charactersis removed. (2) Tests to highlight the presence of cluster-ing signal (e.g., the PTP test) should be carried out onthe deleted characters. (3) Phylogenetic analysis of thedeleted characters should be performed to visualize theirinformation content. (4) The likelihood of each recoveredtree should be recorded, and the values compared. (5)Site removal should be stopped if further character dele-tion results in a significant and systematic deteriorationof the results (i.e., appearance of obviously nonsensicalclusters and/or substantial loss of resolution, support, ora decrease in the likelihood of the recovered trees). If thePTP (or a similar) test suggests that there is no clusteringinformation in the excluded characters (point 2 above),their elimination should be justifiable. However, deletedcharacters often convey significant clustering informa-tion (Pisani, unpublished results, and above). In suchcases, fhe phylogenetic tree recovered from the analy-sis of the deleted characters (point 3 above) should bescrutinized in order to evaluate what signal they convey.
CONCLUSIONS
LBA is still one of the major problems of molecularphylogenetics and there is great need of methods thatcould help coping with it. Both analytical methods andmethods focusing on the taxon-sampling process arepromisijig and need to be better investigated. However,because in many cases taxon sampling is limited, analyt-ical methods, for example the methods suggested hereor SF, seeni particularly important. Still, with few no-table exceptions (e.g., Hirt et al., 1999; Brinkmann andPhilippe, 1999; this study) they have been overlooked.
It is the hope of the author tha t this study will stimulatefurther developments of these methods, as well as newstudies specifically investigating areas of applicability ofthe different approaches, and new, more sophisticatedexperimental protocols.
ACKNOWLEDGMENTS
The author would like to thanks H. Philippe, F. Delsuc, I. Padovani,M. Wilkinson, F. Thomarat, D. Gower, S. Harris, and ]. Cotton for theircritical reading of tliis manuscript and for their helpful comments andsuggestions.
REFERENCES
Aguinaldo, A. M., J. M. Turbe\'ilie, L. S. Linford, M. C. Rivera, J. R.Garey, R. A. Raff, and j . A. Lake. iW7. Evidence for a clade of nema-todes, arthropods, and other moultinj; animals. Nature 3H7:-18y-4y3.
Archie, J. W. 1989. A randomization lest for phylogenetic informationin s\'stematic data. Syst. Zool. 38:239-252.
Brinkmann, H.,and H.Philippe. 1999. Archaea sister group of Bacteria?Indicationis from tree reconstruction artifacts in ancient phylogenies,Mol. Bio!. Evol. [6:817-825.
Brochier, C, and H. Philippe. 2()02. A non-hyperthermophilic ancestortor Bacteria. Natuiv 417:244.
988 SYSTEMATIC BIOLOGY VOL. 53
Cook, C. E., M. L. Smith, M. J. Telford, A. Bastianell, and M. Akam.2001. Hox genes and the phylogeny of arthropods. Curr. Biol. 11:759-763.
Danforth, B. N. and S.J i. 1998. Elongation factor-lu occurs in two copiesin beew: implications for phylogenetic analysis of EF-1 of sequences ininsects. Mol. Biol. Evoi. 15:225-235.
Delsuc, F., M. Phillips, and D. Penny. 2003. Comment on "Hexapodorigins: Monophyletic or Paraphyletic?" Nature 301:1482.
Faith, D. P., and P. S. Cranston. 1991. Could a cladogram this short havearisi.'n by chance alone? On permutation tests for cladistic structure.CladLstics 7:1-28.
Felsenstein, J. 1978. Cases in which parsimony or compatibility meth-ods will be positively misleading. Syst. Zool. 27:401-410.
Foster, P. G. 2004. Modeling compositional heterogeneity. Syst. Biol.53:485-195.
Frederich, M., and D. Tautz. 1995. Ribosomal DNA phylogeny ofthemajor extant arthropod classes and the evolution of myriapods. Na-ture 376:165-167.
Giribet, G., G. D. Edgecombe, and W. C. Wheeler. 2001. Arthnaptxlphylogeny hased on eight molecular loci and morphology. Nature413:157-161.
Gribaldo, S., and H. Philippe. 2002. Ancient phylogenetic relationships.Theor. Pop. Biol. 61:391-108.
Kedin, M. C, and W. P. Maddison. 2001. Phylogenetic utility and evi-dence for multiple copies of Elongation Factor-1« in the spider genusHabronattus (Araneae: Saltiddae). Mol. Biol. Evol. 18:1512-1521.
Hendy, M. D., and D. Penny. 1989. A framework for the quantitativestudy of evolutionary trees. Syst. Zooi. 38:297-309.
Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130-131.
Hiri, R., J. M. Jr. Logsdon, B. Healy, M. W. Dorey, W, F. Doolitlle, andT.M. Embley. 1999. Microsporidia are related to fungi: Evidence fromthe largest subunll of RNA polymerase II and other proteins. PrcK.Natl. Acad. Sci. USA. 96:580-585.
Howeniann, B., S. Richter, U. Walldorf, and C. Cztepluch. 1988. Twogenes encode related cytoplasmic elongation factors ltr (EF-1«) inDrosapiiila inciaiiDgii^ter with continues and stage specific expression.Nucleic Acids Res. 16:3175-3194.
Hwang, U.-W., M. Friedrich, D. Tautz, C.-J. Park, and W. Kim. 2001.Mitochondrial protein phyiogeny joins myriapods with chelicerates.Nature 41.1:154-157.
Kim, J. 1996. General inconsistency conditions for maximum parsi-mony: Effects of branch lengths and increasing number of taxa. Syst.Biol. 45:363-374.
Kumar, S., K. Tamura, I. B. Jakobsen, and M. Nei. 2001. MEGA2: Molec-ular evolutionary genetics analysis software, Arizona State Univer-sity, Tempe, Arizona.
Kusche, C, and T. Burmester 2001. Diplopod hemcx^yanin sequenceand the phylogenetic position of the myriapoda. Mol. Biol. Evol.18:1566-1573.
Le Quesne, W.J. 1969. A method ofselectionofcharacters in numericaltaxonomy. Syst. Zool. 18:201-205.
Lemmon, A. R., and E. C. Moriatry. 2004. The importance of propermtxlel assumption in Bayesian phylogenetics. Syst. Biol. 53^265-277.
Mallatt, J. M.,J. R. Garey, and J. W.Shultz. 20(14. Ecdysozoan phylogenyand Bayesian inference: first use of nearly complete 28S and 18Sr l^A gene sequences to classify the arthropods and their kin. Mol.Phylogenet. Evol. 31:178-191.
Meacham, C. A. 1994. Phylogenetic relationships at the basal radiationof angiosperms: Further study by probability of character compati-bility. Syst. Bot. 19:506-522.
Meacham, C. A., and G. E. Estabrook. 1985. Compatibility methods insystematics. Annu. Rev. Ecol. Syst. 16:431-466.
Miyamoto, M. M., and W. M. Fitch. 1995. Testing species phylogeniesand phylogent'tic methods with congruence. Syst. Biol. 44:64-76.
Nardi, F., G. Spinsanti, J. L. Boore, A. Carapelli, R. Dallai, and F.Frati. 2003. Hexapod origins: monophyletic or paraphyletic? Science299:1887-1889.
Nei, M., and S. Kumar. 2000. Molecular evolution and phylogenetics.Oxford University Press, New York.
Negrisolo, E., A. Minelli, and G. Vatle. 2004. The mitochondrial genomeof the house centipede Sciitigera and the monophyly versus para-phyly of Myriapods. Mol. Biol. Evol. 21:770-780.
Philippe, H., P. Lopez, H. Brinkmann, K. Budin, A. Germot, J. l.jurent,D. Moreiro, M. Miiller, and H. Le Guyader. 2000. Early branchingor fast evolving eukaryotes? An answer based on slowly evolvingposition. Proc. R. Soc. Lond. B 267:1213-1221.
Pisani, D. 2002. Comparing and combining trees and data in phyloge-netic analysis. Ph.D. Thesis, University of Bristol, UK.
Pisani, D., L'. Poling, M. Lyons-Weiler, and S. B. Hedges. 2004. The col-onization of land by animals: Molecular phylogeny and divergencetimes among arthropods. BMC Biology 2:1.
Pisani, D. and M. Wilkinson. 2002. Matrix representation with parsi-mony, taxonomic congruence and total evidence. Syst. Biol. 51:151-155.
Poe, S. 2003. Evaluation of the strategy of long-branch subdivision toimprove the accuracy ot phylogenetic methods. Syst. Biol. 52:423-428.
Poe, S., and D. L. Swofford. 1999. Taxon sampling R'visited. Nature398:299-300.
Poikxrk, D., D. J. Zwickl, J. A. McGuire, and D. M. Hillis. 2002. increasedphylogenetic sampling is advantageous for phylogenetic Inference.Syst. Biol. 51:664-671.
Posada, D., and K. A. Crandall. 1998. Modeltest: Testing the model ofDNA substitution. Bioinformatics 14:817-818.
Rannala, B., J. P Huelsenbeck, Z. Vang, and R. Nielsen. 1998. Taxonsampling and the accuracy of large phylogenies. Syst. Biol. 47:702-710.
Ronquist, E, and J. P. Huelsenbeck. 2003. MrBayes 3: Bayesian phy-iogenetic inference under mixed models. Bioinformatics 19:1572-1574.
Semple. C, and M. SteeL 2003. Phylogenetics. Oxford University Press,Oxford, UK.
Sokal, R. R., and F. J. Rolilf. 1981. Taxonomic congruence in the Lep-topodomorpha reexamined. Syst. Zool, 30:309-325.
Swofford, D. L. 1998. PAUP'. Phylogenetic analysis using parsimony(* and other methods). Version 4. Sinauer A.ssociates, Sunderland,Massachusetts.
Swofford, D. L., P. J. Waddell, J. P Huelsenbeck, P G. Foster, P. O.Lewis, and ]. S. Rogers. 2001. Bias in phylogenetic estimation and itsrelevance to the choice between parsimony and likelihtKid methods.Syst. Biol. 50:525-5.'19.
Thompson, j . D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W:Improving the sensiti\'ity of progressive multiple sequence align-ment through sequence weighting, positions-specific gap penaltiesand weight matrix chcjice. Nucleic Acids Res. 22:4673-4680.
Wheeler, W. C. 1990. Nucleic acid sequence phylogeny and randomoutgroups. Ctadisties 6:363-367.
Wilkinson, M. 1992. Consensus compatibility and missing data in phy-logenetic inference. Ph.D. Thesis. University of Bristol, UK.
Wilkinson, M. 1998. Split support and split conflict randomisation testsin phyiogenetic inference. Syst. Biol. 47:637-695.
Wilkinson, M. 2t)01. PICA 4.0: Software and documentation. Depart-ment of Zoology, The Natural History Museum, London.
Wilkinson, M., and R. A. Nussbaum. 1996. On the phylogenetic positionof the Uraeotyphlidae (Ampltibia: Gymnophiona). Copeia 3:550-562.
First submitted 2 December 20O3; revieivs rein med 1 (•> May 201)4;filial acceptance 29 August 2004
Ally tree can be represented by a variety of matricesfrom which the tree can be reconstructed. Such matrixrepresentations (encodings) of trees have been used in anumber of contexts, including measuring tree sitnilarity(Farris, 1973), cophylogeny (Brooks, 1981), and consen-sus (Nelson and Ladiges, 1994b; Phillips and Warnou',1996), but they have become more prominent through thematrix representation with parsimony (MRP) approachto supertree construction (Baum, 1992; Ragan, 1992;Sanderson et al., 1998). Most published supertrees haveutilized MRP (e.g., Jones et al., 2002; Kemiedy and Page,2002; Pisani et al., 2002; Purvis, 1995a), in which matrixrepresentations of trees are combined into a compositemotrix and iinalyzed with parsimony. In standard MRP,the elements of matrix representations are the relation-ships given by the full splits or bipartitions of leaves (ter-minal taxa) present in the trees. In the context of rootedtrees, standtird matrix representations have one binary(pseudo)character for each clade in which clade membersare scored conventionally (and arbitrarily) as '1 ' and non-members and the root (MRP outgroup) as '0' (Table 1).
Purvis (1995b) developed an alternative matrix rep-resentation (Purvis coding) intended to compensate forapparent overweighting of larger trees in standard MRP.His method, which is applicable only to rooted trees, at-tempted to remove redundant information iti standardmatrix representations, but was criticized by Roiiquist(1996) who argued that the information removed was notredimdant. Williams and Humphries (2003) advocatedthe use of matrix representations based on three-itemstatements or triplets (Nelson and Ladiges, 1992, 1994b;Nelson and Platnick, 1991; Wilkinson et al, 2001), whichare also applicable only to rooted trees, and they used
such matrices, and an associated differential weighting,as a yardstick to compare and contrast standard andPurvis matrix representations. Matrix representationscan also be based on pairwise pathlength distances be-tween leaves (Lapointe and Cucumel, 1997; Lapointe etal., 2003) and nonbinary discrete characters (Semple andSteel, 2002).
We believe the literature on the relative merits of al-ternative binary matrix representations of trees to beconfused and confusing. Here we use a quantitativemeasure, cladistic information content (CIC; Thorleyet al., 1998), to compare the information content of treesand their matrix representations. Specifically, we revisitRonquist's (1996) critique of Purvis coding. We demon-strate that the information removed by this method isredundant in the sense tliat Purvis (1995b) intended, butthat it does not remove all such redundant informationas Purvis supposed. Building on Ronquist (1996), we dis-cuss the distinction between the representation of a treeand the representation of the data from which a tree hasbeen inferred and its importance in choosing a matrixrepresentation. This leads us to consider logical relationsand dependencies among pseudocharacters from thesame tree, and highlight fundamental inference rules thatform the basis of the logical calculus of cladistic relation-ships. We show that fractional weighting as proposedby Nelson and Ladiges (1992), and the form of this usedby Williams and Humphries (2003), does not account forall logical dependencies between triplets. This leads usto propose an alternative weighting scheme. Finally, wereview the results and present a critical discussion ofWilliams and Humphries' (2003) comparison of alterna-tive methods. We use the terms triplet and three-item