E value cutoff and eukaryotic genome content phylogeneticsresearch.amnh.org/users/desalle/pdf/Rosenfeld.2012.MPE.e... · 2012-06-07 · E value cutoff and eukaryotic genome content

Molecular Phylogenetics and Evolution 63 (2012) 342–350

Contents lists available at SciVerse ScienceDirect

Molecular Phylogenetics and Evolution

journal homepage: www.elsevier .com/locate /ympev

E value cutoff and eukaryotic genome content phylogenetics

Jeffrey A. Rosenfeld a, Rob DeSalle b,⇑a IST/High Performance and Research Computing, University of Medicine and Dentistry of New Jersey, Newark, NJ 07103, United Statesb Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, United States

a r t i c l e i n f o

Article history:Received 22 May 2011Revised 2 January 2012Accepted 3 January 2012Available online 28 January 2012

Keywords:Genome contentDollo parsimonye ValuePhylogeny

1055-7903/$ - see front matter � 2012 Elsevier Inc. Adoi:10.1016/j.ympev.2012.01.003

⇑ Corresponding author.E-mail addresses: [email protected] (J.A. Rose

DeSalle).

a b s t r a c t

Genome content analysis has been used as a source of phylogenetic information in large prokaryotic treeof life studies. Recently the sequencing of many eukaryotic genomes has allowed for the similar use ofgenome content analysis for these organisms too. In this communication we examine the utility of gen-ome content analysis for recovering phylogenetic patterns in several eukaryotic groups. By constructingmultiple matrices using different e value cutoffs we examine the dynamics of altering the e value cutoffon five eukaryotic genome data sets. Our analysis indicates that the e value cutoff that is used as a crite-rion in the construction of the genome content matrix is a critical factor in both the accuracy and infor-mation content of the analysis. Strikingly, genome content by itself is not a reliable or accurate source ofcharacters for phylogenetic analysis of the taxa in the five data sets we analyzed. We discuss two prob-lems – small genome attraction and genome duplications as being involved in the rather poor perfor-mance of genome content data in recovering eukaryotic phylogeny.

� 2012 Elsevier Inc. All rights reserved.

1. Introduction

A potentially useful way to utilize whole genome information inphylogenomics is to use protein domain content (Yang et al., 2005;Fukami-Kobayashi et al., 2007), or gene or gene family content(Tekaia et al., 1999; Snel et al., 1999, 2005; Wolf et al., 2001a,2001b, 2002; Gu and Zhang, 2004; Dutilh et al., 2007; Makarovaet al., 2007) as phylogenetic characters. In this approach, knownas gene content analysis, whole genomes of several taxa arescanned and the presence or absence of genes or gene families inthese genomes is assessed (Wolf et al., 2001a, 2001b; Wu et al.,2006; Puigbo et al., 2009, 2010; Snipen and Ussery, 2010; Schliepet al., 2011). The approach used to generate the primary data ma-trix in most studies is a single linkage method that takes alignedpairs of genes that are determined by either BLAST or BLAT andgroups them into clusters. Single linkage clustering is one of thesimplest types of clustering approaches and relies on the law ofsyllogism. If gene A matches to gene B and gene B matches to geneC, genes A, B and C will be clustered together even though genes Aand C might not have a significant match. Before clustering thematches to create clusters, they are filtered based upon the e-valuescore which is a measure of the strength of the match between thesequences.

This approach produces a primary data matrix of zeros (indicat-ing the lack of the gene family) and ones (indicating the presence

ll rights reserved.

nfeld), [email protected] (R.

of the gene family) for each taxon in the analysis. The matrix caneither be analyzed using character based approaches (parsimonyor likelihood; Snipen and Ussery, 2010; Gu and Zhang, 2004; Hu-son and Steel, 2004; Mirkin et al., 2003) or the character data canbe converted into a table of similarity measures and used in dis-tance analyses (minimum evolution or neighbor joining; Blairet al., 2005; Vishnoi et al., 2010). The focus of most of these ap-proaches has been in phylogenetic studies in Bacteria (Wolfet al., 2001a, 2002; Dutilh et al., 2007; Medini et al., 2005) and Ar-chaea (Makarova et al., 2007). In general for these two domains oflife, the results are in agreement with known taxonomy of theseprokaryotes. Several websites exist to implement the scoring oforthologous protein domains and gene families to assist in thisphylogenetic approach (Tatusov et al., 2003; Jiang et al., 2008;Luo et al., 2007; Mirkin et al., 2003; Wilson et al., 2009; Novichkovet al., 2009; Jensen et al., 2008; Penel et al., 2009; Ranwez et al.,2007; Dubchak and Ryaboy, 2006; Marthey et al., 2008; Milleret al., 2007).

A problem with the genome content approach that we addressin this communication arises in the construction of the primarygene content matrix. The problem relates directly to the fact thatdifferent e value cutoffs will give different matrices and the differ-ent matrices will give different phylogenetic hypotheses (Lienauet al., 2006, 2011). This problem is very similar to the impact of in-put parameters in multiple alignment on phylogenetics. Differentinput values of gap costs and transformation costs in the alignmentprocedure will oftentimes give very different alignments, and thesedifferent alignments will give different trees (Gatesy et al., 1994;Wheeler et al., 1995). While some studies have pointed to the

http://dx.doi.org/10.1016/j.ympev.2012.01.003

mailto:[email protected]

mailto:[email protected]


http://www.sciencedirect.com/science/journal/10557903

http://www.elsevier.com/locate/ympev

J.A. Rosenfeld, R. DeSalle / Molecular Phylogenetics and Evolution 63 (2012) 342–350 343

problem of e value cutoff in Bacteria and Archaea (Wolf et al.,2001a; Abeln et al., 2007; Lienau et al., 2006), a thorough under-standing of the impact of e value cutoff and gene family assign-ment in gene content studies is lacking. In addition, the genomecontent approach has not been used widely as a tool for diploideukaryotic organisms (but see Blair et al., 2005; Ranwez et al.,2007; Wu et al., 2006). There are some major differences in thedynamics of eukaryotic genomes relative to bacterial genomes,the most obvious being whole genome duplications and chromo-somal duplications. How these duplications affect gene contentassessment is also an important aspect of genome content studiesand can be examined using e value as a possible indicator of theimpact of duplication.

In this communication we have generated genome contentmatrices for five eukaryotic systems. – Plants, Animals, Mammals,Drosophila and Eukaryotes for e values ranging from -5 to -300.The mammal, plant and animal data sets overlap with the Eukary-otic data set. We test the impact of e value on the generation ofcharacter matrices for phylogenetic purposes in these five datasets. In addition, since the genome content data are amenable toanalysis using weighting based on the direction of change (Dolloparsimony) we have examined the affect of this phylogeneticweighting system using e value as a comparative tool.

2. Methods

2.1. Rationale for exploring e value space

The calculation of a BLAST e-value depends upon both thelength of the match between the two sequences, and the totalamount of sequence in the database (Altschul et al., 1990). The rea-son for this is that as the size of a database increases, the likelihoodof finding a match of a certain length increases. For example, in asmall database with only several thousand total nucleotides, itwould be very unlikely for a match of 50 contiguous bases to occurrandomly between a query sequence and any sequence in the data-base. Thus, this match would have a very significant e-value (Altsc-hul et al., 1990). But, if the size of the database were increased toover 100 billion nucleotides (the current size of GenBank), then itwould be much more likely for this match to occur and thereforethe e-value is correspondingly less significant. Because of the exor-bitant growth of GenBank, the same BLAST query against the wholedatabase performed 10 years ago and performed today would havea different e-value. Therefore the use of a single strict e-value cut-off in studies is not necessarily optimal. In addition, the contentionthat a specific e-value can distinguish between paralogs and ortho-logs is not logical. In addition, the e-value threshold utilized needsto vary with each study if a different database is used. This is be-cause the e-value calculation is highly dependant upon the sizeof the database that is being searched. The same match betweentwo sequences would have highly divergent e-values dependingupon the size of the database that is being queried.

It is also extremely important to take into consideration thelength of the gene family matches that are found using the BLASTalgorithm. Finally, the level of phylogenetic analysis should alsohave an impact on the generation of genome content matrices. Thisimpact is caused by the divergence of the primary sequence ofgenes. In a set of closely related species, one expects the matchesto be more precise and hence e values will be impacted.

2.2. Genomes used and generating genome content matrices

All genomes were obtained from NCBI except for those desig-nated as such in Table 1. The total number of aligned nucleotidesfrom the different groups are Mammals: 162,002,373; Plants: 186,

681,789; Other Animals: 206,083,415; Combined: 554,767,577.The genome content matrices were constructed as in Lienau et al.(2006) and Rosenfeld et al. (2008). Briefly, all of the genomes in adata set were compared against each other using the BLAT programwith the default parameters. Matches were filtered at differente-value thresholds and single linkage clustered. Each species wasthen queried to determine if it contained a gene in each of theclusters. If a species contained a cluster, then it was marked as a 1,otherwise it was marked as a 0. The tabulation of all data for all ofthe columns resulted in a binary matrix with each row listing a spe-cies, and each column listing a cluster. All five matrices are availableas Supplemental Files 1–5. For convenience we have given these fivedata sets three letter acronyms – Drosophila – dro, Mammal – mam,Plant – pla, animal level – ani and Eukaryota – euk.

2.3. Tree construction and manipulation

All analyses discussed in this communication were accom-plished using PAUP� (Swofford, 2001). The first step in our analyseswas to obtain a phylogenetic hypothesis from the entire data setfor each of the five matrices we generated. Because the genomecontent data generated trees that were incongruent with the cur-rent understanding of relationships for some of the taxa in ouranalyses we also generated trees that represent the most reason-able hypothesis for those taxa (Supplemental File 6; SupplementalFig. 3). We call these trees ‘‘constrained trees’’. Phylogenetic treesusing the genome content data were generated using parsimonyand exact searches where possible. For the matrices with largenumbers of taxa, we used heuristic searches, with Tree BisectionReconnection (TBR) and 100 random taxon additions to generatea phylogenetic hypothesis. Bootstraps and Jackknifes were gener-ated in PAUP, by setting frequency cutoffs as indicated in the text.We used the Rohlf index to measure agreement of the trees gener-ated under different conditions with the constrained trees men-tioned above and with the trees generated from the parsimonyanalysis of the genome content data. We also used the consistencyindex (ci) and the rescaled consistency index (rci) as measures ofinternal consistency of trees we generate. For Dollo parsimonywe examined ‘‘up’’ dollo parsimony, which allows only one gain(0 ? 1) of a specific character and as many loss (1 ? 0) changesas needed to optimize the character on the tree.

To obtain the number of ‘‘inflections’’ caused by changing e va-lue we examined the data for Rohlf indices and determined whiche values incurred greater than 25% (either up or down) change be-tween adjacent e values. We also imposed the restriction that thechanges had to remain for at least two e value intervals. We com-piled these data for intervals of e-20 values.

3. Results and discussion

In all of the following analyses we have done three kinds ofanalyses. First we have used equal weighted parsimony. Secondwe use the same matrices but generate phylogenetic hypothesesusing Dollo parsimony. Third we generate matrices using a sizerestriction on the BLAST alignments (keeping alignments only>250 residues) with subsequent analysis using equal weightedparsimony.

3.1. Impact of e value on character number (raw gene family number)

The single linkage cluster approach to genome content willgroup genes in the genomes of target organisms into gene familiesand determine if all organisms have at least one gene in the theirgenome for each gene family cluster. For large e values, the strin-gency of the similarity is low and hence the number of hits using

Table 1Sources for genomes.

Species Source

Amphimedon_queenslandica JGI build 1.0Arabidopsis_thaliana http://www.ncbi.nlm.nih.gov/sites/

genomesBos_taurus http://www.ncbi.nlm.nih.gov/sites/

genomesBrachypodium_distachyon http://www.phytozome.net/brachy.phpBranchiostoma_floridae NZ_ABEP00000000Caenorhabditis_elegans http://www.ncbi.nlm.nih.gov/sites/

genomesCanis_familiaris http://www.ncbi.nlm.nih.gov/sites/

genomesCarica_papaya ftp://asgpb.mhpcc.hawaii.edu/papaya/

annotation/Chlamydomonas_reinhardtii http://www.phytozome.net/chlamyCiona_intestinalis http://cucumber.genomics.org.cnCucumis_sativus http://www.ncbi.nlm.nih.gov/sites/

genomesDanio_rerio http://www.ncbi.nlm.nih.gov/sites/

genomesDictyostelium_discoideum http://dictybase.orgDrosophila_12_species http://www.ncbi.nlm.nih.gov/sites/

genomesEquus_caballus http://www.ncbi.nlm.nih.gov/sites/

genomesGallus_gallus http://www.ncbi.nlm.nih.gov/sites/

genomesGlycine_max http://www.phytozome.net/soybeanHomo_sapiens http://www.ncbi.nlm.nih.gov/sites/

genomesHydra_magnipapillata http://www.ncbi.nlm.nih.gov/sites/

genomesMacaca_mulatta http://www.ncbi.nlm.nih.gov/sites/

genomesMonodelphis_domestica http://www.ncbi.nlm.nih.gov/sites/

genomesMonosiga_brevicollis JGI build 1.0Mus_musculus http://www.ncbi.nlm.nih.gov/sites/

genomesNematostella_vectensis JGI build 1.0Neurospora_crassa Broad InstituteOrnithorhynchus_anatinus http://www.ncbi.nlm.nih.gov/sites/

genomesOryctolagus_cuniculus http://www.ncbi.nlm.nih.gov/sites/

genomesOryza_sativa http://rice.plantbiology.msu.edu/

index.shtmlPan_troglodytes http://www.ncbi.nlm.nih.gov/sites/

genomesParamecium_tetraurelia http://paramecium.cgm.cnrs-gif.frPhyscomitrella_patens http://www.phytozome.net/physcomitrellaPopulus_trichocarpa http://www.phytozome.net/poplarRattus_norvegicus http://www.ncbi.nlm.nih.gov/sites/

genomesSaccoglossus_kowalevskii http://www.ncbi.nlm.nih.gov/sites/

genomesSelaginella_moellendorfii http://www.phytozome.net/selaginellaSorghum_bicolor http://www.phytozome.net/sorghumStrongylocentrotus_purpuratus http://www.ncbi.nlm.nih.gov/sites/

genomesTaeniopygia_guttata http://www.ncbi.nlm.nih.gov/sites/

genomesTrichoplax_adhaerens JGI build 1.0Vitis_vinifera http://www.ncbi.nlm.nih.gov/sites/

genomesXenopus_laevis http://ftp.maizesequence.org/Zea_mays http://www.ncbi.nlm.nih.gov/sites/

genomes

344 J.A. Rosenfeld, R. DeSalle / Molecular Phylogenetics and Evolution 63 (2012) 342–350

BLAST will be larger than for very small e values. The behavior ofgene family number versus e value for the five phylogenetic matri-ces we generated for both equal weighted parsimony and Dolloparsimony are shown in Fig. 1A (left and middle panels) and theseagree with previously reported patterns for genome content

studies using whole genomes in bacteria and flies (Lienau et al.,2006 and Rosenfeld et al., 2008, respectively).

The optima in the curves in Fig. 1A (left and middle panels) areexplained by noting that at very large e values, a large number ofsequences that are found in only a single species will be obtainedand these are eliminated a priori from the matrices. On the otherend of the curve, as the e value decreases, the stringency of a matchgets higher and higher and fewer gene families are identified. Thefigure demonstrates that there is an optimal e value for highestnumber of gene families recovered for each of the data sets ataround -50 to -100. The number of phylogenetically informativegene families in the data sets is correlated to the number of genefamilies.

We next examined the relative number of phylogeneticallyinformative gene families in the data sets. To do this we scaledthe number of phylogenetically informative gene families by divid-ing the number of phylogenetically informative gene families bythe minimum number of steps in the parsimony tree for each e va-lue. This scaling step allowed us to compare patterns across datasets (Baker et al., 1998). Fig. 1B (left and middle panels) showsthe results of comparing the scaled PI for each e value for bothequal weighted parsimony and Dollo parsimony. The figure dem-onstrates that the number of scaled PI characters begins to plateaufor three of the data sets (pla, euk and dro) at about an e value of -50. The remaining two data sets have very different behaviors thanthe first three. The mam pattern indicates that e value has no im-pact on scaled PI. On the other hand, the ani data set shows an al-most linear increase in scaled PI with decreasing e value. Ittherefore appears that optimal e value cutoffs for scaled PI varyfrom data set to data set. The e value that gives the largest rawnumber of gene families for recently diverged species is aboutthe same as the e values for scaled PI at -50 to -100.

3.2. Impact of e value on internal character consistency

We measured the agreement of the phylogenetic hypothesesgenerated using the genome content matrices with standard con-sistency measures and the Rohlf index. While the standard consis-tency measure (consistency index [RI]) can measure internalconsistency of a data matrix, it is dependent on the number of taxain an analysis (Sanderson and Donoghue, 1992). The retention in-dex (RI) is another measure of internal data set consistency andit is not dependent on taxon number. The Rohlf index allows for di-rect comparison of a constrained tree based on the analysis of thedata set or with accepted relationships. The Rohlf index measuresthe ratio of nodes in a query tree that also exist in a reference tree.These comparisons for the five data sets are given in Fig. 2A (left forDollo parsimony and middle for equal weighted parsimony). Forthe rci four of the curves show plateaus. The ani data set, however,shows a pattern of increasing consistency with decreasing e value.The data sets show a plateau at e values between -50 and -100.

The curves for the Rohlf indices are much more complex andshow extreme instability in agreement with the total evidence treeas e value varies. The dro data set is unique with respect to the rciand Rohlf indices because, the parsimony tree for this data set andthe accepted tree are the same. The other data sets give parsimonytrees whose topologies are different from the accepted tree (see Ta-ble 1). Consequently, the dro data set is the most stable of the fivedata sets, as it rises to a maximum Rohlf index at about e-50 andplateaus there to a value of e-240. The other data sets fluctuate be-tween strong agreement (Rohlf values close to 1.0) and strong dis-agreement (lower Rohlf indices) over long ranges of e value whencompared to the parsimony tree (Fig. 2B, left and middle panel) ofeach data set.

We also compared the trees generated by the various e valuecutoffs with constrained trees based on the accepted topologies

http://www.ncbi.nlm.nih.gov/sites/genomes




http://www.phytozome.net/brachy.php





http://www.phytozome.net/chlamy

http://cucumber.genomics.org.cn





http://dictybase.org







http://www.phytozome.net/soybean















http://rice.plantbiology.msu.edu/index.shtml

http://rice.plantbiology.msu.edu/index.shtml



http://paramecium.cgm.cnrs-gif.fr

http://www.phytozome.net/physcomitrella

http://www.phytozome.net/poplar





http://www.phytozome.net/selaginella

http://www.phytozome.net/sorghum







http://ftp.maizesequence.org/



Fig. 1. Graphs of e value (x axis) versus (A) gene families; (B) scaled phylogenetic informative sites (see text). The left graphs are for Dollo parsimony analysis. The middlegraphs are for equal weighting. The right graphs are for size restricted e value matrices. Color code is shown in the upper right hand of the figures.

Fig. 2. Graphs of e value (x axis) versus (A) rci, (B) Rohlf index on tree from each data set, (C) Rohlf index on constrained tree (Supplemental Fig. 1). The left graphs are forDollo parsimony analysis. The middle graphs are for equal weighting. The right graphs are for size restricted e value matrices. Color code is shown in the upper right hand ofthe figures.


Table 2Rohlf indices for the analyses used in this paper. All indices are calculated relative tothe accepted phylogenies for the five data sets (see Supplemental Fig. 1).

dro mam pla ani euk

Equal wt. 1.0 0.250 0.306 0.088 0.233Dollo 1.0 0.367 0.510 0.054 0.190Size restr. 1.0 0.167 0.286 0.088 0.196


(Supplemental Fig. 1) for the five data sets. These comparisons areshown in Fig. 2C and indicate that there is a low degree of agree-ment of the constrained topology (Fig. 1 Supplemental) with thetopologies obtained for the various e values. Since the dro dataset tree and the dro constrained tree are the same, the pattern de-scribed above where a maximum Rohlf index is obtained at aboute-50 and remains maximized to about e-240 is the same for theconstrained tree. The other data sets have very poor node consis-tency with the constrained trees for the four remaining data sets(Fig. 2C, middle panel). The ani data set behaves the worst withthe lowest Rohlf indices and the pla data set with the highest Rohlfindices, but overall the four remaining data sets perform verypoorly when compared to the accepted topologies (Table 2).

The curves in Fig. 2 give us the opportunity to examine if spe-cific e values have consistent effects on consistency with phyloge-netic hypotheses. To do this for each data set and the three kinds ofanalyses we counted the number of ‘‘inflections’’ in the Rohlf indexcurves in Fig. 2B and C for ranges of e values. We next examined ifparticular ranges incurred more inflection (defined here as thechange of a Rohlf index by 20% from one e value interval to thenext). Fig. 4 shows the distribution of inflections as a function ofe value and indicates that there is a non-random distribution ofwhere changing e value will change the consistency of a data set.For instance, there are very few inflections for e values between5 and 100 and no inflections for e values smaller than e-240. How-ever between e values of e-120 and e-220, a great deal of fluctua-tion occurs in the data sets. The e value range with the greatestinflection is at e-120, indicating that this e value may be a criticalone for detecting orthology of gene families in these data sets.

3.3. Effect of length of match on e value

Length of a match has a direct impact on the estimation of the evalue. Short matches of 100 residues or less will often times givehighly significant e values and result in the inclusion of a gene fam-ily based only on the conservation of a short motif in a protein.Consequently we constrained the length of the match in generatingmatrices using e values to be greater than 250 residues. At thisstringent threshold for matching, it is not mathematically possiblefor there to be an e-value of less than e-60 because a match of 250residues in length is a strong match. When this size restriction isimposed, the maximal gene family number starts at e-60 plateausto e-150 and then drops off as e value is lowered (Fig. 1A, right pa-nel). On the other hand, as e value is lowered, the scaled PI riseswith slight leveling at e-150 for all data sets (Fig. 2B, right panel).As with the unrestricted size data sets using equal weighted parsi-mony (Fig. 2B and C, middle panels), the size restricted data setsshow fluctuations in Rohlf indices as e value is lowered (Fig. 2Band C, right panels).

3.4. Phylogenetic hypotheses from eukaryotic genome contentmatrices

In general, the phylogenetic trees generated from genome con-tent information gives robust trees but trees that are incongruentat several nodes with the accepted phylogenetic relationships of

these organisms (Fig. 3; Supplemental Fig. 1). The phylogenetichypotheses generated using genome content data are entirely con-gruent only with one well known organismal phylogeny (Drosoph-ila 12 Genomes Consortium, 2007), the dro data set (Table 1). Thepla data set has the next highest congruence with known relation-ships (Bowman et al., 2007; Angiosperm Phylogeny Group III,2009; Chase and Reveal, 2009) where the major agreement is thatthe monocots and dicots are recovered and the relationships with-in monocots are congruent with known relationships within thatgroup. Relationships within the dicots in the pla data set are notrecovered however. For the mam data set three major clades oftaxa in the data set (Primates [ Homo, Pan, Macaca], Rodentia [Mus, Rattus] and Scrotifera [Canus, Bos, Equus]) are recovered usinggenome content data that are congruent with accepted mamma-lian relationships (Murphy et al., 2001, 2007; Asher and Helgen,2010). However, the accepted relationships of these groups to eachother and more distinctly the placement of Oryctolagus in themammal genome content tree is problematic.

For the ani and euk trees, some relationships are recovered thatare accepted by phylogeneticists working at this level (Halanych,2004; Philippe and Telford, 2006; Dunn et al., 2008; DeSalle andSchierwater, 2008) Two glaring exceptions are the placement ofthe two protostome model organisms – C. elegans and D. melano-gaster. In addition, the Cnidaria (Nematostella and Hydra) are neverfound as monophyletic in the genome content trees. These twoorganisms are placed in very unconventional positions in bothphylogenies. Specifically, Nematostella is placed within the Bilate-ria, and Hydra resides toward the base of the tree under all treat-ments of the data. It is notable that the ani data set has thelowest Rohlf index of all five of the data sets under all treatments(Table 1). Imposing dollo parsimony (Supplemental Fig. 2) orrestricting the orthology searches to longer stretches of similarity(Supplemental Fig. 3) does not improve the congruence of the treesderived from the restricted data sets with the accepted phylogeny(Table 1; Supplemental Figs. 2 and 3).

3.5. Genome duplications and eukaryotic genome content phylogenies

These results indicate that the genome content data are bythemselves poor characters for recovering accepted topologies forfour out of five of the data sets. The one major difference betweenthe dro data set and all of the others concerns duplication events.The taxa in the dro data set have undergone chromosomal rear-rangement but no extreme genome or segmental duplications.Duplications do occur in the 12 taxa in the dro data set, but withan ‘‘excess of low-divergence duplicated genes in the terminalbranches of the 12 species tree’’ (Osada and Innan, 2008). Hence,the low divergence of genes involved in duplications do not com-plicate the establishment of gene family characters in the dro ma-trix. The ani and euk data sets involve taxa that have experiencedseveral well-known genome duplications (Meyer and Schartl,1999; Wang and Gu, 2000; Escriva et al., 2002; Pebusque et al.,1998; Gibson and Spring, 2000). Such genome duplications obscureor at best makes difficult the discovery of the orthology of genefamilies and result in phylogenetic signal incongruent with the ac-cepted phylogenies. Likewise for the pla data set, these taxa areknown to undergo polyploidization and segmental duplications(Vision et al., 2000; Blanc and Wolfe, 2004; Tang et al., 2008a,2008b). The major genome duplications involved in mammalianevolution, occur prior to the divergence of the taxa in the mam dataset. However, the taxa in the mam data set have undergone largeamounts of chromosomal rearrangements (which might have re-sulted in segmental duplications, as with plants). Taken togetherthese results indicate that genome content phylogenetic matricesperform poorly most likely as a result of the inability of e valuesto discern between orthology and paralogy of gene families.

Fig. 3. Phylogenetic trees generated using the genome content data generated for e values from -5 to -300. The matrices for each e value were elided together and the elidedmatrix was analyzed as in Section 2 to produce the trees in the figure. The red asterisks indicate nodes that are not supported by bootstraps = 100%. (A) dro; (B) pla; (C) mam;(D) ani; (E) euk.

Fig. 4. Bar graph of Rohlf index ‘‘inflection’’ (see text) versus e value compiled overall data sets. Note the concentration of ‘‘inflection’’ between e-120 and e-200.

Fig. 5. Graph of genome size versus presence of gene family. Correlationcoefficients are given next to the three clusters of organisms we used in this study.The plant correlation was the only one that was not significant. The correlation forall points in the graph is r = 0.3061 and is significant at p < 0.99.


3.6. Small genome attraction as an additional problem in majordepartures of phylogenetic incongruence

Highly divergent or highly reduced genomes will have a largenumber of gene families coded as missing or as zeros. Such codingwill lead to what has been called small genome attraction. Small

genome attraction is a phenomenon first mentioned by Riveraand Lake (2004) and Lake and Rivera (2004) in analysis of genecontent and phylogeny of the three major domains of life. Theproblem arises when an organism’s genome is so small that singlelinkage clustering will result in the character state of missing (0)being predominant for that taxon in the matrix. Species in thesame matrix with a large number of missing gene families will‘‘attract’’ each other in phylogenetic analysis. Another problem

Fig. 6. Bar graph of gene family presence for all of the taxa in the euk data set. The arrows are discussed in the text.


with small genomes is that they tend to randomly attach tobranches in phylogenetic trees whether those branches are smallgenome branches or not.

To characterize the impact of ‘‘small genomes’’ we first charac-terized the number of gene families present in the e-100 euk datamatrix, by counting the number of 1’s in each taxon. We thengraphed these values relative to genome size. Fig. 5 shows thesemeasures graphed versus genome size for three phylogenetic clus-ters of taxa – plants, mammals and eukaryotes minus mammals.The correlation of genome size with presence of gene families issignificant for mammals and eukaryotes minus mammals and bor-derline significant for plants. Over all of the eukaryotes in the eukdata set there is a significant correlation (r = 0.3061) that inter-cepts the X axis at about 4000 gene families.

Fig. 6 shows bar graphs of gene family number for each taxon inthe euk data set. The arrows in the figure point to critical taxa inthe data sets we discuss in this paper. The red1 arrows point tothe two protostome species that are problematic in both the aniand euk data sets with respect to phylogenetic position (Drosophilaand Caenorhabditis). Caenorhabditis has an extremely low numberof present gene families despite its genome having nearly 20,000genes and its genome not being alarmingly small with respect tonumbers of nucleotides. This small number of present gene fami-lies, likely has an impact on its position in the various trees wegenerated for this study and can explain much of the incongruencefor this species with respect to accepted phylogenies. In fact, Cae-norhabditis is placed in all analyses with the small genome Neuros-pora and Paramecium. The blue arrow singles out the bilateriantaxon Ciona intestinallis that is also placed in unaccepted positionsin the euk and ani trees. Ciona has the smallest number of genefamilies present of any of the deuterostome taxa in the euk andani data sets and its small genome factor may explain the incon-gruent position of this taxon in the genome content trees.

The green arrow points to Xenopus laevis, a problematic taxon inthe ani and euk trees (Fig. 3), as it is inferred to have a more prim-itive position in the phylogeny than fish. Xenopus has the smallestnumber of gene families of all of the vertebrates in these data sets.The purple arrow points to Oryctolagus cunniculus, a mammaliantaxon that we mentioned earlier that is misplaced in all of themam data set trees (Fig. 3) because it is most commonly placedas sister to the Rodentia. This species has the smallest number ofgene families present of all of the mammals in the data sets. Finallythe connected dotted black line indicates the two Cnidarian taxa(Nematostella vectensis and Hydra magnipalpa) that are problematicin the euk and ani data sets as they are never seen as sister taxa in a

1 For interpretation of color in Figs. 1–3, 5 and 6, the reader is referred to the webversion of this article.

monophyletic grouping. Rather Hydra appears to be ‘‘attracted’’ tojoin with another lower Metazoan Trichoplax, which also has one ofthe smaller genomes in the data set. It appears then, that smallgenomes in the present data set attract each other as in the casesof Ciona to Drosophila and Hydra to Trichoplax and Caenorhabditisto Neurospora as well as cause taxa to ‘‘float’’ in phylogenetic anal-ysis as in the case of Xenopus and Oryctolagus.

4. Conclusions

Clearly, the genome content data by themselves are not perfectpredictors of phylogeny in most of the data sets we examinedregardless of e value, character weighting or restriction on matrixconstruction on length of sequence match (Table 1). However, theyare not completely incongruent and do contain some phylogeneticsignal. In combination with sequence information, they may beuseful in phylogenetic studies. This approach has been shown tobe useful in Bacteria (Lienau et al., 2011).

Varying e value to construct matrices is an important consider-ation. Previous studies showing that varying alignment parametersto produce matrices can alter phylogenetic hypotheses (Gatesy et al.,1994; Wheeler et al., 1995) are similar to the results shown here forvarying e value. It appears that the degree of evolutionary diver-gence has an impact on the optimal e value for obtaining themaximum number gene families (Fig. 1). Specifically, the deeperphylogenetic matrices (euk – 1 billion years; pla – 400 million years;ani – 600 million years) have optimal e values at about e-50. Theshallower phylogenetic matrices (mam – 100 million years;dros – 100 million years) have optimal e values that are at e-150.These results reflect the higher similarity of gene sequences in theshallower phylogenetic matrices. When there is more similaritythe matches are longer and this will raise the e values. We examinethe phenomenon of length and e value in more detail below. It alsoappears that e value cutoffs are somewhat unstable with respect toconsistency of trees. In other words, changing e values slightly will insome cases cause somewhat drastic changes in consistency of thephylogenetic hypotheses generated (Fig. 4). There appears to be spe-cific e values that are points of instability. For instance, e values be-tween -120 and -200 show a high degree of ‘‘inflection’’ with respectto the Rohlf index. Apparently slightly changing the e value cutoff,will produce somewhat different homology statements and resultin relatively different matrices that result in inconsistent tree topol-ogies. This result is probably caused by the cutoff values in this rangebeing highly prone to making different orthology statements aboutgene family membership.

Finally, ‘‘small genome’’ attraction can be very problematic forreconstruction of phylogenetic relationships using gene family pres-ence/absence data. ‘‘Small genomes’’ can be produced by physically


small genomes and by highly divergent genomes where the pres-ence of genes is overlooked due to extreme divergence. We demon-strate here, a clear correlation of ‘‘small genomes’’ with incongruentplacement of taxa.

Appendix A. Supplementary material

Supplementary data associated with this article can be found, inthe online version, at doi:10.1016/j.ympev.2012.01.003.

References

Abeln, S., Teubner, C., Deane, C.M., 2007. Using phylogeny to improve genome-widedistant homology recognition. PLoS Comput. Biol. 3 (1), e3.

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic localalignment search tool. J. Mol. Biol. 215, 403–410.

Angiosperm Phylogeny Group III, 2009. An update of the Angiosperm PhylogenyGroup classification for the orders and families of flowering plants: APG III. Bot.J. Linn. Soc. 161, 105–121.

Asher, R.J., Helgen, K.M., 2010. Nomenclature and placental mammal phylogeny.BMC Evol. Biol. 10, 102.

Baker, R.H.Yu., Yu, X., DeSalle, R., 1998. Assessing the relative contribution ofmolecular and morphological characters in simultaneous analysis trees. Mol.Phylogenet. Evol. 9, 427–436.

Blair, J.E., Shah, P., Hedges, S.B., 2005. Evolutionary sequence analysis of completeeukaryote genomes. BMC Bioinform. 6, 53.

Blanc, G., Wolfe, K.H., 2004. Widespread paleopolyploidy in model plant speciesinferred from age distributions of duplicate genes. Plant Cell 16, 1667–1678.

Bowman, J.L., Floyd, S.K., Sakakibara, K., 2007. Green genes – comparative genomicsof the green branch of life. Cell 129, 229–234.

Chase, M.W., Reveal, J.L., 2009. A phylogenetic classification of the land plants toaccompany APG III. Bot. J. Linn. Soc. 161, 122–127.

DeSalle, R., Schierwater, B., 2008. An even ‘‘newer’’ animal phylogeny. Bioessays 30,1043–1047.

Drosophila 12 Genomes Consortium, 2007. Evolution of genes and genomes on theDrosophila phylogeny. Nature 450, 203–218.

Dubchak, I., Ryaboy, D.V., 2006. VISTA family of computational tools forcomparative analysis of DNA sequences and whole genomes. Meth. Mol. Biol.338, 69–89.

Dunn, C.W., Hejnol, A., Matus, D.Q., Pang, K., Brown, W.E., et al., 2008. Broadphylogenomic sampling improves resolution of the animal tree of life. Nature452, 745–749.

Dutilh, B.E., van Noort, V., van der Heijden, R.T., Boekhout, T., Snel, B., Huynen, M.A.,2007. Assessment of phylogenomic and orthology approaches for phylogeneticinference. Bioinformatics 23, 815–824.

Escriva, H., Manzon, L., Youson, J., Laudet, V., 2002. Analysis of lamprey and hagfishgenes reveals a complex history of gene duplications during early vertebrateevolution. Mol. Biol. Evol. 19, 1440–1450.

Fukami-Kobayashi, K., Minezaki, K., Tateno, Y., Nishikawa, K., 2007. A tree of lifebased on protein domain organizations. Mol. Biol. Evol. 24, 1181–1189.

Gatesy, J., DeSalle, R., Wheeler, W.C., 1994. Alignment-ambiguous nucleotide sitesand the exclusion of data. Mol. Phylogenet. Evol. 2, 152–157.

Gibson, T.J., Spring, J., 2000. Evidence in favour of ancient octaploidy in thevertebrate genome. Biochem. Soc. Trans. 28, 259–264.

Gu, X., Zhang, H., 2004. Genome phylogenetic analysis based on extended genecontents. Mol. Biol. Evol. 21, 1401–1408.

Halanych, K.M., 2004. The new view of animal phylogeny. Annu. Rev. Ecol. Evol.Syst. 35, 229–256.

Huson, D.H., Steel, M., 2004. Phylogenetic trees based on gene content.Bioinformatics 20, 2044–2049.

Jensen, L.J., Julien, P., Kuhn, M., von Mering, C., Muller, J., Doerks, T., Bork, P., 2008.EggNOG: automated construction and annotation of orthologous groups ofgenes. Nucleic Acids Res. 36, D250–D254.

Jiang, L.W., Lin, K.L., Lu, C.L., 2008. OGtree: a tool for creating genome trees ofprokaryotes based on overlapping genes. Nucleic Acids Res. 36, W475–W480.

Lake, J.A., Rivera, M.C., 2004. Deriving the genomic tree of life in the presence ofhorizontal gene transfer: conditioned reconstruction. Mol. Biol. Evol. 21, 681–690.

Lienau, E.K., DeSalle, R., Rosenfeld, J.A., Planet, P.J., 2006. Reciprocal illumination inthe gene content tree of life. Syst. Biol. 55, 441–453.

Lienau, K., DeSalle, R., Allard, M., Brown, E.W., Swofford, D., Rosenfeld, J.A., Sarkar,I.N., Planet, P.J., 2011. The mega-matrix tree of life: using genome-scalehorizontal gene transfer and sequence evolution data as information about thevertical history of life. Cladistics 27, 1–12.

Luo, Y., Fu, C., Zhang, D.Y., Lin, K., 2007. BPhyOG: an interactive server for genome-wide inference of bacterial phylogenies based on overlapping genes. BMCBioinform. 8, 266.

Makarova, K.S., Sorokin, A.V., Novichkov, P.S., Wolf, Y.I., Koonin, E.V., 2007. Clustersof orthologous genes for 41 archaeal genomes and implications for evolutionarygenomics of archaea. Biol. Direct. 2, 33.

Marthey, S., Aguileta, G., Rodolphe, F., Gendrault, A., Giraud, T., Fournier, E., Lopez-Villavicencio, M., Gautier, A., Lebrun, M.H., Chiapello, H., 2008. FUNYBASE: aFUNgal phYlogenomic dataBASE. BMC Bioinform. 27, 456.

Medini, D., Donati, C., Tettelin, H., Masignani, V., Rappuoli, R., 2005. The microbialpan-genome. Curr. Opin. Genet. Dev. 15, 589–594.

Meyer, A., Schartl, M., 1999. Gene and genome duplications in vertebrates: the one-to-four (-to-eight in fish) rule and the evolution of novel gene functions. Curr.Opin. Cell Biol. 11, 699–704.

Miller, W., Rosenbloom, K., Hardison, R.C., Hou, M., Taylor, J., Raney, B., Burhans, R.,King, D.C., Baertsch, R., Blankenberg, D., Kosakovsky Pond, S.L., Nekrutenko, A.,Giardine, B., Harris, R.S., Tyekucheva, S., Diekhans, M., Pringle, T.H., Murphy,W.J., Lesk, A., Weinstock, G.M., Lindblad-Toh, K., Gibbs, R.A., Lander, E.S., Siepel,A., Haussler, D., Kent, W.J., 2007. 28-way vertebrate alignment and conservationtrack in the UCSC Genome Browser. Genome Res. 17, 1797–1808.

Mirkin, B.G., Fenner, T.I., Galperin, M.Y., Koonin, E.V., 2003. Algorithms forcomputing parsimonious evolutionary scenarios for genome evolution, thelast universal common ancestor and dominance of horizontal gene transfer inthe evolution of prokaryotes. BMC Evol. Biol. 3, 2.

Murphy, W.J., Eizirik, E., Johnson, W.E., Zhang, Y.P., Ryder, O.A., O’Brien, S.J., 2001.Molecular phylogenetics and the origins of placental mammals. Nature 409,614–618.

Murphy, W.J., Pringle, T.H., Crider, T.A., Springer, M.S., Miller, W., 2007. Usinggenomic data to unravel the root of the placental mammal phylogeny. GenomeRes. 17, 413–421.

Novichkov, P.S., Ratnere, I., Wolf, Y.I., Koonin, E.V., Dubchak, I., 2009. ATGC: adatabase of orthologous genes from closely related prokaryotic genomes and aresearch platform for microevolution of prokaryotes. Nucleic Acids Res. 37,D448–D454.

Osada, N., Innan, H., 2008. Duplication and gene conversion in the Drosophilamelanogaster genome. PLoS Genet. 4, e1000305.

Pebusque, M.J., Coulier, F., Birnbaum, D., Pontarotti, P., 1998. Ancient large-scalegenome duplications: phylogenetic and linkage analyses shed light on chordategenome evolution. Mol. Biol. Evol. 15, 1145–1159.

Penel, S., Arigon, A.M., Dufayard, J.F., Sertier, A.S., Daubin, V., Duret, L., Gouy, M.,Perrière, G., 2009. Databases of homologous gene families for comparativegenomics. BMC Bioinform. 16, S3.

Philippe, H., Telford, M.J., 2006. Large-scale sequencing and the new animalphylogeny. Trends Ecol. Evol. 21, 614–620.

Puigbo, P., Wolf, Y.I., Koonin, E.V., 2009. Search for a Tree of Life in the thicket of thephylogenetic forest. J. Biol. 8, 59.

Puigbo, P., Wolf, Y.I., Koonin, E.V., 2010. The tree and net components of prokaryoteevolution. Genome Biol. Evol. 2010 (2), 745–756.

Ranwez, V., Delsuc, F., Ranwez, S., Belkhir, K., Tilak, M.K., Douzery, E.J., 2007.OrthoMaM: a database of orthologous genomic markers for placental mammalphylogenetics. BMC Evol. Biol. 30, 241.

Rivera, M.C., Lake, J.A., 2004. The ring of life provides evidence for a genome fusionorigin of eukaryotes. Nature 2004 (431), 152–155.

Rosenfeld, J.A., DeSalle, R., Lee, E.K., O’Grady, P., 2008. Using whole genomepresence/absence data to untangle function in 12 Drosophila genomes. Fly 2,291–299.

Sanderson, M., Donoghue, M., 1992. Patterns of variation in levels of homoplasy.Evolution 43, 1781–1795.

Schliep, K., Lopez, P., Lapointe, F.-J., Bapteste, E., 2011. Harvesting evolutionarysignals in a forest of prokaryotic gene trees. Mol. Biol. Evol. 28, 1393–1405.

Snel, B., Bork, P., Huynen, M.A., 1999. Genome phylogeny based on gene content.Nat. Genet. 21, 108–110.

Snel, B., Huynen, M.A., Dutilh, B.E., 2005. Genome trees and the nature of genomeevolution. Annu. Rev. Microbiol. 59, 191–209.

Snipen, L., Ussery, D.W., 2010. Standard operating procedure for computingpangenome treesStand. Genomic Sci. 2, 1.

Swofford, D., 2001. PAUP⁄: Phylogenetic Analysis Using Parsimony (⁄and OtherMethods), 4.0b7 Beta Version. Sinauer Associates, Sunderland, MA.

Tang, H., Bowers, J.E., Wang, X., Ming, R., Alam, M., Paterson, A.H., 2008a. Syntenyand collinearity in plant genomes. Science 320, 486–488.

Tang, H., Wang, X., Bowers, J.E., Ming, R., Alam, M., Paterson, A.H., 2008b. Unravelingancient hexaploidy through multiply-aligned angiosperm gene maps. GenomeRes. 18, 1944–1954.

Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V.,Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., et al., 2003. TheCOG database: an updated version includes eukaryotes. BMC Bioinform.4, 41.

Tekaia, F., Lazcano, A., Dujon, B., 1999. The genomic tree as revealed from wholeproteome comparisons. Genome Res. 9, 550–557.

Vishnoi, A., Roy, R., Prasad, H.K., Bhattacharya, A., 2010. Anchor-based wholegenome phylogeny (ABWGP): a tool for inferring evolutionary relationshipamong closely related microorganisms. PLoS One 5 (11), e14159.

Vision, T.J., Brown, D.G., Tanksley, S.D., 2000. The origins of genomic duplications inArabidopsis. Science 290, 2114–2121.

Wang, Y., Gu, X., 2000. Evolutionary patterns of gene families generated in the earlystage of vertebrates. J. Mol. Evol. 51, 88–96.

Wheeler, W.C., Gatesy, J., DeSalle, R., 1995. Elision: a method for accommodatingmultiple molecular sequence alignments with alignment-ambiguous sites. Mol.Phylogenet. Evol. 4, 1–9.

Wilson, D., Pethica, R., Zhou, Y., Talbot, C., Vogel, C., Madera, M., Chothia, C., Gough,J., 2009. SUPERFAMILY – sophisticated comparative genomics, data mining,visualization and phylogeny. Nucleic Acids Res. 37, D380–D386.



Wolf, Y.I., Rogozin, I.B., Kondrashov, A.S., Koonin, E.V., 2001a. Genome alignment,evolution of prokaryotic genome organization, and prediction of gene functionusing genomic context. Genome Res. 11 (3), 356–372.

Wolf, Y.I., Rogozin, I.B., Grishin, N.V., Tatusov, R.L., Koonin, E.V., 2001b. Genometrees constructed using five different approaches suggest new major bacterialclades. BMC Evol. Biol. 1 (8), 1792–1797.

Wolf, Y.I., Rogozin, I.B., Grishin, N.V., Koonin, E.V., 2002. Genome trees and the treeof life. Trends Genet. 18, 472–479.

Wu, F., Mueller, L.A., Crouzillat, D., Petiard, V., Tanksley, S.D., 2006. Combiningbioinformatics and phylogenetics to identify large sets of single-copyorthologous genes (COSII) for comparative, evolutionary and systematicstudies: a test case in the euasterid plant clade. Genetics 174, 1407–1420.

Yang, S., Doolittle, R.F., Bourne, P.E., 2005. Phylogeny determined by protein domaincontent. Proc. Natl. Acad. Sci. USA 102, 373–378.

E value cutoff and eukaryotic genome content phylogeneticsresearch.amnh.org/users/desalle/pdf/Rosenfeld.2012.MPE.e... · 2012-06-07 · E value cutoff and eukaryotic genome content

Documents