E value cutoff and eukaryotic genome content phylogenetics Jeffrey A. Rosenfeld a , Rob DeSalle b,⇑ a IST/High Performance and Research Computing, University of Medicine and Dentistry of New Jersey, Newark, NJ 07103, United States b Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, United States article info Article history: Received 22 May 2011 Revised 2 January 2012 Accepted 3 January 2012 Available online 28 January 2012 Keywords: Genome content Dollo parsimony e Value Phylogeny abstract Genome content analysis has been used as a source of phylogenetic information in large prokaryotic tree of life studies. Recently the sequencing of many eukaryotic genomes has allowed for the similar use of genome content analysis for these organisms too. In this communication we examine the utility of gen- ome content analysis for recovering phylogenetic patterns in several eukaryotic groups. By constructing multiple matrices using different e value cutoffs we examine the dynamics of altering the e value cutoff on five eukaryotic genome data sets. Our analysis indicates that the e value cutoff that is used as a crite- rion in the construction of the genome content matrix is a critical factor in both the accuracy and infor- mation content of the analysis. Strikingly, genome content by itself is not a reliable or accurate source of characters for phylogenetic analysis of the taxa in the five data sets we analyzed. We discuss two prob- lems – small genome attraction and genome duplications as being involved in the rather poor perfor- mance of genome content data in recovering eukaryotic phylogeny. Ó 2012 Elsevier Inc. All rights reserved. 1. Introduction A potentially useful way to utilize whole genome information in phylogenomics is to use protein domain content (Yang et al., 2005; Fukami-Kobayashi et al., 2007), or gene or gene family content (Tekaia et al., 1999; Snel et al., 1999, 2005; Wolf et al., 2001a, 2001b, 2002; Gu and Zhang, 2004; Dutilh et al., 2007; Makarova et al., 2007) as phylogenetic characters. In this approach, known as gene content analysis, whole genomes of several taxa are scanned and the presence or absence of genes or gene families in these genomes is assessed (Wolf et al., 2001a, 2001b; Wu et al., 2006; Puigbo et al., 2009, 2010; Snipen and Ussery, 2010; Schliep et al., 2011). The approach used to generate the primary data ma- trix in most studies is a single linkage method that takes aligned pairs of genes that are determined by either BLAST or BLAT and groups them into clusters. Single linkage clustering is one of the simplest types of clustering approaches and relies on the law of syllogism. If gene A matches to gene B and gene B matches to gene C, genes A, B and C will be clustered together even though genes A and C might not have a significant match. Before clustering the matches to create clusters, they are filtered based upon the e-value score which is a measure of the strength of the match between the sequences. This approach produces a primary data matrix of zeros (indicat- ing the lack of the gene family) and ones (indicating the presence of the gene family) for each taxon in the analysis. The matrix can either be analyzed using character based approaches (parsimony or likelihood; Snipen and Ussery, 2010; Gu and Zhang, 2004; Hu- son and Steel, 2004; Mirkin et al., 2003) or the character data can be converted into a table of similarity measures and used in dis- tance analyses (minimum evolution or neighbor joining; Blair et al., 2005; Vishnoi et al., 2010). The focus of most of these ap- proaches has been in phylogenetic studies in Bacteria (Wolf et al., 2001a, 2002; Dutilh et al., 2007; Medini et al., 2005) and Ar- chaea (Makarova et al., 2007). In general for these two domains of life, the results are in agreement with known taxonomy of these prokaryotes. Several websites exist to implement the scoring of orthologous protein domains and gene families to assist in this phylogenetic approach (Tatusov et al., 2003; Jiang et al., 2008; Luo et al., 2007; Mirkin et al., 2003; Wilson et al., 2009; Novichkov et al., 2009; Jensen et al., 2008; Penel et al., 2009; Ranwez et al., 2007; Dubchak and Ryaboy, 2006; Marthey et al., 2008; Miller et al., 2007). A problem with the genome content approach that we address in this communication arises in the construction of the primary gene content matrix. The problem relates directly to the fact that different e value cutoffs will give different matrices and the differ- ent matrices will give different phylogenetic hypotheses (Lienau et al., 2006, 2011). This problem is very similar to the impact of in- put parameters in multiple alignment on phylogenetics. Different input values of gap costs and transformation costs in the alignment procedure will oftentimes give very different alignments, and these different alignments will give different trees (Gatesy et al., 1994; Wheeler et al., 1995). While some studies have pointed to the 1055-7903/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.ympev.2012.01.003 ⇑ Corresponding author. E-mail addresses: [email protected] (J.A. Rosenfeld), [email protected] (R. DeSalle). Molecular Phylogenetics and Evolution 63 (2012) 342–350 Contents lists available at SciVerse ScienceDirect Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev