Avoiding potential biases in ses.PD estimations with the ...46 arguments of pd, ses.pd includes multiple null models that can be used to generate null 47 distributions (the default

1

Avoiding potential biases in ses.PD estimations with the Picante software package 1

Rafael Molina-Venegas1 2

1. Institute of Plant Sciences, University of Bern, Altenbergrain 21, Bern 3013, 3

Switzerland. 4

Contact: [email protected] 5

Running title: avoiding biases in ses.PD estimations 6

7

.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

2

Abstract 8

1. Faith’s phylogenetic diversity (PD) is one of the most widespread used indices of 9

phylogenetic structure in the eco-phylogenetic literature. The metric became 10

notably popular with the publication of the function pd as part of the Picante R 11

package, which is nowadays a reference software for phylogenetic analyses. 12

2. Because PD is not statistically independent of species richness, the routine 13

procedure is to standardize the observed PD values for unequal richness across 14

samples. The function ses.pd, which is also implemented in the Picante R 15

package, is the reference function to conduct such standardization. 16

3. Unfortunately, I have detected an anomaly in the function that may result in biased 17

estimations of standardized PD values, particularly in communities with low 18

species richness (i.e. less than four species) and unbalanced phylogenies. 19

4. I conduct a simple simulation exercise to illustrate the issue and propose two 20

alternative and easy to implement solutions to go around the problem. 21

22

Keywords: Phylogenetic diversity; Picante; ses.pd; standardization. 23

24




3

Introduction 25

Faith’s phylogenetic diversity (PD; Faith, 1992), defined as the sum of all 26

branch lengths connecting taxa in a sample, is one of the most widespread used indices 27

of phylogenetic structure in the eco-phylogenetic literature. The metric became notably 28

popular with the publication of the function pd as part of the Picante software (Kembel 29

et al., 2010), which is nowadays a reference R package for phylogenetic analyses. The 30

pd function includes three arguments: (i) “samp”, a community data matrix (samples in 31

rows and species in columns), (ii) “tree”, a phylo tree object including all the species in 32

the community data matrix, and (iii) “include.root”, which is a logical argument. If the 33

latter is set to TRUE (default = TRUE), then the PD of all samples in the community 34

data matrix will include the distance from the most recent common ancestor (MRCA) of 35

the species in each sample and the root of the supplied phylogeny (hereafter “MRCA – 36

root” distance). Otherwise, the MRCA – root distance is excluded from the 37

computations. 38

Importantly, PD is not statistically independent of species richness, and the 39

routine procedure is to standardize the observed PD values for unequal richness across 40

samples (see documentation for the pd function in Picante R package, Kembel et al., 41

2010). Typically, the observed PD is compared to a null distribution of PD values 42

generated by shuffling species names across the phylogenetic tips a high number of 43

times (e.g. 999 times). The function ses.pd, which is also implemented in the Picante 44

software, is the reference function to conduct such standardization. In addition to the 45

arguments of pd, ses.pd includes multiple null models that can be used to generate null 46

distributions (the default model is “taxa.labels”, which shuffles taxa labels across the 47

phylogenetic tips). Since ses.pd calls internally to pd, the user can specify if the MRCA 48

– root distance should be included in the calculation of the observed PD and the 49




4

corresponding null PD values. However, I have noted that ses.pd computes null PD 50

values without including the MRCA – root distance regardless of the logical value that 51

is specified in the include.root argument. This is, if include.root = TRUE (default = 52

TRUE), the observed PD will include the MRCA – root distance, but the null PD values 53

will not (see Supplementary Material). Unfortunately, this anomaly in the ses.pd 54

function may result in biased estimations of standardized PD values (hereafter 55

“ses.PD”), particularly in samples with low species richness. This is because the lower 56

the species richness in the samples, the lower the probability for the phylogenetic 57

branches connecting species in the null samples to traverse the root node of the supplied 58

phylogeny, and therefore the higher the impact of excluding the MRCA – root distance 59

from the computations when it should be included (i.e. when include.root = TRUE). 60

Here, I conducted a simple simulation exercise to illustrate this issue. 61

62

Materials and Methods 63

I simulated four different community data matrices with n = 50 samples (rows) 64

and m = 25, 50, 100 and 200 species (columns) respectively. Then, I used the function 65

pbtree implemented in Phytools R package (Revell, 2012) to simulate 500 pure-birth 66

phylogenies (root to tip distance scaled to unit) of m = 25, 50, 100 and 200 tips, 67

respectively (2000 phylogenies in total), representing the species in the community data 68

matrices. Finally, I simulated four different community datasets per community data 69

matrix (each community data matrix represents a different species pool) with fixed 70

species richness within datasets (i.e. equal row sums). To do so, I assigned n = 2, 4, 8 71

and 16 species, respectively, to each of samples in the community data matrices by 72

randomly picking species from the corresponding pools. For each dataset (16 in total), I 73

computed ses.PD values for the samples using 500 simulated phylogenies and the ses.pd 74




5

function as implement in Picante (Kembel et al., 2010, hereafter “ses.pd-Picante”). The 75

argument include.root was set to TRUE, and null distributions were generated using the 76

taxa.labels model and 999 randomizations. Then, I reanalyzed the data using a corrected 77

version of the ses.pd-Picante function that actually includes the MRCA – root distance 78

in all the computations if the argument include.root is set to TRUE. Both functions are 79

identical in all other respects (see Supplementary Material). I used the same seed 80

(random number generator) to analyze the data with both functions. Finally, I compared 81

the ses.PD values derived from each function using cross-validation R-squared (𝑅"#$ ) 82

(Molina-Venegas, Moreno-Saiz, Castro, Davies, Peres-Neto & Rodríguez, 2018). 𝑅"#$ = 83

1 indicates perfect match between ses.PD values obtained from both functions, and 𝑅"#$ 84

< 1 indicates imperfect match. 𝑅"#$ varies from 1 to minus infinity. Since I used the 85

same seed to analyze the data with both functions, the randomization pattern was 86

preserved, and therefore 𝑅"#$ will be equal to 1 in case both functions yield identical 87

results. The R code to reproduce all the analyses along with the corrected version of the 88

ses.pd-Picante function is provided as Supplementary Material. All analyses were 89

conducted using Picante version 1.7 (latest version) and R version 3.4.3 (R Core Team., 90

2017), yet results were the same regardless of the version of the package (the ses.pd 91

function was first implemented in Picante version 0.7-2 and delivered in CRAN R 92

repository in July 2009). 93

94

Results and Discussion 95

I found substantial mismatch between the ses.PD values derived from the ses.pd-96

Picante function and its corrected version, particularly at low species richness and 97

regardless of the size of the species pool (Figs. 1 and S1 in Appendix 1). More 98

specifically, the ses.pd-Picante function yielded higher ses.PD values than expected (i.e. 99




6

above the 1:1 line) in the negative side of the distribution (Figs. 2 and S2 in Appendix 100

1). However, results derived from both functions rapidly converged following an 101

exponential trend as species richness increased (Figs. 1 and S1 in Appendix 1), 102

suggesting that the ses.pd-Picante function will introduce biases only when species 103

richness is very low (i.e. less than four species). Fortunately, such low-richness levels 104

are rare in natural communities, yet they are eventually reported along with their ses.PD 105

values (e.g. Mennes, Moerland, Rath, Smets & Merckx, 2015; Geedicke, Schultz, 106

Rudolph & Oldeland, 2016; Nowakowski, Frishkoff, Thompson, Smith & Todd, 2018). 107

On the other hand, some simulation analyses have also reported ses.PD values for 108

samples including only two species (e.g. Mazel et al., 2016), and diversity experiments 109

often include plots with very few species (e.g. Symstad et al., 2003). 110

Figs. 3 and S3 show that mismatches were more likely to occur with highly 111

unbalanced trees (i.e. those with internal nodes defining divergent lineages of unequal 112

size). This is because the higher the imbalance of the phylogeny, the lower the 113

probability for the phylogenetic branches connecting species in the null samples to 114

traverse the root node of the supplied phylogeny, and therefore the higher the impact of 115

excluding the MRCA – root distance when it should be included. Given the unbalanced 116

nature of most real phylogenies, I conclude that future studies will avoid potential 117

biases in ses.PD estimations (particularly in communities with very low species 118

richness) by either removing the MRCA – root distance from all the computations 119

conducted by the ses.pd-Picante function (i.e. setting the include.root argument to 120

FALSE) or using its corrected version if the MRCA – root distance is to be included 121

(see Supplementary Material). 122

123

Acknowledgments 124




7

I thank the Scientific Computation Centre of Andalusia (CICA) for the computing 125

services they provided. 126

127

Supporting Information 128

Appendix 1. R code used for the analyses. 129

130

REFERENCES 131

Faith, D. P. (1992). Conservation evaluation and phylogenetic diversity. Biological 132 Conservation, 61, 1–10. doi:10.1016/0006-3207(92)91201-3 133

Geedicke, I., Schultz, M., Rudolph, B., & Oldeland, J. (2016). Phylogenetic clustering 134 found in lichen but not in plant communities in European heathlands. 135 Community Ecology, 17, 216–224. doi:10.1556/168.2016.17.2.10 136

Kembel,S.W., Cowan, P. D., Helmus, M. R., Cornwell, W. K., Morlon, H, Ackerly, D. 137 D., … , Webb, C. O. (2010). Picante: R tools for integrating phylogenies and 138 ecology. Bioinformatics, 26, 1463–1464. 139

Mazel, F., Davies, T. J., Gallien, L., Renaud, J., Groussin, M., Münkemüller, T., & 140 Thuiller, W. (2016). Influence of tree shape and evolutionary time-scale on 141 phylogenetic diversity metrics. Ecography, 39, 913–920. 142 doi:10.1111/ecog.01694 143

Mennes, C. B., Moerland, M. S., Rath, M., Smets, E. F., & Merckx, V. S. F. T. (2015). 144 Evolution of mycoheterotrophy in Polygalaceae: The case of Epirixanthes. 145 American Journal of Botany, 102, 598–608. doi:10.3732/ajb.1400549 146

Molina-Venegas, R., Moreno-Saiz, J. C., Parga, I. C., Davies, T. J., Peres-Neto, P. R., & 147 Rodríguez, M. Á. (2018). Assessing among-lineage variability in phylogenetic 148 imputation of functional trait datasets. Ecography, 41, 1740–1749 149 doi:10.1111/ecog.03480 150

Nowakowski, A. J., Frishkoff, L. O., Thompson, M. E., Smith, T. M., & Todd, B. D. 151 (2018). Phylogenetic homogenization of amphibian assemblages in human-152 altered habitats across the globe. Proceedings of the National Academy of 153 Sciences, 115, E3454–E3462. doi:10.1073/pnas.1714891115 154




8

R Core Team (2017) R: a language and environment for statistical computing. R 155 Foundation for Statistical Computing, Vienna, Austria. 156

Revell, L. J. (2012). phytools: an R package for phylogenetic comparative biology (and 157 other things). Methods in Ecology and Evolution, 3, 217–223. 158

Symstad, A. J., Chapin, F. S., Wall, D. H., Gross, K. L., Huenneke, L. F., Mittelbach, G. 159 G., … Tilman, D. (2003). Long-term and large-scale perspectives on the 160 relationship between biodiversity and ecosystem functioning. BioScience, 53, 161 89–98. doi:10.1641/0006-3568(2003)053[0089:LTALSP]2.0.CO;2 162

163




9

Figure 1. Violin plot showing the cross-validation R-squared scores for the 164

comparisons between ses.PD values derived from the function ses.pd-Picante (Kembel 165

et al., 2010) and its corrected version. Analyses were conducted using datasets with 166

species richness = 2, 4, 8 and 16, respectively, 500 simulated phylogenies and a species 167

pool (community data matrix) of n = 25 species (see Fig. S1 in Appendix 1 for results 168

derived from community data matrices with n = 50, 100 and 200 species). 169

170

171

172




10

Figure 2. Scatter plots showing the relationship between ses.PD values (25,000 per 173

plot) derived from the function ses.pd-Picante (Kembel et al., 2010, y-axis) and its 174

corrected version (x-axis). Analyses were conducted using datasets with species 175

richness n = 2, 4, 8 and 16, respectively, 500 simulated phylogenies and a species pool 176

of 25 species (see Fig. S2 for results derived from species pools of 50, 100 and 200 177

species). The grey lines represent the expected 1:1 relationship. 178

179

180

181




11

Figure 3. Relationship between cross-validation R-squared scores (y-axis) and tree 182

imbalance (i.e. Colless’ index, x-axis). Analyses were conducted using datasets with 183

species = 2, 4, 8 and 16, respectively, 500 simulated phylogenies and a species pool of 184

25 species (see Fig. S3 for results derived from species pools of 50, 100 and 200 185

species). The higher the value of the Colless’ index, the higher the imbalance of the 186

phylogeny (values scaled between 0 and 1). 187

188

189

190



Avoiding potential biases in ses.PD estimations with the ...46 arguments of pd, ses.pd includes multiple null models that can be used to generate null 47 distributions (the default

Documents