-
1
Avoiding potential biases in ses.PD estimations with the Picante
software package 1
Rafael Molina-Venegas1 2
1. Institute of Plant Sciences, University of Bern,
Altenbergrain 21, Bern 3013, 3
Switzerland. 4
Contact: [email protected] 5
Running title: avoiding biases in ses.PD estimations 6
7
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
2
Abstract 8
1. Faith’s phylogenetic diversity (PD) is one of the most
widespread used indices of 9
phylogenetic structure in the eco-phylogenetic literature. The
metric became 10
notably popular with the publication of the function pd as part
of the Picante R 11
package, which is nowadays a reference software for phylogenetic
analyses. 12
2. Because PD is not statistically independent of species
richness, the routine 13
procedure is to standardize the observed PD values for unequal
richness across 14
samples. The function ses.pd, which is also implemented in the
Picante R 15
package, is the reference function to conduct such
standardization. 16
3. Unfortunately, I have detected an anomaly in the function
that may result in biased 17
estimations of standardized PD values, particularly in
communities with low 18
species richness (i.e. less than four species) and unbalanced
phylogenies. 19
4. I conduct a simple simulation exercise to illustrate the
issue and propose two 20
alternative and easy to implement solutions to go around the
problem. 21
22
Keywords: Phylogenetic diversity; Picante; ses.pd;
standardization. 23
24
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
3
Introduction 25
Faith’s phylogenetic diversity (PD; Faith, 1992), defined as the
sum of all 26
branch lengths connecting taxa in a sample, is one of the most
widespread used indices 27
of phylogenetic structure in the eco-phylogenetic literature.
The metric became notably 28
popular with the publication of the function pd as part of the
Picante software (Kembel 29
et al., 2010), which is nowadays a reference R package for
phylogenetic analyses. The 30
pd function includes three arguments: (i) “samp”, a community
data matrix (samples in 31
rows and species in columns), (ii) “tree”, a phylo tree object
including all the species in 32
the community data matrix, and (iii) “include.root”, which is a
logical argument. If the 33
latter is set to TRUE (default = TRUE), then the PD of all
samples in the community 34
data matrix will include the distance from the most recent
common ancestor (MRCA) of 35
the species in each sample and the root of the supplied
phylogeny (hereafter “MRCA – 36
root” distance). Otherwise, the MRCA – root distance is excluded
from the 37
computations. 38
Importantly, PD is not statistically independent of species
richness, and the 39
routine procedure is to standardize the observed PD values for
unequal richness across 40
samples (see documentation for the pd function in Picante R
package, Kembel et al., 41
2010). Typically, the observed PD is compared to a null
distribution of PD values 42
generated by shuffling species names across the phylogenetic
tips a high number of 43
times (e.g. 999 times). The function ses.pd, which is also
implemented in the Picante 44
software, is the reference function to conduct such
standardization. In addition to the 45
arguments of pd, ses.pd includes multiple null models that can
be used to generate null 46
distributions (the default model is “taxa.labels”, which
shuffles taxa labels across the 47
phylogenetic tips). Since ses.pd calls internally to pd, the
user can specify if the MRCA 48
– root distance should be included in the calculation of the
observed PD and the 49
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
4
corresponding null PD values. However, I have noted that ses.pd
computes null PD 50
values without including the MRCA – root distance regardless of
the logical value that 51
is specified in the include.root argument. This is, if
include.root = TRUE (default = 52
TRUE), the observed PD will include the MRCA – root distance,
but the null PD values 53
will not (see Supplementary Material). Unfortunately, this
anomaly in the ses.pd 54
function may result in biased estimations of standardized PD
values (hereafter 55
“ses.PD”), particularly in samples with low species richness.
This is because the lower 56
the species richness in the samples, the lower the probability
for the phylogenetic 57
branches connecting species in the null samples to traverse the
root node of the supplied 58
phylogeny, and therefore the higher the impact of excluding the
MRCA – root distance 59
from the computations when it should be included (i.e. when
include.root = TRUE). 60
Here, I conducted a simple simulation exercise to illustrate
this issue. 61
62
Materials and Methods 63
I simulated four different community data matrices with n = 50
samples (rows) 64
and m = 25, 50, 100 and 200 species (columns) respectively.
Then, I used the function 65
pbtree implemented in Phytools R package (Revell, 2012) to
simulate 500 pure-birth 66
phylogenies (root to tip distance scaled to unit) of m = 25, 50,
100 and 200 tips, 67
respectively (2000 phylogenies in total), representing the
species in the community data 68
matrices. Finally, I simulated four different community datasets
per community data 69
matrix (each community data matrix represents a different
species pool) with fixed 70
species richness within datasets (i.e. equal row sums). To do
so, I assigned n = 2, 4, 8 71
and 16 species, respectively, to each of samples in the
community data matrices by 72
randomly picking species from the corresponding pools. For each
dataset (16 in total), I 73
computed ses.PD values for the samples using 500 simulated
phylogenies and the ses.pd 74
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
5
function as implement in Picante (Kembel et al., 2010, hereafter
“ses.pd-Picante”). The 75
argument include.root was set to TRUE, and null distributions
were generated using the 76
taxa.labels model and 999 randomizations. Then, I reanalyzed the
data using a corrected 77
version of the ses.pd-Picante function that actually includes
the MRCA – root distance 78
in all the computations if the argument include.root is set to
TRUE. Both functions are 79
identical in all other respects (see Supplementary Material). I
used the same seed 80
(random number generator) to analyze the data with both
functions. Finally, I compared 81
the ses.PD values derived from each function using
cross-validation R-squared (𝑅"#$ ) 82
(Molina-Venegas, Moreno-Saiz, Castro, Davies, Peres-Neto &
Rodríguez, 2018). 𝑅"#$ = 83
1 indicates perfect match between ses.PD values obtained from
both functions, and 𝑅"#$ 84
< 1 indicates imperfect match. 𝑅"#$ varies from 1 to minus
infinity. Since I used the 85
same seed to analyze the data with both functions, the
randomization pattern was 86
preserved, and therefore 𝑅"#$ will be equal to 1 in case both
functions yield identical 87
results. The R code to reproduce all the analyses along with the
corrected version of the 88
ses.pd-Picante function is provided as Supplementary Material.
All analyses were 89
conducted using Picante version 1.7 (latest version) and R
version 3.4.3 (R Core Team., 90
2017), yet results were the same regardless of the version of
the package (the ses.pd 91
function was first implemented in Picante version 0.7-2 and
delivered in CRAN R 92
repository in July 2009). 93
94
Results and Discussion 95
I found substantial mismatch between the ses.PD values derived
from the ses.pd-96
Picante function and its corrected version, particularly at low
species richness and 97
regardless of the size of the species pool (Figs. 1 and S1 in
Appendix 1). More 98
specifically, the ses.pd-Picante function yielded higher ses.PD
values than expected (i.e. 99
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
6
above the 1:1 line) in the negative side of the distribution
(Figs. 2 and S2 in Appendix 100
1). However, results derived from both functions rapidly
converged following an 101
exponential trend as species richness increased (Figs. 1 and S1
in Appendix 1), 102
suggesting that the ses.pd-Picante function will introduce
biases only when species 103
richness is very low (i.e. less than four species). Fortunately,
such low-richness levels 104
are rare in natural communities, yet they are eventually
reported along with their ses.PD 105
values (e.g. Mennes, Moerland, Rath, Smets & Merckx, 2015;
Geedicke, Schultz, 106
Rudolph & Oldeland, 2016; Nowakowski, Frishkoff, Thompson,
Smith & Todd, 2018). 107
On the other hand, some simulation analyses have also reported
ses.PD values for 108
samples including only two species (e.g. Mazel et al., 2016),
and diversity experiments 109
often include plots with very few species (e.g. Symstad et al.,
2003). 110
Figs. 3 and S3 show that mismatches were more likely to occur
with highly 111
unbalanced trees (i.e. those with internal nodes defining
divergent lineages of unequal 112
size). This is because the higher the imbalance of the
phylogeny, the lower the 113
probability for the phylogenetic branches connecting species in
the null samples to 114
traverse the root node of the supplied phylogeny, and therefore
the higher the impact of 115
excluding the MRCA – root distance when it should be included.
Given the unbalanced 116
nature of most real phylogenies, I conclude that future studies
will avoid potential 117
biases in ses.PD estimations (particularly in communities with
very low species 118
richness) by either removing the MRCA – root distance from all
the computations 119
conducted by the ses.pd-Picante function (i.e. setting the
include.root argument to 120
FALSE) or using its corrected version if the MRCA – root
distance is to be included 121
(see Supplementary Material). 122
123
Acknowledgments 124
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
7
I thank the Scientific Computation Centre of Andalusia (CICA)
for the computing 125
services they provided. 126
127
Supporting Information 128
Appendix 1. R code used for the analyses. 129
130
REFERENCES 131
Faith, D. P. (1992). Conservation evaluation and phylogenetic
diversity. Biological 132 Conservation, 61, 1–10.
doi:10.1016/0006-3207(92)91201-3 133
Geedicke, I., Schultz, M., Rudolph, B., & Oldeland, J.
(2016). Phylogenetic clustering 134 found in lichen but not in
plant communities in European heathlands. 135 Community Ecology,
17, 216–224. doi:10.1556/168.2016.17.2.10 136
Kembel,S.W., Cowan, P. D., Helmus, M. R., Cornwell, W. K.,
Morlon, H, Ackerly, D. 137 D., … , Webb, C. O. (2010). Picante: R
tools for integrating phylogenies and 138 ecology. Bioinformatics,
26, 1463–1464. 139
Mazel, F., Davies, T. J., Gallien, L., Renaud, J., Groussin, M.,
Münkemüller, T., & 140 Thuiller, W. (2016). Influence of tree
shape and evolutionary time-scale on 141 phylogenetic diversity
metrics. Ecography, 39, 913–920. 142 doi:10.1111/ecog.01694 143
Mennes, C. B., Moerland, M. S., Rath, M., Smets, E. F., &
Merckx, V. S. F. T. (2015). 144 Evolution of mycoheterotrophy in
Polygalaceae: The case of Epirixanthes. 145 American Journal of
Botany, 102, 598–608. doi:10.3732/ajb.1400549 146
Molina-Venegas, R., Moreno-Saiz, J. C., Parga, I. C., Davies, T.
J., Peres-Neto, P. R., & 147 Rodríguez, M. Á. (2018). Assessing
among-lineage variability in phylogenetic 148 imputation of
functional trait datasets. Ecography, 41, 1740–1749 149
doi:10.1111/ecog.03480 150
Nowakowski, A. J., Frishkoff, L. O., Thompson, M. E., Smith, T.
M., & Todd, B. D. 151 (2018). Phylogenetic homogenization of
amphibian assemblages in human-152 altered habitats across the
globe. Proceedings of the National Academy of 153 Sciences, 115,
E3454–E3462. doi:10.1073/pnas.1714891115 154
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
8
R Core Team (2017) R: a language and environment for statistical
computing. R 155 Foundation for Statistical Computing, Vienna,
Austria. 156
Revell, L. J. (2012). phytools: an R package for phylogenetic
comparative biology (and 157 other things). Methods in Ecology and
Evolution, 3, 217–223. 158
Symstad, A. J., Chapin, F. S., Wall, D. H., Gross, K. L.,
Huenneke, L. F., Mittelbach, G. 159 G., … Tilman, D. (2003).
Long-term and large-scale perspectives on the 160 relationship
between biodiversity and ecosystem functioning. BioScience, 53, 161
89–98. doi:10.1641/0006-3568(2003)053[0089:LTALSP]2.0.CO;2 162
163
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
9
Figure 1. Violin plot showing the cross-validation R-squared
scores for the 164
comparisons between ses.PD values derived from the function
ses.pd-Picante (Kembel 165
et al., 2010) and its corrected version. Analyses were conducted
using datasets with 166
species richness = 2, 4, 8 and 16, respectively, 500 simulated
phylogenies and a species 167
pool (community data matrix) of n = 25 species (see Fig. S1 in
Appendix 1 for results 168
derived from community data matrices with n = 50, 100 and 200
species). 169
170
171
172
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
10
Figure 2. Scatter plots showing the relationship between ses.PD
values (25,000 per 173
plot) derived from the function ses.pd-Picante (Kembel et al.,
2010, y-axis) and its 174
corrected version (x-axis). Analyses were conducted using
datasets with species 175
richness n = 2, 4, 8 and 16, respectively, 500 simulated
phylogenies and a species pool 176
of 25 species (see Fig. S2 for results derived from species
pools of 50, 100 and 200 177
species). The grey lines represent the expected 1:1
relationship. 178
179
180
181
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
11
Figure 3. Relationship between cross-validation R-squared scores
(y-axis) and tree 182
imbalance (i.e. Colless’ index, x-axis). Analyses were conducted
using datasets with 183
species = 2, 4, 8 and 16, respectively, 500 simulated
phylogenies and a species pool of 184
25 species (see Fig. S3 for results derived from species pools
of 50, 100 and 200 185
species). The higher the value of the Colless’ index, the higher
the imbalance of the 186
phylogeny (values scaled between 0 and 1). 187
188
189
190
.CC-BY 4.0 International licenseavailable under anot certified
by peer review) is the author/funder, who has granted bioRxiv a
license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version
posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv
preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/