This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
MACROECOLOGICALMETHODS
BHPMF – a hierarchical Bayesianapproach to gap-filling and traitprediction for macroecology andfunctional biogeographyFranziska Schrodt1,2,3,*, Jens Kattge1,2, Hanhuai Shan4,5, Farideh Fazayeli4,
Julia Joswig1, Arindam Banerjee4, Markus Reichstein1, Gerhard Bönisch1,
Sandra Díaz6, John Dickie7, Andy Gillison8, Anuj Karpatne4, Sandra Lavorel9,
Paul Leadley10, Christian B. Wirth2,11, Ian J. Wright12, S. Joseph Wright13 and
Peter B. Reich3,14
1Max Planck Institute for Biogeochemistry,
Hans-Knöll-Strasse 10, 07745 Jena, Germany,2German Centre for Integrative Biodiversity
UOS, AgroParisTech, 91405 Orsay, France,11University of Leipzig, Leipzig, Germany,12Department of Biological Sciences,
Macquarie University, NSW 2109, Australia,13Smithsonian Tropical Research Institute,
Apartado 0843-03092, Balboa, Republic of
Panama, 14Hawkesbury Institute for the
Environment, University of Western Sydney,
Locked Bag 1797, Penrith, NSW 2751
Australia
ABSTRACT
Aim Functional traits of organisms are key to understanding and predicting bio-diversity and ecological change, which motivates continuous collection of traits andtheir integration into global databases. Such trait matrices are inherently sparse,severely limiting their usefulness for further analyses. On the other hand, traits arecharacterized by the phylogenetic trait signal, trait–trait correlations and environ-mental constraints, all of which provide information that could be used to statis-tically fill gaps. We propose the application of probabilistic models which, for thefirst time, utilize all three characteristics to fill gaps in trait databases and predicttrait values at larger spatial scales.
Innovation For this purpose we introduce BHPMF, a hierarchical Bayesianextension of probabilistic matrix factorization (PMF). PMF is a machine learningtechnique which exploits the correlation structure of sparse matrices to imputemissing entries. BHPMF additionally utilizes the taxonomic hierarchy for traitprediction and provides uncertainty estimates for each imputation. In combinationwith multiple regression against environmental information, BHPMF allows forextrapolation from point measurements to larger spatial scales. We demonstrate theapplicability of BHPMF in ecological contexts, using different plant functional traitdatasets, also comparing results to taking the species mean and PMF.
Main conclusions Sensitivity analyses validate the robustness and accuracy ofBHPMF: our method captures the correlation structure of the trait matrix as wellas the phylogenetic trait signal – also for extremely sparse trait matrices – andprovides a robust measure of confidence in prediction accuracy for each missingentry. The combination of BHPMF with environmental constraints provides apromising concept to extrapolate traits beyond sampled regions, accounting forintraspecific trait variability. We conclude that BHPMF and its derivatives have ahigh potential to support future trait-based research in macroecology and func-tional biogeography.
BHPMF exploits the taxonomic hierarchy of the plant kingdom
as a proxy for the phylogenetic trait signal, with the individual
plant being nested in species, species in genus, genus in family
and family in phylogenetic group (Fig. 2).
BHPMF sequentially performs PMF at the different hierar-
chical levels, using latent vectors of the neighbouring level (ℓ) as
prior information at the current level. For example, trait data
averaged at species level are used to optimize latent vectors at
species level, which in turn act as priors for latent vectors at
the individual level, which finally are optimized against the
observed trait entries in the trait matrix (Fig. 2, equation 1).The
sequential approach across the taxonomic hierarchy turned out
to be most effective if applied iteratively top down and bottom
up.
After transformation of traits to approximate normal distri-
butions and z-score transformation, the cost function is devel-
oped as the sum of absolute deviations of predictions versus
observations for traits (m) of entities (n) (first summand in
equation 1) and the sum of absolute deviations of posterior and
prior of the latent factors u and v (second and third summand of
equation 1) across all hierarchical levels (L):
E xnm nm n m
nm
L
u n p n
= − ⟨ ⟩( )⎧⎨⎩
+ −
( ) ( ) ( ) ( )
=
( )( )−
∑∑ δ
λ
� � � �
�
� �
u v
u u
,2
1
11
2
2 12
2( ) ( ) −( )∑ ∑+ − ⎫⎬⎭n
v m m
m
λ v v� � ,
(1)
Figure 1 Schematic of the probabilistic matrix factorization(PMF) model. u denotes the latent vector on the individual plantside, v the latent vector on the functional trait side, both of whichhave a Gaussian normal distribution with a mean of 0 and avariance of σ2. Each missing entry Xij can be approximated by theproduct of the transposed latent vector U and the latent vector V.
The parameters of our BHPMF model are optimized against the
observations in the matrix using a Gibbs sampler (Fazayeli et al.,
2014). The Gibbs sampler is a Markov chain Monte Carlo
(MCMC) method, which samples the probability density distri-
butions of model parameters (here the latent vectors) and
model predictions (here entries in the plant × trait matrix) (see
the grey inset in Fig. 2 and ‘Model evaluation’, as well as Gibbs
sampler results in Appendix S10). The Gibbs sampler-inferred
density distributions of trait values are then used to infer the
most likely imputation value, as well as the associated uncer-
tainty for each prediction.
aHPMF – extrapolation from point measurements toregional scales
If BHPMF is stopped at the species level, i.e. without accounting
for trait variability specific to individual plants, the residual
error represents the intraspecific variability and modelling/
measurement errors. aHPMF focuses on explaining this residual
trait variability based on environmental variables, such as soil
and climate characteristics of their growth environment in order
to enable out-of-sample prediction, i.e. trait predictions for
individual plants where the only known factors are species iden-
tity and location but no traits have actually been measured on
the given individual (Fig. 3).
To capture trait variability that can be attributed to environ-
mental factors, we utilize a hierarchical regression framework,
taking into account the taxonomic structure of plants to regu-
larize the regression model. The regression framework takes as
independent predictor variables the climatic and soil variables
mentioned below at locations with georeferenced trait measure-
ments. The residuals of BHPMF for the 13 plant traits of each
observation are considered as the target dependent variables to
be predicted. We treat each plant trait independently of every
other while regressing them using climate and soil features.
In essence, combining BHPMF with least squares regression
over the residuals against environmental factors, we can model
the unknown value for species n and trait m in a probabilistic
model as
k u v w x enm n m nm= + +α βT T (2)
where (un, vm) are the latent factors, with un having a hierarchical
prior from the taxonomy and x being the environmental condi-
tion with w as the regression coefficient. enm is the zero mean
Gaussian noise. Note that α, β are scalar parameters: for BHPMF
set (α = 1, β = 0), for aHPMF set (α = 1, β = 1). For details see
‘aHPMF’ in Appendix S1.
Data: traits, climate and soil
We demonstrate the applicability of the methods introduced
above on the example of a trait matrix derived from TRY. For
details on data standardization see Kattge et al. (2011). The
spatial distribution of measurement sites and detailed informa-
tion on the original datasets are shown in Fig. S4.1 (Appendix
S4) and Tables S3.1 & S3.2 in Appendix S3.
We extracted a matrix of 13 georeferenced traits consisting of
204,404 trait measurements on 78,300 individuals, spanning
14,320 species, 3793 genera, 358 families and 6 phylogenetic
Figure 2 Schematic of the Bayesian hierarchical probabilisticmatrix factorization (BHPMF) model. N denotes the entity(individual plant) side and U the corresponding matrix of latentvectors on the row side, M the trait side and V the correspondingmatrix of latent vectors on the column side. x denotes an entry inthe original plant × trait matrix S. The numbers in parenthesesshow the taxonomic level L. For example (4) is the species levelwhereas (2) is the family level. The grey inset provides a schemafor the Gibbs sampler where p(n) is the parent node of n in theupper level and c(n) is the set of child nodes n in the lower level.
Figure 3 Schematic of the advanced hierarchical probabilisticmatrix factorization (aHPMF) model. w denotes the regressioncoefficient at different levels of the hierarchy and Q thecorresponding matrix of latent vectors. The numbers inparentheses shows the taxonomic level L.
groups. The sparsity ranged from 49.63% for leaf area to 92.33%
for the leaf N to P ratio, with an average sparsity of 79.9% across
the trait matrix (Table 1). All traits were log- and z-transformed
to improve normality and equalize traits in the cost function
during optimization.
For out-of-sample predictions by aHPMF, climate data for
mean annual precipitation, mean annual temperature, isother-
mally and precipitation seasonality were extracted from the
WorldClim dataset (Hijmans et al., 2005) and soil texture (sand,
silt, clay) and soil organic carbon content in the top soil from the
Harmonized World Soil Database v1.2 (FAO et al., 2012).
Model evaluation
We ran PMF, BHPMF and aHPMF on the test dataset extracted
from TRY. Given the plant × trait matrix, we randomly selected
80% of entries for training (parameter setting), 10% for valida-
tion (parameter adjustment by optimizing performance) and
10% for test (independent performance testing after parameter
adjustment and learning). This cross-validation improves model
fidelity by ensuring that none of the observations are known by
the model when performing new predictions. Test entries
without training data in the same row would have highly
inflated variance. Such cases were prevented by adjusting the
splitting accordingly (see ‘BHPMF’ in Appendix S1).
We evaluated the predicted trait values, using the root mean
squared error (RMSE; see equation S13 in Appendix S1) and the
correlation coefficient (R2) of z-transformed predicted versus
observed traits as indicators of overall prediction accuracy. We
compared the performance of PMF, BHPMF and aHPMF with a
baseline of species mean trait values (MEAN), which uses the
overall trait mean of all individual plants within a species for
prediction. The effectiveness of capturing the phylogenetic trait
signal was explored by performing BHPMF including increas-
ingly detailed taxonomic information (Fig. 2).
In order to evaluate how well not only predicted versus meas-
ured but also trait-trait correlations are preserved in BHPMF, we
performed standardized major axis (SMA) regression, the first
principal component vector of a correlation matrix fitted
through the data centroid (Taskinen & Warton, 2011), on the
measured and imputed trait values for some key trait correla-
tions. We also performed a Procrustes analysis with PROTEST
(using the R package ‘vegan’) on a PCA of a subset of the
original data versus a PCA based on the estimated values for
artificially introduced gaps. Due to its good data cover, we per-
formed this test on the RAINFOR extract from the TRY database
(see below). Procrustes is a statistical shape analysis tool (least-
squares orthogonal mapping) which compares two ‘superim-
posed’ matrices for overlap, with placement in space and object
size being adjustable. We show how uncertainty in trait predic-
tions is accounted for using the Gibbs sampler, comparing pre-
diction confidence (SD) with prediction accuracy (RMSE).
The sensitivity of BHPMF to the fraction of gaps and the
effect of using a global database to fill gaps in local or regional
datasets were explored using two approaches. First by ‘cutting
out’ a local dataset with high coverage, adding additional gaps
(0, 10, 30, 60 and 80%; see Table S8.1 in Appendix S8) and
second by using a regional gappy dataset, filling gaps in each of
these ‘cut-outs’ using (1) the global data with information from
the local/regional data and (2) just the local/regional data. For
our local example, we extracted TRY trait data contributed by
the RAINFOR group (Fyllas et al., 2009), which shows a good
coverage (sparsity 11%) and covers most of the Amazon
(Fig. S8.1 in Appendix S8). For our regional example, we
extracted all of the European data (sparsity 72%) (Fig. S8.2 in
Appendix S8). For details on methodology please refer to
Table 1 Number of entries, sparsity androot mean square error (RMSE) ofspecies mean (MEAN), probabilisticmatrix factorization (PMF), Bayesianhierarchical PMF (BHPMF) andadvanced hierarchical PMF (aHPMF) bytrait, as well as R2 values of theregression of imputed versus measuredtraits. The lowest RMSE and highest R2
SLA, specific leaf area; LDMC, leaf dry matter content; SSD, stem-specific density; Leaf N and Leaf P,leaf nitrogen and phosphorus concentrations per dry mass, respectively; Leaf N/area, leaf nitrogenconcentration per leaf area; Leaf C/dry mass, leaf carbon concentration per dry mass. For definitionsof all traits and data sources as well as corresponding references see the Supporting Information(Appendices S2, S3 and S11 respectively).
(leaf N) from point measurements to the whole species range of
Acer saccharum using aHPMF.
Probabilistic matrix factorization and subsequent regression
were developed and applied in MATLAB version 2012a
(MATLAB, 2012). All other analyses were performed using the
statistical platform R version 2.15 (R Core Team, 2014). The
maps reported here were produced in ArcMap 10.1 (ArcGIS
Desktop, 2011) and R, using the tree species distribution map of
A. saccharum from the US Geological Survey (Little, 1971). R
scripts to implement BHPMF are available from the authors by
request.
RESULTS
Predicted versus observed trait values
To analyse prediction accuracy we compare RMSE and the coef-
ficient of determination (R2) for MEAN, PMF, BHPMF and
aHPMF averaged across traits and for each trait separately
(Table 1; for scatterplots of observed versus predicted for all
traits see Fig. S9.1 in Appendix S9). On average, across all traits,
BHPMF outperforms PMF, MEAN and aHPMF, with MEAN
being significantly more accurate than PMF. This holds after
statistical evaluation using a paired t-test with P-values smaller
than 10−5 at all levels, and is supported by the evaluation of the
correlation coefficient R2 (Table 1). As the RMSE is calculated
from z-transformed approximate normal distributions of traits,
a RMSE of 0.45 for BHPMF indicates that the average error of
predictions is about half a standard deviation, or about 10% of
the 95% CI. BHPMF outperforms MEAN and PMF in all traits,
while aHPMF shows the same or higher RMSE and higher R2
than BHPMF for SLA, plant height, leaf dry matter content
(LDMC), leaf carbon (C) per dry mass and leaf δ 15N (D15N)
(Table 1). The advantage of BHPMF over MEAN is largest for
‘physiological traits’, such as leaf N and leaf phosphorus concen-
tration (leaf P), and smaller for more ‘structural traits’ such as
seed mass or plant height. The prediction accuracy of BHPMF
varies across traits: from RMSE = 0.36 (R2 = 0.92) for seed mass
to RMSE = 0.61 (R2 = 0.61) for leaf C content per dry mass.
Interestingly, prediction accuracy is not related to the number of
entries per trait (Table 1).
Accounting for taxonomic hierarchy
The RMSE of MEAN and BHPMF decreases with increasing
taxonomic information, indicating that both methods can
utilize the hierarchical structure to their advantage (Table S7.1
in Appendix S7). This is also supported by the scatter plot of
measured versus predicted specific leaf area (SLA) and leaf N
shown in Fig. 4. With increasing taxonomic information, the
scatter plot approaches the 1:1 line, i.e. prediction accuracy
improves.
Trait–trait correlations
Although the presence of strong trait–trait correlations is a pre-
requisite for the accuracy of BHPMF, such correlations are not
provided a priori and are thus not part of the objective function
used (equation 1). This turns them into a suitable evaluation
measure. An important quality criterion is to what extent the
imputed values reflect the observed bivariate correlations, as this
is a first indication of the extent to which the overall correlation
structure of the n-dimensional trait matrix is maintained by
imputation. Our dataset shows on average strong trait–trait cor-
relations, with some exceptions (Fig. S9.5 in Appendix S9).
BHPMF and MEAN capture these general trait–trait correla-
tions, but BHPMF reproduces extreme values more accurately
than MEAN and is therefore generally better at capturing the
shape of the scatter of observed trait data, which is confirmed by
more similar SMA R2 values (Fig. S9.2 in Appendix S9). Looking
at the multivariate preservation of trait–trait correlations using
Procrustes analysis, our results indicate again that BHPMF does
Figure 4 Scatter plots of predicted versus true values for twotraits with increasing taxonomic information. Left column, leafnitrogen concentration per dry weight; right column, specific leafarea. Row 1, no phylogenetic information is used; row 2, only thephylogenetic group is used; row 3, phylogenetic group and familyare used; row 4, phylogenetic group, family and genus are used;row 5, phylogenetic group, family, genus and species are used.Predictions are based on Bayesian hierarchical probabilistic matrixfactorization. The data are presented in z-transformed space.Dotted lines indicate the 1:1 correlation.
not significantly alter the correlation structure of the gap-filled
matrix (Fig. 5). The first four principal component axes explain
83.4% and 83.4% of the variability in the dataset for the original
and gap-filled data, respectively. None of the principal compo-
nent axes are significantly different between the gappy and gap-
filled data for any of the traits. The traits stem specific density
and leaf carbon differ – but not significantly – along the third
and fourth axes (see Fig. S9.3 in Appendix S9).
Uncertainty quantified predictions
The Gibbs sampler provides a probability distribution for every
single prediction, as shown in the example of Gibbs sampler-
generated density plots of BHPMF-estimated LDMC, leaf N and
SLA for A. saccharum and Pinus sylvestris trees (Fig. S10.1 in
Appendix S10). This distribution can be exploited to calculate
indices for the best estimate (e.g. mean) and variability (e.g. SD).
This provides an additional means to evaluate our imputation
model by comparing prediction confidence (SD) with predic-
tion accuracy (RMSE): when we are confident about our pre-
dictions (small SD), these predictions should also be accurate
(small RMSE) and vice versa. Figure S10.2 in Appendix S10
shows that this is indeed the case for the whole 13-trait dataset,
implying that our model is appropriate. This remains true when
we evaluate the Gibbs sampler on each trait separately
(Fig. S10.3 in Appendix S10).
Gap-filling of regional/local data using BHPMF
As expected, increasing the number of gaps in the RAINFOR
dataset generally resulted in a decrease of prediction accuracy
(Fig. 6), although less so for structural traits, such as stem-
specific density (SSD) and plant height. Reproducibility was
high in all cases (Fig. 6). Prediction accuracy of BHPMF was
generally approximately equal, no matter whether the regional
(RforR) or global (WforR) datasets were used to fill the gaps
(Figs 6 & S8.3 in Appendix S8). This was particularly the case if
gap sizes were large (above 10%), whilst RforR outperformed
WforR for the imputations of plant height, leaf N, SLA and leaf
carbon only where additional gap sizes were small (0 and
10%).
Out-of-sample prediction (aHPMF)
We illustrate the extension of BHPMF towards out-of-sample
prediction with the example of leaf N across the species range of
A. saccharum (Fig. 7).
Figure 5 Procrustes analysis errors for the first and secondprincipal component axes comparing a princpal componentsanalysis (PCA) performed on the original, gappy RAINFOR datawith a PCA performed on the RAINFOR data with artificiallyintroduced gaps being filled using Bayesian hierarchicalprobabilistic matrix factorization.
Figure 6 Root mean square error (RMSE) of performingBayesian hierarchical probabilistic matrix factorization (BHPMF)on the RAINFOR cutout (red points in Fig. S8.1 in Appendix S8)for the whole dataset (Total), specific leaf area (SLA), plant height(PlantHt), stem-specific density (SSD), leaf nitrogen (LeafN), leafphosphorus (LeafP), leaf nitrogen per area (LeafNArea), leafcarbon (LeafC) with increasing number of gaps added to theoriginal RAINFOR data (inherent gappiness of 11%). For the totalnumber of gaps for each trait and added gaps per dataset, seeTable S8.1 in Appendix S8. Left- and right-hand sections for eachtrait (separated by a dotted line) show results when using only theRAINFOR data (RforR) or using all available data (WforR),respectively, to fill the gaps.
due to the taxonomic signal in trait variation at all levels.
This is achieved without a priori assuming a phylogenetic
signal in the trait variability, but rather by opening the door for
our model to extract a signal, if it should be there. Thus, in some
cases, BHPMF might not put any constraint on the imputed
Figure 7 Advanced hierarchical probabilistic matrix factorization (aHPMF)-predicted leaf nitrogen concentration (mg g−1) of Acersaccharum (a), measured values for leaf nitrogen (mg g−1) (b), MAT (mean annual temperature) (c), and MAP (mean annual precipitation)(d) across the species range of A. saccharum. For a map of the geographic location see Figure S5.1 in Appendix S5.
A “hierarchical mean” strategy is used. For example, to predict trait m of plant n, if there are plants in the samespecies as plant n, we use species mean for prediction; otherwise, if there are plants in the same genus as plant n, weuse the genus mean, and so on. In general, among species mean, genus mean, family mean, and phylogenetic groupmean, we use the first available one at the lowest level.
S1.2. BHPMF
BHPMF is a hierarchical Bayesian implementation of PMF. The latent vectors are implemented as Gaussian normaldistributions with prior mean of 0 and a variance of σ 2, which results in two adjustable parameters per latent vector(mean and variance). The length of the latent vectors can be defined, but needs to be constrained to avoid over-fitting(Salakhutdinov and Mnih, 2008). We run a Gibbs sampler for optimization (see S1.3). In principle, this allows us toupdate {U(`)} and {V (`)} - which are the matrices formed by stacking the latent vectors u and v at the taxonomic level of` - in an arbitrary order. Empirically, we do it level by level iteratively following a top-down and bottom-up order. Ineach iteration, we first do a top-down pass to update
({U(1)}, {V (1)}
)to
({U(L)}, {V (L)}
), followed by a bottom-up pass to
update({U(L)}, {V (L)}
)to
({U(1)}, {V (1)}
), and repeat the process for several iterations. The intuition is that after updating(
{U(`)}, {V (`)}), we want to immediately use it for regularization (to prevent unnecessary complexity and over-fitting) in
the next level update. Empirically, we observed that such a strategy converges faster than only doing top-down updatesrepeatedly. The posterior over {U(`)} and {V (`)} is
p({U(`)}, {V (`)}|{X(`)}, σ2,U(0),V (0)
)∝
L∏`=1
{∏n
N(u(`)n |u
(`−1)p(n) , σ
2uI)
∏m
N(v(`)m |v
(`−1)m , σ2
v I)
∏n,m
δ(`)nmN
(x(`)
nm|〈u(`)n , v(`)
m 〉, σ2) }
,
(S1)
where {·} denotes the set of data at all L levels (L = 5 for the TRY data), and δ(`)nm = 1 when the entry (n,m) of X(`)
is non-missing and 0 otherwise. MAP (Maximum a posteriori) inference which is similar to Monte Carlo inference on{U(`)} and {V (`)} can be done by maximizing the logarithm of the posterior in eqn S1, which boils down to minimizingthe regularized squared loss as
E =
L∑`=1
{∑nm
δ(`)nm ‖ x(`)
nm − 〈u(`)n , v(`)
m 〉 ‖22
+ λu
∑n
‖u(`)n − u(`−1)
p(n) ‖22 +λv
∑m
‖v(`)m − v(`−1)
m ‖22
},
(S2)
where λu = σ2/σ2u and λv = σ2/σ2
v and ` − 1 is replaced by ` + 1 for the bottom-up iteration. The objective functioncontaining U(`) and V (`) is given by
E(`) =∑n,m
δ(`)nm ‖ x(`)
nm − 〈u(`)n , v(`)
m 〉 ‖22
+ λu
∑n
‖ u(`)n −u(`−1)
p(n) ‖22 +1(`<L)
∑n′∈c(n)
‖u(`)n −u(`+1)
n′ ‖22
+ λv
∑m
(‖ v(`)
m − v(`−1)m ‖22 +1(`<L) ‖ v(`)
m − v(`+1)m ‖22
),
(S3)
1
where c(n) is the set of child nodes of n, e.g, if n is a species, c(n) denotes plants of that species, and 1(`<L) is anindicator function taking value 1 when ` < L and 0 otherwise. The regularization terms ‖u(`)
n −u(`−1)p(n) ‖
22 and ‖v(`)
m −v(`−1)m ‖22
keep u(`)n and v(`)
m close to the corresponding latent factor at level ` − 1, and the regularization terms∑
c(n)‖u(`)n − u(`+1)
c(n) ‖22
and ‖v(`)m − v(`+1)
m ‖22 keep u(`)n and v(`)
m close to the corresponding latent factor at level ` + 1 (if applicable).At least one trait is needed for each plant in order to run matrix factorization methods. Therefore, we split the
training, test and validation sets as follows: For each plant, if it has at least three traits available, we randomly holdout one trait for test, one trait for validation, and use the rest for training; if it has two traits available, we randomlyhold out one trait for training and one for test; if it only has one trait available, we use it for training. Following such astrategy, each plant has at least one trait in the training set. The test set is used for test and the validation set is usedduring the training process for early stopping, i.e., when there have been more than 5 iterations and the performance onthe validation set decreases, we stop training. We repeat the holding-out process 5 times to get five randomly splitdatasets, then constructing the upper-level matrices for training and validation, but the test set only operates at theplant×trait level.
S1.3. Gibbs samplerMany imputation methods commonly used in ecology do not provide means to assess the uncertainty of every
single predicted value. Ideally, one would like to quantify both, the expected value and the range of variation (in caseof normal distribution mean and standard deviation (SD)) of the imputations. Using Gibbs sampling we infer theprobability distribution for each prediction (Casella and George, 1992). The Gibbs sampler is a Markov chain MonteCarlo algorithm which, based on the Metropolis algorithm (Metropolis et al., 1953), samples from the conditionaldistribution of one variable given all the others. In BHPMF, each variable (element in each latent factor) is conditionallyindependent of most other variables, thereby leading to an efficient sampler. In essence, the procedure is as follows: fora given matrix X, the sampler updates the latent factor matrices (U(`),V (`)) at each level `, keeping the factors at allother levels fixed. Each sample at the lowest level is obtained by sampling the upper level matrices iteratively followinga top-down and bottom-up order (Algorithm S1).
Each row of U(`) (u(`)n ) is independent of U(`)
−n, U(−`)(−p(n),−c(n)), V (−`), X(−`), and X(`)
−n given its Markov blanket (x(`)n , V (`),
u(`−1)p(n) , u(`+1)
c(n) ), where p(n) is the parent node of n in the upper level and c(n) is the set of child nodes n in the lower level.Therefore, the conditional probability of U(`) can be factorized into the product of conditional probability of its rows
p(U(`)|X(`),V (`),U(`−1)
p ,U(`+1)c
)(S4)
=∏
n
p(u(`)
n |x(`)n ,V (`),u(`−1)
p(n) ,u(`+1)c(n)
).
By applying Bayes rule and given that the product of multiple Gaussian distributions is another Gaussian distribution,it can be shown that the conditional probability of un is a Gaussian distribution
p(u(`)
n |x(`)n ,V (`),u(`−1)
p(n) ,u(`+1)c(n)
)= N(u(`)
n |µ∗(`)n ,Σ∗(`)n )
∼∏
m
[δ(`)
nmN(x(`)nm|〈u
(`)n , v(`)
m 〉, σ2)]N(u(`)
n |u(`−1)p(n) , σ
2uI)∏
n′∈c(n)
[N(u(`+1)
n′ |u(`)n , σ2
uI)] (S5)
Σ∗(`)n =
|c(n)` + 1|σ2
uI +
1σ2
∑m
δ(`)nmv(`)
m v(`)Tm
−1
µ∗(`)n = Σ∗(`)n
1σ2
uI
u(`−1)p(n) +
∑n′∈c(n)
u(`+1)n′
+
1σ2
∑m
δ(`)nmx(`)
nmv(`)m
.(S6)
2
where |.| denotes set cardinality.With the similar argument, the conditional probability of V (`) can be factorized into the product of conditional
probability of its rows, where x:m is column m of x.
p(V (`)|X(`),U(`),V (`−1),V (`+1)
)(S7)
=∏
m
p(v(`)
m |x(`):m ,U
(`), v(`−1)m , v(`+1)
m
).
p(v(`)
m |x(`):m ,U
(`), v(`−1)m , v(`+1)
m
)= N(v(`)
m |µ∗(`)m ,Σ∗(`)m )
∼∏
n
[δ(`)
nmN(x(`)nm|〈u
(`)n , v(`)
m 〉, σ2)]
(S8)
N(v(`)m |v
(`−1)m , σ2
v I) N(v(`+1)m |v(`)
m , σ2v I)
where
Σ∗(`)m =
2σ2
vI +
1σ2
∑n
δ(`)nmu(`)
n u(`)Tn
−1
(S9)
and
µ∗(`)m = Σ∗(`)m
[1σ2
vI(v(`−1)
m + v(`+1)m
)+
1σ2
∑n
δ(`)nmx(`)
nmu(`)n
. (S10)
For a given matrix X, the sampler updates the latent factor matrices (U(`),V (`)) at every level `. Each sample at thelowest level is obtained by sampling the upper level matrices iteratively following a top-down and bottom-up order. Ateach iteration, we first do a bottom-up pass to sample (U(L),V (L)) to (U(1),V (1)), followed by a top-down pass to sample(U(1),V (1)) to (U(L),V (L)), and repeat the procedure to generate enough samples (Algorithm S1).
Algorithm S1 Gibbs Sampling for BHPMF1: for ` = 1, · · · , L do2: Initialize model parameters {U1(`),V1(`)}
3: for t = 1, · · · ,T do4: for ` = L, · · · , 1 do . bottom-up5: for each n = 1 · · ·N sample un in parallel (eqn S5):6: ut+1(`)
n ∼ p(ut(`)n |x
(`)n ,V t(`),ut(`−1)
p(n) ,ut(`+1)c(n) )
7: for each m = 1 · · ·M sample vm in parallel (eqn S8):8: vt+1(`)
m ∼ p(vt(`)m |x
(`)m ,U t+1(`), vt(`−1)
m , vt(`+1)m )
9: for ` = 1, · · · , L do . top-down10: for each n = 1 · · ·N sample un in parallel (eqn S5):11: ut+2(`)
n ∼ p(ut+1(`)n |x(`)
n ,V t+1(`),ut+1(`−1)p(n) ,ut+1(`+1)
c(n) )12: for each m = 1 · · ·M sample vm in parallel (eqn S8):13: vt+2(`)
m ∼ p(vt+1(`)m |x(`)
m ,U t+2(`), vt+1(`−1)m , vt+1(`+1)
m )
Where t denotes the trait, c(n) the child node, p(n) the parent node, l the hierarchical level, u and n the row (entityor, in our case, plant) side and v and m the column (trait) side. The used burn-in period was 100 samples with a lag of 2and a final number of 450 samples.
3
S1.4. aHPMF
HPMF was developed to fill gaps in plant trait matrices. To facilitate trait predictions at regional scale, we stopHPMF at species level, followed by a least squares regression of the residuals against environmental features. StoppingHPMF at species level separates phylogenetic conservatism from environmental drivers. Explicitly taking into accountenvironmental conditions as co-determinants of trait variation enables out-of-sample prediction.
Let Q` denote the number of distinct categories available at level ` of the taxonomy, e.g., Q1 is the number ofspecies available at ` = 1, and so on. We learn distinct regression model parameters w`
q for each category q at everylevel ` of the taxonomy by partitioning the observations into their respective categories, and also accounting for ahierarchical regularization among the parameters based on the taxonomic hierarchy.
Let X ∈ RQ×7 be the design matrix of climate and soil features (7 covariates, including a column of ones to handlea constant intercept), where the total number of observations, including all levels of the taxonomic hierarchy, is Q. LetY ∈ RQ×1 be the target residuals of BHPMF to be predicted for a given trait k. Let (X(`)
q ,Y (`)q ) be the subset of data that
belong to the qth category at level `. With w(`)q denoting the regression vector for category q at level `, we consider the
following objective function
E(w) =
L∑`=1
Q∑q=1
γ`∥∥∥Y (`)
q − X(`)q w(`)
q
∥∥∥2
+ λ
L∑`=1
Q∑q=1
‖w`q − w(`−1)
p(q) ‖2
(S11)
where w is the weight vectors over all categories, γ` is a weight term for minimizing the squared errors at level `, λis the trade-off parameter of regularization, and p(q) is the parent of category q. Using vector notations, the objectivefunction can be written as:
E(w) = (Y − Xw)Tγ(Y − Xw) + λwTLw , (S12)
where γ is a vector of γ` for all `, Y is a concatenated vector of Y (`)q , X is a block diagonal concatenation of X(`)
q ,and L is the graph Laplacian of the taxonomic hierarchy. We utilize gradient descent based methods for solving theoptimization problem (eqn S12), in a spirit similar to BHPMF. Finally, we use the learned parameters at the specieslevel for predicting the plant trait residuals of BHPMF over unobserved data instances during the testing phase.
S1.5. RMSE
RMSE is used for evaluation. Assuming there are in total T entries for test, at is the true value and at is the predictedvalue, RMSE is defined as:
RMS E =
√∑t
(at − at)2/T . (S13)
S1.6. Sensitivity analysis
RMSEs of the sensitivity analysis testing the effect of using a global versus local (RAINFOR)/regional (Europe)dataset to fill gaps, as well as the effect of different degrees of sparsity were calculated based on the results of threeBHPMF runs obtained for each subset (global data to fill local gaps, local data to fill local gaps (10%, 30%, 60% and80% extra gaps), as well as global data to fill regional gaps and regional data to fill regional gaps). In order to enablecomparison of the results, all data were z-transformed on the basis of the whole dataset. To ensure consistency in matrixlength, we constrained the random introduction of gaps by retaining one data point in every row (plant), also usingthe same gaps for the respective a) or b) runs. Gaps were added as percentage of the available trait data for each traitwhich resulted in different absolute degrees of missingness. It should also be noted that the total RMSE was calculatedindependently for the whole dataset and is not the average across all RMSEs per trait. Differences in RMSE withineach treatment are due to pre-processing (i.e. differences in the random setting of gaps) in the three different runs ratherthan BHPMF per se.
4
References
Casella, G., George, E.I., 1992. Explaining the gibbs sampler. The American Statistician 46, 167–174.Metropolis, A.W., Rosenbluth, M.N., Rosenbluth, A.H., Teller, H., Teller, E., 1953. Equation of state calculations by fast computing machines.
Journal of Chemical Physics 21, 10871092.Salakhutdinov, S., Mnih, A., 2008. Probabilistic matrix factorization. IEEE CS Press: Advances in Neural Information Processing Systems 20 (NIPS
07) .
5
S2. Definition of traits used in this study
Table S2.1: numbered code, units of measurement, plant functional traits, number of non-missing geo-referenced entries (Nr. geo-ref.) and definitionof the respective trait.
Code Unit Trait Nr. geo-ref. Definition1 mm2 mg−1 Specific leaf area 33001 One sided area of a fresh leaf divided by its oven-dry mass2 m Plant height 16465 Shortest distance between the upper boundary of main photo-
synthetic tissue or reproduction unit on a plant and the groundlevel
3 mg Seed dry mass 7311 Dry mass of a whole single seed4 g g−1 Leaf dry matter content 17331 Leaf dry mass per unit of leaf fresh mass (hydrated)5 mg mm−3 Stem specific density 9191 Oven-dry mass of a section of a plant’s main stem divided by its
fresh volume6 mm2 Leaf area 39438 One-sided projected surface area of a single leaf or leaf lamina7 mg g−1 Leaf nitrogen per weight 26882 Total amount of nitrogen per unit of leaf dry mass8 mg g−1 Leaf phosphorus per weight 11975 Total amount of phosphorus per unit of leaf dry mass9 g m−2 Leaf nitrogen per area 8180 Total amount of nitrogen per unit of leaf area (one-sided)10 mg Leaf fresh mass 11484 Fresh mass of a whole leaf11 g g−1 Leaf nitrogen/phosphorus ratio 5999 Ratio of leaf total nitrogen versus total phosphorus12 mg g−1 Leaf carbon per dry mass 8125 Total carbon per unit of leaf dry mass13 ‰ Leaf δ 15N 9022 foliar 15N:14N ratios relative to 15N:14N ratios in atmospheric N2
S3. References of contributing databases and number of traits contributed
Table S3.1: References of contributing databases in order of decreasing contribution as measured in the number of traits, where Ref.1 contributed23364 traits and Ref.56 37 traits. Where the database has not been published, yet, only the name of the database is given. For a list of contributedtraits see table S3.2, for the list of references, see S11
Ref. Database Reference1 The LEDA Tb (Kleyer et al., 2008)2 Global 15N Db (Craine et al., 2009, 2005)3 Panama Plant Traits Db (Wright et al., 2010)4 Global Leaf N, P Db (Reich et al., 2009)5 Catalonian Mediterranean Forest Trait Db (Ogaya and Penuelas, 2003)6 The VISTA Plant Trait Db (Garnier et al., 2007)7 Herbaceous Leaf Traits Db Old Field New York8 Panama Leaf Traits Db (Messier et al., 2010)9 Midwestern and Southern US Herbaceous Db10 The RAINFOR Plant Trait Db (Fyllas et al., 2009)11 Global Seed Mass, Plant Height Db (Moles et al., 2004, 2005)12 GLOPNET - Global Plant Trait Network Db (Wright et al., 2004, 2006)13 Sheffield Db (Cornelissen et al., 2001, 2003)14 VegClass CBM Global Db (Gillison and Carpenter, 1997)15 Floridian Leaf Traits Db (Cavender-Bares et al., 2006)16 Neotropic Plant Traits Db (Wright et al., 2007)17 Leaf and Whole Plant Traits Db (Shipley, 1989, 1995, 2002)18 Leaf Physiology Db (Kattge et al., 2009)19 Overton/Wright New Zealand Db20 New South Wales Db (Fonseca et al., 2000)21 Traits of Bornean Trees Db (Kurokawa and Nakashizuka, 2008)22 Chinese Leaf Traits Db (Han et al., 2005)23 The Netherlands Plant Traits Db (Ordonez et al., 2010)24 Leaf Biomechanics Db (Onoda et al., 2011)25 Sheffield-Iran-Spain Db (Dıaz et al., 2004)26 Quercus Leaf C and N Db27 Global Leaf Robustness and Physiology Db (Niinemets, 2001)28 Ponderosa Pine Forest Db (Laughlin et al., 2010)29 Ukraine Wetlands Plant Traits Db30 ECOCRAFT (Medlyn and Jarvis, 1999)31 Abisko & Sheffield Db (Cornelissen et al., 1996, 2004)32 ArtDeco Db (Cornwell et al., 2008)33 Photosynthesis and Leaf Characteristics Db34 Causasus Plant Traits Db35 New South Wales Plant Traits Db36 European Mountain Meadows Plant Traits Db (Bahn et al., 1999)37 CORDOBASE (Dıaz et al., 2004)38 Global Respiration Db (Reich et al., 2008)39 Tropical Rainforest Traits Db (Poorter and Bongers, 2006)40 Global A, N, P, SLA Db (Reich et al., 2009)41 ECOQUA South American Plant Traits Db (Muller et al., 2007)42 FAPESP Brazil Rainforest Db43 South African Woody Plants Db (ZLTP)44 French Massif Central Grassland Trait Db (Louault et al., 2005)45 Tropical Traits from West Java Db (Poorter, 2009)46 Jasper Ridge Californian Woody Plants Db (Preston et al., 2006)47 Hawaiian Leaf Traits Db (Penuelas et al., 2010)48 Wetland Dunes Db (Bodegom et al., 2005, 2008)49 Tropical Traits from West Java Db (Shioder et al., 2008)50 Tropical Plant Traits From Borneo Db (Swaine, 2007)51 Herbaceous Traits from the Oland Island Db (Hickler, 1999)52 Traits from Subarctic Plant Species Db (Freschet et al., 2010)53 Leaf and Whole-Plant Traits Db (Sack et al., 2003, 2005, 2006)54 Plant Traits in Pollution Gradients Db55 Cedar Creek Plant Physiology Db56 VirtualForests Trait Db (Gutierrez and Huth, 2012)
Tabl
eS3
.2:D
atab
ases
and
num
bero
ftra
itsco
ntri
bute
d.Fo
ralis
tofr
efer
ence
sse
eta
ble
S3.1
.SL
Ais
the
spec
ific
leaf
area
,SD
Mse
eddr
ym
ass,
LDM
Cle
afdr
ym
atte
rcon
tent
,SSD
stem
spec
ific
area
,LA
leaf
area
,Nle
afni
troge
n,P
leaf
phos
phor
us,N
area
leaf
nitro
gen
peru
nitl
eafa
rea,
LFM
leaf
fres
hm
ass,
Nto
Pth
era
tioof
leaf
nitr
ogen
tole
afph
osph
orus
,Cle
afca
rbon
and
15N
the
leaf
cont
ento
fdel
ta15
nitr
ogen
Ref
.SL
AH
eigh
tSD
ML
DM
CSS
DL
AN
PN
area
LFM
Nto
PC
15N
167
3613
0650
652
2310
9583
210
531
1030
1030
1054
53
5256
293
331
5000
1003
5355
214
5098
455
0255
0255
025
819
660
5483
2219
0213
0972
018
876
1560
3401
1573
1564
989
707
969
991
723
2123
2123
2123
2123
218
1866
1872
1880
1836
1886
1836
919
9716
4126
022
8689
2098
2254
1015
5588
315
0615
6315
5415
2515
6311
1700
8416
1223
7056
015
8520
6178
719
7574
513
2953
245
2852
2594
477
474
286
1447
8147
8115
2270
2255
3514
1656
714
3010
0727
6917
750
840
375
726
720
312
1230
040
031
218
927
312
942
1464
1910
9911
1810
9920
1043
1028
1043
2146
746
647
146
747
122
813
1179
2328
928
726
928
828
228
128
224
1735
2543
042
642
943
026
767
767
2775
511
222
612
722
625
2813
913
913
913
913
713
913
813
913
913
929
214
214
209
214
214
107
214
3033
821
639
694
196
112
3124
924
860
254
624
517
632
576
603
3357
857
834
144
102
144
144
156
129
144
155
3526
129
224
828
736
210
202
210
210
201
3717
713
714
467
2114
676
5570
5050
3826
730
267
7826
739
135
8216
413
582
135
8240
124
4912
712
712
411
941
656
4267
7714
467
7667
7676
Con
tinue
don
Nex
tPag
e...
Tabl
eS3
.2–
Con
tinue
d
Ref
.SL
AH
eigh
tSD
ML
DM
CSS
DL
AN
PN
area
LFM
Nto
PC
15N
4325
960
6060
6044
4312
943
4343
4343
4345
101
9610
110
146
5451
5454
5453
5347
8888
8886
4815
316
149
101
9610
150
3636
3536
3636
3651
8080
8052
4040
4040
4040
5317
430
1122
5440
8040
2640
5545
4545
5637
sum
4204
627
953
1262
022
571
9905
4246
932
246
1415
210
158
1210
375
2892
0410
820
S4. Map of TRY measurement sites
Figure S4.1: Measurement sites of trait data contributed to the TRY project (acquisition date 2012-08-25) on a map of mean annual temperature(extracted from the WorldClim dataset). Note the sparse spatial distribution of measurement sites, especially in extremely cold or hot areas.
S5. Location of Acer saccharum range map and soil and climate across the range of Acer saccharum
Figure S5.1: Geographic location of the species range map of Acer saccharum
Figure S5.2: Soil %clay (a), soil organic carbon (b), %silt (c), %clay (d), isothermality (e) and precipitation seasonality (f) across the species rangeof Acer saccharum. For a map of the geographic location of these maps, see Figure S5.1.
S6. Correlation between traits and environmental variables used in aHPMF
Table S6.1: Correlations between the raw trait values and the environmental variables used within aHPMF (MAT is the mean annual temperature,MAP is the mean annual precipitation, SOC soil organic carbon, AWC the annual water capacity and ISO isothermality
S7. Root mean squared error comparison between MEAN, BHPMF and aHPMF
Table S7.1: RMSE ± range of Species Mean (MEAN), HPMF and aHPMF with increasing taxonomic information. Latent dimension k=15 formatrix factorization methods. Lowest RMSE is shown in bold. Phylo is the phylogenetic group.
Taxonomic info MEAN HPMF aHPMFPhylo 0.9301 ± 0.0026 0.8362 ± 0.0103 ×
Figure S8.1: Location of the extracted RAINFOR data (red) and the whole TRY data in this region (black).
Table S8.1: Number of observations corresponding to the percentage of added gaps of the RAINFOR database by trait, using a) only the RAINFORdata itself (Rainfor for Rainfor) or b) the whole TRY database (World for Rainfor) to fill these gaps. SLA is the specific leaf area, PlantHt is the Plantheight, SSD is the stem specific density, P is phosphorus, N is nitrogen and C is carbon.
Rainfor for Rainfor World for Rainfor0 10 30 60 80 0 10 30 60 80
Figure S8.2: Location of the extracted European data (red) and the global TRY data (black).
Table S8.2: Number of observations in the gap-filling the European database exercise, using a) only the European data itself (Europe for Europe) orb) the whole TRY database (World for Europe) to fill these gaps. SLA is the specific leaf area, PlantHt is the Plant height, LDMC is the leaf drymatter content, SSD is the stem specific density, N is nitrogen, P is phosphorus and C is carbon
Europe for Europe World for EuropeTotal 155.147 204.389SLA 24.749 33.001PlantHt 13.959 16.465Seed mass 5.930 7.311LDMC 12.560 17.331SSD 3.675 9.191Leaf area 34.813 39.438Leaf N 18.857 26.882Leaf P 7.280 11.975Leaf N/area 5.014 8.180Leaf fresh mass 11.484 11.484Leaf N/P ratio 4.123 5.999Leaf C/dry mass 4.470 8.125Leaf δ 15N 8.233 9.007
Figure S8.3: RMSE of performing HPMF on the European cutout (red points in Figure S8.2) for the whole dataset (Total), specific leaf area (SLA),Plant height (PlantHt), seed mass, leaf dry matter content (LDMC), stem specific density (SSD), leaf area, leaf nitrogen, leaf phosphorus, leafnitrogen per area (LeafNArea), leaf fresh mass (LeafFmass), leaf nitrogen to phosphorus ratio (LeafNP), leaf carbon and leaf delta 15 nitrogen(LeafD15N). Grey and black boxplots (on the left and right for each trait respectively) are RMSE for imputations using just the European dataset(grey) or the whole global dataset (black) to fill the gaps
S9. Bi- and multi-variate relationships between traits, measured and imputed trait values
Figure S9.1: Figure is continued on next page.
observed plant height
Mea
n −
pre
dict
ed p
lant
hei
ght
−4
−2
0
2
4
−4 −2 0 2 4
heightPlant
observed SSD
Mea
n −
pre
dict
ed S
SD
−8
−6
−4
−2
0
−8 −6 −4 −2 0
densityStem specific
observed Leaf P
Mea
n −
pre
dict
ed L
eaf P
−3
−2
−1
0
1
2
−3 −2 −1 0 1 2
P
observed LDMC
Mea
n −
pre
dict
ed L
DM
C
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
−3.0 −2.0 −1.0
LDMC
observed foliar N/P ratio
Mea
n −
pre
dict
ed fo
liar
N/P
rat
io
1
2
3
4
1 2 3 4
N to P ratio
observed plant height
HP
MF
− p
redi
cted
pla
nt h
eigh
t
−4
−2
0
2
4
−4 −2 0 2 4
heightPlant
observed SSD
HP
MF
− p
redi
cted
SS
D
−8
−6
−4
−2
0
−8 −6 −4 −2 0
densityStem specific
observed Leaf P
HP
MF
− p
redi
cted
Lea
f P
−3
−2
−1
0
1
2
−3 −2 −1 0 1 2
P
observed LDMC
HP
MF
− p
redi
cted
LD
MC
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
−3.0 −2.0 −1.0
LDMC
observed foliar N/P ratio
HP
MF
− p
redi
cted
folia
r N
/P r
atio
1
2
3
4
1 2 3 4
N to P ratio
observed plant height
aHP
MF
− p
redi
cted
pla
nt h
eigh
t
−4
−2
0
2
4
−4 −2 0 2 4
heightPlant
observed SSD
aHP
MF
− p
redi
cted
SS
D
−8
−6
−4
−2
0
−8 −6 −4 −2 0
densityStem specific
observed Leaf P
aHP
MF
− p
redi
cted
Lea
f P
−3
−2
−1
0
1
2
−3 −2 −1 0 1 2
P
observed LDMC
aHP
MF
− p
redi
cted
LD
MC
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
−3.0 −2.0 −1.0
LDMC
observed foliar N/P ratio
aHP
MF
− p
redi
cted
folia
r N
/P r
atio
1
2
3
4
1 2 3 4
N to P ratio
observed Leaf N per area
Mea
n −
pre
dict
ed L
eaf N
per
are
a
−2
−1
0
1
2
−2 −1 0 1 2
N per area
observed Leaf C per dry massMea
n −
pre
dict
ed L
eaf C
per
dry
mas
s
5.8
6.0
6.2
6.4
5.8 6.0 6.2 6.4
dry massC per
observed Seed mass
Mea
n −
pre
dict
ed S
eed
mas
s
−5
0
5
10
−5 0 5 10
Seed mass
observed foliar N
Mea
n −
pre
dict
ed fo
liar
N
1
2
3
4
1 2 3 4
N
observed leaf area
Mea
n −
pre
dict
ed le
af a
rea
0
5
10
0 5 10
Leaf area
observed Leaf N per area
HP
MF
− p
redi
cted
Lea
f N p
er a
rea
−2
−1
0
1
2
−2 −1 0 1 2
N per area
observed Leaf C per dry massHP
MF
− p
redi
cted
Lea
f C p
er d
ry m
ass
5.8
6.0
6.2
6.4
5.8 6.0 6.2 6.4
dry massC per
observed Seed mass
HP
MF
− p
redi
cted
See
d m
ass
−5
0
5
10
−5 0 5 10
Seed mass
observed foliar N
HP
MF
− p
redi
cted
folia
r N
1
2
3
4
1 2 3 4
N
observed leaf area
HP
MF
− p
redi
cted
leaf
are
a
0
5
10
0 5 10
Leaf area
observed Leaf N per area
aHP
MF
− p
redi
cted
Lea
f N p
er a
rea
−2
−1
0
1
2
−2 −1 0 1 2
N per area
observed Leaf C per dry massaHP
MF
− p
redi
cted
Lea
f C p
er d
ry m
ass
5.8
6.0
6.2
6.4
5.8 6.0 6.2 6.4
dry massC per
observed Seed mass
aHP
MF
− p
redi
cted
See
d m
ass
−5
0
5
10
−5 0 5 10
Seed mass
observed foliar N
aHP
MF
− p
redi
cted
folia
r N
1
2
3
4
1 2 3 4
N
observed leaf area
aHP
MF
− p
redi
cted
leaf
are
a
0
5
10
0 5 10
Leaf area
Figure S9.1: Scatter plot for pairs of traits on true test data (x) versus predicted test data (y) using the MEAN (blue triangles), HPMF (red circles)and aHPMF (green crosses) approach. All values are given in natural log-normal (base e) transformed space. Note how HPMF and aHPMF imputedvalues are generally less scattered and closer to the 1:1 line than MEAN imputed values. This holds true for structural traits (such as plant height) aswell as physiological traits (like the foliar N to P ratio). All correlations were highly significant at the p <0.001 level.
Figure S9.2: Scatter plots of trait-trait correlations for observed data (left column), Mean predictions (second column) and BHPMF predictions(right column) for specific leaf area (SLA) versus foliar nitrogen (N) concentration (top row) and foliar phosphorus (P) concentration versus foliar Nconcentration (bottom row). Values are given in natural log-normal (base e) transformed space). Note the correlation coefficients (R2) given.
−0.8 −0.4 0.0 0.4
−0.
40.
00.
4
Procrustes errors
PCA 3
PC
A 4
Figure S9.3: Procrustes analysis errors for the 3rd and 4th Principal Component axes comparing a PCA performed on the original, gappy RAINFORdata with a PCA performed on the RAINFOR data with artificially introduced gaps being filled using BHPMF
−1.0 −0.5 0.0 0.5 1.0
−0.
40.
00.
40.
8
Procrustes errors
PCA 5
PC
A 6
Figure S9.4: Procrustes analysis errors for the 5th and 6th Principal Component axes comparing a PCA performed on the original, gappy RAINFORdata with a PCA performed on the RAINFOR data with artificially introduced gaps being filled using BHPMF
Figure S9.5: Bivariate Pearson correlations between traits. Colors indicate strength of relationship with deep red being positive and deep bluenegative correlations. The plots in the diagonal show the distribution of each trait with the number indicating the trait according to Table S2.1
S10. Gibbs sampler results
Figure S10.1: Gibbs sampler generated density plots of BHPMF-estimated natural log-normal transformed leaf dry matter content, foliar nitrogenconcentration and specific leaf area (SLA) of three Acer saccharum (top row) and Pinus sylvestris (bottom row) individuals respectively.
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1All traits
Ave
rag
e o
f R
MS
E
Percentage of Data with Ascending StdFigure S10.2: Gibbs sampler for the 13-trait data set with the inverse of prediction confidence (SD) on the x-axis and the prediction error (RMSE)on the y-axis. The x-axis is produced by sorting the data points in ascending order of their SD, dividing the test sets evenly into 10 classes withascending SD. I.e., the first class contains the 10% of predictions with the lowest SD, the second class the 10% of predictions with the second lowestSD, and so on. We regress the average RMSE per class against the ordered SD.
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0
0.2
0.4
0.6
0.8
1
1.2
1.41− SLA
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.22− PlantHeight
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.23− SeedMass
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.24− LDMC
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
−0.2
0
0.2
0.4
0.6
0.8
1
1.25− StemSpecificDensity
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0
0.2
0.4
0.6
0.8
1
1.2
1.46− LeafArea
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.27− LeafN
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.28− LeafP
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0
0.2
0.4
0.6
0.8
1
1.2
1.49− Leaf nitrogen (N) content per area
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.810− Leaf fresh mass
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.2
0.4
0.6
0.8
1
1.2
1.4
1.611− Leaf nitrogen/phosphorus (N/P) ratio
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.2
0.4
0.6
0.8
1
1.2
1.4
1.612− Leaf carbon (C) content per dry mass
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.213− Leaf delta 15N
Ave
rage
of R
MS
E
Percentage of Data with Ascending Std
Figure S10.3: Gibbs sampler for each of the 13 traits with the inverse of prediction confidence (SD) on the x-axis and the prediction error (RMSE)on the y-axis. The x-axis is produced as follows. If we have 100 total entries, we rank them by SD in ascending order, so 10% represent the first 10entries with the lowest SD, 20% represent the second 10% entries with the lowest SD etc. From top to bottom and left to right: specific leaf area(SLA), Plant height (PlantHt), seed mass, leaf dry matter content (LDMC), stem specific density (SSD), leaf area, leaf nitrogen, leaf phosphorus, leafnitrogen per area (LeafNArea), leaf fresh mass (LeafFmass), leaf nitrogen to phosphorus ratio (LeafNP), leaf carbon and leaf delta 15 nitrogen(LeafD15N)
S11. Additional references of data contributors
References
Bahn, M., Wohlfahrt, G., Haubner, E., Horak, I., Michaeler, W., Rottmar, K., Tappeiner, U., Cernusca, A., 1999. Leaf photosynthesis, nitrogencontents and specific leaf area of 30 grassland species in differently managed mountain ecosystems in the eastern alps, in: Land-Use Changes inEuropean Mountain Ecosystems. ECOMONT- Concept and Results. in: Land-Use Changes in European Mountain Ecosystems. ECOMONT-Concept and Results, ed.: Cernusca, A. and Tappeiner, U. and Bayfield, N., Blackwell Wissenschaft, Berlin.
Bodegom, P.M.v., Kanter, M., Aerts, C.B.R., 2005. Radial oxygen loss, a plastic property of dune slack plant species. Plant and Soil 271, 351–364.Bodegom, P.M.v., Sorrell, B.K., Oosthoek, A., Bakker, C., Aerts, R., 2008. Separating the effects of partial submergence and soil oxygen demand on
plant physiology. Ecology 89, 193–204.Cavender-Bares, J., Keen, A., Miles, B., 2006. Phylogenetic structure of floridian plant communities depends on taxonomic and spatial scale.
Ecology 87, 109–122.Cornelissen, J., Aerts, R., Cerabolini, B., Werger, M., van der Heijden, M., 2001. Carbon cycling traits of plant species are linked with mycorrhizal
strategy. Oecologia 129, 611–619.Cornelissen, J., Cerabolini, B., Castro-Dez, P., Villar-Salvador, P., Montserrat-Mart, G., Puyravaud, J., Maestro, M., Werger, M., Aerts, R., 2003.
Functional traits of woody plants: correspondence of species rankings between field adults and laboratory-grown seedlings? Journal of VegetationScience 14, 311–22.
Cornelissen, J.H.C., Diez, P.C., Hunt, R., 1996. Seedling growth, allocation and leaf attributes in a wide range of woody plant species and types.Journal of Ecology 84, 755–765.
Cornelissen, J.H.C., Quested, H.M., Gwynn-Jones, D., Van Logtestijn, R.S.P., De Beus, M.A.H., Kondratchuk, A., Callaghan, T.V., Aerts, R., 2004.Leaf digestibility and litter decomposability are related in a wide range of subarctic plant species and types. Functional Ecology 18, 779–786.
Cornwell, W.K., Cornelissen, J.H.C., Amatangelo, K., Dorrepaal, E., Eviner, V.T., Godoy, O., Hobbie, S.E., Hoorens, B., Kurokawa, H., Prez-Harguindeguy, N., Quested, H.M., Santiago, L.S., Wardle, D.A., Wright, I.J., Aerts, R., Allison, S.D., Van Bodegom, P., Brovkin, V., Chatain,A., Callaghan, T.V., Daz, S., Garnier, E., Gurvich, D.E., Kazakou, E., Klein, J.A., Read, J., Reich, P.B., Soudzilovskaia, N.A., Vaieretti, M.V.,Westoby, M., 2008. Plant species traits are the predominant control on litter decomposition rates within biomes worldwide. Ecology Letters 11,1065–1071.
Craine, J.M., Elmore, A.J., Aidar, M.P.M., Bustamante, M., Dawson, T.E., Hobbie, E.A., Kahmen, A., Mack, M.C., McLauchlan, K.K., Michelsen,A., Nardoto, G.B., Pardo, L.H., Peuelas, J., Reich, P.B., Schuur, E.A.G., Stock, W.D., Templer, P.H., Virginia, R.A., Welker, J.M., Wright, I.J.,2009. Global patterns of foliar nitrogen isotopes and their relationships with climate, mycorrhizal fungi, foliar nutrient concentrations, andnitrogen availability. New Phytologist 183, 980–992.
Craine, J.M., Lee, W.G., Bond, W.J., Williams, R.J., Johnson, L.C., 2005. Environmental constraints on a global relationship among leaf and roottraits of grasses. Australian Journal of Botany 86, 12–19.
Dıaz, S., Hodgson, J., Thompson, K., Cabido, M., Cornelissen, J., Jalili, A., Montserrat-Martı, G., Grime, J., Zarrinkamar, F., Asri, Y., Band, S.,Basconcelo, S., Castro-Dıez, P., Funes, G., Hamzehee, B., Khoshnevi, M., Perez-Harguindeguy, N., Perez-Rontome, M., Shirvany, F., Vendramini,F., Yazdani, S., Abbas-Azimi, R., Bogaard, A., Boustani, S., Charles, M., Dehghan, M., de Torres-Espuny, L., Falczuk, V., Guerrero-Campo, J.,Hynd, A., Jones, G., Kowsary, E., Kazemi-Saeed, F., Maestro-Martınez, M., Romo-Dıez, A., Shaw, S., Siavash, B., Villar-Salvador, P., Zak, M.,2004. The plant traits that drive ecosystems: Evidence from three continents. Journal of Vegetation Science 15, 295–304.
Fonseca, C.R., Overton, J.M., Collins, B., Westoby, M., 2000. Shifts in trait-combinations along rainfall and phosphorus gradients. Journal ofEcology 88, 964–977.
Freschet, G.T., Cornelissen, J.H.C., Van Logtestijn, R.S.P., Aerts, R., 2010. Evidence of the plant economics spectrum in a subarctic flora. Journal ofEcology 98, 362–373.
Fyllas, N.M., Patino, S., Baker, T.R., Bielefeld Nardoto, G., Martinelli, L.A., Quesada, C.A., Paiva, R., Schwarz, M., Horna, V., Mercado, L.M.,Santos, A., Arroyo, L., Jimenez, E.M., Luizao, F.J., Neill, D.A., Silva, N., Prieto, A., Rudas, A., Silviera, M., Vieira, I.C.G., Lopez-Gonzalez,G., Malhi, Y., Phillips, O.L., Lloyd, J., 2009. Basin-wide variations in foliar properties of amazonian forest: phylogeny, soils and climate.Biogeosciences 6, 2677–2708.
Garnier, E., Lavorel, S., Ansquer, P., Castro, H., Cruz, P., Dolezal, J., Eriksson, O., Fortunel, C., Freitas, H., Golodets, C., Grigulis, K., Jouany, C.,Kazakou, E., Kigel, J., Kleyer, M., Lehsten, V., Lep, J., Meier, T., Pakeman, R., Papadimitriou, M., Papanastasis, V.P., Quested, H., Qutier, F.,Robson, M., Roumet, C., Rusch, G., Skarpe, C., Sternberg, M., Theau, J.P., Thbault, A., Vile, D., Zarovali, M.P., 2007. Assessing the effectsof land-use change on plant traits, communities and ecosystem functioning in grasslands: A standardized methodology and lessons from anapplication to 11 european sites. Annals of Botany 99, 967–985.
Gillison, A.N., Carpenter, G., 1997. A generic plant functional attribute set and grammar for dynamic vegetation description and analysis. FunctionalEcology 11, 775–783.
Gutierrez, A.G., Huth, A., 2012. Successional stages of primary temperate rainforests of chiloe island, chile. Perspectives in Plant Ecology, Evolutionand Systematics 14, 243 – 256.
Han, W.X., Fang, J.Y., Guo, D.L., Zhang, Y., 2005. Leaf nitrogen and phosphorus stoichiometry across 753 terrestrial plant species in china. NewPhytologist 168, 377–385.
Hickler, T., 1999. Plant functional types and community characteristics along environmental gradients on Oland’s Great Alvar (Sweden). Master’sthesis. University of Lund, Sweden.
Kattge, J., Knorr, W., Raddatz, T., Wirth, C., 2009. Quantifying photosynthetic capacity and its relationship to leaf nitrogen content for global-scaleterrestrial biosphere models. Global Change Biology 15, 976 – 991.
Kleyer, M., Bekker, R., Knevel, I., Bakker, J., Thompson, K., Sonnenschein, M., Poschlod, P., Van Groenendael, J., Klimes, L., Klimesova, J., Klotz,S., Rusch, G., Hermy, M., Adriaens, D., Boedeltje, G., Bossuyt, B., Dannemann, A., Endels, P., Gotzenberger, L., Hodgson, J., Jackel, A.K.,Kuhn, I., Kunzmann, D., Ozinga, W., Romermann, C., Stadler, M., Schlegelmilch, J., Steendam, H., Tackenberg, O., Wilmann, B., Cornelissen, J.,
Eriksson, O., Garnier, E., Peco, B., 2008. The leda traitbase: a database of life-history traits of the northwest european flora. Journal of Ecology96, 1266–1274.
Kurokawa, H., Nakashizuka, T., 2008. Leaf herbivory and decomposability in a malaysian tropical rain forest. Ecology 89, 2645–2656.Laughlin, D.C., Leppert, J.J., Moore, M.M., Sieg, C.H., 2010. A multi-trait test of the leaf-height-seed plant strategy scheme with 133 species from a
pine forest flora. Functional Ecology 24, 493–501.Louault, F., Pillar, V., Aufrre, J., Garnier, E., Soussana, J.F., 2005. Plant traits and functional types in response to reduced disturbance in a
semi-natural grassland. Journal of Vegetation Science 16, 151–160.Medlyn, B.E., Jarvis, P.G., 1999. Design and use of a database of model parameters from elevated co2 experiments. Ecological Modelling 124,
69–83.Messier, J., McGill, B.J., Lechowicz, M.J., 2010. How do traits vary across ecological scales? a case for trait-based ecology. Ecology Letters 13,
838–848.Moles, A.T., Ackerly, D.D., Webb, C.O., Tweddle, J.C., Dickie, J.B., Westoby, M., 2005. A brief history of seed size. Science 307, 576–580.Moles, A.T., Falster, D.S., Leishman, M.R., Westoby, M., 2004. Small-seeded species produce more seeds per square metre of canopy per year, but
not per individual per lifetime. Journal of Ecology 92, 384–396.Muller, S.C., Overbeck, G.E., Pfadenhauer, J., Pillar, V.D., 2007. Plant functional types of woody species related to fire disturbance in forestgrassland
ecotones. Plant Ecology 189, 1–14.Niinemets, U., 2001. Global-scale climatic controls of leaf dry mass per area, density, and thickness in trees and shrubs. Ecology 82, 453–469.Ogaya, R., Penuelas, J., 2003. Comparative field study of quercus ilex and phillyrea latifolia: photosynthetic response to experimental drought
conditions. Environmental and Experimental Botany 50, 137 – 148.Onoda, Y., Westoby, M., Adler, P.B., Choong, A.M.F., Clissold, F.J., Cornelissen, J.H.C., Dıaz, S., Dominy, N.J., Elgart, A., Enrico, L., Fine, P.V.A.,
Howard, J.J., Jalili, A., Kitajima, K., Kurokawa, H., McArthur, C., Lucas, P.W., Markesteijn, L., Perez-Harguindeguy, N., Poorter, L., Richards,L., Santiago, L.S., Sosinski, E.E., Van Bael, S.A., Warton, D.I., Wright, I.J., Joseph Wright, S., Yamashita, N., 2011. Global patterns of leafmechanical properties. Ecology Letters 14, 301–312.
Ordonez, J.C., Van Bodegom, P.M., Witte, J.P.M., Bartholomeus, R.P., Hal, J.R., Aerts, R., 2010. Plant strategies in relation to resource supply inmesic to wet environments: Does theory mirror nature? The American Naturalist 175, 225–239.
Penuelas, J., Sardans, J., Llusia, J., Owen, S.M., Carnicer, J., Giambelluca, T.W., Rezende, E.L., Waite, M., Niinemets, U., 2010. Faster returns onleaf economics and different biogeochemical niche in invasive compared with native plant species. Global Change Biology 16, 2171–2185.
Poorter, L., 2009. Leaf traits show different relationships with shade tolerance in moist versus dry tropical forests. New Phytologist 181, 890–900.Poorter, L., Bongers, F., 2006. Leaf traits are good predictors of plant performance across 53 rain forest species. Ecology 87, 1733–1743.Preston, K.A., Cornwell, W.K., DeNoyer, J.L., 2006. Wood density and vessel traits as distinct correlates of ecological strategy in 51 california coast
range angiosperms. New Phytologist 170, 807–818.Reich, P.B., Oleksyn, J., Wright, I.J., 2009. Leaf phosphorus influences the photosynthesis-nitrogen relation: a cross-biome analysis of 314 species.
Oecologia 160, 207–212.Reich, P.B., Tjoelker, M.G., Pregitzer, K.S., Wright, I.J., Oleksyn, J., Machado, J.L., 2008. Scaling of respiration to nitrogen in leaves, stems and
roots of higher land plants. Ecology Letters 11, 793–801.Sack, L., Cowan, P.D., Jaikumar, N., Holbrook, N.M., 2003. The hydrology of leaves: co-ordination of structure and function in temperate woody
species. Plant, Cell and Environment 26, 1343–1356.Sack, L., Frole, K., 2006. Leaf structural diversity is related to hydraulic capacity in tropical rain forest trees. Ecology 87, 483491.Sack, L., Tyree, M.T., 2005. Leaf Hydraulics and Its Implications in Plant Structure and Function. Oxford, UK.Shioder, S., Rahajoe, J.S., Kohyama, T., 2008. Variation in longevity and traits of leaves among co-occurring understory plants in a tropical montane
forest. Journal of Tropical Ecology 24, 121–133.Shipley, B., 1989. The use of above-ground maximum relative growth rate as an accurate predictor of whole-plant maximum relative growth rate.
Functional Ecology 3, 771–775.Shipley, B., 1995. Structured interspecific determinants of specific leaf area in 34 species of herbaceous angiosperms. Functional Ecology 9,
312–319.Shipley, B., 2002. Trade-offs between net assimilation rate and specific leaf area in determining relative growth rate: relationship with daily
irradiance. Functional Ecology 16, 682–689.Swaine, E.K., 2007. Ecological and evolutionary drivers of plant community assembly in a Bornean rain forest. Ph.D. thesis. University of Aberdeen,
N.C.A., Poorter, L., Silman, M.R., Vriesendorp, C.F., Webb, C.O., Westoby, M., Wright, S.J., 2007. Relationships among ecologically importantdimensions of plant trait variation in seven neotropical forests. Annals of Botany 99, 1003–1015.
Wright, I.J., Reich, P.B., Atkin, O.K., Lusk, C.H., Tjoelker, M.G., Westoby, M., 2006. Irradiance, temperature and rainfall influence leaf darkrespiration in woody plants: evidence from comparisons across 20 sites. New Phytologist 169, 309–319.
Wright, S.J., Kitajima, K., Kraft, N.J., Reich, P.B., Wright, I.J., Bunker, D.E., Condit, R., Dalling, J.W., Davies, S.J., Dıaz, S., Engelbrecht, B.M.,Harms, K.E., Hubbell, S.P., Marks, C.O., Ruiz-Jaen, M.C., Salvador, C.M., Zanne, A.E., 2010. Functional traits and the growth-mortality tradeoff
in tropical trees. Ecology 9, 3664–3674.
Individual contributions of the authors: The concept of the paper was developed by Franziska Schrodt, Jens Kattge, Arindam Banerjee and Peter B. Reich. Development of methods by Hanhuai Shan, Farideh Fazayeli, Anuj Karpatne, Franziska Schrodt, Arindam Banerjee and Jens Kattge Data analyses and sensitivity analyses were performed by Franziska Schrodt, Julia Joswig and Farideh Fazayeli. Trait data were provided by the TRY initiative, including major contributions by Jens Kattge, Gerhard Boenisch, Sandra Diaz, John Dickie, Andy Gillison, Sandra Lavorel, Paul Leadley, Christian Wirth, Ian J. Wright, S. Joseph Wright and Peter B. Reich. Franziska Schrodt, Hanhuai Shan, Farideh Fazayeli and Jens Kattge wrote the paper with contributions from all co-authors.