Introduc)on to Sta)s)cal Modelling Tools for Habitat Models Development, 2628 th Oct 2011 EUROBASIN, www.eurobasin.eu
Nov 02, 2014
Introduc)on to Sta)s)cal Modelling Tools for Habitat Models Development, 26-‐28th Oct 2011 EURO-‐BASIN, www.euro-‐basin.eu
2
OUTLINE
• Why to model?
• Habitat models
• Model properties
• Steps for modelling
• What about data?
3
WHY TO MODEL?
• “All models are wrong, some models are useful” (G. Box)
• Models are how we understand the world:
We see the world through models
We learn about the world using formal descriptions
• Model types:
– Static vs dynamic
– Explanatory vs predictive
– Deterministic vs stochastic
– Discrete vs continuous
4
HABITAT MODELS
• Habitat models are focused on how environmental factors controlthe distribution of species and communities.
• Multiple applications:
– Biogeography, impact of the global change, management,conservation, ecology, …
• New conceptual and operative advances due to the growth incomputing power, e.g. GIS, remote sensing, new statisticalmodelling tools (computer intensive), etc
5
MODEL PROPERTIES
Some desirable model properties:
• Parsimony (Occam’s razor): “All things being equal, the simplest solution tends to be the best one”
• Tractability: easy to be analysed
• Conceptually insightful: reveal fundamental properties
• Generalizability: can be applied to other situations/species/…
• Empirical consistency: consistent with the available data
• Falsifiability: can be tested by observations
• Predictive precision
6
MODEL PROPERTIES
Levins (1966); Sharpe (1990); Guisan and Zimmermann (2000)
Predictive habitatdistribution models
7
MODEL PROPERTIES
The more complex model is not necessarily the best…
GENERALITY
COMPLEXITY
8
STEPS FOR MODELLING
1) Conceptual phase
2) Model formulation
3) Model calibration
4) Spatial predictions
5) Model evaluation
6) Model applicability
9
STEPS FOR MODELLING
Guisan and Zimmermann (2000)
10
1. Conceptual phase
• Some sort of theoretical model should be in mind, before a statistical model is even considered
• This phase includes:
– Literature review
– Define an up-to-date conceptual model
– Set multiple hypothesis
– Assess available and missing data
– Identify appropriate sampling strategy for new data
– Choose appropriate spatio-temporal resolution and geographic extent
– Identify the most appropriate statistical methods for the other phases
11
STEPS FOR MODELLING
Guisan and Zimmermann (2000)
12
2. Model formulation
• The model depends on the type of response variable and its associated probability distribution
Distribution Examples
Gaussian Biomass
Poisson Individual counts
Negative Binomial Individual counts
Multinomial Communities
Binomial Presence/absence
13
2. Model formulation
Guisan and Zimmermann (2000)
14
oct-11 © AZTI-Tecnalia 14
0 2 4 6 8 10
010
2030
4050
x
y
2. Model formulationR
EG
RE
SSIO
N A
NA
LY
SIS
15
© AZTI-Tecnalia 15
0 2 4 6 8 10
010
2030
4050
x
y
2. Model formulationR
EG
RE
SSIO
N A
NA
LY
SIS
16
oct-11 © AZTI-Tecnalia 16
2. Model formulation
0.0 0.2 0.4 0.6 0.8 1.0
-50
510
x
y
RE
GR
ESS
ION
AN
AL
YSI
S
17
oct-11 © AZTI-Tecnalia 17
2. Model formulation
0.0 0.2 0.4 0.6 0.8 1.0
-50
510
x
y
RE
GR
ESS
ION
AN
AL
YSI
S
18
oct-11 © AZTI-Tecnalia 18
2. Model formulation
The response variable y can follow distributions like:
NORMAL, BINOMIAL, POISSON, GAMMA, etc
LINK FUNCTION
RE
GR
ESS
ION
AN
AL
YSI
S
McCullagh and Nelder (1989); Dobson (2008)
19
oct-11 © AZTI-Tecnalia 19
2. Model formulation
The response variable y can follow distributions like:
NORMAL, BINOMIAL, POISSON, GAMMA, etc
LINK FUNCTION
RE
GR
ESS
ION
AN
AL
YSI
S
SMOOTHS
Hastie and Tibshirani (1990); Wood (2006)
20
oct-11 © AZTI-Tecnalia 20
2. Model formulation
Modelo lineal (LM)
Modelo lineal generalizado (GLM)
Modelo aditivo generalizado (GAM)
Modelo aditivo (AM)
RE
GR
ESS
ION
AN
AL
YSI
S
21
2. Model formulation
Other regression models:
• Mixed models: LM, GLM and GAMs including random effectterms. Useful for meta-analysis.
• Quantile regression: the quantiles are modelled instead of the mean. Useful for finding limiting factors
• Segmented regression: the model changes depending on a partition of the explanatory variable. Useful for detectingregime changes
• Spatial autocorrelation and autoregressive modelsRE
GR
ESS
ION
AN
AL
YSI
S
22
2. Model formulation
• Classification is the placement of species and/or sample units into groups based on the environmental variables
CL
ASS
IFIC
AT
ION
TE
CH
NIQ
UE
S
23
2. Model formulation
• Classification is the placement of species and/or sample unitsinto groups based on the environmental variables
• Many techniques included: classification decision tree,regression decision tree, rule-based classification, maximum-likelihood classification
• Mainly two groups:
– Supervised classification: a training data set is required(groups are known beforehand)
– unsupervised classification: groups are unknown and needto be defined, like in cluster analysis
CL
ASS
IFIC
AT
ION
TE
CH
NIQ
UE
S
24
2. Model formulation
• The environmental envelope of a species is defined as the setof environments within which it is believed that the species canpersist (Walker and Cocks, 1991)
EN
VIR
ON
ME
NT
AL
EN
VE
LO
PE
S
25
2. Model formulation
• The environmental envelope of a species is defined as the setof environments within which it is believed that the species canpersist (Walker and Cocks, 1991)
• Examples of models:
– BIOCLIM: minimal rectilinear envelopes based onclassification trees
– HABITAT: convex polytope envelopes based onclassification trees
– DOMAIN: based on multivariate distance metrics
EN
VIR
ON
ME
NT
AL
EN
VE
LO
PE
S
26
2. Model formulation
• Ordination is the arrangement or ‘ordering’ of species and/or sample units along gradients
• Usually applied to community data matrices (row: species, column: samples, value: abundance)
OR
DIN
AT
ION
TE
CH
NIQ
UE
S
27
2. Model formulation• Indirect gradient analysis (no environmental data used)
– Distance-based approaches:
• Polar ordination, Principal Coordinates Analysis, Nonmetric Multidimensional Scaling
– Eigenanalysis-based approaches
• Linear model
– Principal Components Analysis
• Unimodal model
– Correspondence Analysis, Detrended Correspondence Analysis
• Direct gradient analysis (environmental data used)
– Linear model
• Redundancy Analysis
– Unimodal model
• Canonical Correspondence Analysis , Detrended Canonical Correspondence AnalysisO
RD
INA
TIO
N T
EC
HN
IQU
ES
ter Braak and Prentice (1988)
28
2. Model formulation
• Models inspired in the human-brain (interconnected group ofneurons)
• They define a non-linear function, decomposed further as aweighted sum of functions, that similarly can be furtherdecomponsed, etc. So, complex non-parametric model (black-box?)
• Adjusted by varying parameters, connection weights, orspecifics of the architecture such as the number of neurons ortheir connectivity
• Few examples available yet
NE
UR
AL
NE
TW
OR
KS
29
STEPS FOR MODELLING
Guisan and Zimmermann (2000)
30
3. Model calibration
• It includes model fitting (find the best value of the unknownparameters to improve the agreement between the data and modeloutputs) and model selection (which explanatory variables to beincluded)
• To take into account:
– Use of predictors that are ecologically relevant: direct vs indirect(proxy) variables
– Correlation between explanatory variables
• Each method has each own diagnostic tools according to theirassumptions, e.g, in regression models the residual deviance
31
STEPS FOR MODELLING
Guisan and Zimmermann (2000)
32
4.Spatial predictions
• Spatial predictions can be done on the data set used for calibrationor on new data sets. Care must be taken if predictions are done in anew data set with new combinations between the explanatoryvariables and for values outside the range of values in the data setfor calibration
• GIS tools are very often used, but still many statistical models arenot implemented in a GIS environment
33
STEPS FOR MODELLING
Guisan and Zimmermann (2000)
34
5. Model evaluation
• The aim is to evaluate the predictive power of a model
• If only one data set is available (we have used the data set forcalibration), bootstrap, cross-validation, jacknife
• If other data sets are available (independent of the calibration dataset), predicted and observed values are compared using:
– the same goodness of fit measure as used for model calibration
– any other measure of association
The data sets for calibration and evaluation are called respectivelytraining and evaluation data sets. Sometimes the original singledata set is split in two (split-sample approach)
35
STEPS FOR MODELLING
Guisan and Zimmermann (2000)
APPLICABILITY
36
6. Model applicability
• It refers to the domain over which a validated model can be properlyused
• Potential uses (Decoursey, 1992):
– Screening
– Research
– Planning, monitoring and assessment
37
WHAT ABOUT DATA?
• Data is even more important than the model itself.
• Usually from multiple sources: surveys (continuous, stations, verticalprofiles), remote sensing, circulation models, …
• The scale of the response and the environmental variables might notbe the same. Need to define a common scale unit. Sometimesinterpolation might be needed. This might include additionaluncertainities
• Simple exploratory statistics and figures can be very useful beforeeven start thinking on any model. They also help to spot errors in thedata.
Introduc)on to Sta)s)cal Modelling Tools for Habitat Models Development, 26-‐28th Oct 2011 EURO-‐BASIN, www.euro-‐basin.eu