Predictive Habitat Distribution Models, Leire Ibaibarriaga

Introduc)on to Sta)s)cal Modelling Tools for Habitat Models Development, 26-‐28th Oct 2011 EURO-‐BASIN, www.euro-‐basin.eu

2

OUTLINE

• Why to model?

• Habitat models

• Model properties

• Steps for modelling

• What about data?

3

WHY TO MODEL?

• “All models are wrong, some models are useful” (G. Box)

• Models are how we understand the world:

We see the world through models

We learn about the world using formal descriptions

• Model types:

– Static vs dynamic

– Explanatory vs predictive

– Deterministic vs stochastic

– Discrete vs continuous

4

HABITAT MODELS

• Habitat models are focused on how environmental factors controlthe distribution of species and communities.

• Multiple applications:

– Biogeography, impact of the global change, management,conservation, ecology, …

• New conceptual and operative advances due to the growth incomputing power, e.g. GIS, remote sensing, new statisticalmodelling tools (computer intensive), etc

5

MODEL PROPERTIES

Some desirable model properties:

• Parsimony (Occam’s razor): “All things being equal, the simplest solution tends to be the best one”

• Tractability: easy to be analysed

• Conceptually insightful: reveal fundamental properties

• Generalizability: can be applied to other situations/species/…

• Empirical consistency: consistent with the available data

• Falsifiability: can be tested by observations

• Predictive precision

6

MODEL PROPERTIES

Levins (1966); Sharpe (1990); Guisan and Zimmermann (2000)

Predictive habitatdistribution models

7

MODEL PROPERTIES

The more complex model is not necessarily the best…

GENERALITY

COMPLEXITY

8

STEPS FOR MODELLING

1) Conceptual phase

2) Model formulation

3) Model calibration

4) Spatial predictions

5) Model evaluation

6) Model applicability

9

STEPS FOR MODELLING

Guisan and Zimmermann (2000)

10

1. Conceptual phase

• Some sort of theoretical model should be in mind, before a statistical model is even considered

• This phase includes:

– Literature review

– Define an up-to-date conceptual model

– Set multiple hypothesis

– Assess available and missing data

– Identify appropriate sampling strategy for new data

– Choose appropriate spatio-temporal resolution and geographic extent

– Identify the most appropriate statistical methods for the other phases

11

STEPS FOR MODELLING


12

2. Model formulation

• The model depends on the type of response variable and its associated probability distribution

Distribution Examples

Gaussian Biomass

Poisson Individual counts

Negative Binomial Individual counts

Multinomial Communities

Binomial Presence/absence

13



14

oct-11 © AZTI-Tecnalia 14

0 2 4 6 8 10

010

2030

4050

x

y

2. Model formulationR

EG

RE

SSIO

N A

NA

LY

SIS

15

© AZTI-Tecnalia 15

0 2 4 6 8 10

010

2030

4050

x

y

2. Model formulationR

EG

RE

SSIO

N A

NA

LY

SIS

16



0.0 0.2 0.4 0.6 0.8 1.0

-50

510

x

y

RE

GR

ESS

ION

AN

AL

YSI

S

17



0.0 0.2 0.4 0.6 0.8 1.0

-50

510

x

y

RE

GR

ESS

ION

AN

AL

YSI

S

18



The response variable y can follow distributions like:

NORMAL, BINOMIAL, POISSON, GAMMA, etc

LINK FUNCTION

RE

GR

ESS

ION

AN

AL

YSI

S

McCullagh and Nelder (1989); Dobson (2008)

19



The response variable y can follow distributions like:

NORMAL, BINOMIAL, POISSON, GAMMA, etc

LINK FUNCTION

RE

GR

ESS

ION

AN

AL

YSI

S

SMOOTHS

Hastie and Tibshirani (1990); Wood (2006)

20



Modelo lineal (LM)

Modelo lineal generalizado (GLM)

Modelo aditivo generalizado (GAM)

Modelo aditivo (AM)

RE

GR

ESS

ION

AN

AL

YSI

S

21


Other regression models:

• Mixed models: LM, GLM and GAMs including random effectterms. Useful for meta-analysis.

• Quantile regression: the quantiles are modelled instead of the mean. Useful for finding limiting factors

• Segmented regression: the model changes depending on a partition of the explanatory variable. Useful for detectingregime changes

• Spatial autocorrelation and autoregressive modelsRE

GR

ESS

ION

AN

AL

YSI

S

22


• Classification is the placement of species and/or sample units into groups based on the environmental variables

CL

ASS

IFIC

AT

ION

TE

CH

NIQ

UE

S

23


• Classification is the placement of species and/or sample unitsinto groups based on the environmental variables

• Many techniques included: classification decision tree,regression decision tree, rule-based classification, maximum-likelihood classification

• Mainly two groups:

– Supervised classification: a training data set is required(groups are known beforehand)

– unsupervised classification: groups are unknown and needto be defined, like in cluster analysis

CL

ASS

IFIC

AT

ION

TE

CH

NIQ

UE

S

24


• The environmental envelope of a species is defined as the setof environments within which it is believed that the species canpersist (Walker and Cocks, 1991)

EN

VIR

ON

ME

NT

AL

EN

VE

LO

PE

S

25


• The environmental envelope of a species is defined as the setof environments within which it is believed that the species canpersist (Walker and Cocks, 1991)

• Examples of models:

– BIOCLIM: minimal rectilinear envelopes based onclassification trees

– HABITAT: convex polytope envelopes based onclassification trees

– DOMAIN: based on multivariate distance metrics

EN

VIR

ON

ME

NT

AL

EN

VE

LO

PE

S

26


• Ordination is the arrangement or ‘ordering’ of species and/or sample units along gradients

• Usually applied to community data matrices (row: species, column: samples, value: abundance)

OR

DIN

AT

ION

TE

CH

NIQ

UE

S

27

2. Model formulation• Indirect gradient analysis (no environmental data used)

– Distance-based approaches:

• Polar ordination, Principal Coordinates Analysis, Nonmetric Multidimensional Scaling

– Eigenanalysis-based approaches

• Linear model

– Principal Components Analysis

• Unimodal model

– Correspondence Analysis, Detrended Correspondence Analysis

• Direct gradient analysis (environmental data used)

– Linear model

• Redundancy Analysis

– Unimodal model

• Canonical Correspondence Analysis , Detrended Canonical Correspondence AnalysisO

RD

INA

TIO

N T

EC

HN

IQU

ES

ter Braak and Prentice (1988)

28


• Models inspired in the human-brain (interconnected group ofneurons)

• They define a non-linear function, decomposed further as aweighted sum of functions, that similarly can be furtherdecomponsed, etc. So, complex non-parametric model (black-box?)

• Adjusted by varying parameters, connection weights, orspecifics of the architecture such as the number of neurons ortheir connectivity

• Few examples available yet

NE

UR

AL

NE

TW

OR

KS

29

STEPS FOR MODELLING


30

3. Model calibration

• It includes model fitting (find the best value of the unknownparameters to improve the agreement between the data and modeloutputs) and model selection (which explanatory variables to beincluded)

• To take into account:

– Use of predictors that are ecologically relevant: direct vs indirect(proxy) variables

– Correlation between explanatory variables

• Each method has each own diagnostic tools according to theirassumptions, e.g, in regression models the residual deviance

31

STEPS FOR MODELLING


32

4.Spatial predictions

• Spatial predictions can be done on the data set used for calibrationor on new data sets. Care must be taken if predictions are done in anew data set with new combinations between the explanatoryvariables and for values outside the range of values in the data setfor calibration

• GIS tools are very often used, but still many statistical models arenot implemented in a GIS environment

33

STEPS FOR MODELLING


34

5. Model evaluation

• The aim is to evaluate the predictive power of a model

• If only one data set is available (we have used the data set forcalibration), bootstrap, cross-validation, jacknife

• If other data sets are available (independent of the calibration dataset), predicted and observed values are compared using:

– the same goodness of fit measure as used for model calibration

– any other measure of association

The data sets for calibration and evaluation are called respectivelytraining and evaluation data sets. Sometimes the original singledata set is split in two (split-sample approach)

35

STEPS FOR MODELLING


APPLICABILITY

36

6. Model applicability

• It refers to the domain over which a validated model can be properlyused

• Potential uses (Decoursey, 1992):

– Screening

– Research

– Planning, monitoring and assessment

37

WHAT ABOUT DATA?

• Data is even more important than the model itself.

• Usually from multiple sources: surveys (continuous, stations, verticalprofiles), remote sensing, circulation models, …

• The scale of the response and the environmental variables might notbe the same. Need to define a common scale unit. Sometimesinterpolation might be needed. This might include additionaluncertainities

• Simple exploratory statistics and figures can be very useful beforeeven start thinking on any model. They also help to spot errors in thedata.

Introduc)on to Sta)s)cal Modelling Tools for Habitat Models Development, 26-‐28th Oct 2011 EURO-‐BASIN, www.euro-‐basin.eu

Predictive Habitat Distribution Models, Leire Ibaibarriaga

Education

cal modelling

habitat models

model formulation

model formulation

modelling

data sets

environmental

environmental