Top Banner
M. Kanevski, Palermo 2009 1 International Workshop: Intelligent Analysis of Environmental Data Institute of Geomatics and Analysis of Risk (IGAR) University of Lausanne, Switzerland Prof. Mikhail Kanevski
80

Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

Aug 29, 2014

Download

Technology

Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)
Intelligent Analysis of Environmental Data (S4 ENVISA Workshop 2009)
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 1

International Workshop: Intelligent Analysis of Environmental Data

Institute of Geomatics and Analysis of Risk (IGAR)

University of Lausanne, Switzerland

Prof. Mikhail Kanevski

Page 2: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 2

Comments and questions to:• [email protected]

– www.unil.ch/igar – www.geokernels.org

Page 3: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 3

General IntroductionTypical problems

ApproachesSolutions

Future research

Page 4: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 4

Geo- and Environmental Data(classes, continuous, images, networks, geomanifolds,…)

• Spatio-temporal• Multi-scale• Multivariate• Highly variable at many scales• High-dimensional geo-feature spaces• Uncertainties• ………….

• In some cases we do have science-based models: data/knowledge/models integration

Page 5: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 5

Spatio-temporal data in terms of patterns/structures: a. pattern recognition (pattern discovery, pattern extraction), b. pattern modelling, c. pattern prediction

Page 6: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 6

Main Topics:• Review and posing of typical problems.• From “numbers” to data• Collection of data: Monitoring networks and data

representativity? Monitoring network optimisation. • Get more information value from your data –

EXPLORE ! Exploratory spatio-temporal data analysis (EDA, ESDA).

• Predictions/estimations or simulations? Risk analysis and mapping

• Let data speak for themselves: learning from data. Data mining, Machine learning.

Page 7: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 7

Methods:• Monitoring networks descriptions• Geostatistics: predictions/simulations• Machine Learning(neural nets, SLT):

– Neural networks: MLP, PNN, GRNN, RBF, SOM. ANNEX models. Hybrid models

– Support Vector Machines• Recent trends in geostatistics: Multiple-points

geostatistics, pattern based geostatistics.• Bayesian approach for uncertainty assessment,

integration of data and science-based models (Bayesian Maximum Entropy)

Page 8: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 8

• Predict a value at a given point. • Build a map (isolines, 3D surfaces,..). • Estimate prediction error.• Take into account measurement errors. • Risk mapping: Uncertainty mapping around unknown

value. Estimate the probability of exceeding of a given/decision level.

• Joint predictions of several variables (improve predictions on primary variable using auxiliary data and information).

• Optimization of monitoring network (design/ redesign)• Simulations: modelling of spatial uncertainty and

variability• Data/Science-based models assimilation/fusion• Image analysis. Remote sensing• Spatio-temporal events (forest fires, epidemiology,

crime,…)• Predictions/simulations in high dimensional spaces• ………………………………………..

Spatial data analysis: typical tasks

Page 9: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 9

Generic Methodology

DATA

Statistical Description

Monitoring Network Analysis

Quick Visualisation

Variography Deterministic Interpolations

Cross-validation

Machine LearningAlgorithmsGeostatistical

Predictions & Simulations

Monitoring Network

Generation

Decision-oriented Mapping

Data Base Management System

GIS, GIS, Remote SensingRemote Sensing

Page 10: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 10

GEOSTATISTICAL ANALYSIS• Basic/Naïve statistical analysis. EDA• ESDA (regionalized EDA)• Structural analysis. Spatial correlation analysis

(variography)• Model selection: Cross-validation, jack-knife,… • Prediction and error mapping for decision

making (family of kriging models)• Probability and Risk mapping. Conditional

stochastic simulations

Page 11: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 11

Some Geostatistics

• Exploration of spatial correlations

• Family of kriging models (simple, ordinary, disjunctive, indicator,…)

• Conditional Stochastic Simulations

Page 12: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 12

Briansk region (radioactivity, Cs137)

Page 13: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 13

Heavy metals, Japan

Page 14: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 14

Switzerland, indoor radon

Page 15: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 15

Measures to characterise MN

• Topological• Statistical• Fractal/multifractal• Lacunarity

Page 16: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 16

Preferential Sampling. Declustering Problem

Page 17: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 17

Example: geostatistical spatial co-predictions

Sr90 « expensive » information. Cs137 « cheap » exhaustive information.

Page 18: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 18

(Cross)Variography

Page 19: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 19

Use of Cs137 to improve Sr90 predictions

(reduced errors and uncertainty).

Decision-oriented

mapping: « Thick isolines »

Page 20: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 20

Simulations and Interpolations

Page 21: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 21

Unconditional simulations

Page 22: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 22

SGSim of the precipitation:

Page 23: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 23

Results of the simulations

Page 24: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 24

Post-processing of simulations: mean and standard deviation

Page 25: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 25

Geostatistics: some comments• Geostatistics is a powerful and well elaborated

model-dependent approach. • Geostatistics proposes a variety of models for spatial data

analysis and modeling. It has long and successful history of developments and applications

• Some problems:Nonlinearity

Non-stationarityTwo-point statisticsData/models integrationData mining. Pattern recognition

• Hybrid Models (ANN/SVM + Geostat) can help.

Page 26: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 26

Some useful comments, conclusions and future research

• 1. Detection of patterns: try k-NN or GRNN• as an exploratory tools• Cross-validation: leave-one-out, leave k-out,

jackknife,etc. as a control tool • Model selection and model asssessment

Page 27: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 27

K- Nearest Neighbours

Page 28: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 28

K-NN prediction:

NN methods use those k-observations in the training data set T closest in input space to prediction point x to estimate Y

( )

1

i k

k

ix N x

Y yk

Where Nk(x) is the neighborhood of x defined by the

closest points in the training set

Page 29: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 29

k-NN Classifiers

These classifiers are memory-based and do not require any model to be fit! Given a query point x, we find the k training points closest in the distance to x and then classify using MAJORITY vote among the k neighbors.

Page 30: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 30

Because it uses only the training point closest to the query point, the bias of the 1-nn estimate is often low, but the variance is high.

A famous result of Cover and Hurt (1967) shows that asymptotically the error rate of the 1-nn classifier is never more than twice the Bayes rate.

This result can provide a rough idea about the best performance that is possible in a given problem:if the 1-nn rule has a 10% error rate, then asymptotically the Bayes error rate is at least 5%.

Page 31: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 31

Dirichlet cells, Thiessen tessellation, Voronoï polygons

Page 32: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 32

• How to find k ?

Possible answer:

Cross-validation or leave-one-out

Page 33: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 33

k-NN prediction (n=6 ?)

2

3

6

4

1

5

W1~(1/n)

W2~(1/n)

W6~(1/n)

W5~(1/n)

W4~(1/n)

W3~(1/n)

r1

r2r3

r4

r5

r6

Page 34: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 34

Cross-validation

2

3

6

4

1

5

W1~(1/n)

W2~(1/n)

W6~(1/n)

W5~(1/n)

W4~(1/n)

W3~(1/n)

r1

r2r3

r4

r5

r6

Calculate error = (prediction-data)

Page 35: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 35

Leave-next-one-out, etc

2

3

6

4

1

5

W1~(1/n)

W2~(1/n)

W6~(1/n)

W5~(1/n)

W4~(1/n)

W3~(1/n)

r1

r2r3

r4

r5

r6

Calculate error = (prediction-data)

Page 36: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 36

Data and k-nn Cross-validation error curve

Page 37: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 37

Complete data set and 500 training points linearly interpolated

Page 38: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 38

Cross-validation curve

Page 39: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 39

K-nn predictions

Page 40: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 40

Machine Learning Algorithms• Machine learning is an area of artificial intelligence

concerned with the development of techniques which allow computers to "learn".

• More specifically, machine learning is a method for creating computer programs by the analysis of data sets. Machine learning overlaps heavily with statistics, since both fields study the analysis of data, but unlike statistics, machine learning is concerned with the algorithmic complexity of computational implementations. ...

Page 41: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 41

AlgorithmsCommon algorithm types include:• supervised learning – where the algorithm generates a function that

maps inputs to desired outputs. • unsupervised learning – which models a set of inputs: labeled

examples are not available. • semi-supervised learning – which combines both labeled and

unlabeled examples to generate an appropriate function or classifier. • reinforcement learning – where the algorithm learns a policy of how to

act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.

• transduction – similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and new inputs.

• The performance and computational analysis of machine learning algorithms is a branch of statistics known as computational learning theory.

Page 42: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 42

ML Topics (short lists)• Machine learning topics• Modeling conditional probability density functions,

regression and classification – Artificial neural networks – Decision trees – Gene expression programming – Genetic Programming – Gaussian process regression – Linear discriminant analysis – k-Nearest Neighbor – Minimum message length – Perceptron – Quadratic classifier – Radial basis functions – Support vector machines

Page 43: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 43

ML Topics (continued) • Modeling probability density functions through generative models:

– Expectation-maximization algorithm – Graphical models including Bayesian networks and Markov Random Fields – Generative Topographic Mapping

• Appromixate inference techniques: – Markov chain Monte Carlo method – Variational Bayes

• Meta-Learning (Ensemble methods): – Boosting – Bootstrap Aggregating aka Bagging – Random forest – Weighted Majority Algorithm

• Optimization: most of methods listed above either use optimization or are instances of optimization algorithms.

• Multi-objective Machine Learning: An approach that addresses multiple, and often confliciting learning objectives explicitly using Pareto-based multi-objective optimization techniques.

Page 44: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 44

Machine Learning• Artificial Neural Networks1. Multilayer perceptrons (MLP)2. General Regression Neural

Networks (GRNN)

• Statistical Learning Theory1. Support Vector Classification2. Support Vector Regression3. Monitoring Networks Optimization

Page 45: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 45

A Generic Model of Learning from Data/Examples

Generator Supervisor

LearningMachine

Page 46: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 46

The Problem of Risk Minimization

In order to choose the best available model to the supervisor’s response, one measure the LOSS or discrepancy L(y,f(x,)) between the response y of the supervisor to a given input x and the response f(x,) provided by the Loss Measure.

Page 47: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 47

Three Main Learning Problems

• Regression Estimation. Let the supervisor’s answer y, be a real value, and let f(x,), , be a set of real functions which contains the regression function

)¦(),( 0 xyydFxf

Page 48: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 48

The Problem of Risk Minimization

Consider the expected value of the loss, given by the risk functional

The goal is to find the function f(x,0) which minimises the risk in the situation where the joint pdf is unknown and the only available information is contained in the training set.

( ) ( , ( , )) ( , )R L y f x dF x y

Page 49: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 49

• Classification problem:

A

AA

AA

A

A

A

A

A

B

B

B

B

B B

B

B

BB

A B

Page 50: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 50

Three Main Learning Problems

• Pattern Recognition (classification). y = {0,1}, classification error:

0, if ( , )( , ( , ))1, if ( , )

y f xL y f xy f x

Page 51: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 51

• Regression problem

f(x) ?

yf xx ˆ

Page 52: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 52

Three Main Learning Problems

• Regression Estimation It is known that regression function is the one

which minimizes the following loss-function:

2)),(()),(,( xfyxfyL

Page 53: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 53

• Probability density estimation

x

p(x)

Page 54: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 54

Three Main Learning Problems

• Density Estimation. For this problem we consider the following loss-function:

),(log)),(( xpxpL

Page 55: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 55

Training samples(xi, yi) (ynew,xnew)

F(x,y)

Induction Deduction

TransductionTransduction

Inductive, Deductive and Transductive

Page 56: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 56

Why Machine Learning algorithms?

• Universal, nonlinear, robust tools• Data adapted• Easy data and knowledge integration• Efficient in high dimensional spaces• Good generalisation (low prediction

error)• Input/feature selection

Page 57: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 57

Our experience, some applications

• Hydrogeology, pollution/contamination (soil, water, air, food chains,…), topo-climatic modelling, geophysics

• Renewable resources – wind fields• Natural hazards/risks: forest fires, avalanches, indoor

radon,• Optimization of monitoring networks• Crime data, epidemiology• MNL for remote sensing, change detection• Socio-economic spatio-temporal multivariate data• Spatial econometrics. Financial data. Econophysics • Fractals, Chaos, EVT, • Time series

Page 58: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 58

Model Selection & Model Evaluation

Page 59: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 59

Guillaume d'Occam (1285 - 1349)

“Pluralitas non est ponenda sine necessitate”

 Occam’s razor: “The more simple explanation of the

phenomena is more likely to be correct”

Page 60: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 60

Model Assessment and Model Selection:

Two separate goals

Page 61: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 61

Model Selection:

Estimating the performance of different models in order to choose the (approximate) best one

Model Assessment:Having chosen a final model, estimating its

prediction error (generalization error) on new data

Page 62: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 62

If we are in a data-rich situation, the best solution is to split randomly (?) data

Raw Data

Test:25%(validation)

Validation:25%(test)

Train: 50%(Train)

Page 63: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 63

Interpretation

• The training set is used to fit the models

• The validation set is used to estimate prediction error for model selection (tuning hyperparameters)

• The test set is used for assessment of the generalization error of the final chosen model

Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001

Page 64: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 64

Bias and Variance. Model’s complexity

2 4 6 8 10

0.5

1

1.5

2

2.5

3

c. Underfitting

2 4 6 8 10

0.5

1

1.5

2

2.5

3

b. Overfitting

Page 65: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 65

One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples.

This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task.

Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error

Page 66: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 66

Bias-Variance Dilemma

Assume that

2

( )

( ) 0,

( )

Y f XwhereE

Var

Page 67: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 67

We can derive an expression for the expected prediction error of a

regression at an input point X=x0 using squared-error loss:

Page 68: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 68

20 0 0

2 2 20 0 0 0

2 20 0

2

( ) [( ( )) ¦ ]

[ ( ) ( )] [ ( ) ( )]

( ( )) ( ( ))

Err x E Y f x X x

E f x f x E f x E f x

Bias f x Var f x

IrreducibleError Bias Variance

Page 69: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 69

• The first term is the variance of the target around its true mean f(x0), and cannot be avoided no matter how well we estimate f(x0), unless

2=0. • The second term is the squared bias, the amount

by which the average of our estimate differs from the true mean

• The last term is the variance, the expected squared deviation of around its mean.

0( )f x

Page 70: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 70

20 0 0

2 2 20

1

( ) [( ( )) ¦ ]1[ ( ) ( )] /

k

ll

Err x E Y f x X x

f x f x kk

For the k-NN regression fit

Here we assume for simplicity that training inputs are fixed, and the randomness arises from the Y. The number of neighbors k is inversely related to the model complexity

Page 71: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 71

Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001

Page 72: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 72

Page 73: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 73

• A neural network is only as good as the training data!

• Poor training data inevitably leads to an unreliable and unpredictable network.

• Exploratory Data Analysis and data preprocessing are extremely important!!!

Page 74: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 74

• If possible, prior to training, add some noise or other randomness to your example (such as a random scaling factor). This helps to account for noise and natural variability in real data, and tends to produce a more reliable network.

Page 75: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 75

Hybrid Models:Geostatistics + ML

Page 76: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 76

Final estimates(ANN + Geostatistics)

Data F1,F2,...,Fn

Statistical description

Trend analysis

Structural analysisData for

testingvalidation

ANN Training

TestingValidationANN architecture choice

Accuracy Test

ANN Residuals F1,F2,...,Fn

Statistical description

Multivariate structural analysis

Variogram model for residuals

Cokriging

errors estimates

ANN estimates for F1,F2,...,Fn

Cross-validation

Validation

training

Raw Data Variogram

Lag (km)

Var

iogr

am

Residual Variogram

Lag (km)

Var

iogr

am

NNRK/CK Algorithm

Page 77: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 77

Model: Neural Network Residual Cokriging

Artificial Neural Network Estimate Geostatistical Estimate

of the Residuals

Final estimate of 90Sr with NNRCK

Page 78: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 78

Conclusions• Machine Learning: universal data-driven

recently developed approach with many successful applications. Nonlinear, robust. Integration of different types of data and information. Efficient in high dimensional space.

• But: Depends on the quality and quantity of data. Uncertainty characterization. Diagnostic tools. Hyper-parameters tuning.

Page 79: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 79

Topics for the research

• Multitask learning• Automatic feature selection/ feature extraction• Uncertainties characterisation• Understanding and visluation of high

dimensional data• Modelling on geomanifold, semi-supervised

learning• Active learning• MLA and simulations? • ……………………………………………………

Page 80: Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 80

Thank you for your attention!

2004

2008

www.geokernels.org

www.unil.ch/igar 2009