Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

M. Kanevski, Palermo 2009 1

International Workshop: Intelligent Analysis of Environmental Data

Institute of Geomatics and Analysis of Risk (IGAR)

University of Lausanne, Switzerland

Prof. Mikhail Kanevski


Comments and questions to:• [email protected]

– www.unil.ch/igar – www.geokernels.org

mailto:[email protected]

http://www.unil.ch/igar

http://www.geokernels.org/


General IntroductionTypical problems

ApproachesSolutions

Future research


Geo- and Environmental Data(classes, continuous, images, networks, geomanifolds,…)

• Spatio-temporal• Multi-scale• Multivariate• Highly variable at many scales• High-dimensional geo-feature spaces• Uncertainties• ………….

• In some cases we do have science-based models: data/knowledge/models integration


Spatio-temporal data in terms of patterns/structures: a. pattern recognition (pattern discovery, pattern extraction), b. pattern modelling, c. pattern prediction


Main Topics:• Review and posing of typical problems.• From “numbers” to data• Collection of data: Monitoring networks and data

representativity? Monitoring network optimisation. • Get more information value from your data –

EXPLORE ! Exploratory spatio-temporal data analysis (EDA, ESDA).

• Predictions/estimations or simulations? Risk analysis and mapping

• Let data speak for themselves: learning from data. Data mining, Machine learning.


Methods:• Monitoring networks descriptions• Geostatistics: predictions/simulations• Machine Learning(neural nets, SLT):

– Neural networks: MLP, PNN, GRNN, RBF, SOM. ANNEX models. Hybrid models

– Support Vector Machines• Recent trends in geostatistics: Multiple-points

geostatistics, pattern based geostatistics.• Bayesian approach for uncertainty assessment,

integration of data and science-based models (Bayesian Maximum Entropy)


• Predict a value at a given point. • Build a map (isolines, 3D surfaces,..). • Estimate prediction error.• Take into account measurement errors. • Risk mapping: Uncertainty mapping around unknown

value. Estimate the probability of exceeding of a given/decision level.

• Joint predictions of several variables (improve predictions on primary variable using auxiliary data and information).

• Optimization of monitoring network (design/ redesign)• Simulations: modelling of spatial uncertainty and

variability• Data/Science-based models assimilation/fusion• Image analysis. Remote sensing• Spatio-temporal events (forest fires, epidemiology,

crime,…)• Predictions/simulations in high dimensional spaces• ………………………………………..

Spatial data analysis: typical tasks


Generic Methodology

DATA

Statistical Description

Monitoring Network Analysis

Quick Visualisation

Variography Deterministic Interpolations

Cross-validation

Machine LearningAlgorithmsGeostatistical

Predictions & Simulations

Monitoring Network

Generation

Decision-oriented Mapping

Data Base Management System

GIS, GIS, Remote SensingRemote Sensing


GEOSTATISTICAL ANALYSIS• Basic/Naïve statistical analysis. EDA• ESDA (regionalized EDA)• Structural analysis. Spatial correlation analysis

(variography)• Model selection: Cross-validation, jack-knife,… • Prediction and error mapping for decision

making (family of kriging models)• Probability and Risk mapping. Conditional

stochastic simulations


Some Geostatistics

• Exploration of spatial correlations

• Family of kriging models (simple, ordinary, disjunctive, indicator,…)

• Conditional Stochastic Simulations


Briansk region (radioactivity, Cs137)


Heavy metals, Japan


Switzerland, indoor radon


Measures to characterise MN

• Topological• Statistical• Fractal/multifractal• Lacunarity


Preferential Sampling. Declustering Problem


Example: geostatistical spatial co-predictions

Sr90 « expensive » information. Cs137 « cheap » exhaustive information.


(Cross)Variography


Use of Cs137 to improve Sr90 predictions

(reduced errors and uncertainty).

Decision-oriented

mapping: « Thick isolines »


Simulations and Interpolations


Unconditional simulations


SGSim of the precipitation:


Results of the simulations


Post-processing of simulations: mean and standard deviation


Geostatistics: some comments• Geostatistics is a powerful and well elaborated

model-dependent approach. • Geostatistics proposes a variety of models for spatial data

analysis and modeling. It has long and successful history of developments and applications

• Some problems:Nonlinearity

Non-stationarityTwo-point statisticsData/models integrationData mining. Pattern recognition

• Hybrid Models (ANN/SVM + Geostat) can help.


Some useful comments, conclusions and future research

• 1. Detection of patterns: try k-NN or GRNN• as an exploratory tools• Cross-validation: leave-one-out, leave k-out,

jackknife,etc. as a control tool • Model selection and model asssessment


K- Nearest Neighbours


K-NN prediction:

NN methods use those k-observations in the training data set T closest in input space to prediction point x to estimate Y

( )

1

i k

k

ix N x

Y yk

Where Nk(x) is the neighborhood of x defined by the

closest points in the training set


k-NN Classifiers

These classifiers are memory-based and do not require any model to be fit! Given a query point x, we find the k training points closest in the distance to x and then classify using MAJORITY vote among the k neighbors.


Because it uses only the training point closest to the query point, the bias of the 1-nn estimate is often low, but the variance is high.

A famous result of Cover and Hurt (1967) shows that asymptotically the error rate of the 1-nn classifier is never more than twice the Bayes rate.

This result can provide a rough idea about the best performance that is possible in a given problem:if the 1-nn rule has a 10% error rate, then asymptotically the Bayes error rate is at least 5%.


Dirichlet cells, Thiessen tessellation, Voronoï polygons


• How to find k ?

Possible answer:

Cross-validation or leave-one-out


k-NN prediction (n=6 ?)

2

3

6

4

1

5

W1~(1/n)

W2~(1/n)

W6~(1/n)

W5~(1/n)

W4~(1/n)

W3~(1/n)

r1

r2r3

r4

r5

r6


Cross-validation

2

3

6

4

1

5

W1~(1/n)

W2~(1/n)

W6~(1/n)

W5~(1/n)

W4~(1/n)

W3~(1/n)

r1

r2r3

r4

r5

r6

Calculate error = (prediction-data)


Leave-next-one-out, etc

2

3

6

4

1

5

W1~(1/n)

W2~(1/n)

W6~(1/n)

W5~(1/n)

W4~(1/n)

W3~(1/n)

r1

r2r3

r4

r5

r6

Calculate error = (prediction-data)


Data and k-nn Cross-validation error curve


Complete data set and 500 training points linearly interpolated


Cross-validation curve


K-nn predictions


Machine Learning Algorithms• Machine learning is an area of artificial intelligence

concerned with the development of techniques which allow computers to "learn".

• More specifically, machine learning is a method for creating computer programs by the analysis of data sets. Machine learning overlaps heavily with statistics, since both fields study the analysis of data, but unlike statistics, machine learning is concerned with the algorithmic complexity of computational implementations. ...


AlgorithmsCommon algorithm types include:• supervised learning – where the algorithm generates a function that

maps inputs to desired outputs. • unsupervised learning – which models a set of inputs: labeled

examples are not available. • semi-supervised learning – which combines both labeled and

unlabeled examples to generate an appropriate function or classifier. • reinforcement learning – where the algorithm learns a policy of how to

act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.

• transduction – similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and new inputs.

• The performance and computational analysis of machine learning algorithms is a branch of statistics known as computational learning theory.

http://en.wikipedia.org/wiki/Supervised_learning

http://en.wikipedia.org/wiki/Unsupervised_learning

http://en.wikipedia.org/wiki/Semi-supervised_learning

http://en.wikipedia.org/wiki/Reinforcement_learning

http://en.wikipedia.org/wiki/Transduction_%28machine_learning%29

http://en.wikipedia.org/wiki/Statistics

http://en.wikipedia.org/wiki/Computational_learning_theory


ML Topics (short lists)• Machine learning topics• Modeling conditional probability density functions,

regression and classification – Artificial neural networks – Decision trees – Gene expression programming – Genetic Programming – Gaussian process regression – Linear discriminant analysis – k-Nearest Neighbor – Minimum message length – Perceptron – Quadratic classifier – Radial basis functions – Support vector machines

http://en.wikipedia.org/wiki/Conditional_probability

http://en.wikipedia.org/wiki/Regression_analysis

http://en.wikipedia.org/wiki/Classification

http://en.wikipedia.org/wiki/Artificial_neural_network

http://en.wikipedia.org/wiki/Decision_tree

http://en.wikipedia.org/wiki/Gene_expression_programming

http://en.wikipedia.org/wiki/Genetic_Programming

http://en.wikipedia.org/wiki/Kriging

http://en.wikipedia.org/wiki/Linear_discriminant_analysis




ML Topics (continued) • Modeling probability density functions through generative models:

– Expectation-maximization algorithm – Graphical models including Bayesian networks and Markov Random Fields – Generative Topographic Mapping

• Appromixate inference techniques: – Markov chain Monte Carlo method – Variational Bayes

• Meta-Learning (Ensemble methods): – Boosting – Bootstrap Aggregating aka Bagging – Random forest – Weighted Majority Algorithm

• Optimization: most of methods listed above either use optimization or are instances of optimization algorithms.

• Multi-objective Machine Learning: An approach that addresses multiple, and often confliciting learning objectives explicitly using Pareto-based multi-objective optimization techniques.


Machine Learning• Artificial Neural Networks1. Multilayer perceptrons (MLP)2. General Regression Neural

Networks (GRNN)

• Statistical Learning Theory1. Support Vector Classification2. Support Vector Regression3. Monitoring Networks Optimization


A Generic Model of Learning from Data/Examples

Generator Supervisor

LearningMachine


The Problem of Risk Minimization

In order to choose the best available model to the supervisor’s response, one measure the LOSS or discrepancy L(y,f(x,)) between the response y of the supervisor to a given input x and the response f(x,) provided by the Loss Measure.


Three Main Learning Problems

• Regression Estimation. Let the supervisor’s answer y, be a real value, and let f(x,), , be a set of real functions which contains the regression function

)¦(),( 0 xyydFxf


The Problem of Risk Minimization

Consider the expected value of the loss, given by the risk functional

The goal is to find the function f(x,0) which minimises the risk in the situation where the joint pdf is unknown and the only available information is contained in the training set.

( ) ( , ( , )) ( , )R L y f x dF x y


• Classification problem:

A

AA

AA

A

A

A

A

A

B

B

B

B

B B

B

B

BB

A B



• Pattern Recognition (classification). y = {0,1}, classification error:

0, if ( , )( , ( , ))1, if ( , )

y f xL y f xy f x


• Regression problem

f(x) ?

yf xx ˆ



• Regression Estimation It is known that regression function is the one

which minimizes the following loss-function:

2)),(()),(,( xfyxfyL


• Probability density estimation

x

p(x)



• Density Estimation. For this problem we consider the following loss-function:

),(log)),(( xpxpL


Training samples(xi, yi) (ynew,xnew)

F(x,y)

Induction Deduction

TransductionTransduction

Inductive, Deductive and Transductive


Why Machine Learning algorithms?

• Universal, nonlinear, robust tools• Data adapted• Easy data and knowledge integration• Efficient in high dimensional spaces• Good generalisation (low prediction

error)• Input/feature selection


Our experience, some applications

• Hydrogeology, pollution/contamination (soil, water, air, food chains,…), topo-climatic modelling, geophysics

• Renewable resources – wind fields• Natural hazards/risks: forest fires, avalanches, indoor

radon,• Optimization of monitoring networks• Crime data, epidemiology• MNL for remote sensing, change detection• Socio-economic spatio-temporal multivariate data• Spatial econometrics. Financial data. Econophysics • Fractals, Chaos, EVT, • Time series


Model Selection & Model Evaluation


Guillaume d'Occam (1285 - 1349)

“Pluralitas non est ponenda sine necessitate”

Occam’s razor: “The more simple explanation of the

phenomena is more likely to be correct”

http://fr.wikipedia.org/wiki/1285

http://fr.wikipedia.org/wiki/1349


Model Assessment and Model Selection:

Two separate goals


Model Selection:

Estimating the performance of different models in order to choose the (approximate) best one

Model Assessment:Having chosen a final model, estimating its

prediction error (generalization error) on new data


If we are in a data-rich situation, the best solution is to split randomly (?) data

Raw Data

Test:25%(validation)

Validation:25%(test)

Train: 50%(Train)


Interpretation

• The training set is used to fit the models

• The validation set is used to estimate prediction error for model selection (tuning hyperparameters)

• The test set is used for assessment of the generalization error of the final chosen model

Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001


Bias and Variance. Model’s complexity

2 4 6 8 10

0.5

1

1.5

2

2.5

3

c. Underfitting

2 4 6 8 10

0.5

1

1.5

2

2.5

3

b. Overfitting


One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples.

This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task.

Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error


Bias-Variance Dilemma

Assume that

2

( )

( ) 0,

( )

Y f XwhereE

Var


We can derive an expression for the expected prediction error of a

regression at an input point X=x0 using squared-error loss:


20 0 0

2 2 20 0 0 0

2 20 0

2

( ) [( ( )) ¦ ]

[ ( ) ( )] [ ( ) ( )]

( ( )) ( ( ))

Err x E Y f x X x

E f x f x E f x E f x

Bias f x Var f x

IrreducibleError Bias Variance


• The first term is the variance of the target around its true mean f(x0), and cannot be avoided no matter how well we estimate f(x0), unless

2=0. • The second term is the squared bias, the amount

by which the average of our estimate differs from the true mean

• The last term is the variance, the expected squared deviation of around its mean.

0( )f x


20 0 0

2 2 20

1

( ) [( ( )) ¦ ]1[ ( ) ( )] /

k

ll

Err x E Y f x X x

f x f x kk

For the k-NN regression fit

Here we assume for simplicity that training inputs are fixed, and the randomness arises from the Y. The number of neighbors k is inversely related to the model complexity


Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001



• A neural network is only as good as the training data!

• Poor training data inevitably leads to an unreliable and unpredictable network.

• Exploratory Data Analysis and data preprocessing are extremely important!!!


• If possible, prior to training, add some noise or other randomness to your example (such as a random scaling factor). This helps to account for noise and natural variability in real data, and tends to produce a more reliable network.


Hybrid Models:Geostatistics + ML


Final estimates(ANN + Geostatistics)

Data F1,F2,...,Fn

Statistical description

Trend analysis

Structural analysisData for

testingvalidation

ANN Training

TestingValidationANN architecture choice

Accuracy Test

ANN Residuals F1,F2,...,Fn

Statistical description

Multivariate structural analysis

Variogram model for residuals

Cokriging

errors estimates

ANN estimates for F1,F2,...,Fn

Cross-validation

Validation

training

Raw Data Variogram

Lag (km)

Var

iogr

am

Residual Variogram

Lag (km)

Var

iogr

am

NNRK/CK Algorithm


Model: Neural Network Residual Cokriging

Artificial Neural Network Estimate Geostatistical Estimate

of the Residuals

Final estimate of 90Sr with NNRCK


Conclusions• Machine Learning: universal data-driven

recently developed approach with many successful applications. Nonlinear, robust. Integration of different types of data and information. Efficient in high dimensional space.

• But: Depends on the quality and quantity of data. Uncertainty characterization. Diagnostic tools. Hyper-parameters tuning.


Topics for the research

• Multitask learning• Automatic feature selection/ feature extraction• Uncertainties characterisation• Understanding and visluation of high

dimensional data• Modelling on geomanifold, semi-supervised

learning• Active learning• MLA and simulations? • ……………………………………………………


Thank you for your attention!

2004

2008

www.geokernels.org

www.unil.ch/igar 2009

Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

Technology

machine learning

generalization

statistical

supervised

training set

nn prediction

training data

tibshirani