Top Banner
Giacomo Boracchi 1 and Gregory Ditzler 2 1 Politecnico di Milano Dipartimento Elettronica e Informazione Milano, Italy 2 The University of Arizona Department of Electrical & Computer Engineering Tucson, AZ USA [email protected], [email protected]
105

Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Jan 22, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Giacomo Boracchi1 and Gregory Ditzler2

1 Politecnico di MilanoDipartimento Elettronica e InformazioneMilano, Italy

2 The University of ArizonaDepartment of Electrical & Computer EngineeringTucson, AZ USA

[email protected], [email protected]

Page 2: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Everyday millions of credit card transactions are processed by automatic systems that are in charge of authorizing, analyzing and eventually detecting frauds

Fraud Detection

Page 3: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Dal Pozzolo A., Boracchi G., Caelen O., Alippi C. and Bontempi G., Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information , Proceedings of IJCNN 2015,

Page 4: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Dal Pozzolo A., Boracchi G., Caelen O., Alippi C. and Bontempi G., Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information , Proceedings of IJCNN 2015,

Page 5: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Everyday millions of credit card transactions are processed by automatic systems that are in charge of authorizing, analyzing and eventually detecting frauds

Fraud detection is performed by a classifier that associates to each transaction a label «genuine» or «fraudulent»

Challenging classification problem because:

§ A massive amount of transactions coming in a stream§ High dimensional data (considering the amount of supervised

samples)§ Class unbalanced§ Concept drift: new fraudulent strategies appear over time§ Concept drift: genuine transactions evolves over time

Concept drift “changes the problem” the classifier has to address

Page 6: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Learning problems related to predicting user preferences / interests, such as:

§ Recommendation systems § Spam / email filtering

..when the user change his/her own preferences, the classification problem changes

Spam Classification

Page 7: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Prediction problems like :

§ Financial markets analysis§ Environmental monitoring§ Critical infrastructure monitoring / management

where data are often in a form of time-series and the data-generating process typically evolves over time.

Financial Markets

Page 8: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

In practice Concept Drift (CD) is a problem in all application scenarios where:

§ data come in the form of stream § the data-generating process might evolve over time§ data-driven models are used

Since in these cases, the data-driven model should autonomouslyadapt to preserve its performance over time

Page 9: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

This tutorial focuses on:§ methodologies and general approaches for adapting data-driven

models to Concept Drift (i.e. in Nonstationary Environments)§ learning aspects, while related issues like change/outlier/anomaly

detection are not discussed in detail§ classification as an example of supervised learning problem.

Regression problems are not considered here even though similar issues applies

§ the most important frameworks that can be implemented using any classifier, rather than solutions for specific classifiers

§ illustrations typically refer to scalar and numerical data, even though methodologies often apply to multivariate and numerical or categorical data as well

Page 10: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

The tutorial is far from being exhaustive… please have a look at the very good surveys below

J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Computing Surveys (CSUR), vol. 46, no. 4, p. 44, 2014

G. Ditzler, M. Roveri, C. Alippi, R. Polikar, “Adaptive strategies for learning in nonstationary environments,” IEEE Computational Intelligence Magazine, November 2015

C.Alippi, G.Boracchi, G.Ditzler, R.Polikar, M.Roveri, “Adaptive Classifiers for Nonstationary Environments” Contemporary Issues in Systems Science and Engineering, IEEE/Wiley Press Book Series, 2015

Page 11: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

The tutorial is far from being exhaustive… please have a look at the very good surveys below

We hope this tutorial will help researcher from other disciplines to familiarize with the problem and possibly contribute to the development of this research filed

Let’s try to make this tutorial as interactive as possible

Page 12: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ Problem Statement§ Drift taxonomy§ The issue

§ Active Approaches§ CD detection monitoring classification error§ CD detection monitoring raw data§ JIT classifiers§ Window comparison methods

§ Passive Approaches§ Single model methods§ Ensemble methods§ Initially labelled environments

§ Datasets and Codes§ Concluding Remarks and Research Perspectives

Page 13: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Learning in Nonstationary (Streaming) Environments

Page 14: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

The problem: classification over a potentially infinitely long stream of data

𝑋 = {𝒙𝟎,𝒙𝟏,… , }

Data-generating process 𝒳 generates tuples 𝒙+, 𝑦+ ∼ 𝒳§ 𝒙+ is the observation at time 𝑡 (e.g., 𝒙+ ∈ ℝ1 )§ 𝑦+ is the associated label which is (often) unknown (𝑦+ ∈ Λ )

The task: learn an adaptive classifier 𝐾+ to predict labels𝑦4+ = 𝐾+ 𝒙+

in an online manner having a low classification error,

𝑝 𝑇 =1𝑇8𝑒+

:

+;<

, where𝑒+ = B0, if𝑦4+ = 𝑦+1, if𝑦4+ ≠ 𝑦+

Page 15: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

The problem: classification over a potentially infinitely long stream of data

𝑋 = {𝒙𝟎,𝒙𝟏,… , }

Data-generating process 𝒳 generates tuples 𝒙+, 𝑦+ ∼ 𝒳§ 𝒙+ is the observation at time 𝑡 (e.g., 𝒙+ ∈ ℝ1 )§ 𝑦+ is the associated label which is (often) unknown (𝑦+ ∈ Λ )

The task: learn an adaptive classifier 𝐾+ to predict labels𝑦4+ = 𝐾+ 𝒙+

in an online manner having a low classification error,

𝑝 𝑇 =1𝑇8𝑒+

:

+;<

, where𝑒+ = B0, if𝑦4+ = 𝑦+1, if𝑦4+ ≠ 𝑦+

Page 16: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Typical assumptions:§ Independent and identically distributed (i.i.d.) inputs

𝒙𝒕,𝑦+ ∼ 𝜙 𝒙,𝑦§ An initial training set 𝑇𝑅 = 𝒙J, 𝑦J , … , 𝒙K,𝑦K is provided for

learning 𝐾J§ 𝑇𝑅 contains data generated in stationary conditions

A stationary condition of 𝓧 is also denoted concept

Page 17: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Unfortunately, in the real world, datastream 𝒳 might change unpredictably during operation.

The data generating process is then modeled as:𝒙𝒕,𝑦+ ∼ 𝜙+ 𝒙, 𝑦

We say that concept drift occurs at time 𝑡 if𝜙+ 𝒙, 𝑦 ≠ 𝜙+M< 𝒙,𝑦

We also say 𝒳 becomes nonstationary.

Page 18: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

We assume that few supervised samples are provided also during operations.

These supervised samples might arrive:

§ In single instances§ Batch-wise

Fresh, new supervised samples are necessary to:

§ React/adapt to concept drift§ Increase classifier accuracy in stationary conditions

The classifier 𝐾J is updated during operation, thus is denoted by 𝐾+.

Page 19: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Different Types of Concept Drift

Page 20: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Drift taxonomy according to two characteristics:

1. What is changing? 𝜙+ 𝒙, 𝑦 = 𝜙+ 𝑦|𝒙 𝜙+ 𝒙

Drift might affect 𝜙+ 𝑦|𝒙 and/or𝜙+ 𝒙§ Real § Virtual

2. How does process change over time?

§ Abrupt§ Gradual§ Incremental§ Recurring

Page 21: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Real Drift 𝜙OM< 𝑦 𝒙 ≠ 𝜙O 𝑦 𝒙

affects 𝜙+ 𝑦|𝒙 while 𝜙+ 𝒙 – the distribution of unlabeled data – mightchange or not.

𝜙OM< 𝒙 ≠ 𝜙O(𝒙)

𝑥

𝑡

class 1class 2

𝜙J 𝜙<

𝜏

Page 22: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Real Drift 𝜙OM< 𝑦 𝒙 ≠ 𝜙O 𝑦 𝒙

affects 𝜙+ 𝑦|𝒙 while 𝜙+ 𝒙 – the distribution of unlabeled data – mightchange or not.

𝜙OM< 𝒙 ≠ 𝜙O(𝒙)

Page 23: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Real Drift 𝜙OM< 𝑦 𝒙 ≠ 𝜙O 𝑦 𝒙

affects 𝜙+ 𝑦|𝒙 while 𝜙+ 𝒙 – the distribution of unlabeled data – mightchange or not.

𝜙OM< 𝒙 = 𝜙O(𝒙)

E.g. classes swap

𝑥

𝑡

class 1class 2

𝜏

𝜙J 𝜙<

Page 24: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Virtual Drift 𝜙OM< 𝑦 𝒙 = 𝜙O 𝑦 𝒙 while𝜙OM< 𝒙 ≠ 𝜙O 𝒙

affects only 𝜙+ 𝒙 and leaves the class posterior probability unchanged.

These are not relevant from a predictive perspective, classifier accuracy is not affected

𝑥

𝑡

class 1class 2

𝜏

𝜙J 𝜙<

Page 25: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Virtual Drift 𝜙OM< 𝑦 𝒙 = 𝜙O 𝑦 𝒙 while𝜙OM< 𝒙 ≠ 𝜙O 𝒙

affects only 𝜙+ 𝒙 and leaves the class posterior probability unchanged.

Page 26: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Abrupt

𝜙+ 𝒙,𝑦 = B𝜙J 𝒙,𝑦 𝑡 < 𝜏𝜙< 𝒙, 𝑦 𝑡 ≥ 𝜏

Permanent shift in the state of 𝒳, e.g. a faulty sensor, or a system turned to an active state

𝑥

𝑡

class 1class 2

𝜏

𝜙J 𝜙<

Page 27: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Incremental

𝜙+ 𝒙,𝑦 = B𝜙J 𝒙,𝑦 𝑡 < 𝜏𝜙+ 𝒙,𝑦 𝑡 ≥ 𝜏

There is a continuously drifting condition after the change

𝑥

𝑡

class 1class 2

𝜏

𝑝<𝜙J 𝜙+

Page 28: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Incremental

𝜙+ 𝒙, 𝑦 = W𝜙J 𝒙, 𝑦 𝑡 < 𝜏J𝜙+ 𝒙,𝑦 𝜏J ≤ 𝑡 < 𝜏<𝜙< 𝒙, 𝑦 𝑡 ≥ 𝜏<

There is a continuously drifting condition after the change that might end up in another stationary state

𝑥

𝑡

class 1class 2

𝜏J

𝜙J 𝜙+ 𝜙<

Page 29: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Recurring

𝜙+ 𝒙, 𝑦 =

𝜙J 𝒙,𝑦 𝑡 < 𝜏J𝜙< 𝒙, 𝑦 𝜏J ≤ 𝑡 < 𝜏<

…𝜙J 𝒙, 𝑦 𝑡 ≥ 𝜏K

After concept drift, it is possible that 𝒳 goes back in its initial conditions 𝜙J

𝑥

𝑡

class 1class 2

𝜏J

𝜙< 𝜙<𝜙J𝜙J

Page 30: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Gradual

𝜙+ 𝒙, 𝑦 = B𝜙J 𝒙, 𝑦 𝑜𝑟𝜙< 𝒙, 𝑦 𝑡 < 𝜏𝜙< 𝒙, 𝑦 𝑡 ≥ 𝜏

The process definitively switches in the new conditions after having anticipated some short drifts

𝑥

class 1class 2

𝜏 𝑡

𝜙< 𝜙<𝜙J𝜙J 𝜙<𝜙J

Page 31: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches
Page 32: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Consider as, an illustrative example, a simple 1-dimensional classification problem, where

§ The initial part of the stream is provided for training

𝑥

𝑇𝑅𝑡

class 1class 2

Page 33: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Consider as, an illustrative example, a simple 1-dimensional classification problem, where

§ The initial part of the stream is provided for training § 𝐾 is simply a threshold

𝑥

𝑡

class 1class 2

𝑇𝑅

Page 34: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Consider as, an illustrative example, a simple 1-dimensional classification problem, where

§ The initial part of the stream is provided for training § 𝐾 is simply a threshold

𝑥

𝑇𝑅𝑡

(𝒙𝒕, 𝑦+) are i.i.d.

class 1class 2

Page 35: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Consider as, an illustrative example, a simple 1-dimensional classification problem, where

§ The initial part of the stream is provided for training § 𝐾 is simply a threshold

As far as data are i.i.d., the classification error is controlled

𝑥

𝑡

(𝒙𝒕, 𝑦+) are i.i.d.

class 1class 2

Page 36: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Unfortunately, when concept drift occurs, and 𝜙changes,

𝑥

𝑡

Concept drift(𝒙𝒕, 𝑦+) are i.i.d.class 1class 2

𝜏

Page 37: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Unfortunately, when concept drift occurs, and 𝜙changes, things can be terribly worst,

class 1class 2

𝑥

𝑡

Concept drift(𝒙𝒕, 𝑦+) are i.i.d.

𝜏

Page 38: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Unfortunately, when concept drift occurs, and 𝜙changes, things can be terribly worst, The average classification error 𝑝+ typically increases

class 1class 2

𝑥

𝑡

Concept drift(𝒙𝒕, 𝑦+) are i.i.d.

𝑝+

𝜏

𝜏

Page 39: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Adaptation is needed to preserve classifier performance

𝑥

𝑡

Concept drift(𝒙𝒕, 𝑦+) are i.i.d.class 1class 2

𝑒+

𝜏

𝜏

Page 40: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Do we Really Need Smart Adaptation Strategies?

Page 41: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Consider two simple adaptation strategies and a simple concept drift§ Continuously update 𝐾+ using all supervised couples§ Train 𝐾+ using only the last 𝛿 supervised couples

obse

rvat

ions

-5

0

5

10 class ωclass ωT*

Classification error as a function of time

Cla

ssifi

catio

n E

rror

(%)

1000 2000 3000 4000 5000 6000 7000 8000 9000

27

28

29

30

31

32

33

34

35

T

JIT classifierContinuous Update ClassifierSliding Window ClassifierBayes error

Dataset

1

2

a)

b)

1000 2000 3000 4000 5000 6000 7000 8000 9000 T

Page 42: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Classification error of two simple adaptation strategies§ Black dots: 𝐾+ uses all supervised couples at time 𝑡§ Red line: 𝐾+ uses only the last 𝛿 supervised couples

obse

rvat

ions

-5

0

5

10 class ωclass ωT*

Classification error as a function of timeC

lass

ifica

tion

Err

or (%

)

1000 2000 3000 4000 5000 6000 7000 8000 9000

27

28

29

30

31

32

33

34

35

T

JIT classifierContinuous Update ClassifierSliding Window ClassifierBayes error

Dataset

1

2

a)

b)

1000 2000 3000 4000 5000 6000 7000 8000 9000 T

Page 43: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Classification error of two simple adaptation strategies§ Black dots: 𝐾+ uses all supervised couples at time 𝑡§ Red line: 𝐾+ uses only the last 𝛿 supervised couples

obse

rvat

ions

-5

0

5

10 class ωclass ωT*

Classification error as a function of timeC

lass

ifica

tion

Err

or (%

)

1000 2000 3000 4000 5000 6000 7000 8000 9000

27

28

29

30

31

32

33

34

35

T

JIT classifierContinuous Update ClassifierSliding Window ClassifierBayes error

Dataset

1

2

a)

b)

1000 2000 3000 4000 5000 6000 7000 8000 9000 T

Just including"fresh"

trainingsamples is not

enough

Page 44: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Classification error of two simple adaptation strategies§ Black dots: 𝐾+ uses all supervised couples at time 𝑡§ Red line: 𝐾+ uses only the last 𝛿 supervised couples

obse

rvat

ions

-5

0

5

10 class ωclass ωT*

Classification error as a function of timeC

lass

ifica

tion

Err

or (%

)

1000 2000 3000 4000 5000 6000 7000 8000 9000

27

28

29

30

31

32

33

34

35

T

JIT classifierContinuous Update ClassifierSliding Window ClassifierBayes error

Dataset

1

2

a)

b)

1000 2000 3000 4000 5000 6000 7000 8000 9000 T

Need to integrate

supervised samples

Page 45: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Classification error of two simple adaptation strategies§ Black dots: 𝐾+ uses all supervised couples at time 𝑡§ Red line: 𝐾+ uses only the last 𝛿 supervised couples

obse

rvat

ions

-5

0

5

10 class ωclass ωT*

Classification error as a function of timeC

lass

ifica

tion

Err

or (%

)

1000 2000 3000 4000 5000 6000 7000 8000 9000

27

28

29

30

31

32

33

34

35

T

JIT classifierContinuous Update ClassifierSliding Window ClassifierBayes error

Dataset

1

2

a)

b)

1000 2000 3000 4000 5000 6000 7000 8000 9000 T

Adaptive Learning algorithms trade-off

the two aspects

Page 46: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Two main solutions in the literature:

§ Active: the classifier 𝐾+ is combined with statistical tools to detect concept drift and pilot the adaptation

§ Passive: the classifier 𝐾+ undergoes continuous adaptationdetermining every time which supervised information to preserve

Which is best depends on the expected change rate and memory/computational availability

Page 47: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Giacomo Boracchi1 and Gregory Ditzler2

1 Politecnico di MilanoDipartimento Elettronica e InformazioneMilano, Italy

2 The University of ArizonaDepartment of Electrical & Computer EngineeringTucson, AZ USA

[email protected], [email protected]

Page 48: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Peculiarities:

§ Relies on an explicit drift-detection mechanism: the change detection tests (CDTs)

§ Specific post-detection adaptation procedures to isolate recent data generated after the change

Pro:

§ Also provide information that CD has occurred§ Can improve their performance in stationary conditions§ Alternatively, classifier adapts only after detection

Cons:

§ Difficult to handle incremental and gradual drifts

Page 49: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

The simplest approach consist in monitoring the classification error (or similar performance measure)

Pro:

§ It is the most straightforward figure of merit to monitor§ Changes in 𝑝+ prompts adaptation only when performance are

affected

Cons:

§ CD detection from supervised samples only

Page 50: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches
Page 51: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ The element-wise classification error follows a Bernoulli pdf𝑒+ ∼ Bernulli(𝜋J)

𝜋J is the expected classification error in stationary conditions

§ The sum of 𝑒+ in a window follows a Binomial pdf

8 𝑒+

:

+;:`a

∼ ℬ 𝜋J, 𝜈

§ Gaussian approximation when 𝜈 is sufficiently large

𝑝+ =1𝜈8 𝑒+

:

+;:`a

∼1𝜈ℬ 𝜋J, 𝜈 ≈ 𝒩 𝜋J,

𝜋J 1 − 𝜋J𝜈

§ We have a sequence of i.i.d. Gaussian distributed values

Page 52: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Basic idea behind Drift Detection Method (DDM):

J. Gama, P. Medas, G. Castillo, and P. Rodrigues. “Learning with Drift Detection” In Proc. of the 17th Brazilian Symp. on Artif. Intell. (SBIA). Springer, Berlin, 286–295, 2004

Page 53: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Basic idea behind Drift Detection Method (DDM):§ Detect CD as outliers in the classification error

Page 54: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Basic idea behind Drift Detection Method (DDM):§ Detect CD as outliers in the classification error§ Since in stationary conditions error will decrease, look for outliers in the

right tail only

Page 55: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Basic idea behind Drift Detection Method (DDM):§ Detect CD as outliers in the classification error

§ Compute, over time 𝑝g , and 𝜎g =ij <`ij

g

𝑡

𝑥

𝑡

𝑝g + 𝜎g

Page 56: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Basic idea behind Drift Detection Method (DDM):§ Detect CD as outliers in the classification error

§ Compute, over time 𝑝g , and 𝜎g =ij <`ij

g

§ Let 𝑝lmn be the minimum error, 𝜎lmn =iopq <`iopq

g

𝑡

𝑝g + 𝜎g

𝑝lmn + 𝜎lmn

Page 57: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Basic idea behind Drift Detection Method (DDM):§ Detect CD as outliers in the classification error

§ Compute, over time 𝑝g , and 𝜎g =ij <`ij

g

§ Let 𝑝lmn be the minimum error, 𝜎lmn =iopq <`iopq

g

§ When 𝑝g + 𝜎g >𝑝lmn + 2 ∗ 𝜎lmn raise a warning alert

𝑝g + 𝜎g

𝑝lmn + 𝜎lmn

𝑝g + 2𝜎g

𝑡

Page 58: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Basic idea behind Drift Detection Method (DDM):§ Detect CD as outliers in the classification error

§ Compute, over time 𝑝g , and 𝜎g =ij <`ij

g

§ Let 𝑝lmn be the minimum error, 𝜎lmn =iopq <`iopq

g

§ When 𝑝g + 𝜎g >𝑝lmn + 2 ∗ 𝜎lmn raise a warning alert§ When 𝑝g + 𝜎g >𝑝lmn + 3 ∗ 𝜎lmn detect concept drift

𝑝g + 𝜎g

𝑝lmn + 𝜎lmn

𝑝g + 3𝜎g𝑝g + 2𝜎g

𝑡

Page 59: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Use supervised samples in between warning and drift alert to reconfigure the classifier

𝑡

𝑥

𝑝g + 𝜎g

𝑝lmn + 𝜎lmn

𝑝g + 3𝜎g𝑝g + 2𝜎g

𝑡

𝑇𝑅

Page 60: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Use supervised samples in between warning and drift alert to reconfigure the classifierWarning alerts non that are not followed by a drift alert are discarded and considered false-positive detections

𝑝lmn + 𝜎lmn

𝑝g + 3𝜎g𝑝g + 2𝜎g

𝑡

𝑝g + 𝜎g

Page 61: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Early Drift Detection Methods (EDDM) performs similar monitoring on the average distance between misclassified samples

§ Average distance is expected to decrease under CD§ They aim at detecting gradual drifts

M. Baena-García, J. Campo-Ávila, R. Fidalgo, A. Bifet, R. Gavaldá, R. Morales-Bueno. “Early drift detection method“ In Fourth International Workshop on Knowledge Discovery from Data Streams (2006)

Page 62: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Use the Exponential Weighted Moving Average (EWMA) as tests statistic

Compute EWMA statistic𝑍+ = 1− 𝜆 𝑍+`< + 𝜆𝑒+, 𝑍J = 0

Detect concept drift when𝑍+ > 𝑝J,+ + 𝐿+𝜎+

§ 𝑝J,+ is the average error estimated until time 𝑡§ 𝜎+ is its standard deviation of the above estimator§ 𝐿+ is a threshold parameter

EWMA statistic is mainly influenced by recent data. CD is detected when the error on recent samples departs from 𝑝J,+

G. J. Ross, N. M. Adams, D. K. Tasoulis, and D. J. Hand "Exponentially Weighted Moving AverageCharts for Detecting Concept Drift" Pattern Recogn. Lett. 33, 2 (Jan. 2012), 191–198 2012

Page 63: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Most importantly:

§ 𝐿+ can be set to control the average run length (ARL) of the test (the expected time between false positives)

§ Like DDM, classifier reconfiguration is performed by monitoring 𝑍+ also at a warning level

𝑍+ > 𝑝J,+ + 0.5𝐿+𝜎+§ Once CD is detected, the first sample raising a warning is used to

isolate samples from the new distribution and retrain the classifier

G. J. Ross, N. M. Adams, D. K. Tasoulis, and D. J. Hand "Exponentially Weighted Moving AverageCharts for Detecting Concept Drift" Pattern Recogn. Lett. 33, 2 (Jan. 2012), 191–198 2012

Page 64: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

In some cases, CD can be detected by ignoring class labels and monitoring the distribution of the input, unsupervised, raw data.

𝑥

𝑡

Page 65: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

In some cases, CD can be detected by ignoring class labels and monitoring the distribution of the input, unsupervised, raw data.

𝑥

𝑡

𝑥

𝑡

Page 66: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches
Page 67: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Pros:

§ Monitoring𝜙 𝒙 does not require supervised samples§ Enables the detection of both real and virtual concept drift

Cons:

§ CD that does not affect 𝜙(𝒙) are not perceivable§ In principle, changes not affecting 𝜙 𝑦 𝒙 do not require

reconfiguration.§ Difficult to design sequential detection tools, i.e., change-detection

tests (CDTs) when streams are multivariate and distribution unknown

Page 68: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Extracts Gaussian-distributed features from non-overlapping windows (such that they are i.i.d.) . Example of features are:

§ the sample mean over data windows

M 𝑠 = 8 𝑥+

}a

+; }`< aM<§ a power-law transform of the sample variance

V 𝑠 =S 𝑠𝜈 − 1

��

S(𝑠) is the sample variance over window yielding 𝑀 𝑠Detection criteria: the Intersection of Confidence Intervals rule, an adaptive filtering technique for polynomial regression

C. Alippi, G. Boracchi, M. Roveri "A just-in-time adaptive classification system based on the intersection of confidence intervals rule", Neural Networks, Elsevier vol. 24 (2011), pp. 791-800

A. Goldenshluger and A. Nemirovski, “On spatial adaptive estimation of nonparametric regression” Math. Meth. Statistics, vol. 6, pp. 135–170,1997.

Page 69: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Several features from non-overlapping windows including

§ Sample moments§ Projections over the principal components§ Mann-Kendal statistic

Detection criteria: the cumulative sum of each of this feature is monitored to detect change in a CUSUM-like scheme

C. Alippi and M. Roveri, “Just-in-time adaptive classifiers–part I: Detecting nonstationary changes,”IEEE Transactions on Neural Networks, vol. 19, no. 7, pp. 1145–1153, 2008.

C. Alippi, M. Roveri, “Just-in-time adaptive classifiers — part II: Designing the classifier,” IEEE Transactions on Neural Networks, vol. 19, no. 12, pp. 2053–2064, 2008.

Page 70: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

One typically resort to:

§ Operating component-wise (thus not performing a multivariate analysis)

§ Monitoring the log-likelihood w.r.t. an additional model describing approximating 𝜙(𝒙) in stationary conditions

Page 71: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Fit a model (e.g. by GMM or KDE) 𝜙�J to describe distribution of raw(multivariate) data in stationary conditions

For each sample 𝒙 compute the log-likelihood w.r.t.𝜙�Jℒ 𝒙𝒕 = log 𝜙�J 𝒙𝒕 ∈ ℝ

Idea: Changes in the distribution of the log-likelihood indicate that 𝜙�J is unfit in describing unsupervised data, thus concept drift (possibly virtual) has occurred.

Detection Criteria: any monitoring scheme for scalar i.i.d.datastream

Kuncheva L.I., " Change detection in streaming multivariate data using likelihood detectors", IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5), 1175-1180

X. Song, M. Wu, C. Jermaine, S. Ranka "Statistical change detection for multi-dimensional data" In Proceedings of the 13th ACM SIGKDD (KDD 2007)

C Alippi, G Boracchi, D Carrera, M Roveri Change Detection in Multivariate Datastreams: Likelihoodand Detectability Loss - arXiv preprint arXiv:1510.04850, 2015

Page 72: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches
Page 73: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

JIT classifiers are described in terms of :

§ concept representations

§ operators for concept representations

JIT classifiers are able to:

§ detect abrupt CD (both real or virtual)

§ identify a new training for the new concept and exploit of recurrent concepts

JIT classifiers leverage:

§ sequential techniques to detect CD, monitoring both classification error and raw data distribution

§ statistical techniques to identify the new concept and possibly recurrent ones

C. Alippi, G. Boracchi, M. Roveri "Just In Time Classifiers for Recurrent Concepts" IEEE Transactions on Neural Networks and Learning Systems, 2013. vol. 24, no.4, pp. 620 -634 Outstanding Paper Award 2016

Page 74: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

𝐶g = (𝑍g, 𝐹g, 𝐷g)§ 𝑍g = 𝒙𝟎,𝑦J , … , 𝒙𝒏,𝑦K :supervised samples provided during the 𝑖��concept

§ 𝐹g features describing 𝑝(𝒙) of the 𝑖�� concept. We take: § the sample mean 𝑀 ⋅§ the power-low transform of the sample variance 𝑉(⋅)

extracted from non-overlapping sequences§𝐷g features for detecting concept drift. These include:

§ the sample mean 𝑀 ⋅§ the power-low transform of the sample variance 𝑉(⋅)§ the average classification error 𝑝+(⋅)

extracted from non-overlapping sequencesIn stationary conditions features are i.i.d.

Page 75: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Concept Representations 𝐶 = (𝑍,𝐹, 𝐷)

§ 𝑍 : set of supervised samples§ 𝐹 : set of features for assessing

concept equivalence§ 𝐷 : set of features for detecting

concept drift

Initial Training

Use the initial training sequence to build the concept representation 𝐶J

Page 76: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

𝑡

𝐶J

𝑇𝑅

Build 𝐶J, a practical representation of the current concept• Characterize both 𝜙(𝒙) and 𝜙 𝑦|𝒙 in stationary conditions

Page 77: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Concept Representations 𝐶 = (𝑍, 𝐹, 𝐷)

§ 𝑍 : set of supervised samples§ 𝐹 : set of features for assessing

concept equivalence§ 𝐷 : set of features for detecting

concept drift

Operators for Concepts§ 𝒟 concept-drift detection§ Υ concept split§ ℰ equivalence operators§ 𝒰 concept update

Page 78: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Concept Update:

During operations, each input sample is analyzed to:

§ Extract features that are appended to 𝐹g

§ Append supervised information in 𝑍g

thus updating the current concept representation

Page 79: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

𝑡

The concept representation 𝐶J is always updated during operation,

• Including supervised samples in 𝑍J (to describe 𝑝(𝑦|𝒙))• Computing feature 𝐹J(to describe 𝑝(𝒙))• Computing feature 𝐷J

𝐶J

𝑇𝑅

Page 80: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Concept Drift Detection:

The current concept representation is analyzed by 𝒟 to determine whether concept drift has occurred

Page 81: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Determine when features in 𝑫 are no more stationary

§𝒟 monitoring the datastream by means of online and sequential change-detection tests (CDTs)

§ Depending on features, both changes in 𝜙 𝑦 𝒙 and 𝜙(𝒙) can be detected

§ 𝑇� is the detection time

𝑡𝑇�

𝐶J𝒟(𝐶J) = 1

Page 82: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

𝒟 𝐶g ∈ {0,1}

§ Implements online change-detection tests (CDTs) based on the Intersection of Confidence Intervals (ICI) rule

§ The ICI-rule is an adaptation technique used to define adaptive supports for polynomial regression

§ The ICI-rule determines when feature sequence (𝐷g) cannot be fit by a zero-order polynomial, thus when 𝑫𝒊 is non stationary

§ ICI-rule requires Gaussian-distributed features but no assumptions on the post-change distribution

A. Goldenshluger and A. Nemirovski, “On spatial adaptive estimation of nonparametric regression” Math. Meth. Statistics, vol. 6, pp. 135–170,1997.

V. Katkovnik, “A new method for varying adaptive bandwidth selection” IEEE Trans. on Signal Proc, vol. 47, pp. 2567–2571, 1999.

Page 83: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Concept Split:

After having detected concept drift the concept representation is split, to isolate the recent data that refer to the new state of 𝒳

A new concept description is built

Page 84: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Goal: estimating the change point 𝜏 (detections are always delayed). Samples in between �̂� and 𝑇�

Uses statistical tools for performing an offline and retrospectiveanalysis over the recent data, like:

§ as hypothesis tests (HT)§ change-point methods (CPM) can

𝑡𝑇��̂�

Page 85: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Given �̂�, two different concept representations are built

𝑡𝑇��̂�

1

𝐶<𝐶J

Page 86: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Υ(𝐶J) = (𝐶J, 𝐶<)

§ It performs an offline analysis on 𝐹g (just the feature detecting the change) to estimate when concept drift has actually happened

§ Detections 𝑇� are delayed w.r.t. the actual change point 𝜏

§ Change-Point Methods implement the following hypothesis test on the feature sequence:

B𝐻J: "𝐹g containsi. i. d. samples"𝐻<: "𝐹g containsachangepoint"

testing all the possible partitions of 𝐹g and determining the most likely to contain a change point

§ ICI-based CDTs implement a refinement procedure to estimate 𝜏 after having detected a change at 𝑇�.

Page 87: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Concept Equivalence

Look for concepts that are equivalent to the current one.

Gather supervised samples from all the representations 𝐶¡ that refers to the same concept

Page 88: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Concept equivalence is assessed by

§ comparing features 𝐹 to determine whether 𝜙 𝒙 is the same on 𝐶¢and 𝐶K using a test of equivalence

§ comparing classifiers trained on 𝐶¢ and 𝐶K to determine whether 𝜙 𝑦 𝒙 is the same

𝑡𝑇�

𝐶K𝐶¢ℰ 𝐶¢, 𝐶K = 1

�̂�

Page 89: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Label Prediction:

The classifier 𝐾 is reconfigured using all the available supervised couples

Page 90: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches
Page 91: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Detect CD at time 𝑡 by comparing two different windows.In practice, one computes:

𝒯(𝑊J,𝑊+)§𝑊J: reference window of past (stationary) data§𝑊+: sliding window of recent (possibly changed) data§ 𝒯 is a suitable statistic

𝑥

𝑡

𝑥

𝑊J 𝑊+

Page 92: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Detect CD at time 𝑡 by comparing two different windows.In practice, one computes:

𝒯(𝑊J,𝑊+)§𝑊J: reference window of past (stationary) data§𝑊+: sliding window of recent (possibly changed) data§ 𝒯 is a suitable statistic

𝑥

𝑡

𝑥

𝑊+`¥ 𝑊+

Page 93: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Pro:§ There are a lot of test statistics to compare the data distribution on

two different windows

Cons:§ The biggest drawback of comparing windows is that subtle CD

might not be detected (this is instead the main advantage of sequential techniques)

§ More computational demanding than sequential technique§ Window size definition is an issue

Page 94: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ The averages over two adjacent windows (ADWIN)

Bifet A., Gavaldà R. "Learning from time-changing data with adaptive windowing" In Proc. of SIAM International Conference on Data Mining 2007

Page 95: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ The averages over two adjacent windows (ADWIN)

§ Comparing the classification error over 𝑊+ and 𝑊J

Nishida, K. and Yamauchi, K. "Detecting concept drift using statistical testing" In DS, pp. 264–269, 2007

Page 96: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ The averages over two adjacent windows (ADWIN)

§ Comparing the classification error over 𝑊+ and 𝑊J

§ Compute empirical distributions of raw data over 𝑊J and 𝑊+ and compare§ The Kullback-Leibler divergence

T. Dasu, Sh. Krishnan, S. Venkatasubramanian, and K. Yi. "An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams". In Proc. of the 38th Symp. on the Interface of Statistics, Computing Science, and Applications, 2006

Page 97: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ The averages over two adjacent windows (ADWIN)

§ Comparing the classification error over 𝑊+ and 𝑊J

§ Compute empirical distributions of raw data over 𝑊J and 𝑊+ and compare§ The Kullback-Leibler divergence§ The Hellinger distance

G. Ditzler and R. Polikar, “Hellinger distance based drift detection for nonstationary environments” in Computational Intelligence in Dynamic and Uncertain Environments (CIDUE), 2011 IEEE Symposium on, April 2011, pp. 41–48.

Page 98: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ The averages over two adjacent windows (ADWIN)

§ Comparing the classification error over 𝑊+ and 𝑊J

§ Compute empirical distributions of raw data over 𝑊J and 𝑊+ and compare§ The Kullback-Leibler divergence§ The Hellinger distance § The density ratio over the two windows using kernel methods (to overcome

curse of dimensionality problems when computing empirical distributions)

Kawahara, Y. and Sugiyama, M. "Sequential change-point detection based on direct density-ratio estimation". Statistical Analysis and Data Mining, 5(2):114–127, 2012.

Page 99: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

In stationary conditions, all data are i.i.d., thus if we

§ Select a training set and a test set in a window

§ Select another 𝑇𝑅 and 𝑇𝑆 pair after reshuffling the two

the empirical error of the two classifiers should be the same

Vovk, V., Nouretdinov, I., and Gammerman, A. Testing exchangeability on-line. In ICML, pp.

Harel M., Mannor S., El-yaniv R., Crammer K. “Concept Drift Detection Through Resampling“, ICML 2014

𝑇𝑅 𝑇𝑆

Page 100: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

Two classifiers are trained

§ A stable online learner (𝑆) that predicts based on all the supervised samples

§ A reactive one (𝑅§) trained over a short sliding window

During operation

§ Labels are provided by 𝑆§ Predictions of 𝑅§ are computed but not provided § As soon as, on the most recent samples, 𝑹𝒘 correctly classifies

enough samples that 𝑆misclassifies, then, detect CD

Adaptation consists in replacing 𝑆 by 𝑅§

Bach, S.H.; Maloof, M., "Paired Learners for Concept Drift" in Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on pp.23-32, 15-19 Dec. 2008

Page 101: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches
Page 102: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ Typically, when monitoring the classification error, false positives hurt less than detection delay

§ Things might change when classes are unbalanced

Page 103: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ Typically, when monitoring the classification error, false positives hurt less than detection delay

§ Things might change when classes are unbalanced

§ Providing i.i.d. samples for reconfiguration seems more critical. When estimating the change-time:

𝑡𝑇��̂�

Page 104: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ Typically, when monitoring the classification error, false positives hurt less than detection delay

§ Things might change when classes are unbalanced

§ Providing i.i.d. samples for reconfiguration seems more critical. When estimating the change-time:

§ Overestimating of 𝜏 provide too few samples§ Underestimating of 𝜏provide non i.i.d. data§ Worth using accurate SPC methods like change-point methods

(CPMs)

D. M. Hawkins, P. Qiu, and C. W. Kang, “The changepoint model for statistical process control” Journal of Quality Technology, vol. 35, No. 4, pp. 355–366, 2003.

Page 105: Learning In Nonstationary Environments: Perspectives And Applications. Part1: Active Approaches

§ Typically, when monitoring the classification error, false positives hurt less than detection delay

§ Things might change when classes are unbalanced

§ Providing i.i.d. samples for reconfiguration seems more critical. When estimating the change-time:

§ Overestimating of 𝜏 provide too few samples§ Underestimating of 𝜏provide non i.i.d. data§ Worth using accurate SPC methods like change-point methods

(CPMs)

§ Exploiting recurrent concepts is important

§ Providing additional samples could really make the difference§ Mitigate the impact of false positives