Top Banner
NEW FORMULATIONS FOR PREDICTIVE LEARNING Vladimir Cherkassky Dept. Electrical & Computer Eng. University of Minnesota Email [email protected] Plenary talk at ICANN-2001 VIENNA, AUSTRIA, August 22, 2001
36

overview of pred learning

Jul 13, 2015

Download

Documents

Tommy96
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: overview of pred learning

NEW FORMULATIONS FOR

PREDICTIVE LEARNING

Vladimir Cherkassky

Dept. Electrical & Computer Eng.

University of Minnesota

Email [email protected]

Plenary talk at ICANN-2001

VIENNA, AUSTRIA, August 22, 2001

Page 2: overview of pred learning

2

OUTLINE

1. MOTIVATION and BACKGROUND

Historical background

New application needs

Methodological & scientific importance

2. STANDARD FORMULATIONs

Standard formulations of the learning problem

Possible directions for new formulations

VC-theory framework

3. EXAMPLE NEW FORMULATIONS

Transduction: learning function at given points

Signal Processing: signal denoising

Financial engineering: market timing

Spatial data mining

Page 3: overview of pred learning

34. SUMMARY

1.1 HISTORICAL BACKGROUND

• The problem of predictive learning:

GIVEN past data + reasonable assumptions

ESTIMATE unknown dependency for future predictions

• Driven by applications (NOT theory):

medicine

biology: genomics

financial engineering (i.e., program trading, hedging)

signal/ image processing

data mining (for marketing)

........

• Math disciplines:

function approximation

pattern recognition

statistics

Page 4: overview of pred learning

4optimization …….

CURRENT STATE-OF-THE-ART

(in predictive learning)

• Disconnect between theory and practical methods

• Cookbook of learning methods

(neural networks, soft computing…)

• Terminological confusion

(reflecting conceptual confusion ?)

• REASONS for the above

- objective (inherent problem complexity)

- historical: existing data-analytical tools developed for classical formulations

Page 5: overview of pred learning

5- subjective (fragmentation in science & engineering)

OBJECTIVE FACTORS

• Estimation/ Learning with finite samples

= inherently complex problem (ill-posed)

• Characterization of uncertainty

- statistical (frequentist)

- statistical (Bayesian)

- fuzzy

- etc.

• Representation of a priori knowledge

(implicit in the choice of a learning method)

• Learning in Physical vs Social Systems

- social systems may be inherently unpredictable

Page 6: overview of pred learning

6

1.2 APPLICATION NEEDS (vs standard formulations)

• Learning methods assume standard formulations

- classification

- regression

- density estimation

- hidden component analysis

• Theoretical basis for existing methods

- parametric (linear parameterization)

- asymptotic (large-sample)

• TYPICAL ASSUMPTIONS

- stationary distributions

- i.i.d. samples

- finite training sample / very large test sample

- particular cost function (i.e. squared loss)

Page 7: overview of pred learning

7• DO NOT HOLD for MODERN APPLICATIONS

1.3 METHODOLOGICAL & SCIENTIFIC NEEDS

• Different objectives for

- scientific discipline

- engineering

- (academic) marketing

Science is an attempt to make the chaotic diversity of our sensory experiences correspond to a logically uniform system of thoughts

A. Einstein, 1942

• Scientific approach

- a few (primary) abstract concepts

- math description (quantifiable, measurable)

Page 8: overview of pred learning

8- describes objective reality (via repeatable

experiments)

• Predictive learning

- not a scientific field (yet)

- this inhibits (true) progress

- understanding of major issues

• Characteristics of a scientific discipline

(1) problem setting / main concepts

(2) solution approach (= inductive principle)

(3) math proofs

(4) constructive implementation ( = learning algorithm)

(5) applications

NOTE: current focus on (3), (4), (5), NOT on (1) or (2)

• Distinction between (1) and (2)

Page 9: overview of pred learning

9- necessary for scientific approach

- currently lacking

TWO APPROACHES TO PREDICTIVE LEARNING

• The PROBLEM

GIVEN past data + reasonable assumptions

ESTIMATE unknown dependency for future predictions

• APPROACH 1 (most common in Soft Computing)

(a) Apply one’s favourite learning method, i.e., neural net, Bayesian, Maximum Likelihood, SVM etc.

(b) Report results; suggest heuristic improvements

(c) Publish a paper (optional)

• APPROACH 2 (VC-theory)

(a) Problem formulation reflecting application needs

Page 10: overview of pred learning

10(b) Select appropriate learning method

(c) Report results; publish paper etc.

Page 11: overview of pred learning

11

2. STANDARD FORMULATIONS for PREDICTIVE LEARNING

LearningX

machine

f*(X)

X

ZSystem

y

Generator of X-values

• LEARNING as function estimation = choosing the ‘best’ function from a given set of functions [Vapnik, 1998]

• Formal specs:

Generator of random X-samples from unknown probability distribution

System (or teacher) that provides output y for every X-value according to (unknown) conditional distribution P(y/X)

Learning machine that implements a set of approximating functions f(X, w) chosen a priori where w is a set of parameters

Page 12: overview of pred learning

12

• The problem of learning/estimation:

Given finite number of samples (Xi, yi), choose from a given set of functions f(X, w) the one that approximates best the true output

Loss function L(y, f(X,w)) a measure of discrepancy (error)

Expected Loss (Risk) R(w) = ∫ y)dP(X, w))f(X, L(y,

The goal is to find the function f(X, wo) that minimizes R(w) when the distribution P(X,y) is unknown.

Important special cases

(a) Classification (Pattern Recognition)

output y is categorical (class label)

approximating functions f(X,w) are indicator functions

For two-class problems common loss

L(y, f(X,w)) = 0 if y = f(X,w)

L(y, f(X,w)) = 1 if y ≠ f(X,w)

(b) Function estimation (regression)

output y and f(X,w) are real-valued functions

Commonly used squared loss measure:

L(y, f(X,w)) = [y - f(X,w)]2

Page 13: overview of pred learning

13LIMITATIONS of STANDARD FORMULATION

• Standard formulation of the learning problem

(a) global function estimation (= large test set)

(b) i.i.d. training/test samples

(c) standard regression/ classification (loss function)

(d) stationary distributions

possible research directions

(a) new formulations of the learning problem (i.e., transduction)

(b) non i.i.d. data

(c) alternative cost (error) measures

(d) non-stationary distributions

Page 14: overview of pred learning

14

VC LEARNING THEORY

• Statistical theory for finite-sample estimation

• Focus on predictive non-parametric formulation

• Empirical Risk Minimization approach

• Methodology for model complexity control (SRM)

References on VC-theory:

V. Vapnik, The Nature of Statistical Learning Theory, Springer 1995

V. Vapnik, Statistical Learning Theory, Wiley, 1998

V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory and Methods, Wiley, 1998

B. Scholkopf et al (eds), Advances in Kernel Methods: Support Vector Learning, MIT Press, 1999

IEEE Trans on Neural Networks, Special Issue on VC learning theory and its applications, Sept. 1999

Page 15: overview of pred learning

15

CONCEPTUAL CONTRIBUTIONS of VC-THEORY

• Clear separation between

problem statement

solution approach (i.e. inductive principle)

constructive implementation (i.e. learning algorithm)

- all 3 are usually mixed up in application studies.

• Main principle for solving finite-sample problems

Do not solve a given problem by indirectly solving a more general (harder) problem as an intermediate step

- usually not followed in statistics, neural networks and applications.

Example: maximum likelihood methods

• Worst-case analysis for learning problems

Theoretical analysis of learning should be based on the worst-case scenario (rather than average-case).

Page 16: overview of pred learning

16

3. EXAMPLES of NEW FORMULATIONS

TOPICS COVERED

Transduction: learning function at given points

Signal Processing: signal denoising

Financial engineering: market timing

Spatial data mining

FOCUS ON

Motivation / application domain

(Formal) problem statement

Why this formulation is important

Rather than solution approach (learning algorithm)

Page 17: overview of pred learning

17

COMPONENTS of PROBLEM FORMULATION

Page 18: overview of pred learning

18

APPLICATION NEEDS

LossFunction

Input, output,other variables

Training/test data

AdmissibleModels

FORMAL PROBLEM STATEMENT

LEARNING THEORY

TRANSDUCTION FORMULATION

Page 19: overview of pred learning

19• Standard learning problem:

Given: training data (Xi, yi) i = 1,…n

Estimate: (unknown) dependency everywhere in X

R(w) = ∫ y)dP(X, w))f(X, L(y, min

~ global function estimation ~ large test data set

LearningX

machine

f*(X)

X

ZSystem

y

Generator of X-values

• Usually need to make predictions at a few points

estimating function everywhere may be wasteful

Page 20: overview of pred learning

20

• Transduction approach [Vapnik, 1995]

learning/training = induction operation/lookup = deduction

a priori knowledge assumptions

estimated function

training data

predicted output

induction deduction

transduction

• Problem statement (Transduction):

Given: training data (Xi, yi) , i = 1,…n

Estimate: y-values at given points Xn+j , j = 1,…k

Optimal prediction minimize

R(w) = (1/k) ∑j

Loss [yn+j , f (Xn+j, w)]

Page 21: overview of pred learning

21

CURRENT STATE - OF - ART

• Vapnik [1998] provides

- theoretical analysis of transductive inference

- SRM formulation

- SVM solution approach

• Transduction yields better predictions than standard approach

• Potentially plenty of applications

• No practical applications reported to date

Page 22: overview of pred learning

22

SIGNAL DENOISING

0 100 200 300 400 500 600 700 800 900 1000-2

-1

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70 80 90 100 -3

-2

-1

0

1

2

3

4

5

6

Page 23: overview of pred learning

23

SIGNAL DENOISING PROBLEM STATEMENT

• REGRESSION FORMULATION

real-valued function estimation (with squared loss)

• DIFFERENCES (from standard formulation)

- fixed sampling rate (no randomness in X-space)

- training data X-values = test data X-values

- non i.i.d. data

• SPECIFICS of this formulation

computationally efficient orthogonal estimators, i.e.

- Discrete Fourier Transform (DFT)

- Discrete Wavelet Transform (DWT)

Page 24: overview of pred learning

24

Page 25: overview of pred learning

25

RESEARCH ISSUES FOR SIGNAL DENOISING

• CONCEPTUAL SIGNIFICANCE

Signal denoising formulation =

nonlinear estimator suitable for ERM

• MODEL SELECTION / COMPLEXITY CONTROL

V. Cherkassky and X. Shao (2001), Signal estimation and denoising using VC-theory, Neural Networks, 14, 37-52

• NON I.I.D. DATA

• SPECIFYING GOOD STRUCTURES

- especially for 2D signals (images)

Page 26: overview of pred learning

26

FINANCIAL ENGINEERING: MARKET TIMING

• MOTIVATION

increasing market volatility, i.e. typical daily fluctuations:

- 1-2% for SP500

- 2-4% for NASDAQ 100

Example of intraday volatility: NASDAQ 100 on March 14, 2001

• AVAILABILITY of MARKET DATA

easy to quantify loss function

hard to formulate good (robust) trading strategies

Page 27: overview of pred learning

27

TRADING VIA PREDICTIVE LEARNING

• GENERAL SETTING

Buy/sell/holdy

predictionPREDICTIVE

MODEL

y=f(x)TRADINGDECISION

MARKET

input xindicators

GAIN/ LOSS

• CONCEPTUAL ISSUES: problem setting

binary decision / continuous loss

• PRACTICAL ISSUES

- choice of trading security (equity)

- selection of input variables X

- timing of trading (i.e., daily, weekly etc.)

These issues are most important for practical success

BUT can not be formalized

Page 28: overview of pred learning

28

FUNDS EXCHANGE APPROACH

• 100% INVESTED OR IN CASH (daily trading)

trading decisions made at the end of each day

Buyor

sell

Money MarketIndex or Fund

Sellorbuy

Proprietary Exchange Strategy

optimal trade-off between preservation of principal and short-term gains

• PROPRIETARY EXCHANGE STRATEGIES

- selection of mutual funds

- selection of predictor (input) variables

Page 29: overview of pred learning

29Ref: V. Cherkassky, F. Mulier and A. Sheng , Funds Exchange: an approach for risk and portfolio management, in Proc. Conf. Computational Intelligence for Financial Engineering, N.Y., 2000

Page 30: overview of pred learning

30

• EXAMPLE International Fund (actual trading)

results for 6 month period:

Fund exchange Buy-and-Hold

cumulative return 16.5% 5.7%

market exposure 55% 100%

number of trades 25 1

market exposure = % of days the account was invested

number of trades (25) = 25 buy and 25 sell transactions

Page 31: overview of pred learning

31

Value of exchange vs. buy and hold

98100102104106108110112114116118

20-May-99 9-Jul-99 28-Aug-99 17-Oct-99 6-Dec-99 25-Jan-00

date

va

lue

exchange

buy_hold

Page 32: overview of pred learning

32

FORMULATION FOR FUNDS EXCHANGE

Buy/sell/holdy

predictionPREDICTIVE

MODEL

y=f(x)

TRADINGDECISION

MARKET

input xindicators

GAIN/ LOSS

GIVENip time series of daily closing prices (of a fund)

( )iiiipppq

1−−= daily % change

iX a time series of daily values of predictor variables

),( wxfy ii = indicator decision function

1=iy means BUY 0=iy means SELL

• OBJECTIVE maximize total return over n day period

∑=

=n

iii qwxfwQ

1

),()(

Page 33: overview of pred learning

33

SPECIFICS of FUNDS EXCHANGE FORMULATION

• COMBINES

indicator decision functions (BUY or SELL)

continuous cost function (cumulative return)

• NOTE - classification formulation

(= predicting direction)

is not meaningful because classification cost does not reflect the magnitude of price change

- regression formulation

(= predicting next-day price)

not appropriate either

• Training/ Test data not i.i.d. random variables

• No constructive learning methods available

Page 34: overview of pred learning

34

SPATIAL DATA MINING

• MOTIVATION: geographical / spatial data

LAW of GEOGRAPHY:

nearby things are more similar than distant things

Examples: real estate data, US census data, crime rate data

CRIME RATE DATA in Columbus OH

Output (response) variable: crime rate in a given neighborhood

Input (predictor) variables: ave. household income, house price

Spatial domain (given): 49 neighborhoods in Columbus

Training data: 49 samples (each sample = neighborhood)

The PROBLEM:

Using training data, estimate predictive model (regression-type) accounting for spatial correlation, i.e., the crime rate in adjacent neighborhoods should be similar.

Page 35: overview of pred learning

35

PROBLEM STATEMENT

GIVEN X input (explanatory) variables

y output (response) variable

),( wXf s Approximating functions

),( ss yX Training data defined on domain S

S = given spatial domain (2D fixed grid)

Spatial correlation: X or y values close in S are similar

• SPATIAL CORRELATION VIA LOSS FUNCTION

Loss (Risk) R(s) = prediction loss + spatial loss

Define )(sN = given neighborhood around each location S

( ) ( ) 22 ),(),()( wxfywxfysR ssss −+−=

where )(sN = given neighborhood around each location S

∑=k

ks ysN

y)(

1 ave spatial response at each location S

Page 36: overview of pred learning

36

4. SUMMARY and DISCUSSION

Perfection of means and confusion of goals seem – in my opinion - to characterize our age

A. Einstein, 1942

IMPORTANCE OF

• NEW FORMULATIONS for PREDICTIVE LEARNING

• PRACTICAL SIGNIFICANCE

• NEW DIRECTIONS for RESEARCH

• CONNECTION BETWEEN THEORY and PRACTICE

DISCUSSION / QUESTIONS