Network Ensembles (Committees) for Improved Classification and Regression Radwan E. Abdel-Aal Computer Engineering Department November 2006.

Network Ensembles (Committees) for

Improved Classification and Regression

Radwan E. Abdel-Aal Computer Engineering Department

November 2006

Contents

• Data-based Predictive Modeling - Approach, advantages, Scope, and Main tools

• Need for high prediction accuracy• The network ensemble (committee) approach

- Need for diversity among members and How to achieve it

• Some Results - Classification: Medical diagnosis - Regression: Electric peak load forecasting

• Summary

Data-based Predictive Modeling

• The process of creating a statistical model of future behavior based on data collected on observed past behavior

• The model uses a number of predictors (input variables that are likely to influence the output)

• The model relationship between such inputs and behavior is determined using a machine learning algorithm

Input Vector, X(Predictors)(Attributes)(Features)

Output, YY = F(X)

Advantages over other modeling approaches

• Thorough theoretical knowledge is not necessary• Less user intervention (Let the data speak!) (No biases or pre-assumptions on relationships)• Better handling of nonlinearities, complexities• Greater tolerance to noise, uncertainties (Soft Computing)• Faster and easier to develop• Utilizes the loads of computerized historical data available now in many disciplines

Scope Environmental:

- Pollution monitoring, Weather forecasting Finance and business: - Loan assessment, Fraud detection, Market

forecasting- Basket analysis, Product targeting, Efficient mailing

Engineering:- Process modeling and optimization, Load forecasting- Machine diagnostics, Predictive maintenance

Medical and Bio Informatics- Screening, Diagnosis, Prognosis, Therapy, Gene classification

Internet: - Web access analysis, Site personalization

How? Two basic steps

RockProperties

1 Develop Model Using Known Cases

IN OUTAttributes, X

Known O/P, Y

2 Use Model For New Cases

IN OUTAttributes (X)

F(X)

Y = F(X)Determine F(X)

Supervised Learning

Unknown O/P, Y

Data-based Predictive Modeling by supervised Machine learning

Database of solved examples (input-output) Preparation: cleanup, transform, derive new

attributes Split data into a training and a test set Training:

Develop model on the training set Evaluation: See how the model fares on the test set Actual use: Use promising model on new input data to

estimate unknown output

Example: Medical Screening

Y=F(x): true function (usually not known) for population P

1. Collect Data: “labeled” training sample drawn from P

57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 078,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 169,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 018,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1

2. Training: Get G(x); model learned from training sample, Goal: E<(F(x)-G(x))2> ≈ 0 for future samples drawn from P – Not just data fitting!

3. Test/Use:

71,M,160,1,132,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?

x YF(x) ? G(x)

Data-Based Modeling Tools(Learning Paradigms)

Decision Trees Nearest-Neighbor Classifiers Support Vector Machines Neural Networks Abductive Networks

Neural Networks (NN)Input Layer

Weights

Output Layer

Independent Input Variables (Attributes)

Dependent Output Variable

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

Weights

HiddenLayer

0.60

.4

.2

Neurons

Transfer Function

Actual: 0.65

Error: 0.05

Error back-propagation

Ad hoc approach by user to determine network structure and training parameters- Trial & Error ?

Opacity or black-box nature gives poor explanation capabilities which are important, e.g. in medicine

Limitations of Neural Networks

Significant inputs are not immediately obvious When to stop training to avoid over-fitting ? Local Minima may hinder optimum solution

x YF(x) ? G(x)

G(x) is ‘distributed’in a maze of network weights

x Y

Self-Organizing Abductive (Polynomial) Networks

-Network of polynomial functional elements- not simple neurons

-No fixed a priori model structure. Model evolves with training

-Automatic selection of: Significant inputs, Network size, Element types, Connectivity, and Coefficients

-Automatic stopping criteria, with simple control on complexity

-Analytical input-output relationships

“Double” Element:

y = w0+ w1 x1 + w2 x2 + w3 x12 + w4 x22

+ w5 x1 x2 + w6 x13 + w7 x23

GMDH-based

Need for high prediction accuracy:Medical Diagnosis

Predicted

+ ive(PP)

- ive (PN)

Ac tua l

+ ive(AP)

TP FN

- ive (AN)

FP TN

- Ideally FN = FP = 0- FN: Actual positives missed as negatives by classifier- FP: Actual negatives mistaken as positives by classifier

Both types of errors are costly!

100)(

)( x

ANAP

TNTPAccuracytionClassifica

“Cost” can be givento each type of error

Need for high prediction accuracy:Hourly Electric Load Forecasting

Overestimation: Spin up reserve units unnecessarilyUnderestimation: Need to deploy expensive peaking units

or buy costly generation from other utilities

higher operating costs

An extra 1% in forecast error increased operating cost of a UK power utility by 10 million sterling pounds in 1985

How to ensure good predictive models?

Use effective predictors Use representative datasets for model

training and evaluation Large training and evaluation datasets Pre-process datasets to remove outliers,

errors, etc. and perform normalization, transformations, etc.

Avoid over-fitting during training (i.e. use parsimonious models)

Use proven learning algorithms

What if a single is not good enough? The Network Ensemble Approach

If member networks are independent, diversity in the decision making process boosts generalization, thus improving accuracy, robustness, and reliability of the overall prediction

Identical members No gain in performance

Improvement expected onlywhen members err in different ways and directions, so errors can cancel out!

n is usually odd to suit majority voting

Methods of combining member outputs

1. Simple combination of member outputs:

- Simple averaging of continuous outputs

- Weighed averaging of continuous outputs (fixed weights)

, where

e.g. where is the variance of member i output on its training set

- Majority voting of categorical outputs

n

iic y

nz

1

1

n

iiic yz

1

11

n

ii

n

j j

ii

12

2

1

1

2i

Methods of combining member outputs, Contd.

2. A gating network uses the input vector to determine optimum weights for member outputs for each case to be classified

3. Stacked generalization approach: the output combiner is another higher-level network trained on the outputs of individual members to generate the committee classification output

Network Ensembles:The need for diversification

The committee error can be shown to have two components:- One measuring the average generalization error

of individual members - The other measuring the disagreement among

outputs of individual members

Therefore: Individual members should ideally be uncorrelated or

even negatively correlated (Diversity) An ideal committee would consist of:

- Highly accurate members, - which disagree among themselves as much as

possiblePossible tradeoffs between the above two requirements

Ensuring diversity: Committee of Experts

Individual member networks belong to totally different learning paradigms, e.g. Neural networks, Nearest neighbor classifiers, Classification and Regression Trees (CART), etc.

Advantages: Members can use the same full training dataset and the

same full set of input features No conflict between individual quality and collective diversity

Disadvantages: Requires the use of different tools and expertise

Ensuring diversity: Same learning paradigm

Develop member networks with same paradigm using: Different training subsets or Different input features or Different training conditions

Neural networks:- Different: architectures (MLP, RBF), algorithms (BP, SA) topologies, initial random weights, neuron transfer functions, learning rate, momentum, stopping criteria, (Research topic)

Abductive networks (Self organizing- Limited user choices!):- Different values for the model complexity parameter (CPM)

Sacrifice individual qualityfor collective diversity!

CPM (Complexity Penalty Parameter: 0.1 to 10)

Can be used as a method for ranking input features:

Those selected earlier are better predictors

Lower C

PM

More C

omplex M

odel

Some Results: Classification for Medical Diagnosis

1. The Pima Indians Diabetes Dataset from the UCI Machine Learning Repository

768 cases: (669 for training and 99 for evaluation) 8 numerical attributes on physiological

measurements and medical test results A binary class variable (Diabetic:1, Not Diabetic: 0) Percentage of positives in the total set: 34.9% Typical classification accuracies reported for C4.5

decision tree tool: 74.6%

Classification of the Diabetes Dataset

Optimum monolithic abductive network model using full training dataset and feature set

Two abductive network ensemble approaches: A: 3 members trained on the same (full)

training set at different CPM values (model complexity parameters) (NT = 669 cases)

B: 3 members trained on 3 mutually exclusives subsets of the training set at same CPM value (NT = 223 cases)

Memberstrained on same training data and Different CPMs

Errors by different members are highly correlated (i.e. err together, less independent)

Memberstrained on different training data and same CPMs

Errors by different members are poorly correlated (i.e. err differently, more independent)

Ensemble-A Ensemble-B

Classification of the Diabetes Dataset

Members Training

NT

RMS of the 3 Error Correlation Coefficients

Average Member Classification Accuracy, %

Unanimous Errors, %

Classification Accuracy, %

(Committees use majority vote)

Best Single Model (CPM = 0.5)

669 - - - 73.7

Full training set, Different CPMs (0.5, 1, 2)

669 0.96 72.7 23.2 73.7

Split training set (Same CPM = 1)

223 0.80 73.7 16.2 76.8

Ensemble-A

Monolithic

Ensemble-B

Some Results: Classification for Medical Diagnosis

2. The Cleveland Heart Disease Dataset from the UCI Machine Learning Repository

270 cases: (190 for training and 80 for evaluation) 13 numerical attributes A binary class variable: Presence 1/Absence 0 of

heart disease Percentage of positives in the total set: 44.4 % Typical classification accuracies reported with

neural networks: 81.8%

Classification of the Heart Disease

Dataset

Optimum monolithic abductive network model using full training dataset and feature set

3-member abductive network ensemble: Training set available is small

Not practical to split For diversity: Members trained on

the same (full) training set but using different (mutually exclusive) subsets of input features


Dataset

To ensure good (and uniform) quality of all member networks, good quality input features must be distributed uniformly amongst members

First, rank the input feature subset based on predictive quality

Then distribute the features on the 3 members fairly


Dataset

Members Training

Sensitivity, %

Specificity, %

+ivePredictiveValue, %

-ivePredictiveValue, %

ClassificationAccuracy, %

Best Single Model 71.4 91.1 86.2 80.4 82.5

Full training set, Different input

features77.1 95.6 93.1 84.3 87.5

Some Results: Regression: Electrical Load Forecasting

Short term (ST) Forecasting: - Hourly load profile - Daily peak load ()

Short term: Hours, a week Medium term: Months, a year

Long term: Up to 20 years

ST Forecasts useful for scheduling: Generator unit commitment Short-term maintenance Fuel allocation Evaluating interchange transactions in deregulated

markets

Weekend

Weekdays

Short-Term Forecasting: Factors affecting the load

• Time, Calendar:Hourly, daily, seasonal,holidays, school year, …

• Weather: (Heating/cooling loads)Temperature, humidity, wind speed, cloud cover, …

Economic, Societal: (Slow trending effect) Industrial growth, electricity pricing, Population growth, …

• Events:Start/stop of large loads: Sports and TV

shows, …

Weekend

Weekdays

28 years

Seasonal yearlyvariations

General upward trend

Forecasting tomorrow’s Peak Load

• 47 Inputs representing:- Peak load, - Max and min temperatures, - Day type: WRK, SAT,

SUN/HOLIDAY over the previous 7 days and the forecasted day

• Training: 3 years (1987-89) Evaluation: 4th year (1990)

• Trend management (2 ways):- Use an additional trend input - Normalize all training years to

last training year and then denormalize model output

Tomorrow’sPeak LoadForecaster

7 x 6 = 42

5

Total: 48 inputs

24 off

O/P

Trend

ForecastedDay

Forecasting tomorrow’s Peak Load:Monolithic neural and abductive Networks

Abductive Model:

Only 8 of the 48 inputs available are automatically selected during training Performance over evaluation year:

- Mean Absolute Percentage Error (MAPE) = 2.52%- Correlation cofft. between true and predicted data = 0.986

MAPE:2.61%2.52%

Pre

dic

ted

Actual, MW

R = 0.986

1800

2300

2800

3300

3800

4300

4800

1800 2300 2800 3300 3800 4300 4800

Actual PL(d+1), MWActual

Fore

caste

d

Improving Forecasting Accuracy Using Abductive Ensembles

• Three-member ensemble to forecast load for the year 1990

Training:

• Members are trained on raw data for the three preceding years 1987, 1988, 1989

• No need for trend input when training on 1-year data

For evaluation:

• For each model: - Normalize evaluation year load data to the model training year at input- Denormalize model output to evaluation year before combining the 3 outputs

Evaluation Year

Abductive Network Ensembles Results

SchemeRMS of Model

Error Correlations

MAPEMAE,

MWMaximum

APE

% of Evaluation population

with APE 3%

Monolithic Model - 2.52 70 14.20 68

3-Member Committee, Same CPM

0.655

2.36

65.2 10.97 69

3-Member Committee, Different CPMs

0.577 2.19 61 10.02 74

Statistically significant error reduction

Summary Network ensembles (committees) can lead to

significant performance gains in classification and regression

Members need to be both accurate and independent Independence is more difficult to achieve with

abductive compared to neural networks Effective ways to achieve this are: Different training

datasets, different input features, and different model complexity (CPMs)

Demonstrated the technique on medical data (classification) and electrical load forecasting (regression)

Network Ensembles (Committees) for Improved Classification and Regression Radwan E. Abdel-Aal Computer Engineering Department November 2006.

Documents

fx slide

new input data

use model

contents data

data fitting

gx slide

gx model

unknown output slide