Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification Relationship.

Data mining and statistical learning - lecture 11

Neural networks

- a model class providing a joint framework

for prediction and classification

Relationship to other prediction models

Some simple examples of neural networks

Parameter estimation

Joint framework for prediction and classification

Features of neural networks


Ordinary least squares regression (OLS)

x1 x2 xp…

yModel:

Terminology:

0: intercept (or bias)

1, …, p: regression coefficients (or weights)

The response variable responds directly and linearly to changes in the inputs

errorxβ...xy pp 110

errory T Xβ0


Principal components regression (PCR)

Extract principal components (linear combinations of the inputs) as derived features, and then model the target (response) as a linear function of these features

MmZ Tmm ,...,1, Xα

x1 x2 xp

z1 z2 zM…

…

y

ZβTy 0

The response variable responds indirectly and linearly to changes in the inputs


Neural network with a single target

Output

x1 x2 xp

z1 z2 zM…

…

y

Hidden layer

of neurons

Inputs

The response to changes in inputs is indirect and nonlinear


Neuron

XαTmm 0 )( 0 XαTmmmZ Sigmoid

activation function

-1.5

-1

-0.5

0

0.5

1

1.5

-5 -3 -1 1 3 5


Neural networks with a single target

Extract linear combinations

of the inputs as derived features, and then model the target (response) as a linear function of a sigmoid function (activation function) of these features

MmZ Tmmm ,...,1,)( 0 Xα

x1 x2 xp

z1 z2 zM…

…

y

ZβTy 0

MmTmm ,...,1,0 Xα


Neural network with one input, one neuron, and one target

)( 10 XZ

x

z

y

Zy 10 -1.5

-1

-0.5

0

0.5

1

1.5

-5 -3 -1 1 3 5



)1,0,1.0,0( 1010

)( 10 XZ

Zy 10

x

z

y

-1.5-1

-0.50

0.51

1.5

-20 -10 0 10 20

x

y

-1

-0.5

0

0.5

1

-20 -10 0 10 20

x

y-1

-0.5

0

0.5

1

-20 -10 0 10 20

x

y

-1.5-1

-0.50

0.51

1.5

-20 -10 0 10 20

xy

)1,0,1,0( 1010

)1,0,1,0( 1010 )1,0,1.0,0( 1010

)( 10 XZ



- a simple example

Select Advanced user interface

Select 1 hidden node

Tick Outputs from Training,… -1.5

-1

-0.5

0

0.5

1

1.5

-15 -10 -5 0 5 10 15

xy



-1.5

-1

-0.5

0

0.5

1

1.5

-15 -10 -5 0 5 10 15

y

P_y


Output from proc Neural

- one input, one neuron, one target

Parameter Estimates

Gradient

Objective

N Parameter Estimate Function

1 x_H11 -5.851506 0.000000103

2 BIAS_H11 -0.032606 -0.000001516

3 H11_y -1.017515 1.8123827E-8

4 BIAS_y -0.006434 1.2814216E-8

Value of Objective Function = 0.0106538302

H11 =

Hidden layer 1, neuron 1



- manual calculation of predicted values

Parameter Estimates

Gradient

Objective


1 x_H11 -5.851506 0.000000103

2 BIAS_H11 -0.032606 -0.000001516

3 H11_y -1.017515 1.8123827E-8

4 BIAS_y -0.006434 1.2814216E-8

Standardize x to mean zero and variance one

Compute xstand*x_H11+BIAS_H11

Take tanh to compute z

Compute z*H11_y+BIAS_y


Neural networks with one input, two neurons, and one target

2,1),( 10 mXZ mmm

22110 ZZy

x

z1 z2

y

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

-20 -10 0 10 20

x

y

1,5.0,0

1,1,0

0

21202

11101

0


Output from proc Neural

- one input, two neurons, one target

Parameter Estimates

Gradient

Objective


1 x_H11 -4.040296 -0.000006221

2 x_H12 -4.755015 0.000008922

3 BIAS_H11 0.449445 -0.000046905

4 BIAS_H12 0.176599 0.000092579

5 H11_y 0.767115 0.000009568

6 H12_y -1.781053 0.000026628

7 BIAS_y -0.014300 -0.000086070



Absorbance records for ten samples of chopped meat

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1 12 23 34 45 56 67 78 89 100

Channel

Ab

sorb

ance

Sample_1

Sample_2

Sample_3

Sample_4

Sample_5

Sample_6

Sample_7

Sample_8

Sample_9

Sample_10

1 response variable (fat)

100 predictors (absorbance at 100 wavelengths or channels)

The predictors are strongly correlated to each other


Absorbance records for 215 samples of chopped meat

The target is poorly correlated to each predictor

0

10

20

30

40

50

60

0 1 2 3 4 5 6

Absorbance in channel 50

Fat

(%

)


Neural networks with a single target and many inputs

- the fat content and absorbance dataset

A total of (p+2)*3+1 parameters are estimated

3,...,1,)( 0 mZ Tmmm Xα

x1 x1 xp

z1 z2 z3

…

y

3322110 ZZZy



- parameter estimates for a model with three neurons

.

.

.

291 Channel90_H13 -0.534226 -0.243706

292 Channel91_H13 -0.590502 -0.245327

293 Channel92_H13 -0.482705 -0.246851

294 Channel93_H13 -0.528643 -0.248195

295 Channel94_H13 -0.333949 -0.249403

296 Channel95_H13 -0.258637 -0.250348

297 Channel96_H13 0.162351 -0.250953

298 Channel97_H13 0.273746 -0.251128

299 Channel98_H13 0.711445 -0.250887

300 Channel99_H13 0.879623 -0.250285

301 BIAS_H11 -2.144805 0.003961

302 BIAS_H12 0.738894 0.095724

303 BIAS_H13 -0.771776 0.587769

304 H11_Fat -1.504744 0.054906

305 H12_Fat -15.057170 -0.025459

306 H13_Fat -18.345040 0.006471

307 BIAS_Fat 16.856496 -0.029187


A total of 307

parameters



- output from a model with three neurons



- output from models with 1 to 10 neurons

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8 9 10# neurons

Ro

o t

AS

E

Root ASE Test:Root ASE

Convergence problems


Neural networks with multiple targets

Extract linear combinations

of the inputs as derived features, and then model the target (response) as a linear function of a sigmoid function (activation function) of these features

MmZ Tmmm ,...,1,)( 0 Xα

x1 x2 xp

z1 z2 zM…

…

y1

ZβTKkky 0

MmTmm ,...,1,0 Xα

yK…


Neural networks for K-class classification

With the softmax activation function

and the deviance (cross-entropy) error function

the neural network model is exactly a logistic regression model in the hidden units, and all the parameters are estimated by maximum likelihood

K

ll

kk

y

yYg

1

)exp(

)exp()(

x1 x2 xp

z1 z2 zM…

…

y1 yK…

K

ll

kk

y

yYg

1

)exp(

)exp()(

N

i

K

kikik xfyR

1 1

)(log),(


Neural networks for regression and K-class classification

For regression, we use the sum-of-squared errors

as our measure of fit

For classification, we normally use the deviance (cross-entropy) error function

and the corresponding classifier is

.

x1 x2 xp

z1 z2 zM…

…

y1 yK…

K

ll

kk

y

yYg

1

)exp(

)exp()(

N

i

K

kikik xfyR

1 1

)(log),(

N

i

K

kikik xfyR

1 1

2))((),(

)(maxarg)( xfxG kk


Fitting neural networks

x1 x2 xp

z1 z2 zM…

…

y1 yK…M(p+1)+K(M+1) parameters (weights)

We don’t want the global minimizer of the deviance (cross-entropy) function.

Instead we use early stopping or a penalty term


Neural networks

Provide a joint framework for prediction and classification

Can describe both linear and nonlinear responses

Can accommodate multidimensional correlated inputs

Are normally over-fitted – validation is a must

Are difficult to interpret

Convergence problems are not uncommon


Some characteristics of different learning methods

Characteristic Neural networks Trees

Natural handling of data of “mixed” type

Handling of missing values

Robustness to outliers in input space

Insensitive to monotone transformations of inputs

Computational scalability (large N)

Ability to deal with irrelevant inputs

Ability to extract linear combinations of features

Good Poor

Interpretability Poor Fair/good

Predictive power Good Poor

Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification Relationship.

Documents

data mining

statistical learning

target slide

y bias

target x z y slide

target x z1z1 z2z2 y

h11 bias

target parameter