Data mining and statistic al learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification Relationship to other prediction models Some simple examples of neural networks Parameter estimation Joint framework for prediction and classification Features of neural networks
26
Embed
Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification Relationship.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data mining and statistical learning - lecture 11
Neural networks
- a model class providing a joint framework
for prediction and classification
Relationship to other prediction models
Some simple examples of neural networks
Parameter estimation
Joint framework for prediction and classification
Features of neural networks
Data mining and statistical learning - lecture 11
Ordinary least squares regression (OLS)
x1 x2 xp…
yModel:
Terminology:
0: intercept (or bias)
1, …, p: regression coefficients (or weights)
The response variable responds directly and linearly to changes in the inputs
errorxβ...xy pp 110
errory T Xβ0
Data mining and statistical learning - lecture 11
Principal components regression (PCR)
Extract principal components (linear combinations of the inputs) as derived features, and then model the target (response) as a linear function of these features
MmZ Tmm ,...,1, Xα
x1 x2 xp
z1 z2 zM…
…
y
ZβTy 0
The response variable responds indirectly and linearly to changes in the inputs
Data mining and statistical learning - lecture 11
Neural network with a single target
Output
x1 x2 xp
z1 z2 zM…
…
y
Hidden layer
of neurons
Inputs
The response to changes in inputs is indirect and nonlinear
Data mining and statistical learning - lecture 11
Neuron
XαTmm 0 )( 0 XαTmmmZ Sigmoid
activation function
-1.5
-1
-0.5
0
0.5
1
1.5
-5 -3 -1 1 3 5
Data mining and statistical learning - lecture 11
Neural networks with a single target
Extract linear combinations
of the inputs as derived features, and then model the target (response) as a linear function of a sigmoid function (activation function) of these features
MmZ Tmmm ,...,1,)( 0 Xα
x1 x2 xp
z1 z2 zM…
…
y
ZβTy 0
MmTmm ,...,1,0 Xα
Data mining and statistical learning - lecture 11
Neural network with one input, one neuron, and one target
)( 10 XZ
x
z
y
Zy 10 -1.5
-1
-0.5
0
0.5
1
1.5
-5 -3 -1 1 3 5
Data mining and statistical learning - lecture 11
Neural network with one input, one neuron, and one target
)1,0,1.0,0( 1010
)( 10 XZ
Zy 10
x
z
y
-1.5-1
-0.50
0.51
1.5
-20 -10 0 10 20
x
y
-1
-0.5
0
0.5
1
-20 -10 0 10 20
x
y-1
-0.5
0
0.5
1
-20 -10 0 10 20
x
y
-1.5-1
-0.50
0.51
1.5
-20 -10 0 10 20
xy
)1,0,1,0( 1010
)1,0,1,0( 1010 )1,0,1.0,0( 1010
)( 10 XZ
Data mining and statistical learning - lecture 11
Neural network with one input, one neuron, and one target
- a simple example
Select Advanced user interface
Select 1 hidden node
Tick Outputs from Training,… -1.5
-1
-0.5
0
0.5
1
1.5
-15 -10 -5 0 5 10 15
xy
Data mining and statistical learning - lecture 11
Neural network with one input, one neuron, and one target
-1.5
-1
-0.5
0
0.5
1
1.5
-15 -10 -5 0 5 10 15
y
P_y
Data mining and statistical learning - lecture 11
Output from proc Neural
- one input, one neuron, one target
Parameter Estimates
Gradient
Objective
N Parameter Estimate Function
1 x_H11 -5.851506 0.000000103
2 BIAS_H11 -0.032606 -0.000001516
3 H11_y -1.017515 1.8123827E-8
4 BIAS_y -0.006434 1.2814216E-8
Value of Objective Function = 0.0106538302
H11 =
Hidden layer 1, neuron 1
Data mining and statistical learning - lecture 11
Neural network with one input, one neuron, and one target
- manual calculation of predicted values
Parameter Estimates
Gradient
Objective
N Parameter Estimate Function
1 x_H11 -5.851506 0.000000103
2 BIAS_H11 -0.032606 -0.000001516
3 H11_y -1.017515 1.8123827E-8
4 BIAS_y -0.006434 1.2814216E-8
Standardize x to mean zero and variance one
Compute xstand*x_H11+BIAS_H11
Take tanh to compute z
Compute z*H11_y+BIAS_y
Data mining and statistical learning - lecture 11
Neural networks with one input, two neurons, and one target
2,1),( 10 mXZ mmm
22110 ZZy
x
z1 z2
y
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
-20 -10 0 10 20
x
y
1,5.0,0
1,1,0
0
21202
11101
0
Data mining and statistical learning - lecture 11
Output from proc Neural
- one input, two neurons, one target
Parameter Estimates
Gradient
Objective
N Parameter Estimate Function
1 x_H11 -4.040296 -0.000006221
2 x_H12 -4.755015 0.000008922
3 BIAS_H11 0.449445 -0.000046905
4 BIAS_H12 0.176599 0.000092579
5 H11_y 0.767115 0.000009568
6 H12_y -1.781053 0.000026628
7 BIAS_y -0.014300 -0.000086070
Value of Objective Function = 0.0104173896
Data mining and statistical learning - lecture 11
Absorbance records for ten samples of chopped meat
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
1 12 23 34 45 56 67 78 89 100
Channel
Ab
sorb
ance
Sample_1
Sample_2
Sample_3
Sample_4
Sample_5
Sample_6
Sample_7
Sample_8
Sample_9
Sample_10
1 response variable (fat)
100 predictors (absorbance at 100 wavelengths or channels)
The predictors are strongly correlated to each other
Data mining and statistical learning - lecture 11
Absorbance records for 215 samples of chopped meat
The target is poorly correlated to each predictor
0
10
20
30
40
50
60
0 1 2 3 4 5 6
Absorbance in channel 50
Fat
(%
)
Data mining and statistical learning - lecture 11
Neural networks with a single target and many inputs
- the fat content and absorbance dataset
A total of (p+2)*3+1 parameters are estimated
3,...,1,)( 0 mZ Tmmm Xα
x1 x1 xp
z1 z2 z3
…
y
3322110 ZZZy
Data mining and statistical learning - lecture 11
Neural networks with a single target and many inputs
- parameter estimates for a model with three neurons
.
.
.
291 Channel90_H13 -0.534226 -0.243706
292 Channel91_H13 -0.590502 -0.245327
293 Channel92_H13 -0.482705 -0.246851
294 Channel93_H13 -0.528643 -0.248195
295 Channel94_H13 -0.333949 -0.249403
296 Channel95_H13 -0.258637 -0.250348
297 Channel96_H13 0.162351 -0.250953
298 Channel97_H13 0.273746 -0.251128
299 Channel98_H13 0.711445 -0.250887
300 Channel99_H13 0.879623 -0.250285
301 BIAS_H11 -2.144805 0.003961
302 BIAS_H12 0.738894 0.095724
303 BIAS_H13 -0.771776 0.587769
304 H11_Fat -1.504744 0.054906
305 H12_Fat -15.057170 -0.025459
306 H13_Fat -18.345040 0.006471
307 BIAS_Fat 16.856496 -0.029187
Value of Objective Function = 0.3045279048
A total of 307
parameters
Data mining and statistical learning - lecture 11
Neural networks with a single target and many inputs
- output from a model with three neurons
Data mining and statistical learning - lecture 11
Neural networks with a single target and many inputs
- output from models with 1 to 10 neurons
0
1
2
3
4
5
6
0 1 2 3 4 5 6 7 8 9 10# neurons
Ro
o t
AS
E
Root ASE Test:Root ASE
Convergence problems
Data mining and statistical learning - lecture 11
Neural networks with multiple targets
Extract linear combinations
of the inputs as derived features, and then model the target (response) as a linear function of a sigmoid function (activation function) of these features
MmZ Tmmm ,...,1,)( 0 Xα
x1 x2 xp
z1 z2 zM…
…
y1
ZβTKkky 0
MmTmm ,...,1,0 Xα
yK…
Data mining and statistical learning - lecture 11
Neural networks for K-class classification
With the softmax activation function
and the deviance (cross-entropy) error function
the neural network model is exactly a logistic regression model in the hidden units, and all the parameters are estimated by maximum likelihood
K
ll
kk
y
yYg
1
)exp(
)exp()(
x1 x2 xp
z1 z2 zM…
…
y1 yK…
K
ll
kk
y
yYg
1
)exp(
)exp()(
N
i
K
kikik xfyR
1 1
)(log),(
Data mining and statistical learning - lecture 11
Neural networks for regression and K-class classification
For regression, we use the sum-of-squared errors
as our measure of fit
For classification, we normally use the deviance (cross-entropy) error function
and the corresponding classifier is
.
x1 x2 xp
z1 z2 zM…
…
y1 yK…
K
ll
kk
y
yYg
1
)exp(
)exp()(
N
i
K
kikik xfyR
1 1
)(log),(
N
i
K
kikik xfyR
1 1
2))((),(
)(maxarg)( xfxG kk
Data mining and statistical learning - lecture 11
Fitting neural networks
x1 x2 xp
z1 z2 zM…
…
y1 yK…M(p+1)+K(M+1) parameters (weights)
We don’t want the global minimizer of the deviance (cross-entropy) function.
Instead we use early stopping or a penalty term
Data mining and statistical learning - lecture 11
Neural networks
Provide a joint framework for prediction and classification
Can describe both linear and nonlinear responses
Can accommodate multidimensional correlated inputs
Are normally over-fitted – validation is a must
Are difficult to interpret
Convergence problems are not uncommon
Data mining and statistical learning - lecture 11
Some characteristics of different learning methods
Characteristic Neural networks Trees
Natural handling of data of “mixed” type
Handling of missing values
Robustness to outliers in input space
Insensitive to monotone transformations of inputs
Computational scalability (large N)
Ability to deal with irrelevant inputs
Ability to extract linear combinations of features