Page 1
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 1/67
Harvard-MIT Division of Health Sciences and TechnologyHST.951J: Medical Decision Support, Fall 2005Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo
6.873/HST.951 Medical Decision SupportSpring 2005
Artificial Neural Networks
Lucila Ohno-Machado(with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission.)
Page 2
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 2/67
Overview • Motivation
• Perceptrons
• Multilayer perceptrons
• Improving generalization • Bayesian perspective
Page 3
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 3/67
Motivation
Images removed due to copyright reasons.
benign lesion malignant lesion benign lesion malignant lesion
Page 4
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 4/67
Motivation • Human brain
– Parallel processing
– Distributed representation
– Fault tolerant – Good generalization capability
• Mimic structure and processing in
computational model
Page 5
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 5/67
Biological Analogy
Synapses
Axon
Dendrites
Synapses+
+
+--
(weights)
Nodes
Page 6
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 6/67
Perceptrons
Page 7
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 7/67
00
01
110010
10
00 11 Input patterns
Input layer
Output layer
11
01 11
11
1100
00
00 10
10Sorted
.patterns
Page 8
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 8/67
Activation functions
Page 9
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 9/67
Perceptrons (linear machines) Input units
Cough Headache
weights
Δrule
change weights to
decrease the error
-what we got
what we wantedNo disease Pneumonia Flu Meningitiserror
Output units
Page 10
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 10/67
0 1 0 0 0 0 0
Abdominal Pain Perceptron Intensity Duration
Male Age Temp WBC Pain Pain adjustable
weights37 10 11 120
Appendic itis Diverticulit is Ulcer Pain Cholecystitis Obstruction PancreatitisDuodenal Non-specific Small BowelPerforated
Page 11
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 11/67
AND y
x1 x2
w1 w2
θ= 0.5
input output
00
011011
0
001
f(x1w1 + x2w2) = y
f(0w1 + 0w2) = 0
f(0w1 + 1w2) = 0f(1w1 + 0w2) = 0
f(1w1 + 1w2) = 1
1, for a >θf(a) =
0, for a ≤ θ θ
some possible values for w1 and w2
w1 w2
0.20
0.20
0.25
0.40
0.35
0.40
0.30
0.20
Page 12
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 12/67
Single layer neural network Output of unit j:
Input units
Input to unit
j
i
Input to unit
units
measured value of variable
Output +θ j)o j = 1/ (1 + e- (a j )
j: a j = Σwijai
i: ai
i
Page 13
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 13/67
Single layer neural network Output of unit j:
Output o j = 1/ (1 + e - (a j+θ j)
units
Input to unit
j
i
Input to unit
measured
Input units
j: a j = Σwija i
i: a i
value of variable i Increasing
0
1
-5 0 5
i
i
i
0.2
0.4
0.6
0.8
1.2
-15 -10 10 15
Ser es1
Ser es2
Ser es3
Page 14
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 14/67
Training: Minimize Error Input units
Cough Headache
weights
Δrule
change weights to
decrease the error
-what we got
what we wantedNo disease Pneumonia Flu Meningitiserror
Output units
Page 15
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 15/67
Error Functions • Mean Squared Error (for regression
problems), where t is target, o is outputΣ(t - o)2/n
• Cross Entropy Error (for binaryclassification)
− Σ(t log o) + (1-t) log (1-o) +θ j)o j = 1/ (1 + e- (a j )
Page 16
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 16/67
Error function • Convention: w := (w0,w), x := (1,x)
• w0 is “bias” y
x1 x2
w1 w2
θ= 0.5
• o = f (w • x)
• Class labels ti ∈{+1,-1}
• Error measure
– E = -Σ ti (w • x )o
i miscl.
i w
• How to minimize E?
y
x1 x2
w1 w2
1
= 0.5
Page 17
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 17/67
Minimizing the Error initial error
final error
local minimum
Error surface
derivative
initial trainedw w
Page 18
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 18/67
Gradient descent
Local minimum
Global minimum
Error
Page 19
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 19/67
Perceptron learning w
• Find minimum of E by iterating
k+1 = wk – η gradw E
• E = -Σ ti (w • x ) ⇒i i miscl. gradw E = -Σ ti xi
i miscl.
w
• “online” version: pick misclassified x k+1 = wk + η ti xi
i
Page 20
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 20/67
Perceptron learning • Update rule wk+1 = wk + η ti xi • Theorem: perceptron learning converges
for linearly separable sets
Page 21
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 21/67
Gradient descent • Simple function minimization algorithm • Gradient is vector of partial derivatives • Negative gradient is direction of steepest
descent
3
3
2
2
1
1
0
0
20
10
0
0
0
11
2
2
3
3
Figures by MIT OCW.
Page 22
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 22/67
Classification Model Inputs
Weights Output
Age 34
1
4of beingAlive”
5
8
40.6
Gender
“Probability
Stage
Dependent Independent Coefficients variablevariables
a, b, c p x1, x2, x3
Prediction
Page 23
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 23/67
Terminology • Independent variable = input variable • Dependent variable = output variable • Coefficients = “weights”
• Estimates = “targets”
• Iterative step = cycle, epoch
Page 24
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 24/67
XOR y
x1 x2
w1 w2
θ= 0.5
input output
00
011011
0
110
f(x1w1 + x2w2) = y
f(0w1 + 0w2) = 0
f(0w1 + 1w2) = 1f(1w1 + 0w2) = 1
f(1w1 + 1w2) = 0
some possible values for w1 and w2
w1 w2
1, for a > θf(a) = 0, for a ≤ θ
θ
Page 25
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 25/67
XOR x 1 x 2
y
input output
00
011011
0
110
θ= 0.5
w 3 w5 w 4
z θ= 0.5
w1 w2
f (w1, w2, w3, w4, w5)
a possible set of values for ws
1, for a > θ(w1, w2, w3, w4, w5) f(a) =0, for a ≤ θ
(0.3,0.3,1,1,-2) θ
Page 26
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 26/67
XOR input output
00
011011
0
110
w1 w3w2
w5 w6
θ= 0.5 for all units
w4
f (w1, w2, w3, w4, w5 , w6)
a possible set of values for ws
1, for a > θ(w1, w2, w3, w4, w5 , w6) f(a) =0, for a ≤ θ
(0.6,-0.6,-0.7,0.8,1,1) θ
Page 27
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 27/67
From perceptrons tomultilayer perceptrons
Why?
Page 28
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 28/67
Abdominal Pain
37 10 1
Appendici tis Diverticulit is
PerforatedNon-specific
Cholecystitis
Small Bowel
Pancreatitis
1 20
Male Age Temp WBC PainIntensity
1
PainDuration
0 1 0 0000
adjustable
weights
ObstructionPainDuodenalUlcer
Page 29
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 29/67
0.8
Heart Attack Network Duration Intensity ECG: ST
Pain Pain elevation Smoker Age Male
Myocardial Infarction
112 1504
“ Probability” of MI
Page 30
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 30/67
Multilayered Perceptrons
Input uni ts
Input to unit j: a j = Σwijai
j
i
Input to unit i: aii
Output of unit j:o j = 1/ (1 + e- (a j+θ j) )
Output units
Perceptron
MultilayeredInput to unit k:
perceptron
Output of unit k:
ok = 1/ (1 + e- (ak+θk) )k
Hidden
units
ak = Σw jko j
measured value of variable
Page 31
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 31/67
Neural Network Model Inputs
Age 34
2
4
.6
.5
.8
.2
.1
.3
.7
.2
.4
.2
Output
0.6
Gender
“Probability
Stageof beingAlive”
Independent Weights Hidden Weights Dependentvariables Layer
variable
Prediction
Page 32
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 32/67
“Combined logistic models” Inputs
Age 34
2
4
.6
.5
.8
.1
.7
Output
0.6
Gender
“Probability
Stageof beingAlive”
Independent Weights Hidden Weights Dependentvariables Layer
variable
Prediction
Page 33
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 33/67
Page 34
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 34/67
Inputs Age 34
1
4
.6.5
.8
.2
.1
.3
.7
.2
Output
0.6
Gender
“Probability
Stageof beingAlive”
Independent Weights Hidden Weights Dependentvariables Layer
variable
Prediction
Page 35
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 35/67
Not really,
no target for hidden units... Age
0.6
Gender
34
2
4
.6
.5
.8
.2
.1
.3
.7
.2
.4
.2
“Probability
Stageof beingAlive”
Independent Weights Hidden Weights Dependentvariables Layer
variable
Prediction
Page 36
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 36/67
Hidden Units and Backpropagation Output units
Hiddenunits
what we got
what we wanted-error
Δ rule
Δ rule
ba
i
ckpr
o
pagato
n
Input units
Page 37
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 37/67
Multilayer perceptrons • Sigmoidal hidden layer
• Can represent arbitrary decision regions • Can be trained similar to perceptrons
Page 38
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 38/67
ECG Interpretation R-R interval
S-T elevation
QRS duration
QRS amplitude
AVF lead
SV tachycardia
Ventricular tachycardia
LV hypertrophy
RV hypertrophy
Myocardial infarction
P-R interval
Page 39
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 39/67
Linear Separation Separate n-dimensional space using one (n - 1)-dimensional space
Meningitis FluNo cough Cough
Headache Headache
CoughNo coughNo headache No headacheNo disease Pneumonia
No treatment
00 10
01 11
000
010
101
111
110
Treatment
011
100
Page 40
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 40/67
Another way of thinking about
this… • Have data set D = {(x ,t )} drawn from
i i
probability distribution P(x,t)
• Model P(x,t) given samples D by ANN
with adjustable parameter w• Statistics
analogy:
Page 41
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 41/67
Maximum Likelihood Estimation • Maximize likelihood of data D• Likelihood L = Πp(x ,t ) = Πp(t |x )p(x ) i i i i i
• Minimize -log L = -Σlog p(t |x ) -Σlog p(x ) i i i
• Drop second term: does not depend on w • Two cases: “regression” and classification
Page 42
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 42/67
Likelihood for classification (ie categorical target)
• For classification, targets t are class labels
• Minimize -Σlog p(t |x ) i i • p(t 1-ti |x ) = y(x ,w) ti(1- y(x ,w)) ⇒
i i i i-log p(t |x ) = -t log y(x ,w) -(1 – t ) * log(1-y(x ,w))i i i i i i
• Minimizing –log L equivalent to minimizing
-[Σt log y(x ,w) +(1 – t ) * log(1-y(x ,w))] i i i i
(cross-entropy error )
Page 43
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 43/67
Likelihood for “regression” (ie continuous target)
• For regression, targets t are real values • Minimize -Σlog p(t |x ) i i
• p(t |x ) = 1/Z exp(-(y(x ,w) – t )
2
/(2σ2
)) ⇒
i i i i-log p(t |x ) = 1/(2σ2) (y(x ,w) – t )2 +log Z i i i i
• y(x ,w) is network outputi • Minimizing –log L equivalent to minimizing
Σ(y(x ,w) – t )2 (sum-of-squares error ) i i
Page 44
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 44/67
Backpropagation algorithm w
• Minimizing error function
by gradient descent:k+1 = wk – η gradw E
• Iterative gradientcalculation bypropagating error signals
Backpropagation algorithm
Page 45
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 45/67
Backpropagation algorithm Problem: how to set learning rate η ?
Better: use more advanced minimizationalgorithms (second-order information)
2
2
0
0
-2
2
0
-2
-4 -2 20-4 -2
Figures by MIT OCW.
Page 46
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 46/67
Backpropagation algorithm Classification Regression
cross-entropy sum-of-squares
Page 47
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 47/67
Overfitting
ModelReal Distribution Overfitted
Page 48
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 48/67
Improving generalization
Problem: memorizing (x,t) combinations
(“overtraining”)
0.7 0.5 00.9-0.5 1
1-1.2-0.2
0.3 0.6 1
0.5-0.2 ?
Page 49
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 49/67
Improving generalization
• Need test set to judge performance • Goal: represent information in data set, not
noise
• How to improve generalization? – Limit network topology
– Early stopping
– Weight decay
Page 50
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 50/67
Limit network topology
• Idea: fewer weights ⇒less flexibility • Analogy to polynomial interpolation:
Page 51
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 51/67
Limit network topology
Page 52
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 52/67
Early Stopping
C H D
error
holdout
training
Overfitted model “Real” model Overfitted model
0 age cycles
Page 53
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 53/67
Early stopping
• Idea: stop training when information (but
not noise) is modeled• Need hold-out set to determine when to
stop training
Page 54
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 54/67
Overfitting
b = training set
a = hold-out set
Overfitted model
tss
Epochs
min (Δtss)
tss a
tss b
Stopping criterion
Page 55
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 55/67
Early stopping
Page 56
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 56/67
Weight decay
• Idea: control smoothness of network
output by controlling size of weights
• Add term α||w||2 to error function
Page 57
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 57/67
Page 58
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 58/67
Bayesian perspective
• Error function minimization corresponds to
maximum likelihood (ML) estimate: singlebest solution wML
• Can lead to overtraining • Bayesian approach: consider weight
posterior distribution p(w|D).
Page 59
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 59/67
Bayesian perspective
• Posterior = likelihood * prior
• p(w|D) = p(D|w) p(w)/p(D)• Two approaches to approximating p(w|D):
– Sampling
– Gaussian approximation
Page 60
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 60/67
Sampling from p(w|D)
prior
likelihood
Page 61
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 61/67
Sampling from p(w|D)
prior * likelihood = posterior
Bayesian example for
Page 62
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 62/67
ayes a e a p e o
regression
Bayesian example for
Page 63
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 63/67
y p
classification
Model Features
Page 64
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 64/67
(with strong personal biases) Modeling Examples Explanat.Effort Needed
Rule-based Exp. Syst. high low high?
Classification Trees low high+ “high”Neural Nets, SVM low high lowRegression Models high moderate moderate
Learned Bayesian Nets low high+ high(beautiful when it works)
Page 65
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 65/67
Regression vs. Neural Networks
“X1” 1X3” “X1X2X3”
Y
“X2”
X1 X2 X3 X1X2 X1X3 X2X3
Y
(23-1) possible combinations
X1X2X3
“X
X1 X2 X3
Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...
Page 66
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 66/67
Summary
• ANNs inspired by functionality of brain • Nonlinear data model• Trained by minimizing error function
• Goal is to generalize well• Avoid overtraining
• Distinguish ML and MAP solutions
Page 67
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 67/67
Some References
Introductory and Historical Textbooks
• Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed
Processing. MIT Press, Cambridge, 1986. (H)• Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of
Neural Computation. Addison-Wesley, Redwood City, 1991.
• Pao, YH. Adaptive Pattern Recognition and Neural Networks.
Addison-Wesley, Reading, 1989.• Bishop CM. Neural Networks for Pattern Recognition. Clarendon
Press, Oxford, 1995.