Page 1
Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo
6.873/HST.951 Medical Decision Support Spring 2005
ArtificialNeural Networks
Lucila Ohno-Machado (with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission.)
Page 2
Overview
• Motivation • Perceptrons • Multilayer perceptrons • Improving generalization
• Bayesian perspective
Page 3
Motivation
Images removed due to copyright reasons.
benign lesion malignant lesionbenign lesion malignant lesion
Page 4
Motivation
• Human brain – Parallel processing – Distributed representation – Fault tolerant – Good generalization capability
• Mimic structure and processing in computational model
Page 5
Biological Analogy
Synapses
Axon
Dendrites
Synapses+ + + --
(weights)
Nodes
Page 7
00
01 11
0010
10
00 11 Input patterns
Input layer
Output layer
11
01 11
11
1100
00
00 10
10 Sorted
.patterns
Page 8
Activation functions
Page 9
Perceptrons (linear machines)Input units
Cough Headache
weights Δ rule change weights todecrease the error
- what we gotwhat we wantedNo disease Pneumonia Flu Meningitis error
Output units
Page 10
0 1 0 0000Abdominal Pain Perceptron
Intensity DurationMale Age Temp WBC Pain Pain
adjustable weights37 10 11 120
AppendicitisDiverticulitis Ulcer Pain Cholecystitis Obstruction Pancreatitis Duodenal Non-specific Small Bowel Perforated
Page 11
ANDy
x1 x2
w1 w2
θ = 0.5
input output 00 01 10 11
0 0 0 1
f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 0 f(1w1 + 0w2) = 0 f(1w1 + 1w2) = 1
1, for a > θf(a) = 0, for a ≤ θ
θ
some possible values for w1 and w2 w1 w2
0.20 0.20 0.25 0.40
0.35 0.40 0.30 0.20
Page 12
Single layer neural network
Output of unit j:
Input units
Input to unit
j
i
Input to unit
units
measured value of variable
Output +θj)oj = 1/ (1 + e- (aj )
j: aj = Σ wijai
i: ai
i
Page 13
Single layer neural networkOutput of unit j:
Output oj = 1/ (1 + e - ( aj+θj)
units
Input to unit
j
i
Input to unit
measured
Input units
j: aj = Σ wijai
i: a i
value of variable i Increasing θ
0
1
-5 0 5
i
i
i
0.2
0.4
0.6
0.8
1.2
-15 -10 10 15
Ser es1
Ser es2
Ser es3
Page 14
Training: Minimize ErrorInput units
Cough Headache
weights Δ rule change weights todecrease the error
- what we gotwhat we wantedNo disease Pneumonia Flu Meningitis error
Output units
Page 15
Error Functions
• Mean Squared Error (for regression problems), where t is target, o is output
Σ(t - o)2/n
• Cross Entropy Error (for binary classification)
− Σ(t log o) + (1-t) log (1-o)
+θj)oj = 1/ (1 + e- (aj )
Page 16
Error function
• Convention: w := (w0,w), x := (1,x) • w0 is “bias” y
x1 x2
w1 w2
θ = 0.5
• o = f (w • x) • Class labels ti ∈{+1,-1} • Error measure
– E = -Σ ti (w • x ) oi miscl.
i w
• How to minimize E?
y
x1 x2
w1 w2
1
= 0.5
Page 17
Minimizing the Error
initial error
final error
local minimum
Error surface
derivative
initial trainedw w
Page 18
Gradient descent
Local minimum
Global minimum
Error
Page 19
Perceptron learning
w• Find minimum of E by iterating
k+1 = wk – η gradw E
• E = -Σ ti (w • x ) ⇒ii miscl.
gradw E = -Σ ti xii miscl.
w• “online” version: pick misclassified x
k+1 = wk + η ti xi
i
Page 20
Perceptron learning
• Update rule wk+1 = wk + η ti xi
• Theorem: perceptron learning converges for linearly separable sets
Page 21
Gradient descent• Simple function minimization algorithm
• Gradient is vector of partial derivatives
• Negative gradient is direction of steepest descent
3
3
2
2
1
10
0
20
10
00
0
1 1
2
2
3
3
Figures by MIT OCW.
Page 22
Classification ModelInputs
Weights Output
Age 34
1
4 of beingAlive”
5
8
4 Σ 0.6
Gender “Probability
Stage
DependentIndependent Coefficients variable variables a, b, c px1, x2, x3
Prediction
Page 23
Terminology
• Independent variable = input variable
• Dependent variable = output variable
• Coefficients = “weights” • Estimates = “targets”
• Iterative step = cycle, epoch
Page 24
XORy
x1 x2
w1 w2
θ = 0.5
input output 00 01 10 11
0 1 1 0
f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0
some possible values for w1 and w2 w1 w2
1, for a > θf(a) = 0, for a ≤ θ
θ
Page 25
XOR
x 1 x 2
y input output
00 01 10 11
0 1 1 0
θ = 0.5 w 3 w5 w 4
z θ = 0.5
w1 w2
f(w1, w2, w3, w4, w5)
a possible set of values for ws
1, for a > θ(w1, w2, w3, w4, w5) f(a) = 0, for a ≤ θ
(0.3,0.3,1,1,-2) θ
Page 26
XORinput output
00 01 10 11
0 1 1 0
w1 w3 w2
w5 w6 θ = 0.5 for all units
w4
f(w1, w2, w3, w4, w5 , w6)
a possible set of values for ws
1, for a > θ(w1, w2, w3, w4, w5 , w6) f(a) = 0, for a ≤ θ
(0.6,-0.6,-0.7,0.8,1,1) θ
Page 27
From perceptrons to multilayer perceptrons
Why?
Page 28
Abdominal Pain
37 10 1
Appendicitis Diverticulitis
Perforated Non-specificCholecystitis
Small BowelPancreatitis
1 20
Male Age Temp WBC PainIntensity
1
PainDuration
0 1 0 0000
adjustableweights
ObstructionPainDuodenal Ulcer
Page 29
0.8
Heart Attack Network
Duration Intensity ECG: ST Pain Pain elevationSmoker Age Male
Myocardial Infarction
112 1504
“Probability” of MI
Page 30
Multilayered Perceptrons
Input units
Input to unit j: aj = Σwijai
j
i
Input to unit i: ai i
Output of unit j: oj = 1/ (1 + e- (aj+θj) )
Output units
Perceptron
MultilayeredInput to unit k:
perceptron
Output of unit k: ok = 1/ (1 + e- (ak+θk) )
k
Hidden units
ak = Σwjkoj
measured value of variable
Page 31
Neural Network ModelInputs
Age 34
2
4
.6
.5
.8
.2
.1
.3.7
.2
ΣΣ
.4
.2 Σ
Output
0.6 Gender
“Probability
Stage of beingAlive”
Independent Weights Hidden Weights Dependent
variables Layer variable
Prediction
Page 32
“Combined logistic models”Inputs
Age 34
2
4
.6
.5
.8
.1
.7 Σ
Output
0.6 Gender
“Probability
Stage of beingAlive”
Independent Weights Hidden Weights Dependent
variables Layer variable
Prediction
Page 33
Inputs
Age Output34
2
4
.5
.8 .2
.3
.2
Σ 0.6
Gender “Probability
Stage of beingAlive”
Independent Weights Hidden Weights Dependent
variables Layer variable
Prediction
Page 34
Inputs
Age 34
1
4
.6.5
.8.2
.1
.3.7
.2
Σ
Output
0.6 Gender
“Probability
Stage of beingAlive”
Independent Weights Hidden Weights Dependent
variables Layer variable
Prediction
Page 35
Not really, no target for hidden units...
Age 0.6
Gender
34
2
4
.6
.5
.8
.2
.1
.3.7
.2
ΣΣ
.4
.2 Σ “Probability
Stage of beingAlive”
Independent Weights Hidden Weights Dependent
variables Layer variable
Prediction
Page 36
Hidden Units and Backpropagation
Output units
Hidden units
what we gotwhat we wanted -error
Δ rule
Δ rule
bai
ckpr opag ato n
Input units
Page 37
Multilayer perceptrons
• Sigmoidal hidden layer • Can represent arbitrary decision regions
• Can be trained similar to perceptrons
Page 38
ECG Interpretation
R-R interval
S-T elevation
QRS duration
QRS amplitude
AVF lead
SV tachycardia
Ventricular tachycardia
LV hypertrophy
RV hypertrophy
Myocardial infarction
P-R interval
Page 39
Linear SeparationSeparate n-dimensional space using one (n - 1)-dimensional space
Meningitis Flu No cough CoughHeadache Headache
CoughNo coughNo headache No headache
No disease Pneumonia
No treatment
00 10
01 11
000
010
101
111
110
Treatment
011
100
Page 40
Another way of thinking about this…
• Have data set D = {(x ,t )} drawn fromi iprobability distribution P(x,t)
• Model P(x,t) given samples D by ANNwith adjustable parameter w
• Statisticsanalogy:
Page 41
Maximum Likelihood Estimation
• Maximize likelihood of data D
• Likelihood L = Π p(x ,t ) = Π p(t |x )p(x )i i i i i
• Minimize -log L = -Σ log p(t |x ) -Σ log p(x )i i i
• Drop second term: does not depend on w
• Two cases: “regression” and classification
Page 42
Likelihood for classification(ie categorical target)
• For classification, targets t are class labels
• Minimize -Σ log p(t |x )i i
• p(t 1-ti|x ) = y(x ,w) ti (1- y(x ,w)) ⇒i i i i-log p(t |x ) = -t log y(x ,w) -(1 – t ) * log(1-y(x ,w))i i i i i i
• Minimizing –log L equivalent to minimizing -[Σ t log y(x ,w) +(1 – t ) * log(1-y(x ,w))]i i i i(cross-entropy error)
Page 43
Likelihood for “regression”(ie continuous target)
• For regression, targets t are real values
• Minimize -Σ log p(t |x )i i
• p(t |x ) = 1/Z exp(-(y(x ,w) – t )2/(2σ2)) ⇒i i i i-log p(t |x ) = 1/(2σ2) (y(x ,w) – t )2 +log Zi i i i
• y(x ,w) is network outputi
• Minimizing –log L equivalent to minimizing Σ (y(x ,w) – t )2 (sum-of-squares error)i i
Page 44
Backpropagation algorithm
w
• Minimizing error function by gradient descent:
k+1 = wk – η gradw E • Iterative gradient
calculation by propagating error signals
Page 45
Backpropagation algorithm
Problem: how to set learning rate η ?
Better: use more advanced minimization algorithms (second-order information)
2
2
0
0
-2
2
0
-2
-4 -2 20-4 -2
Figures by MIT OCW.
Page 46
Backpropagation algorithm
Classification Regression
cross-entropy sum-of-squares
Page 47
Overfitting
ModelReal Distribution Overfitted
Page 48
Improving generalization
Problem: memorizing (x,t) combinations (“overtraining”)
0.7 0.5 0 0.9 -0.5 1
1 -1.2 -0.2
0.3 0.6 1
0.5 -0.2 ?
Page 49
Improving generalization
• Need test set to judge performance
• Goal: represent information in data set, not noise
• How to improve generalization?– Limit network topology – Early stopping – Weight decay
Page 50
Limit network topology
• Idea: fewer weights ⇒ less flexibility
• Analogy to polynomial interpolation:
Page 51
Limit network topology
Page 52
Early StoppingC
HD
error holdout
training
Overfitted model “Real” model Overfitted model
0 age cycles
Page 53
Early stopping
• Idea: stop training when information (butnot noise) is modeled
• Need hold-out set to determine when to stop training
Page 54
Overfitting
b = training set
a = hold-out set
Overfitted model
tss
Epochs
min (Δtss)
tss a
tss b
Stopping criterion
Page 56
Weight decay
• Idea: control smoothness of network output by controlling size of weights
• Add term α||w||2 to error function
Page 58
Bayesian perspective
• Error function minimization corresponds to maximum likelihood (ML) estimate: singlebest solution wML
• Can lead to overtraining• Bayesian approach: consider weight
posterior distribution p(w|D).
Page 59
Bayesian perspective
• Posterior = likelihood * prior • p(w|D) = p(D|w) p(w)/p(D) • Two approaches to approximating p(w|D):
– Sampling – Gaussian approximation
Page 60
Sampling from p(w|D)
prior likelihood
Page 61
Sampling from p(w|D)
prior * likelihood = posterior
Page 62
Bayesian example for regression
Page 63
Bayesian example for classification
Page 64
Model Features(with strong personal biases)
Modeling Examples Explanat. Effort Needed
Rule-based Exp. Syst. high low high? Classification Trees low high+ “high” Neural Nets, SVM low high low Regression Models high moderate moderate
Learned Bayesian Nets low high+ high (beautiful when it works)
Page 65
Regression vs. Neural Networks
“X1 ” 1X3 ” “X1X2X3 ”
Y
“X2 ”
X1 X2 X3 X1X2 X1X3 X2X3
Y
(23 -1) possible combinations
X1X2X3
“X
X1 X2 X3
Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...
Page 66
Summary
• ANNs inspired by functionality of brain
• Nonlinear data model • Trained by minimizing error function • Goal is to generalize well • Avoid overtraining • Distinguish ML and MAP solutions
Page 67
Some References
Introductory and Historical Textbooks • Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed
Processing. MIT Press, Cambridge, 1986. (H) • Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of
Neural Computation. Addison-Wesley, Redwood City, 1991. • Pao, YH. Adaptive Pattern Recognition and Neural Networks.
Addison-Wesley, Reading, 1989. • Bishop CM. Neural Networks for Pattern Recognition. Clarendon
Press, Oxford, 1995.