Neural Networks and Logistic Regression Lucila Ohno-Machado Decision Systems Group Brigham and Women’s Hospital Department of Radiology
Neural Networks and Logistic Regression
Lucila Ohno-Machado
Decision Systems Group
Brigham and Women’s Hospital
Department of Radiology
STOP
CoronaryDisease
NeuralNet
Outline
• Examples, neuroscience analogy
• Perceptrons, MLPs: How they work
• How the networks learn from examples
• Backpropagation algorithm
• Learning parameters
• Overfitting
Examples in MedicalPattern Recognition
Diagnosis
• Protein Structure Prediction
• Diagnosis of Giant Cell Arteritis
• Diagnosis of Myocardial Infarction
• Interpretation of ECGs
• Interpretation of PET scans, Chest X-rays
Prognosis
• Prognosis of Breast Cancer
• Outcomes After Spinal Cord Injury
Myocardial Infarction Network
0.8Myocardial Infarction “Probability” of MI
112 150
MaleAgeSmokerECG: STPainIntensity
4
PainDuration Elevation
Abdominal Pain Perceptron
Male Age Temp WBC PainIntensity
PainDuration
37 10 11 20 1adjustableweights
0 1 0 0000
AppendicitisDiverticulitis
PerforatedNon-specific
CholecystitisSmall Bowel
PancreatitisObstructionPainDuodenal Ulcer
Biological Analogy
Synapses
Axon
Dendrites
Synapses++
+--
(weights)
Nodes
.
Input layer
Output layer
Input patterns000011
11
01
110010
01 11
11
11
Sortedpatterns
00
00
00 10
10
10
Perceptrons
weights
Output units
No disease Pneumonia Flu Meningitis
Input units
Cough Headache
what we gotwhat we wanted-error
rulechange weights todecrease the error
Perceptrons
Input units
Input to unit j:aj =wij ai
j
i
Input to unit i:ai
measured value of variable i
Output of unit j:
oj = 1/ (1 + e- (aj+j) )Output
units
AND
input output00011011
0001
y
x1 x2
w1 w2
f(x1w1 + x2w2) = y
f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 0 f(1w1 + 0w2 ) = 0 f(1w1 + 1w2 ) = 1
= 0.5
f(a) = 1, for a > 0, for a
some possible values for w1 and w2
w1 w2
0.200.200.250.40
0.350.400.300.20
XOR
input output00011011
0110
y
x1 x2
w1 w2
f(x1w1 + x2w2) = y
f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0
= 0.5
f(a) = 1, for a > 0, for a
some possible values for w1 and w2
w1 w2
XORinput output
00011011
0110
y
x 1 x 2
= 0.5
f(a) = 1, for a > 0, for a
z = 0.5w 3 w 4
f(w1, w2, w3, w4, w5)
w5
a possible set of values for ws
(w1, w2, w3, w4, w5)
(0.3,0.3,1,1,-2)
w1 w2
XORinput output
00011011
0110
f(a) = 1, for a > 0, for a
f(w1, w2, w3, w4, w5 , w6)
a possible set of values for ws
(w1, w2, w3, w4, w5 , w6)
(0.6,-0.6,-0.7,0.8,1,1)
w1 w4w3w2
w5 w6
= 0.5 for all units
Linear Separation
Cough No cough
CoughNo coughNo headache No headache
Headache Headache
No disease
Meningitis Flu
Pneumonia
No treatmentTreatment
00 10
01 11
000 100
010
101
111011
110
Y = a(X) + b Y =1 + e-a(X) + b
1
Linear LogisticRegressionDiscriminant
Abdominal Pain
37 10 1
Appendicitis Diverticulitis
PerforatedNon-specific
CholecystitisSmall Bowel
Pancreatitis
1 20
Male Age Temp WBC PainIntensity
1
PainDuration
0 1 0 0000
adjustableweights
ObstructionPainDuodenal Ulcer
Multilayered Perceptrons
Input units
Input to unit j:aj =wijai
j
i
Input to unit i:aimeasured value of variable i
Output of unit j:
oj = 1/ (1 + e- (a j+j) )
Output units
Perceptron
MultilayeredInput to unit k:
perceptron
Output of unit k:ok = 1/ (1 + e- (ak+k) )
k
Hiddenunits
ak =wjkoj
Regression vs. Neural Networks
X1 X2 X3
“X1” “X1X3” “X1X2X3”
Y
“X2”
X1 X2 X3 X1X2 X1X3 X2X3
Y
(23-1) possible combinations
X1X2X3
Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...
Logistic Regression
• One independent variable
f(x) = 1
1 + e -(ax + cte)
• Two
f(x) = 1
1 + e -(ax1 + bx2 + cte)
f(x)
x
1
0
Logistic function
p = 1
1 + e -(ax + cte)
log (p/1-p) = ax + cte
log(p/1-p)
x
1
0
linear
a
Logistic function
p = 1
1 + e -(ax + cte)
log (p/1-p) = ax + cte
linear
a is the odds for1 unit of increase in x
Jargon Pseudo-Correspondence
• Independent variable = input variable
• Dependent variable = output variable
• Coefficients = “weights”
• Estimates = “targets”
• Cycles = epoch
Logistic Regression ModelInputs
Coefficients
a, b, c
Output
Independent variables
x1, x2, x3
Dependent variable
p
Prediction
Age 34
1Gender
Stage 4
“Probability of beingAlive”
5
8
4 0.6
is the sum of inputs * weightsInputs
Coefficients
Output
Independent variables
Prediction
Age 34
1Gender
Stage 4
5
8
4
Logistic functionInputs
Coefficients
Output
Independent variables
Prediction
Age 34
1Gender
Stage 4
.5
.8
.40.6
“Probability of beingAlive”
p = 1 1 + e -( + cte)
Activation Functions...
• Linear
• Threshold or step function
• Logistic, sigmoid, “squash”
• Hyperbolic tangent
Neural Network ModelInputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
2Gender
Stage 4
.6
.5
.8
.2
.1
.3.7
.2
WeightsHiddenLayer
“Probability of beingAlive”
0.6
.4
.2
“Combined logistic models”Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
2Gender
Stage 4
.6
.5
.8
.1
.7
WeightsHiddenLayer
“Probability of beingAlive”
0.6
Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
2Gender
Stage 4
.5
.8.2
.3
.2
WeightsHiddenLayer
“Probability of beingAlive”
0.6
Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
1Gender
Stage 4
.6.5
.8.2
.1
.3.7
.2
WeightsHiddenLayer
“Probability of beingAlive”
0.6
Not really, no target for hidden units...
WeightsIndependent variables
Dependent variable
Prediction
Age 34
2Gender
Stage 4
.6
.5
.8
.2
.1
.3.7
.2
WeightsHiddenLayer
“Probability of beingAlive”
0.6
.4
.2
Perceptrons
weights
Output units
No disease Pneumonia Flu Meningitis
Input units
Cough Headache
what we gotwhat we wanted-error
rulechange weights todecrease the error
Hidden Units and Backpropagation
Input units
Output units
Hiddenunits
what we gotwhat we wanted-error
rule
rule
bac kpropag ati on
Error Functions
• Mean Squared Error (for most problems)
(t - o)2/n
• Cross Entropy Error (for dichotomous or binary outcomes)
(t ln o) + (1-t) ln (1-o)
Minimizing the Error
winitialwtrained
initial error
final error
Error surface
positive change
negative derivative
local minimum
Numerical Methods
a(x3) + b(x2) + c(x) + d = 0
1st pair of guessed roots
+-
2nd pair of guessed roots
x
y
Gradient descent
Local minimum
Global minimum
Error
Overfitting
Overfitted ModelReal Distribution
Overfitting
b = training set
a = test set
Overfitted model
tss
Epochs
mintss )
tss a
tss b
Stopping criterion
Overfitting in Neural NetsC
HD
age0
Overfitted model “Real” model
cycles
error
Overfitted model
holdout
training
Parameter Estimation
Logistic regression• It models “just” one
function– Maximum likelihood
– Fast
– Optimizations• Fisher
• Newton-Raphson
Neural network• It models several
functions– Backpropagation
– Iterative
– Slow
– Optimizations• Quickprop• Scaled conjugate g.d.• Adaptive learning rate
What do you want?Insight versus prediction
Insight into the model• Explain importance of
each variable• Assess model fit to
existing data
Accurate predictions• Make a good estimate
of the “real” probability
• Assess model prediction in new data
Model SelectionFinding influential variables
Logistic• Forward• Backward• Stepwise• Arbitrary• All combinations• Relative risk
Neural Network• Weight elimination• Automatic Relevance
Determination• “Relevance”
Regression DiagnosticsFinding influential observations
Logistic• Analysis of residuals• Cook’s distance• Deviance• Difference in
coefficients when case is left out
Neural Network• Ad-hoc
How accurate are predictions?
• Construct training and test sets or bootstrap to assess “unbiased” error
• Assess – Discrimination
• How model “separates” alive and dead
– Calibration• How close the estimates are from “real” probability
“Unbiased” EvaluationTraining and Tests Sets
• Training set is used to build the model (may include holdout set to control for overfitting)
• Test set left aside for evaluation purposes
• Ideal: yet another validation data set, from different source to test if model generalizes to other settings
Small sets: Cross-validation
• Several training and test set pairs are created so that the union of all test sets corresponds exactly to the original set
• Results from the different models are pooled and overall performance is estimated
• “Leave-n-out”
• Jackknife
ECG Interpretation
R-R interval
S-T elevation
P-R interval
QRS duration
AVF lead
QRS amplitude
SV tachycardia
Ventricular tachycardia
LV hypertrophy
RV hypertrophy
Myocardial infarction
Thyroid Diseases
Hiddenlayer
Patientdata
Partialdiagnoses
TSH
T4U
Clinical¼nding1
.
.
.
.
.
(5 or 10 units)
Normal
Hyperthyroidism
Hypothyroidism
Otherconditions
Patients whowill be evaluatedfurther
Hiddenlayer
Patientdata
Finaldiagnoses
TSH
T4U
Clinical¼nding
1
.
.
.
T3
TT4
TBG
.
.
(5 or 10 units)
Normal
Primaryhypothyroidism
CompensatedhypothyroidismSecondaryhypothyroidism
Hypothyroidism
OtherconditionsAdditional
input
Time Series
Hidden units
Xn X n+1
Input units
Y = Xn+2
Output units(dependent variables)
(independent variables)
Weights(estimated parameters)
Time Series
Hidden units
Xn Xnn+1 X n+1
Input units
Y = Xn+2Xn+1 n+2
Output units(dependent variables)
(independent variables)
Weights(estimated parameters)
Evaluation
Training Test
Validation
Randomizationof cases
Modeldevelopment
Modelenhancement
Model evaluation
“A” “B”
A
B
Type I
Type II
OK
OK
Evaluation: Area Under ROCs
1 - Speci¼city
Data Models
Neural network1 - Speci¼city
1 - Speci¼city
ROCs
Area under ROCComparison
ROC Analysis: Variations
ROC
Area under ROC
Slope andIntercept
Confidence interval
Wilcoxon statistic
Expert Systems and Neural Nets
# Examples
ExpertSystems
NeuralNetworks
Model Comparison(personal biases)
Modeling ExamplesExplanation
Effort Needed Provided
Rule-based Exp. Syst. high low high
Bayesian Nets high low moderate
Classification Trees low high “high”
Neural Nets low high low
Regression Models high moderate moderate
Conclusion
Neural Networks are
• mathematical models that resemble nonlinear regression models, but are also useful to model nonlinearly separable spaces
• “knowledge acquisition tools” that learn from examples
• Neural Networks in Medicine are used for:– pattern recognition (images, diseases, etc.)
– exploratory analysis, control
– predictive models
Conclusion
• No final indication for using either logistic regression or neural network
• Try both, select best
• Make unbiased evaluation
• Compare statistically
Some References
Introductory Textbooks• Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed
Processing. MIT Press, Cambridge, 1986.• Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of Neural
Computation. Addison-Wesley, Redwood City, 1991.• Pao, YH. Adaptive Pattern Recognition and Neural Networks. Addison-
Wesley, Reading, 1989.• Reggia JA. Neural computation in medicine. Artificial Intelligence in
Medicine, 1993 Apr, 5(2):143–57.• Miller AS; Blott BH; Hames TK. Review of neural network applications
in medical imaging and signal processing.Medical and Biological Engineering and Computing, 1992 Sep, 30(5):449–64.
• Bishop CM. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.