Neural Networks and Logistic Regression Lucila Ohno-Machado Decision Systems Group Brigham and Women’s Hospital Department of Radiology.

Neural Networks and Logistic Regression

Lucila Ohno-Machado

Decision Systems Group

Brigham and Women’s Hospital

Department of Radiology

STOP

CoronaryDisease

NeuralNet

Outline

• Examples, neuroscience analogy

• Perceptrons, MLPs: How they work

• How the networks learn from examples

• Backpropagation algorithm

• Learning parameters

• Overfitting

Examples in MedicalPattern Recognition

Diagnosis

• Protein Structure Prediction

• Diagnosis of Giant Cell Arteritis

• Diagnosis of Myocardial Infarction

• Interpretation of ECGs

• Interpretation of PET scans, Chest X-rays

Prognosis

• Prognosis of Breast Cancer

• Outcomes After Spinal Cord Injury

Myocardial Infarction Network

0.8Myocardial Infarction “Probability” of MI

112 150

MaleAgeSmokerECG: STPainIntensity

4

PainDuration Elevation

Abdominal Pain Perceptron

Male Age Temp WBC PainIntensity

PainDuration

37 10 11 20 1adjustableweights

0 1 0 0000

AppendicitisDiverticulitis

PerforatedNon-specific

CholecystitisSmall Bowel

PancreatitisObstructionPainDuodenal Ulcer

Biological Analogy

Synapses

Axon

Dendrites

Synapses++

+--

(weights)

Nodes

.

Input layer

Output layer

Input patterns000011

11

01

110010

01 11

11

11

Sortedpatterns

00

00

00 10

10

10

Perceptrons

weights

Output units

No disease Pneumonia Flu Meningitis

Input units

Cough Headache

what we gotwhat we wanted-error

rulechange weights todecrease the error

Perceptrons

Input units

Input to unit j:aj =wij ai

j

i

Input to unit i:ai

measured value of variable i

Output of unit j:

oj = 1/ (1 + e- (aj+j) )Output

units

AND

input output00011011

0001

y

x1 x2

w1 w2

f(x1w1 + x2w2) = y

f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 0 f(1w1 + 0w2 ) = 0 f(1w1 + 1w2 ) = 1

= 0.5

f(a) = 1, for a > 0, for a

some possible values for w1 and w2

w1 w2

0.200.200.250.40

0.350.400.300.20

XOR

input output00011011

0110

y

x1 x2

w1 w2

f(x1w1 + x2w2) = y

f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0

= 0.5

f(a) = 1, for a > 0, for a

some possible values for w1 and w2

w1 w2

XORinput output

00011011

0110

y

x 1 x 2

= 0.5

f(a) = 1, for a > 0, for a

z = 0.5w 3 w 4

f(w1, w2, w3, w4, w5)

w5

a possible set of values for ws

(w1, w2, w3, w4, w5)

(0.3,0.3,1,1,-2)

w1 w2

XORinput output

00011011

0110

f(a) = 1, for a > 0, for a

f(w1, w2, w3, w4, w5 , w6)

a possible set of values for ws

(w1, w2, w3, w4, w5 , w6)

(0.6,-0.6,-0.7,0.8,1,1)

w1 w4w3w2

w5 w6

= 0.5 for all units

Linear Separation

Cough No cough

CoughNo coughNo headache No headache

Headache Headache

No disease

Meningitis Flu

Pneumonia

No treatmentTreatment

00 10

01 11

000 100

010

101

111011

110

Y = a(X) + b Y =1 + e-a(X) + b

1

Linear LogisticRegressionDiscriminant

Abdominal Pain

37 10 1

Appendicitis Diverticulitis

PerforatedNon-specific

CholecystitisSmall Bowel

Pancreatitis

1 20

Male Age Temp WBC PainIntensity

1

PainDuration

0 1 0 0000

adjustableweights

ObstructionPainDuodenal Ulcer

Multilayered Perceptrons

Input units

Input to unit j:aj =wijai

j

i

Input to unit i:aimeasured value of variable i

Output of unit j:

oj = 1/ (1 + e- (a j+j) )

Output units

Perceptron

MultilayeredInput to unit k:

perceptron

Output of unit k:ok = 1/ (1 + e- (ak+k) )

k

Hiddenunits

ak =wjkoj

Regression vs. Neural Networks

X1 X2 X3

“X1” “X1X3” “X1X2X3”

Y

“X2”

X1 X2 X3 X1X2 X1X3 X2X3

Y

(23-1) possible combinations

X1X2X3

Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...

Logistic Regression

• One independent variable

f(x) = 1

1 + e -(ax + cte)

• Two

f(x) = 1

1 + e -(ax1 + bx2 + cte)

f(x)

x

1

0

Logistic function

p = 1

1 + e -(ax + cte)

log (p/1-p) = ax + cte

log(p/1-p)

x

1

0

linear

a

Logistic function

p = 1

1 + e -(ax + cte)

log (p/1-p) = ax + cte

linear

a is the odds for1 unit of increase in x

Jargon Pseudo-Correspondence

• Independent variable = input variable

• Dependent variable = output variable

• Coefficients = “weights”

• Estimates = “targets”

• Cycles = epoch

Logistic Regression ModelInputs

Coefficients

a, b, c

Output

Independent variables

x1, x2, x3

Dependent variable

p

Prediction

Age 34

1Gender

Stage 4

“Probability of beingAlive”

5

8

4 0.6

is the sum of inputs * weightsInputs

Coefficients

Output


Prediction

Age 34

1Gender

Stage 4

5

8

4

Logistic functionInputs

Coefficients

Output


Prediction

Age 34

1Gender

Stage 4

.5

.8

.40.6


p = 1 1 + e -( + cte)

Activation Functions...

• Linear

• Threshold or step function

• Logistic, sigmoid, “squash”

• Hyperbolic tangent

Neural Network ModelInputs

Weights

Output


Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHiddenLayer


0.6

.4

.2

“Combined logistic models”Inputs

Weights

Output


Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.1

.7

WeightsHiddenLayer


0.6

Inputs

Weights

Output


Dependent variable

Prediction

Age 34

2Gender

Stage 4

.5

.8.2

.3

.2

WeightsHiddenLayer


0.6

Inputs

Weights

Output


Dependent variable

Prediction

Age 34

1Gender

Stage 4

.6.5

.8.2

.1

.3.7

.2

WeightsHiddenLayer


0.6

Not really, no target for hidden units...

WeightsIndependent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHiddenLayer


0.6

.4

.2

Perceptrons

weights

Output units

No disease Pneumonia Flu Meningitis

Input units

Cough Headache


rulechange weights todecrease the error

Hidden Units and Backpropagation

Input units

Output units

Hiddenunits


rule

rule

bac kpropag ati on

Error Functions

• Mean Squared Error (for most problems)

(t - o)2/n

• Cross Entropy Error (for dichotomous or binary outcomes)

(t ln o) + (1-t) ln (1-o)

Minimizing the Error

winitialwtrained

initial error

final error

Error surface

positive change

negative derivative

local minimum

Numerical Methods

a(x3) + b(x2) + c(x) + d = 0

1st pair of guessed roots

+-

2nd pair of guessed roots

x

y

Gradient descent

Local minimum

Global minimum

Error

Overfitting

Overfitted ModelReal Distribution

Overfitting

b = training set

a = test set

Overfitted model

tss

Epochs

mintss )

tss a

tss b

Stopping criterion

Overfitting in Neural NetsC

HD

age0

Overfitted model “Real” model

cycles

error

Overfitted model

holdout

training

Parameter Estimation

Logistic regression• It models “just” one

function– Maximum likelihood

– Fast

– Optimizations• Fisher

• Newton-Raphson

Neural network• It models several

functions– Backpropagation

– Iterative

– Slow

– Optimizations• Quickprop• Scaled conjugate g.d.• Adaptive learning rate

What do you want?Insight versus prediction

Insight into the model• Explain importance of

each variable• Assess model fit to

existing data

Accurate predictions• Make a good estimate

of the “real” probability

• Assess model prediction in new data

Model SelectionFinding influential variables

Logistic• Forward• Backward• Stepwise• Arbitrary• All combinations• Relative risk

Neural Network• Weight elimination• Automatic Relevance

Determination• “Relevance”

Regression DiagnosticsFinding influential observations

Logistic• Analysis of residuals• Cook’s distance• Deviance• Difference in

coefficients when case is left out

Neural Network• Ad-hoc

How accurate are predictions?

• Construct training and test sets or bootstrap to assess “unbiased” error

• Assess – Discrimination

• How model “separates” alive and dead

– Calibration• How close the estimates are from “real” probability

“Unbiased” EvaluationTraining and Tests Sets

• Training set is used to build the model (may include holdout set to control for overfitting)

• Test set left aside for evaluation purposes

• Ideal: yet another validation data set, from different source to test if model generalizes to other settings

Small sets: Cross-validation

• Several training and test set pairs are created so that the union of all test sets corresponds exactly to the original set

• Results from the different models are pooled and overall performance is estimated

• “Leave-n-out”

• Jackknife

ECG Interpretation

R-R interval

S-T elevation

P-R interval

QRS duration

AVF lead

QRS amplitude

SV tachycardia

Ventricular tachycardia

LV hypertrophy

RV hypertrophy

Myocardial infarction

Thyroid Diseases

Hiddenlayer

Patientdata

Partialdiagnoses

TSH

T4U

Clinical¼nding1

.

.

.

.

.

(5 or 10 units)

Normal

Hyperthyroidism

Hypothyroidism

Otherconditions

Patients whowill be evaluatedfurther

Hiddenlayer

Patientdata

Finaldiagnoses

TSH

T4U

Clinical¼nding

1

.

.

.

T3

TT4

TBG

.

.

(5 or 10 units)

Normal

Primaryhypothyroidism

CompensatedhypothyroidismSecondaryhypothyroidism

Hypothyroidism

OtherconditionsAdditional

input

Time Series

Hidden units

Xn X n+1

Input units

Y = Xn+2

Output units(dependent variables)

(independent variables)

Weights(estimated parameters)

Time Series

Hidden units

Xn Xnn+1 X n+1

Input units

Y = Xn+2Xn+1 n+2

Output units(dependent variables)

(independent variables)

Weights(estimated parameters)

Evaluation

Training Test

Validation

Randomizationof cases

Modeldevelopment

Modelenhancement

Model evaluation

“A” “B”

A

B

Type I

Type II

OK

OK

Evaluation: Area Under ROCs

1 - Speci¼city

Data Models

Neural network1 - Speci¼city

1 - Speci¼city

ROCs

Area under ROCComparison

ROC Analysis: Variations

ROC

Area under ROC

Slope andIntercept

Confidence interval

Wilcoxon statistic

Expert Systems and Neural Nets

# Examples

ExpertSystems

NeuralNetworks

Model Comparison(personal biases)

Modeling ExamplesExplanation

Effort Needed Provided

Rule-based Exp. Syst. high low high

Bayesian Nets high low moderate

Classification Trees low high “high”

Neural Nets low high low

Regression Models high moderate moderate

Conclusion

Neural Networks are

• mathematical models that resemble nonlinear regression models, but are also useful to model nonlinearly separable spaces

• “knowledge acquisition tools” that learn from examples

• Neural Networks in Medicine are used for:– pattern recognition (images, diseases, etc.)

– exploratory analysis, control

– predictive models

Conclusion

• No final indication for using either logistic regression or neural network

• Try both, select best

• Make unbiased evaluation

• Compare statistically

Some References

Introductory Textbooks• Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed

Processing. MIT Press, Cambridge, 1986.• Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of Neural

Computation. Addison-Wesley, Redwood City, 1991.• Pao, YH. Adaptive Pattern Recognition and Neural Networks. Addison-

Wesley, Reading, 1989.• Reggia JA. Neural computation in medicine. Artificial Intelligence in

Medicine, 1993 Apr, 5(2):143–57.• Miller AS; Blott BH; Hames TK. Review of neural network applications

in medical imaging and signal processing.Medical and Biological Engineering and Computing, 1992 Sep, 30(5):449–64.

• Bishop CM. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.

Neural Networks and Logistic Regression Lucila Ohno-Machado Decision Systems Group Brigham and Women’s Hospital Department of Radiology.

Documents