Supervised Learning for Image Segmentationpeople.ee.ethz.ch/~cattin/MIA-ETH/pdf/MIA-08... · Supervised Learning for Image Segmentation Raphael Meier ... Computer vision: Models,

Supervised Learning for Image Segmentation

Raphael Meier

06.10.2016

Raphael Meier MIA 2016 06.10.2016 1 / 52

References

A. Ng, Machine Learning lecture, Stanford University.

A. Criminisi, J. Shotton, E. Konukoglu, Decision Forests: A UnifiedFramework for Classification, Regression, Density Estimation,Manifold Learning and Semi-Supervised Learning, Foundations andTrends in Computer Graphics and Computer Vision, 2012.

A. Criminisi, Decision Forests for Computer Vision and Medical ImageAnalysis, Tutorial, http://research.microsoft.com/en-us/projects/decisionforests/.

S. J. D. Prince, Computer vision: Models, Learning and Inference,Cambridge University Press, 2012.

D. Barber, Bayesian Reasoning and Machine Learning,http://www.cs.ucl.ac.uk/staff/d.barber/brml/

T. Hastie, R. Tibshirani, J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference and Prediction, Springer, 2009.


http://research.microsoft.com/en-us/projects/decisionforests/

http://research.microsoft.com/en-us/projects/decisionforests/

http://www.cs.ucl.ac.uk/staff/d.barber/brml/

Part I – Supervised Learning

General rule H(x)

Expert knowledge (manual segmentation)

Fully automatic segmentation

Training data

…

Training

Testing

Brain Tumor Segmentation

Brain tumors: Glioma (Glioblastoma)

Clinical guidelines

I Bidimensional measures(RANO/AvaGlio)

I Desired: Tumor Volumetry (manual

segmentation, takes hours)

Future: Fully-automatic segmentation

Bidimensional measures fail (Reuter et al., 2014)


Motivation (Menze et al., 2014)


The Learning Problem

Hypothesis H(x)

Training data

New data (x) Prediction (y)

Training set: SInput: x

Output: y

Hypothesis: H(x) : x→ y


Application: Image segmentation

Aim: Partition image into disjoint, semantically meaningful imageregions

I can be seen as a learning (classification) problem

Input: Image(s) consisting of voxels

Output: Regions, indicated by voxel-wise numbers (usually integers:1,2,3,· · · )


Image representation - Features

Definition: Measurable attributes of image data

Can be either hand-crafted or automatically learned (e.g. viaRestricted Boltzmann Machine)


Taxonomy of Learning Scenarios

Defined by nature of training data

Unsupervised Learning: Given a set of unlabeled feature vectorsI Su =

{x(i) : i = 1, ...,m

}Supervised Learning: Given a set of fully-labeled feature vectors

I S` ={(

x(i), y (i))

: i = 1, ...,m}

Semi-supervised Learning: Given a set of partially labeled featurevectors

I S = Su ∩ S`


Taxonomy of Learning Problems

Defined by the learning scenario and nature of the output

Unsupervised Learning:I Given Su, find interesting structure (clustering, density estimation)I Given Su with x ∈ Rn, find H(x) = x such that n� n (dimensionality

reduction, manifold learning)

Supervised Learning:I Given S`, find H(x) : x→ y with x ∈ Rn and y ∈ {1, 2, 3, · · · }

(classification)I Given S`, find H(x) : x→ y with x ∈ Rn and y ∈ R (regression)


Image segmentation via Classification

General rule H(x)



Training data

…


Training and Testing phase

General rule H(x)



Training data

…

Training

Testing


Learning (Training) Algorithm

Aim: Construct a hypothesis H which relates a feature vector x to itsmost probable label y .

Output: Hypothesis (model) parametrized by set of parameters θ

Assume we know p(y |x, θ), then the mapping H(x) : x→ y can berealized via (MAP-rule):

y = arg maxy

p(y |x, θ). (1)

How do we obtain p(y |x, θ)?


Generative vs. Discriminative Models

Bayes rule:

p(y |x, θ) =p(x, y |θ)

p(x|θ)=

p(x|y , θ)p(y |θ)

p(x|θ). (2)

Generative models: Estimate p(y |x) via likelihood p(x|y) and priordistribution p(y).

Discriminative models: Estimate posterior distribution p(y |x)directly

I can be also non-probabilistic (e.g. Support Vector Machines)


Logistic regression – A Classic (1940s)Used extensively, 1415 hits on pubmedSupervised learningSolves binary classification problems (y ∈ {0, 1})Discriminative approach, we model p(y |x) directly:

I p(y = 1|x; θ) = hθ(x) and p(y = 0|x; θ) = 1− hθ(x) (Bernoulli)I More compactly:

p(y |x; θ) = (hθ(x))y (1− hθ(x))1−y (3)

⇐⇒ y |x, θ ∼ Bernoulli(hθ(x)) (4)

Linear model, hence: hθ(x) = g(θTx)


Logistic regression – Sigmoid Function

Logit functon:

g(z) =ez

1 + ez=

1

1 + e−z(5)

Previously, z = θTx.

Motivation: Restrict values of our hypothesis to be between zero andone (probability)

Logistic regression – Decision Boundary

Set of points x for which p(y = 1|x; θ) = p(y = 0|x; θ) = 0.5 holds.

Given by the hyperplane:θTx = 0 (6)

For θTx > 0, feature vectors are classified as 1’s.

For θTx < 0, feature vectors are classified as 0’s.


Learning θ – Maximum Likelihood

Given a set of i.i.d. training pairs S ={(

x(i), y (i))

: i = 1, ...,m}

θ?ML = arg maxθ

L(θ) = arg maxθ

m∏i=1

p(y (i)|x(i), θ) (7)

= arg maxθ

m∏i=1

(hθ(x(i)))y(i)

(1− hθ(x(i)))1−y (i)(8)

For simplification, we maximize log L(θ):

l(θ) = log L(θ) =m∑i=1

y (i) log h(x(i)) + (1− y (i)) log(1− h(x(i))) (9)


Learning θ – Maximum Likelihood II

No closed-form solution to maximize log-likelihood `(θ)

l(θ)

θ

l(θ)

θ

However: `(θ) is concaveI Global maximumI Allows optimization via gradient ascent

Ascent method: θ(t+1) := θ(t) + α · ∇θ`(θ(t)) with `(θ(t+1)) > `(θ(t))

Derivative w.r.t. θj :∂`(θ)∂θj

=∑m

i=1

(y (i) − h(x(i))

)x

(i)j


Learning algorithm – Gradient ascent

initialization;while convergence criteria not satisfied do

for j = 0 to n do

θj := θj + α∑m

i=1(y (i) − hθ(x(i)))x(i)j ;

end

endAlgorithm 1: Gradient ascent

Convergence: ‖∇θ`(θ)‖ ≈ 0

Magnitude of update is proportional to error in prediction:(y (i) − hθ(x(i)))


Multiple classes

Logistic regression can be generalized to situations withy ∈ {1, · · · ,K}Hypothesis changes (softmax function):

p(y = k |x) =exp(θTk x)∑Ki=1 exp(θTi x)

(10)


Binary Image Segmentation using Logistic Regression

Preprocessing

Feature Extraction

Logistic Regression

Spatial Regularization

Generalization – Model complexity

Errors in prediction due to:I Bias (Wrong assumptions in our model)I Variance (Limited sample size, sensitivity of model to changes in

training data)


Generalization – Bias-Variance trade-off

Generalization error = bias + variance + irreducible error

How can we minimize generalization error?I First: Employ appropriate error measureI Second: Vary complexity of model, choose the one

with minimum error


Generalization – Number of samples

Generalization error decreases with increasing number of trainingsamples m

Dilemma: Acquisition of training data (ground truth) is usuallyexpensive


Model evaluation – Strategies

Always best: Training (2/3) and Testing (1/3) set

K -fold cross-validation on full data set:

Popular choices for K are 5 or 10

Alternative: Leave-one-out cross-validation (LOOCV)

CV often used for tuning of hyperparameters


Model evaluation – Real-World example: BRATS 2013

Overfitted on training data


Part II – Decision Forests for Image Classification

Linear vs. Non-linear

Logistic regression: Linear Classifier

Real problems are very often non-linear!


Transitioning from linear to non-linear classifier

x

x

)x(g)x(h T

0

x

)x(g)x(h T

0

)x(g)x(h T

2)x(g)x(h T

1

x

)x(g)x(h T

0

)x(g)x(h T

2)x(g)x(h T

1

x

Idea: Combination of simple classifiers to more complex ones

x

TT

e)x(g)x(h

01

10

)x(h)x'blue'y(p

)x(h)x'red'y(p

1

x

Final decision boundary is non-linear!


Decision tree


How to decide? – Weak Learner

Simple model which performs only slightly better than flipping a coin

Can be represented as (1 {·} is indicator function):

hθ(x) = 1 {g(x, θ) > τ} (11)

Linear model: g(x, θ) = φ(x)T θ (homogenous coordinates)

φ(x) selects a random subset of features (Randomized NodeOptimization), θ defines geometric primitive


Examples of weak learner

Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section


How to predict? – Leaf prediction model

Feature vector is passed down a tree and will end up in a leaf

Leaf stores p(y |x) (class label histogram)

Apply MAP-rule on p(y |x)


How to predict? – Leaf prediction model


Testing phase

x x x

T

Final prediction given by:

p (y |x) =1

T

T∑t=1

pt (y |x) . (12)


How to train? – Information gain

high information gain low information gain


How to train? – Information gain

Optimization of information gain:

IG = H(S)−∑

i∈{L,R}

∣∣S i ∣∣|S|

H(S i ) (13)

where

H(S) =∑y∈Y

p(y) log p(y). (14)

θ?j = arg maxθj∈Θ

IGj . (15)

Minimizes “impurity” of child-distributions

Optimization procedure: Exhaustive search over Θ


How does all of this make sense? – Bias-Variance trade-off

Decision tree is a low-bias high-variance model

Two key aspects (Breiman, 2001):I Randomized Node Optimization (and Bagging) de-correlates treesI Averaging of tree predictions

Variance of average prediction given by:

ρσ2 +1− ρT

σ2 (16)

Hence, grow randomized trees sufficiently deep and combine theminto an ensemble


Forest hyperparameters

Number of trees T

Depth of trees D

Number of candidate weak learners |H?|Number of candidate thresholds |T |How to tune them? Gridsearch (cross-validation)






Toy example


(Binary) Image Segmentation using Decision Forest

Preprocessing

Feature Extraction

Decision Forest

Spatial Regularization

Real-world examples – MICCAI 2014


Real-world examples – Brain Tumor Segmentation

Real-world examples – Feature Importance

IM – Voxel-wise intensity value extracted from modality M

Depth Tumor vs. healthy Healthy tissues Tumor core

1 IFLAIR IT1 IT1c

2 IT2 IT1c IT1 − IT2

3 IT1 − IFLAIR IT1c IT1 − IT1c

4 IT2 IT1 − IT2 IT2

5 IFLAIR IT1c IFLAIR6 IFLAIR IT1c IT1c

7 IT1c IT1c IFLAIR8 IT2 IT1c IT1c

9 IT1c IT1c IFLAIR

10 IT1c IT1c IT1c

11 IT1c IT1c IFLAIR12 IT1c IT1c IT1c

13 IT1c IT1c IT1c

14 IT1c IT1c IT2

15 IT2 IT1c IT1c

16 IT2 IT1c IT2

17 IT2 IT1c IT1c

18 IT2 IT1c IT1c

Summary – Decision Forest

Discriminative model

Decision Forest has two main degrees of freedom:I Weak learnerI Objective function (information gain)

Training: Generation of de-correlated trees based on maximizinginformation gain

Testing: New input is pushed down each tree, prediction is performedbased on model stored in leaf


A last note...

Decision forests are a flexible multi-purpose framework

Can solve also regression problems, density estimation and manifoldlearning


Connection to deep learning

Thank you!

[email protected]


Supervised Learning for Image Segmentationpeople.ee.ethz.ch/~cattin/MIA-ETH/pdf/MIA-08... · Supervised Learning for Image Segmentation Raphael Meier ... Computer vision: Models,

Documents