Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52
Supervised Learning for Image Segmentation
Raphael Meier
06.10.2016
Raphael Meier MIA 2016 06.10.2016 1 / 52
References
A. Ng, Machine Learning lecture, Stanford University.
A. Criminisi, J. Shotton, E. Konukoglu, Decision Forests: A UnifiedFramework for Classification, Regression, Density Estimation,Manifold Learning and Semi-Supervised Learning, Foundations andTrends in Computer Graphics and Computer Vision, 2012.
A. Criminisi, Decision Forests for Computer Vision and Medical ImageAnalysis, Tutorial, http://research.microsoft.com/en-us/projects/decisionforests/.
S. J. D. Prince, Computer vision: Models, Learning and Inference,Cambridge University Press, 2012.
D. Barber, Bayesian Reasoning and Machine Learning,http://www.cs.ucl.ac.uk/staff/d.barber/brml/
T. Hastie, R. Tibshirani, J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference and Prediction, Springer, 2009.
Raphael Meier MIA 2016 06.10.2016 2 / 52
Part I – Supervised Learning
General rule H(x)
Expert knowledge (manual segmentation)
Fully automatic segmentation
Training data
…
Training
Testing
Brain Tumor Segmentation
Brain tumors: Glioma (Glioblastoma)
Clinical guidelines
I Bidimensional measures(RANO/AvaGlio)
I Desired: Tumor Volumetry (manual
segmentation, takes hours)
Future: Fully-automatic segmentation
Bidimensional measures fail (Reuter et al., 2014)
Raphael Meier MIA 2016 06.10.2016 5 / 52
Motivation (Menze et al., 2014)
Raphael Meier MIA 2016 06.10.2016 6 / 52
The Learning Problem
Hypothesis H(x)
Training data
New data (x) Prediction (y)
Training set: SInput: x
Output: y
Hypothesis: H(x) : x→ y
Raphael Meier MIA 2016 06.10.2016 7 / 52
Application: Image segmentation
Aim: Partition image into disjoint, semantically meaningful imageregions
I can be seen as a learning (classification) problem
Input: Image(s) consisting of voxels
Output: Regions, indicated by voxel-wise numbers (usually integers:1,2,3,· · · )
Raphael Meier MIA 2016 06.10.2016 8 / 52
Image representation - Features
Definition: Measurable attributes of image data
Can be either hand-crafted or automatically learned (e.g. viaRestricted Boltzmann Machine)
Raphael Meier MIA 2016 06.10.2016 9 / 52
Taxonomy of Learning Scenarios
Defined by nature of training data
Unsupervised Learning: Given a set of unlabeled feature vectorsI Su =
{x(i) : i = 1, ...,m
}Supervised Learning: Given a set of fully-labeled feature vectors
I S` ={(
x(i), y (i))
: i = 1, ...,m}
Semi-supervised Learning: Given a set of partially labeled featurevectors
I S = Su ∩ S`
Raphael Meier MIA 2016 06.10.2016 10 / 52
Taxonomy of Learning Problems
Defined by the learning scenario and nature of the output
Unsupervised Learning:I Given Su, find interesting structure (clustering, density estimation)I Given Su with x ∈ Rn, find H(x) = x such that n� n (dimensionality
reduction, manifold learning)
Supervised Learning:I Given S`, find H(x) : x→ y with x ∈ Rn and y ∈ {1, 2, 3, · · · }
(classification)I Given S`, find H(x) : x→ y with x ∈ Rn and y ∈ R (regression)
Raphael Meier MIA 2016 06.10.2016 11 / 52
Image segmentation via Classification
General rule H(x)
Expert knowledge (manual segmentation)
Fully automatic segmentation
Training data
…
Raphael Meier MIA 2016 06.10.2016 12 / 52
Training and Testing phase
General rule H(x)
Expert knowledge (manual segmentation)
Fully automatic segmentation
Training data
…
Training
Testing
Raphael Meier MIA 2016 06.10.2016 13 / 52
Learning (Training) Algorithm
Aim: Construct a hypothesis H which relates a feature vector x to itsmost probable label y .
Output: Hypothesis (model) parametrized by set of parameters θ
Assume we know p(y |x, θ), then the mapping H(x) : x→ y can berealized via (MAP-rule):
y = arg maxy
p(y |x, θ). (1)
How do we obtain p(y |x, θ)?
Raphael Meier MIA 2016 06.10.2016 14 / 52
Generative vs. Discriminative Models
Bayes rule:
p(y |x, θ) =p(x, y |θ)
p(x|θ)=
p(x|y , θ)p(y |θ)
p(x|θ). (2)
Generative models: Estimate p(y |x) via likelihood p(x|y) and priordistribution p(y).
Discriminative models: Estimate posterior distribution p(y |x)directly
I can be also non-probabilistic (e.g. Support Vector Machines)
Raphael Meier MIA 2016 06.10.2016 15 / 52
Logistic regression – A Classic (1940s)Used extensively, 1415 hits on pubmedSupervised learningSolves binary classification problems (y ∈ {0, 1})Discriminative approach, we model p(y |x) directly:
I p(y = 1|x; θ) = hθ(x) and p(y = 0|x; θ) = 1− hθ(x) (Bernoulli)I More compactly:
p(y |x; θ) = (hθ(x))y (1− hθ(x))1−y (3)
⇐⇒ y |x, θ ∼ Bernoulli(hθ(x)) (4)
Linear model, hence: hθ(x) = g(θTx)
Raphael Meier MIA 2016 06.10.2016 16 / 52
Logistic regression – Sigmoid Function
Logit functon:
g(z) =ez
1 + ez=
1
1 + e−z(5)
Previously, z = θTx.
Motivation: Restrict values of our hypothesis to be between zero andone (probability)
Logistic regression – Decision Boundary
Set of points x for which p(y = 1|x; θ) = p(y = 0|x; θ) = 0.5 holds.
Given by the hyperplane:θTx = 0 (6)
For θTx > 0, feature vectors are classified as 1’s.
For θTx < 0, feature vectors are classified as 0’s.
Raphael Meier MIA 2016 06.10.2016 18 / 52
Learning θ – Maximum Likelihood
Given a set of i.i.d. training pairs S ={(
x(i), y (i))
: i = 1, ...,m}
θ?ML = arg maxθ
L(θ) = arg maxθ
m∏i=1
p(y (i)|x(i), θ) (7)
= arg maxθ
m∏i=1
(hθ(x(i)))y(i)
(1− hθ(x(i)))1−y (i)(8)
For simplification, we maximize log L(θ):
l(θ) = log L(θ) =m∑i=1
y (i) log h(x(i)) + (1− y (i)) log(1− h(x(i))) (9)
Raphael Meier MIA 2016 06.10.2016 19 / 52
Learning θ – Maximum Likelihood II
No closed-form solution to maximize log-likelihood `(θ)
l(θ)
θ
l(θ)
θ
However: `(θ) is concaveI Global maximumI Allows optimization via gradient ascent
Ascent method: θ(t+1) := θ(t) + α · ∇θ`(θ(t)) with `(θ(t+1)) > `(θ(t))
Derivative w.r.t. θj :∂`(θ)∂θj
=∑m
i=1
(y (i) − h(x(i))
)x
(i)j
Raphael Meier MIA 2016 06.10.2016 20 / 52
Learning algorithm – Gradient ascent
initialization;while convergence criteria not satisfied do
for j = 0 to n do
θj := θj + α∑m
i=1(y (i) − hθ(x(i)))x(i)j ;
end
endAlgorithm 1: Gradient ascent
Convergence: ‖∇θ`(θ)‖ ≈ 0
Magnitude of update is proportional to error in prediction:(y (i) − hθ(x(i)))
Raphael Meier MIA 2016 06.10.2016 21 / 52
Multiple classes
Logistic regression can be generalized to situations withy ∈ {1, · · · ,K}Hypothesis changes (softmax function):
p(y = k |x) =exp(θTk x)∑Ki=1 exp(θTi x)
(10)
Raphael Meier MIA 2016 06.10.2016 22 / 52
Binary Image Segmentation using Logistic Regression
Preprocessing
Feature Extraction
Logistic Regression
Spatial Regularization
Generalization – Model complexity
Errors in prediction due to:I Bias (Wrong assumptions in our model)I Variance (Limited sample size, sensitivity of model to changes in
training data)
Raphael Meier MIA 2016 06.10.2016 24 / 52
Generalization – Bias-Variance trade-off
Generalization error = bias + variance + irreducible error
How can we minimize generalization error?I First: Employ appropriate error measureI Second: Vary complexity of model, choose the one
with minimum error
Raphael Meier MIA 2016 06.10.2016 25 / 52
Generalization – Number of samples
Generalization error decreases with increasing number of trainingsamples m
Dilemma: Acquisition of training data (ground truth) is usuallyexpensive
Raphael Meier MIA 2016 06.10.2016 26 / 52
Model evaluation – Strategies
Always best: Training (2/3) and Testing (1/3) set
K -fold cross-validation on full data set:
Popular choices for K are 5 or 10
Alternative: Leave-one-out cross-validation (LOOCV)
CV often used for tuning of hyperparameters
Raphael Meier MIA 2016 06.10.2016 27 / 52
Model evaluation – Real-World example: BRATS 2013
Overfitted on training data
Raphael Meier MIA 2016 06.10.2016 28 / 52
Part II – Decision Forests for Image Classification
Linear vs. Non-linear
Logistic regression: Linear Classifier
Real problems are very often non-linear!
Raphael Meier MIA 2016 06.10.2016 30 / 52
Transitioning from linear to non-linear classifier
x
x
)x(g)x(h T
0
x
)x(g)x(h T
0
)x(g)x(h T
2)x(g)x(h T
1
x
)x(g)x(h T
0
)x(g)x(h T
2)x(g)x(h T
1
x
Idea: Combination of simple classifiers to more complex ones
x
TT
e)x(g)x(h
01
10
)x(h)x'blue'y(p
)x(h)x'red'y(p
1
x
Final decision boundary is non-linear!
Raphael Meier MIA 2016 06.10.2016 31 / 52
Decision tree
Raphael Meier MIA 2016 06.10.2016 32 / 52
How to decide? – Weak Learner
Simple model which performs only slightly better than flipping a coin
Can be represented as (1 {·} is indicator function):
hθ(x) = 1 {g(x, θ) > τ} (11)
Linear model: g(x, θ) = φ(x)T θ (homogenous coordinates)
φ(x) selects a random subset of features (Randomized NodeOptimization), θ defines geometric primitive
Raphael Meier MIA 2016 06.10.2016 33 / 52
Examples of weak learner
Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section
Raphael Meier MIA 2016 06.10.2016 34 / 52
How to predict? – Leaf prediction model
Feature vector is passed down a tree and will end up in a leaf
Leaf stores p(y |x) (class label histogram)
Apply MAP-rule on p(y |x)
Raphael Meier MIA 2016 06.10.2016 35 / 52
How to predict? – Leaf prediction model
Raphael Meier MIA 2016 06.10.2016 36 / 52
Testing phase
x x x
T
Final prediction given by:
p (y |x) =1
T
T∑t=1
pt (y |x) . (12)
Raphael Meier MIA 2016 06.10.2016 37 / 52
How to train? – Information gain
high information gain low information gain
Raphael Meier MIA 2016 06.10.2016 38 / 52
How to train? – Information gain
Optimization of information gain:
IG = H(S)−∑
i∈{L,R}
∣∣S i ∣∣|S|
H(S i ) (13)
where
H(S) =∑y∈Y
p(y) log p(y). (14)
θ?j = arg maxθj∈Θ
IGj . (15)
Minimizes “impurity” of child-distributions
Optimization procedure: Exhaustive search over Θ
Raphael Meier MIA 2016 06.10.2016 39 / 52
How does all of this make sense? – Bias-Variance trade-off
Decision tree is a low-bias high-variance model
Two key aspects (Breiman, 2001):I Randomized Node Optimization (and Bagging) de-correlates treesI Averaging of tree predictions
Variance of average prediction given by:
ρσ2 +1− ρT
σ2 (16)
Hence, grow randomized trees sufficiently deep and combine theminto an ensemble
Raphael Meier MIA 2016 06.10.2016 40 / 52
Forest hyperparameters
Number of trees T
Depth of trees D
Number of candidate weak learners |H?|Number of candidate thresholds |T |How to tune them? Gridsearch (cross-validation)
Raphael Meier MIA 2016 06.10.2016 41 / 52
Forest hyperparameters
Raphael Meier MIA 2016 06.10.2016 42 / 52
Forest hyperparameters
Raphael Meier MIA 2016 06.10.2016 43 / 52
Toy example
Raphael Meier MIA 2016 06.10.2016 44 / 52
(Binary) Image Segmentation using Decision Forest
Preprocessing
Feature Extraction
Decision Forest
Spatial Regularization
Real-world examples – MICCAI 2014
Raphael Meier MIA 2016 06.10.2016 46 / 52
Real-world examples – Brain Tumor Segmentation
Real-world examples – Feature Importance
IM – Voxel-wise intensity value extracted from modality M
Depth Tumor vs. healthy Healthy tissues Tumor core
1 IFLAIR IT1 IT1c
2 IT2 IT1c IT1 − IT2
3 IT1 − IFLAIR IT1c IT1 − IT1c
4 IT2 IT1 − IT2 IT2
5 IFLAIR IT1c IFLAIR6 IFLAIR IT1c IT1c
7 IT1c IT1c IFLAIR8 IT2 IT1c IT1c
9 IT1c IT1c IFLAIR
10 IT1c IT1c IT1c
11 IT1c IT1c IFLAIR12 IT1c IT1c IT1c
13 IT1c IT1c IT1c
14 IT1c IT1c IT2
15 IT2 IT1c IT1c
16 IT2 IT1c IT2
17 IT2 IT1c IT1c
18 IT2 IT1c IT1c
Summary – Decision Forest
Discriminative model
Decision Forest has two main degrees of freedom:I Weak learnerI Objective function (information gain)
Training: Generation of de-correlated trees based on maximizinginformation gain
Testing: New input is pushed down each tree, prediction is performedbased on model stored in leaf
Raphael Meier MIA 2016 06.10.2016 49 / 52
A last note...
Decision forests are a flexible multi-purpose framework
Can solve also regression problems, density estimation and manifoldlearning
Raphael Meier MIA 2016 06.10.2016 50 / 52
Connection to deep learning