Fundamentals of Image Analysis - Instituto de Computaçãoafalcao/mo445/... · Image analysis requires to learn models for object detection and/or delineation (sample extraction),

Fundamentals of Image Analysis

Alexandre Xavier Falcao

Institute of Computing - UNICAMP

[email protected]

Alexandre Xavier Falcao MC940/MO445 - Image Analysis

Introduction

Image analysis requires to learn models for object detection and/ordelineation (sample extraction), feature extraction (descriptorlearning), and classification (classifier learning).


Introduction



Introduction



Introduction

We will start by taking object detection as an example that usespixels as samples and requires feature extraction and classification.

Object detection evaluates candidate locations for the object(s) ofinterest in the image.


Agenda

Object detection and main concepts for image analysis.

Machine learning and basic concepts from Statistics.

Classic pattern recognition techniques.


Object detection

The important concepts for object detection are

Multiband image, adjacency relation, and multiband kernel.

Convolution between image and kernels (i.e., multiband imagefiltering).

Fast filtering through integral images for feature extraction.

Feature selection using systems of weak classifiers.


Multiband image

A multiband image I is a pair (DI , I), in which

DI ⊂ Zd is the image domain and

I assigns to each space element (spel) p ∈ DI a feature vectorI(p) = (I1(p), I2(p), . . . , In(p)) — i.e., a point in the pixelfeature space <n.

We will focus on d = 2 (spel is pixel) and n ≥ 1.


Multiband image

A multiband image I is a pair (DI , I), in which

DI ⊂ Zd is the image domain and

I assigns to each space element (spel) p ∈ DI a feature vectorI(p) = (I1(p), I2(p), . . . , In(p)) — i.e., a point in the pixelfeature space <n.

We will focus on d = 2 (spel is pixel) and n ≥ 1.


Example of a 2D grayscale image

We will use I = I for binary and grayscale images (n = 1).

The pixel coordinates p = (xp, yp) ∈ DI , where the image domainDI = (0, 0), (1, 0), . . . , (nx − 1, 0), (0, 1), (1, 1), . . . , (nx −1, 1), (0, ny − 1), (1, ny − 1), . . . , (nx − 1, ny − 1).


Image domain and feature space

Pixels with similar features should be mapped onto nearbypositions in the feature space <n.

The feature space can be changed by image filtering.


Adjacency relation

An adjacency relation A ⊆ DI × DI may be defined in the imagedomain and/or feature space as a binary relation.

A1 : (p, q) ∈ DI × DI | ‖q − p‖ ≤ α1,

A2 : (p, q) ∈ DI × DI | ‖I(q)− I(p)‖ ≤ α2,

A3 : (p, q) ∈ DI×DI | ‖q−p‖ ≤ α1 and ‖I(q)−I(p)‖ ≤ α2,α1, α2 ∈ <+, such that A(p) is the set of pixels q adjacent to p.

For the image on the right, whatis the adjacency set of p = (2, 3)for A1,A2, and A3, whenα1 =

√5 and α2 = 0?


Adjacency relation

An adjacency relation A ⊆ DI × DI may be defined in the imagedomain and/or feature space as a binary relation.

A1 : (p, q) ∈ DI × DI | ‖q − p‖ ≤ α1,

A2 : (p, q) ∈ DI × DI | ‖I(q)− I(p)‖ ≤ α2,

A3 : (p, q) ∈ DI×DI | ‖q−p‖ ≤ α1 and ‖I(q)−I(p)‖ ≤ α2,α1, α2 ∈ <+, such that A(p) is the set of pixels q adjacent to p.

For the image on the right, whatis the adjacency set of p = (2, 3)for A1,A2, and A3, whenα1 =

√5 and α2 = 0?


Adjacency relation

We will consider only translation-invariant adjacency relations.A : (p, qk) ∈ DI × DI | (xqk , yqk )− (xp, yp) = (dxk , dyk), k =1, 2, . . . ,K, where (dxk , dyk) is a set of K displacements.

One can store the displacements and generate the setA(p) = qk, qk = (xqk , yqk ) = (xp + dxk , yp + dyk),k = 1, 2, . . . ,K , for any p ∈ DI .

For fixed distancements (−2,−1), (0, 2), examples of setsA(p) = q1, q2 are


Adjacency relation





Adjacency relation





Kernel

A kernel is a pair (A,W ) in which A is translation invariantand W (qk − p) = wk , k = 1, 2, . . . ,K .

One can then store the set dxk , dyk ,wk, k = 1, 2, . . . ,K ,and interpret a kernel as a moving image whose domain A(p)changes with p ∈ DI for a fixed scalar function W .

In our example, for w1 = 2 and w2 = −3


Kernel





Kernel





Multiband kernel

A multiband kernel is a pair (A,W) in which A is translationinvariant and W(qk − p) = wk = (wk1,wk2, . . . ,wkn),k = 1, 2, . . . ,K .

One can then store the set dxk , dyk ,wk, k = 1, 2, . . . ,K , asa multiband kernel (moving multiband image).

As we will see next, the dimension n of each kernel coefficientmust be the number of image bands when computing theconvolution between a multiband image with a multibandkernel.


Multiband kernel





Multiband kernel





Convolution

The convolution (indeed, correlation) between a multiband imageI = (DI , I) and a multiband kernel (A,W) can be defined as agrayscale image J = (DJ , J), where

J(p) =K∑

k=1

I(qk) ·wk ,

I(qk) ·wk =n∑

j=1

Ij(qk)wkj ,

for qk ∈ A(p) and p ∈ DJ ⊇ DI . We usually force DJ = DI .


Convolution

The moving image translates from p = (xp, yp) = (−∞,−∞) top = (xp, yp) = (+∞,+∞), but J(p) is computed only forp ∈ DJ = DI .


The convolution algorithm

Input: I = (DI , I) and dxk , dyk ,wk, k = 1, 2, . . . ,K .

Output: J = (DJ , J).

1. For each p = (xp, yp) ∈ DJ , do

2. J(p)← 0.

3. For k ← 1, 2, . . . ,K , do

4. q = (xq, yq)← (xp + dxk , yp + dyk)

5. If q = (xq, yq) ∈ DI , then

6. J(p)← J(p) + I(q) ·wk .

By choice of the kernel coefficients, different local features of I areextracted in the filtered image J.


The convolution algorithm

Input: I = (DI , I) and dxk , dyk ,wk, k = 1, 2, . . . ,K .

Output: J = (DJ , J).

1. For each p = (xp, yp) ∈ DJ , do

2. J(p)← 0.

3. For k ← 1, 2, . . . ,K , do

4. q = (xq, yq)← (xp + dxk , yp + dyk)

5. If q = (xq, yq) ∈ DI , then

6. J(p)← J(p) + I(q) ·wk .

By choice of the kernel coefficients, different local features of I areextracted in the filtered image J.


Image filtering for feature extraction

The Sobel filters, for example, can enhance vertical andhorizontal edges of I . The corresponding moving images, inwhich the origin p is the central pixel, are

One can also compute the convolution between I and a kernelbank (A,W1,W2, . . . ,Wm) such that the result will be amultiband image J = (DJ , J), whereJ(p) = (J1(p), J2(p), . . . , Jm(p)).

The kernel coefficients and the adjacent pixel values of eachpixel can be organized into matrices and the convolution canbe efficiently computed by matrix multiplication.













Let (A,Wi ), where Wi(qk − p) = wki ∈ <n, i = 1, 2, . . . ,m,k = 1, 2, . . . ,K , be the i-th kernel of the bank, such that eachkernel is organized as a column of a matrix K.

K =

w11 w12 . . . w1m

w21 w22 . . . w2m...

......

...wK1 wK2 . . . wKm

nK×m

where each vector wki ∈ <n is a column matrix.



For an image (DI , I) and adjacency relation A, we can organize thefeature vectors I(qjk) ∈ <n, of the adjacent pixels qjk ∈ A(pj),k = 1, 2, . . . ,K , of each pixel pj ∈ DI , j = 1, 2, . . . , |DI |, along therows of a matrix XI .

XI =

I(q11) I(q12) . . . I(q1K )I(q21) I(q22) . . . I(q2K )

......

......

I(q|DI |1) I(q|DI |2) . . . I(q|DI |K )

|DI |×nK

where each vector I(qjk) ∈ <n is a row matrix.



The multiplication XIK outputs a matrix XJ ,

XJ =

J(p1)J(p2)

...J(p|DI |)

|DI |×m

where each vector J(pj) ∈ <m, j = 1, 2, . . . , |DI |, is a row matrix.

That is, XJ is the matrix organization of the output image (DJ , J)for DJ = DI .



The Sobel-vertical-edge kernel can enhance the characters of a carplate and, as we will see later, the integral image can be exploitedto assign higher scores to the best candidate locations.


Integral image

The pixel values of the integral image Iint = (DI , Iint) of an imageI = (DI , I ) are defined by

Iint(p) =∑

∀q∈A(p)

I (q)

A(p) : q ∈ DI | (xq ≤ xp) and (yq ≤ yp)


Integral image

The integral value within any rectangular region R, delimited bypixels p1, p2, p3 and p4, is∑

∀p∈RI (p) = Iint(p4)− Iint(p2)− Iint(p3) + Iint(p1).

This corresponds to the convolution between the image and anunitary kernel with adjacency defined by R with respect to someorigin p.



By defining R around each pixel p ∈ DI , the integral of theedge-enhanced image can be used to define candidates for theplate location by thresholding (i.e., a weak classifier) andconnected component analysis.



One may define kernels of different sizes and configurations basedon integral images (haar-like features)— the weights are w ≥ 1 inthe white region(s) and −w in the black region(s) — or the otherway around.



The convolution between an image and a bank of kernels generatesa multiband image J = (DJ , J) for feature selection andclassification. Viola & Jones introduced a fast scheme based oncascade of weak classifiers for face detection [5].



The convolution between an image and a bank of kernels generatesa multiband image J = (DJ , J) for feature selection andclassification. Viola & Jones introduced a fast scheme based oncascade of weak classifiers for face detection [5].


Systems of weak classifiers

By providing a training set with images and the correspondingannotations of the object location in each image,

one can find the threshold that minimizes the classificationerror in the training set for each feature (and even at eachpixel).

Object detection on an unseen image set, named test set, canbe based on the weighted combination of the decision from allclassifiers.

Some post-processing, such as the analysis of the resultingcomponents, is likely to be required.


Descriptor and classifier learning

One may rely on the knowledge about a given application todefine feature extraction.

However, methods based on data processing and analysis,named data-driven approaches, are more popular nowadays.

Data-driven approaches for descriptor and classifier learningmay be divided into

supervised (discriminative, wrapper),unsupervised (generative, filters), orsemi-supervised (transductive).

Supervised deep neural networks, for instance, project thedescriptor and classifier at the same time [1].























Supervised and unsupervised learning

A sample (it has a different meaning in Statistics) may be animage, pixel, superpixel, object, or subimage around theobject.

In supervised learning, we are interested in the case a classifierassigns samples to one out of c possible categories ωkck=1.

In unsupervised learning, samples are grouped into one out ofg clusters Gkgk=1 (clustering) based on their proximity inthe feature space.

A “good” descriptor should map samples from a samecategory into the same group and separate the groups asmuch as possible in the feature space, despite the absence ofcategory (label) information.




















Unsupervised learning

In unsupervised learning, one can

extract features from training samples,

group samples into g clusters, Gkgk=1, and

apply some clustering validity measure to evaluate eachcandidate solution and find the best one among them.


Supervised learning

In supervised learning, one can

take into account the labels of the samples to improve featureextraction,

classify samples into c categories, ωkck=1, and

apply an effectiveness measure to evaluate the candidatesolution, improve the process, and find the best one among allcandidates.


Training samples and feature matrix

For a given training set Ztr = sjmj=1, a descriptor D is a mappingZtr → Xtr that creates a n ×m feature matrix Xtr .

Xtr =

x11 x12 . . . x1mx21 x22 . . . x2m

......

......

xn1 xn2 . . . xnm

where the columns [x1j , x2j , . . . , xnj ]

t are the feature vectorsxj = x(sj) = (x1(sj), x2(sj), . . . , xn(sj)) of the samples sj ∈ Ztr .

Once D is defined, the pair (Ztr ,Xtr ) is called training dataset Dtr .Similarly, one can use D: Zts → Xts to obtain a test datasetDts = (Zts ,Xts) from a test set Zts .


Training samples and feature matrix

For a given training set Ztr = sjmj=1, a descriptor D is a mappingZtr → Xtr that creates a n ×m feature matrix Xtr .

Xtr =

x11 x12 . . . x1mx21 x22 . . . x2m

......

......

xn1 xn2 . . . xnm

where the columns [x1j , x2j , . . . , xnj ]

t are the feature vectorsxj = x(sj) = (x1(sj), x2(sj), . . . , xn(sj)) of the samples sj ∈ Ztr .

Once D is defined, the pair (Ztr ,Xtr ) is called training dataset Dtr .Similarly, one can use D: Zts → Xts to obtain a test datasetDts = (Zts ,Xts) from a test set Zts .


Good practice in machine learning

Good practice in machine learning should

evaluate a method multiple times for statistically independenttraining and test sets,

use the same sets for each method, and

compare methods using a statistical test suitable for theexperiment.

The methods may be distinct classifiers based on a samedescriptor, distinct descriptors using a same classificationmodel, image segmentation algorithms, etc.

Unfortunately, the number of labeled samples are not alwaysenough to draw reliable conclusions.



















This process must be repeated multiple times for the statisticalanalysis of a method or the statistical comparison among methods.


Sample selection

Samples are often randomly selected from a larger set Z tocompose the training set Ztr and the test set Zts , such thatZtr ∩ Zts = ∅ and Ztr ∪ Zts = Z.

When their true labels in Z are known a priori, the selectionof a same number of samples per category creates balancedsets.

Alternatively, a same percentage of samples per categorycreates imbalanced sets whenever Z is imbalanced.

For descriptor learning, it is usually better to use balancedtraining sets whereas classifier learning must respect thecharacteristics of the problem as represented by Z.


Sample selection






Sample selection






Sample selection






Sample selection: random variable and field

Given that x(s) = (x1(s), x2(s), . . . , xn(s)) ∈ <n changes withthe random choice of s ∈ Z, then x is said a random fieldwith probability density ρ(x) : x→ [0, 1].

Likewise, each feature xi (s) ∈ <, i ∈ [1, n], changes with therandom choice of s ∈ Z, then xi is said a random variable.

Therefore, the random choices of sj ∈ Z generate differentsequences of observations xj = x(sj), j = 1, 2, . . . , |Ztr | = m,for training and likewise for testing.












Sample selection

Whenever the descriptor (or the classifier) has parameters tobe optimized, the use of a third statistically independent setZev , named evaluation set, for optimization may reduce thechances of overfitting.

For a given descriptor, an apprentice classifier can improveperformance along learning iterations as it selects trainingsamples for user supervision. This is called active learning.

Sample selection is never perfect, but cross-validationmethods are the most preferable ones.


Sample selection





Sample selection





Sample selection

Cross validation may be called K -hold-out, K -fold, orN × K -fold [2].

K -hold-out: Z is split K times into P% of samples for Ztr

and (100− P)% for Zts , 0 < P < 100, to obtain the statisticsof the effectiveness measure in the test sets. The instances ofZtr and Zts are not statistically independent.

K -fold: Z is split into K parts of approximately equal sizes,using each of the parts for testing and the rest for training Ktimes. The instances of Zts are statistically independent, butnot the instances of Ztr .

N × K -fold: K -fold is repeated N times, usually with K = 2.


Sample selection







Sample selection







Sample selection







Effectiveness and confusion matrix

Let nij be the number of times test samples from category ωi havebeen classified into category ωj for i , j ∈ [1, c] and mts samples. Aconfusion matrix is defined as

ω1 ω2 . . . ωc

ω1 n11 n12 . . . n1cω2 n21 n22 . . . n2c...

......

......

ωc nc1 nc2 . . . ncc

The total of correct classifications is∑c

i=1 nii , beingmts −

∑ci=1 nii the total of misclassifications.

Several effectiveness measures can be obtained from theconfusion matrix (sensitivity, accuracy, specificity, precision,etc). A “good” one is the Cohen’s kappa, which is robust toimbalanced categories.




ω1 ω2 . . . ωc

ω1 n11 n12 . . . n1cω2 n21 n22 . . . n2c...

......

......


The total of correct classifications is

∑ci=1 nii , being

mts −∑c

i=1 nii the total of misclassifications.





ω1 ω2 . . . ωc

ω1 n11 n12 . . . n1cω2 n21 n22 . . . n2c...

......

......


The total of correct classifications is

∑ci=1 nii , being

mts −∑c

i=1 nii the total of misclassifications.



Cohen’s kappa

Cohen’s kappa κ measures the agreement between two raters, Aand B, such that nij indicates the number of samples rater A saysthat they are from category ωi , while rater B says that they arefrom category ωj .

κ =Po − Pe

1− Pe,

Po =1

mts

c∑i=1

nii ,

Pe =1

m2ts

c∑i=1

NA(i)NB(i),

where NA(i) =∑c

j=1 nij is the total of samples rater A says thatthey are from category ωi and NB(i) =

∑cj=1 nji is the total of

samples that rater B says they are from category ωi .


Statistical tests

Statistical tests provide a formal way to decide if the results ofan experiment are significant or accidental.

For example, one can measure the Cohen’s kappa κi (t) ofeach execution t = 1, 2, . . . ,T of each classifier Ci , i ∈ [1, n],on T statistically independent test sets during crossvalidation.

A statistical test starts from a null hypothesis, such as allclassifiers are equivalent, and verify if it can be rejected atsome significance level p (e.g., p = 0.05).


Statistical tests





Statistical tests





Statistical tests

First, some measure mo , that indicates differences among theclassifiers, is obtained from the experiment. For example, forn = 2 classifiers and a 5× 2-fold cross validation, one cancompute the variances s2t of the differences κ1(t)− κ2(t) ofthe two folds for t = 1, 2, . . . , 5 and define

mo =κ1(1)− κ2(1)√

15

∑5t=1 s

2t

It is shown that mo (a random variable) satisfies someprobability density ρ(mo) when the null hypothesis is satisfied.For the example, a t-distribution of five degrees of freedom.


Statistical tests

First, some measure mo , that indicates differences among theclassifiers, is obtained from the experiment. For example, forn = 2 classifiers and a 5× 2-fold cross validation, one cancompute the variances s2t of the differences κ1(t)− κ2(t) ofthe two folds for t = 1, 2, . . . , 5 and define

mo =κ1(1)− κ2(1)√

15

∑5t=1 s

2t

It is shown that mo (a random variable) satisfies someprobability density ρ(mo) when the null hypothesis is satisfied.For the example, a t-distribution of five degrees of freedom.


Statistical tests

The areas below the curve ρ(mo) are tabulated for each valueof mo , representing the chances p of the null hypothesis becorrect.

If mo is observed above a critical value such that p < 0.05, forinstance, we reject the null hypothesis with less than 5% ofchance to be wrong.

The most popular tests are student’s t-test, Wilcoxonsigned-rank test, analysis of variance (ANOVA), Tukey’s rangetest, Nemenyi test, and Friedman test.


Statistical tests





Statistical tests





Probability density function

A probability density function (pdf) ρ of a random variable x(e.g., a feature) is a mapping ρ : x→ <, such that ρ(x) ≥ 0and P(xo ≤ x ≤ xf ) =

∫ xfxoρ(x)dx ∈ [0, 1] measures the

probability of the value x be in the interval [xo , xf ].

The pdf may also be called probability distribution and, fordiscrete random variables (e.g., pixel intensity), it may becalled probability mass function.

The normalized image histogram, for instance, represents thepdf of the pixel intensity. However, the color x = (x1, x2, x3)of the pixels is a discrete random field. In this case, theρ(x) ≥ 0 defines a manifold in <4.

















In the general case, x = (x1, x2, . . . , xn) defines a manifold in<n+1.

The simplest approach to estimate the pdf starts by countingthe number Ω(x(s)) of samples t ∈ Ztr , whose point x(t) fallswithin a hypercube of volume hn (Parzen window) aroundx(s) ∈ <n.

Let A be an adjacency relation defined by

A(s) : t ∈ Ztr | |xi (t)− xi (s)| ≤ h

2, i = 1, 2, . . . , n,

and w(t) be a kernel weight defined by

w(t) =

1 if t ∈ A(s),0 otherwise.






A(s) : t ∈ Ztr | |xi (t)− xi (s)| ≤ h

2, i = 1, 2, . . . , n,


w(t) =







A(s) : t ∈ Ztr | |xi (t)− xi (s)| ≤ h

2, i = 1, 2, . . . , n,


w(t) =




The counting Ω(x(s)) is defined by

Ω(x(s)) =∑∀t∈A(s)

w(t).

Clearly, the choice of h is important and a fixed scale k ≥ 1 canmake it adaptive:

Ak(s) : t ∈ Ztr | x(t) is a k-closest observation of x(s),

w(t) =

exp

[−‖x(t)−x(s)‖2

2σ2

]if t ∈ Ak(s),

0 otherwise,

for σ = 13 max∀(s,t)∈Ak

‖x(t)− x(s)‖ [4].



Let u(1),u(2), . . . ,u(L) be the set of distinct observations x(s),∀s ∈ Ztr . Then, the probability density function ρ can beestimated at any point x(s) = u(j) ∈ <n, 1 ≤ j ≤ L, as

Ω(u(j)) ← Ω(x(s))

ρ(u(j)) =Ω(u(j))∑Li=1 Ω(u(i))

ρ(s) ← ρ(x(s)) = ρ(u(j))



For an image I = (DI , I), for instance, one can create a pdf imageby assigning to each pixel p ∈ DI a pdf value ρ(p) as estimated inthe feature space defined by I.



For an image I = (DI , I), for instance, one can create a pdf imageby assigning to each pixel p ∈ DI a pdf value ρ(p) as estimated inthe feature space defined by I.


More basic concepts in Statistics

ρ(x) ≥ 0, x = (x1, x2, . . . , xn), is said joint pdf.

When ρ(x) = ρ(x1)ρ(x2) . . . ρ(xn), the variables are saidstatistically independent.

Let U = u1, u2, . . . , uLi and V = v1, v2, . . . , vLj be therespective sets of distinct observations xi (s) and xj(s),1 ≤ i 6= j ≤ n, ∀s ∈ Ztr (i.e., observations of two randomvariables of x).



ρ(x) ≥ 0, x = (x1, x2, . . . , xn), is said joint pdf.

When ρ(x) = ρ(x1)ρ(x2) . . . ρ(xn), the variables are saidstatistically independent.

Let U = u1, u2, . . . , uLi and V = v1, v2, . . . , vLj be therespective sets of distinct observations xi (s) and xj(s),1 ≤ i 6= j ≤ n, ∀s ∈ Ztr (i.e., observations of two randomvariables of x).



The Entropy H of a pdf ρ(xi ) measures the unpredictability ofxi — i.e., less uniform is ρ(xi ), lower is H, higher is thepredictability.

H(ρ) = −Li∑

k=1

ρ(uk) log2 ρ(uk).

Given ρ1(xi ) and ρ2(xi ), obtained from two training sets (orimages), the relative entropy, or Kullback-Leibler distanceD(ρ1, ρ2), measures their cross entropy.

D(ρ1, ρ2) =

Li∑k=1

ρ2(uk) lnρ2(uk)

ρ1(uk).



The Entropy H of a pdf ρ(xi ) measures the unpredictability ofxi — i.e., less uniform is ρ(xi ), lower is H, higher is thepredictability.

H(ρ) = −Li∑

k=1

ρ(uk) log2 ρ(uk).

Given ρ1(xi ) and ρ2(xi ), obtained from two training sets (orimages), the relative entropy, or Kullback-Leibler distanceD(ρ1, ρ2), measures their cross entropy.

D(ρ1, ρ2) =

Li∑k=1

ρ2(uk) lnρ2(uk)

ρ1(uk).



Given ρ1(xi ) and ρ2(xj), from possibly distinct training sets, themutual information I (ρ1, ρ2) is the reduction in uncertainty aboutone variable due to the knowledge of the other.

I (ρ1, ρ2) = H(ρ1)− H(ρ1\ρ2),

I (ρ1, ρ2) =

Li∑k=1

Lj∑l=1

ρ(uk , vl) log2ρ(uk , vl)

ρ1(uk)ρ2(vl).

Mutual information is well used when aligning two images in thesame domain (image registration). The alignment aims atmaximizing the mutual information.



the mean µi = E [xi ] of a random variable xi can be estimatedby the first moment

µi =

Li∑k=1

ukρ(uk)

the variance σi = E [(xi − µi )2] can be estimated by thesecond moment of (xi − µi )

σ2i =

Li∑k=1

[uk − µi ]2 ρ(uk)



The cross-moment σij = σji = E [(xi − µi )(xj − µj)](covariance) of xi and xj can be estimated as

σij =

Li∑k=1

Lj∑l=1

[(uk − µi )(vl − µj)] ρ(uk , vl)

The mean vector µ = E [x] ∈ <n and the covariance matrixΣ = E [(x(s)− µ)(x(s)− µ)t ] is

Σ =

σ21 σ12 . . . σ1nσ21 σ22 . . . σ2n

......

......

σn1 σn2 . . . σ2n

.



The cross-moment σij = σji = E [(xi − µi )(xj − µj)](covariance) of xi and xj can be estimated as

σij =

Li∑k=1

Lj∑l=1

[(uk − µi )(vl − µj)] ρ(uk , vl)

The mean vector µ = E [x] ∈ <n and the covariance matrixΣ = E [(x(s)− µ)(x(s)− µ)t ] is

Σ =

σ21 σ12 . . . σ1nσ21 σ22 . . . σ2n

......

......

σn1 σn2 . . . σ2n

.



The joint distribution ρ(uk , vl) = ρ(uk)ρ(vl\uk) =ρ(vl)ρ(uk\vl).

The Cauchy-Schwarz inequality says that σij ≤ σ2i σ2j .

The Pearson correlation coefficient is defined asσijσiσj

.

The variables xi and xj are said uncorrelated when σij = 0.



For c categories, wkck=1, the Bayes rule says that theoccurrence probability of a category wk given an observation xis

P(wk\x) =P(wk)ρ(x\wk)

ρ(x),

where P(wk\x) is named the posterior probability, P(wk) isthe prior probability, the conditional density function ρ(x\wk)is the likelihood, and ρ(x) is the evidence.

The evidence ρ(x) =∑c

i=1 P(wi )ρ(x\wi ).

The estimation of ρ(x\wk) can be similar to the one of ρ(x),but using only adjacent samples t ∈ Ztr from category wk .





ρ(x),









ρ(x),






Clustering methods

As we will see in the course, it is possible to separate the g domesof the pdf manifold ρ(x) into g clusters.

heat map of the pdf 4 groups 2 categories

This is the result of the method in [4].


Clustering methods

For images, using the Lab color space, it can obtain the followingresults.

image pdf groups


Clustering methods

For images, using the Lab color space, it can obtain the followingresults.

image pdf groups


Clustering methods

The most used method, however, assumes that the clusters arehyper-spheres — the g -means clustering.

g = 2 groups g = 4 groups 2 categories


Clustering methods

The g -means clustering algorithm finds g groups Gkgk=1(clusters) by assigning each sample s ∈ Ztr = sjmj=1, m g , toone group, such that

g∑k=1

∑∀s∈Gk

‖x(s)− µk‖2

is minimized and

µk =1

|Gk |∑∀s∈Gk

x(s)

is the centroid of group Gk . The algorithm works as follows.


Clustering methods

Input : A training dataset (Ztr ,Xtr ).Output: A label map L : Ztr → kgk=1 (i.e., L(s) = k ⇒ s ∈ Gk).

1. Select g random centroids µkgk=1 from x(sj)mj=1.

2. For each iteration t = 1, 2, . . . ,T do.

3. For each sample s ∈ sjmj=1 do.

4. Set L(s)← arg mink=1,2,...,g‖x(s)− µk‖2.

5. For each group Gk , k = 1, 2, . . . , g , do.

6. Update µk ← 1|Gk |

∑∀s∈Ztr |L(s)=k x(s).


Clustering methods

The algorithm may be interrupted, when the differencesbetween previous and current centroids are negligible, and atest sample s ∈ Zts is assigned to the group of its closestcentroid in <n.

The representative x(s) of group Gk can also be selected asthe observation closest to the others in Gk .

x(s) ← arg minx(s′),∀s′,t∈Ztr |L(t)=L(s′)=k

‖x(t)− x(s ′)‖2.

The observation x(s) is called medoid and the methodbecomes g -medoids.

Other popular clustering approaches are mean-shift,normalized cut, Gaussian mixture models, and single-linkage.


Classification methods

Let Li (x) be a discriminant function that assigns a samples ∈ Ztr with observation x(s) to category ωi , by setting labelL(s)← i , 1 ≤ i ≤ c , when

ωi = arg maxj=1,2,...,c

Lj(x(s)).

A Bayesian classifier adopts Li (x) = P(ωi\x).

Indeed, one can use any other equivalent function, such asLi (x) = log [P(ωi )ρ(x\ωi )].


Classification methods

Let Li (x) be a discriminant function that assigns a samples ∈ Ztr with observation x(s) to category ωi , by setting labelL(s)← i , 1 ≤ i ≤ c , when

ωi = arg maxj=1,2,...,c

Lj(x(s)).

A Bayesian classifier adopts Li (x) = P(ωi\x).

Indeed, one can use any other equivalent function, such asLi (x) = log [P(ωi )ρ(x\ωi )].


Bayesian classifier

Let ρ(x\ωi ) be a Normal distribution N(µi ,Σi ) . Then,

Li (x) = log [P(ωi )] +

log

1

(2π)n2

√|Σi |

exp

[−1

2(x− µi )tΣ−1i (x− µi )

],

where µi and Σi can be estimated as the mean vector andcovariance matrix of the observations x(s), s ∈ Ztr , whose truelabel λ(s) = i (i.e., s ∈ ωi ). The argument (x− µi )tΣ−1i (x− µi ) isthe squared of the Mahalanobis distance between x(s) and µi .


Quadratic discriminant classifier

A quadratic discriminant classifier (QDC) adopts

Li (x) = wi0 + wti x + xtWix,

where x,wi ∈ <n, Wi is a n × n matrix, and wi0 ∈ <.

By adopting ρ(x\ωi ) = N(µi ,Σi ), the Bayesian classifier becomesa QDC, where wi0 = log [P(ωi )]− 1

2µti Σ−1i µi − 1

2 log (|Σi |),

wi = µti Σ−1i , and Wi = −1

2Σ−1i .

Obs: Category-independent terms are eliminated.


Linear discriminant classifier

A linear discriminant classifier (LDC) adopts

Li (x) = wi0 + wti x.

By adopting ρ(x\ωi ) = N(µi ,ST ), ST = 1m

∑ci=1miΣi , where mi

is the number of training samples from category ωi , the Bayesianclassifier becomes a LDC, where wi0 = log [P(ωi )]− 1

2µti S−1T µi and

wti = µti S

−1T .

Obs: Category-independent terms are eliminated.


The k-nearest neighbor classifier

The k-nearest neighbor classifier, k ≥ 1, adopts the k-closestadjacency relation

Ak(s) : t ∈ Ztr | x(t) is a k-closest observation of x(s)

and counts the number ki of samples t whose true label λ(t) = ifor each i = 1, 2, . . . , c.

It then approximates Li (x) = P(ωi\x) ≈ kik and assigns to s the

label L(s) ∈ ici=1 (i.e., it classifies s as belonging to ωi ) whenLi (x(s)) = maxj=1,2,...,cLj(x(s)).


Role of feature space reduction

The reduction of the feature space <n to some dimension1 ≤ k < n is also useful to handle the curse of highdimensionality or to understand the distribution of theobservations x(s), ∀s ∈ Ztr , in <n.

The reduction can use a linear (e.g., PCA, LDA) or anon-linear (e.g., MDS,t-SNE) projection.

A linear projection is Ytr = WXtr , where W is a k × n matrixand Xtr is the n×m feature matrix of the m training samples.

In the reduction by principal component analysis (PCA), therows of W are the corresponding eigenvectors of the k highesteigenvalues of the covariance matrix Σ of the observations.





















The reduction by linear discriminant analysis (LDA) considers thetrue label λ(s) ∈ ici=1 (i.e., s ∈ ωi ) of the training sampless ∈ Ztr and the rows of W are the corresponding eigenvectors ofthe k = c − 1 highest eigenvalues of the matrix SBS

−1W , where

SB =c∑

i=1

mi (µi − µ)(µi − µ)t

SW =c∑

i=1

Si

Si =∑∀s∈Ztr

(x(s)− µi )(x(s)− µi )t ,

µ =1

m

∑∀s∈Ztr

x(s)

µi =1

mi

∑∀s∈Ztr |λ(s)=i

x(s).



From Rauber et al. [3].



From Rauber et al. [3].


[1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning.

MIT Press, 2016.

http://www.deeplearningbook.org.

[2] Ludmila I. Kuncheva.

Combining Pattern Classifiers: Methods and Algorithms.

Wiley-Interscience, 2004.

[3] P.E. Rauber, A.X. Falcao, and A.C. Telea.

Projections as visual aids for classification system design.

Information Visualization, 2017.

[4] L.M. Rocha, F.A.M. Cappabianco, and A.X. Falcao.

Data clustering as an optimum-path forest problem with applicationsin image analysis.

Int. J. Imaging Syst. Technol., 19(2):50–68, June 2009.

[5] P. Viola and M. Jones.

Rapid object detection using a boosted cascade of simple features.Alexandre Xavier Falcao MC940/MO445 - Image Analysis

http://www.deeplearningbook.org

In Proceedings of the 2001 IEEE Computer Society Conference onComputer Vision and Pattern Recognition. CVPR 2001, volume 1,pages I–511–I–518 vol.1, 2001.


Fundamentals of Image Analysis - Instituto de Computaçãoafalcao/mo445/... · Image analysis requires to learn models for object detection and/or delineation (sample extraction),

Documents