Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Prénom Nom

Document Analysis:Parameter Estimation for Pattern Recognition

Prof. Rolf Ingold, University of Fribourg

Master course, spring semester 2008

© Prof. Rolf Ingold

2

Outline

Introduction Parameter estimation Non parametric classifiers : kNN Neural networks Hidden Markov Models Other approaches


3

Introduction

Bayesian decision theory provides a theoretical framework for statistical pattern recognition

It supposes the following probabilistic information to be available: n, the number of classes

P(i), the a priori probability (prior) of each class i

p(x|i), the distribution of the feature vector x, depending of the

class i

How to estimate these values and functions ? especially how to estimate the class dependent distribution (or

density) functions


4

Approaches for statistical pattern recognition

Several approaches try to overcome the difficulty of getting the class dependent feature distributions (or densities): Parameter estimation : the form of the distributions is

supposed to be known; only some parameters have to be estimated from training samples

Parzen windows : densities are estimated from training samples by “smoothing” them with a window function

K-nearest neighbors (KNN) rule : the decision is associated with the dominant class of the K-nearest neighbors taken from the training samples

Functional discrimination : the decision consist in minimizing an objective function within an augmented feature space


5

Parameter Estimation

By hypothesis, the following information is supposed to be known n, the number of classes

for each class i

the a priori probability P(i) the functional form of the class conditional feature densities

with unknown parameters i

a labeled set of training data Di={xi1, xi2,..., xiNi} supposed

to be drawn randomly from i

In fact parameter estimation can be performed class by class


6

Maximum likelihood criteria

Maximum likelihood estimation consists in determining i that

maximizes the likelihood of Di , i.e

For some distributions, the problem can be solved analytically by the equations

is it really a maximum ?

If the solution can not be found analytically, it can be computed iteratively by a gradient climbing method

k

iikii pDp )|()|( θxθ

0θx k

ikp )|(lnθ


7

Univariate Gaussian distribution

In one dimension, the normal distribution N(,) is defined by the expression

represents the mean

represents la variance lemaximum of the curve

corresponds to

2

2

1exp

2

1)(

xxp

399.0~2/1)(p


8

Multivariate Gaussian distribution

At d dimensions, the generalized normal distribution N(,) is defined by

where represents the mean vector

represents the covariance matrix

)()(

2

1exp

)2(

1)( 1

2/12/μxμxx t

dp

][

)(][

ii xE

dpE

xxxxμ

)])([(

)())((]))([(

jjiiij

tt

xxE

dpE

xxμxμxμxμx


9

Interpretation of the parameters

The mean vector represents the center of the distribution The covariance matrix describes the scatter

it is symmetrical : ijji

it is positive semidefinite (usually postive definite)ii i

2 ≥ the principal axes of the

hyperboloids are given by the eigenvectors of

the length of the axes are given by the eigenvalues

if two features xi and xj are

statistically independent, then ij ji


10

Mahalanobis distance

Regions of constant density are hyperboloids centered at and characterized by the equations

where C is a positive constant The Mahalanobis distance from x to is defined as

Ct )()( 1 μxμx

)()( 1 μxμx t


11

Estimation of and of normal distributions

In the one-dimensional case, the maximum likelihood criteria leads to following equations

In the one-dimensional case the solution is

Generalized to the multi-dimensional case, we obtain

0)(1

0)(1

4*

2*

2*

*2*

k

k

k

kk

x

x

k

kk

k xn

xn

2*2** )(11

tk

kk

kk nn

)()(11 **** μxμxxμ


12

Bias Problem

The estimation for (resp. ) is biased; the expected value over all sets of size n is different to the true variance, which is

An unbiased estimation would be

Both estimator converge asymptotically

Which estimator is correct ? they are neither right or wrong ! no one has all desirable properties Bayesian learning theory can give an answer

22 1)(

1

n

nxx

nE

kk

k

kxn

2*)μ(1

1σ


13

Discriminant functions for normal distributions (1)

For normal distributions, the following discriminant functions may be stated

In the case where all classes share the same covariance matrix

the decision boundaries are linear

)(ln)|(ln)( iii Ppg xx

)(lnln2

12ln

2)()(

2

1)( 1

iiit

ii Pd

g μxμxx

)(lnln2

1)()(

2

1)( 1

iiit

ii Pg μxμxx

)(ln2

1)( 11

iit

it

ii Pg μμxμx

)(ln)()(2

1)( 1

iit

ii Pg μxμxx


14

Linear decision boundaries for normal distributions


15

Discriminant functions for normal distributions (2)

In the case of arbitrary covariance matrices, boundaries become quadratic

)(lnln2

1

2

1

2

1)( 111

iiiit

i

t

iiit

i Pg

μμxμxxx


16


17

Font Recognition : 1D-Gaussian estimation (1)

Font style discrimination (■ roman ■ italic) using hpd-stdev

estimated models fit with distributions decision boundary is accurate recognition accuracy (96.3%) is confirmed

by the experimental confusion matrix

5610 39049 5951

2 4 6 8 10 12 14 16

0 .2

0 .4

0 .6

0 .8


18


Font boldness discrimination (■ normal ■ bold) using hr-mean

estimated models do not fit real distributions decision boundary is surprisingly well adapted recognition accuracy (97.6%) is high

as observed from the experimental confusion matrix

2 3 4 5 6 7 8

0 .2

0 .4

0 .6

0 .8

5942 58228 5772


19

100

200


Boldness is generally dependent on the font family

hr-mean can perfectly discriminate ■ normal and ■ bold fonts if the font family is known (recognition rate > 99.9%)

hr mean

hr mean

hr mean

hr mean

Times

Courier

Arial

all

100

100

200

100

200


20


Font family discrimination (■ Arial, ■ Courier, ■ Times) using hr-mean

estimated models do not fit real distributions at all decision boundary are inadequate recognition accuracy is bad

(41,9%)

2 3 4 5 6 7 8

0 .2

0 .4

0 .6

0 .8

2000 403 15975 2004 19911055 1927 1018


21

Font Recognition : 1D-Multi-Gaussian estimation

Font family discrimination (■ Arial, ■ Courier, ■ Times) using hr-mean, supposing font style to be known for learning

estimated models fit real distributions decision boundary are adequate recognition accuracy is nearly optimal

for the given feature (89,6%)

2 3 4 5 6 7 8

1

2

3

4

5

3722 125 153119 3753 128517 209 3274


22

Font Recognition : 2D-Gaussian estimation

Font family discrimination (■ Arial, ■ Courier, ■ Times) using two features: hr-stdev and vr-mean

models fit approximately two classes but not the third one decision boundary is surprisingly well adapted recognition accuracy (93,5%) is reasonable

0 1 2 3 41

2

3

4

5

6

7

3918 82 0175 3683 14218 364 3618


23

Font Recognition : General Gaussian estimation

Performance of font family discrimination (■ Arial, ■ Courier, ■ Times) depends of the used feature set hr-stdev : recognition rate => 72,7% hr-stdev, vr-mean : recognition rate => 93,5% hp-mean, hr-mean, vr-mean : recognition rate => 98,0% hp-mean, hpd-stdev, hr-mean, vr-mean, hr-stdev, vr-stdev :

recognition rate => 99,7%


24

Font recognition : classifier for all 12 classes

Discrimination of all fonts using all featureshp-mean, hpd-stdev, hr-mean, hr-stdev, vr-mean, vr-stdev overall recognition rate of 99.6% most errors due to roman/italic confusion

992 0 7 0 0 1 0 0 0 0 0 00 996 0 4 0 0 0 0 0 0 0 01 0 996 0 0 2 0 1 0 0 0 00 4 0 996 0 0 0 0 0 0 0 00 0 0 0 1000 0 0 0 0 0 0 00 0 0 0 0 983 0 17 0 0 0 00 0 0 0 2 0 998 0 0 0 0 00 0 5 0 0 4 0 991 0 0 0 00 0 0 0 0 0 0 0 1000 0 0 00 0 0 0 0 0 0 0 0 1000 0 00 0 0 0 0 0 0 0 0 0 1000 00 0 0 0 0 0 0 0 0 0 0 1000


25

Error types

In a Bayesian classifier using parameter estimation, several error types occur Indistinguishability errors, due to overlapping of distributions,

which are inherent to the problem can not be reduced

Modeling errors, due to a bad choice for the parametric density functions (models) can be avoided by changing the models

Modeling errors, due to the imprecision of training data; can be improved by increasing training data


26

Influence of the size of training data

Evolution of the error rate as function of the size of training sets(experiment with 4 training sets and 2 test sets, ■ average)

50 100 150 200

0.994

0.995

0.996

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Documents