Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008
Prénom Nom
Document Analysis:Parameter Estimation for Pattern Recognition
Prof. Rolf Ingold, University of Fribourg
Master course, spring semester 2008
© Prof. Rolf Ingold
2
Outline
Introduction Parameter estimation Non parametric classifiers : kNN Neural networks Hidden Markov Models Other approaches
© Prof. Rolf Ingold
3
Introduction
Bayesian decision theory provides a theoretical framework for statistical pattern recognition
It supposes the following probabilistic information to be available: n, the number of classes
P(i), the a priori probability (prior) of each class i
p(x|i), the distribution of the feature vector x, depending of the
class i
How to estimate these values and functions ? especially how to estimate the class dependent distribution (or
density) functions
© Prof. Rolf Ingold
4
Approaches for statistical pattern recognition
Several approaches try to overcome the difficulty of getting the class dependent feature distributions (or densities): Parameter estimation : the form of the distributions is
supposed to be known; only some parameters have to be estimated from training samples
Parzen windows : densities are estimated from training samples by “smoothing” them with a window function
K-nearest neighbors (KNN) rule : the decision is associated with the dominant class of the K-nearest neighbors taken from the training samples
Functional discrimination : the decision consist in minimizing an objective function within an augmented feature space
© Prof. Rolf Ingold
5
Parameter Estimation
By hypothesis, the following information is supposed to be known n, the number of classes
for each class i
the a priori probability P(i) the functional form of the class conditional feature densities
with unknown parameters i
a labeled set of training data Di={xi1, xi2,..., xiNi} supposed
to be drawn randomly from i
In fact parameter estimation can be performed class by class
© Prof. Rolf Ingold
6
Maximum likelihood criteria
Maximum likelihood estimation consists in determining i that
maximizes the likelihood of Di , i.e
For some distributions, the problem can be solved analytically by the equations
is it really a maximum ?
If the solution can not be found analytically, it can be computed iteratively by a gradient climbing method
k
iikii pDp )|()|( θxθ
0θx k
ikp )|(lnθ
© Prof. Rolf Ingold
7
Univariate Gaussian distribution
In one dimension, the normal distribution N(,) is defined by the expression
represents the mean
represents la variance lemaximum of the curve
corresponds to
2
2
1exp
2
1)(
xxp
399.0~2/1)(p
© Prof. Rolf Ingold
8
Multivariate Gaussian distribution
At d dimensions, the generalized normal distribution N(,) is defined by
where represents the mean vector
represents the covariance matrix
)()(
2
1exp
)2(
1)( 1
2/12/μxμxx t
dp
][
)(][
ii xE
dpE
xxxxμ
)])([(
)())((]))([(
jjiiij
tt
xxE
dpE
xxμxμxμxμx
© Prof. Rolf Ingold
9
Interpretation of the parameters
The mean vector represents the center of the distribution The covariance matrix describes the scatter
it is symmetrical : ijji
it is positive semidefinite (usually postive definite)ii i
2 ≥ the principal axes of the
hyperboloids are given by the eigenvectors of
the length of the axes are given by the eigenvalues
if two features xi and xj are
statistically independent, then ij ji
© Prof. Rolf Ingold
10
Mahalanobis distance
Regions of constant density are hyperboloids centered at and characterized by the equations
where C is a positive constant The Mahalanobis distance from x to is defined as
Ct )()( 1 μxμx
)()( 1 μxμx t
© Prof. Rolf Ingold
11
Estimation of and of normal distributions
In the one-dimensional case, the maximum likelihood criteria leads to following equations
In the one-dimensional case the solution is
Generalized to the multi-dimensional case, we obtain
0)(1
0)(1
4*
2*
2*
*2*
k
k
k
kk
x
x
k
kk
k xn
xn
2*2** )(11
tk
kk
kk nn
)()(11 **** μxμxxμ
© Prof. Rolf Ingold
12
Bias Problem
The estimation for (resp. ) is biased; the expected value over all sets of size n is different to the true variance, which is
An unbiased estimation would be
Both estimator converge asymptotically
Which estimator is correct ? they are neither right or wrong ! no one has all desirable properties Bayesian learning theory can give an answer
22 1)(
1
n
nxx
nE
kk
k
kxn
2*)μ(1
1σ
© Prof. Rolf Ingold
13
Discriminant functions for normal distributions (1)
For normal distributions, the following discriminant functions may be stated
In the case where all classes share the same covariance matrix
the decision boundaries are linear
)(ln)|(ln)( iii Ppg xx
)(lnln2
12ln
2)()(
2
1)( 1
iiit
ii Pd
g μxμxx
)(lnln2
1)()(
2
1)( 1
iiit
ii Pg μxμxx
)(ln2
1)( 11
iit
it
ii Pg μμxμx
)(ln)()(2
1)( 1
iit
ii Pg μxμxx
© Prof. Rolf Ingold
14
Linear decision boundaries for normal distributions
© Prof. Rolf Ingold
15
Discriminant functions for normal distributions (2)
In the case of arbitrary covariance matrices, boundaries become quadratic
)(lnln2
1
2
1
2
1)( 111
iiiit
i
t
iiit
i Pg
μμxμxxx
© Prof. Rolf Ingold
16
© Prof. Rolf Ingold
17
Font Recognition : 1D-Gaussian estimation (1)
Font style discrimination (■ roman ■ italic) using hpd-stdev
estimated models fit with distributions decision boundary is accurate recognition accuracy (96.3%) is confirmed
by the experimental confusion matrix
5610 39049 5951
2 4 6 8 10 12 14 16
0 .2
0 .4
0 .6
0 .8
© Prof. Rolf Ingold
18
Font Recognition : 1D-Gaussian estimation (2)
Font boldness discrimination (■ normal ■ bold) using hr-mean
estimated models do not fit real distributions decision boundary is surprisingly well adapted recognition accuracy (97.6%) is high
as observed from the experimental confusion matrix
2 3 4 5 6 7 8
0 .2
0 .4
0 .6
0 .8
5942 58228 5772
© Prof. Rolf Ingold
19
100
200
Font Recognition : 1D-Gaussian estimation (3)
Boldness is generally dependent on the font family
hr-mean can perfectly discriminate ■ normal and ■ bold fonts if the font family is known (recognition rate > 99.9%)
hr mean
hr mean
hr mean
hr mean
Times
Courier
Arial
all
100
100
200
100
200
© Prof. Rolf Ingold
20
Font Recognition : 1D-Gaussian estimation (4)
Font family discrimination (■ Arial, ■ Courier, ■ Times) using hr-mean
estimated models do not fit real distributions at all decision boundary are inadequate recognition accuracy is bad
(41,9%)
2 3 4 5 6 7 8
0 .2
0 .4
0 .6
0 .8
2000 403 15975 2004 19911055 1927 1018
© Prof. Rolf Ingold
21
Font Recognition : 1D-Multi-Gaussian estimation
Font family discrimination (■ Arial, ■ Courier, ■ Times) using hr-mean, supposing font style to be known for learning
estimated models fit real distributions decision boundary are adequate recognition accuracy is nearly optimal
for the given feature (89,6%)
2 3 4 5 6 7 8
1
2
3
4
5
3722 125 153119 3753 128517 209 3274
© Prof. Rolf Ingold
22
Font Recognition : 2D-Gaussian estimation
Font family discrimination (■ Arial, ■ Courier, ■ Times) using two features: hr-stdev and vr-mean
models fit approximately two classes but not the third one decision boundary is surprisingly well adapted recognition accuracy (93,5%) is reasonable
0 1 2 3 41
2
3
4
5
6
7
3918 82 0175 3683 14218 364 3618
© Prof. Rolf Ingold
23
Font Recognition : General Gaussian estimation
Performance of font family discrimination (■ Arial, ■ Courier, ■ Times) depends of the used feature set hr-stdev : recognition rate => 72,7% hr-stdev, vr-mean : recognition rate => 93,5% hp-mean, hr-mean, vr-mean : recognition rate => 98,0% hp-mean, hpd-stdev, hr-mean, vr-mean, hr-stdev, vr-stdev :
recognition rate => 99,7%
© Prof. Rolf Ingold
24
Font recognition : classifier for all 12 classes
Discrimination of all fonts using all featureshp-mean, hpd-stdev, hr-mean, hr-stdev, vr-mean, vr-stdev overall recognition rate of 99.6% most errors due to roman/italic confusion
992 0 7 0 0 1 0 0 0 0 0 00 996 0 4 0 0 0 0 0 0 0 01 0 996 0 0 2 0 1 0 0 0 00 4 0 996 0 0 0 0 0 0 0 00 0 0 0 1000 0 0 0 0 0 0 00 0 0 0 0 983 0 17 0 0 0 00 0 0 0 2 0 998 0 0 0 0 00 0 5 0 0 4 0 991 0 0 0 00 0 0 0 0 0 0 0 1000 0 0 00 0 0 0 0 0 0 0 0 1000 0 00 0 0 0 0 0 0 0 0 0 1000 00 0 0 0 0 0 0 0 0 0 0 1000
© Prof. Rolf Ingold
25
Error types
In a Bayesian classifier using parameter estimation, several error types occur Indistinguishability errors, due to overlapping of distributions,
which are inherent to the problem can not be reduced
Modeling errors, due to a bad choice for the parametric density functions (models) can be avoided by changing the models
Modeling errors, due to the imprecision of training data; can be improved by increasing training data
© Prof. Rolf Ingold
26
Influence of the size of training data
Evolution of the error rate as function of the size of training sets(experiment with 4 training sets and 2 test sets, ■ average)
50 100 150 200
0.994
0.995
0.996