Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Faten Hussein Presented by The University of British.

Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems

Faten HusseinPresented byThe University of British ColumbiaDepartment of Electrical & Computer Engineering

OutlineIntroduction & Problem DefinitionMotivation & ObjectivesSystem OverviewResultsConclusions

IntroductionAddress readersBank Cheques readersReading data entered in forms (tax forms)Detecting forged signatures

ScanningPre-ProcessingFeature ExtractionClassificationPost-ProcessingText documentClassified textOff-line Character Recognition System

Character recognition is the process of converting scanned images of machine printed or handwritten text to computer processed format such as ASCII.FE: involves extracting from the raw data the information which is most relevant to classification.C: map these features into classesPostP: to enhance classification, using dictionary or by user.

IntroductionMany variants of character (symbol) shape, size.Different writers have different writing styles. Same person could have different writing style.Thus, unlimited number of variations for a single character exists.

For typical handwritten recognition task:

Introduction

Variations in handwritten digits extracted from zip codesTo overcome this diversity, a large number of features must be addedAn example of features that we used are: moment invariants, number of loops, number of end points, centroid, area, circularity and so on.L=2, E=0L=1, E=1L=0, E=3

In our work, we were interested in classifying handwritten digits We used several features such as..However, due to variations in shape and size and style, several features must be added.

Problem

Add more features Increase problem size

Dilemma Increase run time/memory for classification

To accommodate variations in symbols

Hope to increase classification accuracy

Character Recognition System Add-hoc process, depends on experience and trail and error

Might add redundant/irrelevant features which decrease the accuracy

Irrelevant features do not have an effect on the target concept at all redundant features add nothing new to the target concept.A redundant feature is a feature, which its value can be extracted from other features values, for example if its value is the average or square or even multiple of other feature values

Feature Selection

AdvantagesSolution: Feature SelectionDefinition: Select a relevant subset of features from a larger set of features while maintaining or enhancing accuracy Remove irrelevant and redundant features

Total of 40 features -> reduced to 16 7 Hu moments -> only first three Area removed -> redundant (Circularity)

Maintain/enhance the classification accuracy

70% recognition rate using 40 features -> 75% after FS & using only 16 features Faster classification and less memory requirements

So a solution to this dilemma is to add a FS module into your character recognition system

Feature Selection/WeightingThe process of assigning weights (binary or real valued) to features needs a search algorithm to search for the set of weights that results in best classification accuracy (optimization problem) Genetic algorithm is a good search method for optimization problems

Feature Selection (FS)Feature Weighting (FW)Special CaseGeneral CaseBinary weights (0 for irrelevant/redundant & 1 for relevant)Real-valued weights (variable weights depending on the feature relevance)Number of feature subset combinationsNumber of feature subset combinations

Searching the space of 2 or l is impossible even for a moderate size of n, so we need a search algorithm to search for the set of weights that results in best classification rate

Genetic Feature Selection/WeightingHas been proven to be a powerful search method for FS problemDoes not require derivative information or any extra knowledge; only the objective function (classifiers error rate) to evaluate the quality of the feature subsetSearch a population of solutions in parallel, so they can provide a number of potential solutions not only oneGA is resistant to becoming trapped in local minimaWhy use GA for FS/FW

Some search algorithms need the derivative information to get the maximum or minimum

Objectives & Motivations Study the effect of varying weight values on the number of selected features (FS often eliminates more features than FW, how much ??) Compare the performance of genetic feature selection/weighting in the presence of irrelevant & redundant features (not studied before)Compare the performance of genetic feature selection/weighting for regular cases (test the hypothesis that says that FW should have better or at least same results as FS ??)Evaluate the performance of the better method (GFS or GFW) in terms of optimality and time complexity (study the feasibility of genetic search for optimality & time)

Build a genetic feature selection/weighting system to be applied to character recognition problem and investigate the following issues:

And what is the relation between the number of eliminated features and the number of weight values(not necessarily having irrelevant/redundant)

MethodologyThe recognition problem is to classify isolated handwritten digits Used k-nearest-neighbor as a classifier (k=1)Used genetic algorithm as search methodApplied genetic feature selection and weighting in the wrapper approach (i.e. fitness function is the classifiers error rate)Used two phases during the program run: training/testing phase and validation phase

Training/testing: to guide the GA searchValidation: to assess the quality of the generated solution on unseen data

System Overview

Pre-Processing ModuleAll Extracted features NFeature selection/weighting Module (GA)Evaluation Module (KNN classifier)Feature subsetAssessment of feature subsetEvaluationBest feature subset (M

Results (Comparison 1)As the number of weight values increase, the probability of a feature having weight value=0 (POZ) decreases, so the number of eliminated features decreasesGFS eliminates more features (thus selects less features) than GFW because of its smaller number of weight values (0/1) and without compromising classification accuracy

Effect of varying weight values on the number of selected features

Chart1

27

18

9

5

3

1

0

Number of weight values

Number of zero (eliminated) features

Sheet1

579183134

98.598.197.997.898.397.4training

97.597.597.597.597.597.51-nn train

82.884.283.483.382.481.2testing

84.584.584.584.584.584.51-nn test

23611214181161nlevels

2718953100nzf

0.090.0980.1660.330.50.66poz

579182734nzf

23611214181161nlevels

0.50.33333333330.16666666670.09090909090.04761904760.02439024390.0123456790.0062111801poz

Sheet1

0

0

0

0

0

0

0


Number of zero (eliminated) features

Sheet2

Probabiltiy of zero (POZ)

Number of zero features

0

0

0

0

0

0

0

0

0

0

0

0

POZ

Number of levels

Probability of zero (POZ)

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Sheet3

Results (Comparison 2)The performance of 1-NN classifier rapidly degrades by increasing the number of irrelevant featuresAs the number of irrelevant features increases, FS outperform all FW settings in both classification accuracy and elimination of features

Performance of genetic feature selection/weighting in the presence of irrelevant features

Chart1

59.3266.766766.2466.04

41.8466.5664.8463.1661.6

34.4865.0864.3261.7660.44

30.8860.7258.1257.2852.56

28.3656.5650.648.5647.88

25.0448.724544.9244.68

22.8442.240.9639.6438.76

1-NN

FS

3FW

5FW

33FW

Number of irrelevant features

Classification rate %

Sheet1

4121624344454irr-clss

59.3241.8434.4830.8828.3625.0422.841-nn

66.7666.5665.0860.7256.5648.7242.2fs

6764.8464.3258.1250.64540.963fw

66.2463.1661.7657.2848.5644.9239.645fw

6859.8858.7255.647.1640.640.617fw

66.0461.660.4452.5647.8844.6838.7633fw

4121522303642fsirr-f

41113162124263fw

37891014155fw

000000033fw

41424344454rd-clss

66.7665.4464.7264.126463.841-nn

67.1666.6866.1266.26665.36fs

67.1266.686664.7664.4863.923fw

66.9666.4864.8464.4864.4863.885fw

6764.964.5263.7663.9617fw

66.866.3664.0463.5664.0463.5633fw

3920243336

2712132025

14891216

000000

Sheet1

00000

00000

00000

00000

00000

00000

00000

1-NN

FS

3FW

5FW

33FW



Sheet2

FS

3FW

5FW

33FW


Number of eliminated features

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1-NN

FS

3FW

5FW

33FW

Number of redundant features

Classification rate

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

FS

3FW

5FW

33FW



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Sheet3

Chart2

4430

121170

151380

221690

3021100

3624140

4226150

FS

3FW

5FW

33FW



Sheet1

4121624344454irr-clss

59.3241.8434.4830.8828.3625.0422.841-nn

66.7666.5665.0860.7256.5648.7242.2fs

6764.8464.3258.1250.64540.963fw

66.2463.1661.7657.2848.5644.9239.645fw

6859.8858.7255.647.1640.640.617fw

66.0461.660.4452.5647.8844.6838.7633fw

4121522303642fsirr-f

41113162124263fw

37891014155fw

000000033fw

41424344454rd-clss

66.7665.4464.7264.126463.841-nn

67.1666.6866.1266.26665.36fs

67.1266.686664.7664.4863.923fw

66.9666.4864.8464.4864.4863.885fw

6764.964.5263.7663.9617fw

66.866.3664.0463.5664.0463.5633fw

3920243336

2712132025

14891216

000000

Sheet1

00000

00000

00000

00000

00000

00000

00000

1-NN

FS

3FW

5FW

33FW



Sheet2

FS

3FW

5FW

33FW



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1-NN

FS

3FW

5FW

33FW


Classification rate

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

FS

3FW

5FW

33FW



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Sheet3

As mentioned before, irrelevant features do not have an effect on the target concept at all Irrelevant features lower the classification accuracy and increase the problem dimensionality

Results (Comparison 3)

Performance of genetic feature selection/weighting in the presence of redundant featuresThe classification accuracy of 1-NN does not suffer so much by adding redundant features, but they increase the problem sizeAs the number of redundant features increases, FS has slightly better classification accuracy than all FW settings, but significantly outperform FW in elimination of features

Chart1

66.7667.1667.1266.9666.8

65.4466.6866.6866.4866.36

64.7266.126664.8464.04

64.1266.264.7664.4863.56

646664.4864.4864.04

63.8465.3663.9263.8863.56

1-NN

FS

3FW

5FW

33FW



Sheet1

4121624344454irr-clss

59.3241.8434.4830.8828.3625.0422.841-nn

66.7666.5665.0860.7256.5648.7242.2fs

6764.8464.3258.1250.64540.963fw

66.2463.1661.7657.2848.5644.9239.645fw

6859.8858.7255.647.1640.640.617fw

66.0461.660.4452.5647.8844.6838.7633fw

4121522303642fsirr-f

41113162124263fw

37891014155fw

000000033fw

41424344454rd-clss

66.7665.4464.7264.126463.841-nn

67.1666.6866.1266.26665.36fs

67.1266.686664.7664.4863.923fw

66.9666.4864.8464.4864.4863.885fw

6764.964.5263.7663.9617fw

66.866.3664.0463.5664.0463.5633fw

3920243336

2712132025

14891216

000000

Sheet1

00000

00000

00000

00000

00000

00000

00000

1-NN

FS

3FW

5FW

33FW



Sheet2

FS

3FW

5FW

33FW



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1-NN

FS

3FW

5FW

33FW



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

FS

3FW

5FW

33FW



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Sheet3

Chart2

3210

9740

201280

241390

3320120

3625160

FS

3FW

5FW

33FW



Sheet1

4121624344454irr-clss

59.3241.8434.4830.8828.3625.0422.841-nn

66.7666.5665.0860.7256.5648.7242.2fs

6764.8464.3258.1250.64540.963fw

66.2463.1661.7657.2848.5644.9239.645fw

6859.8858.7255.647.1640.640.617fw

66.0461.660.4452.5647.8844.6838.7633fw

4121522303642fsirr-f

41113162124263fw

37891014155fw

000000033fw

41424344454rd-clss

66.7665.4464.7264.126463.841-nn

67.1666.6866.1266.26665.36fs

67.1266.686664.7664.4863.923fw

66.9666.4864.8464.4864.4863.885fw

6764.964.5263.7663.9617fw

66.866.3664.0463.5664.0463.5633fw

3920243336

2712132025

14891216

000000

Sheet1

00000

00000

00000

00000

00000

00000

00000

1-NN

FS

3FW

5FW

33FW



Sheet2

FS

3FW

5FW

33FW



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1-NN

FS

3FW

5FW

33FW



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

FS

3FW

5FW

33FW



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Sheet3

redundant features add nothing new to the target conceptThey increase the problem size while adding nothing to the classification problem

Results (Comparison 4)

Performance of genetic feature selection/weighting for regular cases (not necessarily having irrelevant/redundant)FW has better training accuracies than FS, but FS is better in generalization (have better accuracies for unseen validation samples)FW over-fits the training samples

Chart3

67.42

67.52

67.96

68.2

68.8

68.82

Training Classification Rate

Sheet1

67.7667.7667.3666.2866.2466.24

67.4267.5267.9668.268.868.82

1-NNFS3FW5FW17FW33FW

Sheet1

0

0

0

0

0

0

Validation classification rate

Sheet2

Training classification rate

0

0

0

0

0

0

Sheet3

Chart2

67.76

67.76

67.36

66.28

66.24

66.24

Validation Classification Rate

Sheet1

67.7667.7667.3666.2866.2466.24

67.4267.5267.9668.268.868.82


Sheet1

0

0

0

0

0

0

Validation classification rate

Sheet2

Training classification rate

0

0

0

0

0

0

Sheet3

The trend line shows that the greater the number of weight values the higher the training accuracy achieved and the lower the validation accuracyIncreasing the number of weights over 2 or 3 increases the chances of over-fitting

Results (Evaluation 1)

Convergence of GFS to an Optimal or Near-Optimal Set of Features GFS was able to return optimal or near-optimal values (reached by the exhaustive search)The worst average value obtained by GFS less than 1% away from optimal value

Number of featuresBest Exhaustive (class. rate %)Best GA (class. rate %)Average GA (5 runs)87474741075.275.275.21277.277.277.0414797978.561679.27978.281879.479.478.92

Out of six times, GFS was able to return optimal values reached by exhaustive search and in the sixth it returned a near-optimal solution

Results (Evaluation 2)

Convergence of GFS to an Optimal or Near-Optimal Set of Features within an Acceptable Number of Generations The time needed for GFS is bounded by (lower) linear-fit and (upper) exponential-fit curvesThe use of GFS for highly dimensional problems need parallel processing

Chart1

555

555

101010

101010

151515

202020

2026.50594236621.3333333333

2559.085914477430.2222222222

30115.164044400138.398794143

35256.18172174647.3420697586

40531.308117674655.3464010682

451122.203181132963.5141577994

502367.360303157772.2148231692

554952.999604043180.5044207281

6010585.198454301588.8848594333

Actual

Expected exponential

Expected linear

Number of features

Number of generations

Sheet1

810121416182030405060708090100

551010152028.1372483812105.9243595549466.06511750541889.66221632237763.937469177731749.3470677102133309.988315109541697.5850969462239865.54782656

510101520

810121416182022242628304050

51545101520

51010152028.1372483812105.9243595549466.06511750541889.66221632237763.937469177731749.3470677102133309.988315109541697.5850969462239865.54782656

810121416182030405060708090100

510101520LINEAR

810121416182030405060708090100

551010152028.1372483812105.9243595549466.06511750541889.66221632237763.937469177731749.3470677102133309.988315109541697.5850969462239865.54782656EXPO

551010152021.333333333338.555555555655.061930783272.211111836888.59419889105.0996750728122.085649554138.6826992735155.3791707470

5510101520263746668711751922691006843981

5510101520212528323538557289106

81012141618202530354045505560

5510101520275911525653111222367495310585

5510101520213038475564728189

Sheet1

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

Actual


Expected linear

Number of features


Sheet2

Sheet3

5

55

105

155

205

255

Number of features


Actual

5

5

10

10

15

20


5

5

10

10

15

20

26

37

46

66

87

117

519

2269

Expected linear

5

5

10

10

15

20

21

25

28

32

35

38

55

72

8

10

12

14

16

18

20

22

24

26

28

30

40

50

MBD00397064.xls

Chart1

555

555

101010

101010

151515

202020

202621

223725

244628

266632

288735

3011738

4051955

50226972

Actual


Expected linear

Number of features


Sheet1

810121416182022242628304050

55101015202124273033363942

810121416182022242628304050

51545101520

810121416182022242628304050

551010152026.50594236637.154421057246.728438081666.355018430587.4537365934117.8039754971519.20824066512269.8480889363expon

551010152021.333333333325.222222222228.43703703732.417283950635.363621399238.4573936955.924737082872.8557320081linear

551010152026374666871175192269

55101015202125283235385572

Sheet1

000

000

000

000

000

000

000

000

000

000

000

000

000

000

Actual


Expected linear

Number of features


Sheet2

Sheet3

The run time for GA depends on: no of generations, population size, number of features and number of training samplesNeed to investigate the relationship between the no. of features and the number of generations needed to reach optimal values (while keeping the other two unchanged)As the no of features increase, the no of generations needed to reach the optimal values increase as well.We used extrapolation because it is computationally impossible to run exhaustive search for large no of featuresThe number of generations required for 60 features is something between 10585 & 89 (5337), which will not be computationally feasible.

ConclusionsGFS is superior to GFW in feature reduction and without compromising classification accuracyIn the presence of irrelevant features, GFS is better than GFW in both feature reduction and classification accuracyIn the presence of redundant features, GFS is also preferred over GFW due its increased ability to feature reductionFor regular databases, it is advisable to use 2 or 3 weight values at most to avoid over-fittingGFS is a reliable method to find optimal or near-optimal solution, but need parallel processing for large problem sizes

Questions ?

Character recognition is the process of converting scanned images of machine printed or handwritten text to computer processed format such as ASCII.FE: involves extracting from the raw data the information which is most relevant to classification.C: map these features into classesPostP: to enhance classification, using dictionary or by user.

In our work, we were interested in classifying handwritten digits We used several features such as..However, due to variations in shape and size and style, several features must be added.Irrelevant features do not have an effect on the target concept at all redundant features add nothing new to the target concept.A redundant feature is a feature, which its value can be extracted from other features values, for example if its value is the average or square or even multiple of other feature values

So a solution to this dilemma is to add a FS module into your character recognition systemSearching the space of 2 or l is impossible even for a moderate size of n, so we need a search algorithm to search for the set of weights that results in best classification rateSome search algorithms need the derivative information to get the maximum or minimum And what is the relation between the number of eliminated features and the number of weight values(not necessarily having irrelevant/redundant)Training/testing: to guide the GA searchValidation: to assess the quality of the generated solution on unseen dataPre-processing: noise removal, image resizing, thinning.

As mentioned before, irrelevant features do not have an effect on the target concept at all Irrelevant features lower the classification accuracy and increase the problem dimensionality

redundant features add nothing new to the target conceptThey increase the problem size while adding nothing to the classification problem

The trend line shows that the greater the number of weight values the higher the training accuracy achieved and the lower the validation accuracyIncreasing the number of weights over 2 or 3 increases the chances of over-fitting

Out of six times, GFS was able to return optimal values reached by exhaustive search and in the sixth it returned a near-optimal solutionThe run time for GA depends on: no of generations, population size, number of features and number of training samplesNeed to investigate the relationship between the no. of features and the number of generations needed to reach optimal values (while keeping the other two unchanged)As the no of features increase, the no of generations needed to reach the optimal values increase as well.We used extrapolation because it is computationally impossible to run exhaustive search for large no of featuresThe number of generations required for 60 features is something between 10585 & 89 (5337), which will not be computationally feasible.

0

5

10

15

20

25

30

23611214181


Number of zero (eliminated)

features

0

5

10

15

20

25

30

35

40



FS

3FW

5FW

33FW

FS

3920243336

3FW

2712132025

5FW

14891216

33FW

000000

41424344454

67.42

67.52

67.96

68.2

68.8

68.82

66.5

67

67.5

68

68.5

69

69.5


Training Classification Rate

5

55

105

155

205

255

Number of features


ActualExpected exponentialExpected linear

Actual

5510101520


5510101520275911525653111222367495310585

Expected linear

5510101520213038475564728189

81012141618202530354045505560

50.4

55.4

60.4

65.4

70.4



1-NN

FS

3FW

5FW

33FW

1-NN

66.7665.4464.7264.126463.84

FS

67.1666.6866.1266.26665.36

3FW

67.1266.686664.7664.4863.92

5FW

66.9666.4864.8464.4864.4863.88

33FW

66.866.3664.0463.5664.0463.56

41424344454

67.7667.76

66.28

66.2466.24

67.36

65

65.5

66

66.5

67

67.5

68

68.5


Validation Classification Rate

20

25

30

35

40

45

50

55

60

65

70

75



1-NN

FS

3FW

5FW

33FW

1-NN

59.3241.8434.4830.8828.3625.0422.84

FS

66.7666.5665.0860.7256.5648.7242.2

3FW

6764.8464.3258.1250.64540.96

5FW

66.2463.1661.7657.2848.5644.9239.64

33FW

66.0461.660.4452.5647.8844.6838.76

4121624344454

0

5

10

15

20

25

30

35

40

45



FS

3FW

5FW

33FW

FS

4121522303642

3FW

4111316212426

5FW

3789101415

33FW

0000000

4121624344454

2

n

L

n

Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Faten Hussein Presented by The University of British.

Documents