Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Faten Hussein Presented by The University of British Columbia Department of Electrical & Computer Engineering
Jan 06, 2018
Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems
Faten HusseinPresented byThe University of British ColumbiaDepartment of Electrical & Computer Engineering
OutlineIntroduction & Problem DefinitionMotivation & ObjectivesSystem OverviewResultsConclusions
IntroductionAddress readersBank Cheques readersReading data entered in forms (tax forms)Detecting forged signatures
ScanningPre-ProcessingFeature ExtractionClassificationPost-ProcessingText documentClassified textOff-line Character Recognition System
Character recognition is the process of converting scanned images of machine printed or handwritten text to computer processed format such as ASCII.FE: involves extracting from the raw data the information which is most relevant to classification.C: map these features into classesPostP: to enhance classification, using dictionary or by user.
IntroductionMany variants of character (symbol) shape, size.Different writers have different writing styles. Same person could have different writing style.Thus, unlimited number of variations for a single character exists.
For typical handwritten recognition task:
Introduction
Variations in handwritten digits extracted from zip codesTo overcome this diversity, a large number of features must be addedAn example of features that we used are: moment invariants, number of loops, number of end points, centroid, area, circularity and so on.L=2, E=0L=1, E=1L=0, E=3
In our work, we were interested in classifying handwritten digits We used several features such as..However, due to variations in shape and size and style, several features must be added.
Problem
Add more features Increase problem size
Dilemma Increase run time/memory for classification
To accommodate variations in symbols
Hope to increase classification accuracy
Character Recognition System Add-hoc process, depends on experience and trail and error
Might add redundant/irrelevant features which decrease the accuracy
Irrelevant features do not have an effect on the target concept at all redundant features add nothing new to the target concept.A redundant feature is a feature, which its value can be extracted from other features values, for example if its value is the average or square or even multiple of other feature values
Feature Selection
AdvantagesSolution: Feature SelectionDefinition: Select a relevant subset of features from a larger set of features while maintaining or enhancing accuracy Remove irrelevant and redundant features
Total of 40 features -> reduced to 16 7 Hu moments -> only first three Area removed -> redundant (Circularity)
Maintain/enhance the classification accuracy
70% recognition rate using 40 features -> 75% after FS & using only 16 features Faster classification and less memory requirements
So a solution to this dilemma is to add a FS module into your character recognition system
Feature Selection/WeightingThe process of assigning weights (binary or real valued) to features needs a search algorithm to search for the set of weights that results in best classification accuracy (optimization problem) Genetic algorithm is a good search method for optimization problems
Feature Selection (FS)Feature Weighting (FW)Special CaseGeneral CaseBinary weights (0 for irrelevant/redundant & 1 for relevant)Real-valued weights (variable weights depending on the feature relevance)Number of feature subset combinationsNumber of feature subset combinations
Searching the space of 2 or l is impossible even for a moderate size of n, so we need a search algorithm to search for the set of weights that results in best classification rate
Genetic Feature Selection/WeightingHas been proven to be a powerful search method for FS problemDoes not require derivative information or any extra knowledge; only the objective function (classifiers error rate) to evaluate the quality of the feature subsetSearch a population of solutions in parallel, so they can provide a number of potential solutions not only oneGA is resistant to becoming trapped in local minimaWhy use GA for FS/FW
Some search algorithms need the derivative information to get the maximum or minimum
Objectives & Motivations Study the effect of varying weight values on the number of selected features (FS often eliminates more features than FW, how much ??) Compare the performance of genetic feature selection/weighting in the presence of irrelevant & redundant features (not studied before)Compare the performance of genetic feature selection/weighting for regular cases (test the hypothesis that says that FW should have better or at least same results as FS ??)Evaluate the performance of the better method (GFS or GFW) in terms of optimality and time complexity (study the feasibility of genetic search for optimality & time)
Build a genetic feature selection/weighting system to be applied to character recognition problem and investigate the following issues:
And what is the relation between the number of eliminated features and the number of weight values(not necessarily having irrelevant/redundant)
MethodologyThe recognition problem is to classify isolated handwritten digits Used k-nearest-neighbor as a classifier (k=1)Used genetic algorithm as search methodApplied genetic feature selection and weighting in the wrapper approach (i.e. fitness function is the classifiers error rate)Used two phases during the program run: training/testing phase and validation phase
Training/testing: to guide the GA searchValidation: to assess the quality of the generated solution on unseen data
System Overview
Pre-Processing ModuleAll Extracted features NFeature selection/weighting Module (GA)Evaluation Module (KNN classifier)Feature subsetAssessment of feature subsetEvaluationBest feature subset (M
Results (Comparison 1)As the number of weight values increase, the probability of a feature having weight value=0 (POZ) decreases, so the number of eliminated features decreasesGFS eliminates more features (thus selects less features) than GFW because of its smaller number of weight values (0/1) and without compromising classification accuracy
Effect of varying weight values on the number of selected features
Chart1
27
18
9
5
3
1
0
Number of weight values
Number of zero (eliminated) features
Sheet1
579183134
98.598.197.997.898.397.4training
97.597.597.597.597.597.51-nn train
82.884.283.483.382.481.2testing
84.584.584.584.584.584.51-nn test
23611214181161nlevels
2718953100nzf
0.090.0980.1660.330.50.66poz
579182734nzf
23611214181161nlevels
0.50.33333333330.16666666670.09090909090.04761904760.02439024390.0123456790.0062111801poz
Sheet1
0
0
0
0
0
0
0
Number of weight values
Number of zero (eliminated) features
Sheet2
Probabiltiy of zero (POZ)
Number of zero features
0
0
0
0
0
0
0
0
0
0
0
0
POZ
Number of levels
Probability of zero (POZ)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Sheet3
Results (Comparison 2)The performance of 1-NN classifier rapidly degrades by increasing the number of irrelevant featuresAs the number of irrelevant features increases, FS outperform all FW settings in both classification accuracy and elimination of features
Performance of genetic feature selection/weighting in the presence of irrelevant features
Chart1
59.3266.766766.2466.04
41.8466.5664.8463.1661.6
34.4865.0864.3261.7660.44
30.8860.7258.1257.2852.56
28.3656.5650.648.5647.88
25.0448.724544.9244.68
22.8442.240.9639.6438.76
1-NN
FS
3FW
5FW
33FW
Number of irrelevant features
Classification rate %
Sheet1
4121624344454irr-clss
59.3241.8434.4830.8828.3625.0422.841-nn
66.7666.5665.0860.7256.5648.7242.2fs
6764.8464.3258.1250.64540.963fw
66.2463.1661.7657.2848.5644.9239.645fw
6859.8858.7255.647.1640.640.617fw
66.0461.660.4452.5647.8844.6838.7633fw
4121522303642fsirr-f
41113162124263fw
37891014155fw
000000033fw
41424344454rd-clss
66.7665.4464.7264.126463.841-nn
67.1666.6866.1266.26665.36fs
67.1266.686664.7664.4863.923fw
66.9666.4864.8464.4864.4863.885fw
6764.964.5263.7663.9617fw
66.866.3664.0463.5664.0463.5633fw
3920243336
2712132025
14891216
000000
Sheet1
00000
00000
00000
00000
00000
00000
00000
1-NN
FS
3FW
5FW
33FW
Number of irrelevant features
Classification rate %
Sheet2
FS
3FW
5FW
33FW
Number of irrelevant features
Number of eliminated features
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1-NN
FS
3FW
5FW
33FW
Number of redundant features
Classification rate
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
FS
3FW
5FW
33FW
Number of redundant features
Number of eliminated features
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Sheet3
Chart2
4430
121170
151380
221690
3021100
3624140
4226150
FS
3FW
5FW
33FW
Number of irrelevant features
Number of eliminated features
Sheet1
4121624344454irr-clss
59.3241.8434.4830.8828.3625.0422.841-nn
66.7666.5665.0860.7256.5648.7242.2fs
6764.8464.3258.1250.64540.963fw
66.2463.1661.7657.2848.5644.9239.645fw
6859.8858.7255.647.1640.640.617fw
66.0461.660.4452.5647.8844.6838.7633fw
4121522303642fsirr-f
41113162124263fw
37891014155fw
000000033fw
41424344454rd-clss
66.7665.4464.7264.126463.841-nn
67.1666.6866.1266.26665.36fs
67.1266.686664.7664.4863.923fw
66.9666.4864.8464.4864.4863.885fw
6764.964.5263.7663.9617fw
66.866.3664.0463.5664.0463.5633fw
3920243336
2712132025
14891216
000000
Sheet1
00000
00000
00000
00000
00000
00000
00000
1-NN
FS
3FW
5FW
33FW
Number of irrelevant features
Classification rate %
Sheet2
FS
3FW
5FW
33FW
Number of irrelevant features
Number of eliminated features
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1-NN
FS
3FW
5FW
33FW
Number of redundant features
Classification rate
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
FS
3FW
5FW
33FW
Number of redundant features
Number of eliminated features
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Sheet3
As mentioned before, irrelevant features do not have an effect on the target concept at all Irrelevant features lower the classification accuracy and increase the problem dimensionality
Results (Comparison 3)
Performance of genetic feature selection/weighting in the presence of redundant featuresThe classification accuracy of 1-NN does not suffer so much by adding redundant features, but they increase the problem sizeAs the number of redundant features increases, FS has slightly better classification accuracy than all FW settings, but significantly outperform FW in elimination of features
Chart1
66.7667.1667.1266.9666.8
65.4466.6866.6866.4866.36
64.7266.126664.8464.04
64.1266.264.7664.4863.56
646664.4864.4864.04
63.8465.3663.9263.8863.56
1-NN
FS
3FW
5FW
33FW
Number of redundant features
Classification rate %
Sheet1
4121624344454irr-clss
59.3241.8434.4830.8828.3625.0422.841-nn
66.7666.5665.0860.7256.5648.7242.2fs
6764.8464.3258.1250.64540.963fw
66.2463.1661.7657.2848.5644.9239.645fw
6859.8858.7255.647.1640.640.617fw
66.0461.660.4452.5647.8844.6838.7633fw
4121522303642fsirr-f
41113162124263fw
37891014155fw
000000033fw
41424344454rd-clss
66.7665.4464.7264.126463.841-nn
67.1666.6866.1266.26665.36fs
67.1266.686664.7664.4863.923fw
66.9666.4864.8464.4864.4863.885fw
6764.964.5263.7663.9617fw
66.866.3664.0463.5664.0463.5633fw
3920243336
2712132025
14891216
000000
Sheet1
00000
00000
00000
00000
00000
00000
00000
1-NN
FS
3FW
5FW
33FW
Number of irrelevant features
Classification rate %
Sheet2
FS
3FW
5FW
33FW
Number of irrelevant features
Number of eliminated features
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1-NN
FS
3FW
5FW
33FW
Number of redundant features
Classification rate %
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
FS
3FW
5FW
33FW
Number of redundant features
Number of eliminated features
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Sheet3
Chart2
3210
9740
201280
241390
3320120
3625160
FS
3FW
5FW
33FW
Number of redundant features
Number of eliminated features
Sheet1
4121624344454irr-clss
59.3241.8434.4830.8828.3625.0422.841-nn
66.7666.5665.0860.7256.5648.7242.2fs
6764.8464.3258.1250.64540.963fw
66.2463.1661.7657.2848.5644.9239.645fw
6859.8858.7255.647.1640.640.617fw
66.0461.660.4452.5647.8844.6838.7633fw
4121522303642fsirr-f
41113162124263fw
37891014155fw
000000033fw
41424344454rd-clss
66.7665.4464.7264.126463.841-nn
67.1666.6866.1266.26665.36fs
67.1266.686664.7664.4863.923fw
66.9666.4864.8464.4864.4863.885fw
6764.964.5263.7663.9617fw
66.866.3664.0463.5664.0463.5633fw
3920243336
2712132025
14891216
000000
Sheet1
00000
00000
00000
00000
00000
00000
00000
1-NN
FS
3FW
5FW
33FW
Number of irrelevant features
Classification rate %
Sheet2
FS
3FW
5FW
33FW
Number of irrelevant features
Number of eliminated features
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1-NN
FS
3FW
5FW
33FW
Number of redundant features
Classification rate %
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
FS
3FW
5FW
33FW
Number of redundant features
Number of eliminated features
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Sheet3
redundant features add nothing new to the target conceptThey increase the problem size while adding nothing to the classification problem
Results (Comparison 4)
Performance of genetic feature selection/weighting for regular cases (not necessarily having irrelevant/redundant)FW has better training accuracies than FS, but FS is better in generalization (have better accuracies for unseen validation samples)FW over-fits the training samples
Chart3
67.42
67.52
67.96
68.2
68.8
68.82
Training Classification Rate
Sheet1
67.7667.7667.3666.2866.2466.24
67.4267.5267.9668.268.868.82
1-NNFS3FW5FW17FW33FW
Sheet1
0
0
0
0
0
0
Validation classification rate
Sheet2
Training classification rate
0
0
0
0
0
0
Sheet3
Chart2
67.76
67.76
67.36
66.28
66.24
66.24
Validation Classification Rate
Sheet1
67.7667.7667.3666.2866.2466.24
67.4267.5267.9668.268.868.82
1-NNFS3FW5FW17FW33FW
Sheet1
0
0
0
0
0
0
Validation classification rate
Sheet2
Training classification rate
0
0
0
0
0
0
Sheet3
The trend line shows that the greater the number of weight values the higher the training accuracy achieved and the lower the validation accuracyIncreasing the number of weights over 2 or 3 increases the chances of over-fitting
Results (Evaluation 1)
Convergence of GFS to an Optimal or Near-Optimal Set of Features GFS was able to return optimal or near-optimal values (reached by the exhaustive search)The worst average value obtained by GFS less than 1% away from optimal value
Number of featuresBest Exhaustive (class. rate %)Best GA (class. rate %)Average GA (5 runs)87474741075.275.275.21277.277.277.0414797978.561679.27978.281879.479.478.92
Out of six times, GFS was able to return optimal values reached by exhaustive search and in the sixth it returned a near-optimal solution
Results (Evaluation 2)
Convergence of GFS to an Optimal or Near-Optimal Set of Features within an Acceptable Number of Generations The time needed for GFS is bounded by (lower) linear-fit and (upper) exponential-fit curvesThe use of GFS for highly dimensional problems need parallel processing
Chart1
555
555
101010
101010
151515
202020
2026.50594236621.3333333333
2559.085914477430.2222222222
30115.164044400138.398794143
35256.18172174647.3420697586
40531.308117674655.3464010682
451122.203181132963.5141577994
502367.360303157772.2148231692
554952.999604043180.5044207281
6010585.198454301588.8848594333
Actual
Expected exponential
Expected linear
Number of features
Number of generations
Sheet1
810121416182030405060708090100
551010152028.1372483812105.9243595549466.06511750541889.66221632237763.937469177731749.3470677102133309.988315109541697.5850969462239865.54782656
510101520
810121416182022242628304050
51545101520
51010152028.1372483812105.9243595549466.06511750541889.66221632237763.937469177731749.3470677102133309.988315109541697.5850969462239865.54782656
810121416182030405060708090100
510101520LINEAR
810121416182030405060708090100
551010152028.1372483812105.9243595549466.06511750541889.66221632237763.937469177731749.3470677102133309.988315109541697.5850969462239865.54782656EXPO
551010152021.333333333338.555555555655.061930783272.211111836888.59419889105.0996750728122.085649554138.6826992735155.3791707470
5510101520263746668711751922691006843981
5510101520212528323538557289106
81012141618202530354045505560
5510101520275911525653111222367495310585
5510101520213038475564728189
Sheet1
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
Actual
Expected exponential
Expected linear
Number of features
Number of generations
Sheet2
Sheet3
5
55
105
155
205
255
Number of features
Number of generations
Actual
5
5
10
10
15
20
Expected exponential
5
5
10
10
15
20
26
37
46
66
87
117
519
2269
Expected linear
5
5
10
10
15
20
21
25
28
32
35
38
55
72
8
10
12
14
16
18
20
22
24
26
28
30
40
50
MBD00397064.xls
Chart1
555
555
101010
101010
151515
202020
202621
223725
244628
266632
288735
3011738
4051955
50226972
Actual
Expected exponential
Expected linear
Number of features
Number of generations
Sheet1
810121416182022242628304050
55101015202124273033363942
810121416182022242628304050
51545101520
810121416182022242628304050
551010152026.50594236637.154421057246.728438081666.355018430587.4537365934117.8039754971519.20824066512269.8480889363expon
551010152021.333333333325.222222222228.43703703732.417283950635.363621399238.4573936955.924737082872.8557320081linear
551010152026374666871175192269
55101015202125283235385572
Sheet1
000
000
000
000
000
000
000
000
000
000
000
000
000
000
Actual
Expected exponential
Expected linear
Number of features
Number of generations
Sheet2
Sheet3
The run time for GA depends on: no of generations, population size, number of features and number of training samplesNeed to investigate the relationship between the no. of features and the number of generations needed to reach optimal values (while keeping the other two unchanged)As the no of features increase, the no of generations needed to reach the optimal values increase as well.We used extrapolation because it is computationally impossible to run exhaustive search for large no of featuresThe number of generations required for 60 features is something between 10585 & 89 (5337), which will not be computationally feasible.
ConclusionsGFS is superior to GFW in feature reduction and without compromising classification accuracyIn the presence of irrelevant features, GFS is better than GFW in both feature reduction and classification accuracyIn the presence of redundant features, GFS is also preferred over GFW due its increased ability to feature reductionFor regular databases, it is advisable to use 2 or 3 weight values at most to avoid over-fittingGFS is a reliable method to find optimal or near-optimal solution, but need parallel processing for large problem sizes
Questions ?
Character recognition is the process of converting scanned images of machine printed or handwritten text to computer processed format such as ASCII.FE: involves extracting from the raw data the information which is most relevant to classification.C: map these features into classesPostP: to enhance classification, using dictionary or by user.
In our work, we were interested in classifying handwritten digits We used several features such as..However, due to variations in shape and size and style, several features must be added.Irrelevant features do not have an effect on the target concept at all redundant features add nothing new to the target concept.A redundant feature is a feature, which its value can be extracted from other features values, for example if its value is the average or square or even multiple of other feature values
So a solution to this dilemma is to add a FS module into your character recognition systemSearching the space of 2 or l is impossible even for a moderate size of n, so we need a search algorithm to search for the set of weights that results in best classification rateSome search algorithms need the derivative information to get the maximum or minimum And what is the relation between the number of eliminated features and the number of weight values(not necessarily having irrelevant/redundant)Training/testing: to guide the GA searchValidation: to assess the quality of the generated solution on unseen dataPre-processing: noise removal, image resizing, thinning.
As mentioned before, irrelevant features do not have an effect on the target concept at all Irrelevant features lower the classification accuracy and increase the problem dimensionality
redundant features add nothing new to the target conceptThey increase the problem size while adding nothing to the classification problem
The trend line shows that the greater the number of weight values the higher the training accuracy achieved and the lower the validation accuracyIncreasing the number of weights over 2 or 3 increases the chances of over-fitting
Out of six times, GFS was able to return optimal values reached by exhaustive search and in the sixth it returned a near-optimal solutionThe run time for GA depends on: no of generations, population size, number of features and number of training samplesNeed to investigate the relationship between the no. of features and the number of generations needed to reach optimal values (while keeping the other two unchanged)As the no of features increase, the no of generations needed to reach the optimal values increase as well.We used extrapolation because it is computationally impossible to run exhaustive search for large no of featuresThe number of generations required for 60 features is something between 10585 & 89 (5337), which will not be computationally feasible.
0
5
10
15
20
25
30
23611214181
Number of weight values
Number of zero (eliminated)
features
0
5
10
15
20
25
30
35
40
Number of redundant features
Number of eliminated features
FS
3FW
5FW
33FW
FS
3920243336
3FW
2712132025
5FW
14891216
33FW
000000
41424344454
67.42
67.52
67.96
68.2
68.8
68.82
66.5
67
67.5
68
68.5
69
69.5
1-NNFS3FW5FW17FW33FW
Training Classification Rate
5
55
105
155
205
255
Number of features
Number of generations
ActualExpected exponentialExpected linear
Actual
5510101520
Expected exponential
5510101520275911525653111222367495310585
Expected linear
5510101520213038475564728189
81012141618202530354045505560
50.4
55.4
60.4
65.4
70.4
Number of redundant features
Classification rate %
1-NN
FS
3FW
5FW
33FW
1-NN
66.7665.4464.7264.126463.84
FS
67.1666.6866.1266.26665.36
3FW
67.1266.686664.7664.4863.92
5FW
66.9666.4864.8464.4864.4863.88
33FW
66.866.3664.0463.5664.0463.56
41424344454
67.7667.76
66.28
66.2466.24
67.36
65
65.5
66
66.5
67
67.5
68
68.5
1-NNFS3FW5FW17FW33FW
Validation Classification Rate
20
25
30
35
40
45
50
55
60
65
70
75
Number of irrelevant features
Classification rate %
1-NN
FS
3FW
5FW
33FW
1-NN
59.3241.8434.4830.8828.3625.0422.84
FS
66.7666.5665.0860.7256.5648.7242.2
3FW
6764.8464.3258.1250.64540.96
5FW
66.2463.1661.7657.2848.5644.9239.64
33FW
66.0461.660.4452.5647.8844.6838.76
4121624344454
0
5
10
15
20
25
30
35
40
45
Number of irrelevant features
Number of eliminated features
FS
3FW
5FW
33FW
FS
4121522303642
3FW
4111316212426
5FW
3789101415
33FW
0000000
4121624344454
2
n
L
n