YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Holistic Recognition of Printed Arabic Script LigaturesAkram El-KorashySupervised by: Dr. Faisal Shafait

Deutsche Forschungszentrum für Künstliche Intelligenz (DFKI)Kaiserslautern, Deutschland

Page 2: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 1

Page 3: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 2

Page 4: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Segmentation-free OCR for Arabic scriptsive

● Nastalique writing: Classify ligatures instead of individual characters.

● Over 20,000 valid ligatures in the Urdu language.

● Ease in the preprocessing, with difficulty in feature extraction & classification.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 3

Page 5: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 4

Page 6: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Features Extraction, Shape Context method● Distribution of Points, Transformation

methods, Structural Analysis.

● Nabocr: Shape Context features vector.

● Contour Extraction.

● Shape Context is a shape descriptor proposed by Belongie et al.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 5

Page 7: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Features Extraction, Shape Context method● 4 histograms from 4 quadrants.

● Each histogram is a sum of point histograms.

● Distance, Orientation

● Histogram: bins of ranges.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 6

Page 8: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 7

Page 9: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Hierarchical Classification● Decomposing a classification problem into a

set of smaller problems.● Useful with large numbers of categories.

● Efficiency of recognition.● Can help improve accuracy

● Independent set of features for each branch.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 8

Page 10: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 9

Page 11: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Spectral Hashing● Fast NN technique● Feature vector into a binary code:

○ easily computed○ small no. of bits○ similarity mapping

● Calculating binary code:○ maximum variance direction(PCA)○ sin eigenfn.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 10

Page 12: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 11

Page 13: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Random Forests● Ensemble Classifier

Ensemble learning combines the predictions of different classifiers (decision trees) by collecting independent votes from each tree and calculating the majority vote to give a prediction.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 12

Page 14: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 13

Page 15: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Shape context weaknesses● Scale invariance

● Missing representation of dots

● Confusion between ligatures that vary only in dots.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 14

Page 16: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 15

Page 17: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

New Features● Sizes of connected components

● Locations of connected components

○ above, below,or interleaving

○ Grid location

Akram El-Korashy, Segmentation-free OCR, 14.08.12 16

Page 18: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

New Features● Pixel-level properties:

○ weights of regions○ fill ratio

● Length, Width, Aspect Ratio

○ Invariance to scanning resolution○ Setting reference size○ Histogram of widths and heights

Akram El-Korashy, Segmentation-free OCR, 14.08.12 17

Page 19: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 18

Page 20: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Classification Methodology● Experiment set "1"

○ Spectral Hashing, reduction of number of comparisons

● Experiment set "2"○ Random Forests, hierarchy by recognizing the no. of

characters

● Experiment "3"○ Random Forests, classification of alphabet symbols

Akram El-Korashy, Segmentation-free OCR, 14.08.12 19

Page 21: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Classification Methodology● Spectral Hashing (sunvid project):

○ Training Dataset (~80,000 samples)

○ Test Dataset (~20,000 samples)

○ Different combinations of number of bits, number of tables, tolerance bits (training different hash structures in parallel)

Akram El-Korashy, Segmentation-free OCR, 14.08.12 20

Page 22: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Classification Methodology● Random Forests (python milk):

○ Number of decision trees: 101○ 70% of the attributes○ 70% of the training samples

○ Reduced training dataset (~20,000 samples)○ Test dataset of ~18,000 samples

Akram El-Korashy, Segmentation-free OCR, 14.08.12 21

Page 23: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Classification Methodology●

○ New features vector

○ Classifying based on no. of characters

○ Classifying the Alphabet Symbols

1-character classifier

2-character classifier

3+ character classifier

Random Forest classifier

Akram El-Korashy, Segmentation-free OCR, 14.08.12 22

input

Page 24: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 23

Page 25: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

● Spectral Hashing Results "1"○ Effect of changing the number of tables○ 7-bit-binary-code, 2 tolerance bits

Experiments and Results

Akram El-Korashy, Segmentation-free OCR, 14.08.12 24

Page 26: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Experiments and Results● Spectral Hashing Results "1"

Accuracy Best Reduction Hash (bits, tables, tolerance)

81.5% 37538 (47.2%) 7, 9, 1

81% 31553 (39.7%) 7, 7, 1

80.5% 23975 (30.1%) 8, 9, 1

79.5% 20736 (26.1%) 7, 4, 1

78% 18737 (23.6%) 8, 7, 1

76% 15392 (19.4%) 7, 3, 1

Akram El-Korashy, Segmentation-free OCR, 14.08.12 25

Page 27: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Experiments and Results● Spectral Hashing Results "1"● Significant reduction rates

○ Reduction down to 19% for a difference of 6% in accuracy

○ Reduction down to 24% for a difference of 4% in accuracy.

○ Reduction down to 47.2% for no accuracy loss.○ Observation: Accuracy slightly higher than 1-NN for

reduction down to 57.6%

Akram El-Korashy, Segmentation-free OCR, 14.08.12 26

Page 28: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Experiments and Results● Random Forest Results "2"

● Accuracy of 78.7% for 1, 2, 3, 4+ labels● Accuracy of 45.4% for 1, 2, 3, 4, 5+ labels● Accuracy of 20.7% for 1, 2, 3, 4, 5, 6+ labels● Even worse with more partitioning

Akram El-Korashy, Segmentation-free OCR, 14.08.12 27

Page 29: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Experiments and Results● Random Forest Results "2"

● Confusion matrix for 1, 2, 3+: alphabet symbols can be separately classified.

test label / result 1 2 3+ Recall

1 1131 88 14 91.9%

2 16 94 531 17.2%

3+ 7 2 16627 99.9%% true positives 98% 51% 96.8% ___

Akram El-Korashy, Segmentation-free OCR, 14.08.12 28

Page 30: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Experiments and Results● Alphabet symbols

○ 80.34 % for Random Forests "3"

○ Accuracy of 98.74 % for 1-NN classifier

○ 1-NN classifier can be used for recognition under class 1.

○ Over 30% of ligatures are individual characters.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 29

Page 31: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 30

Page 32: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Conclusion and Summary● Features vector can be improved.

● 1-NN improved efficiency by Spectral Hashing: significant reduction

● Random Forests: can be used to separate the 1-character alphabet symbols.

● Useful for overall performance improvement on real text data.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 31

Page 33: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Future Work

Thank You

Questions?


Related Documents