An OCR System for recognition of Urdu text in Nastaliq Font

An OCR System for recognition of Urdu text in Nastaliq Font

By

S. Hassan Amin

Supervised By

Dr. S. Afaq Hussain

Faculty of Computer Science & Engineering

Ghulam Ishaq Khan Institute of Engineering Sciences & Technology, Topi-Swabi, 2004

Layout Introduction Research Scope Objectives Optical Character Recognition

Steps in OCR Urdu Writing Characteristics Cursive Script Recognition Schemes Methodology

Multi-Tier Holistic Approach Multi-Stage Classification Approach

Results and Discussion Conclusion Future Directions References

Introduction

Urdu is the national language of Pakistan, and is understood by well over 300 million people around the world.

There is a need to convert historical database of Urdu literature into electronic form , so that Urdu can prosper in the age of computers.

Urdu text recognition endeavors to convert scanned Urdu documents automatically into computerized text files.

Research Scope Paper documents have been the most important

means for exchanging information for ages, but this is changing , as we are rapidly moving towards paperless society.

It has been estimated by IBM that about $250 billion are annually spent worldwide (largely in operator salaries, etc.) in keying-in information from paper documents, and this is the cost of manually capturing information from only 5% of the available documents [1].

Urdu Text Recognition Urdu Text Transliteration Machine Translation

Objectives The main objective of this research is to make an

OCR system for Urdu language that is effective for Nastaliq Script irrespective of font size and orientation. To achieve this objective, there are a number of sub goals which are:-

To investigate the problem of Urdu OCR in depth, and to propose new and better ways to solve this problem.

To investigate the use of appropriate set of features for Urdu OCR.

To establish a database of Urdu ligatures for investigating the problem of Urdu OCR.

To investigate classification methods that can be useful for the problem of Urdu OCR.

Optical Character Recognition(OCR) Character Recognition or Optical Character

Recognition (OCR) is the process of converting scanned images of machine printed or handwritten text (numerals, letters and symbols), into a computer processable format (such as ASCII and Unicode) [2].

Offline character recognition is performed after the writing or printing has been completed.

In Online character recognition, computer recognizes the character as they are drawn(timing information).

Steps in OCR

1. Image Acquistion

2. Preprocessing

3. Segmentation

4. Feature Extraction

5. Classification

6. Post Processing

1. Image Acquistion

This conversion process is accomplished by digitizer which can be either a scanner(Offline recognition), Camera, tablet digitizer(Online recognition).

2. Preprocessing

The preprocessing involves noise reduction,skew detection,slant normalization, document decomposition etc.

For slant estimation we have methods such as Projection method , chain code method[4].

For estimating skew angle of page , we have methods such as Orientation dependent histogram[3].

3. Segmentation

Segmentation is the process of dividing an image into regions , each susceptible to containing a single object or a group of objects of the same type. For instance , an object can be a character on a text page or a line segment in an engineering drawing.

In OCR , the commonly used segmentation algorithms are XY tree decomposition , run-length smearing and Hough transform.

4. Feature Extraction

Selection of appropriate feature extraction method is probably the single most important factor in achieving high recognition performance [5].

A new comer to the field is faced with the challenge of selecting appropriate features for his/her application.

Feature Extraction(Contd)

Some useful feature extraction methods in the field of OCR are :-

1. Geometric Features

2. Structural Features

3. Moment based Features

4. Template Matching

5. Unitary Image Transforms

6. Zoning

7. Contour Profiles

8. Fourier Descriptors

5. Classification

Classification is the process of identifying each character and assigning to it the correct character class. Two major approaches for classification methods are:

1. Decision theoretic method

2. Structural Methods

1. Decision theoretic method

These methods are used when the description of the character can be represented numerically in a feature vector.

The principal approaches to decision-theoretic recognition are minimum distance classifiers , statistical classifiers and neural networks.

2. Structural Methods

Within the area of the structural recognition, syntactic methods are among the most common approaches.

In Syntactic pattern recognition, measures of similarity based on the relationship between structural components are formulated using grammatical concepts.

5. Post Processing

In Post Processing , we have1. Grouping

2. Error Detection and Correction

1. Grouping The result of plain symbol recognition is a set of

individual symbols. These symbols in themselves usually do not

contain enough information. We would like to associate the individual symbols

that belong to the same string with each other making up word and numbers.

The process of performing this association of symbols into strings is commonly referred to as grouping.

2. Error Detection and Correction

Along with the grouping of the characters, another issue to take care is the context in which each character appears.

Because even the best of the OCR systems cannot identify each character with 100% accuracy. These errors may be detected or even corrected by use of context.

Urdu Writing Characteristics

Urdu is a cursive language , which has evolved from Arabic , Persian and Turkish languages.

Urdu language has 36,37,42,51 and 53 characters according to different sources[8].

The UZT 1.01 standard has 42 characters.

Urdu Writing Characteristics(Contd)

Figure : Urdu Character Set UZT 1.01

Urdu Writing Characteristics(Contd)

Characteristics Urdu Arabic Latin Hebrew Hindi

H Justification RL RL LR RL LR

V-Justification Center Base No No Top

Cursive Yes Yes No No Yes

Diacritics Yes Yes No No Yes

# Vowels 2 2 5 11 -

# Letters 37 28 26 22 40

Letter Shapes 1-28 1-4 2 1 1

Complementary Characters

5 3- - - -

Cursive Script Recognition Schemes There are two strategies that have been

applied to cursive script recognition. As mentioned by Amin and Khorsheed [6,7], they can be categorized as follows:

1. Holistic Strategies in which the recognition is globally performed on the whole representation of words and where there is no attempt to identify characters individually.

Cursive Script Recognition Schemes(Contd)1. Analytical strategies in which words are

not considered as a whole, but as sequences of small size units and recognition is not directly performed at word level but at an intermediate level dealing with these units, which can be graphemes, segments, Pseudo-letters etc.

Research Methodology

Two approaches to recognize Urdu ligatures printed in Nastaliq Script are presented. Both these approaches are holistic in nature.These approaches are tested for identification of a set of most frequent ligatures printed in Noori Nastaliq Script. The suggested approaches to recognize Urdu text are :-

1. Multi-tier Holistic Approach

2. Multi-Stage Classification Approach.

Multi-Tier Holistic Approach to Urdu Nastaliq Recognition A multi-tier Holistic Approach using feed

forward back propagation neural network was implemented[12].

(Contd)

Figure :Multi-Tier Holistic Approach to Urdu Nastaliq Recognition

1. Segmentation

Connected Component Labeling is applied to the image of Urdu text.

This technique assigns to each connected component of binary image a distinct label.

The labels are usually natural numbers from 1 to the number of connected components in the input image.

The algorithm scans the image from left-to-right and top-to-bottom.

Segmentation(Contd)

On the first line containing black pixels, a unique label is assigned to each contiguous run of black pixels.

For each black pixel, the pixels in its eight neighborhood are examined, if any of these pixels has been labeled the same label is assigned to the current pixel, otherwise a new label is assigned to it. The procedure continues to the bottom of the image.

Feature Extraction I In this stage, we extract

some features that will help us in the recognition of special ligatures, see figure. These features are Solidity, Number of Holes, Axis Ratio, Eccentricity, Moments, Normalized segment length, curvature, ratio of bounding box width and height.

1

2

3

4

5

6

7

8

Special Ligature Identification

A Feed forward BPN network is trained on the feature vectors obtained in the Feature Extraction I stage. During testing , this network is used to identify input ligatures as one of special ligature . If no valid output is returned , then the ligature is identified as base ligature.

Feature Extraction II In this stage, special ligatures are associated with

the base ligatures. Special ligature are associated with the base ligature whose Centroid-to-Centroid distance is minimum.

A number of lines are grown from the center of each special ligature, when one of these lines touches a base ligature, then the special ligature is associated with that base ligature.

In this stage, due to association of special ligatures with the base ligatures twenty new features are added to the feature vector of the base ligature.

Classification and Recognition

In this stage, the final feature vector consisting of 34 features is fed into Feed Forward Back propagation neural network. The network architecture consists of 34 inputs, 65 hidden neurons and 45 output neurons.

Multi-Stage Classification Approach to Urdu Text Recognition The motivation behind this approach is the

belief , that classification performance could be improved by combining multiple classifiers[9,10,11].

(Contd) As shown in the figure , the first three stages are

similar to the multi-tier approach. Intermediate Classification

In the training phase , we train a competitive network on feature vectors of base ligatures , to divide input data into desired number of clusters.

In the training phase , a LVQ/BPN network is trained on the output of the competitive network , to classify the input pattern to a particular class or cluster.

In the testing phase, the input feature vector is presented to the to trained LVQ/BPN network , it gives us the desired class/cluster.

(Contd)

Ligature IdentificationA BPN network is trained for all the ligatures

belonging to a particular class/cluster in the classification and recognition stage of the system.

Results and Discussion

Frequency Analysis To establish a database of Urdu images for training and testing, it was decided

that most frequent Urdu ligatures would be identified from the World Wide Web.

This was a challenge, since most Urdu sites are based on images of Urdu text, so there was no way of counting Urdu ligatures without first identifying them.

The BBC Urdu news site http://www.bbc.co.uk/urdu/ was selected for frequency analysis because it is font-based site of Urdu.

The hex codes of BBC Urdu font were studied. A study of Urdu font was also done. There are three types of Urdu characters,

given as follows:1. Characters which do not connect on both sides e.g alif2. Character which connect on both sides e.g bay, tay 3. Characters which do not connect from the left e.g wow , ray There are two types of breaks in Urdu text file , one is hard break identified by

0x0020 and soft break identified by nature of character. On the basis of these breaks and punctuation marks we decide about separation between ligatures , and hence keep count of ligatures.

Frequency Analysis(Contd)S.No. Lig Count S.No. Lig Count

1 ا 2904 11 کا 408

2 ر 1600 12 ہے 377

3 و 1240 13 کر 338

4 ےک 745 14 کو 309

5 د 718 15 ہ 295

6 ں 480 16 ےس 290

7 کی 469 17 ی 269

8 ےن 456 18 ہو 269

9 میں 445 19 س 260

10 ن 439 20 ہک 256

Table : List of 20 most frequent ligatures

1. Segmentation

Feature Vectors

S.No. Name Moment 1 Moment 2 Moment 3 Moment 4 Moment 5 Moment 6 Moment 71 1.bmp 0.52283 0.24376 0.00496 0.004624 2.21E-05 0.002274 -5.63E-072 10.bmp 0.16563 9.28E-05 0.000277 5.48E-06 2.01E-10 -1.78E-08 -7.16E-113 100.bmp 0.16949 0.000171 9.12E-05 3.05E-06 4.94E-11 -2.88E-08 1.26E-114 101.bmp 0.64308 0.37256 0.008243 0.005689 3.88E-05 0.003196 3.37E-065 102.bmp 0.16488 0.0007 0.000168 6.87E-06 1.89E-10 1.61E-07 -1.37E-106 103.bmp 0.40757 0.03951 0.039031 0.031366 0.001002 0.006081 -0.000457 104.bmp 0.29624 0.048083 0.000436 8.78E-05 1.66E-08 1.91E-05 4.48E-098 105.bmp 0.16481 0.000165 4.32E-05 1.16E-06 6.75E-12 -7.19E-09 4.63E-129 106.bmp 0.26849 0.033972 0 0 0 0 0

S.No Name Solidity Minor Axis LengthMajor Axis LengthEccentricityOrientation Axis Ratio1 1.bmp 0.82051 4.0294 22.8431 0.98432 86.8416 0.17642 10.bmp 0.80645 5.7038 6.0321 0.32538 56.7493 0.945583 100.bmp 0.75 5.6006 6.0319 0.37135 -16.0531 0.928494 101.bmp 0.61702 2.9867 17.0919 0.98461 41.8065 0.174745 102.bmp 0.83871 5.4889 6.4133 0.51721 17.1527 0.855866 103.bmp 0.44898 16.0802 27.3559 0.809 109.504 0.587817 104.bmp 0.66667 5.3315 13.5202 0.91897 1.5099 0.394338 105.bmp 0.81081 6.1484 6.6314 0.37467 16.9823 0.927169 106.bmp 0.7 6.2487 14.2896 0.89932 -3.3781 0.43729

Figure : Moment based features for some ligatures

Figure : Geometric features for some ligatures

Special Ligature Identification

Figure : Importance of Special ligature in identifying ligatures

Network BPN Configuration 52-26-8

Goal 0.01 Mc 0.4 Lr 0.1

Figure : Network configuration used to identify special ligatures

Special Ligature Identification(Contd)

Figure : Training to identify special ligatures

Intermediate Classification

Figure : Analysis for identification of clusters

Intermediate Classification(Contd)

Features Used No. of Clusters No. of Images

Moment 1 Solidity Eccentricity Axis Ratio 4 216

Neural Net Used BPN Configuration 64-32-4

Percentage Distribution of Clusters

Cluster 1 Cluster 2 Cluster 3 Cluster 4

16.67 29.63 27.78 25.93

Figure : Network Configuration

Intermediate Classification(Contd)

Figure : Training to identify clusters

Feature Extraction II

Ligature IdentificationCluster 4 Configuration 80-40-8 Lr 0.1 mc 0.3

Ligature Identification(Contd)

Cluster 2 Configuration 80-40-8 Lr 0.1 mc 0.3 goal 0.019

Conclusion Two different approaches for recognition of Cursive Urdu

text written in Nastaliq Script have been presented. A set of 1000 most frequent ligature has been identified. Our approach minimizes the errors due to segmentation by

using segmentation free approach. By using different types of features, we have improved the

number of ligatures that can be identified. Classification performance has been improved by

implementing multi-stage classification approach; this approach is especially useful for large number of ligatures[9,10,11].

Future Directions

A number of possible directions are under consideration for enhancement of the system for practical use namely,

Study of effectiveness of features used , and to find new features that can be effective for Urdu OCR.

Enhancement of the number of ligatures used for training. Addition of Special characters, Numerals and Aerab for

recognition as special ligatures. Recognition of intonation marks in the document. Addition of multi lingual support in the system.

References

1. http://www.almaden.ibm.com/cs/dare.html

2. Sargur N. Sridhar, Stephen W. Lam, “Character Recognition” .

3. H. Bunke and Wang, “Handbook of character recognition and document image analysis”, World Scientific.

4. M. Shridhar, F. Kimura,”Segmentation Based Cursive Handwriting Recognition”, Handbook of Character Recognition.

5. Oivind De Trier, Anil K. Jain and Torfinn, “Feature Extraction methods for Character Recognition-A Survey”, Pattern Recognition,Vol 29, No. 4,pp. 641-662, 1996

References(Contd)1. Adnan Amin, “Arabic Character Recognition”, Handbook of

Character Recognition.2. Mohammad S. Khorsheed, “Structural Features of Cursive Arabic

Script”3. Muhammad Afzal, Sarmad Hussain,”Urdu Computing

Standards:Development of Urdu Zabta Takhti-WG2 N2413-2-SC2 N3589-2 (UZT) 1.01”

4. L. Xu, A. Krzyzak, and C. Y. Suen ,” Methods of Combining Multiple Classifiers and their Applications to Handwriting Recognition,” IEEE Trans. Systems, Man and Cybernetics, vol. 27 , no. 4, pp.418-435,1992.

5. T.K. Ho, J.J. Hull and S. N. Srihari, ” Decision Combination in Multiple Classifier Systems,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 1, pp. 66-75,1994.

References(Contd)

1. K. Kittler, M. Hatef, R P. W. Dutin and K. Matas, “On Combining Classifiers,” IEEE Trans. Pattern Analysis and Machnie Intelligence, vol. 20, no. 3 pp. 226-239, 1998.

2. Syed Afaq Husain, S. Hassan Amin,” Multi-Tier Holistic Approach to Urdu Nastaliq Recognition,” IEEE INMIC Dec. 2002, Karachi.

Questions ?

Thank You

An OCR System for recognition of Urdu text in Nastaliq Font

Technology

recognition of urdu

problem of urdu ocr

urdu language

introduction urdu

scanned urdu documents

online character recognition

offline character recognition

scanneroffline recognition