CHARACTER Segmentation and Ground– truth preparation for handwritten Bangla word images Submitted by SANCHITA MAITY Exam. Roll No. : MCA-3212027 of 2011-12 University Regn. No. : 108560 of 2009-10 Under the guidance of Mr. Ram Sarkar Department of Computer Science and Engineering, Jadavpur University. A dissertation submitted in partial fulfillment of the requirements for the award of Master of Computer Application (MCA) Department of Computer Science and Engineering Faculty of Engineering and Technology Jadavpur University Kolkata - 700 0032 20011 -2012
57
Embed
CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CHARACTER Segmentation and Ground–truth preparation for handwritten
Bangla word images
Submitted by
SANCHITA MAITY
Exam. Roll No. : MCA-3212027 of 2011-12
University Regn. No. : 108560 of 2009-10
Under the guidance of
Mr. Ram Sarkar
Department of Computer Science and Engineering,
Jadavpur University.
A dissertation submitted in partial fulfillment of the requirements for the award of Master
of Computer Application (MCA)
Department of Computer Science and Engineering
Faculty of Engineering and Technology
Jadavpur University
Kolkata - 700 0032
20011 -2012
CONTENTS
Page no.
Chapter 1: Introduction 1
1.1 An overview on Optical Character
Recognition(OCR)
1
1.1.1 Description 1
1.1.2 History of OCR 2
1.1.3 Problem of OCR 5
1.1.4 Recent Trends in OCR research 6
1.2 Characteristic of Bangla script 6
1.3 Character Segmentation and Ground-truthing 9
1.3.1 What is character segmentation? 9
1.3.2 What is ground-truthing? 11
1.3.3 Importance of handwritten Bangla Word 12
Chapter 2: Review of existing work 13
2.1 Problems of Character Segmentation from
handwritten Bangla word images
13
2.2 Some recent character segmentation and ground-
truthing methodologies
14
2.2.1 A fuzzy technique for character segmentation
2.2.2 A two stage approach for segmentation
14
14
2.2.3 A database for unconstrained handwritten
Bangla word images
15
2.2.4 A complete handwritten numeral database 16
2.3 Motivation 16
Chapter 3: Present Work 18
3.1 Data collection methodologies 20
3.2 Segmentation 20
3.2.1 Selection of SF and DNS Components 24
3.2.1.1 Initial Selection of Obvious SF and DNS
Class Components
25
3.2.1.2 Classification of SF/DNS Components
using MLP
26
3.2.2 Determination of Matra Pixels using a Fuzzy
Membership Function and Horizontalness Feature for
SF components
30
3.2.3 Determination of Potential Segmentation
Points using Two Fuzzy Membership Functions for SF
components
33
3.2.4 Identification of Actual Segmentation Points in
the SF Components
34
3.3 Preparation Ground-truthed images 36
Chapter 4: Conclusion 49
References 50
1
Chapter 1
Introduction
1.1 An Overview on Optical Character Recognition
(OCR)
1.1.1 Description
Optical character recognition usually abbreviated to OCR, is the mathematical or
electronic translation of images of handwritten, typewritten or printed text (usually
captured by a scanner) into machine – editable text.
Broadly speaking, OCR system eases the barrier of the keyboard interface
between man and machine to a great extent, and help in advancement of office
automation. By doing so, OCR system facilate large scale document transcription with
huge saving of time and human effort. The systems has potential application in reading
amount from bank checks, extracting data from field-in forms and interpreting
handwritten address from mail pieces for automatic routine, and so on.
OCR is a field of research pattern recognition, artificial intelligence and machine
vision. Though academic research in a field continues the focus of OCR has shifted to
implementation of proven techniques. Optical character recognition (using optical
techniques such as mirrors and lenses and) and digital character recognition (using
scanners and computer algorithms) were originally considered separate fields. Because
2
vary few application survive that use true optical techniques, the OCR has now been
broaden to include digital image processing as well.
Early system required training (the provision of known samples of each
character) to read a specific font. ―Intelligent‖ systems with a high degree of recognition
accuracy for most fonts are now common. Some systems are even capable of
reproducing output that closely approximates the original scanned page including
images, column and other non textual components.
1.1.2 History of OCR
In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by
handle who obtained a US pattern on OCR in USA in 1933 (U.S. Patent 1,915,993). In
1935 Tauschek was also granted a US patent on his method (U.S. Patent 2,026,329).
Tauschek‘s machine was a mechanical device that used templates. A photo
detector was placed so that when the template and the character to be recognized were
line up for an exact match and a light was directed towards them, no light would reach
the photo detector.
In 1950, David H. Shepard, a cryptanalyst at the Armed Forces Security Agency
in the United State, was asked by frank Rowlett who had broken the Japanese PURPLE
diplomatic code, to work with Dr. Louis Tordella to recommend data automation
procedures for the Agency. This included the problem of converting printed messages
into machine language for computer processing. Shepard decided it must be possible to
build a machine to do this, and, with the help of Harvey Cook, a friend, built ―gismo‖ in
his attic during evenings and weekends. This was reported in the Washington Daily
News on 27 April 1951 and in the New York Times on 26 December 1953 after his U.S.
Patent 2,663,758 was issued. Shepard then founded Intelligent Machines Research
Corporation (IMR), which went on to deliver the world‘s first several OCR systems used
image analysis, as opposed to character matching, and could accept some font variation,
Gismo was limited to reasonably close vertical registration, whereas the following
3
commercial IMR scanners analyzed characters anywhere in the scanned field, a practical
necessity on real world documents.
The first commercial system was installed at the Readers Digest in 1955, which,
many years later, was donated by Readers Digest to the Smithsonian, where it was put
on display. The second system was sold to the Standard Oil Company of California for
reading credit card imprints for billing purposes, with many more systems sold to other
oil companies. Other system sold by IMR during the late 1950s included a bill stub
reader to the Ohio Bell Telephone Company and a page scanner to the United States Air
Force for reading and transmitting by teletype typewritten messages. IBM and others
were later licensed on Sheppard‘s OCR patents.
In about 1965 Readers Digest and RCA collaborated to build an OCR Document
reader designed to digest the serial numbers on Reader Digest coupons returned from
advertisements. The fonts used on the documents were printed by an RCA Drum printer
using the OCR-A font. The reader was connected directly to an RCA 301 computer (one
f the first solid state computers). The reader was followed by a specialized document
reader installed at TWA where the reader processed Airline Ticket stock(a task made
more difficult by the carbonized backing on the ticket stock). The readers processed
document at a rate of 1500 documents per minute and checked each document rejecting
those it was not able to process correctly. The product became part of the RCA product
line as a reader designed to process ―Turn around Documents‖ such as those Utility and
insurance bills returned with payments.
The United States Postal Service has been using OCR machines to sort mail
since 1965 based on technology devised primarily by the prolific inventor Jacob
Rabinow. The first use of OCR in Europe was by the British General Post office or
GPO. In 1965 it began planning an entire banking system, the national Gyro, using OCR
technology, a process that revolutionized bill payment systems in the UK. Canada Post
has been using OCR systems since 1971. OCR systems read the name and address of the
addressee at the first mechanized sorting center, and print a routing bar code on the
envelope based on the postal code. After that the letters need only be sorted at later
centers by less expensive sorters which need only read the bar code. To avoid
4
interference with the human-readable address field which can be located anywhere on
the letter, special ink used that is clearly visible under ultra violate light. This ink looks
orange in normal lighting conditions. Envelopes marked with the machine readable bar
code may then be processed.
In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc.
and led development of the first Omni-font optical character recognition system—a
computer program capable of recognizing text printed in any normal font. He decided
that the best application of the technology would be to create a reading machine for the
blind, which would allow blind people to understand written text by having a computer
read it to them out loud. However, this device required the invention to two enabling
technologies—the CCD flatbed scanner and the text-to- speech synthesizer. On January
13, 1976, the finished product was unveiled during a widely reported news conference
headed by Kurzweil and the leaders of the National Federation of the blind. Called the
Kurzweil Reading Machine, the device covered an entire tabletop, but functioned
exactly as intended. On the day of the machine‘s unveiling, Walter Cronkite used the
machine to give his signature sound off, ―And that‘s the way it was, January 13, 1976.‖
While listening to The Today Show, musician Stevie Wonder heard a demonstration of
the device and personally purchased the first production version of the Kurzweil
Reading Machine.
In 1978 Kurzweil Computer Products began selling a commercial version of the
optical character recognition computer program. LexisNexis was one of the first
customers, and bought the program to upload paper legal and news documents onto its
nascent online databases. Two years later, Kurzweil sold his company to Xerox, which
had an interest in further commercializing paper-to-computer text conversion. Kurzweil
Computer Products thus became a subsidiary of Xerox known as Scan soft (now
Nuance).
5
1.1.3 Problem of OCR
OCR of textual documents in general involves the following problems.
i) Image acquisition
ii) Text line extraction from document images
iii) Word segmentation and character segmentation
iv) Character recognition and word recognition
Optical scanners attached with PCs are mostly used for capturing digital images
document images. Extraction of text lines from document images is a trivial problem
provided that document image remains unskewed. Text line from such document images
can be easily extracted by identified valleys of horizontal pixel density histograms of
these images. But for all practical situations, document images are skewed at least to
some extent and the said technique fails to work for these images. Many text lines may
touch each other. Skewness is inherent in handwritten text. So, special techniques are
required for character segmentation of handwritten Bangla word images.
Segmentation of isolated word images, extracted from optically scanned
document images of handwritten text, is one of the major problems of optical character
recognition (OCR). If we can find a better method for segmenting the handwritten words
into characters then we can increase our recognition of characters too. So segmentation
of words into characters makes a large contribution towards the overall performance of
OCR system character recognition also towards the overall performance of an OCR
system too.
Characters segmented from document image are to be recognized for coding
them in ASCII or some other standard character code. For any of the widely used non-
holistic optical character recognition (OCR) approaches, success of a specific technique
depends on how best a word can be segmented into pieces, which are to be considered
subsequently as candidates for its constituent characters. The better is the segmentation,
the lesser is the ambiguity encountered in recognition of candidate characters or word
pieces. To recognize a candidate character, its context also requires due consideration.
6
Because of variation of shapes and sizes, character segmentation of handwritten Bangla
word images requires more sophisticated technique than that of printed characters.
1.1.4 Recent Trends in OCR Research
Research on OCR has been mostly found to concentrate on text of European
languages based on Roman alphabet. Possibly the probability of European languages in
the industrialized West has interested both the researchers and entrepreneurs in OCR of
text of European languages including English text. Scripts relating to Asian languages
like Chinese, Korean, Japanese and Arabic have also received considerable attention
from the researchers working in the field of OCR. Other that these, a number of Indian
scripts, viz, Devnagri, Oriya and Bangla, have started to receive attention for OCR
related research in the recent years. Out of these, Bangla is the second most popular
script and language in the Indian subcontinent. As a script, it is used for Bangla,
Assamees and Manipuri languages. Bangla, which is also the national language of
Bangladesh, is the fifth most popular language in the world. So is the importance of
Bangla both as a script and as a language. But evidences of research on OCR of
handwritten Bangla characters, as observed in the literature, are a few in numbers.
1.2 Characteristics of Bangla script
Characters of Bangla script can be grouped into five categories of characters,
viz., vowel, consonant, modified shape, compound character, and punctuation symbol.
Out of these characters, vowels and consonants, which constitute Bangla alphabet, are
called basic characters. There are 11 vowels and 39 consonants in Bangla alphabet.
There is no concept of upper and lower case characters in Bangla script. Characters in
Bangla script are written from left to right. A vowel following a consonant in a word
takes a modified shape in Bangla script. Such shapes of all vowels are termed as
modified shapes. It is noteworthy that some modified shapes attached with a consonant
7
have two isolated parts appearing in two opposite sides of the consonant. Some modified
shapes may appear just below the consonant, and some may reach its top from one of its
sides with a curved or partly curved segment. So, characters in Bangla script may not
always appear in non-overlapping consecutive positions. Depending on the mode of
pronunciation, a Bangla consonant followed by one or two consonants takes a complex
shape, which is called a compound character. There are in all 280 compound characters
in Bangla script. Apart from the basic characters, the modified shapes, and the
compound characters, Bangla script also constitutes 10 digit patterns. An important
feature of Bangla characters is Matra or head line. Excluding a few, all basic and
compound characters of Bangla script has this feature. The width of a Matra is nearly
same as the width of the character it touches. All the Matras of consecutive characters
appearing in a Bangla word are joined to form a common Matra of the characters
appearing in the word.
Fig.1 Bangla alphabet basic shape (The first 11 characters are vowels while others are
consonants.)
8
(a)
(b)
(c)
Fig. 2 Examples of vowel and consonant modifiers: (a) vowel modifiers, (b) exceptional
cases of vowel modifiers and (c) consonant modifiers
9
1.3 Character segmentation and ground truthing
1.3.1 What is character segmentation?
Character segmentation is a necessary preprocessing step for character
recognition in many handwritten word recognition systems. The most difficult case in
character segmentation is the cursive script. Fully cursive nature of Bangla handwriting
poses some high challenges for automatic character segmentation. Character
segmentation techniques are mostly script dependent. It is not only because of variations
of character shapes from one alphabet to other but also for certain script specific features
of text document. Segmentation of isolated words into constituent characters is a
challenging problem for Bangla scripts. Appearance of consecutive characters in
overlapping column positions over a text line makes the problem of Bangla word
segmentation more complex compared to segmentation of English words. The problem
becomes compounded with handwritten Bangla words because of variation in sizes and
shapes of handwritten characters. Considering all this, a novel technique for segmenting
images of handwritten Bangla words is presented in this paper.
Before segmenting individual characters of each Bangla word in the text image,
the word is horizontally partitioned into three adjacent zones as shown in Fig. 3. The
portion of each word on and above the Matra constitutes ‗upper zone‘, the main body of
the characters in a word lies within the ‗middle zone‘, and the portion of the word,
containing especially modified shapes and period like isolated character components,
below the main body form the ‗lower zone‘. The technique of word segmentation is
based on detection of the Matra.
10
Fig. 3 Illustration of three zones and region boundaries of a Bangla word
A Matra is a horizontal line, which passes touching the upper part of many
characters of Bangla script as shown in Fig. 4(a). Depending on the characters, it covers
at most the entire character width. The consecutive characters, in a Bangla word, which
have Matras, are joined through a common Matra formed by joining the Matras of
individual characters as shown in Fig. 4(b). This line may have some discontinuity over
the positions where the characters in the word appear without Matras
In handwritten Bangla words, the Matras are not horizontal as strictly as these
are in printed words. So the technique of removing the Matra of a word for segmenting
its constituent characters may leave many characters joined with each other. Such under
segmentation may complicate classification decisions in the subsequent stage.
How to segment handwritten Bangla words into constituent characters efficiently
is still a challenging problem of OCR related research. This is a major point of
motivation behind the present work that deals with the problem of segmenting hand
written Bangla words into constituent characters.
11
Fig. 4(a) An illustration of the common Matra of a word
Fig. 4(b) An illustration of Matra of individual characters in a word
In image analysis testing of any algorithm is time and man-power consuming in
manual way, which is now a days are widely used in different corner of world. Even the
testing schema is not standard. Different organization uses different testing schema. So
the success rate varies. Even standard database availability is too poor. So the result
generation to a particular technique becomes hectic for researcher as they need to collect
or prepare database for their won job.
1.3.2 What is ground truthing?
Generations of appropriate ground truth data has always been a challenging and
time some task for the kind of problem under consideration. Availability of ground truth
12
information, however, makes any database more useful, enabling proper evaluation of
one‘s technique by comparing their output with the ground truth of the same. In the
present work, we have prepared ground truth images for a subset of our database ,viz.,
CMATERg1.1.1 and CMATERg1.2.1 respectively. We have prepared these ground
truth images of the databases in a semi automatic way. More specifically, we have
employed our previously developed technique [9] to identify individual character
segments from any document image. The possible error that might have been generated
in the automated character segmentation is corrected using a software tool called GT
Gen version 1.1, which we have developed for this project. Basically, we have used GT
Gen to recolor the characters, which were erroneously labeled by our previously
developed technique [9]. It may be noted that all the ground truth images are stored in
bitmap (bmp) file format, where the background is labeled in white and individual
characters are marked in different colors.
1.3.3 Importance of handwritten Bangla word
Bangla is an important East Asian script widely used in India and Bangladesh.
Popularity wise, Bangla ranks 5th in the world and 2nd ranked in India as a script and a
language both. It is also the national language of Bangladesh.
Handwritten Bangla word is cursive in nature in most of the cases. So
identification of each character is difficult to any segmentation algorithm. In handwritten
Bangla words, the Matras are not horizontal as strictly as these are in printed words. So
the technique of removing the Matra of a word for segmenting its constituent characters
may leave many characters joined with each other. Such under segmentation may
complicate classification decisions in the subsequent stage.
13
Chapter 2
Review of Existing Work
In this chapter about the previous work and their drawbacks on character
segmentation from handwritten Bangla word images.
2.1 Problems of Character Segmentation from
handwritten Bangla word images
Character identification is the first and most important step in the process of
OCR of document images. If the characters are not identified accurately and for example
two or more characters are connected with common Matra line then none of the
characters of the word can be recognized correctly. The same problem occurs if a
character is accidentally split into two or more parts. The characters might be written so
close to one another those accents and similar features may become difficult to classify
into the correct character. Adjacent characters might even touch one another at some
points and in those cases it becomes very difficult to identify the constituent characters,
which have joined to form a single component. Character segmentation of handwritten
Bangla word images is faced by many challenges depends on the style of writing of an
individual. In image analysis testing of any algorithm is time and man-power consuming
in manual way, which is now a days are widely used in different corner of world. Even
the testing schema is not standard. Different organization uses different testing schema.
So the success rate varies. The present work suggests a method based on comparison of
neighborhood connected or disconnected components to determine whether they belong
to the same character.
14
2.2 Some recent character segmentation and ground-
truthing methodologies
A wide variety of text line detection methods for handwritten Bangla word
images have been reported in the literature. These methods may be categorized into four
types, namely (i) a fuzzy technique for segmentation ; (ii) a two stage approach for
segmentation; (iii) a database for unconstrained handwritten Bangla word images; (iv) a
complete handwritten numeral database , which cannot be grouped in a unique category
since they do not share a common guideline.
2.2.1 A fuzzy technique for character segmentation
A fuzzy technique for segmentation of handwritten Bangla word images have
been presented in work [1]. It works in two steps. In first step, the black pixels
constituting the Matra (i.e., the longest horizontal line joining the tops of individual
characters of a Bangla word) in the target word image is identified by using a fuzzy
feature. In second step, some of the black pixels on the Matra are identified as segment
points (i.e., the points through which the word is to be segmented) by using three fuzzy
features. On experimentation with a set of 210 samples of handwritten Bangla words,
collected from different sources, the average success rate of the technique is shown to be
95.32%. Apart from certain limitations, the technique can be considered as a significant
step towards the development of a full-fledged Bangla OCR system, especially for
handwritten documents.
2.2.2 A two stage approach for segmentation
Segmentation of handwritten Bangla word images is a challenging problem for
the researchers. Discontinuity or absence of Matra, an important feature of Bangla
15
script, may lead to inherent segmentation within the word images. Around 55% of these
inherently segmented connected sub-images do not require further segmentation. They
have designed a novel two-stage approach for segmentation of isolated Bangla word
images. In the first stage, a feature based approach is design to classify the connected
word segments into either of the two classes ,‘Segment further‘ and ‗Do not Segment‘
using a multi-layer perception (MLP) based classifier. In the second stage, fuzzy
segmentation feature are design to identify the Matra region and the potential
segmentation point on the Matra of the connected word segments that belong to
‗Segment further‘ class. Using this technique, the overall successful segmentation
accuracy achieved after two stages is 95.87% in the work [2].
2.2.3 A database for unconstrained handwritten Bangla word
images
In the work [7], the preparation of a benchmark database for research on off-line
Optical Character Recognition (OCR) of document images of handwritten Bangla text
and Bangla text mixed with English words have been described. This is the first
handwritten database in this area, as mentioned above, available as an open source
document. As India is a multi-lingual country and has a colonial past, so multi-script
document pages are very much common. The database contains 150 handwritten
document pages, among which 100 pages are written purely in Bangla script and rests of
the 50 pages are written in Bangla text mixed with English words. This database for off-
line-handwritten scripts is collected from different data sources. After collecting the
document pages, all the documents have been preprocessed and distributed into two
groups, i.e., CMATERdb1.1.1, containing document pages written in Bangla script only,
and CMATERdb1.2.1, containing document pages written in Bangla text mixed with
English words. Finally, we have also provided the useful ground truth images for the
line segmentation purpose. To generate the ground truth images, we have first labeled
each line in a document page automatically by applying one of our previously developed
line extraction techniques and then corrected any possible error by using our developed
16
tool GT Gen 1.1. Line extraction accuracies of 90.6 and 92.38% are achieved on the two
databases, respectively, using our algorithm. Both the databases along with the ground
truth annotations and the ground truth generating tool are available freely at
http://code.google.com/p/cmaterdb.
2.2.4 A complete handwritten numeral database
The paper [16] describes the ISI database of handwritten Bangla numerals.
Bangla is the second most popular language and script of the Indian subcontinent and it
is used by more than 200 million people all over the globe. The present database has
several components which include both on-line and off-line handwritten numerals.
Samples of numeral strings and isolated numerals have been collected under both modes
of writing. This database has been developed at the Computer Vision and Pattern
Recognition Unit laboratory of Indian Statistical Institute, Kolkata. Samples of the
present database are properly ground thruthed and subdivided into respective training
and test sets. The off-line sample images are stored in TIFF image format and the on-
line samples are stored along with various information as header in ASCII file format.
This database will facilitate fruitful research on handwriting recognition of Bangla
through free access to the researchers.
Other methodologies are include in the works described in [12-14]. In [5], the
character segmentation problem is seen from an artificial intelligence perspective.
2.3 Motivation
Considering the all kind of problems as discussed above, we actually need an
automated evaluation tool for OCR systems, which is comparing the segmented results
of a technique/algorithm with ground thruthed images. The evaluation technique is
constructed of the following steps. First, database of word images is prepared. Then