A SEMINAR/ COLLOQUIUM REPORT ONOFFLINE HANDWRITTEN HINDI
CHARACTER RECOGNITION USING DATA
MINING Submitted in Partial Fulfillment of the Requirements for
the Degree of Master in Computer Applications SUBMITTED
BYSHRIKRISHNA SHARMARoll No.: 1350204016
UNDER THE SUPERVISION OFMr. Ashok KumarAssociate
Professor,Invertis University Bareilly INVERTIS INSTITUTE OF
COMPUTER APPLICATIONSINVERTIS UNIVERSITY Invertis Village, Lucknow
National Highway 24, Bareilly, Uttar Pradesh 243123 Batch: 2012 -
2015
CERTIFICATE This is to certify that Mr./Ms. SHRIKRISHNA SHARMA
(Roll No. 1350204016), has carried out the Seminar/ Colloquium work
presented in this report entitled OFFLINE HANDWRITTEN HINDI
CHARACTER RECOGNITION USING DATA MINING for the award of Master of
computer Applications from Invertis University, Bareilly under the
supervision of undersigned. Ashok Kumar Ajay indianSeminar/
Colloquium Supervisor HOD (IICA) Associate Professor Associate
Professor Invertis University Bareilly Invertis University
Bareilly
ACKNOWLEDGEMENT The real spirit of achieving a goal is through
the way of excellence and austerous discipline. I would have never
succeeded in completing my task without the cooperation,
encouragement and help provided to me by various personalities.
First of all, I render my gratitude to the almighty who bestowed
self-confidence, ability and strength in me to complete this work.
Without his grace this would never come to be todays reality. With
deep sense of gratitude I express my sincere thanks to my esteemed
and worthy supervisor Mr. Ashok kumar in the Department of Master
of computer application for his valuable guidance in carrying out
this work under his effective supervision, encouragement,
enlightenment and cooperation. Most of the novel ideas and
solutions found in this thesis are the result of our numerous
stimulating discussions. His feedback and editorial comments were
also invaluable for writing of this thesis. I shall be failing in
my duties if I do not express my deep sense of gratitude towards
Mr. Ajay indian , Head of computer application Department who has
been a constant source of inspiration for me throughout this work.
I am also thankful to all the staff members of the Department for
their full cooperation and help.
thank you shrikrishna sharma 1350204016
Abstract
Handwritten Numeral recognition plays a vital role in postal
automation services especially in countries like India where
multiple languages and scripts are used Discrete Hidden Markov
Model (HMM) and hybrid of Neural Network (NN) and HMM are popular
methods in handwritten word recognition system. The hybrid system
gives better recognition result due to better discrimination
capability of the NN.
A major problem in handwriting recognition is the huge
variability and distortions of patterns. Elastic models based on
local observations and dynamic programming such HMM are not
efficient to absorb this variability. But their vision is local.
But they cannot face to length variability and they are very
sensitive to distortions. Then the SVM is used to estimate global
correlations and classify the pattern. Support Vector Machine (SVM)
is an alternative to NN. In Handwritten recognition, SVM gives a
better recognition result.
The aim of this paper is to develop an approach which improve
the efficiency of handwritten recognition nusing artificial neural
network Keyword: Handwriting recognition, Support Vector Machine,
Neural Network Advancement in Artificial Intelligence has lead to
the developments of various smart devices. The biggest challenge in
the field of image processing is to recognize documents both in
printed and handwritten format. Character recognition is one of the
most widely used biometric traits for authentication of person as
well as document.
Optical Character Recognition (OCR) is a type of document image
analysis where scanned digital image that contains either machine
printed or handwritten script input into an OCR software engine and
translating it into an editable machine readable digital text
format. A Neural network is designed to model the way in which the
brain performs a particular task or function of interest. Each
image character is comprised of 3020 pixels. We have applied
feature extraction technique for calculating the feature. Features
extracted from characters are directions of pixels with respect to
their neighboring pixels. These inputs are given to a back
propagation neural network with hidden layer and output layer. We
have used the Back propagation Neural Network for efficient
recognition where the errors were corrected through back
propagation and rectified neuron values were transmitted by
feed-forward method in the neural network of multiple layers.
Handwriting recognition is the ability of a computer to receive
and interpret intelligible handwritten input from sources such as
photographs, touch - screens, paper documentsand other devices .
Written text image may be sensed "off line" from a piece of paper
by optical scanning (optical character recognition). Devnagari
script has 14 vowels and 33 consonants. Vowels occur either in
isolation or in combination with consonants. Apart from vowels and
consonants characters called basic characters, compound characters
are there in Devnagari script, which are formed by joiningtwo or
more basic characters. Coupled to this in Devnagari script there is
a practice of having twelve forms of modifiers with each for 33
consonants, giving rise to modified shapes which, depends on
whether the modifier is placed to the left, right, top or bottom of
the character. The net result is that there are several thousand
different shapes or patterns, which makes Devnagari OCR more
difficult to develop. Here focus is on the recognition of offline
handwritten Hindi characters that can be used in common
applications like commercial forms , bill processing systems , bank
cheques, government records, Signature Verification , Postcode
Recognition, , passport readers, offline document recognition
generated by the expanding technological society . In this project
, by the use of templet matching algorithm devnagari script
characters are OCR from document images
Table of Contents1. Cover Page 2. Certificate 3. Abstract 4.
Acknowledgements 5. Table of Contents 6. List of Tables 7. List of
Figures 8. Introduction 9. Retail review of work and problem
statement10. Proposed approach11. Solution approach
12. Implementation and Result
13.Future work and conclusion
14.Reffences
List of Tables
Recognition accuracy of handwritten hindi character 25 Detail
recognition performance of svm and UCI datasets 45 Detail
recognition performance of svm and HMM on UCI datasets 47
Recognition rate of each numeral in datasets 49
List of Figures
hindi language basic character set 16 character recognition of
the document image 20 output is saved in form of the text format 25
genrated 8*8 input matrics 28 load some entries from the digits
dataset into application using the default values in the
application 35 we can see the analysis also plerforms rather well
on completely new and priviously unseen data. 40
Introduction
Handwritten Recognition refers to the process of translating
images of hand-written, typewritten, or printed digits into a
format understood by user for the purpose of editing,
indexing/searching, and a reduction in storage size. Handwritten
recognition system is having its own importance and it is adoptable
in various fields such as online handwriting recognition on
computer tablets, recognize zip codes on mail for postal mail
sorting, processing bank check amounts, numeric entries in forms
filled up by hand and so on. There are two distinct handwriting
recognition domains; online and offline, which are differentiated
by the nature of their input signals.
In offline system, static representation of a digitized document
is used in applications such as cheque , form, mail or document
processing. On the other hand, online handwriting recognition (OHR)
systems rely on information acquired during the production of the
handwriting. They require specific equipment that allows the
capture of the trajectory of the writing tool. Mobile communication
systems such as Personal Digital Assistant (PDA), electronic pad
and smart-phone have online handwriting recognition interface
integrated in them. Therefore, it is important to further improve
on the recognition performances for these applications while trying
to constrain space for parameter storage and improving processing
speed. Figure 1 shows an online handwritten Word recognition
system. Many current systems use Discrete Hidden Markov Model based
recognizer or a hybrid of Neural Network (NN) normalization, the
writing is usually segmented into basic units (normally character
or part of character) and each segment is classified and labeled.
Using HMM search algorithm in the context of a language model, the
most likely word path is then returned to the user as the intended
string .
Segmentation process can be performed in various ways. However,
observation probability for each segment is normally obtained by
using a neural network (NN) and a Hidden Markov Model (HMM)
estimates the probabilities of transitions within a resulting word
path. This research aims to investigate the usage of support vector
machines (SVM) in place of NN in a hybrid SVM/HMM recognition
system. The main objective is to further improve the recognition
rate[6,7] by using support vector machine (SVM) at the segment
classification level. This is motivated by successful earlier work
by Ganapathiraju in a hybrid SVM/HMM speech recognition (SR) system
and the work by Bahlmann [8] in OHR. Ganapathiraju obtained better
recognition rate Compared to hybrid NN/HMM SR system. In this work,
SVM is first developed and used to trazin an OHR system using
character databases. SVM with probabilistic output are then
developed for use in the hybrid system. Eventually, the SVM will be
integrated with the HMM module for word recognition.Preliminary
results of using SVM for character recognition are given and
compared with results using NN reported by Poisson. The following
databases were used: IRONOFF, UNIPEN and the mixture IRONOFF-UNIPEN
databases.
The biometrics is most commonly defined as measurable
psychological or behavioral characteristic of the individual that
can be used in personal identification and verification. Character
recognition device is one of such smart devices that acquire
partial human intelligence with the ability to capture and
recognize various characters in different languages. Character
recognition (in general, pattern recognition) addresses the problem
of classifying input data, represented as vectors, into categories.
Character Recognition is a part of Pattern Recognition [1].
It is impossible to achieve 100% accuracy. The most basic way to
recognizing the patterns using probabilistic methods in which we
use Bayesian Network classifiers for recognizing characters. The
need for character recognition software has increased much since
the outstanding growth of the Internet. Optical Character
Recognition (OCR) is a very well-studied problem in the vast area
of pattern recognition. Its origins can be found as early as 1870
when an image transmission system was invented which used an array
of photocells to recognize patterns.
Until the middle of the 20th century OCR was primarily developed
as an aid to the visually handicapped. With the advent of digital
computers in the 1940s, OCR was realized as a data processing
approach for the first time. The first commercial OCR systems began
to appear in the early 1950s and soon they were being used by the
US postal service to sort mail. The accurate recognition of
Latin-script, typewritten text is now considered largely a solved
problem on applications where clear imaging is available such as
scanning of printed documents.
Typical accuracy rates on these exceed 99%. Total accuracy can
only be achieved by human review. Optical Character Recognition
(OCR) programs are capable of reading printed text. This could be
text that was scanned in form a document, or hand written text that
was drawn to a hand-held device, such as a Personal Digital
Assistant (PDA). The character recognition software breaks the
image into sub-images, each containing a single character.
The sub-images are then translated from an image format into a
binary format, where each 0 and 1 represents an individual pixel of
the sub image. The binary data is then fed into a neural network
that has been trained to make the association between the character
image data and a numeric value that corresponds to the character.
The output from the neural network is then translated into ASCII
text and saved as a file. Recognition of characters is a very
complex problem. The characters could be written in different size,
orientation, thickness, format and dimension. This will give
infinite variations.
The capability of neural network to generalize and insensitive
to the [6, 7] missing data would be very beneficial in recognizing
characters. An Artificial Neural Network as the backend to solve
the recognition problem. Neural Network used for training of neural
network. Neural networks have been used in a variety of different
areas to solve a wide range of problems. Unlike human brains that
can identify and memorize the characters like letters or digits;
computers treat them as binary graphics. The central objective of
this paper is demonstrating the capabilities of Artificial Neural
Network implementations in recognizing extended sets of image pixel
data. In this paper offline recognition of character is done for
this printed text document is used. It is a process by which we
convert printed document or scanned page to ASCII character that a
computer can recognize. A back propagation feed-forward neural
network is used to recognize the characters.
After training the network with back-propagation learning
algorithm, high recognition accuracy can be achieved. Recognition
of printed characters is itself a challenging problem since there
is a variation of the same character due to change of fonts or
introduction of different types of noises. Difference in font and
sizes makes recognition task difficult if pre-processing, feature
extraction and recognition are not robust. This paper is organized
as follows. Multilayer Perceptron Neural Network for Recognition is
briefly described in Section 2. In section 3, Character recognition
procedure is described. Section 4 training performance and accuracy
of prediction is analyzed. Section 4 contains data description and
result analysis.
Hindi handwritten character recognition is the one of the major
problem in todays world. Typed Hindi characters are very
difficultly recognized by computer machine therefore Hindi
handwritten characters are not recognized efficiently and
accurately by computer machine. Many researches have been done to
recognize these characters and many algorithms have been proposed
to recognize characters. Many types of software are in the market
for optical Hindi character recognition. For recognizing
characters, many processes have to be performed. No single process
or single machine can perform that recognition. Artificial neural
networks can be used for recognition of characters due to the
simplicity of their design and their universality.
Hindi character recognition is becoming more and more important
in the modern world. It helps human ease their jobs and solve more
complex problems. The problem of recognition of hand-printed
characters is still an active area of research. With the increasing
necessity for office automation, it is imperative to provide
practical and effective solutions. All sorts of structural,
topological and statistical information has been observed about the
characters does not lend a helping hand in the recognition process
due to different writing styles and moods of persons at the time of
writing. Limited variations in shapes of character are
considered.
Literature Survey:-
Although first research report on handwritten Devnagari
characters was published in 1977 [1] but not much research work is
done after that. At present researchers have started to work
onhandwritten Devnagari characters and few research reports are
published recently. In this paper, implementation is doneon matlab
which allowsmatrixmanipulations, plotting of functions and data,
implementation of algorithms, creation of user interfaces, and
interfacing with programs written in other languages, including C,
C++, Java, and Fortran.Hanmandlu and Murthy [2][3] proposed a Fuzzy
model based recognition of handwritten Hindi numerals and
characters and they obtained 92.67% accuracy for Handwritten
Devnagari numerals and 90.65% accuracy for Handwritten Devnagari
characters. Bajajet al [4] employed three different kinds of
features namely, density features, moment features and descriptive
component features for classification of Devnagari Numerals. They
proposed multi-classifier connectionist architecture for increasing
the recognition reliability and they obtained 89.6% accuracy for
handwritten Devnagari numerals. Kumar and Singh [5] proposed a
Zernike moment feature based approach for Devnagari handwritten
character recognition. They used an artificialneural network for
classification.
OCR is one of the oldest ideas in the history of pattern
recognition using computers. In recent time, Punjabi character
recognition becomes the field of practical usage. In character
recognition, the process starts with reading of a scanned image of
a series of characters, determines their meaning, and finally
translates the image to a computer written text document. Mainly,
this process is done commonly in the post-offices to mechanically
read names and addresses on envelopes and by the banks to read
amount and number on cheques. Also, companies and civilians can use
this method to quickly translate paper documents to computer
written documents. Many researches have been done on character
recognition in last 56 years. Some books [6-8] and many surveys [4,
5] have been published on the character recognition. Most of the
work on character recognition has been done on Japanese, Latin,
Chinese characters in the middle of 1960s. The work by Impedovo et
al. [9] focuses on commercial OCR systems. Jain et al. [10]
summarized and compared some of the well-known methods used in
various stages of a pattern recognition system. They have tried to
identify research topics and applications, which are at the
forefront in this field. Pal and Chaudhuri [8] in their report
summarized different systems for Indian language scripts
recognition. They have described some commercial systems like
Bangla and Devnagiri OCRs. Manish [11] in his survey report
summarized a system for the recognition of Punjabi characters. They
reported the scope of future work to be extended in several
directions such as OCR for poor quality documents, for multi font
OCR and bi- script/multi-script OCR development etc. A bibliography
of the fields of OCR and document analysis is given in [12]. Tappet
et al. [13] and Wakahara et al. [14] worked on-line handwriting
recognition and described a distortion-tolerant shape matching
method. Noubound and Plamondon [15] and Suen et al. [16] proposed
methods used for on-line recognition of hand-printed characters
while Connell et al. [17, 18] described on-line character
recognition for Devanagari characters and alphanumeric characters.
Bortolozzi et al. [19] have published a very useful study on recent
advances in handwriting recognition. Lee et al. [20] described
off-line recognition of totally unconstrained handwritten numerals
using multiplayer cluster neural network. The character regions are
determined by using projection profiles and topographic features
extracted from the gray-scale images. Then, a nonlinear character
segmentation path in each character region is found by using
multi-stage graph search algorithm. Khaly and Ahmed [21], Amin [22]
and Lorigo & Govindraju [23] have produced a bibliography of
research on the Arabic optical text recognition. Hildebrandt and
Liu [24] have reported the advances in handwritten Chinese
character recognition and Liu et al. [25] have discussed various
techniques used for on-line Chinese character recognition. 2.1
Indian Script Recognition As compared to English and Chinese
languages, the research on OCR of Indian language scripts has not
achieved that perfection. Few attempts have been carried out on the
recognition of Indian character sets on Devanagari, Bangla, Tamil,
Telugu, Oriya, Gurmukhi, Gujarati and Kannada. These attempts are
briefly described in the following sub-sections. 2.1.1 Recognition
of Handwritten Devnagari Scripts Devnagari is the most popular
script in India. Devnagari script is used to write many ndian
languages such as Hindi, Marathi, Rajasthani, Sanskrit and Nepali.
The characters of Hindi Language are shown in figure 9. The work on
Handwritten Devnagari character recognition started early in 1977.
Firstly in 1977, I. K. Sethi and B. Chatterjee [26] presented a
system for handwritten Devnagari characters. In this system, sets
of very simple primitives were used. Most of the decisions were
taken on the basis of the presence/absence or positional
relationship of these 18 primitives. A multistage process was used
for taking these decisions. By completion of each stage, the
options for making decision regarding the class membership of the
input token decreases. In 1979, Sinha and Mahabala [27] presented a
syntactic pattern analysis system with an embedded picture language
for the recognition of handwritten and machine printed Devnagari
characters. In this system, mainly feature extraction technique was
used. Sethi and Chatterjee [28] also have done some studies on
hand-printed Devnagari numerals which is based upon binary decision
tree classifier and that binary decision tree was made on the basis
of presence or absence of some basic primitives, namely, horizontal
line segment, vertical line segment, left and right slant, D-curve,
C- curve, etc. and their positions and interconnections. That
decision process was also based on multistage process. Brijesh K.
Verma [29] presented a system for HCR using Multi- Layer Perceptron
(MLP) networks and the Radial Basis Function (RBF) networks in the
task of handwritten Hindi Character Recognition (HCR). The error
back propagation algorithm was used to train the MLP networks.
2.1.2. Recognition of Bangla Characters Among all the Indian
scripts, the maximum work for recognition of handwritten characters
has been done on Bangla characters. Handwritten Bangla characters
are shown in figure 10. For offline handwritten Bangla numerals and
characters recognition, some OCR systems are available in market.
In 1982, S. K. Parui and B.B. Chaudhuri et al. [30] proposed a
recognition scheme using a syntactic method for connected Bangla
handwritten numerals. By using some automation some sub-patterns
are made on the basis of these one-dimensional strings of eight
direction codes. In 1998, A.F.R. Rahman and M. Kaykobad [31]
proposed a complete Bangali OCR system in which they used hybrid
approach for recognition of handwritten Bangla characters.
Everybody have different writing style. For this purpose, Pal and
Chaudhuri [32] proposed a robust scheme for the recognition of
isolated Bangla off-line handwritten numeral. In this scheme, the
direction of numeral, height and position of numeral with respect
to the character bounding box, shape of the reservoir etc. are used
for recognition. Dutta and Chaudhuri [34] reported a work on
recognition of isolated Bangla alphanumeric handwritten characters
using neural networks. In this method, the primitives are used for
representing the characters and structural constraints between the
primitives imposed by the junctions present in the characters.
Neural network approach is also used by Bhattacharya et al. [33]
for the recognition of Bangla handwritten numeral. In this, certain
features like loops, junctions, etc. present in the graph are
considered to classify a numeral into a smaller group. Sural and
Das [35] defined fuzzy sets on Hough transform of character pattern
pixels from which additional fuzzy sets are synthesized using t-
norms. Garain et al. [36] proposed an online handwriting
recognition system for Bangla. A low complexity classifier has been
designed and the proposed similarity measure appears to be quite
robust against wide variations in writing styles. Pal, Wakabayashi
and F. Kimura [37] proposed a recognition system for handwritten
offline compound Bangla characters using Modified Quadratic
Discriminate Function (MQDF). The features used for recognition
purpose are mainly based on directional information obtained from
the arc tangent of the gradient. To get the feature, at first, a 2
X 2 mean filtering is applied 4 times on the gray level image and
non-linear size normalization is done on the image. 2.1.3
Recognition of Tamil Characters The work on recognition of Tamil
characters started in 1978 by Siromony et al. [38]. They described
a method for recognition of machine-printed Tamil characters using
an encoded character string dictionary. The scheme employs string
features extracted by row- and column- wise scanning of character
matrix. Features in each row (column) are encoded suitably
depending upon the complexity of the script to be recognised.
Chandrasekaran et al. [39] used similar approach for constrained
hand-printed Tamil character recognition. Chinnuswamy and
Krishnamoorthy [40] presented an approach for hand-printed Tamil
character recognition employing labeled graphs to describe
structural composition of characters in terms of line-like
primitives. Recognition is carried out by correlation matching of
the labeled graph of the unknown character with that of the
prototypes. A piece of work on on-line Tamil character recognition
is reported by Aparna et al. [41]. They used shape-based features
including dot, line terminal, bumps and cusp. Comparing an unknown
stroke with a database of strokes does stroke identification.
Finite state automation has been used for character recognition
with an accuracy of 71.32-91.5%. 2.1.4 Recognition of Telugu
Characters A two-stage recognition system for printed Telugu
alphabets has been described by Rajasekaran and Deekshatulu [42].
In the first stage a directed curve tracing method is employed to
recognize primitives and to extract basic character from the actual
character pattern. In the second stage, the basic character is
coded, and on the basis of the knowledge of the primitives and the
basic character present in the input pattern, the classification is
achieved by means of a decision tree. Lakshmi and Patvardhan [43]
presented a Telugu OCR system for printed text of multiple sizes
and multiple fonts. After pre-processing, connected component
approach is used for segmentation characters. Real valued direction
features have been used for neural network based recognition
system. The authors have claimed an accuracy of 98.6%. Negi et al.
[2] presented a system for printed Telugu character recognition,
using connected components and fringe distance based template
matching for recognition. Fringe distances compare only the black
pixels and their positions between the templates and the input
images. 2.1.5 Recognition of Gurmukhi Characters Gurmukhi script is
used primarily for writing Punjabi language. Punjabi Language is
spoken by eighty four million native speakers and is the worlds
14th most widely spoken language. Lehal and Singh [30] developed a
complete OCR system for printed Gurmukhi script where connected
components are first segmented using thinning based approach. They
started work with discussing useful pre-processing techniques.
Lehal and Singh [30] have discussed in detail the segmentation
problems for Gurmukhi script. They have observed that horizontal
projection method, which is the most commonly used method employed
to extract the lines from the document, fails in many cases when
applied to Gurmukhi text and results in over segmentation or under
segmentation. The text image is broken into horizontal text strips
using horizontal projection in each row. The gaps on the horizontal
projection profiles are taken as separators between the text
strips. Each text strip could represent: a) Core zone of one text
line consisting of upper, middle zone and optionally lower zone
(core strip), b) upper zone of a text line (upper strip), c) lower
zone of a text line (lower strip), d) core zone of more than one
text line (multi strip). Then using estimated average height of the
core strip and its percentage they identify the type of each strip.
The classification process is carried out in three stages. In the
first stage, the characters are grouped into three sets depending
on their zonal position, i.e., upper zone, middle zone and lower
zone. In the second stage, the characters in middle zone set are
further distributed into smaller sub-sets by a binary decision tree
using a set of robust and font independent features. In the third
stage, the nearest neighbor classifier is used and the special
features distinguishing the characters in each subset are used.
This enhances the computational efficiency. The system has an
accuracy of about 97.34%. An OCR postprocessor of Gurmukhi script
is also developed. In last, Lehal and Singh and Lehal et al.
Proposed a post-processor for Gurmukhi OCR where statistical
information of Punjabi language syllable combinations and certain
heuristics based on Punjabi grammar rules have been considered.
There is also some literature dealing with segmentation of Gurmukhi
Script. Lehal and Singh have performed segmentation of Gurmukhi
script by connected component analysis of a word assuming the
headline not being a part of the word. Goyal et al. have suggested
a dissection based Gurmukhi character segmentation method, which
segments the characters in the different zones of a word by
examining the vertical white space. Manish [11] proposed an
algorithm for recognizing Gurumukhi scripts. In his work he
recognized Punjabi characters with the efficiency of 92.56 %. In
Chinese, Latin the efficiency of recognition of words is over
99%.
HINDI LANGUAGE: A REVIEW :-
Hindi is an Indo-Aryan language and is one the official
languages of India. It is the worlds third most commonly used
language after Chinese and English and has approximately 500
million speakers all over the world. It is written in Devnagari
script. It is written from left to right along a horizontal line.
The basic character set has 13 SWARS (vowel) and 33 VYANJANS
(consonants) shown in the figure.
Figure 1: Hindi language basic character set
DEVNAGARI SCRIPT :-
Hindi is worlds third most commonly used language after Chinese
and English, and there are approximately 500 billion people all
over the world who speak and write in Hindi. It is the basic script
of many languages in India, such as Hindi and Sanskrit. It is
indisputable that Devnagari has the most accurate scientific basis.
For a long time, it has been script of Indian Aryan languages. It
is even now used by Sanskrit, Hindi, Marathi and Nepali languages.
Hindi is the worlds widely spoken language and since its script is
Devnagari, so its the most popular script. As Hindi has been
declared the national language by constitution of Indian, Devnagari
has got the status of national dialect.
In the beginning, Hindi was declared as the state language and
Devnagari as the state script of other major states such as
Himachal, Haryana, Rajasthan, Madhya Pradesh, Bihar, Uttaranchal,
etc. Presently, it is found that Devnagari script is the most
scientific script. Since every script is developed from Brahmi
script so, Devnagari has connection with almost every other script.
In Devnagari, all letters are equal, i.e. there is no concept of
capital or small letters. Devnagari is half syllabic in nature.
Optical Character Recognition (Ocr):-
OCR is the acronym for Optical Character Recognition. This
technology allows a machine to automatically recognize characters
through an optical mechanism. Many objects are recognized by human
beings in this manner. Optical mechanism are the Eyes while the
brain sees the input, according to many factors the ability to
appreciatethese signals varies in each person. Reviewing the
variables, the challenges faced by the technologist developing an
OCR system can be understood easily. Documents are in the form of
papers which the human can read and understand but it is not
possible for the computer to understand these documents
directly.
In order to convert these documents into computer process able
form, OCR systems are developed. OCR is the process of converting
scanned images of machine printed or handwritten text, numerals,
letters and symbols into a computer process able format such as
ASCII. OCR is an area of pattern recognition and processing of
handwritten character is motivated largely by desire to improve man
and machine communication.
Proposed Algorithm:-
The system performs character recognition by exploring the
feature of templates matching for its ability to recognize
handwritten Hindi characters.The following steps are followed: 1- A
database of Hindi handwritten character is created in different
handwritings from different peoples. 2- Preprocessing of training
image.a) Binarization of image using function bw=
im2bw(Ibw,level).b) Edge detection of image using function
iedge=edge(uint8(BW2)).c) Dilation of image using function se =
strel(square,2) ; iedge2=imdilate(iedge,se) .d) Region filling of
image using function ifill = imfill(iedge2,holes).e) Character
detection in image using [Ilabel num] = bwlabel(Ifill);
Iprops = regionprops(Ilabel);Ibox = [Iprops.BoundingBox];Ibox =
reshape(Ibox,[4 num]);
3 - Extraction and Scaling the normalized characters to 50*50
scale using boundary value analysis img{cnt} =
imcrop(Ibw,Ibox(:,cnt)); bw2 = imgcrop(img{cnt}); charvec =
imresize(bw2,[50 50]);4 - Templates generation using image
averaging and saving templates in templates.mat file which is used
in matching phase.5- Binarizing test image and matching it with
templates and generating a result.txt file containing recognized
characters.The system performs character recognition by exploring
the feature of templates matching for its ability to recognize
handwritten Hindi characters. The scope of the proposed system is
limited to the recognition of a single character.
Scanning :- Handwritten character data samples are acquired on
paper from various people. These data samples are then scanned from
paper through an optically digitizing device such as optical
scanner or camera. A flat-bed scanner is used at 300dpi which
converts the data on the paper being scanned into a bitmap
image.
DETAIL REVIEW OF WORK AND PROBLEM STATEMENTDESIGNING OF
MULTILAYER NEURAL NETWORK FOR RECOGNITIONThere are two basic
methods used for OCR: Matrix matching and feature extraction. Of
the two ways to recognize characters, matrix matching is the
simpler and more common. But still we have used Feature Extraction
to make the product more robust and accurate. Feature Extraction is
much more versatile than matrix matching. Here we use Matrix
matching for Recognition of character. The Process of Character
Recognition of the document image mainly involves six
phases:Acquisition of Grayscale ImageDigitization/BinarizationLine
and Boundary DetectionFeature ExtractionFeed Forward Artificial
Neural Network basedMatching.Recognition of Character based on
matching score.
Fig. 2.
The scanned image must be [4, 5] a grayscale image or binary
image, where binary image is a contrast stretched grayscale image.
That grayscale image is then undergoes digitization. In
digitization [12] a rectangular matrix of 0s and 1s are formed from
the image. Where 0-black and 1-white and all RGB values are
converted into 0s and 1s.The matrix of dots represents two
dimensional arrays of bits.
Digitization is also called binarization as it converts
grayscale image into binary image using adaptive threshold. Line
and Boundary detection is the process of identifying points in a
digital image at which the character top, bottom, left and right
are calculated. Feed Forward Neural Network approach is used to
combine all the unique features, which are taken as inputs, one
hidden layer is used to integrate and collaborate[9] similar
features and if required adjust the inputs by adding or subtracting
weight values, finally one output layer is used to find the overall
matching score of the III. CHARACTER RECOGNITION PROCEDURE
Pre-processing:- The pre-processing stage yields a clean
document in the sense that maximal shape information with maximal
compression and minimal noise on normalized image is obtained.
Segmentation: - Segmentation is an important stage because the
extent one can reach in separation of words, lines or characters
directly affects the recognition rate of the script.
Feature extraction:- After segmenting the character, extraction
of feature like height, width, horizontal line, vertical line, and
top and bottom detection is done.
Classification:- For classification or recognition back
propagation algorithm is used. Output:-Output is saved in form of
text format.
TRAINING ALGORITHM PERFORMANCE AND ACCURACY OF PREDICTION The
back propagation algorithm requires a numerical representation for
the characters. Learning is implemented using the back-propagation
algorithm with learning rate. Gradient is calculated [10], after
every iteration and compared with threshold gradient value. If
gradient is greater than the threshold value then it performs next
iteration. The batch steepest descent training function is trained.
The weights and biases are updated in the direction of the negative
gradient of the performance function. In order to determine
quantitatively the model, two error measures is employed for
evaluation and model comparison, being these: The model squared
error (MSE) and the mean absolute error (MAE). If yi is the actual
observation for a time period t and Ft is the forecast for the same
period, then the error is defined as ( (1 = The standard
statistical error measures can be defined as MSE= n i n i e 1 1 n 1
= = (2) And the mean absolute error as MSE= = n i i e n 1 1 where n
is the number of periods of time .When the mean square error
decreased gradually and became stable, and the training and testing
error produced satisfactory results .the training performance curve
of neural network. The accuracy of the trained network is tested
against output data. The accuracy of the trained network is
assessed in the following way: in first way, the predicted output
value is compared with the measured values .The results are
presented shows the relative accuracy of the predicted output. The
overall percentage error obtained from the tested results is 4%. In
the second way, the root mean square error and the mean absolute
error are determined and compared. The performance index for
training of ANN is given in terms of mean square error (MSE).The
tolerance limit for the MSE is set to 0.001.The MSE of the training
set become stable at 0.0070 when the number of iteration reaches
350. The closeness of the training and the testing errors validates
the accuracy of the model. V. EXPERIMENTAL RESULTS We create
interface for proposed system for character recognition by using
Microsoft Visual C # 2008 Express Editions. The MLP network that is
implemented is composed of three layers input layer, output layer
and hidden layer. The input layer constitutes of 180 neurons which
receive printed image data from a 30x20 symbol pixel matrix. The
hidden layer constitutes of 256 neurons whose [12] number is
decided on the basis of optimal results on a trial and error basis.
The output layer is composed of 16 neurons. Number of
characters=90, Learning rate=150, No of neurons in hidden layer=256
TABLE I: PERCENTAGE OF ERROR FOR DIFFERENT EPOCHS
2. Existing Techniques
2.1 Modified discrimination function (MQDF) Classifier G. S.
Lehal and Nivedan Bhatt [10] designed a recognition system for
handwritten Devangari Numeral using Modified discrimination
function (MQDF) classifier. A recognition rate and a confusion rate
were obtained as 89% and 4.5% respectively.
2.2 Neural Network on Devenagari NumeralsR. Bajaj, L. Dey, S.
Chaudhari [11] used neural network based classification scheme.
Numerals were represented by feature vectors of three types. Three
different neural classifiers had been used for classification of
these numerals. Finally, the outputs ofthe three classifiers were
combined using a connectionist scheme. A 3-layer MLP was used for
implementing the classifier for segment-based features. Their work
produced recognition rate of 89.68%.
2.3 Gaussian Distribution FunctionR. J. Ramteke et.al applied
classifiers on 2000 numerals images obtained from different
individuals of different professions. The results of PCA,
correlation coefficient and perturbed moments are an experimental
success as compared to MIs. This research produced 92.28%
recognition rate by considering 77 feature dimensions.
2.4 Fuzzy classifier on Hindi Numerals
M. Hanmandlu, A.V. Nath, A.C. Mishra and V.K. Madasu used fuzzy
membership function for recognition of Handwritten Hindi Numerals
and produce 96% recognition rate. To recognize the unknown numeral
set, an exponential variant of fuzzy membership function was
selected and it was constructed using the normalized vector
distance.
2.5 Multilayer Perceptron
Ujjwal Bhattacharya, B. B. Chaudhuri [11] used a distinct MLP
classifier. They worked onDevanagari, Bengali and English
handwritten numerals. A back propagation (BP) algorithm was used
for training the MLP classifiers. It provided 99.27% and 99.04%
recognition accuracies on the original training and test sets of
Devanagari numeral database, respectively.
2.6 Quadratic classifier for Devanagari Numerals
U. Pal, T. Wakabayashi, N. Sharma and F. Kimura [14] developed a
modified quadratic classifier for recognition of offline
handwritten numerals of six popular Indian scripts; viz. They had
used 64 dimensional features for high-speed recognition. A
five-fold cross validation technique has been used for result
computation and obtained 99.56% accuracy from Devnagari scripts,
respectively.
PROPOSED APPROACH
3.1 Support Vector Machine (SVM)
SVM in its basic form implement two class classifications. It
has been used in recent years as an alternative to popular methods
such as neural network. The advantage of SVM, is that it takes into
account both experimental data and structural behavior for better
generalization capability based on the principle of structural risk
minimization (SRM). Its formulation approximates SRM principle by
maximizing the margin of class separation, the reason for it to be
known also as large margin classifier. The basic SVM formulation is
for linearly separable datasets.
It can be used for nonlinear datasets by indirectly mapping the
nonlinear inputs into to linear feature space where the maximum
Margin decision function is approximated. The mapping is done by
using a kernel function. Multi class classification can be
performed by modifying the 2 class scheme. The objective of
recognition is to interpret a sequence of numerals taken from the
test set. The architecture of proposed system is given in fig.
3.The SVM (binary classifier) is applied to multi class numeral
recognition problem by using one-versus-rest type method.The SVM is
trained with the training samples using linear kernel.
Classifier performs its function in two phases; Training and
Testing. [29] After preprocessing and Feature Extraction process,
Training is performed by considering the feature vectors which are
stored in the form of matrices. Result of training is used for
testing the numerals. Algorithm for Training is given in
algorithm.
3.2 Statistical Learning Theory
Support Vector Machines have been developed by Vapnik in the
framework of Statistical Learning Theory [13]. In statistical
learning theory (SLT), the problem of classification in supervised
learning is formulated as follows: We are given a set of l training
data and its class, {(x1,y1)...(xl,yl)} in Rn R sampled according
to unknown joint probability distribution P(x,y) characterizing how
the classes are spread in Rn R. To measure the performance of the
classifier, a loss function L(y,f(x)) is defined as follows:
L(y,f(x)) is zero if f classifies x correctly, one otherwise. On
average, how f performs can be described by the Risk functional:
ERM principle states that given the training set and a set of
possible classifiers in the hypothesis space F, we Should choose f
F that minimizes Remp(f). However, which generalizes well to unseen
data due to over fitting phenomena. Remp(f) is a poor,
over-optimistic approximation of R(f), the true risk. Neural
network classifier relies on ERM principle.
The normal practice to get a more realistic estimate of
generalization error, as in neural network is to divide the
available data into training and test set. Training set is used to
find a Classifier with minimal empirical error (optimize the weight
of an MLP neural networks) while the test set is used to find the
generalization error (error rate on the Test set). If we have
different sets of classifier hypothesis space F1, F2 e.g. MLP
neural networks with different topologies, we can select a
classifier from each hypothesis space (each topology) with minimal
Remp(f) and choose the final classifier with minimal generalization
error. However, to do that requires designing and training
potentially large number of individual classifiers. Using SLT, we
do not need to do that. Generalization error can be directly
minimized by minimizing an upper bound of the risk functional
R(f).
The bound given below holds for any distribution P(x,y) with
probability of at least 1- :where the parameter h denotes the so
called VC (Vapnik-Chervonenkis) dimension. is theconfidence term
defined by Vapnik [10] as : ERM is not sufficient to find good
classifierbecause even with small Remp(f), when h is large compared
to l, will be large, so R(f) will also be large, ie: not optimal.
We actually need to minimize Remp(f)and at the same time, a process
which is called structural risk Minimization (SRM). By SRM, we do
not need test set for model selection anymore.
Taking different sets of classifiers F1, F2 with known h1, h2 we
can select f from one of the set with minimal Remp(f), compute and
choose a classifier with minimal R(f).No more evaluation on test
set needed, at least in theory. However, we still have to train
potentially very large number of individual classifiers. To avoid
this, we want to make h tunable (ie: to cascade a potential
classifier Fi with VC dimension = h and choose an optimal f from an
optimal Fi in a single optimization step. This is done in large
margin classification.
3.3 SVM formulations
SVM is realized from the above SLT framework. The simplest
formulation of SVM is linear, where the decision hyper plane lies
in the space of the input data x. In this case the hypothesis space
is a subset of all hyper planes of the form: f(x) = wx +b. SVM
finds an optimal hyper plane as the solution to the learning
Problem which is geometrically the furthest from both classes since
that will generalize best for future unseen data.
There are two ways of finding the optimal decision hyper plane.
The first is by finding a plane that bisects the two closest points
of the two convex hulls defined by the set of points of each class,
as shown in figure 2. The second is by maximizing the margin
between two supporting planes as shown in figure 3. Both methods
will produce the same optimal decision plane and the same set of
points that support the solution (the closest points on the two
convex hulls in figure 2 or the points on the two parallel
supporting planes in figure 3). These are called the support
vectors.
4. Feature Extraction
4.1 Moment InvariantsThe moment invariants (MIs) [1] are used to
evaluate seven distributed parameters of a numeral image. In any
character Recognition system, the characters are processed to
extract features that uniquely represent properties of the
character. Based on normalized central moments, a set of seven
moment invariants is derived. Further, the resultant image was
thinned and seven moments were extracted. Thus we had 14 features
(7 original and 7 thinned), which are applied as features for
recognition using Gaussian Distribution Function. To increase the
success rate, the new features need to be extracted by applying
Affine Invariant Moment method.
4.2 Affine Moment Invariants
The Affine Moment Invariants were derived by means of the theory
of algebraic invariants. Full derivation and comprehensive
discussion on the properties of invariants can be found. Four
features can be computed for character recognition. Thus overall 18
features have been used for Support Vector Machine.
5. Experiment
5.1 Data Set DescriptionIn this paper, the UCI Machine learning
data set are used. The UCI Machine Learning Repository is a
collection of databases, domain theories, and data generators that
are used by the machine learning community for the empirical
analysis of machine learning algorithms. One of the available
datasets is the Optical Recognition of the Handwritten Digits Data
Set.
The dataset of handwritten assamese characters by collecting
samples from 45 writers is created. Each writer contributed 52
basic characters, 10 numerals and 121 assamese conjunct consonants.
The total number of entries corresponding to each writer is 183 (=
52 characters + 10 numerals + 121 conjunct consonants). The total
number of samples in the dataset is 8235 ( = 45 183 ). The
handwriting samples were collected on an iball 8060U external
digitizing tablet connected to a laptop using its cordless digital
stylus pen. The distribution of the dataset consists of 45
folders.
This file contains information about the character id (ID),
character name (Label) and actual shape of the character (Char). In
the raw Optdigits data, digits are represented as 32x32 matrices.
They are also available in a pre- processed form in which digits
have been divided into non-overlapping blocks of 4x4 and the number
of on pixels have been counted in each block. This generated 8x8
input matrices where each element is an integer in the range
0.16.
5.2 Data Preprocessing
For the experiments using SVM, example isolated characters are
preprocessed and 7 local features for each point of the spatially
resample online signal were extracted. For each example character
there are 350 feature values as input to the SVM. We use SVM with
RBF kernel, since RBF kernel has been shown to generally give
better recognition result. Grid search was done to find the best
values for the C and gamma (representing in the original RBF kernel
formulation) for the final SVM models and by that, C = 8 and gamma
= were chosen.
Preprocessing :-
A series of operations are performed on scanned image during
preprocessing (figure 4). Scanned image The operations that are
performed during preprocessing are: (i) Applied median filtering to
reduce noise from the introduced to the character image during
scanning. It is usually taken from a template centered on the point
of interest. To perform median filtering at a point values of the
pixel and its neighbors are sorted into order based upon their gray
levels and their median is determined [12]. (ii) Global
thresholding is applied to convert image from gray scale to binary
form. (iii) Image is normalized into 7X7. (iv) Thinning is
performed by the method proposed in [10].
The back-propagation algorithm works as follows (i) All the
weights are initialized to some small random values. (ii) Input
vector and desired output vectors are presented to the net.(iii)
Each input unit receives an input signal and transmits this signal
to each hidden unit. (iv) Each hidden unit calculates the
activation function and sends the signal to each output unit.(v)
Actual output is calculated. Each output unit compares the actual
output with the desired output to determine the associated error
with that unit. (vi) Weights are adjusted to minimize the
error.
In this paper, the proposed backpropagation neural net is
designed two hidden layers as shown in figure 7. The input layer
contained total 49 nodes as 49 features were extracted for each
character. The output layer contained total 5 nodes (1 node for
each class). Size of the output was 5X5, where each character is
represented as a 5X1 output vector. Number of nodes in both hidden
layers was set to 7.
PROPOSED WORK
Character recognition task has been attempted through many
different approaches like template matching, statistical techniques
like NN, HMM, Quadratic Discriminant function (QDF) etc. Template
matching works effectively for recognition of standard fonts, but
gives poor performance with handwritten characters and when the
size of dataset grows. It is not an effective technique if there is
font discrepancy . HMM models achieved great success in the field
of speech recognition in past decades, however developing a 2-D HMM
model for character recognition is found difficult and complex. NN
is found very computationally expensive in recognition purpose . N.
Araki et al. applied Bayesian filters based on Bayes Theorem for
handwritten character recognition. Later, discriminative
classifiers such as Artificial Neural Network (ANN) and Support
Vector Machine (SVM) grabbed a lot of attention. In G. Vamvakas et
al. compared the performance of three classifiers: Naive Bayes,
K-NN and SVM and attained best performance with SVM. However SVM
suffers from limitation of selection of kernel. ANNs can adapt to
changes in the data and learn the characteristics of input signal
.Also, ANNs consume less storage and computation than SVMs. Mostly
used classifiers based on ANN are MLP and RBFN. B.K. Verma
presented a system for HCR using MLP and RBFN networks in the task
of handwritten Hindi character recognition. The error back
propagation algorithm was used to train the MLP networks. J. Sutha
et al. in showed the effectiveness of MLP for Tamil HCR using the
Fourier descriptor features. R. Gheroie et al. in proposed
handwritten Farsi character recognition using MLP trained with
error back propagation algorithm. Computer Science &
Information Technology (CS & IT) 27 similar shaped characters
are difficult to differentiate because of very minor variations in
their structures. In T. Wakabayashi et al. proposed an F-Ratio
(Fisher Ratio) based feature extraction method to improve results
of similar shaped characters. They considered pairs of similar
shaped characters of different scripts like English,
Arabic/Persian, Devnagri, etc. and used QDF for recognition
purpose. QDF suffers from limitation of minimum required size of
dataset. F. Yang et al. in [14] proposed a method that combines
both structural and statistical features of characters for similar
handwritten Chinese character recognition. As it can be seen that
various feature extraction methods and classifiers have been used
for character recognition by researchers that are suitable for
their work, we propose a novel feature set that is expected to
perform well for this application. In this paper, the features are
extracted on the basis of character geometry, which are then fed to
each of the selected ML algorithms for recognition of SSHMC.
3.Methodology for feature extraction A device is to be designed and
trained to recognize the 26 letters of the alphabet. We assume that
some imaging system digitizes each letter centered in the systems
field of vision. The result is that each letter is represented as a
5 by 7 grid of real values. The following figure shows the perfect
pictures of all 26 letters. Figure 1: The 26 letters of the
alphabet with a resolution of 5 7. However, the imaging system is
not perfect and the letters may suffer from noise: Figure 2: A
perfect picture of the lettar A and 4 noisy versions (stabdard
devistion of 0.2). Perfect classification of ideal input vectors is
required, and more important reasonably accurate classification of
noisy vectors. Before OCR can be used, the source material must be
scanned using an optical scanner (and sometimes a specialized
circuit board in the PC) to read in the page as a bitmap (a pattern
of dots). Software to recognize the images is also required. The
character recognition software then processes these scans to
differentiate between images and text and determine what letters
are represented in the light and dark areas. Older OCR systems
match these images against stored bitmaps based on specific fonts.
The hit-or-miss results of such pattern-recognition systems helped
establish OCR's reputation for inaccuracy. Today's OCR engines add
the multiple algorithms of neural.
SOLUTION APPROACH
On-line handwriting recognition involves the automatic
conversion of text as it is written on a specialdigitizer orPDA
where a sensor picks up the pen-tip movements as well as
pen-up/pen-down switching. This kind of data is known as digital
ink and can be regarded as a digital representation of handwriting.
The obtained signal is converted into letter codes which are usable
within computer and text-processing applications.The elements of an
on-line handwriting recognition interface typically include: a pen
or stylus for the user to write with. a touch sensitive surface,
which may be integrated with, or adjacent to, an output display. a
software application which interprets the movements of the stylus
across the writing surface, translating the resulting strokes into
digital text.General processThe process of online handwriting
recognition can be broken down into a few general steps:
preprocessing, feature extraction and classification.The purpose of
preprocessing is to discard irrelevant information in the input
data, that can negatively affect the recognitio.This concerns speed
and accuracy. Preprocessing usually consists of binarization,
normalization, sampling , smoothing and denoising. The second step
is feature extraction. Out of the two- or more-dimensional vector
field received from the preprocessing algorithms,
higher-dimensional data is extracted. The purpose of this step is
to highlight important information for the recognition model. This
data may include information like pen pressure, velocity or the
changes of writing direction. The last big step is classification.
In this step various models are used to map the extracted features
to different classes and thus identifying the characters or words
the features represent.
HardwareCommercial products incorporating handwriting
recognition as a replacement for keyboard input were introduced in
the early 1980s. Examples include handwriting terminals such as the
Pencept Penpadand the Inforite point-of-sale terminal.With the
advent of the large consumer market for personal computers, several
commercial products were introduced to replace the keyboard and
mouse on a personal computer with a single pointing/handwriting
system, such as those from PenCept, CICand others. The first
commercially available tablet-type portable computer was the
GRiDPad fromgrid system, released in September 1989. Its operating
system was based onMS-DOS.In the early 1990s, hardware makers
includingNCR,IBMandEOreleasedtablet computersrunning
thePenPointoperating system developed byGO Corp. PenPoint used
handwriting recognition and gestures throughout and provided the
facilities to third-party software. IBM's tablet computer was the
first to use theThinkPadname and used IBM's handwriting
recognition. This recognition system was later ported to
MicrosoftWindows for Pen Computing, andIBM's PenforOS/2. None of
these were commercially successful.Advancements in electronics
allowed the computing power necessary for handwriting recognition
to fit into a smaller form factor than tablet computers, and
handwriting recognition is often used as an input method for
hand-heldPDAs. The first PDA to provide written input was theApple
Newton, which exposed the public to the advantage of a streamlined
user interface. However, the device was not a commercial success,
owing to the unreliability of the software, which tried to learn a
user's writing patterns. By the time of the release of theNewton
OS2.0, wherein the handwriting recognition was greatly improved,
including unique features still not found in current recognition
systems such as modeless error correction, the largely negative
first impression had been made. After discontinuation ofApple
Newton, the feature has been ported to Mac OS X 10.2 or later in
form ofInkwell (Macintosh) Palmlater launched a successful series
ofPDAs based on theGraffitirecognition system. Graffiti improved
usability by defining a set of "unistrokes", or one-stroke forms,
for each character. This narrowed the possibility for erroneous
input, although memorization of the stroke patterns did increase
the learning curve for the user. The Graffiti handwriting
recognition was found to infringe on a patent held by Xerox, and
Palm replaced Graffiti with a licensed version of the CIC
handwriting recognition which, while also supporting unistroke
forms, pre-dated the Xerox patent. The court finding of
infringement was reversed on appeal, and then reversed again on a
later appeal. The parties involved subsequently negotiated a
settlement concerning this and other patentsGraffiti (Palm OS).
ATablet PCis a special notebook computer that is outfitted with
adigitizer tabletand a stylus, and allows a user to handwrite text
on the unit's screen. The operating system recognizes the
handwriting and converts it into typewritten text.Windows Vista
andWindows 7include personalization features that learn a user's
writing patterns or vocabulary for English, Japanese, Chinese
Traditional, Chinese Simplified and Korean. The features include a
"personalization wizard" that prompts for samples of a user's
handwriting and uses them to retrain the system for higher accuracy
recognition. This system is distinct from the less advanced
handwriting recognition system employed in itsWindows Mobile OS for
PDAs.Although handwriting recognition is an input form that the
public has become accustomed to, it has not achieved widespread use
in either desktop computers or laptops. It is still generally
accepted thatkeyboard input is both faster and more reliable. As of
2006, many PDAs offer handwriting input, sometimes even accepting
natural cursive handwriting, but accuracy is still a problem, and
some people still find even a simpleon-screen keyboard more
efficient.
SoftwareInitial software modules could understand print
handwriting where the characters were separated. Author of the
first applied pattern recognition program in 1962 wasShelia
Guberman, then in Moscow.Commercial examples came from companies
such as Communications Intelligence Corporation and IBM. In the
early 90s, two companies, ParaGraph International, and Lexicus came
up with systems that could understand cursive handwriting
recognition. ParaGraph was based in Russia and founded by computer
scientist Stepan Pachikov while Lexicus was founded by Ronjon Nag
and Chris Kortge who were students at Stanford University. The
ParaGraph CalliGrapher system was deployed in the Apple Newton
systems, and Lexicus Longhand system was made available
commercially for the PenPoint and Windows operating system. Lexicus
was acquired by Motorola in 1993 and went on to develop Chinese
handwriting recognition andpredictive textsystems for Motorola.
ParaGraph was acquired in 1997 by SGI and its handwriting
recognition team formed a P&I division, later acquired from SGI
by Vadem. Microsoft has acquired CalliGrapher handwriting
recognition and other digital ink technologies developed by P&I
from Vadem in 1999. Wolfram Mathematica (8.0 or later) also
provides a handwriting or text recognition function
TextRecognize.Character recognition task has been attempted through
many different approaches like template matching, statistical
techniques like NN, HMM, Quadratic Discriminant function (QDF) etc.
Template matching works effectively for recognition of standard
fonts, but gives poor performance with handwritten characters and
when the size of dataset grows. It is not an effective technique if
there is font discrepancy .
HMM models achieved great success in the field of speech
recognition in past decades, however developing a 2-D HMM model for
character recognition is found difficult and complex . NN is found
very computationally expensive in recognition purpose. N. Araki et
al. applied Bayesian filters based on Bayes Theorm for handwritten
character recognition. Later, discriminative classifiers such as
Artificial Neural Network (ANN) and Support Vector Machine (SVM)
grabbed a lot of attention.
In G. Vamvakas et al. compared the performance of three
classifiers: Naive Bayes, K-NN and SVM and attained best
performance with SVM. However SVM suffers from limitation of
selection of kernel. ANNs can adapt to changes in the data and
learn the characteristics of input signal. Also, ANNs consume less
storage and computation than SVMs . Mostly used classifiers based
on ANN are MLP and RBFN. B.K. Verma [10] presented a system for HCR
using MLP and RBFN networks in the task of handwritten Hindi
character recognition.
The error back propagation algorithm was used to train the MLP
networks. J. Sutha et al. in showed the effectiveness of MLP for
Tamil HCR using the Fourier descriptor features. R. Gheroie et al.
in proposed handwritten Farsi character recognition using MLP
trained with error back propagation algorithm. Computer Science
& Information Technology (CS & IT) 27 Similar shaped
characters are difficult to differentiate because of very minor
variations in their structures. In T.
Wakabayashi et al. proposed an F-Ratio (Fisher Ratio) based
feature extraction method to improve results of similar shaped
characters. They considered pairs of similar shaped characters of
different scripts like English, Arabic/Persian, Devnagri, etc. and
used QDF for recognition purpose. QDF suffers from limitation of
minimum required size of dataset.F. Yang et al. in proposed a
method that combines both structural and statistical features of
characters for similar handwritten Chinese character
recognition.
As it can be seen that various feature extraction methods and
classifiers have been used for character recognition by researchers
that are suitable for their work, we propose a novel feature set
that is expected to perform well for this application. In this
paper, the features are extracted on the basis of character
geometry, which are then fed to each of the selected ML algorithms
for recognition of SSHHC.
3. MACHINE LEARNING CONCEPTS
Machine learning [15] is a scientific discipline that deals with
the design and development ofalgorithms that allow computers to
develop behaviours based on empirical data. ML algorithms, in this
application, are used to map the instances of the handwritten
character samples to predefined classes.
3.1. Machine Learning Algorithms
For this application of SSHHC recognition, we use the below
mentioned ML algorithms that have been implemented in WEKA 3.7.0:
WEKA (Waikato Environment for Knowledge Analysis) is JAVA based
open source simulator. These algorithms have been found performing
very well for most of the applications and have been widely used by
researchers. Brief description of these algorithms is as
follows:
3.1.1. Bayesian Network
A Bayesian Network [17] or a Belief Network is a probabilistic
model in the form of directedacyclic graphs (DAG) that represents a
set of random variables by its nodes and their correlations by its
edges. Bayesian Networks has an advantage that they visually
represent all the relationships between the variables in the system
via connecting arcs. Also, they can handle situations where the
data set is incomplete.
3.1.2. Radial Basis Function Network
An RBFN is an artificial neural network which uses radial basis
functions as activationfunctions. Due to its non-linear
approximation properties, RBF Networks are able to modelcomplex
mapping. RBF Networks do not suffer from the issues of local minima
because theparameters required to be adjusted are the linear
mappings from hidden layer to output layer.
3.1.3. Multilayer Perceptron
An MLP is a feed forward artificial neural network that computes
a single output frommultiple real-valued inputs by forming a linear
combination according to its input weights and then putting the
output through some nonlinear activation function (mainly Sigmoid).
MLP is a universal function approximator and is highly efficient
for solving complex problems due to the presence of more than one
hidden layer.3.1.4. C4.5C4.5 is an extension of Ross Quinlan's
earlier ID3 algorithm. It builds decision trees from aset of
training data using the concept of information gain and entropy.
C4.5 uses a white box28 Computer Science & Information
Technology (CS & IT) model due to which the explanation of
results is easy to understand. Also, it performs well with even
large amount of data.
3.2. Feature Reduction
Feature extraction is the task to detect and isolate various
desired attributes (features) of an object in an image, which
maximizes the recognition rate with the least amount of elements.
However, training the classifier with maximum number of features
obtained is not always the best option, as the irrelevant or
redundant features can cause negative impact on a classifiers
performance . and at the same time, the build classifier can be
computationally complex. Feature reduction or feature selection is
the technique of selecting a subset of relevant features to improve
the performance of learning models by speeding up the learning
process and reducing computational complexities. Two feature
reduction methods that have been chosen for this application are
CFS [22] and CON [23] as these methods have been widely used by
researchers for feature reduction.
4. EXPERIMENTAL METHODOLOGY
The following sections describe the data-set, pre-processing and
feature extraction adopted in our proposed work of recognition of
SSHHC.
4.1. Dataset Creation
Dataset is created by asking the candidates of different age
groups to write the similar shapedcharacters (, ; , ; , ) several
times in their handwriting on white plain sheets. These image
samples are scanned using HP Scanjet G2410 with resolution of 1200
x 1200 dpi. Each character is cropped and stored in .jpg format
using MS Paint. Thus this dataset consists ofisolated handwritten
Hindi characters that are to be recognized using ML algorithms.
Using these character samples, three datasets are created as
described below:
Dataset 1 consists of only 100 samples of the target pair (,
).
Dataset 2 consists of increased samples of the same target pair,
i.e. size of training dataset isincreased to 342 samples by adding
more samples of the target class (from other persons) in the
training dataset. More samples are added in order to analyze the
impact of increase in number of samples on the relative performance
of ML algorithms.
Dataset 3 consists of samples of both the target and non-target
class, i.e. other similar shapedcharacter pairs (like , ; , ) are
also added to the dataset (making 500 samples in total) withwhich
the ML algorithms are trained. Non-target class characters are
added to test the ability of ML classifiers for target characters
among different characters. A few samples of the entiredataset are
shown in Figure 1.
4.2. Performance MetricsPerformance of the classifiers is
evaluated on the basis of the metrics described below:i Precision:
Proportion of the examples which truly have class x among all those
which wereclassified as class x. Figure. 1 Samples of Handwritten
Hindi Characters Computer Science & Information Technology (CS
& IT) 29ii Misclassification Rate: Number of instances that
were classified incorrectly out of the totalinstances.iii Model
Build Time: Time taken to train a classifier on a given data
set.
4.3. Pre-processing
Following pre-processing steps are applied to the scanned
character images:i First each RGB character image, after converting
to gray scale, is binarized through thresholding.ii The image is
inverted such that the background is black and foreground is
white.iii Then shortest matrix that fits the entire character
skeleton for each image is obtained andthis is termed as universe
of discourse.iv Finally, the spurious pixels are removed from the
image followed by skeletonization.
4.4. Feature Extraction
After pre-processing, features for each character image are
extracted based on the charactergeometry using the technique
described in [24]. The features are based on the basic line types
that form the skeleton of the character. Each pixel in the image is
traversed. Individual line segments, their directions and
intersection points are identified from an isolated character
image. For this, the image matrix is initially divided into nine
zones and the number, length and type of lines and intersections
present in each zone are determined. The line type can be:
Horizontal, Vertical, Right Diagonal and Left Diagonal. For each
zone following features are extracted. It results into a feature
vector of length 9 for each zone:
i. Number of horizontal linesii. Number of vertical linesiii.
Number of Right diagonal linesiv. Number of Left diagonal linesv.
Normalized Length of all horizontal linesvi. Normalized Length of
all vertical linesvii. Normalized Length of all right diagonal
linesviii. Normalized Length of all left diagonal linesix. Number
of intersection points.
A total of 81 (9x9) zonal features are obtained. After zonal
feature extraction, four additionalfeatures are extracted for the
entire image based on the regional properties namely:
i. Euler Number: It is defined as the difference of Number of
Objects and Number of holesin the imageii. Eccentricity: It is
defined as the ratio of the distance between the centre of the
ellipse andits major axis lengthiii. Orientation: It is the angle
between the x-axis and the major axis of the ellipse that hasthe
same second-moments as the regioniv. Extent: It is defined as the
ratio of pixels in the region to the pixels in the total bounding
box
IMPLEMENTATION AND RESULT
5.4 Experimental Results5.4.1 Test application Analysis
The test application accompanying the source code can perform
the recognition of handwritten digits. To do so, open the
application (preferably outside Visual Studio, for better
performance). Click on the menu File and select Open. This will
load some entries from the Optdigits dataset into the
application.
To perform the analysis, click the Run Analysis button. Please
be aware that it may take sometime. After the analysis is complete,
the other tabs in the sample application will be populated with
theof each factor found during the discriminant analysis is plotted
in a pie graph for easy visual inspection. Once the analysis is
complete, we can test its classification ability in the testing
data set. The green rows have been correctly identified by the
discriminant space Euclidean distance classifier. We can see that
it correctly identifies 98% of the testing data. The testing and
training data set are disjoint and independent. analysis'
information.
Fig.6: Using the default values in the application
Results
After the analysis has been completed and validated, we can use
it to classify the new digitsdrawn directly in the application. The
bars on the right show the relative response for each of the
discriminant functions. Each class has a discriminant function that
outputs a closeness measure for the input point. The classification
is based on which function produces the maximum output.
Handwritten Devanagari Character sets are taken from test .bmp
image. Steps are followed to obtain best accuracy of input
handwritten Hindi character image given to the system. First of
all, training of system is done by using different data set or
sample. And then system is tested for few of the given sample, and
accuracy is measured. For each character, feature were computed and
stored in templates for training the system.
The sets of handwritten Gurumukhi characters are made. The data
set was partitioned into two parts. The first one is used for
training the system and the second one for testing. For each
character, features were computed and stored for training the
network. Three network layers, i.e. one input layer, one hidden
layer and one output layer are taken. If number of neurons in the
hidden layer is increased, then a problem of allocation of required
memory is occurred. Also, if the value of error tolerance is high,
say 0.1, desired results are not obtained, so changing the value of
error tolerance i.e. say 0.01, high accuracy rate is obtained. Also
the network takes more number of cycles to learn when the error
tolerance value is less rather than in the case of high value of
error tolerance in which network learns in less number of cycles
and so the learning is not very fine. The unit disk is taken for
each character by finding the maximum radius of the character (i.e.
the maximum distance between the center of the character and the
boundary of the character), so that the character could fit on the
disk.Here are some tables displaying the results obtained from the
program. Sign images of the same letter are grouped together on
every table. The table gives us information about the
pre-processing operations that took place (i.e. noise, edge
detection, filling of gap) and also if the image belongs to the
same database with the training images. The amount of each filter
is also recorded so maximum values of noise can be estimated that
the network can tolerate. This of course varies from character
image to character image. The result also varies for every time the
algorithm is executed. The variance is very small but it is there.
Following are main results of Gurumukhi character recognition:
-
Table 4: Recognition Accuracy of Handwritten Hindi Characters
Character No. of Samples Train/Test % Accuracy
200 180/20 93%
196 176/20 87%
155 130/25 89%
184 169/15 71%
192 162/30 69%
160 140/20 81%
179 159/20 79%
168 148/20 84%
195 170/25 80%
177 152/25 90%
191 166/25 88%
180 165/15 86%
195 170/25 89%
187 167/20 96%
169 149/20 95%
199 174/25 92%
188 168/20 94%
166 146/20 82%
196 176/20 82%
189 164/25 88%
168 148/20 85%
178 158/20 84%
196 176/20 87%
171 151/20 81%
182 162/20 88%
184 164/20 80%
169 149/20 89%
180 155/25 76%
170 150/20 78%
193 173/20 71%
185 165/20 82%
176 146/30 70%
167 147/20 92%
157 132/25 85%
178 158/20 87%
183 153/30 69%
191 161/30 73%
185 155/30 70%
Fig 7: We can see the analysis also performs rather well on
completely new and previously unseen data.
Experiments were performed on different samples having mixed
scripting languages on numerals using single hidden layer.
Table 2: Detail Recognition performance of SVM on UCI
datasets
Table 3: Detail Recognition performance of SVM and HMM on UCI
datasets
Table 4: Recognition Rate of Each Numeral in DATASET
It is observed that recognition rate using SVM is higher than
Hidden Markov Model. However, free parameter storage for SVM model
is significantly higher. The memory space required for SVM will be
the number of support vectors multiply by the number of feature
values (in this case 350). This is significantly large compared to
HMM which only need to store the weight. HMM needs less space due
to the weight-sharing scheme.
However, in SVM, space saving can be achieved by storing only
the original online signals and the penup/ pen-down status in a
compact manner. During recognition, the model will be expanded
dynamically as required. Table 3 shows the comparison of
recognition rates between HMM and SVM using all three databases.
SVM clearly outperforms in all three isolated character cases.
The result for the isolated character cases above indicates that
the recognition rate for the hybrid word recognizer could be
improved by using SVM instead of HMM. Thus, we are currently
implementing word recognizer using both HMM and SVM and comparing
their performance.
FUTURE WORK AND CONCLUSION CONCLUSION- The important feature of
this ANN training is that the learning rates are dynamically
computed each epoch by an interpolation map. The ANN error function
is transformed into a lower dimensional error space and the reduced
error function is employed to identify the variable learning rates.
As the training progresses the geometry of the ANN error function
constantly changes and therefore the interpolation map always
identifies variable learning rates that gradually reduce to a lower
magnitude. As a result the error function also reduces to a smaller
terminal function value. The result of structure analysis shows
that if the number of hidden nodes increases the number of epochs
taken to recognize the handwritten character is also increases. A
lot of efforts have been made to get higher accuracy but still
there are tremendous scopes of improving recognition accuracy.
REFERE6. Conclusion Handwriting recognition is a challenging field
in the machine learning and this work identifies Support Vector
Machines as a potential solution. The number of support vectors can
be reduced by selecting better C and gamma parameter values through
a finer grid search and by reduced set selection Work on
integrating the SVM character recognition framework into the HMM
based word recognition framework is on the way. In the hybrid
system, word preprocessing and normalization needs to be done
before SVM is then used for character hypothesis recognition and
word likelihood computation using HMM. It is envisaged that, due to
SVMs better discrimination capability, word recognition rate will
be better than in a HMM hybrid system.The scope of the proposed
system is limited to the recognition of a single character.Offline
handwritten Hindi character recognition is a difficult problem, not
only because of the great amount of variations in human
handwriting, but also, because of the overlapped and joined
characters. Recognition approaches heavily depend on the nature of
the data to be recognized. Since handwritten Hindi characters could
be of various shapes and size, the recognition process needs to be
much efficient and accurate to recognize the characters written by
different users.There are few reasons that create problem in Hindi
handwritten character recognition. Some characters are similar in
shape (for example and ). Sometimes characters are overlapped and
joined. Large numbers of character and stroke classes are present
there. Different, or even the same user can write differently at
different times, depending on the pen or pencil, the width of the
line, the slight rotation of the paper, the type of paper and the
mood and stress level of the person.The character can be written at
different location on paper or in window Characters can be written
in different fonts.
Handwritten Gurumukhi character recognition using neural
networks is discussed here. It has been found that recognition of
handwritten Gurumukhi characters is a very difficult task.
Following are main reasons for difficulty in recognition of
Gurumukhi characters:- Some Gurumukhi characters are similar in
shape (for example and ). Different, or even the same writer can
write differently at different times, depending on the pen or
pencil, the width of the line, the slight rotation of the paper,
the type of paper and the mood and stress level of the person. The
character can be written at different location on paper or in
window Characters can be written in different fonts. These facts
are justified by the work done here. A small set of all Gurumukhi
characters using back propagation neural network is trained, then
testing was performed on other character set. The accuracy of
network was very low. Then, some other character images in the old
character set are added and trained the network using new sets.
Then again testing was performed on some new image sets written by
different people, and it was found that accuracy of the network
increases slightly in some cases. Again some new character images
into old character set are added (on which network was trained) and
trained the network using this new set. The network is presented
new character images and it has been seen that recognition
increases, although at a slow rate. The result of the last training
by 50 character set and testing with the 10 character set are
presented. It can be concluded that as the network is trained with
more number of sets, the accuracy of recognition of characters will
increase definitely.
Future scope Over the past three decades, many different methods
have been explored by a large number of scientists to recognize
characters. A variety of approaches have been proposed and tested
by researchers in different parts of the world, including
statistical methods, structural and syntactic methods and neural
networks. No OCR in this world is 100% accurate till date. The
recognition accuracy of the neural networks proposed here can be
further improved. The number of character set used for training is
reasonably low and the accuracy of the network can be increased by
taking more training character sets. This approach of recognition
is used for recognition of Gurumukhi characters only. In future
work, this can be implemented for recognition of Gurumukhi
words.
References
[1] R. Plamondon and S. N. Srihari, On-line and off-line
handwritten recognition: a comprehensive survey, IEEE Transactions
on PAMI, Vol. 22(1), pp. 6384, 2000. [2] Negi, C. Bhagvati and B.
Krishna, An OCR system for Telugu, in the Proceedings of the Sixth
International Conference on Document Processing, pp.1110-1114,
2001. [3] Hong, J.I. and Landay, J.A. SATIN: A Toolkit for Informal
Ink-based Applications. CHI Letters: ACM Symposium on UIST, 2 (2),
63-72. [4] S. Mori, C. Y. Suen and K. Yamamoto, Historical review
of OCR research and development, Proceedings of the IEEE, Vol.
80(7), pp. 1029-1058, 1992. [5] U. Pal and B. B. Chaudhuri, Indian
script character recognition, Pattern Recognition, Vol. 37(9), pp.
1887-1899, 2004. [6] H. Bunke and P. S. P. Wang, Handbook of
Character Recognition and Document Image Analysis, World Scientific
Publishing Company, 1997. [7] Stephen V. Rice, George Nagy and
Thomas A. Nartker, Optical Character Recognition: An Illustrated
Guide to the Frontier, Kluwer Academic Pub