-
Optical Character Recognition system for printed Telugu text
MTech Project Report
Submitted in partial fulfillment of the requirements for the
degree of
Master of Technology
by
Udaya Kumar Ambati
Roll No : 09305073
under the guidance of
Prof.M.R.Bhujade
Department of Computer Science and Engineering
Indian Institute of Technology, Bombay
April 2010
-
Acknowledgements
I would sincerely like to thank my guide,Prof. M.R. Bhujade for
his motivating support through-
out the semester and the consistent directions that he has fed
into my work.I would like to thank
each and every one who helped me throughout my work.
-
Abstract
Telugu is a language spoken by more than 66 million people of
South India. Not much work
has been reported on the development of optical character
recognition (OCR) systems for Telugu
text. Therefore, it is an area of current research. Some
characters in Telugu are made up of
more than one connected symbol. Compound characters are written
by associating modifiers
with consonants, resulting in a huge number of possible
combinations, running into hundreds
of thousands. A compound character may contain one or more
connected symbols. Therefore,
systems developed for documents of other scripts, like Roman,
cannot be used directly for the
Telugu language.
This project aims at developing a complete Optical Character
Recognition system for printed
Telugu text. The system segments the document image into lines
and words. The features of
each character are extracted. The extracted features are passed
to a Support Vector Machine
where the characters are classified by Supervised Learning
Algorithm.
-
Contents
1 Introduction 1
2 Structure of Telugu text and Segmentation issues[5] 3
2.1 Characteristics of Telugu script . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 3
2.2 Segmentation issues in OCR of Telugu script . . . . . . . .
. . . . . . . . . . . . 6
3 Preprocessing phase 8
3.1 Thresholding and noise removal . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 8
3.1.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 9
3.2 Skew detection and correction . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 10
3.2.1 Skew angle Detection . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 10
3.2.2 Image rotation transformation . . . . . . . . . . . . . .
. . . . . . . . . . 11
3.3 Connected Components . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 12
3.4 Line Segmentation . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 12
3.5 Word Segmentation . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 15
3.6 Feature Extraction . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 15
3.7 Pattern classification [2] . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 16
3.7.1 SVM Classifier:[2] . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 16
4 Implementation 22
5 Results 24
6 Conclusion and Future work 29
i
-
List of Figures
2.1 Harshapriya and Godavari fonts . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 3
2.2 Vowels their associated modifiers (Matras) and their
phonetic English representation 4
2.3 Consonants and their associated modifiers (Matras) and their
phonetic English
representation . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 5
2.4 Various combinations forming compound characters . . . . . .
. . . . . . . . . . 6
3.1 Original Text lines . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 13
3.2 Smoothed Text lines with Histogram . . . . . . . . . . . . .
. . . . . . . . . . . . 13
3.3 Highest peak and vertical line drawn at the middle of
highest peak . . . . . . . . 13
3.4 middle line detection for considering small length text . .
. . . . . . . . . . . . . 14
3.5 (a).Initial segmentation line through the white pixels of
horizontal histogram (b).
Result after considering only the candidate lines from original
histogram. . . . . 14
3.6 Output for word segmentation . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 15
5.1 Home page of the tool . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 24
5.2 Displaying the original image . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 25
5.3 Bounding Connected Components . . . . . . . . . . . . . . .
. . . . . . . . . . . 26
5.4 Line Segmentation . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 27
5.5 Word Segmentation . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 28
ii
-
Chapter 1
Introduction
During the past few decades, substantial research efforts have
been devoted to optical char-
acter recognition (OCR) [7, 6]. The object of OCR is automatic
reading of optically sensed
document text materials to translate human-readable characters
into machine-readable codes.
Research in OCR is popular for its various potential
applications in banks, post offices and
defence organizations. Other applications involve reading aids
for the blind, library automation,
language processing and multi-media design .
Commercial OCR packages are already available for languages like
English. Considerable work
has also been done for languages like Japanese and Chinese [7].
Recently, work has been done
in the development of OCR systems for Indian languages. This
includes work on recognition of
Devanagari characters , Bengali characters , Kannada characters
and Tamil characters .
The Indian subcontinent has more than 18 constitutionally
recognized languages with several
scripts but commercial products in Optical Character
Recognition(OCR) are very few. Telugu
is one of the oldest and most popular languages of India.
Historically, Telugu has evolved from
the ancient Brahmi script. It also used features of the
Dravidian (Pali) language for script
generation. In the process of evolution, this script was carved
with needles on palm leaves, and
so, it favored rounded letter shapes. Work on Telugu character
recognition is not substantial.
Motivation In spite of Telugu being the third mostly used
language in India there are only a
few OCR systems for Telugu script. This gave us a motivation to
approach the problem. Further
1
-
motivation to develop a Telugu language OCR is the digitization
of thousands of printed books
of Indian languages by both private and public sector. For an
efficient access of these scanned
documents an OCR specific for printed Telugu text is of very
urgent need.
Scope of the report The first section of the report deals with
the explanation of the structure
of Telugu characters and its segmentation issues. In second
section we explain the algorithm used
for noise removal and binarization. Next section explains an
efficient algorithm that segments
the given scanned document into lines and words. The last
section explains the concept of
Support Vector Machines(SVM) and the method of feature
extraction of Telugu letters and
their classification using SVM.
Most document analysis systems can be visualized as consisting
of two steps: the pre-processor
and the recognizer. in preprocessing, the raw image obtained by
scanning a page of text is con-
verted to a form acceptable to the recognizer by extracting
individually recognizable characters.
the pre-processed image of the character is processed to obtain
meaningful elements, called fea-
tures; recognition is completed by searching for a feature
vector in a database of stored feature
vectors of all possible Telugu characters that matches with the
feature vector of the character
to be recognized.
In Indian scripts, one or more vowel and consonant modifiers are
attached to the consonant
forms in a variety of combinations forming compound characters.
The total number of possible
compound characters is in of the order of hundreds of thousands.
Therefore, the question, What
constitutes a character?, assumes many new dimensions for Indian
languages. Is a modifier an
independent character or not? Does being treated as an
independent character depend on the
way it is written, i.e. whether it is written touching the
character it is to modify or separated
from it? A more detailed discussion of these issues for Telugu
script is provided in Sect. 2.
In this project, an approach has been presented for Telugu.
2
-
Chapter 2
Structure of Telugu text and
Segmentation issues[5]
2.1 Characteristics of Telugu script
Telugu is a syllabic languageconfusion and spelling problems. In
that sense, it is a WYSIWYG
(what you see is what you get) script. This form of script is
considered to be most scientific
by linguists. The Telugu script consists of 18 vowels, 36
consonants and two dual symbols. Of
the vowels, sixteen are in common usage. Fig 2.1 lists some of
the vowels in Harshapriya and
Godavari fonts.
All vowels and consonants, along with their modifiers and
phonetic equivalent symbols, are
listed in Fig 2.2 and Fig 2.3, respectively. Compound characters
in Telugu follow some phonetic
[5]
Figure 2.1: Harshapriya and Godavari fonts
3
-
[5]
Figure 2.2: Vowels their associated modifiers (Matras) and their
phonetic English representation
sequences that can be represented in grammatical form, as shown
in Fig 2.4. Base consonants
are vowel-suppressed consonants. These are typically used when
words of other languages are
written in Telugu. The third combination, i.e. of a base
consonant and a vowel, is an extremely
important and often used combination in Telugu script. As there
are 38 (36+2 dual symbols)
base consonants and 16 vowels, logically, 608 (3816 = 608)
combinations are possible.
The combinations from the fourth to the seventh combinations are
categorized under conjunct
formation. Telugu has a special feature of providing a unique
symbol of dependent form for each
of the consonants. In all conjunct formations, the first
consonant appears in its actual form.
The dependent vowel sign and the second (third) consonant act as
dependent consonants in the
formation of the complete character. combinations from the
fourth to seventh combinations
generate a large number of conjuncts in Telugu script. The
fourth combination logically gener-
ates (383816) 23,104 different compound characters. This is an
important combination. Thefifth combination is similar to the
fourth combination. The second and the third consonants act
as the dependent consonants. Logically 746,496 different
compound characters are possible in
this combination, but their frequency of appearance in the text
is less when compared to the
previous combination. In the sixth and seventh combinations,
1,296 combinations and 46,656
combinations, respectively, are logically possible.
The sixth and seventh combinations are used when words from
other languages are written in
Telugu script. In these combinations, the vowel is omitted. The
first consonant appears as a
base consonant and the other consonants act as dependent
consonants.
4
-
[5]
Figure 2.3: Consonants and their associated modifiers (Matras)
and their phonetic English
representation
5
-
[5]
Figure 2.4: Various combinations forming compound characters
2.2 Segmentation issues in OCR of Telugu script
A connected region in an image of Telugu text may be:
1. A part of a character or a compound character
2. A character
3. A compound character
This complicates the segmentation issues. The areas occupied by
individual characters in a
line of text are not in a horizontal line, unlike in English
text, and in some cases, the area of
a single complex character formation can be equal to the sum of
the areas of two individual
characters. The segmentation algorithm has to take these factors
into consideration. The basic
question to be answered in segmentation is: What are the symbols
that will be isolated during
segmentation and provided to the recognizer for completing the
OCR?
6
-
The first approach is to treat all types of conjuncts, together
with the base consonants, as
units for the purpose of segmentation and further recognition.
This is not preferable for a
number of reasons. The first reason is that the sheer number of
possibilities has been shown
to be enormous. The second reason is that, in compound
characters like KRAI, we have to
identify all the three parts, i.e. below and on the left, as
being together in the same compound
character, although they are not connected in the image. This
is, in general, difficult because the
association information is difficult to generate until the
recognition process is at least partially
completed, and the reason we are segmenting is to perform this
recognition process. This is the
catch-22 situation referred earlier, and, therefore, treating
all types of conjuncts together is not
possible. The second alternative is to attempt to isolate the
base consonants, vowel modifiers,
etc. This is difficult and leads to unmanageable complications
at the segmentation stage where
the symbols are yet to be recognized. This is primarily because
the symbols are full of curves
and their separation is not clear. However, this is a popular
approach for Indian scripts like
Devanagari and Bangla [3].
7
-
Chapter 3
Preprocessing phase
3.1 Thresholding and noise removal
The task of thresholding is to extract the foreground from the
background. Generally an OCR
expects a text printed against clean backgrounds. Usually a
simple global binarization technique
is adopted which does not handle well text printed against
shaded or texture backgrounds, and/or
embedded in images.
In this project, a simple yet effective algorithm is proposed
for document image binarization
and cleanup. It is especially robust for extracting from
images.
There are basically two classes of binarization techniques
global and adaptive. Global methods
binarize the entire image using a single threshold. For example,
a typical OCR system separates
text from background by global thresholding[12, 8] . A simple
way to automatically select a
global threshold is to value at the valley of the intensity
histogram of the image, assuming that
there are two peaks in the histogram, one corresponding to the
foreground and the other to the
background. Methods have also been proposed to facilitate more
robust valley picking.
There are problems with the global thresholding paradigm. First,
due to noise and poor
contrast, many documents do not have well differentiated
foreground and background intensi-
ties. Second, the bimodal histogram assumption is not always
valid in the case of complicated
documents such as photographs and advertisements. Third, the
foreground peak is often over-
8
-
shadowed by other peaks which makes the valley detection
difficult or impossible. Some research
has been carried out to overcome these problems. For example,
weighted histograms[1] are used
to balance the size difference between the foreground and
background, and /or convert the valley-
finding into maximum peak detection. Minimum-error thresholding
models the foreground and
background intensity distributions as Gaussian distributions and
the threshold is selected to
minimize the classification error. Otsu[9] models the intensity
histogram as probability distribu-
tion and the threshold is chosen to maximize the separability of
the resultant background and
foreground classes. Similarly entropy measures[] have been used
to select the threshold which
maximizes the sum of background and foreground entropies.
In contrast, adaptive algorithms compute a threshold for each
pixel based on information
extracted from its neighborhood. For images in which the
intensity ranges of foreground objects
and backgrounds entangle, different thresholds must be used for
different regions.
3.1.1 The Algorithm
The algorithm proposed by Wu and Manmatha[11] works under the
assumption that text input
image or a region of the input image has more or less the same
intensity value. However the
unique feature of this algorithm is it works well even of the
text is printed against shaded or
hatched background
The following are the steps in the algorithm:
1. smooth the input text chip.
2. compute the intensity histogram of the smoothed chip.
3. smooth histogram using a low-pass filter.
4. pick a threshold at the first valley counted from the left
side of the histogram.
5. binarize the smoothed text chip using the threshold.
A low-pass Gaussian filter is used to smooth the text chip in
step 1. The smoothing operation
affects the background more than the text because text is
normally is of lower frequency than
the shading. Thus it cleans up the background.
9
-
The histogram generated by step 2 is often jagged, hence it
needs to be smoothed to allow
the valley to be detected. Again a Gaussian filter is used for
this purpose.
Text is normally the darkest item in the detected chips.
Therefore, a threshold is picked
at the first valley closest to the darkest side of the
histogram. To extract text against darker
background, a threshold at the last valley is picked
instead.
3.2 Skew detection and correction
Skew estimation of document refers to the process of finding the
angle of inclination made
by the document with respect to horizontal axis,which is often
introduced during document
scanning. For any ensuing document image processing tasks(such
as page layout analysis,
OCR,document retrieval etc.)to yield accurate results,the skew
angle must be detected and cor-
rected beforehand.The algorithms for skew estimation can mainly
be classified as the ones based
on(i)projection profile(PP) , nearest neighbor(NN) (iii)Hough
transform(HT) and (iv)cross-
correlation. We used the variation of the hough transform method
[4] to detect skew in our
project.
3.2.1 Skew angle Detection
The skew angles detection process used in this project can be
divided into three steps:
detection point determination
coarse skew angle estimation
Hough transformation.
First, the skew image is vertically separated into several
blocks, each block consisting of one
hundred rows. Then the locations of detection points in each
block are recorded to estimate
the coarse skew angle e. The coarse skew angle here can be
estimated by selecting the angle
which possesses most detection points.Finally, the accurate skew
angle can be determined by
choosing the peak in the Hough plane within the small range of [
e - 3 , e + 3] A detailed
description of the three steps to detect the skew angle
follows.
10
-
Step 1. Detection point (DP) determination First of all, the
input image is vertically
divided into several blocks. According to our empirical study,
100 rows are chosen as the size of
each block. A detection point is defined as the left-most black
pixel in each block. Each divided
block is scanned from left to right and then from top to bottom
to find the detection point. If
the scanned pixel is not a background pixel, it is declared as a
detection point. Following the
above procedure, we can find all detection points embedded in
the input image. These detection
points are then fed into Step 2 for the estimation of the coarse
skew angle.
Step 2. Coarse skew angle estimation In this step, the coarse
skew angle 0 e is determined
by selecting the majority of local skew angles which are
generated from the detection points.
Before the majority selection procedure, the local skew angle i
has to be calculated first.Consider
two detection points DPi1(xi1, yi1) and DPi(xi, yi) in two
consecutive divided blocks Bi1
and Bi. The local skew angle i is defined as
i = tan1(
yixi
)= tan1
(yi yi1xi xi1
)(3.1)
Here, the value yixi is adopted to represent the local skew
angle i to avoid the computation
burden of tan1 function. The coarse skew angle r is then
assigned as the majority of local
skew angles.
Hough Transformation Following the previous two steps, the
search range of the skew angle
in the Hough plane is reduced from [90, 90] to [e 3, e + 3].
Last, the left-most pixelPi(xi, yi) in each row of the x y plane is
transformed to the Hough plane by making useof the following
equation:
i = xi. cos i + yi. sin (3.2)
where i is located in the range [e 3, e + 3]. The skew angle of
the input document canthereby be determined by selecting the angle
with the largest value in the transformed Hough
plane.
3.2.2 Image rotation transformation
In this section, a skew image will be corrected to generate a
non-skew image by rotating
over a skew angle 0 which is obtained in Section 3.2.1.The
rotation transformation is a mapping
11
-
function f(x, y) which maps the coordinates of pixels in the
original image to those in the output
image. However, some pixel values in the output image which
correspond to the pixels in the
original image cannot be defined via the mapping function f
because the range and domain
defined in image processing are integer. In program
implementation, we can devise an inverse
function f1 to define all output pixel values from the original
image. Each pixel value in the
output image can thereby be determined from the value in the
original image via the inverse
function f1.
Geometrically, the value of pixel P (x, y) in the output image
can be determined from that
of the corresponding pixel P (x, y) in the original image. The
location of pixel P can be obtained
from the location of pixel P via the following function f1:(x,
y)
=(x, y
) cos () sin () sin () cos ()
= (x cos + y sin ,x sin + y cos ) (3.3)
3.3 Connected Components
The connected components are computed for the whole document
using a recursive labeling
algorithm. The algorithm works by first negating the whole
image. Each black pixel is replaced
by -1 and white pixel with 0. Each pixel in this image is now
checked for a black pixel. If a
pixel is a text pixel, We define a search function which takes a
text pixel, its coordinates and
defines its neighbors. This function recursively searches the
black pixels that are part of this
component and labels them. Again it reaches a new component.
3.4 Line Segmentation
There are several steps in the line segmentation method proposed
by Priyanka and Srikanth[10]
that are systematically described below.
Step1:Run length smearing A smoothing algorithm is applied in
the text of a document
page. In this step we use run length smearing technique [12] to
increase the strength of the
histogram. Here we consider the consecutive run of white pixels
in between two black pixels and
then we compute the length of that white run. If the length of
white run is less than five times
12
-
the stoke width, fill the white run length into black. in figure
there are two original text lines
and in figure there are smoothed text lines with horizontal
histogram corresponding to their
text lines.
[10]
Figure 3.1: Original Text lines
[10]
Figure 3.2: Smoothed Text lines with Histogram
Step2:Recursive procedure to get middle lines for segmentation
Getting the his-
togram of every line from the smoothed document page, we
consider the highest peaks of the
projection profile. After that we find the middle point of the
length of the highest peak, and
then we draw a vertical line from top to bottom at the middle
point of the highest peak as
shown in fig.
[10]
Figure 3.3: Highest peak and vertical line drawn at the middle
of highest peak
The continuity of this step is to find the middle lines of each
and every peaks of histogram. At
the line (the line passes vertically through middle point of the
highest peak) we find middle point
of peaks. We draw the horizontal lines based on this middle
point of the width of histogram. In
some cases all peak of histograms do not cross this vertical
line. For these cases we find distances
between middle lines and find the average value of these
distances.If the distance between the
two middle lines is greater than two times of average value then
we assume that region contains
13
-
[10]
Figure 3.4: middle line detection for considering small length
text
one or more text lines and we need recursive segmentation for
that region. After getting that
region (the region between two middle lines of peaks) we apply
the same procedure to find
vertical line through the middle of highest peak and middle
lines of that particular region. This
procedure runs recursively; until we find middle lines of
particular image as shown in Fig .10
Step3:Finding candidate line In this step, from the starting
point of first histogram we
vertically scan the region in between the first middle and
second middle line of histogram until
we get first two white pixels. We consider that two white pixels
as minimum points. The line,
where we get the first white pixel, we consider that line as
first minimum. Similarly the line
where we get second white pixel, we consider that line as second
minimum. Now we calculate
the vertical distances from first middle line to first minimum
point and from first middle line
to second minimum point. Getting these two distances, we
consider the maximum distance.
The minimum point which contains maximum vertical distance as a
separator between two
consecutive middle lines. In this way we find all line
separators between two consecutive middle
lines and shown in Fig below. If we consider only the point
where we get minimum black pixel
in the histogram is separator line, then we will get many
errors.
[10]
Figure 3.5: (a).Initial segmentation line through the white
pixels of horizontal histogram (b).
Result after considering only the candidate lines from original
histogram.
14
-
3.5 Word Segmentation
In word segmentation method, a text line has taken as an input.
After a text line is segmented,
it is scanned vertically. If in one vertical scan two or less
black pixels are encountered then
the scan is denoted by 0, else the scan is denoted by the number
of black pixels. In this way
a vertical projection profile is constructed. Now, if in the
profile there exist a run of at least
k1 consecutive 0s then the midpoint of that run is considered as
the boundary of a word. The
value of k1 is taken as 1/3 of the text line height. Word
segmentation results of a Telugu text
line are shown in Fig.
[10]
Figure 3.6: Output for word segmentation
3.6 Feature Extraction
Feature Extraction [5]: The output of the Normalization phase
gives a normalized image of size
N N. Real Valued Directional Features[] are calculated for each
normalized image of size NN.These are based on the percentage of
pixels in each direction range within each partition. An
adaptive gradient magnitude threshold, r is computed over the
whole character image gradient
map. This threshold is needed to filter out spurious responses
to the Sobel operator used to find
the gradients. Threshold value ,rt is computed as
rt = r(i, j)
D1D2
Thresholding is performed to nullify the pixels whose gradient
magnitude values below the
computed threshold.
The feature vector is extracted basing on the direction of the
gradient at each pixel. We di-
vided the whole character image into MN partitions. In our
project we selected M=N=8. Thedirections of the gradient are
quantized into K values. Thus each pixel can have now gradient
direction values from 1 to K. Percentage of pixels in each
partition with direction quantised to k
are calculated. Thus each partition gives us K such values. We
have total MNK dimensional
15
-
feature vector for each character image. We chose the value of K
= 12. In our project we have
total 192 dimensional feature vector for each normalized
character image.
The steps to extract feature vector are as follows:
For each connected component.
Obtain the bonding box for each connected component eliminating
the blank surroundingspace.
Calculate the gradient magnitude and direction at each
pixel.
Calculate the adaptive threshold of gradient magnitude and
perform thresholding to obtainthe new gradient direction each
pixel.
Partition the adaptive gradient direction map and extract the
complete feature vector.
3.7 Pattern classification [2]
The feature vector extracted from the normalized image has to be
assigned a label using a
pattern classifier[2]. There are many methods for designing
pattern classifiers such as Bayes clas-
sifier based on density estimation, using neural networks,
linear discriminant functions, nearest
neighbor classification based on prototypes etc. In this system
we have used the Support Vector
Machine (SVM) classifier. SVMs represent a new pattern
classification method which grew out
of some of the recent work in statistical learning theory. The
solution offered by SVM method-
ology for the two class pattern recognition problem is
theoretically elegant, computationally
efficient and is often found to give better performance by way
of improved generalizations. In
the next subsection we provide a brief overview of SVMs.
3.7.1 SVM Classifier:[2]
classifier is a two-class classifier based on the use of
discriminant functions. A discriminant
function represents a surface which separates the patterns so
that the patterns from the two
16
-
classes lie on the opposite sides of the surface. The SVM is
essentially a separating surface which
is optimal according to a criterion as explained below.
Consider a two-class problem where the class labels are denoted
by +1 and 1. Given aset of labeled (training) patterns = ((x)i,
yi), yi {1,+1} the hyper-plane represented by(w, b) where w 0fori :
yi = +1;
wtxi + b > 0fori : yi = 1; (3.4)
Here,wtxi denotes the inner product between the two vectors, and
g (x+ b) is the linear dis-
criminant function.
In general, the set may not be linearly separable. In such a
case one can employ the
generalized linear discriminant function defined by,
g (x) = wt(x) + b where :
-
Let zi = (xi) Thus now we have a training sample (zi, yi) to
learn a separating hyperplane
in
-
Now it is clear that i = 0 if i / S Hence we can rewrite (4.6)
as
w =iS
i yizi (3.11)
The set of patterns zi : i s.t.i > 0 g are called the support
vectors. From (4.8), it is clear
that w is a linear combination of support vectors and hence the
name SVM for the classifier.
The support vectors are those patterns which are closest to the
hyper-plane and are sufficient
to completely define the optimal hyper-plane. Hence these
patterns can be considered to be the
most important training examples.
To learn the SVM all we need are the optimal Lagrange
multipliers corresponding the problem
given by (4.4) and (4.5). This can be done efficiently by
solving its dual which is the optimization
problem given by: Find i, i = 1, ...., l, to
Maximize :i
i 12i,j
ijyiyjztizj
Subject to : i 0, i = 1, 2, ..., l,li
iyi = 0. (3.12)
By solving this problem we obtain i i and using these we get w
and b. It may be noted
that the dual given by 4.(9) is a quadratic optimization problem
of dimension l (recall that l
is the number of training patterns) with one equality constraint
and nonnegativity constraints
on the variables. This is so irrespective of how complicated the
function is. Once the SVM
is obtained, the classification of any new feature vector,x, is
based on the sign of (recall that
z = (x)
f(x) = (x)tw + b =iS
i yi(xi)t(x) + b (3.13)
where we have used (4.8). Thus, both while solving the
optimization problem (given by (4.9))
and while classifying a new pattern, the only way the training
pattern vectors, xi come into
picture are as inner products (xi)t(xj). This is the only way,
also enters into the picture.
Suppose we have a function,K :
-
Table 3.1: Some popular kernels for SVMs.
Type of kernel K(xi, xj) Comments
Polynomial kernel (xtixj + 1)p Power p is specified a priori
by
the user
Gaussian kernel exp( 122||xi xj ||2) The width 2 common to
all the kernels, is specified a
priori
Perceptron Kernel tanh(0xtixj + 1) Mercers condition
satisfied
only for certain values of 0
and 1
Given any symmetric function K :
-
this, we can change the optimization problem to
Minimize :12||w||2 + C
li=1
i, (3.14)
Subject to : 1 yi(ztiw+ b) i 0 i = 1, ...., l.i 0, i = 1, ....,
l (3.15)
Here i can be thought of as penalties for violating separability
constraints. Now these are
also variables over which optimization is to be performed. The
constant C is a user specified
parameter of the algorithm and as C we get the old problem. It
so turns out that the dualof this problem is same as (4.9) except
that the non negativity constraint on i is replaced by
0 i C. The optimal values of the new variables i are irrelevant
to th e final SVM solution.
To sum up, the SVM method for learning two class classifiers is
as follows. We choose a
Kernel function and some value for the constant C in (4.11).
Then we solve its dual which is
same as (4.9) except that the variables i also have an upper
bound, namely, C. (It may be
noted that here we use K(()x)i,xj in place of ztizj in
(4.9)).Once we solve this problem, all we
need to store are the non-zero i i and the corresponding xi
(which are the support vectors).
Using these, given any new feature vector x, we can calculate
the output of SVM, namely, f(x)
through (4.10). The classification of x would be +1 if the
output of SVM is positive; otherwise
it is 1.
SVM classifier for OCR We have used SVM classifiers for labeling
each segment of a
word. As explained earlier, we have trained a number of
two-class classifiers (SVMs), each one
for distinguishing one class from all others. Thus each of our
class labels has an associated
SVM.A test example is assigned the label of the class whose SVM
gives the largest positive
output. If no SVM gives a positive output then the example is
rejected. The output of the SVM
gives a measure of the distance of the example from the
separating hyper-plane in the space.
Hence higher the value of the (positive) output for a given
pattern higher is the confidence in
classifying the pattern.
21
-
Chapter 4
Implementation
Developing an OCR for printed Telugu text consists of two
stages, Pre-processing and Recog-
nition. In the prep-processing phase thresholding and noise
removal are implemented using the
algorithm specified in sect 3.1.1. Skew detection and removal
are implemented using a variant
of Hough transform.
The first step of the OCR starts with taking the document image
as input.The image is then
converted into a grayscale image. The grayscale image is
converted to a binary image using the
method described in section Thresholding. Connected components
in the whole document are
found out with their bounding box using a two pass algorithm.
These connected components are
then used to line segment the whole document. Line segmentation
takes the array of connected
components as parameters and returns the top and bottom row
numbers of each line with respect
to image coordinate system.
Each text line is given as input to the word segmentation phase.
This function segments the
text line into words and returns the left and right column
numbers of each word. The connected
components which belong to each word are grouped.
Each component is normalized into an image of 4848 image. This
image is given as inputto the feature extraction function. This
function takes just an image and returns the feature
vector of 192 dimensions using Sobel operator and the adaptive
threshold gradient. The feature
vector is then given as an input to the SVM classifier which is
trained using training SVM phase
22
-
described in later sections.
All these functions are implemented in Java Advanced Imaging
package of Oracle Sun Mi-
crosystems in Netbeans6.8 IDE. LibSVM is the package used for
training and using the SVM
classifier.
23
-
Chapter 5
Results
Figure 5.1: Home page of the tool
24
-
Figure 5.2: Displaying the original image
25
-
Figure 5.3: Bounding Connected Components
26
-
Figure 5.4: Line Segmentation
27
-
Figure 5.5: Word Segmentation
28
-
Chapter 6
Conclusion and Future work
Conclusion The main aim of this project is to develop a Optical
Character Recognition for
printed Telugu text. Telugu script has a complex structure and
has thousands of combinations of
vowel, consonant and consonant modifier.Hence detection and
recognition of basic symbols helps
in reducing the number of classes. This project develops a tool
that takes a document image
as input and displays each characters Unicode.This Unicode can
be further used to display the
corresponding Telugu text.
Future work The recognition accuracies can be further increased
by post processing which
makes use of the association of the basic symbols. For example,
it is known that the some
modifiers occur very frequently with some characters and some
modifiers occur very infrequently.
This feature vector can be further used for recognizing
handwritten Telugu script. The final
output of the proposed system can be used further for text to
speech conversion.
29
-
Bibliography
[1] Histogram modification for threshold selection. Systems, Man
and Cybernetics, IEEETransactions on, 9(1):38 52, jan. 1979.
[2] T V Ashwin and P S Sastry. font and sizeindependent ocr
system for printed kannadadocuments using support vector machines.
Sadhana, 27:3558, 2002.
[3] B. B. Chaudhuri and U. Pal. A complete printed bangla ocr
system. Pattern Recognition,31(5):531 549, 1998.
[4] Huei-Fen Jiang, Chin-Chuan Han, and Kuo-Chin Fan. A fast
approach to the detectionand correction of skew documents. Pattern
Recogn. Lett., 18(7):675686, 1997.
[5] C. Vasantha Lakshmi and C. Patvardhan. An optical character
recognition system forprinted telugu text. Pattern Analysis and
Applications, 7:190204, 2004. 10.1007/s10044-004-0217-2.
[6] S. Mori, C.Y. Suen, and K. Yamamoto. Historical review of
ocr research and development.Proceedings of the IEEE, 80(7):1029
1058, jul. 1992.
[7] G. Nagy. Twenty years of document image analysis in pami.
Pattern Analysis and MachineIntelligence, IEEE Transactions on,
22(1):38 62, jan. 2000.
[8] L. Ogorman. Binarization and multithresholding of document
images using connectivity.CVGIP: Graphical Models and Image
Processing, 56(6):494 506, 1994.
[9] N. Otsu. A threshold selection method from grey-level
histograms. SMC, 9(1):6266,January 1979.
[10] Nallapareddy Priyanka, Srikanta Pal, and Ranju Manda.
Article:line and word segmen-tation approach for printed documents.
IJCA,Special Issue on RTIPPR, (1):3036, 2010.Published By
Foundation of Computer Science.
[11] Victor Wu and R. Manmatha. Document image clean-up and
binarization. In In Proc.SPIE Symposium on Electronic Imaging,
pages 263273, 1998.
[12] Hong Yan. Skew correction of document images using
interline cross-correlation. CVGIP:Graph. Models Image Process.,
55(6):538543, 1993.
30
IntroductionStructure of Telugu text and Segmentation
issues4Characteristics of Telugu scriptSegmentation issues in OCR
of Telugu script
Preprocessing phaseThresholding and noise removalThe
Algorithm
Skew detection and correctionSkew angle DetectionImage rotation
transformation
Connected ComponentsLine SegmentationWord SegmentationFeature
ExtractionPattern classification 13SVM Classifier:13
ImplementationResultsConclusion and Future work