Top Banner
Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm Konstantinos Zagoris 1,2 , Ioannis Pratikakis 2 , Apostolos Antonacopoulos 1 , Basilis Gatos 3 , Nikos Papamarkos 2 1 Pattern Recognition and Image Analysis (PRImA) Research Lab School of Computing, Science and Engineering, University of Salford, Greater Manchester, UK 2 Department of Electrical and Computer Engineering Democritus University of Thrace, Xanthi, Greece 3 Institute of Informatics and Telecommunications, National Center for Scientific Research “Demokritos” Athens, Greece
17

Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Jan 14, 2015

Download

Technology

In a number of types of documents, ranging from forms to archive documents and books with annotations, machine printed and handwritten text may be present in the same document image, giving rise to significant issues within a digitisation and recognition pipeline. It is therefore necessary to separate the two types of text before applying different recognition methodologies to each. In this paper, a new approach is proposed which strives towards identifying and separating handwritten from machine printed text using the Bag of Visual Words paradigm (BoVW). Initially, blocks of interest are detected in the document image. For each block, a descriptor is calculated based on the BoVW. The final characterization of the blocks as Handwritten,Machine Printed or Noise is made by a Support Vector Machine classifier. The promising performance of the proposed approach is shown by using a consistent evaluation methodology which couples meaningful measures along with a new dataset.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Handwritten and Machine Printed Text Separation in Document Images using

the Bag of Visual Words Paradigm

Konstantinos Zagoris1,2, Ioannis Pratikakis2, Apostolos Antonacopoulos1, Basilis Gatos3, Nikos Papamarkos2

1Pattern Recognition and Image Analysis (PRImA) Research LabSchool of Computing, Science and Engineering, University of Salford, Greater Manchester, UK

2Department of Electrical and Computer Engineering Democritus University of Thrace, Xanthi, Greece

3Institute of Informatics and Telecommunications, National Center for Scientific Research “Demokritos” Athens, Greece

Page 2: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Current state-of-the-artThree (3) main approaches Text Line Level Word Level Character Level

Page 3: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Disadvantages Different Page Segmentation Algorithms Incompatible Feature Set

Page 4: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Bag of Visual Word Model (BoVWM)

• Inspired from Information Retrieval Theory

• An image content is described by a set of “visual words”.

• A “visual word” is expressed by a group of local features

• Most well-known local feature is the Scale-Invariant Feature Transform (SIFT)

• Codebook Creation• A codebook is defined by

the set of the clusters• A “visual word” is denoted

as the vector which represents the center of each cluster

• Codebook is analogous to a dictionary

Page 5: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Bag of Visual Word Model (BoVWM)

• Each visual entity is described by a BoVWM descriptor

• Each SIFT point belongs to a “visual word”

• The “visual word” that corresponds to the closest center of the cluster by a distance function (Euclidean, Manhattan)

• The descriptor reflects the frequency of each visual word that appears in the image.

Page 6: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Proposed Method

Original Image

Page Segmentati

on

Block Descriptor Extraction

(BoVW model)

Classification

Final Result

Page 7: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Page Segmentation

Original Image

Locally Adaptive Binarisation Method [1]

Adaptive Run Length Smoothing Algorithm [2]

Final Result

1. B. Gatos, I. Pratikakis, and S. Perantonis. Adaptive degraded document image binarization. Pattern Recognition, 39(3):317–327, 2006.

2. N. Nikolaou, M. Makridis, B. Gatos, N. Stamatopoulos, and N. Papamarkos. Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image and Vision Computing, 28(4):590–604, 2010.

Page 8: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Block Descriptor Extraction This step involves the creation of the block

descriptor by utilizing the BoVW model Codebook Properties It must be small enough to ensure a low

computational cost. It must be large enough to provide sufficiently high discrimination performance

For the clustering stage the k-means algorithm is employed due to its simplicity and speed.

Page 9: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Block Descriptor Extraction

• The SIFTs are calculated on the greyscale version

• those SIFTs whose position in the binary image does not match the foreground pixel are rejected

• Each of the remaining local features is assigned a Visual Word from the Codebook

• a Visual Word Descriptor is formed based on the appearance of each Visual Word of the Codebook in this particular block

An example text block

Initial SIFT keypoints

Final SIFT keypoints

Page 10: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Decision System a classifier decides if the block contains

handwritten or machine printed text or neither of the above (noise)

Based on the Support Vector Machines (SVMs) Conventional approach – one against one, one

against others Train two SVMs with the Radial Basis Function (RBF)

kernel The first (SVM1) deals with the handwritten text

problem against all the other the second (SVM2) deals with the machine printed

text problem against all the other.

Page 11: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Decision System Algorithm

Page 12: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Decision System Algorithm

SVM2 (Machine-printed Text)SVM1 (Handwritten Text)

D1

D2

Support Vector

Support Vector

Sample

Sample

Page 13: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Examples

Original Image

Output of the proposed method

Page 14: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Evaluation Datasets 103 modified document images from the

IAM Handwriting Database 100 representative images selected from

the index cards of the UK Natural History Museum’s card archive (PRImA-NHM)

The ground truth files adhere to the Page Analysis and Ground-truth Elements (PAGE) format framework

http://datasets.primaresearch.org

Page 15: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Evaluation

The F-measure of each method.Dataset IAM PRImA-

NHM

Upper Bound (Proposed Segmentation) 0.9887 0.7985

Proposed Method (Proposed Segmentation and BoVW) 0.9886 0.7689

Gabor Filters (Proposed Segmentation and Gabor Filters) 0.7921 0.5702

Page 16: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Page Segmentation Problems Binarization Failures

Noise – Text Overlapping

Handwritten – Machine text Overlapping

Page 17: Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Thank You!

Ευχαριστώ!