Top Banner
Document Page Layout Analysis Document Page Layout Analysis Bhabatosh Chanda Electronics and Communication Sciences Unit Indian Statistical Institute Indian Statistical Institute Kolkata 700108, India
77

Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Jul 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Document Page Layout AnalysisDocument Page Layout Analysis

Bhabatosh ChandaElectronics and Communication Sciences Unit

Indian Statistical InstituteIndian Statistical InstituteKolkata  700108, India

Page 2: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

AcknowledgementAcknowledgement

• Amit Das IIEST SibpurAmit Das, IIEST, Sibpur• Sekhar Mandal, IIEST, SibpurS j S h• Sanjoy Kumar Saha, Jadavpur Univeristy

• Ranjan Mandal, Indian Statistical Institute 

January 30, 2017 2Indian Statistical Institute

Page 3: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

OutlineOutline

• Introduction • Projection method 

– Zone content classification

• Morphological operators – Skew correction 

• Morphology based methodMorphology based method  • Deep learning based method • Performance evaluation • Database: examples• Conclusion 

January 30, 2017 3Indian Statistical Institute

Page 4: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

IntroductionIntroduction

• Problem descriptionProblem description • Motivation

I f f OCR• Improve performance of OCR • Data compression • Graphics recognition • Browsing and navigation 

• Physical and logical structure  

January 30, 2017 4Indian Statistical Institute

Page 5: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Problem DescriptionProblem Description

5January 30, 2017 Indian Statistical Institute

Page 6: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ObjectiveObjective

6January 30, 2017 Indian Statistical Institute

Page 7: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Major Source of Document PagesMajor Source of Document Pages

1 Books1. Books 2. Journals 3 i3. Magazines 4. Newspapers 5. Forms and leaflets 6 Reports6. Reports

January 30, 2017 Indian Statistical Institute 7

Page 8: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Types of document pagesTypes of document pages

Consider books and journalsConsider books and journals• Title page 

bli h ’• Publisher’s page • Table of Contents • Text page • Index pageIndex page 

January 30, 2017 Indian Statistical Institute 8

Page 9: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Different types of pagesDifferent types of pages

Title page Publisher’s pageTitle page Publisher s page

9January 30, 2017 Indian Statistical Institute

Page 10: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Different types of pagesDifferent types of pages

Table of Content page Table of Content pageTable of Content page Table of Content page

10January 30, 2017 Indian Statistical Institute

Page 11: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Different types of pagesDifferent types of pages

Text page‐1 Text page‐2Text page‐1 Text page‐2

11January 30, 2017 Indian Statistical Institute

Page 12: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Different types of pagesDifferent types of pages

Text page‐3 Index pageText page‐3 Index page

12January 30, 2017 Indian Statistical Institute

Page 13: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Issues in document page scanningIssues in document page scanning

• ResolutionResolution• Back page impression

G l i• Granular noise• Blotted text (specially in old documents)• Bending of pages at the binding • SkewSkew 

(due to placement of the page in the scanner) 

January 30, 2017 Indian Statistical Institute 13

Page 14: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Entities of Document PageEntities of Document Page

• TextText– Body text

• LineWord Character• Line Word  Character 

– Heading

• Non text• Non‐text – Half‐tone

T bl– Table– Graphics or line drawing 

January 30, 2017 Indian Statistical Institute 14

Page 15: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Entities of Document PageEntities of Document Page

• Each detected zone or block must be homogeneousEach detected zone or block must be homogeneous in terms of content or entity

• Each zone will be input to one of the suitable pmodules based on entity.– OCR system – Image compressor  – Vectorization system 

• Output of these modules may be compiled and archived using suitable structure.

January 30, 2017 Indian Statistical Institute 15

Page 16: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Geometrical / Physical structureGeometrical / Physical structure

PPage

Block Word ch

Non‐text

DocumentPage Block

LineWord

. ..

.

.

arac

Block LineWord

.

...

ters

Page Line

16January 30, 2017 Indian Statistical Institute

Page 17: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Logical structureLogical structureDocument

Text Non‐Text

Normal High‐lighted lf iNormal High‐lighted Half‐tone(image)

Line drawing

Body Heading

Sub‐heading

AbstractGraphics

Table

17January 30, 2017 Indian Statistical Institute

Page 18: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Logical structureLogical structure

• Different entities:Different entities: – Text (red box) – Halftone (green box) – Table (magenta box) – Line drawing (blue box)

• Reading direction (dark blue arrow) 

• Link between entities (brown arrow) 

18January 30, 2017 Indian Statistical Institute

Page 19: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Zone / block detectionZone / block detection

• One of the simple way is Projection method.One of the simple way is Projection method. • Algorithm 

– Take horizontal (or vertical) projection of foregroundTake horizontal (or vertical) projection of foreground pixels. (may be implemented as pixel count) 

– If there exists a characteristic change in projection profile, h i l ( i l)put a horizontal (resp. vertical) separator. 

– Take horizontal and vertical direction alternately. Continue until above condition is satisfied– Continue, until above condition is satisfied. 

• Works well for structured document, usually the pages of technical journals, books, etc.

January 30, 2017 Indian Statistical Institute 19

Page 20: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Projection Method: An ExampleProjection Method: An Example

20January 30, 2017 Indian Statistical Institute

Page 21: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Example (contd.)Example (contd.)

21January 30, 2017 Indian Statistical Institute

Page 22: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Example (contd.)

22January 30, 2017 Indian Statistical Institute

Page 23: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Example (contd.)

23January 30, 2017 Indian Statistical Institute

Page 24: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Problems of Projection methodProblems of Projection method

• Cannot say what each block contains until furtherCannot say what each block contains until further analysis.  – Extract features from a zone – Recognize the zone content using a classifier

• Results are highly dependent even on small skew in the scanned page. 

January 30, 2017 Indian Statistical Institute 24

Page 25: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Zone content recognitionZone content recognition

Features: • Black pixel ratio (no. of black pixel / zone area)• Horizontal transition (black to white) count• Vertical transition (black to white) count• Normalized mean length of horizontal black pixel run • Normalized mean length of vertical black pixel run• Normalized mean length of vertical black pixel run • Connected component ratio Classifier:• Two‐class (text and non‐text) 

SVM with RBF kernel (accuracy 94.89%) 

January 30, 2017 Indian Statistical Institute 25

Duong, Emptoz, Côté: Features for Printed Document Image Analysis, ICPR 2002.

Page 26: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Zone content recognitionZone content recognition

• Functional classification of text blocksu c o a c ass ca o o e b oc s– Title / Heading, Sub‐heading, Body text … 

• Features:– complexity (measured by entropy)– visibility values (or relative boldness)

di i l (h i l d i l)– directional compactness (horizontal and vertical)– geometric characteristics (block height, width, etc.)

• Classifier:Classifier:– K‐means clustering followed by min. distance classifier

Bres, Eglin, and Gafneux,  Unsupervised Clustering of Text Entities in Heterogeneous Grey Level 

January 30, 2017 Indian Statistical Institute 26

Documents, ICPR, 2002. 

Page 27: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Problems of Projection methodProblems of Projection method

• Cannot say what each block contains until furtherCannot say what each block contains until further analysis – Extract features from a zone – Recognize the zone content using a classifier

• Results are highly dependent even on small skew in the scanned page – Detecting base line of each text line of the document – Determining orientation (slope) angle of base line – Estimation overall skew of the document page

January 30, 2017 Indian Statistical Institute 27

Page 28: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Processing ToolProcessing Tool

• Spatial domain operator that can handleSpatial domain operator that can handle shape information directly 

• Mathematically well defined• Mathematically well defined • Neighborhood operator such that  hardware 

i l i h ld b i limplementation should be simple

January 30, 2017 Indian Statistical Institute 28

Page 29: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Mathematical MorphologyMathematical Morphology

• Mathematical morphological operators areMathematical morphological operators are good choice. 

ObjectsObjects• All characters, figures, drawing, i.e., black components against white backgroundcomponents against white background 

Structuring elementR l i fi• Regular geometric figures: – mostly line segment, square, circle, etc. 

January 30, 2017 Indian Statistical Institute 29

Page 30: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Morphological OperationsMorphological Operations

Set theoretic operations (including union, intersection, etc.):

1. Dilation1. Dilation

2. Erosion

3. Opening

4. Closing

30January 30, 2017 Indian Statistical Institute

Page 31: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Morphological operator: DilationMorphological operator: Dilation

• Expands the objects. Orig. p j

BbAabaBA ,| SE:

where A is an object and B is SE.

Circ‐5

• Properties:Commutative, associative

Circ‐9

associative, distributive (over union), increasing

Line‐19g

31January 30, 2017 Indian Statistical Institute

Page 32: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Morphological operator: ErosionMorphological operator: Erosion

• Shrinks the objects. Orig. j

ApBpBA | SE:

where A is an object and B is SE.

Circ‐5

• Properties:Distributive (over intersection),

increasing

Circ‐9

increasing. • Dilation and erosion are dual. Line‐

19

32January 30, 2017 Indian Statistical Institute

Page 33: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Morphological operator: OpeningMorphological operator: Opening

• Removes objects or parts of it Orig. j pthat cannot fit in SE.

SE: BBABA

where A is an object and B is SE.

P ti

Circ‐5

• Properties:Increasing, idempotent

Circ‐9

idempotent, anti-extensive.

• It is a filter.Line‐19f

33January 30, 2017 Indian Statistical Institute

Page 34: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Morphological operator: ClosingMorphological operator: Closing

• Appends to objects parts of Orig. pp j pbackground if SE does not fit.

SE: BBABA where A is an object and

B is SE.P ti

Circ‐5

• Properties:Increasing, idempotent, and extensive.

Circ‐9

• It is a filter. • Opening & closing are dual.

Line‐19

34January 30, 2017 Indian Statistical Institute

Page 35: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Detecting base lineDetecting base line

• Close the original image Orig. Close the original image with line SE of suitable length.  SE:

• Open the close image with same line SE.

Close Line‐29

• Detect black to white transition in vertical 

Cl‐Op Line‐29

scan.  

B‐W transtrans.

35January 30, 2017 Indian Statistical Institute

Page 36: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

FontFont

• Traditionally in metal typesetting a font is aTraditionally, in metal typesetting, a font is a particular size, weight and style of a typeface. 

• The weight of a particular font is the thickness ofThe weight of a particular font is the thickness of the character outlines relative to their height. 

• Font size is measured in point unit• Font size is measured in point unit. 1 point in ......         is equal to ...typographic units 1/12 picastypographic units 1/12 picas imperial/US units   1/72 inch metric (SI) units 0 3528 mmmetric (SI) units     0.3528 mm 

January 30, 2017 Indian Statistical Institute 36

Page 37: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Size related parametersSize related parameters

• X‐height or corpus heightX height or corpus height • Ascender 

d• Descender

• Scan resolution (in dpi)Scan resolution (in dpi) • Font style: bold, italics, ornamental 

January 30, 2017 Indian Statistical Institute 37

Page 38: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Skew correction: An exampleSkew correction: An example

38January 30, 2017 Indian Statistical Institute

Page 39: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Pages with complex layoutPages with complex layout

39January 30, 2017 Indian Statistical Institute

Page 40: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Morphological algorithmMorphological algorithm

• Text region is composed of small objects (characters) placed in g p j ( ) pregular interval. 

• Opening the image with small SE removes the thin object t ( t k f h t ) b t h i i ifi t ff tparts (strokes of character), but has insignificant effect on 

large objects in half‐tone etc. • Closing the image with small SE fills in white holes in small g g

objects (space within and between character), but has insignificant effect on large white space or half‐tone. Th diff b l d d d i hi hli h• Thus difference between closed and opened image highlights the text region. 

• Difference image is thresholded to detect text region.Difference image is thresholded to detect text region.

January 30, 2017 Indian Statistical Institute 40

Page 41: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Morphological approach: An exampleMorphological approach: An example

(a) Original image               (b) Closed image               (c) Opened image

QUESTION:  Size of structuring element? 

41January 30, 2017 Indian Statistical Institute

Page 42: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ResultsResults

Input test image Resultant (labeled) imageInput test image Resultant (labeled) image

42January 30, 2017 Indian Statistical Institute

Page 43: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ResultsResults

Input test image Resultant (labeled) imageInput test image Resultant (labeled) image

43January 30, 2017 Indian Statistical Institute

Page 44: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ResultsResults

Input test image Resultant (labeled) imageInput test image Resultant (labeled) image

44January 30, 2017 Indian Statistical Institute

Page 45: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ResultsResults

Input test image Resultant (labeled) imageInput test image Resultant (labeled) image

45January 30, 2017 Indian Statistical Institute

Page 46: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Deep learningDeep learning

• Popular technique for unsupervised featurePopular technique for unsupervised feature extraction for supervised applications – Ex. object recognition. j g

• Utilizes HUGE number of instances to train relatively simpler system to perform more complicated task. p y p p

• Training samples may be outcome of controlled or uncontrolled data acquisition. 

• Requires very high computational resources for implementing a reasonably meaningful system. 

January 30, 2017 Indian Statistical Institute 46 / 73 

Page 47: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Detect text area using CNNDetect text area using CNN

Input: A document image Output:Text / Non text areaInput: A document image Output:Text / Non‐text area

47January 30, 2017 Indian Statistical Institute

Page 48: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Solution strategySolution strategy

Transforming the problem into a classificationTransforming the problem into a classification Problem. 

• Divide the Input image into MxM patchesDivide the Input image into MxM patches.• Input: Image patch of size MxM• Output: Text Non text and Ambiguous• Output: Text, Non text, and Ambiguous 

– Text: if >80% of the patch has text – Non‐text:      if <20% of the patch has text area o te t 0% o t e patc as te t a ea– Ambiguous: otherwise

January 30, 2017 Indian Statistical Institute 48

Page 49: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Training dataTraining data

49/59January 30, 2017 Indian Statistical Institute

Page 50: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Prepare training dataPrepare training data

INPUT: document images with manually labeled text area. g y• From each image, overlapping patches of size 100x100 are 

taken (stride along x, y is 20) and resized to 50x50• From each image, overlapping patches of size 50x50 are taken 

(stride along x, y is 10)• Each 50x50 patch is divided into 4 patches of size 25x25 andEach 50x50 patch is divided into 4 patches of size 25x25 and 

are resized back to 50x50. • We get total number of 825670 patches of size 50x50 as 

training data from 8 images. Label: as described before. 

January 30, 2017 Indian Statistical Institute 50

Page 51: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Training blocks: ExampleTraining blocks: Example

January 30, 2017 Indian Statistical Institute 51

Page 52: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Model descriptionModel descriptionInput: 50x50  Patch of gray scale.Layer (type) Output Shape Param #Layer (type) Output Shape  Param #========================================================== Convolution2D(3x3 @8)  (8, 48, 48)  80MaxPooling2D(2x2)  (8, 24, 24)  0Convolution2D(3x3 @6)  (6, 22, 22)  438Convolution2D(3x3 @4)  (4, 20, 20)  220Flatten (1600) 0Flatten  (1600)  0Dense(7)  (7)  11207Activation(Sigmoid)  (7)  0Dense(3) (3) 24Dense(3)  (3)  24Activation(Softmax)  (3)  0========================================================== Total parameters: 11969Total parameters:  11969

52/59January 30, 2017 Indian Statistical Institute

Page 53: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Model descriptionModel description

53/59January 30, 2017 Indian Statistical Institute

Page 54: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Training the modelTraining the model

• Number of epoch: 200Number of epoch: 200• Batch size: 100

i 0 0• Learning Rate: 0.01• Learning weight decay: 0.95• Optimizer: Stochastic gradient descent• Loss function: Mean squared errorLoss function: Mean squared error

January 30, 2017 Indian Statistical Institute 54

Page 55: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

TestingTesting• Input: A test imagep g• Take 50x50 patch and submit it to the trained model• If predicted class is text, color that patch as pink.• If predicted class is non‐text, color the patch as white.• If predicted class is ambiguous, then

– Divide that patch into 4 patches of size 25x25 and resize to– Divide that patch into 4 patches of size 25x25 and resize to 50x50 and submitted to the model. 

– If that 50x50 patch is again ambiguous, then color that patch as yellow (Ideally it should be done recursively untilpatch as yellow (Ideally it should be done recursively until we get no ambiguous patch)

– Else color the patch as according to text or non‐text class.

January 30, 2017 Indian Statistical Institute 55

Page 56: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ResultsResults

Input test image Resultant (labeled) imageInput test image Resultant (labeled) image

56January 30, 2017 Indian Statistical Institute

Page 57: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ResultsResults

Input test image Resultant (labeled) imageInput test image Resultant (labeled) image

57January 30, 2017 Indian Statistical Institute

Page 58: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ResultsResults

Input test image Resultant (labeled) imageInput test image Resultant (labeled) image

58January 30, 2017 Indian Statistical Institute

Page 59: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ResultsResults

Input test image Resultant (labeled) imageInput test image Resultant (labeled) image

59January 30, 2017 Indian Statistical Institute

Page 60: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

An improved networkAn improved network

32×32  25×25×96  5×5×96  4×4×256         2×2×256

[N T ][Non‐Text]

[Text]

Convolution                                          Convolution ClassificationAverage pooling                       Average pooling

Wang, Wu, Coates and Ng, End‐to‐End Text Recognition with Convolutional Neural 

January 30, 2017 Indian Statistical Institute 60

Wang, Wu, Coates and Ng, nd to nd Text Recognition with Convolutional NeuralNetworks, ICPR 2012. 

Page 61: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Comparative resultsComparative results

Simpler system Wang et alSimpler system Wang et al. 

January 30, 2017 Indian Statistical Institute 61

Page 62: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Comparative resultsComparative results

Simpler system Wang et alSimpler system Wang et al. 

January 30, 2017 Indian Statistical Institute 62

Page 63: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Benchmark databaseBenchmark database

• UW‐I, II, III databeases, Developed at University ofUW I, II, III databeases, Developed at University of Washington, Seattle, USA in 1996.

• Widely used earliest database with1620 pages y p g• Zones contain text, non‐text such as halftone, line drawing, math and chemical equation. g q

• The database also contains – Page condition file : skew angle, noise. – Page attribute file : dominant  font and other content. – Page bounding box file : location and size of zones.

January 30, 2017 Indian Statistical Institute 63

http://isis‐data.science.uva.nl/events/dlia//datasets/uwash3.html

Page 64: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Benchmark databaseBenchmark database

• Mediateam document databaseMediateam document database• Developed at University of Oulu, Finland in1998. • One of the early databases containingOne of the early databases containing 

Pattern type  Samples

Text 4811Text 4811

Graphics 735

Image 161

Composite 219

January 30, 2017 Indian Statistical Institute 64

Duong, Emptoz, Côté: Features for Printed Document Image Analysis, ICPR 2002.

Page 65: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Benchmark databaseBenchmark database

• Pattern Recognition and Image Analysis (PRImA) Layout a e ecog o a d age a ys s ( ) ayouAnalysis dataset

• Developed at University of Salford, Manchester• 1240 ground‐truthed pages from magazines (1085 pages) 

and technical journals (155 pages) • Used in following contests 

– ICDAR 2015  Recognition of Documents with Complex Layouts (RDCL2015)(RDCL2015)

– ICDAR2013 Historical Newspaper Layout Analysis (HNLA2013) – ICDAR2011 Historical Document Layout Analysis (HDLAC 2011)

January 30, 2017 Indian Statistical Institute 65

http://www.primaresearch.org/datasets/Layout_Analysis

Page 66: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Benchmark databaseBenchmark database• Historical Newspaper dataset (ENP dataset) • Developed at University of Salford, Manchester in 

Europeana Newspapers Project • 500 ground truthed pages covering• 500 ground‐truthed pages covering 

– 13 languages (German, french, English, Estonian, etc.) – 17th, 18th, 19th and 20th centuries , ,

• Contains (total regions 61,619) including – 1,497 image zones – 208 table zones – 46,889 text zones

January 30, 2017 Indian Statistical Institute 66

Clausner et. Al, The ENP Image and Ground Truth Dataset of historical newspaper, ICDAR 2015. 

Page 67: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Performance evaluationPerformance evaluation

• A document page D may be represented as am tuple. p g y p pD = (E1, E2, …, Em)

where Ei s are entities such as text, tables, half‐tone, etc. • Each entity has a unique property denoted by Prop.(Ei ). • Document page image domain X has n bounding boxes Bj

(j 1 n) with such that:(j=1,…, n) with such that: 

for)(

)(1

nj

kjBBii

XBi

).(Prop).(Propsuch that oneonly and one exists thereevery For )(

for )(

ij

kj

EBijiii

kjBBii

January 30, 2017 Indian Statistical Institute 67

).(Prop).(Prop and background called is\)(1 in

j EBBXWiv

Page 68: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Performance evaluationPerformance evaluation

68January 30, 2017 Indian Statistical Institute

Page 69: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Performance evaluationPerformance evaluation

69January 30, 2017 Indian Statistical Institute

Page 70: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Performance evaluationPerformance evaluation

70January 30, 2017 Indian Statistical Institute

Page 71: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Performance evaluationPerformance evaluation

71January 30, 2017 Indian Statistical Institute

Page 72: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Performance evaluationPerformance evaluation• Both model and object graphs are directed acyclic graph. 

• Let us represent the model graph by G = (V L )GM = (VM, LM) 

where VM = {M0, M1, M2, . . . , Mn} represents the setof nodes or vertices and LM represents set of links.  M p

• Note thatMj = (BBj, bbj,Ej) and Ljk = (Mj, Mk). • Similarly the object graph is represented by 

G (V L )Go = (Vo, Lo) • And Oj = (Bbj, Ej ) and Ljk = (Oj, Ok). • Finally graph matching algorithm is employedFinally, graph matching algorithm is employed. 

January 30, 2017 Indian Statistical Institute 72

Page 73: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Performance evaluationPerformance evaluation

Das, Saha and Chanda, An empirical measure of performance of document image as, Saha and Chanda, An empirical measure of performance of document imagesegmentation algorithm, IJDAR, Vol. 4(3), 2002.

73January 30, 2017 Indian Statistical Institute

Page 74: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Performance evaluationPerformance evaluation

• Relation between BBj and bbj in model (groundtruth): j j (g )

• For good segmentation of object node:kjbbBBbbBB kjjj forand

For good segmentation of object node: 

ijjij OMBBBbbb node matchesnodeif

• The error measure: (i) Correct classification (True positive) = #(Bbj ∩ BBi).(ii) F l l (F l iti ) #(Bb \ BB )(ii) False alarm (False positive) = #(Bbj \ BBi).(iii) Mis‐classification (False negative) = #(bbi \ Bbj ).

January 30, 2017 Indian Statistical Institute 74

Page 75: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ConclusionConclusion

• Presented a document image segmentationPresented a document image segmentation method based on shape features 

• Used mathematical morphological operators• Used mathematical morphological operators • Necessary for OCR and data compression • System is useful for development of digital 

library providing facilities for electronic storage, searching, navigation 

January 30, 2017 Indian Statistical Institute 75

Page 76: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

ReferencesReferences• B. Chanda and D. Dutta Majumder, Digital Image Processing and Analysis, 

Prentice Hall of India New Delhi 2000Prentice Hall of India, New Delhi, 2000. • A. K. Das and B. Chanda, A fast algorithm for skew detection of document 

images using morphology, Intl. J. Of Document Analysis and Recognition}, Vol.4, pp.109‐114, 2001. 

• A. K. Das, S. K. Saha and B. Chanda, An empirical measure of performance of document image segmentation algorithm, Intl. J. on Document Analysis and Recognition, Vol.4, pp.183‐190, 2002. 

• S Mandal A K Das and B Chanda A Simple and Effective TableS. Mandal, A. K. Das and B. Chanda, A Simple and Effective Table Detection System from Document Images, Int. J. on Document Analysis and Recognition, Vol.8, pp.172‐182, 2006.

January 30, 2017 Indian Statistical Institute 76

Page 77: Document Page Layout Analysis - IIIT Hyderabadcvit.iiit.ac.in › SSDA › slides › SSDA_Jaipur_BhabatoshChanda.pdf · Document Page Layout Analysis BhabatoshChanda Electronics

Thank you

77January 30, 2017 Indian Statistical Institute