Document Analysis: Segmentation & Layout Analysis
Post on 11-Jan-2016
44 Views
Preview:
DESCRIPTION
Transcript
Prénom Nom
Document Analysis:Segmentation & Layout Analysis
Prof. Rolf Ingold, University of Fribourg
Master course, spring semester 2008
© Prof. Rolf Ingold
2
Outline
Objectives of layout analysis Classification of layout analysis methods Splitting methods Grouping methods Text-Graphics-Image Separation Text line segmentation Word and character segmentation Field extraction from forms
© Prof. Rolf Ingold
3
Objectives of layout analysis and segmentation
The role of segmentation is to split a document image into regions of interest
Regions of interest may be of different granularity levels: graphics or text blocs, text lines, words, characters
The goal of layout analysis is to get a hierarchical description of segmented objects
© Prof. Rolf Ingold
4
Segmentation strategies
Segmentation produces a hierarchy of physical objects
Two strategies can be used top-down segmentation: starting with the entire image, split it
recursively down to elementary shapes bottom-up segmentation: starting at pixel level, detect
connected components and group them hierarchically
Hybrid methods combine both strategies
Segmentation methods can be data-driven using only data properties (without contextual
knowledge) model-driven, i.e., using contextual knowledge
© Prof. Rolf Ingold
5
Top-down methods
Top-down methods decompose the entire page into a hierarchy of rectangular regions
Top-down approaches perform recursive XY-cuts horizontal and vertical projection profile analysis white streams (spaces) analysis run length smoothing algorithm (RLSA)
© Prof. Rolf Ingold
6
Recursive XY-Cut
The page is cut alternatively horizontally and vertically according to white spaces Robust for most printed modern documents Supposes page images to be unskewed Does not work for all kind of layouts
Non rectangular formatting Complex mosaics (illustration next)
Resulting hierarchy may not reflect the natural structure (illustration below)
© Prof. Rolf Ingold
7
Top-Down Segmentation
Recursive splitting can be performed by horizontal and vertical profile analysis images need to be "unskewed" !
© Prof. Rolf Ingold
8
Top-Down Segmentation (2)
Order in which X-Y cuts are performed is critical
© Prof. Rolf Ingold
9
White streams analysis
Principle: detect maximal rectangular white blocs split regions recursively according to thresholds
© Prof. Rolf Ingold
10
Run Length Smearing Algorithm (RLSA)
The Run Length Smearing Algorithm (RLSA) is a morphological operator it replaces white runs that are smaller or equal to a given
threshold by black runs it can be applied horizontally as well as vertically
© Prof. Rolf Ingold
11
RLSA based segmentation
RLSA can be used to segment a page into blocs using three steps applied horizontally applied vertically combined by logical and
operator
Threshold values are critical and have to be chosen according to document class using statistical white space
analysis
© Prof. Rolf Ingold
12
Bottom-up methods
Bottom-up methods start at pixel levels and groups them together in a hierarchy of multi-rectangular regions (shapes delimited by horizontal and
vertical segments) arbitrary shapes
Bottom up methods use connected component extraction region grouping
© Prof. Rolf Ingold
13
Connected components
In a binary image, a connected component is a set of black pixels connected by 4- or 8-adjacency
five 4-connected components two 8-connected components
© Prof. Rolf Ingold
14
Extraction of connected components
Connected components can be extracted by different algorithms By a one pass full image scanning process, from top to bottom
and from left to right By a border following algorithm, using as first pixel a border
pixel supposed to be known
© Prof. Rolf Ingold
15
Scanning based CC Extraction
for each scan line ly
for each black run r
if on line ly-1 there is no run k-adjacent to r
create a new component containing r
else if on line ly-1 there exist one run r’ k-adjacent to r
add r to the component containing r’
else if on line ly-1 there exist several runs ri k-adjacent to r
merge all components containing such a ri
add r to that component
merge
© Prof. Rolf Ingold
16
P Qd
R2
Border following algorithm
consider P0 S having a 4-neighbor Q0 S
P ← P0 ; Q ← Q0 ; d ← direction of Q according to P ;
repeat
let Ri be the neighbor of P in direction (d+i) mod 8
if R2 S then Q ← R2 ; d ← (d+2) mod 8;
else
if R1 S then P ← R2; Q ← R1;
else P ← R1; d ← (d2) mod 8;
add P to the contour
until P = P0 and Q = Q0
P
Q
d
R2
R1
© Prof. Rolf Ingold
17
Illustration of connected components
© Prof. Rolf Ingold
18
Connected components from RLSA
Connected components can be used to detect characters
Word can be located using RLSA
© Prof. Rolf Ingold
19
Grouping components
Grouping connected components is non trivial
Grouping rules are based on relative positioning distances and thresholds component classification
Parameters can be estimated statistically
© Prof. Rolf Ingold
20
Allen's relations in 2D space
Relative positioning of two rectangles generate 169 configurations !
© Prof. Rolf Ingold
21
Threshold estimation
Thresholds can be estimated on statistical distributions of horizontal spaces for character grouping into words and word
grouping into text lines vertical spacing for grouping text lines into text blocs
© Prof. Rolf Ingold
22
Distributions of component sizes
Components can be classified into symbols letters hairlines punctuation
according to their size
© Prof. Rolf Ingold
23
Region grouping
© Prof. Rolf Ingold
24
Docstrum
The docstrum method [O'Gorman] is using a graph that connects each connected component to its k closest neighbors
© Prof. Rolf Ingold
Model driven layout analysis [Azokly95]
© Prof. Rolf Ingold
Generic macrostructures
In a model-driven approach, generic macrostructures are used a formal language describes margins and separators
© Prof. Rolf Ingold
Formal description of macrostructures
VOLUME Article ISWIDTH = 160; HEIGHT = 240;PAGE Garde IS ... END;PAGE Paire IS
HSEP hs1 = (4, 3, LEFT, RIGHT, BLANK);LAYER Principal IS
VSEP vs1 = (40, 65, TOP, hs1, BLANK);VSEP vs2 = ([50,60], 4, hs1, BOTTOM, BLANK);REGION Centre = (vs2, RIGHT, hs1, BOTTOM, ANY, NORMAL);REGION Marge = (LEFT, vs2, hs1, BOTTOM, TEXT, SMALL);...
END;LAYER Secondaire IS
HSEP hs2 = ([10,220], 2, LEFT, RIGHT, BLANK) SUBST hs1;HSEP hs3 = ([20,240], 2, LEFT, RIGHT, BLANK) SUBST BOTTOM;REGION Figure = (LEFT, RIGHT, hs2, hs3, {TABLE,
GRAPHICS});END;
END;PAGE Impaire IS ... END;
END;
© Prof. Rolf Ingold
Evaluation of segmentation results
Segmentation is rarely perfect; it generates undersegmentation : real components are merged oversegmentation : a single component is split
Special metrics have been developed to evaluate a segmentation result
In ICDAR'03 and ICDAR'05 scientific contests were organized
© Prof. Rolf Ingold
Conclusion
Segmentation is a crucial step in document analysis
Segmentation is almost solved for printed documents with regular layout form analysis
Results are rarely perfect Contextual knowledge may improve the results Advanced pattern recognition method are required
Segmentation remains an open problem for uncontrolled handwriting and graphical documents
© Prof. Rolf Ingold
Component hierarchy
top related