JY Ramel et al. Interactive and Incremental Analysis of Document Images Laboratoire d’Informatique de Tours - FRANCE
Dec 28, 2015
JY Ramel et al.
Interactive and Incremental Analysis of Document Images
Laboratoire d’Informatique de Tours - FRANCE
Document Image Analysis Systems: strategies and tools
2
LABRI Seminar
Introduction Context of the presented work Let’s dive into the semantical gap…
Characterization and representation of document images Selection of low level primitives A graph representation for the layout or structure
Analysis and recognition of the contents Contextual and incremental analysis Operators and scenarios AGORA and RETRO Prototypes
Conclusion
Work realized over a series of collaborations … thanks to all contributors!
Document Image Analysis Systems: strategies and tools
3
IntroductionContext of the work
Preservation of the cultural heritage The CESR Tours : a training and research centre Working on various domains of the Renaissance (historians) A rich library of rare books (Loire Valley)
An initial project: The Humanistic Virtual Library (BVH in French)
Collaboration with RFAI research team A pluridisciplinary collaboration :
Experts in DIA + Experts in rare books + End-users
Fill up the semantical gap ? A new idea : Introduce more interaction into DIA systems
Document Image Analysis Systems: strategies and tools
4
IntroductionLet’s dive into the semantical gap
Which segmentation methods are able to extract such EoC (Element of Content)?
• Data driven methods?
No Too much noise and parameters
• Model driven methods?
No Not much variability in the model of document
Document Image Analysis Systems: strategies and tools
5
IntroductionLet’s dive into the semantical gap
Low level information (data driven)
Pixels, regions, contours, primitives
Low level processing according to images
specificities (data)
High level entities (model driven)
Indexation = reading =
a priori knowledge = Model
Domain specific processing genericity
Most of the time: no user / Everything is
encoded
?
Document Image Analysis Systems: strategies and tools
6
Fill up the gap with the help of the user ?
Learning of the model or of the shapes to recognize
IntroductionLet’s dive into the semantical gap
Before
Document Image Analysis Systems: strategies and tools
7
Fill up the gap with the help of the user ? Interactive construction of the processing sequence
A posteriori intervention
Error Correction by relevance feedback
Ariane / PandoreGraphEdit / Directshow
IntroductionLet’s dive into the semantical gap
Before
After
Document Image Analysis Systems: strategies and tools
8
Introduction
Let’s dive into the semantical gap
Fill up the gap with the help of the user Our Proposition
Incremental analysis Segmentation for recognition, recognition for segmentation From the simplest to the more difficult
Interactive analysis User-driven method Adaptation according to images Adaptation according to user objectives
It requires An initial representation of the image content A set of processing operators for segmentation and recognition
with interoperability and compatibility capabilities
DURING
Document Image Analysis Systems: strategies and tools
9
Part I
Characterization and representation of document images
Information about the shapes using contour vectorization
Information about the structure (layout) using a graph representation
Towards a generic structural representation VectoGraph
Document Image Analysis Systems: strategies and tools
10
Characterization and representation of document imagesInformation about the shapes using contour vectorization
Which primitives for describing shapes in a document?
Binarization to extract contours (Vectors, Quad) and CC
Document Image Analysis Systems: strategies and tools
11
Characterization and representation of document imagesInformation about structure using graphs
• Idea: An evolutive graph of EoC
Two types of EoC:
• Primitives / Elementary EoC• Connected components• Vectors• Quadrilaterals
• User defined EoC • Characters• Words• Ornamental letters• Triangles• Diodes• …
Document Image Analysis Systems: strategies and tools
12
Node = Primitive or EoC
- Type of EoC– Centre (X,Y) of the Bounding Box– Bounding Box– Bounding Rectangle BR :(P1,P2,P3,P4)– Orientation = inertia axis– Density of B&W inside the BB– Color : Average color– List of the elementary EoC– Number of elementary EoC– Confidence rate
Edge = Relation between EoC
– Minimal distance between 2 EoC– Angle between EoC– Relation : Inside, Overlap, L, T, P, X, S, undefined
EoC
Characterization and representation of document imagesInformation about structure using graphs
• Towards a generic representation with graphs
Axis
Document Image Analysis Systems: strategies and tools
13
Initial Representation for old document - Graph of EoCs + Background map -
Background map
Graph of the connected components Tagging of the nodes according to the
size Noise Text Images
Edges between closed shapes (CC) Horizontal/Vertical neighborhood
ThresholdFusionjiNgMin256GGdGGji
_))],([(),(],[),(
2121
1 2 3
4
HH
V V
Document Image Analysis Systems: strategies and tools
14
Characterization and representation of document imagesInformation about structure using graphs
• Towards a generic structural representationD
om
ain
independent
Documents : Pixels, Points, …
Layout & relations Shapes & EoC
Angle
Edge : Relation between primitives
or EoCs
Quadrilateral Vector
Connectedcomponent
Distance
Topology
Node : Primitive, EoC
Representation = structural graph
Analysis = Domain dependent
Document Image Analysis Systems: strategies and tools
15
Part II
Analysis and recognition of image contents
Contextual and incremental analysis of old document images
Three operators with simple parameters Interactive construction of scenario
Examples AGORA prototype
Document Image Analysis Systems: strategies and tools
16
Analysis and recognition of image contents Strategy of analysis
Proposition : user driven analysis (scenario)
incremental and interactive approach
No predefined EoCs (model of document) Users can themselves define the required EoCs Interactive definition of the model of the document Incremental analysis (simple difficult)
Easiness Easy to use interfaces – user assistant No complex image processing algorithms But just:
Tagging (extraction-recognition) Merging Deletion
Document Image Analysis Systems: strategies and tools
17
Analysis and recognition of image contents Three operators
Tagging (extraction) of EoC (nodes) according to rules about spatial position in the pages rules about neighborhood relationship (using edges) rules about internal properties (node attributes)
Merging of EoC according to rules using the distance
computed from the background map (edge attributes)
on a specific type of EoCs
Deletion of EoC according to label and user decision
Document Image Analysis Systems: strategies and tools
Analysis and recognition of image contents Scenarios
User-defined processing sequence
Graph analysis and modification
Defined by users on a typical image
Depending on the user objectives and on the images
Can be saved, edited and applied in batch mode …
Document Image Analysis Systems: strategies and tools
19
Analysis and recognition of image contents Examples
Initial representation = primal EoC
Document Image Analysis Systems: strategies and tools
20
Analysis and recognition of image contents Examples
Tagging the primal EoC Text – Graphic - Noise
Graphic
Document Image Analysis Systems: strategies and tools
21
Analysis and recognition of image contents Examples
Merging of EoC = Text Word – Line - Paragraph
Document Image Analysis Systems: strategies and tools
22
Analysis and recognition of image contents Examples
Position verticale
Position horizontale
avg = 0,46 std = 0,41
avg = 0,51 std = 0,07
Automatic Tagging Lettrine
With the collaboration of Nicholas…
Document Image Analysis Systems: strategies and tools
23
Analysis and recognition of image contents Examples
ERR
EUR
Automatic Tagging with manual validation/modification of the rule
With the collaboration of Nicholas…
Document Image Analysis Systems: strategies and tools
24
Analysis and recognition of image contents Examples
- Primal sketch construction
- Img Type = Connected Component of size > 200
- Noise Type = Connected Component of size < 10
- Text Type = Connected Component of size between 10 and 200
- Horizontal and vertical Fusion of Text with d < 2000
- Border Type = Img with Width/Height Ratio between 3 and 10
- Ornamental Letter Type = Img close to Nothing at the Left and close to Text at the Right
- Img Type = Ornamental Letter with Width/Height Ratio < 0,8 or with Width/Height Ratio >1,2
- Img Type = Ornamental Letter on the Right < 75 %
- Left Margin Type = Text on the left with 25%
- Right Margin Type = Text on the right with 25%
- Vertical Fusion of the Left and Right Margins with d < 10000
- Horizontal Fusion of the Text with d < 3000
- Pagination Type = Text in top with 10%
- Text Type = Pagination with a number of Connected Components > 3
- Signature Type = Text in bottom with 25%
- Text Type = Signature with a number of Connected components > 5
- Text type = Signature with Text below, on the left or on the right
- Suppression of the EoC labelled Text - Suppression of the EoC labelled Noise
Example of an obtained scenario applicable on a set of images
Document Image Analysis Systems: strategies and tools
Analysis and recognition of image contents Examples
Marge
Title
Text
Text
Legend
Lettrine
Noise
Results
Document Image Analysis Systems: strategies and tools
26
Analysis and recognition of image contents AGORA
A User-driven Approach |see IJDAR]
Graph representation Simple operators Scenarios Some interfaces
Used since 2004 at the
CESR Always to be improved…
Download :
http://www.rfai.li.univ-tours.fr/pagesperso/ramel/fr/work1.html
Document Image Analysis Systems: strategies and tools
27
Analysis and recognition of image contents From AGORA to RETRO
Document Image Analysis Systems: strategies and tools
28
Analysis and recognition of image contents From AGORA to RETRO
…La l[21]ngueu[7] du chevalet depuis s[21]n pied jusques au c[7][21]chet d’en haut, p[21]rte d[21]uze t[7][21]us …
Accuracy Frequence of « tri-gram »
Dictionnary
Contextual, manual and automatic Transcription
Document Image Analysis Systems: strategies and tools
29
Analysis and recognition of image contents From AGORA to RETRO
Experiment on one complete book (Vésale)
Book of 150 pages 1.062.081 connected
components (pseudo characters) 40.000 classes (clusters) have
been built. 90% of these classes are
composed of less than 10 occurrences
Ignoring these classes during transcription means to miss one character for 14 more than one on each text line !!!
57% of the classes are composed of a single shape
Why ? Noise, spots Touching characters Splitted characters Same for words
• The 200 largest classes correspond to 85% of the text
Document Image Analysis Systems: strategies and tools
30
Conclusion
Proposition of a global approach: from images to their interpretation
Modelisation of the data representation of image content
Genericity : thin and filled shapes, line and curves, shapes and structure or layout
Contour vectorization + relationship analysis
Utilization of attributed graphs
Modelisation of processing operators recognition
Utilization of contextual information during the EoC extraction and recognition
Involvement of the user (early) in the processing sequence : user-driven analysis
Proposition of new structural PR techniques
Document Image Analysis Systems: strategies and tools
31
Thanks
Questions ?
Document Image Analysis Systems: strategies and tools
Analysis and recognition of image contents Scenarios
Document : Image, pixels
Symbole …
Titre Lettrine
EdCEdC
Representation = Graph of EdC
Q1 Q2 …
User defined scenarios= succession of operators + thresholds
Q2 Q1 …P2 Q3 …
Scénario 1 Scénario 2 Scénario 3
P1 P3P2 P4