Top Banner
UC Berkeley CS294-9 Fall 2000 21- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center
56

UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

Dec 27, 2015

Download

Documents

Lesley Goodwin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 1

Document Image AnalysisLecture 21: Introduction to Layout

Richard J. FatemanHenry S. Baird

University of California – BerkeleyXerox Palo Alto Research Center

Page 2: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 2

Page layout analysis

• Structural (Physical, Geometric) Layout Analysis [Segmentation]

• Functional (Syntactic, Logical) Layout Analysis [Classification]

• Read-order determination

Page 3: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 3

Structural

• Isolation of columns, paragraphs, lines words, tables, figures. Maybe letters.

• Without some layout analysis, much of the previous work would be impossible!

• Without layout analysis, what is the sequence of words in a multi-column format?

Page 4: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 4

Functional

• Typically domain dependent• May require merging or splitting of

syntactic components• Encoding into ODA (object oriented

document architecture) or SGML (DTD describes components like section, title..)

Page 5: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 5

Functional Components

• First page of a technical article may have• Title• Author• Abstract, body/column1 body/column2 footnotes• Pagination• Journal name/volume/date…

• Business letter might have• Sender• Date• Logo• Recipient• Body• Signature

Page 6: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 6

Finding structural blocks

Page 7: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 7

Common Approaches

• Top Down analysis– Horizontal and vertical profiles– Recursive: columns, paragraphs/lines/words– As illustrated earlier

• Bottom Up analysis– Use adjacency based on

• Pixels / morphology of dilation (millions)• RLE/ merge lines (thousands)• Connected Components (hundreds)

• Look at the background (shape-directed covers)• Also, human hints.

Page 8: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 8

Standard images…the Scanned Input

Page 9: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 9

Smear character boxes

Page 10: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 10

Smear words to get lines

Page 11: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 11

Smear lines to get paragraphs

Page 12: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 12

Issues:

• Sensitivity to noise. Solutions: – Clean up via kfill or similar filtering, ruthlessly– Divide the page (artificially) and keep the noise from

affecting the document globally

• Slanted lines. Solution(s):– Deskew (since it is not too hard(?))– Use nearest neighbors “docstrum”

• Concave regions (text flow around a box). Solution(?) look at background

• Variation in font, spacing can throw off analysis– Allow for local analysis

Page 13: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 13

Interactive semi-automatic zoning (RJF)

Page 14: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 14

Zoom in

Page 15: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 15

Scroll around

Page 16: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 16

View individual pixels

Page 17: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 17

Semi…

Turn up the noise filter until we start to kill some of the punctuation. How?

As we turn up the threshold, the number of connected components drops, then reaches a stable plateau after the noise is gone, and then drops again as we remove punctuation, the dots above the “i” etc.

Page 18: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 18

auto…

Turn the horizontal smear knob until the number of components drops suddenly from about 3000 to about 600.

Character boxes have been merged into wordboxes

Turn the horizontal smear knob until the number of components drop from about 600 to about 100.

Wordboxes have become lineboxes.

Page 19: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 19

matic..

Tweek the vertical smear knob. Lines become paragraphs.

(Turn further, and paragraphs become columns).

Page 20: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 20

Specify read order

Page 21: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 21

Interactive functional tagging:mark subject/author/etc? Here we attempt automatic id of math…

Automatic math zone. This is a challenge because the zone is in two parts, containing the math … f(p)=F(p)

Page 22: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 22

Docstrum/ L.O’Gorman

5 nearest neighbors (ogorman93)

Page 23: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 23

Example of “spectrum”

Each point represents distance and angle of a cc.

N^2, but not so bad.

Page 24: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 24

Statistics for skew and spacing

Set the knobs?

Page 25: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 25

Extract Lines, group to paragraphs

• Statistically close enough horizontally to be words, then lines

• Statistically close enough and parallel enough and the same length as… group two lines into the same text block.

• (arguably saving time by not deskewing; dealing with non-constant skew) Example follows..

Page 26: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 26

Sections with different skew

6 business cards, nearest neighbors vectors

Page 27: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 27

Extracted text lines, blocks

Useful? General?

Page 28: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 28

Does Docstrum work?

• Great on this page of business cards• An attempt to remove the assumption of

most previous work that layout was “Manhattan”

• Largely skew-independent.

but• Useless if characters are not separated

Page 29: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 29

Area Voronoi Diagram (Kise)

Start with connected components

Compute area ratios from pairs of neighboring connected components

Adaptively compute thresholds of intercharacter and interline gaps

Page 30: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 30

Point Voronoi diagram

Page 31: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 31

Area Voronoi Diagram

• Define the distance d between a point p and a figure g to be the minimum distance of p from any point in g

Page 32: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 32

Computing an approximate area Voronoi diagram

• Compute the point Voronoi diagram from a sampled set of points on the boundary of each figure.

• Delete Voronoi edges generated from point-to-point on the same figure

Advantage: we are not abstracting shapes into points (centroids) or into rectangles.

Page 33: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 33

ExampleThe points don’t show here…

All we have to do now is decide which of the (many) Voronoi edges are appropriate for segmentation.

Page 34: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 34

Features for selecting edges

• Delete edges in narrow spaces, because they are merely separating characters or words.

• Delete edges which divide two components of about equal area

• Delete edges that don’t form loops.Characters in the same font but in different columns will be in different segments.

Characters, even if they are close to a (large) halftone figure, will be separated from the figure.

Find the threshold based on a frequency of distances

Page 35: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 35

Example

Page 36: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 36

Area Voronoi diagram

Page 37: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 37

After deleting edges

Page 38: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 38

Imposing loop conditions, pasting back the text (etc).

Page 39: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 39

Errors

Fragmentation

Over-merging

Page 40: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 40

Impressive

Page 41: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 41

Reminder: Without layout analysis

• Reading across columns• Misplacing captions• Misplacing footnotes• Misunderstanding page numbers (which should

be REMOVED in the reformatting process)• Need extraction of biblio data: title, author,

abstract, keywords

• Nearly every subsequent step is compromised by lack of context.

Page 42: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 42

A Diversion: Separating Math from Text

• Why separate math from text?• Types of mathematics encountered• Previous Work• Two approaches

– post-processing commercial OCR– character-based (details!)

• Errors and their correction• Ambiguities

Page 43: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 43

Why separate math/text/images/..

• OCR programs do not work for math

l~F(P)=(~ ~(P)j(P) -~7~(p)

fli

becomes, in Textbridge,

Designation as a “picture” is only a partial solution

Page 44: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 44

Mathematics on a Page

Inline is harder to pick out

because it may look like italics text

Page 45: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 45

Previous Work

• Isolation by hand (most math parser papers)• Texture/ statistics based heuristics

– useful for display math “paragraphs”– not useful for in-line math

• Character based pseudo-parsing (but without font information or true parsing feedback)

• Incomplete

Page 46: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 46

Proposal: Post-Processing of OCR

• Start with commercial best-effort recognition• Reprocess the intermediate data structure (e.g.

for TextBridge, the XDOC file)• Accept recognition of text zones with high

recognition certainty. (Lines with no errors surrounded by lines with no errors are considered solved)

Page 47: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 47

Separate uncertain areas

• Re-consider “the rest of the image” as potential mathematics zones: uncertain regions (including nearby “certain” characters/lines)

• Isolate characters, identify fonts, etc.• Play out heuristic rules for separating text and

math zones.• Consider eradicating math and re-submitting

text; separately recognizing math and reinserting in XDOC

Page 48: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 48

Alternatively, Starting from our own naïve OCR

• Connected component recognition• Separate characters by initial classification• Repeatedly re-examine via rules• Determine text zones, remove math / feed

remainder to commercial OCR– How best to blank-out math? XXX

• Most likely human interaction remains

Page 49: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 49

Two bags: Math vs Text

• Initially MathInitially Math– + - = / Greek, scientific symbols, 0-9, italics,

bold, (), [], sin, cos, tan, dots, commas, decimal points

• Initially TextInitially Text– Roman Letters, junk

Page 50: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 50

Sample Text Bag

Page 51: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 51

Sample Math Bag

Page 52: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 52

Second Pass

• Correct for too much Math• Grow “clumps” (expand BBs) to categorize

– 3.14159 vs “end of sentence.”– (comment) vs f(x)– hyphen-words vs x2 - y2

– horizontal lines generally– isolated 1 or is it l “ell” or I “eye”

• “bags” or “zones” of geometric-relation boxes containing either words or potential math

Page 53: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 53

Importance of Context

Here are 12 L’s and a 1

Page 54: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 54

Third Pass

• Too much is in the text bag now– blur the math to allow for embedded Roman text like

“sin” or “l”

• Re-clump the mathematics to see if new bridges have been formed

• Some italics in the math bag may be really – English words in theorems– emphasized text

Page 55: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 55

On Ambiguity and Correctness

• Can we find the math in ad - bc by ad hoc methods?

• If we are unable to disambiguate English words, why should we be able to disambiguate mathematics?

• Abuse of mathematical notation is widespread: can we insist that new papers either have a non-ambiguous notation or an underlying electronic non-ambiguous notation?

Page 56: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California.

UC Berkeley CS294-9 Fall 2000 21- 56

Conclusions

• We can make a first cut on separating math from text

• If we wish to “enliven” math publication with semantic underpinnings, this may help in their production

• Incorporation of AI rule-based transformations as well as hand correction are likely to be important