Top Banner
Scanned Documents INST 734 Module 10 Doug Oard
15

Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval Representation Retrieval Thanks for David Doermann for most of these.

Jan 17, 2018

Download

Documents

Images A collection of dots called “pixels” –Pixels often binary-valued (black, white) Greyscale or color is sometimes needed –Arranged in a grid and called a “bitmap” 300 Dots per Inch (dpi) gives good results –Often stored in TIFF or PDF format Images are fairly large (~1 MB per page) “Content” is in the relation between pixels –Image analysis seeks to mimic human visual behavior
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Scanned Documents

INST 734Module 10Doug Oard

Page 2: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Agenda

• Document image retrieval

Representation

• Retrieval

Thanks for David Doermann for most of these slides

Page 3: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Images

• A collection of dots called “pixels”– Pixels often binary-valued (black, white)

• Greyscale or color is sometimes needed– Arranged in a grid and called a “bitmap”

• 300 Dots per Inch (dpi) gives good results– Often stored in TIFF or PDF format

• Images are fairly large (~1 MB per page)

• “Content” is in the relation between pixels– Image analysis seeks to mimic human visual behavior

10 27 33 2927 34 33 5454 47 89 6025 35 43 9

Page 4: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Document Image Analysis

Target Processing Speed in Seconds

PageClassification

LayoutSimilarity

PageDecomposition

EnhancementDocumentImages

Images w/o Text

Images w/Text

Segmentation

Handprint LineDetection

ZoneLabeling

SignatureDetection

Stamp and LogoDetection

QueryDocuments

Genre Classification

RankedResults

Machine

Graphics

Hand

Noise

ClassResults

< .5 .25-3 1-3 1-3

Page 5: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Page Analysis

• Skew correction– Based on finding the primary orientation of lines

• Image and text region detection– Based on texture and dominant orientation

• Structural classification– Infer logical structure from physical layout

• Text region classification– Title, author, letterhead, signature block, etc.

Page 6: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Page Layer Segmentation

Page 7: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Image Detection

Page 8: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Text Region Detection

Page 9: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Application to Page Segmentation

Printed textHandwritingNoise

Page 10: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Language Identification

• Language-independent skew detection– Accommodate horizontal and vertical writing

• Script class recognition– Asian scripts have blocky characters– Connected scripts can’t be segmented easily

• Language identification– Shape statistics work well for western languages– Competing classifiers work for Asian languages

Page 11: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Optical Character Recognition• Pattern-matching approach

– Standard approach in commercial systems– Segment individual characters– Recognize using a neural network classifier

• Hidden Markov model approach– Experimental approach– Segment into sub-character slices– Limited lookahead to find best character choice– Useful for connected scripts (e.g., Arabic)

Page 12: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

OCR Accuracy Problems

• Character segmentation errors– In English, segmentation often changes “m” to “rn”

• Character confusion– Characters with similar shapes often confounded

• OCR on copies is much worse than on originals– Pixel bloom, character splitting, binding bend

• Uncommon fonts can cause problems– If not used to train the neural network character recornizers

Page 13: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Improving OCR Accuracy

• Image preprocessing– Mathematical morphology for bloom and splitting– Particularly important for degraded images

• “Voting” between several OCR engines helps– Individual systems depend on specific training data

• Linguistic analysis can correct some errors– Use confusion statistics, word lists, syntax, …– But more harmful errors might be introduced

Page 14: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Logical Page Analysis (Reading Order)

• Can be hard to guess in some cases– Newspaper columns, figure captions, appendices, …

• Sometimes there are explicit guides– “Continued on page 4” (but page 4 may be big!)

• Structural cues can help– Column 1 might continue to column 2

• Content analysis is also useful– Word co-occurrence statistics, syntax analysis

Page 15: Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Agenda

• Document image retrieval

• Representation

Retrieval

Thanks for David Doermann for most of these slides