Top Banner
UC Berkeley CS294-9 Fall 2000 5- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center
23

UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

Dec 24, 2015

Download

Documents

Logan Burke
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 1

Document Image AnalysisLecture 5: Metrics

Richard J. FatemanHenry S. Baird

University of California – BerkeleyXerox Palo Alto Research Center

Page 2: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 2

The course so far….• Reminder: All course materials are online:

http://www-inst.eecs.berkeley.edu/~cs294-9/

• Overview of the DIA Research Field

• Some applications (Postal Addresses, Checks):

• Research Objectives: more systematic

modeling, design

• Some basic engineering

Page 3: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 3

How well are we doing?

• Cost to achieve a useful result• Compare digital version to

– hand keying/ digitizing– verification– correction

• Correction cost may dominate total system cost

Page 4: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 4

When is a result nearly correct?

• Character Model– Correct– Reject– Error

• String model– Insertion– Deletion– Rejection– Substitution [wrong letter identification]

Page 5: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 5

Using ascii character labels

ABCDEFGHIJKL = s1ACD~~OIIUKL = s2

Insert B after A in s2Substitute E for ~, F for ~ [~=reject]subst G for O in s2subst H for I in s2subst I for U … etc (really H was recognized as II, IJ was recognized as U)

Page 6: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 6

Ascii labels are inadequate

• Unicode +• Font +• Point size +• Tag information <author> .. </author>

Page 7: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 7

Simple measures may mislead

Increase the rejection rate and this “error rate” decreases. Reject all characters to get 0/0?

Some applications (e.g. post office) force very low error, even if (low confidence) correct results are sometimes rejected.

%100### rejectedcharacterserrors

Page 8: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 8

Some errors are acceptable

• Keyword search: if the key word occurs many times and is occasionally rejected

• Erroneous (nonsense) words are unlikely to be found by a search

• Caveat: if a key word is consistently changed to a nearby word, it may be missed (e.g. search for durnptruck and never find it.)

Page 9: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 9

Example: UNLV-ISRI document collection

• 20 million pages of scientific, legal, official memos from DOE and contractors– Rock mining– Maps– Safe transportation of nuclear waste– Average length 44 pages

Page 10: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 10

Example: UNLV-ISRI document collection

• DOE’s Licensing Support System Prototype– 104,000 Page images, 2,600 documents– Manually typed “correct” text– OCR text

• To determine relevance to queries, 3 methods used– Geology students ranking (0/1)– OCR keyword search– “correct” text search

Page 11: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 11

Example: UNLV-ISRI document collection

• Exact match on 71 queries. – 632 returned by correct text– 617 returned by OCR. – Essentially: OCR is OK for this application.

• Probabilistic ranking / frequency: – Excessive OCR errors affected ranking– On average, similar results

• Feedback on relevance was not helpful for poor OCR

• Benchmarking: similar relevance = good results

Page 12: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 12

Example: UNLV-ISRI document collection

One surprising result is that for some standard tests of precision and recall, processing OCR did better than actual text.

[Crummy OCR meant that some terms were not recognized; but the documents were irrelevant….]

Page 13: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 13

A theory for computing accuracy

• Consider the result of OCR to be a string– Idealization: most common errors involve

mis-counting the number of spaces!– Ignores size/font/absolute position etc etc

Page 14: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 14

Computing the shortest edit distance

• Bio-informatics sequencing• Associate a cost for each

correspondence. For example,– Match or substitute (cost 0 or 1)– Insert or delete (cost 2)

Page 15: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 15

Attempt to align of AUGGAA to ACUGAUGUGA. Distances were calculated using following parameters: s(a,b) = 0 when a equals b; s(a,b) = 1 when a differs from b insert or delete cost = 2. One of the possible optimal paths is indicated by a solid line connecting cells. It corresponds to the following alignment: ACUGAUGUGA A-UG--G-AA [explain dynamic programming here?]

A

U

G

G

A

A

A C U G A U G U G A

14

Page 16: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 16

Computing the shortest edit distance

• Also useful for other tasks (recognizing speech)

• Lots of ways of organization of dynamic programming, still O(n2).

• Probably of more interest is word accuracy, or accuracy on non-stopwords (excluding and the of … etc.)

Page 17: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 17

Correct Zoning is essential

• Read order in multi-column pages

• How to compare competing programs on performance of repeated headers

• What to do with figures, logos.

123456

123456

Page 18: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 18

Document Attribute Format Specification : DAFS

``While many formats exist for composing a document fromelectronic storage onto paper, no satisfactory standard existsfor the reverse process. DAFS is intended to be a standardfor document decomposition. It will used in applications suchas OCR and document image understanding.

There are three storage formats: DAFS-Unicode, DAFS-ASCII anda more compact DAFS-Binary form.

DAFS is a file format specification for documents with avariety of uses. It is developed under the Document ImageUnderstanding (DIMUND) project funded by ARPA.’’ www.raf.com, Illuminator, UW CDRoms (English and Japanese)

Page 19: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 19

DAFS vs SGML

• DAFS= SGML+Unicode +CCITFax4• SGML requires DTD (document type

definition) • SGML is intended for structure, not

appearance (e.g. not bold, italic)• Images which accidentally contain ascii

version of <tag> can be problematical– Solved by putting images in separate files!

Page 20: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 20

Perfect results: how to obtain ground truth?

• Painfully enter it by hand, or • Painfully correct OCR results, or• Compute some kind of average of OCR

programs

Page 21: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 21

Perfect ground truth: a synthetic approach

• (Kanungo,UMD): start with TeX, – produce the ground truth for layout form

TeX,– Extract character positions, glyphs by

analyzing DVI files– This provides essentially every bit position of

each character.

Page 22: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 22

Ground truth

• Next, commit to paper:– Print the DVI files– Scan a calibration page – Compute parameters of 2d2d transformations T

imposed by physics– Scan the printout– Align the page– Run the recognizer– Compare reported positions (• T-1 ) to correct ones

Page 23: UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 23

Change of Pace

• Assignment 1– What does it mean to write a program?

• Documentation• Demo• Instructions for use• (perhaps optional)

– Extensions, limitations, discussion

• Discussion questions