UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

UC Berkeley CS294-9 Fall 2000 5- 1

Document Image AnalysisLecture 5: Metrics

Richard J. FatemanHenry S. Baird

University of California – BerkeleyXerox Palo Alto Research Center


The course so far….• Reminder: All course materials are online:

http://www-inst.eecs.berkeley.edu/~cs294-9/

• Overview of the DIA Research Field

• Some applications (Postal Addresses, Checks):

• Research Objectives: more systematic

modeling, design

• Some basic engineering


How well are we doing?

• Cost to achieve a useful result• Compare digital version to

– hand keying/ digitizing– verification– correction

• Correction cost may dominate total system cost


When is a result nearly correct?

• Character Model– Correct– Reject– Error

• String model– Insertion– Deletion– Rejection– Substitution [wrong letter identification]


Using ascii character labels

ABCDEFGHIJKL = s1ACD~~OIIUKL = s2

Insert B after A in s2Substitute E for ~, F for ~ [~=reject]subst G for O in s2subst H for I in s2subst I for U … etc (really H was recognized as II, IJ was recognized as U)


Ascii labels are inadequate

• Unicode +• Font +• Point size +• Tag information <author> .. </author>


Simple measures may mislead

Increase the rejection rate and this “error rate” decreases. Reject all characters to get 0/0?

Some applications (e.g. post office) force very low error, even if (low confidence) correct results are sometimes rejected.

%100### rejectedcharacterserrors


Some errors are acceptable

• Keyword search: if the key word occurs many times and is occasionally rejected

• Erroneous (nonsense) words are unlikely to be found by a search

• Caveat: if a key word is consistently changed to a nearby word, it may be missed (e.g. search for durnptruck and never find it.)


Example: UNLV-ISRI document collection

• 20 million pages of scientific, legal, official memos from DOE and contractors– Rock mining– Maps– Safe transportation of nuclear waste– Average length 44 pages



• DOE’s Licensing Support System Prototype– 104,000 Page images, 2,600 documents– Manually typed “correct” text– OCR text

• To determine relevance to queries, 3 methods used– Geology students ranking (0/1)– OCR keyword search– “correct” text search



• Exact match on 71 queries. – 632 returned by correct text– 617 returned by OCR. – Essentially: OCR is OK for this application.

• Probabilistic ranking / frequency: – Excessive OCR errors affected ranking– On average, similar results

• Feedback on relevance was not helpful for poor OCR

• Benchmarking: similar relevance = good results



One surprising result is that for some standard tests of precision and recall, processing OCR did better than actual text.

[Crummy OCR meant that some terms were not recognized; but the documents were irrelevant….]


A theory for computing accuracy

• Consider the result of OCR to be a string– Idealization: most common errors involve

mis-counting the number of spaces!– Ignores size/font/absolute position etc etc


Computing the shortest edit distance

• Bio-informatics sequencing• Associate a cost for each

correspondence. For example,– Match or substitute (cost 0 or 1)– Insert or delete (cost 2)

http://rrna.uia.ac.be/~peter/doctoraat/ali.html


Attempt to align of AUGGAA to ACUGAUGUGA. Distances were calculated using following parameters: s(a,b) = 0 when a equals b; s(a,b) = 1 when a differs from b insert or delete cost = 2. One of the possible optimal paths is indicated by a solid line connecting cells. It corresponds to the following alignment: ACUGAUGUGA A-UG--G-AA [explain dynamic programming here?]

A

U

G

G

A

A

A C U G A U G U G A

14


Computing the shortest edit distance

• Also useful for other tasks (recognizing speech)

• Lots of ways of organization of dynamic programming, still O(n2).

• Probably of more interest is word accuracy, or accuracy on non-stopwords (excluding and the of … etc.)


Correct Zoning is essential

• Read order in multi-column pages

• How to compare competing programs on performance of repeated headers

• What to do with figures, logos.

123456

123456


Document Attribute Format Specification : DAFS

``While many formats exist for composing a document fromelectronic storage onto paper, no satisfactory standard existsfor the reverse process. DAFS is intended to be a standardfor document decomposition. It will used in applications suchas OCR and document image understanding.

There are three storage formats: DAFS-Unicode, DAFS-ASCII anda more compact DAFS-Binary form.

DAFS is a file format specification for documents with avariety of uses. It is developed under the Document ImageUnderstanding (DIMUND) project funded by ARPA.’’ www.raf.com, Illuminator, UW CDRoms (English and Japanese)

http://palimpsest.stanford.edu/bytopic/imaging/std/dafsdrft.html

http://www.raf.com/


DAFS vs SGML

• DAFS= SGML+Unicode +CCITFax4• SGML requires DTD (document type

definition) • SGML is intended for structure, not

appearance (e.g. not bold, italic)• Images which accidentally contain ascii

version of <tag> can be problematical– Solved by putting images in separate files!


Perfect results: how to obtain ground truth?

• Painfully enter it by hand, or • Painfully correct OCR results, or• Compute some kind of average of OCR

programs


Perfect ground truth: a synthetic approach

• (Kanungo,UMD): start with TeX, – produce the ground truth for layout form

TeX,– Extract character positions, glyphs by

analyzing DVI files– This provides essentially every bit position of

each character.


Ground truth

• Next, commit to paper:– Print the DVI files– Scan a calibration page – Compute parameters of 2d2d transformations T

imposed by physics– Scan the printout– Align the page– Run the recognizer– Compare reported positions (• T-1 ) to correct ones


Change of Pace

• Assignment 1– What does it mean to write a program?

• Documentation• Demo• Instructions for use• (perhaps optional)

– Extensions, limitations, discussion

• Discussion questions

UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

Documents

uc berkeley cs294

pages slide

u slide

correct text ocr text

basic engineering slide

total system cost slide

processing ocr

crummy ocr