UC Berkeley CS294-9 Fall 2000 5- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center
Dec 24, 2015
UC Berkeley CS294-9 Fall 2000 5- 1
Document Image AnalysisLecture 5: Metrics
Richard J. FatemanHenry S. Baird
University of California – BerkeleyXerox Palo Alto Research Center
UC Berkeley CS294-9 Fall 2000 5- 2
The course so far….• Reminder: All course materials are online:
http://www-inst.eecs.berkeley.edu/~cs294-9/
• Overview of the DIA Research Field
• Some applications (Postal Addresses, Checks):
• Research Objectives: more systematic
modeling, design
• Some basic engineering
UC Berkeley CS294-9 Fall 2000 5- 3
How well are we doing?
• Cost to achieve a useful result• Compare digital version to
– hand keying/ digitizing– verification– correction
• Correction cost may dominate total system cost
UC Berkeley CS294-9 Fall 2000 5- 4
When is a result nearly correct?
• Character Model– Correct– Reject– Error
• String model– Insertion– Deletion– Rejection– Substitution [wrong letter identification]
UC Berkeley CS294-9 Fall 2000 5- 5
Using ascii character labels
ABCDEFGHIJKL = s1ACD~~OIIUKL = s2
Insert B after A in s2Substitute E for ~, F for ~ [~=reject]subst G for O in s2subst H for I in s2subst I for U … etc (really H was recognized as II, IJ was recognized as U)
UC Berkeley CS294-9 Fall 2000 5- 6
Ascii labels are inadequate
• Unicode +• Font +• Point size +• Tag information <author> .. </author>
UC Berkeley CS294-9 Fall 2000 5- 7
Simple measures may mislead
Increase the rejection rate and this “error rate” decreases. Reject all characters to get 0/0?
Some applications (e.g. post office) force very low error, even if (low confidence) correct results are sometimes rejected.
%100### rejectedcharacterserrors
UC Berkeley CS294-9 Fall 2000 5- 8
Some errors are acceptable
• Keyword search: if the key word occurs many times and is occasionally rejected
• Erroneous (nonsense) words are unlikely to be found by a search
• Caveat: if a key word is consistently changed to a nearby word, it may be missed (e.g. search for durnptruck and never find it.)
UC Berkeley CS294-9 Fall 2000 5- 9
Example: UNLV-ISRI document collection
• 20 million pages of scientific, legal, official memos from DOE and contractors– Rock mining– Maps– Safe transportation of nuclear waste– Average length 44 pages
UC Berkeley CS294-9 Fall 2000 5- 10
Example: UNLV-ISRI document collection
• DOE’s Licensing Support System Prototype– 104,000 Page images, 2,600 documents– Manually typed “correct” text– OCR text
• To determine relevance to queries, 3 methods used– Geology students ranking (0/1)– OCR keyword search– “correct” text search
UC Berkeley CS294-9 Fall 2000 5- 11
Example: UNLV-ISRI document collection
• Exact match on 71 queries. – 632 returned by correct text– 617 returned by OCR. – Essentially: OCR is OK for this application.
• Probabilistic ranking / frequency: – Excessive OCR errors affected ranking– On average, similar results
• Feedback on relevance was not helpful for poor OCR
• Benchmarking: similar relevance = good results
UC Berkeley CS294-9 Fall 2000 5- 12
Example: UNLV-ISRI document collection
One surprising result is that for some standard tests of precision and recall, processing OCR did better than actual text.
[Crummy OCR meant that some terms were not recognized; but the documents were irrelevant….]
UC Berkeley CS294-9 Fall 2000 5- 13
A theory for computing accuracy
• Consider the result of OCR to be a string– Idealization: most common errors involve
mis-counting the number of spaces!– Ignores size/font/absolute position etc etc
UC Berkeley CS294-9 Fall 2000 5- 14
Computing the shortest edit distance
• Bio-informatics sequencing• Associate a cost for each
correspondence. For example,– Match or substitute (cost 0 or 1)– Insert or delete (cost 2)
UC Berkeley CS294-9 Fall 2000 5- 15
Attempt to align of AUGGAA to ACUGAUGUGA. Distances were calculated using following parameters: s(a,b) = 0 when a equals b; s(a,b) = 1 when a differs from b insert or delete cost = 2. One of the possible optimal paths is indicated by a solid line connecting cells. It corresponds to the following alignment: ACUGAUGUGA A-UG--G-AA [explain dynamic programming here?]
A
U
G
G
A
A
A C U G A U G U G A
14
UC Berkeley CS294-9 Fall 2000 5- 16
Computing the shortest edit distance
• Also useful for other tasks (recognizing speech)
• Lots of ways of organization of dynamic programming, still O(n2).
• Probably of more interest is word accuracy, or accuracy on non-stopwords (excluding and the of … etc.)
UC Berkeley CS294-9 Fall 2000 5- 17
Correct Zoning is essential
• Read order in multi-column pages
• How to compare competing programs on performance of repeated headers
• What to do with figures, logos.
123456
123456
UC Berkeley CS294-9 Fall 2000 5- 18
Document Attribute Format Specification : DAFS
``While many formats exist for composing a document fromelectronic storage onto paper, no satisfactory standard existsfor the reverse process. DAFS is intended to be a standardfor document decomposition. It will used in applications suchas OCR and document image understanding.
There are three storage formats: DAFS-Unicode, DAFS-ASCII anda more compact DAFS-Binary form.
DAFS is a file format specification for documents with avariety of uses. It is developed under the Document ImageUnderstanding (DIMUND) project funded by ARPA.’’ www.raf.com, Illuminator, UW CDRoms (English and Japanese)
UC Berkeley CS294-9 Fall 2000 5- 19
DAFS vs SGML
• DAFS= SGML+Unicode +CCITFax4• SGML requires DTD (document type
definition) • SGML is intended for structure, not
appearance (e.g. not bold, italic)• Images which accidentally contain ascii
version of <tag> can be problematical– Solved by putting images in separate files!
UC Berkeley CS294-9 Fall 2000 5- 20
Perfect results: how to obtain ground truth?
• Painfully enter it by hand, or • Painfully correct OCR results, or• Compute some kind of average of OCR
programs
UC Berkeley CS294-9 Fall 2000 5- 21
Perfect ground truth: a synthetic approach
• (Kanungo,UMD): start with TeX, – produce the ground truth for layout form
TeX,– Extract character positions, glyphs by
analyzing DVI files– This provides essentially every bit position of
each character.
UC Berkeley CS294-9 Fall 2000 5- 22
Ground truth
• Next, commit to paper:– Print the DVI files– Scan a calibration page – Compute parameters of 2d2d transformations T
imposed by physics– Scan the printout– Align the page– Run the recognizer– Compare reported positions (• T-1 ) to correct ones