Top Banner
Tesseract OCR Engine What it is, where it came from, where it is going. Ray Smith, Google Inc OSCON 2007
22

Tesseract OCR Engine Contents

Feb 25, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tesseract OCR Engine Contents

Tesseract OCR Engine

What it is, where it came from,where it is going.

Ray Smith, Google Inc

OSCON 2007

Page 2: Tesseract OCR Engine Contents

Contents

• Introduction & history of OCR

• Tesseract architecture & methods

• Announcing Tesseract 2.00

• Training Tesseract

• Future enhancements

Page 3: Tesseract OCR Engine Contents

A Brief History of OCR

• What is Optical Character Recognition?

My invention relates to statistical machinesof the type in which successive comparisonsare made between a character and a charac-

OCR

Page 4: Tesseract OCR Engine Contents

A Brief History of OCR

• OCR predates electronic computers!

US Patent 1915993, Filed Apr 27, 1931

Page 5: Tesseract OCR Engine Contents

A Brief History of OCR

• 1929 – Digit recognition machine

• 1953 – Alphanumeric recognition machine

• 1965 – US Mail sorting

• 1965 – British banking system

• 1976 – Kurzweil reading machine

• 1985 – Hardware-assisted PC software

• 1988 – Software-only PC software

• 1994-2000 – Industry consolidation

Page 6: Tesseract OCR Engine Contents

Tesseract Background

• Developed on HP-UX at HP between 1985and 1994 to run in a desktop scanner.

• Came neck and neck with Caere and XISin the 1995 UNLV test.(See http://www.isri.unlv.edu/downloads/AT-1995.pdf )

• Never used in an HP product.

• Open sourced in 2005. Now on:http://code.google.com/p/tesseract-ocr

• Highly portable.

Page 7: Tesseract OCR Engine Contents

Tesseract OCR Architecture

Find TextLines and

Words

RecognizeWord

Pass 2

RecognizeWord

Pass 1

AdaptiveThresholding

ConnectedComponent

Analysis

Input: Gray or Color Image[+ Region Polygons]

Binary Image

CharacterOutlines

CharacterOutlinesOrganizedInto Words

Page 8: Tesseract OCR Engine Contents

Adaptive Thresholding is Essential

Some examples of how difficult it can be to make a binary imageTaken from the UNLV Magazine set.(http://www.isri.unlv.edu/ISRI/OCRtk )

Page 9: Tesseract OCR Engine Contents

Baselines are rarely perfectly straight

• Text Line Finding – skew independent –published at ICDAR’95 Montreal.(http://scholar.google.com/scholar?q=skew+detection+smith)

• Baselines are approximated by quadratic splinesto account for skew and curl.

• Meanline, ascender and descender lines are aconstant displacement from baseline.

• Critical value is the x-height.

Page 10: Tesseract OCR Engine Contents

Spaces between words are tricky too

• Italics, digits, punctuation all createspecial-case font-dependent spacing.

• Fully justified text in narrow columns canhave vastly varying spacing on differentlines.

Page 11: Tesseract OCR Engine Contents

Tesseract: Recognize Word

StaticCharacterClassifier

Dictionary

CharacterChopper

AdaptiveCharacterClassifier

NumberParser

CharacterAssociator

Done?

Adapt toWord

No

Yes

Page 12: Tesseract OCR Engine Contents

Outline Approximation

Original Image Outlines of components Polygonal Approximation

Polygonal approximation is a double-edged sword.Noise and some pertinent information are both lost.

Page 13: Tesseract OCR Engine Contents

Tesseract: Features and Matching

• Static classifier uses outline fragments asfeatures. Broken characters are easilyrecognizable by a small->large matchingprocess in classifier. (This is slow.)

• Adaptive classifier uses the same technique!(Apart from normalization method.)

Prototype Characterto classify

ExtractedFeatures

Match ofPrototypeTo Features

Match ofFeatures ToPrototype

Page 14: Tesseract OCR Engine Contents

Announcing tesseract-2.00

• Fully Unicode (UTF-8) capable

• Already trained for 6 Latin-basedlanguages (Eng, Fra, Ita, Deu, Spa, Nld)

• Code and documented process to train athttp://code.google.com/p/tesseract-ocr

• UNLV regression test framework

• Other minor fixes

Page 15: Tesseract OCR Engine Contents

Training Tesseract

Word ListWord-dawg,Freq-dawg

inttemp,pffmtable

normproto

unicharset

DangAmbigs

User-words

CharacterFeatures(*.tr files)

Trainingpage images

Box files unicharset

Tesseract Data Files

Wordlist2dawg

mfTraining

cnTraining

Unicharset_extractor Addition ofcharacterproperties

ManualData Entry

TesseractTesseract+manualcorrection

Page 16: Tesseract OCR Engine Contents

Tesseract Dictionaries

Word ListWord-dawg,Freq-dawg

User-words

Tesseract Data Files

Wordlist2dawg

Usually Empty

InfrequentWord List

FrequentWord List

Page 17: Tesseract OCR Engine Contents

Tesseract Shape Data

inttemp,pffmtable

normproto

CharacterFeatures(*.tr files)

Trainingpage images

Box files

Tesseract Data Files

mfTraining

cnTraining

TesseractTesseract+manualcorrection

Prototype Shape Features

Expected Feature Counts

Character Normalization Features

Page 18: Tesseract OCR Engine Contents

Tesseract Character Data

unicharset

DangAmbigs

Trainingpage images

Box files unicharset

Tesseract Data Files

Unicharset_extractor Addition ofcharacterproperties

ManualData Entry

Tesseract+manualcorrection

List of Characters + ctype information

Typical OCR errors eg e<->c, rn<->m etc

Page 19: Tesseract OCR Engine Contents

Accuracy Results

-6.58%12347-10.37%57171TotalGcc4.1

-18.77%97.51%122016.98%98.47%7524News.3BGcc4.1

-7.58%95.37%3123-1.62%97.78%14800Mag.3BGcc4.1

-4.97%95.12%6692-21.35%98.05%28589Doe3.3BGcc4.1

1.47%95.67%13125.02%98.04%6258Bus.3BGcc4.1

96.94%150298.69%6432News.3B1995

94.99%337997.74%15043Mag.3B1995

94.87%704297.52%36349Doe3.3B1995

95.73%129398.14%5959Bus.3B1995

ChangeAccuracyErrorsChangeAccuracyErrors

Non-stopwordCharacterTestsetTestid

Comparison of current results against 1995 UNLV results

Page 20: Tesseract OCR Engine Contents

Commercial OCR v Tesseract

• 6 languages + growing.

• Accuracy was good in1995.

• No UI yet.

• Page layout analysiscoming soon.

• Runs on Linux, Mac,Windows, more...

• Open source – Free!

• 100+ languages.

• Accuracy is goodnow.

• Sophisticated appwith complex UI.

• Works on complexmagazine pages.

• Windows Mostly.

• Costs $130-$500

Page 21: Tesseract OCR Engine Contents

Tesseract Future

• Page layout analysis.

• More languages.

• Improve accuracy.

• Add a UI.

Page 22: Tesseract OCR Engine Contents

The End

• For more information see:http://code.google.com/p/tesseract-ocr