Top Banner
Tesseract OCR Tesseract OCR Engine Engine Svetlin Nakov and Veselin Kole BASD (Bulgarian Association of Software Develope www.devbg.org
19

Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Mar 26, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Tesseract OCR EngineTesseract OCR Engine

Svetlin Nakov and Veselin KolevSvetlin Nakov and Veselin KolevBASD (Bulgarian Association of Software Developers)BASD (Bulgarian Association of Software Developers)

www.devbg.orgwww.devbg.org

Page 2: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Hot News!Hot News!

• Microsoft Corporation just announced Microsoft Corporation just announced its strategic partnership with OpenFestits strategic partnership with OpenFest

• OpenFest is upgrading to Windows 7 and OpenFest is upgrading to Windows 7 and MS SQL Server 2008MS SQL Server 2008

==

++

Page 3: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

What is OCR?What is OCR?

• Stands for Optical Character RecognitionStands for Optical Character Recognition

• Extracts the text from a given imageExtracts the text from a given image

Page 4: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

What is OCR? (2)What is OCR? (2)

• Invented by Invented by Gustav TauschekGustav Tauschek

• TauschekTauschek obtained a patent on OCR obtained a patent on OCR

• 1929 in Germany1929 in Germany

• 1935 in USA1935 in USA

• Tauschek’s machine Tauschek’s machine

• Was a mechanical deviceWas a mechanical device

• Uses templates, light and photodetectorUses templates, light and photodetector

• When a light was directed towards the When a light was directed towards the templates no light reach the photodetectortemplates no light reach the photodetector

Page 5: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

What is OCR? (3)What is OCR? (3)

• OCR Predicates electronic computers!OCR Predicates electronic computers!

Page 6: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Project TesseractProject Tesseract

• History of TesseractHistory of Tesseract

• Open source OCR engineOpen source OCR engine

• Developed by HP between 1985 and 1995 Developed by HP between 1985 and 1995

• Never used in an HP productNever used in an HP product

• Rated highly at The Fourth Annual Test of Rated highly at The Fourth Annual Test of OCR Accuracy in 1995OCR Accuracy in 1995

• In 2005 HP transferred Tesseract to the ISRI In 2005 HP transferred Tesseract to the ISRI and released it as open sourceand released it as open source

• ISRI == Information Science Research InstituteISRI == Information Science Research Institute

• The development is currently led by GoogleThe development is currently led by Google

Page 7: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Project Tesseract (2)Project Tesseract (2)

• Tesseract is an OCR Engine and is NOT Tesseract is an OCR Engine and is NOT a complete OCR programa complete OCR program

• Originally intended to serve as a Originally intended to serve as a component part of other programscomponent part of other programs

• Works from the command lineWorks from the command line

• Has no page layout analysis (will have Has no page layout analysis (will have soon)soon)

• Has no output formattingHas no output formatting

• Has no GUIHas no GUI

Page 8: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Tesseract VersionsTesseract Versions

• Stable build – version 2.04Stable build – version 2.04

• Has some documentationHas some documentation

• Can be easily trained on a new languageCan be easily trained on a new language

• Has memory leaksHas memory leaks

• Development version – 3.0 (unstable)Development version – 3.0 (unstable)

• Not documented, unstableNot documented, unstable

• Language files are not compatible (need Language files are not compatible (need special conversion)special conversion)

Page 9: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Downloading, Compiling Downloading, Compiling and Running Tesseractand Running Tesseract

(Latest Version)(Latest Version)

DemoDemo

Page 10: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

How Tesseract Works?How Tesseract Works?

1.1. Adaptive thresholding on the input imageAdaptive thresholding on the input image

2.2. Analyze connected components in the binary Analyze connected components in the binary imageimage

3.3. Find text lines and wordsFind text lines and words

4.4. First pass of recognition process First pass of recognition process

• Attempts to recognize each word in turnAttempts to recognize each word in turn

5.5. Satisfactory words are passed to adaptive trainerSatisfactory words are passed to adaptive trainer

6.6. Lessons learned are employed in a second passLessons learned are employed in a second pass

• Used for words not satisfactory recognizedUsed for words not satisfactory recognized

7.7. Producing the output textProducing the output text

Page 11: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Training TesseractTraining Tesseract

1.1. Prepare training images and .box filesPrepare training images and .box files

• Files: Files: lang.tiflang.tif and and lang.boxlang.box

• 2.04 supports only uncompressed TIFFs2.04 supports only uncompressed TIFFs

• .box files contain characters with coordinates.box files contain characters with coordinates

2.2. Extract the character featuresExtract the character features

• This produces This produces lang.trlang.tr

3.3. Perform character clusteringPerform character clustering

tesseract lang.tif junk nobatch box.traintesseract lang.tif junk nobatch box.train

mftraining lang.trmftraining lang.trcntraining lang.trcntraining lang.tr

Page 12: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Training Tesseract (2)Training Tesseract (2)

4.4. Compute the character set propertiesCompute the character set properties

• isLetter, isDigit, isUpper, isPunctuation, …isLetter, isDigit, isUpper, isPunctuation, …

• Unicode provides this informationUnicode provides this information

5.5. Train language dictionariesTrain language dictionaries

• List of all words in the target languageList of all words in the target language

• List of the most frequent wordsList of the most frequent words

unicharset_extractor lang.boxunicharset_extractor lang.box

wordlist2dawg freq-words.txt lang.freq-dawgwordlist2dawg freq-words.txt lang.freq-dawg

wordlist2dawg all-words.txt lang.word-dawgwordlist2dawg all-words.txt lang.word-dawg

Page 13: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Training Tesseract for Training Tesseract for Bulgarian and EnglishBulgarian and English

(Bulgarian for IT Professionals)(Bulgarian for IT Professionals)

DemoDemo

Page 14: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Other OCR EnginesOther OCR Engines

• OCRopusOCRopus

• Open source document analysis and Open source document analysis and OCR systemOCR system

• Also funded by GoogleAlso funded by Google

• Provides much of the layout analysis Provides much of the layout analysis functionality missing from Tesseractfunctionality missing from Tesseract

• Capable to use engines other than Capable to use engines other than TesseractTesseract

• http://code.google.com/p/ocropus/http://code.google.com/p/ocropus/

Page 15: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Other OCR Engines (2)Other OCR Engines (2)

• ABBYY FineReader OCRABBYY FineReader OCR

• Supports a big number of featuresSupports a big number of features

• Known for its highly accuracyKnown for its highly accuracy

• CommercialCommercial

• Microsoft Office Document Imaging (MODI)Microsoft Office Document Imaging (MODI)

• Supports editing documents scanned by Supports editing documents scanned by Microsoft Office Document ScanningMicrosoft Office Document Scanning

• It was firstly introduced in MS Office XPIt was firstly introduced in MS Office XP

• CommercialCommercial

Page 16: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Commercial OCR vs. TesseractCommercial OCR vs. Tesseract

• 100+ languages100+ languages

• Accuracy is good Accuracy is good nownow

• Sophisticated app Sophisticated app with complex UIwith complex UI

• Works on complex Works on complex magazine pagesmagazine pages

• Windows mostlyWindows mostly

• Costs $130-$500Costs $130-$500

• 6 languages6 languages

• Accuracy was good in Accuracy was good in 19951995

• No UI yetNo UI yet

• Page layout analysis Page layout analysis coming sooncoming soon

• Running on Linux, Running on Linux, Mac, Windows, more..Mac, Windows, more..

• Open source – Free!Open source – Free!

Page 17: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

Tesseract FutureTesseract Future

• Page layout analysisPage layout analysis

• More languagesMore languages

• Improve accuracyImprove accuracy

• Add a UIAdd a UI

• Support for connected scripts (like Support for connected scripts (like Arabian)Arabian)

Page 18: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

LinksLinks

• For more information see:For more information see:

• http://code.google.com/p/tesseract-ocr/http://code.google.com/p/tesseract-ocr/

• http://en.wikipedia.org/wiki/http://en.wikipedia.org/wiki/Optical_character_recognitionOptical_character_recognition

• http://tesseract-ocr.repairfaq.org/http://tesseract-ocr.repairfaq.org/ downloads/tesseract_overview.pdfdownloads/tesseract_overview.pdf

• SpeakersSpeakers

• http://nakov.com/bloghttp://nakov.com/blog

• http://veskokolev.blogspot.comhttp://veskokolev.blogspot.com

Page 19: Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers) .

QuestionsQuestions??

Tesseract OCRTesseract OCR