IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Michael Fuchs Senior Product Marketing Manager ABBYY Europe [email protected]Optical Character Recognition (OCR) Introduction & Overview
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Michael FuchsSenior Product Marketing ManagerABBYY Europe
Optical Character Recognition (OCR) Introduction & Overview
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT + ABBYY - OCR Introduction & Overview 2
Agenda ABBYY Technology in the IMPACT project
Who is ABBYY? Company Overview Product Overview How is OCR used in real-life scenarios?
Optical Character Recognition - Basics What is OCR? How does OCR work inside? OCR = Only Character Recognition? IMPACT – the areas of improvement
Questions & Answers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT & ABBYY
IMPACT + ABBYY - OCR Introduction & Overview 3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT + ABBYY - OCR Introduction & Overview 4
Improving Access to Text Mission of IMPACT: It aims to significantly improve access to historical text and
remove the barriers that stand in the way of the mass digitisation of the European cultural heritage.
Partners: Koninklijke Bibliotheek, The British Library, Österreichische Nationalbibliothek, Universität Innsbruck,
Deutsche Nationalbibliothek, Bayerische Staatsbibliothek, Staats- und Universitätsbibliothek Göttingen
ABBYY, IBM Israel – Science and Technology Ltd, Instituut voor Nederlandse Lexicologie
National Centre for Scientific Research "Demokritos“,
Centrum für Informations- und Sprachverarbeitung, University of Munich
University of Bath, University of Salford, Bibliothèque Nationale de France
Web: www.impact-project.eu
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT & ABBYY
ABBYY is the OCR technology provider for IMPACT members
IMPACT members work with ABBYYs OCR SDK (FineReader Engine), because: Only development toolkits allow developers to combine new/different modules,
for example: complex dictionaries Scientific research & tests have to be implemented in custom modules
IMPACT + ABBYY - OCR Introduction & Overview 5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT & ABBYY
ABBYY improves the OCR core technologies for the recognition of old documents, current focus areas are Image pre-processing Character recognition
IMPACT currently focuses on research and not in setting up a production system ;o)
Improvements in ABBYY recognition technologies that are a result of the IMPACT project will be added to future products Important: ABBYY FineReader 8/9/10 Professional (Box) has NO Fraktur OCR Fraktur OCR is only available in Recognition Server und FineReader Engine
IMPACT + ABBYY - OCR Introduction & Overview 6
IMPACT + ABBYY – OCR Introduction & Overview
ABBYY – an Overview
ABBYY & OCR for IMPACT
Who is ABBYY?
Leading developer of artificial intelligence software in document recognition, data capture and linguistics
Headquartered in Moscow, Russia
Founded in 1989 by Mr. David Yang as BIT Software
More than 880 employees worldwide
8 offices worldwide
Established sales and distribution network in more than 130 countries worldwide
ABBYY & OCR for IMPACT
ABBYY Worldwide
ABBYY JapanFremontABBYY USA ABBYY Ukraine
Kiev
ABBYY Europe UK
ABBYY Europe GmbH Munich, Germany
ABBYY Taiwan
ABBYY Headquarters/ ABBYY RussiaMoscow
ABBYY & OCR for IMPACT
ABBYY in Western Europe
ABBYY Europe GmbH
Located in Munich, Germany
Established in 2001
Serves partners and customers in Western European countries
ABBYY FineReader Engine SDKComprehensive toolkit for integrating recognition and data capture technologies into third-party applications
ABBYY Mobile OCR EngineOCR for thin clients such as mobile phones, PDAs and Web applications
ABBYY & OCR for IMPACT
ABBYY OCR Products – Usage View
Desktop/Workgroup Server/Backend SDK/Integration
OCR
& D
ocum
ent
Conv
ersi
on FineReader (Professional, Corporate, Site Licence Edition)
PDF Transformer
FotoReader
ScreenshotReader
Recognition Server (Professional, Extended Edition)
FineReader Engines (Windows, Linux, Mac OS X, Free BSD, Embedded Systems)
Mobile OCR Engine (Android, Symbian, Linux, Windows, Windows Mobile,iPhone )
End Users, Companies,(Libraries)
Companies,Scan Service Provider,
Libraries
User driven processing,Ready to use
Automated processing,Ready to use
Automated processing,Development needed
Developers,Scan Service Provider
IMPACT ResearchUser
sar
e:
ABBYY & OCR for IMPACT
OCR Basics
ABBYY & OCR for IMPACT 10
Designed to be not OCRed
ABBYY & OCR for IMPACT 11
What (ABBYY) OCR can read...
Recognition Languages >191 languages altogether Alphabets: Cyrillic, Latin, Greek, Armenian, Hebrew, Thai 34 languages with dictionary support and spell check Chinese, Japanese, Korean (CJK) - 4 sets of hieroglyphs
(Chinese (traditional and simplified), Japanese, Korean) 5 languages in FineReader XIX (Gothic and other 17-20 century fonts) 6 programming languages (Basic, C/C++, COBOL, Java, etc.) 4 artificial languages (Esperanto, Interlingua, etc.) Simple chemical formulas
Step 3. Character Recognition – learn new symbolsOwn Pattern Training to learn special characters on a pixel level
22
OCR Optimization
ABBYY & OCR for IMPACT
Step 3. Character Recognition – back to the word levelApplying selected recognition languages and dictionaries
Own languages and dictionaries can be defined
23
OCR Optimization
ABBYY & OCR for IMPACT
Step 4. Verification by Operators (optional)Manual validation or correction of Layout Analysis Results
● Text blocks● Image blocks● Table blocks
Suspicious characters and wordcorrections using dictionaries
Re-Recognition with other language settings
Recognition Server allows one to set quality level and also to log processing results in a XML file
24
OCR Processing Steps
ABBYY & OCR for IMPACT
Step 5. Document Synthesis and ExportGenerating an output document in the selected format
TXT, Office formats, PDF, etc.
From version 9.0 on ADRT(Adaptive DocumentRecognition Technology) included.Goal: Understanding the document structure and detectinge.g. headers, footers, footnotes. V10: table of contents
SDKs and Recognition Server offer more export formats, e.g.● XML● Internal
FineReader Engine Format
25
ABBYY OCR Processing Steps
ABBYY & OCR for IMPACT
OCR in General&
IMPACT in Particular
ABBYY & OCR for IMPACT
Recreates the same layout as in the original document Resulting document looks just like the scanned original Information captured during Layout Analysis is used here
Supports popular document formats ABBYY products support all popular output formats the customer needs
PDF, PDF/A, XML, HTML, TXT/CSV, Word, Excel, PowerPoint and DBF
Compliance with the regulations Support for selective access password protection, document encryption,
support for PDF/A format, etc.
27
OCR = Only Character Recognition?
ABBYY & OCR for IMPACT
Step 1. Image Quality Problem areas: Scans of microfilms, distortions, shine through characters Optimisation approach: Image pre-processing, e.g: Binarisation
Step 2. Document Analysis Problem areas : Layout of old print material, e.g. narrow columns in old newspapers, Optimisation approach: improved Layout/Document Analysis
Step 3. Character recognition & Languages Problem areas : Used Fonts, old language (grammar & spelling) Optimisation approach: Optimised patterns, adaptive OCR, creation of special dictionaries
Step 4. Validation & Correction Problem areas : often recurring errors during Fraktur OCR, Scalability of correction Optimisation approach: New approaches for mass verification
Step 5. Document Synthesises, Export & Rating Problem areas : Content classification, Meta data generation, “reliable ”formats Optimisation approach: XML, AltoXML, XML analysis, PDF/A, …
28
IMPACT = „Step by Step“ Optimisation
ABBYY & OCR for IMPACT
Thank you for your attention!
Questions?
Michael FuchsSenior Product Marketing ManagerABBYY [email protected]