Top Banner
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Michael Fuchs Senior Product Marketing Manager ABBYY Europe [email protected] Optical Character Recognition (OCR) Introduction & Overview
35

Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

Oct 20, 2014

Download

Technology

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Michael FuchsSenior Product Marketing ManagerABBYY Europe

[email protected]

Optical Character Recognition (OCR) Introduction & Overview

Page 2: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT + ABBYY - OCR Introduction & Overview 2

Agenda ABBYY Technology in the IMPACT project

Who is ABBYY? Company Overview Product Overview How is OCR used in real-life scenarios?

Optical Character Recognition - Basics What is OCR? How does OCR work inside? OCR = Only Character Recognition? IMPACT – the areas of improvement

Questions & Answers

Page 3: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT & ABBYY

IMPACT + ABBYY - OCR Introduction & Overview 3

Page 4: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT + ABBYY - OCR Introduction & Overview 4

Improving Access to Text Mission of IMPACT: It aims to significantly improve access to historical text and

remove the barriers that stand in the way of the mass digitisation of the European cultural heritage.

Partners: Koninklijke Bibliotheek, The British Library, Österreichische Nationalbibliothek, Universität Innsbruck,

Deutsche Nationalbibliothek, Bayerische Staatsbibliothek, Staats- und Universitätsbibliothek Göttingen

ABBYY, IBM Israel – Science and Technology Ltd, Instituut voor Nederlandse Lexicologie

National Centre for Scientific Research "Demokritos“,

Centrum für Informations- und Sprachverarbeitung, University of Munich

University of Bath, University of Salford, Bibliothèque Nationale de France

Web: www.impact-project.eu

Page 5: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT & ABBYY

ABBYY is the OCR technology provider for IMPACT members

IMPACT members work with ABBYYs OCR SDK (FineReader Engine), because: Only development toolkits allow developers to combine new/different modules,

for example: complex dictionaries Scientific research & tests have to be implemented in custom modules

IMPACT + ABBYY - OCR Introduction & Overview 5

Page 6: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT & ABBYY

ABBYY improves the OCR core technologies for the recognition of old documents, current focus areas are Image pre-processing Character recognition

IMPACT currently focuses on research and not in setting up a production system ;o)

Improvements in ABBYY recognition technologies that are a result of the IMPACT project will be added to future products Important: ABBYY FineReader 8/9/10 Professional (Box) has NO Fraktur OCR Fraktur OCR is only available in Recognition Server und FineReader Engine

IMPACT + ABBYY - OCR Introduction & Overview 6

Page 7: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

IMPACT + ABBYY – OCR Introduction & Overview

ABBYY – an Overview

Page 8: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Who is ABBYY?

Leading developer of artificial intelligence software in document recognition, data capture and linguistics

Headquartered in Moscow, Russia

Founded in 1989 by Mr. David Yang as BIT Software

More than 880 employees worldwide

8 offices worldwide

Established sales and distribution network in more than 130 countries worldwide

Page 9: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

ABBYY Worldwide

ABBYY JapanFremontABBYY USA ABBYY Ukraine

Kiev

ABBYY Europe UK

ABBYY Europe GmbH Munich, Germany

ABBYY Taiwan

ABBYY Headquarters/ ABBYY RussiaMoscow

Page 10: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

ABBYY in Western Europe

ABBYY Europe GmbH

Located in Munich, Germany

Established in 2001

Serves partners and customers in Western European countries

Sales and Marketing

Sales● Distribution, channel development, partner management

Marketing● Product marketing, channel marketing, outbound marketing (PR, advertising, direct)

More than 50 employees today

Page 11: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Product Overview

Page 12: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

ABBYY Product Brands Mainline Distribution

“Box” products:

ABBYY FineReaderOptical character recognition (OCR)/text processing end user products

ABBYY FotoReaderConversion of texts taken with digital cameras

ABBYY PDF TransformerPDF conversion and creation for end users

ABBYY LingvoElectronic dictionaries, Russian and European languages

Page 13: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

ABBYY Product Brands Direct Sales and VAR Distribution

Licensing and integration products:

ABBYY Recognition ServerServer-based OCR

ABBYY FormReader and ABBYY FlexiCaptureForm processing, unstructured document processing, document assembly

ABBYY FineReader Engine SDKComprehensive toolkit for integrating recognition and data capture technologies into third-party applications

ABBYY Mobile OCR EngineOCR for thin clients such as mobile phones, PDAs and Web applications

Page 14: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

ABBYY OCR Products – Usage View

Desktop/Workgroup Server/Backend SDK/Integration

OCR

& D

ocum

ent

Conv

ersi

on FineReader (Professional, Corporate, Site Licence Edition)

PDF Transformer

FotoReader

ScreenshotReader

Recognition Server (Professional, Extended Edition)

FineReader Engines (Windows, Linux, Mac OS X, Free BSD, Embedded Systems)

Mobile OCR Engine (Android, Symbian, Linux, Windows, Windows Mobile,iPhone )

End Users, Companies,(Libraries)

Companies,Scan Service Provider,

Libraries

User driven processing,Ready to use

Automated processing,Ready to use

Automated processing,Development needed

Developers,Scan Service Provider

IMPACT ResearchUser

sar

e:

Page 15: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

OCR Basics

Page 16: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT 10

Designed to be not OCRed

Page 17: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT 11

What (ABBYY) OCR can read...

Recognition Languages >191 languages altogether Alphabets: Cyrillic, Latin, Greek, Armenian, Hebrew, Thai 34 languages with dictionary support and spell check Chinese, Japanese, Korean (CJK) - 4 sets of hieroglyphs

(Chinese (traditional and simplified), Japanese, Korean) 5 languages in FineReader XIX (Gothic and other 17-20 century fonts) 6 programming languages (Basic, C/C++, COBOL, Java, etc.) 4 artificial languages (Esperanto, Interlingua, etc.) Simple chemical formulas

Font Types Recognition of mixed font types

(dot-matrix printer, typewriter, Gothic, etc.) OCR-A OCR-B MICR (E13B) CMC-7

Page 18: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 1. Scanning, Image Loading, Pre-Processing and Modification Compensating image defects and making the document better viewable and suited for

automatic OCR

Step 2. Document Layout Analysis Detect sections of a document, analyze layout and find barcodes

Step 3. Character Recognition Automatic recognition of characters, apply selected recognition languages, dictionaries

and other settings

Step 4. Verification by Operators (optional) Manual validation of suspicious characters and words

Step 5. Document Synthesis and Export Generating an output document in the selected format

12

OCR Processing Steps

Page 19: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 1. Image Loading, Pre-Processing and ModificationImages from existing files or captured with a scanner

Splitting images

Scaling (e.g. low resolution images can be digitally magnified)

Rotation (on 90, 180, or 270 degrees)

Flipping and inverting images

Cropping (selecting rectangular areas)

Creating previews (small images for previews)

Changing text colour and background in rectangular areas

13

OCR Processing Steps

Page 20: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 1. Image Loading, Pre-Processing and ModificationCompensating for scanning defects

Automatic de-skew to proper straight position

Straightening text lines Controlled de-speckle

(cleaning garbage dots)

14

ABBYY OCR Processing Steps

Page 21: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 1. Image Loading, Pre-Processing and Modification

Intelligent background filtering

Adaptive Binarisation

15

OCR Processing Steps

General binarisation on an image level can not deliver good results for OCR

Page 22: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 1. Image Loading, Pre-Processing and Modification

Success during IMPACT

16

OCR Processing Steps

Original State of Art New

No text from the other page

Page 23: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

New Binarization Examples

23

Original scan

Prev. binarization

New binarization

Page 24: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Camera OCRAutomatic correction of 3D perspective distortions

24

Before

After

Page 25: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Camera OCRISO noise reduction

25

Before

After

Page 26: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 2. Document Layout AnalysisDetecting sections of a document, analyze layout and find barcodes

20

OCR Processing Steps

Page 27: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 3. Character RecognitionAfter line detection, character recognition is applied with different classifiers

21

OCR Processing Steps

Raster classifier Contour classifier

Feature differentiating classifier Structure classifier

Page 28: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 3. Character Recognition – learn new symbolsOwn Pattern Training to learn special characters on a pixel level

22

OCR Optimization

Page 29: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 3. Character Recognition – back to the word levelApplying selected recognition languages and dictionaries

Own languages and dictionaries can be defined

23

OCR Optimization

Page 30: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 4. Verification by Operators (optional)Manual validation or correction of Layout Analysis Results

● Text blocks● Image blocks● Table blocks

Suspicious characters and wordcorrections using dictionaries

Re-Recognition with other language settings

Recognition Server allows one to set quality level and also to log processing results in a XML file

24

OCR Processing Steps

Page 31: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 5. Document Synthesis and ExportGenerating an output document in the selected format

TXT, Office formats, PDF, etc.

From version 9.0 on ADRT(Adaptive DocumentRecognition Technology) included.Goal: Understanding the document structure and detectinge.g. headers, footers, footnotes. V10: table of contents

SDKs and Recognition Server offer more export formats, e.g.● XML● Internal

FineReader Engine Format

25

ABBYY OCR Processing Steps

Page 32: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

OCR in General&

IMPACT in Particular

Page 33: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Recreates the same layout as in the original document Resulting document looks just like the scanned original Information captured during Layout Analysis is used here

Supports popular document formats ABBYY products support all popular output formats the customer needs

PDF, PDF/A, XML, HTML, TXT/CSV, Word, Excel, PowerPoint and DBF

Supports image output BMP, PCX, JPEG, JPEG 2000, TIFF, PNG

Compliance with the regulations Support for selective access password protection, document encryption,

support for PDF/A format, etc.

27

OCR = Only Character Recognition?

Page 34: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Step 1. Image Quality Problem areas: Scans of microfilms, distortions, shine through characters Optimisation approach: Image pre-processing, e.g: Binarisation

Step 2. Document Analysis Problem areas : Layout of old print material, e.g. narrow columns in old newspapers, Optimisation approach: improved Layout/Document Analysis

Step 3. Character recognition & Languages Problem areas : Used Fonts, old language (grammar & spelling) Optimisation approach: Optimised patterns, adaptive OCR, creation of special dictionaries

Step 4. Validation & Correction Problem areas : often recurring errors during Fraktur OCR, Scalability of correction Optimisation approach: New approaches for mass verification

Step 5. Document Synthesises, Export & Rating Problem areas : Content classification, Meta data generation, “reliable ”formats Optimisation approach: XML, AltoXML, XML analysis, PDF/A, …

28

IMPACT = „Step by Step“ Optimisation

Page 35: Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

ABBYY & OCR for IMPACT

Thank you for your attention!

Questions?

Michael FuchsSenior Product Marketing ManagerABBYY [email protected]