Top Banner
ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota
19

ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

ExpressReader Pro adopted toretrodigitization of mathematicaldocuments

Kazuaki Yokota

Page 2: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

ExpressReader Pro

■ Printed Text OCR

■ Japanese / English

■ Recognition Rate

99.7% for Japanese

99.8% for English

■ Powerful Layout Analysis

■ for x86 based Windows PC

Features

Page 3: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Layout analysis 1

Page 4: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Layout analysis 2

Page 5: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Adoption for mathematical document

■ Application framework

■ Detection and recognition of mathematical formula

■ Output format

Problems

Page 6: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Flow diagram

Image scanning

Skew correction

Layout analysis

Character recognition

User modificationOutput conversion

Formula recognition

Formula detection

Page 7: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Component relation

Scanning

GraphicalUserInterface

INFTYformulaRecognition

Layout analysis

Character recognitionFormula detection

Page 8: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Formula detection 1

■ Score each words for both mathematical formula and text word, obtained by character recognition.

M 0 90 100 100 0 90 70 90

T 100 40 20 20 100 40 70 90

Page 9: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Formula detection 2

■ Parse by context-free grammar(CFG) - Formula is also non-terminal symbol of this CFG.

Page 10: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.
Page 11: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

XML based processing

■ Input Recognition parameter, Image

■ While processing Layout information, etc

■ Output Result

OCR needs various data while processing

To implement OCR to certain application system,user must program to treat these data.----- Unify to XML

Page 12: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

XML Based Processing

Layout analysis

Character recognitionFormula detection

GraphicalUserInterface

XML

XML

XML

Page 13: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Advantage of XML

■ Easy to convert to other formats (XSLT)

■ Easy to treat (DOM/SAX)

■ Extensible / Flexible

■ MathML

■ Platform independent

Page 14: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

XML format 1

<OCR> <Parameter> ……Recognition Parameters </Parameter> <Document> <Sheet> <Area> <Text> ….. Recognized Results(After Recognition) </Text> </Area> </Sheet> </Document></OCR>

Page 15: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

XML format 2

<Text tag="paragraph" language="English" line_direction="horz" rect="56,308,3258,714">

<ExpText tag_id="0"/> <Field> <Line rect="56,308,3257,392"> <Character rect="56,332,96,392" code="0x67">g <ExpCharacter original_code="0x67" offset="0" size="40"/> <Candidate id="1" code="0x67" sim="867"/> </Character> …… </Line> </Field></Text>

Page 16: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

XML format 3

<Character rect="56,332,96,392" code="0x67">g <ExpCharacter original_code="0x67" offset="0" size="40"/> <Candidate id="1" code="0x67" sim="867"/></Character><Formula rect=“98,332,205,392”> <MathML> ….Mathematical formulae </MathML></Formula>

Page 17: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Demonstration

■ ….

Page 18: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Product form

■ Software Development Kit

■ Simple OCR Software

For x86 based Windows PC

Page 19: ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.

Summary

■ More convenient GUI is needed

■ We wish our product will make your business to be more efficient....