IBM Labs in Haifa © 2011 IBM Corporation CONCERT COoperative eNgine for Correction of ExtRacted Text Asaf Tzadok Manager, Image and Document Analytics Group October 2011
Oct 19, 2014
IBM Labs in Haifa © 2011 IBM Corporation
CONCERTCOoperative eNgine for Correction of ExtRacted Text
Asaf Tzadok
Manager, Image and Document Analytics Group
October 2011
IBM Labs in Haifa
2
Introduction
An estimated of at least 100 Millions books have been produced since Johann Gutenberg invented movable type in the 15th century.
A large part of this vast literature is now being converted to digital books and moved into the world of electronic publishing.
The digitization process involves Scanning technologies OCR (Optical Character Recognition) Post correction
The OCR quality range between 50%-90% of word level accuracy Post correction is a must and it costs a lot and it takes time
~1 euro per A5 page
IBM Labs in Haifa
3
Crowd Sourcing Projects
Distributed Proofreaders Gutenberg Project
National Library of Australia Australian Newspaper Digitisation
LDS Church Family Search
The National Library of Finland Digitalkoot
All are pure volunteer based crowd sourcing programs It works !!
IBM Labs in Haifa
4
Gutenberg Project – 1st Gen.
IBM Labs in Haifa
5
NLA – Australian Newspapers – 2nd Gen.
IBM Labs in Haifa
6
Collaborative Correction – State of the Art cont.
State-of-the-art systems, such as Project Gutenberg, Simply show page image and OCR results to be corrected
Drawbacks: Slow and unproductive process Prone to errors Hard to cross-check/merge Two passes are needed to ensure quality
Result:Complex, hard to track process = a lot of manual labor = limited public participation and contribution
IBM Labs in Haifa
7
DIGITALKOOT - Mole Games – 3rd Gen
IBM Labs in Haifa
8
Collaborative Correction – Games
Wider and younger public participation Easy to cross check Allows Parallelism Fully Scalable
Drawbacks Low productivity factor Static process with huge amount of work Limited use of the feedback from the users
Very long process to complete the digitization
IBM Labs in Haifa
9
Collaborative Correction – How does it work
A full web based collaborative-correction system Avoid any installation in the client side Intuitive for the wide public use
Call for participation (optional) Via the official website of the library Collection based
Volunteers keen on contributing to their cultural heritage preservation Top performers lists Library recognition awards Acknowledgements
IBM Labs in Haifa
10
CONCERT
Adaptive collaborative correction platform Uses the feedback from the users to improve productivity Fully connected to the Adaptive OCR Engine
Strong emphasis on productivity tools Reduce the time for verification/correction
Patented smart-key approach Motivate volunteers
Separating data entry process into several complementary tasks Optimized application dedicated to each task Break down the tasks into subtask Make it suitable for parallel processing Online compilation
Digitization flow optimizations Hierarchical context-level : character -> word -> page
IBM Labs in Haifa
11
CONCERT System Architecture
Image
Enhancements
OMNI Engine
(ABBYY FRE)
Book Fonts
Extraction
Book Optimized
Adaptive OCR
Engine
CONCERT
Quality ControlDictionaries
Scanned
Book
High Quality
Transcription
Web Users
CONCERT
Productivity Tools
CONCERT
Games
IBM Labs in Haifa
13
Adaptive OCR - Requirements
Consistent and reliable confidence level Important for quality assurance
No use of prior knowledge on the font Crazy font can be handled
Good use of the feedback from the users Character and Word level
Robust to distortion Page level distortion and printing variations
Easy to migrate between books from the same publisher Continues update
Not too slow Around 2-3 times slower than OMNI Engines
IBM Labs in Haifa
14
Adaptive OCR – Technical Considerations
Pixel Domain (Template matching) Pros
Easy to implement Scoring consistency
Cons Slow Sensitive to small distortion
Features Domain Pros
Fast Robust to small distortion Using invariant features can improve robustness to distortion
Cons Non consistent scoring mechanism
IBM Labs in Haifa
15
Adaptive OCR - Hybrid Approach
IBM Labs in Haifa
16
Distortion Example
Using hierarchical optic-flow High quality results for compensation for non-linear character warping Can overcome significant distortions
IBM Labs in Haifa
18
System flow
Character (Carpet) session Fast validation of OCR results Every word with rejected character is routed to Word Session
Word Verifier Session Utilized for cases when contextual information is necessary Rejected word will be corrected in the Page Session
Page-level Session For final closure of the page When entire page view for understanding is required
IBM Labs in Haifa
19
Character Session
OCR results are analyzed: Very high confidence results don’t require verification High confidence results may include some character recognition
errors. Hence, character session is used Low confidence results may have been caused by segmentation
errors. Hence Word session is used. For Character session, individual character images are extracted and
grouped together based on the recognition results (i.e. all the “b” would be grouped together at the same session)
For the given session, all the characters are grouped based on their confidence
IBM Labs in Haifa
20
Character Session
IBM Labs in Haifa
21
Character Session
IBM Labs in Haifa
22
Character Session
IBM Labs in Haifa
23
Word Session
Used for words Word is not in dictionary Having low confidence characters Having characters rejected in the Character Session
Shows Original word image Recognition results Possible spelling options
Words ordered alphabetic Based on the recognition results in lexicographic
IBM Labs in Haifa
24
Word Session – Before data entry
IBM Labs in Haifa
25
Word Session – After data entry
IBM Labs in Haifa
26
Word Session – Before data entry
IBM Labs in Haifa
27
Word Session – After data entry
IBM Labs in Haifa
28
Page Session
Used for correction of cases where word segmentation fails
Can be activated in one of 4 flavors Word view Line view Paragraph view Tagging view
System can go automatically from one problematic word to another
IBM Labs in Haifa
29
CONCERT - Page Session
IBM Labs in Haifa
30
Multilingual Support - English
1772
IBM Labs in Haifa
31
Multilingual Support - French
1668
IBM Labs in Haifa
32
Multilingual Support - German Gothic
1778
IBM Labs in Haifa
33
Multilingual Support - Dutch 1789
IBM Labs in Haifa
34
Multilingual Support - Japanese
IBM Labs in Haifa
35
Heart Newsreel Collection – Index Card
IBM Labs in Haifa
36
User Monitoring
Wide public participation may end up with data corruption by Malicious users Non qualified users
User rating and feedback motivates the use of the system Three ways validation
Good injection Characters/Words with high confidence to be true
Similar injection Characters/Words which may look similar but not identical For example: ‘O’ injection on ‘Q’ session
Error injection Characters/Words with high confidence to be false
IBM Labs in Haifa
37
User Monitoring – Screenshots
IBM Labs in Haifa
38
User Monitoring – Screenshots Cont.
IBM Labs in Haifa
39
User Monitoring – Screenshots Cont.
IBM Labs in Haifa
40
User Monitoring – Screenshots Cont.
IBM Labs in Haifa
41
User Monitoring – Screenshots Cont.
IBM Labs in Haifa
42
User Monitoring – Screenshots Cont.
IBM Labs in Haifa
43
User Monitoring – Screenshots Cont.
IBM Labs in Haifa
44
CONCERT Games
IBM Labs in Haifa
45
CONCERT in use
Hearst Newsreel Archive Collection First production use Tagging capabilities
Pilot in Japan for the Japanese Library Including customization for Japanese
1st phase pilots in major libraries in Europe KB – National Library of the Netherlands BL – British Library BSB – Bavarian State Library
IBM Labs in Haifa
46
CONCERT Future Planning
Search Over OCR Beyond transcription
Improve User Feedback Online advisor Best performers list
Community building around content Integrate community tools within the platform
CONCERT Games iPhone/iPad/Android/Desktop
E-Book creation Fully digital transcription Using original font as option
Page distortion correction Fully integrate the word-based page distortion correction
IBM Labs in Haifa © 2011 IBM Corporation
Thank You!