Integrating OCR and NLP to Digitize 2.3 Million Lichen and Bryophyte Specimens Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin
Jan 01, 2016
Integrating OCR and NLP to Digitize 2.3 Million Lichen and Bryophyte Specimens
Edward GilbertCorinna GriesThomas H. Nash IIIRobert Anglin
Goals and Scope
NSF ADBC (#1115116) ~ 2.3 million specimen
90% of all specimens 900,000 lichens 1.4 million bryophytes
> 60 non-governmental US herbaria (95%) Mexico, US, Canada
16 digitization centers
Digitization Workflow
National Portals
Lichen Consortium http://lichenportal.org 34 Collections 902,664 Records
Bryophyte Consortium http://bryophyteportal
/ 26 Collections 1,300,135 Records
Symbiota software
Imaging Stage
Capture Image
barcode in file name
Create Skeleton
Filespecies name,
country, state,
exsiccati, etc.
Upload to FTP server
Image processing
extract barcode,
create web versions, map to portal DBs
Herbarium Database
Automated OCR
Tesseract, ABBYY
Existing Record
simply link image
Upload to FTP server
Image URLs
Manage Specimen
Data in Portal
Manage / Review
Records in Portal
SymbiotaEditor
review, edit, keystroke
Create New Record
barcode, image, skeletal data
Automated NLP
Darwin Core Parsing
Automated OCR
1. Iterate through “unprocessed” images
2. OCR via Tesseract (version 3)a) In focus, good lighting, minimal noiseb) Resolution: >20px x-height
3. Database raw text block4. Progress to next step
1. Low OCR return => hand processing2. Natural Language Processing
OCR Challenges
Issues Old fonts Faded labels Form labels Handwritten
labels Specialized terms
Solutions Image
treatments OCR tuning Dictionaries Consensus OCR
¢_].L.|»‘¢ .'».f.'._..‘~,(.Jfin-x‘*\'a:"511z:1 wf .~\:'i/.onli State UniversityP.’~.r"~2= ,_. gg J:.2 " J*J*" †(=:\‘-“ax "»..'\-12�‘ “ "‘ ;T~;‘~7i?»-1_1_\f;>sf`;,' ESXZ»ie+‘-». “~'.»te;~:i_.t<» ff`t;~f3":.f.“» »4 xx, ,"""‘“â€T"’ <1;-.rs f3'a,1.z>.t;;a¢f~rus ’�V4 J 'if . r°'° M '1?nies ivain.) Sav.neutal Station - " '1 ~»r';;4-\P ` 1.T11 ./P.. ,J ..-.ELEV. ' `.fJL_\ LATL Q _‘ 1 _ Y’ DATE_ ,. W5. (> f- , -:‘; i f>i_T ~~ . A 1:». v\ .-v »~. 4. a xvala 8/27/73
PLANTS OF NEW r~1ExIcoHerbarium of Arizona State UniversityParmelia ulophyllodes (Vain.) Sav.COUNTY “°â€â€œâ€œ �Joranada Experimental Station -New Mexico State University"“““' on JuniperusELEV. ‘ 4400EEILLEETUR DATEDU T. H. Nash #7914 8/27/73T. H. N.
Automated NLP
1. Iterate through raw OCR text blocks
2. Parse text block1. Darwin Core 2. Populate database
3. Review1. Adjust content2. Approve3. Handwritten => keystroke
NLP Challenges
Issues Variable layouts Loose standards OCR error
Solutions Authority tables Levenshtein
distance Word stats Format
recognition Parsing profiles Duplicate
harvesting
NLP: Duplicate Harvesting
1. Extract collector dataa) Last name, number, date
2. Harvest duplicates from consortium DB
a) Exact duplicatesb) Duplicate events
3. High similarity indexes4. OCR block comparison5. Consensus record
NLP: Targeted Parsing Profiles
1. Target similar label formats2. Use raw OCR to locate “Nash”
labels3. Targeted parsing algorithms4. Exclude:
a) Determined by Nashb) Author of scientific namec) Associated collectord) County
Label Review
Thank You
Michael Adamo Bruce Allen Meredith Blackwell Bill Buck Alina Freire-Fierro John Freudenstein Alan Fryday David Giblin Karen Hughes Steffi Ickert-Bond Timothy James Jennifer S. Kluse Matt Von Konrat Ben Legler Tatyana Livshultz
Robert Lücking Francois Lutzoni Bob Magill Andrew Miller Brent Mishler Donald Pfister Richard Rabeler Malcolm Sargent Edward Schilling Michaela Schmull Blanka Shaw Jon Shaw Carol Shearer Larry StClair Barbara Thiers
Funded by the NSF ADBC program