Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org BHL Technology Overview Chris Freeland Technical Director, BHL Director of Bioinformatics, Missouri Botanical Garden
May 26, 2015
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
BHL Technology Overview
Chris Freeland
Technical Director, BHL
Director of Bioinformatics, Missouri Botanical Garden
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
About BHL: Usage, History
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Goals of BHL
• Scan public domain biodiversity literature.
• Negotiate rights to digitize copyrighted materials.
• Ingest content digitized by others.
• Provide interfaces & APIs for repository.– GUIs– Services for data mining & citation resolution
http://www.biodiversitylibrary.org
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
• More than:33,000 volumes
13.3 million pages
• Avg. monthly growth rate1,500 volumes
600,000 pages
Now Online
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Monthly Usage Stats
• 45,000 unique users
• 250,000 pageviews
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
History
• Preliminary work: MOBOT’s Botanicus– http://www.botanicus.org
• Funded by Keck Foundation & IMLS
• Working demonstration of how nomenclators/databases can link into digitized scientific literature
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Architecture
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Distributed
• Digitized content on Internet Archive servers in California
• Metadata index on MOBOT servers in Missouri
• Image server on MBL servers in Massachusetts
• Nice, but not global
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.orgMOBOT
Petabox cluster
Internet Archive
Image Server
MBL
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Scanning Workflow
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Scanning OperationsBHL uses scanning centers established by
Internet Archive for mass scanning.
Some partner libraries also scan in-house.
Want to expand international footprint:
•mirrored content•ingest from global data providers
Locations of BHL/IA Scanning Centers
Workflow
Selection Preparation
Post Production(Re)publication
Digitization
Conservation
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Open Access DataFlora medica, oder, Abbildung der wichtigsten officinellen Pflanzen…[Heft 1-18]
Publisher: Jena,August Schmid,1831 [i.e. 1829-1831].
OCR
XML
JP2
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Complexities of distributed, mass scanningfrom NYBG
from Smithsonian
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Post Processing & Derivatives
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Derivatives
• JPEG2000 (JP2) images
• OCR: ABBY FineReader
• PDF: LuraTech PDF Compressor
• XML metadata
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Name Finding via TaxonFinder
Raw Image Converted to text via OCRName finding via TaxonFinder Extract namesSubmit to NameBankSOAP response
Name Finding in action
with Taxonomic Intelligence…
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Name Finding Stats to date*
• Have mined more than 42 million name string occurrences
• More than 30 million name strings verified by NameBank– 1.5 million unique
*12 May 2009
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Content Delivery
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
OCR error rate for names only
1 Insert Space 8 n->v
2 Omit Space 9 l->i
3 e->c 10 r->i
4 u->I 11 u->ii
5 u->n 12 h->l
6 i->l 13 h->ii
7 c->e 14 e->o
Top OCR errors
35.16%
Study in 2008 found that for sample population of 3,003 names, 1,056 were incorrectly transcribed by OCR.
http://biodiversitylibrary.blogspot.com/2008/10/evaluation-of-taxonomic-name-finding.html
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Current image delivery: djatoka
• Images stored as JPEG2000 (.jp2)
• Decoded & delivered to browser via djatoka– Open source JP2 image server– Developed by digital librarians– Scalable– Rapid development cycle (v1.1)– Growing community of users
djatoka
Browser IIPViewer
www.biodiversitylibrary.org
.jp2
.jpg
IA
/page/1274907
pageid: 1274907
BHLdb
http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2
images.biodivlibrary.org
A user requests Mushrooms of America, edible and poisonous, Plate X:http://www.biodiversitylibrary.org/page/1274907
locate:
BHL/IA architecture
St. Louis
San Francisco
Woods Hole
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
New delivery option: IA Bookreader
• Open source
• Example: Flora medicahttp://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229
IA Book Viewer
http://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
APIs & Data Sharing
• Name Service (Documentation)
– REST: XML or JSON
• Data Export (Documentation)
– Monthly export of BHL titles, volumes, pages, names, other metadata in delimited files
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
*Soon: Citation resolver via OpenURL
Beetle, A. A. 1977. Noteworthy grasses from Mexico V. Phytologia 37(4): 317–407.
http://example.edu/cgi?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:article&rft.jtitle=Phytologia&rft.atitle=Noteworthy+grasses+from+Mexico&rft.aulast=Beetle&rft.aufirst=A&rft.date=1977&rft.volume=37&rft.issue=4&rft.spage=317&rft.epage=407
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Articles
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Article repository
• Needed a way to display these PDFs
• Wanted to extend contribution functionality to users
• “Safe harbor” model– BHL provides platform– Community provides content
• Scientists, students, libraries
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
http://cite.biodiversitylibrary.org
• Drupal with Biblio module
• Multi-lingual interface
• Customizable display, layout
• Solr search/faceting
• OAI & other services for discovery/sharing
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Outreach
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
BHL Blog
• Updates
• Announcements
• 1,500 users / month
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
• twitter.com/BioDivLibrary
• Communication tool– Connecting with LinkedData community, other
users– Receiving assistance, guidance– FAST turnaround
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
If BHL-E is not a Research Project…
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Technologies in hand:
• TaxonFinder
• djatoka
• IA Bookreader
• Drupal/Biblio
• OAI-PMH
• OpenURL
• Fedora Commons
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Needed:
• Deduplication Tools
• Storage
• OCR
• Markup/rekeying
• UI/UX
• Interface translation
• Data synchronization
Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org
Thank youChris Freeland
4344 Shaw Blvd.
St. Louis, MO 63110
http://www.biodiversitylibrary.org