The PLAZI Markup System Donat Agosti Terry Catapano Robert “Bob“ Morris Guido Sautter Universität Karlsruhe (TH) Research University – founded 1825
Jan 05, 2016
The PLAZI Markup System
Donat AgostiTerry Catapano
Robert “Bob“ MorrisGuido Sautter
Universität Karlsruhe (TH) Research University – founded 1825
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 2
The PLAZI Markup System
GoldenGATE Document
Editor
PLAZI ServerPLAZI Search Portal
External Data
Sources
Marked-Up Documents
Queries
Treatments, Detail Data,
PDF Document Handles
Links,Materials Citations
Taxon LSIDs, GeoData
New Taxon Names
Taxonomic data sources
& web services
Search portal,TAPIR
provider,RSS feed
Document markup, external
referencing
XML & PDF storage,
treatment server
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 3
The PLAZI Server• GoldenGATE Search & Retrieval Server (SRS)
– Extracts individual treatments from XML documents– Stores and indexes treatments– Based on independend, pluggable Indexers
• Taxonomic names• Materials citations• Document meta data• Full text
– Serves treatments or indexed details
• DSpace– Stores PDF and XML documents– Issues Handles for documents
Web Service
SRS
PostgreSQLFile System
TNMCMDFT
Docu
men
t M
an
ag
em
ent
DataIndex
DataXM
L D
ocu
men
ts
IndexersIndexersIndexersIndexers
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 4
The PLAZI Markup System
GoldenGATE Document
Editor
PLAZI ServerPLAZI Search Portal
External Data
Sources
Marked-Up Documents
Queries
Treatments, Detail Data,
PDF Document Handles
Links,Materials Citations
Taxon LSIDs, GeoData
New Taxon Names
Taxonomic data sources
& web services
Search portal,TAPIR
provider,RSS feed
Document markup, external
referencing
XML & PDF storage,
treatment server
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 5
The PLAZI Search Portal• Series of Java Servlets running in Apache Tomcat• Front-end for SRS Web Service• Linker plug-ins create hyperlinks to other web sites
• HTML based search portal for humans– Search treatments & index data– Links submitting new search queries– Links to external data sources (e.g. HNS, GoogleMaps)– Links to PDF document & XML versions of treatments
• XML document access in various XML schemas• TAPIR provider
– Taxonomic names– Materials citations
• RSS feed for new treatments
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 6
Probolomyrmex tani
The PLAZI Search Portal
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 7
The PLAZI Markup System
GoldenGATE Document
Editor
PLAZI ServerPLAZI Search Portal
External Data
Sources
Marked-Up Documents
Queries
Treatments, Detail Data,
PDF Document Handles
Links,Materials Citations
Taxon LSIDs, GeoData
New Taxon Names
Taxonomic data sources
& web services
Search portal,TAPIR
provider,RSS feed
Document markup, external
referencing
XML & PDF storage,
treatment server
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 8
The GoldenGATE Editor• Java-based editor for semi-automated document markup• Extensible through plug-in mechanism• Independent of specific XML schema
• Element-level XML editing (XML syntax is generated)• Flexible display for clear view on all detail levels• Existing plug-ins provide broad spectrum of functionality:
– NLP-based markup generation• Regular expressions, gazetteers, GATE JAPE• Homegrown and third-party NLP components• Import of data from external sources (e.g. LSIDs)
– Specialized document views for correcting NLP results– Markup transformation & filtering– IO components for different data formats & storage locations
(e.g. for uploading XML documents to PLAZI server)
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 9
The GoldenGATE Editor
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 10
The PLAZI Markup System
GoldenGATE Document
Editor
PLAZI ServerPLAZI Search Portal
External Data
Sources
Marked-Up Documents
Queries
Treatments, Detail Data,
PDF Document Handles
Links,Materials Citations
Taxon LSIDs, GeoData
New Taxon Names
Taxonomic data sources
& web services
Search portal,TAPIR
provider,RSS feed
Document markup, external
referencing
XML & PDF storage,
treatment server
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 11
The External Data Sources• Hymenoptera Name Server (HNS)
– Retrieve LSIDs for taxon names– Enter new taxon names in HNS database
• Further LSID sources: ZooBank, Index Fungorum
• GBIF pulls materials citations via TAPIR
• EOL pulls treatments via TAPIR (to start soon)
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 12
Outlook• Tighter integration of GoldenGATE editor with server
– Load plug-ins from server Easier update distribution
– Upload documents directly after OCR– Host documents at server throughout markup
Users can share markup work (experts do LSIDs, etc) Treatments available in search portal soon as marked up
– Auto-distribute documents to different storage locations
– Run automated markup generation on server side– Get corrections from community via online feedback forms
• Other extensions of GoldenGATE editor– Simplified, more flexible plug-in architecture– Extensible user interface
Thank you! Questions?
Donat AgostiTerry Catapano
Robert “Bob“ MorrisGuido Sautter
PLAZI homepagePLAZI search portal
GoldenGATE homepage
Universität Karlsruhe (TH) Research University – founded 1825
[email protected]@[email protected]@ipd.uka.de
http://plazi.orghttp://plazi.org:8080/GgSRShttp://idaho.ipd.uka.de/GoldenGATE
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 14
The GoldenGATE Editor V3Plug-in GUI extensions (hideable)
Simplified, more flexible architecture
Pre-OCR page images for correcting OCR errors
Document navigator for finding stuff more quickly