Open Access Publishing on the Semantic Web
May 07, 2015
Open Access Publishing on the
Semantic Web
SF Semantic Meetup www.plos.org
the Public Library of Science (PLoS)
non-profit, Open Access STM (scientific, technical and medical) publisher focused on life-sciences
mission: open the doors to the world's library of scientific knowledge by giving any scientist, physician, patient, or student - anywhere in the world - unlimited access to the latest scientific research
all research articles are published under the Creative Commons Attribution License
SF Semantic Meetup www.plos.org
why Open Access?
taxpayers pay for research but print and online journals are available only to subscribers
traditional publishers own the copyright to all the researchers published materials
licensing is complex and restrictive
libraries are struggling to provide access to all required journals because of subscription fees
SF Semantic Meetup www.plos.org
PLoS Journals
publish seven peer-reviewed journals– PLoS Biology, PLoS Medicine (flagship)– PLoS Pathogens, PLoS Computational Biology, PLoS NTDs,
PLoS Genetics (community)– PLoS ONE (disruptive force)
largest journal is PLoS ONE– high volume, very efficient workflow– ~6500 articles as of July 24, 2009– publish >400 articles a month (and growing)
using semantic platform since December ‘06– PLoS ONE first journal on new platform– all journals migrated to platform as of May 12, ‘09– ~13,000 articles published on semantic platform
SF Semantic Meetup www.plos.org
state of STM publishing platforms
publishing platforms are proprietary or hosted by a third party (PLoS)
most publishers treat online journals as digital repositories for research articles
– “end of the road” for research articles
– online backseat to print journals ($$)
the internet changes everything– cheap and fast– global– quick search and retrieval
open source solutions exist today (e.g. Open Journal Systems/Drupal, Rhaptos/Zope) but limited features in 2006
SF Semantic Meetup www.plos.org
big ideas for transforming journal publishing
open source publishing platform
semantic repository to mine the unknown
(semantic) relationships in research articles
a “Web 2.0” user interface
provide features for post-publication annotation
and discussion allowing for a “living” document
– notes inline with the content
– comments and discussions
– ratings
© by wales.nhs.uk
SF Semantic Meetup www.plos.org
…embarked down the path
Topaz non-profit development team funded by the Moore Foundation
intended as a journal publishing system for many types of publishing– scholarly communications / Open Access– eScience / eScholarship– education– libraries / museums
semantic publishing platform based on Fedorainstitutional repository and Mulgara triple-store
Topaz (back-end glue)
– Object to Triple Mapping (OTM)
– Object Query Language (OQL)© by Michael James
Ambra journal publishing system (front-end user interface)
SF Semantic Meetup www.plos.org
Ambra / Topaz journal publishing platform
Apache
Ambra
Fedora + Mulgara
RDF Store
Topaz OTM
Topaz
Files
CAS
Fedora is used to store digital objects (XML, PDF, images, etc.)
article metadata, annotations (annotea) and user information (foaf) is stored as triples in Mulgara
Topaz is used for storage and retrieval of the digital objects and triple stores through the Objects to Triples Mapping (OTM)
Ambra (user interface)CAS single sign-on serviceApache webhead
SF Semantic Meetup www.plos.org
under the hood of Topaz (1)
an Object-Triples-Mapping (OTM) library – modeled after Hibernate Object-Relational Mapping (ORM) – except the database is made of RDF triples instead of a relational
database.
provides a query language based on objects (OQL)– an "object" based query syntax– makes life a bit easier for developers
OQL exampleselect all articles with a given title: select a.id, a.author from Article a where a.title = 'Hello Dolly';
SF Semantic Meetup www.plos.org
why Objects to Triples Mapping (OTM)?
don’t walk a tree to retrieve objects (slow)
instead, retrieve collections of objects with one query (fast)
as an online-only publisher, we need fast
SF Semantic Meetup www.plos.org
under the hood of Topaz (2)
defines Java classes maps the classes into RDF – Ambra defines models which are mapped into sets of triples in
various graphs
– such as “article”, “annotation”, etc. models defined in Ambra
provides support for storing files to a separate blob store (Fedora and/or Akubra)
provides storage and retrieval of files and triples in a single transaction – necessary to render an article with associated metadata (e.g.
notes, ratings, etc.)
SF Semantic Meetup www.plos.org
Ambra
first application built on Topaz
journal publishing platform with “Web 2.0” features– uses the FreeMarker templating engine to display the content
received from Topaz service.– uses the DOJO JavaScript toolkit to handle complex user
interactions like annotations, ratings, etc. – provides social networking features (in-line notes, comments,
trackbacks)– turns a reader of scientific articles into a knowledge contributor,
knowledge that can be used by other users– living document!
SF Semantic Meetup www.plos.org
Ambra features
Ambra
article
ingestion
search
annotations
discussions
security
mgmtratings
user profile/
preferencesatom feeds
multiple
journalstrackbacks
SignOn
Server
CAS
single
sign-
onarticle
publication
CrossRef
registration
DOI resolver
Cache for web content and digital objects
SF Semantic Meetup www.plos.org
Ambra <-> Mulgara interaction
Ambra inserts data into Mulgara in the following cases– article Ingest– post-publication annotations (comment, note, rating, trackback)– admin actions (volume and issue collections, annotation
moderation, etc.)– user actions (create or edit a user profile)
Mulgara uses OTM to pull data from Fedora and Mulgara– Ambra transforms XML to HTML– displays notes, comments, ratings, etc.
SF Semantic Meetup www.plos.org
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "http://dtd.nlm.nih.gov/publishing/2.0/journalpublishing.dtd"><article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" dtd-version="2.0" xml:lang="EN"> <front> <journal-meta> <journal-id journal-id-type="nlm-ta">PLoS ONE</journal-id> <journal-id journal-id-type="publisher-id">plos</journal-id> <journal-id journal-id-type="pmc">plosone</journal-id> <journal-title>PLoS ONE</journal-title> <issn pub-type="epub">1932-6203</issn>...
article ingest (1)
Ambra expects an article package that contains an XML file in NLM-DTD format (http://dtd.nlm.nih.gov/publishing/)
SF Semantic Meetup www.plos.org
article ingest (2)
Ambra transforms the XML into an OTM object that Topaz pushes into Mulgara.
<info:doi/10.1371/journal.pone.0000000> <rdf:type> <http://rdf.plos.org/RDF/articleType/Research%20Article><info:doi/10.1371/journal.pone.0000000> <rdf:type> <http://rdf.plos.org/RDF/articleType/research-article><info:doi/10.1371/journal.pone.0000000> <rdf:type> <topaz:Article><info:doi/10.1371/journal.pone.0000000> <rdf:type> <topaz:ObjectInfo><info:doi/10.1371/journal.pone.0000000> <http://prismstandard.org/namespaces/1.2/basic/eIssn> '1932-6203'<info:doi/10.1371/journal.pone.0000000> <dc:creator> 'Bonnie Real'<info:doi/10.1371/journal.pone.0000000> <dc:creator> 'Richard Cave'<info:doi/10.1371/journal.pone.0000000> <dc:creator>...
SF Semantic Meetup www.plos.org
Ambra – future development
article level metrics– impact of the article above and beyond citations
RDFaautomatic article relationshipssemantic enhancementREST-based APIingest and publish many types of content / data
– structured and unstructured
tagsenhance search and browsedirect access to Mulgara’s triple store
– sparql endpoint, RDFa
SF Semantic Meetup www.plos.org
semantic enhancement of content
add value to the content of a research article
highlight text for selected terms– protein names– genus / species– disease– location / habitat– etc.
provide links to external sources to create new user interactions
SF Semantic Meetup www.plos.org© by David Shotton
SF Semantic Meetup www.plos.org
SF Semantic Meetup www.plos.org
system requirements
minimum - single server (Linux) with 8 Gb RAM
…better (based on PLoS journals):– 1 server for Fedora and Mulgara with 8Gb RAM– 1 server for Ambra and Topaz with 8Gb RAM– 1 server for Apache and CAS with 4Gb RAM
PLoS journals on Ambra / Topaz– 800k visits / month– ~2 million pageviews / month
Amazon AMI to test Ambra / Topaz available
SF Semantic Meetup www.plos.org
resources
Ambra website http://www.ambraproject.org/
Ambra mailing lists:http://lists.topazproject.org/mailman/listinfo/ambra-usershttp://lists.topazproject.org/mailman/listinfo/ambra-dev
Topaz websitehttp://www.topazproject.org/
Fedora Commons websitehttp://fedoracommons.org/
Richard Cave – rcave at plos.org