Open Access Publishing on the Semantic Web

Open Access Publishing on the

Semantic Web

SF Semantic Meetup www.plos.org

the Public Library of Science (PLoS)

non-profit, Open Access STM (scientific, technical and medical) publisher focused on life-sciences

mission: open the doors to the world's library of scientific knowledge by giving any scientist, physician, patient, or student - anywhere in the world - unlimited access to the latest scientific research

all research articles are published under the Creative Commons Attribution License


why Open Access?

taxpayers pay for research but print and online journals are available only to subscribers

traditional publishers own the copyright to all the researchers published materials

licensing is complex and restrictive

libraries are struggling to provide access to all required journals because of subscription fees


PLoS Journals

publish seven peer-reviewed journals– PLoS Biology, PLoS Medicine (flagship)– PLoS Pathogens, PLoS Computational Biology, PLoS NTDs,

PLoS Genetics (community)– PLoS ONE (disruptive force)

largest journal is PLoS ONE– high volume, very efficient workflow– ~6500 articles as of July 24, 2009– publish >400 articles a month (and growing)

using semantic platform since December ‘06– PLoS ONE first journal on new platform– all journals migrated to platform as of May 12, ‘09– ~13,000 articles published on semantic platform


state of STM publishing platforms

publishing platforms are proprietary or hosted by a third party (PLoS)

most publishers treat online journals as digital repositories for research articles

– “end of the road” for research articles

– online backseat to print journals ($$)

the internet changes everything– cheap and fast– global– quick search and retrieval

open source solutions exist today (e.g. Open Journal Systems/Drupal, Rhaptos/Zope) but limited features in 2006


big ideas for transforming journal publishing

open source publishing platform

semantic repository to mine the unknown

(semantic) relationships in research articles

a “Web 2.0” user interface

provide features for post-publication annotation

and discussion allowing for a “living” document

– notes inline with the content

– comments and discussions

– ratings

© by wales.nhs.uk


…embarked down the path

Topaz non-profit development team funded by the Moore Foundation

intended as a journal publishing system for many types of publishing– scholarly communications / Open Access– eScience / eScholarship– education– libraries / museums

semantic publishing platform based on Fedorainstitutional repository and Mulgara triple-store

Topaz (back-end glue)

– Object to Triple Mapping (OTM)

– Object Query Language (OQL)© by Michael James

Ambra journal publishing system (front-end user interface)


Ambra / Topaz journal publishing platform

Apache

Ambra

Fedora + Mulgara

RDF Store

Topaz OTM

Topaz

Files

CAS

Fedora is used to store digital objects (XML, PDF, images, etc.)

article metadata, annotations (annotea) and user information (foaf) is stored as triples in Mulgara

Topaz is used for storage and retrieval of the digital objects and triple stores through the Objects to Triples Mapping (OTM)

Ambra (user interface)CAS single sign-on serviceApache webhead


under the hood of Topaz (1)

an Object-Triples-Mapping (OTM) library – modeled after Hibernate Object-Relational Mapping (ORM) – except the database is made of RDF triples instead of a relational

database.

provides a query language based on objects (OQL)– an "object" based query syntax– makes life a bit easier for developers

OQL exampleselect all articles with a given title: select a.id, a.author from Article a where a.title = 'Hello Dolly';


why Objects to Triples Mapping (OTM)?

don’t walk a tree to retrieve objects (slow)

instead, retrieve collections of objects with one query (fast)

as an online-only publisher, we need fast


under the hood of Topaz (2)

defines Java classes maps the classes into RDF – Ambra defines models which are mapped into sets of triples in

various graphs

– such as “article”, “annotation”, etc. models defined in Ambra

provides support for storing files to a separate blob store (Fedora and/or Akubra)

provides storage and retrieval of files and triples in a single transaction – necessary to render an article with associated metadata (e.g.

notes, ratings, etc.)


Ambra

first application built on Topaz

journal publishing platform with “Web 2.0” features– uses the FreeMarker templating engine to display the content

received from Topaz service.– uses the DOJO JavaScript toolkit to handle complex user

interactions like annotations, ratings, etc. – provides social networking features (in-line notes, comments,

trackbacks)– turns a reader of scientific articles into a knowledge contributor,

knowledge that can be used by other users– living document!


Ambra features

Ambra

article

ingestion

search

annotations

discussions

security

mgmtratings

user profile/

preferencesatom feeds

multiple

journalstrackbacks

SignOn

Server

CAS

single

sign-

onarticle

publication

CrossRef

registration

DOI resolver

Cache for web content and digital objects


Ambra <-> Mulgara interaction

Ambra inserts data into Mulgara in the following cases– article Ingest– post-publication annotations (comment, note, rating, trackback)– admin actions (volume and issue collections, annotation

moderation, etc.)– user actions (create or edit a user profile)

Mulgara uses OTM to pull data from Fedora and Mulgara– Ambra transforms XML to HTML– displays notes, comments, ratings, etc.


<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "http://dtd.nlm.nih.gov/publishing/2.0/journalpublishing.dtd"><article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" dtd-version="2.0" xml:lang="EN"> <front> <journal-meta> <journal-id journal-id-type="nlm-ta">PLoS ONE</journal-id> <journal-id journal-id-type="publisher-id">plos</journal-id> <journal-id journal-id-type="pmc">plosone</journal-id> <journal-title>PLoS ONE</journal-title> <issn pub-type="epub">1932-6203</issn>...

article ingest (1)

Ambra expects an article package that contains an XML file in NLM-DTD format (http://dtd.nlm.nih.gov/publishing/)


article ingest (2)

Ambra transforms the XML into an OTM object that Topaz pushes into Mulgara.

<info:doi/10.1371/journal.pone.0000000> <rdf:type> <http://rdf.plos.org/RDF/articleType/Research%20Article><info:doi/10.1371/journal.pone.0000000> <rdf:type> <http://rdf.plos.org/RDF/articleType/research-article><info:doi/10.1371/journal.pone.0000000> <rdf:type> <topaz:Article><info:doi/10.1371/journal.pone.0000000> <rdf:type> <topaz:ObjectInfo><info:doi/10.1371/journal.pone.0000000> <http://prismstandard.org/namespaces/1.2/basic/eIssn> '1932-6203'<info:doi/10.1371/journal.pone.0000000> <dc:creator> 'Bonnie Real'<info:doi/10.1371/journal.pone.0000000> <dc:creator> 'Richard Cave'<info:doi/10.1371/journal.pone.0000000> <dc:creator>...


Ambra – future development

article level metrics– impact of the article above and beyond citations

RDFaautomatic article relationshipssemantic enhancementREST-based APIingest and publish many types of content / data

– structured and unstructured

tagsenhance search and browsedirect access to Mulgara’s triple store

– sparql endpoint, RDFa


semantic enhancement of content

add value to the content of a research article

highlight text for selected terms– protein names– genus / species– disease– location / habitat– etc.

provide links to external sources to create new user interactions

SF Semantic Meetup www.plos.org© by David Shotton



system requirements

minimum - single server (Linux) with 8 Gb RAM

…better (based on PLoS journals):– 1 server for Fedora and Mulgara with 8Gb RAM– 1 server for Ambra and Topaz with 8Gb RAM– 1 server for Apache and CAS with 4Gb RAM

PLoS journals on Ambra / Topaz– 800k visits / month– ~2 million pageviews / month

Amazon AMI to test Ambra / Topaz available


resources

Ambra website http://www.ambraproject.org/

Ambra mailing lists:http://lists.topazproject.org/mailman/listinfo/ambra-usershttp://lists.topazproject.org/mailman/listinfo/ambra-dev

Topaz websitehttp://www.topazproject.org/

Fedora Commons websitehttp://fedoracommons.org/

Richard Cave – rcave at plos.org

Open Access Publishing on the Semantic Web

Technology