Building a hosted repository service on DSpace Matthew Cockerill Director of Operations BioMed Central Ltd. Open Repository
Dec 18, 2015
Building a hosted repository service on
DSpace
Matthew Cockerill
Director of OperationsBioMed Central Ltd.
Open Repository
Outline
Background on BioMed Central Why is there a need for a hosted
repository service? Why build it on DSpace? Why choose Open Repository? Technical implementation challenges Other challenges
Background on BioMed Central
Scientific publisher,founded in 1999 All research articles Open Access 130+peer-reviewed journals 10,000+ articles published Continuing to grow rapidly
Open Access research
All research distributed under the Creative Commons Attribution License:
Allows– Redistribution– Reuse– Creation of derivative works– Commercial or non-commercial
Institutional repositories and Open Access publishing
Sometimes seen as alternative roads to Open Access
In fact roads are very complementary Repositories can contain both:
– Manuscript copies of articles from 'traditional journals'– Final, structured versions of articles from open access
journals
We expect growth in repositories to go hand in hand with growth in Open Access publishing
Outline
Background on BioMed Central Why is there a need for a hosted
repository service? Why build it on DSpace? Why choose Open Repository? Technical implementation challenges Other challenges
Why is there a need for a hosted repository service?
Not all institutions want to operate, maintain and customize their own repository
Small institutions– Hosted solution can offer better value, due to
economies of scale– Alternative 'shoestring' solutions are possible but do
not give reliability of flexibility
Large institutions– Hosted solution may give greater flexibility
BioMed Central's track-record as a service provider
Has developed and operated a 24/7 web-based journal workflow system for thousands of authors, reviewers, and journal editors since 2000
25,000+ manuscripts have been submitted to BioMed Central journals to date
Outline
Background on BioMed Central Why is there a need for a hosted
repository service? Why build it on DSpace? What does OR offer compared to regular
DSpace Technical implementation challenged Other challenges
Why was DSpace chosen as the foundation for Open Repository
Java-based Large, active and diverse community of
developers Designed with the big issues in mind
– Modularity/extensibility– Scalability– Interoperability– Long term digital preservation
BSD-licensed
BSD License
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
•Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
•Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
•Neither the name of the <ORGANIZATION> nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Outline
Background on BioMed Central Why is there a need for a hosted
repository service? Why build it on DSpace? Why choose Open Repository? Technical implementation challenges Other challenges
Why choose Open Repostory?
Does not require extensive in house IT skills/resources
Flexible customization High availability, for a fraction of the
price of a dedicated HA solution Additional features compared to
standard DSpace software
Why not to choose OR?
Not for every institutions Some institutions choose to make a
major investment in developing and extending the repository platform
In return for greater investment of staff and resources, an institution can– arbitrarily customize DSpace to its precise needs– steer the overall direction of the DSpace platform
Impact of RCUK position statement
The draft position statement on Open Access from RCUK proposes to mandate deposition of articles in an Open Access repository if available
Only a small minority of UK institutions currently have repositories
RCUK policy likely to encourage many smaller institutions to consider setting up repositories
High Availability
Commercial Tier-1 network datacentre 24x7 monitoring, troubleshooting and fault
resolution Fully redundant infrastructure:
power / internet / firewall / LAN etc High-end fibre-channel/RAID storage DSpace Tomcat servers configured as an
active/passive cluster Oracle database - 2-node RAC cluster + offsite
standby database
Examples of functionality added to core DSpace platform
Automatic population of repository with Open Access content
Improvements to ease-of-use of submission system
Automated conversion of proprietary file formats to PDF suitable for archiving
XML markup of submitted articles Enhanced usage reporting tools
Outline
Background on BioMed Central Why is there a need for a hosted
repository service? Why build it on DSpace? Why choose Open Repository? Technical implementation challenges Other challenges
Tomcat application
Running multiple instances of DSpace within Tomcat is fairly straightforward and works OK
Ultimately may need to tweak DSpace code to allow single DSpace application instance to have many 'faces' (different repositories)i.e. break the 1:1 relationship between application instance and repository
That is the approach we use to operate our 70 independent journal websites
Database issues
Each Repository needs it's own database schema (for metadata etc.)
Don't want to have to independently manage (dozens or hundreds) of database schemas
Need to maintain good performance Also would like all DSpace instances to
effectively share a pool of connections – difficult if each connection is tied to a different user/schema
Database solution: Part 1
1. Partition all tables, by a new repos_id column 2. Create a series of schemas, one for each
Open Repository, identified by repos_id3. Generate a set of views in each schema,
which filter the underlying tables by the relevant repos_id
4. End result: Schema appears to DSpace code to be indistinguishable
from a dedicated schema Single set of tables provide easy manageability Partitioning ensures high performance
Database solution: Part 2
1. To allow efficient sharing of database connections, all connections use same username
2. ALTER SESSION SET CURRENT_SCHEMA used to point at correct schema
3. Oracle's connection attribute functionality is used to ensure that connections already pointing at the correct session are reused when possible
Each DSpace instance has own connection pool
OR1 OR2 OR3 OR4 OR5Tomcat applications
Database connections
Database
Webserver
ActiveInactive
INEFFICIENT
DSpace instances share a connection pool
OR1 OR2 OR3 OR4 OR5Tomcat applications
Database connections
Database
Webserver
ActiveInactive
Shared connection pool
EFFICIENT
Contributing code back to DSpace
BioMed Central intends to contribute many of its tweaks to the core DSpace code back to the DSpace project
Where possible, all proprietary functionality is being added as distinct modules
DSpace's architectural evolution will hopefully make this easier to achieve
BioMed Central's goal is for Open Repository to remain in sync, as far as possible, with the core DSpace code
Outline
Background on BioMed Central Why is there a need for a hosted
repository service? Why build it on DSpace? Why choose Open Repository? Technical implementation challenges Other challenges
Biggest challenge
Persuading authors to contribute content to the repository
Not trivial Need to:
– Make it as easy as possible– Carrots and sticks
Ease of use of BioMed Central’s manuscript submission system
0%
10%
20%
30%
40%
50%
60%
52.9% 43.9% 2.6% 0.6% 0.0%
Very good Good Neutral Poor Very Poor
96.8% rate ease of use as "good" or "very good"
96.8% rate ease of use as "good" or "very good"
End-to-end service
The Open Repository service is not just about providing the technology
Provision of training and ongoing technical support to the institution's repository administrators
Provide guidelines on best practice for successfully launching a repository