EUDAT Towards a Collaborative Data Infrastructure Bielefeld 10’th International Conference Daan Broeder - MPI for Psycholinguistics - EUDAT - CLARIN - DASISH
EUDAT
Towards a Collaborative Data Infrastructure
Bielefeld 10’th International Conference
Daan Broeder
- MPI for Psycholinguistics
- EUDAT
- CLARIN
- DASISH
Data
• These days it is so very easy to create data but still far less easy to manage it. – Experiment data
– Sensor produced data
– Simulations
– Digital libraries
– The Web
– …
• How to store, to administrate, to find, to enrich, to link, to process, to share, to reuse, …, to publish
• For this we need a data infrastructure
• One that is efficient, sustainable and cost effective
Data creation cycle
raw data
Citable
publication
registration &
preservation
analysis &
enrichment
temp.
data citable
data
referable
data
4
The current data infrastructure landscape
Long history of data management in Europe: several existing data infrastructures dealing with established and growing user communities (e.g., ESO, ESA, EBI, CERN)
New Research Infrastructures (ESFRI roadmap) are emerging and are also trying to build data infrastructure solutions to meet their needs (CLARIN, EPOS, ELIXIR, ESS, etc.)
However, most of these infrastructures and initiatives address primarily the needs of a specific discipline and user community
Challenges Compatibility, interoperability, for cross-disciplinary research Data growth in volume and complexity
strong impact on costs threatening the sustainability of the infrastructure
Opportunities Synergies do exist:
although disciplines have different work flows and ambitions, they have common basic needs and requirements that can be matched with generic services supporting multiple communities
Strategy needed at pan-European level
5
Collaborative Data Infrastructure
EUDAT short fact list
6
Content Project Name EUDAT – European Data
Start date 1st October 2011
Duration 36 months
Budget 16,3 M€ (including 9,3 M€ from the EC)
EC call Call 9 (INFRA-2011-1.2.2): Data infrastructure for e-Science (11.2010)
Participants 25 partners from 13 countries (national data enters, technology providers, research communities, and funding agencies)
Objectives “To deliver cost-efficient and high quality Collaborative Data Infrastructure (CDI) with the capacity and capability for meeting researchers’ needs in a flexible and sustainable way, across geographical and disciplinary boundaries.”
Consortium
7
Research Communities
EUDAT targets all scientific disciplines (discipline neutral):
To enable the capture and identify cross-discipline requirements To involving the scientists of all the communities in the shaping of the infrastructure and its services
Environmental Science
ENES, EPOS, Lifewatch, EMSO, IAGOS-ERI, ICOS, Euro-Argo
Social Sciences and Humanities
CLARIN, CESSDA, DARIAH
Biological and Medical Science VPH, ELIXIR, BBRMI, ECRIN, DiXA
Physical Sciences and Engineering
WLCG, ISIS, PanData
Material Science
ESS…
Research fields
1. Capturing Communities Requirements (WP4)
1st round of interviews with the five initial communities (Oct.2011 - Dec. 2012) • Understand how data is organised in each community
• Collect first wishes and specific requirements from a common data service layer
Next phase: refine analysis and expanding it to other communities
2. Building the corresponding services (WP5)
Technology appraisal (ongoing) • What is already available at partners’s sites to build the corresponding services?
• What are the gaps and market failures that should be addressed by EUDAT?
Next phase: Developing candidate services • Adapt services to match the requirements
• Integrate with community and SP services
• Test and evaluate with communities
3. Deploying the services and operating the federated infrastructure (WP6)
Designing the federated infrastructure and the interfaces for cross-site operations (ongoing)
Next phase: integrating and coordinating resource provision, operations and support
EUDAT service design activities
Core services are building blocks of EUDAT‘s
Common Data Infrastructure
mainly included on bottom layer of data services
Community-oriented services
• Simple Data Acces and upload
• Long term preservation
• Shared workspaces
• Execution and workflow
• Joint metadata and data visibility
• Simple storage facility for individual
scientists and small projects
Enabling services (making use of
existing services where possible
• Persistent identifier service (EPIC,
DataCite, ...)
• Federated AAI service (NRENs,
eduGain)
• Network Services
• Monitoring and accounting
EUDAT Core Service Areas
Data Management Service Cases
• Safe Replication – Replicate data between selected centers
– Based on user specified policies
– For LTA, for easy access, …
– Technology: iRods
• Dynamic Replication (Data staging) – Moving data to HPC workspaces and storing the results
– Technology: iRods + grid tools
• Usable PID framework – facilitate administrating data replication
– allow identifying ‘parts’ of objects
– data verifiability, …
– Technology: HS + EPIC and DataCite
• Center registry – Listing EUDAT services, centers and their capabilities
Data Management Service Cases
• Joint metadata domain – A metadata catalogue for (all?) research data
– Interdisciplinary (re-)use of data
– Semantic interoperability: • explicit semantics and flexible relations or hard-wired mappings,..
– Granularity • Include individual resources or data-sets only
– Commenting function
– Platform permitting data-set promotion • Proper acknowledgements for data creators
– Technologies: icat, mercury, OAI-PMH, xsd, rdf,…
• Simple Store – A safe repository for all research data in need
• youTube or dropbox model
– (Detailed?) metadata
– Sharing
EUDAT Architecture
EUDAT Community
center
EUDAT data
center
EUDAT data
center
EUDAT data
center
PRACE HPC
center
HPC workspace
EUDAT Community
center
D EUDAT PID
Service
LTA facility
EUDAT HPC
center
D
HPC workspace
D
D
D
D
D
LTA facility
EUDAT Metadata
Service
Harvesting
metadata
EUDAT center registry
EUDAT Simple -store
D
Collaborations
• With the ESFRI (cluster) projects
• With service providers: EPIC, DataCite, …
• EUDAT <-> EGI collaboration (& competition)
• US DataNET: DataOne, Data Conservancy,…
• DAITF - Data Access & Interoperability Task Force
– This task will contribute to the efforts to establish an
international task force. This work will be carried out in
collaboration with OpenAIRE and other relevant
initiatives/projects focusing on data.
Thank you for your attention
Interlinking data and publications
datasets & metadata
publications
data depositor data curator reviewer author editor
API API
Identifiers for Actors (ORCID)
Identifiers for data & publications (HS, DOI, URN)
ICSU
Organizations guiding
data management infrastructure building
ICORDI
DAITF
COAR CODATA
WDC
EUDAT
OpenAIRE
WDC
Move to DAITF & iCORDI inspired by OpenAIRE and EUDAT
DAITF PROCESS Conferences, working groups,
hands-on training
DAITF STEERING BOARD
HLSF PROCESSWorkshops, working groups
DOMAIN OF DATA INFRASTRUCTURE PRACTITIONERS
iCORDI PROGRAMMES
DOMAIN OF TOP SCIENTISTS,SENIOR TECHNOLOGISTS,
POLICY MAKERS
InfluencingInteracting
KNOWLEDGE EXCHANGE PROGRAMME
WORKSHOP PROGRAMME
PROTOTYPE PROGRAMME
iCORDI PROGRAMMES
ANALYSIS PROGRAMME
Informing
Informing
Horizontal DataInfrastructures
Data ScientistsYoung Scientists
TechnologistsDiscipline/Domain
Data Infrastructures
EC
NSF
CNRS
KNAWMPG CNR
DFG
NWO
STFC
bottom-up
process towards
solutions
driven by
science
top-down
process about
strategies and
needs driven
by science how to
organize
and
support
this
process?
IETF?
DWF?
other
stakeholders
RCs, ROs,
Funders, etc
1st Workshop March 2012, Copenhagen
next workshop in October, Washington
What has been done so far?
2006/8
DataNet1
DAITF
Prepar.
Workshop
UIPIU
Data2012
Workshop
ASIST
Workshop
DataNet2
DWF
Concept
2012 2011 2010 2009 2008
20
global interaction
in place
brainstorming on data issues, need for
global action & first focussed actions tackling first
data topics
EUDAT Services for Communities!?
DiXa DARIAH
v
v
v v
v
v v
v
v
v
Safe Replication Use Case
22
• Objective: Allow communities to reliably replicate data to selected data centers for storage and do this in a robust, reliable and highly available manner. Respecting existing conventions on stewardship and security.
Using user defined policies: e.g. make 4 copies, don’t copy to the UK, …
• Application: To (1) move data to locations where curation and/or LTP services are present (2) processing requiring HPC can take place (3) for improved user data accessibility
• Replicated digital objects are identified through a single PID, with multiple locations associated to the PID record; one location per copy.
Dynamic Replication Service Case • Move entire data set (i.e. data collection) back and forth between an
EUDAT node and a non-EUDAT node: PRACE or EGI facilities
• Keep the data replicas at the non-EUDAT nodes in sync with the
EUDAT nodes
• Ingest/register relevant simulation results back at the EUDAT nodes.
Candidate technologies
• iRods
• Globus on-line
• FTS
• Unicore FTP
• gTransfer
HPC, GRID services – PRACE, EGI
SSH communities wide - DASISH
common SSH metadata catalog
community specific
CLARIN LT web service infrastructure
NETWORK Services - GEANT
Federated Identity Management
Data Replication & Preservation, Publication – EUDAT
replication & preservation
CLARIN DARIAH CESSDA Life Watch
DASISH ENVRI