Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
EUDATTowards a pan-European
Collaborative Data InfrastructureMark van de Sanden
SURFsaraDutch National HPC center, The Netherlands
Workshop on the Future of Big Data ManagementImperial College, London, UK
27-28 June 2013
Workshop on the Future of Big Data management, London, UK – 27-28 June 20132
Outline• Setting the Scene• Collaborative Data Infrastructure• EUDAT project• CDI Building Blocks
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Rep
osito
ry V
olui
me
Rep
osito
ries
EB/yearPB/day
2016-2020
200PB~25PB/year
80TB~TB/year 20TB
~TB/year
~10PB/year
1
100
10k
PB
TB
EB
30 Repositories 5 Repositories
7 Repositories
Setting the Scene
Long tail of small dataLarge volume
~2-3PB/year
Use
rs
#M
1,3M Researchers15M Students500M Citizens
Varie
ty
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Doing some MathEU
Institutes on higher education(research institutes not included)
4000
Average repositories per institute 10Average size repository 5TBResearchers 1,3MStudents 15M
4000 Institutes * 10 Rep/Institute * 5TB/Rep = 200PB1,3M Researchers sharing 50GB = 65PB
15M students sharing 1GB = 15PB
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Data trends
Increasing complexity and variety
Gigabytes
Terabytes
PetabytesExabytesZettabytes
Expo
nenti
al g
row
th
• Where to store it?• How to find it?• How to make the most of it?
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Trus
t
Dat
a Cu
rati
on
Common Data Services
UsersUser functionalities, data capture & transfer, virtual research environments
Persistent storage, identification, authenticity, workflow execution, mining
Data Generators
Community Support ServicesData discovery & navigation, workflow generation, annotation, interpretability
Collaborative Data Infrastructure -A framework for the future? -
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
• EPOS: European Plate Observatory System• CLARIN: Common Language Resources and Technology Infrastructure• ENES: Service for Climate Modelling in Europe• LifeWatch: Biodiversity Data and Observatories• VPH: The Virtual Physiological Human
• INCF: International Neuroinformatics
• All share common challenges:– Reference models and architectures– Persistent data identifiers– Metadata management– Distributed data sources– Data interoperability
Six research communities on Board
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Data Centers and Communities
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Data Staging Safe Replication Simple Store
AAIMetadata Catalogue
Dynamic replication to HPC workspace for processing
Data curation and access optimization
Researcher data store (simple upload, share and access)
Aggregated EUDAT metadata domain.Data inventory
Network of trust among authentication and authorization actors
Building Blocks of the CDI
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Safe Replication• Robust, safe and highly available data replication service
for small- and medium- sized repositories– To guard against data loss in long-term archiving and
preservation
EUDAT CDI Domain of registered data
PIDs • Policy rules
– To optimize access for user from different regions
– To bring data closer to powerful computers for compute-intensive analysis
Where to Store it?
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Safe Replication
SAMQFS
iRODS
GPFS
iRODS
dCache
iRODS
PID
doReplication(*pid,*source,*destination,*status) { msiDataObjRsync(*source, "IRODS_TO_IRODS", "null", *destination, *rsyncStatus); triggerCreatePID("*collectionPath*child.pid.create",*pid,*destination); updateMonitor("*collectionPath*filepathslash.pid.update");}
PID
msiDataObjRsync()
rule: DoReplication()
msiDataObjRsync()
triggerCreatePID()
rule: DoReplication()
HPSS DMF
updateMonitor()
updateMonitor()
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
How to make most of it?
• Support researchers in transferring large data collections from EUDAT storage to HPC facilities
• Reliable, efficient, and easy-to-use tools to manage data transfers
EUDAT CDI Domain of registered data
PRACEHPC
HPC
• Provide the means to re-ingest computational results back into the EUDAT infrastructure
Data Staging
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Data Staging
iRODS
PID
datastager.py [-h] [-d] [-p PATH] [-u USER] [-y YEAR] [-n NETWORK] [-c CHANNEL] [-s STATION] [-P PID] [-PF PIDFILE] [-U URL] [-UF URLFILE] [-t TASKID] [-pF PATHFILE] [--ss SRC_SITE] [--ds DST_SITE] [--sd SRC_DIR] [--dd DST_DIR]{in,out} {seed,pid,url,taskid}
CommunityPortal
Workflow
GO
GridFTP 3rd Party Transfers
DataStaging()
User starts Workflow
GO starts Transfers
User can monitor data flow
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
How to find it?
• Easily find collections of scientific data – generated either by various communities or via EUDAT services
• Access those data collections through the given references in the metadata to the relevant data stores
• Europeana of scientific data
EUDAT CDI Domain of registered data
Joint Metadata Service
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
OA
I H
arve
ster
WWW
Adapter schema B
Browsing limited set of (10?) facets
Adapter schema AAdapter
schema AAdapter schema A
CommunityOAI
Metadata provider
A
CommunityNon-OIA
Metadata provider
B
RawMDStore
XML-MD
XML-MD
1
2
56
7 8
9
Joint Metadata Service
Communitye.g. EPOS, …
Communitye.g. ENES or CLARIN
ftp o
r oth
erpr
otoc
olIndexer
CKAN
PostgreSQL
LuceneSOLR
LuceneSOLR
3
Full metadata content search
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
What about Homeless and Citizen scientist?
• Allow registered users to upload ”long tail” data into the EUDAT store
• Enable sharing objects and collections with other researchers
EUDAT CDI Domain of registered data
Simple uploadSimple metadata
PID registration
• Utilize other EUDAT services to provide reliability and data retention
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013
Simple Store
Invenio
PID
Replicate
• Create a user profile• Deposit a Data Object• Select a Science Domain• Fill in basic metadata on basis of Science Domain• A PID is created
Workshop on the Future of Big Data management, London, UK – 27-28 June 201321