Top Banner
Workshop on the Future of Big Data management, London, UK – 27-28 June 2013 EUDAT Towards a pan-European Collaborative Data Infrastructure Mark van de Sanden SURFsara Dutch National HPC center, The Netherlands Workshop on the Future of Big Data Management Imperial College, London, UK 27-28 June 2013
19

EUDAT

Feb 23, 2016

Download

Documents

amory

EUDAT. Towards a pan- European Collaborative Data Infrastructure. Mark van de Sanden SURFsara Dutch National HPC center, The Netherlands Workshop on the Future of Big Data Management Imperial College, London, UK 27-28 June 2013. Outline. Setting the Scene - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

EUDATTowards a pan-European

Collaborative Data InfrastructureMark van de Sanden

SURFsaraDutch National HPC center, The Netherlands

Workshop on the Future of Big Data ManagementImperial College, London, UK

27-28 June 2013

Page 2: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 20132

Outline• Setting the Scene• Collaborative Data Infrastructure• EUDAT project• CDI Building Blocks

Page 3: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

Rep

osito

ry V

olui

me

Rep

osito

ries

EB/yearPB/day

2016-2020

200PB~25PB/year

80TB~TB/year 20TB

~TB/year

~10PB/year

1

100

10k

PB

TB

EB

30 Repositories 5 Repositories

7 Repositories

Setting the Scene

Long tail of small dataLarge volume

~2-3PB/year

Use

rs

#M

1,3M Researchers15M Students500M Citizens

Varie

ty

Page 4: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

Doing some MathEU

Institutes on higher education(research institutes not included)

4000

Average repositories per institute 10Average size repository 5TBResearchers 1,3MStudents 15M

4000 Institutes * 10 Rep/Institute * 5TB/Rep = 200PB1,3M Researchers sharing 50GB = 65PB

15M students sharing 1GB = 15PB

Page 5: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

Data trends

Increasing complexity and variety

Gigabytes

Terabytes

PetabytesExabytesZettabytes

Expo

nenti

al g

row

th

• Where to store it?• How to find it?• How to make the most of it?

Page 6: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

Trus

t

Dat

a Cu

rati

on

Common Data Services

UsersUser functionalities, data capture & transfer, virtual research environments

Persistent storage, identification, authenticity, workflow execution, mining

Data Generators

Community Support ServicesData discovery & navigation, workflow generation, annotation, interpretability

Collaborative Data Infrastructure -A framework for the future? -

Page 7: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

Page 8: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

• EPOS: European Plate Observatory System• CLARIN: Common Language Resources and Technology Infrastructure• ENES: Service for Climate Modelling in Europe• LifeWatch: Biodiversity Data and Observatories• VPH: The Virtual Physiological Human

• INCF: International Neuroinformatics

• All share common challenges:– Reference models and architectures– Persistent data identifiers– Metadata management– Distributed data sources– Data interoperability

Six research communities on Board

Page 10: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

Data Staging Safe Replication Simple Store

AAIMetadata Catalogue

Dynamic replication to HPC workspace for processing

Data curation and access optimization

Researcher data store (simple upload, share and access)

Aggregated EUDAT metadata domain.Data inventory

Network of trust among authentication and authorization actors

Building Blocks of the CDI

Page 11: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

Safe Replication• Robust, safe and highly available data replication service

for small- and medium- sized repositories– To guard against data loss in long-term archiving and

preservation

EUDAT CDI Domain of registered data

PIDs • Policy rules

– To optimize access for user from different regions

– To bring data closer to powerful computers for compute-intensive analysis

Where to Store it?

Page 12: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

Safe Replication

SAMQFS

iRODS

GPFS

iRODS

dCache

iRODS

PID

doReplication(*pid,*source,*destination,*status) { msiDataObjRsync(*source, "IRODS_TO_IRODS", "null", *destination, *rsyncStatus); triggerCreatePID("*collectionPath*child.pid.create",*pid,*destination); updateMonitor("*collectionPath*filepathslash.pid.update");}

PID

msiDataObjRsync()

rule: DoReplication()

msiDataObjRsync()

triggerCreatePID()

rule: DoReplication()

HPSS DMF

updateMonitor()

updateMonitor()

Page 13: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

How to make most of it?

• Support researchers in transferring large data collections from EUDAT storage to HPC facilities

• Reliable, efficient, and easy-to-use tools to manage data transfers

EUDAT CDI Domain of registered data

PRACEHPC

HPC

• Provide the means to re-ingest computational results back into the EUDAT infrastructure

Data Staging

Page 14: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

Data Staging

iRODS

PID

datastager.py [-h] [-d] [-p PATH] [-u USER] [-y YEAR] [-n NETWORK] [-c CHANNEL] [-s STATION] [-P PID] [-PF PIDFILE] [-U URL] [-UF URLFILE] [-t TASKID] [-pF PATHFILE] [--ss SRC_SITE] [--ds DST_SITE] [--sd SRC_DIR] [--dd DST_DIR]{in,out} {seed,pid,url,taskid}

CommunityPortal

Workflow

GO

GridFTP 3rd Party Transfers

DataStaging()

User starts Workflow

GO starts Transfers

User can monitor data flow

Page 15: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

How to find it?

• Easily find collections of scientific data – generated either by various communities or via EUDAT services

• Access those data collections through the given references in the metadata to the relevant data stores

• Europeana of scientific data

EUDAT CDI Domain of registered data

Joint Metadata Service

Page 16: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

OA

I H

arve

ster

WWW

Adapter schema B

Browsing limited set of (10?) facets

Adapter schema AAdapter

schema AAdapter schema A

CommunityOAI

Metadata provider

A

CommunityNon-OIA

Metadata provider

B

RawMDStore

XML-MD

XML-MD

1

2

56

7 8

9

Joint Metadata Service

Communitye.g. EPOS, …

Communitye.g. ENES or CLARIN

ftp o

r oth

erpr

otoc

olIndexer

CKAN

PostgreSQL

LuceneSOLR

LuceneSOLR

3

Full metadata content search

Page 17: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

What about Homeless and Citizen scientist?

• Allow registered users to upload ”long tail” data into the EUDAT store

• Enable sharing objects and collections with other researchers

EUDAT CDI Domain of registered data

Simple uploadSimple metadata

PID registration

• Utilize other EUDAT services to provide reliability and data retention

Page 18: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 2013

Simple Store

Invenio

PID

Replicate

• Create a user profile• Deposit a Data Object• Select a Science Domain• Fill in basic metadata on basis of Science Domain• A PID is created

Page 19: EUDAT

Workshop on the Future of Big Data management, London, UK – 27-28 June 201321

[email protected]

[email protected]