20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Globus Scientific Data Publication

Services Ben Blaiszik, Kyle Chard, Rachana Anathakrishnan, Steve Tuecke, Ian

Foster, Globus Team

[email protected] www.globus.org

ComputationInstitute

Overview •  What is Globus? •  Globus Services

– Data publication – Data cataloging – Data transfer – User authentication – Groups – Sharing

2

> 8000 endpoints > 85 U.S. campuses European Globus Community: http://www.egcf.eu/

3

Globus is ...

Research data management delivered via SaaS

Big data transfer, sharing,publication, and discovery…

…directly from your own storage systems OR the cloud

4

Globus Delivers

SaaS Market Domination...

…for your photos

…for your e-mail

…for your entertainment

…for your research data

5

Research data management scenarios and challenges

6

Public Cloud

“I need to easily, quickly, & reliably move or mirror portions of my data to other places.”

Research Computing HPC Cluster Lab Server Personal Laptop

XSEDE Resource

7

Scientific Instrumentation

“I need to easily and securely share my data with my colleagues at other institutions.”

8

“I need to publish my data so that others can find it and use it.”

Scholarly Publication

Reference Dataset

Active Research Collaboration

9

Globus Transfer •  “Fire-and-forget” transfers

–  Optimize transfer –  Automatic fault recovery –  Automatic retry –  Seamless security integration –  128-bit checksums

•  Intuitive Web GUI and powerful APIs for automation –  REST and Python APIs

10

B

Globus moves the data for you

secureendpoint,

e.g. laptop

You submit a transfer request Globus

notifies you once the transfer is complete

secureendpoint,e.g. midway

transfer

A

Catalog Data Publication

Endpoint File Systems

Discover Plugin Point [Federation?]

Globus, the Abridged Version

Transfers

Groups Sharing

User Auth

Metadata Layer

Data layer

11 * REST and Python APIs throughout

Globus Catalog •  Automate metadata ingestion from

instrumentation and acquisition machines

–  API/CLI integration

•  Allow near real-time metadata-driven feedback to experiments

•  Allow for insert points in the workflow –  Ingest at point of collection –  Catalog metadata and provenance –  Push to data store –  Push to local or external HPC

•  Allow building and sharing of typed metadata definitions –  e.g. build definition set that specifically

fits X-ray scattering data at your beamline

–  Addresses problem of T, temp, Temp, temperature, temperature_kelvin, ...

12

13

•  Group data based on use and features, not location/filename –  Logical grouping to organize, search, and

describe

•  Operate on datasets as units •  Tag datasets with characteristics

that reflect content •  Share/move datasets for

collaboration •  Interact with via REST API,

Python API, GUI, and CLI

Vs.

Globus Catalog Catalog à Datasets à Members

Globus Catalog Web User Interface

Near field-HEDM Workflow (Sharma, Almer)

15

** Supported by Data Engines for Big Data LDRD

(Wilde, Wozniak, Sharma, Almer, Blaiszik)

3: GenerateParameters

(FOP.c)50 tasks25s/task

¼ CPU hoursconcurrent

DetectorUp to 1000 datasets/week

Dataset360 files

4 GB total

1: Median calc75s (90% I/O)

MedianImage.cUses Swift

2: Peak Search15s per file

ImageProcessing.cUses Swift

ReducedDataset360 files

5 MB total

4: Analysis PassFitOrientation.c

105 tasks20s/task

555 CPU hoursthen

1m/task 1667 CPU hours

concurrent

real-time overnight or real-time

feedback to experiment

Up to2.2 M CPU hours

per week!

real time: 4/4/2014

On

Orth

ros

Experimenting “in the data dark” •  Feedback during each experiment was non-existent •  Required months to calculate relevant information for

publication OR to find out experiment was corrupted •  Now, initial feedback over lunch using (Globus, SWIFT,

and Catalog) to leverage HPC and track metadata

Globus Data Publication •  Operated as a hosted

service

•  Designed for Big Data

•  Bring your own (per collection) storage

•  Extensible metadata schemas and input forms

•  Customizable publication and curation workflows

•  Associate unique and persistent digital identifiers with datasets

•  Rich discovery model (in dev)

16

Curator reviews and approves; data set published on campus or other

Researcher assembles data set; describes it using metadata (Dublin core and domain-specific) Peers and public

search and discover data sets; access and transfer using

publisheddatastore

curator

researcher

Data Publication Dashboard

17

Start a New Submission

18

Policies at the Collection Level •  Required metadata, schemas •  Data storage location •  Metadata curation policies

Describe Submission: Dublin Core

19

•  Scientist or representative describes the data they are submitting

•  For this collection

Dublin Core and a collection metadata template are required

Describe submission: 2) Scientific metadata

20

•  Scientist or representative describes the data they are submitting

•  For this collection

Dublin Core and a collection metadata template are required

Assemble the dataset

21

Transfer Files to Submission Endpoint

22

•  Scientist transfers dataset files to a unique publish endpoint

•  Endpoint is created

on collection-specified data store

•  Dataset may be

assembled over any period of time

•  When submission is finished, dataset will be rendered immutable via checksum

Check Dataset Assembly

23

•  Verify size, file names, etc

•  System attempts to

determine file types

•  Scientist can choose to edit, remove, or add more files

•  Scientist then accepts the collection-specified license and completes the submission (not pictured)

DOI Assignment

Submission Curation

25

If configured, a curator can approve the submission, reject, or edit metadata

Discover a Published Dataset

26

•  Search on ranged meta-data

•  Link back to the

published dataset

View Downloaded Dataset

27

Use Globus Connect Personal to pull the files locally for analysis

...all of this via SaaS and with your own (institutional or personal) resources or cloud resources

Summary

Transfer

User Authentication Groups Sharing

Data Publication

Data Cataloging

Automation and Workflows

Thank you to our sponsors! U.S. DEPARTMENT OF

ENERGY

Data Engines for Big Data LDRD

20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Documents

globus transfer

data engines

group data

big data transfer

b globus

transfer request globus

data store push

big data ldrdwilde