Globus Scientific Data Publication Services Ben Blaiszik , Kyle Chard, Rachana Anathakrishnan, Steve Tuecke, Ian Foster, Globus Team [email protected] www.globus.org Computation Institute
Aug 17, 2015
Globus Scientific Data Publication
Services Ben Blaiszik, Kyle Chard, Rachana Anathakrishnan, Steve Tuecke, Ian
Foster, Globus Team
[email protected] www.globus.org
ComputationInstitute
Overview • What is Globus? • Globus Services
– Data publication – Data cataloging – Data transfer – User authentication – Groups – Sharing
2
> 8000 endpoints > 85 U.S. campuses European Globus Community: http://www.egcf.eu/
3
Globus is ...
Research data management delivered via SaaS
Big data transfer, sharing,publication, and discovery…
…directly from your own storage systems OR the cloud
4
Globus Delivers
SaaS Market Domination...
…for your photos
…for your e-mail
…for your entertainment
…for your research data
5
Research data management scenarios and challenges
6
Public Cloud
“I need to easily, quickly, & reliably move or mirror portions of my data to other places.”
Research Computing HPC Cluster Lab Server Personal Laptop
XSEDE Resource
7
Scientific Instrumentation
“I need to easily and securely share my data with my colleagues at other institutions.”
8
“I need to publish my data so that others can find it and use it.”
Scholarly Publication
Reference Dataset
Active Research Collaboration
9
Globus Transfer • “Fire-and-forget” transfers
– Optimize transfer – Automatic fault recovery – Automatic retry – Seamless security integration – 128-bit checksums
• Intuitive Web GUI and powerful APIs for automation – REST and Python APIs
10
B
Globus moves the data for you
secureendpoint,
e.g. laptop
You submit a transfer request Globus
notifies you once the transfer is complete
secureendpoint,e.g. midway
transfer
A
Catalog Data Publication
Endpoint File Systems
Discover Plugin Point [Federation?]
Globus, the Abridged Version
Transfers
Groups Sharing
User Auth
Metadata Layer
Data layer
11 * REST and Python APIs throughout
Globus Catalog • Automate metadata ingestion from
instrumentation and acquisition machines
– API/CLI integration
• Allow near real-time metadata-driven feedback to experiments
• Allow for insert points in the workflow – Ingest at point of collection – Catalog metadata and provenance – Push to data store – Push to local or external HPC
• Allow building and sharing of typed metadata definitions – e.g. build definition set that specifically
fits X-ray scattering data at your beamline
– Addresses problem of T, temp, Temp, temperature, temperature_kelvin, ...
12
13
• Group data based on use and features, not location/filename – Logical grouping to organize, search, and
describe
• Operate on datasets as units • Tag datasets with characteristics
that reflect content • Share/move datasets for
collaboration • Interact with via REST API,
Python API, GUI, and CLI
Vs.
Globus Catalog Catalog à Datasets à Members
Globus Catalog Web User Interface
Near field-HEDM Workflow (Sharma, Almer)
15
** Supported by Data Engines for Big Data LDRD
(Wilde, Wozniak, Sharma, Almer, Blaiszik)
3: GenerateParameters
(FOP.c)50 tasks25s/task
¼ CPU hoursconcurrent
DetectorUp to 1000 datasets/week
Dataset360 files
4 GB total
1: Median calc75s (90% I/O)
MedianImage.cUses Swift
2: Peak Search15s per file
ImageProcessing.cUses Swift
ReducedDataset360 files
5 MB total
4: Analysis PassFitOrientation.c
105 tasks20s/task
555 CPU hoursthen
1m/task 1667 CPU hours
concurrent
real-time overnight or real-time
feedback to experiment
Up to2.2 M CPU hours
per week!
real time: 4/4/2014
On
Orth
ros
Experimenting “in the data dark” • Feedback during each experiment was non-existent • Required months to calculate relevant information for
publication OR to find out experiment was corrupted • Now, initial feedback over lunch using (Globus, SWIFT,
and Catalog) to leverage HPC and track metadata
Globus Data Publication • Operated as a hosted
service
• Designed for Big Data
• Bring your own (per collection) storage
• Extensible metadata schemas and input forms
• Customizable publication and curation workflows
• Associate unique and persistent digital identifiers with datasets
• Rich discovery model (in dev)
16
Curator reviews and approves; data set published on campus or other
Researcher assembles data set; describes it using metadata (Dublin core and domain-specific) Peers and public
search and discover data sets; access and transfer using
publisheddatastore
curator
researcher
Data Publication Dashboard
17
Start a New Submission
18
Policies at the Collection Level • Required metadata, schemas • Data storage location • Metadata curation policies
Describe Submission: Dublin Core
19
• Scientist or representative describes the data they are submitting
• For this collection
Dublin Core and a collection metadata template are required
Describe submission: 2) Scientific metadata
20
• Scientist or representative describes the data they are submitting
• For this collection
Dublin Core and a collection metadata template are required
Assemble the dataset
21
Transfer Files to Submission Endpoint
22
• Scientist transfers dataset files to a unique publish endpoint
• Endpoint is created
on collection-specified data store
• Dataset may be
assembled over any period of time
• When submission is finished, dataset will be rendered immutable via checksum
Check Dataset Assembly
23
• Verify size, file names, etc
• System attempts to
determine file types
• Scientist can choose to edit, remove, or add more files
• Scientist then accepts the collection-specified license and completes the submission (not pictured)
DOI Assignment
Submission Curation
25
If configured, a curator can approve the submission, reject, or edit metadata
Discover a Published Dataset
26
• Search on ranged meta-data
• Link back to the
published dataset
View Downloaded Dataset
27
Use Globus Connect Personal to pull the files locally for analysis
...all of this via SaaS and with your own (institutional or personal) resources or cloud resources
Summary
Transfer
User Authentication Groups Sharing
Data Publication
Data Cataloging
Automation and Workflows
Thank you to our sponsors! U.S. DEPARTMENT OF
ENERGY
Data Engines for Big Data LDRD