20160922 Materials Data Facility TMS Webinar

Ben Blaiszik (blaiszik@uchicago.edu),Kyle Chard, Rachana AnanthakrishnanMichael Ondrejcek, Kenton McHenry

PIs: Ian Foster (foster@uchicago.edu), Steven Tuecke, John Towns

materialsdatafacility.orgglobus.org

Materials Data Facility -Data Services to Advance Materials

Science Research

http://dx.doi.org/10.1007/s11837-016-2001-3

MDF Article in JOM (August Issue)

MaterialsDataFacility.org

Togetstarted,contactBenBlaiszikblaiszik@uchicago.edu

Outline

• Overview§ MDF Overview§ Globus quick introduction

• MDF Data Publication Service§ Key MDF data pub service features§ Publication walk-through

• General Observations and Future Outlook

What is MDF?

We are developing production services to make it more simple for materials

datasets and resources to be ...

PublishedIdentifiedDescribedCurated

VerifiableAccessiblePreserved

DiscoveredSearchedBrowsedShared

RecommendedAccessed

Publishable ResultsPublished Results

Resource DataRef Data

Derived DataWorking Data * Figure adapted from Warren et al.

Data Service Infrastructure

Publication Discovery

Compute for data interaction

and viz

Resource Registration

+ - Initial Foci

Publication

• Identify datasets with persistent identifiers (e.g. DOI)

• Describe datasets with appropriate metadata and provenance

• Verify dataset contents over time

• Preserve critical datasets in a state that increases transparency, replicability, and helps encourage reuse

Discovery

• Search and query datasets in modern ways – e.g. via search against indexed metadata and harvested file contents rather than remembering opaque file paths

Future...

Spotlight for all data you have

access to regardless of

location

Under Development

DiscoveryUnder Development

• SaaS cloud-hosted solution

• Logical metadata repository to index many external sources

• Flexible queries (boosting, full text, partial matches, etc.)

• Search results are limited by ACLs

• All MDF-published datasets will be indexed

• May use common schemas (Datacite, Dublin Core etc.) or domain specific

• Globus endpoint contents may be indexed (owner enabled)

• Index has the flexibility of no required schema

• Built on Elasticsearch for proven scalability and speed, hosted on scalable AWS resources

Custom boosting

Facets

Test-indexed data

Globus Backgroundhttps://www.globus.org

Globus Platform-as-a-Service (PaaS)

Identity management

User groups

Data transfer

Data sharing

• Share directly from your storage device (laptop or cluster)

• File and directory-level ACLs

• Manage user group creation and administration flows

• Share data with user groups

• High-performance data transfer from a web browser

• Optimize transfer settings and verify transfer integrity

• Add your laptop to the Globus cloud with Globus Connect Personal

• create and manage a unique identity linked to external identities for authentication

Publication Discovery

REST APIs, Clients, and Docs

• New version of core services released in Feb.

• New Python SDK available§ https://github.com/globusonline/globus-sdk-python

• Jupyter Notebook Examples§ https://github.com/globus/globus-jupyter-notebooks

• Sample Data Portal§ https://github.com/globus/globus-sample-data-portal

• (alpha) MDF Data Publication Service API

Globus Background

Globus moves the data for you

secureendpoint,

e.g. laptop

You submit a transfer request Globus

notifies you once the transfer is complete

secureendpoint,e.g. midway

transfer

Endpoint• E.g. laptop or server

running a Globus client (e.g. Dropbox client)

• Enables advanced file transfer and sharing

• Currently GridFTP, future GridFTP + HTTP

Some Key Features• REST API for

automation and interoperability

• Web UI for convenience

• Optimizes and verifies transfers

• Handles auto-restarts

• Battle tested with big data

Globus Web UI

Endpoint• E.g. laptop or server

running a Globus client (e.g. Dropbox client)

• Enables advanced file transfer and sharing

• Currently GridFTP, future GridFTP + HTTP

Some Key Features• REST API for

automation and interoperability

• Web UI for convenience

• Optimizes and verifies transfers

• Handles auto-restarts

• Battle tested with big data

Data Publication

Where are we Now?

Materials Data Publication/Discovery is Often a Challenge

Data Collection Data Storage and Process Publication

Data Collection

Networked storage, sometimes many TBUnique identifier data for search/citeCustom metadata descriptionsData curation workflowAutomation capabilities

Data Storage and Process Publication

Want to Discover / Use

Want to Publish

Don’t put under desk!

Needed to close the loop

Data Collection

Need storage, sometimes many TBNeed to uniquely identify data for search/citeNeed custom metadata descriptionsNeed a data curation workflowNeed automation capabilities

Data Storage and Process Publication

Want to Discover / Use

Want to Publish

Don’t put under desk!

Collection Model

• Collections might be a research group or a research topic...

• Collections have specified§ Mapping to storage endpoint

§ Currently handled as automatically created shared endpoints

§ Metadata schemas§ Access control policies§ Licenses§ Curation workflows

• Collections contain§ Datasets

§ Data§ Metadata

• Metadata Persistence§ Metadata log file with dataset§ Metadata replicated in search

Hybrid Distributed Model

Petrel @Argonne1.7 PB

BlueWaters Condo@UIUC100 TB

CampusRDS

Cloud Metadata IndexAnd Tools

Centralized resource

Globus endpoint

NSF(XSEDE)

ElectroCatEP

Publish Large Datasets

• Distributed data model leverages Globus production capabilities for file transfer (i.e. dataset assembly), user authentication, and access control groups

• 100s of TB of reliable storage @ NCSA, and more storage at Argonne§ Globusendpointatncsa#mdf onNebula§ ExpandabletomanyPBsasnecessary§ Automatedtapebackupforreliability(inprogress)

• Researchers can optionally use your own local or institutional storage

Uniquely Identify Datasets

• Associate a unique identifier with a dataset§ DOI,Handle

• Improve dataset discovery and citability§ Aligningincentivesandunderstandingtheculture

willbecriticaltodrivingadoption

• Your work has been cited 153 times in the last year

• Researchers from 30 institutions have downloaded your datasets

Future...

Share Data with Flexible ACLs

• Share data publicly, with a set of users, or keep data private

Leverage Curation Workflows• Collection administrators can specify

the level of curation workflow required for a given collection e.g.§ Nocuration§ Curationofmetadataonly§ Curationofmetadataandfiles

Customize Metadata

• Build a custom metadata schema for your specific research data

• Re-use existing metadata schemas• Working in conjunction with NIST

researchers to define these schemas

• Can we build a system that allows schema:§ Inheritance

§ E.g. a schema “polymers” might inherit and expand upon the “base material” of NIST

§ Versioning§ E.g. Understand contextually how to map fields

between versions§ Dependence

§ E.g. Allows the ability to build consensus around schemas

Future...

MDF Submission Walkthrough

Example Use Case

Publishing Big, Remote Data

Collected multi TBof data at a light source

Bundle the data with metadataand provenance

Want a citable DOI to share the raw and derived data with the community

Want their data to be discoverable by free text search and custom metadata

MDF Collection Home

MDF Collections

Recall: Policies Set at the Collection Level• Required metadata, schemas• Data storage location• Metadata curation policies

MDF Metadata Entry

• Scientist or representative describes the data they are submitting

• For this collection Dublin Core and a custom metadata template are required

MDF Custom Metadata

• Scientist or representative describes the data they are submitting

• For this collection Dublin Core and a custom metadata template are required

Dataset Assembly

• Shared endpoint is auto-created on collection-specified data store

• Scientist transfers dataset files to a unique publish endpoint

• Dataset may be assembled over any period of time

• When submission is finished, dataset will be rendered immutable via checksum

(e.g. NU) (e.g. UIUC Nebula)

Dataset Assembly

• Shared endpoint is auto-created on collection-specified data store

• Scientist transfers dataset files to a unique publish endpoint

• Dataset may be assembled over any period of time

• When submission is finished, dataset will be rendered immutable via checksum

(e.g. NU) (e.g. UIUC Nebula)

Dataset Curation (Optional)

• Optionally specified in collection configuration

• Can be approved or rejected (i.e. sent back to the submitter)

Mint a Permanent Identifier

CanbeDOI orHandle

Dataset Record

Dataset Discovery

General Observations

and Future Outlook

Publication Year 1 Milestones

• Opened to the public in March 2016

• Provisioned reliable storage to support researchers sharing open materials data (~200 TB)

• MDF data volume approaching ~ 6 TB of materials data

• Started building deep relationships with many of the key materials data generating groups and communities

• Ingested dataset > 1 TB in size

• Ingested dataset > 1.5M files

Integration with the Community is Key

MaterialsProject

CitrinationMaterialsCommons

OtherFacilities(APS,SNS,NSLS,…),InstitutionalRepositories,Publishers!

MetadataPublishing

MetadataMD,Pub.,Compute

MetadataPublishing

NCSA-PIREHV/TMSMBDH

Understanding Incentives is Critical

Meeting Award Requirements

Smoothing Dislocations

Increasing Impact

• Increase paper citations1

• Add dataset citation capabilities

• [Distance] Enable simple sharing among collaborators (near and far)

• [Personnel] Ease transitions between students• [Format] Lessen need for ad hoc resource sharing

(e.g. via group websites)

• Simplify DMP compliance

1 Citation increase 30 (10.7717/peerj.175) - 60% (10.1371/journal.pone.0000308) [caveat bio research]

Lessons Learned

• The demand is there from researchers and institutions

• Lots of cross-over with centers and projects§ (NIST) CHiMaD§ (DOE) ElectroCat, MICCoM, JCESR, PRISMS, Argonne IT, Integrated Imaging Institute§ (NSF) T2C2 [DIBBS], AMI-CFP (PIRE), HV/TMS (I/UCRC), BD Hubs, IMaD BD Spoke*

• Data Heterogeneity is a challenge§ Metadata is the major sticking point

• Friction points§ Need more flexible data objects e.g. {“temperature”:100, “unit”:“K”}§ Need file or directory based metadata§ Immutable datasets alone is not enough à Versioning§ Data gathering in retrospect§ Schema generation and interoperability

§ Working with and following developments at NIST, RDA, Citrine et al. § Differing institutional approval processes§ Lack of programmatic interface (planned).

• Support for data interactivity and visualization• Smart versioning for large file-based datasets

Wider Data Community

• Curated and described datasets• Well-posed problems• Community to share analyses• Challenges to start “sprints”

• Great APIs and clients• Examples to get started• Hundreds of video tutorials

MaterialsProjectOQMD

• Less inherently intuitive problems• Sometimes need advanced compute

capabilities• Often many TB

• Continuous integration, QA, and testing• Containerized solutions, microservice architecture, abstracting software from

hardware• Automation • Internet of Things (IoT) – connect everything• Machine Learning / AI• Natural Language Processing (Siri, chatbots or “slack”bots, etc.)• Search rules the world – ok this was 20 years ago…

What are the analogs and applications in the materials community?

MaterialsProjectOQMD

• Less inherently intuitive problems• Sometimes need advanced compute

capabilities• Often many TB

Broader Trends

Experimentation Ahead

No team commitments here!

Open source opportunities, contact:

blaiszik@uchicago.edu

Use Case: Scenario Generator-Consumer

• Data generator§ Generates data periodically (perhaps from an instrument)§ Pushes data to a public channel§ Schema is validated before inclusion in channel stream

• Data consumer§ Polls channel periodically§ Wants to pull datasets by property

DatasetChannelMDF-composites

Data Generator

Data Consumer

DatasetDatasetDatasetDatasetDatasetcreate q: result

Automated Data Aggregation (consumer)

Aggregate, Perform ML

• Combine cloud-published dataset, scikitlearn, pandas to predict steel fatigue and “reproduce” data from journal publication

Aggregate, Perform ML, Visualize

• Combine cloud-published dataset, scikitlearn, pandas to predict steel fatigue and validate journal publication

What’s Currently Available?

• Web interface to support data publication (public-facing APIs coming soon)

• 100s of TB of storage at NCSA (scalable to many PB) more at Argonne (1.7 PB total on Petrel – not all for materials…)

• Help with developing metadata schemas to describe your research datasets

MDF Tutorial on Githubhttps://github.com/blaiszik/materials-data-facility-training

What are we looking for?

• Early adopters, willing to get their hands dirty with the service and give honest feedback

• Key integration points where metadata is picked up automatically!

• Key datasets and resources of all sizes, shapes, raw or derived, that might help us understand the process better

Thanks to Our Sponsors!

U .S . D E PART M E N T O F

ENERGY

Publication REST APIs Discovery

• Identify datasets with persistent identifiers (e.g. DOI)

• Describe datasets with appropriate metadata and provenance

• Verify dataset contents over time

• Handle big (and small) data:We have already ingested datasets with > 1.5M files and > 1TB in size

• Search and query datasets in modern ways

• Index metadata and harvest file contents

• Simple user interfaces (i.e., after Google and Amazon)

Opened to external users in Mar. 2016~ 6 TB of data published

Materialsdatafacility.org

20160922 Materials Data Facility TMS Webinar

Science

Bruno Fierens TMS RADical WEB : TMS WEB Core Verona 1 ·...

FoodTechnologyCorporation - MCIK · 2016. 6. 21. ·...

BOCM-20160922-3 -51 págs -3368 Kbs

20160922 taiwan day in korea introduction of taiwan mkt (kr)

PDF (BOCM-20160922-2 -47 págs -540 Kbs)

TMS / TFRS UYGULAMALARI - kto.org.tr · TMS 1 Finansal...

Osc ftth solutions v1.0 20160922

TMS Metro Controls Pack DEVELOPERS GUIDE - TMS Software |...

TMS Cryptography Pack Developers Guide - … · TMS...

VA TMS Asministrators Role-Based Training: TMS 2.0 Exam ...

退休規劃介紹 @20160922

Truth Maintenance Systems. Outline What is a TMS? Basic TMS....

06 Pensando en El Futuro TMS ForConectivity TMS ForTPV TMS.....

TMS WEB Core3 tms software tms web core developers guide...

Dorma-Türmanagementsystem TMS - Systemübersicht und...

TMS Smooth Controls Pack - tmssoftware.biz...