Transcript

Making small data BIGInsights from a Long-tail Geoscience Domain

Kerstin Lehnert lehnert@ldeo.columbia.eduLamont -Doherty Earth Observatory of Columbia UniversityPalisades, NY, 10964

www.iedadata.org

Outline

• The (super-fast) Introduction to Geochemistry

• Achievements & Challenges in Geochemical Data Management

• Sustainable data infrastructure in the Long Tail

• EarthCube

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 2

Geochemistry

• Puts real numbers on geologic times.

• Fingerprints sources of material involved in geological processes.

• Reveals the history of climate and the circulations of the atmosphere and ocean.

• Constrains theories of the Earth’s deep interior

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 3

Geochemical Observations

• Hundreds of chemical properties of different Earth materials• elemental or oxide concentrations

• isotopes and isotopic ratios

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 4

• Thermodynamic properties

• Kinetics

Geochemical Data Types

• Analytical (observational)• Sample-based measurements

• Sensor data

• Experimental data

• Derived data (models)

• (Samples)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 5

Materials & Samples

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 6

Geochemistry Methods

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 7

How a Geochemist Generates Data:“Did New Zealand Dust Influence the Last Ice Age?”

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 8

Bess Koffman, Michael Kaplan, Steven Goldstein, Gisela Winckler (LDEO), Natalie Mahowald (Cornell)http://blogs.ei.columbia.edu/2014/03/13/did-new-zealand-dust-influence-the-last-ice-age/

Get Samples in the Field

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 9

Get Samples in the Lab/Repository

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 10

Analyze Samples in the Lab

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 11

The Data!

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 12

Note the number of data points generated in this study (the yellow dots) in light of the effort that included collecting samples in NZ to operating expensive equipment in the lab.

Data “Sharing”

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 13

Long-tail Research Data

• heterogeneous

• customized & optimized for research questions

• lack of data standards

• data sharing limited

• lack of data infrastructure (facilities)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 14

The Value of Long-tail Data

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 15

“While the data volumes are small when viewed individually, in total they represent a very significant

portion of the country’s scientific output.”

“The long tail is a breeding ground for new ideas and never before attempted science.”

(Heidorn, B. 2008: “Shedding Light on the Dark Data in the Long Tail of Science”)

BUT:Long-tail data have no value if they are not re-usable!

Monday’s Musings: Beyond The Three V’s of Big Data – Viscosity and ViralityPublished on February 27, 2012 by R "Ray" Wanghttp://blog.softwareinsider.org/2012/02/27/mondays-musings-beyond-the-three-vs-of-big-data-viscosity-and-virality/

What Makes Data BIG?

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"

Value

16

The sixth ‘V’:

Adding VALUE

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 17

accessible

small data

BIG DATA

findable

identification,persistence

authorization,protocols

context,provenance

re-usable

harmonized, machine-readable

interoperable“… data have no value or

meaning in isolation; they exist

within a knowledge

infrastructure — an ecology of

people, practices,

technologies, institutions,

material objects, and

relationships.”

C.L. Borgman

https://www.force11.org/group/fairgroup/fairprinciples

Generic Repositories Domain Repositories

Domain-specific Data Facilities

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 18

Science Community

Domain specific Data facility

18

Libraries Archives

CI, Computer Science

Publishers, editors

Metadata registrationSoftware (tool) development

InteroperabilityData policies

Persistent access Bibliometrics

Data CurationData access & discovery (optimized for domain)

Data products (synthesis)Data harmonization (standards)

User Support

Funding Agencies

Data Facilities

Registries

AGU FM 2014: IN14B-01

Small Data Gone BIG

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 19

IEDA Repositories >500,000 files 47 TB 4 x 106 samples

IEDA Syntheses 19 x 106 analytical values in EarthChem 2.63 x 106 miles of data from 808 cruises in the

Global Multi-Resolution Topography (GMRT)

EarthChem: Big Data for Geochemistry

• EarthChem Library• DOI registration

• Long-term archiving

• CC license

• Data templates & guidelines for data documentation

• QC by data managers

• Synthesis Databases (PetDB, EarthChem Portal)• QA/QC by data managers

• Data & metadata harmonization

• Standards-compliant data model

• Service Oriented Architecture (ECP)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 20

EarthChem Data Systems

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 21

Metadata

Data Data Data Data Data

EarthChem Library

Data Data Data

Search

Investigators

Data Repository

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 22

DOI to allow proper citation

Link to publications

Link to funding source

22

Data Templates

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 23

ECL Challenges

• Metadata guidelines/templates for an increasing diversity of data

• Need extended metadata for meaningful searches• Geospatial

• Variables

• Sample name

• Integration with publication workflow

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 24

Coalition for Publishing Data in the Earth & Space Sciences (COPDESS)

25

• Joint initiative of Earth Science publishers and Data Facilities to help translate the aspirations of open, available, and useful data from policy into practice.• Reaffirm and ensure adherence to existing journal and publishing policies

and society position statements regarding open data sharing and archiving of data, tools, and models.

• Ensure that Earth science data will, to the greatest extent possible, be stored in community approved repositories that can provide additional data services.

• Statement of Commitment signed by all major Earth & Space Science publishers

• Build an online community directory of appropriate Earth science community repositories for data, tools, and models that meet leading standards on curation, quality, and access

www.copdess.org

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 26

Presentation at EarthCube workshop “Scope & Vision”, March 2015

EarthChem Data Systems

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 27

Metadata

Data Data Data Data Data

EarthChem Library

Data Data Data

Search

Data & Metadata

Search

Data Data

Search

DB DB DB DB DB

Data & Metadata

[XML]Investigators

[.xls]

EarthChem Data Managers

Data Repository

PetDB, SedDB EarthChem Portal

Data Synthesis

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 28

Example of success:

This study showed new relationships between noble gases and the elemental and isotope geochemistry of the deep mantle, with implications for mantle structure and evolution.

It was possible through a synthesis of the global data set,

only because the scattered data were made available by the online databases PetDB and GEOROC.

This entire community now depends on this cyberinfrastructure.

The PetDB Database

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 29

Map shows locations of mafic volcanic rock samples. Color of symbols is scaled to the 87Sr/86Sr isotope ratio in the rocks, illustrating the difference in the composition of the Earth’s mantle under the Indian and the Pacific Ocean.

Data are from >300 publications, retrieved from the PetDB database in ca. 2 minutes.

PetDB Concept: BIG Data

• Data Mining

• Fine-grained data access: Database structure ‘disintegrates’ data sets into individual values

• Context & provenance metadata to search and filter

• Harmonized data: controlled vocabularies, data compilation & QC by data managers

• Data Integration• User-defined across data sets

• By sample (use of unique sample ID)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 30

Data Mining: Search & Filter

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"

31

Filter by method or concentration

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 32

PetDB Impact

• 500 - 800 downloads per quarter

• >550 citations in the literature

• many fundamental new discoveries & insights

• new scientific approaches

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 33

Meyzen et al, 2007, Isotopic portrayal of theEarth's upper mantle flow field. Nature 447, 1069A. W. Hofmann: “Mantle

Myths, Reservoirs, and Databases”, Goldschmidt Conf. 2008

Technical Challenges

• scalability/flexibility of database schema• accommodate new sample and data types (time series, non-numeric

data, etc.)

• track relationships among samples

• diverse context for new sample and data types

• track provenance of metadata

• performance of search application

• usability & functionality of search application

• interoperability interfaces

• data ingestion & quality control

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 34

ODM2

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 35

ODM2 Team:J S HorsburghA K AufdenkampeL HsuA JonesK LehnertE MayorgaL SongD TarbotonI Zaslavsky

Challenges:• migration of db content• new user interface• new data entry & QA/QC tools• resources

ODM2 Problem

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 36

from:http://techdistrict.kirkk.com/2009/10/07/the-usereuse-paradox/

“In general, the more reusable we choose to make a software module, the more difficult that same software module is to use.”

New User Interface (under development)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 37

Challenge: User Expectations

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 38

C.H. Langmuir (Harvard): “Geochemical Databases: What is needed now?” Presentation at EarthCubeDomain End-user workshop for Petrology & Geochemistry, March 2013

Access to Samples is a Community Concern

• Poor and uneven access and management of sample collections

• Incomplete sample tracking and linking of samples to analyses in the literature and databases

• Poor discoverability of existing samples

• insufficient or uneven sample density through space and time for most geological terrains of interest

From Executive Summary of EarthCube Domain End-user Workshop Petrology & Geochemistry 2013

EarthCube Domain End-user Workshop for Petrology & Geochemistryat the National Museum of Natural History, Smithsonian Institution, March 2013

The Internet of Samples

• Central or federated online catalogs for discovery & access of samples.

• Best practices for sample identification, documentation, and citation.

• Software tools that support personal or institutional sample management & curation.

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 40

(And facilities to provide access to curated samples!)

IGSN: International GeoSample Number

• persistent unique identifier for physical objects in the Earth Sciences; centralized control mechanism via IGSN e.V.

• resolves to virtual sample representations (sample metadata profiles) managed at federated IGSN Allocating Agents.

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 42

Use of the IGSN

IGSNs in data table resolve to sample metadata in IGSN registry

SESAR (www.geosamples.org)

System for Earth Sample Registration

• Allocating Agent for individual investigators, sample repositories, and science programs• tools and services for users to catalog and manage sample metadata

(MySESAR)

• personal (authenticated) workspace

• metadata template creator

• label creation & printing (including QR code)

• transfer of sample ownership

• web services for client systems

• register sample metadata & obtain IGSNs

• access to IGSN metadata

• preservation & persistent access of sample metadata

• Global Sample Catalog (harvest metadata from other AAs

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 43

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 44

Challenges:• scalability of architecture for a rapidly growing

number of registrations• service-oriented architecture• handle registrations• software tools that support investigators with

metadata capture in the field & lab• flexibility for user specific metadata & new sample

types• inclusion of sample images (storage!)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 45

InstitutionsCollection Mgmt

Public ‘Virtual Museum’

InvestigatorsSample Mgmt

(storage, software solutions, & services)

VisualizationPublications

Data Systems

Sample Registries

AP

IsG

UIs

Internet of Samples Initiatives

• CODATA Task Group “Physical Samples in the Digital Era”

• SciColl: Scientific Collections International (Consortium)

• iSamples (Internet of Samples in the Earth Sciences)• Funded EarthCube Research Coordination Network (RCN)

• advance access and re-use of physical samples through use of innovative cyberinfrastructure

• DESC: Digital Environment for Sample Curation

• IGSN e.V.

• National Data Services test-bed

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 46

DATA FACILITIES FOR THE LONG TAIL

Scalability, Sustainability

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 47

Many Earth Science Data Communities

48

Atmo-spheric

Chemistry

Climate & Large Scale

Dynamics

Paleo-Climate

Meteor-ology

Aeronomy

Space Weather

Magneto-spheric Physics

Solar Terrestrial

Igneous Petrology & Volcan-

ology

Geo Ed & Workforce

Training

NCAR

Geophysics &

Geody-namics

Geobiology & Paleoen-

tology

Cryosphere & Ice

Dynamics

Critical Zone &

Soil Science

Chemical Ocean-

ography

Geomor-phology

Hydrology

Sediment-ology &

Strati-graphy

Marine Geophysics

Physical Ocean-

ography

Marine Geology

BiologicalOcean-

ography

Ocean Education

Ocean Drilling & Engineer-

ingSoftware

& Modeling

Bio-informatics

Ecosystems

Biology

High PerfComputing

Semantics &

Ontologies

Algorithms & DataMining

EarthCube CI

Solid and Aqueous Geochem

-istry

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"

IEDA: A “Long-Tail” Data Facility

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 49

www.iedadata.org

• Multiple core disciplines (focus: solid earth)• High-T Geochemistry• Low-T Geochemistry• Petrology• Marine Geophysics & Geology• Geochronology

• Cross-disciplinary tools & services• Sample registry SESAR• IEDA Data Browser• Portals (GeoPRISMs, USAP-DCC, etc.)• GeoMapApp• Data management support

49

From Research Data Collections to Data Facility

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 50

Formal Governance

Robust Infrastructure

Stable Expert Team

Accreditation

Adherence to Community Standards

Scalable Infrastructure

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 51

The ALLIANCE Model

Alliance Development

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 52

Proposal “Interdisciplinary Earth Data Alliance as a Model for Integrating EC Technology Resources and Engaging the Broad Community” submitted March 2015

MetPetDB

Mineral PhysicsDeep SubmergenceIcePod

Challenges:• Social & organizational engineering• Diversity of data needs• Diversity of systems • Business models

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 53

Conclusions

• Long-tail data can grow BIG through domain-specific data curation.

• Partnerships among data efforts can provide a solution for sustainability of data infrastructure in long tail communities

• Partnerships with the computer and information sciences are necessary to build the cyberinfrastructure.

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 54

EarthCube MotivationsKerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 55

To transform geosciences research by supporting community-driven cyberinfrastructure to integrate data and information.

Tech

. Dri

vers Supports science and

other User Needs

Create a dynamic, community-driven cyberinfrastructure

Open, evolvable, sustainable

Easy interface with existing capabilities

Ch

alle

nge

s Diversity of the geosciences

Interdisciplinary Science Questions

Big, Heterogeneous Data issues

Communities that are poorly served/have no community resources

Towards an Architecture for EarthCube

• Under purview of the EarthCube Technology and Architecture Committee (TAC)

– Coordinating with Council of Data Facilities, Science Committee, and Liaison Team

• Ongoing Working Groups (since Fall 2014):

– Architecture WG

– Standards WG

– Use Cases WG

– Funded Projects and Gap Analysis WG

– Testbed WG

!

!

EarthCube!

23!

!!!!!!!!!!!!!!!!!!!!!

!

!

Building((Blocks(

Architecture(

Governance(Research((Coordina7on((Networks(

Funded&Projects&

!

EarthCube!Funded!Projects!

!(2013!and!2014!Awards)!

!

TAC Workshop (ongoing on now)

Learn more at:

http://earthcube.org/group/technology-architecture-committee http://earthcube.org/document/2014/earthcube-past-present-future

top related