Top Banner
1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani, Petr Votava, Andrew Michaelis, Hirofumi Hashimoto, Forrest Melton
43

1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

Dec 14, 2015

Download

Documents

Jonathan Ruddle
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

1

NASA Earth Exchange: Improving access to large-scale data and computational infrastructure

ACCESS-11-0034 Annual ReviewAugust 20, 2013

Ramakrishna Nemani, Petr Votava, Andrew Michaelis, Hirofumi Hashimoto, Forrest

Melton

Page 2: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

2

Vision: To engage and enable the Earth science community to address global Earth science challenges.

NEX is a collaborative compute platform that improves the availability of Earth

science data, models, analysis tools and scientific results through a centralized environment that fosters knowledge

sharing, collaboration, innovation and direct access to compute resources.

Engage:Network, share and collaborate

Discuss and formulate new ideasPortal, Virtual Institute

Enable:Access to data

Access to computingAccess to knowledge

Background: NASA Earth Exchange

Page 3: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

3

NEX Infrastructure View

Page 4: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

4

Outline

• Project background• Updated quad chart• Review of schedule and milestones• Description of work accomplished and results• Technical reports and presentations• Discussion of next 6 month activity• Schedule and budget summary

Page 5: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

5

Project Background

• Main focus of the projects is on supporting the NEX community by continuously improving access to data, tools, computing and knowledge.

• By improving the above, we can engage more users and teams and provide them with better and faster support- Need to be able to respond quickly to new requirements- Focus on knowledge acquisition, and access

• We can also help our users to significantly scale their projects

Page 6: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

NASA Earth Exchange: Improving access to large-scale data and computational infrastructure

Key Milestones

Goals and Objectives

• Enhance access, discovery and integration of data, models and services for the NEX communities

• Provide integrated system view of NEX data, metadata, processing libraries, models and QA

information

• Provide API and client libraries to NEX tools, datasets and search capabilities

• Provide streamlined way for researchers to share their results with the community

Approach

• Inventory current NEX datasets, tools and models and engage the community in gathering

requirements and use cases.• Design a common database schema for existing NEX

datasets.• Develop API that facilitates search and access to

data, tools and models and use it to implement client libraries

• Develop migration and dissemination tools for NEX users

• Co-I: Petr Votava, Andrew Michaelis, Dr. Hirofumi Hashimoto, Forrest Melton/CSU Monterey Bay

PI: Ramakrishna Nemani Ph.D., NASA Ames Research Center

Co-Is/Partners

Architecture Overview

TRLin = 6

• Preliminaries completed 07/2012• Data integration completed 11/2012• Process integration completed 01/2013• System interface completed 08/2013• Migration tools completed 01/2014• Client libraries and tools completed 02/2014

08/13

Page 7: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

7

Project Schedule

Page 8: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

8

Project Goal

To enhance access, discovery and integration of data, models and tools for the NEX

communities.

Page 9: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

9

Objectives for Activity During Review Period

• Complete inventory of current NEX data, metadata, tools and libraries

• Engage NEX users to gather additional data and tools requirements

• Complete initial data integration with the key NEX datasets and the existing infrastructure

• Continue rapid prototyping of database access tools based on user requirements.

• Continue integration of utilities and tools with NEX system.

• Prototype integration with NEX semantic infrastructure.

Page 10: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

10

Project Drivers = Why

1. To directly support large-scale NASA projects such as WELD, NAFD, NCA, MEASURES, CMS, CMAC and projects in applied sciences

2. Efficiently support fast growing NEX community both inside and outside of NASA- Earth science research is a global undertaking and we aim to engage the

largest possible community- Large global collaboratory

• Global knowledge pool

• Need critical mass -> everybody benefits

- Support for large-scale science while engaging large community

3. Place for community contributions and access to these contributions:- Knowledge, tools, data, workflows, …

Page 11: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

11

NEX User and Project Evolution

• Number of active compute/data users at the beginning of this ACCESS project: less than 50

• Current number of active compute/data users: 158• Largest data requirements at the beginning of this

ACCESS project: 10s of TB (per project)• Current data requirements: 100s of TB – 1PB+

(per project)• On the NEX portal – currently 404 users and

1,252 projects (not all active)

Page 12: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

12

ACCESS Project Overview

Data

Tools

Knowledge

Provide integrated view of NEX data and metadata through API, command-line tools and query services.

Cross-reference and provide access to information about datasets, tools, users, projects, publications and other docs.

DisseminationEstablish process, policies and infrastructure for dissemination

of data produced on NEX.

Provide mechanism to discover and manage environments fortools and utilities required by different projects and provide APIs

InfrastructureComponents and solutions that enable the above within security

and policy constraints.

Page 13: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

13

Data Organization

• Started with inventory• Currently over 450TB on-line and 500+TB near-

line• Feedback from summer school 2012 users,

summer interns in 2013 and NEX users and PIs• Two rounds of “Query Requirements” with the

NEX science team • Two-to-three tier system

- Primary on-line fast storage, secondary on-line cache, near-line tape accessed through DMF

Page 14: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

14

Query Categories and Requirements

• “Standard” queries- Temporal, spatial, match region by name, what data are available, …

• Data provenance- How was data produced (process/workflow)?- What were the inputs into the process?- Who created this dataset?

• Knowledge queries- Which projects work with dataset X? In what geographic region?- Which publications are relevant to dataset X?

• Administration queries- How often is the dataset updated? From where?

• Analytics queries (not addressed by this project)- Filter based on internal QA, Landcover or statistics- Large number of requests for these capabilities

Page 15: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

15

Data Organization Details

• Keep metadata in the original format/naming conventions- Researchers are used to the metadata names- At times extensive documentation exists to describe the metadata

• Metadata are processed by custom parsers- Different for different sensors (MODIS, Landsat, NAIP, …)

• Each datasets is stored in a separate set of tables and when it is added to NEX a custom plug-in is written- Overrides abstract methods from the DB class- It is manageable, because the class of the datasets in not that large (few dozens at

most) and writing a generic code in this case while maintaining the original metadata would take longer in this case

- We are experimenting with semantic layer that describes and maps terms in different DBs to common taxonomy, but it requires dynamic query rewriting and it’s suitability for this problem is questionable.

- Best solution in this case seems either fully relational (current) or fully graph-based (future). Needs to hide the implementation behind an API, however users at times want access to a full RDBMS in which case maintaining two consistent copies seems the best answer.

Page 16: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

16

Tools/Utilities/Models

• Tried number of approaches- Users often want custom solutions with specific library/tool versions- Management of this gets quickly complicated

• Using “modules” infrastructure to provide custom environments for NEX teams- We can easily mix and match versions as per team’s requirements- Also good for easy reproduction/packaging of environments- Will be basis for tool contribution setup (nex/contrib)

• Access to almost all tools through a Python API or through regular command-line invocation- Great for integration with VisTrails workflow management system

• Mechanism to query a list of modules to be built or request a new module to be built.

• Working on adding better search and documentation capabilities- Also, exposing documentation externally on the NEX portal

Page 17: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

17

Knowledge Organization

• Internal NEX Knowledge graph- Spans data, content, web portal, tools- Provenance

• RDF/OWL representation- Triple and quad-store (MySQL and Virtuoso)

• Knowledge Acquisition- Manual = Documentation, blogs etc. (internal and external)- Automatic = entity extraction from text and metadata using natural language

processing• Location, datasets used by project, sensors

• Build relationships

• Improves search – who is doing what where

- Who is doing work in Amazon, what sensors are they using? What are the most frequent sensors used by NEX projects

• Can generate project concepts, so that projects can be easily related to each other (LSI)

Page 18: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

18

Relating Entities

NEX Projects, wikis,…(NEX web portal)

Publications(NEX Web Portal

Harvard Database,…)

GCMD Concepts

NEX Extension(Additional conceptsoutside the GCMD

hierarchy – data hierarchy, …)

NEX GraphData Store

Extract entities

Extract entitiesLink to

Link to/Define new

Queries

Links to externaldocs

(LP DAAC, …)

Link to resources

Provenance from running process

Recor

d pr

oven

ance

Page 19: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

19

Example queries

• What is the provenance of file X?• What is the bounding box of region R?• Get sorted (by number of projects) the usage of each of the

NASA instruments in the NEX projects?• What instruments are used by projects doing research in

the Amazon?• What are the most cited datasets in the remote sensing

publications?• Now that NEX portal has been migrated to NAS we can

start to integrate this information with the portal a lot easier.

Page 20: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

20

Data Dissemination

• Number of faucets- Large-scale data distribution (CMIP-5 for NCA)- Web-services application support (SIMS)- Open Access – Amazon

• Focus not only on the mechanics and implementation, but also on protocols and policies development- Often more time-consuming than implementation

Page 21: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

21

CMIP-5 Dissemination

• Downscaled climate dataset produced on NEX (17TB)- Important and highly requested by the community

• First process for NEX data -> NASA distribution facility- Established DOI mining capabilities (through UC Digital Library)

• http://dx.doi.org/10.7292/W0WD3XH4

- Established a technique for DOI dataset verification through checksums without extensive web services even when underlying naming changes.

• Data available at:- http://dataserver.nccs.nasa.gov/thredds/idd/bypass.html- And internally on NEX- Data had to be aggregated and reformatted for use by NCCS

• This raises issues of verifications with original datasets as well as the fact that there are effectively two copies of the data in different formats

• Needed to work extensive work with users + many lessons learned = update protocol with NCCS, but will be different with different facilities

Page 22: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

22

NASA Satellite Irrigation Management Support (SIMS)

• ACCESS software infrastructure directly supports the SIMS project (NASA Applied Sciences)- Build partially on efforts from last ACCESS project- Provides access to near-real-time Landsat data time-

series through a data cube interface- The goal of the SIMS project is to develop new

information products from satellite data to support growers in optimizing irrigation

• Currently tested by 12 partner growers

- Data visualization and queries via web services built on OPeNDAP

- Both web-based and mobile interfaces

Page 23: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

23

crop cond.% cover

crop coeff

crop waterrequirement

An example of the SIMS web / mobile data interface, which is designed to enhance grower access to satellite-derived measures of crop condition and crop water

requirements across 3.7 million ha of irrigated land in California.

Page 24: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

24

Amazon Web Services Space Act Agreement

• Prototype process for providing access to NEX data through public cloud facilities- Open access to data and workflows

• We are reaching capacity on NEX and have restrictions on access

- Different cost model – billing for computing is under users control- We can add complete Virtual Machines with packaged environments and

workflows developed and managed on NEX and accessible through the NEX web portal

- Prototyping effort includes• NCA-related activities

- NCA downscaled data (CMIP-5)- NEX portal linked with Amazon Web Services (open) or internal (NEX-members

only) NEX work environment

Page 25: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

25

Infrastructure

• Database setup- Access to database systems from all NEX components- Mostly MySQL-based, experimenting with Virtuoso, Neo4j and re-

visiting MongoDB• Supercomputing setup

- Work with NAS system group to enable access even from within Pleiades supercomputer

- Needed for easier streaming of provenance information• Applications support

- Separate OpenDAP, THREDDS and FTP server• Security considerations

- Moderate system = 2-factor authentication required- Waiver for NEX portal for OpenID and NDC users- One of the drivers for testing public cloud solutions to improve access

Page 26: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

26

Immediate Benefits for Many NEX Projects (Examples)

• Web-Enabled Landsat Data (WELD)- Acquisition, organization and access to data and processing

capabilities for monthly Landsat vegetation composites – 800+TB total data requirements

• North America Forest Dynamics (NAFD)- Acquisition, organization and access to data, QA, metadata and

processing capabilities for Landsat (80TB)

• BIOCLIM- Acquisition and organization of global MODIS land and

atmospheric products including swath mapping to acquisition regions (15 TB).

Page 27: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

27

Takes over 10,000 scenes each month using WELD system

Creating Global Monthly Landsat Composites, 1999 - Present

April 2010

October 2010

Web Enabled Landsat Data: Going Global, Roy et al.,

Page 28: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

28

North American Forest Disturbance (NAFD, Goward et al.,)

Expanding from 23 samples to Wall-to-wall coverageProcessing 96000 scenes from 1985-2010 on NEX

Page 29: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

29

NEX Software View - Current

Page 30: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

30

NEX Software View – Overall Goal

NASA Cloud/AWS/OpenStack implementation/…

Currently prototyping

Page 31: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

31

Instantiate

Page 32: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

32

Instantiate

Snapshots Instance Type

START

INSTANCE READY

Status MonitorBooting

IP Ready…

Page 33: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

33

Summary of Activity During Review Period (1)

• Inventory of NEX tools and datasets. - Started with 25 existing datasets on NEX comprising about 300TB

of data.- Work with NEX users to better understand:

• How they use the data, metadata and QA information

• Which tools and utilities they are using the most and what functionality is missing from the existing tools and utilities. We have prototyped the database access for number of use cases and some parts of it are already being used by NEX science teams.

• As the science teams have developed a highly sought-after downscaled climate datasets, we have prototyped a process through which the data will are distributed by NASA’s NCCS facility

Page 34: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

34

Summary of Activity During Review Period (2)

• Set-up initial NEX-wide repository based on the “module” utilities that enables us to customize environments for specific user’s needs in terms of tool/software versions and dependencies.

• Started to integrate some of the tools and utilities for data manipulation with the NEX semantic infrastructure and prototyped an end-to-end process of the semantic data and process integration with MODIS climatology processes that also include provenance capture.

• Work closely with several NEX projects to establish initial NEX database and tools API, which is currently in use mainly for access to Landsat and MODIS data and metadata for both gridded and swath datasets.

Page 35: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

35

Summary of Activity During Review Period (3)

• Added a new metadata collection capability for some datasets that enable us to better estimate future data requirements as well as provide users with additional information, mainly for QA screening purposes.

• Prototyped an automated process through which users can submit requests for data, tools and models to be included on NEX using PivotalTracker

Page 36: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

36

Papers and Presentations

“NASA Earth Exchange (NEX): Earth science collaborative for global change science“. Presented at IGARSS 2012.

“NASA Earth Exchange (NEX)”, Presented at Supercomputing 2012.

“Connecting Provenance and Semantic Descriptions in NASA Earth Exchange (NEX)”, Presented at AGU 2012.

Page 37: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

37

ESDSWG Participation

• Participated at 2012 ESDSWG meeting• Participated in Semantics Working Group until it

was dissolved• Currently participate in the Cloud Computing

Working group• Plan to attend 2013 ESDSWG meeting and

expand participation to Earth Science Collaboratory WG

Page 38: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

38

Relationship to other funded activities

• AIST- Facilitate access to tools and knowledge through API for workflow

integration

• CMAC (Data Mining)- Facilitates access to data and pre-processing tools

• CMAC (Recommendations)- Facilitates access to tools through workflows

• National Climate Assessment (NCA)- Facilitates process for NEX-produced data distribution for NCA

• BIOCLIM- Facilitates access to tools, data and libraries for several BIOCLIM

projects.

Page 39: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

39

Relationship to NEX

• Provides foundation for user/project work environments

• Provides access to metadata for integration with the NEX knowledge system

• Provides the overarching metadata architecture for data and processes integrated through a semantic layer

Page 40: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

40

Project Schedule

Page 41: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

41

Summary of Work During Next Review Period (through 2/14)

• Continuous integration of tools and utilities with the NEX infrastructure based on user’s requirements

• Continuous integration of data with the NEX infrastructure based on user’s requirements

• Continue to work on the data and process interface (API) – the initial API is in Python, but we are also working with users for access to data and tools through R and MATLAB- The extent of this will be driven by user requirements

• Work with users in order to continue integration of documentation, FAQs and code samples for the tools and datasets so that they are available both on the computing platform and on the NEX web portal.

Page 42: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

42

Cumulative Budget (3/2012 – 8/2013)

• FY12: $141,750 - All funds have been obligated

• FY13: $145,200 - All funds have been obligated

• Does it match your numbers?

Page 43: 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

43

Glossary

• API: Application Programming Interface• BIOCLIM: Climate and Biological Response: Research and Applications• CMAC: Computational Modeling Algorithms and Cyberinfrastructure• CMS: Carbon Monitoring System• DMF: Data Migration Facility• HEC: High-End Computing• HPC: High-Performance Computing• NAFD: North American Forest Disturbance• NCCS: NASA Center for Climate Simulations• NEX: NASA Earth Exchange• OWL: Web Ontology Language• RDF: Resource Description Framework• SIMS: Satellite Irrigation Management Support