Top Banner
www.nasa.gov Architectures Toward Reusable Science Data Systems [email protected] Science Data Systems Branch, NASA Goddard Space Flight Center, Greenbelt, MD 20771 1 TT&C GRAVITE Science Data Ingest Science Data Systems (SDS) comprise an important class of data processing systems that support product generation from remote sensors and in-situ observations. These systems enable research into new science data products, replication of experiments and verification of results. NASA has been building systems for satellite data processing since the first Earth observing satellites launched and is continuing development of systems to support NASA science research and NOAA’s Earth observing satellite operations. The basic data processing workflows and scenarios continue to be valid for remote sensor observations research as well as for the complex multi-instrument operational satellite data systems being built today. GRAVITE Software Architecture Satellite Data System Enterprise Architectures Establish a design hierarchy and process, structure of elements, properties and relationships, abstractions for managing complexity Partition the system into software elements (components) with responsibilities and interaction (interface) rules, hierarchical, recursive with focus on functionality Look for conceptual integrity: a small number of simple interaction patterns. System functions such as ingest, product generation and distribution need to be configured and perform consistently with scalability Re-use infrastructure, framework, data models Software Architecture Views for Re-use Architect’s Application Business process description focus on: Dynamic interaction of stakeholders; roles & interfaces Flow of information between the enterprise entities Business model can drive design; identifies stakeholders, systems and data; examples include: NASA EOSDIS science discipline-specific facilities such as Science Investigator- led Processing systems (SIPS) and Distributed Active Archive Centers (DAACs); Joint Polar Satellite System (JPSS) has mission partner facilities/systems; e.g., NOAA NESDIS STAR, ESPC, FNMOC, CLASS, NASA SDS Manages interfaces; enables system design independence TDRS ANC Data Field Terminal Support Node Launch Service Space/Ground Comm. Node MD Frames SMD TT&C JPSS Ground System High-level Architecture OV-2 (Jan 31, 2014) SMD MSD TT&C TT&C TT&C TT&C TT&C MD – Mission Data SMD – Stored Mission Data MSD – Mission Support data DMSP DP AFWA, EUMETSAT PPF, WindSat NRL/FNMOC SCaN POP GSFC Findings Alg. Support MD Frames HRD/LRD Performance HRD/LRD Performance MSD SvalSat HRD/LRD Monitor JPSS Ground System SMD GCOM-n MD Frames Coriolis/ WindSat MetOp-n DMSP-Fn (NASA) HRD LRD TT&C Management & Operations Node Supporting Ops CLASS NASA SDS xDR, IPs ESPC xDRs, IPs Security Ops Sim & Test Systems Integrate Element level simulators Maintain Simulators Alg, ASF, DRs, Findings Correlative Data Sources Data Data Alg & Val LCFs S T A R Data SMD APs Simulation Node Data Processing Node T T & C / M S D JPSS Ground Network Node Alg. Support xDRs, IPs Data Network supports routing of NASA SCaN- supported missions, & McMurdo NSF data SMD (J-1+) SMD (J-1+) CGS Support Node (L3) FNMOC NAVOCEANO SMD APs MSD Fairbanks CDA TrollSat McMurdo NSOF MMC Fairmont CBU Alt MMC NSOF IDPS Fairmont Alt IDPS FVS FVTS Flt Dynamics System Support Nodes Field Terminal Users & PFF SARSAT/Argos Terminals S-NPP JPSS-n PFF WSC LEGEND: Black/White Text – Block 1.2 + Red Text – Block 2.0 Purple Text – Block 3+ LASP PFF APs MSD AGS AGS NASA CARA Cal/Val Node GRAVITE JPSS Common Ground System EOSDIS System Architecture 1 Spacecraft Data Acquisition Ground Stations Science Teams (SIPS) Polar Ground Stations Flight Operations, Data Capture, Initial Processing & Backup Archive Data Transport to DAACs Science Data Processing, Data Mgmt., Data Archive & Distribution Distribution, Access, Interoperability & Reuse NASA Integrat ed Services Network (NISN) Mission Services Data Processin g & Mission Control Technology Infusion Research Education Value-Added Providers Interagency Data Centers International Partners Earth System Models Benchmarking DSS Measurement Teams Tracking & Data Relay Satellite (TDRS) W W W ACCESS ACCESS EOSDIS Science Data Systems (DAACs) Data Pools REASoNs MEaSUREs ECHO Major Functions for Satellite Science Data System: EOSDIS: Goddard Earth Sciences Data and Information Services Center (GES DISC) Science Data System built using Simple Scalable Script-based Science Processor (S4P) 1. Perl script, S4P Archive (S4PA), S4P Missions (S4PM) 2. Process steps are organized in directory structures 3. Station daemon and configuration file provide building blocks: Polls local directory for work order files, looks up commands for type of work, changes to temporary subdirectory, forks child process to execute the job, creates and writes output work order to downstream station Instrument data systems employing S4P and Perl- based framework components: (sample) TRMM science data system: (GES DISC) Aura Ozone (OMI) instrument data processing and archive at GES DISC S4PA (L2) aurapar2 S4PA (L1) aurapar1 S4PA (L0) auraraw1 OMI Science Investigat or-led Processing System ODPS S4PA (L2-3) acdisc FMI S4PA (L0) S4PM (dpre p) tads EDOS satellite Science data capture EMOS telemetry Orbit/Attitude NOAA Ancillary S4PA workflow concept Pollers Provider Receive Data Store Data Giovanni Pre-process Subscription Metadata Publication Deletion Post Office Subscriber ECHO Mirador Archive Storage data met data pdr pdr pdr pdr pdr pdr pdr data met links links Use Aura OMI ozone instrument science data processing scenario to serve as model of priority functions for examining solution attributes Science algorithm scenario allows partitioning into sets of the most basic or general functions and interactions Frameworks concept prescribes the design methodology Two supporting middleware packages emerge as popular frameworks Abstract views are used to identify components with common structures and priority attributes JPSS node; Government Resource for Algorithm Verification Integration and Test Environment (GRAVITE) data system built using Apache Object Oriented Data Transfer (OODT) framework 1. JAVA in Linux server environment 2. Process steps use components from OODT 3. Communicate via XML Remote Procedural Calls Instrument data systems employing OODT components: Seawinds/QuickSCAT science data processing SMAP: soil moisture science data system (JPL) Orbiting Carbon Observatory-2: operations pipeline (JPL) SNPP Sounder Product Evaluation & Test Element (PEATE) GES DISC Software Architecture Pull Server Periodically checks in remote host location for new data files; transfers new files to source landing zone Configuration file contains polling parameters: e.g., remote host directory, source landing zone directory Crawler instances monitor data-source subdirectories for new files Verifies checksum; unique product identifier; and sends data type and file location to File Manager After successful database insert, moves file from landing zone to inventory File Manager receives file location, data type Extracts HDF5 and other metadata and populates the database. Sends message to Crawler on successful insert. Poll PDR: Periodically looks in remote subscription PDR directory, pulls PDR files and sends them to Receive Data. Configuration file contains parameters for polling: e.g., remote host/directory, local directory for new PDRs, local file of accepted PDRs, polling protocol, format Receive Data: Uses science data filename from PDR to create directory for the science data file Extracts metadata for data type, converts to XML Allocates local directory using PDR filename, download data file named in the PDR Store Data Extracts metadata, stores data type records, obs time Looks in configuration for compression, quality check Creates and stores sym links to downloaded files Writes a subscription PDR containing sym links Subscribe Reads the PDR file and extracts data type Configuration gives who to notify; data filters; URL Prepares PDR and sends to PostOffice for ftp or email PostOffice Uses PDR to extract type and file metadata (XML) Configuration data type provides metadata filters Creates Delivery Notification (DN) Acquire Data Reads DN.PDR for files to get Uses symlinks, or FTP get if remote Outputs PDR with data location Register Data Uses data type to identify the algorithm name from configuration Select Data Data type/time, production rules determine other required data Track Data adds filename and finds expected algorithm uses in configuration Find Data Locates the needed/desired inputs Outputs data found after timers expire Prepare Run Creates a Process Control File using algorithm-specific template Allocate Disk (S4PM) Allocates disk & adds directories to PCF Run Algorithm (S4PM & code specific) Executes the named algorithm Register Data (S4PM) Writes file name, metadata Track Data (S4PM) – store type metadata and updates usage Export (S4PM) – Writes PDR Sweep (S4PM) - Deletes data file when use count drops to zero Two middleware frameworks are used in many current satellite science data systems. They provide the major functions for supporting simple science data processing scenarios and offer practical reuse options at the component level. Data download and storage management Workflow management and algorithm application They are composed of similar processing steps Science data transfer using standard directory polling and data protocols Workflow chain development for instrument data processing algorithms Reuse is made possible through public software release and by availability of limited informal set of code examples, design artifacts and user guides. Future Work Examine implementations to quantify latency and scalability factors. Understand complexity in installation, tuning and configuration management. Quantifying the significance of language skill requirement for Perl vs. Java. Planner (Java) Verifies all input files in inventory Checks the inventory database for PGE inputs Tells the workflow Manager to create a working directory Updates PGE configuration files in the working directory WorkFlow Manager (OODT) reads config of conditions & tasks Creates a workflow instance and processing thread Creates a working directory with symbolic links to the input files Send the executable tasks to the Resource Manager Resource Manager (OODT) Resource Monitor determines state of resources on the servers Sends jobs to queue/scheduler when resources are available Batch Managers submit jobs to Resource Nodes on the servers PGE (JAVA, PGE specific languages) Executes algorithms/commands Output moved into landing zone Incinerator (JAVA) Periodically searches and removes links and folders after time expires Examining Satellite Science Data System Architectures Look for generalize reoccurring structures and properties: e.g. file transfer, job control, algorithm input data and run configuration Characterize features most important to developers and operators: e.g., functional, performance, Maintainability Test methods to scale/extrapolate scenario Aura OMI instrument observations of NO2 (Tropospheric NO 2 ) in Level 2 (by orbit) format are acquired from the GES DISC and used to make multi-day Level 3 global grid for visual display. Acquire calibrated and geo-location instrument observations covering their operating life File transfer protocols and methods Configure for FTP, SFTP, or HTTP file transfers User provides information about the type and internet location of instrument observation data Data subscription with data center source protocols Copy observation time/location-based data files to local directory Extract metadata for downstream process control’ Support common file formats with standard metadata content: e.g., HDF, NetCDF, ISO 19115 Provide key content: data observation/model time, spatial resolution and coverage extent Source identification: file name, headers internal to the file and/or separate configuration file Generate higher level synoptic-based products The algorithm assimilates (e.g., composites) multiple observation times into a representative time period Integrates other external sources of observations, model or reference geophysical parameters Configure run criteria and data format for algorithms Identify all observations and static inputs Run algorithm process scripts and executables when all input data is available Store results locally for distribution, downstream analysis, visualization GRAVITE Automated Processing Scenario Data Transfer using S4PA, S4P, Perl components Scenario Data Transfer Process (using OODT & JAVA) Scenario Workflow using S4PA, S4P, Perl Scenario Workflow using OODT & Java Scenario Data Transfer using S4PA, S4P, Perl components S4PA Linux File System Produc t Delive ry Record PDR Data locations Start time Receive Subscribe Store Poller:PD R Data Type Science Data Files PDR PDR configuration files science data files work order files Remote host (e.g., GES DISC) Components on Local Linux Server Data Type Filenames metadata Subscripti on Data Type-User PDR DN FTP/ SFTP S4PA S4PM PDR polling config Metadat a config QC config PostOffic e Who to notify data filter URL data type filters PGE spec scenario data transfer process (using OODT & Java components) Data Source Landing Zone Poll Crawler Inventory File System Science Data File Remote host (e.g. GES DISC) Local System Inventory database File metadata Poll FTP/SFTP HTTP/HTTPS File name, type, location HDF5 Metadata & file location File Manager Pull Server Polling rules Remote server Source location Target location XML Polling rules data source location to poll configuration files science data files XML RPCs Java OODT PGE spec Scenario Workflow (OODT & Java) PGE planner database Product Generation Executeabl e Working directo ry Invento ry databas e location of Input data Incinera tor Inventory File System Landing Zone Workflow Manager PGE Prep Planner Resource Manager Workflow configurat ion Resource Monitor tasks conditio ns Required Input XML •Conditions •Run status Scenario Workflow using S4PM, S4P, Perl components DATA/INPUT Produc t Delive ry Record DN.PDR Science Data Type Name Acquir e Find Select Registe r Data Type Delivery Notificat ion Track PDR PrepareRu n File locations Alg config Sign al Other input data Process Control File AllocateDi sk RunAlgorith m DATA/OUTPUT Sign al Regist er Track Export Working S4PA Linux File System Algorithm name Data type metadata Data needed or desired PCF productio n rules PCF output Filename PDR locati on metadat a filena me filena me Sweep Scenario Deployment View (OODT & JAVA) Ingest Database PGE Manager PGE Component Code Counts OODT: 58K SLOC PGE: 1k SLOC Linux Server JAVA, OODT platform components JAVA Libraries, SFTP/HTTP OODT Alg Compilers COTS Tools CENT OS (Linux) and Virtual Machine Environment RDBS Science Algorithm JAVA (Planner) Software Architecture Scenario S4P, S4PA, S4PM Scenario Deployment View S4PA S4PM Code Counts S4P: 7K SLOC S4PM: 14K SLOC S4PA: 20K SLOC Linux Server PERL, S4P platform components Perl Libraries, SFTP/HTTP S4P S4PA S4PM Alg Compilers COTS Tools CENT OS (Linux) on Virtual Machine Environment Science Algorithm Software Architecture S4P is a framework for S4PA and S4PM, where a standard station daemon polls for new work order files in local directory and maintains a queue. Scripts and configurations are added for S4PA and S4PM functions, includes handing addition popular protocols and metadata. Communicates among stations uses the file system and includes several conventional protocols. S4PA functions use station configurations to control data transfer by polling remote host for available data location, then constructing request to transfer the remote data. A directory is created in local file system from filename, and symbolic links for access. S4PM includes major functions in station components, stations look for and prepare inputs, run algorithm on dedicated resource. Load balance via static configuration parameters. Creates S4PM location for output files; links or moves them to S4PA. Archive is separate from algorithm processing platform. OODT functions are in Java components grouped into data ingest and workflow management. Java methods and configurations are added to support data type ingest and algorithm execution planning functions. Communicates among components using XML RPCs (XML encoding, HTTP) Data transfer controlled through two polling components, one polls for files in remote subscription directory and transfers them to local directory, second polls for files in local directory and moves them to an inventory file system. Utilization is maximized and delays are minimized through tuning timers and other parameters. Functionality added to interface and manage science data configurations and data inventory, preparing input data for running algorithms on dedicated resources. Job queues and resource queues are used to control and run algorithm in working directories on computer cluster nodes. Symbolic links used to access science data. Output products are moved to file system monitored for ingest. Separate platforms for archive and processing cluster. Summary Highlights and Distinctions Perl, S4P, S4PA, S4PM OODT, JAVA Simplistic Satellite Science Data System Use Case Scenario Poll Data Center and Copy New Level 2 HDF Local Copy OMI NO2 Level 2 HDF Web Server Browser Animation GES DISC OMI NO2 processe d Level 2 HDF OMI Directory List Latest product files Daily Level 2 Generate Composite Level 3 TIF Latest 2 Days Local Copy OMI NO2 Level 3 TIF Geo- Political Boundaries Daily Level 3 Last 7 Days vectors S4PM workflow concept GRAVITE Processing Deployment View
1

Www.nasa.gov Architectures Toward Reusable Science Data Systems [email protected] Science Data Systems Branch, NASA Goddard Space Flight Center, Greenbelt,

Jan 16, 2016

Download

Documents

Jonah Todd
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Www.nasa.gov Architectures Toward Reusable Science Data Systems John.f.Moses@nasa.gov Science Data Systems Branch, NASA Goddard Space Flight Center, Greenbelt,

www.nasa.gov

Architectures Toward Reusable Science Data [email protected]

Science Data Systems Branch, NASA Goddard Space Flight Center, Greenbelt, MD 20771

1

TT

&C

GRAVITE Science Data Ingest

Science Data Systems (SDS) comprise an important class of data processing systems that support product generation from remote sensors and in-situ observations. These systems enable research into new science data products, replication of experiments and verification of results. NASA has been building systems for satellite data processing since the first Earth observing satellites launched and is continuing development of systems to support NASA science research and NOAA’s Earth observing satellite operations. The basic data processing workflows and scenarios continue to be valid for remote sensor observations research as well as for the complex multi-instrument operational satellite data systems being built today.

GRAVITE Software Architecture

Satellite Data System Enterprise Architectures

• Establish a design hierarchy and process, structure of elements, properties and relationships, abstractions for managing complexity

• Partition the system into software elements (components) with responsibilities and interaction (interface) rules, hierarchical, recursive with focus on functionality

• Look for conceptual integrity: a small number of simple interaction patterns. System functions such as ingest, product generation and distribution need to be configured and perform consistently with scalability

• Re-use infrastructure, framework, data models

Software Architecture Views for Re-use

Architect’s Application

• Business process description focus on:• Dynamic interaction of stakeholders; roles & interfaces• Flow of information between the enterprise entities

• Business model can drive design; identifies stakeholders, systems and data; examples include:• NASA EOSDIS science discipline-specific facilities such

as Science Investigator-led Processing systems (SIPS) and Distributed Active Archive Centers (DAACs);

• Joint Polar Satellite System (JPSS) has mission partner facilities/systems; e.g., NOAA NESDIS STAR, ESPC, FNMOC, CLASS, NASA SDS

• Manages interfaces; enables system design independence

TDRS

ANC Data

Field Terminal Support Node

Launch Service

Space/Ground Comm. Node

MD Frames

SMD

TT&C

JPSS Ground System High-level Architecture OV-2 (Jan 31, 2014)

SMD

MSD

TT&C

TT&C

TT&C

TT

&C

TT&C

MD – Mission DataSMD – Stored Mission Data

MSD – Mission Support data

DMSP DP AFWA,

EUMETSAT PPF,

WindSat NRL/FNMOC

SCaN POP GSFC

Findings

Alg. Support

MD Frames

HRD/LRDPerformance

HRD/LRDPerformance

MSD

SvalSat

HRD/LRD Monitor

JPSS Ground System

SMD

GCOM-n

MD Frames

Coriolis/ WindSatMetOp-nDMSP-Fn(NASA)

HRDLRD

TT&C

Management & Operations Node

Supporting Ops

CLASS

NASA SDS

xDR, IPs

ESPC

xDRs, IPs

Security

Ops Sim & Test

Systems

• Integrate Element level

simulators

• Maintain Simulators

Alg, ASF, DRs, Findings

Correlative Data

Sources

Data

Data

Alg & Val

LCFs

STA

R

Data

SMD APs

Simulation Node

Data Processing Node

TT

&C

/ MS

D

JPS

S G

rou

nd

Net

wo

rk N

od

e

Alg. Support

xDRs, IPs

Data

Network supports routing of NASA SCaN-

supported missions, & McMurdo NSF data

SMD (J-1+)

SM

D (

J-1+

)

CGS Support Node

(L3)FNMOC

NAVOCEANO

SMD APs

MSD

Fairbanks CDA

TrollSat

McMurdo

NSOF MMC

Fairmont CBU Alt MMC

NSOF IDPS Fairmont Alt IDPS

FVS FVTS

Flt Dynamics System

Support Nodes

Field Terminal

Users &

PFF SARSAT/Argos

Terminals

S-NPPJPSS-n

PFF

WSC

LEGEND:Black/White Text – Block 1.2 +

Red Text – Block 2.0Purple Text – Block 3+

LASPPFF APs

MSD

AGS

AGS

NASA CARA

Cal/Val Node

GRAVITE

JPSS Common Ground System

EOSDIS System Architecture

1

Spacecraft

Data Acquisition

GroundStations

ScienceTeams(SIPS)Polar Ground Stations

Flight Operations, Data Capture, Initial

Processing & Backup Archive

Data Transport to

DAACs

Science Data Processing, Data Mgmt., Data Archive &

Distribution

Distribution, Access, Interoperability & Reuse

NASA Integrated Services Network (NISN)

Mission Services

Data Processing & Mission

Control

Technology Infusion

Research

Education

Value-AddedProviders

InteragencyData Centers

InternationalPartners

EarthSystem Models

BenchmarkingDSS

MeasurementTeams

Tracking & Data Relay

Satellite (TDRS)

WWW

ACCESSACCESS

ACCESSACCESS

EOSDIS ScienceData Systems

(DAACs)

DataPools

REASoNsMEaSUREs

ECHO

Major Functions for Satellite Science Data System:

EOSDIS: Goddard Earth Sciences Data and InformationServices Center (GES DISC) Science Data System built using

Simple Scalable Script-based Science Processor (S4P)1. Perl script, S4P Archive (S4PA), S4P Missions (S4PM)2. Process steps are organized in directory structures3. Station daemon and configuration file provide building

blocks: Polls local directory for work order files, looks up commands for type of work, changes to temporary subdirectory, forks child process to execute the job, creates and writes output work order to downstream station

Instrument data systems employing S4P and Perl-based framework components: (sample)

• TRMM science data system: (GES DISC)• AQUA AIRS and AURA MLS, OMI: (GES DISC & OMI SIPS)• TERRA ASTER: ASTER on-demand system (LP DAAC)• TERRA MISR S4PM: (LARC ASDC)• CALIPSO, FlashFLux S4PM (LARC ASDC)

Aura Ozone (OMI) instrument data processing and archive at GES DISC

S4PA(L2)

aurapar2

S4PA(L1)

aurapar1

S4PA(L0)

auraraw1

OMIScience Investigator-led Processing System

ODPS

S4PA(L2-3)

acdisc

FMI

S4PA(L0)

S4PM(dprep)

tads

EDOS satelliteScience data capture

EMOStelemetryOrbit/Attitude

NOAAAncillary

S4PA workflow concept

Pollers

Provider

ReceiveData

StoreData

GiovanniPre-process

Subscription

MetadataPublication

Deletion

PostOffice

Subscriber

ECHO

Mirador

Archive Storage

data

met

data

pdr pdr

pdr

pdr

pdr

pdr

pdr

data

met

links

links

Use Aura OMI ozone instrument science data processing scenario to serve as model of priority functions for examining solution attributes• Science algorithm scenario allows partitioning into sets of the

most basic or general functions and interactions• Frameworks concept prescribes the design methodology

• Two supporting middleware packages emerge as popular frameworks

• Abstract views are used to identify components with common structures and priority attributes

JPSS node; Government Resource for Algorithm Verification Integration and Test Environment (GRAVITE) data system built using Apache Object Oriented Data Transfer (OODT) framework

1. JAVA in Linux server environment2. Process steps use components from OODT3. Communicate via XML Remote Procedural Calls

Instrument data systems employing OODT components:• Seawinds/QuickSCAT science data processing• SMAP: soil moisture science data system (JPL)• Orbiting Carbon Observatory-2: operations pipeline (JPL)• SNPP Sounder Product Evaluation & Test Element (PEATE)

GES DISC Software Architecture

Pull Server • Periodically checks in remote host

location for new data files; transfers new files to source landing zone

• Configuration file contains polling parameters: e.g., remote host directory, source landing zone directory

Crawler instances monitor data-source subdirectories for new files• Verifies checksum; unique product

identifier; and sends data type and file location to File Manager

• After successful database insert, moves file from landing zone to inventory

• File Manager receives file location, data type

• Extracts HDF5 and other metadata and populates the database. Sends message to Crawler on successful insert.

Poll PDR:• Periodically looks in remote subscription PDR

directory, pulls PDR files and sends them to Receive Data.

• Configuration file contains parameters for polling: e.g., remote host/directory, local directory for new PDRs, local file of accepted PDRs, polling protocol, format

Receive Data:• Uses science data filename from PDR to create

directory for the science data file• Extracts metadata for data type, converts to XML• Allocates local directory using PDR filename,

download data file named in the PDRStore Data• Extracts metadata, stores data type records, obs

time • Looks in configuration for compression, quality

check• Creates and stores sym links to downloaded files• Writes a subscription PDR containing sym linksSubscribe• Reads the PDR file and extracts data type• Configuration gives who to notify; data filters; URL• Prepares PDR and sends to PostOffice for ftp or

emailPostOffice• Uses PDR to extract type and file metadata (XML)• Configuration data type provides metadata filters• Creates Delivery Notification (DN)

Acquire Data• Reads DN.PDR for files to get• Uses symlinks, or FTP get if remote • Outputs PDR with data locationRegister Data• Uses data type to identify the algorithm

name from configurationSelect Data• Data type/time, production rules

determine other required dataTrack Data• adds filename and finds expected

algorithm uses in configurationFind Data• Locates the needed/desired inputs• Outputs data found after timers expirePrepare Run• Creates a Process Control File using

algorithm-specific templateAllocate Disk (S4PM)• Allocates disk & adds directories to PCFRun Algorithm (S4PM & code specific)• Executes the named algorithmRegister Data (S4PM)• Writes file name, metadataTrack Data (S4PM) – store type metadata and updates usageExport (S4PM) – Writes PDRSweep (S4PM) - Deletes data file when use count drops to zero

Two middleware frameworks are used in many current satellite science data systems. They provide the major functions for supporting simple science data processing scenarios and offer practical reuse options at the component level.

• Data download and storage management• Workflow management and algorithm application

They are composed of similar processing steps• Science data transfer using standard directory polling and data protocols• Workflow chain development for instrument data processing algorithms

Reuse is made possible through public software release and by availability of limited informal set of code examples, design artifacts and user guides.

Future Work• Examine implementations to quantify latency and scalability factors.• Understand complexity in installation, tuning and configuration management.• Quantifying the significance of language skill requirement for Perl vs. Java.

Planner (Java)• Verifies all input files in inventory• Checks the inventory database for

PGE inputs• Tells the workflow Manager to create a

working directory• Updates PGE configuration files in the

working directoryWorkFlow Manager (OODT)• reads config of conditions & tasks• Creates a workflow instance and

processing thread• Creates a working directory with

symbolic links to the input files• Send the executable tasks to the

Resource ManagerResource Manager (OODT)• Resource Monitor determines state of

resources on the servers• Sends jobs to queue/scheduler when

resources are available• Batch Managers submit jobs to

Resource Nodes on the serversPGE (JAVA, PGE specific languages)• Executes algorithms/commands• Output moved into landing zoneIncinerator (JAVA)• Periodically searches and removes

links and folders after time expires

Examining Satellite Science Data System Architectures

• Look for generalize reoccurring structures and properties: e.g. file transfer, job control, algorithm input data and run configuration

• Characterize features most important to developers and operators: e.g., functional, performance, Maintainability

• Test methods to scale/extrapolate scenario

Aura OMI instrument observations of NO2 (Tropospheric NO2 ) in Level 2 (by orbit) format are acquired from the GES DISC and used to make multi-day Level 3 global grid for visual display.

Acquire calibrated and geo-location instrument observations covering their operating life• File transfer protocols and methods

• Configure for FTP, SFTP, or HTTP file transfers• User provides information about the type and internet

location of instrument observation data• Data subscription with data center source protocols • Copy observation time/location-based data files to local

directory• Extract metadata for downstream process control’

• Support common file formats with standard metadata content: e.g., HDF, NetCDF, ISO 19115

• Provide key content: data observation/model time, spatial resolution and coverage extent

• Source identification: file name, headers internal to the file and/or separate configuration file

Generate higher level synoptic-based products• The algorithm assimilates (e.g., composites) multiple

observation times into a representative time period• Integrates other external sources of observations, model or

reference geophysical parameters• Configure run criteria and data format for algorithms

• Identify all observations and static inputs• Run algorithm process scripts and executables when all input

data is available• Store results locally for distribution, downstream analysis,

visualization

GRAVITE Automated Processing

Scenario Data Transferusing S4PA, S4P, Perl components

Scenario Data Transfer Process (using OODT & JAVA)

Scenario Workflowusing S4PA, S4P, Perl

Scenario Workflowusing OODT & Java

Scenario Data Transferusing S4PA, S4P, Perl components

S4PA Linux File System

Product Delivery Record

PDR

Data locationsStart time

Receive SubscribeStore

Poller:PDR

Data Type

Science Data Files

PDR

PDR

configuration filesscience data files

work order files

Remote host(e.g., GES DISC)

Components on Local Linux Server

Data TypeFilenamesmetadata

Subscription

Data Type-User

PDR

DN

FTP/SFTP

S4PA

S4PM

PDRpolling config

Metadataconfig

QC config

PostOffice

Who to notifydata filter

URL

data type filters

PGE spec

scenario data transfer process(using OODT & Java components)

Data SourceLanding Zone

Poll

Crawler

InventoryFile System

Science Data File

Remote host(e.g. GES DISC) Local System

Inventorydatabase

File metadata

PollFTP/SFTPHTTP/HTTPS

File name, type, location

HDF5Metadata& file location

File Manager

Pull Server

Polling rules

Remote serverSource locationTarget location

XML

Polling rules

data sourcelocation to poll

configuration filesscience data files

XML RPCs

Java

OODT

PGE spec

Scenario Workflow (OODT & Java)

PGE plannerdatabase

Product Generation

Executeable

Working directory

Inventory database

location of Input data

Incinerator

Inventory File System

LandingZone

WorkflowManager

PGE Prep

Planner

ResourceManager

Workflowconfiguration

ResourceMonitortasks

conditions

Required Input

XML

• Conditions• Run status

Scenario Workflowusing S4PM, S4P, Perl components

DATA/INPUT

Product Delivery Record

DN.PDR

Science Data Type

Name

Acquire Find

Select

Register

Data TypeDeliveryNotification

Track

PDR

PrepareRun

File locations

Alg config

Signal

Other inputdata

Process Control File

AllocateDisk

RunAlgorithm

DATA/OUTPUT

Signal

Register

Track

Export

Working

S4PA Linux File System

Algorithmname

Data type metadata

Data needed or desired

PCF

production rules

PCF

outputFilename

PDR

location

metadata

filename

filename

Sweep

Scenario Deployment View (OODT & JAVA)

Ingest Database

PGE Manager PGE

ComponentCode Counts

OODT: 58K SLOCPGE: 1k SLOC

Linux Server

JAVA, OODT

platform

components

JAVA Libraries, SFTP/HTTP

OODT

Alg Compilers COTS Tools

CENT OS (Linux) and Virtual Machine Environment

RDBS

Science Algorithm

JAVA (Planner)

Software Architecture

Scenario S4P, S4PA, S4PM Scenario Deployment View

S4PA S4PMCode Counts

S4P: 7K SLOCS4PM: 14K SLOCS4PA: 20K SLOC

Linux Server

PERL, S4P

platform

components

Perl Libraries, SFTP/HTTP

S4P

S4PA S4PM

Alg Compilers COTS Tools

CENT OS (Linux) on Virtual Machine Environment

Science Algorithm

Software Architecture

• S4P is a framework for S4PA and S4PM, where a standard station daemon polls for new work order files in local directory and maintains a queue.

• Scripts and configurations are added for S4PA and S4PM functions, includes handing addition popular protocols and metadata.

• Communicates among stations uses the file system and includes several conventional protocols.

• S4PA functions use station configurations to control data transfer by polling remote host for available data location, then constructing request to transfer the remote data. A directory is created in local file system from filename, and symbolic links for access.

• S4PM includes major functions in station components, stations look for and prepare inputs, run algorithm on dedicated resource. Load balance via static configuration parameters.

• Creates S4PM location for output files; links or moves them to S4PA. Archive is separate from algorithm processing platform.

• OODT functions are in Java components grouped into data ingest and workflow management.

• Java methods and configurations are added to support data type ingest and algorithm execution planning functions.

• Communicates among components using XML RPCs (XML encoding, HTTP)

• Data transfer controlled through two polling components, one polls for files in remote subscription directory and transfers them to local directory, second polls for files in local directory and moves them to an inventory file system. Utilization is maximized and delays are minimized through tuning timers and other parameters.

• Functionality added to interface and manage science data configurations and data inventory, preparing input data for running algorithms on dedicated resources.

• Job queues and resource queues are used to control and run algorithm in working directories on computer cluster nodes. Symbolic links used to access science data. Output products are moved to file system monitored for ingest. Separate platforms for archive and processing cluster.

Summary Highlights and Distinctions

Perl, S4P, S4PA, S4PM OODT, JAVA

Simplistic Satellite Science Data System Use Case Scenario

Poll Data Center and Copy New

Level 2 HDF

Local Copy OMI NO2 Level 2

HDF

Web ServerBrowser

Animation

GES DISCOMI NO2processed

Level 2 HDF

OMI Directory ListLatest product files

Daily Level 2

Generate Composite Level 3 TIF

Latest 2 Days

Local Copy OMI NO2

Level 3 TIF

Geo-Political Boundaries

Daily Level 3

Last 7 Days

vectors

S4PM workflow concept

GRAVITE Processing Deployment View