www.nasa.gov Architectures Toward Reusable Science Data Systems [email protected] Science Data Systems Branch, NASA Goddard Space Flight Center, Greenbelt, MD 20771 1 TT&C GRAVITE Science Data Ingest Science Data Systems (SDS) comprise an important class of data processing systems that support product generation from remote sensors and in-situ observations. These systems enable research into new science data products, replication of experiments and verification of results. NASA has been building systems for satellite data processing since the first Earth observing satellites launched and is continuing development of systems to support NASA science research and NOAA’s Earth observing satellite operations. The basic data processing workflows and scenarios continue to be valid for remote sensor observations research as well as for the complex multi-instrument operational satellite data systems being built today. GRAVITE Software Architecture Satellite Data System Enterprise Architectures • Establish a design hierarchy and process, structure of elements, properties and relationships, abstractions for managing complexity • Partition the system into software elements (components) with responsibilities and interaction (interface) rules, hierarchical, recursive with focus on functionality • Look for conceptual integrity: a small number of simple interaction patterns. System functions such as ingest, product generation and distribution need to be configured and perform consistently with scalability • Re-use infrastructure, framework, data models Software Architecture Views for Re-use Architect’s Application • Business process description focus on: • Dynamic interaction of stakeholders; roles & interfaces • Flow of information between the enterprise entities • Business model can drive design; identifies stakeholders, systems and data; examples include: • NASA EOSDIS science discipline-specific facilities such as Science Investigator- led Processing systems (SIPS) and Distributed Active Archive Centers (DAACs); • Joint Polar Satellite System (JPSS) has mission partner facilities/systems; e.g., NOAA NESDIS STAR, ESPC, FNMOC, CLASS, NASA SDS • Manages interfaces; enables system design independence TDRS ANC Data Field Terminal Support Node Launch Service Space/Ground Comm. Node MD Frames SMD TT&C JPSS Ground System High-level Architecture OV-2 (Jan 31, 2014) SMD MSD TT&C TT&C TT&C TT&C TT&C MD – Mission Data SMD – Stored Mission Data MSD – Mission Support data DMSP DP AFWA, EUMETSAT PPF, WindSat NRL/FNMOC SCaN POP GSFC Findings Alg. Support MD Frames HRD/LRD Performance HRD/LRD Performance MSD SvalSat HRD/LRD Monitor JPSS Ground System SMD GCOM-n MD Frames Coriolis/ WindSat MetOp-n DMSP-Fn (NASA) HRD LRD TT&C Management & Operations Node Supporting Ops CLASS NASA SDS xDR, IPs ESPC xDRs, IPs Security Ops Sim & Test Systems • Integrate Element level simulators • Maintain Simulators Alg, ASF, DRs, Findings Correlative Data Sources Data Data Alg & Val LCFs S T A R Data SMD APs Simulation Node Data Processing Node T T & C / M S D JPSS Ground Network Node Alg. Support xDRs, IPs Data Network supports routing of NASA SCaN- supported missions, & McMurdo NSF data SMD (J-1+) SMD (J-1+) CGS Support Node (L3) FNMOC NAVOCEANO SMD APs MSD Fairbanks CDA TrollSat McMurdo NSOF MMC Fairmont CBU Alt MMC NSOF IDPS Fairmont Alt IDPS FVS FVTS Flt Dynamics System Support Nodes Field Terminal Users & PFF SARSAT/Argos Terminals S-NPP JPSS-n PFF WSC LEGEND: Black/White Text – Block 1.2 + Red Text – Block 2.0 Purple Text – Block 3+ LASP PFF APs MSD AGS AGS NASA CARA Cal/Val Node GRAVITE JPSS Common Ground System EOSDIS System Architecture 1 Spacecraft Data Acquisition Ground Stations Science Teams (SIPS) Polar Ground Stations Flight Operations, Data Capture, Initial Processing & Backup Archive Data Transport to DAACs Science Data Processing, Data Mgmt., Data Archive & Distribution Distribution, Access, Interoperability & Reuse NASA Integrat ed Services Network (NISN) Mission Services Data Processin g & Mission Control Technology Infusion Research Education Value-Added Providers Interagency Data Centers International Partners Earth System Models Benchmarking DSS Measurement Teams Tracking & Data Relay Satellite (TDRS) W W W ACCESS ACCESS EOSDIS Science Data Systems (DAACs) Data Pools REASoNs MEaSUREs ECHO Major Functions for Satellite Science Data System: EOSDIS: Goddard Earth Sciences Data and Information Services Center (GES DISC) Science Data System built using Simple Scalable Script-based Science Processor (S4P) 1. Perl script, S4P Archive (S4PA), S4P Missions (S4PM) 2. Process steps are organized in directory structures 3. Station daemon and configuration file provide building blocks: Polls local directory for work order files, looks up commands for type of work, changes to temporary subdirectory, forks child process to execute the job, creates and writes output work order to downstream station Instrument data systems employing S4P and Perl- based framework components: (sample) • TRMM science data system: (GES DISC) Aura Ozone (OMI) instrument data processing and archive at GES DISC S4PA (L2) aurapar2 S4PA (L1) aurapar1 S4PA (L0) auraraw1 OMI Science Investigat or-led Processing System ODPS S4PA (L2-3) acdisc FMI S4PA (L0) S4PM (dpre p) tads EDOS satellite Science data capture EMOS telemetry Orbit/Attitude NOAA Ancillary S4PA workflow concept Pollers Provider Receive Data Store Data Giovanni Pre-process Subscription Metadata Publication Deletion Post Office Subscriber ECHO Mirador Archive Storage data met data pdr pdr pdr pdr pdr pdr pdr data met links links Use Aura OMI ozone instrument science data processing scenario to serve as model of priority functions for examining solution attributes • Science algorithm scenario allows partitioning into sets of the most basic or general functions and interactions • Frameworks concept prescribes the design methodology • Two supporting middleware packages emerge as popular frameworks • Abstract views are used to identify components with common structures and priority attributes JPSS node; Government Resource for Algorithm Verification Integration and Test Environment (GRAVITE) data system built using Apache Object Oriented Data Transfer (OODT) framework 1. JAVA in Linux server environment 2. Process steps use components from OODT 3. Communicate via XML Remote Procedural Calls Instrument data systems employing OODT components: • Seawinds/QuickSCAT science data processing • SMAP: soil moisture science data system (JPL) • Orbiting Carbon Observatory-2: operations pipeline (JPL) • SNPP Sounder Product Evaluation & Test Element (PEATE) GES DISC Software Architecture Pull Server • Periodically checks in remote host location for new data files; transfers new files to source landing zone • Configuration file contains polling parameters: e.g., remote host directory, source landing zone directory Crawler instances monitor data-source subdirectories for new files • Verifies checksum; unique product identifier; and sends data type and file location to File Manager • After successful database insert, moves file from landing zone to inventory • File Manager receives file location, data type • Extracts HDF5 and other metadata and populates the database. Sends message to Crawler on successful insert. Poll PDR: • Periodically looks in remote subscription PDR directory, pulls PDR files and sends them to Receive Data. • Configuration file contains parameters for polling: e.g., remote host/directory, local directory for new PDRs, local file of accepted PDRs, polling protocol, format Receive Data: • Uses science data filename from PDR to create directory for the science data file • Extracts metadata for data type, converts to XML • Allocates local directory using PDR filename, download data file named in the PDR Store Data • Extracts metadata, stores data type records, obs time • Looks in configuration for compression, quality check • Creates and stores sym links to downloaded files • Writes a subscription PDR containing sym links Subscribe • Reads the PDR file and extracts data type • Configuration gives who to notify; data filters; URL • Prepares PDR and sends to PostOffice for ftp or email PostOffice • Uses PDR to extract type and file metadata (XML) • Configuration data type provides metadata filters • Creates Delivery Notification (DN) Acquire Data • Reads DN.PDR for files to get • Uses symlinks, or FTP get if remote • Outputs PDR with data location Register Data • Uses data type to identify the algorithm name from configuration Select Data • Data type/time, production rules determine other required data Track Data • adds filename and finds expected algorithm uses in configuration Find Data • Locates the needed/desired inputs • Outputs data found after timers expire Prepare Run • Creates a Process Control File using algorithm-specific template Allocate Disk (S4PM) • Allocates disk & adds directories to PCF Run Algorithm (S4PM & code specific) • Executes the named algorithm Register Data (S4PM) • Writes file name, metadata Track Data (S4PM) – store type metadata and updates usage Export (S4PM) – Writes PDR Sweep (S4PM) - Deletes data file when use count drops to zero Two middleware frameworks are used in many current satellite science data systems. They provide the major functions for supporting simple science data processing scenarios and offer practical reuse options at the component level. • Data download and storage management • Workflow management and algorithm application They are composed of similar processing steps • Science data transfer using standard directory polling and data protocols • Workflow chain development for instrument data processing algorithms Reuse is made possible through public software release and by availability of limited informal set of code examples, design artifacts and user guides. Future Work • Examine implementations to quantify latency and scalability factors. • Understand complexity in installation, tuning and configuration management. • Quantifying the significance of language skill requirement for Perl vs. Java. Planner (Java) • Verifies all input files in inventory • Checks the inventory database for PGE inputs • Tells the workflow Manager to create a working directory • Updates PGE configuration files in the working directory WorkFlow Manager (OODT) • reads config of conditions & tasks • Creates a workflow instance and processing thread • Creates a working directory with symbolic links to the input files • Send the executable tasks to the Resource Manager Resource Manager (OODT) • Resource Monitor determines state of resources on the servers • Sends jobs to queue/scheduler when resources are available • Batch Managers submit jobs to Resource Nodes on the servers PGE (JAVA, PGE specific languages) • Executes algorithms/commands • Output moved into landing zone Incinerator (JAVA) • Periodically searches and removes links and folders after time expires Examining Satellite Science Data System Architectures • Look for generalize reoccurring structures and properties: e.g. file transfer, job control, algorithm input data and run configuration • Characterize features most important to developers and operators: e.g., functional, performance, Maintainability • Test methods to scale/extrapolate scenario Aura OMI instrument observations of NO2 (Tropospheric NO 2 ) in Level 2 (by orbit) format are acquired from the GES DISC and used to make multi-day Level 3 global grid for visual display. Acquire calibrated and geo-location instrument observations covering their operating life • File transfer protocols and methods • Configure for FTP, SFTP, or HTTP file transfers • User provides information about the type and internet location of instrument observation data • Data subscription with data center source protocols • Copy observation time/location-based data files to local directory • Extract metadata for downstream process control’ • Support common file formats with standard metadata content: e.g., HDF, NetCDF, ISO 19115 • Provide key content: data observation/model time, spatial resolution and coverage extent • Source identification: file name, headers internal to the file and/or separate configuration file Generate higher level synoptic-based products • The algorithm assimilates (e.g., composites) multiple observation times into a representative time period • Integrates other external sources of observations, model or reference geophysical parameters • Configure run criteria and data format for algorithms • Identify all observations and static inputs • Run algorithm process scripts and executables when all input data is available • Store results locally for distribution, downstream analysis, visualization GRAVITE Automated Processing Scenario Data Transfer using S4PA, S4P, Perl components Scenario Data Transfer Process (using OODT & JAVA) Scenario Workflow using S4PA, S4P, Perl Scenario Workflow using OODT & Java Scenario Data Transfer using S4PA, S4P, Perl components S4PA Linux File System Produc t Delive ry Record PDR Data locations Start time Receive Subscribe Store Poller:PD R Data Type Science Data Files PDR PDR configuration files science data files work order files Remote host (e.g., GES DISC) Components on Local Linux Server Data Type Filenames metadata Subscripti on Data Type-User PDR DN FTP/ SFTP S4PA S4PM PDR polling config Metadat a config QC config PostOffic e Who to notify data filter URL data type filters PGE spec scenario data transfer process (using OODT & Java components) Data Source Landing Zone Poll Crawler Inventory File System Science Data File Remote host (e.g. GES DISC) Local System Inventory database File metadata Poll FTP/SFTP HTTP/HTTPS File name, type, location HDF5 Metadata & file location File Manager Pull Server Polling rules Remote server Source location Target location XML Polling rules data source location to poll configuration files science data files XML RPCs Java OODT PGE spec Scenario Workflow (OODT & Java) PGE planner database Product Generation Executeabl e Working directo ry Invento ry databas e location of Input data Incinera tor Inventory File System Landing Zone Workflow Manager PGE Prep Planner Resource Manager Workflow configurat ion Resource Monitor tasks conditio ns Required Input XML •Conditions •Run status Scenario Workflow using S4PM, S4P, Perl components DATA/INPUT Produc t Delive ry Record DN.PDR Science Data Type Name Acquir e Find Select Registe r Data Type Delivery Notificat ion Track PDR PrepareRu n File locations Alg config Sign al Other input data Process Control File AllocateDi sk RunAlgorith m DATA/OUTPUT Sign al Regist er Track Export Working S4PA Linux File System Algorithm name Data type metadata Data needed or desired PCF productio n rules PCF output Filename PDR locati on metadat a filena me filena me Sweep Scenario Deployment View (OODT & JAVA) Ingest Database PGE Manager PGE Component Code Counts OODT: 58K SLOC PGE: 1k SLOC Linux Server JAVA, OODT platform components JAVA Libraries, SFTP/HTTP OODT Alg Compilers COTS Tools CENT OS (Linux) and Virtual Machine Environment RDBS Science Algorithm JAVA (Planner) Software Architecture Scenario S4P, S4PA, S4PM Scenario Deployment View S4PA S4PM Code Counts S4P: 7K SLOC S4PM: 14K SLOC S4PA: 20K SLOC Linux Server PERL, S4P platform components Perl Libraries, SFTP/HTTP S4P S4PA S4PM Alg Compilers COTS Tools CENT OS (Linux) on Virtual Machine Environment Science Algorithm Software Architecture • S4P is a framework for S4PA and S4PM, where a standard station daemon polls for new work order files in local directory and maintains a queue. • Scripts and configurations are added for S4PA and S4PM functions, includes handing addition popular protocols and metadata. • Communicates among stations uses the file system and includes several conventional protocols. • S4PA functions use station configurations to control data transfer by polling remote host for available data location, then constructing request to transfer the remote data. A directory is created in local file system from filename, and symbolic links for access. • S4PM includes major functions in station components, stations look for and prepare inputs, run algorithm on dedicated resource. Load balance via static configuration parameters. • Creates S4PM location for output files; links or moves them to S4PA. Archive is separate from algorithm processing platform. • OODT functions are in Java components grouped into data ingest and workflow management. • Java methods and configurations are added to support data type ingest and algorithm execution planning functions. • Communicates among components using XML RPCs (XML encoding, HTTP) • Data transfer controlled through two polling components, one polls for files in remote subscription directory and transfers them to local directory, second polls for files in local directory and moves them to an inventory file system. Utilization is maximized and delays are minimized through tuning timers and other parameters. • Functionality added to interface and manage science data configurations and data inventory, preparing input data for running algorithms on dedicated resources. • Job queues and resource queues are used to control and run algorithm in working directories on computer cluster nodes. Symbolic links used to access science data. Output products are moved to file system monitored for ingest. Separate platforms for archive and processing cluster. Summary Highlights and Distinctions Perl, S4P, S4PA, S4PM OODT, JAVA Simplistic Satellite Science Data System Use Case Scenario Poll Data Center and Copy New Level 2 HDF Local Copy OMI NO2 Level 2 HDF Web Server Browser Animation GES DISC OMI NO2 processe d Level 2 HDF OMI Directory List Latest product files Daily Level 2 Generate Composite Level 3 TIF Latest 2 Days Local Copy OMI NO2 Level 3 TIF Geo- Political Boundaries Daily Level 3 Last 7 Days vectors S4PM workflow concept GRAVITE Processing Deployment View