N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Very Large Dataset Access and Manipulation: Active Data Repository (ADR) DataCutter.

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Very Large Dataset Access and Manipulation:

Active Data Repository (ADR)DataCutter and MetaChaos

Joel SaltzUniversity of Maryland, College Park

andJohns Hopkins Medical Institutions

http://www.cs.umd.edu/projects/adr


What we do• Develop database tools for interacting with large multi-

scale, multi-resolution datasets• Ad-hoc queries, produce data products, support

visualization of disk and tape based datasets• Query, subset and filter very large archival datasets• Operating system and middleware for very large “active”

network attached storage systems • Compilers that allow users to easily specify user defined

data transformations (e.g. using Java dialect)• Tools targeted at distributed multi-architecture platforms


Tools to Manage Storage Hierarchy

• Fast secondary storage (Active Data Repository)• Tools for on-demand data product generation, interactive

data exploration, visualization• Target closely coupled sets of processors/disks

• Archival Storage (DataCutter)• Load subset of data from tertiary storage into disk cache or

client• Access data from distributed data collections• Preprocess close to data sources • Stand-alone and Integrated into NPACI Storage Resource

Broker


Tool to Couple Applications

• MetaChaos• Parallel programs distribute data structures between

processor memories• Separately developed programs will use different schemes

to distribute data• MetaChaos coordinates movement of data between

separately developed, compiled parallel programs• Layered on standard message passing layer such as

MPIch-g, PVM• Garlik – integration of MetaChaos with KeLP floor plans


Irregular Multi-dimensional Datasets

• Spatial/multi-dimensional multi-scale, multi-resolution datasets

• Applications select portions of one or more datasets

• Selection of data subset makes use of spatial index (e.g., R-tree, quad-tree, etc.)

• Data not used “as-is”, generally preprocessing is needed - often to reduce data volumes


Querying Irregular Multi-dimensional Datasets

• Irregular datasets• Think of disk-based unstructured meshes, data structures

used in adaptive multiple grid calculations, sensor data• indexed by spatial location (e.g., position on earth, position of

microscope stage)

• Spatial query used to specify iterator• computation on data obtained from spatial query• computation aggregates data - resulting data product size

significantly smaller than results of range query


Loading Datasets into ADR• A user

• should decompose dataset into data chunks• optionally can distribute chunks across the disks, and

provide an index for accessing them• ADR, given data chunks and associated

minimum bounding rectangles in a set of files• can distribute data chunks across the disks using a

Hilbert-curve based declustering algorithm,• can create an R-tree based index on the dataset.

Loading Datasets into ADR

Disk Farm

• ADR Data Loading Service • Distributes chunks

across the disks in the system (e.g., using Hilbert curve based declustering)

• Constructs an R-tree index using bounding boxes of the data chunks


Data Loading Service• User must decompose the dataset into chunks• For a fully cooked dataset, User

• moves the data and index files to disks (via ftp, for example)

• registers the dataset using ADR utility programs• For a half cooked dataset, ADR

• computes placement information using a Hilbert curve-based declustering algorithm,

• builds an R-tree index,• moves the data chunks to the disks• registers the dataset


Query Execution in Active Data Repository

• An ADR Query contains a reference to• the data set of interest,• a query window (a multi-dimensional bounding box in input

dataset’s attribute space),• default or user defined index lookup functions,• user-defined accumulator,• user-defined projection and aggregation functions,• how the results are handled (write to disk, or send back to the

client).• ADR handles multiple simultaneous active queries


ADR Query Execution

Index lookup

Generate query planAggregate local input

data into output

Combine partial output results

Send output to clientsquery

Initialize output


Dataset Structure• Spatial and temporal

resolution may depend on spatial location

• Physical quantities computed and stored vary with spatial location


Processing Irregular DatasetsExample -- Interpolation

Specify portion of raw sensor data correspondingto some search criterion

Output grid ontowhich a projectionis carried out


Active Data Repository (ADR)

• Set of services for building parallel databases of multi-dimensional datasets• enables integration of storage, retrieval and processing of

multi-dimensional datasets on parallel machines. • can maintain and jointly process multiple datasets.• provides support and runtime system for common

operations such as • data retrieval, • memory management, • scheduling of processing across a parallel machine.

• customizable for various application specific processing.


Data Processing Scenario

source data elements




result data elements

mappingfunction



reductionfunction


result data elements

intermediatedata elements(accumulatorelements)



Data elements declustered across disks attached to processors of distributed memory machines

P0 P1 P2

P0 P1 P2


Data Processing• Source and result datasets are multi-

dimensional• Result dataset often smaller than source

dataset• Perform processing near where source datasets live

• Correctness of the reduction functions does not depend on the order source elements are processed


Order-independent Reduction Functions

Correctness of reduction function does not depend on the order elements are processed

P0 P1 P2 P0 P1 P2

P0 P1 P2

reduction phase

combine phase

P0 P1 P2


Data Processing Strategies• Fully Replicated Accumulator (FRA)

• initialization: replicate accumulator on all processors• reduction:

• read src elements from local disks • process src elements with local accumulator elements

• combine: merge replicated accumulator elements• Sparsely Replicated Accumulator (SRA)

• initialization: only replicate accumulator where required• Distributed Accumulator (DA)

• initialization: partition accumulator among processors• reduction:

• read src elements on local disks• send src elements to processor that owns mapped accumulator for

processing

high memory requirement

low memory requirement


ADR Applications

• Visualize Thematic Mapper (TM) Landsat images • Global Land Cover Facility • Enhanced the capabilities of the GLCF TM meta data

browser to allow browsing of the raw TM images• Visualize astronomy data using MPIRE

• MPIRE/ADR implementation extended the functionality of MPIRE to allow out-of-core computations

• MPIRE runs on very large data sets even on relatively small numbers of processors.

• Applications were demonstrated at SC99.


ADR Applications

• Energy and Environment NPACI Alpha project• Data repository for flow data, mesh interpolation used in

coupling flow results to projection, transport codes• History matching -- examining differences and similarities

in a set of simulation realizations• Virtual Microscope

• Exploration of large microscopy datasets

Processing Remotely Sensed DataNOAA Tiros-Nw/ AVHRR sensor

AVHRR Level 1 DataAVHRR Level 1 Data• As the TIROS-N satellite orbits, the Advanced Very High Resolution Radiometer (AVHRR)sensor scans perpendicular to the satellite’s track.• At regular intervals along a scan line measurementsare gathered to form an instantaneous field of view (IFOV).• Scan lines are aggregated into Level 1 data sets.

A single fi le of Global Area Coverage (GAC) data represents:• ~one full earth orbit.• ~110 minutes.• ~40 megabytes.• ~15,000 scan lines.

One scan line is 409 IFOV’s

Applications

Surface/GroundwaterModeling

Pathology Volume Rendering

Satellite Data Analysis


Example: ADR and MetaChaosCoupling of Surface Water Codes

• Carry out a surface water pollution remediation using a chain of flow codes and reactive transport codes.

• Codes run on separate platforms and their results are stored in ADR which, along with MetaChaos, provides the coupling.

• Parallelization of Projection/Ground Water Code using KeLP


Projection Code: UTPROJ

Upper Chesapeake BayUpper Chesapeake Bay

Delaware BayDelaware Bay

Atlantic OceanAtlantic Ocean

Aberdeen Proving GroundsAberdeen Proving Grounds

BaltimoreBaltimore

US Naval AcademyUS Naval Academy

PhiladelphiaPhiladelphia

CND CanalCND Canal

• Couples 3D surface water flow model to contaminant and salinity transport models, can be used as ground water code

• Implements conservative velocity projection method• Improves local mass conservation• Projection formulation based on mixed finite element method

Delaware Bay, CND Canal, and Chesapeake Bay


Current state of project

Environmental QualityCodes

Multi-scale Database

Flow output* PADCIRC* UT-BEST

Projection

Flow Codes

* UT-PROJ

* CE-QUAL-ICMFlow input

ADR


Simulation Time

Water Contamination Studies

FLOW CODE

CHEMICAL TRANSPORT CODE

POST-PROCESSING(Time averaging, projection)

Hydrodynamics output (velocity,elevation)on unstructured grid

Grid used by chemicaltransport code

(Parallel Program)

(Parallel Program)

* Locally conservative projection* Management of large amounts of data

Visualization


FLOW CODE Partition

into chunks

Attribute Space Service

Data LoadingService Indexing Service

Data Aggregation Service

POST-PROCESSING (Time averaging)

Active Data Repository (ADR)

Register, Integrate

TRANSPORT CODE

Attributespaces of simulators

Chunks loaded todisks Index created


Attribute Space Service

Data LoadingService

Indexing ServiceData Aggregation Service

Query Interface Service

Query PlanningService

Query ExecutionService

TRANSPORT CODEQuery:* Time period* Input grid* Output grid* Post-processing function (Time Averaging)

Output Grid

ADR

POST-PROCESSING (Projection)


Example: Split Parsim

• UT Austin code PARSIM models flow and reactive transport • Applications: Bay and estuary, reservoir, blood flow• Computationally intensive flow calculations• Data intensive reactive transport (20+ components)• Flow and Reactive Transport run on different platforms,

coupled using MetaChaos• Data archived on ADR in I/O cluster• Reactive Transport data analyzed using ADR (isosurface

contour)


ADR Subsets Data, Carries out Iso-surface Rendering Over Range of Timesteps (vtk client)


Other Research• DataCutter

• Supports data subsetting, filters connected by streams (coarse grained dataflow).

• Integrated in NPACI SRB; end to end tests included spatial subsetting, decompression, clipping of 5TB (uncompressed) datasets

• Middleware for large scale data storage• Building large (50TB+) disk based clusters• Active disk “disklet” model for placing processing near disks

• Compilers for user-defined functions• Data parallel model• Users write procedures and customized runtime support is generated• Interprocedural and slicing analysis


New IBM Collaborations

• Active Network Attached Storage• HPSS• Assume dedicated storage cluster(s) and zero,

one or more large SP configurations• SDSC• Hopkins• Florida State


HPSS• Collaborators: Bob Coyne, Otis Graf

• Stage high end computing and large scale data manipulation on a collection of clusters and parallel machines linked by a high bandwidth local area network

• Deploy HPSS to use the very large tape store at SDSC for tertiary storage but instantiate the data cache in the disk cluster at the University of Maryland

– OC-48 network connection (Abilene) will make it possible to separate HPSS disk cache and tape library

• Library routines to invoke filters on tape data obtained from tape. • Library will use IBM client API to open files and to bypass disk

cache to directly access data • DataCutter filters will process data


Software for Network Attached Storage

• Douglas Pase -- Netfinity Network attached Storage (NAS)

• Extend filesystems to support pipelined communicating processes to perform computation as data is stored or retrieved.

• Filter data to implement a database operation such as a join or datacube, or to support a more specialized data mining or data intensive scientific calculation

• Determine whether and how to replicate frequently accessed files, or how to change file placement or file striping

• Related work in context of GPFS filesystem (Roger Haskin, IBM Almaden)


Details on Collaborative Work with Doug Pase

• Work distributed using Java-based software agents or disklets. • Software transported from client to a server, executed on server. • Client would be the application, and the server would be the disk or NAS

server. • Agent processes local data,sends results back to the client as

needed. • Disk or NAS server can maintain its configuration as an appliance,

while still offering the opportunity to move computations to data. • The agent server would restrict any agent's access to data or other resources

appropriately.• Close link with Ongoing Maryland work -- DataCutter, Active Disk

and Java based ADR compiler


Research Group

• Alan Sussman• Tahsin Kurc• Umit Catalyurik• Chialin Chang • Renato Ferreira• Mike Beynon• Henrique Andrade

Collaborators:

Mary Wheeler’s groupScott Baden’s group

DatasetService

Attribute SpaceService

Data AggregationService

IndexingService

Query ExecutionService

Query PlanningService

Query InterfaceService

Query SubmissionService

Front End

Application Front End

Query Client 2(sequential)

Results

Client 1(parallel)

Architecture of Active Data Repository

Back End

Client

Output HandlingPhase

Local Reduction Phase

Global CombinePhase

ADR Query Execution

Initialization Phase


DataCutter


DataCutter

• A suite of Middleware for subsetting and filtering multi-dimensional datasets stored on archival storage systems

• Integrated with NPACI Storage Resource Broker (SRB)

• Standalone Prototype


DataCutter• Spatial Subsetting using Range Queries

• a hyperbox defined in the multi-dimensional space underlying the dataset

• items whose multi-dimensional coordinates fall into the box are retrieved.

• Two-level hierarchical indexing -- summary and detailed index files • Customizable --

• Default R-tree index• User can add new indexing methods


Processing

• Processing (filtering/aggregations) through Filters• to reduce the amount of data transferred to the client• filters can run anywhere, but intended to run near (i.e., over

local area network) storage system• Standalone system allows multiple filters placed

on different platforms• SRB release allows only a single filter which can

be placed anywhere• Motivated by Uysal’s disklet work


Filter Framework

class MyFilter : public AS_Filter_Base {public:

int init(int argc, char *argv[ ]) { … };int process(stream_t st) { … };int finalize(void) { … };

}


DataCutter -- Subsetting

• Datasets are partitioned into segments• used to index the dataset, unit of retrieval

• Indexing very large datasets• Multi-level hierarchical indexing scheme• Summary index files -- to index a group of segments or

detailed index files• Detailed index files -- to index the segments


Placement

• The dynamic assignment of filters to particular hosts for execution is placement (mapping)

• Optimization criteria:• Communication

• leverage filter affinity to dataset • minimize communication volume on slower connections• co-locate filters with large communication volume

• Computation• expensive computation on faster, less loaded hosts


Integration of DataCutter with the Storage Resouce Broker


Storage Resource Broker (SRB)

• Middleware between clients and storage resources

• Remote Access to storage resources.• Various types :

• File Systems - UNIX, HPSS, UniTree, DPSS (LBL).• DB large objects - Oracle, DB2, Illustra.

• Uniform client interface (API).



• MCAT - MetaData Catalog• Datasets (files) and Collections (directories) - inodes and

more.• Storage resources• User information - authentication, access privileges, etc.

• Software package• Server, client library, UNIX-like utilities, Java GUI • Platforms - Solaris, Sun OS, Digital Unix, SGI Irix, Cray

T90.


SRB/DataCutter

• Support for Range Queries• Creation of indices over data sets (composed set of data

files)• Subsetting of data sets

• Search for files or portions of files that intersect a given range query

• Restricted filter operations on portions of files (data segments) before returning them to the client (to perform filtering or aggregation to reduce data volume)

File SID DBLobjID ObjSID Range Query

IndexingService

Filter Filter

Filtering Service

DataCutter

Resource

User

Application Meta-data


SRB I/O and MCAT APIMCAT

Application(SRB client)

DB2, Oracle, Illustra, ObjectStore HPSS, UniTree UNIX, ftpDistributed Storage Resources

SRB/DataCutter System


int sfoCreateIndex(srbConn *conn, sfoClass class, int catType, char *inIndexName, char *outIndexName, char *resourceName)

int sfoDeleteIndex(srbConn *conn, sfoClass class, int catType, char *indexName)

SRB/DataCutter Client Interface• Creating and Deleting Index


int sfoSearchIndex(srbConn *conn, sfoClass class, char *indexName, void *query,

indexSearchResult *myresult, int maxSegCount)

typedef struct { int dim; double *min, *max;} rangeQuery;

int sfoGetMoreSearchResult(srbConn *conn, int continueIndex, indexSearchResult *myresult, int maxSegCount)

SRB/DataCutter Client Interface• Searching Index -- R-tree index

SRB/DataCutter Client Interface• Searching Index -- R-tree index

typedef struct { int dim; /* bounding box dimensions */ double *min; /* minimum in each dimension */ double *max; /* maximum in each dimension */} sfoMBR; /* Bounding box structure */

typedef struct { sfoMBR segmentMBR; /* bounding box of the segment */ char *objID; /* object in SRB that contains the segment */ char *collectionName; /* collection where object is stored */ unsigned int offset; /* offset of the segment in the object */ unsigned int size; /* size of segment */} segmentInfo; /* segment meta-data information */

typedef struct { int segmentCount; /* number of segments returned */ segmentInfo *segments; /* segment meta-data information */ int continueIndex; /* continuation flag */} indexSearchResult; /* search result structure */


int sfoApplyFilter(srbConn *conn, sfoClass class, char *hostName, int filterID, char *filterArg, int numOfInputSegments, segmentInfo *inputSegments, filterDataResult *myresult, int maxSegCount)

int sfoGetMoreFilterResult(srbConn *conn, int continueIndex, filterDataResult *myresult, int maxSegCount)

Applying Filters


typedef struct { segmentInfo segInfo; /* info on segment data buffer after filter oper. */ char *segment; /* segment data buffer after filter is applied */} segmentData;

typedef struct { int segmentDataCount; /* #segments in segmentData array */ segmentData *segments; /* segmentData array */ int continueIndex; /* continuation flag */} filterDataResult;

Applying Filters


zoom viewread_data decompress clip

Application: Virtual Microscope• Interactive software emulation of high power light

microscope for processing/visualizing image datasets• 3-D Image Dataset (100MB to 5GB per focal plane)• Client-server system organization• Rectangular region queries, multiple data chunk reply• pipeline style processing


Virtual Microscope Client

Wide Area Network

Local Area Network

Distributed Collection of Workstations

zoomdecompress

SRB/DataCutter

read

Client

view

clip

Indexing

Client

view

read

decompress

clip

read image chunks

convert jpeg image chunks into RGB pixels

clip image to query boundaries

zoom sub-sample to the required magnification

view stitch image pieces together and display image

Distributed Storage Resources

VM Application using SRB/DataCutter


Experimental Setup• UMD 10 node IBM SP (1 4CPU, 3 2CPU, 6 1CPU)• HPSS system (10TB tape storage, 500GB disk cache)• 4GB JPEG compressed dataset (90GB uncompressed),

180k x 180k RGB pixels (200 x 200 jpeg blocks of 900x900 pixels each)

• 250GB JPEG compressed dataset (5.6TB uncompressed), 1.44Mx1.44M RGB pixels (1600x1600 jpeg blocks)

• Rtree index based query lookups• server host = SP 2CPU node• Read, Decompress, Clip, Zoom, View distributed between

client and server


Dataset --250 GB (Compressed) All Computation on Server

10041618000x18000

482449000x9000

151314500x4500

Warm Disk Cache(Sec)

Cold Disk Cache(Sec)

Query Size


Breakdown of DataCutter Costs250 GB dataset, 9600x9600 query

25115DataLookup

3107Index lookup

48244TotalQuery+ Compute

Warm Cache(Sec)

ColdCache(Sec)

Operation


Effect of Filter Placement 9600x9600 Query Warm Cache

18699118018Kx 18K

46251489.6Kx9.6K

1466154.5Kx4.5K

Server just reads, client does all else(Seconds)

Server:ReadDecompress,Clip(Seconds)

Everything but View on Server(Seconds)


Effect of Dataset Size4.5Kx4.5K Query

Server does Everything but ViewWarm Cache

105755.6TB250GB

1044990GB4GB

DataCutter Data Retrieval(Sec)

DataCutter Indexing(Sec)

TotalTime(Sec)

Size Uncompressed

Dataset Size


The Future• Integrated suite of tools for handling very deep

memory hierarchies• Common set of tools for grid and disk cache computations

• Programmability• Use XML metadata• Ongoing data parallel compiler project -- uses Java based

user defined functions• Applications development toolkit (Visual DataCutter)

• Implementation• NPACI• Private sector (?)

N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Very Large Dataset Access and Manipulation: Active Data Repository (ADR) DataCutter.

Documents