NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Very Large Dataset Access and Manipulation: Active Data Repository (ADR) DataCutter and MetaChaos Joel Saltz University of Maryland, College Park and Johns Hopkins Medical Institutions http://www.cs.umd.edu/projects/adr
66
Embed
N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Very Large Dataset Access and Manipulation: Active Data Repository (ADR) DataCutter.
N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Tools to Manage Storage Hierarchy Fast secondary storage (Active Data Repository) Tools for on-demand data product generation, interactive data exploration, visualization Target closely coupled sets of processors/disks Archival Storage (DataCutter) Load subset of data from tertiary storage into disk cache or client Access data from distributed data collections Preprocess close to data sources Stand-alone and Integrated into NPACI Storage Resource Broker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Very Large Dataset Access and Manipulation:
Active Data Repository (ADR)DataCutter and MetaChaos
Joel SaltzUniversity of Maryland, College Park
andJohns Hopkins Medical Institutions
http://www.cs.umd.edu/projects/adr
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
What we do• Develop database tools for interacting with large multi-
scale, multi-resolution datasets• Ad-hoc queries, produce data products, support
visualization of disk and tape based datasets• Query, subset and filter very large archival datasets• Operating system and middleware for very large “active”
network attached storage systems • Compilers that allow users to easily specify user defined
data transformations (e.g. using Java dialect)• Tools targeted at distributed multi-architecture platforms
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Tools to Manage Storage Hierarchy
• Fast secondary storage (Active Data Repository)• Tools for on-demand data product generation, interactive
data exploration, visualization• Target closely coupled sets of processors/disks
• Archival Storage (DataCutter)• Load subset of data from tertiary storage into disk cache or
client• Access data from distributed data collections• Preprocess close to data sources • Stand-alone and Integrated into NPACI Storage Resource
Broker
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Tool to Couple Applications
• MetaChaos• Parallel programs distribute data structures between
processor memories• Separately developed programs will use different schemes
to distribute data• MetaChaos coordinates movement of data between
separately developed, compiled parallel programs• Layered on standard message passing layer such as
MPIch-g, PVM• Garlik – integration of MetaChaos with KeLP floor plans
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
• Applications select portions of one or more datasets
• Selection of data subset makes use of spatial index (e.g., R-tree, quad-tree, etc.)
• Data not used “as-is”, generally preprocessing is needed - often to reduce data volumes
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Querying Irregular Multi-dimensional Datasets
• Irregular datasets• Think of disk-based unstructured meshes, data structures
used in adaptive multiple grid calculations, sensor data• indexed by spatial location (e.g., position on earth, position of
microscope stage)
• Spatial query used to specify iterator• computation on data obtained from spatial query• computation aggregates data - resulting data product size
significantly smaller than results of range query
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Loading Datasets into ADR• A user
• should decompose dataset into data chunks• optionally can distribute chunks across the disks, and
provide an index for accessing them• ADR, given data chunks and associated
minimum bounding rectangles in a set of files• can distribute data chunks across the disks using a
Hilbert-curve based declustering algorithm,• can create an R-tree based index on the dataset.
Loading Datasets into ADR
Disk Farm
• ADR Data Loading Service • Distributes chunks
across the disks in the system (e.g., using Hilbert curve based declustering)
• Constructs an R-tree index using bounding boxes of the data chunks
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Loading Service• User must decompose the dataset into chunks• For a fully cooked dataset, User
• moves the data and index files to disks (via ftp, for example)
• registers the dataset using ADR utility programs• For a half cooked dataset, ADR
• computes placement information using a Hilbert curve-based declustering algorithm,
• builds an R-tree index,• moves the data chunks to the disks• registers the dataset
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Query Execution in Active Data Repository
• An ADR Query contains a reference to• the data set of interest,• a query window (a multi-dimensional bounding box in input
dataset’s attribute space),• default or user defined index lookup functions,• user-defined accumulator,• user-defined projection and aggregation functions,• how the results are handled (write to disk, or send back to the
client).• ADR handles multiple simultaneous active queries
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
ADR Query Execution
Index lookup
Generate query planAggregate local input
data into output
Combine partial output results
Send output to clientsquery
Initialize output
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Dataset Structure• Spatial and temporal
resolution may depend on spatial location
• Physical quantities computed and stored vary with spatial location
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
AVHRR Level 1 DataAVHRR Level 1 Data• As the TIROS-N satellite orbits, the Advanced Very High Resolution Radiometer (AVHRR)sensor scans perpendicular to the satellite’s track.• At regular intervals along a scan line measurementsare gathered to form an instantaneous field of view (IFOV).• Scan lines are aggregated into Level 1 data sets.
A single fi le of Global Area Coverage (GAC) data represents:• ~one full earth orbit.• ~110 minutes.• ~40 megabytes.• ~15,000 scan lines.
One scan line is 409 IFOV’s
Applications
Surface/GroundwaterModeling
Pathology Volume Rendering
Satellite Data Analysis
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Example: ADR and MetaChaosCoupling of Surface Water Codes
• Carry out a surface water pollution remediation using a chain of flow codes and reactive transport codes.
• Codes run on separate platforms and their results are stored in ADR which, along with MetaChaos, provides the coupling.
• Parallelization of Projection/Ground Water Code using KeLP
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Projection Code: UTPROJ
Upper Chesapeake BayUpper Chesapeake Bay
Delaware BayDelaware Bay
Atlantic OceanAtlantic Ocean
Aberdeen Proving GroundsAberdeen Proving Grounds
BaltimoreBaltimore
US Naval AcademyUS Naval Academy
PhiladelphiaPhiladelphia
CND CanalCND Canal
• Couples 3D surface water flow model to contaminant and salinity transport models, can be used as ground water code
• Implements conservative velocity projection method• Improves local mass conservation• Projection formulation based on mixed finite element method
Delaware Bay, CND Canal, and Chesapeake Bay
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Current state of project
Environmental QualityCodes
Multi-scale Database
Flow output* PADCIRC* UT-BEST
Projection
Flow Codes
* UT-PROJ
* CE-QUAL-ICMFlow input
ADR
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
* Locally conservative projection* Management of large amounts of data
Visualization
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
FLOW CODE Partition
into chunks
Attribute Space Service
Data LoadingService Indexing Service
Data Aggregation Service
POST-PROCESSING (Time averaging)
Active Data Repository (ADR)
Register, Integrate
TRANSPORT CODE
Attributespaces of simulators
Chunks loaded todisks Index created
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Attribute Space Service
Data LoadingService
Indexing ServiceData Aggregation Service
Query Interface Service
Query PlanningService
Query ExecutionService
TRANSPORT CODEQuery:* Time period* Input grid* Output grid* Post-processing function (Time Averaging)
Output Grid
ADR
POST-PROCESSING (Projection)
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Example: Split Parsim
• UT Austin code PARSIM models flow and reactive transport • Applications: Bay and estuary, reservoir, blood flow• Computationally intensive flow calculations• Data intensive reactive transport (20+ components)• Flow and Reactive Transport run on different platforms,
coupled using MetaChaos• Data archived on ADR in I/O cluster• Reactive Transport data analyzed using ADR (isosurface
contour)
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
ADR Subsets Data, Carries out Iso-surface Rendering Over Range of Timesteps (vtk client)
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Other Research• DataCutter
• Supports data subsetting, filters connected by streams (coarse grained dataflow).
• Integrated in NPACI SRB; end to end tests included spatial subsetting, decompression, clipping of 5TB (uncompressed) datasets
• Middleware for large scale data storage• Building large (50TB+) disk based clusters• Active disk “disklet” model for placing processing near disks
• Compilers for user-defined functions• Data parallel model• Users write procedures and customized runtime support is generated• Interprocedural and slicing analysis
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
New IBM Collaborations
• Active Network Attached Storage• HPSS• Assume dedicated storage cluster(s) and zero,
one or more large SP configurations• SDSC• Hopkins• Florida State
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
HPSS• Collaborators: Bob Coyne, Otis Graf
• Stage high end computing and large scale data manipulation on a collection of clusters and parallel machines linked by a high bandwidth local area network
• Deploy HPSS to use the very large tape store at SDSC for tertiary storage but instantiate the data cache in the disk cluster at the University of Maryland
– OC-48 network connection (Abilene) will make it possible to separate HPSS disk cache and tape library
• Library routines to invoke filters on tape data obtained from tape. • Library will use IBM client API to open files and to bypass disk
cache to directly access data • DataCutter filters will process data
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Software for Network Attached Storage
• Douglas Pase -- Netfinity Network attached Storage (NAS)
• Extend filesystems to support pipelined communicating processes to perform computation as data is stored or retrieved.
• Filter data to implement a database operation such as a join or datacube, or to support a more specialized data mining or data intensive scientific calculation
• Determine whether and how to replicate frequently accessed files, or how to change file placement or file striping
• Related work in context of GPFS filesystem (Roger Haskin, IBM Almaden)
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Details on Collaborative Work with Doug Pase
• Work distributed using Java-based software agents or disklets. • Software transported from client to a server, executed on server. • Client would be the application, and the server would be the disk or NAS
server. • Agent processes local data,sends results back to the client as
needed. • Disk or NAS server can maintain its configuration as an appliance,
while still offering the opportunity to move computations to data. • The agent server would restrict any agent's access to data or other resources
appropriately.• Close link with Ongoing Maryland work -- DataCutter, Active Disk
and Java based ADR compiler
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Research Group
• Alan Sussman• Tahsin Kurc• Umit Catalyurik• Chialin Chang • Renato Ferreira• Mike Beynon• Henrique Andrade
Collaborators:
Mary Wheeler’s groupScott Baden’s group
DatasetService
Attribute SpaceService
Data AggregationService
IndexingService
Query ExecutionService
Query PlanningService
Query InterfaceService
Query SubmissionService
Front End
Application Front End
Query Client 2(sequential)
Results
Client 1(parallel)
Architecture of Active Data Repository
Back End
Client
Output HandlingPhase
Local Reduction Phase
Global CombinePhase
ADR Query Execution
Initialization Phase
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
DataCutter
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
DataCutter
• A suite of Middleware for subsetting and filtering multi-dimensional datasets stored on archival storage systems
• Integrated with NPACI Storage Resource Broker (SRB)
• Standalone Prototype
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
DataCutter• Spatial Subsetting using Range Queries
• a hyperbox defined in the multi-dimensional space underlying the dataset
• items whose multi-dimensional coordinates fall into the box are retrieved.
• Two-level hierarchical indexing -- summary and detailed index files • Customizable --
• Default R-tree index• User can add new indexing methods
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Processing
• Processing (filtering/aggregations) through Filters• to reduce the amount of data transferred to the client• filters can run anywhere, but intended to run near (i.e., over
local area network) storage system• Standalone system allows multiple filters placed
on different platforms• SRB release allows only a single filter which can
be placed anywhere• Motivated by Uysal’s disklet work
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
SRB/DataCutter
• Support for Range Queries• Creation of indices over data sets (composed set of data
files)• Subsetting of data sets
• Search for files or portions of files that intersect a given range query
• Restricted filter operations on portions of files (data segments) before returning them to the client (to perform filtering or aggregation to reduce data volume)
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
int sfoCreateIndex(srbConn *conn, sfoClass class, int catType, char *inIndexName, char *outIndexName, char *resourceName)
int sfoDeleteIndex(srbConn *conn, sfoClass class, int catType, char *indexName)
SRB/DataCutter Client Interface• Creating and Deleting Index
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
int sfoSearchIndex(srbConn *conn, sfoClass class, char *indexName, void *query,
indexSearchResult *myresult, int maxSegCount)
typedef struct { int dim; double *min, *max;} rangeQuery;
int sfoGetMoreSearchResult(srbConn *conn, int continueIndex, indexSearchResult *myresult, int maxSegCount)
SRB/DataCutter Client Interface• Searching Index -- R-tree index
SRB/DataCutter Client Interface• Searching Index -- R-tree index
typedef struct { int dim; /* bounding box dimensions */ double *min; /* minimum in each dimension */ double *max; /* maximum in each dimension */} sfoMBR; /* Bounding box structure */
typedef struct { sfoMBR segmentMBR; /* bounding box of the segment */ char *objID; /* object in SRB that contains the segment */ char *collectionName; /* collection where object is stored */ unsigned int offset; /* offset of the segment in the object */ unsigned int size; /* size of segment */} segmentInfo; /* segment meta-data information */
typedef struct { int segmentCount; /* number of segments returned */ segmentInfo *segments; /* segment meta-data information */ int continueIndex; /* continuation flag */} indexSearchResult; /* search result structure */
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
int sfoApplyFilter(srbConn *conn, sfoClass class, char *hostName, int filterID, char *filterArg, int numOfInputSegments, segmentInfo *inputSegments, filterDataResult *myresult, int maxSegCount)
int sfoGetMoreFilterResult(srbConn *conn, int continueIndex, filterDataResult *myresult, int maxSegCount)
Applying Filters
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
typedef struct { segmentInfo segInfo; /* info on segment data buffer after filter oper. */ char *segment; /* segment data buffer after filter is applied */} segmentData;
typedef struct { int segmentDataCount; /* #segments in segmentData array */ segmentData *segments; /* segmentData array */ int continueIndex; /* continuation flag */} filterDataResult;
Applying Filters
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
zoom viewread_data decompress clip
Application: Virtual Microscope• Interactive software emulation of high power light
microscope for processing/visualizing image datasets• 3-D Image Dataset (100MB to 5GB per focal plane)• Client-server system organization• Rectangular region queries, multiple data chunk reply• pipeline style processing
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Virtual Microscope Client
Wide Area Network
Local Area Network
Distributed Collection of Workstations
zoomdecompress
SRB/DataCutter
read
Client
view
clip
Indexing
Client
view
read
decompress
clip
read image chunks
convert jpeg image chunks into RGB pixels
clip image to query boundaries
zoom sub-sample to the required magnification
view stitch image pieces together and display image
Distributed Storage Resources
VM Application using SRB/DataCutter
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Experimental Setup• UMD 10 node IBM SP (1 4CPU, 3 2CPU, 6 1CPU)• HPSS system (10TB tape storage, 500GB disk cache)• 4GB JPEG compressed dataset (90GB uncompressed),
180k x 180k RGB pixels (200 x 200 jpeg blocks of 900x900 pixels each)