Top Banner
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University of California, San Diego San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE/
24

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

Mar 27, 2015

Download

Documents

Mia White
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Data Grids for Collection Federation

Reagan W. MooreUniversity of California, San DiegoSan Diego Supercomputer Center

[email protected]://www.npaci.edu/DICE/

Page 2: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Massive Data Manipulation

• Analyze an entire sky survey – 10 TBs per hour or 3 GB/sec

– Requires caching on high performance disk

– Requires Teraflop computer (300 operations per byte)

• Challenges– 5 million images per hour

– Latency management requires aggregation of metadata, data, and I/O commands

• Analyze two entire sky surveys

Page 3: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Topics

• Data management systems– Data Grids, Digital Libraries, Persistent Archives

• Common data management technology– Logical name space, storage abstraction

• Collection federation– Knowledge management systems

Page 4: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Compute Resources Catalogs Data Archives

InformationDiscovery

Metadatadelivery

Data Discovery

Data Delivery

Catalog Mediator Data mediator

1. Portals and Workbenches

Bulk DataAnalysis

CatalogAnalysis

MetadataView

DataView

4.GridSecurityCachingReplicationBackupScheduling

2.Knowledge & ResourceManagement

Standard Metadata format, Data model, Wire format

Catalog/Image Specific Access

Standard APIs and Protocols Concept space

3.

5.

6.

7. Derived Collections

National Virtual ObservatoryData Grid

Page 5: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Digital Libraries

• Provide services on the data collection– Ingestion, loading of attribute values– Extensibility, definition of new attributes– Discovery, queries on attributes– Browsing, hierarchical listing– Presentation, formatting specified data models

• Communities– Digital library– Global Grid Forum, Databases and the Grid working group– OMG, Common Warehouse Metamodel

Page 6: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Data Grids• Manage data in a distributed environment

– Logical name space, provide global identifier– Data access, storage system abstraction– Replication, disaster back up– Uniform access, common API across file systems,

archives, and databases– Single sign-on, authenticate across administration

domains

• Communities– Global Grid Forum, data grids– Discipline specific data management systems

Page 7: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Persistent Archives

• Manage technology evolution– Storage system abstraction, support data migration

across storage systems

– Information repository abstraction, support catalog migration to new databases

– Logical name space, support global persistent identifier

• Communities– Persistent archive community

– Global Grid Forum, Persistent archive working group

Page 8: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Common Capabilities

• Logical name space– Registration of digital entities

• Storage repository abstraction– Operations used to manipulate data in a storage

system

• Information repository abstraction– Operations used to manipulate a catalog in a

database

Page 9: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Data Grid(Storage Resource Broker)

• Integration of collection-based management of digital entities, with– Remote data access through storage system

abstraction– Catalog access through information repository

abstraction– Automation through collection-owned data

Page 10: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Storage Abstraction• Provide common access semantics

– Archival storage systems– File systems– Databases

• Support Unix file system operations– Map from the interface preferred by your application to

the interfaces required by legacy storage systems

• Support database interactions– Map from information repository abstraction to database

commands

Page 11: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Unix Shell

Java, NTBrowsers

WebWSDL

GridFTP

SDSC Storage Resource Broker & Meta-data CatalogStorage Abstraction

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRM

AccessAPIs

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python

Page 12: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Logical Name Space(Data Grid Transparencies)

• Naming transparency - find a data set without knowing its name– Map from attributes to a global file name

• Location transparency - access a data set without knowing where it is– Map from global file name to local file name

• Access transparency - access a data set without knowing the type of storage system– Federated client-server architecture

Page 13: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Logical Name Space Operations

• Replication– One to many mapping from logical name to physical

name

• Containers– Mapping from logical name to location in a physical

container

• Shadow links– Registration of user owned data into the collection

Page 14: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Unix Shell

Java, NTBrowsers

WebWSDL

GridFTP

SDSC Storage Resource Broker & Meta-data CatalogLogical Name Space

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRM

AccessAPIs

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python

Page 15: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Digital Entities• Digital entities are “images of reality”,

made of– Data, the bits (zeros and ones) put on a storage

system– Information, the attributes used to assign

semantic meaning to the data– Knowledge, the semantic and structural

relationships described by a data model

• Every digital entity requires information and knowledge to correctly interpret and display

Page 16: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Types of Digital Entities

• Files– Physical files in the collection ID space– Shadow links to files in your user ID space

• Directories– Shadow links to directories in your user ID space

• Databases– Shadow links to tables– SQL command strings

• URLs

Page 17: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Preservation(Similar requirements to a data grid)

• Name transparency– Find a file by attributes (map from attributes to global name)

• Location transparency– Access a file by a global identifier (map from global to local

file name)

• Access transparency– Use same API to access data in archive or file cache

• Authenticity– Disaster recovery, replicate data across storage systems– Audit and process management

Page 18: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Unix Shell

Java, NTBrowsers

WebWSDL

GridFTP

SDSC Storage Resource Broker & Meta-data CatalogPreservation

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRM

AccessAPIs

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python

Page 19: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Convergence of Technologies

• Data grids as basis for distributed data management– Federation of distributed resources– Creation of logical name space to automate discovery

• Digital libraries – Discovery based on attributes– Hierarchical collection management– Extensible schema through information repository abstraction

• Persistent archives– Data replication– Persistence management

Page 20: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

NPACI 1,972.00 1,083,230 NPACI Users NSF/PACIDigsky 17,800.00 5,139,249 2MASS,DPOSS,NVO NSF/ITRDigEmbryo 433.00 31,629 Visible Embryo NLMHyperLter 158.00 3,596 HyperSpectral Images NSF/NPACI (ESS)Hayden 6,800.00 41,391 FlyThrough for Planetarium AMNH/HaydenPortal 33.00 5,485 Grid Portal NSF/NPACISLAC 514.00 77,168 Protein Crystallography NSF/NPACI (Alpha)NARA 7.00 2,455 Archival Documents NARASIO Exp 19.20 383 SIO Explorer Documents NSF/NSDLADL 0.00 6 ADEPT Digital Library NSF/DLI2TRA 5.80 92 Classroom Videos NSF/NPACI (EOT)DTF 239.00 1,766 DTF users NSF/TCSAfCS 27.00 4,007 Cell Signalling Images/Docs NIHTOTAL 28,008.00 6,390,457

28 TB 6.4 million

Storage Resource Broker (SRB)Data brokered by SDSC instances of SRB**

As of 5/17/2002

Funding Agency

** Does not cover data brokered by SRB spaces administered outside SDSC. Does not cover databases; covers only files stored in file systems and archival storage systems

Data_size (in GB)

Count (files)

CommentsProject Instance

Page 21: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Data Naming Ontologies

Concept space Discipline concepts

Collection Discipline attributes

Data grid Global Identifier

Archive / file systems Local file name

Data model Attributes that describe data structure

Page 22: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Differentiating between Data, Information, and Knowledge

• Data– Digital object

– Objects are streams of bits

• Information– Any tagged data, which is treated as an attribute.

– Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object

• Knowledge– Relationships between attributes

– Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional

Page 23: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Knowledge Creation Roadmap

• Knowledge syntax (consensus)– RDF, XMI, Topic Map

• Knowledge management (recursive operations)– Oracle parallel database

• Knowledge manipulation (spatial/procedural rules)– Generation of inference rules and mapping to data models

• Knowledge generation (scalable inference engine)– Application of inference rules in inference engine

Page 24: National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Knowledge Based Data Grid Roadmap

AttributesSemantics

Knowledge

Information

Data

Ingest Services

Management AccessServices

(Model-based Access)

(Data Handling System - SRB)

MC

AT

/HD

F

Gri

ds

XM

L D

TD

SD

LIP

XT

M D

TD

Rul

es -

KQ

L

InformationRepository

Attribute- based Query

Feature-basedQuery

Knowledge orTopic-Based Query / Browse

KnowledgeRepository for Rules

RelationshipsBetweenConcepts

FieldsContainersFolders

Storage(Replicas,Persistent IDs)