Top Banner
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org CERN CERN Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006 Workshop on Next-Generation Distributed Data Management
24

Distributed Metadata with the AMGA Metadata Catalog

Jan 20, 2016

Download

Documents

heinz

Distributed Metadata with the AMGA Metadata Catalog. Nuno Santos, Birger Koblitz 20 June 2006 Workshop on Next-Generation Distributed Data Management. Abstract. Metadata Catalogs on Data Grids – The case for replication The AMGA Metadata Catalog Metadata Replication with AMGA - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distributed Metadata with the  AMGA Metadata Catalog

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org CERNCERN

Distributed Metadata with the AMGA Metadata Catalog

Nuno Santos, Birger Koblitz

20 June 2006Workshop on Next-Generation Distributed Data Management

Page 2: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 2

Enabling Grids for E-sciencE

INFSO-RI-508833

Abstract

• Metadata Catalogs on Data Grids – The case for replication

• The AMGA Metadata Catalog• Metadata Replication with AMGA• Benchmark Results• Future Work/Open Challenges

Page 3: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 3

Enabling Grids for E-sciencE

INFSO-RI-508833

Metadata Catalogs

• Metadata on the Grid– File Metadata - Describe files with application-specific

information Purpose: file discovery based on their contents

– Simplified Database Service – Store generic structured data on the Grid Not as powerful as a DB, but easier to use and better Grid

integration (security, hide DB heterogeneity)

• Metadata Services are essential for many Grid applications

• Must be accessible Grid-wide

But Data Grids can be large…

Page 4: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 4

Enabling Grids for E-sciencE

INFSO-RI-508833

An Example - The LCG Sites

• LCG – LHC Computing Grid– Distribute and process the data generated by the LHC (Large Hadron Collider) at CERN– ~200 sites and ~5.000 users worldwide

Taken from: http://goc03.grid-support.ac.uk/googlemaps/lcg.html

Page 5: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 5

Enabling Grids for E-sciencE

INFSO-RI-508833

Challenges for Catalog Services

• Scalability– Hundreds of grid sites– Thousands users

• Geographical Distribution– Network latency

• Dependability– In a large and heterogeneous system, failures will be common

• A centralized system does not meet the requirements– Distribution and replication required

Page 6: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 6

Enabling Grids for E-sciencE

INFSO-RI-508833

Off-the-shelf DB Replication?

• Most DB systems have DB replication mechanisms– Oracle Streams, Slony for PostgreSQL,

MySQL replication

MetadataCatalog

MetadataCatalog

• Example: 3D Project at CERN (Distributed Deployment of Databases)– Uses Oracle Streams for replication– Being deployed only at a few LCG sites (~10 sites, Tier-0 and Tier-1s)

Requires Oracle ($$$) and expert on-site DBAs ($$$) Most sites don’t have these resources

• Off-the-shelf replication is vendor-specific– But Grids are heterogeneous by nature– Sites have different DB systems available

Only partial solution to the problem of metadata replication

Page 7: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 7

Enabling Grids for E-sciencE

INFSO-RI-508833

Replication in the Catalog

• Alternative we are exploring:

Replication in the Metadata Catalog

• Advantages– Database independent– Metadata-aware replication

More efficient – replicate Metadata commands Better functionality – Partial replication, federation

– Ease of deployment and administration Built-in into the Metadata Catalog No need for dedicated DB admin

• The AMGA Metadata Catalogue is the basis for our work on replication

MetadataCatalog

MetadataCatalog

Page 8: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 8

Enabling Grids for E-sciencE

INFSO-RI-508833

The AMGA Metadata Catalog

• Metadata Catalog of the gLite Middleware (EGEE)

• Several groups of users among the EGEE community:– High Energy Physics– Biomed

• Main features – Dynamic schemas– Hierarchical organization– Security:

Authentication: user/pass, X509 Certs, GSI

Authorization: VOMS, ACLs

AMGAServer

MetadataCommands

Metadata tables

Page 9: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 9

Enabling Grids for E-sciencE

INFSO-RI-508833

AMGA Implementation

• C++ implementation• Back-ends

– Oracle, MySQL, PostgreSQL, SQLite

• Front-end - TCP Streaming– Text-based protocol like

TELNET, SMTP, POP…

Metadata Server

MDServer

TCP Streaming

PostgreSQL

Oracle

SQLite

Client

MySQL

addentry /DLAudio/song.mp3 /DLAudio:Author ‘John Smith’ /DLAudio:Album ‘Latest Hits’

selectattr /DLAudio:FILE /DLAudio:Author /DLAudio:Album‘like(/DLAudio:FILE, “%.mp3")‘

• Examples:Adding data

Retrieving data

Page 10: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 10

Enabling Grids for E-sciencE

INFSO-RI-508833

Standalone Performance

• Single server scales well up to 100 concurrent clients

• Could not go past 100. Limited by the database

• WAN access one to two orders of magnitude slower than LAN

Replication can solve both bottlenecks

Page 11: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 11

Enabling Grids for E-sciencE

INFSO-RI-508833

Metadata Replication with AMGA

Page 12: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 12

Enabling Grids for E-sciencE

INFSO-RI-508833

Requirements of EGEE Communities

• Motivation: Requirements of EGEE’s user communities.– Mainly HEP and Biomed

• High Energy Physics (HEP)– Millions of files, 5.000+ users distributed across 200+ computing

centres– Mainly (read-only) file metadata– Main concerns: scalability, performance and fault-tolerance

• Biomed– Manage medical images on the Grid

Data produced in a distributed fashion by laboratories and hospitals Highly sensitive data: patient details

– Smaller scale than HEP– Main concern: security

Page 13: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 13

Enabling Grids for E-sciencE

INFSO-RI-508833

Metadata Replication

MetadataCommands

RedirectedCommands

Full replication Partial replication

Federation Proxy

Some replication models

Page 14: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 14

Enabling Grids for E-sciencE

INFSO-RI-508833

Architecture

• Main design decisions– Asynchronous replication –

for tolerating with high latencies and fault-tolerance

– Partial replication – Replicate only what is interesting for the remote users

– Master-slave – Writes only allowed on the master But mastership is granted

to metadata collections, not to nodes

AMGAServer

ReplicationDaemon

MetadataCommands

Updatelogs

Local updates

Metadata tables

Remote updates

Page 15: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 15

Enabling Grids for E-sciencE

INFSO-RI-508833

Status

• Initial implementation completed– Available functionality:

Full and partial replication Chained replication (master → slave1 → slave2) Federation - basic support

• Data is always copied to slave Cross DB replication: PostgreSQL → MySQL tested

• Other combinations should work (give or take some debugging)

• Available as part of AMGA

Page 16: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 16

Enabling Grids for E-sciencE

INFSO-RI-508833

Benchmark Results

Page 17: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 17

Enabling Grids for E-sciencE

INFSO-RI-508833

Benchmark Study

• Investigate the following:1) Overhead of replication and scalability of the master

2) Behaviour of the system under faults

Page 18: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 18

Enabling Grids for E-sciencE

INFSO-RI-508833

Scalability

• Small increase in CPU usage as number of slaves increases– 10 slaves, 20% increase from standalone operation

• Number of update logs sent scales almost linearly

• Setup

• Insertion rate at master: 90 entries/s.

• Total: 10,000 entries• 0 slaves - saving replication

updates, but not shipping (slaves disconnected)

10 Slaves

Master

Page 19: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 19

Enabling Grids for E-sciencE

INFSO-RI-508833

Fault Tolerance

• Next test illustrates fault tolerance mechanisms

• Slave fails– Master keeps the updates for the

slave

– Replication log grows

• Slave reconnects– Master sends pending updates

– Eventually system recovers to a steady state with the slave up-to-date

• Test conditions:– Insertion rate at master: 50

entries/s

– Total: 20.000 entries

– Two slaves, both start connected

– Slave1 disconnects temporarily

Setup:

Slaves

Master

Page 20: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 20

Enabling Grids for E-sciencE

INFSO-RI-508833

Fault Tolerance and Recovery

• While slave1 is disconnected, the replication log grows in size– Limited in size. Slave unsubscribed if it does not reconnect in time.

• After slave reconnection, system recovers in around 60 seconds.

Page 21: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 21

Enabling Grids for E-sciencE

INFSO-RI-508833

Future Work/Open Challenges

Page 22: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 22

Enabling Grids for E-sciencE

INFSO-RI-508833

Scalability

• Support hundreds of replicas– HEP use case. Extreme case: one replica catalog per site

• Challenges– Scalability– Fault-tolerance – tolerate failures of slaves and of master

• Current method of shipping updates (direct streaming) might not scale– Chained replication (divide and conquer)

Already possible with AMGA, performance needs to be studied

– Group communication

Page 23: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 23

Enabling Grids for E-sciencE

INFSO-RI-508833

Federation

• Federation of independent catalogs– Biomed use case

• Challenges– Provide a consistent view over the federated catalogs– Shared namespace– Security - Trust management, access control and user

management

• Ideas

Page 24: Distributed Metadata with the  AMGA Metadata Catalog

Workshop on Next-Generation Distributed Data Management - 20 June 2006 24

Enabling Grids for E-sciencE

INFSO-RI-508833

Conclusion

• Replication of Metadata Catalogues necessary for Data Grids

• We are exploring replication at the Catalogue using AMGA

• Initial implementation completed– First results are promising

• Currently working on improving scalability and on federation

• More information about our current work at:http://project-arda-dev.web.cern.ch/project-arda-dev/metadata/