June 21-25, 2004 Lecture4: Grid Data Managemen t 1 Lecture 4 Grid Data Management Jaime Frey UW-Madison Condor Group [email protected]Slides prepared in part by Scott Koranda UW-Milwaukee & NCSA [email protected]Grid Summer Workshop June 21- 25, 2004
71
Embed
June 21-25, 2004Lecture4: Grid Data Management1 Lecture 4 Grid Data Management Jaime Frey UW-Madison Condor Group [email protected] Slides prepared in.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Extensions include Strong authentication, encryption via Globus GSI Multiple, parallel data channels Third-party transfers Tunable network & I/O parameters Server side processing, command pipelining
June 21-25, 2004 Lecture4: Grid Data Management 9
Necessary Semantics… GridFTP is the protocol A server or client that implements the GridFTP
protocol is GridFTP-enabled or Grid-enabled Often hear “the GridFTP server…” or “the GridFTP
client…” Correct is “the GridFTP-enabled server from the
Globus team” or the particular client being used Let it slide…easier to use the slang…but Distinction more important soon as groups outside of
Globus release GridFTP-enabled clients & servers
June 21-25, 2004 Lecture4: Grid Data Management 10
GridFTP Server Built on top of wuftpd, our old friend
A brand new server from scratch in beta now… Most configuration details same as wuftpd Runs as a inetd (xinetd) service
1. Connection is attempted on port 2811
2. Xinetd looks up port in /etc/services and finds responsible service
3. Xinetd starts service according to configuration with data from communication send on stdin
June 21-25, 2004 Lecture4: Grid Data Management 11
June 21-25, 2004 Lecture4: Grid Data Management 21
DebuggingUse –dbg to see control channel communication$ globus-url-copy -dbg gsiftp://hydra.phys.uwm.edu/tmp/file1 file:/tmp/file1debug: starting to get gsiftp://hydra.phys.uwm.edu/tmp/file1debug: connecting to gsiftp://hydra.phys.uwm.edu/tmp/file1debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1:220 hydra.phys.uwm.edu GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2 (gcc32dbg,
1069715860-42) ready. debug: authenticating with gsiftp://hydra.phys.uwm.edu/tmp/file1debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1:230 User skoranda logged in. debug: sending command:FEAT debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1:211-Extensions supported: REST STREAM ESTO ERET MDTM SIZE PARALLEL DCAU211 END<snip>
June 21-25, 2004 Lecture4: Grid Data Management 22
Globus-url-copy
Acutally a general purpose URL copying tool No GSI authentication used Parallel channels and like won’t work
June 21-25, 2004 Lecture4: Grid Data Management 23
GridFTP clients UberFTP
developed and supported at National Center for Supercomputing Applications (NCSA)
interactive like our old (insecure) friend ‘ftp’ use –a GSI for GSI authentication supports multiple channels using –c flag$ uberftp -H hydra.phys.uwm.edu -a GSI220 hydra.phys.uwm.edu GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2 (gcc32dbg, 1069715860-42) ready.
230 User skoranda logged in.uberftp>
June 21-25, 2004 Lecture4: Grid Data Management 24
GridFTP clients “Roll your own” Add functionality directly to your applications
Your application find and download its own data? Your application deliver output data files when
finished computing? Globus Toolkit offers APIs to code against
C Java Python
June 21-25, 2004 Lecture4: Grid Data Management 25
GridFTP and Firewalls Nice document by Globus team at
Tip: when debugging GridFTP and firewalls remember which way connections established 1 single data channel
data connection established from client to server 2 or more data channels
data connection established in direction data will flow control connection always from client to server
June 21-25, 2004 Lecture4: Grid Data Management 26
Hints for ExpertsTo make GridFTP go really fast use fast disks/filesystems
filesystem should read/write > 30 MB/second configure TCP for performance
See TCP Tuning Guide athttp://www-didc.lbl.gov/TCP-tuning/
patch your Linux kernel with web100 patch See http://www.web100.org Important work-around for Linux TCP “feature”
understand your network path
June 21-25, 2004 Lecture4: Grid Data Management 27
Three Data Questions on the Grid
1. What data/files exist?
2. What data/files are where?
3. How do I move data/files from A to B?
June 21-25, 2004 Lecture4: Grid Data Management 28
What data/files are where? Requirements
Catalog 108 files and their locations What files are where (possibly at more then one place) Across multiple sites within a Grid Mappings from logical filenames (LFNs) to physical
filenames (PFNs) or URLs No single point of failure
No central catalog/server to be single point of failure
June 21-25, 2004 Lecture4: Grid Data Management 29
Globus Replica Location Service
Globus RLS Each RLS server usually runs two catalogs
LRC Local replica catalog Catalog of what files you have (LFNs) and mappings to
URL(s) or PFNs RLI
Replica location index Catalog of while files (LFNs) that other LRCs in your data
grid know about
June 21-25, 2004 Lecture4: Grid Data Management 30
Globus RLS Network of RLS servers inform each other
Each site has LRC with mappings of LFNs to PFNs usually contains the “local” mappings where files located at the site Site at Milwaukee might have this mapping in its LRC
Uses a host certificate to identify itself must run as root if host cert is owned by root often copy host cert/key to other non-root limited privilege
account and configure to use that copy
June 21-25, 2004 Lecture4: Grid Data Management 36
Globus RLS: Server Perspective
Mappings LFNs → PFNs kept in database Uses generic ODBC interface to talk to any (good)
RDBM MySQL, PostgreSQL, Oracle, DB2,... All RDBM details hidden from administrator and user
well, not quite RDBM may need to be “tuned” for performance but one can start off knowing very little about RDBMs
June 21-25, 2004 Lecture4: Grid Data Management 37
Globus RLS: Server Perspective
Mappings LFNs → LRCs stored in 1 of 2 ways table in database
full, complete listing from LRCs that update your RLI requires each LRC to send your RLI full, complete list
as number of LFNs in catalog grows, this becomes substantial 108 filenames at 64 bytes per filename ~ 6 GB
in memory in a special hash called Bloom filter 108 filenames stored in as little as 256 MB
easy for LRC to create Bloom filter and send over network to RLIs can cause RLI to lie when asked if knows about a LFN
only false-positives tunable error rate acceptable in many contexts
June 21-25, 2004 Lecture4: Grid Data Management 38
Globus RLS: Configuring the Server Single configuration file
usually $GLOBUS_LOCATION/etc/globus-rls-server.conf
Send server a HUP signal to refresh configuration kill –SIGHUP <pid>
Access control each “client” given one or more of
lrc_read : permission to query the LRC for mappings lrc_update : permission to add new mappings in LRC rli_read : permission to query RLI for mappings rli_update : permission to inform RLI of remote LRC mappings stats : permission to query server for statistics admin : permission to change configuration on the fly
June 21-25, 2004 Lecture4: Grid Data Management 39
Globus RLS: Configuring the Server Access control
access given to certificate subject acl /DC=org/DC=doegrids/OU=People/CN=Scott Koranda: lrc_read
access given to UID mapped in grid-mapfile which grid-mapfile examined controlled by GRIDMAP
environment variableacl skoranda: lrc_read
must give remote LRCs permission to update your RLI remote RLS server uses host certificate to identify itselfacl /DC=org/DC=doegrids/OU=Services/CN=ldas.mit.edu: rli_update
June 21-25, 2004 Lecture4: Grid Data Management 40
Globus RLS: Configuring the Server globus-rls-admin tool for configuration
need GSI credential to talk to server must have acl with admin privileges for your credential manual page is availableNAME globus-rls-admin - Replica Location Service Administration SYNOPSIS globus-rls-admin -A|-a|-C option value|-c option|-D|-d|-e|-p|-q|-r|-S|-s|-t timeout|-u|-v [ rli ] [ pattern ] [ server ] DESCRIPTION The program globus-rls-admin performs administrative oper- ations on a RLS server (see globus-rls-server(8)).
ping the server to see if alive$ globus-rls-admin -p rls://localhostping rls://localhost: 0 seconds
June 21-25, 2004 Lecture4: Grid Data Management 41
Globus RLS: Configuring the Server Query server for statistics$ globus-rls-admin -S rls://localhostVersion: 2.1.5Uptime: 02:46:19LRC stats update method: lfnlist update method: bloomfilter updates bloomfilter: rls://mini.astro.cf.ac.uk:39281 last 06/15/04 11:39:12 updates bloomfilter: rls://ygraine.aei.mpg.de:39281 last 12/31/69 18:00:00 updates bloomfilter: rls://ldas-cit.ligo.caltech.edu:39281 last 12/31/69
18:00:00 lfnlist update interval: 86400 bloomfilter update interval: 900 numlfn: 4110878 numpfn: 12328767 nummap: 12328775RLI stats updated by: rls://mini.astro.cf.ac.uk:39281 last 06/15/04 11:47:56 updated by: rls://ygraine.aei.mpg.de:39281 last 06/15/04 11:25:23 updated by: rls://ldas-cit.ligo.caltech.edu:39281 last 06/15/04 11:43:31 updated via bloomfilters
June 21-25, 2004 Lecture4: Grid Data Management 42
Globus RLS: Configuring the Server Tell LRC what remote RLIs to update
local LRC should update the RLI at MIT using Bloom filter
use –a if updating via lists rather than Bloom filter
June 21-25, 2004 Lecture4: Grid Data Management 43
Globus RLS: Client PerspectiveTwo ways for clients to interact with RLS Server globus-rls-cli simple command-line tool
query create new mappings
“roll your own” client by coding against API Java C Python
June 21-25, 2004 Lecture4: Grid Data Management 44
Globus-rls-cli
Simple query to LRC to find a PFN for LFN Note more then 1 PFN may be returned$ globus-rls-cli query lrc lfn H-R-714024224-16.gwf rls://dataserver:39281
Operation is unsupported: Wildcard searches with Bloom filters
June 21-25, 2004 Lecture4: Grid Data Management 50
Globus-rls-cliRLS with Bloomfilter updates to RLI fast and efficient Bloom filter is hash of information in a LRC remote LRC creates Bloom and sends it to RLI RLI can test to see if a particular LFN in the
LRC’s Bloom filter can’t do a wildcard search will sometimes lie! only false positives if can’t have any false positives use full list updates
June 21-25, 2004 Lecture4: Grid Data Management 51
Globus-rls-cli
Create new LFN → PFN mappings use create to create 1st mapping for a LFN$ globus-rls-cli create file1 gsiftp://dataserver/file1
rls://dataserver
use add to add more mappings for a LFN$ globus-rls-cli add file1 file://dataserver/file1
rls://dataserver
use delete to remove a mapping for a LFN when last mapping is deleted for a LFN the LFN is also deleted cannot have LFN in LRC without a mapping
June 21-25, 2004 Lecture4: Grid Data Management 52
Globus-rls-cliLRC can also store attributes about LFN and PFNs
size of LFN in bytes? md5 checksum for a LFN? ranking for a PFN or URL? extensible...you choose attributes to create and add can search catalog on the attributes attributes limited to
strings integers floating point (double) date/time
June 21-25, 2004 Lecture4: Grid Data Management 53
Globus-rls-cli Create attribute first then add values for LFNs$ globus-rls-cli attribute define md5checksum lfn string
June 21-25, 2004 Lecture4: Grid Data Management 54
Three Data Questions on the Grid
1. What data/files exist?
2. What data/files are where?
3. How do I move data/files from A to B?
June 21-25, 2004 Lecture4: Grid Data Management 55
Metadata Catalog Metadata catalog
store data about...data! help answer question about what data exists
MCS from Globus still a research project One realization of a metadata catalog other projects offer solutions with different capabilities and
limitations very active research on what type of service a metadata catalog
should offer how should metadata information flow from site to site? is there a single solution for most uses on the Grid?
June 21-25, 2004 Lecture4: Grid Data Management 56
Metadata Catalog One scenario useful in a Data Grid
data generated/collected into files at some detector site location of data files published into RLS
H-R-714024224-16.gwf → gsiftp://someserver/path/to/H-R-714024224-16.gwf existence of data files and important metadata published into
metadata catalogH-R-714024224-16.gwf →
data from detector in Hanford, WA raw data file contains all data (no downsampling) data starts at GPS time 714024224 file contains 16 seconds of data detector was in “science” mode with good noise properties a simulated pulsar signal was being injected at the time the operator on duty was D. Brown the calibration parameters are = 1.5643 and = 2.22984 and so on...
June 21-25, 2004 Lecture4: Grid Data Management 57
Metadata Catalog To run an application that analyzes the data on the Grid
1. Query metadata catalog for LFNs that contain data of interestQ: “Show me files where interferometer was locked and calibration had < 1.6
June 21-25, 2004 Lecture4: Grid Data Management 59
Summary
Metadata catalog, Globus RLS, and Globus GridFTP provide powerful way to manage data on the Grid and do more science figure out what data/files are needed find it move it do science with it!
June 21-25, 2004 Lecture4: Grid Data Management 60
But… What about a higher-level tool? We want something that will…
Locate the data Send data to processing sites Share the results with other sites Allocate and de-allocate storage Clean-up everything Do these reliably, efficiently, and without human
supervision
June 21-25, 2004 Lecture4: Grid Data Management 61
Stork A scheduler for data placement activities in the
Grid What Condor is for computational jobs, Stork is
for data placement Stork comes with a new concept:
“Make data placement a first class citizen in the Grid.”
June 21-25, 2004 Lecture4: Grid Data Management 62
The Concept
• Stage-in
• Execute the Job
• Stage-out
Stage-in
Execute the job
Stage-outRelease input space
Release output space
Allocate space for input & output data
Individual Jobs
June 21-25, 2004 Lecture4: Grid Data Management 63
The Concept
• Stage-in
• Execute the Job
• Stage-out
Stage-in
Execute the job
Stage-outRelease input space
Release output space
Allocate space for input & output data
Data Placement Jobs
Computational Jobs
June 21-25, 2004 Lecture4: Grid Data Management 64
DAGMan
The Concept
CondorJob
QueueDaP A A.submitDaP B B.submitJob C C.submit…..Parent A child BParent B child CParent C child D, E…..
C
StorkJob
Queue
E
DAG specification
A CBD
E
F
June 21-25, 2004 Lecture4: Grid Data Management 65
Why Stork? Stork understands the characteristics and
semantics of data placement jobs. Can make smart scheduling decisions, for reliable
and efficient data placement. Integrates seamlessly with Condor-G
June 21-25, 2004 Lecture4: Grid Data Management 66
Failure Recovery and Efficient Resource Utilization
Fault tolerance Just submit a bunch of data placement jobs, and then
go away.. Control number of concurrent transfers from/to
any storage system Prevents overloading
Space allocation and De-allocations Make sure space is available
June 21-25, 2004 Lecture4: Grid Data Management 67
Support for Heterogeneity
Protocol translation using Stork memory buffer.
June 21-25, 2004 Lecture4: Grid Data Management 68
Support for Heterogeneity
Protocol translation using Stork Disk Cache.
June 21-25, 2004 Lecture4: Grid Data Management 69
Flexible Job Representation and Multilevel Policy Support[