Top Banner
Data Management The European DataGrid Project Team http://www.eu-datagrid.org
21

Data Management The European DataGrid Project Team

Jan 18, 2018

Download

Documents

Lewis Johnston

EDG DataManagement Tutorial - n° 3 Common Grid Data Management Tasks  Dealing with Data Your Job Generates n Getting the data back to your desktop n Putting the data “on the Grid”  Getting Data to your Job n Submitting data along with your job n Putting your data onto the Grid (from outside) n Sending your Grid job to your Grid data  Moving Data on the Grid  How to find your data if you don’t remember where you put it  Example scripts and files: ~dgttutor/dm-tests/
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Management The European DataGrid Project Team

Data Management

The European DataGrid Project Team

http://www.eu-datagrid.org

Page 2: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 2

Problem Statement:How to connect

User/Programs/Data? User

logged in to a Grid “User Interface” machine, or Logged in to a “desktop” machine

Programs On desktop On UI On Grid machines “god knows where”

Data May need to supply (Grid or non-Grid) data to GNW programs GNW program may generate data, need to put it somewhere

safe How do you retrieve it from somewhere safe?

Page 3: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 3

Common Grid Data Management Tasks

Dealing with Data Your Job Generates Getting the data back to your desktop Putting the data “on the Grid”

Getting Data to your Job Submitting data along with your job Putting your data onto the Grid (from outside) Sending your Grid job to your Grid data

Moving Data on the Grid How to find your data if you don’t remember where you put it Example scripts and files: ~dgttutor/dm-tests/

Page 4: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 4

Grid Data Management Tools

Data Transfer mostly through gsiftp Like good old FTP except uses grid auth(oriza)(entica)tion No passwords! Can also use multiple streams for faster transfer

Resource Broker can send (small amounts) of data to/from jobs

Replica Catalog keeps track of where various copies of “grid datasets” are located

Edg-replica-manager uses gsiftp & RC to manage instantiation, registration, and replication of grid datasets

Resource Broker can use RC to find your data, and send your job to it, if you tell RB about the data you need

Page 5: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 5

Grid Program -> Data on your desktop

You can set up your job for “data pickup” Job generates data in current working directory on WN At job end, the data files are placed in temp storage at RB You get them back via “dg-job-get-output”

Key items: You need to know names of files you want to get back OutputSandbox = {“higgs.root",“graviton.HDF"}; not intended for large files (> hundred MB) – storage

limitation on Resource Broker machine Example: output-sandbox.{jdl,sh}

Page 6: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 6

Putting the data “on the Grid”

Here we talk about a running Grid program, the output of which you want on the Grid. Two cases:

You let the program write output on the WN, and after the program finishes you have the job script move the data to Grid storage

You arrange for the program to write directly to Grid storage In both cases, data is not really “on the Grid” until it is

registered in the “replica catalog”

Page 7: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 7

Grid-generated data to Grid storage I

Your program generates data to some local file You have to know (or be able to figure out) what the local

file name is Use the edg-replica-manager commands to

Put the data onto Grid storage Register the data as a Grid dataset

A few extras are needed Some idea of where to put the data A “logical file name” – location-independent grid file name

Page 8: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 8

GGDGS (I) Cont’d

How to find out where to put data? Need to know which storage elements are out there

ldapsearch -h lxshare0225.cern.ch -p 2170 -x -b \"Mds-vo-name=local,o=grid" (objectclass=storageelement) \ seid

The command which will move your data to the desired location, and register it in the replica catalog, is edg-replica-manager-copyAndRegisterFile

edg-replica-manager-copyAndRegisterFile \ -s $(hostname)/$(pwd)/$DFILE -l $LFN -d $DEST_SE

See cr-mov-reg.{sh,jdl} examples

Page 9: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 9

Grid-generated data to Grid storage II

Your program generates data directly to a “close SE” Close means you can use normal file IO to write it You have to use a brokerinfo command to find out what

the close SE is (you don’t know where your job will go!) and what the dir is

You write the data Use the edg-replica-manager commands to

Register the data as a Grid dataset An extra is needed

A “logical file name” – location-independent grid file name

Page 10: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 10

GGDGS II (cont’d) Restriction: the “local file name” has to be the same as the

logical file name (at least the “base” name) File on disk: /data/spool/123fred7; LFNs:

123fred7 is OK 123fred is not OK fred7 is not OK Skippy is not OK spool/123fred7 is OK

Logical file name must not already be in catalogue You also probably want to check that the file doesn’t exist on

disk before you start to write it Example files: cr-on-se-and-reg.{jdl,sh} Check if it was successful:

edg-replica-manager-listReplicas -c /opt/edg/etc/tutor/rc.conf \ -l whomp.119

Page 11: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 11

Submitting Data Along With Your Job

This is fairly easy: use the Input Sandbox Careful – not a sandbox in the javascript sense InputSandbox = {“input-ntuple.root"}; Example files: inp-sbox.{jdl,sh}

Page 12: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 12

Moving Data Onto Grid from Outside

This is almost identical to GGDGS I Use edg-replica-manager-copyAndRegisterFile Need to specify rc.conf file (either with RC_CONFIG_FILE

variable or with –c option) … defaults in /opt/edg/etc/<vo>/rc.conf

Remember restrictions: LFN and remote file name have to match source and destination files must include hostnames

edg-replica-manager-copyAndRegisterFile –c rc.conf –l whomp.145 –s $(hostname)/$(pwd)/gls –d gppse05.gridpp.rl.ac.uk

Page 13: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 13

Having Grid Send Job to Your Data

Need to have data “on the Grid” == listed in RC Tell your job (JDL) about the grid data:

InputData = “LF:myfile.dat” Resource Broker puts info about data matching in

“brokerinfo” file on remote execution node In your job execution script, use the “edg-brokerinfo”

command (getselectedfile) to find location of job-local copy

Example files: find-data.{jdl,sh}

Page 14: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 14

Moving Data Around

Edg-replica-manager-replicateFile –c rc.conf –l <lfn> -d <dest-SE-name> -s <source-SE-name>

Try the previous test (w/ dg-job-list-match) – should find a new site willing to accept your job

Page 15: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 15

Finding Your Data

ldapsearch –LLL –h grid-vo.nikhef.nl –p 10389 –x –b “rc=EDGtutorialReplicaCatalog,dc=eu-datagrid,dc=org” ‘(filename=jtdmtest1)’ dn

Shows “dn”s wherever the selected “filename” exist

Page 16: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 16

GDMP

Tool for replication of large sets of files between sites Can do a lot with it Easy to get commands wrong

Can’t recover from certain errors Possible to wreck the GDMP subsystem badly enough that

remote sysadmins will have to make manual fixes Recommend not to use unless you really need it! Ex: you don’t normally use the “dd” command to copy

files!

Page 17: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 17

Gotchas

Edg-replica-manager commands Error messages not always on target Careful not to use commands in ways other than intended –

error trapping not good, and sometimes the command will do something but not necessarily what you want

Build error checking & trapping into your job scripts Remember restrictions on LFN/PFN correspondence

Replica catalog Leaving out pieces of the command generally neither works

nor provides helpful messages – type carefully!

Page 18: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 18

EDG Replica Catalog

Based upon the Globus LDAP Replica Catalog Stores LFN/PFN mappings and additional information (e.g. filesize):

Physical File Name (PFN): host + full path & and file name Logical File Name (LFN): logical name that may be resolved to PFNs LFN : PFN = 1 : n

Only files on storage elements may be registered Each VO has a specific storage dir on an SE Example PFN: lxshare0222.cern.ch/flatfiles/SE1/iteam/file1.dat host storage dir

LFN must be full path of file starting from storage dirLFN of above PFN: file1.dat

Page 19: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 19

globus-url-copy

Low level tool for secure copyingglobus-url-copy <protocol>://<source file> \ <protocol>://<destination file>

Main Protocols: gsiftp – for secure transfer, only available on SE and CE file – for accessing files stored on the local file system on e.g. UI,

WN

globus-url-copy file://`pwd`/file1.dat \ gsiftp://lxshare0222.cern.ch/ \ flatfiles/SE1/EDGTutorial/file1.dat

Page 20: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 20

The Replica Manager APIs

(un)registerEntry(LogicalFileName lfn,

FileName source) Replica Catalogue operations only - no file transfer

copyFile(FileName source,

FileName destination,

String protocol) allows for third-party transfer transfer between:

two StorageElements or ComputingElement and Storage Element Space management policies under development

all tools support parallel streams for file transfers

Page 21: Data Management The European DataGrid Project Team

EDG DataManagement Tutorial - n° 21

copyAndRegisterFile(LogicalFileName lfn,

FileName source,

FileName destination,

String protocol) third-party transfer but : files can only be registered in Replica Catalogue if destination PFN

contains a valid SE (i.e. needs to be registered in the RC)! replicateFile(LogicalFileName lfn,

FileName source,

FileName destination,

String protocol) deleteFile(LogicalFileName lfn,

FileName source)

The Replica Manager APIs