Upgrade D0 farm
Mar 31, 2015
Upgrade D0 farm
Reasons for upgrade
• RedHat 7 needed for D0 software
• New versions of – ups/upd v4_6– fbsng v1_3f+p2_1– sam
• Use of farm for MC and analysis
• Integration in farm network
MC production on farm
• Input: requests
• Request translated in mc_runjob macro
• Stages:1. mc_runjob on batch server (hoeve)
2. MC job on node
3. SAM store on file server (schuur)
farm server file server
node
SAM DB
datastore
fbs(rcp,sam)
fbs(mcc)
mcc request
mcc input
mcc output
1.2 TB
40 GB
FNALSARA
control
data
metadata
fbs job:1 mcc2 rcp3 sam
100 cpu’s
farm server file server
node
SAM DB
datastore
fbs(rcp[,sam])
fbs(mcc)
mcc request
mcc input
mcc output
1.2 TB
40 GB
FNALSARA
control
data
metadata
fbs job:1 mcc2 rcp
100 cpu’s
cron:sam
fbsuser:cpfbsuser:mcc
fbsuser: rcp
willem:sam
hoeve node schuur
fbsuser:mc_runjob
fbs submit
fbs submit
data
control
cron
SECTION mcc EXEC=/d0gstar/curr/minbias-02073214824/batch NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/curr/minbias-02073214824/stdout STDERR=/d0gstar/curr/minbias-02073214824/stdoutSECTION rcp EXEC=/d0gstar/curr/minbias-02073214824/batch_rcp NUMPROC=1 QUEUE=IOQ DEPEND=done(mcc) STDOUT=/d0gstar/curr/minbias-02073214824/stdout_rcp STDERR=/d0gstar/curr/minbias-02073214824/stdout_rcp
#!/bin/sh
. /usr/products/etc/setups.shcd /d0gstar/mcc/mcc-dist. mcc_dist_setup.sh
mkdir -p /data/curr/minbias-02073214824cd /data/curr/minbias-02073214824cp -r /d0gstar/curr/minbias-02073214824/* .touch /d0gstar/curr/minbias-02073214824/.`uname -n`sh minbias-02073214824.sh `pwd` > logtouch /d0gstar/curr/minbias-02073214824/`uname -n`/d0gstar/bin/check minbias-02073214824
#!/bin/shi=minbias-02073214824if [ -f /d0gstar/curr/$i/OK ];thenmkdir -p /data/disk2/sam_cache/$icd /data/disk2/sam_cache/$inode=`ls /d0gstar/curr/$i/node*`node=`basename $node`job=`echo $i | awk '{print substr($0,length-8,9)}'`rcp -pr $node:/data/dest/d0reco/reco*${job}* .rcp -pr $node:/data/dest/reco_analyze/rAtpl*${job}* .rcp -pr $node:/data/curr/$i/Metadata/*.params .rcp -pr $node:/data/curr/$i/Metadata/*.py .rsh -n $node rm -rf /data/curr/$irsh -n $node rm -rf /data/dest/*/*${job}*touch /d0gstar/curr/$i/RCPfi
batchruns on node
batch_rcpruns on schuur
#!/bin/shlocate(){file=`grep "import =" import_${1}_${job}.py | awk -F \" '{print $2}'`sam locate $file | fgrep -q [return $?}. /usr/products/etc/setups.shsetup samSAM_STATION=hoeveexport SAM_STATION
tosam=$1LIST=`cat $tosam`
for job in $LISTdo cd /data/disk2/sam_cache/${job} list='gen d0g sim' for i in $list do until locate $i || (sam declare import_${i}_${job}.py && locate ${i}) do sleep 60; done done
list='reco recoanalyze' for i in $list do sam store --descrip=import_${i}_${job}.py --source=`pwd` return=$? echo Return code sam store $returndonedoneecho Job finished ...
declare gen, d0g, sim
store reco, recoanalyze
runs on schuurcalled by fbs or cron
Filestream
• Fetch input from sam
• Read input file from schuur
• Process data on node
• Copy output to schuur
rcp
d0exe
rcp
sam
hoeve node schuur
mc_runjob
fbs submit
fbs submit
data
control
cron
attach filestream
Analysis on farm
• Stages:– Read files from sam– Copy files to node(s)– Perform analysis on node– Copy files to file server– Store files in sam
farm server file server
node
SAM DB
datastore
1.2 TB
40 GB
FNALSARA
control (fbs)
data
metadata
100 cpu’s
1. sam + rcp2. analyze3. rcp + sam
fbs(1), fbs(3)
fbs(2)
triviaal node-2
fbsuser:rcp
fbsuser:rcp
fbsuser:
analysisprogram
willem:sam
willem:sam
input
output
SECTION sam EXEC=/home/willem/batch_sam NUMPROC=1 QUEUE=IOQ STDOUT=/home/willem/stdout STDERR=/home/willem/stdout
#!/bin/sh
. /usr/products/etc/setups.shsetup samSAM_STATION=triviaalexport SAM_STATION
sam run project get_file.py --interactive > log
/usr/bin/rsh -n -l fbsuser triviaal rcp -r /stage/triviaal/sam_cache/boo node-2:/data/test >> log
batch.jdf
batch_sam
farm server file server
node
SAM DB
datastore
1.2 TB
40 GB
FNALSARA
control (fbs)
data
metadata
100 cpu’s
1. sam2. rcp + analyze + rcp3. rcp + sam
fbs(1), fbs(3)
fbs(2)
triviaal node-2
fbsuser:rcpanalysisprogram
rcp
willem:sam
willem:sam
input
output
fbsuser:fbs submit
SECTION sam EXEC=/d0gstar/batch_node NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/stdout STDERR=/d0gstar/stdout
#!/bin/shuname -adate
rsh -l fbsuser triviaal fbs submit ~willem/batch_node.jdf
#!/bin/sh. /usr/products/etc/setups.shsetup fbsngsetup samSAM_STATION=triviaalexport SAM_STATIONsam run project get_file.py --interactive > log/usr/bin/rsh -n -l fbsuser triviaal fbs submit /home/willem/batch_node.jdf
SECTION sam EXEC=/home/willem/batch NUMPROC=1 QUEUE=IOQ STDOUT=/home/willem/stdout STDERR=/home/willem/stdout
SECTION ana EXEC=/d0gstar/batch_node NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/stdout STDERR=/d0gstar/stdout
#!/bin/shrcp -pr server:/stage/triviaal/sam_cache/boo /data/test. /d0/fnal/ups/etc/setups.shsetup root -q KCC_4_0:exception:opt:threadsetup kailibroot -b -q /d0gstar/test.C
{gSystem->cd("/data/test/boo");gSystem->Exec("pwd");gSystem->Exec("ls -l");}
## This file sets up and runs a SAM project.#import os, sys, string, time, signalfrom re import *from globals import *import run_projectfrom commands import *########################################### Set the following variables to appropriate values
# Consult database for valid choicessam_station = "triviaal"
# Consult Database for valid choicesproject_definition = "op_moriond_p1014"
# A particular snapshot version, last or newsnapshot_version = 'new'
# Consult database for valid choicesappname = "test"version = "1"group = "test"
# The maximum number of files to get from sammax_file_amt = 5
# for additional debug info use "--verbose"#verbosity = "--verbose"verbosity = ""
# Give up on all exceptionsgive_up = 1
def file_ready(filename): # Replace this python subroutine with whatever # you want to do # to process the file that was retrieved. # This function will only be called in the event of # a successful delivery. print "File ",filename," has been delivered!"# os.system('cp '+filename+' /stage/triviaal/sam') return
get_file.py
Disk partitioning hoeve
/d0
/fnal
/d0dist /d0usr
/mcc
/mcc-dist /mc_runjob /curr/ups
/db /etc /prd
/fnal -> /d0/fnal/d0usr -> /fnal/d0usr/d0dist -> /fnal/d0dist/usr/products -> /fnal/ups
/fbsng
ana_runjob
• Is analogous to mc_runjob
• Creates and submits analysis jobs
• Input– get_file.py with SAM project name
• Project defines files to be processed
– analysis script
Integration with grid (1)
• At present separate clusters:– D0, LHCb, Alice, DAS cluster
• hoeve and schuur in farm network
Present network layout
hoeve schuur
switch
node node node
router
hefnet
surfnet
ajax
NFS
New network layout
farmrouter
switch switch switch
D0LHCb
hefnet
lambda
hoeve schuur
alice
ajax
NFS
booder
New network layout
farmrouter
switch switch switch
D0LHCb
hefnet
lambda
hoeve schuur
alice
ajax
NFS
booder
das-2
Server tasks
• hoeve– software server– farm server
• schuur– fileserver– sam node
• booder– home directory server– in backup scheme
Integration with grid (2)
• Replace fbs with pbs or condor– pbs on Alice and LHCb nodes– condor on das cluster
• Use EDG installation tool LCGF– Install d0 software with rpm
• Problem with sam (uses ups/upd)
Integration with grid (3)
• Package mcc in rpm
• Separate programs from working space
• Use cfg commands to steer mc_runjob
• Find better place for card files
• Input structure now created on node
Grid job
#!/bin/sh
macro=$1
pwd=`pwd`
cd /opt/fnal/d0/mcc/mcc-dist. mcc_dist_setup.sh
cd $pwddir=/opt/fnal/d0/mcc/mc_runjob/py_scriptpython $dir/Linker.py script=$macro
[willem@tbn09 willem]$ cat test.pbs# PBS batch job script
#PBS -o /home/willem/out#PBS -e /home/willem/err#PBS -l nodes=1
# Changing to directory as requested by user
cd /home/willem
# Executing job as requested by user
./submit minbias.macro
PBS job submit
RunJob class for gridclass RunJob_farm(RunJob_batch) : def __init__(self,name=None) : RunJob_batch.__init__(self,name) self.myType="runjob_farm"
def Run(self) : self.jobname = self.linker.CurrentJob() self.jobnaam = string.splitfields(self.jobname,'/')[-1] comm = 'chmod +x ' + self.jobname commands.getoutput(comm) if self.tdconf['RunOption'] == 'RunInBackground' : RunJob_batch.Run(self) else : bq = self.tdconf['BatchQueue'] dirn = os.path.dirname(self.jobname) print dirn comm = 'cd ' + dirn + '; sh ' + self.jobnaam + ' `pwd` >& stdout' print comm runcommand(comm)
To be decided
• Location of minimum bias files
• Location of MC output
Job status
• Job status is recorded in– fbs– /d0/mcc/curr/<job_name>– /data/mcc/curr/<job_name>
SAM servers
• On master node:– station– fss
• On master and worker nodes:– stager– bbftp