USCMS T2 Site Admin Toolkit Samir Cury MTF Meeting – May 26 th , 2011
Jan 21, 2016
USCMS T2 Site Admin Toolkit
Samir CuryMTF Meeting – May 26th, 2011
How it began
OSG All Hand Meeting 2010 Fermilab
Yearly T2 Workshop Gathering of site admins A lot of ideas/comments Some code – Scripts
About site admins
Frontline of site management They have in a Daily basis :
Many requests
• Many issues Many workarounds
– What happen with these?
Relevant feedback for CMS Leak of features in existing software Leak of monitoring in existing systems
May lead to Blindly operating it
Is there always someone to listen? Thanks Monitoring Task Force!
WorkaroundsFrom the past slide, this toolkit is all about that. Not always complaining is the best way
It may never be implemented Not everyone will see the benefits/cost
Different needs Not always developers think about all user/ops needs Scripts are done to cover these needs These scripts can give a different approach to the ops Monitoring tools focused in admin's needs.
Can improve response time / error/waste detection
» Example – GridFTP Spy» JobView / CPU Efficiency on T1's
Not essential, but normally saves some time.
The goal What is really missing
– Official place for unofficial code
– People get encouraged to share
Call for tools Get the generic ones –> package into RPM Get the specific ones
Turn into generic, then package into RPM
Standard place (repository) Standard deploy procedure
If it's not quick, no one tries. → RPM's
Helping us to help ourselves.
What it is Full documentation/reference available :
https://twiki.cern.ch/twiki/bin/view/CMS/SIteAdminToolkit
Where we document each tool included in the toolkit, future plans, etc.
A gathering of scripts, that may need some work to get it working
We also try to avoid that by having RPMs and all dependencies included – packages or in the repos.
A free-time-task for every involved person
We normally don't have schedules, but a plan. Shameless “coders” - that's what we need!
We don't care how “bad written” it is, as long as it works
What certainly is not Something that is maintained by a lot of people
But some that contribute with tools A dependency-solver / packager (me)
Would appreciate some help
Something that will solve all the problems That is not the goal, just to put together specific
tools
Something that has “professional quality” Involved people are very capable, but
proportionaly time-constrained
What we can learn “Sites” can also generate some useful code
They probably will do it for themselves, so don't expect High quality code Something that has not a lot of dependencies
Expect Tools that you can adapt for your site with little effort To contribute and make it better instead of complaining
“Sites” should be shameless enough to publish (and send us) tools they find useful.
Ken bloom gave me space for a contribution on a USCMS T2 support meeting so I could present the proposal, then, some tools showed up. (Thanks, Ken!)
T2 Coordinators could inform us when they see something useful in their support meetings, and also remind these sites that the toolkit is there
What I did learn Since getting the script until the RPM gives more
work than I thought – many details, dependencies, etc...
We will live better if we have a step before this :
https://github.com/samircury/US-CMS-T2-Admin-Toolkit
People can download/edit from there, and is a shortcut for the ones that really want to spend some time understanding and deploying the tools that still don't have the RPM.
It helped me to patch Stale Data improving the CLI
Tools we have right now
CondorView (Caltech) - RPM ready
GridFTP Spy (Caltech) – RPM ready
Condor4Web (UERJ) - RPM ready
Stale Data (Nebraska) – tested, needs packaging
Condor Extract Mail (Nebraska) – to be tested
Dcache tools (Wisconsin) – to be tested
Your tool here
CondorView GUI for managing condor
List every single job Can list ALL classAds for a given job Can do what you see in the menu
Run from the cluster frontend Have the ability of SSH to the node, exactly into the running job temp dir
Run from the site's CE Have the ability of killing/releasing/restart jobs
GridFTP Spy Shows in near real time active GridFTP transfers Very useful for link usage / server settings
optimizing Somewhat tricky to deploy
Needs a shared FS for harvesting logs How it does is reading the logs in real time and
gathering interesting info
Never tested it myself – testers are welcome!
Condor4web
Real time batch system monitoring Visible from any corner of the world Your users like it
They know what's going on with their jobs, after the CE MC People like it
For the same reason.
Live demos :
http://monitor.hepgrid.uerj.br/condor/
http://www.cmsaf.mit.edu/condor4web/
If you don't use Condor, try JobView :
https://twiki.cern.ch/twiki/bin/viewauth/CMS/AnalysisOpsT2Monitoring
Stale Data Looks like the (un)popularity data service
Shows which datasets people didn't run a single job against
Tested. Works fine, has a lot of dependencies which should be included in the RPM
date = 15-12-2010 , Starting Date = 01-12-2010
Getting json http://dashb-datasets.cern.ch/dashboard/request.py/inputCollectionsTable_JSON?collec_name=&sites=T2_BR_UERJ&date1=01-12-2010&date2=15-12-2010
Datasets idle since 01-12-2010
/JetMET/Run2010A-Dec4ReReco_v1/AOD , 2474.004614433 GB , Owned by AnalysisOps
/G2Jets_Pt-20to60_TuneZ2_7TeV-alpgen/Fall10-START38_V12-v1/AODSIM , 190.267690679 GB , Owned by top
/W2Jets_ptW-0to100_TuneZ2_7TeV-alpgen-tauola/Fall10-START38_V12-v1/GEN , 0.686380407 GB , Owned by DataOps
/QCD6Jets_Pt120to280-alpgen/Spring10-START3X_V26_S09-v1/GEN-SIM-RECO , 42.528487201 GB , Owned by top
/W1Jets_ptW-800to1600_TuneD6T_7TeV-alpgen-tauola/Fall10-START38_V12-v1/AODSIM , 11.951159415 GB , Owned by top
(Suppressed)
Space taken by stale datasets = 408.164419749117 TB
Broken down by group:
tracker-dpg => 9.250565041201
top => 40.841314603557
AnalysisOps => 157.50586599848
undef => 15.736526476068
FacOps => 1.899973228744
b-tagging => 18.694190177731
local => 164.130428192715
DataOps => 0.105556030621
“Condor Extract Mail”
Fetches from grid proxies in your CE's, mails from the users running jobs in your cluster
[root@red ~]# ~bbockelm/extract_email "Bockelman"
What CMS can profit Better than the code, the ideas
Usability – you may find here potential features for existing real software
Adapt ideas or tools that diserve to CMS central monitoring like cmsweb
Gives an overview of site admin needs and what they would like to see in the software they use.
Some become patches – like Brian Bockelman's script The model / idea of a free software community is a
good example to follow – Small patches from many people turn small things into great ones. Share!
Thanks all involved Ken Bloom, Michael Thomas – Initial effort to set up and make
everything public Authors that submitted tools :
Caltech – Michael Thomas CondorView GridFTP Spy
Nebraska – Carl Lundsted and Brian Bockelman Condor Extract Mail Stale Data
Wisconsin - Will dCache Tools
UERJ – Samir Condor4Web
Feel free to send :
Tools Suggestions Help
But first, we recommend some (small) reading here :
https://twiki.cern.ch/twiki/bin/view/CMS/SIteAdminToolkit
For the future
2 Trainees interested in help packaging @ UERJ
Migrate YUM Repos to CERN webservers Finish testing/package tools we already have.
Recommended toolkit
http://datagrid.ucsd.edu/toolkit/
Thanks!