Analysis Operation report at CMS week Dec 7, 2010 1
Analysis Operations Experience
2010 highlights 2011 work list
Sanjay Padhi (UCSD) on behalf of Analysis Operations
Dec 7, 2010Analysis Operation report at CMS week 2
Analysis Metrics
Overall status tracked weekly with summary mails sent to AnalysisOperations forum and with series of graphs https://hypernews.cern.ch/HyperNews/CMS/get/analysisoperations/189.html https://hypernews.cern.ch/HyperNews/CMS/get/analysisoperations/192.html https://twiki.cern.ch/twiki/bin/view/CMS/AnalysisOpsMetricsPlots
Allows us to see the forest while watching daily the trees
Dec 7, 2010Analysis Operation report at CMS week 3
Lot’s of analysis jobs
More analysis than production jobs, but x2 shorter on average Load on infrastructure scales with #jobs Will try to increase job duration via user education
Not much room left for more jobs
#jobs running at same time (i.e. used slots) is showing saturation
Job wrapper time (avg.last week)
Analysis: 2h
Production: 5h
JobRobot: 20min
Dec 7, 2010Analysis Operation report at CMS week 4
Too many jobs ?
It is a success of CMS that so many users manage to run so much data processing
But expect more frustration as waiting time grows May also expose to new failure modes (like any new regime, so
far)
Dec 7, 2010Analysis Operation report at CMS week 5
Too much data for AnOps managed space ?
AnOps just cleaned up in preparation for 39x based reprocessing Even after cleanup still 74% of space used !
Commonly used versions of MC and data in 2011 MC for two energies (7TeV and 8TeV), with and w/o pile-up, two
releases data for two releases
Unlikely that all of this will fit into AnOps managed space unless we switch to AOD distribution only
Expect physics group to use more of their space in 2011 to host samples that AnOps hosted in 2010.
1.5 of 3.75 PB free at official group sitesAnOps
managed space is ~full
DPG/POG/PAG managed space not full
Dec 7, 2010Analysis Operation report at CMS week 6
Caring for user’s data: placement
in 2010 17 PB transferred to
T2s 4.5 PB transferred
from T2s
Central space completely refreshed about every 3 months
Data distribution works quite well
To T2s
From T2s
Dec 7, 2010Analysis Operation report at CMS week 7
Elevates datasets from local to global DBS and PhEDEx Two instances deployed: RWTH Aachen, Imperial College
London
StoreResults monitoring developed in 2010 Allows physics groups to monitor their requests in real time
Caring for user’s data: StoreResults Service
average 1-2 requests per day usually in bunches of 2-10
340 requests in 2010 ~98% elevated successfully
average elevation time 46h however 3000 job requests
from SUSY take their time, need to transition large productions to dataOps
Dec 7, 2010Analysis Operation report at CMS week 8
Commissioning data transfer infrastructure
Local CRAB stage out validated at 41/48 sites
3 of the missing 7 fails due to known Crab shortcoming
Dec 7, 2010Analysis Operation report at CMS week 9
Caring for user’s jobs: crab feedback
Support 400 different users/week Questions ranging from data lookup to grid and site problems
Draw a line at “cmsRun” (could be easier with better reporting, often users do not know how to read cmsRun exit codes)
Continued need for expert crab users to be willing to help with crab feedback operations Opportunity for service work!
Mail volume (#messages) handled by Analysis Operations on the CrabFeedback forum in 2010
Dec 7, 2010Analysis Operation report at CMS week 10
Caring for user’s jobs: Crab Server
Running 6 now: 2 at CERN, 2 at UCSD, 1 each at Bari and DESY By operational choice one in drain at any time to allow DB reset Other issues continuously pop-up (from hw to sw) and usually
have 4 or less servers actually in production Good news
Installed, operated and (mostly) debugged by non-developers Much improved after last summer developers effort More then 50% of the analysis jobs run through Crab Servers Most of the time: simply works
Bad news Operationally heavy jobs/task tracking can fail and users need to resubmit Resubmission awkward at times Status reporting obscure at times, leads to unnecessary help
requests
Dec 7, 2010Analysis Operation report at CMS week 11
Give us a hand ! How to be a good user
Read and use documentation
Give all information in the first message, before we need to ask crab.cfg, crab.log, relevant stdout/err, dashboard URL ..
Do not use blindly config. files you do not understand
Make some effort to figure out the problem yourself (in the end it will save you time)
Do not expect solution in “minutes”, be prepared for “tomorrow”
Get trained before you need to do something in a rush, ramp up in steps: 1 job, 10 jobs, 100 jobs, …
If you see time-critical CRAB work ahead : Get hands on experience well before you need it
Dec 7, 2010Analysis Operation report at CMS week 12
2010 highlights
A lot of analysis activity is being performed
We believe we know what’s going on
Grid works, at least as well as our tools
We have replaced developers in crab server daily ops and crab feedback daily support No service deterioration for user community
We care successfully for large amount of data in central space
Dec 7, 2010Analysis Operation report at CMS week 13
Current concerns
We are reaching the point of being resource constrained Many decisions were made easy by abundance
do not debug: resubmit do not fix site: replicate data
Doing better requires thinking, planning, following up daily and better tools i.e. human time
Crab2 is in maintenance mode, while still operationally heavy Working in firefighting / complain-driven mode Transient, random problems in the infrastructure are main
effort drain
Major technology transition looming ahead Must find operations resources for the transition
Dec 7, 2010Analysis Operation report at CMS week 14
The year that comes
Continue doing what we already do
Transition to Crab3 Work on job monitoring and error reporting
Not good enough yet to tell where exactly resources are wasted and how to fix
Work on crab servers monitoring, currently not good enough to avoid complain-driven mode spot systemic problems in server or middleware
Work on user problem reporting Reduce number of mail iterations
Learn how to use sites of very different size and reliability Since we may need all of them
Work on better alignement of FacOps site testing with AnOps reality (continous effort)
Dec 7, 2010Analysis Operation report at CMS week 15
Summary
Analysis metrics High level overview: under control Error/efficiency reporting: work to do
Data placement and transfer for physics users Centrally managed storage area: OK but filling up T2-T2 transfer links: success story Group data distribution: OK
CRAB operations Peacked at more then 140K jobs/day: success story Crab feedback: unsustainable since 1 year, but coping Crab server: works, but too heavy operationally
Concerns End of abundant resources era: more work needed New tools coming: more work needed
Things to work on in 2011: pretty long list given we should be in steady operations