Top Banner
Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006
13

Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Mar 28, 2015

Download

Documents

Ryan Ball
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Your university or experiment logo here

BaBar Status Report

Chris BrewGridPP16 QMUL

28/06/2006

Page 2: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Outline• 3 BaBar Grid Projects:

– Monte Carlo (Simulation) Production– Skimming– User Analysis

• easyGrid• bbrbsub

• Overall experience with the Grid• Conclusion

Page 3: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Usual Guff

• BaBar is a running experiment, Situated at SLAC near San Francisco

• e+e- collider tuned to investigate CP Violation in B Physics

• Started taking data in 1999/2000 currently has 350 fb-1 of data

• Projected to have 1000 fb-1 by end of 2008

Page 4: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Data Flow

Tier 1(RAL)

Tier 2sTier 0

(SLAC)

LargeTier 2s

Tier 1(RAL)

Simulation Production

SkimmingAnalysis

Merging

Page 5: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Simulation Production• Running at M/Cr, RAL, RALPP and B'ham

– Tests at Lancs, Oxford + others– Still working to add other BaBar Sites– Limited by need to install Objy DB at each site

• Stable running: 500,000,000 Events Produced, 12% of worldwide total.

• New R-GMA Based job monitor: Status query down from 45 minutes to 5 minutes

• Recent hiatus due to bugs found in BaBar simulation code which caused a global halt. Production has recently restarted

C. Brew, G.Castelli

Page 6: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Cumulative Simulation Production on GridPP

0

100000

200000

300000

400000

500000

600000

10-Ju

l-05

07-A

ug-0

5

04-S

ep-0

5

02-O

ct-05

30-O

ct-05

27-N

ov-0

5

25-D

ec-0

5

22-Ja

n-06

19-F

eb-0

6

19-M

ar-0

6

16-A

pr-0

6

14-M

ay-0

6

11-Ju

n-06

Week Beginning

10

00

's o

f E

ve

nts gridpp.rl.ac.uk

pp.rl.ac.uk

ph.bham.ac.uk

tier2.hep.manchester.ac.uk

Total

Page 7: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Skimming• New Grid Project: Process real and simulated data

to select ~200 subsamples, defined by the BaBar physics analysis working groups.– Much quicker to run over skim than full data sample– Skimming includes physics analysis code and saves the

results, so CPU time spent in skimming is regained many times over

• Plan is to run at one or more large T2s. If we can get this into production we should be able to recover some of the UK’s Common Fund rebate we’ve lost due to lack of T1 Resources

• GridPP has funded three months of effort from Will Roethel to further this work

G.Castelli, W. Roethel, C. Brew

Page 8: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Status of SkimmingPrepare code to be installed on grid Done

Modify BaBar framework to read data out of dCache and RFIO

Working, starting load and stability testing

Develop tools for copying and managing data on Storage Elements

Under development (PHeDEx?)

Integration with BaBar Task Management software

Task DB Creation Done

Task List Creation Works

Job Creation Works

Local Job Submission Works

Grid Job Submission Works

Job Monitoring In progress, should be able to reuse code from SP Tools

Job Recovery

Job Output Checking In progress

Data Merging Not Started

Page 9: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

User Analysis (easyGrid)• Prototype running on Manchester

Testbed testbed (80 CPUs) since Nov/2005 without problems. Real analysis with real data by real users that knows nothing about grid.

• No errors in Easygrid job submission.• No errors in grid testbed due to

installation configuration and improvements.

J. Werner

Page 10: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

• Many problems encountered moving from Testbed to

Production Grid Resources– errors in RB, CE, etc - 10% of time with less then 4

jobs/second submission rate.– errors in BDII, SE, dcache. SE fails 40% of jobs (less then 100

jobs in parallel).– when SE works, performance is terrible (approx. 8 times more

time to run same software).– lack of response to problems from site admins.

• Serious issue for a typical user analysis which is about 2000 8 CPU hour jobs

• Product development will be resumed when resources are available and reliable. Meanwhile, EasyGrid prototype and M/Cr testbed will attend usersFor more information: http://www.hep.man.ac.uk/u/jamwer/

Page 11: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

User Analysis (bbrbsub)• Integration of Simple Job Manager +

bbrbsub with Grid Submission • Take the tools already used by analysis

users to submit jobs at RAL• Transparently add RAL -> RAL grid

submission• Add RAL -> M/Cr and M/Cr -> RAL

submission capabilities• Add RAL -> RALPP and M/Cr -> RALPP• Gradually build up full grid functionality

– Application transport and configuration– Automatic output recovery– Job to data matching

G. Castelli

Page 12: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Overall Grid Experience

• Grid is still not reliable (worst test run):

• SP running seems to indicate that Grid isn't getting more reliable and may be getting less so, long term efficiency stuck around 80%:– RB Problems (have capability of multiple RB use but

efficiency drops because of lack of fail over)– Central LFC problems– BDII problems - Sites drop in and out of bdii– SE Problems - Files randomly don't up/download

• Could run for 1-2 weeks at a time with minimal intervention, now seems to need daily (or more) interventions

RAL to RAL Successful Job Rate

Grid PBS

<50% >99%

Page 13: Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

Conclusions• BaBar has made good progress on moving its

three main offline compute intensive processes to the Grid

• Monte-Carlo generation is in production, significant progress has been made in skimming and user analysis

• There are many things we like about the grid• We are adapting the BaBar software framework

to integrate better with the grid, the dependence on Objectivity will be removed and we are adding the ability to read data directly from Storage Elements

• However, reliability and ease of use are still big issues