Top Banner
SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin
16

SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Mar 28, 2015

Download

Documents

Landon Russell
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

SouthGrid Status

Pete Gronbech: 12th March 2008

GridPP 20 Dublin

Page 2: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

UK Tier 2 reported CPU – Historical View

0

200000

400000

600000

800000

1000000

1200000

1400000

Oct-06

Nov-06

Dec-06

Jan-07

Feb-07

Mar-07

Apr-07

May-07

Jun-07

Jul-07

Aug-07

Sep-07

Oct-07

Nov-07

Dec-07

Jan-08

Feb-08

K SPEC int 2000 hours

UK-London-Tier2

UK-NorthGrid

UK-ScotGrid

UK-SouthGrid

0

200000

400000

600000

800000

1000000

1200000

1400000

Oct-06

Nov-06

Dec-06

Jan-07

Feb-07

Mar-07

Apr-07

May-07

Jun-07

Jul-07

Aug-07

Sep-07

Oct-07

Nov-07

Dec-07

Jan-08

Feb-08

feb++

K SPEC int 2000 hours

UK-London-Tier2

UK-NorthGrid

UK-ScotGrid

UK-SouthGrid

Page 3: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

UK Tier 2 reported CPU – Feb 2008 View

Page 4: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

SouthGrid SitesAccounting as reported

by APEL

0

50000

100000

150000

200000

250000

Oct-06

Nov-06

Dec-06

Jan-07

Feb-07

Mar-07

Apr-07

May-07

Jun-07

Jul-07

Aug-07

Sep-07

Oct-07

Nov-07

Dec-07

Jan-08

Feb-08

K SPEC int 2000 hours

JET

BHAM

BRIS

CAM

OX

RALPPD

Page 5: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

RAL PPD600KSI2K 158TB

• SL4 ce installed, some teething problems

• 80TB storage + a further 78TB which was loaned to RAL Tier 1.

• SRMv2.2 upgrade on the dcache se proved very tricky, space tokens not yet defined.

• 2008 hardware upgrade purchased but not yet installed.

• Some kit installed in the Atlas Centre, due to power/cooling issues in R1

• Two new sys admins started during the last month.

Page 6: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Status at Cambridge391KSI2K 43TB

• 32 Intel ‘ Woodcrest’ servers, giving 128 cpu cores equiv. to 358 KSI2k.

• Jun 2007 Storage upgrade of 40TB running DPM on SL4 64 bit

• Plans to double storage and update CPU’s• Condor version 6.8.5 is being used• SAM availability high• Lots of Work by Graeme and Santanu to get

verified for ATLAS production, but had recent problems with long jobs failing.

• Now working with LHCb to solve issues• Problems with Accounting, we still don’t

believe that the work done at Cambridge is reported correctly in the accounting.

Page 7: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Birmingham Status BaBar Cluster76KSI2K ~10TB-50TB

Had been unstable mainly because of failing disks Very few (<20 out of 120) healthy workers nodes left Many workers died during two shut downs ( no power to motherboards?) Very time consuming to maintain Recently purchased 4 twin Viglen quad core workers – two will go to the grid (2 Twin quad core nodes = 3 racks with 120 nodes! )

BaBar cluster withdrawn from the Grid as effort better spent getting new resources online

Page 8: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Birmingham Status – Atlas (grid) Farm

Added 12 local workers to the grid 20 workers in total -> 40 job slots Will provides 60 jobs slots after local twin boxes are installed Upgraded to SL4 Installation with kickstart / Cfengine, maintained with Cfengine VOS: alice atlas babar biomed calice camont cms dteam fusion gridpp hone ilc lhcb ngs.ac.uk ops vo.southgrid.ac.uk zeus Several broken CPUs fans are being replaced

Page 9: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Birmingham Status - Grid Storage

1 DPM SL 3 head node with 10 TB attached to it Mainly dedicated to Atlas – no use by Alice but ... Latest SL4 DPM provides xrootd needed by Alice Have just bought an extra 40 TB Upgrade strategy: current DPM head node will be migrated to new SL4 server, then a DPM pool node will be deployed on new DPM head node Performance issues with deleting files on ext3 fs were observed -> Should we move to XFS? SRMv2.2 with 3TB space token reservation for Atlas published Latest srmv2.2 clients (not in gLite yet) installed on BlueBear UI but not on PP desktops

Page 10: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Birmingham Status - eScience Cluster

31 nodes (servers included) with 2 Xeon CPU 3.06GHz and 2GB of RAM hosted by IS All on a private network but one NAT node Torque server on private network Connected to the grid via SL4 CE in Physics – more testing needed Serves as model for gLite deployment on BlueBear cluster -> installation assume no root access to workers and user tarball method Aimed to have it passing SAM test by GridPP20, but may not meet target as delayed by security challenge and helping with setting up Atlas on BlueBear Software area is not large enough to meet Atlas 100GB requirement :( ~150 cores will be allocated to Grid on BlueBear

Page 11: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Bristol Update

• Bristol is pleased to report that after considerable hard work, LCG on Bristol University HPC is running well, & the accounting is now showing the promised much higher-spec CPU usage.

• http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php?ExecutingSite=UKI-SOUTHGRID-BRIS-HEP

• That purple credit goes to Jon Wakelin & Yves Coppens.

• Work will start soon on testing StoRM on SL4 in preparing to replace DPM access from HPC with StoRM.

• DPM will remain in use for the smaller cluster.

• 50TB of storage (gpfs) will be ready for PP at least by 1 Sept 2008.

• The above CE & SE are still on 'proof-of-concept' borrowed hardware. • Purchases for new CE/SE/MON & NAT are pending, also we would like to • replace older CE/SE/UI & WN (depends on funding).

Page 12: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Site status and future plans

• Oxford 510KSI2K 102TB– Two sub clusters

• (2004) 76 2.8GHz cpus running SL3 to be upgraded asap

• (2007) 176 5345 (Intel quads) running SL4

– New kit installed last Sept. in local computer room performing very well so far.

– Need to move the 4 grid racks to the new Computer Room at Begbroke Science Park before end of March

Page 13: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

Oxford

• Routine updates have brought us to the level required for CCRC08, and our storage had space tokens configured to allow us to take part in CCRC and FDR successfully.

• We have been maintaining two parallel services, one with SL4 workers, one with SL3 to support VOs that are still migrating. We've been working with Zeus and now have them running on the SL4 system, so the SL3 one is now due for imminent retirement. Overall it has been useful to maintain the two clusters rather than just moving to SL4 in one go.

• We've been delivering large amounts of work to LHC VOs. In periods where there hasn't been much LHC work available we've been delivering time to the fusion VO as part of our efforts to bring in resources from non PP sites such as JET.

• Oxford is one of the two sites supporting the vo.southgrid.ac.uk regional VO; so far only really testing work, but we have some potentially interested users who we're hoping to introduce to the grid.

• On a technical note Oxford's main CE (t2ce03.physics.ox.ac.uk) and site BDII (t2bdii01.physics.ox.ac.uk) are running on VMware server virtual machines. This is allowing good use of hardware, and a clean separation of services, and seems to be working very well.

Page 14: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

EFDA JET

• Cluster upgraded at the end of November 2007, with 80 Sun Fire x2200 with Opteron 2218 CPUs.

• Worker nodes upgraded to SL4.• Have provided a valuable contribution

to ATLAS VO • 242KSI2K 1.5TB

Page 15: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

SouthGrid….Issues? How can SouthGrid become more pro-active with VOs (Atlas)? Alice is very specific with its VOBOX. CMS requires Phedex but RALPPD may be able to provide the interface for SouthGrid. Zeus and Fusion strongly supported

NGS integration, Oxford has become an affiliate and Birmingham is passing conformance tests.

SouthGrid regional VO will be used to bring local groups to the grid. Considering the importance of accounting, do we need independent cross-checks? Manpower issues supporting APEL? Bham PPS nodes are broken -> PPS service suspended :( What strategy should SouthGrid adopt (PPS needs to do 64 bit testing) ?

Page 16: SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.

SouthGrid Summary

• Big improvements at – Oxford– Bristol– Jet

• Expansion expected shortly at – RAL PPD– Birmingham

• Working hard to solve problems with exploiting the resources at Cambridge

•Its sometimes an up hill struggle

•But the top is getting closer