D. Britton GridPP3 David Britton 27/June/2006. D. Britton27/June/2006GridPP3 Life after GridPP2 We propose a 7-month transition period for GridPP2, followed.
Post on 11-Jan-2016
218 Views
Preview:
Transcript
D. Britton
GridPP3David Britton 27/June/2006
27/June/2006 GridPP3 D. Britton
Life after GridPP2
We propose a 7-month transition period for GridPP2, followed by a three year co-development programme with the LHC Computing Grid, the proposed European Grid Infrastructure (EGI), the Particle Physics experiments and the Institutes. The GridPP3 project, a continuation of GridPP, will deliver a full-scale Grid for Exploitation to meet the reconstruction, simulation and analysis requirements of experiments across the Particle Physics programme.
Timeframe: GridPP2+ Sep 07 to Mar 08 GridPP3 Apr 08 to Mar 11
Budget: Has not been pre-specified… input to exploitation review was £36.6m for this period which is clearly (above) the upper limit.
27/June/2006 GridPP3 D. Britton
GridPP2+
In the 7 Month period from Sep-07 to Apr-08 we propose (following the suggestion of the Oversight Committee) to continue the GridPP2 project largely as-is primarily in order to:
1) To sort out issues with the time-frame for the PPRP process and post-extension in Sep 07.
2) Provide continuity of management and support over the expected start-up phase of the LHC.
3) Align future projects with financial years, with EGEE and possible future EGI project, and with other grants in the UK.
Proposal is to continue all GridPP2 posts in this period except for the Application
posts (which have been applied for via the Rolling Grant mechanism).
We hope (need) to use the GridPP2+ period to install/commission a substantial
pulse of hardware to be ready for the start of the LHC.
27/June/2006 GridPP3 D. Britton
GridPP3
Project Management Board (PMB)
Oversight Committee
(OC)
Collaboration Board (CB)
Deployment Board (DB)
User Board (UB)
Provision Utilisation
Review
React
Earth
WindW
ate
r
Fire
27/June/2006 GridPP3 D. Britton
GridPP3 Deployment Board
In GridPP2, the Deployment Board is squeezed into a space already occupied by the Tier-2 Board; the D-TEAM; and the PMB. Many meetings have been “joint” with one of these other bodies. Identity and function have become blurred.
T1B Chair T2B Chair Prdn Mgr.
Deployment Board
Technical Coordinator
Tier-1 Board Tier-2 Board D-Team Grp-1 Grp-2 … Grp-n
Project Management Board
XIn GridPP3, propose a combined Tier-2 Board and Deployment Board with overall responsibility for deployment strategy to meet the needs of the experiments. In particular, this is a forum where providers and users formally meet. Deals with:
1) Issues raised by the Production Manager which require strategic input.2) Issues raised by users concerning the service provision. 3) Issues to do with Tier-1 - Tier-2 relationships. 4) Issues to do with Tier-2 allocations, service levels, performance. 5) Issues to do with collaboration with Grid Ireland and NGS.
27/June/2006 GridPP3 D. Britton
GridPP3 DB Membership
1) Chair 2) Production Manager 3) Technical Coordinator 4) Four Tier-2 Management Board chairs. 5) Tier-1 Board Chair. 6) ATLAS, CMS, LHCb representatives. 7) User Board Chair. 8) Grid Ireland representative 9) NGS representative. 10) Technical people invited for specific issues.
Above list gives ~13 core members, 5 of whom are probably on PMB. There is a move away from the technical side of the current DB and it becomes a forum where the deployers meet each other and hear directly from the main users. The latter is designed to ensure buy-in by the users to strategic decisions.
27/June/2006 GridPP3 D. Britton
LHC Hardware Requirements
GridPP Exploitation Review input: Took Global Hardware requirements and multiplied by UK authorship fraction.
ALICE 1% ATLAS 10% CMS 5% LHCB 15%
Problematic using “Authors” in the denominator when not all Authors (globally) have an associated Tier-1. Such an algorithm applied globally would not result in sufficient hardware. GridPP has asked the experiments for requirements and their input (relative to their global requirements) is:
ALICE ~1.3% ATLAS ~13.7% CMS ~10.5% LHCb ~16.8%
?? (Global Requirements) X (Global T1 author frac.)
(Global Requirements) (Number of Tier1s)
~50% X
(Global Requirements) (Number of Tier1s)
~ UK Authorship fraction
27/June/2006 GridPP3 D. Britton
Proposed Hardware
The input from the User Board was that that the hardware requirements in the GridPP3 proposal should be:
• Those defined by the LHC experiments;
• plus those defined by BaBar (historically well understood);
• plus a 5% provision for “Other” experiments at the Tier-2s only.
27/June/2006 GridPP3 D. Britton
Hardware Costs
Estimated Price per KSI2K
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Pri
ce
/KS
I2K
(£
K)
0
1
2
3
4
5
6
7
8
LN
(Pri
ce/
MS
I2K
)
Price/KSI2K (£K) LN(Price/MSI2K)
Storage Cost
0
2
4
6
8
10
12
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
£K
0
1
2
3
4
5
6
7
8
9
10
LN
(£K
)
Price/TB (K)
LN(Price/PB)
Kryder’s Law for disk cost
Moore’s Law for CPU cost
Hardware costs extrapolated from recent purchases. However, experience tells us there are fluctuations associated with technology steps. Significant uncertainty in integrated cost.
Model must factor in:- Operational life of equipment- Known operational overheads- Lead time for delivery and deployment.
27/June/2006 GridPP3 D. Britton
Hardware Costs: Tape
2007 2008 2009 2010 2011 2012
CAPACITY MODELRequired capacity 816 2538 4808 7682 9753 12085Actual CASTOR Capacity 544 2538 4808 10516 10129 12085
9940 MediaExisting 9940 Slot Count 1948 0 0 0 0 0Media Capacity (9940) 0.182 0.182 0.182 0.182 0.182 0.182Existing 9940 Capacity 324 0.000 0.000 0.000 0.000 0.000
T10/20K Media Total Required Tape Capacity April (TB) 816 2538 4808 7682 9753 12085Tapes phased out in March 0 0 0 0 430 778Total Tapes Available in March 430 1208 5639 10684 11254 10476Total Storage Capacity (March) 194 544 2538 9616 10129 9429Addirtional TB Required for April 350 1994 2270 0 0 2656Additional Tapes Purchased 778 4432 5045 1000 0 2951Used Slots April (T10/20K) 1208 5639 10684 11684 11254 13428T10/20K Media Cost 0.08 0.07 0.06 0.06 0.06 0.06Media Capacity 0.45 0.45 0.45 0.9 0.9 0.9Spent on Media 62 310 303 60 0 177
Spent on new Robot Infrastructure 250 50 50New Slots Purchased 6000 2000 2000Maximum Slot Count Available 5000 11000 11000 13000 13000 15000Total Used Slots 3156 5639 10684 11684 11254 13428
Bandwidth MODEL
Estimated rate to Fill (6 months) 32 114 151 191 137 155In beam Double Fill Rate 228 301 381 275 309In beam Media Conversion (6 months) 17 319In beam reprocessing 114 151 191 137 155Out of beam Reprocessing Read Rate (4 months?) 252 478 764 970 1202Drive deadtime on writes 25% 25% 25% 25% 25% 25%Drive deadtime on Reads 25% 25% 25% 25% 25% 25%
In beam write capacity required 327 401 933 366 412Out Beam write capacity required 0 0 0 0 0In beam read capacity required 174 201 679 183 206Out Beam read capacity required 337 638 1019 1294 1603
In beam total required bandwidth 501 602 1613 549 619out beam total required bandwidth 337 638 1019 1294 1603Total available CASTOR bandwidth 555 640 720 1680 1320 1680
9940B Drives 6 3 0 0 0 09940B Maintainance Cost/drive 3.0 3.3 3.6 4.0 4.4 4.8Spent on 9940B Maintainance 18 9.9 0 0 0 0Bandwidth per brick (MB/s) 25 25 25 25 25 259940B Bandwidth 150 75 0 0 0 0
Cost of Storage Brick (T10) 19.15 19.15 19.15 19.15 19.15 19.15T10K Maintainance Cost/drive 2.3 2.3 2.3 2.3 2.3 2.3New T10K Server Bricks 3 2 1 0Total T10K Server Bricks 6 8 9 9 0 0Bandwidth per brick (MB/s) 80 80 80 80 80 80Spent on Server bricks 57.45 38.3 19.15 0 0 0Spent on T10K Maintaince 6.9 13.8 18.4 20.7Total T10K Bandwidth 480 640 720 720 0 0
Cost of Storage Brick (T20) 19.15 19.15 19.15T20K Maintainance Cost/drive 2.3 2.3 2.3New T20K Server Bricks 8 3 3Total T20K Server Bricks 8 11 14Bandwidth per brick (MB/s) 120 120 120Spent on Server bricks 0 0 0 153.2 57.45 57.45Spent on T20K Maintaince 0 0 0 0 18.4 25.3Total T10K Bandwidth 0 0 0 960 1320 1680
Spent on ADS Maintainance 10 10 0 0 0 0Spent on Minor Parts 10 10 10 10 10 10Spent on Robot 1 M&O 30 30 30 30 30 30Spent on Robot 2 M&O 50 50 55 55 60
Summary
Spent on Media 62 310 303 60 0 177Spent on Bandwidth and Operation 132 412 128 319 189 258Spent Total 195 722 430 379 189 435
27/June/2006 GridPP3 D. Britton
Tier-1 Hardware
Hardware SpendTable k£
Tier-1 CPU £1,252k £1,015k £1,137k £961k
Tier-1 Disk £1,786k £851k £1,170k £855k
Tier-1 Tape £494k £363k £0k £37k
Tape Infrastructure £132k £489k £247k £321k
Tier-1 Infrastructure £113k £119k £125k £131k
Tier-1 Sub-Total £3,778k £2,838k £2,679k £2,305k
FY07 FY08 FY09 FY10
27/June/2006 GridPP3 D. Britton
Tier-2 Resources
In GridPP2 we paid for staff in return for provision of hardware, which is not a sustainable model. Need a transition to a sustainable model that generates sufficient (but not excessive) hardware, which institutes will buy into.
Such a model should:• Acknowledge that we are building a Grid (not a computer centre).• That historically Tier2s have allowed us to lever resources/funding.• That Tier2 are designed to provide different functions and different levels of service from the Tier1.• Dual funding opportunities may continue for a while.• Institutes may have strategic gain by continuing to be part of the
"World's largest Grid"
27/June/2006 GridPP3 D. Britton
Tier-2 Hardware
Model (for proposal) endorsed by CB:
- GridPP funds ~15 FTE at the Tier-2s.- Tier-2 Hardware requirements are defined by the UB request.- That GridPP pays the cost of purchasing hardware to satisfy the following years requirements at the current year price, divided by the nominal hardware lifetime (~4 years for disk; ~5 years for CPU).E.g. 2253 TB of Disk is required in 2008. In January 2007, this would cost ~1.0k£/TB. With a life-time of 4 years, the 1-year “value” is 2253/4 = £563k.
Note: This does not necessarily reimburse the full cost of the hardware because in subsequent years, the money GridPP pays depreciates with the falling cost of hardware, whereas the Tier2s who actually made a purchase, have been locked into a cost determined by the purchase date. However, GridPP does pay cost up to 1-year before the actual purchase date, and institutes which already own resources can delay the spend further.
27/June/2006 GridPP3 D. Britton
Tier-2 Resources
Sanity Checks:
1) Can apply the model and compare cost of hardware at the Tier-1 and Tier-2 integrated over the lifetime of the project:
2) Total cost of ownership: Can compare total cost of the Tier-2 facilities with the cost of placing the same hardware at the Tier-1 (assuming that doubling the Tier-1 hardware requires a 35% increase in staff).
Tier-1 Tier-2CPU (K£/KSI2K-year): 0.070 0.045DISK (K£/TB-year): 0.144 0.109TAPE (K£/TB-year): 0.052
Including staff and hardware, the cost of the Tier-2 facilities is ~80% of cost of an enlarged Tier-1.
27/June/2006 GridPP3 D. Britton
Running Costs
(Work in progress)
Running Costs CPU2007 2008 2009 2010
New Systems 166 761 404 473New Racks 5 24 13 15Phased out racks 4 3 5 0Rack Count 18 39 47 61KW/New System 0.26 0.26 0.27 0.29
198 110 136Phased Out KW 18 51 0Total Load (KW) 151 330 390 525Cost Per KW 0.00008 0.00008 0.00009
£0k £347k £430k £609k
New KW
Cost
Disk2007 2008 2009 2010
101 201 82 13414 29 12 193 4 0 10
32 57 69 780.735 0.77 0.81 0.85
155 66 11414 0 49
116 257 323 3880.00008 0.00008 0.00009
£0k £270k £357k £450k
27/June/2006 GridPP3 D. Britton
Total Hardware Cost
Hardware SpendTable k£
Tier-1 CPU £1,252k £1,015k £1,137k £961k
Tier-1 Disk £1,786k £851k £1,170k £855k
Tier-1 Tape £494k £363k £0k £37k
Tape Infrastructure £132k £489k £247k £321k
Tier-1 Infrastructure £113k £119k £125k £131k
Tier-2 CPU £612k £656k £740k £656k
Tier-2 Disk £552k £638k £643k £626k
Tier-2 Tape £0k £0k £0k £0k
TOTAL £4,941k £4,133k £4,062k £3,587k
Tier-1 Running Costs £0k £628k £798k £1,071k
Tier-2 Running Costs Assumed to be included elsewhere
FY07 FY08 FY09 FY10
In addition to ~£1.6m GridPP2 money –likely to be problematic!
27/June/2006 GridPP3 D. Britton
Tier-1 Service
“Tier1 Centres provide a distributed permanent back-up of the raw data, permanent storage and management of data needed during the analysis process, and offer a grid-enabled data service. They also perform data-intensive analysis and re-processing, and may undertake national or regional support tasks, as well as contribute to Grid Operations Services.”[LCG MoU]
The exact role of the Tier-1 varies from experiment to experiment, and is provided in detail in the individual experiments’ TDRs. However broadly the Tier-1 will carry out the following tasks:
• acceptance of an agreed share of raw data from the Tier0 Centre, keeping up with data acquisition;• acceptance of an agreed share of first-pass reconstructed data from the Tier0 Centre;• acceptance of processed and simulated data from other centres of the WLCG;• recording and archival storage of the accepted share of raw data (distributed back-up);• recording and maintenance of processed and simulated data on permanent mass storage;• provision of managed disk storage providing permanent and temporary data storage for files and
databases;• provision of access to the stored data by other centres of the WLCG … • operation of a data-intensive analysis facility; • provision of other services according to agreed Experiment requirements; • ensure high-capacity network bandwidth and services for data exchange with the Tier0 Centre, as part of
an overall plan agreed amongst the Experiments, Tier1 and Tier0 Centres; • ensure network bandwidth and services for data exchange with Tier1 and Tier2 Centres, as part of an
overall plan agreed amongst the Experiments, Tier1 and Tier2 Centres; • administration of databases required by Experiments at Tier1 Centres. All storage and computational
services shall be “grid enabled” according to standards agreed between the LHC Experiments and the regional centres.
Tier-0 Tier-1 Tier-2
ALICE First-pass scheduled reconstruction
ReconstructionOn-demand analysis
Central simulationOn-demand analysis
ATLAS ReconstructionScheduled analysis / skimmingCalibration
SimulationOn-demand analysisCalibration
CMS ReconstructionScheduled analysis / skimming
SimulationOn-demand analysisCalibration
LHCb ReconstructionOn-demand analysisScheduled skimming
Simulation
27/June/2006 GridPP3 D. Britton
Tier-1 Service
Maximum delay in responding to operational problems
Average availability measured on an annual
basis
Service
Service interruption
Degradation of the
capacity of the service by more than 50%
Degradation of the
capacity of the service by more than 20%
During accelerator operation
At all other times
Acceptance of data from the Tier-0 Centre during accelerator operation
12 hours 12 hours 24 hours 99% n/ a
Networking service to the Tier-0 Centre during accelerator operation
12 hours 24 hours 48 hours 98% n/ a
Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres outwith accelerator operation
24 hours 48 hours 48 hours n/ a 98%
All other services – prime service hours6
2 hour 2 hour 4 hours 98% 98%
All other services – outwith prime service hours
24 hours 48 hours 48 hours 97% 97%
27/June/2006 GridPP3 D. Britton
Tier-1 Growth
Now Start of GridPP3 End of GridPP3
Spinning Disks ~2000 ~10,000 ~20,000Yearly disk failures 30-45 200-300? 400-600?
CPU Systems ~550 ~1800 ~2700Yearly system failures 35-40 120-130? 180-200?
To achieve the levels of service specified in the MOU, a multi-skilled incident response unit (3 FTE) is proposed. This is intended to reduce the risk of over-provisioning other work areas to cope with long term fluctuations in fault rate. These staff will have an expectation that their primary daily role will be dealing with what has gone wrong. They will also provide the backbone of the primary callout team.
27/June/2006 GridPP3 D. Britton
Tier-1 Staff
Work AreaGridPP3
PPARC funding CCLRC funding
CPU 2.0 0.0
Disk 3.0 0.0
Tape Service (CASTOR) 2.0 1.3
Core Services 1.0 0.5
Operations 3.0 1.0
Incident Response Unit 3.0 0.0
Networking 0.0 0.5
Deployment 1.5 0.0
Experiment Support 1.5 0.0
Tier-1 Management 1.0 0.3
Totals 18.0 3.6
27/June/2006 GridPP3 D. Britton
Tier-2 Service
•provision of managed disk storage providing permanent and/or temporary data storage for files and databases;•operation of an end-user analysis facility;•provision of other services, such as simulation, according to agreed Experiment requirements;•provision of network services for data exchange with Tier1 Centres, as part of an overall plan agreed between the Experiments and the Tier1 Centres concerned.•All storage and computational services shall be “grid enabled” according to standards agreed between the LHC Experiments and the regional centres.
The following services shall be provided by each of the Tier2 Centres in respect of the LHC Experiments that they serve, according to policies decided by these Experiments:
Service Maximum delay in responding to operational
problems
Average availability
measured on an annual basisPrime time Other periods
End-user analysis facility 2 hours 72 hours 95%
Other services 12 hours 72 hours 95%
27/June/2006 GridPP3 D. Britton
Tier-2 Staff
Institute FTE FTE %Brunel 0.50 3%Imperial 1.50 10%QMUL 1.00 7%RHUL 0.50 3%UCL 1.00 7%Lancaster 1.50 10%Liverpool 1.00 7%Manchester 1.50 10%Sheffield 1.00 7%Durham 0.25 2%Edinburgh 0.50 3%Glasgow 1.00 7%Birmingham 1.00 7%Bristol 1.00 7%Cambridge 0.50 3%Oxford 0.50 3%RAL PPD 0.50 3%
14.75 100%
Allocated FTE v CPU
0%
2%
4%
6%
8%
10%
12%
0.0% 5.0% 10.0% 15.0% 20.0% 25.0%
Allocated CPU
Allo
cate
d F
TE
Manchester
Birmingham and UCL
Sheff ield Imperial
27/June/2006 GridPP3 D. Britton
Grid Deployment Staff(Operations)
Team of 8: A Production Manager; 4 Tier-2 Coordinators; 3 GOC staff.
Their activities include:•Resource and deployment planning and scheduling upgrades•Installation and configuration of Grid middleware services•Support of these Grid services•Grid Operations•User support•System manager support•Monitoring, accounting and auditing•Security (both operational and policy aspects)•Documentation•VO management and support
27/June/2006 GridPP3 D. Britton
Grid Support Staff
GridPP2 GridPP2+to Sep 07 to Apr 08 to Apr 09 to Apr 10 to Apr 11
Tier-2 Expert 1.0 1.0MSN 1.0 1.0
Portal Applications Interface 1.0 1.0Tier-2 Expert 1.0 1.0MSN 1.0 1.0Tier-2 Expert 1.0 1.0MSN 2.0 2.0Tier-2 Expert 1.5 1.5MSN 3.5 3.5Tier-2 Expert 0.0 0.0MSN 3.5 3.5Tier-2 Expert 0.5 0.5MSN 2.0 2.0
HP Post Tier-2 Expert 0.5 0.0 0.0 0.0 0.0Tier-2 Expert 5.5 5.0MSN 13.0 13.0
19.5 19.0 18.0 15.0 13.5
InfoMon
Network
GridPP2 Post GridPP3 Post
WLMS
Data Management
Data Storage
Security
3.0 3.0 3.0 Data Management
Network Provision
5.0 3.0 2.0
1.0 1.0 1.0
2.5 2.0 2.0WLMS, RTM, and
Portal
15.0 13.5
Storage Support
Security Support
InfoMon Support
GridPP3
Sub-Total 18.0
GRAND TOTAL
3.0 2.5 2.0
3.5 3.5 3.5
27/June/2006 GridPP3 D. Britton
GridPP Staff Evolution
GridPP2 GridPP2+GridPP2 Post to Sep 07 to Apr 08 to Apr 09 to Apr 10 to Apr 11 GridPP3 Post
Project Leader 0.67 0.67 Project LeaderProject Manager 0.90 0.90 Project ManagerT2 Coordinator 0.50 0.50DB Chair 0.30 0.30UB Chair 0.00 0.00 UB ChairMiddleware Coordinator 1.00 1.00Application Coordinator 0.50 0.50
CCLRC ManagementSub-Total 3.87 3.87 3.45 3.45 3.45
Tier-1 All Tier-1 Services 13.50 13.50 18.00 18.00 18.00 All Tier-1 Services Tier-1 Hardware Support 9.00 9.00 14.75 14.75 14.75 All Tier-2 Services Tier-2Specialist Posts 5.50 5.00
Middleware All MSN Posts 13.00 13.00Applications All Application Posts 18.50 1.00
Documentation Documentation Officer 1.00 1.00 1.00 1.00 1.00 Documentation (2 x 0.5)
Operations Manager 1.00 1.00 1.00 1.00 1.00 Operations ManagerTier-2 Coordinators 4.00 4.00 4.00 4.00 4.00 Tier-2 CoordinatorsGOC Posts 0.00 0.00 3.00 3.00 3.00 GOC Posts
Out Reach Dissemination + Events 1.50 1.50 1.50 1.50 1.50 Industry Liaison + Dissemination Out ReachGrand Total 70.87 52.87 64.70 61.70 60.20
15.0018.00
GridPP3
Operations
Management
Operations
Technical Coordinator
Deployment Coordinator
Technical Support Posts13.50
Management
Tier-2
Support
27/June/2006 GridPP3 D. Britton
Full Proposal
Compares with exploitation review input of £36,643k which included £1,800k running costs.
GridPP2+ GridPP3 GridPP3 GridPP3FY07 FY08 FY09 FY10
Tier-1 Staff £585k £1,384k £1,433k £1,483kTier-1 Hardware £3,778k £2,838k £2,679k £2,305kTier-2 Staff £165k £1,123k £1,174k £1,228kTier-2 Hardware £1,163k £1,295k £1,383k £1,282kGrid Support Staff £731k £1,481k £1,302k £1,232kGrid Operation Staff £43k £632k £658k £685kManagement Staff £216k £301k £316k £330kOutreach Staff £67k £121k £126k £132kTravel and Operations £112k £255k £243k £237k
Staff Total £1,808k £5,042k £5,009k £5,090kHardware Total £4,941k £4,133k £4,062k £3,587k
Grand Total £6,861k £9,429k £9,313k £8,914kFull Project Cost £34,517k
Tier-1 Running Costs £0k £628k £798k £1,071k
Cost Table in K£
27/June/2006 GridPP3 D. Britton
GridPP3 Balance
14%
34%
11%15%
14%
6%
3%
1%
2%
Tier-1 Staff
Tier-1 Hardware
Tier-2 Staff
Tier-2 Hardware
Grid Support Staff
Grid Operation Staff
Management Staff
Outreach Staff
Travel and Operations
27/June/2006 GridPP3 D. Britton
Status
•GridPP3 proposal being drafted (deadline July 13th)
•Currently being run by CB (email) and OC (Friday)
•Request the Hardware defined by the experiments
•Request (minimum) staff we think are required
•Expect some iteration!
top related