Top Banner
Deployment issues and SC3 Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1 st June 2005
34

Deployment issues and SC3

Jan 17, 2016

Download

Documents

tanuja munde

Deployment issues and SC3. Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1 st June 2005. Current deployment issues. Main GridPP concerns: gLite migration, fabric management & future of YAIM dCache Data migration – classic SE to SRM SE Security Ganglia deployment - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deployment issues and SC3

Deployment issues and SC3

Jeremy Coles

GridPP Tier-2 Board and Deployment Board Glasgow, 1st June 2005

Page 2: Deployment issues and SC3

June 2005 Deployment update

Current deployment issues

Main GridPP concerns:• gLite migration, fabric management & future of YAIM• dCache• Data migration – classic SE to SRM SE• Security• Ganglia deployment• Use of ticketing system• Use of UK testzone

General• Jobs at sites – improving (nb. Freedom of Choice is coming!)• Few general EGEE VOs supported at GridPP sites

Page 3: Deployment issues and SC3

June 2005 Deployment update

2nd LCG Operations Workshop

• Took place in Bologna last week: http://infnforge.cnaf.infn.it/cdsagenda//fullAgenda.php?ida=a0517

• Covered the following areas:– Daily operations– Pre-production service– Glite deployment and migration– Future monitoring (metrics)– Interoperation with OSG – User support (Executive Support Committee!)– VO management processes – Fabric management– Accounting (DGAS and APEL)– Little on security! Romain presented potential tools.

Page 4: Deployment issues and SC3

June 2005 Deployment update

LCG-2_4_0

-14

6

26

46

66

86

106

126

146

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43

days since rerlease

sit

es o

n L

CG

-2_4_0 (

info s

ys b

ased

)

-14

6

26

46

66

86

106

126

146

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43

days since rerlease

sit

es o

n L

CG

-2_4_0 (

info s

ys b

ased

)

Plan

CPUs:

2_4_0 10642

2_3_1 912

2_3_0 2167

CPUs:

2_4_0 10642

2_3_1 912

2_3_0 2167

Page 5: Deployment issues and SC3

June 2005 Deployment update

0

20

40

60

80

100

120

140

160

16111621263136414651566166717681869196101

days

sit

es

all

2_4_0

2_3_1

2_3_0

other

Version Change in the last 100 days

Others: Sites on older versions or down

All sites in LCG-2All sites in LCG-2

Page 6: Deployment issues and SC3

June 2005 Deployment update

Regions with less than 5 sites are not shown

0

1

2

3

4

5

6

7

15913172125293337414549535761656973778185899397101

CanadaCanada

0

2

4

6

8

10

12

15913172125293337414549535761656973778185899397101

RussiaRussia

0

5

10

15

20

25

30

15913172125293337414549535761656973778185899397101

ItalyItaly

0

2

4

6

8

10

12

14

15913172125293337414549535761656973778185899397101

Germany/SwitzerlandGermany/Switzerland

Page 7: Deployment issues and SC3

June 2005 Deployment update

0

1

2

3

4

5

6

7

8

9

15913172125293337414549535761656973778185899397101

FranceFrance

0

1

2

3

4

5

6

7

13579111315171921232527293133353739414345474951535557596163656769717375

Asia PacificAsia Pacific

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

15913172125293337414549535761656973778185899397101

NorthernNorthern

0

2

4

6

8

10

12

14

15913172125293337414549535761656973778185899397101

SWSW

Page 8: Deployment issues and SC3

June 2005 Deployment update

0

2

4

6

8

10

12

15913172125293337414549535761656973778185899397101

CentralCentral

0

2

4

6

8

10

12

14

16

18

20

15913172125293337414549535761656973778185899397101

SESE

Page 9: Deployment issues and SC3

June 2005 Deployment update

UKI

0

5

10

15

20

25

15913172125293337414549535761656973778185899397101

Page 10: Deployment issues and SC3

June 2005 Deployment update

LCG-2_4_0

Lessons learned:– Harder than expected (rate independent of packaging)– Differences between regions --> ROCs matter– Release definition non trivial with 3 months intervals

• Components dependencies– X without Y and V is useless….

• During certification we still find problems• Upgrade and installation from scratch needed (time consuming)

– Test pilots for deployment are useful– Early announcement of releases is useful – We need to introduce “updates” via APT to fix bugs that show

during deployment – Number of sites is the wrong metric to measure success

• CPUs on new release needs to be tacked, not sites

Page 11: Deployment issues and SC3

June 2005 Deployment update

The next release

• Why?– SC3 is approaching and the needed components are not deployed at the sites

• What?– File transfer service (will need VDT 1.2.2)

• Servers for Tier1 and Tier0, clients for the rest – Improved monitoring sensors for gridFtp– RFC proxy extension for VOMS– New version of the GLUE schema (compatible)– LFC production service– Interoperability with GRID3/OSG– User level stdio monitoring (maybe later)– Bug fixes …….. as always

• When?– Aimed at mid June

• Who?– Tier 1 centers and Tier 2 centers participating in SC3

• As fast as possible– Others?

• At their own pace– Updated release (fixes from 1st release) expected by July 1st.

Page 12: Deployment issues and SC3

June 2005 Deployment update

SITESITE

FIREMAN

VOMS

LFC

shared LCG

gLite SRM-SE

myProxygLiteWLMRB

UIs

WNsgLiteLCG

gLite-IO

gLite-CE

FTS

LCGCE

FTS

R-GMAR-GMA

BD-II BD-II

Data from LCG is owned by VO and role, gLite-IO service owns gLite data

FTS for LCG uses user proxy, gLite uses service cert

R-GMAs can be merged (security ON)

CEs use same batch system

Independent IS

Catalogue and access control

Coexistence &

Extended Pre-Production

Coexistence &

Extended Pre-Production

dgasAPEL

Page 13: Deployment issues and SC3

June 2005 Deployment update

SITESITE

VOMS

LFC

shared LCG

gLite SRM-SE

myProxygLiteWLMRB

UIs

WNsLCG gLite-CE

LCGCE

FTS

R-GMA

BD-II

FTS for LCG uses user proxy, gLite uses service cert

CEs use same batch system

Gradual Transition 1Gradual Transition 1

gLite

dgasAPEL

Optional additional WLM

Data Management LCG

Optional dgas accounting

Optional additional WLM

Data Management LCG

Optional dgas accounting

Page 14: Deployment issues and SC3

June 2005 Deployment update

SITESITE

VOMS

LFC

shared LCG

gLite SRM-SE

myProxygLiteWLM

UIs

WNsLCG gLite-CE

FTS

BD-II

Gradual Transition 2Gradual Transition 2

gLite

R-GMA

FIREMAN

dgasAPEL

Removed LCG WLM

Optional Catalogue

R-GMA in gLite mode

Removed LCG WLM

Optional Catalogue

R-GMA in gLite mode

Page 15: Deployment issues and SC3

June 2005 Deployment update

SITESITE

VOMS

LFC

shared LCG

gLite SRM-SE

myProxygLiteWLM

UIs

WNsLCG gLite-CE

FTS

BD-II

Gradual Transition 3Gradual Transition 3

gLite

R-GMA

FIREMAN

gLite-IO

FTS

Data from LCG is owned by VO and role, gLite-IO service owns gLite data

dgasAPEL

Adding gLite-IO

Second path to data

Additional security model

Data migration phase

Adding gLite-IO

Second path to data

Additional security model

Data migration phase

Page 16: Deployment issues and SC3

June 2005 Deployment update

SITESITE

VOMS

LFC

shared LCG

gLite SRM-SE

myProxygLiteWLM

UIs

WNsLCG gLite-CE

BD-II

Gradual Transition 4Gradual Transition 4

gLite

R-GMA

FIREMAN

gLite-IO

FTS

dgasAPEL

Finalize switch to new security model. LFC, now a local catalogue under VO control

BDII later replaced by

R-GMA

Finalize switch to new security model. LFC, now a local catalogue under VO control

BDII later replaced by

R-GMA

Page 17: Deployment issues and SC3

June 2005 Deployment update

Metrics - EGEE

• General Agreement on the concept– detailed discussions on:

• time windows– Sliding windows (week, month, 3 month)

• quantities to watch for (RCs, ROCs, CICs…..)– ROCs based on RCs– CICs based on services – Release quality has to be measured

• To make progress: workgroup to define quantities– Organized by: Ognjen Prnjat ([email protected]) – Small (˜5), Ognjen, Markus, Helene, Jeff T. and Jeremy– Ognjen will collect input– ROCs, CICs and OMC have to agree on ONE set of

quantities

Page 18: Deployment issues and SC3

June 2005 Deployment update

Operations summary

• CIC On Duty is now well established– COD is just 6 month old!!!!! – Tools have evolved at a dramatic pace

• Portal, SFT,……– Many rapid iterations

• Truly distributed effort • Integration of new COD partner (Russia) went smoothly

– Tuning of procedures is an ongoing process• No dramatic changes (take resource size more into

account)

Page 19: Deployment issues and SC3

June 2005 Deployment update

Accounting

Last November still an area of concern– APEL now well established

• Support for batch systems is improving• Several privacy related problems have been understood

and solved

– gLite Accounting: DGAS• Some concerns about amount of information published

– Can be handled by proper authorization?• Collaboration with APEL on batch sensors (BBQS,

Condor,..)– DGAS agreed to provide them

• Will be introduced initially on a voluntary basis – Sites will give feedback (including privacy issues)

Page 20: Deployment issues and SC3

June 2005 Deployment update

Current deployment issues (recap)

Main GridPP concerns:• gLite migration, fabric management & future of YAIM• dCache• Data migration – classic SE to SRM SE• Security• Ganglia deployment• Use of ticketing system• Use of UK testzone

General• Jobs at sites – improving (nb. Freedom of Choice is coming!)• Few general EGEE VOs supported at GridPP sites

Page 21: Deployment issues and SC3

June 2005 Deployment update

Freedom of choice - VO Page

Page 22: Deployment issues and SC3

June 2005 Deployment update

Service Challenge 3

Page 23: Deployment issues and SC3

June 2005 Deployment update

SC2SC3

LHC Service OperationFull physics run

2005 20072006 2008

First physicsFirst beams

cosmics

June05 - Technical Design Report

Sep05 - SC3 Service Phase

May06 – SC4 Service Phase

Sep06 – Initial LHC Service in stable operation

SC4

SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERNSC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period)SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughputLHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput

Apr07 – LHC Service commissioned

SC timelines

Page 24: Deployment issues and SC3

June 2005 Deployment update

Service Challenge 3 - Phases

High level view:• Throughput phase

– 2 weeks sustained in July 2005• “Obvious target” – GDB of July 20th

– Primary goals: • 150MB/s disk – disk to Tier1s; • 60MB/s disk (T0) – tape (T1s)

– Secondary goals:• Include a few named T2 sites (T2 -> T1 transfers) Encourage remaining T1s to start disk – disk transfers

• Service phase– September – end 2005

• Start with ALICE & CMS, add ATLAS and LHCb October/November• All offline use cases except for analysis• More components: WMS, VOMS, catalogs, experiment-specific

solutions

– Implies production setup (CE, SE, …)

Page 25: Deployment issues and SC3

June 2005 Deployment update

SC implications

• SC3 will involve the Tier 1 sites (+ a few large Tier 2) in July– Must have the release to be used in SC3 available in mid-June– Involved sites must upgrade for July– Not reasonable to expect those sites to commit to other

significant work (pre-production etc) on that timescale– T1: ASCC, BNL, CCIN2P3, CNAF, FNAL, GridKA, NIKHEF/SARA,

RAL and • Expect SC3 release to include FTS, LFC, DPM, but otherwise

be very similar to LCG-2.4.0• September-December: experiment “production” verification

of SC3 services; in parallel set up for SC4• Expect “normal” support infrastructure (CICs, ROCs, GGUS)

to support service challenge usage• Bio-med also planning data challenges

– Must make sure these are all correctly scheduled

Page 26: Deployment issues and SC3

June 2005 Deployment update

SC3 issues

• Tier-1 network being extensively re-configured. Tests showed up to 40% packet loss! Waiting for UKLight to be fixed. Not intending to use dual-homing but dCache have provided a solution

• Lancaster link up at the link level• What is the bandwidth of the Lancaster connection• Edinburgh hardware problem with raid-array to be used as SE – IBM

investigating• Lancaster set up test system. Now deploying more hardware• Need clarification about classification of volatile vs permanent data

in respect of Tier-2s• The file transfer service should be ready now but has problems with

the client component• RAL would like longer period for testing tape than suggested in SC3

plans• There has been an issue with CMS preferring to use Phedex and not

to use FTS for transfers. We need to add into the plans a period to do Phedex only transfer tests

• dCache mailing list very active now. There have been problems with the installation scripts

Page 27: Deployment issues and SC3

June 2005 Deployment update

SC3 issues continued

• We have questions about whether FTS uses SRM-put or SRM-cp.

• From September onwards SC3 infrastructure is to provide a production quality service for all experiments – remember comments about UKLight being a research network – risk!?

• Differing engagement with the experiments. Edinburgh needs a better releationship with LHCb

• There is an LCG workshop in mid-June where the experiment plans should be almost final!

• GridPP needs to do more load testing than is anticipated in SC3

• Planning for SC4 needs to start soon. Currently we are pushing dCache but DPM is also supposed to be available.

Page 28: Deployment issues and SC3

June 2005 Deployment update

Imperial (London Tier-2)

• SRM/dCache Status– Production server installed

• gfe02.hep.ph.ic.ac.uk• Information provider still developing

– 1.5TB Pool node added• RHEL 4 , 64 bit system• Installed using dcache.org instructions http://www.dcache.org/downloads/dCache-instructions.txt

– Extra 1.5TB ready to add when CMS ready– 6TB being purchased. Should be in place by start of Setup

Phase

• CMS Software– Service node provided– Phedex installed– Confirmation on FTS/Phedex issue sought

Page 29: Deployment issues and SC3

June 2005 Deployment update

Edinburgh

Current LCG production setup:

• Compute Element (CE), Classic Storage Element (SE), 3 Worker Nodes (2 machines, 3 CPUs). Monitoring takes place on the SE, running LCG 2.4.0. About to add 2 Worker Nodes (2 CPUs in 1 machine) and have a User Interface (UI) in testing. We have a 22TB datastore available

Plans

• £2000 available for 2 machines - one for dCache work and one to connect to EPCC's SAN (10 TBytes promised).

Considering the procurement of more WNs but have no clear requirements from LHCb.

Page 30: Deployment issues and SC3

June 2005 Deployment update

Lancaster (current)

Page 31: Deployment issues and SC3

June 2005 Deployment update

Lancaster (planned)

1. LighPath and terminal Endbox installed.

2. Still require some hardware for our internal network topology.

3. Increase in Storage to ~84TB to possible ~92TB with working resilient dCache from CE

Page 32: Deployment issues and SC3

June 2005 Deployment update

Other areas…

Page 33: Deployment issues and SC3

June 2005 Deployment update

JRA4 request

• We have some idea of requirements from networking experts within JRA4

• Draft requirements document available here:– https://edms.cern.ch/document/593620/1

• Draft use case document available here:– https://edms.cern.ch/document/591777/1

• We’re looking for more input from NOCs and GOCs• If you have requirements, use cases or opinions on interfaces

or needed metrics, please send them to us• Even if you don’t have ideas at the moment, but would like to

be involved in the process, please get in contact• Contact details are at the end of the talk

Page 34: Deployment issues and SC3

June 2005 Deployment update

DTEAM discussion

• Review of team objectives – what is the team focus for the next 3 & 5 months• Communications with the experiments• Using a project tool to work better as a team• Metrics!!• Review of plans and what needs to be done to keep them up-to-date including

GridPP challenges and SC4• Web-page status• Areas raised at the T2B and DB meetings• Security challenge involvement• Accounting – status and making further progress• Libraries and understanding expt. Needs• Review dCache efforts • Address issues with Quarterly reports & weekly reports• Next release, test-zone and test-zone machines• Data management – guidelines required• Improving robustness• GI – (Documentation (esp. releases), multi-Tier R-GMA, intro. New sites,

LCFGng distribution (Kickstart & Pixieboot… ), jobs – how to get