Top Banner
CMS Issues
19

CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Jan 18, 2016

Download

Documents

Lynne Thomas
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

CMS Issues

Page 2: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Background – RAL Infrastructure

TMNsdXrd-mgr

TM RhdstagerdTGW

CupvVmgrVdqmnsd

CupvVmgrnsd

Common Layer

Instance Headnodes

Diskservers (x20)

CASTOR 2.1.14-15XROOT 3.3.3-1

Page 3: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Background – xroot infrastructure

Diskservers (x20)

Xroot manager(3.3.3-1)

Xroot redirector (4.X)

European redirector1European redirector1 European redirector1

Global redirectors Global redirectorsGlobal redirectors

Local WNs

The Grid

Page 4: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

The Problem…s• Pileup workflow

– Local jobs had 95% failure rate– Jobs that managed to run had only 30%

efficiency• AAA failure

– Despite being the second site to integrate into AAA

– 100% failure for periods of 30 minutes to several days

Page 5: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Tackling the Problems

Page 6: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Pileup Broken Down

• Data accessed through xroot

• >95% of data at RAL• Two problems in one

– Slow opening times (15->600 secs)

– Slow transfers rates– 100% CPU WIO

Page 7: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Slow Opening Times

• No obvious place– Delays at all phases– Almost all DB time spent in

SubRequestToDo

Page 8: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Solution 1(aka Go Faster Stripes Solution

Page 9: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Database Surgery

• DBMS_ALERT suspect to add to delays under load– Modified DB code to sleep for 50 ms (limiting

rate to 20ms for subreqtodo)• Tested on preprod (functionally)

– Improved open time from 3-15 secs to 0-5 secs• Deployed on all instances• Made NO difference for CMS problem

Page 10: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Solution 2(aka The Heart Bypass Solution)

Page 11: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Bypassing Scheduler

• Modified xroot to disable scheduling• RISK

– nothing restricting access to disk server– ONLY applied to CMS

• RESULT– Open times reduced to 1-30 seconds– WIO still flatlining at 100%

• ‘SUCCESS’

Page 12: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Improving IO

• Difficult to test– Could not generate artificially– Needed pileup workflow to be executing

• Testing on production ;)

• Did ‘the usual’– Reducing allowed connections– Throttling batch jobs

Page 13: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Solution 3(aka The Don’t Do This Solution)

• Change UNIX scheduler– Now easy and can be done in-situ

• Four schedulers (plus options)– Cfq (default), anticipatory, deadline, noop– Plus associated config

• Switched to noop– WIO dropped to 60%– Network rate increased 4x

Page 14: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

XROOT Problems

Page 15: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Observations

• Random Failures (or more correctly random successes)

• Local access was OK (if slow – see previous)

• Lack of visibility up the hierarchy didn’t help – REALLY difficult to debug

Page 16: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Investigating the Problem

• Set up parallel infrastructure– Replicate manager, RAL redirector and

European redirector• Immediately saw the same issue…

Page 17: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Causes of Failure…

• Caching!– Cmsd and xrootd timed out at different

times– Xroot can return ENOENT, but later cmsd

gets response, and subseq access work– If cmsd doesn’t get a response, all future

requests get ENOENT• But why the slow response…?

Page 18: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Log Mining…• Each log looked like

performance was good• Part of problem

– Time resoln in xroot 3.3.X– And logging generally

• Finally found delays in ‘local’ nsd– Processing time was good– But delays in servicing

requests

Page 19: CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Solution – RAL Infrastructure

TMNsdXrd-mgr

TM RhdstagerdTGW

EU Redirectors

The Grid

RAL

Diskservers (x20)

CASTOR 2.1.14-15XROOT 3.3.6-1

Global Redirectors

NsdXrd-mgr

Xroot redirector (4.X)Local WNs

RemoteWNs