Top Banner
CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith
19

CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

Mar 28, 2015

Download

Documents

Irea Quinlan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

CERN Data ServicesUpdate

HEPiX 2004 / NeSC Edinburgh

Data Services team:Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes,

Gordon Lee, Tony Osborne, Tim Smith

Page 2: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 2 of 19

Outline

Data Services Drivers Disk Service

Migration to Quattor / LEMON Future directions

Tape Service Media migration Future directions

Grid Data Services

Page 3: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 3 of 19

Data Flows Tier-0 / Tier-1 for the LHC

Data Challenges: CMSDC04 (finished) ; PCP05 (Autumn) +80;

+170 ALICE ongoing +137 TB LHCb ramping up +40 TB ATLAS ramping up +60 TB

Fixed Target Programme: NA48 at 80 MB/s +200 TB COMPASS at 70 MB/s (peak 120) +625 TB nToF at 45 MB/s +180 TB NA60 at 15 MB/s +60 TB Testbeams at 1~5 MB/s (x 5)

Analysis…

Page 4: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 4 of 19

Disk Server FunctionsAFS1%

Oracle8%

CASTOR: Experiment dedicated

64%

CASTOR: Infrastructure

9%

CASTOR: Public Services

4%

LCG14%

Page 5: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 5 of 19

Generations

0th Jumbos

1st & 2nd

4U

3rd & 4th

8U

Page 6: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 6 of 19

Warrantees

0

50

100

150

200

250

300

350

400

Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09

Nu

mb

er o

f D

isk

Ser

vers

ELONEX - 2.4GHz

ELONEX - 2.4GHz

ELONEX - 2.0GHz

ELONEX - 1.1GHz

JTT - 1.1GHz

ELONEX - 1GHz

ELONEX - 1GHz

JTT - 1GHz

ELONEX - 900

ELONEX - 900

ELONEX - 900

TECH - 800

ELONEX - 700

ELONEX - 650

COGESTRA - 500

ELONEX - 500

ELONEX - 500

COGESTRA - 450

Out of Warantee

4th Generation

3rd Generation

2nd Generation

1st Generation

0th Generation

Page 7: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 7 of 19

Disk Servers: Jan 2004 370 EIDE Disk Servers

Commodity Storage in a box 544 TB of disk capacity 6700 spinning disks

Storage Configuration HW Raid-1 mirrored for “maximum

reliability” ext2 file systems

Operating systems RH6.1, 6.2, 7.2, 7.3, RHES 13 different kernels

Application uniformity; CASTOR SW

Page 8: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 8 of 19

Quattor-ising Motivation: Scale

Uniformity; Manageability; Automation Configuration Description (into CDB)

HW and SW; nodes and services Reinstallation

Production machines – min service interruption!

Eliminate peculiarities from CASTOR nodes MySQL, web servers Refocus root control

Quiescing a disk server ≠ draining a batch node!

Gigabit cards gymnastics (ext2 -> ext3)

Complete (except 10 RH6 boxes for Objectivity)

Page 9: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 9 of 19

LEMON-ising MSA everywhere

Linux box monitoring and alarms

Automatic HW static checks

Adding CASTOR server specific Service monitoring

HW Monitoring lm_sensors (see tape section) smartmontools

smartd deployment Kernel issues; firmware bugs; through 3ware controller smart_ctl auto checks; predictive monitoring

IPMI investigations; especially remote access Remote reset/power-on/power-off

Page 10: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 10 of 19

Disk Replacement Failure rate

unacceptably high 10 months to be

believed 4 weeks to execute

1224 disks exchanged (out of 6700)

And the cages

Western Digital; type DUA Head instabilities

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

3.5%

4.0%

4.5%

Dec-03 Jan-04 Feb-04 Mar-04 Apr-04 May-04

% B

rok

en

Mir

rors

Page 11: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 11 of 19

Disk Storage Futures

EIDE Commodity storage in a box Production systems

HW Raid-1 / ext3 Pilots (15 production systems)

HW Raid-5 + SW Raid-0 / XFS (See Jan Iven’s talk next)

New tenders out… 30TB SATA in a box 30TB external SATA disk arrays

New CASTOR stager (see Olof’s talk)

Page 12: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 12 of 19

9940B14157

9940A8889

98408149

35908639

Tape Service

70 tape servers (Linux) (mostly) Single FibreChannel attached

drives 2 symmetric robotic installations

5 x STK 9310 Silos in each

9940B50

9940A4

984020

359014

LTO6Drives

Media

Page 13: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 13 of 19

Tape Server Temperatures

lm_sensors package General SMBus access

and hardware monitoring.

Used to access LM87 chip

Fan speeds Voltages Int/Ext temperatures

ADM1023 chip Int/Ext temperatures

Page 14: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 14 of 19

Tape Server Temperatures

Page 15: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 15 of 19

Media Migration

To 9940B (mainly from 9940A) 200GB – extra capacity avoids

unnecessary acquisitions Better performance – though hard to

benefit in normal chaotic mode Reduced errors; fewer interventions

1-2% of A tapes can not be read (extremely slow) on B drives Have not been able to return all A-

drives

Page 16: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 16 of 19

Tape Service Developments

Removing tails… Tracking of all tape errors (18 months)

Retiring of problematic media Proactive retiring of heavily used media

(>5000 mounts) repack on new media

Checksums Populated writing to tape Verified loading back to disk 22% already after few weeks

Page 17: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 17 of 19

Water Cooled Tapes!

Plumbing error!

5000 tapes disabled for a few days 550 superficially wet 152 seriously wet – visually

inspected

Page 18: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 18 of 19

Tape Storage Futures

Commodity drive studies LTO-2 (Collaboratively CASPUR/Valencia)

Test and evaluate High-end drives IBM 3592 STK NGD

Other STK offerings SL8500 robotics and silos Indigo; managed storage, tape

virtualisation

Page 19: CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.

2004/05/26 CERN Data Services: [email protected] 19 of 19

GRID Data Management

GridFTP + SRM servers (Former) Standalone / experiment dedicated Hard to intervene; not scalable

New load-balanced 6 node Service castorgrid.cern.ch SRM modifications to support operate

behind load balancer GridFTP standalone client

Retire ftp and bbftp access to CASTOR