CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.
Post on 28-Mar-2015
217 Views
Preview:
Transcript
CERN Data ServicesUpdate
HEPiX 2004 / NeSC Edinburgh
Data Services team:Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes,
Gordon Lee, Tony Osborne, Tim Smith
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 2 of 19
Outline
Data Services Drivers Disk Service
Migration to Quattor / LEMON Future directions
Tape Service Media migration Future directions
Grid Data Services
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 3 of 19
Data Flows Tier-0 / Tier-1 for the LHC
Data Challenges: CMSDC04 (finished) ; PCP05 (Autumn) +80;
+170 ALICE ongoing +137 TB LHCb ramping up +40 TB ATLAS ramping up +60 TB
Fixed Target Programme: NA48 at 80 MB/s +200 TB COMPASS at 70 MB/s (peak 120) +625 TB nToF at 45 MB/s +180 TB NA60 at 15 MB/s +60 TB Testbeams at 1~5 MB/s (x 5)
Analysis…
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 4 of 19
Disk Server FunctionsAFS1%
Oracle8%
CASTOR: Experiment dedicated
64%
CASTOR: Infrastructure
9%
CASTOR: Public Services
4%
LCG14%
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 5 of 19
Generations
0th Jumbos
1st & 2nd
4U
3rd & 4th
8U
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 6 of 19
Warrantees
0
50
100
150
200
250
300
350
400
Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09
Nu
mb
er o
f D
isk
Ser
vers
ELONEX - 2.4GHz
ELONEX - 2.4GHz
ELONEX - 2.0GHz
ELONEX - 1.1GHz
JTT - 1.1GHz
ELONEX - 1GHz
ELONEX - 1GHz
JTT - 1GHz
ELONEX - 900
ELONEX - 900
ELONEX - 900
TECH - 800
ELONEX - 700
ELONEX - 650
COGESTRA - 500
ELONEX - 500
ELONEX - 500
COGESTRA - 450
Out of Warantee
4th Generation
3rd Generation
2nd Generation
1st Generation
0th Generation
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 7 of 19
Disk Servers: Jan 2004 370 EIDE Disk Servers
Commodity Storage in a box 544 TB of disk capacity 6700 spinning disks
Storage Configuration HW Raid-1 mirrored for “maximum
reliability” ext2 file systems
Operating systems RH6.1, 6.2, 7.2, 7.3, RHES 13 different kernels
Application uniformity; CASTOR SW
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 8 of 19
Quattor-ising Motivation: Scale
Uniformity; Manageability; Automation Configuration Description (into CDB)
HW and SW; nodes and services Reinstallation
Production machines – min service interruption!
Eliminate peculiarities from CASTOR nodes MySQL, web servers Refocus root control
Quiescing a disk server ≠ draining a batch node!
Gigabit cards gymnastics (ext2 -> ext3)
Complete (except 10 RH6 boxes for Objectivity)
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 9 of 19
LEMON-ising MSA everywhere
Linux box monitoring and alarms
Automatic HW static checks
Adding CASTOR server specific Service monitoring
HW Monitoring lm_sensors (see tape section) smartmontools
smartd deployment Kernel issues; firmware bugs; through 3ware controller smart_ctl auto checks; predictive monitoring
IPMI investigations; especially remote access Remote reset/power-on/power-off
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 10 of 19
Disk Replacement Failure rate
unacceptably high 10 months to be
believed 4 weeks to execute
1224 disks exchanged (out of 6700)
And the cages
Western Digital; type DUA Head instabilities
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
4.0%
4.5%
Dec-03 Jan-04 Feb-04 Mar-04 Apr-04 May-04
% B
rok
en
Mir
rors
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 11 of 19
Disk Storage Futures
EIDE Commodity storage in a box Production systems
HW Raid-1 / ext3 Pilots (15 production systems)
HW Raid-5 + SW Raid-0 / XFS (See Jan Iven’s talk next)
New tenders out… 30TB SATA in a box 30TB external SATA disk arrays
New CASTOR stager (see Olof’s talk)
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 12 of 19
9940B14157
9940A8889
98408149
35908639
Tape Service
70 tape servers (Linux) (mostly) Single FibreChannel attached
drives 2 symmetric robotic installations
5 x STK 9310 Silos in each
9940B50
9940A4
984020
359014
LTO6Drives
Media
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 13 of 19
Tape Server Temperatures
lm_sensors package General SMBus access
and hardware monitoring.
Used to access LM87 chip
Fan speeds Voltages Int/Ext temperatures
ADM1023 chip Int/Ext temperatures
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 14 of 19
Tape Server Temperatures
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 15 of 19
Media Migration
To 9940B (mainly from 9940A) 200GB – extra capacity avoids
unnecessary acquisitions Better performance – though hard to
benefit in normal chaotic mode Reduced errors; fewer interventions
1-2% of A tapes can not be read (extremely slow) on B drives Have not been able to return all A-
drives
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 16 of 19
Tape Service Developments
Removing tails… Tracking of all tape errors (18 months)
Retiring of problematic media Proactive retiring of heavily used media
(>5000 mounts) repack on new media
Checksums Populated writing to tape Verified loading back to disk 22% already after few weeks
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 17 of 19
Water Cooled Tapes!
Plumbing error!
5000 tapes disabled for a few days 550 superficially wet 152 seriously wet – visually
inspected
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 18 of 19
Tape Storage Futures
Commodity drive studies LTO-2 (Collaboratively CASPUR/Valencia)
Test and evaluate High-end drives IBM 3592 STK NGD
Other STK offerings SL8500 robotics and silos Indigo; managed storage, tape
virtualisation
2004/05/26 CERN Data Services: Tim.Smith@cern.ch 19 of 19
GRID Data Management
GridFTP + SRM servers (Former) Standalone / experiment dedicated Hard to intervene; not scalable
New load-balanced 6 node Service castorgrid.cern.ch SRM modifications to support operate
behind load balancer GridFTP standalone client
Retire ftp and bbftp access to CASTOR
top related