Top Banner
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT
14
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

Managing Mature White Box Clusters

at CERN

LCW: Practical Experience

Tim Smith CERN/IT

Page 2: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 2

Contents

Scale Behind the

Scenes Hardware

Complexity Dynamics Practical Steps

Software Legacy Projects

Page 3: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 3

Scale ~1000 boxes 140k Jobs/wk 2400 int user 50 parallel

reinstalls Parallel cmd

engines

350kSi2000 ~7/38 in top

500 clusters

Page 4: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 4

Complexity

Hardware 12 hardware acquisitions 38 combinations of CPU/Mem/Disk

Software 4 versions of RedHat OS 37 clusters (indep. configurations)

User Communities 30 expts/user communities + Public 12,000 users

Page 5: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 5

Dynamics

Hardware Drift e.g. missing after reboot:

CPUs, Memory, Disks Ethernet speed wrong

Volatile configurations e.g. passwd file every couple of hours

Hardware Failures Up to 4% of farm on holiday

Replacements generate new configurations

Monitoring

InventoryTracking

Page 6: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 6

Vendor Call Analysis

0

5

10

15

20

25

30

35

40

45

disks dead motherb. memory video processor floppy power/fan tot. calls

reasons

Nu

mb

er o

f ca

lls

SIEMENS

ELONEX

TECH AS

SEIL

1 every2 days!

Page 7: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 7

Acquisition Cycles

0

200

400

600

800

1000

1200

Jan-97 Jan-98 Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05

Nu

mb

er o

f M

ach

ines

SEIL - 1000

ELONEX - 800

TECH - 600

ELONEX - 600

SIEMENS - 550

ELONEX - 500

HP - 450

ELONEX - 450

ELONEX - 450

ELONEX - 300

COGESTRA - 266

COGESTRA - 200

Out of Warantee

Page 8: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 8

Addressing the Challenge Interactive: Refresh from uniform batch

machines Batch: One large production facility

Shares (and priorities) Selectable resources Flexibility Redundancy to reduced sensitivity to

failures Remedy Hardware workflows But intractable

Scatter in job return times Assumed but undeclared job requirements

Page 9: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 9

SW: Legacy from Maturity

OS

Applications

Mgmt Tools

KickStart

SUE

ASIS

BIS

/home/usr/cute/usr/local/var/opt

Page 10: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 10

BIS DB

SW: Legacy from Maturity

OS

Applications

Mgmt Tools

KickStart

SUE

ASIS

BIS

Oracle

AFSAFSAFSAFS

Local

acrontabs

/home/usr/cute/usr/local/var/opt

crontabs

Multiple owners,methods, formats

Multiplelocations

Page 11: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 11

A Clean Restart

NodeConfiguration

SystemMonitoring

System

InstallationSystem

Fault MgmtSystem

Page 12: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 12

A Clean Restart: SnapShot

NodeConfiguration

SystemMonitoring

System

InstallationSystem

Fault MgmtSystem

HWSW

FunctionState

Software UpdateBase Installation

RPM

AP

I

PXEKickstart

Page 13: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 13

State and Configuration Mgt

Clean Initial State Linux Standards Base, RPM

Externally Specified Configuration System, local cache

Versioned + Repository CVS

No inherent drift No external crontabs No unregistered application provider triggered

updates Update verification nodes + release cycle Procedures and Workflows Transactions

Notifications

Page 14: Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

2002/10/21 White Box Farms: [email protected] 14

Conclusions Maturity brings…

Degradation of initial state definition HW + SW

Accumulation of innocuous temporary procedures

Scale brings… Marginal activities become full time

Many hands on the systems

Combat with strong management automation