ScotGrid EGI CF 2013 Building a grid cluster from the ground up A Tale of Two Rooms
ScotGrid
EGI CF 2013
Building a grid cluster from the ground up
A Tale of Two Rooms
ScotGrid
EGI CF 2013
Introduction
• Scotgrid Glasgow [GridPP]
• One of largest Tier 2 sites in UK NGI
• 4136 cores
• 1.3 PB online storage
ScotGrid
EGI CF 2013
A year on the grid
• Power & A/C outages from multiple causes on different scales
• Trips/larger substation drops etc.
• Lessons learned - general good practice
• General thoughts on living with a cluster
ScotGrid
EGI CF 2013
The case of two machine rooms
• Different ages of rooms - repurposing
• Different cooling solutions
• Advantages and disadvantages• In principle, with redundant links could have cluster
redundancy
• In reality, complexity from bridging cluster with that redundancy - where are the bottlenecks
ScotGrid
EGI CF 2013
Site diagram80 Gb/s
X460-48t
X460-48t
X460-48t
X460-48t
X460-48t
X460-48tX460-48t
Summit X670V
Summit X670V
Summit X670V
X460-48tX460-48tX460-48tX460-48tX460-48t
Worker Nodes
10G WN
10G Servers 10G Disk
10G Disk
10G Servers
Worker NodesServers
Disk
Upper
Lower
Servers
WAN
10 Gb/s
1 Gb/s multiple
10 Gb/smultiple
Summit X670V
ScotGrid
EGI CF 2013
Power & A/C failures
• Can happen to anyone
• Expect failure (like the grid philosophy)
• UPSes are very useful• Except when they’re not
• Complexities of multi-room cluster
ScotGrid
EGI CF 2013
Best case
• One large data centre with ample power, cooling and network infrastructure
• Lower maintenance overheads
• Higher production uptime
• Failure prediction and multiple redundancy
ScotGrid
EGI CF 2013
Reality
• Many clusters grow organically over time, even with careful planning
• Periodic capacity upgrades can lead to infrastructure difficulties
ScotGrid
EGI CF 2013
Essential Cluster Infrastructure
• Power
• Cooling
• Network
ScotGrid
EGI CF 2013
Power
• Clusters are not desktops - be mindful of total power draw
• Potential for mix of 3 phase &13A ring main
• Most likely to impact overall user environment if changes have to be made (whole building outages)
• Don’t mix phases within rack
• Make clear about which phases are where
ScotGrid
EGI CF 2013
Cooling
• Mix of techniques (in our case)
• Compressors - gradual degradation• 4 AHUs: 4 x 2 compressors
• Liquid cooling • 3 AHUs: effectively 1 active chiller (with failover)
• Over-specification• Expect maintenance downtime
ScotGrid
EGI CF 2013
Network
• An aside (not power or A/C)
• Networking now a first class citizen
• Disparate vendors -> Unified structure
• 160 Gbps backbone• 80 Gbps redundant ring
ScotGrid
EGI CF 2013
Site diagram80 Gb/s
X460-48t
X460-48t
X460-48t
X460-48t
X460-48t
X460-48tX460-48t
Summit X670V
Summit X670V
Summit X670V
X460-48tX460-48tX460-48tX460-48tX460-48t
Worker Nodes
10G WN
10G Servers 10G Disk
10G Disk
10G Servers
Worker NodesServers
Disk
Upper
Lower
Servers
WAN
10 Gb/s
1 Gb/s multiple
10 Gb/smultiple
Summit X670V
ScotGrid
EGI CF 2013
Best practices
• Cold starts & boot order
• Auto power on?
• Alerts for sysadmins
• Notifications & communication
• Single points of failure - startup critical path
ScotGrid
EGI CF 2013
Startup procedures
• Critical path• Core infrastructure
• Core services • (NFS Master Services pool nodes DPM WN)
• More speed less haste
• Automation
• Cluster management
ScotGrid
EGI CF 2013
A user’s perspective• Depending on the size of the cluster, power
and A/C concerns can have a major impact on users.
• Communication
• Notification
• Posted maintenance windows
• Postmortem
ScotGrid
EGI CF 2013
Process flow
• Logging• Preventative maintenance
• Event flow
• Postmortem
• Process revision
• Escalation
ScotGrid
EGI CF 2013
Summary
• Cluster environment is very often externally dictated
• Organic growth
• Can happen to anyone
• Process