Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.
Post on 01-Jan-2016
214 Views
Preview:
Transcript
Hotfoot HPC ClusterMarch 31, 2011
Topics
• Overview• Execute Nodes• Manager/Submit Nodes• NFS Server• Storage• Networking• Performance
Overview - Hotfoot Pilot
• Launched May 2009
• Original Partnership– Astronomy– Statistics– CUIT– Office of the Executive Vice President for Research
Overview - Hotfoot Expansion
• Expanded March 2011– More Nodes– More Storage– Changed Scheduler
• New Participant– Social Science Computing Committee (SSCC)
Overview – Cluster Components
• 52 Execute Nodes
• 520 Total Cores
• 2 Manager Nodes
• 1 NFS Server (1 Cold Spare)
• 52 TB Storage (72 TB Raw)
Overview
Overview - Architecture
Manager/SubmitNode 1
(Haddock)
RAID
NFS Server(Herring)
Manager/SubmitNode 2
(Mahimahi)
Hotfoot Components
Blade Chassis32 Execute Nodes
NFS Server(Sardine)
Original blade chassis
containing 32 Execute nodes.
New blade chassiscontaining 24
Execute nodes.
One Manager/Submit node is active. Failover is manual.
Second server available to provide NFS services.
Currently not connected.
72TB raw storage. Approximately 52TB usable
under RAID 5.
NFS server provides working storage for all other systems
in cluster.
Execute Nodes
Model Quantity CPU Cores Total Cores Memory
BL2x220c G5 32 Dual 4 core 256 16 GB
BL2x220c G6 14 Dual 6 core 168 24 GB
BL2x220c G6 8 Dual 6 core 96 96 GB
Manager/Submit Nodes
• HP DL360 G5, 4 GB RAM
• Torque Resource Manager (OpenPBS descendent)
• Maui Cluster Scheduler
• User Access via virtual interface (vif)
• Failover via Torque High Availability (HA)
NFS Servers
• Primary– HP DL360 G7– 2 x 4 cores– 16 GB RAM
• Backup– HP DL360 G5– 1 x 2 cores– 8 GB RAM
Storage
• HP P2000 Storage Array
• 32 x 2 TB Drives
• RAID 5
• ~52 TB Usable
Networking
• Execute Nodes
– Channel-bonding mode 2 (load-balancing and fault tolerance)
– 1 Gb connection to chassis switches
– Usage records suggested this was sufficient
Networking
Sample Traffic for an Execute Node
Networking
• Chassis
– Each chassis has four Cisco 3020 switches
– 1 Gb connection to Edge switches
– Usage records suggested this was sufficient
Networking
Sample Traffic for a Chassis Switch
Networking
Original Chassis, Showing Network Connections for Two Servers
Performance
• Concern about the ability of NFS to handle i/o demands.
• Reviewed performance of pilot system.
• Ran tests on expanded system.
Performance
Memory Usage on Old NFS Server
Performance
Load Average on Old NFS Server
Performance
Performance
Questions?
• Questions?
• Comments?
• Contact: roblane@columbia.edu
top related