Top Banner
Lessons Learned When Building a Greenfield HPC Ecosystem Andrew Keen Michigan State University
19

Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Jul 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Lessons Learned When Building a Greenfield HPC Ecosystem

Andrew Keen

Michigan State University

Page 2: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Terminology

• High Performance Computing

• Greenfield- In contrast to brownfield

Page 3: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

About iCER / HPCC

• ‘Cyber-Enabled’

• 300+ nodes (500+ soon)

• > 1 PB storage

• High speed networks, GPUs, Xeon Phi accelerators, Large memory

• Software!

• People!

Page 4: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network
Page 5: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

There’s more than FLOPS?

It’s an ecosystem:

• Users!

• Compute

• Storage

• Physical Infrastructure

• Management Tools

• Policies

• Education

• Community

Page 6: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

MISTIC

Page 7: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Startup

• Big SMP system from a Famous Name

– RFP benchmarks looked great!

• Actual workloads…

• Didn’t have adequate I/O bandwidth

Page 8: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

There’s more than FLOPS?

It’s an ecosystem:

• Users!

• Compute

• Storage

• Physical Infrastructure

• Management Tools

• Policies

• Education

• Community

Page 9: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Storage

• Fast Storage

– Lustre; 350 TB, 9 GB/s

• Safe Storage

– ZFS

– Data Integrity, Snapshots, Replication

– Fast-ish

– First TB free

– $175/TB/year- competitive with offline cloud storage!

Page 10: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

ConfigurationManagement

• Never manage your systems by hand

• Can manage appliances/devices

– NetApps

– Junipers

– VMWare

• Puppet – GIT environments are cool

Page 11: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Availability

• HPC is not HA*

– but that doesn’t mean you can’t avoid disruptive pain points

– Outages can be disruptive

– Build redundancy based on budget and tolerance for disruption

Page 12: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Cluster Management

• Hardware Management

– IPMI

– Firmware Updates and configuration!

– Lock-in

Page 13: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Security

• Reflect resource’s goals

• Environment

• Trusts matter

Page 14: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Use Outside Resources

• Organization and Community, resources

• Don’t reinvent the wheel

• Gitlab!

• http://gitlab.msu.edu

Page 15: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Communication

• No one reads bulk email

• Few people read personal email

• Social Media?

• Ticketing / Issue Tracking is critical

Page 16: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Physical Concerns

• Lots of power, small space

• Whole Room vs. Spot

• Containment

– Easy to prototype!

• Long term?

Page 17: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

CyberInfrastructure days

• October 24-25

• Open to the MSU community to learn and collaborate about MSU, national CI resources

• http://tech.msu.edu/CI-Days

Page 18: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

New Compute Cluster

• 2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total)

• 500 GB HDD

• FDR (56 gigabit) networkType Mem

(GB)Accelerators Total Performance

(GigaFLOPS)Cost

Base 64 400 $3,805

Big Mem 256 400 $5,339

Bigger Mem 512 400 ~$12,000

GPU 128 2x NVIDIA K20 2400 $7,900

Phi 128 2x Phi 5110p 2400 $9,043

Page 19: Lessons Learned When Building a Greenfield HPC Ecosystem...New Compute Cluster •2x Intel Xeon Ivy Bridge E5-2670v2 (2.5 GHz, 20 cores total) •500 GB HDD •FDR (56 gigabit) network

Conclusion

Questions?

[email protected]

http://contact.icer.msu.edu