nci.org.aunci.org.au
@NCInews
National Computational Infrastructure’s Pursuit of High-Performance in OpenStack Clouds
Andrew Howard & Matthew Sanderson HPC and Cloud Systems
National Computational Infrastructure, The Australian National University
nci.org.au2
o NCI Contributors o Dr. Muhammad Atif o Mr. Simon Fowler o Mr. Jakub Chrzeszczyk o Dr. Ching-Ye (Leif) Lin o Dr. Benjamin Menadue
Thanks to my colleagues
nci.org.au3
o NCI Overview o Why we are interested in HPC Clouds ? o NCI Cloud past and present o What have we done to implement a HPC Cloud o Containers o MPI Performance under Docker o Conclusion o Questions
Agenda
nci.org.au4
NCI: an overview Mission: World-class, high-end computing services for Australian research and innovation
What is NCI: • Australia’s most highly integrated e-infrastructure environment • Petascale supercomputer + highest performance research cloud + highest performance
storage in the southern hemisphere • Comprehensive and integrated expert service • National/internationally renowned support team
NCI is national and strategic: • Driven by national research priorities and excellence • Engaged with research institutions/collaborations and industry • A capability beyond the capacity of any single institution • Sustained by a collaboration of agencies/universities
NCI is important to Australia because it: • Enables research that otherwise would be impossible • Enables delivery of world-class science • Enables interrogation of big data, otherwise impossible • Enables high-impact research that matters; informs public policy • Attracts and retains world-class researchers for Australia • Catalyses development of young researchers’ skills
Research Outcomes
Communities and Institutions/
Access and Services
Expertise Support and
Development
HPC ServicesVirtual Laboratories/
Data-intensive Services
Integration
Compute (HPC/Cloud) Storage/Network
Infrastructure
nci.org.au5
NCI today: comprehensive, integrated,quality service, innovative and valuedFacts and Figures• Supercomputer (Raijin): 1.2 petaflops (1,200,000,000,000,000 operations/sec)
– 57,492 cores, 160 Tbytes memory, 10 petabytes storage, 9 Tbit/sec backplane– Australia’s highest sustained performance research supercomputer
• HPC Cloud: 3,200 cores, supercomputer spec. for orchestrating data services• Global integrated storage (highest performance filesystems in Australia)
– 20 PB disk (up to120 Gbytes/sec b/w); 40 petabytes of tape for archive purposes• Power consumption: 1.6-2.0 megawatts
• Service researchers at 30 universities, 5 national science agencies and 2 MRIs• ~2,500 research users; 1,400 journal articles supported by NCI services• Support for more than $50M of national competitive research grants annually• One-third of Fellows elected to Australian Academy of Science (2014-15) are NCI users
Scale• HPC and data infrastructure: $47M replacement value (NCRIS, Aust. Gov’t)• Purpose built data centre: $24M replacement value (2012) • Recurrent operations: $17-18M p.a. (partners: $11+M; NCRIS: $5+M)
– Co-investment: Science agencies ($6M p.a.), Universities and ARC ($5+M p.a.)
Expert, agile and secure• 60 expert staff: operations, user support, high-performance computing and data,
collections management/curation, visualisation, virtual lab development, etc. • Driven by the goals of researchers and research institutions• Annual IT security audits
nci.org.au6
Inside the 900 sq. m. machine room
nci.org.au7
Supports the full gamut of research
pure strategic applied industry
• Fundamental sciences
• Mathematics, physics, chemistry, astronomy,
• ARC Centres of Excellence (ARCCSS, CAASTRO, CUDOS)
• Research with an intended strategic outcome
• Environmental, medical, geoscientific
• e.g., energy (UNSW), food security (ANU), geosciences (Sydney)
• Supporting industry and innovation
• e.g., ANU/UNSW startup, Lithicon, sold for $76M to US company FEI in 2014; multinational miner
• Informing public policy; real economic impact
• Climate variation, next-gen weather forecasting, disaster management (CoE, BoM, CSIRO, GA)
nci.org.au8
Services
• Services and Technologies (~30 staff) – Operations— robust/expert/secure (20 staff incl. 4 vendor contracted) – HPC
• Expert user support (9) • Largest research software library in Australia (300+ applications in all fields)
– Cloud • High-performance: VMs, Clusters • Secure, high-performance filesystem, integrated into NCI workflow environment
– Storage • Active (high-performance Lustre parallel) and archival (dual copy HSM tape); • Partner shares; Collections; Partner dedicated
• Research Engagement and Innovation (~20 staff) – HPC and Data-Intensive Innovation
• Upscaling priority applications (e.g., Fujitsu-NCI collaboration on ACCESS), • Bioinformatics pipelines (APN, melanoma, human genome)
– Virtual Environments • Climate/Weather, All-sky Astrophysics, Geophysics, etc. (NeCTAR)
– Data Collections • Management, publication, citation— strong environmental focus + other
– Visualisation • Drishti, Voluminous, Interactive presentations
nci.org.au9
Virtual Environments and Laboratories
nci.org.au
Moving to friction-free environments, e.g virtual desktops
nci.org.auCourtesy: Geoscience Australia
Shared Science Platforms for Shared Science Services
nci.org.au12
NCI provides user with Data as a Service
User generates/transfers data
NCI provides fast data storage
Data Management Portal
HPC
Data Curation, Publish, Citation
Web based real-time analytics software, Virtual Desktop Interface, Virtual Laboratory, and other services
Data Manager completes DMP and creates a catalogue
Super computer users
Paper and Data published Data visualisation
NCI Vislab
Data sharing and re-use
End-to-end Data Life Cycle
nci.org.au13
• The Climate & Weather Science Laboratory (CWSLab) is an innovation in climate data analysis enabled by NCI via NeCTAR funding
• Ideal for performing interactive analysis, code development, visualising data and publication writing
• Analogous to local computer but with access to many petabytes of climate & weather data
• Virtual Desktop Infrastructure established with access to climate data
• Users log in to a desktop interface
Earth systems & environmental science data in cloud computing
nci.org.au
Cloud Infrastructureo NCI has been Cloud Computing since 2009
o RedHat OpenStack Cloud. (2013) o 384 core private cloud. o Enterprise grade. o Typically for Virtual Laboratories. o Uptime of 100% for past two years
o Icehouse (2014) o Migrate nova-network to Neutron o 56G Ethernet o Ceph volume services added o Scale up from 32 nodes to 100
o Kilo (2015) o Power efficiency improvements reduce idle
load from 120W to 65W o Increased overcommit ratio
14
nci.org.au
o NeCTAR Research Cloud (2013 – Public Cloud). o Iaas and PaaS o Foundation node of NeCTAR (Australia’s National E-research cloud) o Intel Sandy Bridge (3200 cores with Hyper Threading). o Full Fat Tree 56G Ethernet (Mellanox)
o Higher initial cost but provides consistent network performance and flexibility
o 800Gb of SSDs per compute node o 2x400Gb SSDs in RAID-0
o Access to 0.5Pb of Ceph storage on the same fabric.
o Delivering on-demand research computing
15
Cloud Infrastructure
nci.org.au16
o Tenjin Partner Cloud (2013) o Flagship Cloud for data intensive compute. o Same hardware platform as NeCTAR Cloud o Two zones:
o Density (Overcommit of CPUs) o Performance (No CPU or memory overcommit)
o RDO with Neutron and Centos 7.X. o Architected to support both the high Computational and I/O
performance required for “big data” research. o 2x400Gb SSDs per compute node in RAID-0 (800Gb per node) o Access to ~1 Pb of Ceph storage o Access to 30 Pb of Lustre storage
o SR-IOV, FFT and 56G Ethernet. o On-demand access to GPU nodes. o Federated with NCI HPC environment.
Cloud Infrastructure
nci.org.au17
o InfiniCloud (Experimental) o FDR (56Gb) Infiniband Cloud o IceHouse then Kilo – Heavily Modified at NCI. Based on Mellanox
recipe. o Virtual Functions o Mellanox InfiniBand HCA is presented into Virtual Machines via SR-
IOV o InfiniBand PKey to VLAN mapping o Near line-rate IB performance o Once stable, Tenjin may move to native IB.
o Containers o Docker o Rocket?
Cloud Infrastructure
nci.org.au18
Job statistics on Raijin- Users are really into parallel jobs
NCI’s Awesome dashboard
Why a High Performance Cloud?
nci.org.au19
o Complement NCI supercomputer offerings. o Accelerate processing of single Node jobs
o Virtual Laboratories. o Remote Job Submission. o Visualisation. o Serving Research data to the Web
o Requiring access to Global file-system at NCI. o On-Demand GPU access. o Workloads not best suited for Lustre.
o Local scratch is SSD on NCI Cloud compared to SATA HDD on Raijin. o Pipelines and workloads that are not suited for supercomputer
o Packages that cannot/will not be supported. o Proof of concepts before making a big run.
o Cloud burst o Offloading single node jobs to the Cloud when the supercomputer system
heavily used. o Student Courses.
o RDMA (using NeCTAR)
Why a High Performance Cloud?
nci.org.au20
o Many research workloads utilise very large data sets o Secure access to data in place o Seamlessly combine resources across NCI HPC and Cloud
without copying data into and out of the Cloud
o Migrate workloads transparently between domains (HPC, Cloud) o On-demand provisioning o Legacy and/or emerging elastic workflows o Provide a wider range of services to NCI users
o GPU clusters o Utilise the most appropriate and energy efficient hardware to
achieve research outcomes
Combining computation and data
nci.org.au21
10 GigE
/g/data 56Gb FDR IB Fabric
/g/data1 ~7.4PB
/g/data2 ~6.5PB
/short 7.6PB
/home, /system, /images, /apps
Cache 1.0PB, Tape 12.3PB
Massdata /g/data Raijin FS
VMware OpenStack Tenjin
NCI data movers
To H
uxle
y D
C
Raijin 56Gb FDR IB Fabric
Internet
Raijin Compute
Raijin Login + Data movers
/g/data3 ~7.3PB
OpenStack NeCTAR
Ceph
NeCTAR 0,.5 PB
Tenjin 0.5PB
NCI Systems Connectivity
nci.org.au22
o Elements which differentiate NCI HPC and Cloud systems o Workflows o Communications architecture o InfiniBand and Ethernet
o InfiniBand o FDR 56Gbs and EDR 100Gbs o Lossless - full fat tree o Deterministic network latency and throughput
o Hardware offload for communication through RDMA o Kernel and TCP/IP stack bypass
o Ethernet o 10Gbps, 40Gbps, 56Gbps and 100Gbps o 10G is typical for Cloud presentation o Can be lossless or a traditional switched network
o RDMA o Remote Direct Memory Access o Offloads communication from operating system network stack o Heavily used in HPC applications through various MPI libraries
Comparing Cloud System performance
nci.org.au23
Why are packet loss and latency important
Image: ESNet
nci.org.au24
o What are we measuring ? o Can traditional HPC level MPI applications run effectively
within a container environment ? o How do latency and throughput compare to our baseline
HPC performance ? o Comparison of MPI RDMA performance in various
environments o Native InfiniBand (Full Fat Tree) o Ethernet and RoCE (Full Fat Tree and Switched)
o RDMA in a container o How does it compare to Bare Metal performance
Examining container performance
nci.org.au25
Cluster Architecture Interconnect Loc
Raijin Xeon(R) CPU E5-2670 @ 2.60GHz (Sandy Bridge)
Mellanox FDR Infiniband - FFT
NCI
Tenjin Intel Xeon E312xx @ 2.60 GHz (Sandy Bridge)
Mellanox FDR Infiniband, flashed to 56G Ethernet- FFT
NCI
Tenjin (Container)
Intel Xeon E312xx @ 2.60 GHz (Sandy Bridge)
Mellanox FDR Infiniband, flashed to 56G Ethernet- FFT
NCI
InfiniCloud Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
Mellanox FDR Infiniband
NCI
10G-Cloud AMD Opteron 63xx 10G Ethernet
o OpenMPI 1.10 o All applications compiled with GCC used with -O3. The Intel Compilers were not used, to
achieve a fair comparison. o All clouds were based on OpenStack. (Icehouse, Juno, Kilo) o Preliminary results- 10 runs, discarded max and min results and took average o Comprehensive results will be presented in a white paper.
Preliminary Results (Platform)
nci.org.au26
Point to Point Latency
nci.org.au27
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M
Bandwidth!(M
B/Sec)
Message!Size!in!bytes
OSU!Point!to!Point!Bandwidth!(MB/ Sec)!- Higher!is!Better
#!BW-AWS-WEB #!10GbE-Cloud #!Tenjin-TCP#!Tenjin-Yalla #!Tenjin-RoCE #!Tenjin-Container#!InfiniCloud-VM #!InfiniCloud-HY Raijin
Point to Point Bandwidth
nci.org.auCourtesy: Dr. Ching-Yeh (Leaf) Lin at NCI
Trinity is a bioinformatics de novo sequence-assembly package consists of three programs: Inchworm (openmp,
gcc), Chrysalis (openmp, gcc) and Butterfly (java). The calculation was carried out using the procedure published
by BJ Haas et al, Nature Protocols 8, 1494–1512 (2013)
28
Bioinformatics Workload Speedups compared to 10G-XXX-Cloud
16 CPU-One Compute Node (higher is better)
0
0.5
1
1.5
2
Inchworm Chrysalis Butterfly
Raijin Tenjin 10G-XXX-Cloud
Bioinformatics workload – Single compute node
nci.org.au29
Speed-up of NPB Class 'C' with 32 and 64 Processes Normalized w.r.t. 32 Processes on 10G Ethernet Cloud (Higher is better)
0
2.5
5
7.5
10
CG EP FT IS LU MG
10GbE-Cloud-32P Tenjin-32P Tenjin Container-32P Raijin-32P 10GbE-Cloud-64P Tenjin-64PTenjin Container-64P Raijin-64P
NAS Parallel Benchmarks
nci.org.au30
- ApoA1, measured s time-step - 16 CPUS per Node - Lack of NUMA - TCP btl on cloud
worked better than MXM
NAMD Speed-upSp
eedu
p
0
12.5
25
37.5
50
Number of CPUs1 2 4 8 16 32 64 128
Tenjin Tenjin-Containers Raijin
Molecular Dynamics Code - NAMD
nci.org.au
Com
pute
Tim
e (s
) (l
ower
is
bett
er)
1.00
10.00
100.00
1000.00
10000.00
Number of CPUs
1 2 4 8 16 32 64 128
RDO TCP RDO TCP MXM RDO OIB RDO OIB MXM RJ TCPRJ TCP MXM RJ OIB RJ OIB MXM
Courtesy: Dr. Benjamin Menadue
Computational Physics: Custom-written, hybrid Monte Carlo code for generate gauge fields for Lattice QCD. For each iteration, calculating the Hamiltonian involves inverting a large, complex matrix using CGNE. Written in Fortran, using pure MPI (no threading).
31
Scaling still an issue – NUMA
nci.org.au32
NCI’s commitment to HPC in the Cloud o NCI is engaged with many partners providing Cloud based HPC and HTC
solutions to researchers. These are usually released as Open Source.
o Slurm-Cluster o Enables a researcher to quickly and easily build a cluster in the cloud
backed by the Slurm scheduler. It is targeted to Tenjin and NeCTAR clouds, but should work on any OpenStack deployment. https://github.com/NCI-Cloud/slurm-cluster
o Intel Grant for Cluster in the Cloud o Worked with Amazon via LinkDigital o Raijin in a Box in preproduction and to be made available to the AWS
market place. o How to build a supercomputer on AWS with spot instances.
https://www.youtube.com/watch?v=KG3SKaf7yEw
nci.org.au33
NCI’s commitment to HPC in the Cloud
o Applying NCI’s depth of expertise in HPC application tuning to deliver high performance, secure computing environments in the Cloud for Australian Researchers.
o Bringing “Cloud to HPC” o Containers o Docker
o “Bring your own workflow” model
nci.org.au
o We can support seamless high performance research workloads with large data access requirements across multiple platforms
o Parallel jobs can run on the Cloud, but is it HPC? o Not at the moment. o Cloud is suited to high throughput computing (HTC), ease of provisioning and
specific workloads o Traditional HPC provides the best performance for larger parallel applications with
MPI requirements. o A common underlying hardware architecture shared between our HPC and Cloud platforms
provides application portability and flexibility in provisioning a system in either role. o QPI and NUMA can have a large impact on performance o Single Node performance is on par with bare metal (if the application is not memory
bound) o Locality Aware Scheduling (NUMA and Network awareness)
o Our benchmarks were limited by the QPI performance of SandyBridge. o NCI plans to deploy bare-metal provisioning using Ironic
34
Conclusion
nci.org.aunci.org.au
@NCInews
Thank You