Chasing the Rainbow – National Computational Infrastructure’s Pursuit of High-Performance OpenStack Cloud: Andrew Howard, NCI

nci.org.aunci.org.au

@NCInews

National Computational Infrastructure’s Pursuit of High-Performance in OpenStack Clouds

Andrew Howard & Matthew Sanderson HPC and Cloud Systems

National Computational Infrastructure, The Australian National University

nci.org.au2

o NCI Contributors o Dr. Muhammad Atif o Mr. Simon Fowler o Mr. Jakub Chrzeszczyk o Dr. Ching-Ye (Leif) Lin o Dr. Benjamin Menadue

Thanks to my colleagues

nci.org.au3

o NCI Overview o Why we are interested in HPC Clouds ? o NCI Cloud past and present o What have we done to implement a HPC Cloud o Containers o MPI Performance under Docker o Conclusion o Questions

Agenda

nci.org.au4

NCI: an overview Mission: World-class, high-end computing services for Australian research and innovation

What is NCI: • Australia’s most highly integrated e-infrastructure environment • Petascale supercomputer + highest performance research cloud + highest performance

storage in the southern hemisphere • Comprehensive and integrated expert service • National/internationally renowned support team

NCI is national and strategic: • Driven by national research priorities and excellence • Engaged with research institutions/collaborations and industry • A capability beyond the capacity of any single institution • Sustained by a collaboration of agencies/universities

NCI is important to Australia because it: • Enables research that otherwise would be impossible • Enables delivery of world-class science • Enables interrogation of big data, otherwise impossible • Enables high-impact research that matters; informs public policy • Attracts and retains world-class researchers for Australia • Catalyses development of young researchers’ skills

Research Outcomes

Communities and Institutions/

Access and Services

Expertise Support and

Development

HPC ServicesVirtual Laboratories/

Data-intensive Services

Integration

Compute (HPC/Cloud) Storage/Network

Infrastructure

nci.org.au5

NCI today: comprehensive, integrated,quality service, innovative and valuedFacts and Figures• Supercomputer (Raijin): 1.2 petaflops (1,200,000,000,000,000 operations/sec)

– 57,492 cores, 160 Tbytes memory, 10 petabytes storage, 9 Tbit/sec backplane– Australia’s highest sustained performance research supercomputer

• HPC Cloud: 3,200 cores, supercomputer spec. for orchestrating data services• Global integrated storage (highest performance filesystems in Australia)

– 20 PB disk (up to120 Gbytes/sec b/w); 40 petabytes of tape for archive purposes• Power consumption: 1.6-2.0 megawatts

• Service researchers at 30 universities, 5 national science agencies and 2 MRIs• ~2,500 research users; 1,400 journal articles supported by NCI services• Support for more than $50M of national competitive research grants annually• One-third of Fellows elected to Australian Academy of Science (2014-15) are NCI users

Scale• HPC and data infrastructure: $47M replacement value (NCRIS, Aust. Gov’t)• Purpose built data centre: $24M replacement value (2012) • Recurrent operations: $17-18M p.a. (partners: $11+M; NCRIS: $5+M)

– Co-investment: Science agencies ($6M p.a.), Universities and ARC ($5+M p.a.)

Expert, agile and secure• 60 expert staff: operations, user support, high-performance computing and data,

collections management/curation, visualisation, virtual lab development, etc. • Driven by the goals of researchers and research institutions• Annual IT security audits

nci.org.au6

Inside the 900 sq. m. machine room

nci.org.au7

Supports the full gamut of research

pure strategic applied industry

• Fundamental sciences

• Mathematics, physics, chemistry, astronomy,

• ARC Centres of Excellence (ARCCSS, CAASTRO, CUDOS)

• Research with an intended strategic outcome

• Environmental, medical, geoscientific

• e.g., energy (UNSW), food security (ANU), geosciences (Sydney)

• Supporting industry and innovation

• e.g., ANU/UNSW startup, Lithicon, sold for $76M to US company FEI in 2014; multinational miner

• Informing public policy; real economic impact

• Climate variation, next-gen weather forecasting, disaster management (CoE, BoM, CSIRO, GA)

nci.org.au8

Services

• Services and Technologies (~30 staff) – Operations— robust/expert/secure (20 staff incl. 4 vendor contracted) – HPC

• Expert user support (9) • Largest research software library in Australia (300+ applications in all fields)

– Cloud • High-performance: VMs, Clusters • Secure, high-performance filesystem, integrated into NCI workflow environment

– Storage • Active (high-performance Lustre parallel) and archival (dual copy HSM tape); • Partner shares; Collections; Partner dedicated

• Research Engagement and Innovation (~20 staff) – HPC and Data-Intensive Innovation

• Upscaling priority applications (e.g., Fujitsu-NCI collaboration on ACCESS), • Bioinformatics pipelines (APN, melanoma, human genome)

– Virtual Environments • Climate/Weather, All-sky Astrophysics, Geophysics, etc. (NeCTAR)

– Data Collections • Management, publication, citation— strong environmental focus + other

– Visualisation • Drishti, Voluminous, Interactive presentations

nci.org.au9

Virtual Environments and Laboratories

nci.org.au

Moving to friction-free environments, e.g virtual desktops

nci.org.auCourtesy: Geoscience Australia

Shared Science Platforms for Shared Science Services

nci.org.au12

NCI provides user with Data as a Service

User generates/transfers data

NCI provides fast data storage

Data Management Portal

HPC

Data Curation, Publish, Citation

Web based real-time analytics software, Virtual Desktop Interface, Virtual Laboratory, and other services

Data Manager completes DMP and creates a catalogue

Super computer users

Paper and Data published Data visualisation

NCI Vislab

Data sharing and re-use

End-to-end Data Life Cycle

nci.org.au13

• The Climate & Weather Science Laboratory (CWSLab) is an innovation in climate data analysis enabled by NCI via NeCTAR funding

• Ideal for performing interactive analysis, code development, visualising data and publication writing

• Analogous to local computer but with access to many petabytes of climate & weather data

• Virtual Desktop Infrastructure established with access to climate data

• Users log in to a desktop interface

Earth systems & environmental science data in cloud computing

nci.org.au

Cloud Infrastructureo NCI has been Cloud Computing since 2009

o RedHat OpenStack Cloud. (2013) o 384 core private cloud. o Enterprise grade. o Typically for Virtual Laboratories. o Uptime of 100% for past two years

o Icehouse (2014) o Migrate nova-network to Neutron o 56G Ethernet o Ceph volume services added o Scale up from 32 nodes to 100

o Kilo (2015) o Power efficiency improvements reduce idle

load from 120W to 65W o Increased overcommit ratio

14

nci.org.au

o NeCTAR Research Cloud (2013 – Public Cloud). o Iaas and PaaS o Foundation node of NeCTAR (Australia’s National E-research cloud) o Intel Sandy Bridge (3200 cores with Hyper Threading). o Full Fat Tree 56G Ethernet (Mellanox)

o Higher initial cost but provides consistent network performance and flexibility

o 800Gb of SSDs per compute node o 2x400Gb SSDs in RAID-0

o Access to 0.5Pb of Ceph storage on the same fabric.

o Delivering on-demand research computing

15

Cloud Infrastructure

nci.org.au16

o Tenjin Partner Cloud (2013) o Flagship Cloud for data intensive compute. o Same hardware platform as NeCTAR Cloud o Two zones:

o Density (Overcommit of CPUs) o Performance (No CPU or memory overcommit)

o RDO with Neutron and Centos 7.X. o Architected to support both the high Computational and I/O

performance required for “big data” research. o 2x400Gb SSDs per compute node in RAID-0 (800Gb per node) o Access to ~1 Pb of Ceph storage o Access to 30 Pb of Lustre storage

o SR-IOV, FFT and 56G Ethernet. o On-demand access to GPU nodes. o Federated with NCI HPC environment.


nci.org.au17

o InfiniCloud (Experimental) o FDR (56Gb) Infiniband Cloud o IceHouse then Kilo – Heavily Modified at NCI. Based on Mellanox

recipe. o Virtual Functions o Mellanox InfiniBand HCA is presented into Virtual Machines via SR-

IOV o InfiniBand PKey to VLAN mapping o Near line-rate IB performance o Once stable, Tenjin may move to native IB.

o Containers o Docker o Rocket?


nci.org.au18

Job statistics on Raijin- Users are really into parallel jobs

NCI’s Awesome dashboard

Why a High Performance Cloud?

nci.org.au19

o Complement NCI supercomputer offerings. o Accelerate processing of single Node jobs

o Virtual Laboratories. o Remote Job Submission. o Visualisation. o Serving Research data to the Web

o Requiring access to Global file-system at NCI. o On-Demand GPU access. o Workloads not best suited for Lustre.

o Local scratch is SSD on NCI Cloud compared to SATA HDD on Raijin. o Pipelines and workloads that are not suited for supercomputer

o Packages that cannot/will not be supported. o Proof of concepts before making a big run.

o Cloud burst o Offloading single node jobs to the Cloud when the supercomputer system

heavily used. o Student Courses.

o RDMA (using NeCTAR)

Why a High Performance Cloud?

nci.org.au20

o Many research workloads utilise very large data sets o Secure access to data in place o Seamlessly combine resources across NCI HPC and Cloud

without copying data into and out of the Cloud

o Migrate workloads transparently between domains (HPC, Cloud) o On-demand provisioning o Legacy and/or emerging elastic workflows o Provide a wider range of services to NCI users

o GPU clusters o Utilise the most appropriate and energy efficient hardware to

achieve research outcomes

Combining computation and data

nci.org.au21

10 GigE

/g/data 56Gb FDR IB Fabric

/g/data1 ~7.4PB

/g/data2 ~6.5PB

/short 7.6PB

/home, /system, /images, /apps

Cache 1.0PB, Tape 12.3PB

Massdata /g/data Raijin FS

VMware OpenStack Tenjin

NCI data movers

To H

uxle

y D

C

Raijin 56Gb FDR IB Fabric

Internet

Raijin Compute

Raijin Login + Data movers

/g/data3 ~7.3PB

OpenStack NeCTAR

Ceph

NeCTAR 0,.5 PB

Tenjin 0.5PB

NCI Systems Connectivity

nci.org.au22

o Elements which differentiate NCI HPC and Cloud systems o Workflows o Communications architecture o InfiniBand and Ethernet

o InfiniBand o FDR 56Gbs and EDR 100Gbs o Lossless - full fat tree o Deterministic network latency and throughput

o Hardware offload for communication through RDMA o Kernel and TCP/IP stack bypass

o Ethernet o 10Gbps, 40Gbps, 56Gbps and 100Gbps o 10G is typical for Cloud presentation o Can be lossless or a traditional switched network

o RDMA o Remote Direct Memory Access o Offloads communication from operating system network stack o Heavily used in HPC applications through various MPI libraries

Comparing Cloud System performance

nci.org.au23

Why are packet loss and latency important

Image: ESNet

nci.org.au24

o What are we measuring ? o Can traditional HPC level MPI applications run effectively

within a container environment ? o How do latency and throughput compare to our baseline

HPC performance ? o Comparison of MPI RDMA performance in various

environments o Native InfiniBand (Full Fat Tree) o Ethernet and RoCE (Full Fat Tree and Switched)

o RDMA in a container o How does it compare to Bare Metal performance

Examining container performance

nci.org.au25

Cluster Architecture Interconnect Loc

Raijin Xeon(R) CPU E5-2670 @ 2.60GHz (Sandy Bridge)

Mellanox FDR Infiniband - FFT

NCI

Tenjin Intel Xeon E312xx @ 2.60 GHz (Sandy Bridge)

Mellanox FDR Infiniband, flashed to 56G Ethernet- FFT

NCI

Tenjin (Container)

Intel Xeon E312xx @ 2.60 GHz (Sandy Bridge)

Mellanox FDR Infiniband, flashed to 56G Ethernet- FFT

NCI

InfiniCloud Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz

Mellanox FDR Infiniband

NCI

10G-Cloud AMD Opteron 63xx 10G Ethernet

o OpenMPI 1.10 o All applications compiled with GCC used with -O3. The Intel Compilers were not used, to

achieve a fair comparison. o All clouds were based on OpenStack. (Icehouse, Juno, Kilo) o Preliminary results- 10 runs, discarded max and min results and took average o Comprehensive results will be presented in a white paper.

Preliminary Results (Platform)

nci.org.au26

Point to Point Latency

nci.org.au27

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M

Bandwidth!(M

B/Sec)

Message!Size!in!bytes

OSU!Point!to!Point!Bandwidth!(MB/ Sec)!- Higher!is!Better

#!BW-AWS-WEB #!10GbE-Cloud #!Tenjin-TCP#!Tenjin-Yalla #!Tenjin-RoCE #!Tenjin-Container#!InfiniCloud-VM #!InfiniCloud-HY Raijin

Point to Point Bandwidth

nci.org.auCourtesy: Dr. Ching-Yeh (Leaf) Lin at NCI

Trinity is a bioinformatics de novo sequence-assembly package consists of three programs: Inchworm (openmp,

gcc), Chrysalis (openmp, gcc) and Butterfly (java). The calculation was carried out using the procedure published

by BJ Haas et al, Nature Protocols 8, 1494–1512 (2013)

28

Bioinformatics Workload Speedups compared to 10G-XXX-Cloud

16 CPU-One Compute Node (higher is better)

0

0.5

1

1.5

2

Inchworm Chrysalis Butterfly

Raijin Tenjin 10G-XXX-Cloud

Bioinformatics workload – Single compute node

nci.org.au29

Speed-up of NPB Class 'C' with 32 and 64 Processes Normalized w.r.t. 32 Processes on 10G Ethernet Cloud (Higher is better)

0

2.5

5

7.5

10

CG EP FT IS LU MG

10GbE-Cloud-32P Tenjin-32P Tenjin Container-32P Raijin-32P 10GbE-Cloud-64P Tenjin-64PTenjin Container-64P Raijin-64P

NAS Parallel Benchmarks

nci.org.au30

- ApoA1, measured s time-step - 16 CPUS per Node - Lack of NUMA - TCP btl on cloud

worked better than MXM

NAMD Speed-upSp

eedu

p

0

12.5

25

37.5

50

Number of CPUs1 2 4 8 16 32 64 128

Tenjin Tenjin-Containers Raijin

Molecular Dynamics Code - NAMD

nci.org.au

Com

pute

Tim

e (s

) (l

ower

is

bett

er)

1.00

10.00

100.00

1000.00

10000.00

Number of CPUs

1 2 4 8 16 32 64 128

RDO TCP RDO TCP MXM RDO OIB RDO OIB MXM RJ TCPRJ TCP MXM RJ OIB RJ OIB MXM

Courtesy: Dr. Benjamin Menadue

Computational Physics: Custom-written, hybrid Monte Carlo code for generate gauge fields for Lattice QCD. For each iteration, calculating the Hamiltonian involves inverting a large, complex matrix using CGNE. Written in Fortran, using pure MPI (no threading).

31

Scaling still an issue – NUMA

nci.org.au32

NCI’s commitment to HPC in the Cloud o NCI is engaged with many partners providing Cloud based HPC and HTC

solutions to researchers. These are usually released as Open Source.

o Slurm-Cluster o Enables a researcher to quickly and easily build a cluster in the cloud

backed by the Slurm scheduler. It is targeted to Tenjin and NeCTAR clouds, but should work on any OpenStack deployment. https://github.com/NCI-Cloud/slurm-cluster

o Intel Grant for Cluster in the Cloud o Worked with Amazon via LinkDigital o Raijin in a Box in preproduction and to be made available to the AWS

market place. o How to build a supercomputer on AWS with spot instances.

https://www.youtube.com/watch?v=KG3SKaf7yEw

https://github.com/NCI-Cloud/slurm-cluster

https://github.com/NCI-Cloud/slurm-cluster




nci.org.au33

NCI’s commitment to HPC in the Cloud

o Applying NCI’s depth of expertise in HPC application tuning to deliver high performance, secure computing environments in the Cloud for Australian Researchers.

o Bringing “Cloud to HPC” o Containers o Docker

o “Bring your own workflow” model

nci.org.au

o We can support seamless high performance research workloads with large data access requirements across multiple platforms

o Parallel jobs can run on the Cloud, but is it HPC? o Not at the moment. o Cloud is suited to high throughput computing (HTC), ease of provisioning and

specific workloads o Traditional HPC provides the best performance for larger parallel applications with

MPI requirements. o A common underlying hardware architecture shared between our HPC and Cloud platforms

provides application portability and flexibility in provisioning a system in either role. o QPI and NUMA can have a large impact on performance o Single Node performance is on par with bare metal (if the application is not memory

bound) o Locality Aware Scheduling (NUMA and Network awareness)

o Our benchmarks were limited by the QPI performance of SandyBridge. o NCI plans to deploy bare-metal provisioning using Ironic

34

Conclusion

nci.org.aunci.org.au

@NCInews

Thank You

[email protected] [email protected]

mailto:[email protected]?subject=

Chasing the Rainbow – National Computational Infrastructure’s Pursuit of High-Performance OpenStack Cloud: Andrew Howard, NCI

Technology

Chasing the Rainbow – National Computational Infrastructure’s Pursuit of High-Performance OpenStack Cloud: Andrew Howard, NCI