8/9/2019 HPC Open Petascale Computing
1/48
PATHWAYS TO
OPEN P
ETASCALE COMPUTING
The Sun
Constellation System designed for performance
White Paper
November 2009
Make everything as simple as possible, but not simpler
Albert Einstein
8/9/2019 HPC Open Petascale Computing
2/48
Sun Microsystems, Inc.
T
able of Contents
Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Pathways to Open Petascale Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
The Unstoppable Rise of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
The Importance of a Balanced and Rigorous Design Methodology . . . . . . . . . . . . . . 4
The Sun Constellation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Fast, Large, and Dense InfiniBand Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 7
The Fabric Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Sun Datacenter Switches for InfiniBand Fabrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Deploying Dense and Scalable Modular Compute Nodes
. . . . . . . . . . . . . . . . . . . 15
Compute Node Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
The Sun Blade 6048 Modular System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Scaling to Multiple Sun Datacenter InfiniBand Switch 648. . . . . . . . . . . . . . . . . . . . 23
Scalable and Manageable Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Storage for Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Clustered Sun Fire X4540 Servers as Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
The Sun Lustre Storage System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ZFS and Sun Storage 7000 Unified Storage Systems. . . . . . . . . . . . . . . . . . . . . . . . . 30
Long-Term Retention and Archive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Sun HPC Software
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Sun HPC Software, Linux Edition
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Seamless and Scalable Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Simplified Cluster Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Deploying Supercomputing Clusters Rapidly with Less Risk . . . . . . . . . . . . . . . . . 37
Sun Datacenter Express Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Sun Architected HPC Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A massive supercomputing cluster at the Texas Advanced Computing Center . . . . 38
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Acknowledgements
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8/9/2019 HPC Open Petascale Computing
3/48
This Page Intentionally Left Blank
8/9/2019 HPC Open Petascale Computing
4/48
1
E
x
ecutive Summary
Sun Microsystems, Inc.
E
xecutive Summary
F
rom weather prediction and global climate modeling to minute sub-atomic analysis
and other grand-challenge problems, modern supercomputers often provide the key
technology for unlocking some of the most critical challenges in science and
engineering. These essential scientific, economic, and environmental issues are
complex and daunting and many require answers that can only come from the
fastest available supercomputing technology. In the wake of the industry-wide
migration to terascale computing systems, an open and predictable path to petascale
supercomputing environments has become essential.
Unfortunately, the design, deployment, and management of very large terascale and
petascale clusters and grids has remained elusive and complex. While a few have
accomplished petascale deployments, they have been largely proprietary in nature, andhave come at a high cost. In fact, it is often difficult to reach petascale for fundamental
reasons not because of inherent limitations, but due to practicalities of attempting
to scale architectures to their full potential. Seemingly simple concerns heat, power,
cooling, cabling, and weight are rapidly overloading the vast majority of even the
most modern datacenters.
Sun understands that the key to building petascale supercomputers lies in a balanced
and systemic infrastructure design approach, along with careful application of the
latest technology advancements. Derived from Suns experience and innovation with
very large supercomputing deployments, the Sun Constellation System provides the
world's first open petascale computing environment one built entirely with open
and standard hardware and software technologies. Cluster architects can use the Sun
Constellation System to design and rapidly deploy tightly-integrated, efficient, and cost-
effective supercomputing clusters that scale predictably from a few teraflops to over a
petaflop. With a completely modular approach, processors, memory, interconnect
fabric, and storage can all be scaled independently depending on individual needs.
Best of all, the Sun Constellation System is an enterprise-class Sun-supported offering
comprised of general-purpose compute nodes, interconnects, and storage components
that can be deployed very rapidly. In fact, existing supercomputing clusters have
already been built using the system. For instance, the Texas Advanced Computing
Center (TACC) at the University of Texas at Austin partnered with Sun to deploy the Sun
Constellation system as their Ranger supercomputing cluster
1
with a peak
performance rating of over 500 teraflops. This document describes the key challenges
and constraints involved in the build-out of petascale supercomputing architectures,
including network fabrics, multicore modular compute systems, storage, open HPC
software, and general-purpose I/O.
1.
http://www.tacc.utexas.edu/resources/hpcsystems/#constellation
http://www.tacc.utexas.edu/resources/hpcsystems/#constellationhttp://www.tacc.utexas.edu/resources/hpcsystems/#constellation8/9/2019 HPC Open Petascale Computing
5/48
2
P
a
thw
a
ys to Open P
etascale Computing
Sun Microsystems, Inc.
Chapter 1
P
athways to Open Petascale Computing
Most pr
actitioners in today's high-performance computing (HPC) marketplace would
readily agree that the industry is well into the age of terascale systems.
Supercomputing systems capable of processing multiple teraflops are becoming
commonplace. These systems are readily being built using mostly commercial off-the-
shelf (COTS) components with the ability to address terabytes and petabytes of storage,
and more recently, terabytes of system memory (generally as distributed shared
memory and storage pools, or even as a single system image at the high end).
Only a few years ago, general-purpose terascale computing clusters constructed of
COTS components were hard to imagine. Though they were on several industry road-
maps, such systems were widely regarded as impractical due to limitations in thescalability of the interconnects and fabrics that tie disparate systems together. Through
competitive innovation and the race to be the fastest, the industry has been driven into
the realm of practical and commercially-viable terascale systems and now to the
edge of pondering what similar limitations, if any, lie ahead in the design of open
petascale systems.
T
he Unstoppable Rise of Clusters
In the last fi
ve years, technologies used to build the world's fastest supercomputers
have evolved rapidly. In fact, clusters of smaller interconnected rackmount and blade
systems now represent a majority of the supercomputers on the Top500 list of
supercomputing sites
1
steadil
y replacing vector supercomputers and other large
systems that dominated previously. Figure 1 shows the relative shares of various
supercomputing architectures comprising the Top500 list from 1993 through 2009,
establishing clear acceptance of clusters as leading supercomputing technology.
1.
www.top500.org
8/9/2019 HPC Open Petascale Computing
6/48
3
P
a
thw
a
ys to Open P
etascale Computing
Sun Microsystems, Inc.
F
igure 1. In the last five years, clusters have increasingly dominated the Top500
list architecture share (image courtesy www.top500.org)
No
t only have clusters provided access to supercomputing resources for increasingly
larger groups of researchers and scientists, but the largest supercomputers in the world
are now built using cluster architectures. This trend has been assisted by an explosion
in performance, bandwidth, and capacity for key technologies, including: Faster processors, multicore processors, and multisocket rackmount and blade
systems
Inexpensive memory and system support for larger memory capacity
Faster standard interconnects such as InfiniBand
Higher aggregated storage capacity from inexpensive commodity disk drives
Unfortunately, significant challenges remain that have stifled the growth of true open
petascale-class supercomputing clusters. Time-to-deployment constraints have resulted
from the complexity of deploying and managing large numbers of compute nodes,
switches, cables, and storage systems. The programability of extremely large clusters
remains an issue. Environmental factors too are paramount since deployments must
often take place in existing datacenter space with strict constraints on physical
footprint, as well as power and cooling.
Architecture Share Over Time
1993 - 2009500
400
300
200
100
0
06/1993
06/1994
06/1995
06/1996
06/1997
06/1998
06/1999
06/2000
06/2001
06/2002
06/2003
06/2004
06/2005
06/2006
06/2007
06/2008
06/2009
Systems
Top500 Releases
MPP
Cluster
SMP
Constellations
Single Processor
Others
http://www.top500.org/http://www.top500.org/http://www.top500.org/http://www.top500.org/8/9/2019 HPC Open Petascale Computing
7/48
4
P
a
thw
a
ys to Open P
etascale Computing
Sun Microsystems, Inc.
In addition t
o these challenges, most petascale computational users also have unique
requirements for clustered environments beyond those of less demanding HPC users,
including:
Scalability at the socket and core level
Some have espoused large grids ofrelatively low-performance systems, but lower performance only increase the
number of nodes that are required to solve very large computational problems.
Density in all things
Density is not just a requirement for compute nodes, but for
interconnect fabrics and storage solutions as well.
A scalable programming and execution model
Programmers need to be able to
apply their programmatic challenges to massively-scalable computational resources
without special architecture-specific coding requirements.
A lightweight grid model
Demanding applications need to be able to start
thousands of jobs quickly, distributing workloads across the available computational
resources through highly-efficient distributed resource management (DRM) systems.
Open and standards-based solutions
Programmatic solutions must not cause
extensive porting efforts, or be dedicated to particular proprietary architectures or
environments, and datacenters must remain free to purchase the latest high-
performance computational gear without being locked into proprietary or dead-end
architectures.
T
he Importance of a Balanced and Rigorous DesignMethodology
A
s anyone who has witnessed prior generations of supercomputing and HPC
architectures can attest, scaling gracefully is not simply a matter of accelerating
systems that already perform well. Bigger versions of existing technologies are not
always better. Regrettably, the pathways to teraflop systems are littered with the
products and technologies from dozens of companies that simply failed to adapt along
the way.
Many technologies have failed because the fundamental principles that worked in
small clusters simply could not scale effectively when re-cast in a run-time environment
thousands of times larger or faster than their initial implementations. For example, Ten
Gigabit Ethernet though a significant accomplishment is known in the
supercomputing realm to be fraught with sufficiently variable latency as to make itimpractical for situations where low guaranteed latency and throughput dominate
performance. Ultimately, building petascale-capable systems is about being willing to
fundamentally rethink design, using the latest available components that are capable
of meeting or exceeding specified data rates and capacities.
Put simply, getting to petascale requires balance and massive scalability in all
dimensions, including scalable tools and frameworks, processors, systems,
interconnects, and storage as well as the ability to accommodate changes that
allow software to scale accordingly.
8/9/2019 HPC Open Petascale Computing
8/48
5
P
a
thw
a
ys to Open P
etascale Computing
Sun Microsystems, Inc.
K
ey challenges for petascale environments include:
Keeping floating-point operations (FLOPs) to memory bandwidth ratios balanced to
minimize the effects of memory latency (with each FLOP representing at least two
loads and one store) Allowing for the practical scaling of the interconnect fabric to allow the connection of
tens of thousands of nodes
Exploiting the considerable investment, momentum, and cost savings of commodity
multicore x64 processors, tools, and software
Overcoming software challenges such as the forward portability of HPC codes to new
architectures, scalability limitations, reliability, robustness, and being able to take
advantage of multicore multiprocessor system architectures
Architecting to account for the opportunity to take advantage of external floating
point, vector, and/or general purpose processing on graphics processing unit
(GP/GPU) solutions within a cluster framework Designing the highest levels of density into compute nodes, interconnect fabrics, and
storage solutions in order to facilitate large and compact clusters
Building systems with efficient power and cooling to accommodate the broadest
range of datacenter facilities and to help ensure the highest levels of reliability
Architecting cluster architecture such that compute-intensive applications have
access to fast cluster scratch storage space for a balanced computational approach
These challenges serve as reminders that the value of genuine innovation in the
marketplace must never be underestimated even as design-cycle times shrink and
the pressures of time to market grow with the demand for faster, cheaper and
standards based solutions.
T
he Sun Constellation System
Since its inception, Sun has been f
ocused on building balance and even elegance into
its system designs. The Sun Constellation System represents a tangible application of
this philosophy on a grand scale in the form of a systematic approach to building
terascale and petascale supercomputing clusters. Specifically, the Sun Constellation
System delivers an open architecture that is designed to allow organizations to build
clusters that scale seamlessly from a few racks to teraflops or petaflops of performance.
With an overall datacenter focus, Sun is free to innovate at all levels of the system
from switching fabric, to core system and storage elements, to HPC and file system
software. As a systems company, Sun looks beyond existing technologies toward
solutions that optimize the simultaneous equations of cost, space, practicality, and
complexity. In the form of the Sun Constellation System, this systemic focus combines a
massively-scalable InfiniBand interconnect with very dense computational and storage
solutions in a single architecture that functions as a cohesive system. Organizations
can now obtain all of these tightly-integrated building blocks from a single vendor, and
benefit from a unified management approach.
8/9/2019 HPC Open Petascale Computing
9/48
6
P
a
thw
a
ys to Open P
etascale Computing
Sun Microsystems, Inc.
C
omponents of the Sun Constellation System include:
The Sun Datacenter InfiniBand Switch 648, offering up to 648 QDR/DDR ports in a
single 11 rack unit (11U) chassis, and supporting clusters of up to 5,184 nodes with
multiple switches The Sun Datacenter InfiniBand Switch 72, offering up to 72 QDR/DDR ports in a
compact 1U form factor, and supporting clusters of up to 576 nodes with multiple
switches
The Sun Datacenter InfiniBand Switch 36, offering up to 36 nodes in a 1U form factor
The Sun Blade 6048 Modular System, providing an ultra-dense InfiniBand-connected
blade platform with support for up to 48 multiprocessor, multicore Sun Blade 6000
server modules and up to 96 compute nodes in a rack-sized chassis
Sun Fire X4540 storage clusters, serving as an economical InfiniBand-connected
parallel file system building block, with support for up to 48 terabytes in only four
rack units and up to 480 terabytes in a single rack The Sun Storage 7000 Unified Storage System, integrating enterprise flash
technology through ZFS hybrid storage pools and DTrace Analytics to provide
economical, scalable, and transparent storage
The Sun Lustre Storage System, a simple-to-deploy storage environment based on
the Lustre parallel file system, Sun Fire servers, and Sun Open Storage platforms
Sun HPC Software, encompassing integrated developer tools, Sun Grid Engine
infrastructure, advanced ZFS and Lustre file systems, provisioning, monitoring,
patching, and simplified inventory management available in both a Linux Edition
and a Solaris Operating System (OS) Developer Edition
The Sun Constellation System provides an open systems supercomputer architecture
designed for petascale computing as integrated and Sun-supported product. This
holistic approach offers key advantages to those designing and constructing the largest
supercomputing clusters:
Massive scalability in terms of optimized compute, storage, interconnect, and
software technologies and services
Simplified cluster deployment with open HPC software that can rapidly turn bare-
metal systems into functioning clusters that are ready to run
A dramatic reduction in complexity through integrated connectivity and
management to reduce start-up, development, and operational connectivity
Breakthrough economics from technical innovation that results in fewer more
reliable components and high-efficiency systems in a tightly-integrated solution
Along with key technologies and the experience of helping design and deploy some of
the worlds largest supercomputing clusters, these strengths make Sun an ideal partner
for delivering open high-end terascale and petascale architecture.
8/9/2019 HPC Open Petascale Computing
10/48
7
F
as
t, Lar
ge, and D
ense I
nfi
niBan
d
Infr
as
tructur
e
Sun Microsystems, Inc.
Chapter 2
F
ast, Large, and Dense InfiniBand Infrastructure
Building the lar
gest supercomputing grids presents significant challenges, with fabric
technology paramount among them. Sun set out to design InfiniBand architecture for
maximum flexibility and fabric scalability, and to drastically reduce the cost and
complexity of delivering large-scale HPC solutions. Achieving these goals required a
delicate balancing act one that weighed the speed and number of nodes along with
a sufficiently fast interconnect to provide minimal and predictable levels of latency.
T
he Fabric Challenge
F
or many applications, the interconnect fabric is already the element that limits
performance. One unavoidable driver is that faster processors require a faster
interconnect. Beyond merely employing a fast technology, the fabric must scale
effectively with both the speed and number of systems and processors. Interconnect
fabrics for large terascale and petascale deployments require:
Low latency
High bandwidth
The ability to handle fabric congestion
High reliability to avoid interruptions
Open standards such as OpenFabrics and the OpenMPI software stack
InfiniBand technology has emerged as an attractive fabric for building large
supercomputing clusters. As an open standard, InfiniBand presents a compelling choice
over proprietary interconnect technologies that depend on the success and innovation
of a single vendor. InfiniBand also presents a number of significant technical
advantages:
A switched fabric offers considerable scalability, supporting large numbers of
simultaneous collision-free connections with virtually no increase in latency.
Host channel adaptors (HCAs) with remote direct memory access (RDMA) support
offload communications processing from the processor and operating system,
leaving more processor resources available for computation.
Fault isolation and troubleshooting are easier in switched environments since
problems can be isolated to a single connection.
Applications that rely on bandwidth or quality of service are also well served, since
they each receive their own dedicated bandwidth.
Even with these advantages, building the largest InfiniBand clusters and grids has
remained complex and expensive primarily because of the need to interconnect very
large numbers of computational nodes. Traditional large clusters require literally
thousands of cables and connections and hundreds of individual core and leaf switches
adding considerable expense, weight, cable management complexity, and
8/9/2019 HPC Open Petascale Computing
11/48
8
F
as
t, Lar
ge, and D
ense I
nfi
niBan
d
Infr
as
tructur
e
Sun Microsystems, Inc.
consumption of v
aluable datacenter rack space. It is clear that density, consolidation,
and management efficiencies are important not just for computational platforms, but
for InfiniBand interconnect infrastructure as well.
Even with very significant accomplishments in terms of processor performance and
computational density, large clusters are ultimately constrained by real estate and and
the complexities and limitations of interconnect technologies. Cable length limitations
constrain how many systems can be connected together in a given physical space while
avoiding increased latency. Interconnect topologies play a vital role in determining the
properties that clustered systems exhibit. Mesh, torus (or toroidal), and Clos topologies
are popular choices for interconnected supercomputing clusters and grids.
M
esh and 3D Torus Topologies
In mesh and 3D torus topologies, each node connects to its neighbors in the x, y, and zdimensions, with six connecting ports per node. Some of the most notable
supercomputers based upon torus topologies include IBMs BlueGene and Crays XT3/
XT4 supercomputers. Torus fabrics have had the advantage that they have generally
been easier to build than Clos topologies. Unfortunately, torus topologies represent a
blocking fabric, where interconnect bandwidth can vary between nodes. Torus fabrics
also provide variable latency due to variable hop count, and application deployment for
torus fabrics must carefully consider node locality as a result. For some specific
applications that express a nearest-neighbor type of communication pattern, torus
topologies are a good fit. Computational fluid dynamics (CFD) is one such application.
Clos Fat Tree Topologies
F
irst described by Charles Clos in 1953, Clos networks have long formed the basis for
practical multistage telephone switching systems. Clos networks utilize a fat tree
topology, allowing complex switching networks to be built using many fewer
crosspoints than if the entire system were implemented as a single large crossbar
switch. Clos switches are typically comprised of multiple tiers and stages (hops), with
each tier built from of a number of crossbar switches. Connectivity exists only between
switch chips on adjacent tiers.
Clos fabrics have the advantage of being non-blocking, in that each attached node has
a constant bandwidth. In addition, an equal number of stages between nodes provides
for uniform latency. Historically, the disadvantage of large Clos networks was that they
required more resources to build.
C
onstructing Large Switched Supercomputing Clusters
C
onstructing very large InfiniBand Clos switches in particular is governed by a number
of practical constraints, including the number of ports available in individual switch
elements, maximum achievable printed circuit board size, and maximum connector
8/9/2019 HPC Open Petascale Computing
12/48
9
F
as
t, Lar
ge, and D
ense I
nfi
niBan
d
Infr
as
tructur
e
Sun Microsystems, Inc.
density
. Sun has employed considerable innovation in all of these areas, and provides
both dual data rate (DDR) and quad data rate (QDR) scalable InfiniBand fabrics. For
example, as a part of the Sun Constellation System, Sun InfiniBand infrastructure can
provide both QDR Clos clusters that can scale up to 5,184 nodes as well as 3D Torusconfigurations.
S
un Datacenter Switches for InfiniBand Fabrics
R
ecognizing the considerable promise of InfiniBand interconnects, Sun has made
InfiniBand connectivity a core competency, and has set out to design scalable and
dense switches that avoid many of the conventional limitations. Not content to accept
the status quo in terms of available InfiniBand switching, cabling, and host adapters,
Sun engineers used their considerable networking and datacenter experience to view
InfiniBand technology from a systemic perspective.
K
ey Technical Innovations for Sun Datacenter InfiniBand Switches
Sun Da
tacenter InfiniBand Switches 36, 72, and 648 are components of a complete
system that is based on multiple technical innovations, including:
The Sun Datacenter InfiniBand Switch 648 chassis implements a three-stage Clos
fabric with up to 54 36-port Mellanox InfiniScale IV switching elements, integrated
into a single 11U rackmount enclosure. The Sun Datacenter InfiniBand Switch 648
implements a 3-stage Clos fabric.
Industry-standard 12x CXP connectors on Sun Datacenter InfiniBand Switch 72 and
648 consolidate three discrete InfiniBand 4x connectors, resulting in the ability tohost 72 4x ports through 24 physical 12x connectors.
Complementing the 12x CXP connector, a 12x trunking cable carries signals from
three servers to a single switch connector, offering a 3:1 cable reduction when used
for server trunking, and reducing the number of cables needed to support 648 servers
to 216. A splitter cable that converts one 12x connection to three 4x connections is
provided for connectivity to systems and storage that require 4x QSFP connectors.
A custom-designed double-height Network Express Module (NEM) for the Sun Blade
6048 Modular System provides seamless connectivity to both the Sun Datacenter
InfiniBand Switch 648 and 72. Using the same 12x CXP connectors, the Sun Blade
6048 InfiniBand QDR Switched NEM can trunk up to 12 Sun Blade 6000 servermodules (up to 24 compute nodes) in a single Sun Blade 6048 Modular System shelf.
The NEM together with the 12x CXP cable facilitates connectivity of up to 5,184
servers in a 5-stage Clos topology.
Fabric topology for forwarding InfiniBand traffic is established by a redundant host-
based Subnet Manager. A host-based solution allows the Subnet Manager to take
advantage of the full resources of a general-purpose multicore server.
8/9/2019 HPC Open Petascale Computing
13/48
10 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.
Massive Switch and Cable Consolidation
Given the scale involved with building supercomputing clusters and grids, cost and
complexity figure importantly. Regrettably, traditional approaches to using InfiniBand
for massive connectivity have required very large numbers of conventional switches
and cables. In these configurations, many cables and ports are consumed redundantly
connecting core and leaf switches together, making advertised per-port switch costs
relatively meaningless, and reducing reliability through extra cabling.
In contrast, the very dense InfiniBand fabric provided by Sun Datacenter InfiniBand
switches is able to potentially eliminate hundreds of switches and thousands of cables
dramatically lowering acquisition costs. In addition, replacing physical switches and
cabling with switch chips and traces on printed circuit boards drastically improves
reliability. Standard 12x InfiniBand cables and connectors coupled with a specialized
Sun Blade 6048 Network Express Module can eliminate thousands of additional cables,providing additional cost, complexity, and reliability improvements. Overall, these
switches provide radical simplification of InfiniBand infrastructure. Sun Datacenter
Switches are available to support both DDR and QDR data rates, with fabric capacities
enumerated in Table 1.
Table 1. Sun Datacenter InfiniBand Switch capacities
The Sun Datacenter InfiniBand Switch 648
The Sun Datacenter InfiniBand Switch 648 is designed to drastically reduce the cost and
complexity of delivering large-scale HPC solutions, such as those scaled for leadership
in the Top500 list of supercomputing sites, as well as smaller and moderately-sizedenterprise and HPC applications in scientific, technical, and financial markets. Each Sun
Datacenter InfiniBand Switch 648 provides up to 648 QDR InfiniBand ports in only 11
rack units (11U). Up to eight Sun Datacenter InfiniBand Switch 648 can be combined to
InfiniBand Switch Data Rate (Connector) Maximum SupportedNodes per Switch
MaximumClos Fabric
Sun Datacenter InfiniBandSwitch 648
QDR or DDR(up to 216 12x CXP)
648 5,184a
a.Eight switches are required. The Sun Datacenter InfiniBand Switch 648 is capable of supporting
clusters beyond 5,184 servers. The maximum number of nodes is currently determined by the
number of uplink ports (eight) provided by the Sun Blade 6048 InfiniBand QDR Switched NEM.
Sun Datacenter InfiniBandSwitch 72
QDR or DDR(24 12x CXP)
72 576a
Sun Datacenter InfiniBandSwitch 36
QDR or DDR(36 4x QSFP)
36
8/9/2019 HPC Open Petascale Computing
14/48
11 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.
support up to 5,184 nodes in a single cluster. As shown in Figure 2, the Sun Datacenter
InfiniBand Switch 648 also provides extensive cable support and management for clean
and efficient installations.
Figure 2. The Sun Datacenter InfiniBand Switch 648 offers up to 648 QDR/DDR/SDR
4x InfiniBand connections in an 11u rackmount chassis (shown with cable
management arms deployed).
The Sun Datacenter InfiniBand Switch 648 is ideal for deploying fast, dense, and
compact Clos fabrics when used as a part of the Sun Constellation System. Based on
the Mellanox InfiniScale IV 36-port InfiniBand switch device, each switch chassis
connects up to 648 nodes using 12x CXP connectors. The switch represents a full three-
stage Clos fabric, and up to eight Sun Datacenter InfiniBand Switch 648 can be used to
combine up to 54 Sun Blade 6048 chassis in a maximal 5,184-node fabric. Up to threeSun Datacenter InfiniBand Switch 648 (and up to 1,944 QDR ports) can be deployed in a
single standard rack (Figure 3).
The Sun Datacenter InfiniBand Switch 648 is tightly integrated with the Sun Blade 6048
InfiniBand QDR Switched Network Express Module (NEM). 12x cables and CXP
connectors provide a 3:1 cable consolidation ratio. Each dual-height NEM connects up
to 24 compute nodes in a single Sun Blade 6048 shelf to a QDR InfiniBand fabric. Suns
approach to InfiniBand networking is highly flexible in that both Clos and mesh/torus
interconnects can be built using the same components. The Sun Blade 6048 InfiniBand
Switched NEM can be used by itself to build mesh and torus fabrics, or in combination
with the Sun Datacenter InfiniBand Switch 648 switch to build Clos InfiniBand fabrics.
The Sun Datacenter InfiniBand Switch 648 employs a passive midplane. Fabric cards
install vertically and connect to the midplane from the rear of the chassis. Up to nine
line cards install horizontally from the front of the chassis. A three-dimensional
perspective of the fabric provided by the switch is shown in Figure 4, with an example
route overlaid. With this dense switch configuration, InfiniBand packets traverse onlyFigure 3. Up to three Sun Datacenter
InfiniBand Switch 648 in a single
19-inch rack deliver 1,944 QDR ports.
8/9/2019 HPC Open Petascale Computing
15/48
12 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.
three hops from ingress to egress of the switch, keeping latency very low. The Sun
Blade 6048 InfiniBand QDR Switched NEM adds only two hops for a total of five. All
InfiniBand routing is managed using a redundant host-based subnet manager.
Figure 4. A path through a Sun Datacenter InfiniBand Switch 648 core switch
connects two nodes across horizontal line cards, a vertical fabric card, and the
passive orthogonal midplane.
The Sun Datacenter InfiniBand Switch 72
The Sun Datacenter InfiniBand Switch 72 leverages many of the innovations found in
the Sun Datacenter InfiniBand Switch 648, while offering support for smaller and mid-
sized configurations. Like the larger 648-port switch, the Sun Datacenter InfiniBandSwitch 72 offers QDR and DDR connectivity, extreme density, and unrivaled cable
aggregation for Sun Blade and Sun Fire servers as well as Sun storage solutions.
NineLin
eCards
NineFabricCards
Path Through Switch
Alternate Path Through Switch
8/9/2019 HPC Open Petascale Computing
16/48
13 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.
Depicted in Figure 5, the Sun Datacenter InfiniBand Switch 72 occupies only one rack
unit, offering an ultraslim and ultradense complete switch fabric solution for clusters of
up to 72 nodes.
Figure 5. The Sun Datacenter InfiniBand Switch 72 offers 72 4x QDR InfiniBand
ports in a 1U form factor
When used in conjunction with the Sun Blade 6048 Modular System, up to eight Sun
Datacenter InfiniBand Switch 72 can be combined to support clusters of up to 576
nodes. While similar solutions from competitors occupy over 17 rack units, eight 1U Sun
Datacenter InfiniBand Switch 72 save considerable space, and require roughly one third
the number of cables. In addition to simplification, this end-to-end supercomputing
solution offers extremely low latency using industry-standard transport, and
commodity processors including AMD Opteron, Intel Xeon, and Sun SPARC.
The Sun Datacenter InfiniBand Switch 72 provides the following specifications:
72 QDR/DDR/SDR 4x InfiniBand ports (expressed through 24 12x CXP connectors)
Data throughput of 4.6 Tb/sec.
Port-to-port latency of 300ns (QDR)
Eight data virtual lanes
One management virtual lane
4096 byte MTU
Sun Datacenter InfiniBand Switch 36
Leveraging the properties of the InfiniBand architecture, the Sun Datacenter InfiniBand
Switch 36 helps organizations deploy smaller high-performance fabrics in demanding
high-availability (HA) environments. The switch supports the creation of logically
isolated sub-clusters, as well as advanced features for traffic isolation and Quality ofService (QoS) management preventing faults from causing costly service disruptions.
The embedded InfiniBand fabric management module supports active/hot-standby
dual-manager configurations, helping to ensure a seamless migration of the fabric
management service in the event of a management module failure. The Sun
8/9/2019 HPC Open Petascale Computing
17/48
14 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.
Datacenter InfiniBand Switch 36 is provisioned with redundant power and cooling for
high availability in demanding datacenter environments. The Sun Datacenter
InfiniBand Switch 36 is shown in Figure 6.
Figure 6. The Sun Datacenter InfiniBand Switch 36 offers 36 QDR InfiniBand ports
in a 1U form factor
The Sun Datacenter InfiniBand Switch 36 provides the following specifications:
36 QDR/DDR/SDR 4x InfiniBand ports (expressed through 36 4x QSFP connectors)
Data throughput of 2.3 Tb/sec.
Port-to-port latency of 100ns (QDR)
Eight data virtual lanes
One management virtual lane
4096 byte MTU
8/9/2019 HPC Open Petascale Computing
18/48
15 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
Chapter 3
Deploying Dense and Scalable Modular Compute
Nodes
Implementing terascale and petascale supercomputing clusters depends heavily on
having access to large numbers of high-performance systems with large memory
support and high memory bandwidth. As a part of the Sun Constellation System, Suns
approach is to combine the considerable and constant performance gains in the
standard processor marketplace with the advantages of modular architecture. This
approach results in some of the fastest and most dense systems possible all tightly
integrated with Sun Datacenter InfiniBand switches.
Compute Node RequirementsWhile some supercomputing architectures employ very large numbers of slower
proprietary nodes, this approach does not translate easily to petascale. The
programmatic implications alone of handling literally millions of nodes are not
particularly appealing much less the physical realities of managing and housing
such systems. Instead, building large and open terascale and petascale systems
depends on key capabilities for compute nodes, including:
High Performance
Compute nodes must provide very high peak levels of floating-point performance.
Likewise, because floating-point performance is dependent on multiple memory
operations, equally high levels of memory bandwidth must be provided. I/Obandwidth is also crucial, yielding high-speed access to storage and other
compute nodes.
Density, Power, and Cooling
The physical requirements of todays ever more expensive datacenter real estate
dictate that any viable solutions take the best advantage of datacenter floor space
while staying within environmental realities. Solutions must be as energy efficient
as possible, and must provide effective cooling that fits well with the latest
energy-efficient datacenter practices.
Superior Reliability and Serviceability
Due to their large numbers, computational systems must be as reliable and
servicable as possible. Not only must systems provide redundant hot-swap
processing, I/O, power, and cooling modules, but serviceability must be a key
component of their design and management. Interconnect schemes must allow
systems to be cabled once and reconfigured at will as required.
Blade technology has offered considerable promise in these areas for some time, but
has often been constrained by legacy blade platforms that locked adopters into
expensive proprietary infrastructure. Power and cooling limitations often meant that
8/9/2019 HPC Open Petascale Computing
19/48
16 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
processors were limited to less powerful versions. Limited processing power, memory
capacity, and I/O bandwidth often severely restricted the applications that could be
deployed. Proprietary tie-ins and other constraints in chassis design dictated
networking and interconnect topologies, and I/O expansion options were limited to asmall number of expensive and proprietary modules.
The Sun Blade 6048 Modular SystemTo address the shortcomings of earlier blade computing platforms, Sun started with a
design point focused on the needs of the datacenter and highly-scalable deployments,
rather than with preconceptions of chassis design. With this innovative and truly
modular approach, the Sun Blade 6048 Modular System offers an ultra-dense high-
performance solution for large HPC clusters. Organizations gain the promised benefits
of blades, and can deploy thousands of nodes within the cabling, power, and cooling
constraints of existing datacenters. Fully compatible with the Sun Blade 6000 Modular
System, the Sun Blade 6048 Modular System provides distinct advantages over other
approaches to modular architecture.
Innovative Chassis Design for Industry-Leading Density and Environmentals
The Sun Blade 6048 Modular System features a standard rack-size chassis that
facilitates the deployment of high-density computational environments. By
eliminating all of the hardware typically used to rackmount individual blade
chassis, the Sun Blade 6048 Modular System provides 20% more usable space in
the same physical footprint. Up to 48 Sun Blade 6000 server modules can be
deployed in a single Sun Blade 6048 Modular System for up to 96 compute nodes
per rack. Innovative chassis features are carried forward from the Sun Blade 6000
Modular System.
A Choice of Processors and Operating Systems
Each Sun Blade 6048 Modular System chassis supports up to 48 full performance
and full featured Sun Blade 6000 series server modules. Server modules based on
x86/x64 architectures, and ideal for HPC and supercomputing environments
include:
The Sun Blade X6440 server module, with four sockets for Six-Core AMD Opteron
8000 Series processors, and support for up to 256 GB of memory
The Sun Blade X6270 server module, with two sockets for Intel Xeon Processor5500 Series (Nehalem) CPUs and 144 GB of memory per server module
The Sun Blade X6275 server module, with two nodes, each with two sockets for
Intel Xeon Processor 5500 Series CPUs, 96 GB of memory per node, and an on-
board QDR Mellanox InfiniBand host channel adapter (HCA)
Each server module provides significant I/O capacity as well, with up to 32 lanes of
PCI Express 2.0 bandwidth delivered from each server module to the multiple
available I/O expansion modules (a total of up to 207 Gb/sec supported per server
8/9/2019 HPC Open Petascale Computing
20/48
17 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
module). To enhance availability, server modules dont have separate power
supplies or fans. Some server modules feature up to four hot-swap hard disk drives
(HDDs) or solid state drives (SSDs) disks with hardware RAID options, while others
provide on-board flash technologies for fast and reliable I/O. Organizations candeploy server modules based on the processors and operating systems that best
serve their applications or environment. Different server modules can be mixed
and matched in a single chassis, and deployed and redeployed as needs dictate.
The Solaris Operating System (OS), Linux, and Microsoft Windows are all
supported.
Complete Separation Between CPU and I/O Modules
Sun Blade 6048 Modular System design avoids compromises because it provides a
complete separation between CPU and I/O modules. Two types of I/O modules are
supported.
Up to two industry-standard PCI Express ExpressModule (EMs) slots are dedi-
cated to each server module.
Up to two Network Express Modules (NEMs) provide bulk IO for all of the server
modules installed in each shelf (four shelves per chassis).
Through this flexible approach, server modules can be configured with different
I/O options depending on the applications they host. All I/O modules are hot-plug
capable, and customers can choose from Sun-branded or third-party adapters for
networking, storage, clustering, and other I/O functions.
Sun Blade Transparent Management
Many blade vendors provide management solutions that lock organizations into
proprietary management tools. With the Sun Blade 6048 Modular System,
customers have the choice of using their existing management tools or Sun Blade
Transparent Management. Sun Blade Transparent Management is a standards-
based cross-platform tool that provides direct management over individual server
modules and direct management of chassis-level modules using Sun Integrated
Lights out Management (ILOM).
Within the Sun Blade 6048 Modular System, a chassis monitoring module (CMM)
works in conjunction with the service processor on each server module to form a
complete and transparent management solution. Individual server modules
provide support for IPMI, SNMP, CLI (through serial console or SSH), and HTTP(S)
management methods. In addition, Sun Ops Center provides discovery,
aggregated management, and bulk deployment for multiple systems.
8/9/2019 HPC Open Petascale Computing
21/48
18 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
System Overview
The Sun Blade 6048 chassis provides space for up to 12 server modules in each of its
four shelves for up to 48 Sun Blade 6000 server modules in a single chassis. This
design approach provides considerable density. Front and rear perspectives of the Sun
Blade 6048 Modular System are provided in Figure 7.
Figure 7. Front and rear perspectives of the Sun Blade 6048 Modular System
With four self-contained shelves per chassis, the Sun Blade 6048 Modular System
houses a wide range of components.
Up to 48 Sun Blade 6000 server modules insert from the front of the chassis, with 12modules supported by each shelf.
A total of eight hot-swap power supply modules insert from the front of the chassis,
with two 8,400 Watt 12-volt power supplies (N+N) are provided for each shelf. Each
power supply module contains a dedicated fan module.
Up to 96 hot-plug PCI Express ExpressModules (EMs) insert from the rear of the
chassis (24 per shelf), supporting industry-standard PCI Express interfaces with two
EM slots available for use by each server module.
Four
self-contained
shelves
Chassis Management Module
and Power Interface ModuleHot Swappable
N+N Redundant
power supply
Up to 12 Sun Blade 6000
server modules
per shelf
modules
Up to 24 PCI Express
ExpressModules (EMs)
Up to two single-height Network
Express Modules (NEMS) or one
Eight fan modules (N+1)
dual-height InfiniBand NEM
8/9/2019 HPC Open Petascale Computing
22/48
19 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
Up to four dual-height Sun Blade 6048 InfiniBand NEMs can be installed in a single
chassis (one per shelf). Alternately, up to eight single-height Network Express
Modules (NEMs) can be inserted from the rear, with two NEM slots serving each shelf
of the chassis. A chassis monitoring module (CMM) and power interface module are provided for
each shelf. The CMM provides for transparent management access to individual
server modules while the Power Interface Module provides six plugs for the power
supply modules in each shelf.
Redundant (N+1) fan modules are provided at the rear of the chassis for efficient
front-to-back cooling.
Standard I/O Through a Passive Midplane
In essence, the passive midplane in the Sun Blade 6048 Modular System is a collection
of wires and connectors between different modules in the chassis. Since there are no
active components, the reliability of this printed circuit board is extremely high in
the millions of hours. The passive midplane provides electrical connectivity between
the server modules and the I/O modules.
All front and rear modules connect directly to the passive midplane, with the exception
of the power supplies and the fan modules. The power supplies connect to the
midplane through a bus bar and to the AC inputs via a cable harness. The redundant
fan modules plug individually to a set of three fan boards, where fan speed control and
other chassis-level functions are implemented. The front fan modules that cool the PCI
Express ExpressModules each connect to the chassis via self-aligning, blind-mate
connections. The main functions of the midplane include:
Providing a mechanical connection point for all of the server modules
Providing 12 VDC from the power supplies to each customer-replaceable module
Providing 3.3 VDC power used to power the System Management Bus devices on each
module, and to power the CMM
Providing a PCI Express interconnect between the PCI Express root complexes on each
server module to the EMs and NEMs installed in the chassis
Connecting the server modules, CMMs, and NEMs to the management network
Each server module is energized through the midplane from the redundant chassis
power grid. The midplane also provides connectivity to the I2C network in the chassis,allowing each server module to directly monitor the chassis environment, including fan
and power supply status as well as various temperature sensors. A number of I/O links
are also routed through the midplane for each server module. Connection details differ
depending on the selected server module and associated NEMS. As an example,
Figure 8 illustrates the dual-node Sun Blade X6275 server module configured with the
Sun Blade 6048 InfiniBand QDR Switched NEM with connections that include:
An x8 PCI Express 2.0 link connecting from each compute node to a dedicated EM
Two gigabit Ethernet links to the NEM one from each compute node
8/9/2019 HPC Open Petascale Computing
23/48
20 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
Two 4x QDR InfiniBand connections to the NEM one from each compute node
An Ethernet connection from the server module to the CMM for management
Figure 8. Distribution of communications links from a typical Sun Blade 6000
server module
Tight Integration with Sun Datacenter InfiniBand Switches
Providing dense connectivity to servers while minimizing cables is one of the issues
facing large HPC cluster deployments. The Sun Blade 6048 QDR Switched InfiniBand
NEM solves this challenge and improves both density and reliability by integratingconnections and switch components into a dual-height NEM form factor for the Sun
Blade 6048 chassis. As a part of the Sun Constellation System, the NEM uses common
components, cables, connectors, and architecture with the Sun Datacenter InfiniBand
Switch 648 and 72.
Gigabit Ethernet (Node 1)
4x QDR InfiniBand (Node 1)
Gigabit Ethernet (Node 0)
4x QDR InfiniBand (Node 0)
Ethernet
PCI Express x8 (Node 0)
PCI Express x8 (Node 1)
Sun Blade X6275 Server Module
NEM 0
NEM 1
CMM
EMs
Node 0
Node 1
8/9/2019 HPC Open Petascale Computing
24/48
8/9/2019 HPC Open Petascale Computing
25/48
22 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
density are key, each Sun Blade X6275 server module features two compute nodes, with
each node supporting two sockets for Intel Xeon Processor 5500 Series CPUs and up to
96 GB of memory (Figure 10).
Figure 10. The Sun Blade X6275 server module provides two compute nodes on a
single server module.
Figure 11 shows a block-level representation of how the Sun Blade X6275 server module
connects to the Sun Blade 6048 InfiniBand QDR Switched NEM. In this configuration,
twelve ports from each switch chip (24 total) are used to communicate with the two
compute nodes on each Sun Blade X6275 server module, with nine ports used toconnect the two switches together. The 30 remaining ports (15 per switch chip) are
used as uplinks to either other QDR switched NEMS or external InfiniBand switches.
Sun Blade 6048 InfiniBand QDR Switched NEMs can be connected together directly to
provide mesh or 3D torus fabrics. Alternately, one or more Sun Datacenter InfiniBand
Switch 648 or 72 can be connected to provide Clos fabric implementations. The external
ports use industry-standard CXP connectors that aggregate three 4x ports into a single
12x connector.
8/9/2019 HPC Open Petascale Computing
26/48
23 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
Figure 11. The Sun Blade 6048 InfiniBand QDR Switched NEM connects directly to
Mellanox HCAs on both nodes of the Sun Blade X6275 server module
In the default configuration, and for clusters that utilize up to four Sun Datacenter
InfiniBand Switch 648s, the switch provides a non-blocking fabric. To maintain a non-
blocking fabric in configurations of larger than four switches, an external 12x cable can
link two of the external CXP connectors (one connected to each internal switch chip) to
interconnect the two switches with an additional three 4x connections. This
configuration fully meshes the InfiniScale IV chips on the Sun Blade 6048 InfiniBand
QDR Switched NEM, with a total of 12 ports communicating between the two Mellanox
InfiniScale IV InfiniBand switches, while still leaving 24 4x ports (Eight 12x CXP
connectors) available as switch uplinks.
Scaling to Multiple Sun Datacenter InfiniBand Switch 648Designers need the ability to scale supercomputing deployments without being
constrained by arbitrary limitations. The Sun Datacenter InfiniBand Switch 648 lets
organizations scale from mid-sized InfiniBand deployments that may only populate a
portion of a single Sun Datacenter InfiniBand Switch 648 chassis, to very large
deployments built from multiple Sun Datacenter InfiniBand Switch 648. As with single-
Sun Blade X6275Server Module
Sun Blade 6048 InfiniBandQDR Switched NEM
Memory
Memory
Memory
Memory
P
P
P
PP
CIe
2.0
IBH
CA
PCIe2.0
IB
HCA
4x IB
Up to 12
ServerModules
Node 1
Node 2
9
ports
15 ports
15 ports
12x IB
Cables
36-port
QDR IB
Switch
36-port
QDR IBSwitch
8/9/2019 HPC Open Petascale Computing
27/48
24 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
switch configurations, a multiswitch system still functions and is managed as a single
entity, greatly reducing management complexity.
A single Sun Datacenter InfiniBand Switch 648 can be deployed for configurations
that require up to 648 compute nodes. Up to eight Sun Datacenter InfiniBand Switch 648 can be configured to serve up to
5,184 compute nodes.
Certain requirements exist for maintaining a non-blocking InfiniBand fabric. Table 2
lists various supported numbers of Sun Blade 6048 chassis, Sun Datacenter InfiniBand
Switch 648, Line Cards, Sun Blade 6048 InfiniBand QDR Switched NEMS, and 12x cables
to support various numbers of compute nodes via Sun Blade 6275 server modules. All
listed configurations are non-blocking.
Table 2. Maximum numbers of Sun Blade 6275 server modules and Sun Blade 6048 Modular Systems
supported by various numbers of Sun Datacenter InfiniBand Switch 648.
Sun Blade 6048Chassis
Number ofSun DatacenterInfiniBandSwitch 648
Line CardsRequired
Sun Blade 6048InfiniBand QDRSwitched NEMs
12x CablesRequired
Total ComputeNodes Supported
1 1 2 4 32 96
2 1 3 8 64 192
3 1 4 12 96 288
4 1 6 16 128 384
5 1 7 20 160 480
6 1 8 24 192 5768 2 11 32 256 768
10 2 14 40 320 960
12 2 16 48 384 1,152
24 4 32 96 768 2,304
48 8 64 192 1,536 4,608
54 8 72 216 1,728 5,184
8/9/2019 HPC Open Petascale Computing
28/48
25 Scalable and Manageable Storage Sun Microsystems, Inc.
Chapter 4
Scalable and Manageable Storage
Large-scale supercomputing clusters place significant demands on storage systems. The
enormous computational performance gains that have been realized through
supercomputing clusters are capable of generating ever-larger quantities of data at very
high rates. Effective HPC storage solutions must provide cost-effective capacity, and
throughput must be able to scale along with the performance of cluster compute
nodes. In addition, users and systems alike need fast access to data and home
directories, and longer-term retention and archival are increasingly important in HPC
and supercomputing environments. These diverse demands require a robust range of
integrated storage offerings.
Storage for ClustersAlong with the general growth in storage capacity requirements and the shear number
of files stored, large HPC environments are seeing significant growth in the numbers of
users needing convenient access to their files. All users want to access their essential
data quickly and easily without having to perform extraneous steps. Organizations also
want to get the best utilization possible from their computational systems.
Unfortunately, storage speeds have seriously lagged behind computational
performance for years, and HPC users are increasingly concerned about storage
benchmarks, the increasingly complexity of the I/O path, and the range of solutions
required to provide complete storage solutions.
Of particular importance, large HPC environments need to be able to effectively
manage the flow of high volumes of data through their storage infrastructure,
requiring:
Storage that acts as a resilient compute engine data cacheto match the streaming
rates of applications running on the compute cluster
Storage that provides longer-term retention and archiveto store massive quantities
of essential data to tiered disk or tape hierarchies
A range of scalable and parallel file systems and integrated data management
software to help project file system data from near-term cache to longer-term
retention and archiving and back on demand
Even as the capacities of individual disk drives have risen, and prices have fallen, high-
volume parallel storage systems have remained expensive and complex. With
experience deploying petabytes of storage into large supercomputing clusters, Sun
understands the key issues needed to deliver high-capacity, high-throughput storage in
a cost-effective and manageable fashion. As an example, the Tokyo Institute of
8/9/2019 HPC Open Petascale Computing
29/48
26 Scalable and Manageable Storage Sun Microsystems, Inc.
Technology (TiTech) TSUBAME supercomputing cluster was initially deployed with 1.1
petabytes of storage provided by clustered Sun Fire X4500 storage servers and the
Lustre parallel file system.
Clustered Sun Fire X4540 Storage Servers as Data CacheIdeal for building storage clusters to serve as cluster scratch space or data cache, the
Sun Fire X4540 storage server defines a new category of system. These innovative
systems closely couple a general-purpose enterprise-class x64 server with high-density
storage all in a very compact form factor. Supporting up to 48 terabytes in only four
rack units, the Sun Fire X4540 storage server also provides considerable compute power
with dual sockets for Third-Generation Quad-Core and enhanced Quad-Core AMD
Opteron processors. The server can also be configured for high-throughput InfiniBand
networking allowing it to be connected directly to Sun InfiniBand switches. With
support for up to 48 internal 500 GB or 1 TB disk drives, the Sun Fire X4540 storage
server is ideal for large cluster deployments running the Linux OS and the Lustre
parallel file system.
Figure 12. The Sun Fire X4540 storage server provides up to 48 terabytes of
compact storage in only four rack units ideal for configuration as cluster
scratch space using the Lustre parallel file system.
The Sun Fire X4540 storage server represents an innovative design that provides
throughput and high-speed access to the 48 directly-attached, hot-plug Serial ATA (SATA)
disk drives. Designed for datacenter deployment, the efficient system is cooled from
from front to back across the components and disk drives. Each Sun Fire X4540 storage
server provides:
Minimal cost per gigabyte utilizing SATA II storage and software RAID 6 with six
SATA II storage controllers connecting to 48 high-performance SATA disk drives
High performance from an industry-standard x64 server based on two Quad-Core or
enhanced Quad-Core AMD Opteron processors
48 high-performanceSATA disk drives
Four rack units (4U)
8/9/2019 HPC Open Petascale Computing
30/48
27 Scalable and Manageable Storage Sun Microsystems, Inc.
Maximum memory and bandwidth scaling from embedded single-channel DDR
memory controllers on each processor, delivering up to 64 GB of capacity
High-performance I/O from two PCI-X slots to delivers over 8.5 gigabits per second of
plug-in I/O bandwidth, including support for InfiniBand HCAs Easy maintenance and overall system reliability and availability from redundant hot-
pluggable disks, power supply units, fans, and I/O
Parallel file systems are required for moving massive amounts of data through
supercomputing clusters. Given its strengths, the Sun Fire X4540 storage server is now a
standard component of many large supercomputing cluster deployments around the
world. Large grids and clusters need high-performance heterogeneous access to data,
and the Sun Fire X4540 storage server provides both high throughput as well as
essential scalability that allow parallel file systems to perform at their best. Together
with the Lustre parallel file system and the Linux OS, the Sun Fire X4540 storage server
also serves as the key component for the Sun Lustre Storage System (Chapter 5).
The Sun Lustre Storage SystemThe Lustre file system is a software-only architecture that supports a number of
different hardware implementations. Lustres state-of-the-art object-based storage
architecture provides ground-breaking I/O and metadata throughput, with considerable
reliability, scalability, and performance advantages. The Lustre file system currently
scales to thousands of nodes and hundreds of terabytes of storage with the potential
to support tens of thousands of nodes and petabytes of data.
Building on the strengths of the Lustre parallel file system, the Sun Lustre Storage
system is architected using Sun Open Storage systems that deliver exceptional
performance and provide additional value. The main components of a typical Lustre
architecture include:
Lustre file system clients (Lustre clients)
Metadata Severs (MDS)
Object Storage Servers (OSS)
Metadata Servers and Object Storage Servers implement the file system and
communicate with the Lustre clients. The MDS manages and stores metadata, such as
file names, directories, permissions and file layout. Configurations also require one or
more Lustre Object Storage Server (OSS) modules, which provide scalable I/O
performance and storage capacity.
To these standard configurations, all Sun Lustre Storage System configurations include
a High Availability Lustre Metadata Server (HA MDS) module that provides failover. For
maximum flexibility, the Sun Lustre Storage System defines two OSS modules: a
8/9/2019 HPC Open Petascale Computing
31/48
28 Scalable and Manageable Storage Sun Microsystems, Inc.
Standard OSS module for greatest density and economy, and an HA OSS module that
provides OSS failover for environments where automated recovery from OSS failure is
important (Figure 13).
Figure 13. High availability metadata servers (HA MDS) and high availability
object storage servers (HA OSS) allow for file system failover in Luster
configurations.
HA MDS Module
Designed to meet the critical requirement of high availability, the HA MDSmodule, is common to all Sun Lustre Storage System configurations. This module
includes a pair of Sun Fire X4270 servers with an attached Sun Storage J4200 array
acting as shared storage. Internal boot drives in the Sun Fire X4270 server are
mirrored for added protection. The Sun Fire X4270 server features two quad-core
Intel Xeon Processor 5500 Series (Nehalem) CPUs and is configured with
24 GB RAM.
Metadata
Servers
(MDS)
(active) (standby)
Object Storage
Servers (OSS)
Storage Arrays
(Direct Connect)
Enterprise
Storage Arrays
& SAN Fabrics
Commodity
Storage
Ethernet
Multiple
networks
supported
simultaneously
Clients
File System
Fail-over
InfiniBand
8/9/2019 HPC Open Petascale Computing
32/48
29 Scalable and Manageable Storage Sun Microsystems, Inc.
Standard OSS Module
The Sun Fire X4540 server was chosen for use as the Standard OSS module. As
discussed, the Sun Fire X4540 server features an innovative architecture that
combines a high-performance server, high I/O bandwidth, and very high densitystorage in a single integrated system.
HA OSS Module
Each HA OSS module includes two Sun Fire X4270 servers and four Sun Storage
J4400 arrays. Sun Fire X4270 servers were chosen for the HA OSS module, because
with six PCI Express slots, the Sun Fire X4270 server has the ability to drive the
high throughput required in Sun Lustre Storage System environments. The Sun
Storage J4400 array was chosen for the HA OSS module because it offers
compelling storage density, connectivity, higher availability and very low price per
gigabyte. With redundant SAS I/O Modules and front-serviceable disk drives, the
Sun Storage J4400 array helps the Sun Lustre Storage System deliver price/
performance advantages without sacrificing RAS features.
Sun can reference many storage installations that have achieved impressive scalability
results. One such reference is the Texas Advanced Computing Centers Ranger System
(see http://www.tacc.utexas.edu/resources/hpcsystems/#ranger), where Sun has
demonstrated near-linear scalability in a configuration encompassing fifty similar
previous-generation Sun OSS modules with a single HA MDS module supporting a file
system that was 1.2 petabytes in size. Figure 14 shows the Lustre file system
throughput at TACC where throughput rates of 45 GB/sec with peaks approaching 50
GB/sec have been observed. In addition, TACC has experienced near-linear throughputon a single applications use of the Lustre file system at 35 GB/sec.
Figure 14. Luster parallel file system performance at TACC
More information on implementing the Lustre parallel file system can be found in the
Sun BluePrints article titled Solving the HPC I/O Bottleneck: Sun Lustre Storage System
(http://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-
+Sun+Lustre+Storage+System).
35
0
5
10
30
25
20
15
10 100 1000 1000010 100 1000 10000
0
10
30
20
40
50
60
$SCRATCH File System Throughput
W
riteSpeed(GB/sec)
$SCRATCH Application Performance
# of Writing Clients # of Writing Clients
Stripecount = 1
Stripecount = 4
Stripecount = 1
Stripecount = 4
W
riteSpeed(GB/sec)
http://www.tacc.utexas.edu/resources/hpcsystems/#rangerhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://www.tacc.utexas.edu/resources/hpcsystems/#ranger8/9/2019 HPC Open Petascale Computing
33/48
30 Scalable and Manageable Storage Sun Microsystems, Inc.
ZFS and Sun Storage 7000 Unified Storage SystemsWhile high-throughput cluster scratch space is critical, clusters also need storage that
serves other needs. Some of an organizations most important data includes completed
simulations and key source data. Clusters need storage that provides scalable, reliable,
and robust storage for tier-1 data archival and users home directories.
To address this need, Sun Storage 7000 Unified Storage Systems incorporate an open-
source operating system, commodity hardware, and industry-standard technologies.
These systems represent low-cost, fully-functional network attached storage (NAS)
storage devices designed around the following core technologies:
General-purpose x64-based servers (that function as the NAS head), and Sun Storage
products proven high-performance commodity hardware solutions with
compelling price-performance points
The ZFS file system, the worlds first 128-bit file system with unprecedentedavailability and reliability features
A high-performance networking stack using IPv4 or IPv6
DTrace Analytics, that provide dynamic instrumentation for real-time performance
analysis and debugging
Sun Fault Management Architecture (FMA) for built-in fault detection, diagnosis, and
self-healing for common hardware problems
A large and adaptive two-tiered caching model, based on DRAM and enterprise-class
solid state devices (SSDs)
To meet varied needs for capacity, reliability, performance, and price, the product
family includes three different models the Sun Storage 7110, 7210, 7310, and 7410
Unified Storage Systems (Figure 15). Configured with appropriate data processing and
storage resources, these systems can support a wide range of requirements in HPC
environments.
8/9/2019 HPC Open Petascale Computing
34/48
31 Scalable and Manageable Storage Sun Microsystems, Inc.
Figure 15. Sun Storage 7000 Unified Storage Systems
Sun Storage 7110 Unified Storage System
Sun Storage 7210 Unified Storage System
Sun Storage 7410 Unified Storage System
Sun Storage 7310 Unified Storage System
8/9/2019 HPC Open Petascale Computing
35/48
32 Scalable and Manageable Storage Sun Microsystems, Inc.
Tight integration of the ZFS scalable file system
Sun Storage 7000 Unified Storage Systems are powered by the ZFS scalable file stem.
ZFS offers a dramatic advance in data management with an innovative approach to
data integrity, tremendous performance improvements, and a welcome integration of
both file system and volume management capabilities. A true 128-bit file system, ZFS
removes all practical limitations for scalable storage, and introduces pivotal new
concepts such as hybrid storage pools that de-couple the file system from physical
storage. This radical new architecture optimizes and simplifies code paths from the
application to the hardware, producing sustained throughput at near platter speeds.
New block allocation algorithms accelerate write operations, consolidating what would
traditionally be many small random writes into a single, more efficient write sequence.
Silent data corruption is corruption that goes undetected, and for which no error
messages are generated. This particular form of data corruption is of special concern toHPC applications since they typically generate, store, and archive significant amounts
of data. In fact, a study by CERN1 has shown that silent data corruption, including disk
errors, RAID errors, and memory errors, is much more common that previously
imagined. ZFS provides end-to-end checksumming for all data, greatly reducing the risk
of silent data corruption.
Sun Storage 7000 Unified Storage Systems rely heavily on ZFS for key functionality such
as Hybrid Storage Pools. By automatically allocating space from pooled storage when
needed, ZFS simplifies storage management and gives organizations the flexibility to
optimize data for performance. Hybrid Storage Pools also effectively combine the
strengths of system memory, flash memory technology in the form of enterprise solid
state drives (SSDs), and conventional hard disk drives (HDDs).
Key capabilities of ZFS related to Hybrid Storage Pools include:
Virtual storage pools Unlike traditional file systems that require a separate volume
manager, ZFS introduces the integration of volume management functions.
Data integrity ZFS uses several techniques to keep on-disk data self consistent and
eliminate silent data corruption, such as copy-on-write and end-to-end
checksumming.
High performance ZFS simplifies the code paths from the application to the
hardware, delivering sustained throughput at near platter speeds. Simplified administration ZFS automates many administrative tasks to speed
performance and eliminate common errors.
Sun Storage 7000 Unified Storage Systems utilize ZFS Hybrid Storage Pools to
automatically provide data placement, data protection, and data services such as RAID,
error correction, and system management. By placing data on the most appropriate
storage media, Hybrid Storage Pools help to optimize performance and contain costs.
Sun Storage 7000 Unified Storage Systems feature a common, easy-to-use management
1.Silent Corruptions, Peter Kelemen, CERN After C5, June 1st, 2007
8/9/2019 HPC Open Petascale Computing
36/48
33 Scalable and Manageable Storage Sun Microsystems, Inc.
interface, along with a comprehensive analytics environment to help isolate and
resolve issues. The systems support NFS, CIFS, and iSCSI data access protocols, mirrored
and parity-based data protection, local point-in-time (PIT) copy, remote replication, data
checksum, data compression, and data reconstruction.
Long-Term Retention and ArchiveStaging, storing, and maintaining HPC data requires a massive repository of on-line and
near-line storage to support data retention and archival needs. High-speed data
movement must be provided between computational and archival environments. The
Sun Constellation System addresses this need by integrating with a wealth of
sophisticated Sun StorageTek options, including:
Sun StorageTek SL8500 and SL500 Modular Library Systems
Sun StorageTek 6540 and 6140 Modular Arrays
High-speed data movers
Sun StorageTek 5800 system fixed-content archive
The comprehensive Sun StorageTek software offering is key to facilitating seamless
migration of data between cache and archival.
Sun StorageTek QFS
Sun StorageTek QFS software provides high-performance heterogeneous shared
access to data over a storage area network (SAN). Users across the enterprise get
shared access to the same large files or data sets simultaneously, speeding time to
results. Up to 256 systems running Sun StorageTek QFS technology can have
shared access to the same data while maintaining file integrity. Data can bewritten and accessed at device-rated speeds, providing superior application I/O
rates. Sun StorageTek QFS software also provides heterogeneous file sharing using
NFS, CIFS, Apple Filing Protocol, FTP, and Samba.
Sun StorageTek Storage Archive Manager (SAM) software
Large HPC installations must manage the considerable storage required by
multiple projects running large-scale computational applications on very large
datasets. Solutions must provide a seamless and transparent migration for
essential archival data between disk and tape storage systems. Sun StorageTek
Storage Archive Manager (SAM) addresses this need by providing data
classification and policy-driven data placement across tiers of storage.
Organizations can benefit from data protection as well as long-term retention and
data recovery to match their specific needs.
Chapter 6provides additional detail and a graphical depiction of how caching file
systems such as the Lustre parallel file system combine with SAM-QFS in a real-world
example to provide data management in large supercomputing installations.
8/9/2019 HPC Open Petascale Computing
37/48
34 Sun HPC Software Sun Microsystems, Inc.
Chapter 5
Sun HPC Software
As clusters move from the realm of supercomputing to the enterprise, cluster software
has never been more important. Organizations deploying clusters at all levels need
better ways to control and monitor often expansive cluster deployments in ways that
benefit their users and applications. Unfortunately, collecting, assembling, testing, and
patching all of the requisite software components for effective cluster operation has
proved challenging, to say the least.
Available in both a Linux Edition, and a Solaris Developer Edition, Sun HPC software is
designed to address these needs. Sun HPC Software, Linux Edition is detailed in the
sections that follow. For more information on Sun HPC Software, Solaris Developer
Edition, please see http://wikis.sun.com/display/hpc /Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1
Sun HPC Software, Linux EditionMany HPC customers are demanding Linux-based HPC solutions with open source
components. To answer these demands, Sun has introduced Sun HPC Software, Linux
Edition an integrated solution for Linux HPC clusters based on Sun hardware. More
than a mere collection of software components, Sun HPC Software simplifies the entire
process of deploying and managing large-scale Linux HPC clusters, providing
considerable potential savings in maintenance time and expense.
From its inception, the projects goals were to provide an open product one that
uses as much open source software as possible, and one that depends on and enhances
the community aspects of software development and consolidation. The ongoing goals
for Sun HPC software are to:
Provide simple, scalable provisioning of bare-metal systems into a running HPC
cluster
Validate configurations
Dramatically reduce time-to-results
Offer integrated management and monitoring of the cluster
Employ a community-driven process
Seamless and Scalable IntegrationSun HPC Software, Linux Edition covers the entire cluster life-cycle. The software
provides everything needed to provision the cluster nodes, verify that the software and
hardware are working correctly, manage the cluster, and monitor the clusters
performance and health. All of the components have been fully tested on Sun HPC
hardware, so that the likelihood of post-installation integration problems is significantly
reduced.
http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+18/9/2019 HPC Open Petascale Computing
38/48
35 Sun HPC Software Sun Microsystems, Inc.
Because clusters can vary widely in size, Sun HPC software is designed to be scalable,
and all of the components are selected with large numbers of nodes in mind. For
example, the Lustre parallel file system and OneSIS provisioning software are both well
known for working well with clusters comprised of thousands of nodes. Tools thatprovision, verify, manage, and monitor the cluster were likewise selected for their
scalability to reduce the management cost as clusters grow.
Sun HPC software, Linux Edition is built to be completely modular so that organizations
can customize it according to their own preferences and requirements. The modular
framework provides a ready-made stack that contains the components required to
deploy an HPC cluster. Add-on components let organizations make specific choices
beyond the core software installed. Figure 16 provides a high-level perspective of Sun
HPC Software, Linux Edition. For more specific information on the components provided
at each level, please see www.sun.com/software/products/hpcsoftware, or send an e-
mail to [email protected] join the community.
Figure 16. Sun HPC Software stack, Linux Edition
Sun HPC Software, Linux Edition 2.0.1 contains the components listed in Table 3.
Table 3. Sun HPC Software 2.0.1 components
Category Sun HPC Software, Linux Edition 1.2
Operating System and kernel Red Hat Enterprise Linux, CentOS Linux, Lustre parallel filesystem, perfctr, SuSE Linux Enterprise Server
User space library Allinea DDT, Env-switcher, genders, git, Heartbeat, IntelCompiler, Mellanox firmware tools, Modules, MVAPITCHand MVAPITCH2, OFED, OpenMPI, PGI compiler, RRDtool,Sun Studio, Sun HPC Clustertools, Totalview
Verification HPCC Bench Suite, Lustre IOkit, IOR, Lnet selftest, NetPIPE
Schedulers Sun Grid Engine, PBS, LSF, SLURM, MOAB, MUNGE
Monitoring Ganglia
Provisioning OneSIS, Cobbler
Management CFEngine, Conman, FreeIPMI, gtdb, IBSRM, IPMItool,Ishw, OpenSM, pdsh, Powerman, Sun Ops Center
http://www.sun.com/software/products/hpcsoftwarehttp://www.sun.com/software/products/hpcsoftware8/9/2019 HPC Open Petascale Computing
39/48
36 Sun HPC Software Sun Microsystems, Inc.
Simplified Cluster ProvisioningSun HPC Software, Linux Edition is specifically designed to simplify the complex task of
provisioning systems as a part of a clustered environment. For example, the software
stack includes the OneSIS open source software tool developed at Sandia National
Laboratory. This tool is specifically designed to ease system administration in large-
scale Linux cluster environments.
The software stack itself can be downloaded from the Web, and is designed to fit onto a
single DVD. While installing the first system in a cluster might take place from the DVD,
it goes without saying that installing an entire large cluster in this fashion would
consume unacceptable amounts of time, not to mention the additional time to
maintain and update individual system images. With OneSIS, administrators can create
system images that define the behavior of the entire computing infrastructure.
A typical installation process approximates the following:
The software is first onto a management node, and Installing the system locally via
DVD typically takes about 20 minutes from bare metal to a login prompt
Configuring the system requires another 20 minutes at most
Other cluster systems are then booted onto the master image
The cluster can be running and ready to accept jobs in as little as 50 minutes.
8/9/2019 HPC Open Petascale Computing
40/48
37 Deploying Supercomputing Clusters Rapidly with Less Risk Sun Microsystems, Inc.
Chapter 6
Deploying Supercomputing Clusters Rapidly
with Less Risk
Sun has considerable experience helping organizations deploy supercomputing clusters
specific to their computational, storage, and collaborative requirements.
Complementing the compelling capabilities of the Sun Constellation System, Sun
provides a range of services that are specifically aimed at delivering results for HPC-
focused organizations. Suns partnership with the Texas Advanced Computing Center
(TACC) at the University of Texas at Austin to deliver the Sun Constellation System in the
3,936-node Ranger supercomputing cluster is one such example.
Sun Datacenter Express ServicesSuns new Datacenter Express Services provide a comprehensive, al