HPC Open Petascale Computing

8/9/2019 HPC Open Petascale Computing

1/48

PATHWAYS TO

OPEN P

ETASCALE COMPUTING

The Sun

Constellation System designed for performance

White Paper

November 2009

Make everything as simple as possible, but not simpler

Albert Einstein


2/48

Sun Microsystems, Inc.

T

able of Contents

Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Pathways to Open Petascale Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

The Unstoppable Rise of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

The Importance of a Balanced and Rigorous Design Methodology . . . . . . . . . . . . . . 4

The Sun Constellation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Fast, Large, and Dense InfiniBand Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 7

The Fabric Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Sun Datacenter Switches for InfiniBand Fabrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Deploying Dense and Scalable Modular Compute Nodes

. . . . . . . . . . . . . . . . . . . 15

Compute Node Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

The Sun Blade 6048 Modular System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Scaling to Multiple Sun Datacenter InfiniBand Switch 648. . . . . . . . . . . . . . . . . . . . 23

Scalable and Manageable Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Storage for Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Clustered Sun Fire X4540 Servers as Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

The Sun Lustre Storage System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

ZFS and Sun Storage 7000 Unified Storage Systems. . . . . . . . . . . . . . . . . . . . . . . . . 30

Long-Term Retention and Archive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Sun HPC Software

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Sun HPC Software, Linux Edition

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Seamless and Scalable Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Simplified Cluster Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Deploying Supercomputing Clusters Rapidly with Less Risk . . . . . . . . . . . . . . . . . 37

Sun Datacenter Express Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Sun Architected HPC Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A massive supercomputing cluster at the Texas Advanced Computing Center . . . . 38

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Acknowledgements

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43


3/48

This Page Intentionally Left Blank


4/48

1

E

x

ecutive Summary


E

xecutive Summary

F

rom weather prediction and global climate modeling to minute sub-atomic analysis

and other grand-challenge problems, modern supercomputers often provide the key

technology for unlocking some of the most critical challenges in science and

engineering. These essential scientific, economic, and environmental issues are

complex and daunting and many require answers that can only come from the

fastest available supercomputing technology. In the wake of the industry-wide

migration to terascale computing systems, an open and predictable path to petascale

supercomputing environments has become essential.

Unfortunately, the design, deployment, and management of very large terascale and

petascale clusters and grids has remained elusive and complex. While a few have

accomplished petascale deployments, they have been largely proprietary in nature, andhave come at a high cost. In fact, it is often difficult to reach petascale for fundamental

reasons not because of inherent limitations, but due to practicalities of attempting

to scale architectures to their full potential. Seemingly simple concerns heat, power,

cooling, cabling, and weight are rapidly overloading the vast majority of even the

most modern datacenters.

Sun understands that the key to building petascale supercomputers lies in a balanced

and systemic infrastructure design approach, along with careful application of the

latest technology advancements. Derived from Suns experience and innovation with

very large supercomputing deployments, the Sun Constellation System provides the

world's first open petascale computing environment one built entirely with open

and standard hardware and software technologies. Cluster architects can use the Sun

Constellation System to design and rapidly deploy tightly-integrated, efficient, and cost-

effective supercomputing clusters that scale predictably from a few teraflops to over a

petaflop. With a completely modular approach, processors, memory, interconnect

fabric, and storage can all be scaled independently depending on individual needs.

Best of all, the Sun Constellation System is an enterprise-class Sun-supported offering

comprised of general-purpose compute nodes, interconnects, and storage components

that can be deployed very rapidly. In fact, existing supercomputing clusters have

already been built using the system. For instance, the Texas Advanced Computing

Center (TACC) at the University of Texas at Austin partnered with Sun to deploy the Sun

Constellation system as their Ranger supercomputing cluster

1

with a peak

performance rating of over 500 teraflops. This document describes the key challenges

and constraints involved in the build-out of petascale supercomputing architectures,

including network fabrics, multicore modular compute systems, storage, open HPC

software, and general-purpose I/O.

1.

http://www.tacc.utexas.edu/resources/hpcsystems/#constellation
http://www.tacc.utexas.edu/resources/hpcsystems/#constellationhttp://www.tacc.utexas.edu/resources/hpcsystems/#constellation


5/48

2

P

a

thw

a

ys to Open P

etascale Computing


Chapter 1

P

athways to Open Petascale Computing

Most pr

actitioners in today's high-performance computing (HPC) marketplace would

readily agree that the industry is well into the age of terascale systems.

Supercomputing systems capable of processing multiple teraflops are becoming

commonplace. These systems are readily being built using mostly commercial off-the-

shelf (COTS) components with the ability to address terabytes and petabytes of storage,

and more recently, terabytes of system memory (generally as distributed shared

memory and storage pools, or even as a single system image at the high end).

Only a few years ago, general-purpose terascale computing clusters constructed of

COTS components were hard to imagine. Though they were on several industry road-

maps, such systems were widely regarded as impractical due to limitations in thescalability of the interconnects and fabrics that tie disparate systems together. Through

competitive innovation and the race to be the fastest, the industry has been driven into

the realm of practical and commercially-viable terascale systems and now to the

edge of pondering what similar limitations, if any, lie ahead in the design of open

petascale systems.

T

he Unstoppable Rise of Clusters

In the last fi

ve years, technologies used to build the world's fastest supercomputers

have evolved rapidly. In fact, clusters of smaller interconnected rackmount and blade

systems now represent a majority of the supercomputers on the Top500 list of

supercomputing sites

1

steadil

y replacing vector supercomputers and other large

systems that dominated previously. Figure 1 shows the relative shares of various

supercomputing architectures comprising the Top500 list from 1993 through 2009,

establishing clear acceptance of clusters as leading supercomputing technology.

1.

www.top500.org


6/48

3

P

a

thw

a

ys to Open P

etascale Computing


F

igure 1. In the last five years, clusters have increasingly dominated the Top500

list architecture share (image courtesy www.top500.org)

No

t only have clusters provided access to supercomputing resources for increasingly

larger groups of researchers and scientists, but the largest supercomputers in the world

are now built using cluster architectures. This trend has been assisted by an explosion

in performance, bandwidth, and capacity for key technologies, including: Faster processors, multicore processors, and multisocket rackmount and blade

systems

Inexpensive memory and system support for larger memory capacity

Faster standard interconnects such as InfiniBand

Higher aggregated storage capacity from inexpensive commodity disk drives

Unfortunately, significant challenges remain that have stifled the growth of true open

petascale-class supercomputing clusters. Time-to-deployment constraints have resulted

from the complexity of deploying and managing large numbers of compute nodes,

switches, cables, and storage systems. The programability of extremely large clusters

remains an issue. Environmental factors too are paramount since deployments must

often take place in existing datacenter space with strict constraints on physical

footprint, as well as power and cooling.

Architecture Share Over Time

1993 - 2009500

400

300

200

100

0

06/1993

06/1994

06/1995

06/1996

06/1997

06/1998

06/1999

06/2000

06/2001

06/2002

06/2003

06/2004

06/2005

06/2006

06/2007

06/2008

06/2009

Systems

Top500 Releases

MPP

Cluster

SMP

Constellations

Single Processor

Others
http://www.top500.org/http://www.top500.org/http://www.top500.org/http://www.top500.org/


7/48

4

P

a

thw

a

ys to Open P

etascale Computing


In addition t

o these challenges, most petascale computational users also have unique

requirements for clustered environments beyond those of less demanding HPC users,

including:

Scalability at the socket and core level

Some have espoused large grids ofrelatively low-performance systems, but lower performance only increase the

number of nodes that are required to solve very large computational problems.

Density in all things

Density is not just a requirement for compute nodes, but for

interconnect fabrics and storage solutions as well.

A scalable programming and execution model

Programmers need to be able to

apply their programmatic challenges to massively-scalable computational resources

without special architecture-specific coding requirements.

A lightweight grid model

Demanding applications need to be able to start

thousands of jobs quickly, distributing workloads across the available computational

resources through highly-efficient distributed resource management (DRM) systems.

Open and standards-based solutions

Programmatic solutions must not cause

extensive porting efforts, or be dedicated to particular proprietary architectures or

environments, and datacenters must remain free to purchase the latest high-

performance computational gear without being locked into proprietary or dead-end

architectures.

T

he Importance of a Balanced and Rigorous DesignMethodology

A

s anyone who has witnessed prior generations of supercomputing and HPC

architectures can attest, scaling gracefully is not simply a matter of accelerating

systems that already perform well. Bigger versions of existing technologies are not

always better. Regrettably, the pathways to teraflop systems are littered with the

products and technologies from dozens of companies that simply failed to adapt along

the way.

Many technologies have failed because the fundamental principles that worked in

small clusters simply could not scale effectively when re-cast in a run-time environment

thousands of times larger or faster than their initial implementations. For example, Ten

Gigabit Ethernet though a significant accomplishment is known in the

supercomputing realm to be fraught with sufficiently variable latency as to make itimpractical for situations where low guaranteed latency and throughput dominate

performance. Ultimately, building petascale-capable systems is about being willing to

fundamentally rethink design, using the latest available components that are capable

of meeting or exceeding specified data rates and capacities.

Put simply, getting to petascale requires balance and massive scalability in all

dimensions, including scalable tools and frameworks, processors, systems,

interconnects, and storage as well as the ability to accommodate changes that

allow software to scale accordingly.


8/48

5

P

a

thw

a

ys to Open P

etascale Computing


K

ey challenges for petascale environments include:

Keeping floating-point operations (FLOPs) to memory bandwidth ratios balanced to

minimize the effects of memory latency (with each FLOP representing at least two

loads and one store) Allowing for the practical scaling of the interconnect fabric to allow the connection of

tens of thousands of nodes

Exploiting the considerable investment, momentum, and cost savings of commodity

multicore x64 processors, tools, and software

Overcoming software challenges such as the forward portability of HPC codes to new

architectures, scalability limitations, reliability, robustness, and being able to take

advantage of multicore multiprocessor system architectures

Architecting to account for the opportunity to take advantage of external floating

point, vector, and/or general purpose processing on graphics processing unit

(GP/GPU) solutions within a cluster framework Designing the highest levels of density into compute nodes, interconnect fabrics, and

storage solutions in order to facilitate large and compact clusters

Building systems with efficient power and cooling to accommodate the broadest

range of datacenter facilities and to help ensure the highest levels of reliability

Architecting cluster architecture such that compute-intensive applications have

access to fast cluster scratch storage space for a balanced computational approach

These challenges serve as reminders that the value of genuine innovation in the

marketplace must never be underestimated even as design-cycle times shrink and

the pressures of time to market grow with the demand for faster, cheaper and

standards based solutions.

T

he Sun Constellation System

Since its inception, Sun has been f

ocused on building balance and even elegance into

its system designs. The Sun Constellation System represents a tangible application of

this philosophy on a grand scale in the form of a systematic approach to building

terascale and petascale supercomputing clusters. Specifically, the Sun Constellation

System delivers an open architecture that is designed to allow organizations to build

clusters that scale seamlessly from a few racks to teraflops or petaflops of performance.

With an overall datacenter focus, Sun is free to innovate at all levels of the system

from switching fabric, to core system and storage elements, to HPC and file system

software. As a systems company, Sun looks beyond existing technologies toward

solutions that optimize the simultaneous equations of cost, space, practicality, and

complexity. In the form of the Sun Constellation System, this systemic focus combines a

massively-scalable InfiniBand interconnect with very dense computational and storage

solutions in a single architecture that functions as a cohesive system. Organizations

can now obtain all of these tightly-integrated building blocks from a single vendor, and

benefit from a unified management approach.


9/48

6

P

a

thw

a

ys to Open P

etascale Computing


C

omponents of the Sun Constellation System include:

The Sun Datacenter InfiniBand Switch 648, offering up to 648 QDR/DDR ports in a

single 11 rack unit (11U) chassis, and supporting clusters of up to 5,184 nodes with

multiple switches The Sun Datacenter InfiniBand Switch 72, offering up to 72 QDR/DDR ports in a

compact 1U form factor, and supporting clusters of up to 576 nodes with multiple

switches

The Sun Datacenter InfiniBand Switch 36, offering up to 36 nodes in a 1U form factor

The Sun Blade 6048 Modular System, providing an ultra-dense InfiniBand-connected

blade platform with support for up to 48 multiprocessor, multicore Sun Blade 6000

server modules and up to 96 compute nodes in a rack-sized chassis

Sun Fire X4540 storage clusters, serving as an economical InfiniBand-connected

parallel file system building block, with support for up to 48 terabytes in only four

rack units and up to 480 terabytes in a single rack The Sun Storage 7000 Unified Storage System, integrating enterprise flash

technology through ZFS hybrid storage pools and DTrace Analytics to provide

economical, scalable, and transparent storage

The Sun Lustre Storage System, a simple-to-deploy storage environment based on

the Lustre parallel file system, Sun Fire servers, and Sun Open Storage platforms

Sun HPC Software, encompassing integrated developer tools, Sun Grid Engine

infrastructure, advanced ZFS and Lustre file systems, provisioning, monitoring,

patching, and simplified inventory management available in both a Linux Edition

and a Solaris Operating System (OS) Developer Edition

The Sun Constellation System provides an open systems supercomputer architecture

designed for petascale computing as integrated and Sun-supported product. This

holistic approach offers key advantages to those designing and constructing the largest

supercomputing clusters:

Massive scalability in terms of optimized compute, storage, interconnect, and

software technologies and services

Simplified cluster deployment with open HPC software that can rapidly turn bare-

metal systems into functioning clusters that are ready to run

A dramatic reduction in complexity through integrated connectivity and

management to reduce start-up, development, and operational connectivity

Breakthrough economics from technical innovation that results in fewer more

reliable components and high-efficiency systems in a tightly-integrated solution

Along with key technologies and the experience of helping design and deploy some of

the worlds largest supercomputing clusters, these strengths make Sun an ideal partner

for delivering open high-end terascale and petascale architecture.


10/48

7

F

as

t, Lar

ge, and D

ense I

nfi

niBan

d

Infr

as

tructur

e


Chapter 2

F

ast, Large, and Dense InfiniBand Infrastructure

Building the lar

gest supercomputing grids presents significant challenges, with fabric

technology paramount among them. Sun set out to design InfiniBand architecture for

maximum flexibility and fabric scalability, and to drastically reduce the cost and

complexity of delivering large-scale HPC solutions. Achieving these goals required a

delicate balancing act one that weighed the speed and number of nodes along with

a sufficiently fast interconnect to provide minimal and predictable levels of latency.

T

he Fabric Challenge

F

or many applications, the interconnect fabric is already the element that limits

performance. One unavoidable driver is that faster processors require a faster

interconnect. Beyond merely employing a fast technology, the fabric must scale

effectively with both the speed and number of systems and processors. Interconnect

fabrics for large terascale and petascale deployments require:

Low latency

High bandwidth

The ability to handle fabric congestion

High reliability to avoid interruptions

Open standards such as OpenFabrics and the OpenMPI software stack

InfiniBand technology has emerged as an attractive fabric for building large

supercomputing clusters. As an open standard, InfiniBand presents a compelling choice

over proprietary interconnect technologies that depend on the success and innovation

of a single vendor. InfiniBand also presents a number of significant technical

advantages:

A switched fabric offers considerable scalability, supporting large numbers of

simultaneous collision-free connections with virtually no increase in latency.

Host channel adaptors (HCAs) with remote direct memory access (RDMA) support

offload communications processing from the processor and operating system,

leaving more processor resources available for computation.

Fault isolation and troubleshooting are easier in switched environments since

problems can be isolated to a single connection.

Applications that rely on bandwidth or quality of service are also well served, since

they each receive their own dedicated bandwidth.

Even with these advantages, building the largest InfiniBand clusters and grids has

remained complex and expensive primarily because of the need to interconnect very

large numbers of computational nodes. Traditional large clusters require literally

thousands of cables and connections and hundreds of individual core and leaf switches

adding considerable expense, weight, cable management complexity, and


11/48

8

F

as

t, Lar

ge, and D

ense I

nfi

niBan

d

Infr

as

tructur

e


consumption of v

aluable datacenter rack space. It is clear that density, consolidation,

and management efficiencies are important not just for computational platforms, but

for InfiniBand interconnect infrastructure as well.

Even with very significant accomplishments in terms of processor performance and

computational density, large clusters are ultimately constrained by real estate and and

the complexities and limitations of interconnect technologies. Cable length limitations

constrain how many systems can be connected together in a given physical space while

avoiding increased latency. Interconnect topologies play a vital role in determining the

properties that clustered systems exhibit. Mesh, torus (or toroidal), and Clos topologies

are popular choices for interconnected supercomputing clusters and grids.

M

esh and 3D Torus Topologies

In mesh and 3D torus topologies, each node connects to its neighbors in the x, y, and zdimensions, with six connecting ports per node. Some of the most notable

supercomputers based upon torus topologies include IBMs BlueGene and Crays XT3/

XT4 supercomputers. Torus fabrics have had the advantage that they have generally

been easier to build than Clos topologies. Unfortunately, torus topologies represent a

blocking fabric, where interconnect bandwidth can vary between nodes. Torus fabrics

also provide variable latency due to variable hop count, and application deployment for

torus fabrics must carefully consider node locality as a result. For some specific

applications that express a nearest-neighbor type of communication pattern, torus

topologies are a good fit. Computational fluid dynamics (CFD) is one such application.

Clos Fat Tree Topologies

F

irst described by Charles Clos in 1953, Clos networks have long formed the basis for

practical multistage telephone switching systems. Clos networks utilize a fat tree

topology, allowing complex switching networks to be built using many fewer

crosspoints than if the entire system were implemented as a single large crossbar

switch. Clos switches are typically comprised of multiple tiers and stages (hops), with

each tier built from of a number of crossbar switches. Connectivity exists only between

switch chips on adjacent tiers.

Clos fabrics have the advantage of being non-blocking, in that each attached node has

a constant bandwidth. In addition, an equal number of stages between nodes provides

for uniform latency. Historically, the disadvantage of large Clos networks was that they

required more resources to build.

C

onstructing Large Switched Supercomputing Clusters

C

onstructing very large InfiniBand Clos switches in particular is governed by a number

of practical constraints, including the number of ports available in individual switch

elements, maximum achievable printed circuit board size, and maximum connector


12/48

9

F

as

t, Lar

ge, and D

ense I

nfi

niBan

d

Infr

as

tructur

e


density

. Sun has employed considerable innovation in all of these areas, and provides

both dual data rate (DDR) and quad data rate (QDR) scalable InfiniBand fabrics. For

example, as a part of the Sun Constellation System, Sun InfiniBand infrastructure can

provide both QDR Clos clusters that can scale up to 5,184 nodes as well as 3D Torusconfigurations.

S

un Datacenter Switches for InfiniBand Fabrics

R

ecognizing the considerable promise of InfiniBand interconnects, Sun has made

InfiniBand connectivity a core competency, and has set out to design scalable and

dense switches that avoid many of the conventional limitations. Not content to accept

the status quo in terms of available InfiniBand switching, cabling, and host adapters,

Sun engineers used their considerable networking and datacenter experience to view

InfiniBand technology from a systemic perspective.

K

ey Technical Innovations for Sun Datacenter InfiniBand Switches

Sun Da

tacenter InfiniBand Switches 36, 72, and 648 are components of a complete

system that is based on multiple technical innovations, including:

The Sun Datacenter InfiniBand Switch 648 chassis implements a three-stage Clos

fabric with up to 54 36-port Mellanox InfiniScale IV switching elements, integrated

into a single 11U rackmount enclosure. The Sun Datacenter InfiniBand Switch 648

implements a 3-stage Clos fabric.

Industry-standard 12x CXP connectors on Sun Datacenter InfiniBand Switch 72 and

648 consolidate three discrete InfiniBand 4x connectors, resulting in the ability tohost 72 4x ports through 24 physical 12x connectors.

Complementing the 12x CXP connector, a 12x trunking cable carries signals from

three servers to a single switch connector, offering a 3:1 cable reduction when used

for server trunking, and reducing the number of cables needed to support 648 servers

to 216. A splitter cable that converts one 12x connection to three 4x connections is

provided for connectivity to systems and storage that require 4x QSFP connectors.

A custom-designed double-height Network Express Module (NEM) for the Sun Blade

6048 Modular System provides seamless connectivity to both the Sun Datacenter

InfiniBand Switch 648 and 72. Using the same 12x CXP connectors, the Sun Blade

6048 InfiniBand QDR Switched NEM can trunk up to 12 Sun Blade 6000 servermodules (up to 24 compute nodes) in a single Sun Blade 6048 Modular System shelf.

The NEM together with the 12x CXP cable facilitates connectivity of up to 5,184

servers in a 5-stage Clos topology.

Fabric topology for forwarding InfiniBand traffic is established by a redundant host-

based Subnet Manager. A host-based solution allows the Subnet Manager to take

advantage of the full resources of a general-purpose multicore server.


13/48

10 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

Massive Switch and Cable Consolidation

Given the scale involved with building supercomputing clusters and grids, cost and

complexity figure importantly. Regrettably, traditional approaches to using InfiniBand

for massive connectivity have required very large numbers of conventional switches

and cables. In these configurations, many cables and ports are consumed redundantly

connecting core and leaf switches together, making advertised per-port switch costs

relatively meaningless, and reducing reliability through extra cabling.

In contrast, the very dense InfiniBand fabric provided by Sun Datacenter InfiniBand

switches is able to potentially eliminate hundreds of switches and thousands of cables

dramatically lowering acquisition costs. In addition, replacing physical switches and

cabling with switch chips and traces on printed circuit boards drastically improves

reliability. Standard 12x InfiniBand cables and connectors coupled with a specialized

Sun Blade 6048 Network Express Module can eliminate thousands of additional cables,providing additional cost, complexity, and reliability improvements. Overall, these

switches provide radical simplification of InfiniBand infrastructure. Sun Datacenter

Switches are available to support both DDR and QDR data rates, with fabric capacities

enumerated in Table 1.

Table 1. Sun Datacenter InfiniBand Switch capacities

The Sun Datacenter InfiniBand Switch 648

The Sun Datacenter InfiniBand Switch 648 is designed to drastically reduce the cost and

complexity of delivering large-scale HPC solutions, such as those scaled for leadership

in the Top500 list of supercomputing sites, as well as smaller and moderately-sizedenterprise and HPC applications in scientific, technical, and financial markets. Each Sun

Datacenter InfiniBand Switch 648 provides up to 648 QDR InfiniBand ports in only 11

rack units (11U). Up to eight Sun Datacenter InfiniBand Switch 648 can be combined to

InfiniBand Switch Data Rate (Connector) Maximum SupportedNodes per Switch

MaximumClos Fabric

Sun Datacenter InfiniBandSwitch 648

QDR or DDR(up to 216 12x CXP)

648 5,184a

a.Eight switches are required. The Sun Datacenter InfiniBand Switch 648 is capable of supporting

clusters beyond 5,184 servers. The maximum number of nodes is currently determined by the

number of uplink ports (eight) provided by the Sun Blade 6048 InfiniBand QDR Switched NEM.


QDR or DDR(24 12x CXP)

72 576a


QDR or DDR(36 4x QSFP)

36


14/48


support up to 5,184 nodes in a single cluster. As shown in Figure 2, the Sun Datacenter

InfiniBand Switch 648 also provides extensive cable support and management for clean

and efficient installations.

Figure 2. The Sun Datacenter InfiniBand Switch 648 offers up to 648 QDR/DDR/SDR

4x InfiniBand connections in an 11u rackmount chassis (shown with cable

management arms deployed).

The Sun Datacenter InfiniBand Switch 648 is ideal for deploying fast, dense, and

compact Clos fabrics when used as a part of the Sun Constellation System. Based on

the Mellanox InfiniScale IV 36-port InfiniBand switch device, each switch chassis

connects up to 648 nodes using 12x CXP connectors. The switch represents a full three-

stage Clos fabric, and up to eight Sun Datacenter InfiniBand Switch 648 can be used to

combine up to 54 Sun Blade 6048 chassis in a maximal 5,184-node fabric. Up to threeSun Datacenter InfiniBand Switch 648 (and up to 1,944 QDR ports) can be deployed in a

single standard rack (Figure 3).

The Sun Datacenter InfiniBand Switch 648 is tightly integrated with the Sun Blade 6048

InfiniBand QDR Switched Network Express Module (NEM). 12x cables and CXP

connectors provide a 3:1 cable consolidation ratio. Each dual-height NEM connects up

to 24 compute nodes in a single Sun Blade 6048 shelf to a QDR InfiniBand fabric. Suns

approach to InfiniBand networking is highly flexible in that both Clos and mesh/torus

interconnects can be built using the same components. The Sun Blade 6048 InfiniBand

Switched NEM can be used by itself to build mesh and torus fabrics, or in combination

with the Sun Datacenter InfiniBand Switch 648 switch to build Clos InfiniBand fabrics.

The Sun Datacenter InfiniBand Switch 648 employs a passive midplane. Fabric cards

install vertically and connect to the midplane from the rear of the chassis. Up to nine

line cards install horizontally from the front of the chassis. A three-dimensional

perspective of the fabric provided by the switch is shown in Figure 4, with an example

route overlaid. With this dense switch configuration, InfiniBand packets traverse onlyFigure 3. Up to three Sun Datacenter

InfiniBand Switch 648 in a single

19-inch rack deliver 1,944 QDR ports.


15/48


three hops from ingress to egress of the switch, keeping latency very low. The Sun

Blade 6048 InfiniBand QDR Switched NEM adds only two hops for a total of five. All

InfiniBand routing is managed using a redundant host-based subnet manager.

Figure 4. A path through a Sun Datacenter InfiniBand Switch 648 core switch

connects two nodes across horizontal line cards, a vertical fabric card, and the

passive orthogonal midplane.

The Sun Datacenter InfiniBand Switch 72

The Sun Datacenter InfiniBand Switch 72 leverages many of the innovations found in

the Sun Datacenter InfiniBand Switch 648, while offering support for smaller and mid-

sized configurations. Like the larger 648-port switch, the Sun Datacenter InfiniBandSwitch 72 offers QDR and DDR connectivity, extreme density, and unrivaled cable

aggregation for Sun Blade and Sun Fire servers as well as Sun storage solutions.

NineLin

eCards

NineFabricCards

Path Through Switch

Alternate Path Through Switch


16/48


Depicted in Figure 5, the Sun Datacenter InfiniBand Switch 72 occupies only one rack

unit, offering an ultraslim and ultradense complete switch fabric solution for clusters of

up to 72 nodes.

Figure 5. The Sun Datacenter InfiniBand Switch 72 offers 72 4x QDR InfiniBand

ports in a 1U form factor

When used in conjunction with the Sun Blade 6048 Modular System, up to eight Sun

Datacenter InfiniBand Switch 72 can be combined to support clusters of up to 576

nodes. While similar solutions from competitors occupy over 17 rack units, eight 1U Sun

Datacenter InfiniBand Switch 72 save considerable space, and require roughly one third

the number of cables. In addition to simplification, this end-to-end supercomputing

solution offers extremely low latency using industry-standard transport, and

commodity processors including AMD Opteron, Intel Xeon, and Sun SPARC.

The Sun Datacenter InfiniBand Switch 72 provides the following specifications:

72 QDR/DDR/SDR 4x InfiniBand ports (expressed through 24 12x CXP connectors)

Data throughput of 4.6 Tb/sec.

Port-to-port latency of 300ns (QDR)

Eight data virtual lanes

One management virtual lane

4096 byte MTU

Sun Datacenter InfiniBand Switch 36

Leveraging the properties of the InfiniBand architecture, the Sun Datacenter InfiniBand

Switch 36 helps organizations deploy smaller high-performance fabrics in demanding

high-availability (HA) environments. The switch supports the creation of logically

isolated sub-clusters, as well as advanced features for traffic isolation and Quality ofService (QoS) management preventing faults from causing costly service disruptions.

The embedded InfiniBand fabric management module supports active/hot-standby

dual-manager configurations, helping to ensure a seamless migration of the fabric

management service in the event of a management module failure. The Sun


17/48


Datacenter InfiniBand Switch 36 is provisioned with redundant power and cooling for

high availability in demanding datacenter environments. The Sun Datacenter

InfiniBand Switch 36 is shown in Figure 6.

Figure 6. The Sun Datacenter InfiniBand Switch 36 offers 36 QDR InfiniBand ports

in a 1U form factor

The Sun Datacenter InfiniBand Switch 36 provides the following specifications:

36 QDR/DDR/SDR 4x InfiniBand ports (expressed through 36 4x QSFP connectors)

Data throughput of 2.3 Tb/sec.

Port-to-port latency of 100ns (QDR)

Eight data virtual lanes

One management virtual lane

4096 byte MTU


18/48

15 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

Chapter 3

Deploying Dense and Scalable Modular Compute

Nodes

Implementing terascale and petascale supercomputing clusters depends heavily on

having access to large numbers of high-performance systems with large memory

support and high memory bandwidth. As a part of the Sun Constellation System, Suns

approach is to combine the considerable and constant performance gains in the

standard processor marketplace with the advantages of modular architecture. This

approach results in some of the fastest and most dense systems possible all tightly

integrated with Sun Datacenter InfiniBand switches.

Compute Node RequirementsWhile some supercomputing architectures employ very large numbers of slower

proprietary nodes, this approach does not translate easily to petascale. The

programmatic implications alone of handling literally millions of nodes are not

particularly appealing much less the physical realities of managing and housing

such systems. Instead, building large and open terascale and petascale systems

depends on key capabilities for compute nodes, including:

High Performance

Compute nodes must provide very high peak levels of floating-point performance.

Likewise, because floating-point performance is dependent on multiple memory

operations, equally high levels of memory bandwidth must be provided. I/Obandwidth is also crucial, yielding high-speed access to storage and other

compute nodes.

Density, Power, and Cooling

The physical requirements of todays ever more expensive datacenter real estate

dictate that any viable solutions take the best advantage of datacenter floor space

while staying within environmental realities. Solutions must be as energy efficient

as possible, and must provide effective cooling that fits well with the latest

energy-efficient datacenter practices.

Superior Reliability and Serviceability

Due to their large numbers, computational systems must be as reliable and

servicable as possible. Not only must systems provide redundant hot-swap

processing, I/O, power, and cooling modules, but serviceability must be a key

component of their design and management. Interconnect schemes must allow

systems to be cabled once and reconfigured at will as required.

Blade technology has offered considerable promise in these areas for some time, but

has often been constrained by legacy blade platforms that locked adopters into

expensive proprietary infrastructure. Power and cooling limitations often meant that


19/48


processors were limited to less powerful versions. Limited processing power, memory

capacity, and I/O bandwidth often severely restricted the applications that could be

deployed. Proprietary tie-ins and other constraints in chassis design dictated

networking and interconnect topologies, and I/O expansion options were limited to asmall number of expensive and proprietary modules.

The Sun Blade 6048 Modular SystemTo address the shortcomings of earlier blade computing platforms, Sun started with a

design point focused on the needs of the datacenter and highly-scalable deployments,

rather than with preconceptions of chassis design. With this innovative and truly

modular approach, the Sun Blade 6048 Modular System offers an ultra-dense high-

performance solution for large HPC clusters. Organizations gain the promised benefits

of blades, and can deploy thousands of nodes within the cabling, power, and cooling

constraints of existing datacenters. Fully compatible with the Sun Blade 6000 Modular

System, the Sun Blade 6048 Modular System provides distinct advantages over other

approaches to modular architecture.

Innovative Chassis Design for Industry-Leading Density and Environmentals

The Sun Blade 6048 Modular System features a standard rack-size chassis that

facilitates the deployment of high-density computational environments. By

eliminating all of the hardware typically used to rackmount individual blade

chassis, the Sun Blade 6048 Modular System provides 20% more usable space in

the same physical footprint. Up to 48 Sun Blade 6000 server modules can be

deployed in a single Sun Blade 6048 Modular System for up to 96 compute nodes

per rack. Innovative chassis features are carried forward from the Sun Blade 6000

Modular System.

A Choice of Processors and Operating Systems

Each Sun Blade 6048 Modular System chassis supports up to 48 full performance

and full featured Sun Blade 6000 series server modules. Server modules based on

x86/x64 architectures, and ideal for HPC and supercomputing environments

include:

The Sun Blade X6440 server module, with four sockets for Six-Core AMD Opteron

8000 Series processors, and support for up to 256 GB of memory

The Sun Blade X6270 server module, with two sockets for Intel Xeon Processor5500 Series (Nehalem) CPUs and 144 GB of memory per server module

The Sun Blade X6275 server module, with two nodes, each with two sockets for

Intel Xeon Processor 5500 Series CPUs, 96 GB of memory per node, and an on-

board QDR Mellanox InfiniBand host channel adapter (HCA)

Each server module provides significant I/O capacity as well, with up to 32 lanes of

PCI Express 2.0 bandwidth delivered from each server module to the multiple

available I/O expansion modules (a total of up to 207 Gb/sec supported per server


20/48


module). To enhance availability, server modules dont have separate power

supplies or fans. Some server modules feature up to four hot-swap hard disk drives

(HDDs) or solid state drives (SSDs) disks with hardware RAID options, while others

provide on-board flash technologies for fast and reliable I/O. Organizations candeploy server modules based on the processors and operating systems that best

serve their applications or environment. Different server modules can be mixed

and matched in a single chassis, and deployed and redeployed as needs dictate.

The Solaris Operating System (OS), Linux, and Microsoft Windows are all

supported.

Complete Separation Between CPU and I/O Modules

Sun Blade 6048 Modular System design avoids compromises because it provides a

complete separation between CPU and I/O modules. Two types of I/O modules are

supported.

Up to two industry-standard PCI Express ExpressModule (EMs) slots are dedi-

cated to each server module.

Up to two Network Express Modules (NEMs) provide bulk IO for all of the server

modules installed in each shelf (four shelves per chassis).

Through this flexible approach, server modules can be configured with different

I/O options depending on the applications they host. All I/O modules are hot-plug

capable, and customers can choose from Sun-branded or third-party adapters for

networking, storage, clustering, and other I/O functions.

Sun Blade Transparent Management

Many blade vendors provide management solutions that lock organizations into

proprietary management tools. With the Sun Blade 6048 Modular System,

customers have the choice of using their existing management tools or Sun Blade

Transparent Management. Sun Blade Transparent Management is a standards-

based cross-platform tool that provides direct management over individual server

modules and direct management of chassis-level modules using Sun Integrated

Lights out Management (ILOM).

Within the Sun Blade 6048 Modular System, a chassis monitoring module (CMM)

works in conjunction with the service processor on each server module to form a

complete and transparent management solution. Individual server modules

provide support for IPMI, SNMP, CLI (through serial console or SSH), and HTTP(S)

management methods. In addition, Sun Ops Center provides discovery,

aggregated management, and bulk deployment for multiple systems.


21/48


System Overview

The Sun Blade 6048 chassis provides space for up to 12 server modules in each of its

four shelves for up to 48 Sun Blade 6000 server modules in a single chassis. This

design approach provides considerable density. Front and rear perspectives of the Sun

Blade 6048 Modular System are provided in Figure 7.

Figure 7. Front and rear perspectives of the Sun Blade 6048 Modular System

With four self-contained shelves per chassis, the Sun Blade 6048 Modular System

houses a wide range of components.

Up to 48 Sun Blade 6000 server modules insert from the front of the chassis, with 12modules supported by each shelf.

A total of eight hot-swap power supply modules insert from the front of the chassis,

with two 8,400 Watt 12-volt power supplies (N+N) are provided for each shelf. Each

power supply module contains a dedicated fan module.

Up to 96 hot-plug PCI Express ExpressModules (EMs) insert from the rear of the

chassis (24 per shelf), supporting industry-standard PCI Express interfaces with two

EM slots available for use by each server module.

Four

self-contained

shelves

Chassis Management Module

and Power Interface ModuleHot Swappable

N+N Redundant

power supply

Up to 12 Sun Blade 6000

server modules

per shelf

modules

Up to 24 PCI Express

ExpressModules (EMs)

Up to two single-height Network

Express Modules (NEMS) or one

Eight fan modules (N+1)

dual-height InfiniBand NEM


22/48


Up to four dual-height Sun Blade 6048 InfiniBand NEMs can be installed in a single

chassis (one per shelf). Alternately, up to eight single-height Network Express

Modules (NEMs) can be inserted from the rear, with two NEM slots serving each shelf

of the chassis. A chassis monitoring module (CMM) and power interface module are provided for

each shelf. The CMM provides for transparent management access to individual

server modules while the Power Interface Module provides six plugs for the power

supply modules in each shelf.

Redundant (N+1) fan modules are provided at the rear of the chassis for efficient

front-to-back cooling.

Standard I/O Through a Passive Midplane

In essence, the passive midplane in the Sun Blade 6048 Modular System is a collection

of wires and connectors between different modules in the chassis. Since there are no

active components, the reliability of this printed circuit board is extremely high in

the millions of hours. The passive midplane provides electrical connectivity between

the server modules and the I/O modules.

All front and rear modules connect directly to the passive midplane, with the exception

of the power supplies and the fan modules. The power supplies connect to the

midplane through a bus bar and to the AC inputs via a cable harness. The redundant

fan modules plug individually to a set of three fan boards, where fan speed control and

other chassis-level functions are implemented. The front fan modules that cool the PCI

Express ExpressModules each connect to the chassis via self-aligning, blind-mate

connections. The main functions of the midplane include:

Providing a mechanical connection point for all of the server modules

Providing 12 VDC from the power supplies to each customer-replaceable module

Providing 3.3 VDC power used to power the System Management Bus devices on each

module, and to power the CMM

Providing a PCI Express interconnect between the PCI Express root complexes on each

server module to the EMs and NEMs installed in the chassis

Connecting the server modules, CMMs, and NEMs to the management network

Each server module is energized through the midplane from the redundant chassis

power grid. The midplane also provides connectivity to the I2C network in the chassis,allowing each server module to directly monitor the chassis environment, including fan

and power supply status as well as various temperature sensors. A number of I/O links

are also routed through the midplane for each server module. Connection details differ

depending on the selected server module and associated NEMS. As an example,

Figure 8 illustrates the dual-node Sun Blade X6275 server module configured with the

Sun Blade 6048 InfiniBand QDR Switched NEM with connections that include:

An x8 PCI Express 2.0 link connecting from each compute node to a dedicated EM

Two gigabit Ethernet links to the NEM one from each compute node


23/48


Two 4x QDR InfiniBand connections to the NEM one from each compute node

An Ethernet connection from the server module to the CMM for management

Figure 8. Distribution of communications links from a typical Sun Blade 6000

server module

Tight Integration with Sun Datacenter InfiniBand Switches

Providing dense connectivity to servers while minimizing cables is one of the issues

facing large HPC cluster deployments. The Sun Blade 6048 QDR Switched InfiniBand

NEM solves this challenge and improves both density and reliability by integratingconnections and switch components into a dual-height NEM form factor for the Sun

Blade 6048 chassis. As a part of the Sun Constellation System, the NEM uses common

components, cables, connectors, and architecture with the Sun Datacenter InfiniBand

Switch 648 and 72.

Gigabit Ethernet (Node 1)

4x QDR InfiniBand (Node 1)

Gigabit Ethernet (Node 0)

4x QDR InfiniBand (Node 0)

Ethernet

PCI Express x8 (Node 0)

PCI Express x8 (Node 1)

Sun Blade X6275 Server Module

NEM 0

NEM 1

CMM

EMs

Node 0

Node 1


24/48


25/48


density are key, each Sun Blade X6275 server module features two compute nodes, with

each node supporting two sockets for Intel Xeon Processor 5500 Series CPUs and up to

96 GB of memory (Figure 10).

Figure 10. The Sun Blade X6275 server module provides two compute nodes on a

single server module.

Figure 11 shows a block-level representation of how the Sun Blade X6275 server module

connects to the Sun Blade 6048 InfiniBand QDR Switched NEM. In this configuration,

twelve ports from each switch chip (24 total) are used to communicate with the two

compute nodes on each Sun Blade X6275 server module, with nine ports used toconnect the two switches together. The 30 remaining ports (15 per switch chip) are

used as uplinks to either other QDR switched NEMS or external InfiniBand switches.

Sun Blade 6048 InfiniBand QDR Switched NEMs can be connected together directly to

provide mesh or 3D torus fabrics. Alternately, one or more Sun Datacenter InfiniBand

Switch 648 or 72 can be connected to provide Clos fabric implementations. The external

ports use industry-standard CXP connectors that aggregate three 4x ports into a single

12x connector.


26/48


Figure 11. The Sun Blade 6048 InfiniBand QDR Switched NEM connects directly to

Mellanox HCAs on both nodes of the Sun Blade X6275 server module

In the default configuration, and for clusters that utilize up to four Sun Datacenter

InfiniBand Switch 648s, the switch provides a non-blocking fabric. To maintain a non-

blocking fabric in configurations of larger than four switches, an external 12x cable can

link two of the external CXP connectors (one connected to each internal switch chip) to

interconnect the two switches with an additional three 4x connections. This

configuration fully meshes the InfiniScale IV chips on the Sun Blade 6048 InfiniBand

QDR Switched NEM, with a total of 12 ports communicating between the two Mellanox

InfiniScale IV InfiniBand switches, while still leaving 24 4x ports (Eight 12x CXP

connectors) available as switch uplinks.

Scaling to Multiple Sun Datacenter InfiniBand Switch 648Designers need the ability to scale supercomputing deployments without being

constrained by arbitrary limitations. The Sun Datacenter InfiniBand Switch 648 lets

organizations scale from mid-sized InfiniBand deployments that may only populate a

portion of a single Sun Datacenter InfiniBand Switch 648 chassis, to very large

deployments built from multiple Sun Datacenter InfiniBand Switch 648. As with single-

Sun Blade X6275Server Module

Sun Blade 6048 InfiniBandQDR Switched NEM

Memory

Memory

Memory

Memory

P

P

P

PP

CIe

2.0

IBH

CA

PCIe2.0

IB

HCA

4x IB

Up to 12

ServerModules

Node 1

Node 2

9

ports

15 ports

15 ports

12x IB

Cables

36-port

QDR IB

Switch

36-port

QDR IBSwitch


27/48


switch configurations, a multiswitch system still functions and is managed as a single

entity, greatly reducing management complexity.

A single Sun Datacenter InfiniBand Switch 648 can be deployed for configurations

that require up to 648 compute nodes. Up to eight Sun Datacenter InfiniBand Switch 648 can be configured to serve up to

5,184 compute nodes.

Certain requirements exist for maintaining a non-blocking InfiniBand fabric. Table 2

lists various supported numbers of Sun Blade 6048 chassis, Sun Datacenter InfiniBand

Switch 648, Line Cards, Sun Blade 6048 InfiniBand QDR Switched NEMS, and 12x cables

to support various numbers of compute nodes via Sun Blade 6275 server modules. All

listed configurations are non-blocking.

Table 2. Maximum numbers of Sun Blade 6275 server modules and Sun Blade 6048 Modular Systems

supported by various numbers of Sun Datacenter InfiniBand Switch 648.

Sun Blade 6048Chassis

Number ofSun DatacenterInfiniBandSwitch 648

Line CardsRequired

Sun Blade 6048InfiniBand QDRSwitched NEMs

12x CablesRequired

Total ComputeNodes Supported

1 1 2 4 32 96

2 1 3 8 64 192

3 1 4 12 96 288

4 1 6 16 128 384

5 1 7 20 160 480

6 1 8 24 192 5768 2 11 32 256 768

10 2 14 40 320 960

12 2 16 48 384 1,152

24 4 32 96 768 2,304

48 8 64 192 1,536 4,608

54 8 72 216 1,728 5,184


28/48

25 Scalable and Manageable Storage Sun Microsystems, Inc.

Chapter 4

Scalable and Manageable Storage

Large-scale supercomputing clusters place significant demands on storage systems. The

enormous computational performance gains that have been realized through

supercomputing clusters are capable of generating ever-larger quantities of data at very

high rates. Effective HPC storage solutions must provide cost-effective capacity, and

throughput must be able to scale along with the performance of cluster compute

nodes. In addition, users and systems alike need fast access to data and home

directories, and longer-term retention and archival are increasingly important in HPC

and supercomputing environments. These diverse demands require a robust range of

integrated storage offerings.

Storage for ClustersAlong with the general growth in storage capacity requirements and the shear number

of files stored, large HPC environments are seeing significant growth in the numbers of

users needing convenient access to their files. All users want to access their essential

data quickly and easily without having to perform extraneous steps. Organizations also

want to get the best utilization possible from their computational systems.

Unfortunately, storage speeds have seriously lagged behind computational

performance for years, and HPC users are increasingly concerned about storage

benchmarks, the increasingly complexity of the I/O path, and the range of solutions

required to provide complete storage solutions.

Of particular importance, large HPC environments need to be able to effectively

manage the flow of high volumes of data through their storage infrastructure,

requiring:

Storage that acts as a resilient compute engine data cacheto match the streaming

rates of applications running on the compute cluster

Storage that provides longer-term retention and archiveto store massive quantities

of essential data to tiered disk or tape hierarchies

A range of scalable and parallel file systems and integrated data management

software to help project file system data from near-term cache to longer-term

retention and archiving and back on demand

Even as the capacities of individual disk drives have risen, and prices have fallen, high-

volume parallel storage systems have remained expensive and complex. With

experience deploying petabytes of storage into large supercomputing clusters, Sun

understands the key issues needed to deliver high-capacity, high-throughput storage in

a cost-effective and manageable fashion. As an example, the Tokyo Institute of


29/48


Technology (TiTech) TSUBAME supercomputing cluster was initially deployed with 1.1

petabytes of storage provided by clustered Sun Fire X4500 storage servers and the

Lustre parallel file system.

Clustered Sun Fire X4540 Storage Servers as Data CacheIdeal for building storage clusters to serve as cluster scratch space or data cache, the

Sun Fire X4540 storage server defines a new category of system. These innovative

systems closely couple a general-purpose enterprise-class x64 server with high-density

storage all in a very compact form factor. Supporting up to 48 terabytes in only four

rack units, the Sun Fire X4540 storage server also provides considerable compute power

with dual sockets for Third-Generation Quad-Core and enhanced Quad-Core AMD

Opteron processors. The server can also be configured for high-throughput InfiniBand

networking allowing it to be connected directly to Sun InfiniBand switches. With

support for up to 48 internal 500 GB or 1 TB disk drives, the Sun Fire X4540 storage

server is ideal for large cluster deployments running the Linux OS and the Lustre

parallel file system.

Figure 12. The Sun Fire X4540 storage server provides up to 48 terabytes of

compact storage in only four rack units ideal for configuration as cluster

scratch space using the Lustre parallel file system.

The Sun Fire X4540 storage server represents an innovative design that provides

throughput and high-speed access to the 48 directly-attached, hot-plug Serial ATA (SATA)

disk drives. Designed for datacenter deployment, the efficient system is cooled from

from front to back across the components and disk drives. Each Sun Fire X4540 storage

server provides:

Minimal cost per gigabyte utilizing SATA II storage and software RAID 6 with six

SATA II storage controllers connecting to 48 high-performance SATA disk drives

High performance from an industry-standard x64 server based on two Quad-Core or

enhanced Quad-Core AMD Opteron processors

48 high-performanceSATA disk drives

Four rack units (4U)


30/48


Maximum memory and bandwidth scaling from embedded single-channel DDR

memory controllers on each processor, delivering up to 64 GB of capacity

High-performance I/O from two PCI-X slots to delivers over 8.5 gigabits per second of

plug-in I/O bandwidth, including support for InfiniBand HCAs Easy maintenance and overall system reliability and availability from redundant hot-

pluggable disks, power supply units, fans, and I/O

Parallel file systems are required for moving massive amounts of data through

supercomputing clusters. Given its strengths, the Sun Fire X4540 storage server is now a

standard component of many large supercomputing cluster deployments around the

world. Large grids and clusters need high-performance heterogeneous access to data,

and the Sun Fire X4540 storage server provides both high throughput as well as

essential scalability that allow parallel file systems to perform at their best. Together

with the Lustre parallel file system and the Linux OS, the Sun Fire X4540 storage server

also serves as the key component for the Sun Lustre Storage System (Chapter 5).

The Sun Lustre Storage SystemThe Lustre file system is a software-only architecture that supports a number of

different hardware implementations. Lustres state-of-the-art object-based storage

architecture provides ground-breaking I/O and metadata throughput, with considerable

reliability, scalability, and performance advantages. The Lustre file system currently

scales to thousands of nodes and hundreds of terabytes of storage with the potential

to support tens of thousands of nodes and petabytes of data.

Building on the strengths of the Lustre parallel file system, the Sun Lustre Storage

system is architected using Sun Open Storage systems that deliver exceptional

performance and provide additional value. The main components of a typical Lustre

architecture include:

Lustre file system clients (Lustre clients)

Metadata Severs (MDS)

Object Storage Servers (OSS)

Metadata Servers and Object Storage Servers implement the file system and

communicate with the Lustre clients. The MDS manages and stores metadata, such as

file names, directories, permissions and file layout. Configurations also require one or

more Lustre Object Storage Server (OSS) modules, which provide scalable I/O

performance and storage capacity.

To these standard configurations, all Sun Lustre Storage System configurations include

a High Availability Lustre Metadata Server (HA MDS) module that provides failover. For

maximum flexibility, the Sun Lustre Storage System defines two OSS modules: a


31/48


Standard OSS module for greatest density and economy, and an HA OSS module that

provides OSS failover for environments where automated recovery from OSS failure is

important (Figure 13).

Figure 13. High availability metadata servers (HA MDS) and high availability

object storage servers (HA OSS) allow for file system failover in Luster

configurations.

HA MDS Module

Designed to meet the critical requirement of high availability, the HA MDSmodule, is common to all Sun Lustre Storage System configurations. This module

includes a pair of Sun Fire X4270 servers with an attached Sun Storage J4200 array

acting as shared storage. Internal boot drives in the Sun Fire X4270 server are

mirrored for added protection. The Sun Fire X4270 server features two quad-core

Intel Xeon Processor 5500 Series (Nehalem) CPUs and is configured with

24 GB RAM.

Metadata

Servers

(MDS)

(active) (standby)

Object Storage

Servers (OSS)

Storage Arrays

(Direct Connect)

Enterprise

Storage Arrays

& SAN Fabrics

Commodity

Storage

Ethernet

Multiple

networks

supported

simultaneously

Clients

File System

Fail-over

InfiniBand


32/48


Standard OSS Module

The Sun Fire X4540 server was chosen for use as the Standard OSS module. As

discussed, the Sun Fire X4540 server features an innovative architecture that

combines a high-performance server, high I/O bandwidth, and very high densitystorage in a single integrated system.

HA OSS Module

Each HA OSS module includes two Sun Fire X4270 servers and four Sun Storage

J4400 arrays. Sun Fire X4270 servers were chosen for the HA OSS module, because

with six PCI Express slots, the Sun Fire X4270 server has the ability to drive the

high throughput required in Sun Lustre Storage System environments. The Sun

Storage J4400 array was chosen for the HA OSS module because it offers

compelling storage density, connectivity, higher availability and very low price per

gigabyte. With redundant SAS I/O Modules and front-serviceable disk drives, the

Sun Storage J4400 array helps the Sun Lustre Storage System deliver price/

performance advantages without sacrificing RAS features.

Sun can reference many storage installations that have achieved impressive scalability

results. One such reference is the Texas Advanced Computing Centers Ranger System

(see http://www.tacc.utexas.edu/resources/hpcsystems/#ranger), where Sun has

demonstrated near-linear scalability in a configuration encompassing fifty similar

previous-generation Sun OSS modules with a single HA MDS module supporting a file

system that was 1.2 petabytes in size. Figure 14 shows the Lustre file system

throughput at TACC where throughput rates of 45 GB/sec with peaks approaching 50

GB/sec have been observed. In addition, TACC has experienced near-linear throughputon a single applications use of the Lustre file system at 35 GB/sec.

Figure 14. Luster parallel file system performance at TACC

More information on implementing the Lustre parallel file system can be found in the

Sun BluePrints article titled Solving the HPC I/O Bottleneck: Sun Lustre Storage System

(http://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-

+Sun+Lustre+Storage+System).

35

0

5

10

30

25

20

15

10 100 1000 1000010 100 1000 10000

0

10

30

20

40

50

60

$SCRATCH File System Throughput

W

riteSpeed(GB/sec)

$SCRATCH Application Performance

# of Writing Clients # of Writing Clients

Stripecount = 1

Stripecount = 4

Stripecount = 1

Stripecount = 4

W

riteSpeed(GB/sec)
http://www.tacc.utexas.edu/resources/hpcsystems/#rangerhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://www.tacc.utexas.edu/resources/hpcsystems/#ranger


33/48


ZFS and Sun Storage 7000 Unified Storage SystemsWhile high-throughput cluster scratch space is critical, clusters also need storage that

serves other needs. Some of an organizations most important data includes completed

simulations and key source data. Clusters need storage that provides scalable, reliable,

and robust storage for tier-1 data archival and users home directories.

To address this need, Sun Storage 7000 Unified Storage Systems incorporate an open-

source operating system, commodity hardware, and industry-standard technologies.

These systems represent low-cost, fully-functional network attached storage (NAS)

storage devices designed around the following core technologies:

General-purpose x64-based servers (that function as the NAS head), and Sun Storage

products proven high-performance commodity hardware solutions with

compelling price-performance points

The ZFS file system, the worlds first 128-bit file system with unprecedentedavailability and reliability features

A high-performance networking stack using IPv4 or IPv6

DTrace Analytics, that provide dynamic instrumentation for real-time performance

analysis and debugging

Sun Fault Management Architecture (FMA) for built-in fault detection, diagnosis, and

self-healing for common hardware problems

A large and adaptive two-tiered caching model, based on DRAM and enterprise-class

solid state devices (SSDs)

To meet varied needs for capacity, reliability, performance, and price, the product

family includes three different models the Sun Storage 7110, 7210, 7310, and 7410

Unified Storage Systems (Figure 15). Configured with appropriate data processing and

storage resources, these systems can support a wide range of requirements in HPC

environments.


34/48


Figure 15. Sun Storage 7000 Unified Storage Systems

Sun Storage 7110 Unified Storage System





35/48


Tight integration of the ZFS scalable file system

Sun Storage 7000 Unified Storage Systems are powered by the ZFS scalable file stem.

ZFS offers a dramatic advance in data management with an innovative approach to

data integrity, tremendous performance improvements, and a welcome integration of

both file system and volume management capabilities. A true 128-bit file system, ZFS

removes all practical limitations for scalable storage, and introduces pivotal new

concepts such as hybrid storage pools that de-couple the file system from physical

storage. This radical new architecture optimizes and simplifies code paths from the

application to the hardware, producing sustained throughput at near platter speeds.

New block allocation algorithms accelerate write operations, consolidating what would

traditionally be many small random writes into a single, more efficient write sequence.

Silent data corruption is corruption that goes undetected, and for which no error

messages are generated. This particular form of data corruption is of special concern toHPC applications since they typically generate, store, and archive significant amounts

of data. In fact, a study by CERN1 has shown that silent data corruption, including disk

errors, RAID errors, and memory errors, is much more common that previously

imagined. ZFS provides end-to-end checksumming for all data, greatly reducing the risk

of silent data corruption.

Sun Storage 7000 Unified Storage Systems rely heavily on ZFS for key functionality such

as Hybrid Storage Pools. By automatically allocating space from pooled storage when

needed, ZFS simplifies storage management and gives organizations the flexibility to

optimize data for performance. Hybrid Storage Pools also effectively combine the

strengths of system memory, flash memory technology in the form of enterprise solid

state drives (SSDs), and conventional hard disk drives (HDDs).

Key capabilities of ZFS related to Hybrid Storage Pools include:

Virtual storage pools Unlike traditional file systems that require a separate volume

manager, ZFS introduces the integration of volume management functions.

Data integrity ZFS uses several techniques to keep on-disk data self consistent and

eliminate silent data corruption, such as copy-on-write and end-to-end

checksumming.

High performance ZFS simplifies the code paths from the application to the

hardware, delivering sustained throughput at near platter speeds. Simplified administration ZFS automates many administrative tasks to speed

performance and eliminate common errors.

Sun Storage 7000 Unified Storage Systems utilize ZFS Hybrid Storage Pools to

automatically provide data placement, data protection, and data services such as RAID,

error correction, and system management. By placing data on the most appropriate

storage media, Hybrid Storage Pools help to optimize performance and contain costs.

Sun Storage 7000 Unified Storage Systems feature a common, easy-to-use management

1.Silent Corruptions, Peter Kelemen, CERN After C5, June 1st, 2007


36/48


interface, along with a comprehensive analytics environment to help isolate and

resolve issues. The systems support NFS, CIFS, and iSCSI data access protocols, mirrored

and parity-based data protection, local point-in-time (PIT) copy, remote replication, data

checksum, data compression, and data reconstruction.

Long-Term Retention and ArchiveStaging, storing, and maintaining HPC data requires a massive repository of on-line and

near-line storage to support data retention and archival needs. High-speed data

movement must be provided between computational and archival environments. The

Sun Constellation System addresses this need by integrating with a wealth of

sophisticated Sun StorageTek options, including:

Sun StorageTek SL8500 and SL500 Modular Library Systems

Sun StorageTek 6540 and 6140 Modular Arrays

High-speed data movers

Sun StorageTek 5800 system fixed-content archive

The comprehensive Sun StorageTek software offering is key to facilitating seamless

migration of data between cache and archival.

Sun StorageTek QFS

Sun StorageTek QFS software provides high-performance heterogeneous shared

access to data over a storage area network (SAN). Users across the enterprise get

shared access to the same large files or data sets simultaneously, speeding time to

results. Up to 256 systems running Sun StorageTek QFS technology can have

shared access to the same data while maintaining file integrity. Data can bewritten and accessed at device-rated speeds, providing superior application I/O

rates. Sun StorageTek QFS software also provides heterogeneous file sharing using

NFS, CIFS, Apple Filing Protocol, FTP, and Samba.

Sun StorageTek Storage Archive Manager (SAM) software

Large HPC installations must manage the considerable storage required by

multiple projects running large-scale computational applications on very large

datasets. Solutions must provide a seamless and transparent migration for

essential archival data between disk and tape storage systems. Sun StorageTek

Storage Archive Manager (SAM) addresses this need by providing data

classification and policy-driven data placement across tiers of storage.

Organizations can benefit from data protection as well as long-term retention and

data recovery to match their specific needs.

Chapter 6provides additional detail and a graphical depiction of how caching file

systems such as the Lustre parallel file system combine with SAM-QFS in a real-world

example to provide data management in large supercomputing installations.


37/48

34 Sun HPC Software Sun Microsystems, Inc.

Chapter 5

Sun HPC Software

As clusters move from the realm of supercomputing to the enterprise, cluster software

has never been more important. Organizations deploying clusters at all levels need

better ways to control and monitor often expansive cluster deployments in ways that

benefit their users and applications. Unfortunately, collecting, assembling, testing, and

patching all of the requisite software components for effective cluster operation has

proved challenging, to say the least.

Available in both a Linux Edition, and a Solaris Developer Edition, Sun HPC software is

designed to address these needs. Sun HPC Software, Linux Edition is detailed in the

sections that follow. For more information on Sun HPC Software, Solaris Developer

Edition, please see http://wikis.sun.com/display/hpc /Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1

Sun HPC Software, Linux EditionMany HPC customers are demanding Linux-based HPC solutions with open source

components. To answer these demands, Sun has introduced Sun HPC Software, Linux

Edition an integrated solution for Linux HPC clusters based on Sun hardware. More

than a mere collection of software components, Sun HPC Software simplifies the entire

process of deploying and managing large-scale Linux HPC clusters, providing

considerable potential savings in maintenance time and expense.

From its inception, the projects goals were to provide an open product one that

uses as much open source software as possible, and one that depends on and enhances

the community aspects of software development and consolidation. The ongoing goals

for Sun HPC software are to:

Provide simple, scalable provisioning of bare-metal systems into a running HPC

cluster

Validate configurations

Dramatically reduce time-to-results

Offer integrated management and monitoring of the cluster

Employ a community-driven process

Seamless and Scalable IntegrationSun HPC Software, Linux Edition covers the entire cluster life-cycle. The software

provides everything needed to provision the cluster nodes, verify that the software and

hardware are working correctly, manage the cluster, and monitor the clusters

performance and health. All of the components have been fully tested on Sun HPC

hardware, so that the likelihood of post-installation integration problems is significantly

reduced.
http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1


38/48


Because clusters can vary widely in size, Sun HPC software is designed to be scalable,

and all of the components are selected with large numbers of nodes in mind. For

example, the Lustre parallel file system and OneSIS provisioning software are both well

known for working well with clusters comprised of thousands of nodes. Tools thatprovision, verify, manage, and monitor the cluster were likewise selected for their

scalability to reduce the management cost as clusters grow.

Sun HPC software, Linux Edition is built to be completely modular so that organizations

can customize it according to their own preferences and requirements. The modular

framework provides a ready-made stack that contains the components required to

deploy an HPC cluster. Add-on components let organizations make specific choices

beyond the core software installed. Figure 16 provides a high-level perspective of Sun

HPC Software, Linux Edition. For more specific information on the components provided

at each level, please see www.sun.com/software/products/hpcsoftware, or send an e-

mail to [email protected] join the community.

Figure 16. Sun HPC Software stack, Linux Edition

Sun HPC Software, Linux Edition 2.0.1 contains the components listed in Table 3.

Table 3. Sun HPC Software 2.0.1 components

Category Sun HPC Software, Linux Edition 1.2

Operating System and kernel Red Hat Enterprise Linux, CentOS Linux, Lustre parallel filesystem, perfctr, SuSE Linux Enterprise Server

User space library Allinea DDT, Env-switcher, genders, git, Heartbeat, IntelCompiler, Mellanox firmware tools, Modules, MVAPITCHand MVAPITCH2, OFED, OpenMPI, PGI compiler, RRDtool,Sun Studio, Sun HPC Clustertools, Totalview

Verification HPCC Bench Suite, Lustre IOkit, IOR, Lnet selftest, NetPIPE

Schedulers Sun Grid Engine, PBS, LSF, SLURM, MOAB, MUNGE

Monitoring Ganglia

Provisioning OneSIS, Cobbler

Management CFEngine, Conman, FreeIPMI, gtdb, IBSRM, IPMItool,Ishw, OpenSM, pdsh, Powerman, Sun Ops Center
http://www.sun.com/software/products/hpcsoftwarehttp://www.sun.com/software/products/hpcsoftware


39/48


Simplified Cluster ProvisioningSun HPC Software, Linux Edition is specifically designed to simplify the complex task of

provisioning systems as a part of a clustered environment. For example, the software

stack includes the OneSIS open source software tool developed at Sandia National

Laboratory. This tool is specifically designed to ease system administration in large-

scale Linux cluster environments.

The software stack itself can be downloaded from the Web, and is designed to fit onto a

single DVD. While installing the first system in a cluster might take place from the DVD,

it goes without saying that installing an entire large cluster in this fashion would

consume unacceptable amounts of time, not to mention the additional time to

maintain and update individual system images. With OneSIS, administrators can create

system images that define the behavior of the entire computing infrastructure.

A typical installation process approximates the following:

The software is first onto a management node, and Installing the system locally via

DVD typically takes about 20 minutes from bare metal to a login prompt

Configuring the system requires another 20 minutes at most

Other cluster systems are then booted onto the master image

The cluster can be running and ready to accept jobs in as little as 50 minutes.


40/48

37 Deploying Supercomputing Clusters Rapidly with Less Risk Sun Microsystems, Inc.

Chapter 6

Deploying Supercomputing Clusters Rapidly

with Less Risk

Sun has considerable experience helping organizations deploy supercomputing clusters

specific to their computational, storage, and collaborative requirements.

Complementing the compelling capabilities of the Sun Constellation System, Sun

provides a range of services that are specifically aimed at delivering results for HPC-

focused organizations. Suns partnership with the Texas Advanced Computing Center

(TACC) at the University of Texas at Austin to deliver the Sun Constellation System in the

3,936-node Ranger supercomputing cluster is one such example.

Sun Datacenter Express ServicesSuns new Datacenter Express Services provide a comprehensive, al

HPC Open Petascale Computing

Documents