HPC Technology Compass 2014/15

High Performance Computing 2014/15

Technology CompassS

cie

nce

s

Ris

k A

na

lysi

s

Sim

ula

tio

n

Big

Da

ta A

na

lyti

cs

CA

D

Hig

h P

erf

orm

an

ce C

om

pu

tin

g

2 3

More Than 30 Years of Experience in Scientifi c Computing

1980 marked the beginning of a decade where numerous start-

ups were created, some of which later transformed into big play-

ers in the IT market. Technical innovations brought dramatic

changes to the nascent computer market. In Tübingen, close to

one of Germany’s prime and oldest universities, transtec was

founded.

In the early days, transtec focused on reselling DEC computers

and peripherals, delivering high-performance workstations to

university institutes and research facilities. In 1987, SUN/Sparc

and storage solutions broadened the portfolio, enhanced by

IBM/RS6000 products in 1991. These were the typical worksta-

tions and server systems for high performance computing then,

used by the majority of researchers worldwide.

In the late 90s, transtec was one of the fi rst companies to offer

highly customized HPC cluster solutions based on standard Intel

architecture servers, some of which entered the TOP 500 list of

the world’s fastest computing systems.

Thus, given this background and history, it is fair to say that

transtec looks back upon a more than 30 years’ experience in sci-

entifi c computing; our track record shows more than 750 HPC in-

stallations. With this experience, we know exactly what custom-

ers’ demands are and how to meet them. High performance and

ease of management – this is what customers require today. HPC

systems are for sure required to peak-perform, as their name in-

dicates, but that is not enough: they must also be easy to handle.

Unwieldy design and operational complexity must be avoided

or at least hidden from administrators and particularly users of

HPC computer systems.

This brochure focusses on where transtec HPC solutions excel.

transtec HPC solutions use the latest and most innovative tech-

nology. Bright Cluster Manager as the technology leader for uni-

fi ed HPC cluster management, leading-edge Moab HPC Suite for

job and workload management, Intel Cluster Ready certifi cation

as an independent quality standard for our systems, Panasas

HPC storage systems for highest performance and real ease of

management required of a reliable HPC storage system. Again,

with these components, usability, reliability and ease of man-

agement are central issues that are addressed, even in a highly

heterogeneous environment. transtec is able to provide custom-

ers with well-designed, extremely powerful solutions for Tesla

GPU computing, as well as thoroughly engineered Intel Xeon Phi

systems. Intel’s Infi niBand Fabric Suite makes managing a large

Infi niBand fabric easier than ever before, and Numascale pro-

vides excellent Technology for AMD-based large-SMP Systems

– transtec masterly combines excellent and well-chosen compo-

nents that are already there to a fi ne-tuned, customer-specifi c,

and thoroughly designed HPC solution.

Your decision for a transtec HPC solution means you opt for

most intensive customer care and best service in HPC. Our ex-

perts will be glad to bring in their expertise and support to assist

you at any stage, from HPC design to daily cluster operations, to

HPC Cloud Services.

Last but not least, transtec HPC Cloud Services provide custom-

ers with the possibility to have their jobs run on dynamically pro-

vided nodes in a dedicated datacenter, professionally managed

and individually customizable. Numerous standard applications

like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes like Gro-

macs, NAMD, VMD, and others are pre-installed, integrated into

an enterprise-ready cloud management environment, and ready

to run.

Have fun reading the transtec HPC Compass 2014/15!

Technology CompassTable of Contents and Introduction

High Performance Computing ........................................... 4Performance Turns Into Productivity ......................................................6

Flexible Deployment With xCAT ..................................................................8

Service and Customer Care From A to Z ............................................. 10

Advanced Cluster Management Made Easy ........... 12Easy-to-use, Complete and Scalable ...................................................... 14

Cloud Bursting With Bright ......................................................................... 18

Intelligent HPC Workload Management ................... 30Moab HPC Suite – Basic Edition ................................................................ 32

Moab HPC Suite – Enterprise Edition .................................................... 34

Moab HPC Suite – Grid Option ................................................................... 36

Optimizing Accelerators with Moab HPC Suite .............................. 38

Moab HPC Suite – Application Portal Edition .................................. 42

Moab HPC Suite – Remote Visualization Edition ........................... 44

Using Moab With Grid Environments ................................................... 46

Remote Visualization and Workfl ow Optimization ....................................................... 52NICE EnginFrame: A Technical Computing Portal ......................... 54

Desktop Cloud Virtualization .................................................................... 58

Cloud Computing .............................................................................................. 62

NVIDIA Grid ............................................................................................................ 66

Intel Cluster Ready ................................................................ 70A Quality Standard for HPC Clusters...................................................... 72

Intel Cluster Ready builds HPC Momentum ..................................... 76

The transtec Benchmarking Center ....................................................... 80

Parallel NFS ................................................................................. 82The New Standard for HPC Storage ....................................................... 84

Whats´s New in NFS 4.1? .............................................................................. 86

Panasas HPC Storage ...................................................................................... 88

NVIDIA GPU Computing .....................................................100What is GPU Computing? .......................................................................... 102

Kepler GK110 – The Next Generation GPU ...................................... 104

Introducing NVIDIA Parallel Nsight ..................................................... 106

A Quick Refresher on CUDA ...................................................................... 118

Intel TrueScale Infi niBand and GPUs .................................................. 124

Intel Xeon Phi Coprocessor .............................................128The Architecture.............................................................................................. 130

Infi niBand ...................................................................................142High-speed Interconnects ........................................................................ 144

Top 10 Reasons to Use Intel TrueScale Infi niBand ..................... 146

Intel MPI Library 4.0 Performance ........................................................ 148

Intel Fabric Suite 7 ......................................................................................... 152

Numascale .................................................................................158Background ........................................................................................................ 160

NumaConnect Value Proposition ......................................................... 162

Technology ......................................................................................................... 163

Redefi ning Scalable OpenMP and MPI .............................................. 169

Glossary .......................................................................................174

5

High Performance Computing (HPC) has been with us from the very beginning of the computer era. High-performance computers were built to solve numerous problems which the “human computers” could not handle. The term HPC just hadn’t been coined yet. More important, some of the early principles have changed fundamentally.

High Performance Computing – Performance Turns Into Productivity

Sci

en

ces

R

isk

An

aly

sis

S

imu

lati

on

B

ig D

ata

An

aly

tics

C

AD

H

igh

Pe

rfo

rma

nce

Co

mp

uti

ng

6 7

Variations on the theme: MPP and SMP Parallel computations exist in two major variants today. Ap-

plications running in parallel on multiple compute nodes are

frequently so-called Massively Parallel Processing (MPP) appli-

cations. MPP indicates that the individual processes can each

utilize exclusive memory areas. This means that such jobs are

predestined to be computed in parallel, distributed across the

nodes in a cluster. The individual processes can thus utilize the

separate units of the respective node – especially the RAM, the

CPU power and the disk I/O.

Communication between the individual processes is imple-

mented in a standardized way through the MPI software inter-

face (Message Passing Interface), which abstracts the underlying

network connections between the nodes from the processes.

However, the MPI standard (current version 2.0) merely requires

source code compatibility, not binary compatibility, so an off-

the-shelf application usually needs specifi c versions of MPI

libraries in order to run. Examples of MPI implementations are

OpenMPI, MPICH2, MVAPICH2, Intel MPI or – for Windows clusters

– MS-MPI.

If the individual processes engage in a large amount of commu-

nication, the response time of the network (latency) becomes im-

portant. Latency in a Gigabit Ethernet or a 10GE network is typi-

cally around 10 μs. High-speed interconnects such as Infi niBand,

reduce latency by a factor of 10 down to as low as 1 μs. Therefore,

high-speed interconnects can greatly speed up total processing.

The other frequently used variant is called SMP applications.

SMP, in this HPC context, stands for Shared Memory Processing.

It involves the use of shared memory areas, the specifi c imple-

mentation of which is dependent on the choice of the underly-

ing operating system. Consequently, SMP jobs generally only run

on a single node, where they can in turn be multi-threaded and

thus be parallelized across the number of CPUs per node. For

many HPC applications, both the MPP and SMP variant can be

chosen.

Many applications are not inherently suitable for parallel execu-

tion. In such a case, there is no communication between the in-

dividual compute nodes, and therefore no need for a high-speed

network between them; nevertheless, multiple computing jobs

can be run simultaneously and sequentially on each individual

node, depending on the number of CPUs.

In order to ensure optimum computing performance for these

applications, it must be examined how many CPUs and cores de-

liver the optimum performance.

We fi nd applications of this sequential type of work typically in

the fi elds of data analysis or Monte-Carlo simulations.

Performance Turns Into ProductivityHPC systems in the early days were much different from those

we see today. First, we saw enormous mainframes from large

computer manufacturers, including a proprietary operating

system and job management system. Second, at universities

and research institutes, workstations made inroads and sci-

entists carried out calculations on their dedicated Unix or

VMS workstations. In either case, if you needed more comput-

ing power, you scaled up, i.e. you bought a bigger machine.

Today the term High-Performance Computing has gained a

fundamentally new meaning. HPC is now perceived as a way

to tackle complex mathematical, scientific or engineering

problems. The integration of industry standard, “off-the-shelf”

server hardware into HPC clusters facilitates the construction

of computer networks of such power that one single system

could never achieve. The new paradigm for parallelization is

scaling out.

Computer-supported simulations of realistic processes (so-

called Computer Aided Engineering – CAE) has established

itself as a third key pillar in the field of science and research

alongside theory and experimentation. It is nowadays incon-

ceivable that an aircraft manufacturer or a Formula One rac-

ing team would operate without using simulation software.

And scientific calculations, such as in the fields of astrophys-

ics, medicine, pharmaceuticals and bio-informatics, will to a

large extent be dependent on supercomputers in the future.

Software manufacturers long ago recognized the benefit of

high-performance computers based on powerful standard

servers and ported their programs to them accordingly.

The main advantages of scale-out supercomputers is just that:

they are infinitely scalable, at least in principle. Since they are

based on standard hardware components, such a supercom-

puter can be charged with more power whenever the com-

putational capacity of the system is not sufficient any more,

simply by adding additional nodes of the same kind. A cum-

bersome switch to a different technology can be avoided in

most cases.

High Performance Computing

Performance Turns Into Productivity

“transtec HPC solutions are meant to provide cus-

tomers with unparalleled ease-of-management and

ease-of-use. Apart from that, deciding for a transtec

HPC solution means deciding for the most intensive

customer care and the best service imaginable.”

Dr. Oliver Tennert Director Technology Management & HPC Solutions

8 9

solution Nagios, according to the customer’s preferences and

requirements.

Local Installation or Diskless Installation

We offer a diskful or a diskless installation of the cluster

nodes. A diskless installation means the operating system is

hosted partially within the main memory, larger parts may or

may not be included via NFS or other means. This approach

allows for deploying large amounts of nodes very efficiently,

and the cluster is up and running within a very small times-

cale. Also, updating the cluster can be done in a very efficient

way. For this, only the boot image has to be updated, and the

nodes have to be rebooted. After this, the nodes run either a

new kernel or even a new operating system. Moreover, with

this approach, partitioning the cluster can also be very effi-

ciently done, either for testing purposes, or for allocating dif-

ferent cluster partitions for different users or applications.

Development Tools, Middleware, and Applications

According to the application, optimization strategy, or under-

lying architecture, different compilers lead to code results of

very different performance. Moreover, different, mainly com-

mercial, applications, require different MPI implementations.

And even when the code is self-developed, developers often

prefer one MPI implementation over another.

According to the customer’s wishes, we install various compil-

ers, MPI middleware, as well as job management systems like

Parastation, Grid Engine, Torque/Maui, or the very powerful

Moab HPC Suite for the high-level cluster management.

xCAT as a Powerful and Flexible Deployment Tool

xCAT (Extreme Cluster Administration Tool) is an open source

toolkit for the deployment and low-level administration of

HPC cluster environments, small as well as large ones.

xCAT provides simple commands for hardware control, node

discovery, the collection of MAC addresses, and the node de-

ployment with (diskful) or without local (diskless) installation.

The cluster configuration is stored in a relational database.

Node groups for different operating system images can be

defined. Also, user-specific scripts can be executed automati-

cally at installation time.

xCAT Provides the Following Low-Level Administrative Fea-

tures

� Remote console support

� Parallel remote shell and remote copy commands

� Plugins for various monitoring tools like Ganglia or Nagios

� Hardware control commands for node discovery, collect-

ing MAC addresses, remote power switching and resetting

of nodes

� Automatic configuration of syslog, remote shell, DNS,

DHCP, and ntp within the cluster

� Extensive documentation and man pages

For cluster monitoring, we install and configure the open

source tool Ganglia or the even more powerful open source


Flexible Deployment With xCAT

10 11

benchmarking routines; this aids customers in sizing and devising

the optimal and detailed HPC confi guration.

Each and every piece of HPC hardware that leaves our factory un-

dergoes a burn-in procedure of 24 hours or more if necessary. We

make sure that any hardware shipped meets our and our custom-

ers’ quality requirements. transtec HPC solutions are turnkey solu-

tions. By default, a transtec HPC cluster has everything installed and

confi gured – from hardware and operating system to important

middleware components like cluster management or developer

tools and the customer’s production applications. Onsite delivery

means onsite integration into the customer’s production environ-

ment, be it establishing network connectivity to the corporate net-

work, or setting up software and confi guration parts.

transtec HPC clusters are ready-to-run systems – we deliver, you

turn the key, the system delivers high performance. Every HPC proj-

ect entails transfer to production: IT operation processes and poli-

cies apply to the new HPC system. Effectively, IT personnel is trained

hands-on, introduced to hardware components and software, with

all operational aspects of confi guration management.

transtec services do not stop when the implementation projects

ends. Beyond transfer to production, transtec takes care. transtec

offers a variety of support and service options, tailored to the cus-

tomer’s needs. When you are in need of a new installation, a major

reconfi guration or an update of your solution – transtec is able to

support your staff and, if you lack the resources for maintaining

the cluster yourself, maintain the HPC solution for you. From Pro-

fessional Services to Managed Services for daily operations and

required service levels, transtec will be your complete HPC service

and solution provider. transtec’s high standards of performance, re-

liability and dependability assure your productivity and complete

satisfaction.

transtec’s offerings of HPC Managed Services offer customers the

possibility of having the complete management and administra-

tion of the HPC cluster managed by transtec service specialists, in

an ITIL compliant way. Moreover, transtec’s HPC on Demand servic-

es help provide access to HPC resources whenever they need them,

for example, because they do not have the possibility of owning

and running an HPC cluster themselves, due to lacking infrastruc-

ture, know-how, or admin staff.

transtec HPC Cloud ServicesLast but not least transtec’s services portfolio evolves as customers‘

demands change. Starting this year, transtec is able to provide HPC

Cloud Services. transtec uses a dedicated datacenter to provide

computing power to customers who are in need of more capacity

than they own, which is why this workfl ow model is sometimes

called computing-on-demand. With these dynamically provided

resources, customers with the possibility to have their jobs run on

HPC nodes in a dedicated datacenter, professionally managed and

secured, and individually customizable. Numerous standard ap-

plications like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes

like Gromacs, NAMD, VMD, and others are pre-installed, integrated

into an enterprise-ready cloud and workload management environ-

ment, and ready to run.

Alternatively, whenever customers are in need of space for hosting

their own HPC equipment because they do not have the space ca-

pacity or cooling and power infrastructure themselves, transtec is

also able to provide Hosting Services to those customers who’d like

to have their equipment professionally hosted, maintained, and

managed. Customers can thus build up their own private cloud!

Are you interested in any of transtec’s broad range of HPC related

services? Write us an email to [email protected]. We’ll be happy to

hear from you!


Services and Customer Care from A to Z

transtec HPC as a ServiceYou will get a range of applications like LS-Dyna, ANSYS,

Gromacs, NAMD etc. from all kinds of areas pre-installed,

integrated into an enterprise-ready cloud and workload

management system, and ready-to run. Do you miss your

application?

Ask us: [email protected]

transtec Platform as a ServiceYou will be provided with dynamically provided compute

nodes for running your individual code. The operating

system will be pre-installed according to your require-

ments. Common Linux distributions like RedHat, CentOS,

or SLES are the standard. Do you need another distribu-

tion?

Ask us: [email protected]

transtec Hosting as a ServiceYou will be provided with hosting space inside a profes-

sionally managed and secured datacenter where you

can have your machines hosted, managed, maintained,

according to your requirements. Thus, you can build up

your own private cloud. What range of hosting and main-

tenance services do you need?

Tell us: [email protected]

HPC @ transtec: Services and Customer Care from A to Ztranstec AG has over 30 years of experience in scientifi c computing

and is one of the earliest manufacturers of HPC clusters. For nearly

a decade, transtec has delivered highly customized High Perfor-

mance clusters based on standard components to academic and in-

dustry customers across Europe with all the high quality standards

and the customer-centric approach that transtec is well known for.

Every transtec HPC solution is more than just a rack full of hard-

ware – it is a comprehensive solution with everything the HPC user,

owner, and operator need. In the early stages of any customer’s HPC

project, transtec experts provide extensive and detailed consult-

ing to the customer – they benefi t from expertise and experience.

Consulting is followed by benchmarking of different systems with

either specifi cally crafted customer code or generally accepted

individual Presalesconsulting

application-,customer-,

site-specificsizing of

HPC solution

burn-in testsof systems

benchmarking of different systems

continualimprovement

software& OS

installation

applicationinstallation

onsitehardwareassembly

integrationinto

customer’senvironment

customertraining

maintenance,support &

managed services

individual Presalesconsulting

application-,customer-,

site-specificsizing of

HPC solution

burn-in testsof systems

benchmarking of different systems

continualimprovement

software& OS

installation

applicationinstallation

onsitehardwareassembly

integrationinto

customer’senvironment

customertraining

maintenance,support &

managed services

Services and Customer Care from A to Z

13

Bright Cluster Manager removes the complexity from the installation, management and use of HPC clusters — on premise or in the cloud. With Bright Cluster Manager, you can easily install, manage and use multiple clusters si-multaneously, including compute, Hadoop, storage, database and workstation clusters.

Advanced Cluster Management Made Easy

En

gin

ee

rin

g

Lif

e S

cie

nce

s

Au

tom

oti

ve

Pri

ce M

od

ell

ing

A

ero

spa

ce

CA

E

Da

ta A

na

lyti

cs

15

� Comprehensive cluster monitoring and health checking: au-

tomatic sidelining of unhealthy nodes to prevent job failure

Scalability from Deskside to TOP500

� Off-loadable provisioning: enables maximum scalability

� Proven: used on some of the world’s largest clusters

Minimum Overhead / Maximum Performance

� Lightweight: single daemon drives all functionality

� Optimized: daemon has minimal impact on operating system

and applications

� Efficient: single database for all metric and configuration

data

Top Security

� Key-signed repositories: controlled, automated security and

other updates

� Encryption option: for external and internal communications

� Safe: X509v3 certificate-based public-key authentication

� Sensible access: role-based access control, complete audit

trail

� Protected: firewalls and LDAP

A Unified Approach

Bright Cluster Manager was written from the ground up as a to-

tally integrated and unified cluster management solution. This

fundamental approach provides comprehensive cluster manage-

ment that is easy to use and functionality-rich, yet has minimal

impact on system performance. It has a single, light-weight

daemon, a central database for all monitoring and configura-

tion data, and a single CLI and GUI for all cluster management

functionality. Bright Cluster Manager is extremely easy to use,

scalable, secure and reliable. You can monitor and manage all

aspects of your clusters with virtually no learning curve.

Bright’s approach is in sharp contrast with other cluster man-

agement offerings, all of which take a “toolkit” approach. These

toolkits combine a Linux distribution with many third-party

tools for provisioning, monitoring, alerting, etc. This approach

has critical limitations: these separate tools were not designed

to work together; were often not designed for HPC, nor designed

to scale. Furthermore, each of the tools has its own interface

(mostly command-line based), and each has its own daemon(s)

and database(s). Countless hours of scripting and testing by

highly skilled people are required to get the tools to work for a

specific cluster, and much of it goes undocumented.

Time is wasted, and the cluster is at risk if staff changes occur,

losing the “in-head” knowledge of the custom scripts.

14

The Bright AdvantageBright Cluster Manager delivers improved productivity, in-

creased uptime, proven scalability and security, while reduc-

ing operating cost:

Rapid Productivity Gains

� Short learning curve: intuitive GUI drives it all

� Quick installation: one hour from bare metal to compute-

ready

� Fast, flexible provisioning: incremental, live, disk-full, disk-less,

over InfiniBand, to virtual machines, auto node discovery

� Comprehensive monitoring: on-the-fly graphs, Rackview, mul-

tiple clusters, custom metrics

� Powerful automation: thresholds, alerts, actions

� Complete GPU support: NVIDIA, AMD1, CUDA, OpenCL

� Full support for Intel Xeon Phi

� On-demand SMP: instant ScaleMP virtual SMP deployment

� Fast customization and task automation: powerful cluster

management shell, SOAP and JSON APIs make it easy

� Seamless integration with leading workload managers: Slurm,

Open Grid Scheduler, Torque, openlava, Maui2, PBS Profes-

sional, Univa Grid Engine, Moab2, LSF

� Integrated (parallel) application development environment

� Easy maintenance: automatically update your cluster from

Linux and Bright Computing repositories

� Easily extendable, web-based User Portal

� Cloud-readiness at no extra cost3, supporting scenarios “Clus-

ter-on-Demand” and “Cluster-Extension”, with data-aware

scheduling

� Deploys, provisions, monitors and manages Hadoop clusters

� Future-proof: transparent customization minimizes disrup-

tion from staffing changes

Maximum Uptime

� Automatic head node failover: prevents system downtime

� Powerful cluster automation: drives pre-emptive actions

based on monitoring thresholds

Advanced Cluster Management Made EasyEasy-to-use, complete and scalable

By selecting a cluster node in the tree on the left and the Tasks tab on the right, you can execute a number of powerful tasks on that node with just a single mouse click.

The cluster installer takes you through the installation process and offers advanced options such as “Express” and “Remote”.

16 1717

Extensive Development Environment

Bright Cluster Manager provides an extensive HPC development

environment for both serial and parallel applications, including

the following (some are cost options):

� Compilers, including full suites from GNU, Intel, AMD and Port-

land Group.

� Debuggers and profilers, including the GNU debugger and

profiler, TAU, TotalView, Allinea DDT and Allinea MAP.

� GPU libraries, including CUDA and OpenCL.

� MPI libraries, including OpenMPI, MPICH, MPICH2, MPICHMX,

MPICH2-MX, MVAPICH and MVAPICH2; all cross-compiled with

the compilers installed on Bright Cluster Manager, and opti-

mized for high-speed interconnects such as InfiniBand and

10GE.

� Mathematical libraries, including ACML, FFTW, Goto-BLAS,

MKL and ScaLAPACK.

� Other libraries, including Global Arrays, HDF5, IIPP, TBB, Net-

CDF and PETSc.

Bright Cluster Manager also provides Environment Modules to

make it easy to maintain multiple versions of compilers, libraries

and applications for different users on the cluster, without creating

compatibility conflicts. Each Environment Module file contains the

information needed to configure the shell for an application, and

automatically sets these variables correctly for the particular appli-

cation when it is loaded. Bright Cluster Manager includes many pre-

configured module files for many scenarios, such as combinations

of compliers, mathematical and MPI libraries.

Powerful Image Management and Provisioning

Bright Cluster Manager features sophisticated software image

management and provisioning capability. A virtually unlimited

number of images can be created and assigned to as many dif-

ferent categories of nodes as required. Default or custom Linux

kernels can be assigned to individual images. Incremental

changes to images can be deployed to live nodes without re-

booting or re-installation.

The provisioning system only propagates changes to the images,

minimizing time and impact on system performance and avail-

ability. Provisioning capability can be assigned to any number of

nodes on-the-fly, for maximum flexibility and scalability. Bright

Cluster Manager can also provision over InfiniBand and to ram-

disk or virtual machine.

Comprehensive Monitoring

With Bright Cluster Manager, you can collect, monitor, visual-

ize and analyze a comprehensive set of metrics. Many software

and hardware metrics available to the Linux kernel, and many

hardware management interface metrics (IPMI, DRAC, iLO, etc.)

are sampled.

16

Ease of Installation

Bright Cluster Manager is easy to install. Installation and testing

of a fully functional cluster from “bare metal” can be complet-

ed in less than an hour. Configuration choices made during the

installation can be modified afterwards. Multiple installation

modes are available, including unattended and remote modes.

Cluster nodes can be automatically identified based on switch

ports rather than MAC addresses, improving speed and reliabil-

ity of installation, as well as subsequent maintenance. All major

hardware brands are supported: Dell, Cray, Cisco, DDN, IBM, HP,

Supermicro, Acer, Asus and more.

Ease of Use

Bright Cluster Manager is easy to use, with two interface op-

tions: the intuitive Cluster Management Graphical User Interface

(CMGUI) and the powerful Cluster Management Shell (CMSH).

The CMGUI is a standalone desktop application that provides

a single system view for managing all hardware and software

aspects of the cluster through a single point of control. Admin-

istrative functions are streamlined as all tasks are performed

through one intuitive, visual interface. Multiple clusters can be

managed simultaneously. The CMGUI runs on Linux, Windows

and OS X, and can be extended using plugins. The CMSH provides

practically the same functionality as the CMGUI, but via a com-

mand-line interface. The CMSH can be used both interactively

and in batch mode via scripts.

Either way, you now have unprecedented flexibility and control

over your clusters.

Support for Linux and Windows

Bright Cluster Manager is based on Linux and is available with

a choice of pre-integrated, pre-configured and optimized Linux

distributions, including SUSE Linux Enterprise Server, Red Hat

Enterprise Linux, CentOS and Scientific Linux. Dual-boot installa-

tions with Windows HPC Server are supported as well, allowing

nodes to either boot from the Bright-managed Linux head node,

or the Windows-managed head node.


Cluster metrics, such as GPU, Xeon Phi and CPU temperatures, fan speeds and network statistics can be visualized by simply dragging and dropping them into a graphing window. Multiple metrics can be combined in one graph and graphs can be zoomed into. A Graphing wizard allows creation of all graphs for a selected combination of metrics and nodes. Graph layout and color configurations can be tailored to your requirements and stored for re-use.

The Overview tab provides instant, high-level insight into the status of the cluster.

18 19

Bright Cluster Manager unleashes and manages the unlimited power of the cloudCreate a complete cluster in Amazon EC2, or easily extend your

onsite cluster into the cloud, enabling you to dynamically add

capacity and manage these nodes as part of your onsite cluster.

Both can be achieved in just a few mouse clicks, without the

need for expert knowledge of Linux or cloud computing.

Bright Cluster Manager’s unique data aware scheduling capabil-

ity means that your data is automatically in place in EC2 at the

start of the job, and that the output data is returned as soon as

the job is completed.

With Bright Cluster Manager, every cluster is cloud-ready, at no

extra cost. The same powerful cluster provisioning, monitoring,

scheduling and management capabilities that Bright Cluster

Manager provides to onsite clusters extend into the cloud.

The Bright advantage for cloud bursting

� Ease of use: Intuitive GUI virtually eliminates user learning

curve; no need to understand Linux or EC2 to manage system.

Alternatively, cluster management shell provides powerful

scripting capabilities to automate tasks

� Complete management solution: Installation/initialization,

provisioning, monitoring, scheduling and management in

one integrated environment

� Integrated workload management: Wide selection of work-

load managers included and automatically confi gured with

local, cloud and mixed queues

� Single system view; complete visibility and control: Cloud com-

pute nodes managed as elements of the on-site cluster; visible

from a single console with drill-downs and historic data.

� Effi cient data management via data aware scheduling: Auto-

matically ensures data is in position at start of computation;

delivers results back when complete.

� Secure, automatic gateway: Mirrored LDAP and DNS services

over automatically-created VPN connects local and cloud-

based nodes for secure communication.

� Cost savings: More effi cient use of cloud resources; support

for spot instances, minimal user intervention.

Two cloud bursting scenarios

Bright Cluster Manager supports two cloud bursting scenarios:

“Cluster-on-Demand” and “Cluster Extension”.

Scenario 1: Cluster-on-Demand

The Cluster-on-Demand scenario is ideal if you do not have a

cluster onsite, or need to set up a totally separate cluster. With

just a few mouse clicks you can instantly create a complete clus-

ter in the public cloud, for any duration of time.

Scenario 2: Cluster Extension

The Cluster Extension scenario is ideal if you have a cluster onsite

but you need more compute power, including GPUs. With Bright

Cluster Manager, you can instantly add EC2-based resources to

High performance meets effi ciencyInitially, massively parallel systems constitute a challenge to

both administrators and users. They are complex beasts. Anyone

building HPC clusters will need to tame the beast, master the

complexity and present users and administrators with an easy-

to-use, easy-to-manage system landscape.

Leading HPC solution providers such as transtec achieve this

goal. They hide the complexity of HPC under the hood and match

high performance with effi ciency and ease-of-use for both users

and administrators. The “P” in “HPC” gains a double meaning:

“Performance” plus “Productivity”.

Cluster and workload management software like Moab HPC

Suite, Bright Cluster Manager or QLogic IFS provide the means

to master and hide the inherent complexity of HPC systems.

For administrators and users, HPC clusters are presented as

single, large machines, with many different tuning parameters.

The software also provides a unifi ed view of existing clusters

whenever unifi ed management is added as a requirement by the

customer at any point in time after the fi rst installation. Thus,

daily routine tasks such as job management, user management,

queue partitioning and management, can be performed easily

with either graphical or web-based tools, without any advanced

scripting skills or technical expertise required from the adminis-

trator or user.

Advanced Cluster Management Made EasyCloud Bursting With Bright

20 21

Amazon spot instance support

Bright Cluster Manager enables users to take advantage of the

cost savings offered by Amazon’s spot instances. Users can spec-

ify the use of spot instances, and Bright will automatically sched-

ule as available, reducing the cost to compute without the need

to monitor spot prices and manually schedule.

Hardware Virtual Machine (HVM) virtualization

Bright Cluster Manager automatically initializes all Amazon

instance types, including Cluster Compute and Cluster GPU in-

stances that rely on HVM virtualization.

Data aware scheduling

Data aware scheduling ensures that input data is transferred to

the cloud and made accessible just prior to the job starting, and

that the output data is transferred back. There is no need to wait

(and monitor) for the data to load prior to submitting jobs (de-

laying entry into the job queue), nor any risk of starting the job

before the data transfer is complete (crashing the job). Users sub-

mit their jobs, and Bright’s data aware scheduling does the rest.

Bright Cluster Manager provides the choice:

“To Cloud, or Not to Cloud”

Not all workloads are suitable for the cloud. The ideal situation

for most organizations is to have the ability to choose between

onsite clusters for jobs that require low latency communication,

complex I/O or sensitive data; and cloud clusters for many other

types of jobs.

Bright Cluster Manager delivers the best of both worlds: a pow-

erful management solution for local and cloud clusters, with

the ability to easily extend local clusters into the cloud without

compromising provisioning, monitoring or managing the cloud

resources.

your onsite cluster, for any duration of time. Bursting into the

cloud is as easy as adding nodes to an onsite cluster — there are

only a few additional steps after providing the public cloud ac-

count information to Bright Cluster Manager.

The Bright approach to managing and monitoring a cluster in

the cloud provides complete uniformity, as cloud nodes are man-

aged and monitored the same way as local nodes:

� Load-balanced provisioning

� Software image management

� Integrated workload management

� Interfaces — GUI, shell, user portal

� Monitoring and health checking

� Compilers, debuggers, MPI libraries, mathematical libraries

and environment modules

Bright Cluster Manager also provides additional features that

are unique to the cloud.


One or more cloud nodes can be confi gured under the Cloud Settings tab.

The Add Cloud Provider Wizard and the Node Creation Wizard make the cloud bursting process easy, also for users with no cloud experience. Bright Computing

22 23

overheated GPU unit and sending a text message to your mobile

phone. Several predefined actions are available, but any built-in

cluster management command, Linux command or script can be

used as an action.

Comprehensive GPU Management

Bright Cluster Manager radically reduces the time and effort

of managing GPUs, and fully integrates these devices into the

single view of the overall system. Bright includes powerful GPU

management and monitoring capability that leverages function-

ality in NVIDIA Tesla and AMD GPUs.

You can easily assume maximum control of the GPUs and gain

instant and time-based status insight. Depending on the GPU

make and model, Bright monitors a full range of GPU metrics,

including:

� GPU temperature, fan speed, utilization

� GPU exclusivity, compute, display, persistance mode

� GPU memory utilization, ECC statistics

� Unit fan speed, serial number, temperature, power usage, vol-

tages and currents, LED status, firmware.

� Board serial, driver version, PCI info.

Beyond metrics, Bright Cluster Manager features built-in support

for GPU computing with CUDA and OpenCL libraries. Switching

between current and previous versions of CUDA and OpenCL has

also been made easy.

Full Support for Intel Xeon Phi

Bright Cluster Manager makes it easy to set up and use the Intel

Xeon Phi coprocessor. Bright includes everything that is needed

to get Phi to work, including a setup wizard in the CMGUI. Bright

ensures that your software environment is set up correctly, so

that the Intel Xeon Phi coprocessor is available for applications

that are able to take advantage of it.

Bright collects and displays a wide range of metrics for Phi, en-

suring that the coprocessor is visible and manageable as a de-

vice type, as well as including Phi as a resource in the workload

management system. Bright’s pre-job health checking ensures

that Phi is functioning properly before directing tasks to the co-

processor.

Furthermore, Bright allows you to take advantage of two execu-

tion models available from the Intel MIC architecture and makes

the workload manager aware of the models:

� Native – the application runs exclusively on the coprocessor

without involving the host CPU.

� Offload – the application runs on the host and some specific

computations are offloaded to the coprocessor.

Multi-Tasking Via Parallel Shell

The parallel shell allows simultaneous execution of multiple

commands and scripts across the cluster as a whole, or across

easily definable groups of nodes. Output from the executed com-

mands is displayed in a convenient way with variable levels of

verbosity. Running commands and scripts can be killed easily if

necessary. The parallel shell is available through both the CMGUI

and the CMSH.

Examples include CPU, GPU and Xeon Phi temperatures, fan

speeds, switches, hard disk SMART information, system load,

memory utilization, network metrics, storage metrics, power

systems statistics, and workload management metrics. Custom

metrics can also easily be defined.

Metric sampling is done very efficiently — in one process, or out-

of-band where possible. You have full flexibility over how and

when metrics are sampled, and historic data can be consolidat-

ed over time to save disk space.

Cluster Management Automation

Cluster management automation takes pre-emptive actions

when predetermined system thresholds are exceeded, saving

time and preventing hardware damage. Thresholds can be con-

figured on any of the available metrics. The built-in configura-

tion wizard guides you through the steps of defining a rule:

selecting metrics, defining thresholds and specifying actions.

For example, a temperature threshold for GPUs can be estab-

lished that results in the system automatically shutting down an


The automation configuration wizard guides you through the steps of defining a rule: selecting metrics, defining thresholds and specifying actions

The parallel shell allows for simultaneous execution of commands or scripts across node groups or across the entire cluster.

The status of cluster nodes, switches, other hardware, as well as up to six metrics can be visualized in the Rackview. A zoomout option is available for clusters with many racks.

24 25

Integrated SMP Support

Bright Cluster Manager — Advanced Edition dynamically aggre-

gates multiple cluster nodes into a single virtual SMP node, us-

ing ScaleMP’s Versatile SMP™ (vSMP) architecture. Creating and

dismantling a virtual SMP node can be achieved with just a few

clicks within the CMGUI. Virtual SMP nodes can also be launched

and dismantled automatically using the scripting capabilities of

the CMSH.

In Bright Cluster Manager a virtual SMP node behaves like any

other node, enabling transparent, on-the-fly provisioning, con-

figuration, monitoring and management of virtual SMP nodes as

part of the overall system management.

Maximum Uptime with Failover

Failover Bright Cluster Manager — Advanced Edition allows two

head nodes to be configured in active-active failover mode. Both

head nodes are on active duty, but if one fails, the other takes

over all tasks, seamlesly.

Maximum Uptime with Health Checking

Bright Cluster Manager — Advanced Edition includes a power-

ful cluster health checking framework that maximizes system

uptime. It continually checks multiple health indicators for

all hardware and software components and proactively initi-

ates corrective actions. It can also automatically perform a se-

ries of standard and user-defined tests just before starting a

new job, to ensure a successful execution, and preventing the

“black hole node syndrome”. Examples of corrective actions

include autonomous bypass of faulty nodes, automatic job

requeuing to avoid queue flushing, and process “jailing” to al-

locate, track, trace and flush completed user processes. The

health checking framework ensures the highest job through-

put, the best overall cluster efficiency and the lowest admin-

istration overhead.

Top Cluster Security

Bright Cluster Manager offers an unprecedented level of secu-

rity that can easily be tailored to local requirements. Security

features include:

� Automated security and other updates from key-signed Linux

and Bright Computing repositories.

� Encrypted internal and external communications.

� X509v3 certificate based public-key authentication to the

cluster management infrastructure.

� Role-based access control and complete audit trail.

� Firewalls, LDAP and SSH.

User and Group Management

Users can be added to the cluster through the CMGUI or the

CMSH. Bright Cluster Manager comes with a pre-configured

LDAP database, but an external LDAP service, or alternative au-

thentication system, can be used instead.

Web-Based User Portal

The web-based User Portal provides read-only access to essen-

tial cluster information, including a general overview of the clus-

ter status, node hardware and software properties, workload

Integrated Workload Management

Bright Cluster Manager is integrated with a wide selection of

free and commercial workload managers. This integration pro-

vides a number of benefits:

� The selected workload manager gets automatically installed

and configured

� Many workload manager metrics are monitored

� The CMGUI and User Portal provide a user-friendly interface

to the workload manager

� The CMSH and the SOAP & JSON APIs provide direct and po-

werful access to a number of workload manager commands

and metrics

� Reliable workload manager failover is properly configured

� The workload manager is continuously made aware of the

health state of nodes (see section on Health Checking)

� The workload manager is used to save power through auto-

power on/off based on workload

� The workload manager is used for data-aware scheduling

of jobs to the cloud

The following user-selectable workload managers are tightly in-

tegrated with Bright Cluster Manager:

� PBS Professional, Univa Grid Engine, Moab, LSF

� Slurm, openlava, Open Grid Scheduler, Torque, Maui

Alternatively, other workload managers, such as LoadLevele and

Condor can be installed on top of Bright Cluster Manager.


Workload management queues can be viewed and configured from the GUI, without the need for workload management expertise.

Creating and dismantling a virtual SMP node can be achieved with just a few clicks within the GUI or a single command in the cluster management shell.

Example graphs that visualize metrics on a GPU cluster.

26 27


manager statistics and user-customizable graphs. The User Por-

tal can easily be customized and expanded using PHP and the

SOAP or JSON APIs.

Multi-Cluster Capability

Bright Cluster Manager — Advanced Edition is ideal for organiza-

tions that need to manage multiple clusters, either in one or in

multiple locations. Capabilities include:

� All cluster management and monitoring functionality is

available for all clusters through one GUI.

� Selecting any set of configurations in one cluster and ex-

porting them to any or all other clusters with a few mouse

clicks.

� Metric visualizations and summaries across clusters.

� Making node images available to other clusters.

Fundamentally API-Based

Bright Cluster Manager is fundamentally API-based, which

means that any cluster management command and any piece

of cluster management data — whether it is monitoring data

or configuration data — is available through the API. Both a

SOAP and a JSON API are available and interfaces for various

programming languages, including C++, Python and PHP are

provided.

Cloud Utilization

Bright Cluster Manager supports two cloud utilization sce-

narios for Amazon EC2:

� Cluster-on-Demand –running stand-alone clusters in the

cloud; and

� Cluster Extension – adding cloud-based resources to exist-

ing, onsite clusters and managing these cloud nodes as if

they were local.

Both scenarios utilize the full range of Bright’s capabilitiesand can

be achieved in a few simple steps. In the Cluster Extension scenar-

io, two additional capabilities significantly enhance productivity:

� Data-Aware Scheduling – this ensures that the workload

manager automatically transfers the input data to the cloud

in time for the associated job to start. Output data is automati-

cally transfered back to the on-premise head node.

� Automatic Cloud Resizing – this allows you to specify

policies for automatically increasing and decreasing the

number of cloud nodes based on the load in your queues.

Bright supports Amazon VPC setups which allows com-

pute nodes in EC2 to be placed in an isolated network,

thereby separating them from the outside world. It is even

possible to route part of a local corporate IP network to a

VPC subnet in EC2, so that local nodes and nodes in EC2

can communicate without any effort.

Hadoop Cluster Management

Bright Cluster Manager is the ideal basis for Hadoop clusters.

Bright installs on bare metal, configuring a fully operational

Hadoop cluster in less than one hour. In the process, Bright pre-

pares your Hadoop cluster for use by provisioning the operating

system and the general cluster management and monitoring ca-

pabilities required as on any cluster.

The web-based User Portal provides read-only access to essential cluster in-formation, including a general overview of the cluster status, node hardware and software properties, workload manager statistics and user-customizable graphs.

Bright Cluster Manager can manage multiple clusters simultaneously. This overview shows clusters in Oslo, Abu Dhabi and Houston, all managed through one GUI.

“The building blocks for transtec HPC solutions

must be chosen according to our goals ease-of-

management and ease-of-use. With Bright Cluster

Manager, we are happy to have the technology

leader at hand, meeting these requirements, and

our customers value that.”

Armin Jäger HPC Solution Engineer

28 29


Bright then manages and monitors your Hadoop cluster’s

hardware and system software throughout its life-cycle,

collecting and graphically displaying a full range of Hadoop

metrics from the HDFS, RPC and JVM sub-systems. Bright sig-

nificantly reduces setup time for Cloudera, Hortonworks and

other Hadoop stacks, and increases both uptime and Map-

Reduce job throughput.

This functionality is scheduled to be further enhanced in

upcoming releases of Bright, including dedicated manage-

ment roles and profiles for name nodes, data nodes, as well

as advanced Hadoop health checking and monitoring func-

tionality.

Standard and Advanced Editions

Bright Cluster Manager is available in two editions: Standard

and Advanced. The table on this page lists the differences. You

can easily upgrade from the Standard to the Advanced Edition as

your cluster grows in size or complexity.

Documentation and Services

A comprehensive system administrator manual and user man-

ual are included in PDF format. Standard and tailored services

are available, including various levels of support, installation,

training and consultancy.

Feature Standard Advanced

Choice of Linux distributions x x

Intel Cluster Ready x x

Cluster Management GUI x x

Cluster Management Shell x x

Web-Based User Portal x x

SOAP API x x

Node Provisioning x x

Node Identifi cation x x

Cluster Monitoring x x

Cluster Automation x x

User Management x x

Role-based Access Control x x

Parallel Shell x x

Workload Manager Integration x x

Cluster Security x x

Compilers x x

Debuggers & Profi lers x x

MPI Libraries x x

Mathematical Libraries x x

Environment Modules x x

Cloud Bursting x x

Hadoop Management & Monitoring x x

NVIDIA CUDA & OpenCL - x

GPU Management & Monitoring - x

Xeon Phi Management & Monitoring - x

ScaleMP Management & Monitoring - x

Redundant Failover Head Nodes - x

Cluster Health Checking - x

Off-loadable Provisioning - x

Multi-Cluster Management - x

Suggested Number of Nodes 4–128 129–10,000+

Standard Support x x

Premium Support Optional Optional

Cluster health checks can be visualized in the Rackview. This screenshot shows that GPU unit 41 fails a health check called “AllFansRunning”.

While all HPC systems face challenges in workload demand, resource complexity, and scale, enterprise HPC systems face more stringent challenges and expectations. Enterprise HPC systems must meet mission-critical and priority HPC workload demands for commercial businesses and business-oriented research and academic organizations. They have complex SLAs and priorities to balance. Their HPC workloads directly impact the revenue, product delivery, and organizational objectives of their organizations.

Intelligent HPCWorkload Management

31 Hig

h T

rou

hp

ut

Co

mp

uti

ng

C

AD

B

ig D

ata

An

aly

tics

S

imu

lati

on

A

ero

spa

ce

Au

tom

oti

ve

32 33

Enterprise HPC ChallengesEnterprise HPC systems must eliminate job delays and failures.

They are also seeking to improve resource utilization and man-

agement efficiency across multiple heterogeneous systems. To-

maximize user productivity, they are required to make it easier to

access and use HPC resources for users or even expand to other

clusters or HPC cloud to better handle workload demand surges.

Intelligent Workload ManagementAs today’s leading HPC facilities move beyond petascale towards

exascale systems that incorporate increasingly sophisticated and

specialized technologies, equally sophisticated and intelligent

management capabilities are essential. With a proven history of

managing the most advanced, diverse, and data-intensive sys-

tems in the Top500, Moab continues to be the preferred workload

management solution for next generation HPC facilities.

Moab HPC Suite – Basic EditionMoab HPC Suite - Basic Edition is an intelligent workload man-

agement solution. It automates the scheduling, managing,

monitoring, and reporting of HPC workloads on massive scale,

multi-technology installations. The patented Moab intelligence

engine uses multi-dimensional policies to accelerate running

workloads across the ideal combination of diverse resources.

These policies balance high utilization and throughput goals

with competing workload priorities and SLA requirements. The

speed and accuracy of the automated scheduling decisions

optimizes workload throughput and resource utilization. This

gets more work accomplished in less time, and in the right prio-

rity order. Moab HPC Suite - Basic Edition optimizes the value

and satisfaction from HPC systems while reducing manage-

ment cost and complexity.

Moab HPC Suite – Enterprise EditionMoab HPC Suite - Enterprise Edition provides enterprise-ready HPC

workload management that accelerates productivity, automates

workload uptime, and consistently meets SLAs and business prior-

ities for HPC systems and HPC cloud. It uses the battle-tested and

patented Moab intelligence engine to automatically balance the

complex, mission-critical workload priorities of enterprise HPC

systems. Enterprise customers benefit from a single integrated

product that brings together key enterprise HPC capabilities, im-

plementation services, and 24x7 support. This speeds the realiza-

tion of benefits from the HPC system for the business including:

� Higher job throughput

� Massive scalability for faster response and system expansion

� Optimum utilization of 90-99% on a consistent basis

� Fast, simple job submission and management to increase

productivity

Intelligent HPC Workload ManagementMoab HPC Suite – Basic Edition

Moab is the “brain” of an HPC system, intelligently optimizing workload throughput while balancing service levels and priorities.

34 35

� Reduced cluster management complexity and support costs

across heterogeneous systems

� Reduced job failures and auto-recovery from failures

� SLAs consistently met for improved user satisfaction

� Reduce power usage and costs by 10-30%

Productivity AccelerationMoab HPC Suite – Enterprise Edition gets more results deliv-

ered faster from HPC resources at lower costs by accelerating

overall system, user and administrator productivity. Moab pro-

vides the scalability, 90-99 percent utilization, and simple job

submission that is required to maximize the productivity of

enterprise HPC systems. Enterprise use cases and capabilities

include:

�Massive multi-point scalability to accelerate job response

and throughput, including high throughput computing

�Workload-optimized allocation policies and provisioning

to get more results out of existing heterogeneous resources

and reduce costs, including topology-based allocation

� Unify workload management cross heterogeneous clus-

ters to maximize resource availability and administration

efficiency by managing them as one cluster

� Optimized, intelligent scheduling packs workloads and

backfills around priority jobs and reservations while balanc-

ing SLAs to efficiently use all available resources

� Optimized scheduling and management of accelerators,

both Intel MIC and GPGPUs, for jobs to maximize their utili-

zation and effectiveness

� Simplified job submission and management with advanced

job arrays, self-service portal, and templates

� Administrator dashboards and reporting tools reduce man-

agement complexity, time, and costs

�Workload-aware auto-power management reduces energy

use and costs by 10-30 percent

Uptime AutomationJob and resource failures in enterprise HPC systems lead to de-

layed results, missed organizational opportunities, and missed

objectives. Moab HPC Suite – Enterprise Edition intelligently au-

tomates workload and resource uptime in the HPC system to en-

sure that workload completes successfully and reliably, avoiding

these failures. Enterprises can benefit from:

� Intelligent resource placement to prevent job failures with

granular resource modeling to meet workload requirements

and avoid at-risk resources

� Auto-response to failures and events with configurable ac-

tions to pre-failure conditions, amber alerts, or other metrics

and monitors

�Workload-aware future maintenance scheduling that helps

maintain a stable HPC system without disrupting workload

productivity

� Real-world expertise for fast time-to-value and system

uptime with included implementation, training, and 24x7

support remote services

Auto SLA EnforcementMoab HPC Suite – Enterprise Edition uses the powerful Moab

intelligence engine to optimally schedule and dynamically

adjust workload to consistently meet service level agree-

ments (SLAs), guarantees, or business priorities. This auto-

matically ensures that the right workloads are completed at

the optimal times, taking into account the complex number of

using departments, priorities and SLAs to be balanced. Moab

provides:

� Usage accounting and budget enforcement that schedules

resources and reports on usage in line with resource shar-

ing agreements and precise budgets (includes usage limits,

Intelligent HPC Workload ManagementMoab HPC Suite – Enterprise Edition

“With Moab HPC Suite, we can meet very demand-

ing customers’ requirements as regards unified

management of heterogeneous cluster environ-

ments, grid management, and provide them with

flexible and powerful configuration and reporting

options. Our customers value that highly.”

Thomas Gebert HPC Solution Architect

36 37

usage reports, auto budget management and dynamic fair-

share policies)

� SLA and priority polices that make sure the highest priorty

workloads are processed first (includes Quality of Service,

hierarchical priority weighting)

� Continuous plus future scheduling that ensures priorities

and guarantees are proactively met as conditions and work-

load changes (i.e. future reservations, pre-emption, etc.)

Grid- and Cloud-Ready Workload ManagementThe benefits of a traditional HPC environment can be extended

to more efficiently manage and meet workload demand with

the ulti-cluster grid and HPC cloud management capabilities in

Moab HPC Suite - Enterprise Edition. It allows you to:

� Showback or chargeback for pay-for-use so actual resource

usage is tracked with flexible chargeback rates and report-

ing by user, department, cost center, or cluster

� Manage and share workload across multiple remote clusters

to meet growing workload demand or surges by adding on

the Moab HPC Suite – Grid Option.

Moab HPC Suite – Grid OptionThe Moab HPC Suite - Grid Option is a powerful grid-workload

management solution that includes scheduling, advanced poli-

cy management, and tools to control all the components of ad-

vanced grids. Unlike other grid solutions, Moab HPC Suite - Grid

Option connects disparate clusters into a logical whole, enabling

grid administrators and grid policies to have sovereignty over all

systems while preserving control at the individual cluster.

Moab HPC Suite - Grid Option has powerful applications that al-

low organizations to consolidate reporting; information gather-

ing; and workload, resource, and data management. Moab HPC

Suite - Grid Option delivers these services in a near-transparent

way: users are unaware they are using grid resources—they

know only that they are getting work done faster and more eas-

ily than ever before.

Moab manages many of the largest clusters and grids in the

world. Moab technologies are used broadly across Fortune 500

companies and manage more than a third of the compute cores

in the top 100 systems of the Top500 supercomputers. Adaptive

Computing is a globally trusted ISV (independent software ven-

dor), and the full scalability and functionality Moab HPC Suite

with the Grid Option offers in a single integrated solution has

traditionally made it a significantly more cost-effective option

than other tools on the market today.

ComponentsThe Moab HPC Suite - Grid Option is available as an add-on to

Moab HPC Suite - Basic Edition and -Enterprise Edition. It extends

the capabilities and functionality of the Moab HPC Suite compo-

nents to manage grid environments including:

�Moab Workload Manager for Grids – a policy-based work-

load management and scheduling multi-dimensional deci-

sion engine

�Moab Cluster Manager for Grids – a powerful and unified

graphical administration tool for monitoring, managing and

reporting tool across multiple clusters

�Moab ViewpointTM for Grids – a Web-based self-service

end-user job-submission and management portal and ad-

ministrator dashboard

Benefits � Unified management across heterogeneous clusters provides

ability to move quickly from cluster to optimized grid

� Policy-driven and predictive scheduling ensures that jobs

start and run in the fastest time possible by selecting optimal

resources

� Flexible policy and decision engine adjusts workload pro-

cessing at both grid and cluster level

� Grid-wide interface and reporting tools provide view of grid

resources, status and usage charts, and trends over time for

capacity planning, diagnostics, and accounting

� Advanced administrative control allows various business

units to access and view grid resources, regardless of physical

or organizational boundaries, or alternatively restricts access

to resources by specific departments or entities

� Scalable architecture to support peta-scale, highthroughput

computing and beyond

Grid Control with Automated Tasks, Policies, and Re-porting

� Guarantee that the most-critical work runs first with flexible

global policies that respect local cluster policies but continue

to support grid service-level agreements

� Ensure availability of key resources at specific times with ad-

vanced reservations

� Tune policies prior to rollout with cluster- and grid-level simu-

lations

� Use a global view of all grid operations for self-diagnosis, plan-

ning, reporting, and accounting across all resources, jobs, and

clusters

Intelligent HPC Workload ManagementMoab HPC Suite – Grid Option

Moab HPC Suite -Grid Option can be flexibly configured for centralized, cen-tralized and local, or peer-to-peer grid policies, decisions and rules. It is able to manage multiple resource managers and multiple Moab instances.

38 39

Harnessing the Power of Intel Xeon Phi and GPGPUsThe mathematical acceleration delivered by many-core Gen-

eral Purpose Graphics Processing Units (GPGPUs) offers signifi -

cant performance advantages for many classes of numerically

intensive applications. Parallel computing tools such as NVID-

IA’s CUDA and the OpenCL framework have made harnessing

the power of these technologies much more accessible to

developers, resulting in the increasing deployment of hybrid

GPGPU-based systems and introducing signifi cant challenges

for administrators and workload management systems.

The new Intel Xeon Phi coprocessors offer breakthrough per-

formance for highly parallel applications with the benefi t of

leveraging existing code and fl exibility in programming mod-

els for faster application development. Moving an application

to Intel Xeon Phi coprocessors has the promising benefi t of re-

quiring substantially less effort and lower power usage. Such a

small investment can reap big performance gains for both data-

intensive and numerical or scientifi c computing applications

that can make the most of the Intel Many Integrated Cores (MIC)

technology. So it is no surprise that organizations are quickly

embracing these new coprocessors in their HPC systems.

The main goal is to create breakthrough discoveries, products

and research that improve our world. Harnessing and optimiz-

ing the power of these new accelerator technologies is key to

doing this as quickly as possible. Moab HPC Suite helps organi-

zations easily integrate accelerator and coprocessor technolo-

gies into their HPC systems, optimizing their utilization for the

right workloads and problems they are trying to solve.

Moab HPC Suite automatically schedules hybrid systems incor-

porating Intel Xeon Phi coprocessors and GPGPU accelerators,

optimizing their utilization as just another resource type with

policies. Organizations can choose the accelerators that best

meet their different workload needs.

Managing Hybrid Accelerator SystemsHybrid accelerator systems add a new dimension of manage-

ment complexity when allocating workloads to available re-

sources. In addition to the traditional needs of aligning work-

load placement with software stack dependencies, CPU type,

memory, and interconnect requirements, intelligent workload

management systems now need to consider:

� Workload’s ability to exploit Xeon Phi or GPGPU

technology

� Additional software dependenciesand reduce costs,

including topology-based allocation

� Current health and usage status of available Xeon Phis

or GPGPUs

� Resource confi guration for type and number of Xeon Phis

or GPGPUs

Moab HPC Suite auto detects which types of accelerators are

where in the system to reduce the management effort and costs

as these processors are introduced and maintained together in

an HPC system. This gives customers maximum choice and per-

formance in selecting the accelerators that work best for each

of their workloads, whether Xeon Phi, GPGPU or a hybrid mix of

the two.

Intelligent HPC Workload ManagementOptimizing Accelerators with Moab HPC Suite

Moab HPC Suite optimizes accelerator utilization with policies that ensure the right ones get used for the right user and group jobs at the right time.

40 41

Optimizing Intel Xeon Phi and GPGPUs UtilizationSystem and application diversity is one of the first issues

workload management software must address in optimizing

accelerator utilization. The ability of current codes to use GP-

GPUs and Xeon Phis effectively, ongoing cluster expansion,

and costs usually mean only a portion of a system will be

equipped with one or a mix of both types of accelerators.

Moab HPC Suite has a twofold role in reducing the complex-

ity of their administration and optimizing their utilization.

First, it must be able to automatically detect new GPGPUs

or Intel Xeon Phi coprocessors in the environment and their

availability without the cumbersome burden of manual ad-

ministrator configuration. Second, and most importantly, it

must accurately match Xeon Phi-bound or GPGPUenabled

workloads with the appropriately equipped resources in ad-

dition to managing contention for those limited resources

according to organizational policy. Moab’s powerful alloca-

tion and prioritization policies can ensure the right jobs from

users and groups get to the optimal accelerator resources at

the right time. This keeps the accelerators at peak utilization

for the right priority jobs.

Moab’s policies are a powerful ally to administrators in auto

determining the optimal Xeon Phi coprocessor or GPGPU to

use and ones to avoid when scheduling jobs. These alloca-

tion policies can be based on any of the GPGPU or Xeon Phi

metrics such as memory (to ensure job needs are met), tem-

perature (to avoid hardware errors or failures), utilization, or

other metrics.

Use the following metrics in policies to optimize allocation of

Intel Xeon Phi coprocessors:

� Number of Cores

� Number of Hardware Threads

� Physical Memory

� Free Physical Memory

� Swap Memory

� Free Swap Memory

� Max Frequency (in MHz)

Use the following metrics in policies to optimize allocation of

GPGPUs and for management diagnostics:

� Error counts; single/double-bit, commands to reset counts

� Temperature

� Fan speed

� Memory; total, used, utilization

� Utilization percent

� Metrics time stamp

Improve GPGPU Job Speed and SuccessThe GPGPU drivers supplied by vendors today allow multiple

jobs to share a GPGPU without corrupting results, but sharing

GPGPUs and other key GPGPU factors can signifi cantly impact

the speed and success of GPGPU jobs and the fi nal level of ser-

vice delivered to the system users and the organization.

Consider the example of a user estimating the time for a GP-

GPU job based on exclusive access to a GPGPU, but the work-

Intelligent HPC Workload ManagementOptimizing Accelerators with Moab HPC Suite

load manager allowing the GPGPU to be shared when the job is

scheduled. The job will likely exceed the estimated time and be

terminated by the workload manager unsuccessfully, leaving

a very unsatisfi ed user and the organization, group or project

they represent. Moab HPC Suite provides the ability to sched-

ule GPGPU jobs to run exclusively and fi nish in the shortest time

possible for certain types, or classes of jobs or users to ensure

job speed and success.

42 43

Cluster Sovereignty and Trusted Sharing � Guarantee that shared resources are allocated fairly with

global policies that fully respect local cluster confi guration

and needs

� Establish trust between resource owners through graphical

usage controls, reports, and accounting across all shared

resources

� Maintain cluster sovereignty with granular settings to con-

trol where jobs can originate and be processed n Establish

resource ownership and enforce appropriate access levels

with prioritization, preemption, and access guarantees

Increase User Collaboration and Productivity � Reduce end-user training and job management time with

easy-to-use graphical interfaces

� Enable end users to easily submit and manage their jobs

through an optional web portal, minimizing the costs of ca-

tering to a growing base of needy users

� Collaborate more effectively with multicluster co-allocation,

allowing key resource reservations for high-priority projects

� Leverage saved job templates, allowing users to submit mul-

tiple jobs quickly and with minimal changes

� Speed job processing with enhanced grid placement options

for job arrays; optimal or single cluster placement

Process More Work in Less Time to Maximize ROI � Achieve higher, more consistent resource utilization with in-

telligent scheduling, matching job requests to the bestsuited

resources, including GPGPUs

� Use optimized data staging to ensure that remote data trans-

fers are synchronized with resource availability to minimize

poor utilization

� Allow local cluster-level optimizations of most grid workload

Unify Management Across Independent Clusters � Unify management across existing internal, external, and

partner clusters – even if they have different resource man-

agers, databases, operating systems, and hardware

� Out-of-the-box support for both local area and wide area

grids

� Manage secure access to resources with simple credential

mapping or interface with popular security tool sets

� Leverage existing data-migration technologies, such as SCP

or GridFTP

Moab HPC Suite – Application Portal EditionRemove the Barriers to Harnessing HPC

Organizations of all sizes and types are looking to harness the

potential of high performance computing (HPC) to enable and

accelerate more competitive product design and research. To

do this, you must fi nd a way to extend the power of HPC resourc-

es to designers, researchers and engineers not familiar with or

trained on using the specialized HPC technologies. Simplifying

the user access to and interaction with applications running on

more effi cient HPC clusters removes the barriers to discovery

and innovation for all types of departments and organizations.

Moab® HPC Suite – Application Portal Edition provides easy sin-

gle-point access to common technical applications, data, HPC

resources, and job submissions to accelerate research analysis

and collaboration process for end users while reducing comput-

ing costs.

HPC Ease of Use Drives ProductivityMoab HPC Suite – Application Portal Edition improves ease of

use and productivity for end users using HPC resources with sin-

gle-point access to their common technical applications, data,

resources, and job submissions across the design and research

process. It simplifi es the specifi cation and running of jobs on

HPC resources to a few simple clicks on an application specifi c

template with key capabilities including:

� Application-centric job submission templates for common

manufacturing, energy, life-sciences, education and other

industry technical applications

� Support for batch and interactive applications, such as

simulations or analysis sessions, to accelerate the full proj-

ect or design cycle

� No special HPC training required to enable more users to

start accessing HPC resources with intuitive portal

� Distributed data management avoids unnecessary remote

fi le transfers for users, storing data optimally on the network

for easy access and protection, and optimizing fi le transfer

to/ from user desktops if needed

Accelerate Collaboration While Maintaining ControlMore and more projects are done across teams and with part-

ner and industry organizations. Moab HPC Suite – Application

Portal Edition enables you to accelerate this collaboration be-

tween individuals and teams while maintaining control as ad-

ditional users, inside and outside the organization, can easily

and securely access project applications, HPC resources and

data to speed project cycles. Key capabilities and benefi ts you

can leverage are:

� Encrypted and controllable access, by class of user, ser-

vices, application or resource, for remote and partner users

improves collaboration while protecting infrastructure and

intellectual property

Intelligent HPC Workload ManagementMoab HPC Suite – Application Portal Edition

The Visual Cluster view in Moab Cluster Manager for Grids makes cluster and grid management easy and effi cient.

44 45

� Enable multi-site, globally available HPC services available

anytime, anywhere, accessible through many different types

of devices via a standard web browser

� Reduced client requirements for projectsmeans new proj-

ect members can quickly contribute without being limited by

their desktop capabilities

Reduce Costs with Optimized Utilization and SharingMoab HPC Suite – Application Portal Edition reduces the costs

of technical computing by optimizing resource utilization and

the sharing of resources including HPC compute nodes, applica-

tion licenses, and network storage. The patented Moab intelli-

gence engine and its powerful policies deliver value in key areas

of utilization and sharing:

�Maximize HPC resource utilization with intelligent, opti-

mized scheduling policies that pack and enable more work

to be done on HPC resources to meet growing and changing

demand

� Optimize application license utilization and access by shar-

ing application licenses in a pool, re-using and optimizing us-

age with allocation policies that integrate with license man-

agers for lower costs and better service levels to a broader

set of users

� Usage budgeting and priorities enforcement with usage

accounting and priority policies that ensure resources are

shared in-line with service level agreements and project pri-

orities

� Leverage and fully utilize centralized storage capacity in-

stead of duplicative, expensive, and underutilized individual

workstation storage

Key Intelligent Workload Management Capabilities: � Massive multi-point scalability

�Workload-optimized allocation policies and provisioning

� Unify workload management cross heterogeneous clusters

� Optimized, intelligent scheduling

� Optimized scheduling and management of accelerators

(both Intel MIC and GPGPUs)

� Administrator dashboards and reporting tools

�Workload-aware auto power management

� Intelligent resource placement to prevent job failures

� Auto-response to failures and events

�Workload-aware future maintenance scheduling

� Usage accounting and budget enforcement

� SLA and priority polices

� Continuous plus future scheduling

Moab HPC Suite – Remote Visualization EditionRemoving Ineffi ciencies in Current 3D Visualization Methods

Using 3D and 2D visualization, valuable data is interpreted and

new innovations are discovered. There are numerous ineffi cien-

cies in how current technical computing methods support users

and organizations in achieving these discoveries in a cost effec-

tive way. High-cost workstations and their software are often

not fully utilized by the limited user(s) that have access to them.

They are also costly to manage and they don’t keep pace long

with workload technology requirements. In addition, the work-

station model also requires slow transfers of the large data sets

between them. This is costly and ineffi cient in network band-

width, user productivity, and team collaboration while posing

increased data and security risks. There is a solution that re-

moves these ineffi ciencies using new technical cloud comput-

ing models. Moab HPC Suite – Remote Visualization Edition

signifi cantly reduces the costs of visualization technical com-

puting while improving the productivity, collaboration and se-

curity of the design and research process.

Reduce the Costs of Visualization Technical ComputingMoab HPC Suite – Remote Visualization Edition signifi cantly

reduces the hardware, network and management costs of vi-

sualization technical computing. It enables you to create easy,

centralized access to shared visualization compute, applica-

tion and data resources. These compute, application, and data

resources reside in a technical compute cloud in your data

center instead of in expensive and underutilized individual

workstations.

� Reduce hardware costs by consolidating expensive individual

technical workstations into centralized visualization servers

for higher utilization by multiple users; reduce additive or up-

grade technical workstation or specialty hardware purchases,

such as GPUs, for individual users.

� Reduce management costs by moving from remote user work-

stations that are diffi cult and expensive to maintain, upgrade,

and back-up to centrally managed visualization servers that

require less admin overhead

� Decrease network access costs and congestion as signifi cantly

lower loads of just compressed, visualized pixels are moving to

users, not full data sets

� Reduce energy usageas centralized visualization servers con-

sume less energy for the user demand met

� Reduce data storage costs by consolidating data into com-

Intelligent HPC Workload ManagementMoab HPC Suite – Remote Visualization Edition

The Application Portal Edition easily extends the power of HPC to your users and projects to drive innovation and competentive advantage.

46 47

Consolidation and GridMany sites have multiple clusters as a result of having multiple

independent groups or locations, each with demands for HPC,

and frequent additions of newer machines. Each new cluster

increases the overall administrative burden and overhead. Ad-

ditionally, many of these systems can sit idle while others are

overloaded. Because of this systems-management challenge,

sites turn to grids to maximize the effi ciency of their clusters.

Grids can be enabled either independently or in conjunction

with one another in three areas:

� Reporting Grids Managers want to have global reporting

across all HPC resources so they can see how users and proj-

ects are really utilizing hardware and so they can effectively

plan capacity. Unfortunately, manually consolidating all of

this information in an intelligible manner for more than just

a couple clusters is a management nightmare. To solve that

problem, sites will create a reporting grid, or share informa-

tion across their clusters for reporting and capacity-plan-

ning purposes.

� Management Grids Managing multiple clusters indepen-

dently can be especially diffi cult when processes change,

because policies must be manually reconfi gured across all

clusters. To ease that diffi culty, sites often set up manage-

ment grids that impose a synchronized management layer

across all clusters while still allowing each cluster some

level of autonomy.

� Workload-Sharing Grids Sites with multiple clusters often

have the problem of some clusters sitting idle while other

clusters have large backlogs. Such inequality in cluster uti-

lization wastes expensive resources, and training users to

perform different workload-submission routines across var-

ious clusters can be diffi cult and expensive as well. To avoid

these problems, sites often set up workload-sharing grids.

These grids can be as simple as centralizing user submission

or as complex as having each cluster maintain its own user

submission routine with an underlying grid-management

tool that migrates jobs between clusters.

Inhibitors to Grid EnvironmentsThree common inhibitors keep sites from enabling grid environ-

ments:

� Politics Because grids combine resources across users,

groups, and projects that were previously independent, grid

implementation can be a political nightmare. To create a

grid in the real world, sites need a tool that allows clusters

to retain some level of sovereignty while participating in the

larger grid.

� Multiple Resource Managers Most sites have a variety of

resource managers used by various groups within the orga-

nization, and each group typically has a large investment in

scripts that are specifi c to one resource manager and that

cannot be changed. To implement grid computing effective-

ly, sites need a robust tool that has fl exibility in integrating

with multiple resource managers.

� Credentials Many sites have different log-in credentials for

each cluster, and those credentials are generally indepen-

dent of one another. For example, one user might be Joe.P on

one cluster and J_Peterson on another. To enable grid envi-

ronments, sites must create a combined user space that can

recognize and combine these different credentials.

Using Moab in a GridSites can use Moab to set up any combination of reporting, man-

agement, and workload-sharing grids. Moab is a grid metas-

cheduler that allows sites to set up grids that work effectively

in the real world. Moab’s feature-rich functionality overcomes

the inhibitors of politics, multiple resource managers, and vary-

ing credentials by providing:

Intelligent HPC Workload ManagementUsing Moab With Grid Environments

� Grid Sovereignty Moab has multiple features that break

down political barriers by letting sites choose how each

cluster shares in the grid. Sites can control what information

is shared between clusters and can specify which workload

is passed between clusters. In fact, sites can even choose

to let each cluster be completely sovereign in making deci-

sions about grid participation for itself.

� Support for Multiple Resource Managers Moab meta-

schedules across all common resource managers. It fully

integrates with TORQUE and SLURM, the two most-common

open-source resource managers, and also has limited in-

tegration with commercial tools such as PBS Pro and SGE.

Moab’s integration includes the ability to recognize when a

user has a script that requires one of these tools, and Moab

can intelligently ensure that the script is sent to the correct

machine. Moab even has the ability to translate common

scripts across multiple resource managers.

� Credential Mapping Moab can map credentials across clus-

ters to ensure that users and projects are tracked appropri-

ately and to provide consolidated reporting.

48 49

mon storage node(s) instead of under-utilized individual

storage

Improve Collaboration, Productivity and SecurityWith Moab HPC Suite – Remote Visualization Edition, you can im-

prove the productivity, collaboration and security of the design

and research process by only transferring pixels instead of data

to users to do their simulations and analysis. This enables a wid-

er range of users to collaborate and be more productive at any

time, on the same data, from anywhere without any data transfer

time lags or security issues. Users also have improved immediate

access to specialty applications and resources, like GPUs, they

might need for a project so they are no longer limited by personal

workstation constraints or to a single working location.

� Improve workforce collaboration by enabling multiple

users to access and collaborate on the same interactive ap-

plication data at the same time from almost any location or

device, with little or no training needed

� Eliminate data transfer time lags for users by keeping data

close to the visualization applications and compute resourc-

es instead of transferring back and forth to remote worksta-

tions; only smaller compressed pixels are transferred

� Improve security with centralized data storage so data no

longer gets transferred to where it shouldn’t, gets lost, or

gets inappropriately accessed on remote workstations, only

pixels get moved

Maximize Utilization and Shared Resource GuaranteesMoab HPC Suite – Remote Visualization Edition maximizes your

resource utilization and scalability while guaranteeing shared

resource access to users and groups with optimized workload

management policies. Moab’s patented policy intelligence en-

gine optimizes the scheduling of the visualization sessions

across the shared resources to maximize standard and spe-

cialty resource utilization, helping them scale to support larger

volumes of visualization workloads. These intelligent policies

also guarantee shared resource access to users to ensure a high

service level as they transition from individual workstations.

� Guarantee shared visualization resource access for users

with priority policies and usage budgets that ensure they

receive the appropriate service level. Policies include ses-

sion priorities and reservations, number of users per node,

fairshare usage policies, and usage accounting and budgets

across multiple groups or users.

� Maximize resource utilization and scalability by packing

visualization workload optimally on shared servers using

Moab allocation and scheduling policies. These policies can

include optimal resource allocation by visualization session

characteristics (like CPU, memory, etc.), workload packing,

number of users per node, and GPU policies, etc.

� Improve application license utilization to reduce software

costs and improve access with a common pool of 3D visual-

ization applications shared across multiple users, with usage

optimized by intelligent Moab allocation policies, integrated

with license managers.

� Optimized scheduling and management of GPUs and other

accelerators to maximize their utilization and effectiveness

� Enable multiple Windows or Linux user sessions on a single

machine managed by Moab to maximize resource and power

utilization.

� Enable dynamic OS re-provisioning on resources, managed

by Moab, to optimally meet changing visualization applica-

tion workload demand and maximize resource utilization

and availability to users.


The Remote Visualization Edition makes visualization more cost effective and productive with an innovative technical compute cloud.

Moab HPC Suite Use Case Basic Enterprise Application Portal Remote Visualization

Productivity Acceleration

Massive multifaceted scalability x x x x

Workload optimized resource allocation x x x x

Optimized, intelligent scheduling of work-load and resources

x x x x

Optimized accelerator scheduling & manage-ment (Intel MIC & GPGPU)

x x x x

Simplifi ed user job submission & mgmt.- Application Portal

x x xx

x

Administrator dashboard and management reporting

x x x x

Unifi ed workload management across het-erogeneous cluster(s)

xSingle cluster only

x x x

Workload optimized node provisioning x x x

Visualization Workload Optimization x

Workload-aware auto-power management x x x

Uptime Automation

Intelligent resource placement for failure prevention

x x x x

Workload-aware maintenance scheduling x x x x

Basic deployment, training and premium (24x7) support service

Standard support only

x x x

Auto response to failures and events xLimited no

re-boot/provision

x x x

Auto SLA enforcement

Resource sharing and usage policies x x x x

SLA and oriority policies x x x x

Continuous and future reservations scheduling x x x x

Usage accounting and budget enforcement x x x

Grid- and Cloud-Ready

Manage and share workload across multiple clusters in wich area grid

xw/grid Option

xw/grid Option

xw/grid Option

xw/grid Option

User self-service: job request & management portal

x x x x

Pay-for-use showback and chargeback x x x

Workload optimized node provisioning & re-purposing

x x x

Multiple clusters in single domain

50 51


transtec HPC solutions are designed for maximum fl exibility,

and ease of management. We not only offer our customers the

most powerful and fl exible cluster management solution out

there, but also provide them with customized setup and site-

specifi c confi guration. Whether a customer needs a dynamical

Linux-Windows dual-boot solution, unifi ed management of

different clusters at different sites, or the fi ne-tuning of the

Moab scheduler for implementing fi ne-grained policy confi

guration – transtec not only gives you the framework at hand,

but also helps you adapt the system according to your special

needs. Needless to say, when customers are in need of special

trainings, transtec will be there to provide customers, adminis-

trators, or users with specially adjusted Educational Services.

Having a many years’ experience in High Performance Comput-

ing enabled us to develop effi cient concepts for installing and

deploying HPC clusters. For this, we leverage well-proven 3rd

party tools, which we assemble to a total solution and adapt

and confi gure according to the customer’s requirements.

We manage everything from the complete installation and

confi guration of the operating system, necessary middleware

components like job and cluster management systems up to

the customer’s applications.

Remote Virtualization and Workflow OptimizationIt’s human nature to want to ‘see’ the results from simulations, tests, and analyses. Up until recently, this has meant ‘fat’ workstations on many user desktops. This approach provides CPU power when the user wants it – but as dataset size increases, there can be delays in downloading the results. Also, sharing the results with colleagues means gath-ering around the workstation - not always possible in this globalized, collaborative work-place.

Life

Sci

en

ces

C

AE

H

igh

Pe

rfo

rma

nce

Co

mp

uti

ng

B

ig D

ata

An

aly

tics

S

imu

lati

on

C

AD

54 55

Aerospace, Automotive and ManufacturingThe complex factors involved in CAE range from compute-in-

tensive data analysis to worldwide collaboration between de-

signers, engineers, OEM’s and suppliers. To accommodate these

factors, Cloud (both internal and external) and Grid Computing

solutions are increasingly seen as a logical step to optimize IT

infrastructure usage for CAE. It is no surprise then, that automo-

tive and aerospace manufacturers were the early adopters of

internal Cloud and Grid portals.

Manufacturers can now develop more “virtual products” and

simulate all types of designs, fluid flows, and crash simula-

tions. Such virtualized products and more streamlined col-

laboration environments are revolutionizing the manufac-

turing process.

With NICE EnginFrame in their CAE environment engineers can

take the process even further by connecting design and simula-

tion groups in “collaborative environments” to get even greater

benefi ts from “virtual products”. Thanks to EnginFrame, CAE en-

gineers can have a simple intuitive collaborative environment

that can care of issues related to:

� Access & Security - Where an organization must give access

to external and internal entities such as designers, engi-

neers and suppliers.

� Distributed collaboration - Simple and secure connection of

design and simulation groups distributed worldwide.

� Time spent on IT tasks - By eliminating time and resources

spent using cryptic job submission commands or acquiring

knowledge of underlying compute infrastructures, engi-

neers can spend more time concentrating on their core de-

sign tasks.

EnginFrame’s web based interface can be used to access the

compute resources required for CAE processes. This means ac-

cess to job submission & monitoring tasks, input & output data

associated with industry standard CAE/CFD applications for

Fluid-Dynamics, Structural Analysis, Electro Design and Design

Collaboration (like Abaqus, Ansys, Fluent, MSC Nastran, PAM-

Crash, LS-Dyna, Radioss) without cryptic job submission com-

mands or knowledge of underlying compute infrastructures.

EnginFrame has a long history of usage in some of the most

prestigious manufacturing organizations worldwide including

Aerospace companies like AIRBUS, Alenia Space, CIRA, Galileo

Avionica, Hamilton Sunstrand, Magellan Aerospace, MTU and

Automotive companies like Audi, ARRK, Brawn GP, Bridgestone,

Bosch, Delphi, Elasis, Ferrari, FIAT, GDX Automotive, Jaguar-Lan-

dRover, Lear, Magneti Marelli, McLaren, P+Z, RedBull, Swagelok,

Suzuki, Toyota, TRW.

Life Sciences and HealthcareNICE solutions are deployed in the Life Sciences sector at com-

panies like BioLab, Partners HealthCare, Pharsight and the

M.D.Anderson Cancer Center; also in leading research projects like

DEISA or LitBio in order to allow easy and transparent use of com-

puting resources without any insight of the HPC infrastructure.

The Life Science and Healthcare sectors have some very strict re-

quirement when choosing its IT solution like EnginFrame, for in-

stance

� Security - To meet the strict security & privacy requirements

of the biomedical and pharmaceutical industry any solution

needs to take account of multiple layers of security and au-

thentication.

� Industry specifi c software - ranging from the simplest cus-

tom tool to the more general purpose free and open middle-

wares.

EnginFrame’s modular architecture allows for different Grid mid-

dlewares and software including leading Life Science applica-

tions like Schroedinger Glide, EPFL RAxML, BLAST family, Taverna,

and R) to be exploited, R Users can compose elementary services

into complex applications and “virtual experiments”, run, moni-

tor, and build workfl ows via a standard Web browser. EnginFrame

also has highly tunable resource sharing and fi ne-grained access

control where multiple authentication systems (like Active Direc-

tory, Kerberos/LDAP) can be exploited simultaneously.

Growing complexity in globalized teamsHPC systems and Enterprise Grids deliver unprecedented time-

to-market and performance advantages to many research and

corporate customers, struggling every day with compute and

data intensive processes.

This often generates or transforms massive amounts of jobs

and data, that needs to be handled and archived effi ciently to

deliver timely information to users distributed in multiple loca-

tions, with different security concerns.

Poor usability of such complex systems often negatively im-

pacts users’ productivity, and ad-hoc data management often

increases information entropy and dissipates knowledge and

intellectual property.

The solutionSolving distributed computing issues for our customers, we un-

derstand that a modern, user-friendly web front-end to HPC and

Remote Virtualization

NICE EnginFrame: A Technical Computing Portal

56 57


NICE EnginFrame: A Technical Computing Portal

Grids can drastically improve engineering productivity, if prop-

erly designed to address the specifi c challenges of the Technical

Computing market

NICE EnginFrame overcomes many common issues in the areas

of usability, data management, security and integration, open-

ing the way to a broader, more effective use of the Technical

Computing resources.

The key components of the solution are:

� A fl exible and modular Java-based kernel, with clear separa-

tion between customizations and core services

� Powerful data management features, refl ecting the typical

needs of engineering applications

� Comprehensive security options and a fi ne grained authori-

zation system

� Scheduler abstraction layer to adapt to different workload

and resource managers

� Responsive and competent Support services

End users can typically enjoy the following improvements:

� User-friendly, intuitive access to computing resources, using

a standard Web browser

� Application-centric job submission

� Organized access to job information and data

� Increased mobility and reduced client requirements

On the other side, the Technical Computing Portal delivers sig-

nifi cant added-value for system administrators and IT:

� Reduced training needs to enable users to access the

resources

� Centralized confi guration and immediate deployment of

services

� Comprehensive authorization to access services and

information

� Reduced support calls and submission errors

Coupled with our Remote Visualization solutions, our custom-

ers quickly deploy end-to-end engineering processes on their

Intranet, Extranet or Internet.

58 59


Desktop Cloud Virtualization

Solving distributed computing issues for our customers, it is

easy to understand that a modern, user-friendly web front-end

to HPC and grids can drastically improve engineering productiv-

ity, if properly designed to address the specifi c challenges of the

Technical Computing market.

Remote VisualizationIncreasing dataset complexity (millions of polygons, interact-

ing components, MRI/PET overlays) means that as time comes

to upgrade and replace the workstations, the next generation

of hardware needs more memory, more graphics processing,

more disk, and more CPU cores. This makes the workstation ex-

pensive, in need of cooling, and noisy.

Innovation in the fi eld of remote 3D processing now allows com-

panies to address these issues moving applications away from

the Desktop into the data center. Instead of pushing data to the

application, the application can be moved near the data.

Instead of mass workstation upgrades, remote visualization al-

lows incremental provisioning, on-demand allocation, better

management and effi cient distribution of interactive sessions

and licenses. Racked workstations or blades typically have

lower maintenance, cooling, replacement costs, and they can

extend workstation (or laptop) life as “thin clients”.

The solution

Leveraging their expertise in distributed computing and web-

based application portals, NICE offers an integrated solution to

access, load balance and manage applications and desktop ses-

sions running within a visualization farm. The farm can include

both Linux and Windows resources, running on heterogeneous

hardware.

The core of the solution is the EnginFrame visualization plug-in,

that delivers web-based services to access and manage appli-

cations and desktops published in the farm. This solution has

been integrated with:

� NICE Desktop Cloud Visualization (DCV)

� HP Remote Graphics Software (RGS)

� RealVNC

� TurboVNC and VirtualGL

� Nomachine NX

Coupled with these third party remote visualization engines

(which specialize in delivering high frame-rates for 3D graphics),

the NICE offering for Remote Visualization solves the issues of

user authentication, dynamic session allocation, session man-

agement and data transfers.

End users can enjoy the following improvements:

� Intuitive, application-centric web interface to start, control

and re-connect to a session

� Single sign-on for batch and interactive applications

� All data transfers from and to the remote visualization farm

are handled by EnginFrame

� Built-in collaboration, to share sessions with other users

� The load and usage of the visualization cluster is monitored

in the browser

The solution also delivers signifi cant added-value for the sys-

tem administrators:

� No need of SSH / SCP / FTP on the client machine

� Easy integration into identity services, Single Sign-On (SSO),

Enterprise portals

� Automated data life cycle management

� Built-in user session sharing, to facilitate support

� Interactive sessions are load balanced by a scheduler (LSF,

Moab or Torque) to achieve optimal performance and resour-

ce usage

� Better control and use of application licenses

� Monitor, control and manage users’ idle sessions

Desktop Cloud VirtualizationNICE Desktop Cloud Visualization (DCV) is an advanced technol-

ogy that enables Technical Computing users to remote access

2D/3D interactive applications over a standard network.

Engineers and scientists are immediately empowered by taking

full advantage of high-end graphics cards, fast I/O performance

and large memory nodes hosted in “Public or Private 3D Cloud”,

rather then waiting for the next upgrade of the workstations.

The DCV protocol adapts to heterogeneous networking infra-

structures like LAN, WAN and VPN, to deal with bandwidth and

latency constraints. All applications run natively on the remote

60 61

machines, that could be virtualized and share the same physical

GPU.

In a typical visualization scenario, a software application sends a

stream of graphics commands to a graphics adapter through an

input/output (I/O) interface. The graphics adapter renders the data

into pixels and outputs them to the local display as a video signal.

When using NICE DCV, the scene geometry and graphics state

are rendered on a central server, and pixels are sent to one or

more remote displays.

This approach requires the server to be equipped with one or

more GPUs, which are used for the OpenGL rendering, while the

client software can run on “thin” devices.

NICE DCV architecture consist of:

� DCV Server, equipped with one or more GPUs, used for

OpenGL rendering

� one or more DCV end stations, running on “thin clients”, only

used for visualization

� etherogeneous networking infrastructures (like LAN, WAN

and VPN), optimized balancing quality vs frame rate

NICE DCV Highlights:

� enables high performance remote access to interactive

2D/3D software applications on low bandwidth/high latency

� supports multiple etherogeneous OS (Windows, Linux)

� enables GPU sharing

� supports 3D acceleration for OpenGL applications running

on virtual machines

� Supports multiple user collaboration via session sharing

� Enables attractive Return-on-Investment through resource sha-

ring and consolidation to data centers (GPU, memory, CPU, ...)


Desktop Cloud Virtualization

� Keeps the data secure in the data center, reducing data load

and save time

� Enables right sizing of system allocation based on user’s dy-

namic needs

� Facilitates application deployment: all applications, updates

and patches are instantly available to everyone, without any

changes to original code

Business Benefi tsThe business benefi ts for adopting NICE DCV can be summarized

in to four categories:

Category Business Benefi ts

Productivity � Increase business effi ciency

� Improve team performance by

ensuring real-time collaboration

with colleagues and partners in

real time, anywhere.

� Reduce IT management costs

by consolidating workstation

resources to a single point-of-

management

� Save money and time on applica-

tion deployment

� Let users work from anywhere if

there is an Internet connection

Business Continuity � Move graphics processing and

data to the datacenter - not on

laptop/desktop

� Cloud-based platform support

enables you to scale the visua-

lization solution „on-demand“

to extend business, grow new

revenue, manage costs.

Data Security � Guarantee secure and auditable

use of remote resources (appli-

cations, data, infrastructure,

licenses)

� Allow real-time collaboration

with partners while protecting In-

tellectual Property and resources

� Restrict access by class of user,

service, application, and resource

Training Effectiveness � Enable multiple users to follow

application procedures alongside

an instructor in real-time

� Enable collaboration and session

sharing among remote users (em-

ployees, partners, and affi liates)

NICE DCV is perfectly integrated into EnginFrame Views, lever-

aging 2D/3D capabilities over the web, including the ability to

share an interactive session with other users for collaborative

working.

Windows Linux

� Microsoft Windows 7 -

32/64 bit

� Microsoft Windows Vista -

32/64 bit

� Microsoft Windows XP -

32/64 bit

� RedHat Enterprise 5.x and

6.x - 32/64 bit

� SUSE Enterprise Server 11

- 32/64 bit

© 2012 by NICE

“The amount of data resulting from e.g. simulations

in CAE or other engineering environments can be in

the Gigabyte range. It is obvious that remote post-

processing is one of the most urgent topics to be tack-

led. NICE EnginFrame provides exactly that, and our

customers are impressed that such great technology

enhances their workfl ow so signifi cantly.”

Robin Kienecker HPC Sales Specialist

62 63

Cloud Computing With transtec and NICECloud Computing is a style of computing in which dynamically

scalable and often virtualized resources are provided as a service

over the Internet. Users need not have knowledge of, expertise in,

or control over the technology infrastructure in the “cloud” that

supports them. The concept incorporates technologies that have

the common theme of reliance on the Internet for satisfying the

computing needs of the users. Cloud Computing services usually

provide applications online that are accessed from a web brows-

er, while the software and data are stored on the servers.

Companies or individuals engaging in Cloud Computing do not

own the physical infrastructure hosting the software platform

in question. Instead, they avoid capital expenditure by renting

usage from a third-party provider (except for the case of ‘Private

Cloud’ - see below). They consume resources as a service, paying

instead for only the resources they use. Many Cloud Computing

offerings have adopted the utility computing model, which is

analogous to how traditional utilities like electricity are con-

sumed, while others are billed on a subscription basis.

The main advantage offered by Cloud solutions is the reduc-

tion of infrastructure costs, and of the infrastructure’s main-

tenance. By not owning the hardware and software, Cloud

users avoid capital expenditure by renting usage from a third-

party provider. Customers pay for only the resources they use.


Cloud Computing With transtec and NICEThe advantage for the provider of the Cloud, is that sharing

computing power among multiple tenants improves utiliza-

tion rates, as servers are not left idle, which can reduce costs

and increase effi ciency.

Public cloud

Public cloud or external cloud describes cloud computing in the

traditional mainstream sense, whereby resources are dynami-

cally provisioned on a fi ne-grained, self-service basis over the

Internet, via web applications/web services, from an off-site

third-party provider who shares resources and bills on a fi ne-

grained utility computing basis.

Private cloud

Private cloud and internal cloud describe offerings deploying

cloud computing on private networks. These solutions aim to

“deliver some benefits of cloud computing without the pit-

falls”, capitalizing on data security, corporate governance,

and reliability concerns. On the other hand, users still have

to buy, deploy, manage, and maintain them, and as such do

not benefit from lower up-front capital costs and less hands-

on management.

Architecture

The majority of cloud computing infrastructure, today, consists

of reliable services delivered through data centers and built on

servers with different levels of virtualization technologies. The

services are accessible anywhere that has access to network-

ing infrastructure. The Cloud appears as a single point of access

for all the computing needs of consumers. The offerings need

to meet the quality of service requirements of customers and

typically offer service level agreements, and, at the same time

proceed over the typical limitations.

The architecture of the computing platform proposed by NICE

(fi g. 1) differs from the others in some interesting ways:

� you can deploy it on an existing IT infrastructure, because it

is completely decoupled from the hardware infrastructure

� it has a high level of modularity and confi gurability, result-

ing in being easily customizable for the user’s needs

� based on the NICE EnginFrame technology, it is easy to build

graphical web-based interfaces to provide several applica-

tions, as web services, without needing to code or compile

source programs

� it utilises the existing methodology in place for authentica-

tion and authorization

Further, because the NICE Cloud solution is built on advanced

IT technologies, including virtualization and workload manage-

ment, the execution platform is dynamically able to allocate,

64 65

monitor, and confi gure a new environment as needed by the

application, inside the Cloud infrastructure. The NICE platform

offers these important properties:

� Incremental Scalability: the quantity of computing and

storage resources, provisioned to the applications, changes

dynamically depending on the workload

� Reliability and Fault-Tolerance: because of the virtualiza-

tion of the hardware resources and the multiple redundant

hosts, the platform adjusts the resources needed from the

applications, without disruption during disasters or crashes

� Service Level Agreement: the use of advanced systems for

the dynamic allocation of the resources, allows the guaran-

tee of service level, agreed across applications and services;

� Accountability: the continuous monitoring of the resources

used by each application (and user), allows the setup of ser-

vices that users can access in a pay-per-use mode, or sub-

scribing to a specifi c contract. In the case of an Enterprise

Cloud, this feature allows costs to be shared among the cost

centers of the company.

transtec has a long-term experience in Engineering environ-

ments, especially from the CAD/CAE sector. This allow us to

provide customers from this area with solutions that greatly

enhance their workfl ow and minimizes time-to-result.

This, together with transtec’s offerings of all kinds of services,

allows our customers to fully focus on their productive work,

and have us do the environmental optimizations.


Cloud Computing With transtec and NICE

66 67

Graphics accelerated virtual desktops and applicationsNVIDIA GRID for large corporations offers the ability to offload

graphics processing from the CPU to the GPU in virtualised envi-

ronments, allowing the data center manager to deliver true PC

graphics-rich experiences to more users for the first time.

Benefits of NVIDIA GRID for IT:

� Leverage industry-leading remote visualization solutions like

NICE DCV

� Add your most graphics-intensive users to your virtual solution

� Improve the productivity of all users

Benefits of NVIDIA GRID for users:

� Highly responsive windows and rich multimedia experiences

� Access to all critical applications, including the most 3D-intensive

� Access from anywhere, on any device

NVIDIA GRID BoardsNVIDIA’s Kepler architecture-based GRID boards are specifically

designed to enable rich graphics in virtualised environments.

GPU VirtualisationGRID boards feature NVIDIA Kepler-based GPUs that, for the first

time, allow hardware virtualisation of the GPU. This means mul-

tiple users can share a single GPU, improving user density while

providing true PC performance and compatibility.

Low-Latency Remote DisplayNVIDIA’s patented low-latency remote display technology

greatly improves the user experience by reducing the lag that

users feel when interacting with their virtual machine. With this


Graphics accelerated virtual desktops and applications

technology, the virtual desktop screen is pushed directly to the

remoting protocol.

H.264 EncodingThe Kepler GPU includes a high-performance H.264 encoding en-

gine capable of encoding simultaneous streams with superior

quality. This provides a giant leap forward in cloud server efficien-

cy by offloading the CPU from encoding functions and allowing

the encode function to scale with the number of GPUs in a server.

Maximum User DensityGRID boards have an optimised multi-GPU design that helps to

maximise user density. The GRID K1 board features 4 GPUs and

16 GB of graphics memory, allowing it to support up to 100 users

on a single board.

Power EfficiencyGRID boards are designed to provide data center-class power

efficiency, including the revolutionary new streaming multipro-

cessor, called “SMX”. The result is an innovative, proven solution

that delivers revolutionary performance-per-watt for the enter-

prise data centre

24/7 ReliabilityGRID boards are designed, built, and tested by NVIDIA for

24/7 operation. Working closely with leading server vendors

ensures GRID cards perform optimally and reliably for the life

of the system.

Widest Range of Virtualisation SolutionsGRID cards enable GPU-capable virtualisation solutions like

XenServer or Linux KVM, delivering the flexibility to choose

from a wide range of proven solutions.

GRID K1 Board GRID K2 Board

Number of GPUs 4 x entry Kepler GPUs2 x high-end Kepler

GPUs

Total NVIDIA CUDA cores 768 3072

Total memory size 16 GB DDR3 8 GB GDDR5

Max power 130 W 225 W

Board length 10.5” 10.5”

Board height 4.4” 4.4”

Board width Dual slot Dual slot

Display IO None None

Aux power 6-pin connector 8-pin connector

PCIe x16 x16

PCIe generation Gen3 (Gen2 compatible) Gen3 (Gen2 compatible)

Cooling solution Passive Passive

Technical SpecificationsGRID K1 Board Specifications

GRID K2 Board Specifications

Virtual GPU TechnologyNVIDIA GRID vGPU brings the full benefit of NVIDIA hardware-

accelerated graphics to virtualized solutions. This technology

provides exceptional graphics performance for virtual desk-

tops equivalent to local PCs when sharing a GPU among mul-

tiple users.

NVIDIA GRID vGPU is the industry’s most advanced technology

for sharing true GPU hardware acceleration between multiple

68 69

virtual desktops—without compromising the graphics experi-

ence. Application features and compatibility are exactly the

same as they would be at the desk.

With GRID vGPU technology, the graphics commands of each vir-

tual machine are passed directly to the GPU, without translation

by the hypervisor. This allows the GPU hardware to be time-sliced

to deliver the ultimate in shared virtualised graphics performance.

vGPU Profi les Mean Customised, Dedicated Graphics MemoryTake advantage of vGPU Manager to assign just the right amount

of memory to meet the specifi c needs of each user. Every virtual

desktop has dedicated graphics memory, just like they would

at their desk, so they always have the resources they need to

launch and use their applications.

vGPU Manager enables up to eight users to share each physical

GPU, assigning the graphics resources of the available GPUs to

virtual machines in a balanced approach. Each NVIDIA GRID K1

graphics card has up to four GPUs, allowing 32 users to share a

single graphics card.

NVIDIA GRID Graphics Board

Virtual GPU Profi le Application Certifi cations

Graphics Memory Max Displays Per User

Max Resolution Per Display

Max Users Per Graphics Board

Use Case

GRID K2K260Q 2,048 MB 4 2560×1600 4

Designer/Power User

K240Q 1,024 MB 2 2560×1600 8Designer/

Power User

K220Q 512 MB 2 2560×1600 16Designer/

Power User

K200 256 MB 2 1900×1200 16Knowledge

Worker

GRID K1K140Q 1,024 MB 2 2560×1600 16 Power User

K120Q 512 MB 2 2560×1600 32 Power User

K100 256 MB 2 1900×1200 32Knowledge

Worker


Virtual GPU Technology

Up to 8 users supported per physical GPU depending in vGPU profi les

Intel Cluster Ready –A Quality Standard for HPC ClustersIntel Cluster Ready is designed to create predictable expectations for users and pro-viders of HPC clusters, primarily targeting customers in the commercial and industrial sectors. These are not experimental “test-bed” clusters used for computer science and computer engineering research, or high-end “capability” clusters closely targeting their specifi c computing requirements that power the high-energy physics at the national labs or other specialized research organizations. Intel Cluster Ready seeks to advance HPC clusters used as computing resources in pro-duction environments by providing cluster owners with a high degree of confi dence that the clusters they deploy will run the applications their scientifi c and engineering staff rely upon to do their jobs. It achieves this by providing cluster hardware, software, and system providers with a precisely defi ned basis for their products to meet their customers’ production cluster requirements.

Sci

en

ces

R

isk

An

aly

sis

S

imu

lati

on

B

ig D

ata

An

aly

tics

C

AD

H

igh

Pe

rfo

rma

nce

Co

mp

uti

ng

72 73

Intel Cluster Ready

A Quality Standard for HPC Clusters

What are the Objectives of ICR?The primary objective of Intel Cluster Ready is to make clusters

easier to specify, easier to buy, easier to deploy, and make it eas-

ier to develop applications that run on them. A key feature of ICR

is the concept of “application mobility”, which is defi ned as the

ability of a registered Intel Cluster Ready application – more cor-

rectly, the same binary – to run correctly on any certifi ed Intel

Cluster Ready cluster. Clearly, application mobility is important

for users, software providers, hardware providers, and system

providers.

� Users want to know the cluster they choose will reliably run

the applications they rely on today, and will rely on tomorrow

� Application providers want to satisfy the needs of their cus-

tomers by providing applications that reliably run on their

customers’ cluster hardware and cluster stacks

� Cluster stack providers want to satisfy the needs of their cus-

tomers by providing a cluster stack that supports their custo-

mers’ applications and cluster hardware

� Hardware providers want to satisfy the needs of their custo-

mers by providing hardware components that supports their

customers’ applications and cluster stacks

� System providers want to satisfy the needs of their custo-

mers by providing complete cluster implementations that

reliably run their customers’ applications

Without application mobility, each group above must either try

to support all combinations, which they have neither the time

nor resources to do, or pick the “winning combination(s)” that

best supports their needs, and risk making the wrong choice.

The Intel Cluster Ready defi nition of application portability sup-

ports all of these needs by going beyond pure portability, (re-

compiling and linking a unique binary for each platform), to ap-

plication binary mobility, (running the same binary on multiple

platforms), by more precisely defi ning the target system.

A further aspect of application mobility is to ensure that regis-

tered Intel Cluster Ready applications do not need special pro-

gramming or alternate binaries for different message fabrics.

Intel Cluster Ready accomplishes this by providing an MPI im-

plementation supporting multiple fabrics at runtime; through

this, registered Intel Cluster Ready applications obey the “mes-

sage layer independence property”. Stepping back, the unifying

concept of Intel Cluster Ready is “one-to-many,” that is

� One application will run on many clusters

� One cluster will run many applications

How is one-to-many accomplished? Looking at Figure 1, you

see the abstract Intel Cluster Ready “stack” components that

always exist in every cluster, i.e., one or more applications, a

cluster software stack, one or more fabrics, and fi nally the un-

derlying cluster hardware. The remainder of that picture (to the

right) shows the components in greater detail.

Applications, on the top of the stack, rely upon the various APIs,

utilities, and fi le system structure presented by the underly-

ing software stack. Registered Intel Cluster Ready applications

are always able to rely upon the APIs, utilities, and fi le system

structure specifi ed by the Intel Cluster Ready Specifi cation; if an

application requires software outside this “required” set, then

Intel Cluster Ready requires the application to provide that soft-

ware as a part of its installation. To ensure that this additional

per-application software doesn’t confl ict with the cluster stack

or other applications, Intel Cluster Ready also requires the ad-

ditional software to be installed in application-private trees,

so the application knows how to fi nd that software while not

interfering with other applications. While this may well cause

duplicate software to be installed, the reliability provided by

the duplication far outweighs the cost of the duplicated fi les. A

prime example supporting this comparison is the removal of a

common fi le (library, utility, or other) that is unknowingly need-

“The Intel Cluster Checker allows us to certify that

our transtec HPC clusters are compliant with an

independent high quality standard. Our customers

can rest assured: their applications run as they ex-

pect.”

Marcus Wiedemann HPC Solution Engineer

ICR Stack

So

ftw

are

Sta

ck

Fabric

System

CFDRegisteredApplications

A SingleSolutionPlatform

Fabrics

Certified Cluster Platforms

Crash Climate QCD

Intel MPI Library (run-time)Intel MKL Cluster Edition (run-time)

Linux Cluster Tools(Intel Selected)

InfiniBand (OFED)

Intel Xeon Processor Platform

Gigabit Ethernet 10Gbit Ethernet

BioOptional

Value Addby

IndividualPlatform

Integrator

IntelDevelopment

Tools(C++, Intel Trace

Analyzer and Collector, MKL, etc.)

...

Intel OEM1 OEM2 PI1 PI2 ...

74 75

ed by some other application – such errors can be insidious to

repair even when they cause an outright application failure.

Cluster platforms, at the bottom of the stack, provide the APIs,

utilities, and fi le system structure relied upon by registered ap-

plications. Certifi ed Intel Cluster Ready platforms ensure the

APIs, utilities, and fi le system structure are complete per the Intel

Cluster Ready Specifi cation; certifi ed clusters are able to provide

them by various means as they deem appropriate. Because of

the clearly defi ned responsibilities ensuring the presence of all

software required by registered applications, system providers

have a high confi dence that the certifi ed clusters they build are

able to run any certifi ed applications their customers rely on. In

addition to meeting the Intel Cluster Ready requirements, certi-

fi ed clusters can also provide their added value, that is, other fea-

tures and capabilities that increase the value of their products.

How Does Intel Cluster Ready Accomplish its Objectives?At its heart, Intel Cluster Ready is a defi nition of the cluster as a

parallel application platform, as well as a tool to certify an ac-

tual cluster to the defi nition. Let’s look at each of these in more

detail, to understand their motivations and benefi ts.

A defi nition of the cluster as parallel application platform

The Intel Cluster Ready Specifi cation is very much written as the

requirements for, not the implementation of, a platform upon

which parallel applications, more specifi cally MPI applications,

can be built and run. As such, the specifi cation doesn’t care

whether the cluster is diskful or diskless, fully distributed or

single system image (SSI), built from “Enterprise” distributions

or community distributions, fully open source or not. Perhaps

more importantly, with one exception, the specifi cation doesn’t

have any requirements on how the cluster is built; that one ex-

ception is that compute nodes must be built with automated

tools, so that new, repaired, or replaced nodes can be rebuilt

identically to the existing nodes without any manual interac-

tion, other than possibly initiating the build process.

Some items the specifi cation does care about include:

� The ability to run both 32- and 64-bit applications, including

MPI applications and X-clients, on any of the compute nodes

� Consistency among the compute nodes’ confi guration, capa-

bility, and performance

� The identical accessibility of libraries and tools across

the cluster

� The identical access by each compute node to permanent

and temporary storage, as well as users’ data

� The identical access to each compute node from the

head node

� The MPI implementation provides fabric independence

� All nodes support network booting and provide a

remotely accessible console

The specifi cation also requires that the runtimes for specifi c

Intel software products are installed on every certifi ed cluster:

� Intel Math Kernel Library

� Intel MPI Library Runtime Environment

� Intel Threading Building Blocks

This requirement does two things. First and foremost, main-

line Linux distributions do not necessarily provide a suffi cient

software stack to build an HPC cluster – such specialization is

beyond their mission. Secondly, the requirement ensures that

programs built with this software will always work on certifi ed

clusters and enjoy simpler installations. As these runtimes are

directly available from the web, the requirement does not cause

additional costs to certifi ed clusters. It is also very important

to note that this does not require certifi ed applications to use

these libraries nor does it preclude alternate libraries, e.g., oth-

er MPI implementations, from being present on certifi ed clus-

ters. Quite clearly, an application that requires, e.g., an alternate

MPI, must also provide the runtimes for that MPI as a part of its

installation.

A tool to certify an actual cluster to the defi nition

The Intel Cluster Checker, included with every certifi ed Intel

Cluster Ready implementation, is used in four modes in the life

of a cluster:

� To certify a system provider’s prototype cluster as a valid im-

plementation of the specifi cation

� To verify to the owner that the just-delivered cluster is a “true

copy” of the certifi ed prototype

� To ensure the cluster remains fully functional, reducing ser-

vice calls not related to the applications or the hardware

� To help software and system providers diagnose and correct

actual problems to their code or their hardware.

While these are critical capabilities, in all fairness, this greatly

understates the capabilities of Intel Cluster Checker. The tool

will not only verify the cluster is performing as expected. To

do this, per-node and cluster-wide static and dynamic tests are

made of the hardware and software.

Intel Cluster Ready

A Quality Standard for HPC Clusters

Intel Cluster Checker

Cluster Definition & Configuration XML File

STDOUT + LogfilePass/Fail Results & Diagnostics

Cluster Checker Engine

Output

Test Module

Parallel Ops Check

Res

ult

Co

nfi

g

Output

Res

ult

AP

I

Test Module

Parallel Ops Check

Co

nfi

gN

od

e

No

de

No

de

No

de

No

de

No

de

No

de

No

de

Test Module

Parallel Ops Check

Test Module

Parallel Ops Check

76 77

The static checks ensure the systems are confi gured consis-

tently and appropriately. As one example, the tool will ensure

the systems are all running the same BIOS versions as well as

having identical confi gurations among key BIOS settings. This

type of problem – differing BIOS versions or settings – can be

the root cause of subtle problems such as differing memory

confi gurations that manifest themselves as differing memory

bandwidths only to be seen at the application level as slower

than expected overall performance. As is well-known, paral-

lel program performance can be very much governed by the

performance of the slowest components, not the fastest. In

another static check, the Intel Cluster Checker will ensure that

the expected tools, libraries, and fi les are present on each

node, identically located on all nodes, as well as identically

implemented on all nodes. This ensures that each node has the

minimal software stack specifi ed by the specifi cation, as well as

identical software stack among the compute nodes.

A typical dynamic check ensures consistent system perfor-

mance, e.g., via the STREAM benchmark. This particular test en-

sures processor and memory performance is consistent across

compute nodes, which, like the BIOS setting example above, can

be the root cause of overall slower application performance. An

additional check with STREAM can be made if the user confi g-

ures an expectation of benchmark performance; this check will

ensure that performance is not only consistent across the clus-

ter, but also meets expectations. Going beyond processor per-

formance, the Intel MPI Benchmarks are used to ensure the net-

work fabric(s) are performing properly and, with a confi guration

that describes expected performance levels, up to the cluster

provider’s performance expectations. Network inconsistencies

due to poorly performing Ethernet NICs, Infi niBand HBAs, faulty

switches, and loose or faulty cables can be identifi ed. Finally,

the Intel Cluster Checker is extensible, enabling additional tests

to be added supporting additional features and capabilities.

This enables the Intel Cluster Checker to not only support the

Intel Cluster Ready Builds HPC MomentumWith the Intel Cluster Ready (ICR) program, Intel Corporation

set out to create a win-win scenario for the major constituen-

cies in the high-performance computing (HPC) cluster market.

Hardware vendors and independent software vendors (ISVs)

stand to win by being able to ensure both buyers and users

that their products will work well together straight out of the

box. System administrators stand to win by being able to meet

corporate demands to push HPC competitive advantages

deeper into their organizations while satisfying end users’ de-

mands for reliable HPC cycles, all without increasing IT staff.

End users stand to win by being able to get their work done

faster, with less downtime, on certifi ed cluster platforms. Last

but not least, with ICR, Intel has positioned itself to win by ex-

panding the total addressable market (TAM) and reducing time

to market for the company’s microprocessors, chip sets, and

platforms.

The Worst of Times

For a number of years, clusters were largely confi ned to govern-

ment and academic sites, where contingents of graduate stu-

dents and midlevel employees were available to help program

and maintain the unwieldy early systems. Commercial fi rms

lacked this low-cost labor supply and mistrusted the favored

cluster operating system, open source Linux, on the grounds

that no single party could be held accountable if something

went wrong with it. Today, cluster penetration in the HPC mar-

ket is deep and wide, extending from systems with a handful of

processors to some of the world’s largest supercomputers, and

from under $25,000 to tens or hundreds of millions of dollars

in price. Clusters increasingly pervade every HPC vertical mar-

ket: biosciences, computer-aided engineering, chemical engi-

neering, digital content creation, economic/fi nancial services,

electronic design automation, geosciences/geo-engineering,

minimal requirements of the Intel Cluster Ready Specifi cation,

but the full cluster as delivered to the customer.

Conforming hardware and software

The preceding was primarily related to the builders of certifi ed

clusters and the developers of registered applications. For end

users that want to purchase a certifi ed cluster to run registered

applications, the ability to identify registered applications and

certifi ed clusters is most important, as that will reduce their ef-

fort to evaluate, acquire, and deploy the clusters that run their

applications, and then keep that computing resource operating

properly, with full performance, directly increasing their pro-

ductivity.

Intel Cluster Ready

Intel Cluster Ready Builds HPC Momentum

Intel, XEON and certain other trademarks and logos appearing in this

brochure, are trademarks or registered trademarks of Intel Corporation.

Cluster PlatformSoftware Tools

Reference Designs

Demand Creation

Support & Training

Specification

Co

nfi

gu

ration

Softw

are

Interco

nn

ect

Server Platfo

rm

Intel P

rocesso

r ISV Enabling

©

Intel Cluster Ready Program

78 79

mechanical design, defense, government labs, academia, and

weather/climate.

But IDC studies have consistently shown that clusters remain

diffi cult to specify, deploy, and manage, especially for new and

less experienced HPC users. This should come as no surprise,

given that a cluster is a set of independent computers linked to-

gether by software and networking technologies from multiple

vendors.

Clusters originated as do-it-yourself HPC systems. In the late

1990s users began employing inexpensive hardware to cobble

together scientifi c computing systems based on the “Beowulf

cluster” concept fi rst developed by Thomas Sterling and Don-

ald Becker at NASA. From their Beowulf origins, clusters have

evolved and matured substantially, but the system manage-

ment issues that plagued their early years remain in force today.

The Need for Standard Cluster Solutions

The escalating complexity of HPC clusters poses a dilemma for

many large IT departments that cannot afford to scale up their

HPC-knowledgeable staff to meet the fast-growing end-user de-

mand for technical computing resources. Cluster management

is even more problematic for smaller organizations and busi-

ness units that often have no dedicated, HPC-knowledgeable

staff to begin with.

The ICR program aims to address burgeoning cluster complex-

ity by making available a standard solution (aka reference archi-

tecture) for Intel-based systems that hardware vendors can use

to certify their confi gurations and that ISVs and other software

vendors can use to test and register their applications, system

software, and HPC management software. The chief goal of this

Intel Cluster Ready

Intel Cluster Ready Builds HPC Momentumvoluntary compliance program is to ensure fundamental hard-

ware-software integration and interoperability so that system

administrators and end users can confi dently purchase and de-

ploy HPC clusters, and get their work done, even in cases where

no HPC-knowledgeable staff are available to help.

The ICR program wants to prevent end users from having to be-

come, in effect, their own systems integrators. In smaller orga-

nizations, the ICR program is designed to allow overworked IT

departments with limited or no HPC expertise to support HPC

user requirements more readily. For larger organizations with

dedicated HPC staff, ICR creates confi dence that required user

applications will work, eases the problem of system adminis-

tration, and allows HPC cluster systems to be scaled up in size

without scaling support staff. ICR can help drive HPC cluster re-

sources deeper into larger organizations and free up IT staff to

focus on mainstream enterprise applications (e.g., payroll, sales,

HR, and CRM).

The program is a three-way collaboration among hardware

vendors, software vendors, and Intel. In this triple alliance, Intel

provides the specifi cation for the cluster architecture imple-

mentation, and then vendors certify the hardware confi gura-

tions and register software applications as compliant with the

specifi cation. The ICR program’s promise to system administra-

tors and end users is that registered applications will run out of

the box on certifi ed hardware confi gurations.

ICR solutions are compliant with the standard platform archi-

tecture, which starts with 64-bit Intel Xeon processors in an

Intel-certifi ed cluster hardware platform. Layered on top of this

foundation are the interconnect fabric (Gigabit Ethernet, Infi ni-

Band) and the software stack: Intel-selected Linux cluster tools,

an Intel MPI runtime library, and the Intel Math Kernel Library.

Intel runtime components are available and verifi ed as part of

the certifi cation (e.g., Intel tool runtimes) but are not required

to be used by applications. The inclusion of these Intel runtime

components does not exclude any other components a systems

vendor or ISV might want to use. At the top of the stack are Intel-

registered ISV applications.

At the heart of the program is the Intel Cluster Checker, a valida-

tion tool that verifi es that a cluster is specifi cation compliant

and operational before ISV applications are ever loaded. After

the cluster is up and running, the Cluster Checker can function

as a fault isolation tool in wellness mode. Certifi cation needs

to happen only once for each distinct hardware platform, while

verifi cation – which determines whether a valid copy of the

specifi cation is operating – can be performed by the Cluster

Checker at any time.

Cluster Checker is an evolving tool that is designed to accept

new test modules. It is a productized tool that ICR members ship

with their systems. Cluster Checker originally was designed for

homogeneous clusters but can now also be applied to clusters

with specialized nodes, such as all-storage sub-clusters. Cluster

Checker can isolate a wide range of problems, including net-

work or communication problems.

80 81

Intel Cluster Ready

The transtec Benchmarking Center

transtec offers their customers a new and fascinating way to

evaluate transtec’s HPC solutions in real world scenarios. With

the transtec Benchmarking Center solutions can be explored in

detail with the actual applications the customers will later run

on them. Intel Cluster Ready makes this feasible by simplifying

the maintenance of the systems and set-up of clean systems

very easily, and as often as needed. As High-Performance Com-

puting (HPC) systems are utilized for numerical simulations,

more and more advanced clustering technologies are being

deployed. Because of its performance, price/performance and

energy effi ciency advantages, clusters now dominate all seg-

ments of the HPC market and continue to gain acceptance. HPC

computer systems have become far more widespread and per-

vasive in government, industry, and academia. However, rarely

does the client have the possibility to test their actual applica-

tion on the system they are planning to acquire.

The transtec Benchmarking Center

transtec HPC solutions get used by a wide variety of clients.

Among those are most of the large users of compute power at

German and other European universities and research centers

as well as governmental users like the German army’s com-

pute center and clients from the high tech, the automotive and

other sectors. transtec HPC solutions have demonstrated their

value in more than 500 installations. Most of transtec’s cluster

systems are based on SUSE Linux Enterprise Server, Red Hat

Enterprise Linux, CentOS, or Scientifi c Linux. With xCAT for ef-

fi cient cluster deployment, and Moab HPC Suite by Adaptive

Computing for high-level cluster management, transtec is able

to effi ciently deploy and ship easy-to-use HPC cluster solutions

with enterprise-class management features. Moab has proven

to provide easy-to-use workload and job management for small

systems as well as the largest cluster installations worldwide.

However, when selling clusters to governmental customers as

well as other large enterprises, it is often required that the cli-

ent can choose from a range of competing offers. Many times

there is a fi xed budget available and competing solutions are

compared based on their performance towards certain custom

benchmark codes.

So, in 2007 transtec decided to add another layer to their al-

ready wide array of competence in HPC – ranging from cluster

deployment and management, the latest CPU, board and net-

work technology to HPC storage systems. In transtec’s HPC Lab

the systems are being assembled. transtec is using Intel Cluster

Ready to facilitate testing, verifi cation, documentation, and fi -

nal testing throughout the actual build process. At the bench-

marking center transtec can now offer a set of small clusters

with the “newest and hottest technology” through Intel Cluster

Ready. A standard installation infrastructure gives transtec a

quick and easy way to set systems up according to their cus-

tomers’ choice of operating system, compilers, workload man-

agement suite, and so on. With Intel Cluster Ready there are pre-

pared standard set-ups available with verifi ed performance at

standard benchmarks while the system stability is guaranteed

by our own test suite and the Intel Cluster Checker.

The Intel Cluster Ready program is designed to provide a com-

mon standard for HPC clusters, helping organizations design

and build seamless, compatible and consistent cluster confi gu-

rations. Integrating the standards and tools provided by this

program can help signifi cantly simplify the deployment and

management of HPC clusters.

Parallel NFS – The New Standard for HPC StorageHPC computation results in the terabyte range are not uncommon. The problem in this context is not so much storing the data at rest, but the performance of the necessary copying back and forth in the course of the computation job flow and the dependent job turn-around time. For interim results during a job runtime or for fast storage of input and results data, parallel file systems have established themselves as the standard to meet the ever-in-creasing performance requirements of HPC storage systems. Parallel NFS is about to become the new standard framework for a parallel file system.

En

gin

ee

rin

g

Lif

e S

cie

nce

s

Au

tom

oti

ve

Pri

ce M

od

ell

ing

A

ero

spa

ce

CA

E

Da

ta A

na

lyti

cs

84 85

Parallel NFS

The New Standard for HPC Storage

Yesterday’s Solution: NFS for HPC StorageThe original Network File System (NFS) developed by Sun Micro-

systems at the end of the eighties – now available in version 4.1

– has been established for a long time as a de-facto standard for

the provisioning of a global namespace in networked computing.

A very widespread HPC cluster solution includes a central mater

node acting simultaneously as an NFS server, with its local file

system storing input, interim and results data and exporting

them to all other cluster nodes.

There is of course an immediate bottleneck in this method:

When the load of the network is high, or where there are large

numbers of nodes, the NFS server can no longer keep up deliv-

ering or receiving the data. In high-performance computing es-

pecially, the nodes are interconnected at least once via Gigabit

Ethernet, so the sum total throughput is well above what an

NFS server with a Gigabit interface can achieve. Even a power-

ful network connection of the NFS server to the cluster, for ex-

ample with 10-Gigabit Ethernet, is only a temporary solution to

this problem until the next cluster upgrade. The fundamental

problem remains – this solution is not scalable; in addition, NFS

is a difficult protocol to cluster in terms of load balancing: ei-

ther you have to ensure that multiple NFS servers accessing the

same data are constantly synchronised, the disadvantage be-

ing a noticeable drop in performance or you manually partition

the global namespace which is also time-consuming. NFS is not

suitable for dynamic load balancing as on paper it appears to be

stateless but in reality is, in fact, stateful.

Today’s Solution: Parallel File SystemsFor some time, powerful commercial products have been

available to meet the high demands on an HPC storage sys-

tem. The open-source solutions FraunhoferFS (FhGFS) from

the Fraunhofer Competence Center for High Performance

Computing or Lustre are widely used in the Linux HPC world,

and also several other free as well as commercial parallel file

system solutions exist.

What is new is that the time-honoured NFS is to be upgraded,

including a parallel version, into an Internet Standard with

the aim of interoperability between all operating systems.

The original problem statement for parallel NFS access was

written by Garth Gibson, a professor at Carnegie Mellon Uni-

versity and founder and CTO of Panasas. Gibson was already

a renowned figure being one of the authors contributing to

the original paper on RAID architecture from 1988. The origi-

nal statement from Gibson and Panasas is clearly noticeable

in the design of pNFS. The powerful HPC file system developed

by Gibson and Panasas, ActiveScale PanFS, with object-based

storage devices functioning as central components, is basi-

cally the commercial continuation of the “Network-Attached

Secure Disk (NASD)” project also developed by Garth Gibson at

the Carnegie Mellon University.

Parallel NFSParallel NFS (pNFS) is gradually emerging as the future stan-

dard to meet requirements in the HPC environment. From the

industry’s as well as the user’s perspective, the benefits of

utilising standard solutions are indisputable: besides protect-

ing end user investment, standards also ensure a defined level

of interoperability without restricting the choice of products

available. As a result, less user and administrator training is re-

quired which leads to simpler deployment and at the same time,

a greater acceptance.

As part of the NFS 4.1 Internet Standard, pNFS will not only

adopt the semantics of NFS in terms of cache consistency or

security, it also represents an easy and flexible extension of

the NFS 4 protocol. pNFS is optional, in other words, NFS 4.1

implementations do not have to include pNFS as a feature.

The scheduled Internet Standard NFS 4.1 is today presented as

IETF RFC 5661.

The pNFS protocol supports a separation of metadata and data:

a pNFS cluster comprises so-called storage devices which store

the data from the shared file system and a metadata server

(MDS), called Director Blade with Panasas – the actual NFS 4.1

server. The metadata server keeps track of which data is stored

on which storage devices and how to access the files, the so-

called layout. Besides these “striping parameters”, the MDS

also manages other metadata including access rights or similar,

which is usually stored in a file’s inode.

The layout types define which Storage Access Protocol is

used by the clients to access the storage devices. Up until

now, three potential storage access protocols have been de-

fined for pNFS: file, block and object-based layouts, the for-

mer being described in RFC 5661 direct, the latters in RFC 5663

and 5664, respectively. Last but not least, a Control Protocol

is also used by the MDS and storage devices to synchronise

status data. This protocol is deliberately unspecified in the

standard to give manufacturers certain flexibility. The NFS

4.1 standard does however specify certain conditions which

a control protocol has to fulfil, for example, how to deal with

the change/modify time attributes of files.

A classical NFS server is a bottleneck

Cluster Nodes= NFS Clients

NAS Head= NFS Server

86 87

What’s New in NFS 4.1?

NFS 4.1 is a minor update to NFS 4, and adds new features to it.

One of the optional features is parallel NFS (pNFS) but there is

other new functionality as well.

One of the technical enhancements is the use of sessions, a

persistent server object, dynamically created by the client. By

means of sessions, the state of an NFS connection can be stored,

no matter whether the connection is live or not. Sessions sur-

vive temporary downtimes both of the client and the server.

Each session has a so-called fore channel, which is the connec-

tion connection from the client to the server for all RPC opera-

tions, and optionally a back channel for RPC callbacks from the

server that now can also be realized through fi rewall boundar-

ies. Sessions can be trunked to increase the bandwidth. Besides

session trunking there is also a client ID trunking for grouping

together several sessions to the same client ID.

By means of sessions, NFS can be seen as a really stateful proto-

col with a so-called “Exactly-Once Semantics (EOS)”. Until now,

a necessary but unspecifi ed reply cache within the NFS server

is implemented to handle identical RPC operations that have

been sent several times. This statefulness in reality is not very

robust, however, and sometimes leads to the well-known stale

NFS handles. In NFS 4.1, the reply cache is now a mandatory part

of the NFS implementation, storing the server replies to RPC re-

quests persistently on disk.

Another new feature of NFS 4.1 is delegation for directories: NFS

clients can be given temporary exclusive access to directories.

Before, this has only been possible for simple fi les. With the

forthcoming version 4.2 of the NFS standard, federated fi lesys-

tems will be added as a feature, which represents the NFS coun-

terpart of Microsoft’s DFS (distributed fi lesystem).

Parallel NFS

What’s New in NFS 4.1?

pNFS supports backwards compatibility with non-pNFS com-

patible NFS 4 clients. In this case, the MDS itself gathers data

from the storage devices on behalf of the NFS client and pres-

ents the data to the NFS client via NFS 4. The MDS acts as a kind

of proxy server – which is e.g. what the Director Blades from Pa-

nasas do.

pNFS Layout TypesIf storage devices act simply as NFS 4 fi le servers, the fi le layout

is used. It is the only storage access protocol directly specifi ed

in the NFS 4.1 standard. Besides the stripe sizes and stripe lo-

cations (storage devices), it also includes the NFS fi le handles

which the client needs to use to access the separate fi le areas.

The fi le layout is compact and static, the striping information

does not change even if changes are made to the fi le enabling

multiple pNFS clients to simultaneously cache the layout and

avoid synchronisation overhead between clients and the MDS

or the MDS and storage devices.

File system authorisation and client authentication can be well

implemented with the fi le layout. When using NFS 4 as the stor-

age access protocol, client authentication merely depends on

the security fl avor used – when using the RPCSEC_GSS security

fl avor, client access is kerberized, for example and the server

controls access authorization using specifi ed ACLs and crypto-

graphic processes.

In contrast, the block/volume layout uses volume identifiers

and block offsets and extents to specify a file layout. SCSI

block commands are used to access storage devices. As the

block distribution can change with each write access, the

layout must be updated more frequently than with the file

layout.

Block-based access to storage devices does not offer any secure

authentication option for the accessing SCSI initiator. Secure

SAN authorisation is possible with host granularity only, based

on World Wide Names (WWNs) with Fibre Channel or Initiator

Node Names (IQNs) with iSCSI. The server cannot enforce access

control governed by the fi le system. On the contrary, a pNFS cli-

ent basically voluntarily abides with the access rights, the stor-

age device has to trust the pNFS client – a fundamental access

control problem that is a recurrent issue in the NFS protocol his-

tory.

The object layout is syntactically similar to the file layout,

but it uses the SCSI object command set for data access to

so-called Object-based Storage Devices (OSDs) and is heavily

based on the DirectFLOW protocol of the ActiveScale PanFS

from Panasas. From the very start, Object-Based Storage

Devices were designed for secure authentication and ac-

cess. So-called capabilities are used for object access which

involves the MDS issuing so-called capabilities to the pNFS

clients. The ownership of these capabilities represents the

authoritative access right to an object.

pNFS can be upgraded to integrate other storage access

protocols and operating systems, and storage manufactur-

ers also have the option to ship additional layout drivers for

their pNFS implementations.

Parallel NFS

pNFS Clients Storage Device

Metadata Server

Storage Access Protocol

NFS 4.1Control Protocol

88 89

Parallel NFS

Panasas HPC Storage

Having a many years’ experience in deploying parallel fi le sys-

tems like Lustre or FraunhoferFS (FhGFS) from smaller scales

up to hundreds of terabytes capacity and throughputs of sev-

eral gigabytes per second, transtec chose Panasas, the leader in

HPC storage solutions, as the partner for providing highest per-

formance and scalability on the one hand, and ease of manage-

ment on the other. Therefore, with Panasas as the technology

leader, and transtec’s overall experience and customer-orient-

ed approach, customers can be assured to get the best possible

HPC storage solution available.

The Panasas fi le system uses parallel and redundant access to

object storage devices (OSDs), per-fi le RAID, distributed metada-

ta management, consistent client caching, fi le locking services,

and internal cluster management to provide a scalable, fault tol-

erant, high performance distributed fi le system. The clustered

design of the storage system and the use of client-driven RAID

provide scalable performance to many concurrent fi le system

clients through parallel access to fi le data that is striped across

OSD storage nodes. RAID recovery is performed in parallel by

the cluster of metadata managers, and declustered data place-

ment yields scalable RAID rebuild rates as the storage system

grows larger.

IntroductionStorage systems for high performance computing environ-

ments must be designed to scale in performance so that they

can be confi gured to match the required load. Clustering tech-

niques are often used to provide scalability. In a storage cluster,

many nodes each control some storage, and the overall distrib-

uted fi le system assembles the cluster elements into one large,

seamless storage system. The storage cluster can be hosted on

the same computers that perform data processing, or they can

be a separate cluster that is devoted entirely to storage and ac-

cessible to the compute cluster via a network protocol.

The Panasas storage system is a specialized storage cluster, and

this paper presents its design and a number of performance

measurements to illustrate the scalability. The Panasas system

is a production system that provides fi le service to some of the

largest compute clusters in the world, in scientifi c labs, in seis-

mic data processing, in digital animation studios, in computa-

tional fl uid dynamics, in semiconductor manufacturing, and

in general purpose computing environments. In these environ-

ments, hundreds or thousands of fi le system clients share data

and generate very high aggregate I/O load on the fi le system.

The Panasas system is designed to support several thousand

clients and storage capacities in excess of a petabyte.

The unique aspects of the Panasas system are its use of per-fi le,

client-driven RAID, its parallel RAID rebuild, its treatment of dif-

ferent classes of metadata (block, fi le, system) and a commodity

parts based blade hardware with integrated UPS. Of course, the

system has many other features (such as object storage, fault

tolerance, caching and cache consistency, and a simplifi ed man-

agement model) that are not unique, but are necessary for a

scalable system implementation.

Panasas File System BackgroundThe two overall themes to the system are object storage, which

affects how the fi le system manages its data, and clustering of

components, which allows the system to scale in performance

and capacity.

Object Storage

An object is a container for data and attributes; it is analogous to

the inode inside a traditional UNIX fi le system implementation.

Specialized storage nodes called Object Storage Devices (OSD)

store objects in a local OSDFS fi le system. The object interface ad-

dresses objects in a two-level (partition ID/object ID) namespace.

The OSD wire protocol provides byteoriented access to the data,

attribute manipulation, creation and deletion of objects, and

several other specialized operations. Panasas uses an iSCSI trans-

port to carry OSD commands that are very similar to the OSDv2

standard currently in progress within SNIA and ANSI-T10.

The Panasas fi le system is layered over the object storage. Each fi le

is striped over two or more objects to provide redundancy and high

bandwidth access. The fi le system semantics are implemented by

metadata managers that mediate access to objects from clients of

the fi le system. The clients access the object storage using the

iSCSI/OSD protocol for Read and Write operations. The I/O op-

erations proceed directly and in parallel to the storage nodes,

bypassing the metadata managers. The clients interact with the

out-of-band metadata managers via RPC to obtain access capa-

bilities and location information for the objects that store fi les.

90 91

Object attributes are used to store file-level attributes, and

directories are implemented with objects that store name to

object ID mappings. Thus the file system metadata is kept in

the object store itself, rather than being kept in a separate da-

tabase or some other form of storage on the metadata nodes.

System Software Components

The major software subsystems are the OSDFS object storage

system, the Panasas file system metadata manager, the Pana-

sas file system client, the NFS/CIFS gateway, and the overall

cluster management system.

� The Panasas client is an installable kernel module that runs in-

side the Linux kernel. The kernel module implements the stan-

dard VFS interface, so that the client hosts can mount the file

system and use a POSIX interface to the storage system.

� Each storage cluster node runs a common platform that is based

on FreeBSD, with additional services to provide hardware moni-

toring, configuration management, and overall control.

� The storage nodes use a specialized local file system (OSDFS)

that implements the object storage primitives. They imple-

ment an iSCSI target and the OSD command set. The OSDFS

object store and iSCSI target/OSD command processor are

kernel modules. OSDFS is concerned with traditional block-

level file system issues such as efficient disk arm utilization,

media management (i.e., error handling), high throughput, as

well as the OSD interface.

� The cluster manager (SysMgr) maintains the global configurati-

on, and it controls the other services and nodes in the storage

cluster. There is an associated management application that

provides both a command line interface (CLI) and an HTML in-

terface (GUI). These are all user level applications that run on a

subset of the manager nodes. The cluster manager is concerned

with membership in the storage cluster, fault detection, configu-

ration management, and overall control for operations like soft-

ware upgrade and system restart.

� The Panasas metadata manager (PanFS) implements the file sys-

tem semantics and manages data striping across the object sto-

rage devices. This is a user level application that runs on every

manager node. The metadata manager is concerned with distri-

buted file system issues such as secure multi-user access, main-

taining consistent file- and object-level metadata, client cache

coherency, and recovery from client, storage node, and metada-

ta server crashes. Fault tolerance is based on a local transaction

log that is replicated to a backup on a different manager node.

� The NFS and CIFS services provide access to the file system

for hosts that cannot use our Linux installable file system

client. The NFS service is a tuned version of the standard

FreeBSD NFS server that runs inside the kernel. The CIFS ser-

vice is based on Samba and runs at user level. In turn, these

services use a local instance of the file system client, which

runs inside the FreeBSD kernel. These gateway services run

on every manager node to provide a clustered NFS and CIFS

service.

Commodity Hardware Platform

The storage cluster nodes are implemented as blades that are

very compact computer systems made from commodity parts.

The blades are clustered together to provide a scalable plat-

form. The OSD StorageBlade module and metadata manager Di-

rectorBlade module use the same form factor blade and fit into

the same chassis slots.

Storage Management

Traditional storage management tasks involve partitioning

available storage space into LUNs (i.e., logical units that are one

or more disks, or a subset of a RAID array), assigning LUN owner-

ship to different hosts, configuring RAID parameters, creating

file systems or databases on LUNs, and connecting clients to

the correct server for their storage. This can be a labor-inten-

sive scenario. Panasas provides a simplified model for storage

management that shields the storage administrator from these

kinds of details and allow a single, part-time admin to manage

systems that were hundreds of terabytes in size.

The Panasas storage system presents itself as a file system with

a POSIX interface, and hides most of the complexities of storage

management. Clients have a single mount point for the entire

system. The /etc/fstab file references the cluster manager, and

from that the client learns the location of the metadata service

instances. The administrator can add storage while the system

is online, and new resources are automatically discovered. To

manage available storage, Panasas introduces two basic stor-

age concepts: a physical storage pool called a BladeSet, and a

logical quota tree called a Volume.

The BladeSet is a collection of StorageBlade modules in one or

more shelves that comprise a RAID fault domain. Panasas miti-

gates the risk of large fault domains with the scalable rebuild

performance described below in the text. The BladeSet is a hard

Parallel NFS

Panasas HPC Storage

NFS/CIFS

OSDFS

SysMgr

COMPUTE NODE

MANAGER NODE

STORAGE NODE

iSCSI/OSD

PanFSClient

RPC

Client

Panasas System Components

92 93

physical boundary for the volumes it contains. A BladeSet can

be grown at any time, either by adding more StorageBlade mod-

ules, or by merging two existing BladeSets together.

The Volume is a directory hierarchy that has a quota constraint

and is assigned to a particular BladeSet. The quota can be

changed at any time, and capacity is not allocated to the Vol-

ume until it is used, so multiple volumes compete for space

within their BladeSet and grow on demand. The fi les in those

volumes are distributed among all the StorageBlade modules in

the BladeSet.

Volumes appear in the fi le system name space as directories.

Clients have a single mount point for the whole storage system,

and volumes are simply directories below the mount point.

There is no need to update client mounts when the admin cre-

ates, deletes, or renames volumes.

Automatic Capacity Balancing

Capacity imbalance occurs when expanding a BladeSet (i.e.,

adding new, empty storage nodes), merging two BladeSets, and

replacing a storage node following a failure. In the latter scenar-

io, the imbalance is the result of the RAID rebuild, which uses

spare capacity on every storage node rather than dedicating a

specifi c “hot spare” node. This provides better throughput dur-

ing rebuild, but causes the system to have a new, empty stor-

age node after the failed storage node is replaced. The system

automatically balances used capacity across storage nodes in a

BladeSet using two mechanisms: passive balancing and active

balancing.

Passive balancing changes the probability that a storage node

will be used for a new component of a fi le, based on its avail-

able capacity. This takes effect when fi les are created, and when

their stripe size is increased to include more storage nodes. Ac-

tive balancing is done by moving an existing component object

from one storage node to another, and updating the storage

map for the affected fi le. During the transfer, the fi le is transpar-

ently marked read-only by the storage management layer, and

the capacity balancer skips fi les that are being actively written.

Capacity balancing is thus transparent to fi le system clients.

Object RAID and ReconstructionPanasas protects against loss of a data object or an entire stor-

age node by striping fi les across objects stored on different

storage nodes, using a fault-tolerant striping algorithm such as

RAID-1 or RAID-5. Small fi les are mirrored on two objects, and

larger fi les are striped more widely to provide higher bandwidth

and less capacity overhead from parity information. The per-fi le

RAID layout means that parity information for different fi les is

not mixed together, and easily allows different fi les to use differ-

ent RAID schemes alongside each other. This property and the

security mechanisms of the OSD protocol makes it possible to

enforce access control over fi les even as clients access storage

nodes directly. It also enables what is perhaps the most novel

aspect of our system, client-driven RAID. That is, the clients are

responsible for computing and writing parity. The OSD security

mechanism also allows multiple metadata managers to man-

age objects on the same storage device without heavyweight

coordination or interference from each other.

Client-driven, per-fi le RAID has four advantages for large-scale

storage systems. First, by having clients compute parity for

their own data, the XOR power of the system scales up as the

number of clients increases. We measured XOR processing dur-

ing streaming write bandwidth loads at 7% of the client’s CPU,

with the rest going to the OSD/iSCSI/TCP/IP stack and other fi le

system overhead. Moving XOR computation out of the storage

system into the client requires some additional work to handle

failures. Clients are responsible for generating good data and

good parity for it. Because the RAID equation is per-fi le, an er-

rant client can only damage its own data. However, if a client

fails during a write, the metadata manager will scrub parity to

ensure the parity equation is correct.

The second advantage of client-driven RAID is that clients can

perform an end-to-end data integrity check. Data has to go

through the disk subsystem, through the network interface on

the storage nodes, through the network and routers, through

the NIC on the client, and all of these transits can introduce

errors with a very low probability. Clients can choose to read

parity as well as data, and verify parity as part of a read opera-

tion. If errors are detected, the operation is retried. If the error

is persistent, an alert is raised and the read operation fails. By

checking parity across storage nodes within the client, the sys-

tem can ensure end-to-end data integrity. This is another novel

property of per-fi le, client-driven RAID.

Third, per-fi le RAID protection lets the metadata managers re-

build fi les in parallel. Although parallel rebuild is theoretically

possible in block-based RAID, it is rarely implemented. This

is due to the fact that the disks are owned by a single RAID

controller, even in dual-ported confi gurations. Large storage

systems have multiple RAID controllers that are not intercon-

nected. Since the SCSI Block command set does not provide fi ne-

grained synchronization operations, it is diffi cult for multiple

RAID controllers to coordinate a complicated operation such as

an online rebuild without external communication. Even if they

could, without connectivity to the disks in the affected parity

group, other RAID controllers would be unable to assist. Even

in a high-availability confi guration, each disk is typically only

attached to two different RAID controllers, which limits the po-

tential speedup to 2x.

When a StorageBlade module fails, the metadata managers that

own Volumes within that BladeSet determine what fi les are af-

fected, and then they farm out fi le reconstruction work to every

other metadata manager in the system. Metadata managers re-

build their own fi les fi rst, but if they fi nish early or do not own

any Volumes in the affected Bladeset, they are free to aid other

metadata managers. Declustered parity groups spread out the

I/O workload among all StorageBlade modules in the BladeSet.

The result is that larger storage clusters reconstruct lost data

more quickly.

The fourth advantage of per-fi le RAID is that unrecoverable

Parallel NFS

Panasas HPC Storage

©2013 Panasas Incorporated. All rights reserved. Panasas, the

Panasas logo, Accelerating Time to Results, ActiveScale, Direct-

FLOW, DirectorBlade, StorageBlade, PanFS, PanActive and MyPa-

nasas are trademarks or registered trademarks of Panasas, Inc.

in the United States and other countries. All other trademarks

are the property of their respective owners. Information sup-

plied by Panasas, Inc. is believed to be accurate and reliable at

the time of publication, but Panasas, Inc. assumes no responsi-

bility for any errors that may appear in this document. Panasas,

Inc. reserves the right, without notice, to make changes in prod-

uct design, specifi cations and prices. Information is subject to

change without notice.

94 95

faults can be constrained to individual files. The most com-

monly encountered double-failure scenario with RAID-5 is an

unrecoverable read error (i.e., grown media defect) during the

reconstruction of a failed storage device. The second storage

device is still healthy, but it has been unable to read a sector,

which prevents rebuild of the sector lost from the first drive and

potentially the entire stripe or LUN, depending on the design of

the RAID controller. With block-based RAID, it is difficult or im-

possible to directly map any lost sectors back to higher-level file

system data structures, so a full file system check and media

scan will be required to locate and repair the damage. A more

typical response is to fail the rebuild entirely. RAID controllers

monitor drives in an effort to scrub out media defects and avoid

this bad scenario, and the Panasas system does media scrub-

bing, too. However, with high capacity SATA drives, the chance

of encountering a media defect on drive B while rebuilding drive

A is still significant. With per-file RAID-5, this sort of double fail-

ure means that only a single file is lost, and the specific file can

be easily identified and reported to the administrator. While

block-based RAID systems have been compelled to introduce

RAID-6 (i.e., fault tolerant schemes that handle two failures), the

Panasas solution is able to deploy highly reliable RAID-5 sys-

tems with large, high performance storage pools.

RAID Rebuild Performance

RAID rebuild performance determines how quickly the system

can recover data when a storage node is lost. Short rebuild

times reduce the window in which a second failure can cause

data loss. There are three techniques to reduce rebuild times: re-

ducing the size of the RAID parity group, declustering the place-

ment of parity group elements, and rebuilding files in parallel

using multiple RAID engines.

The rebuild bandwidth is the rate at which reconstructed data

is written to the system when a storage node is being recon-

structed. The system must read N times as much as it writes,

depending on the width of the RAID parity group, so the overall

throughput of the storage system is several times higher than

the rebuild rate. A narrower RAID parity group requires fewer

read and XOR operations to rebuild, so will result in a higher

rebuild bandwidth. However, it also results in higher capacity

overhead for parity data, and can limit bandwidth during nor-

mal I/O. Thus, selection of the RAID parity group size is a trade-

off between capacity overhead, on-line performance, and re-

build performance.

Understanding declustering is easier with a picture. In the figure

on the left, each parity group has 4 elements, which are indicat-

ed by letters placed in each storage device. They are distributed

among 8 storage devices. The ratio between the parity group

size and the available storage devices is the declustering ratio,

which in this example is ½. In the picture, capital letters repre-

sent those parity groups that all share the second storage node.

If the second storage device were to fail, the system would have

to read the surviving members of its parity groups to rebuild the

lost elements. You can see that the other elements of those par-

ity groups occupy about ½ of each other storage device.

For this simple example you can assume each parity element is

the same size so all the devices are filled equally. In a real sys-

tem, the component objects will have various sizes depending

on the overall file size, although each member of a parity group

will be very close in size. There will be thousands or millions of

objects on each device, and the Panasas system uses active bal-

ancing to move component objects between storage nodes to

level capacity.

Declustering means that rebuild requires reading a subset of

each device, with the proportion being approximately the same

as the declustering ratio. The total amount of data read is the

same with and without declustering, but with declustering it is

spread out over more devices. When writing the reconstructed

elements, two elements of the same parity group cannot be

located on the same storage node. Declustering leaves many

storage devices available for the reconstructed parity element,

and randomizing the placement of each file’s parity group lets

the system spread out the write I/O over all the storage. Thus

declustering RAID parity groups has the important property of

taking a fixed amount of rebuild I/O and spreading it out over

more storage devices.

Having per-file RAID allows the Panasas system to divide the

work among the available DirectorBlade modules by assigning

different files to different DirectorBlade modules. This division

is dynamic with a simple master/worker model in which meta-

data services make themselves available as workers, and each

metadata service acts as the master for the volumes it imple-

ments. By doing rebuilds in parallel on all DirectorBlade mod-

ules, the system can apply more XOR throughput and utilize the

additional I/O bandwidth obtained with declustering.

Metadata ManagementThere are several kinds of metadata in the Panasas system.

These include the mapping from object IDs to sets of block ad-

dresses, mapping files to sets of objects, file system attributes

such as ACLs and owners, file system namespace information

(i.e., directories), and configuration/management information

about the storage cluster itself.

Block-level Metadata

Block-level metadata is managed internally by OSDFS, the file

Parallel NFS

Panasas HPC Storage

Declustered parity groups

bCf

hJK

CDe

Gim

ADe

him

ACD

GJK

bCf

Gim

Aef

hJK

bDe

Gim

Abf

hjK

96 97

system that is optimized to store objects. OSDFS uses a floating

block allocation scheme where data, block pointers, and object

descriptors are batched into large write operations. The write

buffer is protected by the integrated UPS, and it is flushed to

disk on power failure or system panics. Fragmentation was an

issue in early versions of OSDFS that used a first-fit block allo-

cator, but this has been significantly mitigated in later versions

that use a modified best-fit allocator.

OSDFS stores higher level file system data structures, such as

the partition and object tables, in a modified BTree data struc-

ture. Block mapping for each object uses a traditional direct/

indirect/double-indirect scheme. Free blocks are tracked by a

proprietary bitmap-like data structure that is optimized for co-

py-on-write reference counting, part of OSDFS’s integrated sup-

port for object- and partition-level copy-on-write snapshots.

Block-level metadata management consumes most of the

cycles in file system implementations. By delegating storage

management to OSDFS, the Panasas metadata managers have

an order of magnitude less work to do than the equivalent SAN

file system metadata manager that must track all the blocks in

the system.

File-level Metadata

Above the block layer is the metadata about files. This includes

user-visible information such as the owner, size, and modifica-

tion time, as well as internal information that identifies which

objects store the file and how the data is striped across those

objects (i.e., the file’s storage map). Our system stores this file

metadata in object attributes on two of the N objects used to

store the file’s data. The rest of the objects have basic attributes

like their individual length and modify times, but the higher-lev-

el file system attributes are only stored on the two attribute-

storing components.

File names are implemented in directories similar to traditional

UNIX file systems. Directories are special files that store an ar-

ray of directory entries. A directory entry identifies a file with a

Parallel NFS

Panasas HPC Storage

tuple of <serviceID, partitionID, objectID>, and also includes two

<osdID> fields that are hints about the location of the attribute

storing components. The partitionID/objectID is the two-level

object numbering scheme of the OSD interface, and Panasas

uses a partition for each volume. Directories are mirrored (RAID-

1) in two objects so that the small write operations associated

with directory updates are efficient.

Clients are allowed to read, cache and parse directories, or they

can use a Lookup RPC to the metadata manager to translate

a name to an <serviceID, partitionID, objectID> tuple and the

<osdID> location hints. The serviceID provides a hint about the

metadata manager for the file, although clients may be redirect-

ed to the metadata manager that currently controls the file. The

osdID hint can become out-of-date if reconstruction or active

balancing moves an object. If both osdID hints fail, the meta-

data manager has to multicast a GetAttributes to the storage

nodes in the BladeSet to locate an object. The partitionID and

objectID are the same on every storage node that stores a com-

ponent of the file, so this technique will always work. Once the

file is located, the metadata manager automatically updates

the stored hints in the directory, allowing future accesses to

bypass this step.

File operations may require several object operations. The fig-

ure on the left shows the steps used in creating a file. The meta-

data manager keeps a local journal to record in-progress actions

so it can recover from object failures and metadata manager

crashes that occur when updating multiple objects. For exam-

ple, creating a file is fairly complex task that requires updating

the parent directory as well as creating the new file. There are

2 Create OSD operations to create the first two components of

the file, and 2 Write OSD operations, one to each replica of the

parent directory. As a performance optimization, the metadata

server also grants the client read and write access to the file

and returns the appropriate capabilities to the client as part of

the FileCreate results. The server makes record of these write

capabilities to support error recovery if the client crashes while

writing the file. Note that the directory update (step 7) occurs

after the reply, so that many directory updates can be batched

together. The deferred update is protected by the op-log record

that gets deleted in step 8 after the successful directory update.

The metadata manager maintains an op-log that records the

object create and the directory updates that are in progress.

This log entry is removed when the operation is complete. If the

metadata service crashes and restarts, or a failure event moves

the metadata service to a different manager node, then the op-

log is processed to determine what operations were active at

the time of the failure. The metadata manager rolls the opera-

tions forward or backward to ensure the object store is consis-

tent. If no reply to the operation has been generated, then the

operation is rolled back. If a reply has been generated but pend-

ing operations are outstanding (e.g., directory updates), then

the operation is rolled forward.

Creating a file

OSDs

1. CREATE

7. WRITE3. CREATE

MetadataServer

6. REPLY

Client

Txn_log

2

8

4

5

oplog

caplog

Replycache

98 99

The write capability is stored in a cap-log so that when a meta-

data server starts it knows which of its files are busy. In addition

to the “piggybacked” write capability returned by FileCreate,

the client can also execute a StartWrite RPC to obtain a sepa-

rate write capability. The cap-log entry is removed when the

client releases the write cap via an EndWrite RPC. If the client

reports an error during its I/O, then a repair log entry is made

and the file is scheduled for repair. Read and write capabilities

are cached by the client over multiple system calls, further re-

ducing metadata server traffic.

System-level Metadata

The final layer of metadata is information about the overall sys-

tem itself. One possibility would be to store this information in

objects and bootstrap the system through a discovery protocol.

The most difficult aspect of that approach is reasoning about

the fault model. The system must be able to come up and be

manageable while it is only partially functional. Panasas chose

instead a model with a small replicated set of system managers,

each that stores a replica of the system configuration metadata.

Each system manager maintains a local database, outside of

the object storage system. Berkeley DB is used to store tables

that represent our system model. The different system manager

instances are members of a replication set that use Lamport’s

part-time parliament (PTP) protocol to make decisions and up-

date the configuration information. Clusters are configured

with one, three, or five system managers so that the voting quo-

rum has an odd number and a network partition will cause a

minority of system managers to disable themselves.

System configuration state includes both static state, such as

the identity of the blades in the system, as well as dynamic

state such as the online/offline state of various services and er-

ror conditions associated with different system components.

Each state update decision, whether it is updating the admin

password or activating a service, involves a voting round and an

update round according to the PTP protocol. Database updates

Parallel NFS

Panasas HPC Storage

are performed within the PTP transactions to keep the data-

bases synchronized. Finally, the system keeps backup copies of

the system configuration databases on several other blades to

guard against catastrophic loss of every system manager blade.

Blade configuration is pulled from the system managers as part

of each blade’s startup sequence. The initial DHCP handshake

conveys the addresses of the system managers, and thereafter

the local OS on each blade pulls configuration information from

the system managers via RPC.

The cluster manager implementation has two layers. The lower

level PTP layer manages the voting rounds and ensures that

partitioned or newly added system managers will be brought

up-to-date with the quorum. The application layer above that

uses the voting and update interface to make decisions. Com-

plex system operations may involve several steps, and the sys-

tem manager has to keep track of its progress so it can tolerate

a crash and roll back or roll forward as appropriate.

For example, creating a volume (i.e., a quota-tree) involves file

system operations to create a top-level directory, object opera-

tions to create an object partition within OSDFS on each Stor-

ageBlade module, service operations to activate the appropri-

ate metadata manager, and configuration database operations

to reflect the addition of the volume. Recovery is enabled by

having two PTP transactions. The initial PTP transaction deter-

mines if the volume should be created, and it creates a record

about the volume that is marked as incomplete. Then the sys-

tem manager does all the necessary service activations, file and

storage operations. When these all complete, a final PTP trans-

action is performed to commit the operation. If the system man-

ager crashes before the final PTP transaction, it will detect the

incomplete operation the next time it restarts, and then roll the

operation forward or backward.

“Outstanding in the HPC world, the ActiveStor so-

lutions provided by Panasas are undoubtedly the

only HPC storage solutions that combine highest

scalability and performance with a convincing ease

of management.“

Thomas Gebert HPC Solution Architect

NVIDIA GPU Computing –The CUDA ArchitectureThe CUDA parallel computing platform is now widely deployed with 1000s of GPU-ac-celerated applications and 1000s of published research papers, and a complete range of CUDA tools and ecosystem solutions is available to developers.

101 Hig

h T

rou

hp

ut

Co

mp

uti

ng

C

AD

B

ig D

ata

An

aly

tics

S

imu

lati

on

A

ero

spa

ce

Au

tom

oti

ve

102 103

What is GPU Computing?GPU computing is the use of a GPU (graphics processing unit)

together with a CPU to accelerate general-purpose scientifi c

and engineering applications. Pioneered more fi ve years ago by

NVIDIA, GPU computing has quickly become an industry stan-

dard, enjoyed by millions of users worldwide and adopted by

virtually all computing vendors.

GPU computing offers unprecedented application performance

by offl oading compute-intensive portions of the application to

the GPU, while the remainder of the code still runs on the CPU.

From a user’s perspective, applications simply run signifi cantly

faster.

CPU + GPU is a powerful combination because CPUs consist of a

few cores optimized for serial processing, while GPUs consist of

thousands of smaller, more effi cient cores designed for parallel

performance. Serial portions of the code run on the CPU while

parallel portions run on the GPU.

Most customers can immediately enjoy the power of GPU com-

puting by using any of the GPU-accelerated applications listed

in our catalog, which highlights over one hundred, industry-

leading applications. For developers, GPU computing offers a

vast ecosystem of tools and libraries from major software ven-

dors

History of GPU Computing

Graphics chips started as fi xed-function graphics processors

but became increasingly programmable and computationally

powerful, which led NVIDIA to introduce the fi rst GPU. In the

1999-2000 timeframe, computer scientists and domain scien-

tists from various fi elds started using GPUs to accelerate a range

of scientifi c applications. This was the advent of the movement

called GPGPU, or General-Purpose computation on GPU.

�While users achieved unprecedented performance (over

100x compared to CPUs in some cases), the challenge was

that GPGPU required the use of graphics programming APIs

like OpenGL and Cg to program the GPU. This limited acces-

sibility to the tremendous capability of GPUs for science.

NVIDIA recognized the potential of bringing this performance

for the larger scientifi c community, invested in making the GPU

fully programmable, and offered seamless experience for devel-

opers with familiar languages like C, C++, and Fortran.

GPU computing momentum is growing faster than ever before.

Today, some of the fastest supercomputers in the world rely on

GPUs to advance scientifi c discoveries; 600 universities around

the world teach parallel computing with NVIDIA GPUs; and hun-

dreds of thousands of developers are actively using GPUs.

All NVIDIA GPUs—GeForce, Quadro, and Tesla — support GPU

computing and the CUDA parallel programming model. Devel-

opers have access to NVIDIA GPUs in virtually any platform of

their choice, including the latest Apple MacBook Pro. However,

we recommend Tesla GPUs for workloads where data reliability

and overall performance are critical.

Tesla GPUs are designed from the ground-up to accelerate sci-

entifi c and technical computing workloads. Based on innova-

tive features in the “Kepler architecture,” the latest Tesla GPUs

offer 3x more performance compared to the previous architec-

ture, more than one terafl ops of double-precision fl oating point

while dramatically advancing programmability and effi ciency.

Kepler is the world’s fastest and most effi cient high perfor-

mance computing (HPC) architecture.

Kepler GK110 – The Next Generation GPU Computing ArchitectureAs the demand for high performance parallel computing in-

creases across many areas of science, medicine, engineering,

and fi nance, NVIDIA continues to innovate and meet that de-

mand with extraordinarily powerful GPU computing architec-

tures. NVIDIA’s existing Fermi GPUs have already redefi ned and

accelerated High Performance Computing (HPC) capabilities in

“We are very proud to be one of the leading provid-

ers of Tesla systems who are able to combine the

overwhelming power of NVIDIA Tesla systems with

the fully engineered and thoroughly tested trans-

tec hardware to a total Tesla-based solution.”

Norbert Zeidler Senior HPC Solution Engineer

NVIDIA GPU Computing

What is GPU Computing?

GPU Computing Applications

NVIDIA GPUwith the CUDA Parallel Computing Architecture

CC++

OpenCL FORTRANDirectXCompute

Java andPython

The CUDA parallel architecure

the cuda programming model

1

2

34

CUDA Parallel Compute Enginesinside NVIDIA GPUs

CUDA Support in OS Kernel

CUDA Driver

ApplicationsUsing DirectX

DirectXCompute

OpenCLDriver

C Runtimefor CUDA

ApplicationsUsing OpenCL

ApplicationsUsing the

CUDA Driver API

ApplicationsUsing C, C++, Fortran,

Java, Python

HLSLCompute Shaders

OpenCL CCompute Kernels

C for CUDACompute Kernels

PTX (ISA)

C for CUDACompute Functions

Device-level APIs Language Integration

104 105

areas such as seismic processing, biochemistry simulations,

weather and climate modeling, signal processing, computa-

tional finance, computer aided engineering, computational flu-

id dynamics, and data analysis. NVIDIA’s new Kepler GK110 GPU

raises the parallel computing bar considerably and will help

solve the world’s most difficult computing problems.

By offering much higher processing power than the prior GPU

generation and by providing new methods to optimize and in-

crease parallel workload execution on the GPU, Kepler GK110

simplifies creation of parallel programs and will further revolu-

tionize high performance computing.

Kepler GK110 – Extreme Performance, Extreme Efficiency

Comprising 7.1 billion transistors, Kepler GK110 is not only the

fastest, but also the most architecturally complex microproces-

sor ever built. Adding many new innovative features focused on

compute performance, GK110 was designed to be a parallel pro-

cessing powerhouse for Tesla® and the HPC market.

Kepler GK110 provides over 1 TFlop of double precision

throughput with greater than 93% DGEMM efficiency versus

60-65% on the prior Fermi architecture.

In addition to greatly improved performance, the Kepler archi-

tecture offers a huge leap forward in power efficiency, deliver-

ing up to 3x the performance per watt of Fermi.

The following new features in Kepler GK110 enable increased

GPU utilization, simplify parallel program design, and aid in the

deployment of GPUs across the spectrum of compute environ-

ments ranging from personal workstations to supercomputers:

� Dynamic Parallelism – adds the capability for the GPU to ge-

nerate new work for itself, synchronize on results, and con-

trol the scheduling of that work via dedicated, accelerated

hardware paths, all without involving the CPU. By providing

the flexibility to adapt to the amount and form of parallelism

through the course of a program’s execution, programmers

can expose more varied kinds of parallel work and make the

most efficient use the GPU as a computation evolves. This ca-

pability allows less - structured, more complex tasks to run

easily and effectively, enabling larger portions of an appli-

cation to run entirely on the GPU. In addition, programs are

easier to create, and the CPU is freed for other tasks.

� Hyper-Q – Hyper-Q enables multiple CPU cores to launch work

on a single GPU simultaneously, thereby dramatically increasing

GPU utilization and significantly reducing CPU idle times. Hyper-

Q increases the total number of connections (work queues)

between the host and the GK110 GPU by allowing 32 simultane-

ous, hardware-managed connections (compared to the single

connection available with Fermi). Hyper-Q is a flexible solution

that allows separate connections from multiple CUDA streams,

from multiple Message Passing Interface (MPI) processes, or

even from multiple threads within a process. Applications that

previously encountered false serialization across tasks, thereby

limiting achieved GPU utilization, can see up to dramatic perfor-

mance increase without changing any existing code.

� Grid Management Unit – Enabling Dynamic Parallelism re-

quires an advanced, flexible grid management and dispatch

control system. The new GK110 Grid Management Unit (GMU)

manages and prioritizes grids to be executed on the GPU.

The GMU can pause the dispatch of new grids and queue

pending and suspended grids until they are ready to execute,

providing the flexibility to enable powerful runtimes, such as

Dynamic Parallelism. The GMU ensures both CPU- and GPU-

generated workloads are properly managed and dispatched.

� NVIDIA GPUDirect – NVIDIA GPUDirect is a capability that

enables GPUs within a single computer, or GPUs in different

servers located across a network, to directly exchange data

without needing to go to CPU/system memory. The RDMA

feature in GPUDirect allows third party devices such as SSDs,

NICs, and IB adapters to directly access memory on multiple

GPUs within the same system, significantly decreasing the la-

tency of MPI send and receive messages to/from GPU memo-

ry. It also reduces demands on system memory bandwidth

and frees the GPU DMA engines for use by other CUDA tasks.

Kepler GK110 also supports other GPUDirect features inclu-

ding Peer-to-Peer and GPUDirect for Video.

An Overview of the GK110 Kepler Architecture

Kepler GK110 was built first and foremost for Tesla, and its

goal was to be the highest performing parallel computing mi-

croprocessor in the world. GK110 not only greatly exceeds the

raw compute horsepower delivered by Fermi, but it does so

efficiently, consuming significantly less power and generating

much less heat output.


Kepler GK110 – The Next Generation GPU Computing Architecture

Technical Specifications Tesla K40 Tesla K20x Tesla K20 Tesla K10

Peak double-precision floating point performance (board)

1.43 Tflops 1.31 Tflops 1.17 Tflops 0.19 Tflops

Peak single-precision floating point performance (board)

4.29 Tflops 3.95 Tflops 3.52 Tflops 4.58 Tflops

Number of GPUs 1 x GK110B 1 x GK110 2 x GK104s

Number of CUDA cores 2,880 2,688 2,496 2 x 1,536

Memory size per board (GDDR5) 12 GB 6 GB 5 GB 8 GB

Memory bandwidth for board (ECC off)

288 Gbytes/sec 250 Gbytes/sec 208 Gbytes/sec 320 Gbytes/sec

Architecture features SMX, Dynamic Parallelism, Hyper-Q SMX

System Servers and workstations Servers Servers and workstations Servers

106 107

Introducing NVIDIA Parallel NsightNVIDIA Parallel Nsight is the fi rst development environment de-

signed specifi cally to support massively parallel CUDA C, Open-

CL, and DirectCompute applications. It bridges the productiv-

ity gap between CPU and GPU code by bringing parallel-aware

hardware source code debugging and performance analysis

directly into Microsoft Visual Studio, the most widely used inte-

grated application development environment under Microsoft

Windows.

Parallel Nsight allows Visual Studio developers to write and

debug GPU source code using exactly the same tools and inter-

faces that are used when writing and debugging CPU code, in-

cluding source and data breakpoints, and memory inspection.

Furthermore, Parallel Nsight extends Visual Studio functional-

ity by offering tools to manage massive parallelism, such as the

ability to focus and debug on a single thread out of the thou-

sands of threads running parallel, and the ability to simply and

effi ciently visualize the results computed by all parallel threads.

Parallel Nsight is the perfect environment to develop co-pro-

cessing applications that take advantage of both the CPU and

GPU. It captures performance events and information across

both processors, and presents the information to the devel-

oper on a single correlated timeline. This allows developers to

see how their application behaves and performs on the entire

system, rather than through a narrow view that is focused on a

particular subsystem or processor.

Parallel Nsight Debugger for GPU Computing

� Debug your CUDA C/C++ and DirectCompute source code di-

rectly on the GPU hardware

� As the industry’s only GPU hardware debugging solution, it

drastically increases debugging speed and accuracy

� Use the familiar Visual Studio Locals, Watches, Memory and

Breakpoints windows


Introducing NVIDIA Parallel NsightParallel Nsight Analysis Tool for GPU Computing

� Isolate performance bottlenecks by viewing system-wide

CPU+GPU events

� Support for all major GPU Computing APIs, including CUDA

C/C++, OpenCL, and Microsoft DirectCompute

Debugger for GPU Computing Analysis Tool for GPU Computing

108 109

Parallel Nsight Debugger for Graphics Development

� Debug HLSL shaders directly on the GPU hardware. Drasti-

cally increasing debugging speed and accuracy over emu-

lated (SW) debugging

� Use the familiar Visual Studio Locals, Watches, Memory and

Breakpoints windows with HLSL shaders, including Direct-

Compute code

� The Debugger supports all HLSL shader types: Vertex, Pixel,

Geometry, and Tessellation

Parallel Nsight Graphics Inspector for Graphics Development

� Graphics Inspector captures Direct3D rendered frames for

real-time examination

� The Frame Profi ler automatically identifi es bottlenecks and

performance information on a per-draw call basis

� Pixel History shows you all operations that affected a given

pixel


Introducing NVIDIA Parallel Nsight

transtec has strived for developing well-engineered GPU Com-

puting solutions from the very beginning of the Tesla era.

From High-Performance GPU Workstations to rack-mounted

Tesla server solutions, transtec has a broad range of specially

designed systems available. As an NVIDIA Tesla Preferred Pro-

vider (TPP), transtec is able to provide customers with the lat-

est NVIDIA GPU technology as well as fully-engineered hybrid

systems and Tesla Preconfi gured Clusters. Thus, customers can

be assured that transtec’s large experience in HPC cluster solu-

tions is seamlessly brought into the GPU computing world. Per-

formance Engineering made by transtec.

Debugger for Graphics Development Graphics Inspector for Graphics Development

110 111

A full Kepler GK110 implementation includes 15 SMX units and

six 64-bit memory controllers. Different products will use differ-

ent configurations of GK110. For example, some products may

deploy 13 or 14 SMXs.

Key features of the architecture that will be discussed below in

more depth include:

� The new SMX processor architecture

� An enhanced memory subsystem, offering additional

caching capabilities, more bandwidth at each level of the

hierarchy, and a fully redesigned and substantially faster

DRAM I/O implementation.

� Hardware support throughout the design to enable new pro-

gramming model capabilities

Kepler GK110 supports the new CUDA Compute Capability 3.5.

(For a brief overview of CUDA see Appendix A - Quick Refresher

on CUDA). The following table compares parameters of different

Compute Capabilities for Fermi and Kepler GPU architectures:

Performance per Watt

A principal design goal for the Kepler architecture was impro-

ving power efficiency. When designing Kepler, NVIDIA engineers

applied everything learned from Fermi to better optimize the

Kepler architecture for highly efficient operation. TSMC’s 28nm

manufacturing process plays an important role in lowering po-

wer consumption, but many GPU architecture modifications

were required to further reduce power consumption while

maintaining great performance.

FermiGF100

FermiGF104

KeplerGK104

KeplerGK110

Compute Capability 2.0 2.1 3.0 3.5

Threads / Warp 32 32 32 32

Max Warps / Multiprocessor 48 48 64 64

Max Threads / Multiproces-sor

1536 1536 2048 2048

Max Thread Blocks / Multi-processor

8 8 16 16

32-bit Registers / Multipro-cessor

32768 32768 65536 65536

Max Registers / Thread 63 63 63 255

Max Threads / Thread Block 1024 1024 1024 1024

Shared Memory Size Con-figurations (bytes)

16K48K

16K48K

16K32K48K

16K32K48K

Max X Grid Dimension 2^16-1 2^16-1 2^32-1 2^32-1

Hyper-Q NO NO NO YES

Dynamic Parallelism NO NO NO YES

Every hardware unit in Kepler was designed and scrubbed to

provide outstanding performance per watt. The best example

of great perf/watt is seen in the design of Kepler GK110’s new

Streaming Multiprocessor (SMX), which is similar in many res-

pects to the SMX unit recently introduced in Kepler GK104, but

includes substantially more double precision units for compute

algorithms.

Streaming Multiprocessor (SMX) Architecture

Kepler GK110’s new SMX introduces several architectural inno-

vations that make it not only the most powerful multiproces-

sor we’ve built, but also the most programmable and power-

efficient.

SMX Processing Core Architecture

Each of the Kepler GK110 SMX units feature 192 single-precision

CUDA cores, and each core has fully pipelined floating-point and

integer arithmetic logic units. Kepler retains the full IEEE 754-

2008

compliant single- and double-precision arithmetic introduced

in Fermi, including the fused multiply-add (FMA) operation.

One of the design goals for the Kepler GK110 SMX was to sig-

nificantly increase the GPU’s delivered double precision perfor-

mance, since double precision arithmetic is at the heart of many

HPC applications. Kepler GK110’s SMX also retains the special

function units (SFUs) for fast approximate transcendental ope-

rations as in previous-generation GPUs, providing 8x the num-

ber of SFUs of the Fermi GF110 SM.

Similar to GK104 SMX units, the cores within the new GK110

SMX units use the primary GPU clock rather than the 2x shader

clock. Recall the 2x shader clock was introduced in the G80 Tesla-

architecture GPU and used in all subsequent Tesla- and Fermi-ar-



Kepler GK110 Full chip block diagram

SMX: 192 single-precision CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store units (LD/ST).

112 113

chitecture GPUs. Running execution units at a higher clock rate

allows a chip to achieve a given target throughput with fewer

copies of the execution units, which is essentially an area opti-

mization, but the clocking logic for the faster cores is more po-

wer-hungry. For Kepler, our priority was performance per watt.

While we made many optimizations that benefi tted both area

and power, we chose to optimize for power even at the expen-

se of some added area cost, with a larger number of processing

cores running at the lower, less power-hungry GPU clock.

Quad Warp Scheduler

The SMX schedules threads in groups of 32 parallel threads called

warps. Each SMX features four warp schedulers and eight ins-

truction dispatch units, allowing four warps to be issued and

executed concurrently. Kepler’s quad warp scheduler selects

four warps, and two independent instructions per warp can be

dispatched each cycle. Unlike Fermi, which did not permit double

precision instructions to be paired with other instructions, Kep-

ler GK110 allows double precision instructions to be paired with

other instructions.

We also looked for opportunities to optimize the power in the

SMX warp scheduler logic. For example, both Kepler and Fermi

schedulers contain similar hardware units to handle the sche-

duling function, including:

� Register scoreboarding for long latency operations (texture

and load)

� Inter-warp scheduling decisions (e.g., pick the best warp to go

next among eligible candidates)

� Thread block level scheduling (e.g., the GigaThread engine)

However, Fermi’s scheduler also contains a complex hardware

stage to prevent data hazards in the math datapath itself. A

multi-port register scoreboard keeps track of any registers that

are not yet ready with valid data, and a dependency checker

block analyzes register usage across a multitude of fully deco-

ded warp instructions against the scoreboard, to determine

which are eligible to issue.

For Kepler, we recognized that this information is deterministic

(the math pipeline latencies are not variable), and therefore it

is possible for the compiler to determine up front when instruc-

tions will be ready to issue, and provide this information in the

instruction itself. This allowed us to replace several complex and

power-expensive blocks with a simple hardware block that ext-

racts the pre-determined latency information and uses it to mask

out warps from eligibility at the inter-warp scheduler stage.

New ISA Encoding: 255 Registers per Thread

The number of registers that can be accessed by a thread has

been quadrupled in GK1 10, allowing each thread access to up to

255 registers. Codes that exhibit high register pressure or spill-

ing behavior in Fermi may see substantial speedups as a result

of the increased available per-thread register count . A compel-

ling example can be seen in the QUDA library for performing lat-

tice QCD (quantum chromodynamics) calculations using CUDA.

QUDA fp64-based algorithms see performance increases up to

5.3x due to the ability to use many more registers per thread and

experiencing fewer spills to local memory.

Shuffl e Instruction

To further improve performance, Kepler implements a new

Shuffl e instruction, which allows threads within a warp to share

data. Previously, sharing data between threads within a warp

required separate store and load operations to pas the data

through shared memory. With the Shuffl e instruction, threads

within a warp can read values from other threads in the warp in

just about any imaginable permutation. Shuffl e supports arbi-

trary indexed references –– i.e. any thread reads from any other

thread. U seful shuffl e subsets including next-thread (offset u p

or down by a fi xed amount) and XORR “butterfl y” style permuta-

tions among the threads in a warp, are also available as CUDA

intrinsics.

NVIDIA, GeForce, Tesla, CUDA, PhysX, GigaThread, NVIDIA

Parallel Data Cache and certain other trademarks and

logos appearing in this brochure, are trademarks or

registered trademarks of NVIDIA Corporation.



Each Kepler SMX contains 4 Warp Schedulers, each with dual Instruction Dis-patch Units. A single Warp Scheduler Unit is shown above.

This example shows some of the variations possible using the new Shuffl e instruction in Kepler.

114 115

Shuffle offers a performance advantage over shared memory,

in that a store-and-load operation is carried out in a single step.

Shuffle also can reduce the amount of shared memory needed

per thread block, since data exchanged at the warp level never

needs to o be placed in shared memory. In the case of FFT, which

requires da ta sharing within a warp, a 6% performance gain can

be seen just by using Shuffle.

Atomic Operations

Atomic memory operations are important in parallel program-

ming, allowing concurrent threads to correctly perform read-

modify-write operations on shared data structures. Atomic

operations such as add, min, max, and compare-and-swap are

atomic in the sense that the read, modify, and write operations

are performed without interruption by other threads. Atomic

memory operations are widely used for parallel sorting, reduc-

tion operations, and building data structures in parallel with-

out locks that Serialize thread execution.

Throughput of global memory atomic operations on Kepler

GK110 is substantially improved compared to the Fermi genera-

tion. Atomic operation throughput to a common global memory

address is improved by 9x to one operation per clock. Atomic

operation throughput to independent global addresses is also

significantly accelerated, and logic to handle address conflicts

has been made more efficient. Atomic operations can often be

processed at rates similar to global load operations. This speed

increase makes atomics fast enough to use frequently within

kernel inner loops, eliminating the separate reduction passes

that were previously required by some algorithms to consoli-

date results. Kepler GK110 also expands the native support for

64-bit atomic operations in global memory. In addition to atomi-

cAdd, atomicCAS, and atomicExch (which were also supported

by Fermi and Kepler GK104), GK110 supports the following:

� atomicMin

� atomicMax

� atomicAnd

� atomicOr

� atomicXor

Other atomic operations which are not supported natively (for

example 64-bit floating point atomics) may be emulated using

the compare-and-swap (CAS) instruction.

Texture Improvements

The GPU’s dedicated hardware Texture units are a valuable re-

source for compute programs with a need to sample or filter

image data. The texture throughput in Kepler is significantly in-

creased compared to Fermi – each SMX unit contains 16 texture

filtering units, a 4x increase vs the Fermi GF110 SM.

In addition, Kepler changes the way texture state is managed.

In the Fermi generation, for the GPU to reference a texture, it

had to be assigned a “slot” in a fixed-size binding table prior to

grid launch. The number of slots in that table ultimately limits

how many unique textures a program can read from at runtime.

Ultimately, a program was limited to accessing only 128 simulta-

neous textures in Fermi.

With bindless textures in Kepler, the additional step of using

slots isn’t necessary: texture state is now saved as an object in

memory and the hardware fetches these state objects on de-

mand, making binding tables obsolete. This effectively elimi-

nates any limits on the number of unique textures that can be

referenced by a compute program. Instead, programs can map

textures at any time and pass texture handles around as they

would any other pointer.

Kepler Memory Subsystem – L1, L2, ECC

Kepler’s memory hierarchy is organized similarly to Fermi. The

Kepler architecture supports a unified memory request path for

loads and stores, with an L1 cache per SMX multiprocessor. Ke-

pler GK110 also enables compiler-directed use of an additional

new cache for read-only data, as described below.

64 KB Configurable Shared Memory and L1 Cache

In the Kepler GK110 architecture, as in the previous generation

Fermi architecture, each SMX has 64 KB of on-chip memory that

can be configured as 48 KB of Shared memory with 16 KB of L1

cache, or as 16 KB of shared memory with 48 KB of L1 cache. Ke-

pler now allows for additional flexibility in configuring the al-

location of shared memory and L1 cache by permitting a 32KB

/ 32KB split between shared memory and L1 cache. To support

the increased throughput of each SMX unit, the shared memory

bandwidth for 64b and larger load operations is also doubled

compared to the Fermi SM, to 256B per core clock.

48KB Read-Only Data Cache

In addition to the L1 cache, Kepler introduces a 48KB cache for

data that is known to be read-only for the duration of the func-

tion. In the Fermi generation, this cache was accessible only by

the Texture unit. Expert programmers often found it advanta-

geous to load data through this path explicitly by mapping their

data as textures, but this approach had many limitations.



116 117

In Kepler, in addition to significantly increasing the capacity

of this cache along with the texture horsepower increase, we

decided to make the cache directly accessible to the SM for

general load operations. Use of the read-only path is beneficial

because it takes both load and working set footprint off of the

Shared/L1 cache path. In addition, the Read-Only Data Cache’s

higher tag bandwidth supports full speed unaligned memory

access patterns among other scenarios.

Use of the read-only path can be managed automatically by the

compiler or explicitly by the programmer. Access to any variable

or data structure that is known to be constant through pro-

grammer use of the C99-standard “const __restrict” keyword

may be tagged by the compiler to be loaded through the Read-

Only Data Cache. The programmer can also explicitly use this

path with the __ldg() intrinsic.

Improved L2 Cache

The Kepler GK110 GPU features 1536KB of dedicated L2 cache

memory, double the amount of L2 available in the Fermi archi-

tecture. The L2 cache is the primary point of data unification

between the SMX units, servicing all load, store, and texture re-

quests and providing efficient, high speed data sharing across

the GPU. The L2 cache on Kepler offers up to 2x of the band-

width per clock available in Fermi. Algorithms for which data

addresses are not known beforehand, such as physics solvers,

ray tracing, and sparse matrix multiplication especially benefit

from the cache hierarchy. Filter and convolution kernels that re-

quire multiple SMs to read the same data also benefit.

Memory Protection Support

Like Fermi, Kepler’s register files, shared memories, L1 cache, L2

cache and DRAM memory are protected by a Single-Error Correct

Double-Error Detect (SECDED) ECC code. In addition, the Read-

Only Data Cache supports single-error correction through a parity

check; in the event of a parity error, the cache unit automatically in-

validates the failed line, forcing a read of the correct data from L2.

ECC checkbit fetches from DRAM necessarily consume some

amount of DRAM bandwidth, which results in a performance

difference between ECC-enabled and ECC-disabled operation,

especially on memory bandwidth-sensitive applications. Kepler

GK110 implements several optimizations to ECC checkbit fetch

handling based on Fermi experience. As a result, the ECC on-vs-

off performance delta has been reduced by an average of 66%,

as measured across our internal compute application test suite.

Dynamic Parallelism

In a hybrid CPU-GPU system, enabling a larger amount of paral-

lel code in an application to run efficiently and entirely within

the GPU improves scalability and performance as GPUs increase

in perf/watt. To accelerate these additional parallel portions of

the application, GPUs must support more varied types of paral-

lel workloads.

Dynamic Parallelism is a new feature introduced with Kepler

GK110 that allows the GPU to generate new work for itself, syn-

chronize on results, and control the scheduling of that work via

dedicated, accelerated hardware paths, all without involving

the CPU.

Fermi was very good at processing large parallel data structures

when the scale and parameters of the problem were known at

kernel launch time. All work was launched from the host CPU,

would run to completion, and return a result back to the CPU.

The result would then be used as part of the final solution, or

would be analyzed by the CPU which would then send addition-

al requests back to the GPU for additional processing.

In Kepler GK110 any kernel can launch another kernel, and can

create the necessary streams, events and manage the depen-

dencies needed to process additional work without the need

for host CPU interaction. This architectural innovation makes it

easier for developers to create and optimize recursive and data-

dependent execution patterns, and allows more of a program to

be run directly on GPU. The system CPU can then be freed up for

additional tasks, or the system could be configured with a less

powerful CPU to carry out the same workload.

Dynamic Parallelism allows more varieties of parallel algo-

rithms to be implemented on the GPU, including nested loops

with differing amounts of parallelism, parallel teams of serial

control-task threads, or simple serial control code offloaded to

the GPU in order to promote data-locality with the parallel por-

tion of the application.

Because a kernel has the ability to launch additional work-

loads based on intermediate, on-GPU results, programmers can

now intelligently load-balance work to focus the bulk of their

resources on the areas of the problem that either require the

most processing power or are most relevant to the solution.

One example would be dynamically setting up a grid for a nu-

merical simulation – typically grid cells are focused in regions

of greatest change, requiring an expensive pre-processing pass

through the data. Alternatively, a uniformly coarse grid could be

used to prevent wasted GPU resources, or a uniformlyfine grid

could be used to ensure all the features are captured, but these

options risk missing simulation features or “over-spending”

compute resources on regions of less interest. With Dynamic

Parallelism, the grid resolution can be determined dynamically

at runtime in a datadependent manner.



Dynamic Parallelism allows more parallel code in an application to be launched directly by the GPU onto itself (right side of image) rather than requiring CPU intervention (left side of image).

118 119


A Quick Refresher on CUDA A Quick Refresher on CUDA CUDA is a combination hardware/software platform that en-

ables NVIDIA GPUs to execute programs written with C, C++,

Fortran, and other languages. A CUDA program invokes paral-

lel functions called kernels that execute accross many parallel

threads. The programmer or compiler organizes these threads

into thread blocks and grids of thre ad blocks, as shown in the

Figure on the right side. Each thread within a thread block ex-

ecutes an instance oof the kernel. Each thread also has thread

and block IDs within its thread block and grid, a program coun-

ter, registers, per-thread private memory, inputs, and output re-

suults. A thread block is a set of concurren tly executing threads

that can cooperate among themselves through barrier sy

nchronizationn and shared memory. A t hread block has a block

ID within its grid. A grid is an array of thread blocks that execute

the same kernel, read inputs from global memory, write results

to global memory, and synchronize between dependent kernel

calls. In the CUDA parallel programming model, each thread has

a per-thread private memory space used for register spills, func-

tion calls, and C automaticc array variables. Each thread block

has a per-block shared memory space used for inter-thread

communication, datasharing, and result sharing in parallel all-

gorithms. Grids of thread blocks share results in Global Memory

space after kernel-wide global synchronization.

CUDA Hardware Execution

CUDA’s hierarchy of threads maps to a hierarchy of processors

on the GPU; a GPU executes one or more kernel grids; a stream-

ing multiprocessor (SM on Fermi / SMX on Kepler) executes one

or more thread blocks; and CUDA cores and other execution

units in the SMX execute thread instructions. The SMX executes

threads in groups of 32 threads called warps. While program-

mers can generally ignore warp execution for functional cor-

rectness and focus on programming individual scalar threads,

they can greatly improve performance by having threads in

a warp execute the same code path and access memory with

nearby addresses.

CUDA Hierarchy of threads, blocks, and grids, with corresponding per-thread private, per-block shared,-and per-application global memory spaces.

120 121

Starting with a coarse grid, the simulation can “zoom in” on ar-

eas of interest while avoiding unnecessary calculation in areas

with little change. Though this could be accomplished using a

sequence of CPU-launched kernels, it would be far simpler to al-

low the GPU to refine the grid itself by analyzing the data and

launching additional work as part of a single simulation kernel,

eliminating interruption of the CPU and data transfers between

the CPU and GPU.

The above example illustrates the benefits of using a dynami-

cally sized grid in a numerical simulation. To meet peak preci-

sion requirements, a fixed resolution simulation must run at an

excessively fine resolution across the entire simulation domain,

whereas a multi-resolution grid applies the correct simulation

resolution to each area based on local variation.

Hyper-Q

One of the challenges in the past has been keeping the GPU sup-

plied with an optimally scheduled load of work from multiple

streams. Thee Fermi architecture supported 16-way concur-

rency of kernel launches from separate streams, but ultimately

the streams were all multiplexed into the same hardware work

queue . This allowed for false intra-stream dependencies, re-

quiring dependent kernels within one stream to complete be-

fore additional kernels in a separate stream could be executed.

While this could be alleviated to some extent through the use of

a breadth-first launch order , as program complexity increases,

this can become more and more difficult to manage efficiently.

Kepler GKK110 improves on this functionality with the new

Hyper-Q feature. Hyper-Q increases the total number of con-

nection s (work queues) between the host and the CUDA Work

Distributor (CWD) logic in the GPU by allowing 322 simultane-

ous, hardware-managed connections (compared to the single

connection available with Fermi). Hyper-Q is a flexible solution

that allows connections from multiple CUDA streams, from mul-

tiple Message Passing Interface (MPI) processes, or even from

multiple threads within a process. Applications that previously

encountered false serialization across tasks, there by limiting

GPU utilization, can see up to a 32x performance increase with-

out changing any existing code.

Each CUDA stream is managed within its own hardware work

queue, inter-stream dependencies are optimized, and opera-

tions in one stream will no longer block other streams, enabling

streams to execute concurrently without neding to specifically

tailor the launch order to eliminate possible false dependencies.

Hyper-Q offers significant benefits for use in MPI-based paral-

lel computer systems. Legacy MPI-based algorithms were often

created to run on multi-core CPU systems, with the amount of

work assigned to each MPI process scaled accordingly. This can

lead to a single MPI process having insufficient work to fully

occupy the GPU. While it has always been possible for multiple

MPI processes to share a GPU, these processes could become

bottlenecked by false dependencies. Hyper-Q removes those

false dependencies, dramatically increasing the efficiency of

GPU sharing across MPI processes.

Grid Management Unit - Efficiently Keeping the GPU Utilized

New features in Kepler GK110, such as the ability for CUDA ker-

nels to launch work directly on the GPU with Dynamic Paral-

lelism, required that the CPU-to-GPU workflow in Kepler offer

increased functionality over the Fermi design. On Fermi, a grid

of thread blocks would be launched by the CPU and would al-

ways run to completion, creating a simple unidirectional flow



Image attribution Charles Reid

Hyper-Q permits more simultaneous connections between CPU and GPU

Hyper-Q working with CUDA Streams: In the Fermi model shown on the left, only (C,P) & (R,X) can run concurrently due to intra-stream dependencies caused by the single hardware work queue. The Kepler Hyper-Q model allows all streams to run concurrently using separate work queues.

122 123

of work from the host to the SMs via the CUDA Work Distributor

(CWD) unit. Kepler GK110 was designed to improve the CPU-to-

GPU workflow by allowing the GPU to efficiently manage both

CPU- and CUDA-created workloads.

We discussed the ability of the Kepler GK110 GPU to allow ker-

nels to launch work directly on the GPU, and it’s important to

understand the changes made in the Kepler GK110 architec-

ture to facilitate these new functions. In Kepler, a grid can be

launched from the CPU just as was the case with Fermi, how-

ever new grids can also be created programmatically by CUDA

within the Kepler SMX unit. To manage both CUDA-created and

host-originated grids, a new Grid Management Unit (GMU) was

introduced in Kepler GK110. This control unit manages and pri-

oritizes grids that are passed into the CWD to be sent to the SMX

units for execution.

The CWD in Kepler holds grids that are ready to dispatch, and it

is able to dispatch 32 active grids, which is double the capacity

of the Fermi CWD. The Kepler CWD communicates with the GMU

via a bidirectional link that allows the GMU to pause the dis-

patch of new grids and to hold pending and suspended grids un-

til needed. The GMU also has a direct connection to the Kepler

SMX units to permit grids that launch additional work on the

GPU via Dynamic Parallelism to send the new work back to GMU

to be prioritized and dispatched. If the kernel that dispatched

the additional workload pauses, the GMU will hold it inactive

until the dependent work has completed.

NVIDIA GPUDirect

When working with a large amount of data, increasing the data

throughput and reducing latency is vital to increasing com-

pute performance. Kepler GK110 supports the RDMA feature in

NVIDIA GPUDirect, which is designed to improve performance

by allowing direct access to GPU memory by third-party devices

such as IB adapters, NICs, and SSDs. When using CUDA 5.0, GPU-

Direct provides the following important features:

� Direct memory access (DMA) between NIC and GPU without

the need for CPU-side data buffering.

� Significantly improved MPISend/MPIRecv efficiency bet-

ween GPU and other nodes in a network.

� Eliminates CPU bandwidth and latency bottlenecks

�Works with variety of 3rd-party network, capture, and sto-

rage devices

Applications like reverse time migration (used in seismic imag-

ing for oil & gas exploration) distribute the large imaging data

across several GPUs. Hundreds of GPUs must collaborate to

crunch the data, often communicating intermediate results.

GPUDirect enables much higher aggregate bandwidth for this

GPUto GPU communication scenario within a server and across

servers with the P2P and RDMA features.

Kepler GK110 also supports other GPUDirect features such as

Peer- to-Peer and GPUDirect for Video.

Conclusion

With the launch of Fermi in 2010, NVIDIA ushered in a new era

in the high performance computing (HPC) industry based on a

hybrid computing model where CPUs and GPUs work together

to solve computationally-intensive workloads. Now, with the

new Kepler GK110 GPU, NVIDIA again raises the bar for the HPC

industry.

Kepler GK110 was designed from the ground up to maximize

computational performance and throughput computing with

outstanding power efficiency. The architecture has many new

innovations such as SMX, Dynamic Parallelism, and Hyper-Q

that make hybrid computing dramatically faster, easier to pro-

gram, and applicable to a broader set of applications. Kepler

GK110 GPUs will be used in numerous systems ranging from

workstations to supercomputers to address the most daunting

challenges in HPC.



The redesigned Kepler HOST to GPU workflow shows the new Grid Manage-ment Unit, which allows it to manage the actively dispatching grids, pause dispatch, and hold pending and suspended grids.

GPUDirect RDMA allows direct access to GPU memory from 3rd-party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.

124 125

Executive OverviewThe High Performance Computing market’s continuing need for

improved time-to-solution and the ability to explore expand-

ing models seems unquenchable, requiring ever-faster HPC

clusters. This has led many HPC users to implement graphic

processing units (GPUs) into their clusters. While GPUs have tra-

ditionally been used solely for visualization or animation, today

they serve as fully programmable, massively parallel proces-

sors, allowing computing tasks to be divided and concurrently

processed on the GPU’s many processing cores. When multiple

GPUs are integrated into an HPC cluster, the performance po-

tential of the HPC cluster is greatly enhanced. This processing

environment enables scientists and researchers to tackle some

of the world’s most challenging computational problems.

HPC applications modifi ed to take advantage of the GPU pro-

cessing capabilities can benefi t from signifi cant performance

gains over clusters implemented with traditional processors. To

obtain these results, HPC clusters with multiple GPUs require a

high-performance interconnect to handle the GPU-to-GPU com-

munications and optimize the overall performance potential of

the GPUs. Because the GPUs place signifi cant demands on the

interconnect, it takes a high-performance interconnect, such as

Infi niBand, to provide the low latency, high message rate, and

bandwidth that are needed to enable all resources in the cluster

to run at peak performance.

Intel worked in concert with NVIDIA to optimize Intel TrueScale

Infi niBand with NVIDIA GPU technologies. This solution sup-

ports the full performance potential of NVIDIA GPUs through an

interface that is easy to deploy and maintain.

Key Points

� Up to 44 percent GPU performance improvement versus

implementation without GPUDirect – a GPU computing


Intel TrueScale Infi niBand and GPUsproduct from NVIDIA that enables faster communication

between the GPU and Infi niBand

� Intel TrueScale Infi niBand offers as much as 10 percent bet-

ter GPU performance than other Infi niBand interconnects

� Ease of installation and maintenance – Intel’s implementa-

tion offers a streamlined deployment approach that is sig-

nifi cantly easier than alternatives

Ease of DeploymentOne of the key challenges with deploying clusters consisting of

multi-GPU nodes is to maximize application performance. With-

out GPUDirect, GPU-to-GPU communications would require the

host CPU to make multiple memory copies to avoid a memory

pinning confl ict between the GPU and Infi niBand. Each addi-

tional CPU memory copy signifi cantly reduces the performance

potential of the GPUs.

Intel’s implementation of GPUDirect takes a streamlined ap-

proach to optimizing NVIDIA GPU performance with Intel TrueS-

cale Infi niBand. With Intel’s solution, a user only needs to up-

date the NVIDIA driver with code provided and tested by Intel.

Other Infi niBand implementations require the user to imple-

ment a Linux kernel patch as well as a special Infi niBand driver.

The Intel approach provides a much easier way to deploy, sup-

port, and maintain GPUs in a cluster without having to sacrifi ce

performance. In addition, it is completely compatible with other

GPUDirect implementations; the CUDA libraries and application

code require no changes.

Optimized PerformanceIntel used AMBER molecular dynamics simulation software to

test clustered GPU performance with and without GPUDirect.

Figure 15 shows that there is a signifi cant performance gain of

up to 44 percent that results from streamlining the host memo-

ry access to support GPU-to-GPU communications.

Clustered GPU PerformanceHPC applications that have been designed to take advantage of

parallel GPU performance require a high-performance intercon-

nect, such as Infi niBand, to maximize that performance. In ad-

dition, the implementation or architecture of the Infi niBand in-

terconnect can impact performance. The two industry-leading

Infi niBand implementations have very different architectures

and only one was specifi cally designed for the HPC market –

Intel’s TrueScale Infi niBand. TrueScale Infi niBand provides un-

matched performance benefi ts, especially as the GPU cluster

is scaled. It offers high performance in all of the key areas that

infl uence the performance of HPC applications, including GPU-

based applications. These factors include the following:

� Scalable non-coalesced message rate performance greater

than 25M messages per second

� Extremely low latency for MPI collectives, even on clusters

consisting of thousands of nodes

� Consistently low latency of one to two μS, even at scale

These factors and the design of Intel TrueScale Infi niBand en-

able it to optimize the performance of NVIDIA GPUs. The fol-

Intel is a global leader and technology innovator in high perfor-

mance networking, including adapters, switches and ASICs.

Performance with and without GPUDirect

Amber Performance Cellulose Test

8 GPU

WithoutGPUDirect

TrueScale with GPUDirect

4,5

4

3,5

3

2,5

2

1,5

1

0,5

0

44%

Bet

ter

NS/

DAY

126 127

lowing tests were performed on NVIDIA Tesla 2050s intercon-

nected with Intel TrueScale QDR Infi niBand at Intel’s NETtrack

Developer Center. The Tesla 2050 results for the industry’s other

leading Infi niBand are from the published results on the AMBER

benchmark site.

Figure 16 shows performance results from the AMBER Myoglo-

bin benchmark (2,492 atoms) when scaling from two to eight

Tesla 2050 GPUs. The results indicate that the Intel TrueScale In-

fi niBand offers up to 10 percent more performance than the in-

dustry’s other leading Infi niBand when both used their versions

of GPUDirect. As the fi gure shows, the performance difference

increases as the application is scaled to more GPUs.

The next test (Figure 17) shows the impact of the Infi niBand in-

terconnect on the performance of AMBER across models of vari-

ous sizes. The following Explicit Solvent models were tested:

� DHFR: 23,558 atoms

� FactorIX: 90,906 atoms

� Cellulose: 408,609 atoms

It is important to point out that the performance of the models

is dependent on the model size, the size of the GPU cluster, and

the performance of the Infi niBand interconnect. The smaller

the model, the more it is dependent on the interconnect due

to the fact that the model’s components (atoms in the case of

AMBER) are divided across the available GPUs in the cluster to

be processed for each step of the simulation. For example, the

DHFR test with its 23,557 atoms means that each Tesla 2050 in

an eight-GPU cluster is processing only 2,945 atoms for each

step of the simulation. The processing time is relatively small

when compared to the communication time. In contrast, the

Cellulose model with its 408K atoms requires each GPU to

process 17 times more data per step than the DHFR test, so

signifi cantly more time is spent in GPU processing than in com-

munications.


Intel TrueScale Infi niBand and GPUsThe preceding tests demonstrate that the TrueScale Infi niBand

performs better under load. The DHFR model is the most sen-

sitive to the interconnect performance, and it indicates that

TrueScale offers six percent more performance than the alter-

native Infi niBand product. Combining the results from Figure

15 and Figure 16 illustrate that TrueScale Infi niBand provides

better results with smaller models on small clusters and better

model scalability for larger models on larger GPU clusters.

Performance/Watt AdvantageToday the focus is not just on performance, but how effi ciently

that performance can be delivered.

This is an area in which Intel TrueScale Infi niBand excels. The

National Center for SuperCompute Applications (NCSA) has a

cluster based on NVIDIA GPUs interconnected with TrueScale

Infi niBand. This cluster is number three on the November 2010

Green500 list with performance of 933 MFlops/Watt .

This on its own is a signifi cant accomplishment, but it is even

more impressive when considering its original position on the

SuperComputing Top500 list. In fact, the cluster is ranked at #404

on the Top500 list, but the combination of NVIDIA’s GPU perfor-

mance, Intel’s TrueScale performance, and low power consump-

tion enabled the cluster to move up 401 spots from the Top500

list to reach number three on the Green500 list. This is the most

dramatic shift of any cluster in the top 50 of the Green500. In

part, the following are the reasons for such dramatic perfor-

mance/watt results:

� Performance of the NVIDIA Telsa 2050 GPU

� Linpack performance effi ciency of this cluster is 49 percent,

which is almost 20 percent better than most other NVIDIA

GPUbased clusters on the Top 500 list

� The Intel TrueScale Infi niBand Adapter required 25 - 50 per-

cent less power than the alternative Infi niBand product

ConclusionThe performance of the Infi niBand interconnect has a signifi -

cant impact on the performance of GPU-based clusters. Intel’s

TrueScale Infi niBand is designed and architected for the HPC

marketplace, and it offers an unmatched performance profi le

with a GPU-based cluster. Finally, Intel’s solution provides an

implementation that is easier to deploy and maintain, and al-

lows for optimal performance in comparison to the industry’s

other leading Infi niBand.

Explicit Solvent Benchmark Results for the Two Leading Infi niBandsGPU Scalable Performance with the Industry’s Leading Infi niBands

AMBERMyoglobin Test

OtherInfiniBand

TrueScale

1409,6%

3,2%3,2%

Even

120

100

80

60

40

20

0

Bet

ter

2 4 8

NS/

DAY

Explicit Solvent Tests8 x GPU

OtherInfiniBand

TrueScale

506%

5%5%

1%

45

40

35

30

25

20

15

105

0

Bet

ter

Cellulose FactorIX DHFR

NS/

DAY

Intel Xeon Phi Coprocessor –The ArchitectureIntel Many Integrated Core (Intel MIC) architecture combines many Intel CPU cores onto a single chip. Intel MIC architecture is targeted for highly parallel, High Performance Com-puting (HPC) workloads in a variety of fields such as computational physics, chemistry, biology, and financial services. Today such workloads are run as task parallel programs on large compute clusters.

129 En

gin

ee

rin

g

Lif

e S

cie

nce

s

Au

tom

oti

ve

Pri

ce M

od

ell

ing

A

ero

spa

ce

CA

E

Da

ta A

na

lyti

cs

130 131

Intel Xeon Phi Coprocessor

The Architecture

The Intel MIC architecture is aimed at achieving high through-

put performance in cluster environments where there are rigid

floor planning and power constraints. A key attribute of the mi-

croarchitecture is that it is built to provide a general-purpose

programming environment similar to the Intel Xeon processor

programming environment. The Intel Xeon Phi coprocessors

based on the Intel MIC architecture run a full service Linux op-

erating system, support x86 memory order model and IEEE 754

floating-point arithmetic, and are capable of running applica-

tions written in industry-standard programming languages

such as Fortran, C, and C++. The coprocessor is supported by a

rich development environment that includes compilers, numer-

ous libraries such as threading libraries and high performance

math libraries, performance characterizing and tuning tools,

and debuggers.

The Intel Xeon Phi coprocessor is connected to an Intel Xeon

processor, also known as the “host”, through a PCI Express (PCIe)

bus. Since the Intel Xeon Phi coprocessor runs a Linux operating

system, a virtualized TCP/IP stack could be implemented over

the PCIe bus, allowing the user to access the coprocessor as a

network node. Thus, any user can connect to the coprocessor

through a secure shell and directly run individual jobs or submit

batch jobs to it. The coprocessor also supports heterogeneous

applications wherein a part of the application executes on the

host while a part executes on the coprocessor.

Multiple Intel Xeon Phi coprocessors can be installed in a single

host system. Within a single system, the coprocessors can com-

municate with each other through the PCIe peer-to-peer inter-

connect without any intervention from the host. Similarly, the

coprocessors can also communicate through a network card

such as InfiniBand or Ethernet, without any intervention from

the host.

Intel’s initial development cluster named “Endeavor”, which is

composed of 140 nodes of prototype Intel Xeon Phi coproces-

sors, was ranked 150 in the TOP500 supercomputers in the world

based on its Linpack scores. Based on its power consumption,

this cluster was as good if not better than other heterogeneous

systems in the TOP500.

These results on unoptimized prototype systems demonstrate

that high levels of performance efficiency can be achieved on

compute-dense workloads without the need for a new pro-

gramming language or APIs.

The Intel Xeon Phi coprocessor is primarily composed of pro-

cessing cores, caches, memory controllers, PCIe client logic, and

a very high bandwidth, bidirectional ring interconnect. Each

core comes complete with a private L2 cache that is kept fully

coherent by a global-distributed tag directory. The memory

controllers and the PCIe client logic provide a direct interface

to the GDDR5 memory on the coprocessor and the PCIe bus, re-

spectively. All these components are connected together by the

ring interconnect.

Each core in the Intel Xeon Phi coprocessor is designed to be

power efficient while providing a high throughput for highly

parallel workloads. A closer look reveals that the core uses a

short in-order pipeline and is capable of supporting 4 threads

The first generation Intel Xeon Phi product codenamed “Knights Corner”

Linpack performance and power of Intel’s cluster

Microarchitecture

Intel Xeon Phi Coprocessor Core

132 133

in hardware. It is estimated that the cost to support IA architec-

ture legacy is a mere 2% of the area costs of the core and is even

less at a full chip or product level. Thus the cost of bringing the

Intel Architecture legacy capability to the market is very mar-

ginal.

Vector Processing UnitAn important component of the Intel Xeon Phi coprocessor’s

core is its vector processing unit (VPU). The VPU features a nov-

el 512-bit SIMD instruction set, officially known as Intel Initial

Many Core Instructions (Intel IMCI). Thus, the VPU can execute

16 single-precision (SP) or 8 double-precision (DP) operations

per cycle. The VPU also supports Fused Multiply-Add (FMA) in-

structions and hence can execute 32 SP or 16 DP floating point

operations per cycle. It also provides support for integers.

Vector units are very power efficient for HPC workloads. A single

operation can encode a great deal of work and does not incur

energy costs associated with fetching, decoding, and retiring

many instructions. However, several improvements were re-

quired to support such wide SIMD instructions. For example, a

mask register was added to the VPU to allow per lane predicated

execution. This helped in vectorizing short conditional branch-

es, thereby improving the overall software pipelining efficiency.

The VPU also supports gather and scatter instructions, which

are simply non-unit stride vector memory accesses, directly in

hardware. Thus for codes with sporadic or irregular access pat-

terns, vector scatter and gather instructions help in keeping the

code vectorized.

The VPU also features an Extended Math Unit (EMU) that can

execute transcendental operations such as reciprocal, square

root, and log, thereby allowing these operations to be executed

in a vector fashion with high bandwidth. The EMU operates by

calculating polynomial approximations of these functions.

The InterconnectThe interconnect is implemented as a bidirectional ring. Each

direction is comprised of three independent rings. The first,

largest, and most expensive of these is the data block ring. The

data block ring is 64 bytes wide to support the high bandwidth

requirement due to the large number of cores. The address ring

is much smaller and is used to send read/write commands and

memory addresses. Finally, the smallest ring and the least ex-

pensive ring is the acknowledgement ring, which sends flow

control and coherence messages.

When a core accesses its L2 cache and misses, an address re-

quest is sent on the address ring to the tag directories. The

memory addresses are uniformly distributed amongst the tag

directories on the ring to provide a smooth traffic characteristic

on the ring. If the requested data block is found in another core’s

L2 cache, a forwarding request is sent to that core’s L2 over the

address ring and the request block is subsequently forwarded

on the data block ring. If the requested data is not found in any

caches, a memory address is sent from the tag directory to the

memory controller.

The figure below shows the distribution of the memory con-

trollers on the bidirectional ring. The memory controllers are

symmetrically interleaved around the ring. There is an all-to-all

mapping from the tag directories to the memory controllers. The

addresses are evenly distributed across the memory controllers,

thereby eliminating hotspots and providing a uniform access

pattern which is essential for a good bandwidth response.

During a memory access, whenever an L2 cache miss occurs on a

core, the core generates an address request on the address ring

and queries the tag directories. If the data is not found in the

tag directories, the core generates another address request and

queries the memory for the data. Once the memory controller

fetches the data block from memory, it is returned back to the

core over the data ring. Thus during this process, one data block,

two address requests (and by protocol, two acknowledgement

messages) are transmitted over the rings. Since the data block

rings are the most expensive and are designed to support the


The Architecture

Vector Processing Unit

The Interconnect

Distributed Tag Directories

134 135

required data bandwidth, we need to increase the number of

less expensive address and acknowledgement rings by a factor

of two to match the increased bandwidth requirement caused

by the higher number of requests on these rings.

Multi-Threaded Streams TriadThe figure below shows the core scaling results for the multi-

threaded streams triad workload. These results were generated

on a simulator for a prototype of the Intel Xeon Phi coproces-

sor with only one address ring and one acknowledgement ring

per direction in its interconnect. The results indicate that in this

case the address and acknowledgement rings would become

performance bottlenecks and would exhibit poor scalability

beyond 32 cores.

The production grade Intel Xeon Phi coprocessor uses two ad-

dress and two acknowledgement rings per direction and pro-

vides a good performance scaling up to 50 cores and beyond. It

is evident from the figure that the addition of the rings results in

an over 40% aggregate bandwidth improvement.

Streaming StoresStreaming stores was another key innovation that was em-

ployed to further boost memory bandwidth. The pseudo code

for Streams Triads is shown below:

Streams Triad

for (I=0; I<HUGE; I++)

A[I] = k*B[I] + C[I];

The stream triad kernel reads two arrays, B and C, and writes a

single array A from memory. Historically, a core reads a cache

line before it writes the addressed data. Hence there is an ad-

ditional read overhead associated with the write. A streaming

store instruction allows the cores to write an entire cache line

without reading it first. This reduces the number of bytes trans-

ferred per iteration from 256 bytes to 192 bytes.

Streaming Stores

The figure below shows the core scaling results of stream triads

workload with streaming stores. As is evident from the results,

streaming stores provide a 30% improvement over previous re-

sults. Totally, then, by adding two rings per direction and imple-

menting streaming stores we are able to improve bandwidth by

more than a factor of 2 for multithreaded streams triad.


The Architecture

Without Streaming Stores

With Streaming Stores

Behavior Read A, B, C write A Read B, C write A

Bytes transferred to/from memory per iteration

256 192

Interleaved Memory Access

Interconnect: 2x AD/AK

Multi-threaded Triad – Saturation for 1 AD/AK Ring

Multi-threaded Triad – Benefit of Doubling AD/AK

136 137


The Architecture

Other Design FeaturesOther micro-architectural optimizations incorporated into

the Intel Xeon Phi coprocessor include a 64-entry second-level

Translation Lookaside Buffer (TLB), simultaneous data cache

loads and stores, and 512KB L2 caches. Lastly, the Intel Xeon

Phi coprocessor implements a 16 stream hardware prefetcher

to improve the cache hits and provide higher bandwidth.The

figure below shows the net performance improvements for the

SPECfp 2006 benchmark suite for a single core, single thread

runs. The results indicate an average improvement of over 80%

per cycle not including frequency.

CachesThe Intel MIC architecture invests more heavily in L1 and L2

caches compared to GPU architectures. The Intel Xeon Phi co-

processor implements a leading-edge, very high bandwidth

memory subsystem. Each core is equipped with a 32KB L1 in-

struction cache and 32KB L1 data cache and a 512KB unified L2

cache. These caches are fully coherent and implement the x86

memory order model. The L1 and L2 caches provide an aggre-

gate bandwidth that is approximately 15 and 7 times, respec-

tively, faster compared to the aggregate memory bandwidth.

Hence, effective utilization of the caches is key to achieving

peak performance on the Intel Xeon Phi coprocessor. In addi-

tion to improving bandwidth, the caches are also more energy

efficient for supplying data to the cores than memory. The fig-

ure below shows the energy consumed per byte of data trans-

fered from the memory, and L1 and L2 caches. In the exascale

compute era, caches will play a crucial role in achieving real per-

formance under restrictive power constraints.

StencilsStencils are common in physics simulations and are classic

examples of workloads which show a large performance gain

through efficient use of caches.

Stencils are typically employed in simulation of physical sys-

tems to study the behavior of the system over time. When these

workloads are not programmed to be cache-blocked, they will

be bound by memory bandwidth. Cache blocking promises

substantial performance gains given the increased bandwidth

and energy efficiency of the caches compared to memory.

Cache blocking improves performance by blocking the physical

structure or the physical system such that the blocked data fits

well into a core’s L1 and or L2 caches. For example, during each

time-step, the same core can process the data which is already

resident in the L2 cache from the last time step, and hence does

not need to be fetched from the memory, thereby improving

performance. Additionally, the cache coherence further aids

the stencil operation by automatically fetching the updated

data from the nearest neighboring blocks which are resident in

the L2 caches of other cores. Thus, stencils clearly demonstrate

the benefits of efficient cache utilization and coherence in HPC

workloads.

Multi-threaded Triad – with Streaming StoresStencils Example

Caches – For or Against?

Per-Core ST Performance Improvement (per cycle)

138 139


The Architecture

Power ManagementIntel Xeon Phi coprocessors are not suitable for all workloads.

In some cases, it is beneficial to run the workloads only on the

host. In such situations where the coprocessor is not being

used, it is necessary to put the coprocessor in a power-saving

mode. The figure above shows all the components of the Intel

Xeon Phi coprocessor in a running state. To conserve power, as

soon as all the four threads on a core are halted, the clock to the

core is gated. Once the clock has been gated for some program-

mable time, the core power gates itself, thereby eliminating any

leakage.

At any point, any number of the cores can be powered down or

powered up. Additionally, when all the cores are power gated

and the uncore detects no activity, the tag directories, the inter-

connect, L2 caches and the memory controllers are clock gated.

At this point, the host driver can put the coprocessor into a

deeper sleep or an idle state, wherein all the uncore is power

gated, the GDDR is put into a self-refresh mode, the PCIe logic is

put in a wait state for a wakeup and the GDDR-IO is consuming

very little power. These power management techniques help

conserve power and make Intel Xeon Phi coprocessor an excel-

lent candidate for data centers.

Power Management: All On and Running

Power Gate Core

Package Auto C3Clock Gate Core

140 141

Intel Xeon Phi Coprocessor 3120P (6GB, 1.100 GHz, 57 core) Intel Xeon Phi Coprocessor 3120A (6GB, 1.100 GHz, 57 core)

Status Launched Launched

Processor Number 3120P 3120A

Number of Cores 57 57

Clock Speed 1.1 GHz 1.1 GHz

L2 Cache 28.5 MB 28.5 MB

Instruction Set 64-bit 64-bit

Instruction Set Extensions IMCI IMCI

Embedded Options Available No No

Lithography 22 nm 22 nm

Max TDP 300 W 300 W

Memory Specifications

Max Memory Size (dependent on memory type) 6 GB 6 GB

Number of Memory Channels 12 12

Max Memory Bandwidth 240 GB/s 240 GB/s

ECC Memory Supported Yes Yes

Expansion Options

PCI Express Revision 2.0 2.0

Intel Xeon Phi Coprocessor 5120D (8GB, 1.053 GHz, 60 core) Intel Xeon Phi Coprocessor 5110P (8GB, 1.053 GHz, 60 core)

Status Launched Launched

Processor Number 5120D 5110P

Number of Cores 60 60

Clock Speed 1.053 GHZ 1.053 GHZ

L2 Cache 30 MB 30 MB

Instruction Set 64-bit 64-bit

Instruction Set Extensions IMCI IMCI

Embedded Options Available No No

Lithography 22 nm 22 nm

Max TDP 245 W 225 W


Max Memory Size (dependent on memory type) 8GB 8 GB

Number of Memory Channels 16 16

Max Memory Bandwidth 352 GB/s 320 GB/s

ECC Memory Supported Yes Yes

Expansion Options

PCI Express Revision 2.0 2.0

Intel Xeon Phi Coprocessor 7120X (16GB, 1.238 GHz, 61 core)

Intel Xeon Phi Coprocessor 7120P (16GB, 1.238 GHz, 61 core)

Intel Xeon Phi Coprocessor 7120D (16GB, 1.238 GHz, 61 core)

Intel Xeon Phi Coprocessor 7120A (16GB, 1.238 GHz, 61 core)

Status Launched Launched Launched Launched

Processor Number 7120X 7120P 7120D 7120A

Number of Cores 61 61 61 61

Clock Speed 1.238 GHz 1.238 GHz 1.238 GHz 1.238 GHz

L2 Cache 30.5 MB 30.5 MB 30.5 MB 30.5 MB

Instruction Set 64-bit 64-bit 64-bit 64-bit

Instruction Set Extensions IMCI IMCI IMCI IMCI

Embedded Options Available No No No No

Lithography 22 nm 22 nm 22 nm 22 nm

Max TDP 300 W 300 W 270 W 300 W


Max Memory Size (dependent on memory type)

16GB 16GB 16GB 16GB

Number of Memory Channels 16 16 16 16

Max Memory Bandwidth 352 GB/s 352 GB/S 352 GB/s 352 GB/s

ECC Memory Supported Yes Yes Yes Yes

Expansion Options

PCI Express Revision 2.0 2.0 2.0 2.0

InfiniBand – High-Speed InterconnectsInfiniBand (IB) is an efficient I/O technology that provides high-speed data transfers and ultra-low latencies for computing and storage over a highly reliable and scalable single fabric. The InfiniBand industry standard ecosystem creates cost effective hardware and software solutions that easily scale from generation to generation.InfiniBand is a high-bandwidth, low-latency network interconnect solution that has grown tremendous market share in the High Performance Computing (HPC) cluster com-munity. InfiniBand was designed to take the place of today’s data center networking technology. In the late 1990s, a group of next generation I/O architects formed an open, community driven network technology to provide scalability and stability based on successes from other network designs. Today, InfiniBand is a popular and widely used I/O fabric among customers within the Top500 supercomputers: major Universities and Labs; Life Sciences; Biomedical; Oil and Gas (Seismic, Reservoir, Modeling Applications); Computer Aided Design and Engineering; Enterprise Oracle; and Financial Applications.

143 Sci

en

ces

R

isk

An

aly

sis

S

imu

lati

on

B

ig D

ata

An

aly

tics

C

AD

H

igh

Pe

rfo

rma

nce

Co

mp

uti

ng

144 145

InfiniBand was designed to meet the evolving needs of the

high performance computing market. Computational science

depends on InfiniBand to deliver:

� High Bandwidth: Supports host connectivity of 10Gbps

with Single Data Rate (SDR), 20Gbps with Double Data

Rate (DDR), and 40Gbps with Quad Data Rate (QDR),

all while offering an 80Gbps switch for link switching

� Low Latency: Accelerates the performance of HPC and enter-

prise computing applications by providing ultra-low latencies

� Superior Cluster Scaling: Point-to-point latency re-

mains low as node and core counts scale – 1.2 μs. High-

est real message per adapter: each PCIe x16 adapter

drives 26 million messages per second. Excellent commu-

nications/computation overlap among nodes in a cluster

� High Efficiency: InfiniBand allows reliable protocols like Re-

mote Direct Memory Access (RDMA) communication to occur

between interconnected hosts, thereby increasing efficiency

� Fabric Consolidation and Energy Savings: InfiniBand can

consolidate networking, clustering, and storage data over

a single fabric, which significantly lowers overall power,

real estate, and management overhead in data centers. En-

hanced Quality of Service (QoS) capabilities support run-

ning and managing multiple workloads and traffic classes

� Data Integrity and Reliability: InfiniBand provides the high-

est levels of data integrity by performing Cyclic Redundancy

Checks (CRCs) at each fabric hop and end-to-end across the

fabric to avoid data corruption. To meet the needs of mission

critical applications and high levels of availability, InfiniBand

provides fully redundant and lossless I/O fabrics with auto-

matic failover path and link layer multi-paths

InfiniBand

High-Speed Interconnects

Components of the InfiniBand Fabric

InfiniBand is a point-to-point, switched I/O fabric architecture.

Point-to-point means that each communication link extends be-

tween only two devices. Both devices at each end of a link have

full and exclusive access to the communication path. To go be-

yond a point and traverse the network, switches come into play.

By adding switches, multiple points can be interconnected to

create a fabric. As more switches are added to a network, ag-

gregated bandwidth of the fabric increases. By adding multiple

paths between devices, switches also provide a greater level of

redundancy.

The InfiniBand fabric has four primary components, which are

explained in the following sections:

� Host Channel Adapter

� Target Channel Adapter

� Switch

� Subnet Manager

Host Channel Adapter

This adapter is an interface that resides within a server and com-

municates directly with the server’s memory and processor as

well as the InfiniBand Architecture (IBA) fabric. The adapter guar-

antees delivery of data, performs advanced memory access, and

can recover from transmission errors. Host channel adapters

can communicate with a target channel adapter or a switch. A

host channel adapter can be a standalone InfiniBand card or

it can be integrated on a system motherboard. Intel TrueScale

InfiniBand host channel adapters outperform the competition

with the industry’s highest message rate. Combined with the

lowest MPI latency and highest effective bandwidth, Intel host

channel adapters enable MPI and TCP applications to scale to

thousands of nodes with unprecedented price performance.

Target Channel Adapter

This adapter enables I/O devices, such as disk or tape storage,

to be located within the network independent of a host com-

puter. Target channel adapters include an I/O controller that

is specific to its particular device’s protocol (for example, SCSI,

Fibre Channel (FC), or Ethernet). Target channel adapters can

communicate with a host channel adapter or a switch.

Switch

An InfiniBand switch allows many host channel adapters and

target channel adapters to connect to it and handles network

traffic. The switch looks at the “local route header” on each

packet of data that passes through it and forwards it to the

appropriate location. The switch is a critical component of

the InfiniBand implementation that offers higher availability,

higher aggregate bandwidth, load balancing, data mirroring,

and much more. A group of switches is referred to as a Fabric. If

a host computer is down, the switch still continues to operate.

The switch also frees up servers and other devices by handling

network traffic.

Intel is a global leader and technology innovator in high perfor-mance networking, including adapters, switches and ASICs.

Host Channel Adapter

InfiniBand SwitchWith Subnet Manager

Target Channel Adapter

figure 1 Typical InfiniBand High Performance Cluster

146 147

Top 10 Reasons to Use Intel TrueScale Infi niBand

1. Predictable Low Latency Under Load – Less Than 1.0 µs.

TrueScale is designed to make the most of multi-core nodes

by providing ultra-low latency and signifi cant message rate

scalability. As additional compute resources are added to

the Intel TrueScale Infi niBand solution, latency and message

rates scale linearly. HPC applications can be scaled without

having to worry about diminished utilization of compute re-

sources.

2. Quad Data Rate (QDR) Performance. The Intel 12000 switch

family runs at lane speeds of 10Gbps, providing a full bi-

sectional bandwidth of 40Gbps (QDR). In addition, the 12000

switch has the unique capability of riding through periods

of congestion with features such as deterministically low

latency. The Intel family of TrueScale 12000 products offers

the lowest latency of any IB switch and high performance

transfers with the industry’s most robust signal integrity.

3. Flexible QoS Maximizes Bandwidth Use. The Intel 12800

advanced design is based on an architecture that provides

comprehensive virtual fabric partitioning capabilities that

enable the IB fabric to support the evolving requirements of

an organization.

4. Unmatched Scalability – 18 to 864 Ports per Switch. Intel

offers the broadest portfolio (fi ve chassis and two edge

switches) from 18 to 864 TrueScale Infi niBand ports, allow-

ing customers to buy switches that match their connectiv-

ity, space, and power requirements.

5. Highly Reliable and Available. Reliability And Service-

ability (RAS) that is proven in the most demanding Top500

and Enterprise environments is designed into Intel’s 12000

series with hot swappable components, redundant com-

ponents, customer replaceable units, and non-disruptive

code load.

6. Lowest Per-Port and Cooling Requirements. The True-Scale

12000 offers the lowest power consumption and the high-

est port density – 864 total TrueScale Infi niBand ports in a

single chassis makes it unmatched in the industry. This re-

sults in delivering the lowest power per port for a director

switch (7.8 watts per port) and the lowest power per port for

an edge switch (3.3 watts per port).

7. Easy to Install and Manage. Intel installation, confi gura-

tion, and monitoring Wizards reduce time-to-ready. The

Intel Infi niBand Fabric Suite (IFS) assists in diagnosing prob-

lems in the fabric. Non-disruptive fi rmware upgrades pro-

vide maximum availability and operational simplicity.

8. Protects Existing Infi niBand Investments. Seamless vir-

tual I/O integration at the Operating System (OS) and appli-

cation levels that match standard network interface cards

and adapter semantics with no OS or app changes – they

just work. Additionally, the TrueScale family of products is

compliant with the Infi niBand Trade Association (IBTA) open

specifi cation, so Intel products inter-operate with any IBTA-

compliant Infi niBand vendor. Being IBTA-compliant makes

the Intel 12000 family of switches ideal for network consoli-

dation; sharing and scaling I/O pools across servers; and to

pool and share I/O resources between servers.

9. Modular Confi guration Flexibility. The Intel 12000 series

switches offer confi guration and scalability fl exibility that

meets the requirements of either a high-density or high-

performance compute grid by offering port modules that

address both needs. Units can be populated with Ultra High

Density (UHD) leafs for maximum connectivity or Ultra High

Performance (UHP) leafs for maximum performance. The

high scalability 24-port leaf modules support confi gurations

between 18 and 864 ports, providing the right size to start

and the capability to grow as your grid grows.

10. Option to Gateway to Ethernet and Fibre Channel Net-

works. Intel offers multiple options to enable hosts on

Infi niBand fabrics to transparently access Fibre Channel

based storage area networks (SANs) or Ethernet based local

area networks (LANs).

Intel TrueScale architecture and the resulting family of prod-

ucts delivers the promise of Infi niBand to the enterprise today.

InfiniBand

Top 10 Reasons to Use Intel TrueScale Infi niBand

“Effective fabric management has become the

most important factor in maximizing performance

in an HPC cluster. With IFS 7.0, Intel has addressed

all of the major fabric management issues in a prod-

uct that in many ways goes beyond what others are

offering.”

Michael Wirth HPC Presales Specialist

148 149

InfiniBand

Intel MPI Library 4.0 PerformanceIntroductionIntel’s latest MPI release, Intel MPI 4.0, is now optimized to work

with Intel’s TrueScale Infi niBand adapter. Intel MPI 4.0 can now

directly call Intel’s TrueScale Performance Scale Messaging

(PSM) interface. The PSM interface is designed to optimize MPI

application performance. This means that organizations will be

able to achieve a signifi cant performance boost with a combi-

nation of Intel MPI 4.0 and Intel TrueScale Infi niBand.

SolutionIntel worked with Intel to tune and optimize the company’s lat-

est MPI release – Intel MPI Library 4.0 – to improve performance

when used with Intel TrueScale Infi niBand. With MPI Library 4.0,

applications can make full use of High Performance Computing

(HPC) hardware, improving the overall performance of the ap-

plications on the clusters.

Intel MPI Library 4.0 PerformanceIntel MPI Library 4.0 uses the high performance MPI-2 specifi ca-

tion on multiple fabrics, which results in better performance for

applications on Intel architecture-based clusters. This library

enables quick delivery of maximum end user performance, even

if there are changes or upgrades to new interconnects – with-

out requiring major changes to the software or operating en-

vironment. This high-performance, message-passing interface

library develops applications that can run on multiple cluster

fabric interconnects chosen by the user at runtime.

Testing

Intel used the Parallel Unstructured Maritime Aerodynamics

(PUMA) Benchmark program to test the performance of Intel

MPI Library versions 4.0 and 3.1 with Intel TrueScale Infi niBand.

The program analyses internal and external non-reacting com-

pressible fl ows over arbitrarily complex 3D geometries. PUMA

is written in ANSI C, and uses MPI libraries for message passing.

Results

The test showed that MPI performance can improve by more

than 35 percent using Intel MPI Library 4.0 with Intel TrueScale

Infi niBand, compared to using Intel MPI Library 3.1.

Intel TrueScale Infi niBand Adapters Intel TrueScale Infi niBand Adapters offer scalable performance,

reliability, low power, and superior application performance.

These adapters ensure superior performance of HPC applica-

tions by delivering the highest message rate for multicore

compute nodes, the lowest scalable latency, large node count

clusters, the highest overall bandwidth on PCI Express Gen1

platforms, and superior power effi ciency.

Purpose: Compare the performance of Intel MPI Library 4.0 and 3.1 with Intel Infi niBandBenchmark: PUMA FlowCluster: Intel/IBM iDataPlex™ clusterSystem Confi guration: The NET track IBM Q-Blue Cluster/iDataPlex nodes were confi gured as follows:

Processor Intel Xeon® CPU X5570 @ 2,93 GHz

Memory 24GB (6x4GB) @ 1333MHz (DDR3)

QDR Infi niBand SwitchIntel Model 12300/fi rmware version

6.0.0.0.33

QDR Infi niBand Host Cnel Adapter Intel QLE7340 software stack 5.1.0.0.49

Operating System Red Hat® Enterprise Linux® Server

release 5.3

Kernel 2.6.18-128.el5

File System IFS Mounted

Typical Infi niBand High Performance Cluster

MPI v3.1

MPI v4.0

1400

35% Improvement

1200

1000

800

600

400

200

0

Bet

ter

Ela

pse

d t

ime

Number of cores

16 32 64

Figure 3 Performance with and without GPUDirect

150 151

The Intel TrueScale 12000 family of Multi-Protocol Fab-

ric Directors is the most highly integrated cluster com-

puting interconnect solution available. An ideal solu-

tion HPC, database clustering, and grid utility computing

applications, the 12000 Fabric Directors maximize cluster

and grid computing interconnect performance while sim-

plifying and reducing the cost of operating a data center.

Subnet Manager

The subnet manager is an application responsible for config-

uring the local subnet and ensuring its continued operation.

Configuration responsibilities include managing switch setup

and reconfiguring the subnet if a link goes down or a new one

is added.

How High Performance Computing Helps Vertical ApplicationsEnterprises that want to do high performance computing must

balance the following scalability metrics as the core size and

the number of cores per node increase:

� Latency and Message Rate Scalability: must allow near line-

ar growth in productivity as the number of compute cores is

scaled.

� Power and Cooling Efficiency: as the cluster is scaled, power

and cooling requirements must not become major concerns

in today’s world of energy shortages and high-cost energy.

InfiniBand offers the promise of low latency, high bandwidth,

and unmatched scalability demanded by high performance

computing applications. IB adapters and switches that perform

well on these key metrics allow enterprises to meet their high-

performance and MPI needs with optimal efficiency. The IB so-

lution allows enterprises to quickly achieve their compute and

business goals.

InfiniBand

High-Speed Interconnects

Vertical Market Application Segment InfiniBand Value

Oil & GasMix of Independent Service Provider (ISP) and home grown codes: reservoir modeling

Low latency, high bandwidth

Computer Aided Engineering (CAE)Mostly Independent Software Vendor (ISV) codes: crash, air flow, and fluid flow simulations

High message rate, low latency, scalability

GovernmentHome grown codes: labs, defense, weather, and a wide range of apps

High message rate, low latency, scalability, high bandwidth

Education Home grown and open source codes: a wide range of apps High message rate, low latency, scalability, high bandwidth

Financial Mix of ISP and home grown codes: market simulation and trading floor

High performance IP, scalability, high bandwidth

Life and Materials ScienceMostly ISV codes: molecular simulation, computational chemistry, and biology apps

Low latency, high message rates

152 153

InfiniBand

Intel Fabric Suite 7Intel Fabric Suite 7

Maximizing Investments in High Performance Computing

Around the world and across all industries, high performance

computing (HPC) is used to solve today’s most demanding com-

puting problems. As today’s high performance computing chal-

lenges grow in complexity and importance, it is vital that the

software tools used to install, confi gure, optimize, and manage

HPC fabrics also grow more powerful. Today’s HPC workloads

are too large and complex to be managed using software tools

that are not focused on the unique needs of HPC.

Designed specifi cally for HPC, Intel Fabric Suite 7 is a complete

fabric management solution that maximizes the return on HPC

investments, by allowing users to achieve the highest levels of

performance, effi ciency, and ease of management from Infi ni-

Band-connected HPC clusters of any size.

Highlights

� Intel FastFabric and Fabric Viewer integration with leading

third-party HPC cluster management suites

� Simple but powerful Fabric Viewer dashboard for monitor-

ing fabric performance

� Intel Fabric Manager integration with leading HPC work-

loadmanagement suites that combine virtual fabrics and

compute

� Quality of Service (QoS) levels that maximize fabric effi cien-

cy and application performance

� Smart, powerful software tools that make Intel TrueScale

Fabric solutions easy to install, confi gure, verify, optimize,

and manage

� Congestion control architecture that reduces the effects of

fabric congestion caused by low credit availability that can

result in head-of-line blocking

� Powerful fabric routing methods – including adaptive and

dispersive routing – that optimize traffi c fl ows to avoid or

eliminate congestion, maximizing fabric throughput

� Intel Fabric Manager’s advanced design ensures fabrics of

all sizesand topologies – from fat-tree to mesh and toruss-

cale to support the most demanding HPC workloads

Superior Fabric Performance and Simplifi ed Management are

Vital for HPC

As HPC clusters scale to take advantage of multi-core and GPU-

accelerated nodes attached to ever-larger and more complex

fabrics, simple but powerful management tools are vital for

maximizing return on HPC investments.

Intel Fabric Suite 7 provides the performance and management

tools for today’s demanding HPC cluster environments. As clus-

ters grow larger, management functions, from installation and

confi guration to fabric verifi cation and optimization, are vital

in ensuring that the interconnect fabric can support growing

workloads. Besides fabric deployment and monitoring, IFS opti-

mizes the performance of message passing applications – from

advanced routing algorithms to quality of service (QoS) – that

ensure all HPC resources are optimally utilized.

Scalable Fabric Performance

� Purpose-built for HPC, IFS is designed to make HPC clusters

faster, easier, and simpler to deploy, manage, and optimize

� The Intel TrueScale Fabric on-load host architecture exploits

the processing power of today’s faster multi-core proces-

sors for superior application scaling and performance

� Policy-driven vFabric virtual fabrics optimize HPC resource

utilization by prioritizing and isolating compute and stor-

age traffi c fl ows

� Advanced fabric routing options – including adaptive and

dispersive routing – that distribute traffi c across all poten-

tial links that improve the overall fabric performance, lower-

ing congestion to improve throughput and latency

Scalable Fabric Intelligence

� Routing intelligence scales linearly as Intel TrueScale Fabric

switches are added to the fabric

� Intel Fabric Manager can initialize fabrics having several

thousand nodes within seconds

� Advanced and optimized routing algorithms overcome the

limitations of typical subnet managers

� Smart management tools quickly detect and respond to

fabric changes, including isolating and correcting problems

that can result in unstable fabrics

Standards-based Foundation

Intel TrueScale Fabric solutions are compliant with all industry

software and hardware standards. Intel Fabric Suite (IFS) re-

defi nes IBTA management by coupling powerful management

tools with intuitive user interfaces. Infi niBand fabrics built us-

ing IFS deliver the highest levels of fabric performance, effi cien-

cy, and management simplicity, allowing users to realize the full

benefi ts from their HPC investments.

Major components of Intel Fabric Suite 7 include:

� FastFabric Toolset

� Fabric Viewer

� Fabric Manager

Intel FastFabric Toolset

Ensures rapid, error-free installation and confi guration of Intel

True Scale Fabric switch, host, and management software tools.

Guided by an intuitive interface, users can easily install, confi g-

ure, validate, and optimize HPC fabrics.

154 155

Key features include:

� Automated host software installation and confi guration

� Powerful fabric deployment, verifi cation, analysis, and re-

porting tools for measuring connectivity, latency, and band-

width

� Automated chassis and switch fi rmware installation and

update

� Fabric route and error analysis tools

� Benchmarking and tuning tools

� Easy-to-use fabric health checking tools

Intel Fabric Viewer

A key component that provides an intuitive, Java-based user

interface with a “topology-down” view for fabric status and di-

agnostics, with the ability to drill down to the device layer to

identify and help correct errors. IFS 7 includes fabric dashboard,

a simple and intuitive user interface that presents vital fabric

performance statistics.


� Bandwidth and performance monitoring

� Device and device group level displays

� Defi nition of cluster-specifi c node groups so that displays

can be oriented toward end-user node types

� Support for user-defi ned hierarchy

� Easy-to-use virtual fabrics interface

Intel Fabric Manager

Provides comprehensive control of administrative functions

using a commercial-grade subnet manager. With advanced

routing algorithms, powerful diagnostic tools, and full subnet

manager failover, Fabric Manager simplifi es subnet, fabric, and

individual component management, making even the largest

fabrics easy to deploy and optimize.


� Designed and optimized for large fabric support

� Integrated with both adaptive and dispersive routing sys-

tems

� Congestion control architecture (CCA)

� Robust failover of subnet management

� Path/route management

� Fabric/chassis management

� Fabric initialization in seconds, even for very large fabrics

� Performance and fabric error monitoring

Adaptive Routing

While most HPC fabrics are confi gured to have multiple paths

between switches, standard Infi niBand switches may not be ca-

pable of taking advantage of them to reduce congestion. Adap-

tive routing monitors the performance of each possible path,

and automatically chooses the least congested route to the

destination node. Unlike other implementations that rely on a

purely subnet manager-based approach, the intelligent path se-

lection capabilities within Fabric Manager, a key part of IFS and

Intel 12000 series switches, scales as your fabric grows larger

and more complex.

Key adaptive routing features include:

� Highly scalable—adaptive routing intelligence scales as

thefabric grows

� Hundreds of real-time adjustments per second per switch

� Topology awareness through Intel Fabric Manager

� Supports all Infi niBand Trade Association* (IBTA*)-compliant

adapters

Dispersive Routing

One of the critical roles of the fabric management is the initial-

ization and confi guration of routes through the fabric between

a single pair of nodes. Intel Fabric Manager supports a variety

of routing methods, including defi ning alternate routes that

disperse traffi c fl ows for redundancy, performance, and load

balancing. Instead of sending all packets to a destination on a

single path, Intel dispersive routing distributes traffi c across all

possible paths. Once received, packets are reassembled in their

proper order for rapid, effi cient processing. By leveraging the

entire fabric to deliver maximum communications performance

for all jobs, dispersive routing ensures optimal fabric effi ciency.

Key dispersive routing features include:

� Fabric routing algorithms that provide the widest separa-

tion of paths possible

� Fabric “hotspot” reductions to avoid fabric congestion

� Balances traffi c across all potential routes

� May be combined with vFabrics and adaptive routing

� Very low latency for small messages and time sensitive con-

trol protocols

� Messages can be spread across multiple paths—Intel Per-

formance Scaled Messaging (PSM) ensures messages are

reassembled correctly

� Supports all leading MPI libraries

Mesh and Torus Topology Support

Fat-tree confi gurations are the most common topology used

in HPC cluster environments today, but other topologies are

gaining broader use as organizations try to create increasingly

larger, more cost-effective fabrics. Intel is leading the way with

full support of emerging mesh and torus topologies that can

help reduce networking costs as clusters scale to thousands of

nodes. IFS has been enhanced to support these emerging op-

tions—from failure handling that maximizes performance, to

even higher reliability for complex networking environments.

InfiniBand

Intel Fabric Suite 7

156 157

InfiniBand

Intel Fabric Suite 7

transtec HPC solutions excel through their easy management

and high usability, while maintaining high performance and

quality throughout the whole lifetime of the system. As clusters

scale, issues like congestion mitigation and Quality-of-Service

can make a big difference in whether the fabric performs up to

its full potential.

With the intelligent choice of Intel Infi niBand products, trans-

tec remains true to combining the best components together

to provide the full best-of-breed solution stack to the customer.

transtec HPC engineering experts are always available to fi ne-

tune customers’ HPC cluster systems and Infi niBand fabrics to

get the maximum performance while at the same time provide

them with an easy-to-manage and easy-to-use HPC solution.

158

NumascaleNumascale’s NumaConnect technology enables computer system vendors to build scal-able servers with the functionality of enterprise mainframes at the cost level of clus-ters. The technology unites all the processors, memory and IO resources in the system in a fully virtualized environment controlled by standard operating systems.

NumaConnect enables significant cost savings in three dimensions: resource utiliza-tion, system management and programmer productivity.

According to long time users of both large shared memory systems and clusters in en-vironments with a variety of applications, the former provide a much higher degree of resource utilization due to the flexibility of all system resources.

159 Hig

h T

rou

hp

ut

Co

mp

uti

ng

C

AD

B

ig D

ata

An

aly

tics

S

imu

lati

on

A

ero

spa

ce

Au

tom

oti

ve

160 161

BackgroundNumascale’s NumaConnect technology enables computer sys-

tem vendors to build scalable servers with the functionality of

enterprise mainframes at the cost level of clusters. The technol-

ogy unites all the processors, memory and IO resources in the

system in a fully virtualized environment controlled by stan-

dard operating systems.

Systems based on NumaConnect will effi ciently support all

classes of applications using shared memory or message pass-

ing through all popular high level programming models. System

size can be scaled to 4k nodes where each node can contain

multiple processors. Memory size is limited by the 48-bit physi-

cal address range provided by the Opteron processors resulting

in a total system main memory of 256 TBytes.

At the heart of NumaConnect is NumaChip; a single chip that

combines the cache coherent shared memory control logic with

an on-chip 7 way switch. This eliminates the need for a separate,

central switch and enables linear capacity and cost scaling.

The continuing trend with multi-core processor chips is en-

abling more applications to take advantage of parallel process-

ing. NumaChip leverages the multi-core trend by enabling ap-

plications to scale seamlessly without the extra programming

effort required for cluster computing. All tasks can access all

memory and IO resources. This is of great value to users and the

ultimate way to virtualization of all system resources.

No other interconnect technology outside the high-end enter-

prise servers can offer this capability. All high speed intercon-

nects now use the same kind of physical interfaces resulting in

almost the same peak bandwidth. The differentiation is in la-

tency for the critical short transfers, functionality and software

compatibility. NumaConnect differentiates from all other in-

terconnects through the ability to provide unifi ed access to all

resources in a system and utilize caching techniques to obtain

very low latency.

Key Facts:

� Scalable, directory based Cache Coherent Shared Memory

interconnect for Opteron

� Attaches to coherent HyperTransport (cHT) through HTX con-

nector, pick-up module or mounted directly on main board

� Confi gurable Remote Cache for each node

� Full 48 bit physical address space (256 Tbytes)

� Up to 4k (4096) nodes

� ≈1 microsecond MPI latency (ping-pong/2)

� On-chip, distributed switch fabric for 2 or 3 dimensional to-

rus topologies

Expanding the capabilities of multi-core processors

Semiconductor technology has reached a level where proces-

sor frequency can no longer be increased much due to power

consumption with corresponding heat dissipation and thermal

handling problems. Historically, processor frequency scaled at

approximately the same rate as transistor density and resulted

in performance improvements for most all applications with

no extra programming efforts. Processor chips are now instead

being equipped with multiple processors on a single die. Utiliz-

ing the added capacity requires softwarethat is prepared for

parallel processing. This is quite obviously simple for individual

and separated tasks that can be run independently, but is much

more complex for speeding up single tasks.

The complexity for speeding up a single task grows with the

logic distance between the resources needed to do the task, i.e.

the fewer resources that can be shared, the harder it is. Multi-

core processors share the main memory and some of the cache

levels, i.e. they are classifi ed as Symmetrical Multi Processors

(SMP). Modern processors chips are also equipped with signals

and logic that allow connecting to other processor chips still

maintaining the same logic sharing of memory. The practical

limit is at two to four processor sockets before the overheads

reduce performance scaling instead of increasing it. This is nor-

mally restricted to a single motherboard.

Currently, scaling beyond the single/dual SMP motherboards is

done through some form of network connection using Ether-

net or a higher speed interconnect like Infi niBand. This requires

processes runningon the different compute nodes to communi-

cate through explicit messages. With this model, programs that

need to be scaled beyond a small number of processors have

to be written in a more complex way where the data can no

longer be shared among all processes, but need to be explicitly

decomposed and transferred between the different processors’

memories when required.

NumaConnect uses a scalable approachto sharing all memory

based on distributed directories to store information about

shared memory locations. This means that programs can be

scaled beyond the limit of a single motherboard without any

changes to the programming principle. Any process running on

Numascale

Background

162 163

any processor in the system can use any part of the memory

regardless if the physical location of the memory is on a differ-

ent motherboard.

NumaConnect Value PropositionNumaConnect enables significant cost savings in three dimen-

sions; resource utilization, system management and program-

mer productivity.

According to long time users of both large shared memory sys-

tems (SMPs) and clusters in environments with a variety of ap-

plications, the former provide a much higher degree of resource

utilization due to the flexibility of all system resources. They

indicate that large mainframe SMPs can easily be kept at more

than 90% utilization and that clusters seldom can reach more

than 60-70% in environments running a variety of jobs. Better

compute resource utilization also contributes to more efficient

use of the necessary infrastructure with power consumption

and cooling as the most prominent ones (account for approxi-

mately one third of the overall coast) with floor-space as a sec-

ondary aspect.

Regarding system management, NumaChip can reduce the

number of individual operating system images significantly. In

a system with 100Tflops computing power, the number of sys-

tem images can be reduced from approximately 1 400 to 40, a

reduction factor of 35. Even if each of those 40 OS images re-

quire somewhat more resources for management than the 1

400 smaller ones, the overall savings are significant.

Parallel processing in a cluster requires explicit message pass-

ing programming whereas shared memory systems can utilize

compilers and other tools that are developed for multi-core pro-

cessors. Parallel programming is a complex task and programs

written for message passing normally contain 50%-100% more

code than programs written for shared memory processing.

Since all programs contain errors, the probability of errors in

message passing programs is 50%-100% higher than for shared

memory programs. A significant amount of software develop-

ment time is consumed by debugging errors further increasing

the time to complete development of an application.

In principle, servers are multi-tasking, multi-user machines that

are fully capable of running multiple applications at any given

time. Small servers are very cost-efficient measured by a peak

price/performance ratio because they are manufactured in very

high volumes and use many of the same components as desk-

side and desktop computers. However, these small to medium

sized servers are not very scalable. The most widely used config-

uration has 2 CPU sockets that hold from 4 to 16 CPU cores each.

They cannot be upgraded with out changing to a different main

board that also normally requires a larger power supply and a

different chassis. In turn, this means that careful capacity plan-

ning is required to optimize cost and if compute requirements

increase, it may be necessary to replace the entire server with

a bigger and much more expensive one since the price increase

is far from linear.

NumaChip contains all the logic needed to build Scale-Up sys-

tems based on volume manufactured server components. This

drives the cost per CPU core down to the same level while offer-

ing the same capabilities as the mainframe type servers.

Where IT budgets are in focus the price difference is obvious

and NumaChip represents a compelling proposition to get main-

frame capabilities at the cost level of high-end cluster technol-

ogy. The expensive mainframes still include some features for

dynamic system reconfiguration that NumaChip systems will

not offer initially. Such features depend on operating system

software and can be also be implemented in NumaChip based

systems.

TechnologyMulti-core processors and shared memory

Shared memory programming for multi-processing boosts pro-

grammer productivity since it is easier to handle than the alter-

native message passing paradigms. Shared memory programs

are supported by compiler tools and require less code than the

alternatives resulting in shorter development time and fewer

program bugs. The availability of multi-core processors on all

major platforms including desktops and laptops is driving more

programs to take advantage of the increased performance po-

tential.

NumaChip offers seamless scaling within the same program-

ming paradigm regardless of system size from a single proces-

sor chip to systems with more than 1,000 processor chips.

Other interconnect technologies that do not offer cc-NUMA

capabilities require that applications are written for message

passing, resulting in larger programs with more bugs and cor-

respondingly longer development time while systems built with

NumaChip can run any program efficiently.

Numascale

Technology

164 165

Virtualization

The strong trend of virtualization is driven by the desire of

obtaining higher utilization of resources in the datacenter. In

short, it means that any application should be able to run on

any server in the datacenter so that each server can be better

utilized by combining more applications on each server dynami-

cally according to user loads.

Commodity server technology represents severe limitations in

reaching this goal. One major limitation is that the memory re-

quirements of any given application need to be satisfi ed by the

physical server that hosts the application at any given time. In

turn, this means that if any application in the datacenter shall

be dynamically executable on all of the servers at different

times, all of the servers must be confi gured with the amount of

memory required by the most demanding application, but only

the one running the app will actually use that memory. This is

where the mainframes excel since these have a fl exible shared

memory architecture where any processor can use any portion

of the memory at any given time, so they only need to be con-

fi gured to be able to handle the most demanding application in

one instance. NumaChip offers the exact same feature, by pro-

viding any application with access to the aggregate amount of

memory in the system. In addition, it also offers all applications

access to all I/O devices in the system through the standard vir-

tual view provided by the operating system.

The two distinctly different architectures of clusters and main-

frames are shown in Figure 1. In Clusters processes are loosely

coupled through a network like Ethernet or Infi niBand. An ap-

plication that needs to utilize more processors or I/O than those

present in each server must be programmed to do so from the

beginning. In the main-frame, any application can use any re-

source in the system as a virtualized resource and the compiler

can generate threads to be executed on any processor.

In a system interconnected with NumaChip, all processors can

access all the memory and all the I/O resources in the system

in the same way as on a mainframe. NumaChip provides a fully

virtualized hardware environment with shared memory and I/O

and with the same ability as mainframes to utilize compiler gen-

erated parallel processes and threads.

Operating Systems

Systems based on NumaChip can run standard operating sys-

tems that handle shared memory multiprocessing. Examples of

such operating systems are Linux, Solaris and Windows Server.

Numascale provides a bootstrap loader that is invoked after

powerup and performs initialization of the system by setting

up node address routing tables. Initially Numascale has tested

and provides bootstrap for Linux.

When the standard bootstrap loader is launched, the system

will appear as a large unifi ed shared memory system.

Cache Coherent Shared Memory

The big differentiator for NumaConnect compared to other

high-speed interconnect technologies is the shared memo-

ry and cache coherency mechanisms. These features allow

programs to access any memory location and any memory

mapped I/O device in a multiprocessor system with high de-

gree of effi ciency. It provides scalable systems with a unifi ed

programming model that stays the same from the small multi-

core machines used in laptops and desktops to the largest

imaginable single system image machines that may contain

thousands of processors.

Numascale

Technology

Clustered vs Mainframe Architecture

Clustered Architecture

Mainframe Architecture

166 167

There are a number of pros for shared memory machines that

lead experts to hold the architecture as the holy grail of com-

puting compared to clusters:

� Any processor can access any data location through direct

load and store operations - easier programming, less code to

write and debug

� Compilers can automatically exploit loop level parallelism –

higher effi ciency with less human effort

� System administration relates to a unifi ed system as op-

posed to a large number of separate images in a cluster – less

effort to maintain

� Resources can be mapped and used by any processor in the

system – optimal use of resources in a virtualized environment

� Process scheduling is synchronized through a single, real-

time clock - avoids serialization of scheduling associated

with asynchronous operating systems in a cluster and the

corresponding loss of effi ciency

Scalability and Robustness

The initial design aimed at scaling to very large numbers of pro-

cessors with 64-bit physical address space with 16 bits for node

identifi er and 48 bits of address within each node. The current

implementation for Opteron is limited by the global physical ad-

dress space of 48 bits, with 12 bits used to address 4 096 physi-

cal nodes for a total physical address range of 256 Terabytes.

A directory based cache coherence protocol was developed

to handle scaling with signifi cant number of nodes sharing

data to avoid overloading the interconnect between nodes

with coherency traffi c which would seriously reduce real data

throughput.

The basic ring topology with distributed switching allows a

number of different interconnect confi gurations that are more

scalable than most other interconnect switch fabrics. This also

eliminates the need for a centralized switch and includes inher-

ent redundancy for multidimensional topologies.

Functionality is included to manage robustness issues associ-

ated with high node counts and extremely high requirements

for data integrity with the ability to provide high availability for

systems managing critical data in transaction processing and

realtime control. All data that may exist in only one copy are ECC

protected with automatic scrubbing after detected single bit er-

rors and automatic background scrubbing to avoid accumula-

tion of single bit errors.

Integrated, distributed switching

NumaChip contains an on-chip switch to connect to other

nodes in a NumaChip based system, eliminating the need to use

a centralized switch. The on-chip switch can connect systems

in one, two or three dimensions. Small systems can use one,

medium-sized system two and large systems will use all three

dimensions to provide effi cient and scalable connectivity be-

tween processors.

The two- and three-dimensional topologies (called Torus) also

have the advantage of built-in redundancy as opposed to sys-

tems based on centralized switches, where the switch repre-

sents a single point of failure.

Numascale

Technology

NumaChip System Architecture

Block Diagram of NumaChip

168 169

The distributed switching reduces the cost of the system since

there is no extra switch hardware to pay for. It also reduces the

amount of rack space required to hold the system as well as the

power consumption and heat dissipation from the switch hard-

ware and the associated power supply energy loss and cooling

requirements.

Redefi ning Scalable OpenMP and MPI Shared Memory Advantages

Multi-processor shared memory processing has long been the

preferred method for creating and running technical computing

codes. Indeed, this computing model now extends from a user’s

dual core laptop to 16+ core servers. Programmers often add

parallel OpenMP directives to their programs in order to take

advantage of the extra cores on modern servers. This approach

is fl exible and often preserves the “sequential” nature of the

program (pthreads can of course also be used, but OpenMP is

much easier to use). To extend programs beyond a single server,

however, users must use the Message Passing Interface (MPI) to

allow the program to operate across a high-speed interconnect.

Interestingly, the advent of multi-core servers has created a par-

allel asymmetric computing model, where programs must map

themselves to networks of shared memory SMP servers. This

asymmetric model introduces two levels of communication,

local within a node, and distant to other nodes. Programmers

often create pure MPI programs that run across multiple cores

on multiple nodes. While a pure MPI program does represent

the greatest common denominator, better performance may be

sacrifi ced by not utilizing the local nature of multi-core nodes.

Hybrid (combined MPI/OpenMP) models have been able to pull

more performance from cluster hardware, but often introduce

programming complexity and may limit portability.

Clearly, users prefer writing software for large shared memory

systems to MPI programming. This preference becomes more

pronounced when large data sets are used. In a large SMP sys-

tem the data are simply used in place, whereas in a distributed

memory cluster the dataset must be partitioned across com-

pute nodes.

In summary, shared memory systems have a number of highly

desirable features that offer ease of use and cost reduction over

traditional distributed memory systems:

� Any processor can access any data location through direct

load and store operations, allowing easier programming

(less time and training) for end users, with less code to write

and debug.

� Compilers, such as those supporting OpenMP, can automati-

cally exploit loop level parallelism and create more effi cient

codes, increasing system throughput and better resource

utilization.

� System administration of a unifi ed system (as opposed to a

large number of separate images in a cluster) results in re-

duced effort and cost for system maintenance.

� Resources can be mapped and used by any processor in the

system, with optimal use of resources in a single image op-

erating system environment.

Shared Memory as a Universal Platform

Although the advantages of shared memory systems are clear,

the actual implementation of such systems “at scale” has been

diffi cult prior to the emergence of NumaConnect technology.

There have traditionally been limits to the size and cost of

shared memory SMP systems, and as a result the HPC commu-

nity has moved to distributed memory clusters that now scale

into the thousands of cores. Distributed memory programming

Numascale

Redefi ning Scalable OpenMP and MPI

System Topology examples

170 171

occurs within the MPI library, where explicit communication

pathways are established between processors (i.e., data is es-

sentially copied from machine to machine). A large number of

existing applications use MPI as a programming model.

Fortunately, MPI codes can run effectively on shared memory

systems. Optimizations have been built into many MPI versions

that recognize the availability of shared memory and avoid full

message protocols when communicating between processes.

Shared memory programming using OpenMP has been useful

on small-scale SMP systems such as commodity workstations

and servers. Providing large-scale shared memory environ-

ments for these codes, however, opens up a whole new world of

performance capabilities without the need for re-programming.

Using NumaConnect technology, scalable shared memory clus-

ters are capable of effi ciently running both large-scale OpenMP

and MPI codes without modifi cation.

Record-Setting OpenMP Performance

In the HPC community NAS Parallel Benchmarks (NPB) have

been used to test the performance of parallel computers

(http://www.nas.nasa.gov/publications/npb.html). The bench-

marks are a small set of programs derived from Computational

Fluid Dynamics (CFD) applications that were designed to help

evaluate the performance of parallel supercomputers. Problem

sizes in NPB) are predefi ned and indicated as different classes

(currently A through F, with F being the largest).

Reference implementations of NPB are available in commonly-

used programming models such as MPI and OpenMP, which

make them ideal for measuring the performance of both distrib-

uted memory and SMP systems. These benchmarks were com-

piled with Intel ifort version 14.0.0. (Note: the current-generated

code is slightly faster, but Numascale is working on NumaCon-

nect optimizations for the GNU compilers and thus suggests us-

ing gcc and gfortran for OpenMP applications.)

For the following tests, the NumaConnect Shared Memory

benchmark system has 1TB of memory and 256 cores. It utilizes

eight servers, each equipped with two x AMD Opteron 2.5 GHz

6380 CPUs, each with 16 cores and 128GB of memory.

Figure One shows the results for running NPB-SP (Scalar Penta-

diagonal solver) over a range of 16 to 121 cores using OpenMP

for the Class D problem size.

Figure Two shows results for the NPB-LU benchmark (Lower-

Upper Gauss-Seidel solver) over a range of 16 to 121 cores, using

OpenMP for the Class D problem size.

Figure Three shows he NAS-SP benchmark E-class scaling per-

fectly from 64 processes (using affi nity 0-255:4) to 121 processes

(using affi nity 0-241:2). Results indicate that larger problems

scale better on NumaConnect systems, and it was noted that

NASA has never seen OpenMP E Class results with such a high

number of cores.

OpenMP applications cannot run on Infi nBand clusters without

additional software layers and kernel modifi cations. The Numa-

Connect cluster runs a standard Linux kernel image.

Surprisingly Good MPI Performance

Despite the excellent OpenMP shared memory performance

that NumaConnect can deliver, applications have historically

been written using MPI. The performance of these applications

is presented below. As mentioned, the NumaConnect system

can easily run MPI applications. Figure Four is a comparison of

Numascale


OpenMP NAS Parallel results for NPB-SP (Class D)

OpenMP NAS Parallel results for NPB-LU (Class E)

172 173

NumaConnect and FDR Infi niBand NPB-SP (Class D). The results

indicate that NumaConnect performance is superior to that of

a traditional distributed Infi niBand memory cluster. MPI tests

were run with OpenMPI and gfortran 4.8.1 using the same hard-

ware mentioned above.

Both industry-standard OpenMPI and MPICH2 work in shared

memory mode. Numascale has implemented their own version

of the OpenMPI BTL (Byte Transfer Layer) to optimize the com-

munication by utilizing non-polluting store instructions. MPI

messages require data to be moved, and in a shared memory en-

vironment there is no reason to use standard instructions that

implicitly result in cache pollution and reduced performance.

This results in very effi cient message passing and excellent MPI

performance.

Similar results are shown in Figure Five for the NAS-LU (Class

D). NumaConnect’s performance over Infi niBand may be one of

the more startling results for the NAS benchmarks. Recall again

that OpenMP applications cannot run on Infi niBand clusters

without additional software layers and kernel modifi cations.

Numascale


OpenMP NAS results for NPB-SP (Class E)

NPB-SP comparison of NumaConnect to FDI Infi niBand

NPB-LU comparison of NumaConnect to FDI Infi niBand

174 175

ACML (“AMD Core Math Library“)

A software development library released by AMD. This library

provides useful mathematical routines optimized for AMD pro-

cessors. Originally developed in 2002 for use in high-performance

computing (HPC) and scientific computing, ACML allows nearly

optimal use of AMD Opteron processors in compute-intensive

applications. ACML consists of the following main components:

� A full implementation of Level 1, 2 and 3 Basic Linear Algebra

Subprograms (→ BLAS), with optimizations for AMD Opteron

processors.

� A full suite of Linear Algebra (→ LAPACK) routines.

� A comprehensive suite of Fast Fourier transform (FFTs) in sin-

gle-, double-, single-complex and double-complex data types.

� Fast scalar, vector, and array math transcendental library

routines.

� Random Number Generators in both single- and double-pre-

cision.

AMD offers pre-compiled binaries for Linux, Solaris, and Win-

dows available for download. Supported compilers include gfor-

tran, Intel Fortran Compiler, Microsoft Visual Studio, NAG, PathS-

cale, PGI compiler, and Sun Studio.

BLAS (“Basic Linear Algebra Subprograms“)

Routines that provide standard building blocks for performing

basic vector and matrix operations. The Level 1 BLAS perform

scalar, vector and vector-vector operations, the Level 2 BLAS

perform matrix-vector operations, and the Level 3 BLAS per-

form matrix-matrix operations. Because the BLAS are efficient,

portable, and widely available, they are commonly used in the

development of high quality linear algebra software, e.g. → LA-

PACK . Although a model Fortran implementation of the BLAS

in available from netlib in the BLAS library, it is not expected to

perform as well as a specially tuned implementation on most

high-performance computers – on some machines it may give

much worse performance – but it allows users to run → LAPACK

software on machines that do not offer any other implementa-

tion of the BLAS.

Cg (“C for Graphics”)

A high-level shading language developed by Nvidia in close col-

laboration with Microsoft for programming vertex and pixel

shaders. It is very similar to Microsoft’s → HLSL. Cg is based on

the C programming language and although they share the same

syntax, some features of C were modified and new data types

were added to make Cg more suitable for programming graph-

ics processing units. This language is only suitable for GPU pro-

gramming and is not a general programming language. The Cg

compiler outputs DirectX or OpenGL shader programs.

CISC (“complex instruction-set computer”)

A computer instruction set architecture (ISA) in which each in-

struction can execute several low-level operations, such as

a load from memory, an arithmetic operation, and a memory

store, all in a single instruction. The term was retroactively

coined in contrast to reduced instruction set computer (RISC).

The terms RISC and CISC have become less meaningful with the

continued evolution of both CISC and RISC designs and imple-

mentations, with modern processors also decoding and split-

ting more complex instructions into a series of smaller internal

micro-operations that can thereby be executed in a pipelined

fashion, thus achieving high performance on a much larger sub-

set of instructions.

Glossary

cluster

Aggregation of several, mostly identical or similar systems to

a group, working in parallel on a problem. Previously known

as Beowulf Clusters, HPC clusters are composed of commodity

hardware, and are scalable in design. The more machines are

added to the cluster, the more performance can in principle be

achieved.

control protocol

Part of the → parallel NFS standard

CUDA driver API

Part of → CUDA

CUDA SDK

Part of → CUDA

CUDA toolkit

Part of → CUDA

CUDA (“Compute Uniform Device Architecture”)

A parallel computing architecture developed by NVIDIA. CUDA

is the computing engine in NVIDIA graphics processing units

or GPUs that is accessible to software developers through in-

dustry standard programming languages. Programmers use “C

for CUDA” (C with NVIDIA extensions), compiled through a Path-

Scale Open64 C compiler, to code algorithms for execution on

the GPU. CUDA architecture supports a range of computational

interfaces including → OpenCL and → DirectCompute. Third

party wrappers are also available for Python, Fortran, Java and

Matlab. CUDA works with all NVIDIA GPUs from the G8X series

onwards, including GeForce, Quadro and the Tesla line. CUDA

provides both a low level API and a higher level API. The initial

CUDA SDK was made public on 15 February 2007, for Microsoft

Windows and Linux. Mac OS X support was later added in ver-

sion 2.0, which supersedes the beta released February 14, 2008.

CUDA is the hardware and software architecture that enables

NVIDIA GPUs to execute programs written with C, C++, Fortran,

→ OpenCL, → DirectCompute, and other languages. A CUDA pro-

gram calls parallel kernels. A kernel executes in parallel across

a set of parallel threads. The programmer or compiler organizes

these threads in thread blocks and grids of thread blocks. The

GPU instantiates a kernel program on a grid of parallel thread

blocks. Each thread within a thread block executes an instance

of the kernel, and has a thread ID within its thread block, pro-

gram counter, registers, per-thread private memory, inputs, and

output results.

Thread

Thread Block

Grid 0

Grid 1

per-Thread PrivateLocal Memory

per-BlockShared Memory

per-Application

ContextGlobal

Memory

...

...

176 177

A thread block is a set of concurrently executing threads that

can cooperate among themselves through barrier synchroniza-

tion and shared memory. A thread block has a block ID within

its grid. A grid is an array of thread blocks that execute the same

kernel, read inputs from global memory, write results to global

memory, and synchronize between dependent kernel calls. In

the CUDA parallel programming model, each thread has a per-

thread private memory space used for register spills, function

calls, and C automatic array variables. Each thread block has a

per-block shared memory space used for inter-thread communi-

cation, data sharing, and result sharing in parallel algorithms.

Grids of thread blocks share results in global memory space af-

ter kernel-wide global synchronization.

CUDA’s hierarchy of threads maps to a hierarchy of processors

on the GPU; a GPU executes one or more kernel grids; a stream-

ing multiprocessor (SM) executes one or more thread blocks;

and CUDA cores and other execution units in the SM execute

threads. The SM executes threads in groups of 32 threads called

a warp. While programmers can generally ignore warp ex-

ecution for functional correctness and think of programming

one thread, they can greatly improve performance by having

threads in a warp execute the same code path and access mem-

ory in nearby addresses. See the main article “GPU Computing”

for further details.

DirectCompute

An application programming interface (API) that supports gen-

eral-purpose computing on graphics processing units (GPUs)

on Microsoft Windows Vista or Windows 7. DirectCompute is

part of the Microsoft DirectX collection of APIs and was initially

released with the DirectX 11 API but runs on both DirectX 10

and DirectX 11 GPUs. The DirectCompute architecture shares a

range of computational interfaces with → OpenCL and → CUDA.

ETL (“Extract, Transform, Load”)

A process in database usage and especially in data warehousing

that involves:

� Extracting data from outside sources

� Transforming it to fit operational needs (which can include

quality levels)

� Loading it into the end target (database or data warehouse)

The first part of an ETL process involves extracting the data from

the source systems. In many cases this is the most challenging

aspect of ETL, as extracting data correctly will set the stage for

how subsequent processes will go. Most data warehousing proj-

ects consolidate data from different source systems. Each sepa-

rate system may also use a different data organization/format.

Common data source formats are relational databases and flat

files, but may include non-relational database structures such

as Information Management System (IMS) or other data struc-

tures such as Virtual Storage Access Method (VSAM) or Indexed

Sequential Access Method (ISAM), or even fetching from outside

sources such as through web spidering or screen-scraping. The

streaming of the extracted data source and load on-the-fly to

the destination database is another way of performing ETL

when no intermediate data storage is required. In general, the

goal of the extraction phase is to convert the data into a single

format which is appropriate for transformation processing.

An intrinsic part of the extraction involves the parsing of ex-

tracted data, resulting in a check if the data meets an expected

pattern or structure. If not, the data may be rejected entirely or

in part.

Glossary

The transform stage applies a series of rules or functions to the

extracted data from the source to derive the data for loading

into the end target. Some data sources will require very little or

even no manipulation of data. In other cases, one or more of the

following transformation types may be required to meet the

business and technical needs of the target database.

The load phase loads the data into the end target, usually the

data warehouse (DW). Depending on the requirements of the

organization, this process varies widely. Some data warehouses

may overwrite existing information with cumulative informa-

tion, frequently updating extract data is done on daily, weekly

or monthly basis. Other DW (or even other parts of the same

DW) may add new data in a historicized form, for example, hour-

ly. To understand this, consider a DW that is required to main-

tain sales records of the last year. Then, the DW will overwrite

any data that is older than a year with newer data. However, the

entry of data for any one year window will be made in a histo-

ricized manner. The timing and scope to replace or append are

strategic design choices dependent on the time available and

the business needs. More complex systems can maintain a his-

tory and audit trail of all changes to the data loaded in the DW.

As the load phase interacts with a database, the constraints de-

fined in the database schema — as well as in triggers activated

upon data load — apply (for example, uniqueness, referential

integrity, mandatory fields), which also contribute to the overall

data quality performance of the ETL process.

FFTW (“Fastest Fourier Transform in the West”)

A software library for computing discrete Fourier transforms

(DFTs), developed by Matteo Frigo and Steven G. Johnson at the

Massachusetts Institute of Technology. FFTW is known as the

fastest free software implementation of the Fast Fourier trans-

form (FFT) algorithm (upheld by regular benchmarks). It can

compute transforms of real and complex-valued arrays of arbi-

trary size and dimension in O(n log n) time.

floating point standard (IEEE 754)

The most widely-used standard for floating-point computa-

tion, and is followed by many hardware (CPU and FPU) and soft-

ware implementations. Many computer languages allow or re-

quire that some or all arithmetic be carried out using IEEE 754

formats and operations. The current version is IEEE 754-2008,

which was published in August 2008; the original IEEE 754-1985

was published in 1985. The standard defines arithmetic formats,

interchange formats, rounding algorithms, operations, and ex-

ception handling. The standard also includes extensive recom-

mendations for advanced exception handling, additional opera-

tions (such as trigonometric functions), expression evaluation,

and for achieving reproducible results. The standard defines

single-precision, double-precision, as well as 128-byte quadru-

ple-precision floating point numbers. In the proposed 754r ver-

sion, the standard also defines the 2-byte half-precision number

format.

FraunhoferFS (FhGFS)

A high-performance parallel file system from the Fraunhofer

Competence Center for High Performance Computing. Built on

scalable multithreaded core components with native → Infini-

Band support, file system nodes can serve → InfiniBand and

Ethernet (or any other TCP-enabled network) connections at the

same time and automatically switch to a redundant connection

path in case any of them fails. One of the most fundamental

178 179

concepts of FhGFS is the strict avoidance of architectural bottle

necks. Striping file contents across multiple storage servers is

only one part of this concept. Another important aspect is the

distribution of file system metadata (e.g. directory information)

across multiple metadata servers. Large systems and metadata

intensive applications in general can greatly profit from the lat-

ter feature.

FhGFS requires no dedicated file system partition on the servers

– it uses existing partitions, formatted with any of the standard

Linux file systems, e.g. XFS or ext4. For larger networks, it is also

possible to create several distinct FhGFS file system partitions

with different configurations. FhGFS provides a coherent mode,

in which it is guaranteed that changes to a file or directory by

one client are always immediately visible to other clients.

Global Arrays (GA)

A library developed by scientists at Pacific Northwest National

Laboratory for parallel computing. GA provides a friendly API

for shared-memory programming on distributed-memory com-

puters for multidimensional arrays. The GA library is a predeces-

sor to the GAS (global address space) languages currently being

developed for high-performance computing. The GA toolkit has

additional libraries including a Memory Allocator (MA), Aggre-

gate Remote Memory Copy Interface (ARMCI), and functionality

for out-of-core storage of arrays (ChemIO). Although GA was ini-

tially developed to run with TCGMSG, a message passing library

that came before the → MPI standard (Message Passing Inter-

face), it is now fully compatible with → MPI. GA includes simple

matrix computations (matrix-matrix multiplication, LU solve)

and works with → ScaLAPACK. Sparse matrices are available but

the implementation is not optimal yet. GA was developed by Jar-

ek Nieplocha, Robert Harrison and R. J. Littlefield. The ChemIO li-

brary for out-of-core storage was developed by Jarek Nieplocha,

Robert Harrison and Ian Foster. The GA library is incorporated

into many quantum chemistry packages, including NWChem,

MOLPRO, UTChem, MOLCAS, and TURBOMOLE. The GA toolkit is

free software, licensed under a self-made license.

Globus Toolkit

An open source toolkit for building computing grids developed

and provided by the Globus Alliance, currently at version 5.

GMP (“GNU Multiple Precision Arithmetic Library”)

A free library for arbitrary-precision arithmetic, operating on

signed integers, rational numbers, and floating point numbers.

There are no practical limits to the precision except the ones

implied by the available memory in the machine GMP runs on

(operand dimension limit is 231 bits on 32-bit machines and 237

bits on 64-bit machines). GMP has a rich set of functions, and

the functions have a regular interface. The basic interface is

for C but wrappers exist for other languages including C++, C#,

OCaml, Perl, PHP, and Python. In the past, the Kaffe Java virtual

machine used GMP to support Java built-in arbitrary precision

arithmetic. This feature has been removed from recent releases,

causing protests from people who claim that they used Kaffe

solely for the speed benefits afforded by GMP. As a result, GMP

support has been added to GNU Classpath. The main target ap-

plications of GMP are cryptography applications and research,

Internet security applications, and computer algebra systems.

GotoBLAS

Kazushige Goto’s implementation of → BLAS.

Glossary

grid (in CUDA architecture)

Part of the → CUDA programming model

GridFTP

An extension of the standard File Transfer Protocol (FTP) for use

with Grid computing. It is defined as part of the → Globus toolkit,

under the organisation of the Global Grid Forum (specifically, by

the GridFTP working group). The aim of GridFTP is to provide a

more reliable and high performance file transfer for Grid com-

puting applications. This is necessary because of the increased

demands of transmitting data in Grid computing – it is frequent-

ly necessary to transmit very large files, and this needs to be

done fast and reliably. GridFTP is the answer to the problem of

incompatibility between storage and access systems. Previous-

ly, each data provider would make their data available in their

own specific way, providing a library of access functions. This

made it difficult to obtain data from multiple sources, requiring

a different access method for each, and thus dividing the total

available data into partitions. GridFTP provides a uniform way

of accessing the data, encompassing functions from all the dif-

ferent modes of access, building on and extending the univer-

sally accepted FTP standard. FTP was chosen as a basis for it

because of its widespread use, and because it has a well defined

architecture for extensions to the protocol (which may be dy-

namically discovered).

Hierarchical Data Format (HDF)

A set of file formats and libraries designed to store and orga-

nize large amounts of numerical data. Originally developed at

the National Center for Supercomputing Applications, it is cur-

rently supported by the non-profit HDF Group, whose mission

is to ensure continued development of HDF5 technologies, and

the continued accessibility of data currently stored in HDF. In

keeping with this goal, the HDF format, libraries and associated

tools are available under a liberal, BSD-like license for general

use. HDF is supported by many commercial and non-commer-

cial software platforms, including Java, MATLAB, IDL, and Py-

thon. The freely available HDF distribution consists of the li-

brary, command-line utilities, test suite source, Java interface,

and the Java-based HDF Viewer (HDFView). There currently exist

two major versions of HDF, HDF4 and HDF5, which differ signifi-

cantly in design and API.

HLSL (“High Level Shader Language“)

The High Level Shader Language or High Level Shading Lan-

guage (HLSL) is a proprietary shading language developed by

Microsoft for use with the Microsoft Direct3D API. It is analo-

gous to the GLSL shading language used with the OpenGL stan-

dard. It is very similar to the NVIDIA Cg shading language, as it

was developed alongside it.

HLSL programs come in three forms, vertex shaders, geometry

shaders, and pixel (or fragment) shaders. A vertex shader is ex-

ecuted for each vertex that is submitted by the application, and

is primarily responsible for transforming the vertex from object

space to view space, generating texture coordinates, and calcu-

lating lighting coefficients such as the vertex’s tangent, binor-

mal and normal vectors. When a group of vertices (normally 3,

to form a triangle) come through the vertex shader, their out-

put position is interpolated to form pixels within its area; this

process is known as rasterisation. Each of these pixels comes

through the pixel shader, whereby the resultant screen colour

is calculated. Optionally, an application using a Direct3D10 in-

180 181

terface and Direct3D10 hardware may also specify a geometry

shader. This shader takes as its input the three vertices of a tri-

angle and uses this data to generate (or tessellate) additional

triangles, which are each then sent to the rasterizer.

Infi niBand

Switched fabric communications link primarily used in HPC and

enterprise data centers. Its features include high throughput,

low latency, quality of service and failover, and it is designed to

be scalable. The Infi niBand architecture specifi cation defi nes a

connection between processor nodes and high performance

I/O nodes such as storage devices. Like → PCI Express, and many

other modern interconnects, Infi niBand offers point-to-point

bidirectional serial links intended for the connection of pro-

cessors with high-speed peripherals such as disks. On top of

the point to point capabilities, Infi niBand also offers multicast

operations as well. It supports several signalling rates and, as

with PCI Express, links can be bonded together for additional

throughput.

The SDR serial connection’s signalling rate is 2.5 gigabit per sec-

ond (Gbit/s) in each direction per connection. DDR is 5 Gbit/s

and QDR is 10 Gbit/s. FDR is 14.0625 Gbit/s and EDR is 25.78125

Gbit/s per lane. For SDR, DDR and QDR, links use 8B/10B encod-

ing – every 10 bits sent carry 8 bits of data – making the use-

ful data transmission rate four-fi fths the raw rate. Thus single,

double, and quad data rates carry 2, 4, or 8 Gbit/s useful data,

respectively. For FDR and EDR, links use 64B/66B encoding – ev-

ery 66 bits sent carry 64 bits of data.

Implementers can aggregate links in units of 4 or 12, called 4X or

12X. A 12X QDR link therefore carries 120 Gbit/s raw, or 96 Gbit/s

of useful data. As of 2009 most systems use a 4X aggregate, im-

plying a 10 Gbit/s (SDR), 20 Gbit/s (DDR) or 40 Gbit/s (QDR) con-

nection. Larger systems with 12X links are typically used for

cluster and supercomputer interconnects and for inter-switch

connections.

The single data rate switch chips have a latency of 200 nano-

seconds, DDR switch chips have a latency of 140 nanoseconds

and QDR switch chips have a latency of 100 nanoseconds. The

end-to-end latency range ranges from 1.07 microseconds MPI

latency to 1.29 microseconds MPI latency to 2.6 microseconds.

As of 2009 various Infi niBand host channel adapters (HCA) exist

in the market, each with different latency and bandwidth char-

acteristics. Infi niBand also provides RDMA capabilities for low

CPU overhead. The latency for RDMA operations is less than 1

microsecond.

See the main article “Infi niBand” for further description of In-

fi niBand features

Intel Integrated Performance Primitives (Intel IPP)

A multi-threaded software library of functions for multimedia

and data processing applications, produced by Intel. The library

supports Intel and compatible processors and is available for

Windows, Linux, and Mac OS X operating systems. It is available

separately or as a part of Intel Parallel Studio. The library takes

advantage of processor features including MMX, SSE, SSE2,

SSE3, SSSE3, SSE4, AES-NI and multicore processors. Intel IPP is

divided into four major processing groups: Signal (with linear ar-

ray or vector data), Image (with 2D arrays for typical color spac-

es), Matrix (with n x m arrays for matrix operations), and Cryp-

tography. Half the entry points are of the matrix type, a third

are of the signal type and the remainder are of the image and

GlossaryGLOSSARY

cryptography types. Intel IPP functions are divided into 4 data

types: Data types include 8u (8-bit unsigned), 8s (8-bit signed),

16s, 32f (32-bit fl oating-point), 64f, etc. Typically, an application

developer works with only one dominant data type for most

processing functions, converting between input to processing

to output formats at the end points. Version 5.2 was introduced

June 5, 2007, adding code samples for data compression, new

video codec support, support for 64-bit applications on Mac OS

X, support for Windows Vista, and new functions for ray-tracing

and rendering. Version 6.1 was released with the Intel C++ Com-

piler on June 28, 2009 and Update 1 for version 6.1 was released

on July 28, 2009.

Intel Threading Building Blocks (TBB)

A C++ template library developed by Intel Corporation for writ-

ing software programs that take advantage of multi-core pro-

cessors. The library consists of data structures and algorithms

that allow a programmer to avoid some complications arising

from the use of native threading packages such as POSIX →

threads, Windows → threads, or the portable Boost Threads in

which individual → threads of execution are created, synchro-

nized, and terminated manually. Instead the library abstracts

access to the multiple processors by allowing the operations

to be treated as “tasks”, which are allocated to individual cores

dynamically by the library’s run-time engine, and by automat-

ing effi cient use of the CPU cache. A TBB program creates, syn-

chronizes and destroys graphs of dependent tasks according

to algorithms, i.e. high-level parallel programming paradigms

(a.k.a. Algorithmic Skeletons). Tasks are then executed respect-

ing graph dependencies. This approach groups TBB in a family

of solutions for parallel programming aiming to decouple the

programming from the particulars of the underlying machine.

Intel TBB is available commercially as a binary distribution with

support and in open source in both source and binary forms.

Version 4.0 was introduced on September 8, 2011.

iSER (“iSCSI Extensions for RDMA“)

A protocol that maps the iSCSI protocol over a network that

provides RDMA services (like → iWARP or → Infi niBand). This

permits data to be transferred directly into SCSI I/O buffers

without intermediate data copies. The Datamover Architecture

(DA) defi nes an abstract model in which the movement of data

between iSCSI end nodes is logically separated from the rest of

the iSCSI protocol. iSER is one Datamover protocol. The inter-

face between the iSCSI and a Datamover protocol, iSER in this

case, is called Datamover Interface (DI).

iWARP (“Internet Wide Area RDMA Protocol”)

An Internet Engineering Task Force (IETF) update of the RDMA

Consortium’s → RDMA over TCP standard. This later standard is

SDRSingle Data Rate

DDRDouble Data Rate

QDRQuadruple Data Rate

FDRFourteen Data Rate

EDREnhanced Data Rate

1X 2 Gbit/s 4 Gbit/s 8 Gbit/s 14 Gbit/s 25 Gbit/s



182 183

zero-copy transmission over legacy TCP. Because a kernel imple-

mentation of the TCP stack is a tremendous bottleneck, a few

vendors now implement TCP in hardware. This additional hard-

ware is known as the TCP offload engine (TOE). TOE itself does

not prevent copying on the receive side, and must be combined

with RDMA hardware for zero-copy results. The main compo-

nent is the Data Direct Protocol (DDP), which permits the ac-

tual zero-copy transmission. The transmission itself is not per-

formed by DDP, but by TCP.

kernel (in CUDA architecture)


LAM/MPI

A high-quality open-source implementation of the → MPI speci-

fication, including all of MPI-1.2 and much of MPI-2. Superseded

by the → OpenMPI implementation-

LAPACK (“linear algebra package”)

Routines for solving systems of simultaneous linear equations,

least-squares solutions of linear systems of equations, eigen-

value problems, and singular value problems. The original goal

of the LAPACK project was to make the widely used EISPACK and

→ LINPACK libraries run efficiently on shared-memory vector

and parallel processors. LAPACK routines are written so that as

much as possible of the computation is performed by calls to

the → BLAS library. While → LINPACK and EISPACK are based on

the vector operation kernels of the Level 1 BLAS, LAPACK was

designed at the outset to exploit the Level 3 BLAS. Highly effi-

cient machine-specific implementations of the BLAS are avail-

able for many modern high-performance computers. The BLAS

enable LAPACK routines to achieve high performance with por-

table software.

layout

Part of the → parallel NFS standard. Currently three types of

layout exist: file-based, block/volume-based, and object-based,

the latter making use of → object-based storage devices

LINPACK

A collection of Fortran subroutines that analyze and solve linear

equations and linear least-squares problems. LINPACK was de-

signed for supercomputers in use in the 1970s and early 1980s.

LINPACK has been largely superseded by → LAPACK, which has

been designed to run efficiently on shared-memory, vector su-

percomputers. LINPACK makes use of the → BLAS libraries for

performing basic vector and matrix operations.

The LINPACK benchmarks are a measure of a system‘s float-

ing point computing power and measure how fast a computer

solves a dense N by N system of linear equations Ax=b, which is a

common task in engineering. The solution is obtained by Gauss-

ian elimination with partial pivoting, with 2/3•N³ + 2•N² floating

point operations. The result is reported in millions of floating

point operations per second (MFLOP/s, sometimes simply called

MFLOPS).

LNET

Communication protocol in → Lustre

logical object volume (LOV)

A logical entity in → Lustre

GlossaryGLOSSARY

Lustre

An object-based → parallel file system

management server (MGS)

A functional component in → Lustre

MapReduce

A framework for processing highly distributable problems

across huge datasets using a large number of computers

(nodes), collectively referred to as a cluster (if all nodes use the

same hardware) or a grid (if the nodes use different hardware).

Computational processing can occur on data stored either in a

filesystem (unstructured) or in a database (structured).

“Map” step: The master node takes the input, divides it into

smaller sub-problems, and distributes them to worker nodes. A

worker node may do this again in turn, leading to a multi-level

tree structure. The worker node processes the smaller problem,

and passes the answer back to its master node.

“Reduce” step: The master node then collects the answers to all

the sub-problems and combines them in some way to form the

output – the answer to the problem it was originally trying to

solve.

MapReduce allows for distributed processing of the map and

reduction operations. Provided each mapping operation is in-

dependent of the others, all maps can be performed in parallel

– though in practice it is limited by the number of independent

data sources and/or the number of CPUs near each source. Simi-

larly, a set of ‘reducers’ can perform the reduction phase - pro-

vided all outputs of the map operation that share the same key

are presented to the same reducer at the same time. While this

process can often appear inefficient compared to algorithms

that are more sequential, MapReduce can be applied to signifi-

cantly larger datasets than “commodity” servers can handle – a

large server farm can use MapReduce to sort a petabyte of data

in only a few hours. The parallelism also offers some possibility

of recovering from partial failure of servers or storage during

the operation: if one mapper or reducer fails, the work can be

rescheduled – assuming the input data is still available.

metadata server (MDS)


metadata target (MDT)


MKL (“Math Kernel Library”)

A library of optimized, math routines for science, engineering,

and financial applications developed by Intel. Core math func-

tions include → BLAS, → LAPACK, → ScaLAPACK, Sparse Solvers,

Fast Fourier Transforms, and Vector Math. The library supports

Intel and compatible processors and is available for Windows,

Linux and Mac OS X operating systems.

MPI, MPI-2 (“message-passing interface”)

A language-independent communications protocol used to

program parallel computers. Both point-to-point and collective

communication are supported. MPI remains the dominant mod-

el used in high-performance computing today. There are two

versions of the standard that are currently popular: version 1.2

(shortly called MPI-1), which emphasizes message passing and

has a static runtime environment, and MPI-2.1 (MPI-2), which in-

cludes new features such as parallel I/O, dynamic process man-

184 185

agement and remote memory operations. MPI-2 specifies over

500 functions and provides language bindings for ANSI C, ANSI

Fortran (Fortran90), and ANSI C++. Interoperability of objects de-

fined in MPI was also added to allow for easier mixed-language

message passing programming. A side effect of MPI-2 stan-

dardization (completed in 1996) was clarification of the MPI-1

standard, creating the MPI-1.2 level. MPI-2 is mostly a superset

of MPI-1, although some functions have been deprecated. Thus

MPI-1.2 programs still work under MPI implementations com-

pliant with the MPI-2 standard. The MPI Forum reconvened in

2007, to clarify some MPI-2 issues and explore developments for

a possible MPI-3.

MPICH2

A freely available, portable → MPI 2.0 implementation, main-

tained by Argonne National Laboratory

MPP (“massively parallel processing”)

So-called MPP jobs are computer programs with several parts

running on several machines in parallel, often calculating simu-

lation problems. The communication between these parts can

e.g. be realized by the → MPI software interface.

MS-MPI

Microsoft → MPI 2.0 implementation shipped with Microsoft

HPC Pack 2008 SDK, based on and designed for maximum com-

patibility with the → MPICH2 reference implementation.

MVAPICH2

An → MPI 2.0 implementation based on → MPICH2 and devel-

oped by the Department of Computer Science and Engineer-

ing at Ohio State University. It is available under BSD licens-

ing and supports MPI over InfiniBand, 10GigE/iWARP and

RDMAoE.

NetCDF (“Network Common Data Form”)

A set of software libraries and self-describing, machine-inde-

pendent data formats that support the creation, access, and

sharing of array-oriented scientific data. The project homep-

age is hosted by the Unidata program at the University Cor-

poration for Atmospheric Research (UCAR). They are also the

chief source of NetCDF software, standards development,

updates, etc. The format is an open standard. NetCDF Clas-

sic and 64-bit Offset Format are an international standard

of the Open Geospatial Consortium. The project is actively

supported by UCAR. The recently released (2008) version 4.0

greatly enhances the data model by allowing the use of the

→ HDF5 data file format. Version 4.1 (2010) adds support for C

and Fortran client access to specified subsets of remote data

via OPeNDAP. The format was originally based on the con-

ceptual model of the NASA CDF but has since diverged and

is not compatible with it. It is commonly used in climatology,

meteorology and oceanography applications (e.g., weather

forecasting, climate change) and GIS applications. It is an in-

put/output format for many GIS applications, and for general

scientific data exchange. The NetCDF C library, and the librar-

ies based on it (Fortran 77 and Fortran 90, C++, and all third-

party libraries) can, starting with version 4.1.1, read some

data in other data formats. Data in the → HDF5 format can be

read, with some restrictions. Data in the → HDF4 format can

be read by the NetCDF C library if created using the → HDF4

Scientific Data (SD) API.

GlossaryGLOSSARY

NetworkDirect

A remote direct memory access (RDMA)-based network interface

implemented in Windows Server 2008 and later. NetworkDirect

uses a more direct path from → MPI applications to networking

hardware, resulting in very fast and efficient networking. See the

main article “Windows HPC Server 2008 R2” for further details.

NFS (Network File System)

A network file system protocol originally developed by Sun Mi-

crosystems in 1984, allowing a user on a client computer to ac-

cess files over a network in a manner similar to how local stor-

age is accessed. NFS, like many other protocols, builds on the

Open Network Computing Remote Procedure Call (ONC RPC)

system. The Network File System is an open standard defined in

RFCs, allowing anyone to implement the protocol.

Sun used version 1 only for in-house experimental purposes.

When the development team added substantial changes to NFS

version 1 and released it outside of Sun, they decided to release

the new version as V2, so that version interoperation and RPC

version fallback could be tested. Version 2 of the protocol (de-

fined in RFC 1094, March 1989) originally operated entirely over

UDP. Its designers meant to keep the protocol stateless, with

locking (for example) implemented outside of the core protocol

Version 3 (RFC 1813, June 1995) added:

� support for 64-bit file sizes and offsets, to handle files larger

than 2 gigabytes (GB)

� support for asynchronous writes on the server, to improve

write performance

� additional file attributes in many replies, to avoid the need

to re-fetch them

� a READDIRPLUS operation, to get file handles and attributes

along with file names when scanning a directory

� assorted other improvements

At the time of introduction of Version 3, vendor support for TCP

as a transport-layer protocol began increasing. While several

vendors had already added support for NFS Version 2 with TCP

as a transport, Sun Microsystems added support for TCP as a

transport for NFS at the same time it added support for Version

3. Using TCP as a transport made using NFS over a WAN more

feasible.

Version 4 (RFC 3010, December 2000; revised in RFC 3530, April

2003), influenced by AFS and CIFS, includes performance im-

provements, mandates strong security, and introduces a state-

ful protocol. Version 4 became the first version developed with

the Internet Engineering Task Force (IETF) after Sun Microsys-

tems handed over the development of the NFS protocols.

NFS version 4 minor version 1 (NFSv 4.1) has been approved by

the IESG and received an RFC number since Jan 2010. The NFSv

4.1 specification aims: to provide protocol support to take ad-

vantage of clustered server deployments including the ability

to provide scalable parallel access to files distributed among

multiple servers. NFSv 4.1 adds the parallel NFS (pNFS) capabil-

ity, which enables data access parallelism. The NFSv 4.1 protocol

defines a method of separating the filesystem meta-data from

the location of the file data; it goes beyond the simple name/

data separation by striping the data amongst a set of data serv-

ers. This is different from the traditional NFS server which holds

the names of files and their data under the single umbrella of

the server.

186 187

In addition to pNFS, NFSv 4.1 provides sessions, directory del-

egation and notifi cations, multi-server namespace, access con-

trol lists (ACL/SACL/DACL), retention attributions, and SECIN-

FO_NO_NAME. See the main article “Parallel Filesystems” for

further details.

Current work is being done in preparing a draft for a future ver-

sion 4.2 of the NFS standard, including so-called federated fi le-

systems, which constitute the NFS counterpart of Microsoft’s

distributed fi lesystem (DFS).

NUMA (“non-uniform memory access”)

A computer memory design used in multiprocessors, where the

memory access time depends on the memory location relative

to a processor. Under NUMA, a processor can access its own local

memory faster than non-local memory, that is, memory local to

another processor or memory shared between processors.

object storage server (OSS)


object storage target (OST)


object-based storage device (OSD)

An intelligent evolution of disk drives that can store and serve

objects rather than simply place data on tracks and sectors.

This task is accomplished by moving low-level storage func-

tions into the storage device and accessing the device through

an object interface. Unlike a traditional block-oriented device

providing access to data organized as an array of unrelated

Glossary

blocks, an object store allows access to data by means of stor-

age objects. A storage object is a virtual entity that groups data

together that has been determined by the user to be logically

related. Space for a storage object is allocated internally by the

OSD itself instead of by a host-based fi le system. OSDs man-

age all necessary low-level storage, space management, and

security functions. Because there is no host-based metadata

for an object (such as inode information), the only way for an

application to retrieve an object is by using its object identifi er

(OID). The SCSI interface was modifi ed and extended by the OSD

Technical Work Group of the Storage Networking Industry Asso-

ciation (SNIA) with varied industry and academic contributors,

resulting in a draft standard to T10 in 2004. This standard was

ratifi ed in September 2004 and became the ANSI T10 SCSI OSD

V1 command set, released as INCITS 400-2004. The SNIA group

continues to work on further extensions to the interface, such

as the ANSI T10 SCSI OSD V2 command set.

OLAP cube (“Online Analytical Processing”)

A set of data, organized in a way that facilitates non-predeter-

mined queries for aggregated information, or in other words,

online analytical processing. OLAP is one of the computer-

based techniques for analyzing business data that are collec-

tively called business intelligence. OLAP cubes can be thought

of as extensions to the two-dimensional array of a spreadsheet.

For example a company might wish to analyze some fi nancial

data by product, by time-period, by city, by type of revenue and

cost, and by comparing actual data with a budget. These addi-

tional methods of analyzing the data are known as dimensions.

Because there can be more than three dimensions in an OLAP

system the term hypercube is sometimes used.

GLOSSARY

OpenCL (“Open Computing Language”)

A framework for writing programs that execute across hetero-

geneous platforms consisting of CPUs, GPUs, and other proces-

sors. OpenCL includes a language (based on C99) for writing ker-

nels (functions that execute on OpenCL devices), plus APIs that

are used to defi ne and then control the platforms. OpenCL pro-

vides parallel computing using task-based and data-based par-

allelism. OpenCL is analogous to the open industry standards

OpenGL and OpenAL, for 3D graphics and computer audio,

respectively. Originally developed by Apple Inc., which holds

trademark rights, OpenCL is now managed by the non-profi t

technology consortium Khronos Group.

OpenMP (“Open Multi-Processing”)

An application programming interface (API) that supports

multi-platform shared memory multiprocessing programming

in C, C++ and Fortran on many architectures, including Unix and

Microsoft Windows platforms. It consists of a set of compiler di-

rectives, library routines, and environment variables that infl u-

ence run-time behavior.

Jointly defi ned by a group of major computer hardware and

software vendors, OpenMP is a portable, scalable model that

gives programmers a simple and fl exible interface for develop-

ing parallel applications for platforms ranging from the desk-

top to the supercomputer.

An application built with the hybrid model of parallel program-

ming can run on a computer cluster using both OpenMP and

Message Passing Interface (MPI), or more transparently through

the use of OpenMP extensions for non-shared memory systems.

OpenMPI

An open source → MPI-2 implementation that is developed and

maintained by a consortium of academic, research, and indus-

try partners.

parallel NFS (pNFS)

A → parallel fi le system standard, optional part of the current →

NFS standard 4.1. See the main article “Parallel Filesystems” for

further details.

PCI Express (PCIe)

A computer expansion card standard designed to replace the

older PCI, PCI-X, and AGP standards. Introduced by Intel in 2004,

PCIe (or PCI-E, as it is commonly called) is the latest standard

for expansion cards that is available on mainstream comput-

ers. PCIe, unlike previous PC expansion standards, is structured

around point-to-point serial links, a pair of which (one in each

direction) make up lanes; rather than a shared parallel bus.

These lanes are routed by a hub on the main-board acting as

a crossbar switch. This dynamic point-to-point behavior allows

PCIe 1.x PCIe 2.x PCIe 3.0 PCIe 4.0

x1 256 MB/s 512 MB/s 1 GB/s 2 GB/s

x2 512 MB/s 1 GB/s 2 GB/s 4 GB/s

x4 1 GB/s 2 GB/s 4 GB/s 8 GB/s

x8 2 GB/s 4 GB/s 8 GB/s 16 GB/s

x16 4 GB/s 8 GB/s 16 GB/s 32 GB/s

x32 8 GB/s 16 GB/s 32 GB/s 64 GB/s

188 189

more than one pair of devices to communicate with each other

at the same time. In contrast, older PC interfaces had all devices

permanently wired to the same bus; therefore, only one device

could send information at a time. This format also allows “chan-

nel grouping”, where multiple lanes are bonded to a single de-

vice pair in order to provide higher bandwidth. The number of

lanes is “negotiated” during power-up or explicitly during op-

eration. By making the lane count flexible a single standard can

provide for the needs of high-bandwidth cards (e.g. graphics

cards, 10 Gigabit Ethernet cards and multiport Gigabit Ethernet

cards) while also being economical for less demanding cards.

Unlike preceding PC expansion interface standards, PCIe is a

network of point-to-point connections. This removes the need

for “arbitrating” the bus or waiting for the bus to be free and

allows for full duplex communications. This means that while

standard PCI-X (133 MHz 64 bit) and PCIe x4 have roughly the

same data transfer rate, PCIe x4 will give better performance

if multiple device pairs are communicating simultaneously or

if communication within a single device pair is bidirectional.

Specifications of the format are maintained and developed by a

group of more than 900 industry-leading companies called the

PCI-SIG (PCI Special Interest Group). In PCIe 1.x, each lane carries

approximately 250 MB/s. PCIe 2.0, released in late 2007, adds a

Gen2-signalling mode, doubling the rate to about 500 MB/s. On

November 18, 2010, the PCI Special Interest Group officially pub-

lishes the finalized PCI Express 3.0 specification to its members

to build devices based on this new version of PCI Express, which

allows for a Gen3-signalling mode at 1 GB/s.

On November 29, 2011, PCI-SIG has annonced to proceed to

PCI Express 4.0 featuring 16 GT/s, still on copper technology.

Glossary

Additionally, active and idle power optimizations are to be in-

vestigated. Final specifications are expected to be released in

2014/15.

PETSc (“Portable, Extensible Toolkit for Scientific Computation”)

A suite of data structures and routines for the scalable (parallel)

solution of scientific applications modeled by partial differential

equations. It employs the → Message Passing Interface (MPI) stan-

dard for all message-passing communication. The current ver-

sion of PETSc is 3.2; released September 8, 2011. PETSc is intended

for use in large-scale application projects, many ongoing compu-

tational science projects are built around the PETSc libraries. Its

careful design allows advanced users to have detailed control

over the solution process. PETSc includes a large suite of parallel

linear and nonlinear equation solvers that are easily used in ap-

plication codes written in C, C++, Fortran and now Python. PETSc

provides many of the mechanisms needed within parallel appli-

cation code, such as simple parallel matrix and vector assembly

routines that allow the overlap of communication and computa-

tion. In addition, PETSc includes support for parallel distributed

arrays useful for finite difference methods.

process

→ thread

PTX (“parallel thread execution”)

Parallel Thread Execution (PTX) is a pseudo-assembly language

used in NVIDIA’s CUDA programming environment. The ‘nvcc’

compiler translates code written in CUDA, a C-like language, into

PTX, and the graphics driver contains a compiler which translates

the PTX into something which can be run on the processing cores.

GLOSSARY

RDMA (“remote direct memory access”)

Allows data to move directly from the memory of one computer

into that of another without involving either one‘s operating

system. This permits high-throughput, low-latency network-

ing, which is especially useful in massively parallel computer

clusters. RDMA relies on a special philosophy in using DMA.

RDMA supports zero-copy networking by enabling the network

adapter to transfer data directly to or from application memory,

eliminating the need to copy data between application memory

and the data buffers in the operating system. Such transfers re-

quire no work to be done by CPUs, caches, or context switches,

and transfers continue in parallel with other system operations.

When an application performs an RDMA Read or Write request,

the application data is delivered directly to the network, reduc-

ing latency and enabling fast message transfer. Common RDMA

implementations include → InfiniBand, → iSER, and → iWARP.

RISC (“reduced instruction-set computer”)

A CPU design strategy emphasizing the insight that simplified

instructions that “do less“ may still provide for higher perfor-

mance if this simplicity can be utilized to make instructions ex-

ecute very quickly → CISC.

ScaLAPACK (“scalable LAPACK”)

Library including a subset of → LAPACK routines redesigned

for distributed memory MIMD (multiple instruction, multiple

data) parallel computers. It is currently written in a Single-

Program-Multiple-Data style using explicit message passing

for interprocessor communication. ScaLAPACK is designed for

heterogeneous computing and is portable on any computer

that supports → MPI. The fundamental building blocks of the

ScaLAPACK library are distributed memory versions (PBLAS) of

the Level 1, 2 and 3 → BLAS, and a set of Basic Linear Algebra

Communication Subprograms (BLACS) for communication tasks

that arise frequently in parallel linear algebra computations. In

the ScaLAPACK routines, all interprocessor communication oc-

curs within the PBLAS and the BLACS. One of the design goals of

ScaLAPACK was to have the ScaLAPACK routines resemble their

→ LAPACK equivalents as much as possible.

service-oriented architecture (SOA)

An approach to building distributed, loosely coupled applica-

tions in which functions are separated into distinct services

that can be distributed over a network, combined, and reused.

See the main article “Windows HPC Server 2008 R2” for further

details.

single precision/double precision

→ floating point standard

SMP (“shared memory processing”)

So-called SMP jobs are computer programs with several

parts running on the same system and accessing a shared

memory region. A usual implementation of SMP jobs is

→ multi-threaded programs. The communication between the

single threads can e.g. be realized by the → OpenMP software

interface standard, but also in a non-standard way by means of

native UNIX interprocess communication mechanisms.

SMP (“symmetric multiprocessing”)

A multiprocessor or multicore computer architecture where

two or more identical processors or cores can connect to a

190 191

GlossaryGLOSSARY

single shared main memory in a completely symmetric way, i.e.

each part of the main memory has the same distance to each of

the cores. Opposite: → NUMA

storage access protocol

Part of the → parallel NFS standard

STREAM

A simple synthetic benchmark program that measures sustain-

able memory bandwidth (in MB/s) and the corresponding com-

putation rate for simple vector kernels.

streaming multiprocessor (SM)

Hardware component within the → Tesla GPU series

subnet manager

Application responsible for configuring the local → InfiniBand

subnet and ensuring its continued operation.

superscalar processors

A superscalar CPU architecture implements a form of parallel-

ism called instruction-level parallelism within a single proces-

sor. It thereby allows faster CPU throughput than would other-

wise be possible at the same clock rate. A superscalar processor

executes more than one instruction during a clock cycle by si-

multaneously dispatching multiple instructions to redundant

functional units on the processor. Each functional unit is not

a separate CPU core but an execution resource within a single

CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.

Tesla

NVIDIA‘s third brand of GPUs, based on high-end GPUs from the

G80 and on. Tesla is NVIDIA‘s first dedicated General Purpose

GPU. Because of the very high computational power (measured

in floating point operations per second or FLOPS) compared to

recent microprocessors, the Tesla products are intended for the

HPC market. The primary function of Tesla products are to aid in

simulations, large scale calculations (especially floating-point

calculations), and image generation for professional and scien-

tific fields, with the use of → CUDA. See the main article “NVIDIA

GPU Computing” for further details.

thread

A thread of execution is a fork of a computer program into two

or more concurrently running tasks. The implementation of

threads and processes differs from one operating system to an-

other, but in most cases, a thread is contained inside a process.

On a single processor, multithreading generally occurs by multi-

tasking: the processor switches between different threads. On

a multiprocessor or multi-core system, the threads or tasks will

generally run at the same time, with each processor or core run-

ning a particular thread or task. Threads are distinguished from

processes in that processes are typically independent, while

threads exist as subsets of a process. Whereas processes have

separate address spaces, threads share their address space,

which makes inter-thread communication much easier than

classical inter-process communication (IPC).

thread (in CUDA architecture)


thread block (in CUDA architecture)


thread processor array (TPA)

Hardware component within the → Tesla GPU series

10 Gigabit Ethernet

The fastest of the Ethernet standards, first published in 2002

as IEEE Std 802.3ae-2002. It defines a version of Ethernet with

a nominal data rate of 10 Gbit/s, ten times as fast as Gigabit

Ethernet. Over the years several 802.3 standards relating to

10GbE have been published, which later were consolidated

into the IEEE 802.3-2005 standard. IEEE 802.3-2005 and the other

amendments have been consolidated into IEEE Std 802.3-2008.

10 Gigabit Ethernet supports only full duplex links which can

be connected by switches. Half Duplex operation and CSMA/

CD (carrier sense multiple access with collision detect) are not

supported in 10GbE. The 10 Gigabit Ethernet standard encom-

passes a number of different physical layer (PHY) standards. As

of 2008 10 Gigabit Ethernet is still an emerging technology with

only 1 million ports shipped in 2007, and it remains to be seen

which of the PHYs will gain widespread commercial acceptance.

warp (in CUDA architecture)

Part of the → CUDA programming modelOLAP cube

Notes

Notes

ALWAYS KEEP IN TOUCH WITH THE LATEST NEWS

Visit us on the Web!

Here you will find comprehensive information about HPC, IT solutions for the datacenter, services and

high-performance, efficient IT systems.

Subscribe to our technology journals, E-News or the transtec newsletter and always stay up to date.

www.transtec.de/en/hpc

transtec Germany

Tel +49 (0) 7071/703-400

[email protected]

www.transtec.de

transtec Switzerland

Tel +41 (0) 44/818 47 00

[email protected]

www.transtec.ch

transtec United Kingdom

Tel +44 (0) 1295/756 500

[email protected]

www.transtec.co.uk

ttec Netherlands

Tel +31 (0) 24 34 34 210

[email protected]

www.ttec.nl

transtec France

Tel +33 (0) 3.88.55.16.00

[email protected]

www.transtec.fr

Texts and concept:

Layout and design:

Dr. Oliver Tennert, Director Technology Management & HPC Solutions | [email protected]

Jennifer Kemmler, Graphics & Design | [email protected]

© transtec AG, June 2014The graphics, diagrams and tables found herein are the intellectual property of transtec AG and may be reproduced or published only with its express permission. No responsibility will be assumed for inaccuracies or omissions. Other names or logos may be trademarks of their respective owners.

HPC Technology Compass 2014/15

Education

online analytical

fraunhofer

advancedclustermanagement

microsoft

target channel

wide global

2x shader

univa grid