High Performance Computing 2014/15 Technology Compass Sciences Risk Analysis Simulation Big Data Analytics CAD High Performance Computing
Aug 19, 2014
High Performance Computing 2014/15
Technology CompassS
cie
nce
s
Ris
k A
na
lysi
s
Sim
ula
tio
n
Big
Da
ta A
na
lyti
cs
CA
D
Hig
h P
erf
orm
an
ce C
om
pu
tin
g
2 3
More Than 30 Years of Experience in Scientifi c Computing
1980 marked the beginning of a decade where numerous start-
ups were created, some of which later transformed into big play-
ers in the IT market. Technical innovations brought dramatic
changes to the nascent computer market. In Tübingen, close to
one of Germany’s prime and oldest universities, transtec was
founded.
In the early days, transtec focused on reselling DEC computers
and peripherals, delivering high-performance workstations to
university institutes and research facilities. In 1987, SUN/Sparc
and storage solutions broadened the portfolio, enhanced by
IBM/RS6000 products in 1991. These were the typical worksta-
tions and server systems for high performance computing then,
used by the majority of researchers worldwide.
In the late 90s, transtec was one of the fi rst companies to offer
highly customized HPC cluster solutions based on standard Intel
architecture servers, some of which entered the TOP 500 list of
the world’s fastest computing systems.
Thus, given this background and history, it is fair to say that
transtec looks back upon a more than 30 years’ experience in sci-
entifi c computing; our track record shows more than 750 HPC in-
stallations. With this experience, we know exactly what custom-
ers’ demands are and how to meet them. High performance and
ease of management – this is what customers require today. HPC
systems are for sure required to peak-perform, as their name in-
dicates, but that is not enough: they must also be easy to handle.
Unwieldy design and operational complexity must be avoided
or at least hidden from administrators and particularly users of
HPC computer systems.
This brochure focusses on where transtec HPC solutions excel.
transtec HPC solutions use the latest and most innovative tech-
nology. Bright Cluster Manager as the technology leader for uni-
fi ed HPC cluster management, leading-edge Moab HPC Suite for
job and workload management, Intel Cluster Ready certifi cation
as an independent quality standard for our systems, Panasas
HPC storage systems for highest performance and real ease of
management required of a reliable HPC storage system. Again,
with these components, usability, reliability and ease of man-
agement are central issues that are addressed, even in a highly
heterogeneous environment. transtec is able to provide custom-
ers with well-designed, extremely powerful solutions for Tesla
GPU computing, as well as thoroughly engineered Intel Xeon Phi
systems. Intel’s Infi niBand Fabric Suite makes managing a large
Infi niBand fabric easier than ever before, and Numascale pro-
vides excellent Technology for AMD-based large-SMP Systems
– transtec masterly combines excellent and well-chosen compo-
nents that are already there to a fi ne-tuned, customer-specifi c,
and thoroughly designed HPC solution.
Your decision for a transtec HPC solution means you opt for
most intensive customer care and best service in HPC. Our ex-
perts will be glad to bring in their expertise and support to assist
you at any stage, from HPC design to daily cluster operations, to
HPC Cloud Services.
Last but not least, transtec HPC Cloud Services provide custom-
ers with the possibility to have their jobs run on dynamically pro-
vided nodes in a dedicated datacenter, professionally managed
and individually customizable. Numerous standard applications
like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes like Gro-
macs, NAMD, VMD, and others are pre-installed, integrated into
an enterprise-ready cloud management environment, and ready
to run.
Have fun reading the transtec HPC Compass 2014/15!
Technology CompassTable of Contents and Introduction
High Performance Computing ........................................... 4Performance Turns Into Productivity ......................................................6
Flexible Deployment With xCAT ..................................................................8
Service and Customer Care From A to Z ............................................. 10
Advanced Cluster Management Made Easy ........... 12Easy-to-use, Complete and Scalable ...................................................... 14
Cloud Bursting With Bright ......................................................................... 18
Intelligent HPC Workload Management ................... 30Moab HPC Suite – Basic Edition ................................................................ 32
Moab HPC Suite – Enterprise Edition .................................................... 34
Moab HPC Suite – Grid Option ................................................................... 36
Optimizing Accelerators with Moab HPC Suite .............................. 38
Moab HPC Suite – Application Portal Edition .................................. 42
Moab HPC Suite – Remote Visualization Edition ........................... 44
Using Moab With Grid Environments ................................................... 46
Remote Visualization and Workfl ow Optimization ....................................................... 52NICE EnginFrame: A Technical Computing Portal ......................... 54
Desktop Cloud Virtualization .................................................................... 58
Cloud Computing .............................................................................................. 62
NVIDIA Grid ............................................................................................................ 66
Intel Cluster Ready ................................................................ 70A Quality Standard for HPC Clusters...................................................... 72
Intel Cluster Ready builds HPC Momentum ..................................... 76
The transtec Benchmarking Center ....................................................... 80
Parallel NFS ................................................................................. 82The New Standard for HPC Storage ....................................................... 84
Whats´s New in NFS 4.1? .............................................................................. 86
Panasas HPC Storage ...................................................................................... 88
NVIDIA GPU Computing .....................................................100What is GPU Computing? .......................................................................... 102
Kepler GK110 – The Next Generation GPU ...................................... 104
Introducing NVIDIA Parallel Nsight ..................................................... 106
A Quick Refresher on CUDA ...................................................................... 118
Intel TrueScale Infi niBand and GPUs .................................................. 124
Intel Xeon Phi Coprocessor .............................................128The Architecture.............................................................................................. 130
Infi niBand ...................................................................................142High-speed Interconnects ........................................................................ 144
Top 10 Reasons to Use Intel TrueScale Infi niBand ..................... 146
Intel MPI Library 4.0 Performance ........................................................ 148
Intel Fabric Suite 7 ......................................................................................... 152
Numascale .................................................................................158Background ........................................................................................................ 160
NumaConnect Value Proposition ......................................................... 162
Technology ......................................................................................................... 163
Redefi ning Scalable OpenMP and MPI .............................................. 169
Glossary .......................................................................................174
5
High Performance Computing (HPC) has been with us from the very beginning of the computer era. High-performance computers were built to solve numerous problems which the “human computers” could not handle. The term HPC just hadn’t been coined yet. More important, some of the early principles have changed fundamentally.
High Performance Computing – Performance Turns Into Productivity
Sci
en
ces
R
isk
An
aly
sis
S
imu
lati
on
B
ig D
ata
An
aly
tics
C
AD
H
igh
Pe
rfo
rma
nce
Co
mp
uti
ng
6 7
Variations on the theme: MPP and SMP Parallel computations exist in two major variants today. Ap-
plications running in parallel on multiple compute nodes are
frequently so-called Massively Parallel Processing (MPP) appli-
cations. MPP indicates that the individual processes can each
utilize exclusive memory areas. This means that such jobs are
predestined to be computed in parallel, distributed across the
nodes in a cluster. The individual processes can thus utilize the
separate units of the respective node – especially the RAM, the
CPU power and the disk I/O.
Communication between the individual processes is imple-
mented in a standardized way through the MPI software inter-
face (Message Passing Interface), which abstracts the underlying
network connections between the nodes from the processes.
However, the MPI standard (current version 2.0) merely requires
source code compatibility, not binary compatibility, so an off-
the-shelf application usually needs specifi c versions of MPI
libraries in order to run. Examples of MPI implementations are
OpenMPI, MPICH2, MVAPICH2, Intel MPI or – for Windows clusters
– MS-MPI.
If the individual processes engage in a large amount of commu-
nication, the response time of the network (latency) becomes im-
portant. Latency in a Gigabit Ethernet or a 10GE network is typi-
cally around 10 μs. High-speed interconnects such as Infi niBand,
reduce latency by a factor of 10 down to as low as 1 μs. Therefore,
high-speed interconnects can greatly speed up total processing.
The other frequently used variant is called SMP applications.
SMP, in this HPC context, stands for Shared Memory Processing.
It involves the use of shared memory areas, the specifi c imple-
mentation of which is dependent on the choice of the underly-
ing operating system. Consequently, SMP jobs generally only run
on a single node, where they can in turn be multi-threaded and
thus be parallelized across the number of CPUs per node. For
many HPC applications, both the MPP and SMP variant can be
chosen.
Many applications are not inherently suitable for parallel execu-
tion. In such a case, there is no communication between the in-
dividual compute nodes, and therefore no need for a high-speed
network between them; nevertheless, multiple computing jobs
can be run simultaneously and sequentially on each individual
node, depending on the number of CPUs.
In order to ensure optimum computing performance for these
applications, it must be examined how many CPUs and cores de-
liver the optimum performance.
We fi nd applications of this sequential type of work typically in
the fi elds of data analysis or Monte-Carlo simulations.
Performance Turns Into ProductivityHPC systems in the early days were much different from those
we see today. First, we saw enormous mainframes from large
computer manufacturers, including a proprietary operating
system and job management system. Second, at universities
and research institutes, workstations made inroads and sci-
entists carried out calculations on their dedicated Unix or
VMS workstations. In either case, if you needed more comput-
ing power, you scaled up, i.e. you bought a bigger machine.
Today the term High-Performance Computing has gained a
fundamentally new meaning. HPC is now perceived as a way
to tackle complex mathematical, scientific or engineering
problems. The integration of industry standard, “off-the-shelf”
server hardware into HPC clusters facilitates the construction
of computer networks of such power that one single system
could never achieve. The new paradigm for parallelization is
scaling out.
Computer-supported simulations of realistic processes (so-
called Computer Aided Engineering – CAE) has established
itself as a third key pillar in the field of science and research
alongside theory and experimentation. It is nowadays incon-
ceivable that an aircraft manufacturer or a Formula One rac-
ing team would operate without using simulation software.
And scientific calculations, such as in the fields of astrophys-
ics, medicine, pharmaceuticals and bio-informatics, will to a
large extent be dependent on supercomputers in the future.
Software manufacturers long ago recognized the benefit of
high-performance computers based on powerful standard
servers and ported their programs to them accordingly.
The main advantages of scale-out supercomputers is just that:
they are infinitely scalable, at least in principle. Since they are
based on standard hardware components, such a supercom-
puter can be charged with more power whenever the com-
putational capacity of the system is not sufficient any more,
simply by adding additional nodes of the same kind. A cum-
bersome switch to a different technology can be avoided in
most cases.
High Performance Computing
Performance Turns Into Productivity
“transtec HPC solutions are meant to provide cus-
tomers with unparalleled ease-of-management and
ease-of-use. Apart from that, deciding for a transtec
HPC solution means deciding for the most intensive
customer care and the best service imaginable.”
Dr. Oliver Tennert Director Technology Management & HPC Solutions
8 9
solution Nagios, according to the customer’s preferences and
requirements.
Local Installation or Diskless Installation
We offer a diskful or a diskless installation of the cluster
nodes. A diskless installation means the operating system is
hosted partially within the main memory, larger parts may or
may not be included via NFS or other means. This approach
allows for deploying large amounts of nodes very efficiently,
and the cluster is up and running within a very small times-
cale. Also, updating the cluster can be done in a very efficient
way. For this, only the boot image has to be updated, and the
nodes have to be rebooted. After this, the nodes run either a
new kernel or even a new operating system. Moreover, with
this approach, partitioning the cluster can also be very effi-
ciently done, either for testing purposes, or for allocating dif-
ferent cluster partitions for different users or applications.
Development Tools, Middleware, and Applications
According to the application, optimization strategy, or under-
lying architecture, different compilers lead to code results of
very different performance. Moreover, different, mainly com-
mercial, applications, require different MPI implementations.
And even when the code is self-developed, developers often
prefer one MPI implementation over another.
According to the customer’s wishes, we install various compil-
ers, MPI middleware, as well as job management systems like
Parastation, Grid Engine, Torque/Maui, or the very powerful
Moab HPC Suite for the high-level cluster management.
xCAT as a Powerful and Flexible Deployment Tool
xCAT (Extreme Cluster Administration Tool) is an open source
toolkit for the deployment and low-level administration of
HPC cluster environments, small as well as large ones.
xCAT provides simple commands for hardware control, node
discovery, the collection of MAC addresses, and the node de-
ployment with (diskful) or without local (diskless) installation.
The cluster configuration is stored in a relational database.
Node groups for different operating system images can be
defined. Also, user-specific scripts can be executed automati-
cally at installation time.
xCAT Provides the Following Low-Level Administrative Fea-
tures
� Remote console support
� Parallel remote shell and remote copy commands
� Plugins for various monitoring tools like Ganglia or Nagios
� Hardware control commands for node discovery, collect-
ing MAC addresses, remote power switching and resetting
of nodes
� Automatic configuration of syslog, remote shell, DNS,
DHCP, and ntp within the cluster
� Extensive documentation and man pages
For cluster monitoring, we install and configure the open
source tool Ganglia or the even more powerful open source
High Performance Computing
Flexible Deployment With xCAT
10 11
benchmarking routines; this aids customers in sizing and devising
the optimal and detailed HPC confi guration.
Each and every piece of HPC hardware that leaves our factory un-
dergoes a burn-in procedure of 24 hours or more if necessary. We
make sure that any hardware shipped meets our and our custom-
ers’ quality requirements. transtec HPC solutions are turnkey solu-
tions. By default, a transtec HPC cluster has everything installed and
confi gured – from hardware and operating system to important
middleware components like cluster management or developer
tools and the customer’s production applications. Onsite delivery
means onsite integration into the customer’s production environ-
ment, be it establishing network connectivity to the corporate net-
work, or setting up software and confi guration parts.
transtec HPC clusters are ready-to-run systems – we deliver, you
turn the key, the system delivers high performance. Every HPC proj-
ect entails transfer to production: IT operation processes and poli-
cies apply to the new HPC system. Effectively, IT personnel is trained
hands-on, introduced to hardware components and software, with
all operational aspects of confi guration management.
transtec services do not stop when the implementation projects
ends. Beyond transfer to production, transtec takes care. transtec
offers a variety of support and service options, tailored to the cus-
tomer’s needs. When you are in need of a new installation, a major
reconfi guration or an update of your solution – transtec is able to
support your staff and, if you lack the resources for maintaining
the cluster yourself, maintain the HPC solution for you. From Pro-
fessional Services to Managed Services for daily operations and
required service levels, transtec will be your complete HPC service
and solution provider. transtec’s high standards of performance, re-
liability and dependability assure your productivity and complete
satisfaction.
transtec’s offerings of HPC Managed Services offer customers the
possibility of having the complete management and administra-
tion of the HPC cluster managed by transtec service specialists, in
an ITIL compliant way. Moreover, transtec’s HPC on Demand servic-
es help provide access to HPC resources whenever they need them,
for example, because they do not have the possibility of owning
and running an HPC cluster themselves, due to lacking infrastruc-
ture, know-how, or admin staff.
transtec HPC Cloud ServicesLast but not least transtec’s services portfolio evolves as customers‘
demands change. Starting this year, transtec is able to provide HPC
Cloud Services. transtec uses a dedicated datacenter to provide
computing power to customers who are in need of more capacity
than they own, which is why this workfl ow model is sometimes
called computing-on-demand. With these dynamically provided
resources, customers with the possibility to have their jobs run on
HPC nodes in a dedicated datacenter, professionally managed and
secured, and individually customizable. Numerous standard ap-
plications like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes
like Gromacs, NAMD, VMD, and others are pre-installed, integrated
into an enterprise-ready cloud and workload management environ-
ment, and ready to run.
Alternatively, whenever customers are in need of space for hosting
their own HPC equipment because they do not have the space ca-
pacity or cooling and power infrastructure themselves, transtec is
also able to provide Hosting Services to those customers who’d like
to have their equipment professionally hosted, maintained, and
managed. Customers can thus build up their own private cloud!
Are you interested in any of transtec’s broad range of HPC related
services? Write us an email to [email protected]. We’ll be happy to
hear from you!
High Performance Computing
Services and Customer Care from A to Z
transtec HPC as a ServiceYou will get a range of applications like LS-Dyna, ANSYS,
Gromacs, NAMD etc. from all kinds of areas pre-installed,
integrated into an enterprise-ready cloud and workload
management system, and ready-to run. Do you miss your
application?
Ask us: [email protected]
transtec Platform as a ServiceYou will be provided with dynamically provided compute
nodes for running your individual code. The operating
system will be pre-installed according to your require-
ments. Common Linux distributions like RedHat, CentOS,
or SLES are the standard. Do you need another distribu-
tion?
Ask us: [email protected]
transtec Hosting as a ServiceYou will be provided with hosting space inside a profes-
sionally managed and secured datacenter where you
can have your machines hosted, managed, maintained,
according to your requirements. Thus, you can build up
your own private cloud. What range of hosting and main-
tenance services do you need?
Tell us: [email protected]
HPC @ transtec: Services and Customer Care from A to Ztranstec AG has over 30 years of experience in scientifi c computing
and is one of the earliest manufacturers of HPC clusters. For nearly
a decade, transtec has delivered highly customized High Perfor-
mance clusters based on standard components to academic and in-
dustry customers across Europe with all the high quality standards
and the customer-centric approach that transtec is well known for.
Every transtec HPC solution is more than just a rack full of hard-
ware – it is a comprehensive solution with everything the HPC user,
owner, and operator need. In the early stages of any customer’s HPC
project, transtec experts provide extensive and detailed consult-
ing to the customer – they benefi t from expertise and experience.
Consulting is followed by benchmarking of different systems with
either specifi cally crafted customer code or generally accepted
individual Presalesconsulting
application-,customer-,
site-specificsizing of
HPC solution
burn-in testsof systems
benchmarking of different systems
continualimprovement
software& OS
installation
applicationinstallation
onsitehardwareassembly
integrationinto
customer’senvironment
customertraining
maintenance,support &
managed services
individual Presalesconsulting
application-,customer-,
site-specificsizing of
HPC solution
burn-in testsof systems
benchmarking of different systems
continualimprovement
software& OS
installation
applicationinstallation
onsitehardwareassembly
integrationinto
customer’senvironment
customertraining
maintenance,support &
managed services
Services and Customer Care from A to Z
13
Bright Cluster Manager removes the complexity from the installation, management and use of HPC clusters — on premise or in the cloud. With Bright Cluster Manager, you can easily install, manage and use multiple clusters si-multaneously, including compute, Hadoop, storage, database and workstation clusters.
Advanced Cluster Management Made Easy
En
gin
ee
rin
g
Lif
e S
cie
nce
s
Au
tom
oti
ve
Pri
ce M
od
ell
ing
A
ero
spa
ce
CA
E
Da
ta A
na
lyti
cs
15
� Comprehensive cluster monitoring and health checking: au-
tomatic sidelining of unhealthy nodes to prevent job failure
Scalability from Deskside to TOP500
� Off-loadable provisioning: enables maximum scalability
� Proven: used on some of the world’s largest clusters
Minimum Overhead / Maximum Performance
� Lightweight: single daemon drives all functionality
� Optimized: daemon has minimal impact on operating system
and applications
� Efficient: single database for all metric and configuration
data
Top Security
� Key-signed repositories: controlled, automated security and
other updates
� Encryption option: for external and internal communications
� Safe: X509v3 certificate-based public-key authentication
� Sensible access: role-based access control, complete audit
trail
� Protected: firewalls and LDAP
A Unified Approach
Bright Cluster Manager was written from the ground up as a to-
tally integrated and unified cluster management solution. This
fundamental approach provides comprehensive cluster manage-
ment that is easy to use and functionality-rich, yet has minimal
impact on system performance. It has a single, light-weight
daemon, a central database for all monitoring and configura-
tion data, and a single CLI and GUI for all cluster management
functionality. Bright Cluster Manager is extremely easy to use,
scalable, secure and reliable. You can monitor and manage all
aspects of your clusters with virtually no learning curve.
Bright’s approach is in sharp contrast with other cluster man-
agement offerings, all of which take a “toolkit” approach. These
toolkits combine a Linux distribution with many third-party
tools for provisioning, monitoring, alerting, etc. This approach
has critical limitations: these separate tools were not designed
to work together; were often not designed for HPC, nor designed
to scale. Furthermore, each of the tools has its own interface
(mostly command-line based), and each has its own daemon(s)
and database(s). Countless hours of scripting and testing by
highly skilled people are required to get the tools to work for a
specific cluster, and much of it goes undocumented.
Time is wasted, and the cluster is at risk if staff changes occur,
losing the “in-head” knowledge of the custom scripts.
14
The Bright AdvantageBright Cluster Manager delivers improved productivity, in-
creased uptime, proven scalability and security, while reduc-
ing operating cost:
Rapid Productivity Gains
� Short learning curve: intuitive GUI drives it all
� Quick installation: one hour from bare metal to compute-
ready
� Fast, flexible provisioning: incremental, live, disk-full, disk-less,
over InfiniBand, to virtual machines, auto node discovery
� Comprehensive monitoring: on-the-fly graphs, Rackview, mul-
tiple clusters, custom metrics
� Powerful automation: thresholds, alerts, actions
� Complete GPU support: NVIDIA, AMD1, CUDA, OpenCL
� Full support for Intel Xeon Phi
� On-demand SMP: instant ScaleMP virtual SMP deployment
� Fast customization and task automation: powerful cluster
management shell, SOAP and JSON APIs make it easy
� Seamless integration with leading workload managers: Slurm,
Open Grid Scheduler, Torque, openlava, Maui2, PBS Profes-
sional, Univa Grid Engine, Moab2, LSF
� Integrated (parallel) application development environment
� Easy maintenance: automatically update your cluster from
Linux and Bright Computing repositories
� Easily extendable, web-based User Portal
� Cloud-readiness at no extra cost3, supporting scenarios “Clus-
ter-on-Demand” and “Cluster-Extension”, with data-aware
scheduling
� Deploys, provisions, monitors and manages Hadoop clusters
� Future-proof: transparent customization minimizes disrup-
tion from staffing changes
Maximum Uptime
� Automatic head node failover: prevents system downtime
� Powerful cluster automation: drives pre-emptive actions
based on monitoring thresholds
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
By selecting a cluster node in the tree on the left and the Tasks tab on the right, you can execute a number of powerful tasks on that node with just a single mouse click.
The cluster installer takes you through the installation process and offers advanced options such as “Express” and “Remote”.
16 1717
Extensive Development Environment
Bright Cluster Manager provides an extensive HPC development
environment for both serial and parallel applications, including
the following (some are cost options):
� Compilers, including full suites from GNU, Intel, AMD and Port-
land Group.
� Debuggers and profilers, including the GNU debugger and
profiler, TAU, TotalView, Allinea DDT and Allinea MAP.
� GPU libraries, including CUDA and OpenCL.
� MPI libraries, including OpenMPI, MPICH, MPICH2, MPICHMX,
MPICH2-MX, MVAPICH and MVAPICH2; all cross-compiled with
the compilers installed on Bright Cluster Manager, and opti-
mized for high-speed interconnects such as InfiniBand and
10GE.
� Mathematical libraries, including ACML, FFTW, Goto-BLAS,
MKL and ScaLAPACK.
� Other libraries, including Global Arrays, HDF5, IIPP, TBB, Net-
CDF and PETSc.
Bright Cluster Manager also provides Environment Modules to
make it easy to maintain multiple versions of compilers, libraries
and applications for different users on the cluster, without creating
compatibility conflicts. Each Environment Module file contains the
information needed to configure the shell for an application, and
automatically sets these variables correctly for the particular appli-
cation when it is loaded. Bright Cluster Manager includes many pre-
configured module files for many scenarios, such as combinations
of compliers, mathematical and MPI libraries.
Powerful Image Management and Provisioning
Bright Cluster Manager features sophisticated software image
management and provisioning capability. A virtually unlimited
number of images can be created and assigned to as many dif-
ferent categories of nodes as required. Default or custom Linux
kernels can be assigned to individual images. Incremental
changes to images can be deployed to live nodes without re-
booting or re-installation.
The provisioning system only propagates changes to the images,
minimizing time and impact on system performance and avail-
ability. Provisioning capability can be assigned to any number of
nodes on-the-fly, for maximum flexibility and scalability. Bright
Cluster Manager can also provision over InfiniBand and to ram-
disk or virtual machine.
Comprehensive Monitoring
With Bright Cluster Manager, you can collect, monitor, visual-
ize and analyze a comprehensive set of metrics. Many software
and hardware metrics available to the Linux kernel, and many
hardware management interface metrics (IPMI, DRAC, iLO, etc.)
are sampled.
16
Ease of Installation
Bright Cluster Manager is easy to install. Installation and testing
of a fully functional cluster from “bare metal” can be complet-
ed in less than an hour. Configuration choices made during the
installation can be modified afterwards. Multiple installation
modes are available, including unattended and remote modes.
Cluster nodes can be automatically identified based on switch
ports rather than MAC addresses, improving speed and reliabil-
ity of installation, as well as subsequent maintenance. All major
hardware brands are supported: Dell, Cray, Cisco, DDN, IBM, HP,
Supermicro, Acer, Asus and more.
Ease of Use
Bright Cluster Manager is easy to use, with two interface op-
tions: the intuitive Cluster Management Graphical User Interface
(CMGUI) and the powerful Cluster Management Shell (CMSH).
The CMGUI is a standalone desktop application that provides
a single system view for managing all hardware and software
aspects of the cluster through a single point of control. Admin-
istrative functions are streamlined as all tasks are performed
through one intuitive, visual interface. Multiple clusters can be
managed simultaneously. The CMGUI runs on Linux, Windows
and OS X, and can be extended using plugins. The CMSH provides
practically the same functionality as the CMGUI, but via a com-
mand-line interface. The CMSH can be used both interactively
and in batch mode via scripts.
Either way, you now have unprecedented flexibility and control
over your clusters.
Support for Linux and Windows
Bright Cluster Manager is based on Linux and is available with
a choice of pre-integrated, pre-configured and optimized Linux
distributions, including SUSE Linux Enterprise Server, Red Hat
Enterprise Linux, CentOS and Scientific Linux. Dual-boot installa-
tions with Windows HPC Server are supported as well, allowing
nodes to either boot from the Bright-managed Linux head node,
or the Windows-managed head node.
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
Cluster metrics, such as GPU, Xeon Phi and CPU temperatures, fan speeds and network statistics can be visualized by simply dragging and dropping them into a graphing window. Multiple metrics can be combined in one graph and graphs can be zoomed into. A Graphing wizard allows creation of all graphs for a selected combination of metrics and nodes. Graph layout and color configurations can be tailored to your requirements and stored for re-use.
The Overview tab provides instant, high-level insight into the status of the cluster.
18 19
Bright Cluster Manager unleashes and manages the unlimited power of the cloudCreate a complete cluster in Amazon EC2, or easily extend your
onsite cluster into the cloud, enabling you to dynamically add
capacity and manage these nodes as part of your onsite cluster.
Both can be achieved in just a few mouse clicks, without the
need for expert knowledge of Linux or cloud computing.
Bright Cluster Manager’s unique data aware scheduling capabil-
ity means that your data is automatically in place in EC2 at the
start of the job, and that the output data is returned as soon as
the job is completed.
With Bright Cluster Manager, every cluster is cloud-ready, at no
extra cost. The same powerful cluster provisioning, monitoring,
scheduling and management capabilities that Bright Cluster
Manager provides to onsite clusters extend into the cloud.
The Bright advantage for cloud bursting
� Ease of use: Intuitive GUI virtually eliminates user learning
curve; no need to understand Linux or EC2 to manage system.
Alternatively, cluster management shell provides powerful
scripting capabilities to automate tasks
� Complete management solution: Installation/initialization,
provisioning, monitoring, scheduling and management in
one integrated environment
� Integrated workload management: Wide selection of work-
load managers included and automatically confi gured with
local, cloud and mixed queues
� Single system view; complete visibility and control: Cloud com-
pute nodes managed as elements of the on-site cluster; visible
from a single console with drill-downs and historic data.
� Effi cient data management via data aware scheduling: Auto-
matically ensures data is in position at start of computation;
delivers results back when complete.
� Secure, automatic gateway: Mirrored LDAP and DNS services
over automatically-created VPN connects local and cloud-
based nodes for secure communication.
� Cost savings: More effi cient use of cloud resources; support
for spot instances, minimal user intervention.
Two cloud bursting scenarios
Bright Cluster Manager supports two cloud bursting scenarios:
“Cluster-on-Demand” and “Cluster Extension”.
Scenario 1: Cluster-on-Demand
The Cluster-on-Demand scenario is ideal if you do not have a
cluster onsite, or need to set up a totally separate cluster. With
just a few mouse clicks you can instantly create a complete clus-
ter in the public cloud, for any duration of time.
Scenario 2: Cluster Extension
The Cluster Extension scenario is ideal if you have a cluster onsite
but you need more compute power, including GPUs. With Bright
Cluster Manager, you can instantly add EC2-based resources to
High performance meets effi ciencyInitially, massively parallel systems constitute a challenge to
both administrators and users. They are complex beasts. Anyone
building HPC clusters will need to tame the beast, master the
complexity and present users and administrators with an easy-
to-use, easy-to-manage system landscape.
Leading HPC solution providers such as transtec achieve this
goal. They hide the complexity of HPC under the hood and match
high performance with effi ciency and ease-of-use for both users
and administrators. The “P” in “HPC” gains a double meaning:
“Performance” plus “Productivity”.
Cluster and workload management software like Moab HPC
Suite, Bright Cluster Manager or QLogic IFS provide the means
to master and hide the inherent complexity of HPC systems.
For administrators and users, HPC clusters are presented as
single, large machines, with many different tuning parameters.
The software also provides a unifi ed view of existing clusters
whenever unifi ed management is added as a requirement by the
customer at any point in time after the fi rst installation. Thus,
daily routine tasks such as job management, user management,
queue partitioning and management, can be performed easily
with either graphical or web-based tools, without any advanced
scripting skills or technical expertise required from the adminis-
trator or user.
Advanced Cluster Management Made EasyCloud Bursting With Bright
20 21
Amazon spot instance support
Bright Cluster Manager enables users to take advantage of the
cost savings offered by Amazon’s spot instances. Users can spec-
ify the use of spot instances, and Bright will automatically sched-
ule as available, reducing the cost to compute without the need
to monitor spot prices and manually schedule.
Hardware Virtual Machine (HVM) virtualization
Bright Cluster Manager automatically initializes all Amazon
instance types, including Cluster Compute and Cluster GPU in-
stances that rely on HVM virtualization.
Data aware scheduling
Data aware scheduling ensures that input data is transferred to
the cloud and made accessible just prior to the job starting, and
that the output data is transferred back. There is no need to wait
(and monitor) for the data to load prior to submitting jobs (de-
laying entry into the job queue), nor any risk of starting the job
before the data transfer is complete (crashing the job). Users sub-
mit their jobs, and Bright’s data aware scheduling does the rest.
Bright Cluster Manager provides the choice:
“To Cloud, or Not to Cloud”
Not all workloads are suitable for the cloud. The ideal situation
for most organizations is to have the ability to choose between
onsite clusters for jobs that require low latency communication,
complex I/O or sensitive data; and cloud clusters for many other
types of jobs.
Bright Cluster Manager delivers the best of both worlds: a pow-
erful management solution for local and cloud clusters, with
the ability to easily extend local clusters into the cloud without
compromising provisioning, monitoring or managing the cloud
resources.
your onsite cluster, for any duration of time. Bursting into the
cloud is as easy as adding nodes to an onsite cluster — there are
only a few additional steps after providing the public cloud ac-
count information to Bright Cluster Manager.
The Bright approach to managing and monitoring a cluster in
the cloud provides complete uniformity, as cloud nodes are man-
aged and monitored the same way as local nodes:
� Load-balanced provisioning
� Software image management
� Integrated workload management
� Interfaces — GUI, shell, user portal
� Monitoring and health checking
� Compilers, debuggers, MPI libraries, mathematical libraries
and environment modules
Bright Cluster Manager also provides additional features that
are unique to the cloud.
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
One or more cloud nodes can be confi gured under the Cloud Settings tab.
The Add Cloud Provider Wizard and the Node Creation Wizard make the cloud bursting process easy, also for users with no cloud experience. Bright Computing
22 23
overheated GPU unit and sending a text message to your mobile
phone. Several predefined actions are available, but any built-in
cluster management command, Linux command or script can be
used as an action.
Comprehensive GPU Management
Bright Cluster Manager radically reduces the time and effort
of managing GPUs, and fully integrates these devices into the
single view of the overall system. Bright includes powerful GPU
management and monitoring capability that leverages function-
ality in NVIDIA Tesla and AMD GPUs.
You can easily assume maximum control of the GPUs and gain
instant and time-based status insight. Depending on the GPU
make and model, Bright monitors a full range of GPU metrics,
including:
� GPU temperature, fan speed, utilization
� GPU exclusivity, compute, display, persistance mode
� GPU memory utilization, ECC statistics
� Unit fan speed, serial number, temperature, power usage, vol-
tages and currents, LED status, firmware.
� Board serial, driver version, PCI info.
Beyond metrics, Bright Cluster Manager features built-in support
for GPU computing with CUDA and OpenCL libraries. Switching
between current and previous versions of CUDA and OpenCL has
also been made easy.
Full Support for Intel Xeon Phi
Bright Cluster Manager makes it easy to set up and use the Intel
Xeon Phi coprocessor. Bright includes everything that is needed
to get Phi to work, including a setup wizard in the CMGUI. Bright
ensures that your software environment is set up correctly, so
that the Intel Xeon Phi coprocessor is available for applications
that are able to take advantage of it.
Bright collects and displays a wide range of metrics for Phi, en-
suring that the coprocessor is visible and manageable as a de-
vice type, as well as including Phi as a resource in the workload
management system. Bright’s pre-job health checking ensures
that Phi is functioning properly before directing tasks to the co-
processor.
Furthermore, Bright allows you to take advantage of two execu-
tion models available from the Intel MIC architecture and makes
the workload manager aware of the models:
� Native – the application runs exclusively on the coprocessor
without involving the host CPU.
� Offload – the application runs on the host and some specific
computations are offloaded to the coprocessor.
Multi-Tasking Via Parallel Shell
The parallel shell allows simultaneous execution of multiple
commands and scripts across the cluster as a whole, or across
easily definable groups of nodes. Output from the executed com-
mands is displayed in a convenient way with variable levels of
verbosity. Running commands and scripts can be killed easily if
necessary. The parallel shell is available through both the CMGUI
and the CMSH.
Examples include CPU, GPU and Xeon Phi temperatures, fan
speeds, switches, hard disk SMART information, system load,
memory utilization, network metrics, storage metrics, power
systems statistics, and workload management metrics. Custom
metrics can also easily be defined.
Metric sampling is done very efficiently — in one process, or out-
of-band where possible. You have full flexibility over how and
when metrics are sampled, and historic data can be consolidat-
ed over time to save disk space.
Cluster Management Automation
Cluster management automation takes pre-emptive actions
when predetermined system thresholds are exceeded, saving
time and preventing hardware damage. Thresholds can be con-
figured on any of the available metrics. The built-in configura-
tion wizard guides you through the steps of defining a rule:
selecting metrics, defining thresholds and specifying actions.
For example, a temperature threshold for GPUs can be estab-
lished that results in the system automatically shutting down an
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
The automation configuration wizard guides you through the steps of defining a rule: selecting metrics, defining thresholds and specifying actions
The parallel shell allows for simultaneous execution of commands or scripts across node groups or across the entire cluster.
The status of cluster nodes, switches, other hardware, as well as up to six metrics can be visualized in the Rackview. A zoomout option is available for clusters with many racks.
24 25
Integrated SMP Support
Bright Cluster Manager — Advanced Edition dynamically aggre-
gates multiple cluster nodes into a single virtual SMP node, us-
ing ScaleMP’s Versatile SMP™ (vSMP) architecture. Creating and
dismantling a virtual SMP node can be achieved with just a few
clicks within the CMGUI. Virtual SMP nodes can also be launched
and dismantled automatically using the scripting capabilities of
the CMSH.
In Bright Cluster Manager a virtual SMP node behaves like any
other node, enabling transparent, on-the-fly provisioning, con-
figuration, monitoring and management of virtual SMP nodes as
part of the overall system management.
Maximum Uptime with Failover
Failover Bright Cluster Manager — Advanced Edition allows two
head nodes to be configured in active-active failover mode. Both
head nodes are on active duty, but if one fails, the other takes
over all tasks, seamlesly.
Maximum Uptime with Health Checking
Bright Cluster Manager — Advanced Edition includes a power-
ful cluster health checking framework that maximizes system
uptime. It continually checks multiple health indicators for
all hardware and software components and proactively initi-
ates corrective actions. It can also automatically perform a se-
ries of standard and user-defined tests just before starting a
new job, to ensure a successful execution, and preventing the
“black hole node syndrome”. Examples of corrective actions
include autonomous bypass of faulty nodes, automatic job
requeuing to avoid queue flushing, and process “jailing” to al-
locate, track, trace and flush completed user processes. The
health checking framework ensures the highest job through-
put, the best overall cluster efficiency and the lowest admin-
istration overhead.
Top Cluster Security
Bright Cluster Manager offers an unprecedented level of secu-
rity that can easily be tailored to local requirements. Security
features include:
� Automated security and other updates from key-signed Linux
and Bright Computing repositories.
� Encrypted internal and external communications.
� X509v3 certificate based public-key authentication to the
cluster management infrastructure.
� Role-based access control and complete audit trail.
� Firewalls, LDAP and SSH.
User and Group Management
Users can be added to the cluster through the CMGUI or the
CMSH. Bright Cluster Manager comes with a pre-configured
LDAP database, but an external LDAP service, or alternative au-
thentication system, can be used instead.
Web-Based User Portal
The web-based User Portal provides read-only access to essen-
tial cluster information, including a general overview of the clus-
ter status, node hardware and software properties, workload
Integrated Workload Management
Bright Cluster Manager is integrated with a wide selection of
free and commercial workload managers. This integration pro-
vides a number of benefits:
� The selected workload manager gets automatically installed
and configured
� Many workload manager metrics are monitored
� The CMGUI and User Portal provide a user-friendly interface
to the workload manager
� The CMSH and the SOAP & JSON APIs provide direct and po-
werful access to a number of workload manager commands
and metrics
� Reliable workload manager failover is properly configured
� The workload manager is continuously made aware of the
health state of nodes (see section on Health Checking)
� The workload manager is used to save power through auto-
power on/off based on workload
� The workload manager is used for data-aware scheduling
of jobs to the cloud
The following user-selectable workload managers are tightly in-
tegrated with Bright Cluster Manager:
� PBS Professional, Univa Grid Engine, Moab, LSF
� Slurm, openlava, Open Grid Scheduler, Torque, Maui
Alternatively, other workload managers, such as LoadLevele and
Condor can be installed on top of Bright Cluster Manager.
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
Workload management queues can be viewed and configured from the GUI, without the need for workload management expertise.
Creating and dismantling a virtual SMP node can be achieved with just a few clicks within the GUI or a single command in the cluster management shell.
Example graphs that visualize metrics on a GPU cluster.
26 27
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
manager statistics and user-customizable graphs. The User Por-
tal can easily be customized and expanded using PHP and the
SOAP or JSON APIs.
Multi-Cluster Capability
Bright Cluster Manager — Advanced Edition is ideal for organiza-
tions that need to manage multiple clusters, either in one or in
multiple locations. Capabilities include:
� All cluster management and monitoring functionality is
available for all clusters through one GUI.
� Selecting any set of configurations in one cluster and ex-
porting them to any or all other clusters with a few mouse
clicks.
� Metric visualizations and summaries across clusters.
� Making node images available to other clusters.
Fundamentally API-Based
Bright Cluster Manager is fundamentally API-based, which
means that any cluster management command and any piece
of cluster management data — whether it is monitoring data
or configuration data — is available through the API. Both a
SOAP and a JSON API are available and interfaces for various
programming languages, including C++, Python and PHP are
provided.
Cloud Utilization
Bright Cluster Manager supports two cloud utilization sce-
narios for Amazon EC2:
� Cluster-on-Demand –running stand-alone clusters in the
cloud; and
� Cluster Extension – adding cloud-based resources to exist-
ing, onsite clusters and managing these cloud nodes as if
they were local.
Both scenarios utilize the full range of Bright’s capabilitiesand can
be achieved in a few simple steps. In the Cluster Extension scenar-
io, two additional capabilities significantly enhance productivity:
� Data-Aware Scheduling – this ensures that the workload
manager automatically transfers the input data to the cloud
in time for the associated job to start. Output data is automati-
cally transfered back to the on-premise head node.
� Automatic Cloud Resizing – this allows you to specify
policies for automatically increasing and decreasing the
number of cloud nodes based on the load in your queues.
Bright supports Amazon VPC setups which allows com-
pute nodes in EC2 to be placed in an isolated network,
thereby separating them from the outside world. It is even
possible to route part of a local corporate IP network to a
VPC subnet in EC2, so that local nodes and nodes in EC2
can communicate without any effort.
Hadoop Cluster Management
Bright Cluster Manager is the ideal basis for Hadoop clusters.
Bright installs on bare metal, configuring a fully operational
Hadoop cluster in less than one hour. In the process, Bright pre-
pares your Hadoop cluster for use by provisioning the operating
system and the general cluster management and monitoring ca-
pabilities required as on any cluster.
The web-based User Portal provides read-only access to essential cluster in-formation, including a general overview of the cluster status, node hardware and software properties, workload manager statistics and user-customizable graphs.
Bright Cluster Manager can manage multiple clusters simultaneously. This overview shows clusters in Oslo, Abu Dhabi and Houston, all managed through one GUI.
“The building blocks for transtec HPC solutions
must be chosen according to our goals ease-of-
management and ease-of-use. With Bright Cluster
Manager, we are happy to have the technology
leader at hand, meeting these requirements, and
our customers value that.”
Armin Jäger HPC Solution Engineer
28 29
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
Bright then manages and monitors your Hadoop cluster’s
hardware and system software throughout its life-cycle,
collecting and graphically displaying a full range of Hadoop
metrics from the HDFS, RPC and JVM sub-systems. Bright sig-
nificantly reduces setup time for Cloudera, Hortonworks and
other Hadoop stacks, and increases both uptime and Map-
Reduce job throughput.
This functionality is scheduled to be further enhanced in
upcoming releases of Bright, including dedicated manage-
ment roles and profiles for name nodes, data nodes, as well
as advanced Hadoop health checking and monitoring func-
tionality.
Standard and Advanced Editions
Bright Cluster Manager is available in two editions: Standard
and Advanced. The table on this page lists the differences. You
can easily upgrade from the Standard to the Advanced Edition as
your cluster grows in size or complexity.
Documentation and Services
A comprehensive system administrator manual and user man-
ual are included in PDF format. Standard and tailored services
are available, including various levels of support, installation,
training and consultancy.
Feature Standard Advanced
Choice of Linux distributions x x
Intel Cluster Ready x x
Cluster Management GUI x x
Cluster Management Shell x x
Web-Based User Portal x x
SOAP API x x
Node Provisioning x x
Node Identifi cation x x
Cluster Monitoring x x
Cluster Automation x x
User Management x x
Role-based Access Control x x
Parallel Shell x x
Workload Manager Integration x x
Cluster Security x x
Compilers x x
Debuggers & Profi lers x x
MPI Libraries x x
Mathematical Libraries x x
Environment Modules x x
Cloud Bursting x x
Hadoop Management & Monitoring x x
NVIDIA CUDA & OpenCL - x
GPU Management & Monitoring - x
Xeon Phi Management & Monitoring - x
ScaleMP Management & Monitoring - x
Redundant Failover Head Nodes - x
Cluster Health Checking - x
Off-loadable Provisioning - x
Multi-Cluster Management - x
Suggested Number of Nodes 4–128 129–10,000+
Standard Support x x
Premium Support Optional Optional
Cluster health checks can be visualized in the Rackview. This screenshot shows that GPU unit 41 fails a health check called “AllFansRunning”.
While all HPC systems face challenges in workload demand, resource complexity, and scale, enterprise HPC systems face more stringent challenges and expectations. Enterprise HPC systems must meet mission-critical and priority HPC workload demands for commercial businesses and business-oriented research and academic organizations. They have complex SLAs and priorities to balance. Their HPC workloads directly impact the revenue, product delivery, and organizational objectives of their organizations.
Intelligent HPCWorkload Management
31 Hig
h T
rou
hp
ut
Co
mp
uti
ng
C
AD
B
ig D
ata
An
aly
tics
S
imu
lati
on
A
ero
spa
ce
Au
tom
oti
ve
32 33
Enterprise HPC ChallengesEnterprise HPC systems must eliminate job delays and failures.
They are also seeking to improve resource utilization and man-
agement efficiency across multiple heterogeneous systems. To-
maximize user productivity, they are required to make it easier to
access and use HPC resources for users or even expand to other
clusters or HPC cloud to better handle workload demand surges.
Intelligent Workload ManagementAs today’s leading HPC facilities move beyond petascale towards
exascale systems that incorporate increasingly sophisticated and
specialized technologies, equally sophisticated and intelligent
management capabilities are essential. With a proven history of
managing the most advanced, diverse, and data-intensive sys-
tems in the Top500, Moab continues to be the preferred workload
management solution for next generation HPC facilities.
Moab HPC Suite – Basic EditionMoab HPC Suite - Basic Edition is an intelligent workload man-
agement solution. It automates the scheduling, managing,
monitoring, and reporting of HPC workloads on massive scale,
multi-technology installations. The patented Moab intelligence
engine uses multi-dimensional policies to accelerate running
workloads across the ideal combination of diverse resources.
These policies balance high utilization and throughput goals
with competing workload priorities and SLA requirements. The
speed and accuracy of the automated scheduling decisions
optimizes workload throughput and resource utilization. This
gets more work accomplished in less time, and in the right prio-
rity order. Moab HPC Suite - Basic Edition optimizes the value
and satisfaction from HPC systems while reducing manage-
ment cost and complexity.
Moab HPC Suite – Enterprise EditionMoab HPC Suite - Enterprise Edition provides enterprise-ready HPC
workload management that accelerates productivity, automates
workload uptime, and consistently meets SLAs and business prior-
ities for HPC systems and HPC cloud. It uses the battle-tested and
patented Moab intelligence engine to automatically balance the
complex, mission-critical workload priorities of enterprise HPC
systems. Enterprise customers benefit from a single integrated
product that brings together key enterprise HPC capabilities, im-
plementation services, and 24x7 support. This speeds the realiza-
tion of benefits from the HPC system for the business including:
� Higher job throughput
� Massive scalability for faster response and system expansion
� Optimum utilization of 90-99% on a consistent basis
� Fast, simple job submission and management to increase
productivity
Intelligent HPC Workload ManagementMoab HPC Suite – Basic Edition
Moab is the “brain” of an HPC system, intelligently optimizing workload throughput while balancing service levels and priorities.
34 35
� Reduced cluster management complexity and support costs
across heterogeneous systems
� Reduced job failures and auto-recovery from failures
� SLAs consistently met for improved user satisfaction
� Reduce power usage and costs by 10-30%
Productivity AccelerationMoab HPC Suite – Enterprise Edition gets more results deliv-
ered faster from HPC resources at lower costs by accelerating
overall system, user and administrator productivity. Moab pro-
vides the scalability, 90-99 percent utilization, and simple job
submission that is required to maximize the productivity of
enterprise HPC systems. Enterprise use cases and capabilities
include:
�Massive multi-point scalability to accelerate job response
and throughput, including high throughput computing
�Workload-optimized allocation policies and provisioning
to get more results out of existing heterogeneous resources
and reduce costs, including topology-based allocation
� Unify workload management cross heterogeneous clus-
ters to maximize resource availability and administration
efficiency by managing them as one cluster
� Optimized, intelligent scheduling packs workloads and
backfills around priority jobs and reservations while balanc-
ing SLAs to efficiently use all available resources
� Optimized scheduling and management of accelerators,
both Intel MIC and GPGPUs, for jobs to maximize their utili-
zation and effectiveness
� Simplified job submission and management with advanced
job arrays, self-service portal, and templates
� Administrator dashboards and reporting tools reduce man-
agement complexity, time, and costs
�Workload-aware auto-power management reduces energy
use and costs by 10-30 percent
Uptime AutomationJob and resource failures in enterprise HPC systems lead to de-
layed results, missed organizational opportunities, and missed
objectives. Moab HPC Suite – Enterprise Edition intelligently au-
tomates workload and resource uptime in the HPC system to en-
sure that workload completes successfully and reliably, avoiding
these failures. Enterprises can benefit from:
� Intelligent resource placement to prevent job failures with
granular resource modeling to meet workload requirements
and avoid at-risk resources
� Auto-response to failures and events with configurable ac-
tions to pre-failure conditions, amber alerts, or other metrics
and monitors
�Workload-aware future maintenance scheduling that helps
maintain a stable HPC system without disrupting workload
productivity
� Real-world expertise for fast time-to-value and system
uptime with included implementation, training, and 24x7
support remote services
Auto SLA EnforcementMoab HPC Suite – Enterprise Edition uses the powerful Moab
intelligence engine to optimally schedule and dynamically
adjust workload to consistently meet service level agree-
ments (SLAs), guarantees, or business priorities. This auto-
matically ensures that the right workloads are completed at
the optimal times, taking into account the complex number of
using departments, priorities and SLAs to be balanced. Moab
provides:
� Usage accounting and budget enforcement that schedules
resources and reports on usage in line with resource shar-
ing agreements and precise budgets (includes usage limits,
Intelligent HPC Workload ManagementMoab HPC Suite – Enterprise Edition
“With Moab HPC Suite, we can meet very demand-
ing customers’ requirements as regards unified
management of heterogeneous cluster environ-
ments, grid management, and provide them with
flexible and powerful configuration and reporting
options. Our customers value that highly.”
Thomas Gebert HPC Solution Architect
36 37
usage reports, auto budget management and dynamic fair-
share policies)
� SLA and priority polices that make sure the highest priorty
workloads are processed first (includes Quality of Service,
hierarchical priority weighting)
� Continuous plus future scheduling that ensures priorities
and guarantees are proactively met as conditions and work-
load changes (i.e. future reservations, pre-emption, etc.)
Grid- and Cloud-Ready Workload ManagementThe benefits of a traditional HPC environment can be extended
to more efficiently manage and meet workload demand with
the ulti-cluster grid and HPC cloud management capabilities in
Moab HPC Suite - Enterprise Edition. It allows you to:
� Showback or chargeback for pay-for-use so actual resource
usage is tracked with flexible chargeback rates and report-
ing by user, department, cost center, or cluster
� Manage and share workload across multiple remote clusters
to meet growing workload demand or surges by adding on
the Moab HPC Suite – Grid Option.
Moab HPC Suite – Grid OptionThe Moab HPC Suite - Grid Option is a powerful grid-workload
management solution that includes scheduling, advanced poli-
cy management, and tools to control all the components of ad-
vanced grids. Unlike other grid solutions, Moab HPC Suite - Grid
Option connects disparate clusters into a logical whole, enabling
grid administrators and grid policies to have sovereignty over all
systems while preserving control at the individual cluster.
Moab HPC Suite - Grid Option has powerful applications that al-
low organizations to consolidate reporting; information gather-
ing; and workload, resource, and data management. Moab HPC
Suite - Grid Option delivers these services in a near-transparent
way: users are unaware they are using grid resources—they
know only that they are getting work done faster and more eas-
ily than ever before.
Moab manages many of the largest clusters and grids in the
world. Moab technologies are used broadly across Fortune 500
companies and manage more than a third of the compute cores
in the top 100 systems of the Top500 supercomputers. Adaptive
Computing is a globally trusted ISV (independent software ven-
dor), and the full scalability and functionality Moab HPC Suite
with the Grid Option offers in a single integrated solution has
traditionally made it a significantly more cost-effective option
than other tools on the market today.
ComponentsThe Moab HPC Suite - Grid Option is available as an add-on to
Moab HPC Suite - Basic Edition and -Enterprise Edition. It extends
the capabilities and functionality of the Moab HPC Suite compo-
nents to manage grid environments including:
�Moab Workload Manager for Grids – a policy-based work-
load management and scheduling multi-dimensional deci-
sion engine
�Moab Cluster Manager for Grids – a powerful and unified
graphical administration tool for monitoring, managing and
reporting tool across multiple clusters
�Moab ViewpointTM for Grids – a Web-based self-service
end-user job-submission and management portal and ad-
ministrator dashboard
Benefits � Unified management across heterogeneous clusters provides
ability to move quickly from cluster to optimized grid
� Policy-driven and predictive scheduling ensures that jobs
start and run in the fastest time possible by selecting optimal
resources
� Flexible policy and decision engine adjusts workload pro-
cessing at both grid and cluster level
� Grid-wide interface and reporting tools provide view of grid
resources, status and usage charts, and trends over time for
capacity planning, diagnostics, and accounting
� Advanced administrative control allows various business
units to access and view grid resources, regardless of physical
or organizational boundaries, or alternatively restricts access
to resources by specific departments or entities
� Scalable architecture to support peta-scale, highthroughput
computing and beyond
Grid Control with Automated Tasks, Policies, and Re-porting
� Guarantee that the most-critical work runs first with flexible
global policies that respect local cluster policies but continue
to support grid service-level agreements
� Ensure availability of key resources at specific times with ad-
vanced reservations
� Tune policies prior to rollout with cluster- and grid-level simu-
lations
� Use a global view of all grid operations for self-diagnosis, plan-
ning, reporting, and accounting across all resources, jobs, and
clusters
Intelligent HPC Workload ManagementMoab HPC Suite – Grid Option
Moab HPC Suite -Grid Option can be flexibly configured for centralized, cen-tralized and local, or peer-to-peer grid policies, decisions and rules. It is able to manage multiple resource managers and multiple Moab instances.
38 39
Harnessing the Power of Intel Xeon Phi and GPGPUsThe mathematical acceleration delivered by many-core Gen-
eral Purpose Graphics Processing Units (GPGPUs) offers signifi -
cant performance advantages for many classes of numerically
intensive applications. Parallel computing tools such as NVID-
IA’s CUDA and the OpenCL framework have made harnessing
the power of these technologies much more accessible to
developers, resulting in the increasing deployment of hybrid
GPGPU-based systems and introducing signifi cant challenges
for administrators and workload management systems.
The new Intel Xeon Phi coprocessors offer breakthrough per-
formance for highly parallel applications with the benefi t of
leveraging existing code and fl exibility in programming mod-
els for faster application development. Moving an application
to Intel Xeon Phi coprocessors has the promising benefi t of re-
quiring substantially less effort and lower power usage. Such a
small investment can reap big performance gains for both data-
intensive and numerical or scientifi c computing applications
that can make the most of the Intel Many Integrated Cores (MIC)
technology. So it is no surprise that organizations are quickly
embracing these new coprocessors in their HPC systems.
The main goal is to create breakthrough discoveries, products
and research that improve our world. Harnessing and optimiz-
ing the power of these new accelerator technologies is key to
doing this as quickly as possible. Moab HPC Suite helps organi-
zations easily integrate accelerator and coprocessor technolo-
gies into their HPC systems, optimizing their utilization for the
right workloads and problems they are trying to solve.
Moab HPC Suite automatically schedules hybrid systems incor-
porating Intel Xeon Phi coprocessors and GPGPU accelerators,
optimizing their utilization as just another resource type with
policies. Organizations can choose the accelerators that best
meet their different workload needs.
Managing Hybrid Accelerator SystemsHybrid accelerator systems add a new dimension of manage-
ment complexity when allocating workloads to available re-
sources. In addition to the traditional needs of aligning work-
load placement with software stack dependencies, CPU type,
memory, and interconnect requirements, intelligent workload
management systems now need to consider:
� Workload’s ability to exploit Xeon Phi or GPGPU
technology
� Additional software dependenciesand reduce costs,
including topology-based allocation
� Current health and usage status of available Xeon Phis
or GPGPUs
� Resource confi guration for type and number of Xeon Phis
or GPGPUs
Moab HPC Suite auto detects which types of accelerators are
where in the system to reduce the management effort and costs
as these processors are introduced and maintained together in
an HPC system. This gives customers maximum choice and per-
formance in selecting the accelerators that work best for each
of their workloads, whether Xeon Phi, GPGPU or a hybrid mix of
the two.
Intelligent HPC Workload ManagementOptimizing Accelerators with Moab HPC Suite
Moab HPC Suite optimizes accelerator utilization with policies that ensure the right ones get used for the right user and group jobs at the right time.
40 41
Optimizing Intel Xeon Phi and GPGPUs UtilizationSystem and application diversity is one of the first issues
workload management software must address in optimizing
accelerator utilization. The ability of current codes to use GP-
GPUs and Xeon Phis effectively, ongoing cluster expansion,
and costs usually mean only a portion of a system will be
equipped with one or a mix of both types of accelerators.
Moab HPC Suite has a twofold role in reducing the complex-
ity of their administration and optimizing their utilization.
First, it must be able to automatically detect new GPGPUs
or Intel Xeon Phi coprocessors in the environment and their
availability without the cumbersome burden of manual ad-
ministrator configuration. Second, and most importantly, it
must accurately match Xeon Phi-bound or GPGPUenabled
workloads with the appropriately equipped resources in ad-
dition to managing contention for those limited resources
according to organizational policy. Moab’s powerful alloca-
tion and prioritization policies can ensure the right jobs from
users and groups get to the optimal accelerator resources at
the right time. This keeps the accelerators at peak utilization
for the right priority jobs.
Moab’s policies are a powerful ally to administrators in auto
determining the optimal Xeon Phi coprocessor or GPGPU to
use and ones to avoid when scheduling jobs. These alloca-
tion policies can be based on any of the GPGPU or Xeon Phi
metrics such as memory (to ensure job needs are met), tem-
perature (to avoid hardware errors or failures), utilization, or
other metrics.
Use the following metrics in policies to optimize allocation of
Intel Xeon Phi coprocessors:
� Number of Cores
� Number of Hardware Threads
� Physical Memory
� Free Physical Memory
� Swap Memory
� Free Swap Memory
� Max Frequency (in MHz)
Use the following metrics in policies to optimize allocation of
GPGPUs and for management diagnostics:
� Error counts; single/double-bit, commands to reset counts
� Temperature
� Fan speed
� Memory; total, used, utilization
� Utilization percent
� Metrics time stamp
Improve GPGPU Job Speed and SuccessThe GPGPU drivers supplied by vendors today allow multiple
jobs to share a GPGPU without corrupting results, but sharing
GPGPUs and other key GPGPU factors can signifi cantly impact
the speed and success of GPGPU jobs and the fi nal level of ser-
vice delivered to the system users and the organization.
Consider the example of a user estimating the time for a GP-
GPU job based on exclusive access to a GPGPU, but the work-
Intelligent HPC Workload ManagementOptimizing Accelerators with Moab HPC Suite
load manager allowing the GPGPU to be shared when the job is
scheduled. The job will likely exceed the estimated time and be
terminated by the workload manager unsuccessfully, leaving
a very unsatisfi ed user and the organization, group or project
they represent. Moab HPC Suite provides the ability to sched-
ule GPGPU jobs to run exclusively and fi nish in the shortest time
possible for certain types, or classes of jobs or users to ensure
job speed and success.
42 43
Cluster Sovereignty and Trusted Sharing � Guarantee that shared resources are allocated fairly with
global policies that fully respect local cluster confi guration
and needs
� Establish trust between resource owners through graphical
usage controls, reports, and accounting across all shared
resources
� Maintain cluster sovereignty with granular settings to con-
trol where jobs can originate and be processed n Establish
resource ownership and enforce appropriate access levels
with prioritization, preemption, and access guarantees
Increase User Collaboration and Productivity � Reduce end-user training and job management time with
easy-to-use graphical interfaces
� Enable end users to easily submit and manage their jobs
through an optional web portal, minimizing the costs of ca-
tering to a growing base of needy users
� Collaborate more effectively with multicluster co-allocation,
allowing key resource reservations for high-priority projects
� Leverage saved job templates, allowing users to submit mul-
tiple jobs quickly and with minimal changes
� Speed job processing with enhanced grid placement options
for job arrays; optimal or single cluster placement
Process More Work in Less Time to Maximize ROI � Achieve higher, more consistent resource utilization with in-
telligent scheduling, matching job requests to the bestsuited
resources, including GPGPUs
� Use optimized data staging to ensure that remote data trans-
fers are synchronized with resource availability to minimize
poor utilization
� Allow local cluster-level optimizations of most grid workload
Unify Management Across Independent Clusters � Unify management across existing internal, external, and
partner clusters – even if they have different resource man-
agers, databases, operating systems, and hardware
� Out-of-the-box support for both local area and wide area
grids
� Manage secure access to resources with simple credential
mapping or interface with popular security tool sets
� Leverage existing data-migration technologies, such as SCP
or GridFTP
Moab HPC Suite – Application Portal EditionRemove the Barriers to Harnessing HPC
Organizations of all sizes and types are looking to harness the
potential of high performance computing (HPC) to enable and
accelerate more competitive product design and research. To
do this, you must fi nd a way to extend the power of HPC resourc-
es to designers, researchers and engineers not familiar with or
trained on using the specialized HPC technologies. Simplifying
the user access to and interaction with applications running on
more effi cient HPC clusters removes the barriers to discovery
and innovation for all types of departments and organizations.
Moab® HPC Suite – Application Portal Edition provides easy sin-
gle-point access to common technical applications, data, HPC
resources, and job submissions to accelerate research analysis
and collaboration process for end users while reducing comput-
ing costs.
HPC Ease of Use Drives ProductivityMoab HPC Suite – Application Portal Edition improves ease of
use and productivity for end users using HPC resources with sin-
gle-point access to their common technical applications, data,
resources, and job submissions across the design and research
process. It simplifi es the specifi cation and running of jobs on
HPC resources to a few simple clicks on an application specifi c
template with key capabilities including:
� Application-centric job submission templates for common
manufacturing, energy, life-sciences, education and other
industry technical applications
� Support for batch and interactive applications, such as
simulations or analysis sessions, to accelerate the full proj-
ect or design cycle
� No special HPC training required to enable more users to
start accessing HPC resources with intuitive portal
� Distributed data management avoids unnecessary remote
fi le transfers for users, storing data optimally on the network
for easy access and protection, and optimizing fi le transfer
to/ from user desktops if needed
Accelerate Collaboration While Maintaining ControlMore and more projects are done across teams and with part-
ner and industry organizations. Moab HPC Suite – Application
Portal Edition enables you to accelerate this collaboration be-
tween individuals and teams while maintaining control as ad-
ditional users, inside and outside the organization, can easily
and securely access project applications, HPC resources and
data to speed project cycles. Key capabilities and benefi ts you
can leverage are:
� Encrypted and controllable access, by class of user, ser-
vices, application or resource, for remote and partner users
improves collaboration while protecting infrastructure and
intellectual property
Intelligent HPC Workload ManagementMoab HPC Suite – Application Portal Edition
The Visual Cluster view in Moab Cluster Manager for Grids makes cluster and grid management easy and effi cient.
44 45
� Enable multi-site, globally available HPC services available
anytime, anywhere, accessible through many different types
of devices via a standard web browser
� Reduced client requirements for projectsmeans new proj-
ect members can quickly contribute without being limited by
their desktop capabilities
Reduce Costs with Optimized Utilization and SharingMoab HPC Suite – Application Portal Edition reduces the costs
of technical computing by optimizing resource utilization and
the sharing of resources including HPC compute nodes, applica-
tion licenses, and network storage. The patented Moab intelli-
gence engine and its powerful policies deliver value in key areas
of utilization and sharing:
�Maximize HPC resource utilization with intelligent, opti-
mized scheduling policies that pack and enable more work
to be done on HPC resources to meet growing and changing
demand
� Optimize application license utilization and access by shar-
ing application licenses in a pool, re-using and optimizing us-
age with allocation policies that integrate with license man-
agers for lower costs and better service levels to a broader
set of users
� Usage budgeting and priorities enforcement with usage
accounting and priority policies that ensure resources are
shared in-line with service level agreements and project pri-
orities
� Leverage and fully utilize centralized storage capacity in-
stead of duplicative, expensive, and underutilized individual
workstation storage
Key Intelligent Workload Management Capabilities: � Massive multi-point scalability
�Workload-optimized allocation policies and provisioning
� Unify workload management cross heterogeneous clusters
� Optimized, intelligent scheduling
� Optimized scheduling and management of accelerators
(both Intel MIC and GPGPUs)
� Administrator dashboards and reporting tools
�Workload-aware auto power management
� Intelligent resource placement to prevent job failures
� Auto-response to failures and events
�Workload-aware future maintenance scheduling
� Usage accounting and budget enforcement
� SLA and priority polices
� Continuous plus future scheduling
Moab HPC Suite – Remote Visualization EditionRemoving Ineffi ciencies in Current 3D Visualization Methods
Using 3D and 2D visualization, valuable data is interpreted and
new innovations are discovered. There are numerous ineffi cien-
cies in how current technical computing methods support users
and organizations in achieving these discoveries in a cost effec-
tive way. High-cost workstations and their software are often
not fully utilized by the limited user(s) that have access to them.
They are also costly to manage and they don’t keep pace long
with workload technology requirements. In addition, the work-
station model also requires slow transfers of the large data sets
between them. This is costly and ineffi cient in network band-
width, user productivity, and team collaboration while posing
increased data and security risks. There is a solution that re-
moves these ineffi ciencies using new technical cloud comput-
ing models. Moab HPC Suite – Remote Visualization Edition
signifi cantly reduces the costs of visualization technical com-
puting while improving the productivity, collaboration and se-
curity of the design and research process.
Reduce the Costs of Visualization Technical ComputingMoab HPC Suite – Remote Visualization Edition signifi cantly
reduces the hardware, network and management costs of vi-
sualization technical computing. It enables you to create easy,
centralized access to shared visualization compute, applica-
tion and data resources. These compute, application, and data
resources reside in a technical compute cloud in your data
center instead of in expensive and underutilized individual
workstations.
� Reduce hardware costs by consolidating expensive individual
technical workstations into centralized visualization servers
for higher utilization by multiple users; reduce additive or up-
grade technical workstation or specialty hardware purchases,
such as GPUs, for individual users.
� Reduce management costs by moving from remote user work-
stations that are diffi cult and expensive to maintain, upgrade,
and back-up to centrally managed visualization servers that
require less admin overhead
� Decrease network access costs and congestion as signifi cantly
lower loads of just compressed, visualized pixels are moving to
users, not full data sets
� Reduce energy usageas centralized visualization servers con-
sume less energy for the user demand met
� Reduce data storage costs by consolidating data into com-
Intelligent HPC Workload ManagementMoab HPC Suite – Remote Visualization Edition
The Application Portal Edition easily extends the power of HPC to your users and projects to drive innovation and competentive advantage.
46 47
Consolidation and GridMany sites have multiple clusters as a result of having multiple
independent groups or locations, each with demands for HPC,
and frequent additions of newer machines. Each new cluster
increases the overall administrative burden and overhead. Ad-
ditionally, many of these systems can sit idle while others are
overloaded. Because of this systems-management challenge,
sites turn to grids to maximize the effi ciency of their clusters.
Grids can be enabled either independently or in conjunction
with one another in three areas:
� Reporting Grids Managers want to have global reporting
across all HPC resources so they can see how users and proj-
ects are really utilizing hardware and so they can effectively
plan capacity. Unfortunately, manually consolidating all of
this information in an intelligible manner for more than just
a couple clusters is a management nightmare. To solve that
problem, sites will create a reporting grid, or share informa-
tion across their clusters for reporting and capacity-plan-
ning purposes.
� Management Grids Managing multiple clusters indepen-
dently can be especially diffi cult when processes change,
because policies must be manually reconfi gured across all
clusters. To ease that diffi culty, sites often set up manage-
ment grids that impose a synchronized management layer
across all clusters while still allowing each cluster some
level of autonomy.
� Workload-Sharing Grids Sites with multiple clusters often
have the problem of some clusters sitting idle while other
clusters have large backlogs. Such inequality in cluster uti-
lization wastes expensive resources, and training users to
perform different workload-submission routines across var-
ious clusters can be diffi cult and expensive as well. To avoid
these problems, sites often set up workload-sharing grids.
These grids can be as simple as centralizing user submission
or as complex as having each cluster maintain its own user
submission routine with an underlying grid-management
tool that migrates jobs between clusters.
Inhibitors to Grid EnvironmentsThree common inhibitors keep sites from enabling grid environ-
ments:
� Politics Because grids combine resources across users,
groups, and projects that were previously independent, grid
implementation can be a political nightmare. To create a
grid in the real world, sites need a tool that allows clusters
to retain some level of sovereignty while participating in the
larger grid.
� Multiple Resource Managers Most sites have a variety of
resource managers used by various groups within the orga-
nization, and each group typically has a large investment in
scripts that are specifi c to one resource manager and that
cannot be changed. To implement grid computing effective-
ly, sites need a robust tool that has fl exibility in integrating
with multiple resource managers.
� Credentials Many sites have different log-in credentials for
each cluster, and those credentials are generally indepen-
dent of one another. For example, one user might be Joe.P on
one cluster and J_Peterson on another. To enable grid envi-
ronments, sites must create a combined user space that can
recognize and combine these different credentials.
Using Moab in a GridSites can use Moab to set up any combination of reporting, man-
agement, and workload-sharing grids. Moab is a grid metas-
cheduler that allows sites to set up grids that work effectively
in the real world. Moab’s feature-rich functionality overcomes
the inhibitors of politics, multiple resource managers, and vary-
ing credentials by providing:
Intelligent HPC Workload ManagementUsing Moab With Grid Environments
� Grid Sovereignty Moab has multiple features that break
down political barriers by letting sites choose how each
cluster shares in the grid. Sites can control what information
is shared between clusters and can specify which workload
is passed between clusters. In fact, sites can even choose
to let each cluster be completely sovereign in making deci-
sions about grid participation for itself.
� Support for Multiple Resource Managers Moab meta-
schedules across all common resource managers. It fully
integrates with TORQUE and SLURM, the two most-common
open-source resource managers, and also has limited in-
tegration with commercial tools such as PBS Pro and SGE.
Moab’s integration includes the ability to recognize when a
user has a script that requires one of these tools, and Moab
can intelligently ensure that the script is sent to the correct
machine. Moab even has the ability to translate common
scripts across multiple resource managers.
� Credential Mapping Moab can map credentials across clus-
ters to ensure that users and projects are tracked appropri-
ately and to provide consolidated reporting.
48 49
mon storage node(s) instead of under-utilized individual
storage
Improve Collaboration, Productivity and SecurityWith Moab HPC Suite – Remote Visualization Edition, you can im-
prove the productivity, collaboration and security of the design
and research process by only transferring pixels instead of data
to users to do their simulations and analysis. This enables a wid-
er range of users to collaborate and be more productive at any
time, on the same data, from anywhere without any data transfer
time lags or security issues. Users also have improved immediate
access to specialty applications and resources, like GPUs, they
might need for a project so they are no longer limited by personal
workstation constraints or to a single working location.
� Improve workforce collaboration by enabling multiple
users to access and collaborate on the same interactive ap-
plication data at the same time from almost any location or
device, with little or no training needed
� Eliminate data transfer time lags for users by keeping data
close to the visualization applications and compute resourc-
es instead of transferring back and forth to remote worksta-
tions; only smaller compressed pixels are transferred
� Improve security with centralized data storage so data no
longer gets transferred to where it shouldn’t, gets lost, or
gets inappropriately accessed on remote workstations, only
pixels get moved
Maximize Utilization and Shared Resource GuaranteesMoab HPC Suite – Remote Visualization Edition maximizes your
resource utilization and scalability while guaranteeing shared
resource access to users and groups with optimized workload
management policies. Moab’s patented policy intelligence en-
gine optimizes the scheduling of the visualization sessions
across the shared resources to maximize standard and spe-
cialty resource utilization, helping them scale to support larger
volumes of visualization workloads. These intelligent policies
also guarantee shared resource access to users to ensure a high
service level as they transition from individual workstations.
� Guarantee shared visualization resource access for users
with priority policies and usage budgets that ensure they
receive the appropriate service level. Policies include ses-
sion priorities and reservations, number of users per node,
fairshare usage policies, and usage accounting and budgets
across multiple groups or users.
� Maximize resource utilization and scalability by packing
visualization workload optimally on shared servers using
Moab allocation and scheduling policies. These policies can
include optimal resource allocation by visualization session
characteristics (like CPU, memory, etc.), workload packing,
number of users per node, and GPU policies, etc.
� Improve application license utilization to reduce software
costs and improve access with a common pool of 3D visual-
ization applications shared across multiple users, with usage
optimized by intelligent Moab allocation policies, integrated
with license managers.
� Optimized scheduling and management of GPUs and other
accelerators to maximize their utilization and effectiveness
� Enable multiple Windows or Linux user sessions on a single
machine managed by Moab to maximize resource and power
utilization.
� Enable dynamic OS re-provisioning on resources, managed
by Moab, to optimally meet changing visualization applica-
tion workload demand and maximize resource utilization
and availability to users.
Intelligent HPC Workload ManagementMoab HPC Suite – Remote Visualization Edition
The Remote Visualization Edition makes visualization more cost effective and productive with an innovative technical compute cloud.
Moab HPC Suite Use Case Basic Enterprise Application Portal Remote Visualization
Productivity Acceleration
Massive multifaceted scalability x x x x
Workload optimized resource allocation x x x x
Optimized, intelligent scheduling of work-load and resources
x x x x
Optimized accelerator scheduling & manage-ment (Intel MIC & GPGPU)
x x x x
Simplifi ed user job submission & mgmt.- Application Portal
x x xx
x
Administrator dashboard and management reporting
x x x x
Unifi ed workload management across het-erogeneous cluster(s)
xSingle cluster only
x x x
Workload optimized node provisioning x x x
Visualization Workload Optimization x
Workload-aware auto-power management x x x
Uptime Automation
Intelligent resource placement for failure prevention
x x x x
Workload-aware maintenance scheduling x x x x
Basic deployment, training and premium (24x7) support service
Standard support only
x x x
Auto response to failures and events xLimited no
re-boot/provision
x x x
Auto SLA enforcement
Resource sharing and usage policies x x x x
SLA and oriority policies x x x x
Continuous and future reservations scheduling x x x x
Usage accounting and budget enforcement x x x
Grid- and Cloud-Ready
Manage and share workload across multiple clusters in wich area grid
xw/grid Option
xw/grid Option
xw/grid Option
xw/grid Option
User self-service: job request & management portal
x x x x
Pay-for-use showback and chargeback x x x
Workload optimized node provisioning & re-purposing
x x x
Multiple clusters in single domain
50 51
Intelligent HPC Workload ManagementMoab HPC Suite – Remote Visualization Edition
transtec HPC solutions are designed for maximum fl exibility,
and ease of management. We not only offer our customers the
most powerful and fl exible cluster management solution out
there, but also provide them with customized setup and site-
specifi c confi guration. Whether a customer needs a dynamical
Linux-Windows dual-boot solution, unifi ed management of
different clusters at different sites, or the fi ne-tuning of the
Moab scheduler for implementing fi ne-grained policy confi
guration – transtec not only gives you the framework at hand,
but also helps you adapt the system according to your special
needs. Needless to say, when customers are in need of special
trainings, transtec will be there to provide customers, adminis-
trators, or users with specially adjusted Educational Services.
Having a many years’ experience in High Performance Comput-
ing enabled us to develop effi cient concepts for installing and
deploying HPC clusters. For this, we leverage well-proven 3rd
party tools, which we assemble to a total solution and adapt
and confi gure according to the customer’s requirements.
We manage everything from the complete installation and
confi guration of the operating system, necessary middleware
components like job and cluster management systems up to
the customer’s applications.
Remote Virtualization and Workflow OptimizationIt’s human nature to want to ‘see’ the results from simulations, tests, and analyses. Up until recently, this has meant ‘fat’ workstations on many user desktops. This approach provides CPU power when the user wants it – but as dataset size increases, there can be delays in downloading the results. Also, sharing the results with colleagues means gath-ering around the workstation - not always possible in this globalized, collaborative work-place.
Life
Sci
en
ces
C
AE
H
igh
Pe
rfo
rma
nce
Co
mp
uti
ng
B
ig D
ata
An
aly
tics
S
imu
lati
on
C
AD
54 55
Aerospace, Automotive and ManufacturingThe complex factors involved in CAE range from compute-in-
tensive data analysis to worldwide collaboration between de-
signers, engineers, OEM’s and suppliers. To accommodate these
factors, Cloud (both internal and external) and Grid Computing
solutions are increasingly seen as a logical step to optimize IT
infrastructure usage for CAE. It is no surprise then, that automo-
tive and aerospace manufacturers were the early adopters of
internal Cloud and Grid portals.
Manufacturers can now develop more “virtual products” and
simulate all types of designs, fluid flows, and crash simula-
tions. Such virtualized products and more streamlined col-
laboration environments are revolutionizing the manufac-
turing process.
With NICE EnginFrame in their CAE environment engineers can
take the process even further by connecting design and simula-
tion groups in “collaborative environments” to get even greater
benefi ts from “virtual products”. Thanks to EnginFrame, CAE en-
gineers can have a simple intuitive collaborative environment
that can care of issues related to:
� Access & Security - Where an organization must give access
to external and internal entities such as designers, engi-
neers and suppliers.
� Distributed collaboration - Simple and secure connection of
design and simulation groups distributed worldwide.
� Time spent on IT tasks - By eliminating time and resources
spent using cryptic job submission commands or acquiring
knowledge of underlying compute infrastructures, engi-
neers can spend more time concentrating on their core de-
sign tasks.
EnginFrame’s web based interface can be used to access the
compute resources required for CAE processes. This means ac-
cess to job submission & monitoring tasks, input & output data
associated with industry standard CAE/CFD applications for
Fluid-Dynamics, Structural Analysis, Electro Design and Design
Collaboration (like Abaqus, Ansys, Fluent, MSC Nastran, PAM-
Crash, LS-Dyna, Radioss) without cryptic job submission com-
mands or knowledge of underlying compute infrastructures.
EnginFrame has a long history of usage in some of the most
prestigious manufacturing organizations worldwide including
Aerospace companies like AIRBUS, Alenia Space, CIRA, Galileo
Avionica, Hamilton Sunstrand, Magellan Aerospace, MTU and
Automotive companies like Audi, ARRK, Brawn GP, Bridgestone,
Bosch, Delphi, Elasis, Ferrari, FIAT, GDX Automotive, Jaguar-Lan-
dRover, Lear, Magneti Marelli, McLaren, P+Z, RedBull, Swagelok,
Suzuki, Toyota, TRW.
Life Sciences and HealthcareNICE solutions are deployed in the Life Sciences sector at com-
panies like BioLab, Partners HealthCare, Pharsight and the
M.D.Anderson Cancer Center; also in leading research projects like
DEISA or LitBio in order to allow easy and transparent use of com-
puting resources without any insight of the HPC infrastructure.
The Life Science and Healthcare sectors have some very strict re-
quirement when choosing its IT solution like EnginFrame, for in-
stance
� Security - To meet the strict security & privacy requirements
of the biomedical and pharmaceutical industry any solution
needs to take account of multiple layers of security and au-
thentication.
� Industry specifi c software - ranging from the simplest cus-
tom tool to the more general purpose free and open middle-
wares.
EnginFrame’s modular architecture allows for different Grid mid-
dlewares and software including leading Life Science applica-
tions like Schroedinger Glide, EPFL RAxML, BLAST family, Taverna,
and R) to be exploited, R Users can compose elementary services
into complex applications and “virtual experiments”, run, moni-
tor, and build workfl ows via a standard Web browser. EnginFrame
also has highly tunable resource sharing and fi ne-grained access
control where multiple authentication systems (like Active Direc-
tory, Kerberos/LDAP) can be exploited simultaneously.
Growing complexity in globalized teamsHPC systems and Enterprise Grids deliver unprecedented time-
to-market and performance advantages to many research and
corporate customers, struggling every day with compute and
data intensive processes.
This often generates or transforms massive amounts of jobs
and data, that needs to be handled and archived effi ciently to
deliver timely information to users distributed in multiple loca-
tions, with different security concerns.
Poor usability of such complex systems often negatively im-
pacts users’ productivity, and ad-hoc data management often
increases information entropy and dissipates knowledge and
intellectual property.
The solutionSolving distributed computing issues for our customers, we un-
derstand that a modern, user-friendly web front-end to HPC and
Remote Virtualization
NICE EnginFrame: A Technical Computing Portal
56 57
Remote Virtualization
NICE EnginFrame: A Technical Computing Portal
Grids can drastically improve engineering productivity, if prop-
erly designed to address the specifi c challenges of the Technical
Computing market
NICE EnginFrame overcomes many common issues in the areas
of usability, data management, security and integration, open-
ing the way to a broader, more effective use of the Technical
Computing resources.
The key components of the solution are:
� A fl exible and modular Java-based kernel, with clear separa-
tion between customizations and core services
� Powerful data management features, refl ecting the typical
needs of engineering applications
� Comprehensive security options and a fi ne grained authori-
zation system
� Scheduler abstraction layer to adapt to different workload
and resource managers
� Responsive and competent Support services
End users can typically enjoy the following improvements:
� User-friendly, intuitive access to computing resources, using
a standard Web browser
� Application-centric job submission
� Organized access to job information and data
� Increased mobility and reduced client requirements
On the other side, the Technical Computing Portal delivers sig-
nifi cant added-value for system administrators and IT:
� Reduced training needs to enable users to access the
resources
� Centralized confi guration and immediate deployment of
services
� Comprehensive authorization to access services and
information
� Reduced support calls and submission errors
Coupled with our Remote Visualization solutions, our custom-
ers quickly deploy end-to-end engineering processes on their
Intranet, Extranet or Internet.
58 59
Remote Virtualization
Desktop Cloud Virtualization
Solving distributed computing issues for our customers, it is
easy to understand that a modern, user-friendly web front-end
to HPC and grids can drastically improve engineering productiv-
ity, if properly designed to address the specifi c challenges of the
Technical Computing market.
Remote VisualizationIncreasing dataset complexity (millions of polygons, interact-
ing components, MRI/PET overlays) means that as time comes
to upgrade and replace the workstations, the next generation
of hardware needs more memory, more graphics processing,
more disk, and more CPU cores. This makes the workstation ex-
pensive, in need of cooling, and noisy.
Innovation in the fi eld of remote 3D processing now allows com-
panies to address these issues moving applications away from
the Desktop into the data center. Instead of pushing data to the
application, the application can be moved near the data.
Instead of mass workstation upgrades, remote visualization al-
lows incremental provisioning, on-demand allocation, better
management and effi cient distribution of interactive sessions
and licenses. Racked workstations or blades typically have
lower maintenance, cooling, replacement costs, and they can
extend workstation (or laptop) life as “thin clients”.
The solution
Leveraging their expertise in distributed computing and web-
based application portals, NICE offers an integrated solution to
access, load balance and manage applications and desktop ses-
sions running within a visualization farm. The farm can include
both Linux and Windows resources, running on heterogeneous
hardware.
The core of the solution is the EnginFrame visualization plug-in,
that delivers web-based services to access and manage appli-
cations and desktops published in the farm. This solution has
been integrated with:
� NICE Desktop Cloud Visualization (DCV)
� HP Remote Graphics Software (RGS)
� RealVNC
� TurboVNC and VirtualGL
� Nomachine NX
Coupled with these third party remote visualization engines
(which specialize in delivering high frame-rates for 3D graphics),
the NICE offering for Remote Visualization solves the issues of
user authentication, dynamic session allocation, session man-
agement and data transfers.
End users can enjoy the following improvements:
� Intuitive, application-centric web interface to start, control
and re-connect to a session
� Single sign-on for batch and interactive applications
� All data transfers from and to the remote visualization farm
are handled by EnginFrame
� Built-in collaboration, to share sessions with other users
� The load and usage of the visualization cluster is monitored
in the browser
The solution also delivers signifi cant added-value for the sys-
tem administrators:
� No need of SSH / SCP / FTP on the client machine
� Easy integration into identity services, Single Sign-On (SSO),
Enterprise portals
� Automated data life cycle management
� Built-in user session sharing, to facilitate support
� Interactive sessions are load balanced by a scheduler (LSF,
Moab or Torque) to achieve optimal performance and resour-
ce usage
� Better control and use of application licenses
� Monitor, control and manage users’ idle sessions
Desktop Cloud VirtualizationNICE Desktop Cloud Visualization (DCV) is an advanced technol-
ogy that enables Technical Computing users to remote access
2D/3D interactive applications over a standard network.
Engineers and scientists are immediately empowered by taking
full advantage of high-end graphics cards, fast I/O performance
and large memory nodes hosted in “Public or Private 3D Cloud”,
rather then waiting for the next upgrade of the workstations.
The DCV protocol adapts to heterogeneous networking infra-
structures like LAN, WAN and VPN, to deal with bandwidth and
latency constraints. All applications run natively on the remote
60 61
machines, that could be virtualized and share the same physical
GPU.
In a typical visualization scenario, a software application sends a
stream of graphics commands to a graphics adapter through an
input/output (I/O) interface. The graphics adapter renders the data
into pixels and outputs them to the local display as a video signal.
When using NICE DCV, the scene geometry and graphics state
are rendered on a central server, and pixels are sent to one or
more remote displays.
This approach requires the server to be equipped with one or
more GPUs, which are used for the OpenGL rendering, while the
client software can run on “thin” devices.
NICE DCV architecture consist of:
� DCV Server, equipped with one or more GPUs, used for
OpenGL rendering
� one or more DCV end stations, running on “thin clients”, only
used for visualization
� etherogeneous networking infrastructures (like LAN, WAN
and VPN), optimized balancing quality vs frame rate
NICE DCV Highlights:
� enables high performance remote access to interactive
2D/3D software applications on low bandwidth/high latency
� supports multiple etherogeneous OS (Windows, Linux)
� enables GPU sharing
� supports 3D acceleration for OpenGL applications running
on virtual machines
� Supports multiple user collaboration via session sharing
� Enables attractive Return-on-Investment through resource sha-
ring and consolidation to data centers (GPU, memory, CPU, ...)
Remote Virtualization
Desktop Cloud Virtualization
� Keeps the data secure in the data center, reducing data load
and save time
� Enables right sizing of system allocation based on user’s dy-
namic needs
� Facilitates application deployment: all applications, updates
and patches are instantly available to everyone, without any
changes to original code
Business Benefi tsThe business benefi ts for adopting NICE DCV can be summarized
in to four categories:
Category Business Benefi ts
Productivity � Increase business effi ciency
� Improve team performance by
ensuring real-time collaboration
with colleagues and partners in
real time, anywhere.
� Reduce IT management costs
by consolidating workstation
resources to a single point-of-
management
� Save money and time on applica-
tion deployment
� Let users work from anywhere if
there is an Internet connection
Business Continuity � Move graphics processing and
data to the datacenter - not on
laptop/desktop
� Cloud-based platform support
enables you to scale the visua-
lization solution „on-demand“
to extend business, grow new
revenue, manage costs.
Data Security � Guarantee secure and auditable
use of remote resources (appli-
cations, data, infrastructure,
licenses)
� Allow real-time collaboration
with partners while protecting In-
tellectual Property and resources
� Restrict access by class of user,
service, application, and resource
Training Effectiveness � Enable multiple users to follow
application procedures alongside
an instructor in real-time
� Enable collaboration and session
sharing among remote users (em-
ployees, partners, and affi liates)
NICE DCV is perfectly integrated into EnginFrame Views, lever-
aging 2D/3D capabilities over the web, including the ability to
share an interactive session with other users for collaborative
working.
Windows Linux
� Microsoft Windows 7 -
32/64 bit
� Microsoft Windows Vista -
32/64 bit
� Microsoft Windows XP -
32/64 bit
� RedHat Enterprise 5.x and
6.x - 32/64 bit
� SUSE Enterprise Server 11
- 32/64 bit
© 2012 by NICE
“The amount of data resulting from e.g. simulations
in CAE or other engineering environments can be in
the Gigabyte range. It is obvious that remote post-
processing is one of the most urgent topics to be tack-
led. NICE EnginFrame provides exactly that, and our
customers are impressed that such great technology
enhances their workfl ow so signifi cantly.”
Robin Kienecker HPC Sales Specialist
62 63
Cloud Computing With transtec and NICECloud Computing is a style of computing in which dynamically
scalable and often virtualized resources are provided as a service
over the Internet. Users need not have knowledge of, expertise in,
or control over the technology infrastructure in the “cloud” that
supports them. The concept incorporates technologies that have
the common theme of reliance on the Internet for satisfying the
computing needs of the users. Cloud Computing services usually
provide applications online that are accessed from a web brows-
er, while the software and data are stored on the servers.
Companies or individuals engaging in Cloud Computing do not
own the physical infrastructure hosting the software platform
in question. Instead, they avoid capital expenditure by renting
usage from a third-party provider (except for the case of ‘Private
Cloud’ - see below). They consume resources as a service, paying
instead for only the resources they use. Many Cloud Computing
offerings have adopted the utility computing model, which is
analogous to how traditional utilities like electricity are con-
sumed, while others are billed on a subscription basis.
The main advantage offered by Cloud solutions is the reduc-
tion of infrastructure costs, and of the infrastructure’s main-
tenance. By not owning the hardware and software, Cloud
users avoid capital expenditure by renting usage from a third-
party provider. Customers pay for only the resources they use.
Remote Virtualization
Cloud Computing With transtec and NICEThe advantage for the provider of the Cloud, is that sharing
computing power among multiple tenants improves utiliza-
tion rates, as servers are not left idle, which can reduce costs
and increase effi ciency.
Public cloud
Public cloud or external cloud describes cloud computing in the
traditional mainstream sense, whereby resources are dynami-
cally provisioned on a fi ne-grained, self-service basis over the
Internet, via web applications/web services, from an off-site
third-party provider who shares resources and bills on a fi ne-
grained utility computing basis.
Private cloud
Private cloud and internal cloud describe offerings deploying
cloud computing on private networks. These solutions aim to
“deliver some benefits of cloud computing without the pit-
falls”, capitalizing on data security, corporate governance,
and reliability concerns. On the other hand, users still have
to buy, deploy, manage, and maintain them, and as such do
not benefit from lower up-front capital costs and less hands-
on management.
Architecture
The majority of cloud computing infrastructure, today, consists
of reliable services delivered through data centers and built on
servers with different levels of virtualization technologies. The
services are accessible anywhere that has access to network-
ing infrastructure. The Cloud appears as a single point of access
for all the computing needs of consumers. The offerings need
to meet the quality of service requirements of customers and
typically offer service level agreements, and, at the same time
proceed over the typical limitations.
The architecture of the computing platform proposed by NICE
(fi g. 1) differs from the others in some interesting ways:
� you can deploy it on an existing IT infrastructure, because it
is completely decoupled from the hardware infrastructure
� it has a high level of modularity and confi gurability, result-
ing in being easily customizable for the user’s needs
� based on the NICE EnginFrame technology, it is easy to build
graphical web-based interfaces to provide several applica-
tions, as web services, without needing to code or compile
source programs
� it utilises the existing methodology in place for authentica-
tion and authorization
Further, because the NICE Cloud solution is built on advanced
IT technologies, including virtualization and workload manage-
ment, the execution platform is dynamically able to allocate,
64 65
monitor, and confi gure a new environment as needed by the
application, inside the Cloud infrastructure. The NICE platform
offers these important properties:
� Incremental Scalability: the quantity of computing and
storage resources, provisioned to the applications, changes
dynamically depending on the workload
� Reliability and Fault-Tolerance: because of the virtualiza-
tion of the hardware resources and the multiple redundant
hosts, the platform adjusts the resources needed from the
applications, without disruption during disasters or crashes
� Service Level Agreement: the use of advanced systems for
the dynamic allocation of the resources, allows the guaran-
tee of service level, agreed across applications and services;
� Accountability: the continuous monitoring of the resources
used by each application (and user), allows the setup of ser-
vices that users can access in a pay-per-use mode, or sub-
scribing to a specifi c contract. In the case of an Enterprise
Cloud, this feature allows costs to be shared among the cost
centers of the company.
transtec has a long-term experience in Engineering environ-
ments, especially from the CAD/CAE sector. This allow us to
provide customers from this area with solutions that greatly
enhance their workfl ow and minimizes time-to-result.
This, together with transtec’s offerings of all kinds of services,
allows our customers to fully focus on their productive work,
and have us do the environmental optimizations.
Remote Virtualization
Cloud Computing With transtec and NICE
66 67
Graphics accelerated virtual desktops and applicationsNVIDIA GRID for large corporations offers the ability to offload
graphics processing from the CPU to the GPU in virtualised envi-
ronments, allowing the data center manager to deliver true PC
graphics-rich experiences to more users for the first time.
Benefits of NVIDIA GRID for IT:
� Leverage industry-leading remote visualization solutions like
NICE DCV
� Add your most graphics-intensive users to your virtual solution
� Improve the productivity of all users
Benefits of NVIDIA GRID for users:
� Highly responsive windows and rich multimedia experiences
� Access to all critical applications, including the most 3D-intensive
� Access from anywhere, on any device
NVIDIA GRID BoardsNVIDIA’s Kepler architecture-based GRID boards are specifically
designed to enable rich graphics in virtualised environments.
GPU VirtualisationGRID boards feature NVIDIA Kepler-based GPUs that, for the first
time, allow hardware virtualisation of the GPU. This means mul-
tiple users can share a single GPU, improving user density while
providing true PC performance and compatibility.
Low-Latency Remote DisplayNVIDIA’s patented low-latency remote display technology
greatly improves the user experience by reducing the lag that
users feel when interacting with their virtual machine. With this
Remote Virtualization
Graphics accelerated virtual desktops and applications
technology, the virtual desktop screen is pushed directly to the
remoting protocol.
H.264 EncodingThe Kepler GPU includes a high-performance H.264 encoding en-
gine capable of encoding simultaneous streams with superior
quality. This provides a giant leap forward in cloud server efficien-
cy by offloading the CPU from encoding functions and allowing
the encode function to scale with the number of GPUs in a server.
Maximum User DensityGRID boards have an optimised multi-GPU design that helps to
maximise user density. The GRID K1 board features 4 GPUs and
16 GB of graphics memory, allowing it to support up to 100 users
on a single board.
Power EfficiencyGRID boards are designed to provide data center-class power
efficiency, including the revolutionary new streaming multipro-
cessor, called “SMX”. The result is an innovative, proven solution
that delivers revolutionary performance-per-watt for the enter-
prise data centre
24/7 ReliabilityGRID boards are designed, built, and tested by NVIDIA for
24/7 operation. Working closely with leading server vendors
ensures GRID cards perform optimally and reliably for the life
of the system.
Widest Range of Virtualisation SolutionsGRID cards enable GPU-capable virtualisation solutions like
XenServer or Linux KVM, delivering the flexibility to choose
from a wide range of proven solutions.
GRID K1 Board GRID K2 Board
Number of GPUs 4 x entry Kepler GPUs2 x high-end Kepler
GPUs
Total NVIDIA CUDA cores 768 3072
Total memory size 16 GB DDR3 8 GB GDDR5
Max power 130 W 225 W
Board length 10.5” 10.5”
Board height 4.4” 4.4”
Board width Dual slot Dual slot
Display IO None None
Aux power 6-pin connector 8-pin connector
PCIe x16 x16
PCIe generation Gen3 (Gen2 compatible) Gen3 (Gen2 compatible)
Cooling solution Passive Passive
Technical SpecificationsGRID K1 Board Specifications
GRID K2 Board Specifications
Virtual GPU TechnologyNVIDIA GRID vGPU brings the full benefit of NVIDIA hardware-
accelerated graphics to virtualized solutions. This technology
provides exceptional graphics performance for virtual desk-
tops equivalent to local PCs when sharing a GPU among mul-
tiple users.
NVIDIA GRID vGPU is the industry’s most advanced technology
for sharing true GPU hardware acceleration between multiple
68 69
virtual desktops—without compromising the graphics experi-
ence. Application features and compatibility are exactly the
same as they would be at the desk.
With GRID vGPU technology, the graphics commands of each vir-
tual machine are passed directly to the GPU, without translation
by the hypervisor. This allows the GPU hardware to be time-sliced
to deliver the ultimate in shared virtualised graphics performance.
vGPU Profi les Mean Customised, Dedicated Graphics MemoryTake advantage of vGPU Manager to assign just the right amount
of memory to meet the specifi c needs of each user. Every virtual
desktop has dedicated graphics memory, just like they would
at their desk, so they always have the resources they need to
launch and use their applications.
vGPU Manager enables up to eight users to share each physical
GPU, assigning the graphics resources of the available GPUs to
virtual machines in a balanced approach. Each NVIDIA GRID K1
graphics card has up to four GPUs, allowing 32 users to share a
single graphics card.
NVIDIA GRID Graphics Board
Virtual GPU Profi le Application Certifi cations
Graphics Memory Max Displays Per User
Max Resolution Per Display
Max Users Per Graphics Board
Use Case
GRID K2K260Q 2,048 MB 4 2560×1600 4
Designer/Power User
K240Q 1,024 MB 2 2560×1600 8Designer/
Power User
K220Q 512 MB 2 2560×1600 16Designer/
Power User
K200 256 MB 2 1900×1200 16Knowledge
Worker
GRID K1K140Q 1,024 MB 2 2560×1600 16 Power User
K120Q 512 MB 2 2560×1600 32 Power User
K100 256 MB 2 1900×1200 32Knowledge
Worker
Remote Virtualization
Virtual GPU Technology
Up to 8 users supported per physical GPU depending in vGPU profi les
Intel Cluster Ready –A Quality Standard for HPC ClustersIntel Cluster Ready is designed to create predictable expectations for users and pro-viders of HPC clusters, primarily targeting customers in the commercial and industrial sectors. These are not experimental “test-bed” clusters used for computer science and computer engineering research, or high-end “capability” clusters closely targeting their specifi c computing requirements that power the high-energy physics at the national labs or other specialized research organizations. Intel Cluster Ready seeks to advance HPC clusters used as computing resources in pro-duction environments by providing cluster owners with a high degree of confi dence that the clusters they deploy will run the applications their scientifi c and engineering staff rely upon to do their jobs. It achieves this by providing cluster hardware, software, and system providers with a precisely defi ned basis for their products to meet their customers’ production cluster requirements.
Sci
en
ces
R
isk
An
aly
sis
S
imu
lati
on
B
ig D
ata
An
aly
tics
C
AD
H
igh
Pe
rfo
rma
nce
Co
mp
uti
ng
72 73
Intel Cluster Ready
A Quality Standard for HPC Clusters
What are the Objectives of ICR?The primary objective of Intel Cluster Ready is to make clusters
easier to specify, easier to buy, easier to deploy, and make it eas-
ier to develop applications that run on them. A key feature of ICR
is the concept of “application mobility”, which is defi ned as the
ability of a registered Intel Cluster Ready application – more cor-
rectly, the same binary – to run correctly on any certifi ed Intel
Cluster Ready cluster. Clearly, application mobility is important
for users, software providers, hardware providers, and system
providers.
� Users want to know the cluster they choose will reliably run
the applications they rely on today, and will rely on tomorrow
� Application providers want to satisfy the needs of their cus-
tomers by providing applications that reliably run on their
customers’ cluster hardware and cluster stacks
� Cluster stack providers want to satisfy the needs of their cus-
tomers by providing a cluster stack that supports their custo-
mers’ applications and cluster hardware
� Hardware providers want to satisfy the needs of their custo-
mers by providing hardware components that supports their
customers’ applications and cluster stacks
� System providers want to satisfy the needs of their custo-
mers by providing complete cluster implementations that
reliably run their customers’ applications
Without application mobility, each group above must either try
to support all combinations, which they have neither the time
nor resources to do, or pick the “winning combination(s)” that
best supports their needs, and risk making the wrong choice.
The Intel Cluster Ready defi nition of application portability sup-
ports all of these needs by going beyond pure portability, (re-
compiling and linking a unique binary for each platform), to ap-
plication binary mobility, (running the same binary on multiple
platforms), by more precisely defi ning the target system.
A further aspect of application mobility is to ensure that regis-
tered Intel Cluster Ready applications do not need special pro-
gramming or alternate binaries for different message fabrics.
Intel Cluster Ready accomplishes this by providing an MPI im-
plementation supporting multiple fabrics at runtime; through
this, registered Intel Cluster Ready applications obey the “mes-
sage layer independence property”. Stepping back, the unifying
concept of Intel Cluster Ready is “one-to-many,” that is
� One application will run on many clusters
� One cluster will run many applications
How is one-to-many accomplished? Looking at Figure 1, you
see the abstract Intel Cluster Ready “stack” components that
always exist in every cluster, i.e., one or more applications, a
cluster software stack, one or more fabrics, and fi nally the un-
derlying cluster hardware. The remainder of that picture (to the
right) shows the components in greater detail.
Applications, on the top of the stack, rely upon the various APIs,
utilities, and fi le system structure presented by the underly-
ing software stack. Registered Intel Cluster Ready applications
are always able to rely upon the APIs, utilities, and fi le system
structure specifi ed by the Intel Cluster Ready Specifi cation; if an
application requires software outside this “required” set, then
Intel Cluster Ready requires the application to provide that soft-
ware as a part of its installation. To ensure that this additional
per-application software doesn’t confl ict with the cluster stack
or other applications, Intel Cluster Ready also requires the ad-
ditional software to be installed in application-private trees,
so the application knows how to fi nd that software while not
interfering with other applications. While this may well cause
duplicate software to be installed, the reliability provided by
the duplication far outweighs the cost of the duplicated fi les. A
prime example supporting this comparison is the removal of a
common fi le (library, utility, or other) that is unknowingly need-
“The Intel Cluster Checker allows us to certify that
our transtec HPC clusters are compliant with an
independent high quality standard. Our customers
can rest assured: their applications run as they ex-
pect.”
Marcus Wiedemann HPC Solution Engineer
ICR Stack
So
ftw
are
Sta
ck
Fabric
System
CFDRegisteredApplications
A SingleSolutionPlatform
Fabrics
Certified Cluster Platforms
Crash Climate QCD
Intel MPI Library (run-time)Intel MKL Cluster Edition (run-time)
Linux Cluster Tools(Intel Selected)
InfiniBand (OFED)
Intel Xeon Processor Platform
Gigabit Ethernet 10Gbit Ethernet
BioOptional
Value Addby
IndividualPlatform
Integrator
IntelDevelopment
Tools(C++, Intel Trace
Analyzer and Collector, MKL, etc.)
...
Intel OEM1 OEM2 PI1 PI2 ...
74 75
ed by some other application – such errors can be insidious to
repair even when they cause an outright application failure.
Cluster platforms, at the bottom of the stack, provide the APIs,
utilities, and fi le system structure relied upon by registered ap-
plications. Certifi ed Intel Cluster Ready platforms ensure the
APIs, utilities, and fi le system structure are complete per the Intel
Cluster Ready Specifi cation; certifi ed clusters are able to provide
them by various means as they deem appropriate. Because of
the clearly defi ned responsibilities ensuring the presence of all
software required by registered applications, system providers
have a high confi dence that the certifi ed clusters they build are
able to run any certifi ed applications their customers rely on. In
addition to meeting the Intel Cluster Ready requirements, certi-
fi ed clusters can also provide their added value, that is, other fea-
tures and capabilities that increase the value of their products.
How Does Intel Cluster Ready Accomplish its Objectives?At its heart, Intel Cluster Ready is a defi nition of the cluster as a
parallel application platform, as well as a tool to certify an ac-
tual cluster to the defi nition. Let’s look at each of these in more
detail, to understand their motivations and benefi ts.
A defi nition of the cluster as parallel application platform
The Intel Cluster Ready Specifi cation is very much written as the
requirements for, not the implementation of, a platform upon
which parallel applications, more specifi cally MPI applications,
can be built and run. As such, the specifi cation doesn’t care
whether the cluster is diskful or diskless, fully distributed or
single system image (SSI), built from “Enterprise” distributions
or community distributions, fully open source or not. Perhaps
more importantly, with one exception, the specifi cation doesn’t
have any requirements on how the cluster is built; that one ex-
ception is that compute nodes must be built with automated
tools, so that new, repaired, or replaced nodes can be rebuilt
identically to the existing nodes without any manual interac-
tion, other than possibly initiating the build process.
Some items the specifi cation does care about include:
� The ability to run both 32- and 64-bit applications, including
MPI applications and X-clients, on any of the compute nodes
� Consistency among the compute nodes’ confi guration, capa-
bility, and performance
� The identical accessibility of libraries and tools across
the cluster
� The identical access by each compute node to permanent
and temporary storage, as well as users’ data
� The identical access to each compute node from the
head node
� The MPI implementation provides fabric independence
� All nodes support network booting and provide a
remotely accessible console
The specifi cation also requires that the runtimes for specifi c
Intel software products are installed on every certifi ed cluster:
� Intel Math Kernel Library
� Intel MPI Library Runtime Environment
� Intel Threading Building Blocks
This requirement does two things. First and foremost, main-
line Linux distributions do not necessarily provide a suffi cient
software stack to build an HPC cluster – such specialization is
beyond their mission. Secondly, the requirement ensures that
programs built with this software will always work on certifi ed
clusters and enjoy simpler installations. As these runtimes are
directly available from the web, the requirement does not cause
additional costs to certifi ed clusters. It is also very important
to note that this does not require certifi ed applications to use
these libraries nor does it preclude alternate libraries, e.g., oth-
er MPI implementations, from being present on certifi ed clus-
ters. Quite clearly, an application that requires, e.g., an alternate
MPI, must also provide the runtimes for that MPI as a part of its
installation.
A tool to certify an actual cluster to the defi nition
The Intel Cluster Checker, included with every certifi ed Intel
Cluster Ready implementation, is used in four modes in the life
of a cluster:
� To certify a system provider’s prototype cluster as a valid im-
plementation of the specifi cation
� To verify to the owner that the just-delivered cluster is a “true
copy” of the certifi ed prototype
� To ensure the cluster remains fully functional, reducing ser-
vice calls not related to the applications or the hardware
� To help software and system providers diagnose and correct
actual problems to their code or their hardware.
While these are critical capabilities, in all fairness, this greatly
understates the capabilities of Intel Cluster Checker. The tool
will not only verify the cluster is performing as expected. To
do this, per-node and cluster-wide static and dynamic tests are
made of the hardware and software.
Intel Cluster Ready
A Quality Standard for HPC Clusters
Intel Cluster Checker
Cluster Definition & Configuration XML File
STDOUT + LogfilePass/Fail Results & Diagnostics
Cluster Checker Engine
Output
Test Module
Parallel Ops Check
Res
ult
Co
nfi
g
Output
Res
ult
AP
I
Test Module
Parallel Ops Check
Co
nfi
gN
od
e
No
de
No
de
No
de
No
de
No
de
No
de
No
de
Test Module
Parallel Ops Check
Test Module
Parallel Ops Check
76 77
The static checks ensure the systems are confi gured consis-
tently and appropriately. As one example, the tool will ensure
the systems are all running the same BIOS versions as well as
having identical confi gurations among key BIOS settings. This
type of problem – differing BIOS versions or settings – can be
the root cause of subtle problems such as differing memory
confi gurations that manifest themselves as differing memory
bandwidths only to be seen at the application level as slower
than expected overall performance. As is well-known, paral-
lel program performance can be very much governed by the
performance of the slowest components, not the fastest. In
another static check, the Intel Cluster Checker will ensure that
the expected tools, libraries, and fi les are present on each
node, identically located on all nodes, as well as identically
implemented on all nodes. This ensures that each node has the
minimal software stack specifi ed by the specifi cation, as well as
identical software stack among the compute nodes.
A typical dynamic check ensures consistent system perfor-
mance, e.g., via the STREAM benchmark. This particular test en-
sures processor and memory performance is consistent across
compute nodes, which, like the BIOS setting example above, can
be the root cause of overall slower application performance. An
additional check with STREAM can be made if the user confi g-
ures an expectation of benchmark performance; this check will
ensure that performance is not only consistent across the clus-
ter, but also meets expectations. Going beyond processor per-
formance, the Intel MPI Benchmarks are used to ensure the net-
work fabric(s) are performing properly and, with a confi guration
that describes expected performance levels, up to the cluster
provider’s performance expectations. Network inconsistencies
due to poorly performing Ethernet NICs, Infi niBand HBAs, faulty
switches, and loose or faulty cables can be identifi ed. Finally,
the Intel Cluster Checker is extensible, enabling additional tests
to be added supporting additional features and capabilities.
This enables the Intel Cluster Checker to not only support the
Intel Cluster Ready Builds HPC MomentumWith the Intel Cluster Ready (ICR) program, Intel Corporation
set out to create a win-win scenario for the major constituen-
cies in the high-performance computing (HPC) cluster market.
Hardware vendors and independent software vendors (ISVs)
stand to win by being able to ensure both buyers and users
that their products will work well together straight out of the
box. System administrators stand to win by being able to meet
corporate demands to push HPC competitive advantages
deeper into their organizations while satisfying end users’ de-
mands for reliable HPC cycles, all without increasing IT staff.
End users stand to win by being able to get their work done
faster, with less downtime, on certifi ed cluster platforms. Last
but not least, with ICR, Intel has positioned itself to win by ex-
panding the total addressable market (TAM) and reducing time
to market for the company’s microprocessors, chip sets, and
platforms.
The Worst of Times
For a number of years, clusters were largely confi ned to govern-
ment and academic sites, where contingents of graduate stu-
dents and midlevel employees were available to help program
and maintain the unwieldy early systems. Commercial fi rms
lacked this low-cost labor supply and mistrusted the favored
cluster operating system, open source Linux, on the grounds
that no single party could be held accountable if something
went wrong with it. Today, cluster penetration in the HPC mar-
ket is deep and wide, extending from systems with a handful of
processors to some of the world’s largest supercomputers, and
from under $25,000 to tens or hundreds of millions of dollars
in price. Clusters increasingly pervade every HPC vertical mar-
ket: biosciences, computer-aided engineering, chemical engi-
neering, digital content creation, economic/fi nancial services,
electronic design automation, geosciences/geo-engineering,
minimal requirements of the Intel Cluster Ready Specifi cation,
but the full cluster as delivered to the customer.
Conforming hardware and software
The preceding was primarily related to the builders of certifi ed
clusters and the developers of registered applications. For end
users that want to purchase a certifi ed cluster to run registered
applications, the ability to identify registered applications and
certifi ed clusters is most important, as that will reduce their ef-
fort to evaluate, acquire, and deploy the clusters that run their
applications, and then keep that computing resource operating
properly, with full performance, directly increasing their pro-
ductivity.
Intel Cluster Ready
Intel Cluster Ready Builds HPC Momentum
Intel, XEON and certain other trademarks and logos appearing in this
brochure, are trademarks or registered trademarks of Intel Corporation.
Cluster PlatformSoftware Tools
Reference Designs
Demand Creation
Support & Training
Specification
Co
nfi
gu
ration
Softw
are
Interco
nn
ect
Server Platfo
rm
Intel P
rocesso
r ISV Enabling
©
Intel Cluster Ready Program
78 79
mechanical design, defense, government labs, academia, and
weather/climate.
But IDC studies have consistently shown that clusters remain
diffi cult to specify, deploy, and manage, especially for new and
less experienced HPC users. This should come as no surprise,
given that a cluster is a set of independent computers linked to-
gether by software and networking technologies from multiple
vendors.
Clusters originated as do-it-yourself HPC systems. In the late
1990s users began employing inexpensive hardware to cobble
together scientifi c computing systems based on the “Beowulf
cluster” concept fi rst developed by Thomas Sterling and Don-
ald Becker at NASA. From their Beowulf origins, clusters have
evolved and matured substantially, but the system manage-
ment issues that plagued their early years remain in force today.
The Need for Standard Cluster Solutions
The escalating complexity of HPC clusters poses a dilemma for
many large IT departments that cannot afford to scale up their
HPC-knowledgeable staff to meet the fast-growing end-user de-
mand for technical computing resources. Cluster management
is even more problematic for smaller organizations and busi-
ness units that often have no dedicated, HPC-knowledgeable
staff to begin with.
The ICR program aims to address burgeoning cluster complex-
ity by making available a standard solution (aka reference archi-
tecture) for Intel-based systems that hardware vendors can use
to certify their confi gurations and that ISVs and other software
vendors can use to test and register their applications, system
software, and HPC management software. The chief goal of this
Intel Cluster Ready
Intel Cluster Ready Builds HPC Momentumvoluntary compliance program is to ensure fundamental hard-
ware-software integration and interoperability so that system
administrators and end users can confi dently purchase and de-
ploy HPC clusters, and get their work done, even in cases where
no HPC-knowledgeable staff are available to help.
The ICR program wants to prevent end users from having to be-
come, in effect, their own systems integrators. In smaller orga-
nizations, the ICR program is designed to allow overworked IT
departments with limited or no HPC expertise to support HPC
user requirements more readily. For larger organizations with
dedicated HPC staff, ICR creates confi dence that required user
applications will work, eases the problem of system adminis-
tration, and allows HPC cluster systems to be scaled up in size
without scaling support staff. ICR can help drive HPC cluster re-
sources deeper into larger organizations and free up IT staff to
focus on mainstream enterprise applications (e.g., payroll, sales,
HR, and CRM).
The program is a three-way collaboration among hardware
vendors, software vendors, and Intel. In this triple alliance, Intel
provides the specifi cation for the cluster architecture imple-
mentation, and then vendors certify the hardware confi gura-
tions and register software applications as compliant with the
specifi cation. The ICR program’s promise to system administra-
tors and end users is that registered applications will run out of
the box on certifi ed hardware confi gurations.
ICR solutions are compliant with the standard platform archi-
tecture, which starts with 64-bit Intel Xeon processors in an
Intel-certifi ed cluster hardware platform. Layered on top of this
foundation are the interconnect fabric (Gigabit Ethernet, Infi ni-
Band) and the software stack: Intel-selected Linux cluster tools,
an Intel MPI runtime library, and the Intel Math Kernel Library.
Intel runtime components are available and verifi ed as part of
the certifi cation (e.g., Intel tool runtimes) but are not required
to be used by applications. The inclusion of these Intel runtime
components does not exclude any other components a systems
vendor or ISV might want to use. At the top of the stack are Intel-
registered ISV applications.
At the heart of the program is the Intel Cluster Checker, a valida-
tion tool that verifi es that a cluster is specifi cation compliant
and operational before ISV applications are ever loaded. After
the cluster is up and running, the Cluster Checker can function
as a fault isolation tool in wellness mode. Certifi cation needs
to happen only once for each distinct hardware platform, while
verifi cation – which determines whether a valid copy of the
specifi cation is operating – can be performed by the Cluster
Checker at any time.
Cluster Checker is an evolving tool that is designed to accept
new test modules. It is a productized tool that ICR members ship
with their systems. Cluster Checker originally was designed for
homogeneous clusters but can now also be applied to clusters
with specialized nodes, such as all-storage sub-clusters. Cluster
Checker can isolate a wide range of problems, including net-
work or communication problems.
80 81
Intel Cluster Ready
The transtec Benchmarking Center
transtec offers their customers a new and fascinating way to
evaluate transtec’s HPC solutions in real world scenarios. With
the transtec Benchmarking Center solutions can be explored in
detail with the actual applications the customers will later run
on them. Intel Cluster Ready makes this feasible by simplifying
the maintenance of the systems and set-up of clean systems
very easily, and as often as needed. As High-Performance Com-
puting (HPC) systems are utilized for numerical simulations,
more and more advanced clustering technologies are being
deployed. Because of its performance, price/performance and
energy effi ciency advantages, clusters now dominate all seg-
ments of the HPC market and continue to gain acceptance. HPC
computer systems have become far more widespread and per-
vasive in government, industry, and academia. However, rarely
does the client have the possibility to test their actual applica-
tion on the system they are planning to acquire.
The transtec Benchmarking Center
transtec HPC solutions get used by a wide variety of clients.
Among those are most of the large users of compute power at
German and other European universities and research centers
as well as governmental users like the German army’s com-
pute center and clients from the high tech, the automotive and
other sectors. transtec HPC solutions have demonstrated their
value in more than 500 installations. Most of transtec’s cluster
systems are based on SUSE Linux Enterprise Server, Red Hat
Enterprise Linux, CentOS, or Scientifi c Linux. With xCAT for ef-
fi cient cluster deployment, and Moab HPC Suite by Adaptive
Computing for high-level cluster management, transtec is able
to effi ciently deploy and ship easy-to-use HPC cluster solutions
with enterprise-class management features. Moab has proven
to provide easy-to-use workload and job management for small
systems as well as the largest cluster installations worldwide.
However, when selling clusters to governmental customers as
well as other large enterprises, it is often required that the cli-
ent can choose from a range of competing offers. Many times
there is a fi xed budget available and competing solutions are
compared based on their performance towards certain custom
benchmark codes.
So, in 2007 transtec decided to add another layer to their al-
ready wide array of competence in HPC – ranging from cluster
deployment and management, the latest CPU, board and net-
work technology to HPC storage systems. In transtec’s HPC Lab
the systems are being assembled. transtec is using Intel Cluster
Ready to facilitate testing, verifi cation, documentation, and fi -
nal testing throughout the actual build process. At the bench-
marking center transtec can now offer a set of small clusters
with the “newest and hottest technology” through Intel Cluster
Ready. A standard installation infrastructure gives transtec a
quick and easy way to set systems up according to their cus-
tomers’ choice of operating system, compilers, workload man-
agement suite, and so on. With Intel Cluster Ready there are pre-
pared standard set-ups available with verifi ed performance at
standard benchmarks while the system stability is guaranteed
by our own test suite and the Intel Cluster Checker.
The Intel Cluster Ready program is designed to provide a com-
mon standard for HPC clusters, helping organizations design
and build seamless, compatible and consistent cluster confi gu-
rations. Integrating the standards and tools provided by this
program can help signifi cantly simplify the deployment and
management of HPC clusters.
Parallel NFS – The New Standard for HPC StorageHPC computation results in the terabyte range are not uncommon. The problem in this context is not so much storing the data at rest, but the performance of the necessary copying back and forth in the course of the computation job flow and the dependent job turn-around time. For interim results during a job runtime or for fast storage of input and results data, parallel file systems have established themselves as the standard to meet the ever-in-creasing performance requirements of HPC storage systems. Parallel NFS is about to become the new standard framework for a parallel file system.
En
gin
ee
rin
g
Lif
e S
cie
nce
s
Au
tom
oti
ve
Pri
ce M
od
ell
ing
A
ero
spa
ce
CA
E
Da
ta A
na
lyti
cs
84 85
Parallel NFS
The New Standard for HPC Storage
Yesterday’s Solution: NFS for HPC StorageThe original Network File System (NFS) developed by Sun Micro-
systems at the end of the eighties – now available in version 4.1
– has been established for a long time as a de-facto standard for
the provisioning of a global namespace in networked computing.
A very widespread HPC cluster solution includes a central mater
node acting simultaneously as an NFS server, with its local file
system storing input, interim and results data and exporting
them to all other cluster nodes.
There is of course an immediate bottleneck in this method:
When the load of the network is high, or where there are large
numbers of nodes, the NFS server can no longer keep up deliv-
ering or receiving the data. In high-performance computing es-
pecially, the nodes are interconnected at least once via Gigabit
Ethernet, so the sum total throughput is well above what an
NFS server with a Gigabit interface can achieve. Even a power-
ful network connection of the NFS server to the cluster, for ex-
ample with 10-Gigabit Ethernet, is only a temporary solution to
this problem until the next cluster upgrade. The fundamental
problem remains – this solution is not scalable; in addition, NFS
is a difficult protocol to cluster in terms of load balancing: ei-
ther you have to ensure that multiple NFS servers accessing the
same data are constantly synchronised, the disadvantage be-
ing a noticeable drop in performance or you manually partition
the global namespace which is also time-consuming. NFS is not
suitable for dynamic load balancing as on paper it appears to be
stateless but in reality is, in fact, stateful.
Today’s Solution: Parallel File SystemsFor some time, powerful commercial products have been
available to meet the high demands on an HPC storage sys-
tem. The open-source solutions FraunhoferFS (FhGFS) from
the Fraunhofer Competence Center for High Performance
Computing or Lustre are widely used in the Linux HPC world,
and also several other free as well as commercial parallel file
system solutions exist.
What is new is that the time-honoured NFS is to be upgraded,
including a parallel version, into an Internet Standard with
the aim of interoperability between all operating systems.
The original problem statement for parallel NFS access was
written by Garth Gibson, a professor at Carnegie Mellon Uni-
versity and founder and CTO of Panasas. Gibson was already
a renowned figure being one of the authors contributing to
the original paper on RAID architecture from 1988. The origi-
nal statement from Gibson and Panasas is clearly noticeable
in the design of pNFS. The powerful HPC file system developed
by Gibson and Panasas, ActiveScale PanFS, with object-based
storage devices functioning as central components, is basi-
cally the commercial continuation of the “Network-Attached
Secure Disk (NASD)” project also developed by Garth Gibson at
the Carnegie Mellon University.
Parallel NFSParallel NFS (pNFS) is gradually emerging as the future stan-
dard to meet requirements in the HPC environment. From the
industry’s as well as the user’s perspective, the benefits of
utilising standard solutions are indisputable: besides protect-
ing end user investment, standards also ensure a defined level
of interoperability without restricting the choice of products
available. As a result, less user and administrator training is re-
quired which leads to simpler deployment and at the same time,
a greater acceptance.
As part of the NFS 4.1 Internet Standard, pNFS will not only
adopt the semantics of NFS in terms of cache consistency or
security, it also represents an easy and flexible extension of
the NFS 4 protocol. pNFS is optional, in other words, NFS 4.1
implementations do not have to include pNFS as a feature.
The scheduled Internet Standard NFS 4.1 is today presented as
IETF RFC 5661.
The pNFS protocol supports a separation of metadata and data:
a pNFS cluster comprises so-called storage devices which store
the data from the shared file system and a metadata server
(MDS), called Director Blade with Panasas – the actual NFS 4.1
server. The metadata server keeps track of which data is stored
on which storage devices and how to access the files, the so-
called layout. Besides these “striping parameters”, the MDS
also manages other metadata including access rights or similar,
which is usually stored in a file’s inode.
The layout types define which Storage Access Protocol is
used by the clients to access the storage devices. Up until
now, three potential storage access protocols have been de-
fined for pNFS: file, block and object-based layouts, the for-
mer being described in RFC 5661 direct, the latters in RFC 5663
and 5664, respectively. Last but not least, a Control Protocol
is also used by the MDS and storage devices to synchronise
status data. This protocol is deliberately unspecified in the
standard to give manufacturers certain flexibility. The NFS
4.1 standard does however specify certain conditions which
a control protocol has to fulfil, for example, how to deal with
the change/modify time attributes of files.
A classical NFS server is a bottleneck
Cluster Nodes= NFS Clients
NAS Head= NFS Server
86 87
What’s New in NFS 4.1?
NFS 4.1 is a minor update to NFS 4, and adds new features to it.
One of the optional features is parallel NFS (pNFS) but there is
other new functionality as well.
One of the technical enhancements is the use of sessions, a
persistent server object, dynamically created by the client. By
means of sessions, the state of an NFS connection can be stored,
no matter whether the connection is live or not. Sessions sur-
vive temporary downtimes both of the client and the server.
Each session has a so-called fore channel, which is the connec-
tion connection from the client to the server for all RPC opera-
tions, and optionally a back channel for RPC callbacks from the
server that now can also be realized through fi rewall boundar-
ies. Sessions can be trunked to increase the bandwidth. Besides
session trunking there is also a client ID trunking for grouping
together several sessions to the same client ID.
By means of sessions, NFS can be seen as a really stateful proto-
col with a so-called “Exactly-Once Semantics (EOS)”. Until now,
a necessary but unspecifi ed reply cache within the NFS server
is implemented to handle identical RPC operations that have
been sent several times. This statefulness in reality is not very
robust, however, and sometimes leads to the well-known stale
NFS handles. In NFS 4.1, the reply cache is now a mandatory part
of the NFS implementation, storing the server replies to RPC re-
quests persistently on disk.
Another new feature of NFS 4.1 is delegation for directories: NFS
clients can be given temporary exclusive access to directories.
Before, this has only been possible for simple fi les. With the
forthcoming version 4.2 of the NFS standard, federated fi lesys-
tems will be added as a feature, which represents the NFS coun-
terpart of Microsoft’s DFS (distributed fi lesystem).
Parallel NFS
What’s New in NFS 4.1?
pNFS supports backwards compatibility with non-pNFS com-
patible NFS 4 clients. In this case, the MDS itself gathers data
from the storage devices on behalf of the NFS client and pres-
ents the data to the NFS client via NFS 4. The MDS acts as a kind
of proxy server – which is e.g. what the Director Blades from Pa-
nasas do.
pNFS Layout TypesIf storage devices act simply as NFS 4 fi le servers, the fi le layout
is used. It is the only storage access protocol directly specifi ed
in the NFS 4.1 standard. Besides the stripe sizes and stripe lo-
cations (storage devices), it also includes the NFS fi le handles
which the client needs to use to access the separate fi le areas.
The fi le layout is compact and static, the striping information
does not change even if changes are made to the fi le enabling
multiple pNFS clients to simultaneously cache the layout and
avoid synchronisation overhead between clients and the MDS
or the MDS and storage devices.
File system authorisation and client authentication can be well
implemented with the fi le layout. When using NFS 4 as the stor-
age access protocol, client authentication merely depends on
the security fl avor used – when using the RPCSEC_GSS security
fl avor, client access is kerberized, for example and the server
controls access authorization using specifi ed ACLs and crypto-
graphic processes.
In contrast, the block/volume layout uses volume identifiers
and block offsets and extents to specify a file layout. SCSI
block commands are used to access storage devices. As the
block distribution can change with each write access, the
layout must be updated more frequently than with the file
layout.
Block-based access to storage devices does not offer any secure
authentication option for the accessing SCSI initiator. Secure
SAN authorisation is possible with host granularity only, based
on World Wide Names (WWNs) with Fibre Channel or Initiator
Node Names (IQNs) with iSCSI. The server cannot enforce access
control governed by the fi le system. On the contrary, a pNFS cli-
ent basically voluntarily abides with the access rights, the stor-
age device has to trust the pNFS client – a fundamental access
control problem that is a recurrent issue in the NFS protocol his-
tory.
The object layout is syntactically similar to the file layout,
but it uses the SCSI object command set for data access to
so-called Object-based Storage Devices (OSDs) and is heavily
based on the DirectFLOW protocol of the ActiveScale PanFS
from Panasas. From the very start, Object-Based Storage
Devices were designed for secure authentication and ac-
cess. So-called capabilities are used for object access which
involves the MDS issuing so-called capabilities to the pNFS
clients. The ownership of these capabilities represents the
authoritative access right to an object.
pNFS can be upgraded to integrate other storage access
protocols and operating systems, and storage manufactur-
ers also have the option to ship additional layout drivers for
their pNFS implementations.
Parallel NFS
pNFS Clients Storage Device
Metadata Server
Storage Access Protocol
NFS 4.1Control Protocol
88 89
Parallel NFS
Panasas HPC Storage
Having a many years’ experience in deploying parallel fi le sys-
tems like Lustre or FraunhoferFS (FhGFS) from smaller scales
up to hundreds of terabytes capacity and throughputs of sev-
eral gigabytes per second, transtec chose Panasas, the leader in
HPC storage solutions, as the partner for providing highest per-
formance and scalability on the one hand, and ease of manage-
ment on the other. Therefore, with Panasas as the technology
leader, and transtec’s overall experience and customer-orient-
ed approach, customers can be assured to get the best possible
HPC storage solution available.
The Panasas fi le system uses parallel and redundant access to
object storage devices (OSDs), per-fi le RAID, distributed metada-
ta management, consistent client caching, fi le locking services,
and internal cluster management to provide a scalable, fault tol-
erant, high performance distributed fi le system. The clustered
design of the storage system and the use of client-driven RAID
provide scalable performance to many concurrent fi le system
clients through parallel access to fi le data that is striped across
OSD storage nodes. RAID recovery is performed in parallel by
the cluster of metadata managers, and declustered data place-
ment yields scalable RAID rebuild rates as the storage system
grows larger.
IntroductionStorage systems for high performance computing environ-
ments must be designed to scale in performance so that they
can be confi gured to match the required load. Clustering tech-
niques are often used to provide scalability. In a storage cluster,
many nodes each control some storage, and the overall distrib-
uted fi le system assembles the cluster elements into one large,
seamless storage system. The storage cluster can be hosted on
the same computers that perform data processing, or they can
be a separate cluster that is devoted entirely to storage and ac-
cessible to the compute cluster via a network protocol.
The Panasas storage system is a specialized storage cluster, and
this paper presents its design and a number of performance
measurements to illustrate the scalability. The Panasas system
is a production system that provides fi le service to some of the
largest compute clusters in the world, in scientifi c labs, in seis-
mic data processing, in digital animation studios, in computa-
tional fl uid dynamics, in semiconductor manufacturing, and
in general purpose computing environments. In these environ-
ments, hundreds or thousands of fi le system clients share data
and generate very high aggregate I/O load on the fi le system.
The Panasas system is designed to support several thousand
clients and storage capacities in excess of a petabyte.
The unique aspects of the Panasas system are its use of per-fi le,
client-driven RAID, its parallel RAID rebuild, its treatment of dif-
ferent classes of metadata (block, fi le, system) and a commodity
parts based blade hardware with integrated UPS. Of course, the
system has many other features (such as object storage, fault
tolerance, caching and cache consistency, and a simplifi ed man-
agement model) that are not unique, but are necessary for a
scalable system implementation.
Panasas File System BackgroundThe two overall themes to the system are object storage, which
affects how the fi le system manages its data, and clustering of
components, which allows the system to scale in performance
and capacity.
Object Storage
An object is a container for data and attributes; it is analogous to
the inode inside a traditional UNIX fi le system implementation.
Specialized storage nodes called Object Storage Devices (OSD)
store objects in a local OSDFS fi le system. The object interface ad-
dresses objects in a two-level (partition ID/object ID) namespace.
The OSD wire protocol provides byteoriented access to the data,
attribute manipulation, creation and deletion of objects, and
several other specialized operations. Panasas uses an iSCSI trans-
port to carry OSD commands that are very similar to the OSDv2
standard currently in progress within SNIA and ANSI-T10.
The Panasas fi le system is layered over the object storage. Each fi le
is striped over two or more objects to provide redundancy and high
bandwidth access. The fi le system semantics are implemented by
metadata managers that mediate access to objects from clients of
the fi le system. The clients access the object storage using the
iSCSI/OSD protocol for Read and Write operations. The I/O op-
erations proceed directly and in parallel to the storage nodes,
bypassing the metadata managers. The clients interact with the
out-of-band metadata managers via RPC to obtain access capa-
bilities and location information for the objects that store fi les.
90 91
Object attributes are used to store file-level attributes, and
directories are implemented with objects that store name to
object ID mappings. Thus the file system metadata is kept in
the object store itself, rather than being kept in a separate da-
tabase or some other form of storage on the metadata nodes.
System Software Components
The major software subsystems are the OSDFS object storage
system, the Panasas file system metadata manager, the Pana-
sas file system client, the NFS/CIFS gateway, and the overall
cluster management system.
� The Panasas client is an installable kernel module that runs in-
side the Linux kernel. The kernel module implements the stan-
dard VFS interface, so that the client hosts can mount the file
system and use a POSIX interface to the storage system.
� Each storage cluster node runs a common platform that is based
on FreeBSD, with additional services to provide hardware moni-
toring, configuration management, and overall control.
� The storage nodes use a specialized local file system (OSDFS)
that implements the object storage primitives. They imple-
ment an iSCSI target and the OSD command set. The OSDFS
object store and iSCSI target/OSD command processor are
kernel modules. OSDFS is concerned with traditional block-
level file system issues such as efficient disk arm utilization,
media management (i.e., error handling), high throughput, as
well as the OSD interface.
� The cluster manager (SysMgr) maintains the global configurati-
on, and it controls the other services and nodes in the storage
cluster. There is an associated management application that
provides both a command line interface (CLI) and an HTML in-
terface (GUI). These are all user level applications that run on a
subset of the manager nodes. The cluster manager is concerned
with membership in the storage cluster, fault detection, configu-
ration management, and overall control for operations like soft-
ware upgrade and system restart.
� The Panasas metadata manager (PanFS) implements the file sys-
tem semantics and manages data striping across the object sto-
rage devices. This is a user level application that runs on every
manager node. The metadata manager is concerned with distri-
buted file system issues such as secure multi-user access, main-
taining consistent file- and object-level metadata, client cache
coherency, and recovery from client, storage node, and metada-
ta server crashes. Fault tolerance is based on a local transaction
log that is replicated to a backup on a different manager node.
� The NFS and CIFS services provide access to the file system
for hosts that cannot use our Linux installable file system
client. The NFS service is a tuned version of the standard
FreeBSD NFS server that runs inside the kernel. The CIFS ser-
vice is based on Samba and runs at user level. In turn, these
services use a local instance of the file system client, which
runs inside the FreeBSD kernel. These gateway services run
on every manager node to provide a clustered NFS and CIFS
service.
Commodity Hardware Platform
The storage cluster nodes are implemented as blades that are
very compact computer systems made from commodity parts.
The blades are clustered together to provide a scalable plat-
form. The OSD StorageBlade module and metadata manager Di-
rectorBlade module use the same form factor blade and fit into
the same chassis slots.
Storage Management
Traditional storage management tasks involve partitioning
available storage space into LUNs (i.e., logical units that are one
or more disks, or a subset of a RAID array), assigning LUN owner-
ship to different hosts, configuring RAID parameters, creating
file systems or databases on LUNs, and connecting clients to
the correct server for their storage. This can be a labor-inten-
sive scenario. Panasas provides a simplified model for storage
management that shields the storage administrator from these
kinds of details and allow a single, part-time admin to manage
systems that were hundreds of terabytes in size.
The Panasas storage system presents itself as a file system with
a POSIX interface, and hides most of the complexities of storage
management. Clients have a single mount point for the entire
system. The /etc/fstab file references the cluster manager, and
from that the client learns the location of the metadata service
instances. The administrator can add storage while the system
is online, and new resources are automatically discovered. To
manage available storage, Panasas introduces two basic stor-
age concepts: a physical storage pool called a BladeSet, and a
logical quota tree called a Volume.
The BladeSet is a collection of StorageBlade modules in one or
more shelves that comprise a RAID fault domain. Panasas miti-
gates the risk of large fault domains with the scalable rebuild
performance described below in the text. The BladeSet is a hard
Parallel NFS
Panasas HPC Storage
NFS/CIFS
OSDFS
SysMgr
COMPUTE NODE
MANAGER NODE
STORAGE NODE
iSCSI/OSD
PanFSClient
RPC
Client
Panasas System Components
92 93
physical boundary for the volumes it contains. A BladeSet can
be grown at any time, either by adding more StorageBlade mod-
ules, or by merging two existing BladeSets together.
The Volume is a directory hierarchy that has a quota constraint
and is assigned to a particular BladeSet. The quota can be
changed at any time, and capacity is not allocated to the Vol-
ume until it is used, so multiple volumes compete for space
within their BladeSet and grow on demand. The fi les in those
volumes are distributed among all the StorageBlade modules in
the BladeSet.
Volumes appear in the fi le system name space as directories.
Clients have a single mount point for the whole storage system,
and volumes are simply directories below the mount point.
There is no need to update client mounts when the admin cre-
ates, deletes, or renames volumes.
Automatic Capacity Balancing
Capacity imbalance occurs when expanding a BladeSet (i.e.,
adding new, empty storage nodes), merging two BladeSets, and
replacing a storage node following a failure. In the latter scenar-
io, the imbalance is the result of the RAID rebuild, which uses
spare capacity on every storage node rather than dedicating a
specifi c “hot spare” node. This provides better throughput dur-
ing rebuild, but causes the system to have a new, empty stor-
age node after the failed storage node is replaced. The system
automatically balances used capacity across storage nodes in a
BladeSet using two mechanisms: passive balancing and active
balancing.
Passive balancing changes the probability that a storage node
will be used for a new component of a fi le, based on its avail-
able capacity. This takes effect when fi les are created, and when
their stripe size is increased to include more storage nodes. Ac-
tive balancing is done by moving an existing component object
from one storage node to another, and updating the storage
map for the affected fi le. During the transfer, the fi le is transpar-
ently marked read-only by the storage management layer, and
the capacity balancer skips fi les that are being actively written.
Capacity balancing is thus transparent to fi le system clients.
Object RAID and ReconstructionPanasas protects against loss of a data object or an entire stor-
age node by striping fi les across objects stored on different
storage nodes, using a fault-tolerant striping algorithm such as
RAID-1 or RAID-5. Small fi les are mirrored on two objects, and
larger fi les are striped more widely to provide higher bandwidth
and less capacity overhead from parity information. The per-fi le
RAID layout means that parity information for different fi les is
not mixed together, and easily allows different fi les to use differ-
ent RAID schemes alongside each other. This property and the
security mechanisms of the OSD protocol makes it possible to
enforce access control over fi les even as clients access storage
nodes directly. It also enables what is perhaps the most novel
aspect of our system, client-driven RAID. That is, the clients are
responsible for computing and writing parity. The OSD security
mechanism also allows multiple metadata managers to man-
age objects on the same storage device without heavyweight
coordination or interference from each other.
Client-driven, per-fi le RAID has four advantages for large-scale
storage systems. First, by having clients compute parity for
their own data, the XOR power of the system scales up as the
number of clients increases. We measured XOR processing dur-
ing streaming write bandwidth loads at 7% of the client’s CPU,
with the rest going to the OSD/iSCSI/TCP/IP stack and other fi le
system overhead. Moving XOR computation out of the storage
system into the client requires some additional work to handle
failures. Clients are responsible for generating good data and
good parity for it. Because the RAID equation is per-fi le, an er-
rant client can only damage its own data. However, if a client
fails during a write, the metadata manager will scrub parity to
ensure the parity equation is correct.
The second advantage of client-driven RAID is that clients can
perform an end-to-end data integrity check. Data has to go
through the disk subsystem, through the network interface on
the storage nodes, through the network and routers, through
the NIC on the client, and all of these transits can introduce
errors with a very low probability. Clients can choose to read
parity as well as data, and verify parity as part of a read opera-
tion. If errors are detected, the operation is retried. If the error
is persistent, an alert is raised and the read operation fails. By
checking parity across storage nodes within the client, the sys-
tem can ensure end-to-end data integrity. This is another novel
property of per-fi le, client-driven RAID.
Third, per-fi le RAID protection lets the metadata managers re-
build fi les in parallel. Although parallel rebuild is theoretically
possible in block-based RAID, it is rarely implemented. This
is due to the fact that the disks are owned by a single RAID
controller, even in dual-ported confi gurations. Large storage
systems have multiple RAID controllers that are not intercon-
nected. Since the SCSI Block command set does not provide fi ne-
grained synchronization operations, it is diffi cult for multiple
RAID controllers to coordinate a complicated operation such as
an online rebuild without external communication. Even if they
could, without connectivity to the disks in the affected parity
group, other RAID controllers would be unable to assist. Even
in a high-availability confi guration, each disk is typically only
attached to two different RAID controllers, which limits the po-
tential speedup to 2x.
When a StorageBlade module fails, the metadata managers that
own Volumes within that BladeSet determine what fi les are af-
fected, and then they farm out fi le reconstruction work to every
other metadata manager in the system. Metadata managers re-
build their own fi les fi rst, but if they fi nish early or do not own
any Volumes in the affected Bladeset, they are free to aid other
metadata managers. Declustered parity groups spread out the
I/O workload among all StorageBlade modules in the BladeSet.
The result is that larger storage clusters reconstruct lost data
more quickly.
The fourth advantage of per-fi le RAID is that unrecoverable
Parallel NFS
Panasas HPC Storage
©2013 Panasas Incorporated. All rights reserved. Panasas, the
Panasas logo, Accelerating Time to Results, ActiveScale, Direct-
FLOW, DirectorBlade, StorageBlade, PanFS, PanActive and MyPa-
nasas are trademarks or registered trademarks of Panasas, Inc.
in the United States and other countries. All other trademarks
are the property of their respective owners. Information sup-
plied by Panasas, Inc. is believed to be accurate and reliable at
the time of publication, but Panasas, Inc. assumes no responsi-
bility for any errors that may appear in this document. Panasas,
Inc. reserves the right, without notice, to make changes in prod-
uct design, specifi cations and prices. Information is subject to
change without notice.
94 95
faults can be constrained to individual files. The most com-
monly encountered double-failure scenario with RAID-5 is an
unrecoverable read error (i.e., grown media defect) during the
reconstruction of a failed storage device. The second storage
device is still healthy, but it has been unable to read a sector,
which prevents rebuild of the sector lost from the first drive and
potentially the entire stripe or LUN, depending on the design of
the RAID controller. With block-based RAID, it is difficult or im-
possible to directly map any lost sectors back to higher-level file
system data structures, so a full file system check and media
scan will be required to locate and repair the damage. A more
typical response is to fail the rebuild entirely. RAID controllers
monitor drives in an effort to scrub out media defects and avoid
this bad scenario, and the Panasas system does media scrub-
bing, too. However, with high capacity SATA drives, the chance
of encountering a media defect on drive B while rebuilding drive
A is still significant. With per-file RAID-5, this sort of double fail-
ure means that only a single file is lost, and the specific file can
be easily identified and reported to the administrator. While
block-based RAID systems have been compelled to introduce
RAID-6 (i.e., fault tolerant schemes that handle two failures), the
Panasas solution is able to deploy highly reliable RAID-5 sys-
tems with large, high performance storage pools.
RAID Rebuild Performance
RAID rebuild performance determines how quickly the system
can recover data when a storage node is lost. Short rebuild
times reduce the window in which a second failure can cause
data loss. There are three techniques to reduce rebuild times: re-
ducing the size of the RAID parity group, declustering the place-
ment of parity group elements, and rebuilding files in parallel
using multiple RAID engines.
The rebuild bandwidth is the rate at which reconstructed data
is written to the system when a storage node is being recon-
structed. The system must read N times as much as it writes,
depending on the width of the RAID parity group, so the overall
throughput of the storage system is several times higher than
the rebuild rate. A narrower RAID parity group requires fewer
read and XOR operations to rebuild, so will result in a higher
rebuild bandwidth. However, it also results in higher capacity
overhead for parity data, and can limit bandwidth during nor-
mal I/O. Thus, selection of the RAID parity group size is a trade-
off between capacity overhead, on-line performance, and re-
build performance.
Understanding declustering is easier with a picture. In the figure
on the left, each parity group has 4 elements, which are indicat-
ed by letters placed in each storage device. They are distributed
among 8 storage devices. The ratio between the parity group
size and the available storage devices is the declustering ratio,
which in this example is ½. In the picture, capital letters repre-
sent those parity groups that all share the second storage node.
If the second storage device were to fail, the system would have
to read the surviving members of its parity groups to rebuild the
lost elements. You can see that the other elements of those par-
ity groups occupy about ½ of each other storage device.
For this simple example you can assume each parity element is
the same size so all the devices are filled equally. In a real sys-
tem, the component objects will have various sizes depending
on the overall file size, although each member of a parity group
will be very close in size. There will be thousands or millions of
objects on each device, and the Panasas system uses active bal-
ancing to move component objects between storage nodes to
level capacity.
Declustering means that rebuild requires reading a subset of
each device, with the proportion being approximately the same
as the declustering ratio. The total amount of data read is the
same with and without declustering, but with declustering it is
spread out over more devices. When writing the reconstructed
elements, two elements of the same parity group cannot be
located on the same storage node. Declustering leaves many
storage devices available for the reconstructed parity element,
and randomizing the placement of each file’s parity group lets
the system spread out the write I/O over all the storage. Thus
declustering RAID parity groups has the important property of
taking a fixed amount of rebuild I/O and spreading it out over
more storage devices.
Having per-file RAID allows the Panasas system to divide the
work among the available DirectorBlade modules by assigning
different files to different DirectorBlade modules. This division
is dynamic with a simple master/worker model in which meta-
data services make themselves available as workers, and each
metadata service acts as the master for the volumes it imple-
ments. By doing rebuilds in parallel on all DirectorBlade mod-
ules, the system can apply more XOR throughput and utilize the
additional I/O bandwidth obtained with declustering.
Metadata ManagementThere are several kinds of metadata in the Panasas system.
These include the mapping from object IDs to sets of block ad-
dresses, mapping files to sets of objects, file system attributes
such as ACLs and owners, file system namespace information
(i.e., directories), and configuration/management information
about the storage cluster itself.
Block-level Metadata
Block-level metadata is managed internally by OSDFS, the file
Parallel NFS
Panasas HPC Storage
Declustered parity groups
bCf
hJK
CDe
Gim
ADe
him
ACD
GJK
bCf
Gim
Aef
hJK
bDe
Gim
Abf
hjK
96 97
system that is optimized to store objects. OSDFS uses a floating
block allocation scheme where data, block pointers, and object
descriptors are batched into large write operations. The write
buffer is protected by the integrated UPS, and it is flushed to
disk on power failure or system panics. Fragmentation was an
issue in early versions of OSDFS that used a first-fit block allo-
cator, but this has been significantly mitigated in later versions
that use a modified best-fit allocator.
OSDFS stores higher level file system data structures, such as
the partition and object tables, in a modified BTree data struc-
ture. Block mapping for each object uses a traditional direct/
indirect/double-indirect scheme. Free blocks are tracked by a
proprietary bitmap-like data structure that is optimized for co-
py-on-write reference counting, part of OSDFS’s integrated sup-
port for object- and partition-level copy-on-write snapshots.
Block-level metadata management consumes most of the
cycles in file system implementations. By delegating storage
management to OSDFS, the Panasas metadata managers have
an order of magnitude less work to do than the equivalent SAN
file system metadata manager that must track all the blocks in
the system.
File-level Metadata
Above the block layer is the metadata about files. This includes
user-visible information such as the owner, size, and modifica-
tion time, as well as internal information that identifies which
objects store the file and how the data is striped across those
objects (i.e., the file’s storage map). Our system stores this file
metadata in object attributes on two of the N objects used to
store the file’s data. The rest of the objects have basic attributes
like their individual length and modify times, but the higher-lev-
el file system attributes are only stored on the two attribute-
storing components.
File names are implemented in directories similar to traditional
UNIX file systems. Directories are special files that store an ar-
ray of directory entries. A directory entry identifies a file with a
Parallel NFS
Panasas HPC Storage
tuple of <serviceID, partitionID, objectID>, and also includes two
<osdID> fields that are hints about the location of the attribute
storing components. The partitionID/objectID is the two-level
object numbering scheme of the OSD interface, and Panasas
uses a partition for each volume. Directories are mirrored (RAID-
1) in two objects so that the small write operations associated
with directory updates are efficient.
Clients are allowed to read, cache and parse directories, or they
can use a Lookup RPC to the metadata manager to translate
a name to an <serviceID, partitionID, objectID> tuple and the
<osdID> location hints. The serviceID provides a hint about the
metadata manager for the file, although clients may be redirect-
ed to the metadata manager that currently controls the file. The
osdID hint can become out-of-date if reconstruction or active
balancing moves an object. If both osdID hints fail, the meta-
data manager has to multicast a GetAttributes to the storage
nodes in the BladeSet to locate an object. The partitionID and
objectID are the same on every storage node that stores a com-
ponent of the file, so this technique will always work. Once the
file is located, the metadata manager automatically updates
the stored hints in the directory, allowing future accesses to
bypass this step.
File operations may require several object operations. The fig-
ure on the left shows the steps used in creating a file. The meta-
data manager keeps a local journal to record in-progress actions
so it can recover from object failures and metadata manager
crashes that occur when updating multiple objects. For exam-
ple, creating a file is fairly complex task that requires updating
the parent directory as well as creating the new file. There are
2 Create OSD operations to create the first two components of
the file, and 2 Write OSD operations, one to each replica of the
parent directory. As a performance optimization, the metadata
server also grants the client read and write access to the file
and returns the appropriate capabilities to the client as part of
the FileCreate results. The server makes record of these write
capabilities to support error recovery if the client crashes while
writing the file. Note that the directory update (step 7) occurs
after the reply, so that many directory updates can be batched
together. The deferred update is protected by the op-log record
that gets deleted in step 8 after the successful directory update.
The metadata manager maintains an op-log that records the
object create and the directory updates that are in progress.
This log entry is removed when the operation is complete. If the
metadata service crashes and restarts, or a failure event moves
the metadata service to a different manager node, then the op-
log is processed to determine what operations were active at
the time of the failure. The metadata manager rolls the opera-
tions forward or backward to ensure the object store is consis-
tent. If no reply to the operation has been generated, then the
operation is rolled back. If a reply has been generated but pend-
ing operations are outstanding (e.g., directory updates), then
the operation is rolled forward.
Creating a file
OSDs
1. CREATE
7. WRITE3. CREATE
MetadataServer
6. REPLY
Client
Txn_log
2
8
4
5
oplog
caplog
Replycache
98 99
The write capability is stored in a cap-log so that when a meta-
data server starts it knows which of its files are busy. In addition
to the “piggybacked” write capability returned by FileCreate,
the client can also execute a StartWrite RPC to obtain a sepa-
rate write capability. The cap-log entry is removed when the
client releases the write cap via an EndWrite RPC. If the client
reports an error during its I/O, then a repair log entry is made
and the file is scheduled for repair. Read and write capabilities
are cached by the client over multiple system calls, further re-
ducing metadata server traffic.
System-level Metadata
The final layer of metadata is information about the overall sys-
tem itself. One possibility would be to store this information in
objects and bootstrap the system through a discovery protocol.
The most difficult aspect of that approach is reasoning about
the fault model. The system must be able to come up and be
manageable while it is only partially functional. Panasas chose
instead a model with a small replicated set of system managers,
each that stores a replica of the system configuration metadata.
Each system manager maintains a local database, outside of
the object storage system. Berkeley DB is used to store tables
that represent our system model. The different system manager
instances are members of a replication set that use Lamport’s
part-time parliament (PTP) protocol to make decisions and up-
date the configuration information. Clusters are configured
with one, three, or five system managers so that the voting quo-
rum has an odd number and a network partition will cause a
minority of system managers to disable themselves.
System configuration state includes both static state, such as
the identity of the blades in the system, as well as dynamic
state such as the online/offline state of various services and er-
ror conditions associated with different system components.
Each state update decision, whether it is updating the admin
password or activating a service, involves a voting round and an
update round according to the PTP protocol. Database updates
Parallel NFS
Panasas HPC Storage
are performed within the PTP transactions to keep the data-
bases synchronized. Finally, the system keeps backup copies of
the system configuration databases on several other blades to
guard against catastrophic loss of every system manager blade.
Blade configuration is pulled from the system managers as part
of each blade’s startup sequence. The initial DHCP handshake
conveys the addresses of the system managers, and thereafter
the local OS on each blade pulls configuration information from
the system managers via RPC.
The cluster manager implementation has two layers. The lower
level PTP layer manages the voting rounds and ensures that
partitioned or newly added system managers will be brought
up-to-date with the quorum. The application layer above that
uses the voting and update interface to make decisions. Com-
plex system operations may involve several steps, and the sys-
tem manager has to keep track of its progress so it can tolerate
a crash and roll back or roll forward as appropriate.
For example, creating a volume (i.e., a quota-tree) involves file
system operations to create a top-level directory, object opera-
tions to create an object partition within OSDFS on each Stor-
ageBlade module, service operations to activate the appropri-
ate metadata manager, and configuration database operations
to reflect the addition of the volume. Recovery is enabled by
having two PTP transactions. The initial PTP transaction deter-
mines if the volume should be created, and it creates a record
about the volume that is marked as incomplete. Then the sys-
tem manager does all the necessary service activations, file and
storage operations. When these all complete, a final PTP trans-
action is performed to commit the operation. If the system man-
ager crashes before the final PTP transaction, it will detect the
incomplete operation the next time it restarts, and then roll the
operation forward or backward.
“Outstanding in the HPC world, the ActiveStor so-
lutions provided by Panasas are undoubtedly the
only HPC storage solutions that combine highest
scalability and performance with a convincing ease
of management.“
Thomas Gebert HPC Solution Architect
NVIDIA GPU Computing –The CUDA ArchitectureThe CUDA parallel computing platform is now widely deployed with 1000s of GPU-ac-celerated applications and 1000s of published research papers, and a complete range of CUDA tools and ecosystem solutions is available to developers.
101 Hig
h T
rou
hp
ut
Co
mp
uti
ng
C
AD
B
ig D
ata
An
aly
tics
S
imu
lati
on
A
ero
spa
ce
Au
tom
oti
ve
102 103
What is GPU Computing?GPU computing is the use of a GPU (graphics processing unit)
together with a CPU to accelerate general-purpose scientifi c
and engineering applications. Pioneered more fi ve years ago by
NVIDIA, GPU computing has quickly become an industry stan-
dard, enjoyed by millions of users worldwide and adopted by
virtually all computing vendors.
GPU computing offers unprecedented application performance
by offl oading compute-intensive portions of the application to
the GPU, while the remainder of the code still runs on the CPU.
From a user’s perspective, applications simply run signifi cantly
faster.
CPU + GPU is a powerful combination because CPUs consist of a
few cores optimized for serial processing, while GPUs consist of
thousands of smaller, more effi cient cores designed for parallel
performance. Serial portions of the code run on the CPU while
parallel portions run on the GPU.
Most customers can immediately enjoy the power of GPU com-
puting by using any of the GPU-accelerated applications listed
in our catalog, which highlights over one hundred, industry-
leading applications. For developers, GPU computing offers a
vast ecosystem of tools and libraries from major software ven-
dors
History of GPU Computing
Graphics chips started as fi xed-function graphics processors
but became increasingly programmable and computationally
powerful, which led NVIDIA to introduce the fi rst GPU. In the
1999-2000 timeframe, computer scientists and domain scien-
tists from various fi elds started using GPUs to accelerate a range
of scientifi c applications. This was the advent of the movement
called GPGPU, or General-Purpose computation on GPU.
�While users achieved unprecedented performance (over
100x compared to CPUs in some cases), the challenge was
that GPGPU required the use of graphics programming APIs
like OpenGL and Cg to program the GPU. This limited acces-
sibility to the tremendous capability of GPUs for science.
NVIDIA recognized the potential of bringing this performance
for the larger scientifi c community, invested in making the GPU
fully programmable, and offered seamless experience for devel-
opers with familiar languages like C, C++, and Fortran.
GPU computing momentum is growing faster than ever before.
Today, some of the fastest supercomputers in the world rely on
GPUs to advance scientifi c discoveries; 600 universities around
the world teach parallel computing with NVIDIA GPUs; and hun-
dreds of thousands of developers are actively using GPUs.
All NVIDIA GPUs—GeForce, Quadro, and Tesla — support GPU
computing and the CUDA parallel programming model. Devel-
opers have access to NVIDIA GPUs in virtually any platform of
their choice, including the latest Apple MacBook Pro. However,
we recommend Tesla GPUs for workloads where data reliability
and overall performance are critical.
Tesla GPUs are designed from the ground-up to accelerate sci-
entifi c and technical computing workloads. Based on innova-
tive features in the “Kepler architecture,” the latest Tesla GPUs
offer 3x more performance compared to the previous architec-
ture, more than one terafl ops of double-precision fl oating point
while dramatically advancing programmability and effi ciency.
Kepler is the world’s fastest and most effi cient high perfor-
mance computing (HPC) architecture.
Kepler GK110 – The Next Generation GPU Computing ArchitectureAs the demand for high performance parallel computing in-
creases across many areas of science, medicine, engineering,
and fi nance, NVIDIA continues to innovate and meet that de-
mand with extraordinarily powerful GPU computing architec-
tures. NVIDIA’s existing Fermi GPUs have already redefi ned and
accelerated High Performance Computing (HPC) capabilities in
“We are very proud to be one of the leading provid-
ers of Tesla systems who are able to combine the
overwhelming power of NVIDIA Tesla systems with
the fully engineered and thoroughly tested trans-
tec hardware to a total Tesla-based solution.”
Norbert Zeidler Senior HPC Solution Engineer
NVIDIA GPU Computing
What is GPU Computing?
GPU Computing Applications
NVIDIA GPUwith the CUDA Parallel Computing Architecture
CC++
OpenCL FORTRANDirectXCompute
Java andPython
The CUDA parallel architecure
the cuda programming model
1
2
34
CUDA Parallel Compute Enginesinside NVIDIA GPUs
CUDA Support in OS Kernel
CUDA Driver
ApplicationsUsing DirectX
DirectXCompute
OpenCLDriver
C Runtimefor CUDA
ApplicationsUsing OpenCL
ApplicationsUsing the
CUDA Driver API
ApplicationsUsing C, C++, Fortran,
Java, Python
HLSLCompute Shaders
OpenCL CCompute Kernels
C for CUDACompute Kernels
PTX (ISA)
C for CUDACompute Functions
Device-level APIs Language Integration
104 105
areas such as seismic processing, biochemistry simulations,
weather and climate modeling, signal processing, computa-
tional finance, computer aided engineering, computational flu-
id dynamics, and data analysis. NVIDIA’s new Kepler GK110 GPU
raises the parallel computing bar considerably and will help
solve the world’s most difficult computing problems.
By offering much higher processing power than the prior GPU
generation and by providing new methods to optimize and in-
crease parallel workload execution on the GPU, Kepler GK110
simplifies creation of parallel programs and will further revolu-
tionize high performance computing.
Kepler GK110 – Extreme Performance, Extreme Efficiency
Comprising 7.1 billion transistors, Kepler GK110 is not only the
fastest, but also the most architecturally complex microproces-
sor ever built. Adding many new innovative features focused on
compute performance, GK110 was designed to be a parallel pro-
cessing powerhouse for Tesla® and the HPC market.
Kepler GK110 provides over 1 TFlop of double precision
throughput with greater than 93% DGEMM efficiency versus
60-65% on the prior Fermi architecture.
In addition to greatly improved performance, the Kepler archi-
tecture offers a huge leap forward in power efficiency, deliver-
ing up to 3x the performance per watt of Fermi.
The following new features in Kepler GK110 enable increased
GPU utilization, simplify parallel program design, and aid in the
deployment of GPUs across the spectrum of compute environ-
ments ranging from personal workstations to supercomputers:
� Dynamic Parallelism – adds the capability for the GPU to ge-
nerate new work for itself, synchronize on results, and con-
trol the scheduling of that work via dedicated, accelerated
hardware paths, all without involving the CPU. By providing
the flexibility to adapt to the amount and form of parallelism
through the course of a program’s execution, programmers
can expose more varied kinds of parallel work and make the
most efficient use the GPU as a computation evolves. This ca-
pability allows less - structured, more complex tasks to run
easily and effectively, enabling larger portions of an appli-
cation to run entirely on the GPU. In addition, programs are
easier to create, and the CPU is freed for other tasks.
� Hyper-Q – Hyper-Q enables multiple CPU cores to launch work
on a single GPU simultaneously, thereby dramatically increasing
GPU utilization and significantly reducing CPU idle times. Hyper-
Q increases the total number of connections (work queues)
between the host and the GK110 GPU by allowing 32 simultane-
ous, hardware-managed connections (compared to the single
connection available with Fermi). Hyper-Q is a flexible solution
that allows separate connections from multiple CUDA streams,
from multiple Message Passing Interface (MPI) processes, or
even from multiple threads within a process. Applications that
previously encountered false serialization across tasks, thereby
limiting achieved GPU utilization, can see up to dramatic perfor-
mance increase without changing any existing code.
� Grid Management Unit – Enabling Dynamic Parallelism re-
quires an advanced, flexible grid management and dispatch
control system. The new GK110 Grid Management Unit (GMU)
manages and prioritizes grids to be executed on the GPU.
The GMU can pause the dispatch of new grids and queue
pending and suspended grids until they are ready to execute,
providing the flexibility to enable powerful runtimes, such as
Dynamic Parallelism. The GMU ensures both CPU- and GPU-
generated workloads are properly managed and dispatched.
� NVIDIA GPUDirect – NVIDIA GPUDirect is a capability that
enables GPUs within a single computer, or GPUs in different
servers located across a network, to directly exchange data
without needing to go to CPU/system memory. The RDMA
feature in GPUDirect allows third party devices such as SSDs,
NICs, and IB adapters to directly access memory on multiple
GPUs within the same system, significantly decreasing the la-
tency of MPI send and receive messages to/from GPU memo-
ry. It also reduces demands on system memory bandwidth
and frees the GPU DMA engines for use by other CUDA tasks.
Kepler GK110 also supports other GPUDirect features inclu-
ding Peer-to-Peer and GPUDirect for Video.
An Overview of the GK110 Kepler Architecture
Kepler GK110 was built first and foremost for Tesla, and its
goal was to be the highest performing parallel computing mi-
croprocessor in the world. GK110 not only greatly exceeds the
raw compute horsepower delivered by Fermi, but it does so
efficiently, consuming significantly less power and generating
much less heat output.
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
Technical Specifications Tesla K40 Tesla K20x Tesla K20 Tesla K10
Peak double-precision floating point performance (board)
1.43 Tflops 1.31 Tflops 1.17 Tflops 0.19 Tflops
Peak single-precision floating point performance (board)
4.29 Tflops 3.95 Tflops 3.52 Tflops 4.58 Tflops
Number of GPUs 1 x GK110B 1 x GK110 2 x GK104s
Number of CUDA cores 2,880 2,688 2,496 2 x 1,536
Memory size per board (GDDR5) 12 GB 6 GB 5 GB 8 GB
Memory bandwidth for board (ECC off)
288 Gbytes/sec 250 Gbytes/sec 208 Gbytes/sec 320 Gbytes/sec
Architecture features SMX, Dynamic Parallelism, Hyper-Q SMX
System Servers and workstations Servers Servers and workstations Servers
106 107
Introducing NVIDIA Parallel NsightNVIDIA Parallel Nsight is the fi rst development environment de-
signed specifi cally to support massively parallel CUDA C, Open-
CL, and DirectCompute applications. It bridges the productiv-
ity gap between CPU and GPU code by bringing parallel-aware
hardware source code debugging and performance analysis
directly into Microsoft Visual Studio, the most widely used inte-
grated application development environment under Microsoft
Windows.
Parallel Nsight allows Visual Studio developers to write and
debug GPU source code using exactly the same tools and inter-
faces that are used when writing and debugging CPU code, in-
cluding source and data breakpoints, and memory inspection.
Furthermore, Parallel Nsight extends Visual Studio functional-
ity by offering tools to manage massive parallelism, such as the
ability to focus and debug on a single thread out of the thou-
sands of threads running parallel, and the ability to simply and
effi ciently visualize the results computed by all parallel threads.
Parallel Nsight is the perfect environment to develop co-pro-
cessing applications that take advantage of both the CPU and
GPU. It captures performance events and information across
both processors, and presents the information to the devel-
oper on a single correlated timeline. This allows developers to
see how their application behaves and performs on the entire
system, rather than through a narrow view that is focused on a
particular subsystem or processor.
Parallel Nsight Debugger for GPU Computing
� Debug your CUDA C/C++ and DirectCompute source code di-
rectly on the GPU hardware
� As the industry’s only GPU hardware debugging solution, it
drastically increases debugging speed and accuracy
� Use the familiar Visual Studio Locals, Watches, Memory and
Breakpoints windows
NVIDIA GPU Computing
Introducing NVIDIA Parallel NsightParallel Nsight Analysis Tool for GPU Computing
� Isolate performance bottlenecks by viewing system-wide
CPU+GPU events
� Support for all major GPU Computing APIs, including CUDA
C/C++, OpenCL, and Microsoft DirectCompute
Debugger for GPU Computing Analysis Tool for GPU Computing
108 109
Parallel Nsight Debugger for Graphics Development
� Debug HLSL shaders directly on the GPU hardware. Drasti-
cally increasing debugging speed and accuracy over emu-
lated (SW) debugging
� Use the familiar Visual Studio Locals, Watches, Memory and
Breakpoints windows with HLSL shaders, including Direct-
Compute code
� The Debugger supports all HLSL shader types: Vertex, Pixel,
Geometry, and Tessellation
Parallel Nsight Graphics Inspector for Graphics Development
� Graphics Inspector captures Direct3D rendered frames for
real-time examination
� The Frame Profi ler automatically identifi es bottlenecks and
performance information on a per-draw call basis
� Pixel History shows you all operations that affected a given
pixel
NVIDIA GPU Computing
Introducing NVIDIA Parallel Nsight
transtec has strived for developing well-engineered GPU Com-
puting solutions from the very beginning of the Tesla era.
From High-Performance GPU Workstations to rack-mounted
Tesla server solutions, transtec has a broad range of specially
designed systems available. As an NVIDIA Tesla Preferred Pro-
vider (TPP), transtec is able to provide customers with the lat-
est NVIDIA GPU technology as well as fully-engineered hybrid
systems and Tesla Preconfi gured Clusters. Thus, customers can
be assured that transtec’s large experience in HPC cluster solu-
tions is seamlessly brought into the GPU computing world. Per-
formance Engineering made by transtec.
Debugger for Graphics Development Graphics Inspector for Graphics Development
110 111
A full Kepler GK110 implementation includes 15 SMX units and
six 64-bit memory controllers. Different products will use differ-
ent configurations of GK110. For example, some products may
deploy 13 or 14 SMXs.
Key features of the architecture that will be discussed below in
more depth include:
� The new SMX processor architecture
� An enhanced memory subsystem, offering additional
caching capabilities, more bandwidth at each level of the
hierarchy, and a fully redesigned and substantially faster
DRAM I/O implementation.
� Hardware support throughout the design to enable new pro-
gramming model capabilities
Kepler GK110 supports the new CUDA Compute Capability 3.5.
(For a brief overview of CUDA see Appendix A - Quick Refresher
on CUDA). The following table compares parameters of different
Compute Capabilities for Fermi and Kepler GPU architectures:
Performance per Watt
A principal design goal for the Kepler architecture was impro-
ving power efficiency. When designing Kepler, NVIDIA engineers
applied everything learned from Fermi to better optimize the
Kepler architecture for highly efficient operation. TSMC’s 28nm
manufacturing process plays an important role in lowering po-
wer consumption, but many GPU architecture modifications
were required to further reduce power consumption while
maintaining great performance.
FermiGF100
FermiGF104
KeplerGK104
KeplerGK110
Compute Capability 2.0 2.1 3.0 3.5
Threads / Warp 32 32 32 32
Max Warps / Multiprocessor 48 48 64 64
Max Threads / Multiproces-sor
1536 1536 2048 2048
Max Thread Blocks / Multi-processor
8 8 16 16
32-bit Registers / Multipro-cessor
32768 32768 65536 65536
Max Registers / Thread 63 63 63 255
Max Threads / Thread Block 1024 1024 1024 1024
Shared Memory Size Con-figurations (bytes)
16K48K
16K48K
16K32K48K
16K32K48K
Max X Grid Dimension 2^16-1 2^16-1 2^32-1 2^32-1
Hyper-Q NO NO NO YES
Dynamic Parallelism NO NO NO YES
Every hardware unit in Kepler was designed and scrubbed to
provide outstanding performance per watt. The best example
of great perf/watt is seen in the design of Kepler GK110’s new
Streaming Multiprocessor (SMX), which is similar in many res-
pects to the SMX unit recently introduced in Kepler GK104, but
includes substantially more double precision units for compute
algorithms.
Streaming Multiprocessor (SMX) Architecture
Kepler GK110’s new SMX introduces several architectural inno-
vations that make it not only the most powerful multiproces-
sor we’ve built, but also the most programmable and power-
efficient.
SMX Processing Core Architecture
Each of the Kepler GK110 SMX units feature 192 single-precision
CUDA cores, and each core has fully pipelined floating-point and
integer arithmetic logic units. Kepler retains the full IEEE 754-
2008
compliant single- and double-precision arithmetic introduced
in Fermi, including the fused multiply-add (FMA) operation.
One of the design goals for the Kepler GK110 SMX was to sig-
nificantly increase the GPU’s delivered double precision perfor-
mance, since double precision arithmetic is at the heart of many
HPC applications. Kepler GK110’s SMX also retains the special
function units (SFUs) for fast approximate transcendental ope-
rations as in previous-generation GPUs, providing 8x the num-
ber of SFUs of the Fermi GF110 SM.
Similar to GK104 SMX units, the cores within the new GK110
SMX units use the primary GPU clock rather than the 2x shader
clock. Recall the 2x shader clock was introduced in the G80 Tesla-
architecture GPU and used in all subsequent Tesla- and Fermi-ar-
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
Kepler GK110 Full chip block diagram
SMX: 192 single-precision CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store units (LD/ST).
112 113
chitecture GPUs. Running execution units at a higher clock rate
allows a chip to achieve a given target throughput with fewer
copies of the execution units, which is essentially an area opti-
mization, but the clocking logic for the faster cores is more po-
wer-hungry. For Kepler, our priority was performance per watt.
While we made many optimizations that benefi tted both area
and power, we chose to optimize for power even at the expen-
se of some added area cost, with a larger number of processing
cores running at the lower, less power-hungry GPU clock.
Quad Warp Scheduler
The SMX schedules threads in groups of 32 parallel threads called
warps. Each SMX features four warp schedulers and eight ins-
truction dispatch units, allowing four warps to be issued and
executed concurrently. Kepler’s quad warp scheduler selects
four warps, and two independent instructions per warp can be
dispatched each cycle. Unlike Fermi, which did not permit double
precision instructions to be paired with other instructions, Kep-
ler GK110 allows double precision instructions to be paired with
other instructions.
We also looked for opportunities to optimize the power in the
SMX warp scheduler logic. For example, both Kepler and Fermi
schedulers contain similar hardware units to handle the sche-
duling function, including:
� Register scoreboarding for long latency operations (texture
and load)
� Inter-warp scheduling decisions (e.g., pick the best warp to go
next among eligible candidates)
� Thread block level scheduling (e.g., the GigaThread engine)
However, Fermi’s scheduler also contains a complex hardware
stage to prevent data hazards in the math datapath itself. A
multi-port register scoreboard keeps track of any registers that
are not yet ready with valid data, and a dependency checker
block analyzes register usage across a multitude of fully deco-
ded warp instructions against the scoreboard, to determine
which are eligible to issue.
For Kepler, we recognized that this information is deterministic
(the math pipeline latencies are not variable), and therefore it
is possible for the compiler to determine up front when instruc-
tions will be ready to issue, and provide this information in the
instruction itself. This allowed us to replace several complex and
power-expensive blocks with a simple hardware block that ext-
racts the pre-determined latency information and uses it to mask
out warps from eligibility at the inter-warp scheduler stage.
New ISA Encoding: 255 Registers per Thread
The number of registers that can be accessed by a thread has
been quadrupled in GK1 10, allowing each thread access to up to
255 registers. Codes that exhibit high register pressure or spill-
ing behavior in Fermi may see substantial speedups as a result
of the increased available per-thread register count . A compel-
ling example can be seen in the QUDA library for performing lat-
tice QCD (quantum chromodynamics) calculations using CUDA.
QUDA fp64-based algorithms see performance increases up to
5.3x due to the ability to use many more registers per thread and
experiencing fewer spills to local memory.
Shuffl e Instruction
To further improve performance, Kepler implements a new
Shuffl e instruction, which allows threads within a warp to share
data. Previously, sharing data between threads within a warp
required separate store and load operations to pas the data
through shared memory. With the Shuffl e instruction, threads
within a warp can read values from other threads in the warp in
just about any imaginable permutation. Shuffl e supports arbi-
trary indexed references –– i.e. any thread reads from any other
thread. U seful shuffl e subsets including next-thread (offset u p
or down by a fi xed amount) and XORR “butterfl y” style permuta-
tions among the threads in a warp, are also available as CUDA
intrinsics.
NVIDIA, GeForce, Tesla, CUDA, PhysX, GigaThread, NVIDIA
Parallel Data Cache and certain other trademarks and
logos appearing in this brochure, are trademarks or
registered trademarks of NVIDIA Corporation.
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
Each Kepler SMX contains 4 Warp Schedulers, each with dual Instruction Dis-patch Units. A single Warp Scheduler Unit is shown above.
This example shows some of the variations possible using the new Shuffl e instruction in Kepler.
114 115
Shuffle offers a performance advantage over shared memory,
in that a store-and-load operation is carried out in a single step.
Shuffle also can reduce the amount of shared memory needed
per thread block, since data exchanged at the warp level never
needs to o be placed in shared memory. In the case of FFT, which
requires da ta sharing within a warp, a 6% performance gain can
be seen just by using Shuffle.
Atomic Operations
Atomic memory operations are important in parallel program-
ming, allowing concurrent threads to correctly perform read-
modify-write operations on shared data structures. Atomic
operations such as add, min, max, and compare-and-swap are
atomic in the sense that the read, modify, and write operations
are performed without interruption by other threads. Atomic
memory operations are widely used for parallel sorting, reduc-
tion operations, and building data structures in parallel with-
out locks that Serialize thread execution.
Throughput of global memory atomic operations on Kepler
GK110 is substantially improved compared to the Fermi genera-
tion. Atomic operation throughput to a common global memory
address is improved by 9x to one operation per clock. Atomic
operation throughput to independent global addresses is also
significantly accelerated, and logic to handle address conflicts
has been made more efficient. Atomic operations can often be
processed at rates similar to global load operations. This speed
increase makes atomics fast enough to use frequently within
kernel inner loops, eliminating the separate reduction passes
that were previously required by some algorithms to consoli-
date results. Kepler GK110 also expands the native support for
64-bit atomic operations in global memory. In addition to atomi-
cAdd, atomicCAS, and atomicExch (which were also supported
by Fermi and Kepler GK104), GK110 supports the following:
� atomicMin
� atomicMax
� atomicAnd
� atomicOr
� atomicXor
Other atomic operations which are not supported natively (for
example 64-bit floating point atomics) may be emulated using
the compare-and-swap (CAS) instruction.
Texture Improvements
The GPU’s dedicated hardware Texture units are a valuable re-
source for compute programs with a need to sample or filter
image data. The texture throughput in Kepler is significantly in-
creased compared to Fermi – each SMX unit contains 16 texture
filtering units, a 4x increase vs the Fermi GF110 SM.
In addition, Kepler changes the way texture state is managed.
In the Fermi generation, for the GPU to reference a texture, it
had to be assigned a “slot” in a fixed-size binding table prior to
grid launch. The number of slots in that table ultimately limits
how many unique textures a program can read from at runtime.
Ultimately, a program was limited to accessing only 128 simulta-
neous textures in Fermi.
With bindless textures in Kepler, the additional step of using
slots isn’t necessary: texture state is now saved as an object in
memory and the hardware fetches these state objects on de-
mand, making binding tables obsolete. This effectively elimi-
nates any limits on the number of unique textures that can be
referenced by a compute program. Instead, programs can map
textures at any time and pass texture handles around as they
would any other pointer.
Kepler Memory Subsystem – L1, L2, ECC
Kepler’s memory hierarchy is organized similarly to Fermi. The
Kepler architecture supports a unified memory request path for
loads and stores, with an L1 cache per SMX multiprocessor. Ke-
pler GK110 also enables compiler-directed use of an additional
new cache for read-only data, as described below.
64 KB Configurable Shared Memory and L1 Cache
In the Kepler GK110 architecture, as in the previous generation
Fermi architecture, each SMX has 64 KB of on-chip memory that
can be configured as 48 KB of Shared memory with 16 KB of L1
cache, or as 16 KB of shared memory with 48 KB of L1 cache. Ke-
pler now allows for additional flexibility in configuring the al-
location of shared memory and L1 cache by permitting a 32KB
/ 32KB split between shared memory and L1 cache. To support
the increased throughput of each SMX unit, the shared memory
bandwidth for 64b and larger load operations is also doubled
compared to the Fermi SM, to 256B per core clock.
48KB Read-Only Data Cache
In addition to the L1 cache, Kepler introduces a 48KB cache for
data that is known to be read-only for the duration of the func-
tion. In the Fermi generation, this cache was accessible only by
the Texture unit. Expert programmers often found it advanta-
geous to load data through this path explicitly by mapping their
data as textures, but this approach had many limitations.
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
116 117
In Kepler, in addition to significantly increasing the capacity
of this cache along with the texture horsepower increase, we
decided to make the cache directly accessible to the SM for
general load operations. Use of the read-only path is beneficial
because it takes both load and working set footprint off of the
Shared/L1 cache path. In addition, the Read-Only Data Cache’s
higher tag bandwidth supports full speed unaligned memory
access patterns among other scenarios.
Use of the read-only path can be managed automatically by the
compiler or explicitly by the programmer. Access to any variable
or data structure that is known to be constant through pro-
grammer use of the C99-standard “const __restrict” keyword
may be tagged by the compiler to be loaded through the Read-
Only Data Cache. The programmer can also explicitly use this
path with the __ldg() intrinsic.
Improved L2 Cache
The Kepler GK110 GPU features 1536KB of dedicated L2 cache
memory, double the amount of L2 available in the Fermi archi-
tecture. The L2 cache is the primary point of data unification
between the SMX units, servicing all load, store, and texture re-
quests and providing efficient, high speed data sharing across
the GPU. The L2 cache on Kepler offers up to 2x of the band-
width per clock available in Fermi. Algorithms for which data
addresses are not known beforehand, such as physics solvers,
ray tracing, and sparse matrix multiplication especially benefit
from the cache hierarchy. Filter and convolution kernels that re-
quire multiple SMs to read the same data also benefit.
Memory Protection Support
Like Fermi, Kepler’s register files, shared memories, L1 cache, L2
cache and DRAM memory are protected by a Single-Error Correct
Double-Error Detect (SECDED) ECC code. In addition, the Read-
Only Data Cache supports single-error correction through a parity
check; in the event of a parity error, the cache unit automatically in-
validates the failed line, forcing a read of the correct data from L2.
ECC checkbit fetches from DRAM necessarily consume some
amount of DRAM bandwidth, which results in a performance
difference between ECC-enabled and ECC-disabled operation,
especially on memory bandwidth-sensitive applications. Kepler
GK110 implements several optimizations to ECC checkbit fetch
handling based on Fermi experience. As a result, the ECC on-vs-
off performance delta has been reduced by an average of 66%,
as measured across our internal compute application test suite.
Dynamic Parallelism
In a hybrid CPU-GPU system, enabling a larger amount of paral-
lel code in an application to run efficiently and entirely within
the GPU improves scalability and performance as GPUs increase
in perf/watt. To accelerate these additional parallel portions of
the application, GPUs must support more varied types of paral-
lel workloads.
Dynamic Parallelism is a new feature introduced with Kepler
GK110 that allows the GPU to generate new work for itself, syn-
chronize on results, and control the scheduling of that work via
dedicated, accelerated hardware paths, all without involving
the CPU.
Fermi was very good at processing large parallel data structures
when the scale and parameters of the problem were known at
kernel launch time. All work was launched from the host CPU,
would run to completion, and return a result back to the CPU.
The result would then be used as part of the final solution, or
would be analyzed by the CPU which would then send addition-
al requests back to the GPU for additional processing.
In Kepler GK110 any kernel can launch another kernel, and can
create the necessary streams, events and manage the depen-
dencies needed to process additional work without the need
for host CPU interaction. This architectural innovation makes it
easier for developers to create and optimize recursive and data-
dependent execution patterns, and allows more of a program to
be run directly on GPU. The system CPU can then be freed up for
additional tasks, or the system could be configured with a less
powerful CPU to carry out the same workload.
Dynamic Parallelism allows more varieties of parallel algo-
rithms to be implemented on the GPU, including nested loops
with differing amounts of parallelism, parallel teams of serial
control-task threads, or simple serial control code offloaded to
the GPU in order to promote data-locality with the parallel por-
tion of the application.
Because a kernel has the ability to launch additional work-
loads based on intermediate, on-GPU results, programmers can
now intelligently load-balance work to focus the bulk of their
resources on the areas of the problem that either require the
most processing power or are most relevant to the solution.
One example would be dynamically setting up a grid for a nu-
merical simulation – typically grid cells are focused in regions
of greatest change, requiring an expensive pre-processing pass
through the data. Alternatively, a uniformly coarse grid could be
used to prevent wasted GPU resources, or a uniformlyfine grid
could be used to ensure all the features are captured, but these
options risk missing simulation features or “over-spending”
compute resources on regions of less interest. With Dynamic
Parallelism, the grid resolution can be determined dynamically
at runtime in a datadependent manner.
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
Dynamic Parallelism allows more parallel code in an application to be launched directly by the GPU onto itself (right side of image) rather than requiring CPU intervention (left side of image).
118 119
NVIDIA GPU Computing
A Quick Refresher on CUDA A Quick Refresher on CUDA CUDA is a combination hardware/software platform that en-
ables NVIDIA GPUs to execute programs written with C, C++,
Fortran, and other languages. A CUDA program invokes paral-
lel functions called kernels that execute accross many parallel
threads. The programmer or compiler organizes these threads
into thread blocks and grids of thre ad blocks, as shown in the
Figure on the right side. Each thread within a thread block ex-
ecutes an instance oof the kernel. Each thread also has thread
and block IDs within its thread block and grid, a program coun-
ter, registers, per-thread private memory, inputs, and output re-
suults. A thread block is a set of concurren tly executing threads
that can cooperate among themselves through barrier sy
nchronizationn and shared memory. A t hread block has a block
ID within its grid. A grid is an array of thread blocks that execute
the same kernel, read inputs from global memory, write results
to global memory, and synchronize between dependent kernel
calls. In the CUDA parallel programming model, each thread has
a per-thread private memory space used for register spills, func-
tion calls, and C automaticc array variables. Each thread block
has a per-block shared memory space used for inter-thread
communication, datasharing, and result sharing in parallel all-
gorithms. Grids of thread blocks share results in Global Memory
space after kernel-wide global synchronization.
CUDA Hardware Execution
CUDA’s hierarchy of threads maps to a hierarchy of processors
on the GPU; a GPU executes one or more kernel grids; a stream-
ing multiprocessor (SM on Fermi / SMX on Kepler) executes one
or more thread blocks; and CUDA cores and other execution
units in the SMX execute thread instructions. The SMX executes
threads in groups of 32 threads called warps. While program-
mers can generally ignore warp execution for functional cor-
rectness and focus on programming individual scalar threads,
they can greatly improve performance by having threads in
a warp execute the same code path and access memory with
nearby addresses.
CUDA Hierarchy of threads, blocks, and grids, with corresponding per-thread private, per-block shared,-and per-application global memory spaces.
120 121
Starting with a coarse grid, the simulation can “zoom in” on ar-
eas of interest while avoiding unnecessary calculation in areas
with little change. Though this could be accomplished using a
sequence of CPU-launched kernels, it would be far simpler to al-
low the GPU to refine the grid itself by analyzing the data and
launching additional work as part of a single simulation kernel,
eliminating interruption of the CPU and data transfers between
the CPU and GPU.
The above example illustrates the benefits of using a dynami-
cally sized grid in a numerical simulation. To meet peak preci-
sion requirements, a fixed resolution simulation must run at an
excessively fine resolution across the entire simulation domain,
whereas a multi-resolution grid applies the correct simulation
resolution to each area based on local variation.
Hyper-Q
One of the challenges in the past has been keeping the GPU sup-
plied with an optimally scheduled load of work from multiple
streams. Thee Fermi architecture supported 16-way concur-
rency of kernel launches from separate streams, but ultimately
the streams were all multiplexed into the same hardware work
queue . This allowed for false intra-stream dependencies, re-
quiring dependent kernels within one stream to complete be-
fore additional kernels in a separate stream could be executed.
While this could be alleviated to some extent through the use of
a breadth-first launch order , as program complexity increases,
this can become more and more difficult to manage efficiently.
Kepler GKK110 improves on this functionality with the new
Hyper-Q feature. Hyper-Q increases the total number of con-
nection s (work queues) between the host and the CUDA Work
Distributor (CWD) logic in the GPU by allowing 322 simultane-
ous, hardware-managed connections (compared to the single
connection available with Fermi). Hyper-Q is a flexible solution
that allows connections from multiple CUDA streams, from mul-
tiple Message Passing Interface (MPI) processes, or even from
multiple threads within a process. Applications that previously
encountered false serialization across tasks, there by limiting
GPU utilization, can see up to a 32x performance increase with-
out changing any existing code.
Each CUDA stream is managed within its own hardware work
queue, inter-stream dependencies are optimized, and opera-
tions in one stream will no longer block other streams, enabling
streams to execute concurrently without neding to specifically
tailor the launch order to eliminate possible false dependencies.
Hyper-Q offers significant benefits for use in MPI-based paral-
lel computer systems. Legacy MPI-based algorithms were often
created to run on multi-core CPU systems, with the amount of
work assigned to each MPI process scaled accordingly. This can
lead to a single MPI process having insufficient work to fully
occupy the GPU. While it has always been possible for multiple
MPI processes to share a GPU, these processes could become
bottlenecked by false dependencies. Hyper-Q removes those
false dependencies, dramatically increasing the efficiency of
GPU sharing across MPI processes.
Grid Management Unit - Efficiently Keeping the GPU Utilized
New features in Kepler GK110, such as the ability for CUDA ker-
nels to launch work directly on the GPU with Dynamic Paral-
lelism, required that the CPU-to-GPU workflow in Kepler offer
increased functionality over the Fermi design. On Fermi, a grid
of thread blocks would be launched by the CPU and would al-
ways run to completion, creating a simple unidirectional flow
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
Image attribution Charles Reid
Hyper-Q permits more simultaneous connections between CPU and GPU
Hyper-Q working with CUDA Streams: In the Fermi model shown on the left, only (C,P) & (R,X) can run concurrently due to intra-stream dependencies caused by the single hardware work queue. The Kepler Hyper-Q model allows all streams to run concurrently using separate work queues.
122 123
of work from the host to the SMs via the CUDA Work Distributor
(CWD) unit. Kepler GK110 was designed to improve the CPU-to-
GPU workflow by allowing the GPU to efficiently manage both
CPU- and CUDA-created workloads.
We discussed the ability of the Kepler GK110 GPU to allow ker-
nels to launch work directly on the GPU, and it’s important to
understand the changes made in the Kepler GK110 architec-
ture to facilitate these new functions. In Kepler, a grid can be
launched from the CPU just as was the case with Fermi, how-
ever new grids can also be created programmatically by CUDA
within the Kepler SMX unit. To manage both CUDA-created and
host-originated grids, a new Grid Management Unit (GMU) was
introduced in Kepler GK110. This control unit manages and pri-
oritizes grids that are passed into the CWD to be sent to the SMX
units for execution.
The CWD in Kepler holds grids that are ready to dispatch, and it
is able to dispatch 32 active grids, which is double the capacity
of the Fermi CWD. The Kepler CWD communicates with the GMU
via a bidirectional link that allows the GMU to pause the dis-
patch of new grids and to hold pending and suspended grids un-
til needed. The GMU also has a direct connection to the Kepler
SMX units to permit grids that launch additional work on the
GPU via Dynamic Parallelism to send the new work back to GMU
to be prioritized and dispatched. If the kernel that dispatched
the additional workload pauses, the GMU will hold it inactive
until the dependent work has completed.
NVIDIA GPUDirect
When working with a large amount of data, increasing the data
throughput and reducing latency is vital to increasing com-
pute performance. Kepler GK110 supports the RDMA feature in
NVIDIA GPUDirect, which is designed to improve performance
by allowing direct access to GPU memory by third-party devices
such as IB adapters, NICs, and SSDs. When using CUDA 5.0, GPU-
Direct provides the following important features:
� Direct memory access (DMA) between NIC and GPU without
the need for CPU-side data buffering.
� Significantly improved MPISend/MPIRecv efficiency bet-
ween GPU and other nodes in a network.
� Eliminates CPU bandwidth and latency bottlenecks
�Works with variety of 3rd-party network, capture, and sto-
rage devices
Applications like reverse time migration (used in seismic imag-
ing for oil & gas exploration) distribute the large imaging data
across several GPUs. Hundreds of GPUs must collaborate to
crunch the data, often communicating intermediate results.
GPUDirect enables much higher aggregate bandwidth for this
GPUto GPU communication scenario within a server and across
servers with the P2P and RDMA features.
Kepler GK110 also supports other GPUDirect features such as
Peer- to-Peer and GPUDirect for Video.
Conclusion
With the launch of Fermi in 2010, NVIDIA ushered in a new era
in the high performance computing (HPC) industry based on a
hybrid computing model where CPUs and GPUs work together
to solve computationally-intensive workloads. Now, with the
new Kepler GK110 GPU, NVIDIA again raises the bar for the HPC
industry.
Kepler GK110 was designed from the ground up to maximize
computational performance and throughput computing with
outstanding power efficiency. The architecture has many new
innovations such as SMX, Dynamic Parallelism, and Hyper-Q
that make hybrid computing dramatically faster, easier to pro-
gram, and applicable to a broader set of applications. Kepler
GK110 GPUs will be used in numerous systems ranging from
workstations to supercomputers to address the most daunting
challenges in HPC.
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
The redesigned Kepler HOST to GPU workflow shows the new Grid Manage-ment Unit, which allows it to manage the actively dispatching grids, pause dispatch, and hold pending and suspended grids.
GPUDirect RDMA allows direct access to GPU memory from 3rd-party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.
124 125
Executive OverviewThe High Performance Computing market’s continuing need for
improved time-to-solution and the ability to explore expand-
ing models seems unquenchable, requiring ever-faster HPC
clusters. This has led many HPC users to implement graphic
processing units (GPUs) into their clusters. While GPUs have tra-
ditionally been used solely for visualization or animation, today
they serve as fully programmable, massively parallel proces-
sors, allowing computing tasks to be divided and concurrently
processed on the GPU’s many processing cores. When multiple
GPUs are integrated into an HPC cluster, the performance po-
tential of the HPC cluster is greatly enhanced. This processing
environment enables scientists and researchers to tackle some
of the world’s most challenging computational problems.
HPC applications modifi ed to take advantage of the GPU pro-
cessing capabilities can benefi t from signifi cant performance
gains over clusters implemented with traditional processors. To
obtain these results, HPC clusters with multiple GPUs require a
high-performance interconnect to handle the GPU-to-GPU com-
munications and optimize the overall performance potential of
the GPUs. Because the GPUs place signifi cant demands on the
interconnect, it takes a high-performance interconnect, such as
Infi niBand, to provide the low latency, high message rate, and
bandwidth that are needed to enable all resources in the cluster
to run at peak performance.
Intel worked in concert with NVIDIA to optimize Intel TrueScale
Infi niBand with NVIDIA GPU technologies. This solution sup-
ports the full performance potential of NVIDIA GPUs through an
interface that is easy to deploy and maintain.
Key Points
� Up to 44 percent GPU performance improvement versus
implementation without GPUDirect – a GPU computing
NVIDIA GPU Computing
Intel TrueScale Infi niBand and GPUsproduct from NVIDIA that enables faster communication
between the GPU and Infi niBand
� Intel TrueScale Infi niBand offers as much as 10 percent bet-
ter GPU performance than other Infi niBand interconnects
� Ease of installation and maintenance – Intel’s implementa-
tion offers a streamlined deployment approach that is sig-
nifi cantly easier than alternatives
Ease of DeploymentOne of the key challenges with deploying clusters consisting of
multi-GPU nodes is to maximize application performance. With-
out GPUDirect, GPU-to-GPU communications would require the
host CPU to make multiple memory copies to avoid a memory
pinning confl ict between the GPU and Infi niBand. Each addi-
tional CPU memory copy signifi cantly reduces the performance
potential of the GPUs.
Intel’s implementation of GPUDirect takes a streamlined ap-
proach to optimizing NVIDIA GPU performance with Intel TrueS-
cale Infi niBand. With Intel’s solution, a user only needs to up-
date the NVIDIA driver with code provided and tested by Intel.
Other Infi niBand implementations require the user to imple-
ment a Linux kernel patch as well as a special Infi niBand driver.
The Intel approach provides a much easier way to deploy, sup-
port, and maintain GPUs in a cluster without having to sacrifi ce
performance. In addition, it is completely compatible with other
GPUDirect implementations; the CUDA libraries and application
code require no changes.
Optimized PerformanceIntel used AMBER molecular dynamics simulation software to
test clustered GPU performance with and without GPUDirect.
Figure 15 shows that there is a signifi cant performance gain of
up to 44 percent that results from streamlining the host memo-
ry access to support GPU-to-GPU communications.
Clustered GPU PerformanceHPC applications that have been designed to take advantage of
parallel GPU performance require a high-performance intercon-
nect, such as Infi niBand, to maximize that performance. In ad-
dition, the implementation or architecture of the Infi niBand in-
terconnect can impact performance. The two industry-leading
Infi niBand implementations have very different architectures
and only one was specifi cally designed for the HPC market –
Intel’s TrueScale Infi niBand. TrueScale Infi niBand provides un-
matched performance benefi ts, especially as the GPU cluster
is scaled. It offers high performance in all of the key areas that
infl uence the performance of HPC applications, including GPU-
based applications. These factors include the following:
� Scalable non-coalesced message rate performance greater
than 25M messages per second
� Extremely low latency for MPI collectives, even on clusters
consisting of thousands of nodes
� Consistently low latency of one to two μS, even at scale
These factors and the design of Intel TrueScale Infi niBand en-
able it to optimize the performance of NVIDIA GPUs. The fol-
Intel is a global leader and technology innovator in high perfor-
mance networking, including adapters, switches and ASICs.
Performance with and without GPUDirect
Amber Performance Cellulose Test
8 GPU
WithoutGPUDirect
TrueScale with GPUDirect
4,5
4
3,5
3
2,5
2
1,5
1
0,5
0
44%
Bet
ter
NS/
DAY
126 127
lowing tests were performed on NVIDIA Tesla 2050s intercon-
nected with Intel TrueScale QDR Infi niBand at Intel’s NETtrack
Developer Center. The Tesla 2050 results for the industry’s other
leading Infi niBand are from the published results on the AMBER
benchmark site.
Figure 16 shows performance results from the AMBER Myoglo-
bin benchmark (2,492 atoms) when scaling from two to eight
Tesla 2050 GPUs. The results indicate that the Intel TrueScale In-
fi niBand offers up to 10 percent more performance than the in-
dustry’s other leading Infi niBand when both used their versions
of GPUDirect. As the fi gure shows, the performance difference
increases as the application is scaled to more GPUs.
The next test (Figure 17) shows the impact of the Infi niBand in-
terconnect on the performance of AMBER across models of vari-
ous sizes. The following Explicit Solvent models were tested:
� DHFR: 23,558 atoms
� FactorIX: 90,906 atoms
� Cellulose: 408,609 atoms
It is important to point out that the performance of the models
is dependent on the model size, the size of the GPU cluster, and
the performance of the Infi niBand interconnect. The smaller
the model, the more it is dependent on the interconnect due
to the fact that the model’s components (atoms in the case of
AMBER) are divided across the available GPUs in the cluster to
be processed for each step of the simulation. For example, the
DHFR test with its 23,557 atoms means that each Tesla 2050 in
an eight-GPU cluster is processing only 2,945 atoms for each
step of the simulation. The processing time is relatively small
when compared to the communication time. In contrast, the
Cellulose model with its 408K atoms requires each GPU to
process 17 times more data per step than the DHFR test, so
signifi cantly more time is spent in GPU processing than in com-
munications.
NVIDIA GPU Computing
Intel TrueScale Infi niBand and GPUsThe preceding tests demonstrate that the TrueScale Infi niBand
performs better under load. The DHFR model is the most sen-
sitive to the interconnect performance, and it indicates that
TrueScale offers six percent more performance than the alter-
native Infi niBand product. Combining the results from Figure
15 and Figure 16 illustrate that TrueScale Infi niBand provides
better results with smaller models on small clusters and better
model scalability for larger models on larger GPU clusters.
Performance/Watt AdvantageToday the focus is not just on performance, but how effi ciently
that performance can be delivered.
This is an area in which Intel TrueScale Infi niBand excels. The
National Center for SuperCompute Applications (NCSA) has a
cluster based on NVIDIA GPUs interconnected with TrueScale
Infi niBand. This cluster is number three on the November 2010
Green500 list with performance of 933 MFlops/Watt .
This on its own is a signifi cant accomplishment, but it is even
more impressive when considering its original position on the
SuperComputing Top500 list. In fact, the cluster is ranked at #404
on the Top500 list, but the combination of NVIDIA’s GPU perfor-
mance, Intel’s TrueScale performance, and low power consump-
tion enabled the cluster to move up 401 spots from the Top500
list to reach number three on the Green500 list. This is the most
dramatic shift of any cluster in the top 50 of the Green500. In
part, the following are the reasons for such dramatic perfor-
mance/watt results:
� Performance of the NVIDIA Telsa 2050 GPU
� Linpack performance effi ciency of this cluster is 49 percent,
which is almost 20 percent better than most other NVIDIA
GPUbased clusters on the Top 500 list
� The Intel TrueScale Infi niBand Adapter required 25 - 50 per-
cent less power than the alternative Infi niBand product
ConclusionThe performance of the Infi niBand interconnect has a signifi -
cant impact on the performance of GPU-based clusters. Intel’s
TrueScale Infi niBand is designed and architected for the HPC
marketplace, and it offers an unmatched performance profi le
with a GPU-based cluster. Finally, Intel’s solution provides an
implementation that is easier to deploy and maintain, and al-
lows for optimal performance in comparison to the industry’s
other leading Infi niBand.
Explicit Solvent Benchmark Results for the Two Leading Infi niBandsGPU Scalable Performance with the Industry’s Leading Infi niBands
AMBERMyoglobin Test
OtherInfiniBand
TrueScale
1409,6%
3,2%3,2%
Even
120
100
80
60
40
20
0
Bet
ter
2 4 8
NS/
DAY
Explicit Solvent Tests8 x GPU
OtherInfiniBand
TrueScale
506%
5%5%
1%
45
40
35
30
25
20
15
105
0
Bet
ter
Cellulose FactorIX DHFR
NS/
DAY
Intel Xeon Phi Coprocessor –The ArchitectureIntel Many Integrated Core (Intel MIC) architecture combines many Intel CPU cores onto a single chip. Intel MIC architecture is targeted for highly parallel, High Performance Com-puting (HPC) workloads in a variety of fields such as computational physics, chemistry, biology, and financial services. Today such workloads are run as task parallel programs on large compute clusters.
129 En
gin
ee
rin
g
Lif
e S
cie
nce
s
Au
tom
oti
ve
Pri
ce M
od
ell
ing
A
ero
spa
ce
CA
E
Da
ta A
na
lyti
cs
130 131
Intel Xeon Phi Coprocessor
The Architecture
The Intel MIC architecture is aimed at achieving high through-
put performance in cluster environments where there are rigid
floor planning and power constraints. A key attribute of the mi-
croarchitecture is that it is built to provide a general-purpose
programming environment similar to the Intel Xeon processor
programming environment. The Intel Xeon Phi coprocessors
based on the Intel MIC architecture run a full service Linux op-
erating system, support x86 memory order model and IEEE 754
floating-point arithmetic, and are capable of running applica-
tions written in industry-standard programming languages
such as Fortran, C, and C++. The coprocessor is supported by a
rich development environment that includes compilers, numer-
ous libraries such as threading libraries and high performance
math libraries, performance characterizing and tuning tools,
and debuggers.
The Intel Xeon Phi coprocessor is connected to an Intel Xeon
processor, also known as the “host”, through a PCI Express (PCIe)
bus. Since the Intel Xeon Phi coprocessor runs a Linux operating
system, a virtualized TCP/IP stack could be implemented over
the PCIe bus, allowing the user to access the coprocessor as a
network node. Thus, any user can connect to the coprocessor
through a secure shell and directly run individual jobs or submit
batch jobs to it. The coprocessor also supports heterogeneous
applications wherein a part of the application executes on the
host while a part executes on the coprocessor.
Multiple Intel Xeon Phi coprocessors can be installed in a single
host system. Within a single system, the coprocessors can com-
municate with each other through the PCIe peer-to-peer inter-
connect without any intervention from the host. Similarly, the
coprocessors can also communicate through a network card
such as InfiniBand or Ethernet, without any intervention from
the host.
Intel’s initial development cluster named “Endeavor”, which is
composed of 140 nodes of prototype Intel Xeon Phi coproces-
sors, was ranked 150 in the TOP500 supercomputers in the world
based on its Linpack scores. Based on its power consumption,
this cluster was as good if not better than other heterogeneous
systems in the TOP500.
These results on unoptimized prototype systems demonstrate
that high levels of performance efficiency can be achieved on
compute-dense workloads without the need for a new pro-
gramming language or APIs.
The Intel Xeon Phi coprocessor is primarily composed of pro-
cessing cores, caches, memory controllers, PCIe client logic, and
a very high bandwidth, bidirectional ring interconnect. Each
core comes complete with a private L2 cache that is kept fully
coherent by a global-distributed tag directory. The memory
controllers and the PCIe client logic provide a direct interface
to the GDDR5 memory on the coprocessor and the PCIe bus, re-
spectively. All these components are connected together by the
ring interconnect.
Each core in the Intel Xeon Phi coprocessor is designed to be
power efficient while providing a high throughput for highly
parallel workloads. A closer look reveals that the core uses a
short in-order pipeline and is capable of supporting 4 threads
The first generation Intel Xeon Phi product codenamed “Knights Corner”
Linpack performance and power of Intel’s cluster
Microarchitecture
Intel Xeon Phi Coprocessor Core
132 133
in hardware. It is estimated that the cost to support IA architec-
ture legacy is a mere 2% of the area costs of the core and is even
less at a full chip or product level. Thus the cost of bringing the
Intel Architecture legacy capability to the market is very mar-
ginal.
Vector Processing UnitAn important component of the Intel Xeon Phi coprocessor’s
core is its vector processing unit (VPU). The VPU features a nov-
el 512-bit SIMD instruction set, officially known as Intel Initial
Many Core Instructions (Intel IMCI). Thus, the VPU can execute
16 single-precision (SP) or 8 double-precision (DP) operations
per cycle. The VPU also supports Fused Multiply-Add (FMA) in-
structions and hence can execute 32 SP or 16 DP floating point
operations per cycle. It also provides support for integers.
Vector units are very power efficient for HPC workloads. A single
operation can encode a great deal of work and does not incur
energy costs associated with fetching, decoding, and retiring
many instructions. However, several improvements were re-
quired to support such wide SIMD instructions. For example, a
mask register was added to the VPU to allow per lane predicated
execution. This helped in vectorizing short conditional branch-
es, thereby improving the overall software pipelining efficiency.
The VPU also supports gather and scatter instructions, which
are simply non-unit stride vector memory accesses, directly in
hardware. Thus for codes with sporadic or irregular access pat-
terns, vector scatter and gather instructions help in keeping the
code vectorized.
The VPU also features an Extended Math Unit (EMU) that can
execute transcendental operations such as reciprocal, square
root, and log, thereby allowing these operations to be executed
in a vector fashion with high bandwidth. The EMU operates by
calculating polynomial approximations of these functions.
The InterconnectThe interconnect is implemented as a bidirectional ring. Each
direction is comprised of three independent rings. The first,
largest, and most expensive of these is the data block ring. The
data block ring is 64 bytes wide to support the high bandwidth
requirement due to the large number of cores. The address ring
is much smaller and is used to send read/write commands and
memory addresses. Finally, the smallest ring and the least ex-
pensive ring is the acknowledgement ring, which sends flow
control and coherence messages.
When a core accesses its L2 cache and misses, an address re-
quest is sent on the address ring to the tag directories. The
memory addresses are uniformly distributed amongst the tag
directories on the ring to provide a smooth traffic characteristic
on the ring. If the requested data block is found in another core’s
L2 cache, a forwarding request is sent to that core’s L2 over the
address ring and the request block is subsequently forwarded
on the data block ring. If the requested data is not found in any
caches, a memory address is sent from the tag directory to the
memory controller.
The figure below shows the distribution of the memory con-
trollers on the bidirectional ring. The memory controllers are
symmetrically interleaved around the ring. There is an all-to-all
mapping from the tag directories to the memory controllers. The
addresses are evenly distributed across the memory controllers,
thereby eliminating hotspots and providing a uniform access
pattern which is essential for a good bandwidth response.
During a memory access, whenever an L2 cache miss occurs on a
core, the core generates an address request on the address ring
and queries the tag directories. If the data is not found in the
tag directories, the core generates another address request and
queries the memory for the data. Once the memory controller
fetches the data block from memory, it is returned back to the
core over the data ring. Thus during this process, one data block,
two address requests (and by protocol, two acknowledgement
messages) are transmitted over the rings. Since the data block
rings are the most expensive and are designed to support the
Intel Xeon Phi Coprocessor
The Architecture
Vector Processing Unit
The Interconnect
Distributed Tag Directories
134 135
required data bandwidth, we need to increase the number of
less expensive address and acknowledgement rings by a factor
of two to match the increased bandwidth requirement caused
by the higher number of requests on these rings.
Multi-Threaded Streams TriadThe figure below shows the core scaling results for the multi-
threaded streams triad workload. These results were generated
on a simulator for a prototype of the Intel Xeon Phi coproces-
sor with only one address ring and one acknowledgement ring
per direction in its interconnect. The results indicate that in this
case the address and acknowledgement rings would become
performance bottlenecks and would exhibit poor scalability
beyond 32 cores.
The production grade Intel Xeon Phi coprocessor uses two ad-
dress and two acknowledgement rings per direction and pro-
vides a good performance scaling up to 50 cores and beyond. It
is evident from the figure that the addition of the rings results in
an over 40% aggregate bandwidth improvement.
Streaming StoresStreaming stores was another key innovation that was em-
ployed to further boost memory bandwidth. The pseudo code
for Streams Triads is shown below:
Streams Triad
for (I=0; I<HUGE; I++)
A[I] = k*B[I] + C[I];
The stream triad kernel reads two arrays, B and C, and writes a
single array A from memory. Historically, a core reads a cache
line before it writes the addressed data. Hence there is an ad-
ditional read overhead associated with the write. A streaming
store instruction allows the cores to write an entire cache line
without reading it first. This reduces the number of bytes trans-
ferred per iteration from 256 bytes to 192 bytes.
Streaming Stores
The figure below shows the core scaling results of stream triads
workload with streaming stores. As is evident from the results,
streaming stores provide a 30% improvement over previous re-
sults. Totally, then, by adding two rings per direction and imple-
menting streaming stores we are able to improve bandwidth by
more than a factor of 2 for multithreaded streams triad.
Intel Xeon Phi Coprocessor
The Architecture
Without Streaming Stores
With Streaming Stores
Behavior Read A, B, C write A Read B, C write A
Bytes transferred to/from memory per iteration
256 192
Interleaved Memory Access
Interconnect: 2x AD/AK
Multi-threaded Triad – Saturation for 1 AD/AK Ring
Multi-threaded Triad – Benefit of Doubling AD/AK
136 137
Intel Xeon Phi Coprocessor
The Architecture
Other Design FeaturesOther micro-architectural optimizations incorporated into
the Intel Xeon Phi coprocessor include a 64-entry second-level
Translation Lookaside Buffer (TLB), simultaneous data cache
loads and stores, and 512KB L2 caches. Lastly, the Intel Xeon
Phi coprocessor implements a 16 stream hardware prefetcher
to improve the cache hits and provide higher bandwidth.The
figure below shows the net performance improvements for the
SPECfp 2006 benchmark suite for a single core, single thread
runs. The results indicate an average improvement of over 80%
per cycle not including frequency.
CachesThe Intel MIC architecture invests more heavily in L1 and L2
caches compared to GPU architectures. The Intel Xeon Phi co-
processor implements a leading-edge, very high bandwidth
memory subsystem. Each core is equipped with a 32KB L1 in-
struction cache and 32KB L1 data cache and a 512KB unified L2
cache. These caches are fully coherent and implement the x86
memory order model. The L1 and L2 caches provide an aggre-
gate bandwidth that is approximately 15 and 7 times, respec-
tively, faster compared to the aggregate memory bandwidth.
Hence, effective utilization of the caches is key to achieving
peak performance on the Intel Xeon Phi coprocessor. In addi-
tion to improving bandwidth, the caches are also more energy
efficient for supplying data to the cores than memory. The fig-
ure below shows the energy consumed per byte of data trans-
fered from the memory, and L1 and L2 caches. In the exascale
compute era, caches will play a crucial role in achieving real per-
formance under restrictive power constraints.
StencilsStencils are common in physics simulations and are classic
examples of workloads which show a large performance gain
through efficient use of caches.
Stencils are typically employed in simulation of physical sys-
tems to study the behavior of the system over time. When these
workloads are not programmed to be cache-blocked, they will
be bound by memory bandwidth. Cache blocking promises
substantial performance gains given the increased bandwidth
and energy efficiency of the caches compared to memory.
Cache blocking improves performance by blocking the physical
structure or the physical system such that the blocked data fits
well into a core’s L1 and or L2 caches. For example, during each
time-step, the same core can process the data which is already
resident in the L2 cache from the last time step, and hence does
not need to be fetched from the memory, thereby improving
performance. Additionally, the cache coherence further aids
the stencil operation by automatically fetching the updated
data from the nearest neighboring blocks which are resident in
the L2 caches of other cores. Thus, stencils clearly demonstrate
the benefits of efficient cache utilization and coherence in HPC
workloads.
Multi-threaded Triad – with Streaming StoresStencils Example
Caches – For or Against?
Per-Core ST Performance Improvement (per cycle)
138 139
Intel Xeon Phi Coprocessor
The Architecture
Power ManagementIntel Xeon Phi coprocessors are not suitable for all workloads.
In some cases, it is beneficial to run the workloads only on the
host. In such situations where the coprocessor is not being
used, it is necessary to put the coprocessor in a power-saving
mode. The figure above shows all the components of the Intel
Xeon Phi coprocessor in a running state. To conserve power, as
soon as all the four threads on a core are halted, the clock to the
core is gated. Once the clock has been gated for some program-
mable time, the core power gates itself, thereby eliminating any
leakage.
At any point, any number of the cores can be powered down or
powered up. Additionally, when all the cores are power gated
and the uncore detects no activity, the tag directories, the inter-
connect, L2 caches and the memory controllers are clock gated.
At this point, the host driver can put the coprocessor into a
deeper sleep or an idle state, wherein all the uncore is power
gated, the GDDR is put into a self-refresh mode, the PCIe logic is
put in a wait state for a wakeup and the GDDR-IO is consuming
very little power. These power management techniques help
conserve power and make Intel Xeon Phi coprocessor an excel-
lent candidate for data centers.
Power Management: All On and Running
Power Gate Core
Package Auto C3Clock Gate Core
140 141
Intel Xeon Phi Coprocessor 3120P (6GB, 1.100 GHz, 57 core) Intel Xeon Phi Coprocessor 3120A (6GB, 1.100 GHz, 57 core)
Status Launched Launched
Processor Number 3120P 3120A
Number of Cores 57 57
Clock Speed 1.1 GHz 1.1 GHz
L2 Cache 28.5 MB 28.5 MB
Instruction Set 64-bit 64-bit
Instruction Set Extensions IMCI IMCI
Embedded Options Available No No
Lithography 22 nm 22 nm
Max TDP 300 W 300 W
Memory Specifications
Max Memory Size (dependent on memory type) 6 GB 6 GB
Number of Memory Channels 12 12
Max Memory Bandwidth 240 GB/s 240 GB/s
ECC Memory Supported Yes Yes
Expansion Options
PCI Express Revision 2.0 2.0
Intel Xeon Phi Coprocessor 5120D (8GB, 1.053 GHz, 60 core) Intel Xeon Phi Coprocessor 5110P (8GB, 1.053 GHz, 60 core)
Status Launched Launched
Processor Number 5120D 5110P
Number of Cores 60 60
Clock Speed 1.053 GHZ 1.053 GHZ
L2 Cache 30 MB 30 MB
Instruction Set 64-bit 64-bit
Instruction Set Extensions IMCI IMCI
Embedded Options Available No No
Lithography 22 nm 22 nm
Max TDP 245 W 225 W
Memory Specifications
Max Memory Size (dependent on memory type) 8GB 8 GB
Number of Memory Channels 16 16
Max Memory Bandwidth 352 GB/s 320 GB/s
ECC Memory Supported Yes Yes
Expansion Options
PCI Express Revision 2.0 2.0
Intel Xeon Phi Coprocessor 7120X (16GB, 1.238 GHz, 61 core)
Intel Xeon Phi Coprocessor 7120P (16GB, 1.238 GHz, 61 core)
Intel Xeon Phi Coprocessor 7120D (16GB, 1.238 GHz, 61 core)
Intel Xeon Phi Coprocessor 7120A (16GB, 1.238 GHz, 61 core)
Status Launched Launched Launched Launched
Processor Number 7120X 7120P 7120D 7120A
Number of Cores 61 61 61 61
Clock Speed 1.238 GHz 1.238 GHz 1.238 GHz 1.238 GHz
L2 Cache 30.5 MB 30.5 MB 30.5 MB 30.5 MB
Instruction Set 64-bit 64-bit 64-bit 64-bit
Instruction Set Extensions IMCI IMCI IMCI IMCI
Embedded Options Available No No No No
Lithography 22 nm 22 nm 22 nm 22 nm
Max TDP 300 W 300 W 270 W 300 W
Memory Specifications
Max Memory Size (dependent on memory type)
16GB 16GB 16GB 16GB
Number of Memory Channels 16 16 16 16
Max Memory Bandwidth 352 GB/s 352 GB/S 352 GB/s 352 GB/s
ECC Memory Supported Yes Yes Yes Yes
Expansion Options
PCI Express Revision 2.0 2.0 2.0 2.0
InfiniBand – High-Speed InterconnectsInfiniBand (IB) is an efficient I/O technology that provides high-speed data transfers and ultra-low latencies for computing and storage over a highly reliable and scalable single fabric. The InfiniBand industry standard ecosystem creates cost effective hardware and software solutions that easily scale from generation to generation.InfiniBand is a high-bandwidth, low-latency network interconnect solution that has grown tremendous market share in the High Performance Computing (HPC) cluster com-munity. InfiniBand was designed to take the place of today’s data center networking technology. In the late 1990s, a group of next generation I/O architects formed an open, community driven network technology to provide scalability and stability based on successes from other network designs. Today, InfiniBand is a popular and widely used I/O fabric among customers within the Top500 supercomputers: major Universities and Labs; Life Sciences; Biomedical; Oil and Gas (Seismic, Reservoir, Modeling Applications); Computer Aided Design and Engineering; Enterprise Oracle; and Financial Applications.
143 Sci
en
ces
R
isk
An
aly
sis
S
imu
lati
on
B
ig D
ata
An
aly
tics
C
AD
H
igh
Pe
rfo
rma
nce
Co
mp
uti
ng
144 145
InfiniBand was designed to meet the evolving needs of the
high performance computing market. Computational science
depends on InfiniBand to deliver:
� High Bandwidth: Supports host connectivity of 10Gbps
with Single Data Rate (SDR), 20Gbps with Double Data
Rate (DDR), and 40Gbps with Quad Data Rate (QDR),
all while offering an 80Gbps switch for link switching
� Low Latency: Accelerates the performance of HPC and enter-
prise computing applications by providing ultra-low latencies
� Superior Cluster Scaling: Point-to-point latency re-
mains low as node and core counts scale – 1.2 μs. High-
est real message per adapter: each PCIe x16 adapter
drives 26 million messages per second. Excellent commu-
nications/computation overlap among nodes in a cluster
� High Efficiency: InfiniBand allows reliable protocols like Re-
mote Direct Memory Access (RDMA) communication to occur
between interconnected hosts, thereby increasing efficiency
� Fabric Consolidation and Energy Savings: InfiniBand can
consolidate networking, clustering, and storage data over
a single fabric, which significantly lowers overall power,
real estate, and management overhead in data centers. En-
hanced Quality of Service (QoS) capabilities support run-
ning and managing multiple workloads and traffic classes
� Data Integrity and Reliability: InfiniBand provides the high-
est levels of data integrity by performing Cyclic Redundancy
Checks (CRCs) at each fabric hop and end-to-end across the
fabric to avoid data corruption. To meet the needs of mission
critical applications and high levels of availability, InfiniBand
provides fully redundant and lossless I/O fabrics with auto-
matic failover path and link layer multi-paths
InfiniBand
High-Speed Interconnects
Components of the InfiniBand Fabric
InfiniBand is a point-to-point, switched I/O fabric architecture.
Point-to-point means that each communication link extends be-
tween only two devices. Both devices at each end of a link have
full and exclusive access to the communication path. To go be-
yond a point and traverse the network, switches come into play.
By adding switches, multiple points can be interconnected to
create a fabric. As more switches are added to a network, ag-
gregated bandwidth of the fabric increases. By adding multiple
paths between devices, switches also provide a greater level of
redundancy.
The InfiniBand fabric has four primary components, which are
explained in the following sections:
� Host Channel Adapter
� Target Channel Adapter
� Switch
� Subnet Manager
Host Channel Adapter
This adapter is an interface that resides within a server and com-
municates directly with the server’s memory and processor as
well as the InfiniBand Architecture (IBA) fabric. The adapter guar-
antees delivery of data, performs advanced memory access, and
can recover from transmission errors. Host channel adapters
can communicate with a target channel adapter or a switch. A
host channel adapter can be a standalone InfiniBand card or
it can be integrated on a system motherboard. Intel TrueScale
InfiniBand host channel adapters outperform the competition
with the industry’s highest message rate. Combined with the
lowest MPI latency and highest effective bandwidth, Intel host
channel adapters enable MPI and TCP applications to scale to
thousands of nodes with unprecedented price performance.
Target Channel Adapter
This adapter enables I/O devices, such as disk or tape storage,
to be located within the network independent of a host com-
puter. Target channel adapters include an I/O controller that
is specific to its particular device’s protocol (for example, SCSI,
Fibre Channel (FC), or Ethernet). Target channel adapters can
communicate with a host channel adapter or a switch.
Switch
An InfiniBand switch allows many host channel adapters and
target channel adapters to connect to it and handles network
traffic. The switch looks at the “local route header” on each
packet of data that passes through it and forwards it to the
appropriate location. The switch is a critical component of
the InfiniBand implementation that offers higher availability,
higher aggregate bandwidth, load balancing, data mirroring,
and much more. A group of switches is referred to as a Fabric. If
a host computer is down, the switch still continues to operate.
The switch also frees up servers and other devices by handling
network traffic.
Intel is a global leader and technology innovator in high perfor-mance networking, including adapters, switches and ASICs.
Host Channel Adapter
InfiniBand SwitchWith Subnet Manager
Target Channel Adapter
figure 1 Typical InfiniBand High Performance Cluster
146 147
Top 10 Reasons to Use Intel TrueScale Infi niBand
1. Predictable Low Latency Under Load – Less Than 1.0 µs.
TrueScale is designed to make the most of multi-core nodes
by providing ultra-low latency and signifi cant message rate
scalability. As additional compute resources are added to
the Intel TrueScale Infi niBand solution, latency and message
rates scale linearly. HPC applications can be scaled without
having to worry about diminished utilization of compute re-
sources.
2. Quad Data Rate (QDR) Performance. The Intel 12000 switch
family runs at lane speeds of 10Gbps, providing a full bi-
sectional bandwidth of 40Gbps (QDR). In addition, the 12000
switch has the unique capability of riding through periods
of congestion with features such as deterministically low
latency. The Intel family of TrueScale 12000 products offers
the lowest latency of any IB switch and high performance
transfers with the industry’s most robust signal integrity.
3. Flexible QoS Maximizes Bandwidth Use. The Intel 12800
advanced design is based on an architecture that provides
comprehensive virtual fabric partitioning capabilities that
enable the IB fabric to support the evolving requirements of
an organization.
4. Unmatched Scalability – 18 to 864 Ports per Switch. Intel
offers the broadest portfolio (fi ve chassis and two edge
switches) from 18 to 864 TrueScale Infi niBand ports, allow-
ing customers to buy switches that match their connectiv-
ity, space, and power requirements.
5. Highly Reliable and Available. Reliability And Service-
ability (RAS) that is proven in the most demanding Top500
and Enterprise environments is designed into Intel’s 12000
series with hot swappable components, redundant com-
ponents, customer replaceable units, and non-disruptive
code load.
6. Lowest Per-Port and Cooling Requirements. The True-Scale
12000 offers the lowest power consumption and the high-
est port density – 864 total TrueScale Infi niBand ports in a
single chassis makes it unmatched in the industry. This re-
sults in delivering the lowest power per port for a director
switch (7.8 watts per port) and the lowest power per port for
an edge switch (3.3 watts per port).
7. Easy to Install and Manage. Intel installation, confi gura-
tion, and monitoring Wizards reduce time-to-ready. The
Intel Infi niBand Fabric Suite (IFS) assists in diagnosing prob-
lems in the fabric. Non-disruptive fi rmware upgrades pro-
vide maximum availability and operational simplicity.
8. Protects Existing Infi niBand Investments. Seamless vir-
tual I/O integration at the Operating System (OS) and appli-
cation levels that match standard network interface cards
and adapter semantics with no OS or app changes – they
just work. Additionally, the TrueScale family of products is
compliant with the Infi niBand Trade Association (IBTA) open
specifi cation, so Intel products inter-operate with any IBTA-
compliant Infi niBand vendor. Being IBTA-compliant makes
the Intel 12000 family of switches ideal for network consoli-
dation; sharing and scaling I/O pools across servers; and to
pool and share I/O resources between servers.
9. Modular Confi guration Flexibility. The Intel 12000 series
switches offer confi guration and scalability fl exibility that
meets the requirements of either a high-density or high-
performance compute grid by offering port modules that
address both needs. Units can be populated with Ultra High
Density (UHD) leafs for maximum connectivity or Ultra High
Performance (UHP) leafs for maximum performance. The
high scalability 24-port leaf modules support confi gurations
between 18 and 864 ports, providing the right size to start
and the capability to grow as your grid grows.
10. Option to Gateway to Ethernet and Fibre Channel Net-
works. Intel offers multiple options to enable hosts on
Infi niBand fabrics to transparently access Fibre Channel
based storage area networks (SANs) or Ethernet based local
area networks (LANs).
Intel TrueScale architecture and the resulting family of prod-
ucts delivers the promise of Infi niBand to the enterprise today.
InfiniBand
Top 10 Reasons to Use Intel TrueScale Infi niBand
“Effective fabric management has become the
most important factor in maximizing performance
in an HPC cluster. With IFS 7.0, Intel has addressed
all of the major fabric management issues in a prod-
uct that in many ways goes beyond what others are
offering.”
Michael Wirth HPC Presales Specialist
148 149
InfiniBand
Intel MPI Library 4.0 PerformanceIntroductionIntel’s latest MPI release, Intel MPI 4.0, is now optimized to work
with Intel’s TrueScale Infi niBand adapter. Intel MPI 4.0 can now
directly call Intel’s TrueScale Performance Scale Messaging
(PSM) interface. The PSM interface is designed to optimize MPI
application performance. This means that organizations will be
able to achieve a signifi cant performance boost with a combi-
nation of Intel MPI 4.0 and Intel TrueScale Infi niBand.
SolutionIntel worked with Intel to tune and optimize the company’s lat-
est MPI release – Intel MPI Library 4.0 – to improve performance
when used with Intel TrueScale Infi niBand. With MPI Library 4.0,
applications can make full use of High Performance Computing
(HPC) hardware, improving the overall performance of the ap-
plications on the clusters.
Intel MPI Library 4.0 PerformanceIntel MPI Library 4.0 uses the high performance MPI-2 specifi ca-
tion on multiple fabrics, which results in better performance for
applications on Intel architecture-based clusters. This library
enables quick delivery of maximum end user performance, even
if there are changes or upgrades to new interconnects – with-
out requiring major changes to the software or operating en-
vironment. This high-performance, message-passing interface
library develops applications that can run on multiple cluster
fabric interconnects chosen by the user at runtime.
Testing
Intel used the Parallel Unstructured Maritime Aerodynamics
(PUMA) Benchmark program to test the performance of Intel
MPI Library versions 4.0 and 3.1 with Intel TrueScale Infi niBand.
The program analyses internal and external non-reacting com-
pressible fl ows over arbitrarily complex 3D geometries. PUMA
is written in ANSI C, and uses MPI libraries for message passing.
Results
The test showed that MPI performance can improve by more
than 35 percent using Intel MPI Library 4.0 with Intel TrueScale
Infi niBand, compared to using Intel MPI Library 3.1.
Intel TrueScale Infi niBand Adapters Intel TrueScale Infi niBand Adapters offer scalable performance,
reliability, low power, and superior application performance.
These adapters ensure superior performance of HPC applica-
tions by delivering the highest message rate for multicore
compute nodes, the lowest scalable latency, large node count
clusters, the highest overall bandwidth on PCI Express Gen1
platforms, and superior power effi ciency.
Purpose: Compare the performance of Intel MPI Library 4.0 and 3.1 with Intel Infi niBandBenchmark: PUMA FlowCluster: Intel/IBM iDataPlex™ clusterSystem Confi guration: The NET track IBM Q-Blue Cluster/iDataPlex nodes were confi gured as follows:
Processor Intel Xeon® CPU X5570 @ 2,93 GHz
Memory 24GB (6x4GB) @ 1333MHz (DDR3)
QDR Infi niBand SwitchIntel Model 12300/fi rmware version
6.0.0.0.33
QDR Infi niBand Host Cnel Adapter Intel QLE7340 software stack 5.1.0.0.49
Operating System Red Hat® Enterprise Linux® Server
release 5.3
Kernel 2.6.18-128.el5
File System IFS Mounted
Typical Infi niBand High Performance Cluster
MPI v3.1
MPI v4.0
1400
35% Improvement
1200
1000
800
600
400
200
0
Bet
ter
Ela
pse
d t
ime
Number of cores
16 32 64
Figure 3 Performance with and without GPUDirect
150 151
The Intel TrueScale 12000 family of Multi-Protocol Fab-
ric Directors is the most highly integrated cluster com-
puting interconnect solution available. An ideal solu-
tion HPC, database clustering, and grid utility computing
applications, the 12000 Fabric Directors maximize cluster
and grid computing interconnect performance while sim-
plifying and reducing the cost of operating a data center.
Subnet Manager
The subnet manager is an application responsible for config-
uring the local subnet and ensuring its continued operation.
Configuration responsibilities include managing switch setup
and reconfiguring the subnet if a link goes down or a new one
is added.
How High Performance Computing Helps Vertical ApplicationsEnterprises that want to do high performance computing must
balance the following scalability metrics as the core size and
the number of cores per node increase:
� Latency and Message Rate Scalability: must allow near line-
ar growth in productivity as the number of compute cores is
scaled.
� Power and Cooling Efficiency: as the cluster is scaled, power
and cooling requirements must not become major concerns
in today’s world of energy shortages and high-cost energy.
InfiniBand offers the promise of low latency, high bandwidth,
and unmatched scalability demanded by high performance
computing applications. IB adapters and switches that perform
well on these key metrics allow enterprises to meet their high-
performance and MPI needs with optimal efficiency. The IB so-
lution allows enterprises to quickly achieve their compute and
business goals.
InfiniBand
High-Speed Interconnects
Vertical Market Application Segment InfiniBand Value
Oil & GasMix of Independent Service Provider (ISP) and home grown codes: reservoir modeling
Low latency, high bandwidth
Computer Aided Engineering (CAE)Mostly Independent Software Vendor (ISV) codes: crash, air flow, and fluid flow simulations
High message rate, low latency, scalability
GovernmentHome grown codes: labs, defense, weather, and a wide range of apps
High message rate, low latency, scalability, high bandwidth
Education Home grown and open source codes: a wide range of apps High message rate, low latency, scalability, high bandwidth
Financial Mix of ISP and home grown codes: market simulation and trading floor
High performance IP, scalability, high bandwidth
Life and Materials ScienceMostly ISV codes: molecular simulation, computational chemistry, and biology apps
Low latency, high message rates
152 153
InfiniBand
Intel Fabric Suite 7Intel Fabric Suite 7
Maximizing Investments in High Performance Computing
Around the world and across all industries, high performance
computing (HPC) is used to solve today’s most demanding com-
puting problems. As today’s high performance computing chal-
lenges grow in complexity and importance, it is vital that the
software tools used to install, confi gure, optimize, and manage
HPC fabrics also grow more powerful. Today’s HPC workloads
are too large and complex to be managed using software tools
that are not focused on the unique needs of HPC.
Designed specifi cally for HPC, Intel Fabric Suite 7 is a complete
fabric management solution that maximizes the return on HPC
investments, by allowing users to achieve the highest levels of
performance, effi ciency, and ease of management from Infi ni-
Band-connected HPC clusters of any size.
Highlights
� Intel FastFabric and Fabric Viewer integration with leading
third-party HPC cluster management suites
� Simple but powerful Fabric Viewer dashboard for monitor-
ing fabric performance
� Intel Fabric Manager integration with leading HPC work-
loadmanagement suites that combine virtual fabrics and
compute
� Quality of Service (QoS) levels that maximize fabric effi cien-
cy and application performance
� Smart, powerful software tools that make Intel TrueScale
Fabric solutions easy to install, confi gure, verify, optimize,
and manage
� Congestion control architecture that reduces the effects of
fabric congestion caused by low credit availability that can
result in head-of-line blocking
� Powerful fabric routing methods – including adaptive and
dispersive routing – that optimize traffi c fl ows to avoid or
eliminate congestion, maximizing fabric throughput
� Intel Fabric Manager’s advanced design ensures fabrics of
all sizesand topologies – from fat-tree to mesh and toruss-
cale to support the most demanding HPC workloads
Superior Fabric Performance and Simplifi ed Management are
Vital for HPC
As HPC clusters scale to take advantage of multi-core and GPU-
accelerated nodes attached to ever-larger and more complex
fabrics, simple but powerful management tools are vital for
maximizing return on HPC investments.
Intel Fabric Suite 7 provides the performance and management
tools for today’s demanding HPC cluster environments. As clus-
ters grow larger, management functions, from installation and
confi guration to fabric verifi cation and optimization, are vital
in ensuring that the interconnect fabric can support growing
workloads. Besides fabric deployment and monitoring, IFS opti-
mizes the performance of message passing applications – from
advanced routing algorithms to quality of service (QoS) – that
ensure all HPC resources are optimally utilized.
Scalable Fabric Performance
� Purpose-built for HPC, IFS is designed to make HPC clusters
faster, easier, and simpler to deploy, manage, and optimize
� The Intel TrueScale Fabric on-load host architecture exploits
the processing power of today’s faster multi-core proces-
sors for superior application scaling and performance
� Policy-driven vFabric virtual fabrics optimize HPC resource
utilization by prioritizing and isolating compute and stor-
age traffi c fl ows
� Advanced fabric routing options – including adaptive and
dispersive routing – that distribute traffi c across all poten-
tial links that improve the overall fabric performance, lower-
ing congestion to improve throughput and latency
Scalable Fabric Intelligence
� Routing intelligence scales linearly as Intel TrueScale Fabric
switches are added to the fabric
� Intel Fabric Manager can initialize fabrics having several
thousand nodes within seconds
� Advanced and optimized routing algorithms overcome the
limitations of typical subnet managers
� Smart management tools quickly detect and respond to
fabric changes, including isolating and correcting problems
that can result in unstable fabrics
Standards-based Foundation
Intel TrueScale Fabric solutions are compliant with all industry
software and hardware standards. Intel Fabric Suite (IFS) re-
defi nes IBTA management by coupling powerful management
tools with intuitive user interfaces. Infi niBand fabrics built us-
ing IFS deliver the highest levels of fabric performance, effi cien-
cy, and management simplicity, allowing users to realize the full
benefi ts from their HPC investments.
Major components of Intel Fabric Suite 7 include:
� FastFabric Toolset
� Fabric Viewer
� Fabric Manager
Intel FastFabric Toolset
Ensures rapid, error-free installation and confi guration of Intel
True Scale Fabric switch, host, and management software tools.
Guided by an intuitive interface, users can easily install, confi g-
ure, validate, and optimize HPC fabrics.
154 155
Key features include:
� Automated host software installation and confi guration
� Powerful fabric deployment, verifi cation, analysis, and re-
porting tools for measuring connectivity, latency, and band-
width
� Automated chassis and switch fi rmware installation and
update
� Fabric route and error analysis tools
� Benchmarking and tuning tools
� Easy-to-use fabric health checking tools
Intel Fabric Viewer
A key component that provides an intuitive, Java-based user
interface with a “topology-down” view for fabric status and di-
agnostics, with the ability to drill down to the device layer to
identify and help correct errors. IFS 7 includes fabric dashboard,
a simple and intuitive user interface that presents vital fabric
performance statistics.
Key features include:
� Bandwidth and performance monitoring
� Device and device group level displays
� Defi nition of cluster-specifi c node groups so that displays
can be oriented toward end-user node types
� Support for user-defi ned hierarchy
� Easy-to-use virtual fabrics interface
Intel Fabric Manager
Provides comprehensive control of administrative functions
using a commercial-grade subnet manager. With advanced
routing algorithms, powerful diagnostic tools, and full subnet
manager failover, Fabric Manager simplifi es subnet, fabric, and
individual component management, making even the largest
fabrics easy to deploy and optimize.
Key features include:
� Designed and optimized for large fabric support
� Integrated with both adaptive and dispersive routing sys-
tems
� Congestion control architecture (CCA)
� Robust failover of subnet management
� Path/route management
� Fabric/chassis management
� Fabric initialization in seconds, even for very large fabrics
� Performance and fabric error monitoring
Adaptive Routing
While most HPC fabrics are confi gured to have multiple paths
between switches, standard Infi niBand switches may not be ca-
pable of taking advantage of them to reduce congestion. Adap-
tive routing monitors the performance of each possible path,
and automatically chooses the least congested route to the
destination node. Unlike other implementations that rely on a
purely subnet manager-based approach, the intelligent path se-
lection capabilities within Fabric Manager, a key part of IFS and
Intel 12000 series switches, scales as your fabric grows larger
and more complex.
Key adaptive routing features include:
� Highly scalable—adaptive routing intelligence scales as
thefabric grows
� Hundreds of real-time adjustments per second per switch
� Topology awareness through Intel Fabric Manager
� Supports all Infi niBand Trade Association* (IBTA*)-compliant
adapters
Dispersive Routing
One of the critical roles of the fabric management is the initial-
ization and confi guration of routes through the fabric between
a single pair of nodes. Intel Fabric Manager supports a variety
of routing methods, including defi ning alternate routes that
disperse traffi c fl ows for redundancy, performance, and load
balancing. Instead of sending all packets to a destination on a
single path, Intel dispersive routing distributes traffi c across all
possible paths. Once received, packets are reassembled in their
proper order for rapid, effi cient processing. By leveraging the
entire fabric to deliver maximum communications performance
for all jobs, dispersive routing ensures optimal fabric effi ciency.
Key dispersive routing features include:
� Fabric routing algorithms that provide the widest separa-
tion of paths possible
� Fabric “hotspot” reductions to avoid fabric congestion
� Balances traffi c across all potential routes
� May be combined with vFabrics and adaptive routing
� Very low latency for small messages and time sensitive con-
trol protocols
� Messages can be spread across multiple paths—Intel Per-
formance Scaled Messaging (PSM) ensures messages are
reassembled correctly
� Supports all leading MPI libraries
Mesh and Torus Topology Support
Fat-tree confi gurations are the most common topology used
in HPC cluster environments today, but other topologies are
gaining broader use as organizations try to create increasingly
larger, more cost-effective fabrics. Intel is leading the way with
full support of emerging mesh and torus topologies that can
help reduce networking costs as clusters scale to thousands of
nodes. IFS has been enhanced to support these emerging op-
tions—from failure handling that maximizes performance, to
even higher reliability for complex networking environments.
InfiniBand
Intel Fabric Suite 7
156 157
InfiniBand
Intel Fabric Suite 7
transtec HPC solutions excel through their easy management
and high usability, while maintaining high performance and
quality throughout the whole lifetime of the system. As clusters
scale, issues like congestion mitigation and Quality-of-Service
can make a big difference in whether the fabric performs up to
its full potential.
With the intelligent choice of Intel Infi niBand products, trans-
tec remains true to combining the best components together
to provide the full best-of-breed solution stack to the customer.
transtec HPC engineering experts are always available to fi ne-
tune customers’ HPC cluster systems and Infi niBand fabrics to
get the maximum performance while at the same time provide
them with an easy-to-manage and easy-to-use HPC solution.
158
NumascaleNumascale’s NumaConnect technology enables computer system vendors to build scal-able servers with the functionality of enterprise mainframes at the cost level of clus-ters. The technology unites all the processors, memory and IO resources in the system in a fully virtualized environment controlled by standard operating systems.
NumaConnect enables significant cost savings in three dimensions: resource utiliza-tion, system management and programmer productivity.
According to long time users of both large shared memory systems and clusters in en-vironments with a variety of applications, the former provide a much higher degree of resource utilization due to the flexibility of all system resources.
159 Hig
h T
rou
hp
ut
Co
mp
uti
ng
C
AD
B
ig D
ata
An
aly
tics
S
imu
lati
on
A
ero
spa
ce
Au
tom
oti
ve
160 161
BackgroundNumascale’s NumaConnect technology enables computer sys-
tem vendors to build scalable servers with the functionality of
enterprise mainframes at the cost level of clusters. The technol-
ogy unites all the processors, memory and IO resources in the
system in a fully virtualized environment controlled by stan-
dard operating systems.
Systems based on NumaConnect will effi ciently support all
classes of applications using shared memory or message pass-
ing through all popular high level programming models. System
size can be scaled to 4k nodes where each node can contain
multiple processors. Memory size is limited by the 48-bit physi-
cal address range provided by the Opteron processors resulting
in a total system main memory of 256 TBytes.
At the heart of NumaConnect is NumaChip; a single chip that
combines the cache coherent shared memory control logic with
an on-chip 7 way switch. This eliminates the need for a separate,
central switch and enables linear capacity and cost scaling.
The continuing trend with multi-core processor chips is en-
abling more applications to take advantage of parallel process-
ing. NumaChip leverages the multi-core trend by enabling ap-
plications to scale seamlessly without the extra programming
effort required for cluster computing. All tasks can access all
memory and IO resources. This is of great value to users and the
ultimate way to virtualization of all system resources.
No other interconnect technology outside the high-end enter-
prise servers can offer this capability. All high speed intercon-
nects now use the same kind of physical interfaces resulting in
almost the same peak bandwidth. The differentiation is in la-
tency for the critical short transfers, functionality and software
compatibility. NumaConnect differentiates from all other in-
terconnects through the ability to provide unifi ed access to all
resources in a system and utilize caching techniques to obtain
very low latency.
Key Facts:
� Scalable, directory based Cache Coherent Shared Memory
interconnect for Opteron
� Attaches to coherent HyperTransport (cHT) through HTX con-
nector, pick-up module or mounted directly on main board
� Confi gurable Remote Cache for each node
� Full 48 bit physical address space (256 Tbytes)
� Up to 4k (4096) nodes
� ≈1 microsecond MPI latency (ping-pong/2)
� On-chip, distributed switch fabric for 2 or 3 dimensional to-
rus topologies
Expanding the capabilities of multi-core processors
Semiconductor technology has reached a level where proces-
sor frequency can no longer be increased much due to power
consumption with corresponding heat dissipation and thermal
handling problems. Historically, processor frequency scaled at
approximately the same rate as transistor density and resulted
in performance improvements for most all applications with
no extra programming efforts. Processor chips are now instead
being equipped with multiple processors on a single die. Utiliz-
ing the added capacity requires softwarethat is prepared for
parallel processing. This is quite obviously simple for individual
and separated tasks that can be run independently, but is much
more complex for speeding up single tasks.
The complexity for speeding up a single task grows with the
logic distance between the resources needed to do the task, i.e.
the fewer resources that can be shared, the harder it is. Multi-
core processors share the main memory and some of the cache
levels, i.e. they are classifi ed as Symmetrical Multi Processors
(SMP). Modern processors chips are also equipped with signals
and logic that allow connecting to other processor chips still
maintaining the same logic sharing of memory. The practical
limit is at two to four processor sockets before the overheads
reduce performance scaling instead of increasing it. This is nor-
mally restricted to a single motherboard.
Currently, scaling beyond the single/dual SMP motherboards is
done through some form of network connection using Ether-
net or a higher speed interconnect like Infi niBand. This requires
processes runningon the different compute nodes to communi-
cate through explicit messages. With this model, programs that
need to be scaled beyond a small number of processors have
to be written in a more complex way where the data can no
longer be shared among all processes, but need to be explicitly
decomposed and transferred between the different processors’
memories when required.
NumaConnect uses a scalable approachto sharing all memory
based on distributed directories to store information about
shared memory locations. This means that programs can be
scaled beyond the limit of a single motherboard without any
changes to the programming principle. Any process running on
Numascale
Background
162 163
any processor in the system can use any part of the memory
regardless if the physical location of the memory is on a differ-
ent motherboard.
NumaConnect Value PropositionNumaConnect enables significant cost savings in three dimen-
sions; resource utilization, system management and program-
mer productivity.
According to long time users of both large shared memory sys-
tems (SMPs) and clusters in environments with a variety of ap-
plications, the former provide a much higher degree of resource
utilization due to the flexibility of all system resources. They
indicate that large mainframe SMPs can easily be kept at more
than 90% utilization and that clusters seldom can reach more
than 60-70% in environments running a variety of jobs. Better
compute resource utilization also contributes to more efficient
use of the necessary infrastructure with power consumption
and cooling as the most prominent ones (account for approxi-
mately one third of the overall coast) with floor-space as a sec-
ondary aspect.
Regarding system management, NumaChip can reduce the
number of individual operating system images significantly. In
a system with 100Tflops computing power, the number of sys-
tem images can be reduced from approximately 1 400 to 40, a
reduction factor of 35. Even if each of those 40 OS images re-
quire somewhat more resources for management than the 1
400 smaller ones, the overall savings are significant.
Parallel processing in a cluster requires explicit message pass-
ing programming whereas shared memory systems can utilize
compilers and other tools that are developed for multi-core pro-
cessors. Parallel programming is a complex task and programs
written for message passing normally contain 50%-100% more
code than programs written for shared memory processing.
Since all programs contain errors, the probability of errors in
message passing programs is 50%-100% higher than for shared
memory programs. A significant amount of software develop-
ment time is consumed by debugging errors further increasing
the time to complete development of an application.
In principle, servers are multi-tasking, multi-user machines that
are fully capable of running multiple applications at any given
time. Small servers are very cost-efficient measured by a peak
price/performance ratio because they are manufactured in very
high volumes and use many of the same components as desk-
side and desktop computers. However, these small to medium
sized servers are not very scalable. The most widely used config-
uration has 2 CPU sockets that hold from 4 to 16 CPU cores each.
They cannot be upgraded with out changing to a different main
board that also normally requires a larger power supply and a
different chassis. In turn, this means that careful capacity plan-
ning is required to optimize cost and if compute requirements
increase, it may be necessary to replace the entire server with
a bigger and much more expensive one since the price increase
is far from linear.
NumaChip contains all the logic needed to build Scale-Up sys-
tems based on volume manufactured server components. This
drives the cost per CPU core down to the same level while offer-
ing the same capabilities as the mainframe type servers.
Where IT budgets are in focus the price difference is obvious
and NumaChip represents a compelling proposition to get main-
frame capabilities at the cost level of high-end cluster technol-
ogy. The expensive mainframes still include some features for
dynamic system reconfiguration that NumaChip systems will
not offer initially. Such features depend on operating system
software and can be also be implemented in NumaChip based
systems.
TechnologyMulti-core processors and shared memory
Shared memory programming for multi-processing boosts pro-
grammer productivity since it is easier to handle than the alter-
native message passing paradigms. Shared memory programs
are supported by compiler tools and require less code than the
alternatives resulting in shorter development time and fewer
program bugs. The availability of multi-core processors on all
major platforms including desktops and laptops is driving more
programs to take advantage of the increased performance po-
tential.
NumaChip offers seamless scaling within the same program-
ming paradigm regardless of system size from a single proces-
sor chip to systems with more than 1,000 processor chips.
Other interconnect technologies that do not offer cc-NUMA
capabilities require that applications are written for message
passing, resulting in larger programs with more bugs and cor-
respondingly longer development time while systems built with
NumaChip can run any program efficiently.
Numascale
Technology
164 165
Virtualization
The strong trend of virtualization is driven by the desire of
obtaining higher utilization of resources in the datacenter. In
short, it means that any application should be able to run on
any server in the datacenter so that each server can be better
utilized by combining more applications on each server dynami-
cally according to user loads.
Commodity server technology represents severe limitations in
reaching this goal. One major limitation is that the memory re-
quirements of any given application need to be satisfi ed by the
physical server that hosts the application at any given time. In
turn, this means that if any application in the datacenter shall
be dynamically executable on all of the servers at different
times, all of the servers must be confi gured with the amount of
memory required by the most demanding application, but only
the one running the app will actually use that memory. This is
where the mainframes excel since these have a fl exible shared
memory architecture where any processor can use any portion
of the memory at any given time, so they only need to be con-
fi gured to be able to handle the most demanding application in
one instance. NumaChip offers the exact same feature, by pro-
viding any application with access to the aggregate amount of
memory in the system. In addition, it also offers all applications
access to all I/O devices in the system through the standard vir-
tual view provided by the operating system.
The two distinctly different architectures of clusters and main-
frames are shown in Figure 1. In Clusters processes are loosely
coupled through a network like Ethernet or Infi niBand. An ap-
plication that needs to utilize more processors or I/O than those
present in each server must be programmed to do so from the
beginning. In the main-frame, any application can use any re-
source in the system as a virtualized resource and the compiler
can generate threads to be executed on any processor.
In a system interconnected with NumaChip, all processors can
access all the memory and all the I/O resources in the system
in the same way as on a mainframe. NumaChip provides a fully
virtualized hardware environment with shared memory and I/O
and with the same ability as mainframes to utilize compiler gen-
erated parallel processes and threads.
Operating Systems
Systems based on NumaChip can run standard operating sys-
tems that handle shared memory multiprocessing. Examples of
such operating systems are Linux, Solaris and Windows Server.
Numascale provides a bootstrap loader that is invoked after
powerup and performs initialization of the system by setting
up node address routing tables. Initially Numascale has tested
and provides bootstrap for Linux.
When the standard bootstrap loader is launched, the system
will appear as a large unifi ed shared memory system.
Cache Coherent Shared Memory
The big differentiator for NumaConnect compared to other
high-speed interconnect technologies is the shared memo-
ry and cache coherency mechanisms. These features allow
programs to access any memory location and any memory
mapped I/O device in a multiprocessor system with high de-
gree of effi ciency. It provides scalable systems with a unifi ed
programming model that stays the same from the small multi-
core machines used in laptops and desktops to the largest
imaginable single system image machines that may contain
thousands of processors.
Numascale
Technology
Clustered vs Mainframe Architecture
Clustered Architecture
Mainframe Architecture
166 167
There are a number of pros for shared memory machines that
lead experts to hold the architecture as the holy grail of com-
puting compared to clusters:
� Any processor can access any data location through direct
load and store operations - easier programming, less code to
write and debug
� Compilers can automatically exploit loop level parallelism –
higher effi ciency with less human effort
� System administration relates to a unifi ed system as op-
posed to a large number of separate images in a cluster – less
effort to maintain
� Resources can be mapped and used by any processor in the
system – optimal use of resources in a virtualized environment
� Process scheduling is synchronized through a single, real-
time clock - avoids serialization of scheduling associated
with asynchronous operating systems in a cluster and the
corresponding loss of effi ciency
Scalability and Robustness
The initial design aimed at scaling to very large numbers of pro-
cessors with 64-bit physical address space with 16 bits for node
identifi er and 48 bits of address within each node. The current
implementation for Opteron is limited by the global physical ad-
dress space of 48 bits, with 12 bits used to address 4 096 physi-
cal nodes for a total physical address range of 256 Terabytes.
A directory based cache coherence protocol was developed
to handle scaling with signifi cant number of nodes sharing
data to avoid overloading the interconnect between nodes
with coherency traffi c which would seriously reduce real data
throughput.
The basic ring topology with distributed switching allows a
number of different interconnect confi gurations that are more
scalable than most other interconnect switch fabrics. This also
eliminates the need for a centralized switch and includes inher-
ent redundancy for multidimensional topologies.
Functionality is included to manage robustness issues associ-
ated with high node counts and extremely high requirements
for data integrity with the ability to provide high availability for
systems managing critical data in transaction processing and
realtime control. All data that may exist in only one copy are ECC
protected with automatic scrubbing after detected single bit er-
rors and automatic background scrubbing to avoid accumula-
tion of single bit errors.
Integrated, distributed switching
NumaChip contains an on-chip switch to connect to other
nodes in a NumaChip based system, eliminating the need to use
a centralized switch. The on-chip switch can connect systems
in one, two or three dimensions. Small systems can use one,
medium-sized system two and large systems will use all three
dimensions to provide effi cient and scalable connectivity be-
tween processors.
The two- and three-dimensional topologies (called Torus) also
have the advantage of built-in redundancy as opposed to sys-
tems based on centralized switches, where the switch repre-
sents a single point of failure.
Numascale
Technology
NumaChip System Architecture
Block Diagram of NumaChip
168 169
The distributed switching reduces the cost of the system since
there is no extra switch hardware to pay for. It also reduces the
amount of rack space required to hold the system as well as the
power consumption and heat dissipation from the switch hard-
ware and the associated power supply energy loss and cooling
requirements.
Redefi ning Scalable OpenMP and MPI Shared Memory Advantages
Multi-processor shared memory processing has long been the
preferred method for creating and running technical computing
codes. Indeed, this computing model now extends from a user’s
dual core laptop to 16+ core servers. Programmers often add
parallel OpenMP directives to their programs in order to take
advantage of the extra cores on modern servers. This approach
is fl exible and often preserves the “sequential” nature of the
program (pthreads can of course also be used, but OpenMP is
much easier to use). To extend programs beyond a single server,
however, users must use the Message Passing Interface (MPI) to
allow the program to operate across a high-speed interconnect.
Interestingly, the advent of multi-core servers has created a par-
allel asymmetric computing model, where programs must map
themselves to networks of shared memory SMP servers. This
asymmetric model introduces two levels of communication,
local within a node, and distant to other nodes. Programmers
often create pure MPI programs that run across multiple cores
on multiple nodes. While a pure MPI program does represent
the greatest common denominator, better performance may be
sacrifi ced by not utilizing the local nature of multi-core nodes.
Hybrid (combined MPI/OpenMP) models have been able to pull
more performance from cluster hardware, but often introduce
programming complexity and may limit portability.
Clearly, users prefer writing software for large shared memory
systems to MPI programming. This preference becomes more
pronounced when large data sets are used. In a large SMP sys-
tem the data are simply used in place, whereas in a distributed
memory cluster the dataset must be partitioned across com-
pute nodes.
In summary, shared memory systems have a number of highly
desirable features that offer ease of use and cost reduction over
traditional distributed memory systems:
� Any processor can access any data location through direct
load and store operations, allowing easier programming
(less time and training) for end users, with less code to write
and debug.
� Compilers, such as those supporting OpenMP, can automati-
cally exploit loop level parallelism and create more effi cient
codes, increasing system throughput and better resource
utilization.
� System administration of a unifi ed system (as opposed to a
large number of separate images in a cluster) results in re-
duced effort and cost for system maintenance.
� Resources can be mapped and used by any processor in the
system, with optimal use of resources in a single image op-
erating system environment.
Shared Memory as a Universal Platform
Although the advantages of shared memory systems are clear,
the actual implementation of such systems “at scale” has been
diffi cult prior to the emergence of NumaConnect technology.
There have traditionally been limits to the size and cost of
shared memory SMP systems, and as a result the HPC commu-
nity has moved to distributed memory clusters that now scale
into the thousands of cores. Distributed memory programming
Numascale
Redefi ning Scalable OpenMP and MPI
System Topology examples
170 171
occurs within the MPI library, where explicit communication
pathways are established between processors (i.e., data is es-
sentially copied from machine to machine). A large number of
existing applications use MPI as a programming model.
Fortunately, MPI codes can run effectively on shared memory
systems. Optimizations have been built into many MPI versions
that recognize the availability of shared memory and avoid full
message protocols when communicating between processes.
Shared memory programming using OpenMP has been useful
on small-scale SMP systems such as commodity workstations
and servers. Providing large-scale shared memory environ-
ments for these codes, however, opens up a whole new world of
performance capabilities without the need for re-programming.
Using NumaConnect technology, scalable shared memory clus-
ters are capable of effi ciently running both large-scale OpenMP
and MPI codes without modifi cation.
Record-Setting OpenMP Performance
In the HPC community NAS Parallel Benchmarks (NPB) have
been used to test the performance of parallel computers
(http://www.nas.nasa.gov/publications/npb.html). The bench-
marks are a small set of programs derived from Computational
Fluid Dynamics (CFD) applications that were designed to help
evaluate the performance of parallel supercomputers. Problem
sizes in NPB) are predefi ned and indicated as different classes
(currently A through F, with F being the largest).
Reference implementations of NPB are available in commonly-
used programming models such as MPI and OpenMP, which
make them ideal for measuring the performance of both distrib-
uted memory and SMP systems. These benchmarks were com-
piled with Intel ifort version 14.0.0. (Note: the current-generated
code is slightly faster, but Numascale is working on NumaCon-
nect optimizations for the GNU compilers and thus suggests us-
ing gcc and gfortran for OpenMP applications.)
For the following tests, the NumaConnect Shared Memory
benchmark system has 1TB of memory and 256 cores. It utilizes
eight servers, each equipped with two x AMD Opteron 2.5 GHz
6380 CPUs, each with 16 cores and 128GB of memory.
Figure One shows the results for running NPB-SP (Scalar Penta-
diagonal solver) over a range of 16 to 121 cores using OpenMP
for the Class D problem size.
Figure Two shows results for the NPB-LU benchmark (Lower-
Upper Gauss-Seidel solver) over a range of 16 to 121 cores, using
OpenMP for the Class D problem size.
Figure Three shows he NAS-SP benchmark E-class scaling per-
fectly from 64 processes (using affi nity 0-255:4) to 121 processes
(using affi nity 0-241:2). Results indicate that larger problems
scale better on NumaConnect systems, and it was noted that
NASA has never seen OpenMP E Class results with such a high
number of cores.
OpenMP applications cannot run on Infi nBand clusters without
additional software layers and kernel modifi cations. The Numa-
Connect cluster runs a standard Linux kernel image.
Surprisingly Good MPI Performance
Despite the excellent OpenMP shared memory performance
that NumaConnect can deliver, applications have historically
been written using MPI. The performance of these applications
is presented below. As mentioned, the NumaConnect system
can easily run MPI applications. Figure Four is a comparison of
Numascale
Redefi ning Scalable OpenMP and MPI
OpenMP NAS Parallel results for NPB-SP (Class D)
OpenMP NAS Parallel results for NPB-LU (Class E)
172 173
NumaConnect and FDR Infi niBand NPB-SP (Class D). The results
indicate that NumaConnect performance is superior to that of
a traditional distributed Infi niBand memory cluster. MPI tests
were run with OpenMPI and gfortran 4.8.1 using the same hard-
ware mentioned above.
Both industry-standard OpenMPI and MPICH2 work in shared
memory mode. Numascale has implemented their own version
of the OpenMPI BTL (Byte Transfer Layer) to optimize the com-
munication by utilizing non-polluting store instructions. MPI
messages require data to be moved, and in a shared memory en-
vironment there is no reason to use standard instructions that
implicitly result in cache pollution and reduced performance.
This results in very effi cient message passing and excellent MPI
performance.
Similar results are shown in Figure Five for the NAS-LU (Class
D). NumaConnect’s performance over Infi niBand may be one of
the more startling results for the NAS benchmarks. Recall again
that OpenMP applications cannot run on Infi niBand clusters
without additional software layers and kernel modifi cations.
Numascale
Redefi ning Scalable OpenMP and MPI
OpenMP NAS results for NPB-SP (Class E)
NPB-SP comparison of NumaConnect to FDI Infi niBand
NPB-LU comparison of NumaConnect to FDI Infi niBand
174 175
ACML (“AMD Core Math Library“)
A software development library released by AMD. This library
provides useful mathematical routines optimized for AMD pro-
cessors. Originally developed in 2002 for use in high-performance
computing (HPC) and scientific computing, ACML allows nearly
optimal use of AMD Opteron processors in compute-intensive
applications. ACML consists of the following main components:
� A full implementation of Level 1, 2 and 3 Basic Linear Algebra
Subprograms (→ BLAS), with optimizations for AMD Opteron
processors.
� A full suite of Linear Algebra (→ LAPACK) routines.
� A comprehensive suite of Fast Fourier transform (FFTs) in sin-
gle-, double-, single-complex and double-complex data types.
� Fast scalar, vector, and array math transcendental library
routines.
� Random Number Generators in both single- and double-pre-
cision.
AMD offers pre-compiled binaries for Linux, Solaris, and Win-
dows available for download. Supported compilers include gfor-
tran, Intel Fortran Compiler, Microsoft Visual Studio, NAG, PathS-
cale, PGI compiler, and Sun Studio.
BLAS (“Basic Linear Algebra Subprograms“)
Routines that provide standard building blocks for performing
basic vector and matrix operations. The Level 1 BLAS perform
scalar, vector and vector-vector operations, the Level 2 BLAS
perform matrix-vector operations, and the Level 3 BLAS per-
form matrix-matrix operations. Because the BLAS are efficient,
portable, and widely available, they are commonly used in the
development of high quality linear algebra software, e.g. → LA-
PACK . Although a model Fortran implementation of the BLAS
in available from netlib in the BLAS library, it is not expected to
perform as well as a specially tuned implementation on most
high-performance computers – on some machines it may give
much worse performance – but it allows users to run → LAPACK
software on machines that do not offer any other implementa-
tion of the BLAS.
Cg (“C for Graphics”)
A high-level shading language developed by Nvidia in close col-
laboration with Microsoft for programming vertex and pixel
shaders. It is very similar to Microsoft’s → HLSL. Cg is based on
the C programming language and although they share the same
syntax, some features of C were modified and new data types
were added to make Cg more suitable for programming graph-
ics processing units. This language is only suitable for GPU pro-
gramming and is not a general programming language. The Cg
compiler outputs DirectX or OpenGL shader programs.
CISC (“complex instruction-set computer”)
A computer instruction set architecture (ISA) in which each in-
struction can execute several low-level operations, such as
a load from memory, an arithmetic operation, and a memory
store, all in a single instruction. The term was retroactively
coined in contrast to reduced instruction set computer (RISC).
The terms RISC and CISC have become less meaningful with the
continued evolution of both CISC and RISC designs and imple-
mentations, with modern processors also decoding and split-
ting more complex instructions into a series of smaller internal
micro-operations that can thereby be executed in a pipelined
fashion, thus achieving high performance on a much larger sub-
set of instructions.
Glossary
cluster
Aggregation of several, mostly identical or similar systems to
a group, working in parallel on a problem. Previously known
as Beowulf Clusters, HPC clusters are composed of commodity
hardware, and are scalable in design. The more machines are
added to the cluster, the more performance can in principle be
achieved.
control protocol
Part of the → parallel NFS standard
CUDA driver API
Part of → CUDA
CUDA SDK
Part of → CUDA
CUDA toolkit
Part of → CUDA
CUDA (“Compute Uniform Device Architecture”)
A parallel computing architecture developed by NVIDIA. CUDA
is the computing engine in NVIDIA graphics processing units
or GPUs that is accessible to software developers through in-
dustry standard programming languages. Programmers use “C
for CUDA” (C with NVIDIA extensions), compiled through a Path-
Scale Open64 C compiler, to code algorithms for execution on
the GPU. CUDA architecture supports a range of computational
interfaces including → OpenCL and → DirectCompute. Third
party wrappers are also available for Python, Fortran, Java and
Matlab. CUDA works with all NVIDIA GPUs from the G8X series
onwards, including GeForce, Quadro and the Tesla line. CUDA
provides both a low level API and a higher level API. The initial
CUDA SDK was made public on 15 February 2007, for Microsoft
Windows and Linux. Mac OS X support was later added in ver-
sion 2.0, which supersedes the beta released February 14, 2008.
CUDA is the hardware and software architecture that enables
NVIDIA GPUs to execute programs written with C, C++, Fortran,
→ OpenCL, → DirectCompute, and other languages. A CUDA pro-
gram calls parallel kernels. A kernel executes in parallel across
a set of parallel threads. The programmer or compiler organizes
these threads in thread blocks and grids of thread blocks. The
GPU instantiates a kernel program on a grid of parallel thread
blocks. Each thread within a thread block executes an instance
of the kernel, and has a thread ID within its thread block, pro-
gram counter, registers, per-thread private memory, inputs, and
output results.
Thread
Thread Block
Grid 0
Grid 1
per-Thread PrivateLocal Memory
per-BlockShared Memory
per-Application
ContextGlobal
Memory
...
...
176 177
A thread block is a set of concurrently executing threads that
can cooperate among themselves through barrier synchroniza-
tion and shared memory. A thread block has a block ID within
its grid. A grid is an array of thread blocks that execute the same
kernel, read inputs from global memory, write results to global
memory, and synchronize between dependent kernel calls. In
the CUDA parallel programming model, each thread has a per-
thread private memory space used for register spills, function
calls, and C automatic array variables. Each thread block has a
per-block shared memory space used for inter-thread communi-
cation, data sharing, and result sharing in parallel algorithms.
Grids of thread blocks share results in global memory space af-
ter kernel-wide global synchronization.
CUDA’s hierarchy of threads maps to a hierarchy of processors
on the GPU; a GPU executes one or more kernel grids; a stream-
ing multiprocessor (SM) executes one or more thread blocks;
and CUDA cores and other execution units in the SM execute
threads. The SM executes threads in groups of 32 threads called
a warp. While programmers can generally ignore warp ex-
ecution for functional correctness and think of programming
one thread, they can greatly improve performance by having
threads in a warp execute the same code path and access mem-
ory in nearby addresses. See the main article “GPU Computing”
for further details.
DirectCompute
An application programming interface (API) that supports gen-
eral-purpose computing on graphics processing units (GPUs)
on Microsoft Windows Vista or Windows 7. DirectCompute is
part of the Microsoft DirectX collection of APIs and was initially
released with the DirectX 11 API but runs on both DirectX 10
and DirectX 11 GPUs. The DirectCompute architecture shares a
range of computational interfaces with → OpenCL and → CUDA.
ETL (“Extract, Transform, Load”)
A process in database usage and especially in data warehousing
that involves:
� Extracting data from outside sources
� Transforming it to fit operational needs (which can include
quality levels)
� Loading it into the end target (database or data warehouse)
The first part of an ETL process involves extracting the data from
the source systems. In many cases this is the most challenging
aspect of ETL, as extracting data correctly will set the stage for
how subsequent processes will go. Most data warehousing proj-
ects consolidate data from different source systems. Each sepa-
rate system may also use a different data organization/format.
Common data source formats are relational databases and flat
files, but may include non-relational database structures such
as Information Management System (IMS) or other data struc-
tures such as Virtual Storage Access Method (VSAM) or Indexed
Sequential Access Method (ISAM), or even fetching from outside
sources such as through web spidering or screen-scraping. The
streaming of the extracted data source and load on-the-fly to
the destination database is another way of performing ETL
when no intermediate data storage is required. In general, the
goal of the extraction phase is to convert the data into a single
format which is appropriate for transformation processing.
An intrinsic part of the extraction involves the parsing of ex-
tracted data, resulting in a check if the data meets an expected
pattern or structure. If not, the data may be rejected entirely or
in part.
Glossary
The transform stage applies a series of rules or functions to the
extracted data from the source to derive the data for loading
into the end target. Some data sources will require very little or
even no manipulation of data. In other cases, one or more of the
following transformation types may be required to meet the
business and technical needs of the target database.
The load phase loads the data into the end target, usually the
data warehouse (DW). Depending on the requirements of the
organization, this process varies widely. Some data warehouses
may overwrite existing information with cumulative informa-
tion, frequently updating extract data is done on daily, weekly
or monthly basis. Other DW (or even other parts of the same
DW) may add new data in a historicized form, for example, hour-
ly. To understand this, consider a DW that is required to main-
tain sales records of the last year. Then, the DW will overwrite
any data that is older than a year with newer data. However, the
entry of data for any one year window will be made in a histo-
ricized manner. The timing and scope to replace or append are
strategic design choices dependent on the time available and
the business needs. More complex systems can maintain a his-
tory and audit trail of all changes to the data loaded in the DW.
As the load phase interacts with a database, the constraints de-
fined in the database schema — as well as in triggers activated
upon data load — apply (for example, uniqueness, referential
integrity, mandatory fields), which also contribute to the overall
data quality performance of the ETL process.
FFTW (“Fastest Fourier Transform in the West”)
A software library for computing discrete Fourier transforms
(DFTs), developed by Matteo Frigo and Steven G. Johnson at the
Massachusetts Institute of Technology. FFTW is known as the
fastest free software implementation of the Fast Fourier trans-
form (FFT) algorithm (upheld by regular benchmarks). It can
compute transforms of real and complex-valued arrays of arbi-
trary size and dimension in O(n log n) time.
floating point standard (IEEE 754)
The most widely-used standard for floating-point computa-
tion, and is followed by many hardware (CPU and FPU) and soft-
ware implementations. Many computer languages allow or re-
quire that some or all arithmetic be carried out using IEEE 754
formats and operations. The current version is IEEE 754-2008,
which was published in August 2008; the original IEEE 754-1985
was published in 1985. The standard defines arithmetic formats,
interchange formats, rounding algorithms, operations, and ex-
ception handling. The standard also includes extensive recom-
mendations for advanced exception handling, additional opera-
tions (such as trigonometric functions), expression evaluation,
and for achieving reproducible results. The standard defines
single-precision, double-precision, as well as 128-byte quadru-
ple-precision floating point numbers. In the proposed 754r ver-
sion, the standard also defines the 2-byte half-precision number
format.
FraunhoferFS (FhGFS)
A high-performance parallel file system from the Fraunhofer
Competence Center for High Performance Computing. Built on
scalable multithreaded core components with native → Infini-
Band support, file system nodes can serve → InfiniBand and
Ethernet (or any other TCP-enabled network) connections at the
same time and automatically switch to a redundant connection
path in case any of them fails. One of the most fundamental
178 179
concepts of FhGFS is the strict avoidance of architectural bottle
necks. Striping file contents across multiple storage servers is
only one part of this concept. Another important aspect is the
distribution of file system metadata (e.g. directory information)
across multiple metadata servers. Large systems and metadata
intensive applications in general can greatly profit from the lat-
ter feature.
FhGFS requires no dedicated file system partition on the servers
– it uses existing partitions, formatted with any of the standard
Linux file systems, e.g. XFS or ext4. For larger networks, it is also
possible to create several distinct FhGFS file system partitions
with different configurations. FhGFS provides a coherent mode,
in which it is guaranteed that changes to a file or directory by
one client are always immediately visible to other clients.
Global Arrays (GA)
A library developed by scientists at Pacific Northwest National
Laboratory for parallel computing. GA provides a friendly API
for shared-memory programming on distributed-memory com-
puters for multidimensional arrays. The GA library is a predeces-
sor to the GAS (global address space) languages currently being
developed for high-performance computing. The GA toolkit has
additional libraries including a Memory Allocator (MA), Aggre-
gate Remote Memory Copy Interface (ARMCI), and functionality
for out-of-core storage of arrays (ChemIO). Although GA was ini-
tially developed to run with TCGMSG, a message passing library
that came before the → MPI standard (Message Passing Inter-
face), it is now fully compatible with → MPI. GA includes simple
matrix computations (matrix-matrix multiplication, LU solve)
and works with → ScaLAPACK. Sparse matrices are available but
the implementation is not optimal yet. GA was developed by Jar-
ek Nieplocha, Robert Harrison and R. J. Littlefield. The ChemIO li-
brary for out-of-core storage was developed by Jarek Nieplocha,
Robert Harrison and Ian Foster. The GA library is incorporated
into many quantum chemistry packages, including NWChem,
MOLPRO, UTChem, MOLCAS, and TURBOMOLE. The GA toolkit is
free software, licensed under a self-made license.
Globus Toolkit
An open source toolkit for building computing grids developed
and provided by the Globus Alliance, currently at version 5.
GMP (“GNU Multiple Precision Arithmetic Library”)
A free library for arbitrary-precision arithmetic, operating on
signed integers, rational numbers, and floating point numbers.
There are no practical limits to the precision except the ones
implied by the available memory in the machine GMP runs on
(operand dimension limit is 231 bits on 32-bit machines and 237
bits on 64-bit machines). GMP has a rich set of functions, and
the functions have a regular interface. The basic interface is
for C but wrappers exist for other languages including C++, C#,
OCaml, Perl, PHP, and Python. In the past, the Kaffe Java virtual
machine used GMP to support Java built-in arbitrary precision
arithmetic. This feature has been removed from recent releases,
causing protests from people who claim that they used Kaffe
solely for the speed benefits afforded by GMP. As a result, GMP
support has been added to GNU Classpath. The main target ap-
plications of GMP are cryptography applications and research,
Internet security applications, and computer algebra systems.
GotoBLAS
Kazushige Goto’s implementation of → BLAS.
Glossary
grid (in CUDA architecture)
Part of the → CUDA programming model
GridFTP
An extension of the standard File Transfer Protocol (FTP) for use
with Grid computing. It is defined as part of the → Globus toolkit,
under the organisation of the Global Grid Forum (specifically, by
the GridFTP working group). The aim of GridFTP is to provide a
more reliable and high performance file transfer for Grid com-
puting applications. This is necessary because of the increased
demands of transmitting data in Grid computing – it is frequent-
ly necessary to transmit very large files, and this needs to be
done fast and reliably. GridFTP is the answer to the problem of
incompatibility between storage and access systems. Previous-
ly, each data provider would make their data available in their
own specific way, providing a library of access functions. This
made it difficult to obtain data from multiple sources, requiring
a different access method for each, and thus dividing the total
available data into partitions. GridFTP provides a uniform way
of accessing the data, encompassing functions from all the dif-
ferent modes of access, building on and extending the univer-
sally accepted FTP standard. FTP was chosen as a basis for it
because of its widespread use, and because it has a well defined
architecture for extensions to the protocol (which may be dy-
namically discovered).
Hierarchical Data Format (HDF)
A set of file formats and libraries designed to store and orga-
nize large amounts of numerical data. Originally developed at
the National Center for Supercomputing Applications, it is cur-
rently supported by the non-profit HDF Group, whose mission
is to ensure continued development of HDF5 technologies, and
the continued accessibility of data currently stored in HDF. In
keeping with this goal, the HDF format, libraries and associated
tools are available under a liberal, BSD-like license for general
use. HDF is supported by many commercial and non-commer-
cial software platforms, including Java, MATLAB, IDL, and Py-
thon. The freely available HDF distribution consists of the li-
brary, command-line utilities, test suite source, Java interface,
and the Java-based HDF Viewer (HDFView). There currently exist
two major versions of HDF, HDF4 and HDF5, which differ signifi-
cantly in design and API.
HLSL (“High Level Shader Language“)
The High Level Shader Language or High Level Shading Lan-
guage (HLSL) is a proprietary shading language developed by
Microsoft for use with the Microsoft Direct3D API. It is analo-
gous to the GLSL shading language used with the OpenGL stan-
dard. It is very similar to the NVIDIA Cg shading language, as it
was developed alongside it.
HLSL programs come in three forms, vertex shaders, geometry
shaders, and pixel (or fragment) shaders. A vertex shader is ex-
ecuted for each vertex that is submitted by the application, and
is primarily responsible for transforming the vertex from object
space to view space, generating texture coordinates, and calcu-
lating lighting coefficients such as the vertex’s tangent, binor-
mal and normal vectors. When a group of vertices (normally 3,
to form a triangle) come through the vertex shader, their out-
put position is interpolated to form pixels within its area; this
process is known as rasterisation. Each of these pixels comes
through the pixel shader, whereby the resultant screen colour
is calculated. Optionally, an application using a Direct3D10 in-
180 181
terface and Direct3D10 hardware may also specify a geometry
shader. This shader takes as its input the three vertices of a tri-
angle and uses this data to generate (or tessellate) additional
triangles, which are each then sent to the rasterizer.
Infi niBand
Switched fabric communications link primarily used in HPC and
enterprise data centers. Its features include high throughput,
low latency, quality of service and failover, and it is designed to
be scalable. The Infi niBand architecture specifi cation defi nes a
connection between processor nodes and high performance
I/O nodes such as storage devices. Like → PCI Express, and many
other modern interconnects, Infi niBand offers point-to-point
bidirectional serial links intended for the connection of pro-
cessors with high-speed peripherals such as disks. On top of
the point to point capabilities, Infi niBand also offers multicast
operations as well. It supports several signalling rates and, as
with PCI Express, links can be bonded together for additional
throughput.
The SDR serial connection’s signalling rate is 2.5 gigabit per sec-
ond (Gbit/s) in each direction per connection. DDR is 5 Gbit/s
and QDR is 10 Gbit/s. FDR is 14.0625 Gbit/s and EDR is 25.78125
Gbit/s per lane. For SDR, DDR and QDR, links use 8B/10B encod-
ing – every 10 bits sent carry 8 bits of data – making the use-
ful data transmission rate four-fi fths the raw rate. Thus single,
double, and quad data rates carry 2, 4, or 8 Gbit/s useful data,
respectively. For FDR and EDR, links use 64B/66B encoding – ev-
ery 66 bits sent carry 64 bits of data.
Implementers can aggregate links in units of 4 or 12, called 4X or
12X. A 12X QDR link therefore carries 120 Gbit/s raw, or 96 Gbit/s
of useful data. As of 2009 most systems use a 4X aggregate, im-
plying a 10 Gbit/s (SDR), 20 Gbit/s (DDR) or 40 Gbit/s (QDR) con-
nection. Larger systems with 12X links are typically used for
cluster and supercomputer interconnects and for inter-switch
connections.
The single data rate switch chips have a latency of 200 nano-
seconds, DDR switch chips have a latency of 140 nanoseconds
and QDR switch chips have a latency of 100 nanoseconds. The
end-to-end latency range ranges from 1.07 microseconds MPI
latency to 1.29 microseconds MPI latency to 2.6 microseconds.
As of 2009 various Infi niBand host channel adapters (HCA) exist
in the market, each with different latency and bandwidth char-
acteristics. Infi niBand also provides RDMA capabilities for low
CPU overhead. The latency for RDMA operations is less than 1
microsecond.
See the main article “Infi niBand” for further description of In-
fi niBand features
Intel Integrated Performance Primitives (Intel IPP)
A multi-threaded software library of functions for multimedia
and data processing applications, produced by Intel. The library
supports Intel and compatible processors and is available for
Windows, Linux, and Mac OS X operating systems. It is available
separately or as a part of Intel Parallel Studio. The library takes
advantage of processor features including MMX, SSE, SSE2,
SSE3, SSSE3, SSE4, AES-NI and multicore processors. Intel IPP is
divided into four major processing groups: Signal (with linear ar-
ray or vector data), Image (with 2D arrays for typical color spac-
es), Matrix (with n x m arrays for matrix operations), and Cryp-
tography. Half the entry points are of the matrix type, a third
are of the signal type and the remainder are of the image and
GlossaryGLOSSARY
cryptography types. Intel IPP functions are divided into 4 data
types: Data types include 8u (8-bit unsigned), 8s (8-bit signed),
16s, 32f (32-bit fl oating-point), 64f, etc. Typically, an application
developer works with only one dominant data type for most
processing functions, converting between input to processing
to output formats at the end points. Version 5.2 was introduced
June 5, 2007, adding code samples for data compression, new
video codec support, support for 64-bit applications on Mac OS
X, support for Windows Vista, and new functions for ray-tracing
and rendering. Version 6.1 was released with the Intel C++ Com-
piler on June 28, 2009 and Update 1 for version 6.1 was released
on July 28, 2009.
Intel Threading Building Blocks (TBB)
A C++ template library developed by Intel Corporation for writ-
ing software programs that take advantage of multi-core pro-
cessors. The library consists of data structures and algorithms
that allow a programmer to avoid some complications arising
from the use of native threading packages such as POSIX →
threads, Windows → threads, or the portable Boost Threads in
which individual → threads of execution are created, synchro-
nized, and terminated manually. Instead the library abstracts
access to the multiple processors by allowing the operations
to be treated as “tasks”, which are allocated to individual cores
dynamically by the library’s run-time engine, and by automat-
ing effi cient use of the CPU cache. A TBB program creates, syn-
chronizes and destroys graphs of dependent tasks according
to algorithms, i.e. high-level parallel programming paradigms
(a.k.a. Algorithmic Skeletons). Tasks are then executed respect-
ing graph dependencies. This approach groups TBB in a family
of solutions for parallel programming aiming to decouple the
programming from the particulars of the underlying machine.
Intel TBB is available commercially as a binary distribution with
support and in open source in both source and binary forms.
Version 4.0 was introduced on September 8, 2011.
iSER (“iSCSI Extensions for RDMA“)
A protocol that maps the iSCSI protocol over a network that
provides RDMA services (like → iWARP or → Infi niBand). This
permits data to be transferred directly into SCSI I/O buffers
without intermediate data copies. The Datamover Architecture
(DA) defi nes an abstract model in which the movement of data
between iSCSI end nodes is logically separated from the rest of
the iSCSI protocol. iSER is one Datamover protocol. The inter-
face between the iSCSI and a Datamover protocol, iSER in this
case, is called Datamover Interface (DI).
iWARP (“Internet Wide Area RDMA Protocol”)
An Internet Engineering Task Force (IETF) update of the RDMA
Consortium’s → RDMA over TCP standard. This later standard is
SDRSingle Data Rate
DDRDouble Data Rate
QDRQuadruple Data Rate
FDRFourteen Data Rate
EDREnhanced Data Rate
1X 2 Gbit/s 4 Gbit/s 8 Gbit/s 14 Gbit/s 25 Gbit/s
4X 8 Gbit/s 16 Gbit/s 32 Gbit/s 56 Gbit/s 100 Gbit/s
12X 24 Gbit/s 48 Gbit/s 96 Gbit/s 168 Gbit/s 300 Gbit/s
182 183
zero-copy transmission over legacy TCP. Because a kernel imple-
mentation of the TCP stack is a tremendous bottleneck, a few
vendors now implement TCP in hardware. This additional hard-
ware is known as the TCP offload engine (TOE). TOE itself does
not prevent copying on the receive side, and must be combined
with RDMA hardware for zero-copy results. The main compo-
nent is the Data Direct Protocol (DDP), which permits the ac-
tual zero-copy transmission. The transmission itself is not per-
formed by DDP, but by TCP.
kernel (in CUDA architecture)
Part of the → CUDA programming model
LAM/MPI
A high-quality open-source implementation of the → MPI speci-
fication, including all of MPI-1.2 and much of MPI-2. Superseded
by the → OpenMPI implementation-
LAPACK (“linear algebra package”)
Routines for solving systems of simultaneous linear equations,
least-squares solutions of linear systems of equations, eigen-
value problems, and singular value problems. The original goal
of the LAPACK project was to make the widely used EISPACK and
→ LINPACK libraries run efficiently on shared-memory vector
and parallel processors. LAPACK routines are written so that as
much as possible of the computation is performed by calls to
the → BLAS library. While → LINPACK and EISPACK are based on
the vector operation kernels of the Level 1 BLAS, LAPACK was
designed at the outset to exploit the Level 3 BLAS. Highly effi-
cient machine-specific implementations of the BLAS are avail-
able for many modern high-performance computers. The BLAS
enable LAPACK routines to achieve high performance with por-
table software.
layout
Part of the → parallel NFS standard. Currently three types of
layout exist: file-based, block/volume-based, and object-based,
the latter making use of → object-based storage devices
LINPACK
A collection of Fortran subroutines that analyze and solve linear
equations and linear least-squares problems. LINPACK was de-
signed for supercomputers in use in the 1970s and early 1980s.
LINPACK has been largely superseded by → LAPACK, which has
been designed to run efficiently on shared-memory, vector su-
percomputers. LINPACK makes use of the → BLAS libraries for
performing basic vector and matrix operations.
The LINPACK benchmarks are a measure of a system‘s float-
ing point computing power and measure how fast a computer
solves a dense N by N system of linear equations Ax=b, which is a
common task in engineering. The solution is obtained by Gauss-
ian elimination with partial pivoting, with 2/3•N³ + 2•N² floating
point operations. The result is reported in millions of floating
point operations per second (MFLOP/s, sometimes simply called
MFLOPS).
LNET
Communication protocol in → Lustre
logical object volume (LOV)
A logical entity in → Lustre
GlossaryGLOSSARY
Lustre
An object-based → parallel file system
management server (MGS)
A functional component in → Lustre
MapReduce
A framework for processing highly distributable problems
across huge datasets using a large number of computers
(nodes), collectively referred to as a cluster (if all nodes use the
same hardware) or a grid (if the nodes use different hardware).
Computational processing can occur on data stored either in a
filesystem (unstructured) or in a database (structured).
“Map” step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes. A
worker node may do this again in turn, leading to a multi-level
tree structure. The worker node processes the smaller problem,
and passes the answer back to its master node.
“Reduce” step: The master node then collects the answers to all
the sub-problems and combines them in some way to form the
output – the answer to the problem it was originally trying to
solve.
MapReduce allows for distributed processing of the map and
reduction operations. Provided each mapping operation is in-
dependent of the others, all maps can be performed in parallel
– though in practice it is limited by the number of independent
data sources and/or the number of CPUs near each source. Simi-
larly, a set of ‘reducers’ can perform the reduction phase - pro-
vided all outputs of the map operation that share the same key
are presented to the same reducer at the same time. While this
process can often appear inefficient compared to algorithms
that are more sequential, MapReduce can be applied to signifi-
cantly larger datasets than “commodity” servers can handle – a
large server farm can use MapReduce to sort a petabyte of data
in only a few hours. The parallelism also offers some possibility
of recovering from partial failure of servers or storage during
the operation: if one mapper or reducer fails, the work can be
rescheduled – assuming the input data is still available.
metadata server (MDS)
A functional component in → Lustre
metadata target (MDT)
A logical entity in → Lustre
MKL (“Math Kernel Library”)
A library of optimized, math routines for science, engineering,
and financial applications developed by Intel. Core math func-
tions include → BLAS, → LAPACK, → ScaLAPACK, Sparse Solvers,
Fast Fourier Transforms, and Vector Math. The library supports
Intel and compatible processors and is available for Windows,
Linux and Mac OS X operating systems.
MPI, MPI-2 (“message-passing interface”)
A language-independent communications protocol used to
program parallel computers. Both point-to-point and collective
communication are supported. MPI remains the dominant mod-
el used in high-performance computing today. There are two
versions of the standard that are currently popular: version 1.2
(shortly called MPI-1), which emphasizes message passing and
has a static runtime environment, and MPI-2.1 (MPI-2), which in-
cludes new features such as parallel I/O, dynamic process man-
184 185
agement and remote memory operations. MPI-2 specifies over
500 functions and provides language bindings for ANSI C, ANSI
Fortran (Fortran90), and ANSI C++. Interoperability of objects de-
fined in MPI was also added to allow for easier mixed-language
message passing programming. A side effect of MPI-2 stan-
dardization (completed in 1996) was clarification of the MPI-1
standard, creating the MPI-1.2 level. MPI-2 is mostly a superset
of MPI-1, although some functions have been deprecated. Thus
MPI-1.2 programs still work under MPI implementations com-
pliant with the MPI-2 standard. The MPI Forum reconvened in
2007, to clarify some MPI-2 issues and explore developments for
a possible MPI-3.
MPICH2
A freely available, portable → MPI 2.0 implementation, main-
tained by Argonne National Laboratory
MPP (“massively parallel processing”)
So-called MPP jobs are computer programs with several parts
running on several machines in parallel, often calculating simu-
lation problems. The communication between these parts can
e.g. be realized by the → MPI software interface.
MS-MPI
Microsoft → MPI 2.0 implementation shipped with Microsoft
HPC Pack 2008 SDK, based on and designed for maximum com-
patibility with the → MPICH2 reference implementation.
MVAPICH2
An → MPI 2.0 implementation based on → MPICH2 and devel-
oped by the Department of Computer Science and Engineer-
ing at Ohio State University. It is available under BSD licens-
ing and supports MPI over InfiniBand, 10GigE/iWARP and
RDMAoE.
NetCDF (“Network Common Data Form”)
A set of software libraries and self-describing, machine-inde-
pendent data formats that support the creation, access, and
sharing of array-oriented scientific data. The project homep-
age is hosted by the Unidata program at the University Cor-
poration for Atmospheric Research (UCAR). They are also the
chief source of NetCDF software, standards development,
updates, etc. The format is an open standard. NetCDF Clas-
sic and 64-bit Offset Format are an international standard
of the Open Geospatial Consortium. The project is actively
supported by UCAR. The recently released (2008) version 4.0
greatly enhances the data model by allowing the use of the
→ HDF5 data file format. Version 4.1 (2010) adds support for C
and Fortran client access to specified subsets of remote data
via OPeNDAP. The format was originally based on the con-
ceptual model of the NASA CDF but has since diverged and
is not compatible with it. It is commonly used in climatology,
meteorology and oceanography applications (e.g., weather
forecasting, climate change) and GIS applications. It is an in-
put/output format for many GIS applications, and for general
scientific data exchange. The NetCDF C library, and the librar-
ies based on it (Fortran 77 and Fortran 90, C++, and all third-
party libraries) can, starting with version 4.1.1, read some
data in other data formats. Data in the → HDF5 format can be
read, with some restrictions. Data in the → HDF4 format can
be read by the NetCDF C library if created using the → HDF4
Scientific Data (SD) API.
GlossaryGLOSSARY
NetworkDirect
A remote direct memory access (RDMA)-based network interface
implemented in Windows Server 2008 and later. NetworkDirect
uses a more direct path from → MPI applications to networking
hardware, resulting in very fast and efficient networking. See the
main article “Windows HPC Server 2008 R2” for further details.
NFS (Network File System)
A network file system protocol originally developed by Sun Mi-
crosystems in 1984, allowing a user on a client computer to ac-
cess files over a network in a manner similar to how local stor-
age is accessed. NFS, like many other protocols, builds on the
Open Network Computing Remote Procedure Call (ONC RPC)
system. The Network File System is an open standard defined in
RFCs, allowing anyone to implement the protocol.
Sun used version 1 only for in-house experimental purposes.
When the development team added substantial changes to NFS
version 1 and released it outside of Sun, they decided to release
the new version as V2, so that version interoperation and RPC
version fallback could be tested. Version 2 of the protocol (de-
fined in RFC 1094, March 1989) originally operated entirely over
UDP. Its designers meant to keep the protocol stateless, with
locking (for example) implemented outside of the core protocol
Version 3 (RFC 1813, June 1995) added:
� support for 64-bit file sizes and offsets, to handle files larger
than 2 gigabytes (GB)
� support for asynchronous writes on the server, to improve
write performance
� additional file attributes in many replies, to avoid the need
to re-fetch them
� a READDIRPLUS operation, to get file handles and attributes
along with file names when scanning a directory
� assorted other improvements
At the time of introduction of Version 3, vendor support for TCP
as a transport-layer protocol began increasing. While several
vendors had already added support for NFS Version 2 with TCP
as a transport, Sun Microsystems added support for TCP as a
transport for NFS at the same time it added support for Version
3. Using TCP as a transport made using NFS over a WAN more
feasible.
Version 4 (RFC 3010, December 2000; revised in RFC 3530, April
2003), influenced by AFS and CIFS, includes performance im-
provements, mandates strong security, and introduces a state-
ful protocol. Version 4 became the first version developed with
the Internet Engineering Task Force (IETF) after Sun Microsys-
tems handed over the development of the NFS protocols.
NFS version 4 minor version 1 (NFSv 4.1) has been approved by
the IESG and received an RFC number since Jan 2010. The NFSv
4.1 specification aims: to provide protocol support to take ad-
vantage of clustered server deployments including the ability
to provide scalable parallel access to files distributed among
multiple servers. NFSv 4.1 adds the parallel NFS (pNFS) capabil-
ity, which enables data access parallelism. The NFSv 4.1 protocol
defines a method of separating the filesystem meta-data from
the location of the file data; it goes beyond the simple name/
data separation by striping the data amongst a set of data serv-
ers. This is different from the traditional NFS server which holds
the names of files and their data under the single umbrella of
the server.
186 187
In addition to pNFS, NFSv 4.1 provides sessions, directory del-
egation and notifi cations, multi-server namespace, access con-
trol lists (ACL/SACL/DACL), retention attributions, and SECIN-
FO_NO_NAME. See the main article “Parallel Filesystems” for
further details.
Current work is being done in preparing a draft for a future ver-
sion 4.2 of the NFS standard, including so-called federated fi le-
systems, which constitute the NFS counterpart of Microsoft’s
distributed fi lesystem (DFS).
NUMA (“non-uniform memory access”)
A computer memory design used in multiprocessors, where the
memory access time depends on the memory location relative
to a processor. Under NUMA, a processor can access its own local
memory faster than non-local memory, that is, memory local to
another processor or memory shared between processors.
object storage server (OSS)
A functional component in → Lustre
object storage target (OST)
A logical entity in → Lustre
object-based storage device (OSD)
An intelligent evolution of disk drives that can store and serve
objects rather than simply place data on tracks and sectors.
This task is accomplished by moving low-level storage func-
tions into the storage device and accessing the device through
an object interface. Unlike a traditional block-oriented device
providing access to data organized as an array of unrelated
Glossary
blocks, an object store allows access to data by means of stor-
age objects. A storage object is a virtual entity that groups data
together that has been determined by the user to be logically
related. Space for a storage object is allocated internally by the
OSD itself instead of by a host-based fi le system. OSDs man-
age all necessary low-level storage, space management, and
security functions. Because there is no host-based metadata
for an object (such as inode information), the only way for an
application to retrieve an object is by using its object identifi er
(OID). The SCSI interface was modifi ed and extended by the OSD
Technical Work Group of the Storage Networking Industry Asso-
ciation (SNIA) with varied industry and academic contributors,
resulting in a draft standard to T10 in 2004. This standard was
ratifi ed in September 2004 and became the ANSI T10 SCSI OSD
V1 command set, released as INCITS 400-2004. The SNIA group
continues to work on further extensions to the interface, such
as the ANSI T10 SCSI OSD V2 command set.
OLAP cube (“Online Analytical Processing”)
A set of data, organized in a way that facilitates non-predeter-
mined queries for aggregated information, or in other words,
online analytical processing. OLAP is one of the computer-
based techniques for analyzing business data that are collec-
tively called business intelligence. OLAP cubes can be thought
of as extensions to the two-dimensional array of a spreadsheet.
For example a company might wish to analyze some fi nancial
data by product, by time-period, by city, by type of revenue and
cost, and by comparing actual data with a budget. These addi-
tional methods of analyzing the data are known as dimensions.
Because there can be more than three dimensions in an OLAP
system the term hypercube is sometimes used.
GLOSSARY
OpenCL (“Open Computing Language”)
A framework for writing programs that execute across hetero-
geneous platforms consisting of CPUs, GPUs, and other proces-
sors. OpenCL includes a language (based on C99) for writing ker-
nels (functions that execute on OpenCL devices), plus APIs that
are used to defi ne and then control the platforms. OpenCL pro-
vides parallel computing using task-based and data-based par-
allelism. OpenCL is analogous to the open industry standards
OpenGL and OpenAL, for 3D graphics and computer audio,
respectively. Originally developed by Apple Inc., which holds
trademark rights, OpenCL is now managed by the non-profi t
technology consortium Khronos Group.
OpenMP (“Open Multi-Processing”)
An application programming interface (API) that supports
multi-platform shared memory multiprocessing programming
in C, C++ and Fortran on many architectures, including Unix and
Microsoft Windows platforms. It consists of a set of compiler di-
rectives, library routines, and environment variables that infl u-
ence run-time behavior.
Jointly defi ned by a group of major computer hardware and
software vendors, OpenMP is a portable, scalable model that
gives programmers a simple and fl exible interface for develop-
ing parallel applications for platforms ranging from the desk-
top to the supercomputer.
An application built with the hybrid model of parallel program-
ming can run on a computer cluster using both OpenMP and
Message Passing Interface (MPI), or more transparently through
the use of OpenMP extensions for non-shared memory systems.
OpenMPI
An open source → MPI-2 implementation that is developed and
maintained by a consortium of academic, research, and indus-
try partners.
parallel NFS (pNFS)
A → parallel fi le system standard, optional part of the current →
NFS standard 4.1. See the main article “Parallel Filesystems” for
further details.
PCI Express (PCIe)
A computer expansion card standard designed to replace the
older PCI, PCI-X, and AGP standards. Introduced by Intel in 2004,
PCIe (or PCI-E, as it is commonly called) is the latest standard
for expansion cards that is available on mainstream comput-
ers. PCIe, unlike previous PC expansion standards, is structured
around point-to-point serial links, a pair of which (one in each
direction) make up lanes; rather than a shared parallel bus.
These lanes are routed by a hub on the main-board acting as
a crossbar switch. This dynamic point-to-point behavior allows
PCIe 1.x PCIe 2.x PCIe 3.0 PCIe 4.0
x1 256 MB/s 512 MB/s 1 GB/s 2 GB/s
x2 512 MB/s 1 GB/s 2 GB/s 4 GB/s
x4 1 GB/s 2 GB/s 4 GB/s 8 GB/s
x8 2 GB/s 4 GB/s 8 GB/s 16 GB/s
x16 4 GB/s 8 GB/s 16 GB/s 32 GB/s
x32 8 GB/s 16 GB/s 32 GB/s 64 GB/s
188 189
more than one pair of devices to communicate with each other
at the same time. In contrast, older PC interfaces had all devices
permanently wired to the same bus; therefore, only one device
could send information at a time. This format also allows “chan-
nel grouping”, where multiple lanes are bonded to a single de-
vice pair in order to provide higher bandwidth. The number of
lanes is “negotiated” during power-up or explicitly during op-
eration. By making the lane count flexible a single standard can
provide for the needs of high-bandwidth cards (e.g. graphics
cards, 10 Gigabit Ethernet cards and multiport Gigabit Ethernet
cards) while also being economical for less demanding cards.
Unlike preceding PC expansion interface standards, PCIe is a
network of point-to-point connections. This removes the need
for “arbitrating” the bus or waiting for the bus to be free and
allows for full duplex communications. This means that while
standard PCI-X (133 MHz 64 bit) and PCIe x4 have roughly the
same data transfer rate, PCIe x4 will give better performance
if multiple device pairs are communicating simultaneously or
if communication within a single device pair is bidirectional.
Specifications of the format are maintained and developed by a
group of more than 900 industry-leading companies called the
PCI-SIG (PCI Special Interest Group). In PCIe 1.x, each lane carries
approximately 250 MB/s. PCIe 2.0, released in late 2007, adds a
Gen2-signalling mode, doubling the rate to about 500 MB/s. On
November 18, 2010, the PCI Special Interest Group officially pub-
lishes the finalized PCI Express 3.0 specification to its members
to build devices based on this new version of PCI Express, which
allows for a Gen3-signalling mode at 1 GB/s.
On November 29, 2011, PCI-SIG has annonced to proceed to
PCI Express 4.0 featuring 16 GT/s, still on copper technology.
Glossary
Additionally, active and idle power optimizations are to be in-
vestigated. Final specifications are expected to be released in
2014/15.
PETSc (“Portable, Extensible Toolkit for Scientific Computation”)
A suite of data structures and routines for the scalable (parallel)
solution of scientific applications modeled by partial differential
equations. It employs the → Message Passing Interface (MPI) stan-
dard for all message-passing communication. The current ver-
sion of PETSc is 3.2; released September 8, 2011. PETSc is intended
for use in large-scale application projects, many ongoing compu-
tational science projects are built around the PETSc libraries. Its
careful design allows advanced users to have detailed control
over the solution process. PETSc includes a large suite of parallel
linear and nonlinear equation solvers that are easily used in ap-
plication codes written in C, C++, Fortran and now Python. PETSc
provides many of the mechanisms needed within parallel appli-
cation code, such as simple parallel matrix and vector assembly
routines that allow the overlap of communication and computa-
tion. In addition, PETSc includes support for parallel distributed
arrays useful for finite difference methods.
process
→ thread
PTX (“parallel thread execution”)
Parallel Thread Execution (PTX) is a pseudo-assembly language
used in NVIDIA’s CUDA programming environment. The ‘nvcc’
compiler translates code written in CUDA, a C-like language, into
PTX, and the graphics driver contains a compiler which translates
the PTX into something which can be run on the processing cores.
GLOSSARY
RDMA (“remote direct memory access”)
Allows data to move directly from the memory of one computer
into that of another without involving either one‘s operating
system. This permits high-throughput, low-latency network-
ing, which is especially useful in massively parallel computer
clusters. RDMA relies on a special philosophy in using DMA.
RDMA supports zero-copy networking by enabling the network
adapter to transfer data directly to or from application memory,
eliminating the need to copy data between application memory
and the data buffers in the operating system. Such transfers re-
quire no work to be done by CPUs, caches, or context switches,
and transfers continue in parallel with other system operations.
When an application performs an RDMA Read or Write request,
the application data is delivered directly to the network, reduc-
ing latency and enabling fast message transfer. Common RDMA
implementations include → InfiniBand, → iSER, and → iWARP.
RISC (“reduced instruction-set computer”)
A CPU design strategy emphasizing the insight that simplified
instructions that “do less“ may still provide for higher perfor-
mance if this simplicity can be utilized to make instructions ex-
ecute very quickly → CISC.
ScaLAPACK (“scalable LAPACK”)
Library including a subset of → LAPACK routines redesigned
for distributed memory MIMD (multiple instruction, multiple
data) parallel computers. It is currently written in a Single-
Program-Multiple-Data style using explicit message passing
for interprocessor communication. ScaLAPACK is designed for
heterogeneous computing and is portable on any computer
that supports → MPI. The fundamental building blocks of the
ScaLAPACK library are distributed memory versions (PBLAS) of
the Level 1, 2 and 3 → BLAS, and a set of Basic Linear Algebra
Communication Subprograms (BLACS) for communication tasks
that arise frequently in parallel linear algebra computations. In
the ScaLAPACK routines, all interprocessor communication oc-
curs within the PBLAS and the BLACS. One of the design goals of
ScaLAPACK was to have the ScaLAPACK routines resemble their
→ LAPACK equivalents as much as possible.
service-oriented architecture (SOA)
An approach to building distributed, loosely coupled applica-
tions in which functions are separated into distinct services
that can be distributed over a network, combined, and reused.
See the main article “Windows HPC Server 2008 R2” for further
details.
single precision/double precision
→ floating point standard
SMP (“shared memory processing”)
So-called SMP jobs are computer programs with several
parts running on the same system and accessing a shared
memory region. A usual implementation of SMP jobs is
→ multi-threaded programs. The communication between the
single threads can e.g. be realized by the → OpenMP software
interface standard, but also in a non-standard way by means of
native UNIX interprocess communication mechanisms.
SMP (“symmetric multiprocessing”)
A multiprocessor or multicore computer architecture where
two or more identical processors or cores can connect to a
190 191
GlossaryGLOSSARY
single shared main memory in a completely symmetric way, i.e.
each part of the main memory has the same distance to each of
the cores. Opposite: → NUMA
storage access protocol
Part of the → parallel NFS standard
STREAM
A simple synthetic benchmark program that measures sustain-
able memory bandwidth (in MB/s) and the corresponding com-
putation rate for simple vector kernels.
streaming multiprocessor (SM)
Hardware component within the → Tesla GPU series
subnet manager
Application responsible for configuring the local → InfiniBand
subnet and ensuring its continued operation.
superscalar processors
A superscalar CPU architecture implements a form of parallel-
ism called instruction-level parallelism within a single proces-
sor. It thereby allows faster CPU throughput than would other-
wise be possible at the same clock rate. A superscalar processor
executes more than one instruction during a clock cycle by si-
multaneously dispatching multiple instructions to redundant
functional units on the processor. Each functional unit is not
a separate CPU core but an execution resource within a single
CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.
Tesla
NVIDIA‘s third brand of GPUs, based on high-end GPUs from the
G80 and on. Tesla is NVIDIA‘s first dedicated General Purpose
GPU. Because of the very high computational power (measured
in floating point operations per second or FLOPS) compared to
recent microprocessors, the Tesla products are intended for the
HPC market. The primary function of Tesla products are to aid in
simulations, large scale calculations (especially floating-point
calculations), and image generation for professional and scien-
tific fields, with the use of → CUDA. See the main article “NVIDIA
GPU Computing” for further details.
thread
A thread of execution is a fork of a computer program into two
or more concurrently running tasks. The implementation of
threads and processes differs from one operating system to an-
other, but in most cases, a thread is contained inside a process.
On a single processor, multithreading generally occurs by multi-
tasking: the processor switches between different threads. On
a multiprocessor or multi-core system, the threads or tasks will
generally run at the same time, with each processor or core run-
ning a particular thread or task. Threads are distinguished from
processes in that processes are typically independent, while
threads exist as subsets of a process. Whereas processes have
separate address spaces, threads share their address space,
which makes inter-thread communication much easier than
classical inter-process communication (IPC).
thread (in CUDA architecture)
Part of the → CUDA programming model
thread block (in CUDA architecture)
Part of the → CUDA programming model
thread processor array (TPA)
Hardware component within the → Tesla GPU series
10 Gigabit Ethernet
The fastest of the Ethernet standards, first published in 2002
as IEEE Std 802.3ae-2002. It defines a version of Ethernet with
a nominal data rate of 10 Gbit/s, ten times as fast as Gigabit
Ethernet. Over the years several 802.3 standards relating to
10GbE have been published, which later were consolidated
into the IEEE 802.3-2005 standard. IEEE 802.3-2005 and the other
amendments have been consolidated into IEEE Std 802.3-2008.
10 Gigabit Ethernet supports only full duplex links which can
be connected by switches. Half Duplex operation and CSMA/
CD (carrier sense multiple access with collision detect) are not
supported in 10GbE. The 10 Gigabit Ethernet standard encom-
passes a number of different physical layer (PHY) standards. As
of 2008 10 Gigabit Ethernet is still an emerging technology with
only 1 million ports shipped in 2007, and it remains to be seen
which of the PHYs will gain widespread commercial acceptance.
warp (in CUDA architecture)
Part of the → CUDA programming modelOLAP cube
Notes
Notes
ALWAYS KEEP IN TOUCH WITH THE LATEST NEWS
Visit us on the Web!
Here you will find comprehensive information about HPC, IT solutions for the datacenter, services and
high-performance, efficient IT systems.
Subscribe to our technology journals, E-News or the transtec newsletter and always stay up to date.
www.transtec.de/en/hpc
transtec Germany
Tel +49 (0) 7071/703-400
www.transtec.de
transtec Switzerland
Tel +41 (0) 44/818 47 00
www.transtec.ch
transtec United Kingdom
Tel +44 (0) 1295/756 500
www.transtec.co.uk
ttec Netherlands
Tel +31 (0) 24 34 34 210
www.ttec.nl
transtec France
Tel +33 (0) 3.88.55.16.00
www.transtec.fr
Texts and concept:
Layout and design:
Dr. Oliver Tennert, Director Technology Management & HPC Solutions | [email protected]
Jennifer Kemmler, Graphics & Design | [email protected]
© transtec AG, June 2014The graphics, diagrams and tables found herein are the intellectual property of transtec AG and may be reproduced or published only with its express permission. No responsibility will be assumed for inaccuracies or omissions. Other names or logos may be trademarks of their respective owners.