Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Science on Supercomputers:

Pushing the (back of) the envelope

Jeffrey P. GardnerJeffrey P. Gardner

Pittsburgh Supercomputing CenterPittsburgh Supercomputing CenterCarnegie Mellon UniversityCarnegie Mellon University

University of PittsburghUniversity of Pittsburgh

Outline History (the past)

Characteristics of scientific codes Scientific computing, supercomputers, and the

Good Old Days Reality (the present)

Is there anything “super” about computers anymore?

Why “network” means more net work on your part.

Fantasy (the future) Strategies for turning a huge pile of processors

into something scientists can actually use.

A (very brief) Introduction of Scientific Computing

Properties of “interesting” scientific datasets

Very large dataset where: Calculation is “tightly-coupled”

Example Science Application:Cosmology

Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM

100 million light years

To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset

To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset

““read-only” read-only” couplingcoupling““read-only” read-only” couplingcoupling

Example Science Application:Cosmology


100 million light years

To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particlesparticles

To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particlesparticles

““read-write” read-write” couplingcoupling““read-write” read-write” couplingcoupling

Scientific Computing

Transaction Processing1:A transaction is an information processing operation

that cannot be subdivided into smaller operations. Each transaction must succeed or fail as a complete unit; it cannot remain in an intermediate state.2

Functional definition:A transaction is any computational task:1. That cannot be easily subdivided because the

overhead in doing so would exceed the time required for the non-divided form to complete.

2. Where any further subdivisions cannot be written in such a way that they are independent of one another.

2From Wikipedia

1term borrowed (and generalized with apologies) from database management


Functional definition:A transaction is any computational task:1. That cannot be easily subdivided because the

overhead in doing so would exceed the time required for the non-divided to complete.


To resolve the To resolve the gravitationalgravitational force force on any single on any single particle requires particle requires the entire datasetthe entire dataset

To resolve the To resolve the gravitationalgravitational force force on any single on any single particle requires particle requires the entire datasetthe entire dataset

““read-only” couplingread-only” coupling““read-only” couplingread-only” coupling


Functional definition:A transaction is any computational task:2. Where any further subdivisions cannot be

written in such a way that they are independent of one another.


To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particle sparticle s

To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particle sparticle s

““read-write” couplingread-write” coupling““read-write” couplingread-write” coupling


In most business andweb applications: A single CPU usually

processes many transactions per second

Transaction sizes are typically small


In many scienceapplications: A single transaction

can take CPU hours, days, or years

Transaction sizes can be extremely large

What Made Computers “Super”?

Since the transaction is memory-resident in order to not be I/O bound, the next bottleneck is memory.

The original Supercomputers differed from “ordinary” computers in their memory bandwidth and latency characteristics.

The “Golden Age” of Supercomputing

1976-1982: The Cray-1 is the most powerful computer in the world

The Cray-1 is a vector platform:

i.e. it performs the same operation on many contiguous memory elements in one clock tick.

Memory subsystem was optimized to feed data to the processor at its maximum flop rate.

The “Golden Age” of Supercomputing

1985-1989: The Cray-2 is the most powerful computer in the world

The Cray-2 is also a vector platform

Scientists Liked Supercomputers.They were simple to program!

1. They were serial machines2. “Caches? We don’t need no

stinkin’ caches!” Scalar machines had no memory

latency This is as close as you get to an ideal

computer Vector machines offered substantial

performance increases over scalar machines if you could “vectorize” your code.

“Triumph” of the Masses

In the 1990s, commercial off-the-shelf (COTS) technology became so cheap, it was no longer cost-effective to produce fully-custom hardware


Instead of producing faster processors with faster memory, supercomputer companies built machines with lots of processors in them.

A single processor Cray-2A 1024-processor Cray (CRI) T3D


These were known as massively parallel platforms, or MPPs.

A single processor Cray-2 A 1024-processor Cray T3D

“Triumph” of the Masses(?)

A single processor Cray-2,The world’s fastest computer in 1989

A 1024-processor Cray T3D,The world’s fastest computer in 1994

(almost)

Part II: The Present

Why “network” means more net work on your part

The “Social Impact” of MPPs

The transition from serial supercomputers to MPPs actually resulted in far fewer scientists using supercomputers. MPPs are really hard to program!

Developing scientific applications for MPPs became an area of study in its own right: High Performance Computing (HPC)

Characteristics of HPC Codes

Large dataset Data must be

distributed across many compute nodes

ProcessorRegisters

Main memory

L2 cache

L1 cache

~2 cycles

~10 cycles

~100 cycles

Off-processor memory

~300,000 cycles!

The CPU memory hierarchyThe MPP memory hierarchy

An N-Body cosomologysimulation

Proc 0 Proc 1 Proc 2

Proc 5Proc 4Proc 3


What makes computers “super” anymore?

Cray T3D in 1994:Cray-built interconnect fabric

PSC Cray XT3 in 2006:Cray-built interconnect fabric

PSC “Terascale Compute System” (TCS) in 2000:Custom interconnect fabric by Quadrics


I would propose the following definition:

A “supercomputer” differs from “a pile of workstations” in that: a supercomputer is optimized to spread

a single large transaction across many many processors.

In practice, this means that the network interconnect fabric is identified as the principle bottleneck.


Google’s 30-acre campus in The Dalles, Oregon

Review: Hallmarks of Computing

FORTRAN heralded as the world’s first “high-level” language

Seymour Cray develops the CDC 6600, the first “supercomputer”

Cray-1 marks the beginning of the Golden Age of supercomputing

Cray-2 marks the end of the Golden Age of supercomputing

MPPs are born (e.g. CM5, T3D, KSR1, etc)

1966:

1976:

1989:

1990s:

1956:

1986:Pittsburgh Supercomputer Center is founded

Seymour Cray founds Cray Research Inc (CRI)1972:

1998: Google Inc. is founded20??:Google achieves world domination;

Scientists still program in a “high-level” language they call FORTRAN

Review: HPC High-Performance Computing (HPC)

refers to a type of computation whereby a single, large transaction is spread across 100s to 1000s of processors.

In general, this kind of computation is sensitive to network bandwidth and latency.

Therefore, most modern-day “supercomputers” seek to maximize interconnect bandwidth and minimize interconnect latency within economic limits.

Naïve algorithm is Order N2

Gasoline: N-Body Treecode (Order N log N) Began development in 1994…and

continues to this day

PE

kd-tree (subset of Binary Space Partitioning tree)

Example HPC Application:Cosmological N-Body Simulation

Cosmological N-Body Simulation

Everything in the Universe attracts everything else

Dataset is far too large to replicate in every PE’s memory

Difficult to parallelize

PROBLEM:PROBLEM:


Everything in the Universe attracts everything else

Dataset is far too large to replicate in every PE’s memory

Difficult to parallelize

PROBLEM:PROBLEM: Only 1 in 3000

memory fetches can result in an off-processor message being sent!

Characteristics of HPC Codes

Large dataset Data must be

distributed across many compute nodes

ProcessorRegisters

Main memory

L2 cache

L1 cache

~2 cycles

~10 cycles

~100 cycles

Off-processor memory

~300,000 cycles!

The MPP memory hierarchy

An N-Body cosomologysimulation


Proc 5Proc 4Proc 3


FeaturesFeatures Advanced interprocessor data caching

Application data is organized into cache-lines Read cache:

Requests for off-PE data result in fetching of “cache line”

Cache line is stored locally and used for future requests

Write cache: Updates to off-PE data are processed locally, then

flushed to remote thread when necessary

< 1 in 100,000 off-PE requests actually result in communication.

FeaturesFeatures Load Balancing

The amount of work each particle required for step t is tracked.

This information is used to distribute work evenly amongst processors for step t+1

PerformanPerformancece

85% linearity on 512 85% linearity on 512 PEs PEs with pure MPI with pure MPI (Cray XT3)(Cray XT3)92% linearity on 512 92% linearity on 512 PEs PEs with one-sided with one-sided comms (Cray T3E comms (Cray T3E Shmem)Shmem)

92% linearity on 2048 92% linearity on 2048 PEs PEs on Cray XT3 for on Cray XT3 for optimal problem size optimal problem size (>100,000 particles (>100,000 particles per processor)per processor)

FeaturesFeatures Portability

Interprocessor communication by high-level requests to “Machine-Dependent Layer” (MDL)

Only 800 lines of code per architecture MDL is rewritten to take advantage of each parallel

architecture (e.g. one-sided communication). MPI-1, POSIX Threads, SHMEM, Quadrics, & more

Parallel Thread

GASOLINE

MDL

Parallel Thread

GASOLINE

MDLCommunication

ApplicationApplicationss

Galaxy Formation(10 million particles)


Solar System Planet

Formation(1 million particles)


Asteroid Collisions(2000 particles)


Piles of Sand (?!)

(~1000 particles)

SummarSummaryy

N-Body simulation are difficult to parallelize: Gravity says: everything interacts with everything

else GASOLINE achieves high scalability by using

several beneficial concepts: Interprocessor data caching for both reads and

writes Maximal exploitation of any parallel architecture Load balancing on a per-particle basis

GASOLINE proved useful for a wide range of applications that simulate particle interactions

Flexible client-server architecture aids in porting to new science domains

Part III: The FutureTurning a huge pile of processors into something that scientists can actually use.

How to turn simulation output into scientific knowledge

Step 1: Run simulation

Step 2: Analyze simulationon workstation

Step 3: Extract meaningfulscientific knowledge

(happy scientist)Using 300 processors:(circa 1996)



Step 2: Analyze simulationon server

Step 3: Extract meaningfulscientific knowledge

(happy scientist)Using 1000 processors:(circa 2000)



Step 2: Analyze simulationon ???

(unhappy scientist)Using 2000+ processors:(circa 2005)

X



Step 2: Analyze simulationon ???

Using 100,000 processors?:(circa 2012)

X

The NSF has announced that it will be providing $200 million to build and operate a Petaflop machine by

2012.

Turning TeraFlops into Scientific Understanding

Problem:The size of simulations is no longer

limited by the scalability of the simulation code, but by the scientists inability to process the resultant data.


As MPPs increase in processor count, analysis tools must also run on MPPs!

PROBLEM: 1. Scientists usually write their own analysis programs2. Parallel program are hard to write!

HPC world is dominated by simulations: Code is often reused for many years by many people Therefore, you can afford to spend lots of time

writing the code. Example: Gasoline required 10 FTE-years of

development!


Data analysis implies: Rapidly changing scientific inqueries Much less code reuse

Data analysis requires rapid algorithm development!

We need to rethink how we as scientists interact with our data!

A Solution(?): N tropy

Scientists tend to write their own code

So give them something that makes that easier for them.

Build a framework that is: Sophisticated enough to take care of

all of the parallel bits for you Flexible enough to be used for a large

variety of data analysis applications

N tropy: A framework for multiprocessor development

GOAL: Minimize development time for parallel applications.

GOAL: Enable scientists with no parallel programming background (or time to learn) to still implement their algorithms in parallel by writing only serial code.

GOAL: Provide seamless scalability from single processor machines to MPPs…potentially even several MPPs in a computational Grid.

GOAL: Do not restrict inquiry space.

Methodology Limited Data Structures:

Astronomy deals with point-like data in an N-dimension parameter space

Most efficient methods on these kind of data use trees. Limited Methods:

Analysis methods perform a limited number of fundamental operations on these data structures.

N tropy DesignGASOLINE already provides a number of

advanced services GASOLINE benefits to keep:

Flexible client-server scheduling architecture Threads respond to service requests issued by master. To do a new task, simply add a new service.

Portability Interprocessor communication occurs by high-level

requests to “Machine-Dependent Layer” (MDL) which is rewritten to take advantage of each parallel architecture.

Advanced interprocessor data caching < 1 in 100,000 off-PE requests actually result in

communication.

N tropy Design

Dynamic load balancing (available now) Workload and processor domain

boundaries can be dynamically reallocated as computation progresses.

Data pre-fetching (To be implemented) Predict request off-PE data that will be

needed for upcoming tree nodes.

N tropy Design

Computing across grid nodes Much more difficult than between nodes on

a tightly-coupled parallel machine: Network latencies between grid resources 1000

times higher than nodes on a single parallel machine.

Nodes on a far grid resources must be treated differently than the processor next door:

Data mirroring or aggressive prefetching. Sophisticated workload management,

synchronization

N tropy Features By using N tropy you will get a lot of

features “for free”: Tree objects and methods

Highly optimized and flexible

Automatic parallelization and scalability You only write serial bits of code!

Portability Interprocessor communication occurs by high-level

requests to “Machine-Dependent Layer” (MDL) which is rewritten to take advantage of each parallel architecture.

MPI, ccNUMA, Cray XT3, Quadrics Elan (PSC TCS), SGI Altix

N tropy Features By using N tropy you will get a lot of

features “for free”: Collectives

AllToAll, AllGather, AllReduce, etc. Automatic reduction variables

All of your routines can return scalars to be reduced across all processors

Timers 4 automatic N tropy timers 10 custom timers

Automatic communication and I/O statistics Quickly identify bottlenecks

Serial Performance

N tropy vs. an existing serial n-point correlation function calculator:

N tropy is 6 to 30 times faster in serial! Conclusions:

1. Not only does it takes much less time to write an application using N tropy,

2. You application may run faster than if you wrote it from scratch!

Performance

10 million particlesSpatial 3-Point3->4 Mpc

This problem is substantially harder than gravity!

3 FTE months ofdevelopment time!

N tropy “Meaningful” Benchmarks

The purpose of this framework is to minimize development time!

Development time for:1. N-point correlation function calculator

3 months

2. Friends-of-Friends group finder 3 weeks

3. N-body gravity code 1 day!*

*(OK, I cheated a bit and used existing serial N-body code fragments)

N tropy Conceptual Schematic

Computational Steering LayerC, C++, Python (Fortran?)

Framework (“Black Box”)

User serial collective staging and processing routines

Web Service Layer (at least from Python)

Domain Decomposition/Tree Building

Tree TraversalParallel I/O

User serial I/O routines

VO

WSDL?SOAP? Key:

Framework ComponentsTree ServicesUser Supplied

Collectives

Dynamic Workload Management

User tree traversal routines

User tree and particle data

Scientists no longer run on their simulations on the biggest MPPs because they cannot analyze the output. Scientists are seriously bummed.

Summary

Scientists run on serial supercomputers. Scientists write many programs for them. Scientists are happy.

MPPs are born. Scientists scratch their heads and figure out how to parallelize their algorithms.

early 1990s:

Ancient times:

mid 1990s: Scientists start writing scalable code for MPPs. After much effort, scientists are kind of happy again.

early 2000s:

Prehistoric times: FORTRAN is heralded as the first “high-level” language.

20??:Google achieves world domination; Scientists still program in a “high-level” language they call FORTRAN

Summary N tropy is an attempt to allow scientists

to rapidly develop their analysis codes for a multiprocessor environment.

Our results so far show that it is worthwhile to invest time developing a individual frameworks that are:

1. Serially optimized2. Scalable3. Flexible enough to be customized to many

different applications, even applications that you do not currently envision.

Is this a solution for the 100,000 processor world of tomorrow??

Pittsburgh Supercomputing Center

Founded in 1986 Joint venture between Carnegie Mellon

University, University of Pittsburgh, and Westinghouse Electric Co.

Funded by several federal agencies as well as private industries.

Main source of support is National Science Foundation, Office of Cyberinfrastructure


PSC is the third largest NSF sponsored supercomputing center

BUT we provide over 60% of the computer time used by the NSF research

AND PSC is the only academic supercomputing center in the U.S. to have had the most powerful supercomputer in the world (for unclassified research)


GOAL: To use cutting edge computer technology to do science that would not otherwise be possible

Conclusions Most data analysis in astronomy is

done using trees as the fundamental data structure.

Most operations on these tree structures are functionally identical.

Based on our studies so far, it appears feasible to construct a general purpose multiprocessor framework that users can rapidly customize to their needs.


Time required for 1 floating point operation:0.25 ns

Time required for 1 memory fetch:~10ns (40 floats)

Time required for 1 off-processor fetch:~10ms (40,000 floats)

Lesson: Only 1 in 1000 memory fetches can result in network activity!

Timings:Timings:

The very first “Super Computer”

1929: New York World newspaper coins the term “super computer” when talking about a giant tabulator custom-built by IBM for Columbia University

Review: Hallmarks of Computing

FORTRAN heralded as the world’s first “high-level” language

Seymour Cray develops the CDC 6600, the first “supercomputer”

Cray-1 marks the beginning of the Golden Age of supercomputing

Cray-2 marks the end of the Golden Age of supercomputing

MPPs are born (e.g. CM5, T3D, KSR1, etc)

1966:

1976:

1989:

1990s:

1956:

1986:Pittsburgh Supercomputer Center is founded

1995:Cray Computer Corporation (CCC) goes bankrupt

Seymour Cray founds Cray Research Inc (CRI)1972:

Seymour Cray leaves CRI and founds Cray Computer Corp. (CCC)1989:

1996:Cray Research Inc. acquired by SGI

1998: Google Inc. is founded20??:Google achieves world domination;

Scientists still program in a “high-level” language they call FORTRAN

The T3D MPP

1024 Dec Alpha processors (COTS)

128MB of RAM per processor (COTS)

Cray Custom-built network fabric ($$$)

A 1024-processor Cray T3D in 1994

General characteristics of MPPs

COTS processors COTS memory

subsystem Linux-based kernel Custom networking Custom networking

in MPPs has replaced the custom memory systems of vector machines

The 2068 processor Cray XT3 at PSC in 2006

Why??

Example Science Applications:Weather Prediction

Looking for Tornados(credits: PSC, Center for Analysis and Prediction of Storms)

Reasons for being sensitive to communication latency

A given processor (PE) may “touch” a very large subsample of the total dataset Example: self-gravitating system

PEs must exchange information many times during a single transaction Example: along domain boundaries of

a fluid calculation

FeaturesFeatures Flexible client-server scheduling

architecture Threads respond to service requests

issued by master. To do a new task, simply add a new

service. Computational steering involves

trivial serial programming

DesignDesign

Computational Steering Layer

Parallel Management Layer

Serial Layer

Gravity Calculator Hydro Calculator

Gasoline Functional Layout

Executes on master processor only

Coordinates execution and data distribution among processors

Executes “independently” on all processors

Machine Dependent Layer (MDL) Interprocessor communication

Serial layers Parallel layers


Simulate how structure in the Universe forms from initial linear density fluctuations:

1. Linear fluctuations in early Universe supplied by cosmological theory.

2. Calculate non-linear final states of these fluctuations.

3. See if these look anything like the real Universe.

4. No? Go to step 1

SCIENCE:SCIENCE:

Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Documents

single transaction

single particle

particles1 tb of ram100

particles1 tb of ramto

single cpu

computational task

couplingscientific computingin

information exchange