Top Banner
Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Jeffrey P. Gardner Pittsburgh Supercomputing Center Pittsburgh Supercomputing Center Carnegie Mellon University Carnegie Mellon University University of Pittsburgh University of Pittsburgh
77

Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Jan 19, 2016

Download

Documents

Reynold Pearson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Science on Supercomputers:

Pushing the (back of) the envelope

Jeffrey P. GardnerJeffrey P. Gardner

Pittsburgh Supercomputing CenterPittsburgh Supercomputing CenterCarnegie Mellon UniversityCarnegie Mellon University

University of PittsburghUniversity of Pittsburgh

Page 2: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Outline History (the past)

Characteristics of scientific codes Scientific computing, supercomputers, and the

Good Old Days Reality (the present)

Is there anything “super” about computers anymore?

Why “network” means more net work on your part.

Fantasy (the future) Strategies for turning a huge pile of processors

into something scientists can actually use.

Page 3: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

A (very brief) Introduction of Scientific Computing

Page 4: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Properties of “interesting” scientific datasets

Very large dataset where: Calculation is “tightly-coupled”

Page 5: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Example Science Application:Cosmology

Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM

100 million light years

To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset

To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset

““read-only” read-only” couplingcoupling““read-only” read-only” couplingcoupling

Page 6: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Example Science Application:Cosmology

Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM

100 million light years

To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particlesparticles

To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particlesparticles

““read-write” read-write” couplingcoupling““read-write” read-write” couplingcoupling

Page 7: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Scientific Computing

Transaction Processing1:A transaction is an information processing operation

that cannot be subdivided into smaller operations. Each transaction must succeed or fail as a complete unit; it cannot remain in an intermediate state.2

Functional definition:A transaction is any computational task:1. That cannot be easily subdivided because the

overhead in doing so would exceed the time required for the non-divided form to complete.

2. Where any further subdivisions cannot be written in such a way that they are independent of one another.

2From Wikipedia

1term borrowed (and generalized with apologies) from database management

Page 8: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Scientific Computing

Functional definition:A transaction is any computational task:1. That cannot be easily subdivided because the

overhead in doing so would exceed the time required for the non-divided to complete.

Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM

To resolve the To resolve the gravitationalgravitational force force on any single on any single particle requires particle requires the entire datasetthe entire dataset

To resolve the To resolve the gravitationalgravitational force force on any single on any single particle requires particle requires the entire datasetthe entire dataset

““read-only” couplingread-only” coupling““read-only” couplingread-only” coupling

Page 9: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Scientific Computing

Functional definition:A transaction is any computational task:2. Where any further subdivisions cannot be

written in such a way that they are independent of one another.

Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM

To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particle sparticle s

To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particle sparticle s

““read-write” couplingread-write” coupling““read-write” couplingread-write” coupling

Page 10: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Scientific Computing

In most business andweb applications: A single CPU usually

processes many transactions per second

Transaction sizes are typically small

Page 11: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Scientific Computing

In many scienceapplications: A single transaction

can take CPU hours, days, or years

Transaction sizes can be extremely large

Page 12: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

What Made Computers “Super”?

Since the transaction is memory-resident in order to not be I/O bound, the next bottleneck is memory.

The original Supercomputers differed from “ordinary” computers in their memory bandwidth and latency characteristics.

Page 13: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

The “Golden Age” of Supercomputing

1976-1982: The Cray-1 is the most powerful computer in the world

The Cray-1 is a vector platform:

i.e. it performs the same operation on many contiguous memory elements in one clock tick.

Memory subsystem was optimized to feed data to the processor at its maximum flop rate.

Page 14: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

The “Golden Age” of Supercomputing

1985-1989: The Cray-2 is the most powerful computer in the world

The Cray-2 is also a vector platform

Page 15: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Scientists Liked Supercomputers.They were simple to program!

1. They were serial machines2. “Caches? We don’t need no

stinkin’ caches!” Scalar machines had no memory

latency This is as close as you get to an ideal

computer Vector machines offered substantial

performance increases over scalar machines if you could “vectorize” your code.

Page 16: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

“Triumph” of the Masses

In the 1990s, commercial off-the-shelf (COTS) technology became so cheap, it was no longer cost-effective to produce fully-custom hardware

Page 17: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

“Triumph” of the Masses

Instead of producing faster processors with faster memory, supercomputer companies built machines with lots of processors in them.

A single processor Cray-2A 1024-processor Cray (CRI) T3D

Page 18: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

“Triumph” of the Masses

These were known as massively parallel platforms, or MPPs.

A single processor Cray-2 A 1024-processor Cray T3D

Page 19: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

“Triumph” of the Masses(?)

A single processor Cray-2,The world’s fastest computer in 1989

A 1024-processor Cray T3D,The world’s fastest computer in 1994

(almost)

Page 20: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Part II: The Present

Why “network” means more net work on your part

Page 21: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

The “Social Impact” of MPPs

The transition from serial supercomputers to MPPs actually resulted in far fewer scientists using supercomputers. MPPs are really hard to program!

Developing scientific applications for MPPs became an area of study in its own right: High Performance Computing (HPC)

Page 22: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Characteristics of HPC Codes

Large dataset Data must be

distributed across many compute nodes

ProcessorRegisters

Main memory

L2 cache

L1 cache

~2 cycles

~10 cycles

~100 cycles

Off-processor memory

~300,000 cycles!

The CPU memory hierarchyThe MPP memory hierarchy

An N-Body cosomologysimulation

Proc 0 Proc 1 Proc 2

Proc 5Proc 4Proc 3

Proc 6 Proc 7 Proc 8

Page 23: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

What makes computers “super” anymore?

Cray T3D in 1994:Cray-built interconnect fabric

PSC Cray XT3 in 2006:Cray-built interconnect fabric

PSC “Terascale Compute System” (TCS) in 2000:Custom interconnect fabric by Quadrics

Page 24: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

What makes computers “super” anymore?

I would propose the following definition:

A “supercomputer” differs from “a pile of workstations” in that: a supercomputer is optimized to spread

a single large transaction across many many processors.

In practice, this means that the network interconnect fabric is identified as the principle bottleneck.

Page 25: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

What makes computers “super” anymore?

Google’s 30-acre campus in The Dalles, Oregon

Page 26: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Review: Hallmarks of Computing

FORTRAN heralded as the world’s first “high-level” language

Seymour Cray develops the CDC 6600, the first “supercomputer”

Cray-1 marks the beginning of the Golden Age of supercomputing

Cray-2 marks the end of the Golden Age of supercomputing

MPPs are born (e.g. CM5, T3D, KSR1, etc)

1966:

1976:

1989:

1990s:

1956:

1986:Pittsburgh Supercomputer Center is founded

Seymour Cray founds Cray Research Inc (CRI)1972:

1998: Google Inc. is founded20??:Google achieves world domination;

Scientists still program in a “high-level” language they call FORTRAN

Page 27: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Review: HPC High-Performance Computing (HPC)

refers to a type of computation whereby a single, large transaction is spread across 100s to 1000s of processors.

In general, this kind of computation is sensitive to network bandwidth and latency.

Therefore, most modern-day “supercomputers” seek to maximize interconnect bandwidth and minimize interconnect latency within economic limits.

Page 28: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Naïve algorithm is Order N2

Gasoline: N-Body Treecode (Order N log N) Began development in 1994…and

continues to this day

PE

kd-tree (subset of Binary Space Partitioning tree)

Page 29: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Example HPC Application:Cosmological N-Body Simulation

Page 30: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Cosmological N-Body Simulation

Everything in the Universe attracts everything else

Dataset is far too large to replicate in every PE’s memory

Difficult to parallelize

PROBLEM:PROBLEM:

Page 31: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Cosmological N-Body Simulation

Everything in the Universe attracts everything else

Dataset is far too large to replicate in every PE’s memory

Difficult to parallelize

PROBLEM:PROBLEM: Only 1 in 3000

memory fetches can result in an off-processor message being sent!

Page 32: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Characteristics of HPC Codes

Large dataset Data must be

distributed across many compute nodes

ProcessorRegisters

Main memory

L2 cache

L1 cache

~2 cycles

~10 cycles

~100 cycles

Off-processor memory

~300,000 cycles!

The MPP memory hierarchy

An N-Body cosomologysimulation

Proc 0 Proc 1 Proc 2

Proc 5Proc 4Proc 3

Proc 6 Proc 7 Proc 8

Page 33: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

FeaturesFeatures Advanced interprocessor data caching

Application data is organized into cache-lines Read cache:

Requests for off-PE data result in fetching of “cache line”

Cache line is stored locally and used for future requests

Write cache: Updates to off-PE data are processed locally, then

flushed to remote thread when necessary

< 1 in 100,000 off-PE requests actually result in communication.

Page 34: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

FeaturesFeatures Load Balancing

The amount of work each particle required for step t is tracked.

This information is used to distribute work evenly amongst processors for step t+1

Page 35: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

PerformanPerformancece

85% linearity on 512 85% linearity on 512 PEs PEs with pure MPI with pure MPI (Cray XT3)(Cray XT3)92% linearity on 512 92% linearity on 512 PEs PEs with one-sided with one-sided comms (Cray T3E comms (Cray T3E Shmem)Shmem)

92% linearity on 2048 92% linearity on 2048 PEs PEs on Cray XT3 for on Cray XT3 for optimal problem size optimal problem size (>100,000 particles (>100,000 particles per processor)per processor)

Page 36: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

FeaturesFeatures Portability

Interprocessor communication by high-level requests to “Machine-Dependent Layer” (MDL)

Only 800 lines of code per architecture MDL is rewritten to take advantage of each parallel

architecture (e.g. one-sided communication). MPI-1, POSIX Threads, SHMEM, Quadrics, & more

Parallel Thread

GASOLINE

MDL

Parallel Thread

GASOLINE

MDLCommunication

Page 37: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

ApplicationApplicationss

Galaxy Formation(10 million particles)

Page 38: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

ApplicationApplicationss

Solar System Planet

Formation(1 million particles)

Page 39: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

ApplicationApplicationss

Asteroid Collisions(2000 particles)

Page 40: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

ApplicationApplicationss

Piles of Sand (?!)

(~1000 particles)

Page 41: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

SummarSummaryy

N-Body simulation are difficult to parallelize: Gravity says: everything interacts with everything

else GASOLINE achieves high scalability by using

several beneficial concepts: Interprocessor data caching for both reads and

writes Maximal exploitation of any parallel architecture Load balancing on a per-particle basis

GASOLINE proved useful for a wide range of applications that simulate particle interactions

Flexible client-server architecture aids in porting to new science domains

Page 42: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Part III: The FutureTurning a huge pile of processors into something that scientists can actually use.

Page 43: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

How to turn simulation output into scientific knowledge

Step 1: Run simulation

Step 2: Analyze simulationon workstation

Step 3: Extract meaningfulscientific knowledge

(happy scientist)Using 300 processors:(circa 1996)

Page 44: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

How to turn simulation output into scientific knowledge

Step 1: Run simulation

Step 2: Analyze simulationon server

Step 3: Extract meaningfulscientific knowledge

(happy scientist)Using 1000 processors:(circa 2000)

Page 45: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

How to turn simulation output into scientific knowledge

Step 1: Run simulation

Step 2: Analyze simulationon ???

(unhappy scientist)Using 2000+ processors:(circa 2005)

X

Page 46: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

How to turn simulation output into scientific knowledge

Step 1: Run simulation

Step 2: Analyze simulationon ???

Using 100,000 processors?:(circa 2012)

X

The NSF has announced that it will be providing $200 million to build and operate a Petaflop machine by

2012.

Page 47: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Turning TeraFlops into Scientific Understanding

Problem:The size of simulations is no longer

limited by the scalability of the simulation code, but by the scientists inability to process the resultant data.

Page 48: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Turning TeraFlops into Scientific Understanding

As MPPs increase in processor count, analysis tools must also run on MPPs!

PROBLEM: 1. Scientists usually write their own analysis programs2. Parallel program are hard to write!

HPC world is dominated by simulations: Code is often reused for many years by many people Therefore, you can afford to spend lots of time

writing the code. Example: Gasoline required 10 FTE-years of

development!

Page 49: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Turning TeraFlops into Scientific Understanding

Data analysis implies: Rapidly changing scientific inqueries Much less code reuse

Data analysis requires rapid algorithm development!

We need to rethink how we as scientists interact with our data!

Page 50: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

A Solution(?): N tropy

Scientists tend to write their own code

So give them something that makes that easier for them.

Build a framework that is: Sophisticated enough to take care of

all of the parallel bits for you Flexible enough to be used for a large

variety of data analysis applications

Page 51: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

N tropy: A framework for multiprocessor development

GOAL: Minimize development time for parallel applications.

GOAL: Enable scientists with no parallel programming background (or time to learn) to still implement their algorithms in parallel by writing only serial code.

GOAL: Provide seamless scalability from single processor machines to MPPs…potentially even several MPPs in a computational Grid.

GOAL: Do not restrict inquiry space.

Page 52: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Methodology Limited Data Structures:

Astronomy deals with point-like data in an N-dimension parameter space

Most efficient methods on these kind of data use trees. Limited Methods:

Analysis methods perform a limited number of fundamental operations on these data structures.

Page 53: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

N tropy DesignGASOLINE already provides a number of

advanced services GASOLINE benefits to keep:

Flexible client-server scheduling architecture Threads respond to service requests issued by master. To do a new task, simply add a new service.

Portability Interprocessor communication occurs by high-level

requests to “Machine-Dependent Layer” (MDL) which is rewritten to take advantage of each parallel architecture.

Advanced interprocessor data caching < 1 in 100,000 off-PE requests actually result in

communication.

Page 54: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

N tropy Design

Dynamic load balancing (available now) Workload and processor domain

boundaries can be dynamically reallocated as computation progresses.

Data pre-fetching (To be implemented) Predict request off-PE data that will be

needed for upcoming tree nodes.

Page 55: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

N tropy Design

Computing across grid nodes Much more difficult than between nodes on

a tightly-coupled parallel machine: Network latencies between grid resources 1000

times higher than nodes on a single parallel machine.

Nodes on a far grid resources must be treated differently than the processor next door:

Data mirroring or aggressive prefetching. Sophisticated workload management,

synchronization

Page 56: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

N tropy Features By using N tropy you will get a lot of

features “for free”: Tree objects and methods

Highly optimized and flexible

Automatic parallelization and scalability You only write serial bits of code!

Portability Interprocessor communication occurs by high-level

requests to “Machine-Dependent Layer” (MDL) which is rewritten to take advantage of each parallel architecture.

MPI, ccNUMA, Cray XT3, Quadrics Elan (PSC TCS), SGI Altix

Page 57: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

N tropy Features By using N tropy you will get a lot of

features “for free”: Collectives

AllToAll, AllGather, AllReduce, etc. Automatic reduction variables

All of your routines can return scalars to be reduced across all processors

Timers 4 automatic N tropy timers 10 custom timers

Automatic communication and I/O statistics Quickly identify bottlenecks

Page 58: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Serial Performance

N tropy vs. an existing serial n-point correlation function calculator:

N tropy is 6 to 30 times faster in serial! Conclusions:

1. Not only does it takes much less time to write an application using N tropy,

2. You application may run faster than if you wrote it from scratch!

Page 59: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Performance

10 million particlesSpatial 3-Point3->4 Mpc

This problem is substantially harder than gravity!

3 FTE months ofdevelopment time!

Page 60: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

N tropy “Meaningful” Benchmarks

The purpose of this framework is to minimize development time!

Development time for:1. N-point correlation function calculator

3 months

2. Friends-of-Friends group finder 3 weeks

3. N-body gravity code 1 day!*

*(OK, I cheated a bit and used existing serial N-body code fragments)

Page 61: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

N tropy Conceptual Schematic

Computational Steering LayerC, C++, Python (Fortran?)

Framework (“Black Box”)

User serial collective staging and processing routines

Web Service Layer (at least from Python)

Domain Decomposition/Tree Building

Tree TraversalParallel I/O

User serial I/O routines

VO

WSDL?SOAP? Key:

Framework ComponentsTree ServicesUser Supplied

Collectives

Dynamic Workload Management

User tree traversal routines

User tree and particle data

Page 62: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Scientists no longer run on their simulations on the biggest MPPs because they cannot analyze the output. Scientists are seriously bummed.

Summary

Scientists run on serial supercomputers. Scientists write many programs for them. Scientists are happy.

MPPs are born. Scientists scratch their heads and figure out how to parallelize their algorithms.

early 1990s:

Ancient times:

mid 1990s: Scientists start writing scalable code for MPPs. After much effort, scientists are kind of happy again.

early 2000s:

Prehistoric times: FORTRAN is heralded as the first “high-level” language.

20??:Google achieves world domination; Scientists still program in a “high-level” language they call FORTRAN

Page 63: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Summary N tropy is an attempt to allow scientists

to rapidly develop their analysis codes for a multiprocessor environment.

Our results so far show that it is worthwhile to invest time developing a individual frameworks that are:

1. Serially optimized2. Scalable3. Flexible enough to be customized to many

different applications, even applications that you do not currently envision.

Is this a solution for the 100,000 processor world of tomorrow??

Page 64: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Pittsburgh Supercomputing Center

Founded in 1986 Joint venture between Carnegie Mellon

University, University of Pittsburgh, and Westinghouse Electric Co.

Funded by several federal agencies as well as private industries.

Main source of support is National Science Foundation, Office of Cyberinfrastructure

Page 65: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Pittsburgh Supercomputing Center

PSC is the third largest NSF sponsored supercomputing center

BUT we provide over 60% of the computer time used by the NSF research

AND PSC is the only academic super- computing center in the U.S. to have had the most powerful supercomputer in the world (for unclassified research)

Page 66: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Pittsburgh Supercomputing Center

GOAL: To use cutting edge computer technology to do science that would not otherwise be possible

Page 67: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Conclusions Most data analysis in astronomy is

done using trees as the fundamental data structure.

Most operations on these tree structures are functionally identical.

Based on our studies so far, it appears feasible to construct a general purpose multiprocessor framework that users can rapidly customize to their needs.

Page 68: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Cosmological N-Body Simulation

Time required for 1 floating point operation:0.25 ns

Time required for 1 memory fetch:~10ns (40 floats)

Time required for 1 off-processor fetch:~10ms (40,000 floats)

Lesson: Only 1 in 1000 memory fetches can result in network activity!

Timings:Timings:

Page 69: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

The very first “Super Computer”

1929: New York World newspaper coins the term “super computer” when talking about a giant tabulator custom-built by IBM for Columbia University

Page 70: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Review: Hallmarks of Computing

FORTRAN heralded as the world’s first “high-level” language

Seymour Cray develops the CDC 6600, the first “supercomputer”

Cray-1 marks the beginning of the Golden Age of supercomputing

Cray-2 marks the end of the Golden Age of supercomputing

MPPs are born (e.g. CM5, T3D, KSR1, etc)

1966:

1976:

1989:

1990s:

1956:

1986:Pittsburgh Supercomputer Center is founded

1995:Cray Computer Corporation (CCC) goes bankrupt

Seymour Cray founds Cray Research Inc (CRI)1972:

Seymour Cray leaves CRI and founds Cray Computer Corp. (CCC)1989:

1996:Cray Research Inc. acquired by SGI

1998: Google Inc. is founded20??:Google achieves world domination;

Scientists still program in a “high-level” language they call FORTRAN

Page 71: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

The T3D MPP

1024 Dec Alpha processors (COTS)

128MB of RAM per processor (COTS)

Cray Custom-built network fabric ($$$)

A 1024-processor Cray T3D in 1994

Page 72: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

General characteristics of MPPs

COTS processors COTS memory

subsystem Linux-based kernel Custom networking Custom networking

in MPPs has replaced the custom memory systems of vector machines

The 2068 processor Cray XT3 at PSC in 2006

Why??

Page 73: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Example Science Applications:Weather Prediction

Looking for Tornados(credits: PSC, Center for Analysis and Prediction of Storms)

Page 74: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Reasons for being sensitive to communication latency

A given processor (PE) may “touch” a very large subsample of the total dataset Example: self-gravitating system

PEs must exchange information many times during a single transaction Example: along domain boundaries of

a fluid calculation

Page 75: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

FeaturesFeatures Flexible client-server scheduling

architecture Threads respond to service requests

issued by master. To do a new task, simply add a new

service. Computational steering involves

trivial serial programming

Page 76: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

DesignDesign

Computational Steering Layer

Parallel Management Layer

Serial Layer

Gravity Calculator Hydro Calculator

Gasoline Functional Layout

Executes on master processor only

Coordinates execution and data distribution among processors

Executes “independently” on all processors

Machine Dependent Layer (MDL) Interprocessor communication

Serial layers Parallel layers

Page 77: Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Cosmological N-Body Simulation

Simulate how structure in the Universe forms from initial linear density fluctuations:

1. Linear fluctuations in early Universe supplied by cosmological theory.

2. Calculate non-linear final states of these fluctuations.

3. See if these look anything like the real Universe.

4. No? Go to step 1

SCIENCE:SCIENCE: