Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Exascale: Why It is Different

Barbara ChapmanUniversity of Houston

High Performance Computing and Tools Grouphttp://www.cs.uh.edu/~hpctools

P2S2 Workshop at ICPPTaipei, TaiwanSeptember 13, 2011

Agenda

What is Exascale? Hardware Revolution Programming at Exascale Runtime Support We Need Tools

Top 10 Supercomputers (June 2011)

Petascale is a Global Reality K computer

68,544 SPARC64 VIIIfx processors, Tofu interconnect, Linux-based enhanced OS, produced by Fujitsu

Tianhe-1A 7,168 Fermi GPUs and 14,336 CPUs; it would require more than

50,000 CPUs and twice as much floor space to deliver the same performance using CPUs alone.

Jaguar 224,256 x86-based AMD Opteron processor cores, Each

compute node features two Opterons with 12 cores and 16GB of shared memory

Nebulae Nvidia Tesla 4640 GPUs, Intel X5650-based 9280 CPUs

Tsubame 4200 GPUs

Exascale Systems: The Race is On Town Hall Meetings April‐June 2007

Scientific Grand Challenges Workshops November 2008 – October 2009 Climate Science (11/08), High Energy Physics (12/08), Nuclear Physics (1/09), Fusion Energy (3/09), Nuclear Energy (5/09) (with NE) Biology (8/09), Material Science and Chemistry (8/09), National Security (10/09) (with NNSA)

Cross-cutting workshops Architecture and Technology (12/09) Architecture, Applied Mathematics and Computer Science (2/10)

Meetings with industry (8/09, 11/09) External Panels

ASCAC Exascale Charge Trivelpiece Panel

Peak performance is 10**18 floating point operations per second

DOE Exascale Exploration: Findings Exascale is essential in important areas

Climate, combustion, nuclear reactors, fusion, stockpile stewardship, materials, astrophysics,…

Electricity production, alternative fuels, fuel efficiency Systems need to be both usable and affordable

Must deliver exascale level of performance to applications Power budget cannot exceed today’s petascale power

consumption by more than a factor of 3 Applications need to exhibit strong scaling, or weak

scaling that does not increase memory requirements Need to understand R&D implications (HW and SW)

What is the roadmap?

IESP: International Exascale Software International effort to specify research

agenda that will lead to exascale capabilities Academia, labs, agencies, industry

Requires open international collaboration Significant contribution of open-source

software Revolution vs. evolution? Produced a detailed roadmap

Focused meetings to determine R&D needs

IESP: Exascale Systems

Given budget constraints, current predictions focus on two alternative designs:Huge number of lightweight processors, e.g. 1 million chips, 1000 cores/chip = 1 billion threads of executionHybrid processors, e.g. 1.0GHz processor and 10000 FPUs/socket & 100000 sockets/system = 1 billion threads of execution

See http:/www.exascale.org/

Platforms expected around 2018

Exascale: Anticipated Architectural Changes

Massive (ca. 4K) increase in concurrency Mostly within compute node

Balance between compute power and memory changes significantly 500x compute power and 30x memory of 2PF HW Memory access time lags further behind

But Wait, There’s More…

Need breakthrough in power efficiency Impact on all of HW, cooling, system software Power-awareness in applications and runtime

Need new design of I/O systems Reduced I/O bandwidth & file system scaling

Must improve system resilience, manage component failure Checkpoint/restart won’t work

DOE’s Co-design Model

Integrate decisions and explore trade-offs Across hardware and software

Simultaneous redesign of architecture and the application code software

Requires significant interactions, incl. with hardware vendors Including accurate simulators

Programming model. system software, will be key to success

Initial set of projects funded in key areas

Hot Off The PressExaflop supercomputer receives full funding from Senate appropriators September 11, 2011 — 11:45pm ET | By David Perera

An Energy Department effort to create a super computer three orders of magnitude more powerful than today's most powerful computer--an exascale computer--would receive $126 million during the coming federal fiscal year under a Senate Appropriations Committee markup of the DOE spending bill.The Senate Appropriations Committee voted Sept. 7 to approve the amount as part of the fiscal 2012 energy and water appropriations bill; fiscal 2011 ends on Sept. 30. The $126 million is the same as requested earlier this year in the White House budget request.In a report accompanying the bill, Senate appropriators say Energy plans to deploy the first exascale system in 2018, despite challenges.

Agenda


Heterogeneous High-Performance System

Each node has multiple CPU cores, and some of the nodes are equipped with additional computational accelerators, such as GPUs.

www.olcf.ornl.gov/wp-content/uploads/.../Exascale-ASCR-Analysis.pdf

Many More Cores

Biggest change is within the node Some full-featured cores Many low-power cores

Technology rapidly evolving, will be integrated Easiest way to get power efficiency and high performance Specialized cores Global memory

Low amount of memory per core Coherency domains, networks on chip

AvRM®

Cortex™-A8CPU

L3/L4 Interconnect

C64x+™ DSP and

video accelerators (3525/3530 only)

Peripherals

Program/Data Storage

System

Serial Interfaces

Display

Subsystem

Connectivity

Camera I/F

POWERVR

SGX™ Graph

ics(3515/3530 only)

Example: Nvidia Hardware Echelon – “Radical and rapid evolution of GPUs for

exascale systems” (SC10) – expected in 2013-2014 ~1024 stream cores ~8 latency optimized CPU cores on a single chip ~20TFLOPS in a shared memory system 25x performance over Fermi

Plan is to integrate in exascale chip

Top 10 Energy-efficient Supercomputers (June 2011)

Processor Architecture and Performance (1993-2011)

Agenda


Programming Challenges

Scale and structure of parallelism Hierarchical parallelism Many more cores, more nodes in clusters Heterogeneous nodes

New methods and models to match architecture Exploit intranode concurrency, heterogeneity Adapt to reduced memory size; drastically reduce data motion

Resilience in algorithm; fault tolerance in application Need to run multi-executable jobs

Coupled computations; Postprocessing of data while in memory

Uncertainty quantification to clarify applicability of results

Programming Models? Today’s Scenario

// Run one OpenMP thread per device per MPI node #pragma omp parallel num_threads(devCount) if (initDevice()) {

// Block and grid dimensions dim3 dimBlock(12,12);kernel<<<1,dimBlock>>>(); cudaThreadExit();

} else {

printf("Device error on %s\n",processor_name);}

MPI_Finalize(); return 0;

}

www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf

Exascale Programming Models Programming models biggest “worry factor” for application

developers MPI-everywhere model no longer viable Hybrid MPI+OpenMP already in use in HPC

Need to explore new approaches, adapt existing APIs Exascale models and their implementation must take

account of: Scale of parallelism, levels of parallelism Potential coherency domains, heterogeneity Need to reduce power consumption Resource allocation and management Legacy code, libraries; interoperability Resilience in algorithms, alternatives to checkpointing

Work on programming models must begin now!

Programming Model Major Requirements Portable expression of scalable parallelism

Across exascale platforms and intermediate systems

Uniformity One model across variety of resources Across node, across machine?

Locality For performance and power At all levels of hierarchy

Asynchrony Minimize delays But trade-off with locality

genericcore

genericcore

Specialized core

Specialized core

Control and data transfers

Delivering The Programming Model Between nodes, MPI with enhancements might work

Needs more help for fault tolerance and improved scalability

Within nodes, too many models, no complete solution OpenMP, PGAS, CUDA and OpenCL all potential starting point

In layered system, migrate code to most appropriate level, develop at most suitable level Incremental path for existing codes

Timing of programming model delivery is critical Must be in place when machines arrive Needed earlier for development of new codes

2009/05/21

DOE Workshop’s Reverse Timeline

A Layered Programming Approach

Efficient, Deterministic, Declarative, Restrictive Expressiveness based language (DSL maybe?)

Efficient, Deterministic, Declarative, Restrictive Expressiveness based language (DSL maybe?)

Parallel Programming Languages (OpenMP, PGAS, APGAS)Parallel Programming Languages (OpenMP, PGAS, APGAS)

Low-level APIs (MPI, pthreads, OpenCL, Verilog)Low-level APIs (MPI, pthreads, OpenCL, Verilog)

Machine code, AssemblyMachine code, Assembly

Computational Science

Computational Science

Data Informatics

Data Informatics

Information TechnologyInformation Technology

Very-high level

High level

Low-level

Very low-level

Applications

Heterogeneous Hardware

OpenMP Evolution Toward Exascale OpenMP language committee is actively working toward

the expression of locality and heterogeneity And to improve task model to enhance asynchrony

How to identify code that should run on a certain kind of core?

How to share data between host cores and other devices?

How to minimize data motion? How to support diversity of cores?

genericcore

genericcore

Specialized core

Specialized

core

Control and data transfers

OpenMP 4.0 Attempts To Target Range of Acceleration Configurations

Dedicated hardware for specific function(s) Attached to a master processor Multiple types or levels of parallelism

Process level, thread level, ILP/SIMD May not support a full C/C++ or Fortran compiler

May lack stack or interrupts, may limit control flow, types

Master Master

DSPDSPDSPDSP

DSPDSPDSPDSP

ACCACC

Acceleratorw/ nonstd

Programming model

Master

Massively Parallel Accelerator

Master

ACCACC

Increase Locality, Reduce Power

void foo(double A[], double B[], double C[], int nrows, int ncols) {#pragma omp data_region acc_copyout(C), host_shared(A,B) { #pragma omp acc_region for (int i=0; i < nrows; ++i) for (int j=0; j < ncols; j += NLANES) for (int k=0; k < NLANES; ++k) { int index = (i * ncols) + j + k; C[index] = A[index] + B[index]; } // end accelerator region print2d(A,nrows,ncols); print2d(B,nrows,ncols); Transpose(C); // calls function w/another accelerator construct } // end data_region print2d(C, nrows, ncols);}void Transpose(double X[], int nrows, int ncols) { #pragma omp acc_region acc_copy(X), acc_present(X) { … }}

OpenMP Locality Research Locations := Affinity Regions

Coordinate data layout, work Collection of locations represent

execution environment Map data, threads to a location;

distribute data across locations Align computations with data’s

location, or map them Location inherited unless task

explicitly migrated

Research on Locality in OpenMP• Implementation in OpenUH compiler shows good performance

– Eliminates unnecessary thread migrations; – Maximizes locality of data operations;– Affinity support for heterogeneous systems.

int main(int argc, char * argv[]){ double A[N]; …

#pragma omp distribute(BLOCK: A) \ location(0:1) …

#pragma omp parallel for OnLoc(A[i]) for(i=0;i<N;i++){ foo(A[i]) } …}

Enabling Asynchronous Computation Directed acyclic graph (DAG):

where each node represents a task and the edges represents inter-task dependencies

A task can begin execution only if all its predecessors have completed execution

Should user express this directly? Compiler can generate tasks and

graph (at least partially) to enhance performance

What is “right” size of task?

OpenMP Research: Enhancing Tasks for Asynchrony Implicitly create DAG by specifying data input

and output related to tasks A task that produces data may need to wait for

any previous child task that reads or writes the same locations

A task that consumes data may need to wait for any previous child task that writes the same locations

Task weights (priorities) Groups of tasks and synchronization on groups

Asynchronous OpenMP Execution

T.-H. Weng, B. Chapman: Implementing OpenMP Using Dataflow Execution Model for Data Locality and Efficient Parallel Execution. Proc. HIPS-7, 2002

OpenUH “All Task” Execution Model Replace fork-join with data flow execution

model Reduce synchronization, increase concurrency Compiler transforms OpenMP code to a collection of

tasks and a task graph. Task graph represents dependences between tasks Array region analysis; data read/write requests for

the data the task needs.

Map tasks to compute resources through task characterization using cost modeling

Enable locality-aware scheduling and load-balancing

Agenda


Runtime Is Critical for Performance Runtime support must

Adapt workload and data to environment Respond to changes caused by application

characteristics, power, faults, system noise Dynamically and continuously

Provide feedback on application behavior Uniform support for multiple programming

models on heterogeneous platforms Facilitate interoperability, (dynamic) mapping

Lessons Learned from Prior Work

Only a tight integration of application-provided meta-data and architecture description can let the runtime system take appropriate decisions

Good analysis of thread affinity and data locality

Task reordering H/W aware selective information gathering

OpenMP Runtime Library


Feedback Optimizations

IPA(Inter Procedural Analyzer)

IPA(Inter Procedural Analyzer)

Source code w/ OpenMP directives

Source code w/ OpenMP directives

Source code with runtime library callsSource code with

runtime library calls

Linking

CG(Itanium, Opteron, Pentium)

CG(Itanium, Opteron, Pentium)

WOPT(global scalar optimizer)

WOPT(global scalar optimizer)

Object filesObject files

LOWER_MP(Transformation of OpenMP )

LOWER_MP(Transformation of OpenMP )

A NativeCompilerA NativeCompiler

ExecutablesExecutables

FRONTENDS(C/C++, Fortran 90, OpenMP)

FRONTENDS(C/C++, Fortran 90, OpenMP)

Op

en64

Co

mp

iler

in

fras

tru

ctu

re LNO(Loop Nest Optimizer)

LNO(Loop Nest Optimizer)

OMP_PRELOWER(Preprocess OpenMP )

OMP_PRELOWER(Preprocess OpenMP )

WHIRL2C & WHIRL2F(IR-to-source option )

WHIRL2C & WHIRL2F(IR-to-source option )

DynamicFeedback forCompilerOptimizations

Runtime Feedback Information

Static Feedback Information

Free compute resources

Compiler’s Runtime Must Adapt

Light-weight performance data collection Dynamic optimization Interoperate with external tools and schedulers Need very low-level interfaces to facilitate its implementation



Collector ToolCollector Tool

OpenMP AppOpenMP App

Event callback

Registerevent

Multicore Association is defining interfaces

Agenda


Interactions Across System Stack

More interactions needed to share information To support application development and tuning Increase execution efficiency Runtime needs architectural information, smart

monitoring Application developer needs feedback So does dynamic optimizer

IPA: Inlining Analysis/ Selective Instrumentation

Instrumentation Phase

Source-to-SourceTransformations

Optimization Logs

Oscar Hernandez, Haoqiang Jin, Barbara Chapman. Compiler Support for Efficient Instrumentation. In Parallel Computing: Architectures, Algorithms and Applications , C. Bischof, M. B¨ucker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr, F. Peters (Eds.), NIC Series, Vol. 38, ISBN 978-3-9810843-4-4, pp. 661-668, 2007.

Automating the Tuning Process

K. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman, A. Malony, L. McInnes, B. Norris. Capturing Performance Knowledge for Automated Analysis. Supercomputing 2008

Code Migration Tools From structural information to

sophisticated adaptation support Changes in data structures, large-

grain and fine-grained control flow Variety of transformations for

kernel granularity and memory optimization

Red: High similarityBlue: Low similarity

Summary

Projected exascale hardware requires us to rethink programming model and its execution support Intra-node concurrency is fine-grained, heterogeneous

Memory is scarce and power is expensive Data locality has never been so important

Opportunity for new programming models Runtime continues to grow in importance for

ensuring good performance Need to migrate large apps: Where are the

tools?

Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Documents

exascale systems

findings exascale

exascale runtime

gpus slide

tools slide

exascale level of performance

second slide

doe exascale exploration