Top Banner
Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools P2S2 Workshop at ICPP Taipei, Taiwan September 13, 2011
45

Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Dec 27, 2015

Download

Documents

Emil Andrews
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Exascale: Why It is Different

Barbara ChapmanUniversity of Houston

High Performance Computing and Tools Grouphttp://www.cs.uh.edu/~hpctools

P2S2 Workshop at ICPPTaipei, TaiwanSeptember 13, 2011

Page 2: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Agenda

What is Exascale? Hardware Revolution Programming at Exascale Runtime Support We Need Tools

Page 3: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Top 10 Supercomputers (June 2011)

Page 4: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Petascale is a Global Reality K computer

68,544 SPARC64 VIIIfx processors, Tofu interconnect, Linux-based enhanced OS, produced by Fujitsu

Tianhe-1A 7,168 Fermi GPUs and 14,336 CPUs; it would require more than

50,000 CPUs and twice as much floor space to deliver the same performance using CPUs alone.

Jaguar 224,256 x86-based AMD Opteron processor cores, Each

compute node features two Opterons with 12 cores and 16GB of shared memory

Nebulae Nvidia Tesla 4640 GPUs, Intel X5650-based 9280 CPUs

Tsubame 4200 GPUs

Page 5: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Exascale Systems: The Race is On Town Hall Meetings April‐June 2007

Scientific Grand Challenges Workshops November 2008 – October 2009 Climate Science (11/08), High Energy Physics (12/08), Nuclear Physics (1/09), Fusion Energy (3/09), Nuclear Energy (5/09) (with NE) Biology (8/09), Material Science and Chemistry (8/09), National Security (10/09) (with NNSA)

Cross-cutting workshops Architecture and Technology (12/09) Architecture, Applied Mathematics and Computer Science (2/10)

Meetings with industry (8/09, 11/09) External Panels

ASCAC Exascale Charge Trivelpiece Panel

Peak performance is 10**18 floating point operations per second

Page 6: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

DOE Exascale Exploration: Findings Exascale is essential in important areas

Climate, combustion, nuclear reactors, fusion, stockpile stewardship, materials, astrophysics,…

Electricity production, alternative fuels, fuel efficiency Systems need to be both usable and affordable

Must deliver exascale level of performance to applications Power budget cannot exceed today’s petascale power

consumption by more than a factor of 3 Applications need to exhibit strong scaling, or weak

scaling that does not increase memory requirements Need to understand R&D implications (HW and SW)

What is the roadmap?

Page 7: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

IESP: International Exascale Software International effort to specify research

agenda that will lead to exascale capabilities Academia, labs, agencies, industry

Requires open international collaboration Significant contribution of open-source

software Revolution vs. evolution? Produced a detailed roadmap

Focused meetings to determine R&D needs

Page 8: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

IESP: Exascale Systems

Given budget constraints, current predictions focus on two alternative designs:Huge number of lightweight processors, e.g. 1 million chips, 1000 cores/chip = 1 billion threads of executionHybrid processors, e.g. 1.0GHz processor and 10000 FPUs/socket & 100000 sockets/system = 1 billion threads of execution

See http:/www.exascale.org/

Platforms expected around 2018

Page 9: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Exascale: Anticipated Architectural Changes

Massive (ca. 4K) increase in concurrency Mostly within compute node

Balance between compute power and memory changes significantly 500x compute power and 30x memory of 2PF HW Memory access time lags further behind

Page 10: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

But Wait, There’s More…

Need breakthrough in power efficiency Impact on all of HW, cooling, system software Power-awareness in applications and runtime

Need new design of I/O systems Reduced I/O bandwidth & file system scaling

Must improve system resilience, manage component failure Checkpoint/restart won’t work

Page 11: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

DOE’s Co-design Model

Integrate decisions and explore trade-offs Across hardware and software

Simultaneous redesign of architecture and the application code software

Requires significant interactions, incl. with hardware vendors Including accurate simulators

Programming model. system software, will be key to success

Initial set of projects funded in key areas

Page 12: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Hot Off The PressExaflop supercomputer receives full funding from Senate appropriators September 11, 2011 — 11:45pm ET | By David Perera

An Energy Department effort to create a super computer three orders of magnitude more powerful than today's most powerful computer--an exascale computer--would receive $126 million during the coming federal fiscal year under a Senate Appropriations Committee markup of the DOE spending bill.The Senate Appropriations Committee voted Sept. 7 to approve the amount as part of the fiscal 2012 energy and water appropriations bill; fiscal 2011 ends on Sept. 30. The $126 million is the same as requested earlier this year in the White House budget request.In a report accompanying the bill, Senate appropriators say Energy plans to deploy the first exascale system in 2018, despite challenges.

Page 13: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Agenda

What is Exascale? Hardware Revolution Programming at Exascale Runtime Support We Need Tools

Page 14: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Heterogeneous High-Performance System

Each node has multiple CPU cores, and some of the nodes are equipped with additional computational accelerators, such as GPUs.

www.olcf.ornl.gov/wp-content/uploads/.../Exascale-ASCR-Analysis.pdf

Page 15: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Many More Cores

Biggest change is within the node Some full-featured cores Many low-power cores

Technology rapidly evolving, will be integrated Easiest way to get power efficiency and high performance Specialized cores Global memory

Low amount of memory per core Coherency domains, networks on chip

AvRM®

Cortex™-A8CPU

L3/L4 Interconnect

C64x+™ DSP and

video accelerators (3525/3530 only)

Peripherals

Program/Data Storage

System

Serial Interfaces

Display

Subsystem

Connectivity

Camera I/F

POWERVR

SGX™ Graph

ics(3515/3530 only)

Page 16: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Example: Nvidia Hardware Echelon – “Radical and rapid evolution of GPUs for

exascale systems” (SC10) – expected in 2013-2014 ~1024 stream cores ~8 latency optimized CPU cores on a single chip ~20TFLOPS in a shared memory system 25x performance over Fermi

Plan is to integrate in exascale chip

Page 17: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Top 10 Energy-efficient Supercomputers (June 2011)

Page 18: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Processor Architecture and Performance (1993-2011)

Page 19: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Agenda

What is Exascale? Hardware Revolution Programming at Exascale Runtime Support We Need Tools

Page 20: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Programming Challenges

Scale and structure of parallelism Hierarchical parallelism Many more cores, more nodes in clusters Heterogeneous nodes

New methods and models to match architecture Exploit intranode concurrency, heterogeneity Adapt to reduced memory size; drastically reduce data motion

Resilience in algorithm; fault tolerance in application Need to run multi-executable jobs

Coupled computations; Postprocessing of data while in memory

Uncertainty quantification to clarify applicability of results

Page 21: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Programming Models? Today’s Scenario

// Run one OpenMP thread per device per MPI node #pragma omp parallel num_threads(devCount) if (initDevice()) {

// Block and grid dimensions dim3 dimBlock(12,12);kernel<<<1,dimBlock>>>(); cudaThreadExit();

} else {

printf("Device error on %s\n",processor_name);}

MPI_Finalize(); return 0;

}

www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf

Page 22: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Exascale Programming Models Programming models biggest “worry factor” for application

developers MPI-everywhere model no longer viable Hybrid MPI+OpenMP already in use in HPC

Need to explore new approaches, adapt existing APIs Exascale models and their implementation must take

account of: Scale of parallelism, levels of parallelism Potential coherency domains, heterogeneity Need to reduce power consumption Resource allocation and management Legacy code, libraries; interoperability Resilience in algorithms, alternatives to checkpointing

Work on programming models must begin now!

Page 23: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Programming Model Major Requirements Portable expression of scalable parallelism

Across exascale platforms and intermediate systems

Uniformity One model across variety of resources Across node, across machine?

Locality For performance and power At all levels of hierarchy

Asynchrony Minimize delays But trade-off with locality

genericcore

genericcore

Specialized core

Specialized core

Control and data transfers

Page 24: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Delivering The Programming Model Between nodes, MPI with enhancements might work

Needs more help for fault tolerance and improved scalability

Within nodes, too many models, no complete solution OpenMP, PGAS, CUDA and OpenCL all potential starting point

In layered system, migrate code to most appropriate level, develop at most suitable level Incremental path for existing codes

Timing of programming model delivery is critical Must be in place when machines arrive Needed earlier for development of new codes

2009/05/21

Page 25: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

DOE Workshop’s Reverse Timeline

Page 26: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

A Layered Programming Approach

Efficient, Deterministic, Declarative, Restrictive Expressiveness based language (DSL maybe?)

Efficient, Deterministic, Declarative, Restrictive Expressiveness based language (DSL maybe?)

Parallel Programming Languages (OpenMP, PGAS, APGAS)Parallel Programming Languages (OpenMP, PGAS, APGAS)

Low-level APIs (MPI, pthreads, OpenCL, Verilog)Low-level APIs (MPI, pthreads, OpenCL, Verilog)

Machine code, AssemblyMachine code, Assembly

Computational Science

Computational Science

Data Informatics

Data Informatics

Information TechnologyInformation Technology

Very-high level

High level

Low-level

Very low-level

Applications

Heterogeneous Hardware

Page 27: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

OpenMP Evolution Toward Exascale OpenMP language committee is actively working toward

the expression of locality and heterogeneity And to improve task model to enhance asynchrony

How to identify code that should run on a certain kind of core?

How to share data between host cores and other devices?

How to minimize data motion? How to support diversity of cores?

genericcore

genericcore

Specialized core

Specialized

core

Control and data transfers

Page 28: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

OpenMP 4.0 Attempts To Target Range of Acceleration Configurations

Dedicated hardware for specific function(s) Attached to a master processor Multiple types or levels of parallelism

Process level, thread level, ILP/SIMD May not support a full C/C++ or Fortran compiler

May lack stack or interrupts, may limit control flow, types

Master Master

DSPDSPDSPDSP

DSPDSPDSPDSP

ACCACC

Acceleratorw/ nonstd

Programming model

Master

Massively Parallel Accelerator

Master

ACCACC

Page 29: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Increase Locality, Reduce Power

void foo(double A[], double B[], double C[], int nrows, int ncols) {#pragma omp data_region acc_copyout(C), host_shared(A,B) { #pragma omp acc_region for (int i=0; i < nrows; ++i) for (int j=0; j < ncols; j += NLANES) for (int k=0; k < NLANES; ++k) { int index = (i * ncols) + j + k; C[index] = A[index] + B[index]; } // end accelerator region print2d(A,nrows,ncols); print2d(B,nrows,ncols); Transpose(C); // calls function w/another accelerator construct } // end data_region print2d(C, nrows, ncols);}void Transpose(double X[], int nrows, int ncols) { #pragma omp acc_region acc_copy(X), acc_present(X) { … }}

Page 30: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

OpenMP Locality Research Locations := Affinity Regions

Coordinate data layout, work Collection of locations represent

execution environment Map data, threads to a location;

distribute data across locations Align computations with data’s

location, or map them Location inherited unless task

explicitly migrated

Page 31: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Research on Locality in OpenMP• Implementation in OpenUH compiler shows good performance

– Eliminates unnecessary thread migrations; – Maximizes locality of data operations;– Affinity support for heterogeneous systems.

int main(int argc, char * argv[]){ double A[N]; …

#pragma omp distribute(BLOCK: A) \ location(0:1) …

#pragma omp parallel for OnLoc(A[i]) for(i=0;i<N;i++){ foo(A[i]) } …}

Page 32: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Enabling Asynchronous Computation Directed acyclic graph (DAG):

where each node represents a task and the edges represents inter-task dependencies

A task can begin execution only if all its predecessors have completed execution

Should user express this directly? Compiler can generate tasks and

graph (at least partially) to enhance performance

What is “right” size of task?

Page 33: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

OpenMP Research: Enhancing Tasks for Asynchrony Implicitly create DAG by specifying data input

and output related to tasks A task that produces data may need to wait for

any previous child task that reads or writes the same locations

A task that consumes data may need to wait for any previous child task that writes the same locations

Task weights (priorities) Groups of tasks and synchronization on groups

Page 34: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Asynchronous OpenMP Execution

T.-H. Weng, B. Chapman: Implementing OpenMP Using Dataflow Execution Model for Data Locality and Efficient Parallel Execution. Proc. HIPS-7, 2002

Page 35: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

OpenUH “All Task” Execution Model Replace fork-join with data flow execution

model Reduce synchronization, increase concurrency Compiler transforms OpenMP code to a collection of

tasks and a task graph. Task graph represents dependences between tasks Array region analysis; data read/write requests for

the data the task needs.

Map tasks to compute resources through task characterization using cost modeling

Enable locality-aware scheduling and load-balancing

Page 36: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Agenda

What is Exascale? Hardware Revolution Programming at Exascale Runtime Support We Need Tools

Page 37: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Runtime Is Critical for Performance Runtime support must

Adapt workload and data to environment Respond to changes caused by application

characteristics, power, faults, system noise Dynamically and continuously

Provide feedback on application behavior Uniform support for multiple programming

models on heterogeneous platforms Facilitate interoperability, (dynamic) mapping

Page 38: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Lessons Learned from Prior Work

Only a tight integration of application-provided meta-data and architecture description can let the runtime system take appropriate decisions

Good analysis of thread affinity and data locality

Task reordering H/W aware selective information gathering

Page 39: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

OpenMP Runtime Library

OpenMP Runtime Library

Feedback Optimizations

IPA(Inter Procedural Analyzer)

IPA(Inter Procedural Analyzer)

Source code w/ OpenMP directives

Source code w/ OpenMP directives

Source code with runtime library callsSource code with

runtime library calls

Linking

CG(Itanium, Opteron, Pentium)

CG(Itanium, Opteron, Pentium)

WOPT(global scalar optimizer)

WOPT(global scalar optimizer)

Object filesObject files

LOWER_MP(Transformation of OpenMP )

LOWER_MP(Transformation of OpenMP )

A NativeCompilerA NativeCompiler

ExecutablesExecutables

FRONTENDS(C/C++, Fortran 90, OpenMP)

FRONTENDS(C/C++, Fortran 90, OpenMP)

Op

en64

Co

mp

iler

in

fras

tru

ctu

re LNO(Loop Nest Optimizer)

LNO(Loop Nest Optimizer)

OMP_PRELOWER(Preprocess OpenMP )

OMP_PRELOWER(Preprocess OpenMP )

WHIRL2C & WHIRL2F(IR-to-source option )

WHIRL2C & WHIRL2F(IR-to-source option )

DynamicFeedback forCompilerOptimizations

Runtime Feedback Information

Static Feedback Information

Free compute resources

Page 40: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Compiler’s Runtime Must Adapt

Light-weight performance data collection Dynamic optimization Interoperate with external tools and schedulers Need very low-level interfaces to facilitate its implementation

OpenMP Runtime Library

OpenMP Runtime Library

Collector ToolCollector Tool

OpenMP AppOpenMP App

Event callback

Registerevent

Multicore Association is defining interfaces

Page 41: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Agenda

What is Exascale? Hardware Revolution Programming at Exascale Runtime Support We Need Tools

Page 42: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Interactions Across System Stack

More interactions needed to share information To support application development and tuning Increase execution efficiency Runtime needs architectural information, smart

monitoring Application developer needs feedback So does dynamic optimizer

IPA: Inlining Analysis/ Selective Instrumentation

Instrumentation Phase

Source-to-SourceTransformations

Optimization Logs

Oscar Hernandez, Haoqiang Jin, Barbara Chapman. Compiler Support for Efficient Instrumentation. In Parallel Computing: Architectures, Algorithms and Applications , C. Bischof, M. B¨ucker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr, F. Peters (Eds.), NIC Series, Vol. 38, ISBN 978-3-9810843-4-4, pp. 661-668, 2007.

Page 43: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Automating the Tuning Process

K. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman, A. Malony, L. McInnes, B. Norris. Capturing Performance Knowledge for Automated Analysis. Supercomputing 2008

Page 44: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Code Migration Tools From structural information to

sophisticated adaptation support Changes in data structures, large-

grain and fine-grained control flow Variety of transformations for

kernel granularity and memory optimization

Red: High similarityBlue: Low similarity

Page 45: Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and Tools Group hpctools P2S2 Workshop.

Summary

Projected exascale hardware requires us to rethink programming model and its execution support Intra-node concurrency is fine-grained, heterogeneous

Memory is scarce and power is expensive Data locality has never been so important

Opportunity for new programming models Runtime continues to grow in importance for

ensuring good performance Need to migrate large apps: Where are the

tools?