Top Banner
High Performance High Performance Computing – CISC Computing – CISC 811 811 Dr Rob Thacker Dr Rob Thacker Dept of Physics (308A) Dept of Physics (308A) thacker@physics thacker@physics
91

High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Dec 26, 2015

Download

Documents

Joella Harper
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

High High Performance Performance Computing – Computing –

CISC 811CISC 811Dr Rob ThackerDr Rob Thacker

Dept of Physics (308A)Dept of Physics (308A)

thacker@physicsthacker@physics

Page 2: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Today’s LectureToday’s Lecture

Part 1: Motivations and benefits, Part 1: Motivations and benefits, serial librariesserial libraries

Part 2: Parallel libraries, ACTS Part 2: Parallel libraries, ACTS collectioncollection

Part 3: Netlib, HPLPart 3: Netlib, HPL

HPC Libraries

Page 3: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Part 1: HPC LibrariesPart 1: HPC Libraries

Motivations, benefitsMotivations, benefits Serial HPC librariesSerial HPC libraries

Page 4: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Is our current programming model viable?

"We need to move away from a coding style suited for serial machines, where every macrostep of an algorithm needs to be thought about and explicitly coded, to a higher-level style, where the compiler and library tools take care of the details. And the remarkable thing is, if we adopt this higher-level approach right now, even on today's machines, we will see immediate benefits in our productivity."

W. H. Press and S. A. Teukolsky, 1997Numerical Recipes: Does This Paradigm

Have a future?

Page 5: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Motivations, concernsMotivations, concerns In developing large applications, three In developing large applications, three

significant issues must be addressed:significant issues must be addressed: ProductivityProductivity

Time to the first solution Time to the first solution (prototype) and t(prototype) and time to ime to solution solution (production)(production)

ComplexityComplexity Increasingly sophisticated models, may need to link Increasingly sophisticated models, may need to link

to other solversto other solvers

PerformancePerformance Increasingly complex algorithms, architecturesIncreasingly complex algorithms, architectures

What strategies should be applied?What strategies should be applied? Some appear mutually exclusive: best performance Some appear mutually exclusive: best performance

would reduce productivity if you tailor every single part would reduce productivity if you tailor every single part of the codeof the code

Page 6: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Unavoidable tensionUnavoidable tension

Scientists frequentlyneed highest performance

Algorithms have longlifetimes (longer than hardware)

Low level programming High level programming

Page 7: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Library approachLibrary approach Why not use libraries? (provided they suit your Why not use libraries? (provided they suit your

problem)problem) Optimization – many library functions are often assembly Optimization – many library functions are often assembly

optimizedoptimized Well tested – libraries are used by far more people than Well tested – libraries are used by far more people than

your local research groupyour local research group Support – frequently commercial packages come with Support – frequently commercial packages come with

online forums or email supportonline forums or email support Main drawback – loss of understanding of code Main drawback – loss of understanding of code

inner workingsinner workings Is this really an issue? 99.9% of the software you use you Is this really an issue? 99.9% of the software you use you

didn’t writedidn’t write Also you are forced into using the library interface, but Also you are forced into using the library interface, but

usual this is not a significant concernusual this is not a significant concern Secondary drawback may be costSecondary drawback may be cost

Page 8: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Library ownershipLibrary ownership

Three main possibilitiesThree main possibilities Public DomainPublic Domain

Most common for numerical softwareMost common for numerical software CommercialCommercial

Becoming more common as Universities Becoming more common as Universities attempt to gain from Intellectual propertyattempt to gain from Intellectual property

Vendor SpecificVendor Specific Many of the big vendors release platform Many of the big vendors release platform

specific optimized versions of the larger specific optimized versions of the larger public domain packagespublic domain packages

Page 9: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Potential benefits of Potential benefits of librarieslibraries

Allows easier collaboration (provided Allows easier collaboration (provided library is freely available to everyone!) library is freely available to everyone!)

Software using GPL’d libraries can be Software using GPL’d libraries can be released publicly as source-code released publicly as source-code You can contribute back improvements to the You can contribute back improvements to the

user communityuser community Source based libraries can be adapted to Source based libraries can be adapted to

your needs your needs Bottomline is that your time to solution is Bottomline is that your time to solution is

reduced!reduced!

Page 10: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Bugs are a serious Bugs are a serious issue…issue…

On June 4, 1996, an Ariane 5 rocket launched by the European Space Agency exploded just forty seconds after its lift-off from Kourou, French Guiana. The rocket was on its first voyage, after a decade of development costing $7 billion. The problem was a software error in the inertial reference system. Specifically a 64 bit floating point number relating to the horizontal velocity of the rocket with respect to the platform was converted to a 16 bit signed integer.

On August 23,1991, the first concrete base structure for the Sleipner A platform sprang a leak and sank under a controlled ballasting operation during preparation for deck mating in Gandsfjorden outside Stavanger, Norway. The post accident investigation traced the error to inaccurate finite element approximation of the linear elastic model of the tricell (using the popular finite element program NASTRAN). The shear stresses were underestimated by 47% leading to insufficient design. In particular, certain concrete walls were not thick enough.

Page 11: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Something to think Something to think about…about…

~ 20 years ago ~ 20 years ago 1x10 1x1066 Floating Point Ops/sec (Mflop/s) Floating Point Ops/sec (Mflop/s) Scalar basedScalar based

~ 10 years ago ~ 10 years ago 1x10 1x1099 Floating Point Ops/sec (Gflop/s) Floating Point Ops/sec (Gflop/s) Vector & Shared memory computing, bandwidth awareVector & Shared memory computing, bandwidth aware Block partitioned, latency tolerantBlock partitioned, latency tolerant

~ Today ~ Today 1x10 1x101212 Floating Point Ops/sec (Tflop/s) Floating Point Ops/sec (Tflop/s) Highly parallel, distributed processing, message passing, Highly parallel, distributed processing, message passing,

network basednetwork based data decomposition, communication/computationdata decomposition, communication/computation

Coming soon Coming soon 1x10 1x101515 Floating Point Ops/sec (Pflop/s) Floating Point Ops/sec (Pflop/s) Many more levels of memory hierarchy, combination of Many more levels of memory hierarchy, combination of

grids&HPCgrids&HPC More adaptive, latency and bandwidth aware, fault tolerant, More adaptive, latency and bandwidth aware, fault tolerant,

extended precision, attention to SMP nodesextended precision, attention to SMP nodes Application codes will need to address these issuesApplication codes will need to address these issues

Page 12: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

The Evolving The Evolving Performance GapPerformance GapPeak performance is

skyrocketing In 1990s, peak performance

increased 100x; in 2000s, it will increase 1000x

But Efficiency for many science

applications declined from 40-50% on the vector supercomputers of 1990s to as little as 5-10% on parallel supercomputers of today

Need research on Mathematical methods and

algorithms that achieve high performance on a single processor and scale to thousands of processors

More efficient programming models for massively parallel supercomputers

0.1

1

10

100

1,000

2000 2004

Ter

aflo

ps

1996

PerformanceGap

Peak Performance

Real Performance

We don’t want everyone working on the same problem though!

Page 13: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Notable Public Domain Notable Public Domain Numerical LibrariesNumerical Libraries

LAPACKLAPACK Linear equations, eigenproblemsLinear equations, eigenproblems

BLASBLAS Fast linear algebra kernelsFast linear algebra kernels

LINPACKLINPACK Linear equation solving (now incorporated in LAPACK)Linear equation solving (now incorporated in LAPACK)

ODEPACKODEPACK Ordinary d.e. solving (see also the DASSL toolkit)Ordinary d.e. solving (see also the DASSL toolkit)

QUADPACKQUADPACK Numerical QuadratureNumerical Quadrature

ITPACKITPACK Sparse problemsSparse problems

PIMPIM Linear systemsLinear systems

Check out mathtools.net for a vast list of librariesCheck out mathtools.net for a vast list of libraries

Page 14: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Basic Linear Algebra Basic Linear Algebra Subprograms (BLAS)Subprograms (BLAS)

FORTRAN FORTRAN library of simple subroutine which library of simple subroutine which can be used to build more sophisticated LA can be used to build more sophisticated LA programs (dates back to 1970’s)programs (dates back to 1970’s)

BLAS BLAS is divided into four types and three levelsis divided into four types and three levels Single, double, complex and double complexSingle, double, complex and double complex Level 1 (vector-vector operations)Level 1 (vector-vector operations) Level 2 (matrix-vector operations)Level 2 (matrix-vector operations) Level 3 (matrix-matrix operations)Level 3 (matrix-matrix operations)

Functions are prefixed with the type of the Functions are prefixed with the type of the variables: variables: s,d,c, or z for s,d,c, or z for ssingle, ingle, ddouble, ouble, ccomplex, or double omplex, or double

complex (z).complex (z).

Page 15: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

BLAS routinesBLAS routines

Some of the BLAS 1 subprograms are: Some of the BLAS 1 subprograms are: xCOPY - copy one vector to another xCOPY - copy one vector to another xSWAP - swap two vectors xSWAP - swap two vectors xSCAL - scale a vector by a constant xSCAL - scale a vector by a constant xAXPY - add a multiple of one vector to xAXPY - add a multiple of one vector to

another another xDOT - inner product xDOT - inner product xASUM - 1-norm of a vector xASUM - 1-norm of a vector xNRM2 - 2-norm of a vector xNRM2 - 2-norm of a vector IxAMAX - find maximal entry in a vector IxAMAX - find maximal entry in a vector

Page 16: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Levels 2 & 3Levels 2 & 3 Some of the BLAS 2 subprograms are: Some of the BLAS 2 subprograms are:

xGEMV - general matrix-vector multiplication xGEMV - general matrix-vector multiplication xGER - general rank-1 update xGER - general rank-1 update xSYR2 - symmetric rank-2 update xSYR2 - symmetric rank-2 update xTRSV - solve a triangular system of equations xTRSV - solve a triangular system of equations

Some of the BLAS 3 subprograms are: Some of the BLAS 3 subprograms are: xGEMM - general matrix-matrix multiplication xGEMM - general matrix-matrix multiplication xSYMM - symmetric matrix-matrix xSYMM - symmetric matrix-matrix

multiplication multiplication xSYRK - symmetric rank-k update xSYRK - symmetric rank-k update xSYR2K - symmetric rank-2k update xSYR2K - symmetric rank-2k update

Page 17: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Tuning AdvantagesTuning Advantages

C A B= *PHiPAC:

Linear algebra is always faster using an optimized library!

Page 18: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

BLAS and CBLAS and C

CBLAS is a C version of the librariesCBLAS is a C version of the libraries Available from NetlibAvailable from Netlib

However, you can still call However, you can still call FORTRAN versions from CFORTRAN versions from C you will need to declare the involved you will need to declare the involved

BLAS routine as “extern”BLAS routine as “extern” extern void dgemv_(char *trans, int *m, int *n, extern void dgemv_(char *trans, int *m, int *n,

double *alpha, double *a, int *lda, double *x, int double *alpha, double *a, int *lda, double *x, int

*incx, double *beta, double *y, int *incy );*incx, double *beta, double *y, int *incy );

Page 19: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

VSIPLVSIPL

www.vsipl.orgwww.vsipl.org Vector Signal and Image Processing Vector Signal and Image Processing

LibraryLibrary Origins in defence contracts to produce Origins in defence contracts to produce

an API for embedded programmingan API for embedded programming Developed in C, bindings for C++ under Developed in C, bindings for C++ under

developmentdevelopment Main functionalityMain functionality

Vector based frequency domain analysis Vector based frequency domain analysis routinesroutines

Page 20: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

LAPACKLAPACK BLAS BLAS is used as the building block for the is used as the building block for the LLinear inear

AAlgegra lgegra PackPackage,age, LAPACK LAPACK

Website describing and distributing a portable Website describing and distributing a portable version of the library: version of the library: http://www.netlib.org/lapack/http://www.netlib.org/lapack/ Includes online manualIncludes online manual

http://www.netlib.org/lapack/lug/index.htmlhttp://www.netlib.org/lapack/lug/index.html Vendors frequently distribute their own assembly level Vendors frequently distribute their own assembly level

optimized versions of the library (e.g. Intel MKL, and optimized versions of the library (e.g. Intel MKL, and AMD ACML)AMD ACML)

This library consists of a set of higher level linear This library consists of a set of higher level linear algebra functions with interface described at:algebra functions with interface described at: http://www.netlib.org/lapack/individualroutines.htmlhttp://www.netlib.org/lapack/individualroutines.html

Page 21: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

LAPACKLAPACK There are a very large number of linear algebra There are a very large number of linear algebra

subroutines available in subroutines available in LAPACKLAPACK All follow a XYYZZZ format, where X denotes the All follow a XYYZZZ format, where X denotes the

datatype, YY the type of matrix and ZZZ describes the datatype, YY the type of matrix and ZZZ describes the computation performed. For example:computation performed. For example:

dgetrf dgetrf is used to compute LU factorizations of a matrix is used to compute LU factorizations of a matrix (d=double, ge=general, trf=triangular factorization)(d=double, ge=general, trf=triangular factorization)

dgetrsdgetrs uses an LU factorization from dgetrf to solve a uses an LU factorization from dgetrf to solve a systemsystem

dgetridgetri uses the LU above to compute the inverse of a uses the LU above to compute the inverse of a matrixmatrix

dgesvdgesv essentially a combined call to dgetrf and dgetrs essentially a combined call to dgetrf and dgetrs dgeev dgeev computes the eigenvalues of a matrix.computes the eigenvalues of a matrix.

Page 22: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Gnu Scientific LibraryGnu Scientific Library http://www.gnu.org/software/gsl/http://www.gnu.org/software/gsl/ GSL is a numerical library for C and C++ GSL is a numerical library for C and C++

programmers programmers Free software, available under GNU GPL Free software, available under GNU GPL The library provides a wide range of The library provides a wide range of

mathematical routinesmathematical routines e.g. random number generatorse.g. random number generators special functions special functions least-squares fittingleast-squares fitting There are over 1000 functions in total. There are over 1000 functions in total.

The project was conceived in 1996 by Dr The project was conceived in 1996 by Dr M. Galassi and Dr J. Theiler of Los Alamos M. Galassi and Dr J. Theiler of Los Alamos National Laboratory. National Laboratory.

Page 23: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

GSL FeaturesGSL Features The library uses an object-oriented designThe library uses an object-oriented design

Different algorithms can be plugged-in easily or changed Different algorithms can be plugged-in easily or changed at run-time without recompiling the programat run-time without recompiling the program

It is intended for ordinary scientific usersIt is intended for ordinary scientific users Users with a knowledge of C programming will be able to Users with a knowledge of C programming will be able to

use the library quicklyuse the library quickly Interface is designed to be simple to link into very Interface is designed to be simple to link into very

high-level languages, such as GNU Guile or Python high-level languages, such as GNU Guile or Python Library is thread-safe Library is thread-safe Many of the routines are C “re”implementations of Many of the routines are C “re”implementations of

FORTRAN routines (e.g. FFTPACK)FORTRAN routines (e.g. FFTPACK) Modern coding conventions and optimizations have been Modern coding conventions and optimizations have been

appliedapplied

Page 24: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Full list of functionsFull list of functionsComplex Numbers Roots of Polynomials Special Functions

Vectors and Matrices Permutations Sorting

BLAS Support Linear Algebra Eigensystems

Fast Fourier Transforms

Quadrature Random Numbers

Quasi-Random Sequences

Random Distributions Statistics

Histograms N-TuplesMonte Carlo

Integration

Simulated Annealing Differential Equations Interpolation

Numerical Differentiation

Chebyshev Approximation

Series Acceleration

Discrete Hankel Transforms

Root-Finding Minimization

Least-Squares Fitting Physical Constants IEEE Floating-Point

Page 25: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Compiling and LinkingCompiling and Linking

The library header files are installed The library header files are installed in their own `gsl' directoryin their own `gsl' directory Include statements need `gsl/' directory Include statements need `gsl/' directory

prefix:prefix:

#include <gsl/gsl_math.h> #include <gsl/gsl_math.h> Compile objects first: gcc -c myprog.cCompile objects first: gcc -c myprog.c Then link: gcc example.o -lgsl -Then link: gcc example.o -lgsl -

lgslcblas -lmlgslcblas -lm

Page 26: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

FFTWFFTW http://http://www.fftw.orgwww.fftw.org ““Fastest Fourier Transform in the West”Fastest Fourier Transform in the West” Authored by Frigo and Johnson at MITAuthored by Frigo and Johnson at MIT C subroutine library for discrete Fourier transformsC subroutine library for discrete Fourier transforms

PortablePortable Multiple dimensionsMultiple dimensions Arbitrary input sizes, real and complex transformsArbitrary input sizes, real and complex transforms

Small prime factors are best thoughSmall prime factors are best though Discrete cosine and sine transformsDiscrete cosine and sine transforms Parallel versions available (both shared (pthreads) and Parallel versions available (both shared (pthreads) and

distributed memory (MPI))distributed memory (MPI)) C and FORTRAN APIC and FORTRAN API Supports SIMD extensions (e.g. SSE)Supports SIMD extensions (e.g. SSE)

Self-tuning Self-tuning Contains many different FFT algorithms and optimal one is Contains many different FFT algorithms and optimal one is

chosen at runtimechosen at runtime Has undergone a number of evolutions, and is now at Has undergone a number of evolutions, and is now at

version 3.0version 3.0 Won 1999 J. H. Wilkinson Prize for Numerical SoftwareWon 1999 J. H. Wilkinson Prize for Numerical Software

Page 27: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Using FFTWUsing FFTW Need to include header filesNeed to include header files

#include <fftw3.h>#include <fftw3.h> or or include “fftw3.f”include “fftw3.f”

Must also link to librariesMust also link to libraries -lfftw3 -lm -lfftw3 -lm but may also need to specify path – will be but may also need to specify path – will be

installation dependentinstallation dependent Having created arrays(“in” and “out”), must create a Having created arrays(“in” and “out”), must create a

“plan”“plan” plan=fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE)plan=fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE) call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD, FFTW_ESTIMATE)call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD, FFTW_ESTIMATE) Precise plan routine will depend upon the FFT operation you wish to Precise plan routine will depend upon the FFT operation you wish to

performperform Call to plan allows system to evaluate architecture and transform and then Call to plan allows system to evaluate architecture and transform and then

optimize the algorithm to be used in the FFToptimize the algorithm to be used in the FFT

Having created the plan the transform is executed Having created the plan the transform is executed by specifying fftw_execute(plan) by specifying fftw_execute(plan)

See the fftw website for precise detailsSee the fftw website for precise details

Page 28: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

2 GHz Opteron speeds2 GHz Opteron speeds

Page 29: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

2 GHz Opteron speeds2 GHz Opteron speeds

Page 30: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

2 GHz Opteron speeds2 GHz Opteron speeds

Page 31: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Parallel FFTW: Shared Parallel FFTW: Shared MemoryMemory

FFTW includes both a pthreads based SMP library and can FFTW includes both a pthreads based SMP library and can be compiled with OpenMP support on platforms where it is be compiled with OpenMP support on platforms where it is availableavailable On HPCVL it is compiled with OpenMP support On HPCVL it is compiled with OpenMP support

Threaded version requires additional memoryThreaded version requires additional memory Call Call fftw_init_threads()fftw_init_threads()before using the threaded versionbefore using the threaded version

SMP parallel plans require knowledge of how many SMP parallel plans require knowledge of how many threads are going to be usedthreads are going to be used CallCall fftw_plan_with_nthreads(nthreads)fftw_plan_with_nthreads(nthreads) Note that since plans are specific to the number of threads, if Note that since plans are specific to the number of threads, if

you change the number of threads you must create a new planyou change the number of threads you must create a new plan When work is completed you must call When work is completed you must call

fftw_cleanup_threads()fftw_cleanup_threads() deallocate memory for threads deallocate memory for threads At linking stage must also include parallel libraryAt linking stage must also include parallel library

-lfftw3_threads -lfftw3_threads Note only Note only fftw_executefftw_execute is using a parallel region is using a parallel region

Page 32: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Parallel FFTW: MPIParallel FFTW: MPI Only available for older 2.x libraries which Only available for older 2.x libraries which

have a different APIhave a different API MPI data decomposition is “slab” based. MPI data decomposition is “slab” based.

For 3d arrays this is potentially limiting – can For 3d arrays this is potentially limiting – can only use L processors if you have an Lonly use L processors if you have an L33 array array

However, communication costs are high so However, communication costs are high so this is not often a significant barrierthis is not often a significant barrier

Uses MPI_Alltoall primitive which can Uses MPI_Alltoall primitive which can occasionally lead to poor performance occasionally lead to poor performance (depends on MPI implementation)(depends on MPI implementation)

Must enable support when FFTW is Must enable support when FFTW is compiled and must also link to compiled and must also link to -lfftw_mpi -lfftw_mpi

Page 33: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Example codeExample code#include <fftw_mpi.h>

int main(int argc, char **argv){ const int NX = ..., NY = ...; fftwnd_mpi_plan plan; fftw_complex *data;

MPI_Init(&argc,&argv);

plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD, NX, NY, FFTW_FORWARD, FFTW_ESTIMATE);

...allocate and initialize data...

fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER);

...

fftwnd_mpi_destroy_plan(plan); MPI_Finalize();}

Page 34: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

(Old) Performance results (Old) Performance results on T3D for MPI transform on T3D for MPI transform

(3d, complex)(3d, complex)

Page 35: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

AMD Core Math Library AMD Core Math Library (ACML)(ACML)

Developed in collaboration with Developed in collaboration with Numerical Algorithms Group (NAG)Numerical Algorithms Group (NAG) Latest version = 3.1Latest version = 3.1

Distribution via registration (but they Distribution via registration (but they have never sent me spam!)have never sent me spam!)

32 bit (Athlon) and 64 bit (Opteron) 32 bit (Athlon) and 64 bit (Opteron) versionsversions

Cannot be linked with Intel 8.1 compiler Cannot be linked with Intel 8.1 compiler though – turf war! though – turf war! Forces you to use Intel MKLForces you to use Intel MKL

Exploits knowledge of cache architecture Exploits knowledge of cache architecture to improve execution speedto improve execution speed

Page 36: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

ACML Components: ACML Components: Linear Algebra, FFTsLinear Algebra, FFTs

BBasic asic LLinear inear AAlgebra lgebra SSubroutinesubroutines ((BLASBLAS)) Level 1 (vector-vector operations)Level 1 (vector-vector operations) Level 2 (matrix-vector operations)Level 2 (matrix-vector operations) Level 3 (matrix-matrix operations)Level 3 (matrix-matrix operations) Plus routines for sparse vectorsPlus routines for sparse vectors

LLinear inear AAlgebra lgebra PACKPACKageage (LAPACK) (LAPACK) 28 (threaded) routines28 (threaded) routines Use BLAS to perform complex operationsUse BLAS to perform complex operations

ScaScalable lable LAPACKLAPACK (ScaLAPACK, MPI parallel (ScaLAPACK, MPI parallel LAPACK) also includedLAPACK) also included Must provide your own MPI implementation (see part 2)Must provide your own MPI implementation (see part 2)

FFTsFFTs 1D,2D,single and double precision plus all combinations 1D,2D,single and double precision plus all combinations

of real-to-complex etcof real-to-complex etc C and FORTRAN APIsC and FORTRAN APIs

Page 37: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Intel Math Kernel Intel Math Kernel LibraryLibrary

Version 9.0 recently releasedVersion 9.0 recently released Free for non-commercial useFree for non-commercial use

Students come under this banner, but faculty do not!Students come under this banner, but faculty do not! Graduate students are becoming a grey area…Graduate students are becoming a grey area…

Online support forum Online support forum Library functions:Library functions:

Linear Algebra - BLAS and LAPACKLinear Algebra - BLAS and LAPACK Linear Algebra - PARDISO Sparse SolverLinear Algebra - PARDISO Sparse Solver Discrete Fourier TransformsDiscrete Fourier Transforms Vector Math LibraryVector Math Library Vector Statistical LibraryVector Statistical Library

random number generatorsrandom number generators

Page 38: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Cluster Math Kernel Cluster Math Kernel LibraryLibrary

Adds ScaLAPACK and parallel BLAS Adds ScaLAPACK and parallel BLAS routines to MKLroutines to MKL

Roughly 20% performanceimprovement over Netlibdistribution of ScaLAPACK.

Page 39: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

PESSL, SCSL & CXMLPESSL, SCSL & CXML

SGI provide their SCSL library freeSGI provide their SCSL library free ““Scientific Computing Software Library”Scientific Computing Software Library” Provides same basic features as ACML (linear Provides same basic features as ACML (linear

algebra)algebra) Ported to Altix systems, but need to compare speed Ported to Altix systems, but need to compare speed

to Intel MKL before usingto Intel MKL before using PESSL is IBM’s parallel libraryPESSL is IBM’s parallel library

““Parallel Engineering and Scientific Subroutine Parallel Engineering and Scientific Subroutine Library”Library”

Again, same basic features as ACML, and also Again, same basic features as ACML, and also includes random number generatorincludes random number generator

CXML is Compaq’s library for the AlphaserverCXML is Compaq’s library for the Alphaserver

Page 40: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Random Number Random Number GeneratorsGenerators

Numerical recipes RAN2 and RAN3 Numerical recipes RAN2 and RAN3 are both reasonable RNGsare both reasonable RNGs Note RAN3 does fail some of the more Note RAN3 does fail some of the more

esoteric testsesoteric tests GSL library provides over 40 GSL library provides over 40

different generatorsdifferent generators Includes Knuth’s algorithmsIncludes Knuth’s algorithms Mersenne Twister as wellMersenne Twister as well

Page 41: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Mersenne TwisterMersenne Twister http://www.math.sci.hiroshima-u.ac.jp/~m-mat/Mhttp://www.math.sci.hiroshima-u.ac.jp/~m-mat/M

T/emt.htmlT/emt.html Developed by Matsumoto and Nishimura Developed by Matsumoto and Nishimura Period is 2^19937-1 (10Period is 2^19937-1 (1060006000))

623-dimensional equidistribution property is assured623-dimensional equidistribution property is assured Fast generationFast generation

C rand() has been substituted, and now there are no C rand() has been substituted, and now there are no much difference in speedmuch difference in speed

Efficient use of the memoryEfficient use of the memory The implemented C-code mt19937.c consumes only 624 The implemented C-code mt19937.c consumes only 624

words of working area words of working area Currently the generator of choice for most Currently the generator of choice for most

problems (except crypto)problems (except crypto)

Page 42: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Summary Part 1Summary Part 1

Libraries offer a number of benefitsLibraries offer a number of benefits OptimizationOptimization RobustnessRobustness PortabilityPortability Time to solution improvementsTime to solution improvements

Page 43: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Part 2: Parallel LibrariesPart 2: Parallel Libraries

BLACSBLACS ACTS collectionACTS collection ScaLAPACKScaLAPACK

Page 44: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

BLACSBLACS

Basic Linear Algebra Basic Linear Algebra Communication SubprogramsCommunication Subprograms

Conceptual aid in design and coding Conceptual aid in design and coding (design tool)(design tool)

Associate widely known mnemonic Associate widely known mnemonic names with communicationnames with communication Improved readability and provides Improved readability and provides

standard interfacestandard interface ““Self documentation”Self documentation”

Page 45: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

BLACS data BLACS data decompositiondecomposition

1 2 0

4 5 3

7 8 6

0 1 2

0

1

2

2d processor grid

Types of BLACS routines: point-to-point communication, broadcast, combine operations and support routines.

Communication Modes: All processes in rowAll processes in columnAll grid processes

Page 46: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Communication RoutinesCommunication Routines

Send/Receive Send/Receive Send (sub)matrix from one process to another:Send (sub)matrix from one process to another: _xxSD2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, _xxSD2D(ICTXT, [UPLO,DIAG], M, N, A, LDA,

RDEST,CDEST)RDEST,CDEST) _xxRV2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, _xxRV2D(ICTXT, [UPLO,DIAG], M, N, A, LDA,

RSRC, CSRC)RSRC, CSRC) _ denotes datatype: _ denotes datatype:

I (integer), S (single), D (double), C (complex), I (integer), S (single), D (double), C (complex), Z (double complex)Z (double complex)

xx denotes matrix typexx denotes matrix type GE = general, TR=trapezoidalGE = general, TR=trapezoidal

Page 47: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Point-to-Point examplePoint-to-Point example

CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, & MYROW, MYCOL )

IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN CALL DGESD2D( ICTXT, 5, 1, X, 5, 1, 0 ) ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.0 ) THEN CALL DGERV2D( ICTXT, 5, 1, Y, 5, 0, 0 ) END IF

Page 48: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

ContextsContexts The concept of a communicator is imbedded within The concept of a communicator is imbedded within

BLACS as a “context”BLACS as a “context” Contexts are thus the mechanism by which you:Contexts are thus the mechanism by which you:

Create arbitrary groups of processes upon which to Create arbitrary groups of processes upon which to executeexecute

Create an indeterminate number of overlapping or Create an indeterminate number of overlapping or disjoint gridsdisjoint grids

Isolate each grid so that grids do not interfere with each Isolate each grid so that grids do not interfere with each other other

Initialization routines return a context (integer) Initialization routines return a context (integer) which is then passed to the communication routineswhich is then passed to the communication routines Equivalent to specifying COMM in MPI callsEquivalent to specifying COMM in MPI calls

Page 49: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

ID less communicationID less communication

Messages with BLACS are taglessMessages with BLACS are tagless Generated internally within the libraryGenerated internally within the library

Why is this an issue?Why is this an issue? If tags are not unique it is possible to If tags are not unique it is possible to

create not deterministic behaviour (have create not deterministic behaviour (have race conditions on message arrival)race conditions on message arrival)

BLACS allows the user to specify BLACS allows the user to specify what range of IDs the BLACS can use what range of IDs the BLACS can use This ensures it can be used with other This ensures it can be used with other

packagespackages

Page 50: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

ACTS CollectionACTS Collection ““Advanced CompuTational Software”Advanced CompuTational Software”

Set of software tools Set of software tools US Department of Energy program, run in US Department of Energy program, run in

conjunction with NSF and DARPA conjunction with NSF and DARPA Extended support for experimental software Extended support for experimental software Provide technical support ([email protected])Provide technical support ([email protected]) Maintain ACTS information center (http://acts.nersc.gov)Maintain ACTS information center (http://acts.nersc.gov) Coordinate efforts with US supercomputing centersCoordinate efforts with US supercomputing centers Enable large scale scientific applicationsEnable large scale scientific applications Educate and train Educate and train

Unclear how much support issue extends beyond Unclear how much support issue extends beyond US borders, although there are registered users US borders, although there are registered users across the globeacross the globe

Page 51: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

ACTS is a guided projectACTS is a guided project

Page 52: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Tuned and machineDependent modules

ApplicationData Layout

Control I/O

Algorithmic Implementations

New Architecture or S/W • Extensive tuning• May require new programming paradigms• Difficult to maintain!

New Architecture:• Extensive re-rewritingNew or extended Physics:• Extensive re-rewriting or increase overhead

New Architecture:• May or may not need re-rewritingNew Developments:• Difficult to compare

New Architecture:• Minimal to Extensive rewriting

ACTS MotivationACTS MotivationLarge Scientific Codes:

A Common Programming Practice

Page 53: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

USER's APPLICATION CODE (Main Control)

Tuned and machineDependent modules

ApplicationData Layout I/O

Algorithmic Implementations

AVAILABLE

LIBRARIES & PACKAGES

AVAILABLE

LIBRARIES & PACKAGES

AVAILABLE

LIBRARIES

The ACTS (“ideal”) The ACTS (“ideal”) ApproachApproach

Page 54: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

ACTS Tools and ACTS Tools and functions functions

CategoryCategory ToolTool FunctionalitiesFunctionalities

NumericalNumerical

AztecAztec Algorithms for the iterative solution of large sparse linear systems.Algorithms for the iterative solution of large sparse linear systems.

HypreHypre Algorithms for the iterative solution of large sparse linear systems, intuitive grid-centric Algorithms for the iterative solution of large sparse linear systems, intuitive grid-centric interfaces, and dynamic configuration of parameters.interfaces, and dynamic configuration of parameters.

PETScPETSc Tools for the solution of PDEs that require solving large-scale, sparse linear and nonlinear Tools for the solution of PDEs that require solving large-scale, sparse linear and nonlinear systems of equations.systems of equations.

OPT++OPT++ Object-oriented nonlinear optimization package.Object-oriented nonlinear optimization package.

SUNDIALSSUNDIALS Solvers for the solution of systems of ordinary differential equations, nonlinear algebraic Solvers for the solution of systems of ordinary differential equations, nonlinear algebraic equations, and differential-algebraic equations.equations, and differential-algebraic equations.

ScaLAPACKScaLAPACK Library of high performance dense linear algebra routines for distributed-memory message-Library of high performance dense linear algebra routines for distributed-memory message-passing.passing.

SuperLUSuperLU General-purpose library for the direct solution of large, sparse, nonsymmetric systems of linear General-purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations.equations.

TAOTAO Large-scale optimization software, including nonlinear least squares, unconstrained Large-scale optimization software, including nonlinear least squares, unconstrained minimization, bound constrained optimization, and general nonlinear optimization.minimization, bound constrained optimization, and general nonlinear optimization.

Code Code DevelopmentDevelopment

Global ArraysGlobal Arrays Library for writing parallel programs that use large arrays distributed across processing nodes Library for writing parallel programs that use large arrays distributed across processing nodes and that offers a shared-memory view of distributed arrays.and that offers a shared-memory view of distributed arrays.

OvertureOverture Object-Oriented tools for solving computational fluid dynamics and combustion problems in Object-Oriented tools for solving computational fluid dynamics and combustion problems in complex geometries.complex geometries.

Code Code ExecutionExecution

CUMULVSCUMULVS Framework that enables programmers to incorporate fault-tolerance, interactive visualization Framework that enables programmers to incorporate fault-tolerance, interactive visualization and computational steering into existing parallel programsand computational steering into existing parallel programs

GlobusGlobus Services for the creation of computational Grids and tools with which applications can be Services for the creation of computational Grids and tools with which applications can be developed to access the Grid.developed to access the Grid.

PAWSPAWS Framework for coupling parallel applications within a component-like model.Framework for coupling parallel applications within a component-like model.

SILOONSILOON Tools and run-time support for building easy-to-use external interfaces to existing numerical Tools and run-time support for building easy-to-use external interfaces to existing numerical codes.codes.

TAUTAU Set of tools for analyzing the performance of C, C++, Fortran and Java programs.Set of tools for analyzing the performance of C, C++, Fortran and Java programs.

Library Library DevelopmentDevelopment

ATLAS and ATLAS and PHiPACPHiPAC

Tools for the automatic generation of optimized numerical software for modern computer Tools for the automatic generation of optimized numerical software for modern computer architectures and compilers.architectures and compilers.

PETEPETE Extensible implementation of the expression template technique (C++ technique for passing Extensible implementation of the expression template technique (C++ technique for passing expressions as function arguments).expressions as function arguments).

Page 55: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

ATLASATLAS Automatically Tuned Linear Algebra SoftwareAutomatically Tuned Linear Algebra Software

Another University of Tennessee project!Another University of Tennessee project! Largely an unsupported project thoughLargely an unsupported project though http://math-atlas.sourceforge.net/http://math-atlas.sourceforge.net/

Provides a subset of both BLAS and LAPACK Provides a subset of both BLAS and LAPACK functionalityfunctionality Provided foundation for work on BLAS and LAPACK Provided foundation for work on BLAS and LAPACK

in AMDs ACMLin AMDs ACML Takes optimization step further by giving the Takes optimization step further by giving the

computer itself possibilities for optimization at computer itself possibilities for optimization at compile timecompile time ““AEOS”: Automated Empirical Optimization of AEOS”: Automated Empirical Optimization of

SoftwareSoftware Similar motivation as FFTWSimilar motivation as FFTW

Page 56: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

ATLAS BenchmarksATLAS Benchmarks

ARCH ATLAS COMP % Peak PEAK (Gflop)

900Mhz Itanium2 3.6.0 icc 90% 3.6

1.6Ghz Opteron 3.6.0 gcc 88% 3.2

1062Mhz UltraSPARC III 3.7.8 gcc 3.3 82% 2.124

600Mhz Athlon 3.5.7 gcc 2.95.3 80% 1.2

2.8Ghz Pentium4E 3.7.3 gcc 3.3.2 77% 5.6

2.6Ghz Pentium4 3.6.0 gcc 77% 5.2

1Ghz PentiumIII 3.7.7 gcc 2.95.3 76% 1

1Ghz Efficieon 3.7.7 gcc 3.2 60% 2

DGEMM performance:

Page 57: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

PETScPETSc Portable, Extensible Toolkit for Scientific Portable, Extensible Toolkit for Scientific

ComputationComputation http://www-unix.mcs.anl.gov/petsc/petsc-as/http://www-unix.mcs.anl.gov/petsc/petsc-as/ Argonne lab developmentArgonne lab development

Suite of data structures and routines for the Suite of data structures and routines for the scalable (parallel) solution of PDEsscalable (parallel) solution of PDEs Intended for use in large-scale application projects Intended for use in large-scale application projects Not a black box solution thoughNot a black box solution though

Easily interfaces with solvers written in C, Easily interfaces with solvers written in C, FORTRAN and C++FORTRAN and C++

All components are designed to be interoperableAll components are designed to be interoperable Works in distributed memory environment using Works in distributed memory environment using

MPIMPI

Page 58: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Levels of Abstraction Levels of Abstraction in Mathematical in Mathematical

SoftwareSoftware Application-specific interfaceApplication-specific interface

Programmer manipulates objects associated with the Programmer manipulates objects associated with the applicationapplication

High-level mathematics interfaceHigh-level mathematics interface Programmer manipulates mathematical objectsProgrammer manipulates mathematical objects

Weak forms, boundary conditions, meshesWeak forms, boundary conditions, meshes Algorithmic and discrete mathematics Algorithmic and discrete mathematics

interfaceinterface Programmer manipulates mathematical objectsProgrammer manipulates mathematical objects

Sparse matrices, nonlinear equationsSparse matrices, nonlinear equations Programmer manipulates algorithmic objectsProgrammer manipulates algorithmic objects

SolversSolvers Low-level computational kernelsLow-level computational kernels

BLAS-type operationsBLAS-type operations FFTFFT

PETScemphasis

Page 59: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

FeaturesFeatures

• Parallel vectors• scatters • gathers

• Parallel matrices• several sparse storage formats • easy, efficient assembly.

• Scalable parallel preconditioners • Krylov subspace methods • Parallel Newton-based nonlinear

solvers • Parallel timestepping (ODE)

solvers

• Complete documentation • Automatic profiling of floating point

and memory usage • Consistent interface • Intensive error checking • Portable to UNIX and Windows • Over one hundred examples • PETSc is supported and will be

actively enhanced for the next several years.

Page 60: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Computation and Communication KernelsMPI, MPI-IO, BLAS, LAPACK

Profiling Interface

PETSc PDE Application Codes

Object-OrientedMatrices, Vectors, Indices

GridManagement

Linear SolversPreconditioners + Krylov Methods

Nonlinear Solvers

ODE Integrators Visualization

Interface

Structure of PETSc – Structure of PETSc – Layered ApproachLayered Approach

Page 61: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Functionality example: Functionality example: selected vector operationsselected vector operations

Function Name Operation VecAXPY(Scalar *a, Vec x, Vec y)

y = y + a*x

VecAYPX(Scalar *a, Vec x, Vec y) y = x + a*y VecWAXPY(Scalar *a, Vec x, Vec y, Vec w) w = a*x + y VecScale(Scalar *a, Vec x) x = a*x VecCopy(Vec x, Vec y) y = x VecPointwiseMult(Vec x, Vec y, Vec w) w_i = x_i *y_i VecMax(Vec x, int *idx, double *r) r = max x_i VecShift(Scalar *s, Vec x) x_i = s+x_i VecAbs(Vec x) x_i = |x_i | VecNorm(Vec x, NormType type , double *r) r = ||x||

Page 62: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

A Complete PETSc A Complete PETSc ProgramProgram

#include #include petscvec.hpetscvec.hint main(int argc,char **argv)int main(int argc,char **argv){{ VecVec x; x; int n = 20,ierr; int n = 20,ierr; PetscTruthPetscTruth flg; flg; PetscScalarPetscScalar one = 1.0, dot; one = 1.0, dot;

PetscInitializePetscInitialize(&argc,&argv,0,0);(&argc,&argv,0,0); PetscOptionsGetIntPetscOptionsGetInt(PETSC_NULL,(PETSC_NULL,"-n""-n",&n,PETSC_NULL);,&n,PETSC_NULL); VecCreateVecCreate(PETSC_COMM_WORLD,&x);(PETSC_COMM_WORLD,&x); VecSetSizes(x,PETSC_DECIDE,n); VecSetSizes(x,PETSC_DECIDE,n); VecSetFromOptions(x); VecSetFromOptions(x); VecSet(&one,x); VecSet(&one,x); VecDot(x,x,&dot); VecDot(x,x,&dot); PetscPrintf(PETSC_COMM_WORLD, PetscPrintf(PETSC_COMM_WORLD,"Vector length %dn""Vector length %dn",(int)dot); ,(int)dot); VecDestroy(x); VecDestroy(x); PetscFinalize(); PetscFinalize(); returnreturn 0; 0;} }

Page 63: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

TAOTAO

Toolkit for Advanced OptimizationToolkit for Advanced Optimization http://www-unix.mcs.anl.gov/tao/http://www-unix.mcs.anl.gov/tao/ Another Argonne projectAnother Argonne project

Aimed at the solution of large-scale Aimed at the solution of large-scale optimization problems on high-performance optimization problems on high-performance architectures architectures Suitable for both single-processor and massively-Suitable for both single-processor and massively-

parallel architecture parallel architecture Object oriented approachObject oriented approach

Interoperable with other toolkits (PETSc for Interoperable with other toolkits (PETSc for example)example)

Page 64: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

FunctionalityFunctionality

Systems of nonlinear equations Systems of nonlinear equations Nonlinear least squares Nonlinear least squares Bound-constrained optimization Bound-constrained optimization Linear and quadratic programming Linear and quadratic programming Nonlinearly constrained optimization Nonlinearly constrained optimization Combinatorial optimization Combinatorial optimization Stochastic optimization Stochastic optimization Global optimizationGlobal optimization

Page 65: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Example programExample program

TAO tao; /* optimization solver */ mat H; /* Hessian matrix */ vec x, g; /* solution and gradient vectors */ double f; /* function to minimize */ int n; /* number of variables */ ApplicationCtx usercontext; /* user-defined context */ MatCreate(MPI_COMM_WORLD,n,n,&H); VecCreate(MPI_COMM_WORLD,n,&x); VecDuplicate(x,&g);

TaoCreate(MPI_COMM_WORLD,&tao); TaoSetFunction(tao,x,EvaluateFunction,usercontext); TaoSetGradient(tao,g,EvaluateGradient,usercontext); TaoSetHessian(tao,H,EvaluateHessian,usercontext); TaoSolve(tao); TaoDestroy(tao);

Page 66: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.
Page 67: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

ScaLAPACKScaLAPACK Scalable LAPACKScalable LAPACK Development teamDevelopment team

University of TennesseeUniversity of Tennessee University of California at BerkeleyUniversity of California at Berkeley ORNL, Rice U.,UCLA, UIUC etc.ORNL, Rice U.,UCLA, UIUC etc.

Support in Commercial PackagesSupport in Commercial Packages NAG Parallel Library (including Intel MKL and NAG Parallel Library (including Intel MKL and

AMD ACML) AMD ACML) IBM PESSL IBM PESSL CRAY Scientific Library and SGI SCSLCRAY Scientific Library and SGI SCSL VNI IMSL VNI IMSL Fujitsu, HP/Convex, Hitachi, NECFujitsu, HP/Convex, Hitachi, NEC

Page 68: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Important detailsImportant details

Web page Web page http://http://www.netlib.org/scalwww.netlib.org/scalapackapack Includes Includes

ScaLAPACK User’s ScaLAPACK User’s GuideGuide

Language : Fortran Language : Fortran Dense Matrix Dense Matrix

Problem Solvers Problem Solvers Linear Equations Linear Equations Least Squares Least Squares EigenvalueEigenvalue

BLAS

LAPACK

MPI, PVM,...

BLACS

PBLAS

ScaLAPACK

Package dependencies

Page 69: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Components of the APIComponents of the API

DriversDrivers Solves a Complete ProblemSolves a Complete Problem

Computational ComponentsComputational Components Performs Tasks: LU factorization, etc.Performs Tasks: LU factorization, etc.

Auxiliary RoutinesAuxiliary Routines Scaling, Matrix Norm, etc.Scaling, Matrix Norm, etc.

Matrix Redistribution/Copy RoutineMatrix Redistribution/Copy Routine Matrix on PE grid1 -> Matrix on PE grid2Matrix on PE grid1 -> Matrix on PE grid2

Page 70: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

API (cont..)API (cont..) LAPACK names with P prefixLAPACK names with P prefix

PXYYZZZComputation Performed

Matrix Type

Data Types

Data Type real double cmplx dble cmplx

X S D C Z

Page 71: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

TAUTAU

Tuning and Analysis UtilitiesTuning and Analysis Utilities University of Oregon developmentUniversity of Oregon development http://www.cs.uoregon.edu/research/http://www.cs.uoregon.edu/research/

paracomp/tau/tautools/paracomp/tau/tautools/ Program and performance analysis Program and performance analysis

tool framework for high-performance tool framework for high-performance parallel and distributed computingparallel and distributed computing TAU provides a suite of tools analysis of TAU provides a suite of tools analysis of

C, C++, FORTRAN 77/90, Python, High C, C++, FORTRAN 77/90, Python, High Performance FORTRAN, and Java Performance FORTRAN, and Java programsprograms

Page 72: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

UseageUseage Instrument the program by inserting Instrument the program by inserting

TAU macros into the program (this can TAU macros into the program (this can be done automatically). be done automatically).

Run the program. Files containing Run the program. Files containing information about the program information about the program performance are automatically performance are automatically generated.generated.

View the results with TAU's pprof, the View the results with TAU's pprof, the TAU visualizer racy (or paraprof), or a TAU visualizer racy (or paraprof), or a third-party visualizer (such as VAMPIR) third-party visualizer (such as VAMPIR)

Page 73: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

pprofpprof

Page 74: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Additional facilitiesAdditional facilities TAU collects much more information than what is TAU collects much more information than what is

available through prof or gprof, the standard Unix available through prof or gprof, the standard Unix utilities. Also available through TAU are: utilities. Also available through TAU are: Per-process, per-thread and per-host information (supports Per-process, per-thread and per-host information (supports

pthreads)pthreads) Inclusive and exclusive function timesInclusive and exclusive function times Profiling groups that allow you to organize data collection Profiling groups that allow you to organize data collection Access to hardware counters on some systems Access to hardware counters on some systems Per-class and per-instance informationPer-class and per-instance information Separate data for each template instantiationSeparate data for each template instantiation Start/stop timers for profiling arbitrary sections of codeStart/stop timers for profiling arbitrary sections of code Support for collection of statistics on user-defined events Support for collection of statistics on user-defined events

TAU is designed so that when you turn off profiling TAU is designed so that when you turn off profiling (by disabling TAU macros) there is no overhead (by disabling TAU macros) there is no overhead

Page 75: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

CACTUSCACTUS http://www.cactuscode.org/http://www.cactuscode.org/ Developed as response to needs of large scale projects (initially Developed as response to needs of large scale projects (initially

developed for General Relativity calculations which have a developed for General Relativity calculations which have a large computation to communication ratio)large computation to communication ratio)

Numerical/computational infrastructure to solve PDE’sNumerical/computational infrastructure to solve PDE’s Freely available, Freely available, Open SourceOpen Source community framework community framework

Cactus Divided in “Flesh” (core) and “Thorns” (modules or Cactus Divided in “Flesh” (core) and “Thorns” (modules or collections of subroutines)collections of subroutines)

Multilingual: User apps Fortran, C, CMultilingual: User apps Fortran, C, C++++; automated interface ; automated interface between thembetween them

Abstraction: Cactus Flesh provides API for virtually all CS type Abstraction: Cactus Flesh provides API for virtually all CS type operationsoperations Storage, parallelization, communication between processors, etcStorage, parallelization, communication between processors, etc Interpolation, ReductionInterpolation, Reduction IO (traditional, socket based, remote viz and steering…)IO (traditional, socket based, remote viz and steering…) Checkpointing, coordinatesCheckpointing, coordinates

““Grid Computing”: Cactus team and many collaborators Grid Computing”: Cactus team and many collaborators worldwide, especially NCSA, Argonne/Chicago, LBL worldwide, especially NCSA, Argonne/Chicago, LBL

Page 76: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Modularity of Modularity of Cactus...Cactus...

Application 1

Cactus Flesh

Application 2 ...

Sub-app

AMR (GrACE, etc)

MPI layer 3 I/O layer 2

Unstructured...

Globus Metacomputing Services

User selectsdesired functionality…Code created...

Abstractions...

Remote Steer 2MDS/Remote Spawn

Legacy App 2

Symbolic Manip App

Page 77: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Cactus & the GridCactus & the GridCactus Application Thorns

Distribution information hidden from programmerInitial data, Evolution, Analysis, etc

Grid Aware Application ThornsDrivers for parallelism, IO, communication, data mapping

PUGH: parallelism via MPI (MPICH-G2, grid enabled message passing library)

Grid Enabled Communication Library

MPICH-G2 implementation of MPI, can run MPI programs across heterogenous computing

resources

Standard MPI

SingleProc

Page 78: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

The FleshThe Flesh Abstract APIAbstract API

evolve the same PDE with unigrid, AMR (MPI or shared memory, etc) evolve the same PDE with unigrid, AMR (MPI or shared memory, etc) without having to change any of the application code.without having to change any of the application code.

InterfacesInterfaces set of data structures that a thorn exports to the world (set of data structures that a thorn exports to the world (globalglobal), to its ), to its

friends (friends (protectedprotected) and to nobody () and to nobody (privateprivate) and how these are ) and how these are inheritedinherited..

ImplementationsImplementations Different thorns may implement e.g. the evolution of the same PDE Different thorns may implement e.g. the evolution of the same PDE

and we select the one we want at runtime.and we select the one we want at runtime. SchedulingScheduling

call in a certain order the routines of every thorn and how to handle call in a certain order the routines of every thorn and how to handle their interdependencies. their interdependencies.

ParametersParameters many types of parameters and all of their essential consistency many types of parameters and all of their essential consistency

checked before runningchecked before running

Page 79: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Summary Part 2Summary Part 2

ACTS is a collection of software for HPC ACTS is a collection of software for HPC that includes a number of useful toolsthat includes a number of useful tools Numerical librariesNumerical libraries Code development softwareCode development software Profiling softwareProfiling software

ScaLAPACK extends LAPACK to ScaLAPACK extends LAPACK to distributed memory architecturesdistributed memory architectures Built on top of PBLAS which uses BLACSBuilt on top of PBLAS which uses BLACS

Page 80: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Part 3: Odds and endsPart 3: Odds and ends

Netlib and other useful websitesNetlib and other useful websites HPL libraryHPL library VTKVTK

Page 81: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

NetlibNetlib The Netlib repository contains The Netlib repository contains

freely available software, documents, freely available software, documents, databases of interest to the numerical & databases of interest to the numerical & scientific computing communitiesscientific computing communities

The repository is maintained by The repository is maintained by AT&T Bell LaboratoriesAT&T Bell Laboratories University of TennesseeUniversity of Tennessee Oak Ridge National LaboratoryOak Ridge National Laboratory

The collection is mirrored at several sites The collection is mirrored at several sites around the worldaround the world Kept synchronizedKept synchronized

Effective search engine to help locate Effective search engine to help locate software of potential usesoftware of potential use

Page 82: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

High Performance High Performance LINPACKLINPACK

Portable and freely available implementation of the Portable and freely available implementation of the LINPACK Benchmark – LINPACK Benchmark – used for Top500 rankingused for Top500 ranking

Developed at UTK Innovative Computing Developed at UTK Innovative Computing LaboratoryLaboratory A. A. PetitetPetitet, , R. C. WhaleyR. C. Whaley, , J. J. DongarraDongarra, , A. ClearyA. Cleary

HPLHPL solves a (random) dense linear system in solves a (random) dense linear system in double precision (64 bits) arithmetic on double precision (64 bits) arithmetic on distributed-memory computersdistributed-memory computers Requires MPI 1.1 be installedRequires MPI 1.1 be installed Also requires an implementation of Also requires an implementation of eithereither the the BLAS orBLAS or

the Vector Signal Image Processing Library the Vector Signal Image Processing Library VSIPLVSIPL Provides a testing and timing programProvides a testing and timing program

Quantifies the Quantifies the accuracyaccuracy of the obtained solution as well of the obtained solution as well as the time it took to compute it as the time it took to compute it

Page 83: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Rice University HPC Rice University HPC softwaresoftware

Center for High Performance Center for High Performance Software Research (HiPerSoft)Software Research (HiPerSoft) established in October 1998 established in October 1998 http://www.hipersoft.rice.edu/http://www.hipersoft.rice.edu/

Rice has a strong history of Rice has a strong history of innovative HPC toolsinnovative HPC tools HPCToolkit is an open-source suite of HPCToolkit is an open-source suite of

multi-platform tools for profile-based multi-platform tools for profile-based performance analysis of applications performance analysis of applications

Page 84: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

HPCtoolkitHPCtoolkit The toolkit components include: The toolkit components include:

hpcrun: a tool for profiling executions of unmodified application hpcrun: a tool for profiling executions of unmodified application binaries using statistical sampling of hardware performance binaries using statistical sampling of hardware performance counters. counters.

hpcprof & xprof: tools for interpeting sample-based execution hpcprof & xprof: tools for interpeting sample-based execution profiles and relating them back to program source lines. profiles and relating them back to program source lines.

bloop: a tool for analyzing application binaries to recover bloop: a tool for analyzing application binaries to recover program structure; namely, to identify where loops are present program structure; namely, to identify where loops are present and what program source lines they contain. and what program source lines they contain.

hpcview: a tool for correlating program structure information, hpcview: a tool for correlating program structure information, multiple sample-based performance profiles, and program multiple sample-based performance profiles, and program source code to produce a performance database. source code to produce a performance database.

hpcviewer: a java-based GUI for exploring databases consisting hpcviewer: a java-based GUI for exploring databases consisting of performance information correlated with program source. of performance information correlated with program source.

Supported platforms: Pentium+Linux, Opteron+Linux, Supported platforms: Pentium+Linux, Opteron+Linux, Athlon+Linux, Itanium+Linux, Alpha+Tru64 and MIPS+Irix. Athlon+Linux, Itanium+Linux, Alpha+Tru64 and MIPS+Irix.

HPCToolkit is open-source software released with a BSD-like HPCToolkit is open-source software released with a BSD-like license. license.

Page 85: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

CALGOCALGO

Collected algorithms of the ACMCollected algorithms of the ACM http://www.acm.org/pubs/calgo/http://www.acm.org/pubs/calgo/

All software is refereed for originality, All software is refereed for originality, accuracy, robustness, completeness, accuracy, robustness, completeness, portability, and lasting value portability, and lasting value Use of ACM Algorithms is subject to the Use of ACM Algorithms is subject to the

ACM Software Copyright and License ACM Software Copyright and License AgreementAgreement

Available on CD Available on CD

Page 86: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

MGnetMGnet

www.mgnet.orgwww.mgnet.org Site devoted to Multi-grid and adaptive Site devoted to Multi-grid and adaptive

mesh refinement algorithmsmesh refinement algorithms Run by Craig DouglasRun by Craig Douglas

Has links to a number of packages for Has links to a number of packages for multigridmultigrid Some are public domainSome are public domain Others are copyrighted Others are copyrighted

Very useful resource for MG methodsVery useful resource for MG methods

Page 87: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

NCSANCSA

National Center for Supercomputing National Center for Supercomputing ApplicationsApplications www.ncsa.uiuc.eduwww.ncsa.uiuc.edu

Their application repository is a very Their application repository is a very useful guide to what software is useful guide to what software is available in a given fieldavailable in a given field

Page 88: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

NHSENHSE

National HPC Software ExchangeNational HPC Software Exchange www.nhse.orgwww.nhse.org

Numerous reports, librariesNumerous reports, libraries Unfortunately has been suspended Unfortunately has been suspended

in light of a lack of funding (2004)in light of a lack of funding (2004) Access to meta-repository is still Access to meta-repository is still

available (and links there in)available (and links there in)

Page 89: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

VTKVTK The Visualization ToolkitThe Visualization Toolkit

http://public.kitware.com/VTK/what-is-vtk.phphttp://public.kitware.com/VTK/what-is-vtk.php Portable open-source software system for Portable open-source software system for

3D computer graphics, image processing, 3D computer graphics, image processing, and visualizationand visualization Object-oriented approachObject-oriented approach

VTK is at a higher level of abstraction VTK is at a higher level of abstraction than rendering libraries like OpenGLthan rendering libraries like OpenGL

VTK applications can be written directly VTK applications can be written directly in C++, Tcl, Java, or Pythonin C++, Tcl, Java, or Python

Large user communityLarge user community Many source code contributions Many source code contributions

Page 90: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Summary Part 3Summary Part 3

When looking for a library first place When looking for a library first place to stop is netlib!to stop is netlib!

Page 91: High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics.

Next lectureNext lecture (Last lecture!) Productivity crisis, (Last lecture!) Productivity crisis,

future of HPCfuture of HPC