August 2012 SUPER Newsletter

The SUPER project is now up to full speed as we near the end of our first year. As can be seen

from the list of publications in this newsletter and the research highlights on the SUPER web-

site, we have made substantial progress already. Our four main focus areas – performance, en-

ergy, resilience, and overall optimization – have formed strong teams who collaborate closely.

This quarter’s newsletter features the performance area, which has a broad focus including auto-

tuning, performance modeling, end-to-end performance measure-

ment, and tool integration. We aim to extend and integrate mature,

robust performance measurement technologies and develop a com-

prehensive performance data management framework that can be

used by other areas of the SUPER project. End-to-end performance

measurement will correlate application-specific measurements with

system-wide conditions at the time of execution. Performance

measurements will feed into an integrated auto-tuning framework

that encompasses identification and outlining of performance-

critical kernels, compiler-based code transformations, and empirical

optimization of runtime parameters.

Now that DOE is in the process of awarding the last of the SciDAC-3 Application Partnerships,

we congratulate those of our SUPER researchers who are included in these recently awarded

projects. We look forward to working closely with our new SciDAC colleagues when it is mu-

tually beneficial. Our mature performance measurement and analysis tools are freely available,

and installed on the large system at Argonne, NERSC, and Oak Ridge where they can be used

by all DOE application projects. The SUPER auto-tuning framework as a whole is still under

development and will be made available for experimental use by SciDAC applications as it ma-

tures, along with our more speculative work on energy, resilience, and optimization.

- Bob Lucas

August 2012

NEWSLETTER

From the Director

Upcoming Events:

• SciDAC-3 PI meeting September 10-12, 2012 Rockville, MD Invited SciDAC PIs

• SUPER All-hands meeting September 26-27, 2012 Argonne National Laboratory http://www.mcs.anl.gov/~wild/meetings/super12.html All SUPER participants and invit-ed guests

• SC’12 http://sc12.supercomputing.org/ November 10-16, 2012 Salt Lake City, Utah • SUPER Meeting at SC’12 in Salt Lake City, Utah Monday November 12, 2012 9am-11:30am

• PPoPP 2013 February 23-27, 2013 Shenzhen, China (collocated with PPoPP 2013 and CGO 2013) http://ppopp2013.ics.uci.edu/ Abstracts due August 10, 2012 Full papers due August 17, 2012

• HPCA 2013 February 23-27, 2013 Shenzhen, China (collocated with PPoPP 2013 and CGO 2013) http://www.hpcaconf.org/hpca19 Abstracts due August 30, 2012 Full papers, workshop and tutorial proposals due Septem-ber 7, 2012

Performance Counter Monitoring for the Blue Gene/Q Architecture written by Heike McCraw, Shirley Moore and Dan Terpstra

The Blue Gene/Q (BG/Q) system is the third generation in the IBM Blue Gene line of massively

parallel, energy efficient supercomputers. BG/Q will be capable of scaling to over a million proces-

sor cores while making the trade-off of lower power consumption over raw processor speed. Perfor-

mance analysis tools for parallel applications running on large scale computing systems typically

rely on hardware performance counters to gather performance relevant data from the system. The

Performance API (PAPI) has provided consistent platform and operating system independent access

to CPU hardware performance counters for more than a decade [1]. PAPI has recently been extend-

ed to provide hardware performance data from other parts of the system, including the network, I/O

devices, temperature sensors, power meters, and GPUs, while continuing to provide a uniform port-

able interface [2,3]. In order to provide this consistency for BG/Q to the HPC community, and

thanks to a close collaboration with IBM’s Performance Analysis team and careful planning long

before the BG/Q release, an extensive effort has been made to extend the Performance API (PAPI)

to support hardware performance monitoring for the BG/Q platform. This customization of PAPI to

support BG/Q includes several PAPI components to provide valuable performance data that origi-

nates not only from the processing cores but also from compute nodes, the network, and the system

as a whole. More precisely, the additional components allow hardware performance counter moni-

toring of the 5D Torus network, the I/O system and the Compute Node Kernel in addition to the

CPU component. Details about the PAPI BG/Q components can be found in [4].

http://www.mcs.anl.gov/~wild/meetings/super12.html

http://www.mcs.anl.gov/~wild/meetings/super12.html

http://ppopp2013.ics.uci.edu/

http://www.hpcaconf.org/hpca19

http://www.hpcaconf.org/hpca19

Many previous parallel 3D-FFT implementations have used a one-dimensional virtual proces-

sor grid – i.e., only one dimension is distributed among the processors and the remaining di-

mensions are kept locally. This has the advantage that one all-to-all communication is suffcient.

However, for problem sizes of about one hundred points or more per dimension, this approach

cannot offer scalability to several hundred or thousand processors as required for modern HPC

architectures. For this reason the developers of the IBM’s Blue Matter application have been

promoting the use of a two-dimensional virtual processor grid for FFTs in three dimensions [6].

This requires two all-to-all type communications, as shown in Figure 6, which illustrates the

parallelization of the 3D-FFT using a two-dimensional decomposition of the data. More details

about the parallelization and our implementation can be found in [7].

Figure 1. Computational steps of the 3D-FFT implementation using 2D-decomposition

We ran our 3D-FFT kernel on 32 nodes of the BG/Q early access system at Argonne National

Laboratory, using all 16 cores per node. Hence, in total, we have 512 MPI tasks, and for the

virtual process grid, we chose 8 x 64, meaning that each subgroup has eight MPI tasks and we

have 64 of those subgroups. We wanted to know how well the communication performs on the

5D torus. We used the new PAPI network component to sample various network related events.

The number of packets sent is shown in Figure 2. This includes packets that originate and pass

through the current node. We repeated the experiment for three different problem sizes (64,

128, 256 -- all cubed). Note that these are the numbers for only the all-to-all communication

within each subgroup, not including a second all-to-all communication between the subgroups.

The reason why these numbers are so high is the placement of MPI tasks onto the network us-

ing the default mapping which results in a lot of inter-node communications.

Figure 2. Number of 32-byte user point-to-point packet chunks sent, including packets origi-

nating or passing through the current node

Performance Counter Monitoring for the Blue Gene/Q Architecture (cont.)

Page 2 August 2012

Upcoming Events (cont):

• CGO 2013 February 23-27, 2013 Shenzhen, China (collocated with PPoPP 2013 and HPCA 2013) http://www.cgo.org/cgo2013/ Abstracts due September 6, 2012 Full papers due September 10, 2012

• IPDPS 2013 May 20-24, 2013 Boston, MA http://www.ipdps.org/ Abstracts due September 24, 2012 Full papers due October 1, 2012 New workshop proposals due August 1, 2012

http://www.ipdps.org/

The default mapping places each task of a subgroup on a different node. To reduce the amount of

communication, we tried a customized mapping that placed all the tasks from one subgroup onto

the same node. Since each subgroup has 8 MPI tasks, and since we have 16 cores per node, we

can place two subgroups on each node. By doing so, all the high numbers reported for the net-

work counter were reduced to zeroes, resulting in no inter-node communication at all. As can be

seen from Figure 3, this gives us a performance improvement of up to a factor of 8.4 (depending

on the problem size) for the first all-to-all. There was no degradation in performance for the se-

cond all-to-all with the customized mapping. For the entire 3D FFT kernel we see an improve-

ment of up to 22%, which is a decent improvement if we consider that we are running on a very

small partition using only 32 nodes.

Figure 3. Performance comparison of first All-to-All communication using default

and customized mappings

Acknowledgments This material is based upon work supported by the U.S. Department of Energy Office of Science under contract DE- FC02

-06ER25761. Access to the early access BG/Q system at Argonne National Laboratory was provided through the ALCF

Early Science Program. We would like to thank the IBM Performance Team for collaborating with us on the implementa-

tion of PAPI for BG/Q.

References [1] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, “A portable programming interface for performance evalu-

ation on modern processors,” International Journal of High Performance Computing Applications, vol. 14, no. 3,

pp. 189–204, 2000.

[2] D. Terpstra, H. Jagode, H. You, and J. Dongarra, “Collecting performance data with PAPI-C,” in 3rd Parallel Tools

Workshop, Dresden, Germany, 2009, pp. 157–173.

[3] A. Malony, S. Biersdorff, S. Shende, H. Jagode, S. Tomov, G. Juckeland, R. Dietrich, D. Poole, C. Lamb, "Parallel performance measurement of heterogeneous parallel systems with GPUs," International Conference on Parallel

Processing (ICPP'11), Taipei, Taiwan, September 13-16, 2011.

[4] H. Jagode, S. Moore, and D. Terpstra. “Hardware performance monitoring for the Blue Gene/Q architecture,”

ScicomP 2012, Toronto, Ontario, Canada, May 2012.

[5] M. Gilge, “IBM system Blue Gene solution: Blue Gene/Q application development”, IBM Redbook Draft SG24-7948-

00, March 2012.

[6] M. Eleftheriou, et al., “A volumetric FFT for Blue-Gene/L”, in G. Goos, J. Hartmanis, J. van Leeuwen, editors, vol-

ume 2913 of Lecture Notes in Computer Science, page 194, Springer-Verlag, 2003.

[7] H. Jagode and J. Hein, "Custom assignment of MPI ranks for parallel multi-dimensional FFTs: Evaluation of

BG/P versus BG/L," Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed

Processing with Applications (ISPA-08), IEEE Computer Society, Sydney, Australia, pp.271-283, 2008.

Performance Counter Monitoring for the Blue Gene/Q Architecture (cont.)

Page 3 August 2012

Recent Software Release:

New TAU, PDT,

KTAU releases

www.super-scidac.org/

tau-2.21.3

See the SUPER website

for more software re-

leases

http://spscicomp.org/wordpress/pages/performance-counter-monitoring-for-the-blue-geneq-architecture/

http://www.super-scidac.org/tau-2.21.3

http://www.super-scidac.org/tau-2.21.3

The Active Harmony Auto-tuning system has recently been enhanced in three ways:

usability, sharing information, and GPU support.

First, we have created a new user interface for the Active Harmony system. The previous in-

terface was based on the TCL/TK toolkit and required an X11 connection from where the ap-

plication was running to the user’s desktop. This approach was both latency sensitive and

problematic for many firewalls. The new interface is entirely based on HTTP and means that

ordinary browsers may be used to view the interface. A sample screen shot of this interface is

shown in Figure 1.

Figure1: New Web-based Interface for Active Harmony

A second area of work for the Active Harmony system has been to create a command line shell

called Tuna which eases the process of creating auto-tuned applications. For programs that

contain auto-tunable variables that can be set from the command line (either of the compilation

of the program or from the application itself), tuna provides a simple interface to specify what

parameters can be tuned and their allowable values.

Figure 2 shows an example of a simple matrix multiply kernel with two tunable parameters

(tile and unroll). The allowable range for tile is from 1 to 10 and unroll is from 2 to 12 (using

even numbers). These two parameters become the –t and –u command line arguments to the

program matrix-mult.

Figure 2: Tuna Usage Example

Developing the Active Harmony Auto-tuning System By Jeff Hollingsworth

Page 4 August 2012

> ./tuna -i=tile,1,10,1 -i=unroll,2,12,2 –n25 matrix_mult -t % -u %

The addition of tuna makes is much easier for users to try out autotuning their programs. What

previously required some expertise to configure and write tuning scripts and driver programs is

now a matter of reading a short man page and trying out the autotuner.

The second area we have worked on in the past year is sharing of information between autotuing

programs. In particular, we have been working to develop a data exchange format between auto-

tuning systems. The idea is to develop a library of real-world objective functions with parameters

from actual applications. The idea is that such a collection will both promote head to head compar-

ison between auto-tuning systems and to allow optimizations researchers to quantitatively evaluate

new optimization algorithms. So far, we have worked with Argonne National Lab to develop a file

format for sharing autotuing data. We have also instrumented Active Harmony so that it will rec-

ord data from each execution in the proposed file format. We have started to collect an initial cor-

pus of auto-tuning runs.

The third area we have started to work on is the topic of applying the Active Harmony auto-tuning

system to GPU systems. Initial work has concentrated on developing techniques that allow auto-

tuned code to be dynamically loaded into a running GPU program. This is a critical infrastructure

step required to allow program to be modified during runtime without have to re-run each applica-

tion from the start for each configuration. In addition, we have been working with the Utah group

to ensure that the combined Chill and Active Harmony system are able to work in a GPU environ-

ment.

Developing the Active Harmony Auto-tuning System (cont.)

Page 5 August 2012

TAU Enhancements for SUPER By Allen Malony and Sameer Shende

1 Introduction

The University of Oregon is contributing to the SUPER project in the use of the TAU

Performance System to understand end-to-end performance, to discover the interaction of

performance factors, and to integrate performance characterization, analysis, and modeling in

intelligent optimization, including autotuning. There are new developments being made to TAU in

four areas. The first involves extensions to TAU's measurement infrastructure to provide event-

based sampling (Section 2). We are adding this support to provide a broader scope of measurement

mechanism that can be used in SUPER performance evaluation and autotuning experiments. Second,

new instrumentation capabilities have ben added to TAU for source and binary code (Section 3). The

third and largest development activity in SUPER involving TAU is the building of a next-generation

performance database called TAUdb (Section 4). In this interim report, we highlight the progress

made in all areas. Finally, TAU is being integrated into the SUPER autotuning framework (Section

5).

2 Integrated Hybrid Parallel Performance Measurement

The SUPER project has broad requirements for measurement support that motivated new work

on an integrated measurement framework in TAU where data from statistical sampling through

interrupt handlers are organized alongside traditional probed measurements. The statistical

samples are collected via one of two mechanisms. Traditional timer-based interrupts generate

a sample point at regular intervals of time specified by the analyst. They allow us to use the

collected samples to estimate the time spent executing specific lines of code simply by counting the

number of samples for each line.

Recorded samples are organized with respect to their enclosing TAU probed-event contexts. For

example, all samples encountered while within the execution of the a loop body probed and meas-

ured by TAU are associated with the TAU event that represents the said loop body. Samples are

recorded and stored in the form of the encountered program counter address at runtime. The rela-

tively expensive task of converting this address into the application source code’s symbol name,

file name and line number is performed at the end of application execution by using the GNU binu-

tils [15] toolsuites Binary File Descriptor (BFD) library interface. In addition to recording flat pro-

file samples with respect to each TAU event context, we have also implemented the ability to cap-

ture the function call context leading to each sample. This is achieved through callstack unwinding

from the samples context. Our framework uses the libunwind [16] library package for this purpose.

In addition, we can also determine the runtime callsite to a TAU probe event, allowing us to differ-

entiate the time spent by callsite. For example, where we would previously have measurements for

time spent over all MPI Send invocations in an application, we will now have the information dif-

ferentiated by their exact callsites in the source code. The integrated hybrid measurement frame-

work has been implemented and ported to pertinent platforms. Our goal is to enable robust feature

support for C/C++ and Fortran codes employing all the major compilers on the BlueGene/P, Blue-

Gene/Q, Cray XT5, XE6, XK6 systems and generic x86 64 clusters.

3 Instrumentation

We have improved support for both source and binary instrumentation in TAU. For source

instrumentation, we have integrated a new ROSE based parser in the Program Database Toolkit

(PDT), TAU's static analysis component. This improves support for parsing C, C++, and UPC

programs under Linux x86 64 using static binaries bundled in PDT. Two new tools (roseparse,

and upcparse) were introduced in PDT v3.17 released in November 2011. These tools, based on

ROSE, emit PDB files that can be parsed by the tau_instrumentor tool. This work on ROSE was

done in collaboration with the LLNL team and features a robust solution for parsing C, C++, and

UPC programs.

TAU supported DyninstAPI (from U. Wisconsin and U. Maryland) for rewriting binary _les.

Worked with the team at U. Wisconsin to improve the support for rewriting multi-threaded

applications and updated support for DWARF in DyninstAPI v7.0.1. Besides DyninstAPI,

TAU has introduced support for Intel Exascale Lab's MAQAO [17] toolkit. A new tool tau

rewrite was introduced and PDT bundles MAQAO in a binary release. An auto-update tool was

also introduced in PDT and to easily pull new MAQAO updates in a PDT installation. This

improves the usability of the binary rewriter and we have successfully introduced 8 updates to

MAQAO through this tool. MAQAO supports single and multi-threaded applications written in

C and Fortran using GNU and Intel compilers for the Linux x86 64 platform. We are working

with the MAQAO group to support C++ and improved demangling of C++ names. We have

also begun integration work with PEBIL tool (from the PMaC group at SDSC). PEBIL is a

lightweight instrumentor that now supports binary rewriting for TAU at routine boundaries.

While DyninstAPI has mature support for selective instrumentation using exclude and include

list of modules and _les as well as support for outer-loop level instrumentation, MAQAO and

PEBIL do not support these features at present. We hope to introduce these features at a later

date. While MAQAO has been released with PDT, TAU's support for PEBIL is at a prototype

stage at present. We hope to release these tools in a future TAU release.

TAU has been extended to support a POSIX I/O interposition library based on linker-level

instrumentation. The extensions allow us to observe parameters that flow through the I/O calls

[1].

TAU Enhancements for SUPER (cont.)

Page 6 August 2012

4 TAUdb

Seven years ago, the TAU research group designed a performance database schema and

associated data management framework for archiving parallel performance profiles. PerfDMF

(Performance Data Management Framework) [5] was the result, and while it has been

successfully integrated into the TAU analysis tools ParaProf [3] and PerfExplorer [6], there has

been limited uptake of the framework by the performance tools community. After polling the

community and from our own experience, we have targeted areas for redesigning the framework.

The desired changes are now happening as part of the SUPER project plan.

The PerfDMF database schema was successful at storing general parallel profile results, but there

were some notable problems with the design. There were not explicit relationships in some parts

of the schema, the organizational hierarchy was too rigid, and the performance metadata were

not stored in a form that made it easy to query against. The redesigned schema, TAUdb, is an

incremental modication that preserves backwards compatibility with existing PerfDMF databases.

TAUdb schemas have been written for PostgreSQL [13] and H2 [10] DBMS (DataBase

Management System). TAUdb will eventually also support Oracle [12], MySQL [11],

DB2 [8], and Derby [9] - the DBMSs supported by PerfDMF. The design for storing performance

profiles and associated metadata is complete, as is a rough draft of how Views, which replace

the rigid data organization hierarchy, will be stored and queried in the database. We are working

to update our existing Java tools, PerfDMF (command line tools and data API), ParaProf, and

PerfExplorer to work with the new schema.

A machine has been configured to be the database server for user collaboration and to also act

as the web server. It has been named taudb.nic.uoregon.edu and the DBMS is PostgreSQL.

Each project will have its own database on this server; scripts for generating user accounts and

databases have been written to facilitate this. Profiles can be uploaded using either the existing

PerfDMF tools or can be copied to the server using a restricted login. Once a profile has been

copied to the server, it is automatically loaded into the correct database according to supplied

metadata. The TAU nightly regression tests that generate performance data are loaded to the new

infrastructure, and further testing of the infrastructure is ongoing.

5 Autotuning and TAU

We have implemented an autotuning and specialization framework combining instrumentation

with TAU and the TAUdb performance database with several other software tools developed by

SUPER participants. In addition, we are applying TAU's unique approach to event creation to

define events based on code location and some context or state of the execution meaningful to the

analysis of interest. This framework is being used in multi-parameter profiling and in capturing

user metadata during execution.

The autotuning framework has the following basic structure as illustrated in Figure 1 (below): a

region of code to be tuned is extracted into a separate function (\outlined") using the ROSE

outliner (LLNL). The outlined function is instrumented using TAU and is run on representative

datasets to collect baseline performance data, which are stored in TAUdb. Captured along with

the performance data are metadata about the dataset being processed and the computational

environment in which the program is running. For example, parameters to the outlined kernel are

captured, along with information about the CPU architecture, cache sizes, GPUs present, and so

on.


Page 7 August 2012

Figure 1

Code variants are then generated using CHiLL (Utah), a source-to-source loop transformation

framework, based upon a set of scripts describing the loop transformations to be made. Scripts

expose tunable parameters that are modified to generate different code variants. Active Harmony

(Maryland) is used to search the parameter space in parallel. With each variant tested by Active

Harmony, performance data are captured in TAUdb, with all the metadata described earlier, along

with the CHiLL script used and its parameters. This workflow is captured in Figure 2 (next page)


Page 8 August 2012

Figure 2


Page 9 August 2012

Since the best code variant to use in a given situation may be determined by an interaction

between properties of the dataset being processed and the execution environment, this process is

repeated on multiple computers and over multiple datasets. We then run a decision-tree learning

algorithm (currently ID3 or C4.5 as implemented in Weka) over the data stored in TAUdb to

generate a classifier that will selects the best-performing code variant for a given dataset and

execution environment. Finally, a ROSE-based code generator converts the decision tree into a

wrapper function that evaluates the data available at runtime and executes the desired variant.

We have tested a proof-of-concept implementation of the above over matrix multiply kernels

using both CHiLL and CUDA-CHiLL (Utah), which automatically generates a CUDA kernel

from sequential C code for execution on a GPU. These tests indicate that even on relatively

simple code, different code variants will perform best on different GPU architectures, and that

this system allows for the development of software that can adapt to its environment. We are

currently working on applying this workflow to the optimization of sparse matrix multiplication

in PFLOTRAN.

To apply TAU to multi-parameter studies with the autotuning framework, we use a scheme to

capture at runtime parameters associated with an executing kernel and capture the name and

values for each parameter. A performance mapping is created between the string where multi-

parameter values are encoded and the performance data associated with the pair of interval event.

The data are stored in the performance database and are used for further analysis to evaluate the

impact of a given parameter on the performance of a kernel.

References

[1] S. Shende, A. D. Malony, W. Spear, and K. Schuchardt, \Characterizing I/O Performance

Using the TAU Performance System," in Proc. Exascale Symposium, ParCO 2011 Conference,

Springer, September 2011.

[2] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, R. A. Fatoohi, P. O.

Frederickson, T. A. Lasinski, H. D. Simon, V. Venkatakrishnan and S. K. Weeratunga. The

NAS Parallel Benchmarks, The International Journal of Supercomputer Applications, 1991.

[3] R. Bell, A. D. Malony, and S. Shende, A Portable, Extensible, and Scalable Tool for Parallel

Performance Pro_le Analysis, Proceedings of EUROPAR 2003, LNCS 2790, Springer, Berlin,

pp. 17-26, 2003.

[4] Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P. A Portable Programming Interface for

Performance Evaluation on Modern Processors The International Journal of High Performance

Computing Applications, Volume 14, number 3, pp. 189-204, Fall 2000.

[5] K. Huck, A. D. Malony, R. Bell and A. Morris. Design and Implementation of a Parallel

Performance Data Management Framework, Proceedings of the 2005 International Conference

on Parallel Processing. June 14-17, 2005. Oslo, Norway.

[6] K. Huck, A. Malony, S. Shende and A. Morris. Knowledge Support and Automation for Per-

formance Analysis with PerfExplorer 2.0. Large-Scale Programming Tools and Environments,

special issue of Scienti_c Programming, vol. 16, no. 2-3, pp. 123-134, 2008.

[7] A. Morris, A. Malony, and S. Shende. Design and Implementation of a Hybrid Parallel Perfor-

mance Measurement System. International Conference on Parallel Processing (ICPP), Septem-

ber 2010.

[8] IBM, DB2 Database Software, http://www-01.ibm.com/software/data/db2/, 2012.

[9] Apache Software Foundation, Apache Derby, http://db.apache.org/derby/, 2012.

[10] HSQLDB Group, H2 Database Engine, http://www.h2database.com/html/main.html, 2012.

[11] Oracle Corporation, MySQL, http://www.mysql.com/, 2012.

[12] Oracle Corporation, Oracle Database, http://www.oracle.com/us/products/database/

index.html, 2012.

[13] PostgreSQL Global Development Group, PostgreSQL, http://www.postgresql.org/, 2012.

[14] Json.org, JavaScript Object Notation, http://www.json.org/, 2012.

[15] Gnu Project, binutils, http://www.gnu.org/software/binutils/, 2012.

[16] David Mosberger, libunwind, http://www.nongnu.org/libunwind/, 2012.

[17] Intel Exascale Labs, UVSQ, http://www.maqao.org, 2012.

[18] Dyninst Project, StackwalkerAPI, http://www.dyninst.org/downloads/stackwalkerAPI,

2011.


Page 10 August 2012

Where do you work and how are you involved with SUPER?

I am a research scientist in Erich Strohmaier's Fu-ture Technologies Group (FTG) at the Lawrence Berkeley National Laboratory. As part of SUPER, I work in the performance research thrust focusing on developing and deploying our auto-tuning technologies into SciDAC applications.

Can you briefly summarize your educational and work background?

I performed my graduate work in computer sci-ence at the University of California at Berkeley where I received my PhD in 2008 under the guid-ance of David Patterson. Initially, I worked on verification, floor planning, and place-and-route for the IRAM embedded-DRAM vector proces-sor. My focus shifted to high-performance compu-ting and I received an appointment as a graduate student research assistant at LBL in 2005. My work there drove the research for my thesis. I received bachelors degrees in Electrical Engineering, Applied Mathematics, and Physics from Southern Methodist University in 1999. While an undergraduate, I spent 5 semester interning at Cyrix corporation where my duties included RTL verification, place-and-route, and post-silicon debug.

Where are your from originally?

Dallas, Texas

What are your research areas of interest?

My current research focuses on performance optimization, performance modeling, and hardware-software co-design. To maximize performance researchers leverage a tech-nique called automatic performance tuning (or "auto-tuning") that automates the tradi-tional benchmark-analyze-modify tuning loop. My current vein of research attempts to embed the high-level knowledge of the underlying numerical method into auto-tuners. Orthogonally, I developed a bound-and-bottleneck based performance model called the Roofline performance model that allows programmers to quickly calculate and visualize performance impediments. This can be used to limit tuning, qualify results, or in the case of HW/SW co-design determine the relationships between a processor's on-chip memory, bandwidth, and compute capabilities.

What do you see yourself doing five years from now?

Today, we are beginning to see the specialization and diversification of architectures to suit the varying needs of the wide-range of HPC applications and users. Decade-old gen-eralizations on parallelism, memory architecture, consistency, and coherency embedded into existing implementations will not be applicable to every machine. Thus, in 5 years, I see our auto-tuning work expanding to compensate for the diversity of architectures and systems whilst simultaneously encompassing high-level domain-specific knowledge via DSLs. I expect this will result in me working closely with domain scientists and applied mathematicians.

What are some things you enjoy doing that don’t involve computers?

In enjoy playing various board games and watching classic movies.

Featured SUPER Researcher: Samuel Williams

Page 11 August 2012

Integration with Active Harmony and TAU We have worked extensively with the Active Harmony group at University of Maryland, and with the TAU group at University of Oregon to integrate our tools and develop a novel per-formance-tuning layer. With regard to Active Harmony, we have integrated a recent “friendly release” of Active Harmony with our CUDA-CHiLL compiler to perform autotuning of parallel codes targeting Nvidia GPUs. Previously, we had performed an integration for sequential node code only. The integration with TAU is new to SUPER. Utah hosted Oregon graduate student Nick Chaimov in December to develop requirements to replicate applica-tion tuning in our prior work, but with full automation of performance data gathering using TAU. In prior work, this was done with significant user intervention. Nick has since devel-oped extensions to TAU to support these experiments automatically. In addition, as a team we have integrated TAU’s performance database with CHiLL and CUDA-CHiLL and Active Harmony, so that the results of autotuning are collected in a tree-structured performance database that can be queried to capitalize on prior experiments. Application tuning for SciDAC-e We chose to focus our application tuning in the first year on the SciDAC-e applications. Utah is involved in the SciDAC-e project entitled, “Performance Engineering Research Institute SciDAC-e Augmentation: Performance Enhancement of Simulating the Dynamics of Photoex-citation for Solar Energy Conversion”. We have worked extensively to optimize the applica-tion of interest, called MGDC. This work was reported at the March SUPER meeting. The optimizations we have performed employ the CHiLL autotuning framework, but in some cas-es require source modifications to simplify the code. The table below shows a number of optimized versions we compared in this work, measured on the Hopper system at NERSC. Overall, these optimizations yield a 1.07X speedup. We see additional optimization opportu-nities which require program modifications that we will explore until the end of the year.

Autotuning Work at University of Utah

By Mary Hall

Page 12 August 2012

Performance variability is a common occurrence on many HPC platforms. When significant, it can represent performance degradation in a number of different ways. For example, per-formance variability indicates that jobs are not performing optimally (as measured by the best observed execution rate). Also, if a job runs much slower than expected, the job may exceed the requested time and be aborted before finishing. This will waste some of the pro-ject allocation time, depending on the frequency of checkpoints. Any resubmitted jobs to continue where the aborted job left off goes back into the batch queue, and is subject to the typical queue delay, slowing project productivity. The first step in addressing this issue is systematic and adequate performance measurement of production runs, to identify if performance variability is a problem and to diagnose why it is occurring. Without this, we cannot identify mitigation strategies, nor convince those who might be able to address the issues directly that there is in fact a problem worth addressing. We are currently looking at four different aspects of performance variability. (1) We want to identify significant differences in execution rate between similar jobs on the same platform when using the same resource requests (e.g. processor count) and in the same computing environment (system software versions, etc.). (2) We want to identify significant differences in execution rate during the execution of a single job (not related to changes in the job's exe-cution characteristics). This type of performance variability can be identified without refer-ence to the execution rate of other similar jobs, but does require a different type of perfor-mance instrumentation. (3) We want to track the time that a job spends in the batch queue. This variability, which is typically much larger than that in the execution rate, exacerbates the impact of execution rate variability when aborted jobs must be resubmitted. (4) We want to identify significant differences in execution rate between similar jobs on the same plat-form when using the same resource requests (e.g. processor count) over a period of time during which things have changed: application code version, compiler version, communica-tion library version, etc. While some change is expected, we do not want a degradation in performance to pass unnoticed, as this may reflect a correctable performance bug or require regression to earlier versions of the code or of the system software stack. We are still in the process of developing approaches to measurement, archive, and analysis. However, we have started working with some climate science researchers to instrument and collect data from their production Community Earth System Model (CESM) runs on the Cray XK6 system at the Oak Ridge Leadership Computing Facility (OLCF). The current per-formance instrumentation is a modest augmentation to what is already used in practice and introduces no measurable instrumentation overhead, and so is suitable for "always on" in-strumentation in the production runs. Figures 1-4 describe performance for one of the ongoing suites of experiments. They contain performance data from 21 jobs, each computing 150 simulation days, collected over a period of 24 (calendar) days. Each job ran on 4096 processor cores (512 processes, 8 OpenMP threads per process). Two of the jobs exceeded the requested 6 hour wallclock limit, one between simulation days 140 and 145 days, and one between 145 and 150. Based on previ-ous benchmark runs, the expectation was that 180 simulation days could be completed in 6 hours. Figure 1 is a plot of the total runtime for each experiment, where execution time for the failed jobs is estimated based on the execution rate exhibited during the last 5 or 10 com-pleted simulation days. Figures 2 and 3 are estimates of the run time without I/O and for I/O only, respectively. Note that both I/O and non-I/O demonstrated performance variability.

Capturing Computer Performance Variability in Production Jobs

By Pat Worley

Page 13 August 2012

Figure 4 is a plot of runtime for the 5 previous simulation days at intervals of 5 simulation days. Plotted are data for the two failed jobs, for the slowest successful job, and for the fast-est successful job. The failed experiments exhibit high internal performance variability. The successful runs have primarily different "base" (or static) performance levels. This was just happenstance - neither of these are necessary characteristics of failed or successful runs. We are still in the process of analyzing these data. However, one interesting result is already ap-parent from comparing the processors subsets allocated for the fast and slow successful jobs. The Gemini interconnect on both the Cray XK6 at the OLCF and the XE6 at NERSC is imple-mented as a 3D torus topology. Two compute nodes share the same (X,Y,Z) coordinates and are connected via a single Gemini switch. Messages between nodes differing by one in either X, Y, or Z coordinates go through two Gemini switches. For communication between nodes that are neighbors in the Y direction, performance differs depending on whether the smaller Y-coordinate is even ("faster") or is odd ("slower"). For communication between nodes that are neighbors in the Z direction, every eighth link is "slower". The fastest successful job was allocated an "optimal" set of processors, comprising a contigu-ous 4x2x16 allocation with both compute nodes with the same (X,Y,Z) coordinates assigned to this job. In contrast, for the slowest successful job 53\% of the processes (272) were as-signed to compute nodes in which the other compute node with the same (X,Y,Z) coordinates was assigned to a different job. The 256 allocated nodes also were scattered within an 8x16x24 partition of the interconnect with many compute nodes assigned to other concur-rently running jobs. As such, the slow job was potentially exposed to more contention with other jobs, and also required communication over more, and possibly slower, links than was the fastest job. We are continuing to measure and diagnose the performance variability in this suite of ex-periments, as well as in a number of other experiments. Data will be archived in the TAU per-formance database and used to track long term trends in variability, in end-to-end metrics such as time spent in the queue, and in base performance as the transition is made to new versions of the code and in the compiler. We are also using these results to examine mitiga-tion strategies, in order to avoid compute node allocations that lead to poor performance.

Figure 1

Capturing Computer Performance Variability in Production Jobs (cont.)

Page 14 August 2012

Figure 2

Figure 3


Page 15 August 2012

Figure 4


Page 16 August 2012

Saurabh Hukerikar, Pedro C. Diniz, and Robert F. Lucas. A programming model for resilience in ex-treme scale computing. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS 2012), Boston, MA, June 25-28, 2012.

Jacob Lidman, Daniel Quinlan, Chunhua Liao, and Sally McKee. ROSE::FTTRANSFORM – a source-to-source transformation framework for exascale fault-tolerance research. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS 2012), Boston MA, June 25-28, 2012. pluto

Kathryn Mohror, Adam Moody, and Bronis de Supinski. Asynchronous checkpoint migration with MRNET in the scalable checkpoint/restart library. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS 2012), Boston, MA, June 25-28, 2012.

Shah Mohammad Faizur Rahman, Jichi Guo, Akshatha Bhat, Carlos Garcia, Majedul Haque Sujon, Qingy Yi, Chunhua Liao, Daniel J. Quinlan, "Studying The Impact Of Application-level Optimizations On The Power Consumption Of Multi-Core Architectures", ACM International Conference on Computing Fron-tiers 2012 (CF'12), May 15th-17th, 2012, Cagliari, Italy.

Kiran Kasichayanula, Daniel Terpstra, Piotr Luszczek, Stan Tomov, Shirley Moore, and Greg Peterson. Power aware computing on GPUs. Symposium on Application Accelerators in High Performance Compu-ting (SAAHPC 2012), Argonne National Laboratory, July 10-11, 2012.

Vince Weaver, Matthew Johnson, Kiran Kasichayanula, James Ralph, Piotr Luszczek, Daniel Terpstra, and Shirley Moore. Measuring energy and power with PAPI. International Workshop on Power-Aware Systems and Architectures (PASA 2012), Pittsburgh, PA, September 10, 2012.

Heike Jagode, Shirley Moore, and Daniel Terpstra. Performance counter monitoring for the Blue Gene/Q architecture. ScicomP 2012, Toronto, Ontario, Canada, May 2012.

Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J. "An Evaluation of User-Level Failure Mitigation Support in MPI," Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Springer, Vienna, Austria, September 23 - 26, 2012.

Marc Cases Guix, Bronis R. de Supinski, Greg Bronevetsky and Martin Schulz. Fault resilience of the algebraic multi-grid solver. International Conference on Supercomputing, Venice, Italy, June 25-29, 2012.

Jeff R. Hammond, Sriram Krishnamoorthy, Sameer Shende, Nichols A. Romero, Allen D. Malony: Perfor-mance characterization of global address space applications: a case study with NWChem. Concurrency and Computation: Practice and Experience 24(2): 135-154 (2012).

Barry Rountree, Dong Ahn, Bronis R. de Supinski, David K. Lowenthal, and Martin Schulz. Beyond DVFS: A first look at performance under a hardware-enforced power bound. In 8th Workshop on High-Performance, Power-Aware Computing (HPPAC), May 2012.

Chen-Han Ho, Marc de Kruijf, Karu Sankaralingam, Barry Rountree, Martin Schulz, and Bronis de Supinski. Mechanisms and evaluation of cross-layer fault-tolerance for supercomputing. In the 41st International Conference on Parallel Processing (ICPP), Sep 2012 (to appear).

See the SUPER website for additional recent publications

Selected Recent Publications

Page 17 August 2012

SUPER website: http://www.super-scidac.org/

Contact: Bob Lucas, [email protected]

Newsletter editor: Bonnie Browne, [email protected]

http://www.computingfrontiers.org/2012

http://www.computingfrontiers.org/2012

http://www.super-scidac.org/

mailto:[email protected]

mailto:[email protected]

August 2012 SUPER Newsletter

Documents