Tools at Scale - Requirements and Experience Mary Zosel, LLNL ASCI / PSE ASCI Simulation Development Environment Tools Project Prepared for SciComp 2000.

Tools at Scale - Requirements and Experience

Mary Zosel, LLNL

ASCI / PSE

ASCI Simulation Development Environment

Tools Project

Prepared for SciComp 2000

La Jolla, Ca.

Aug 14-16, 2000

UCRL: VG - 139702

Presentation Outline:

Overview of Systems

Requirements for Scale

Experience/Progress in debugging and tuning

ASCI WHITE• 8192 P3 cpu’s• NightHawk 2 nodes• Colony Switch• 12.3 TF peak• 160 TB disk• 28 tractor trailers• Classified Network

Full system at IBM

120 nodes in new home atLLNL - remainder due late Aug.

White joins these IBM platforms at LLNL

• 128 cpu - SNOW - (8-way P3 NH 1 nodes - Colony)– Experimental software development platform - Unclassified

• 1344 cpu - BLUE - (4-way 604e silver nodes / TB3MX)– Production unclassified platform

• 16 cpu - BABY - (4-way 604e silver nodes / TB3MX)– Experimental development platform - first stop for new system software

• 64 cpu - ER - (4 way 604e silver nodes / TB3MX)– Backup production system “parts” - and experimental software

• 5856 cpu - SKY (3 sectors of 488 silver nodes - connected with TB3MX and 6 HPGN IP routers) - Classified production system.

• When White is complete - ~2/3 of SKY will become the unclassified production system

Why the big machines?

• The purpose of ASCI is new 3-D codes for use in place of testing for Stockpile Certification.

• ASCI program plan calls for series of application milepost demonstrations of increasingly complex calculations which require the very large platforms.– Last year- 1000 cpu requirement

– This year - 1500 cpu requirement

– Next year - ~4000 cpu requirement

• Tri-lab resource -> multiple code teams with large scale requirements

What does this imply for development environment?Pressure Stress Pressure

• Deadlines: multiple code teams working against time

• Long Calculations: need to understand and optimize time requirements of each component to plan for production runs

• Large Scale: easy to push past the knee of scalability - and past the Troutbeck US limit of 1024 tasks

• Large Memory: n**2 buffer management schemes hurt • Access Contention: not easy to get large test runs -

especially for tool work

What Tools are in use?Staying with standards helps make tools usable

• Languages/Compilers: – C, C++, Fortran from both IBM and KAI

• Runtime: OpenMP and MPI– Production codes not using pvm, shmem, direct LAPI use, etc. and direct

use of pthreads is very limited

• Debugging / Tuning:– TotalView, LCF, Great Circle, ZeroFault, Guide, Vampir, xprofiler,

pmapi / papi, and hopefully new IBM tools

Debugging --- LLNL Experience• Users DO want to use the debugger with large # cpus• There have been lots of frustrations - but there is progress and

expectation of further improvements– Slow to attach / start … what was hours is now minutes– Experience / education helps avoid some problems ...

• Need large memory settings in ld• Now have MP_SYNC_ON_CONNECT off by default• Set startup timeouts (MP_TIMEOUT)

– “Sluggish but tolerable” describes a recent 512 cpu session

• Local feature development aimed at scale ... – Subsetting, collapse, shortcuts, filtering, … both CLI and X versions

• Etnus continuing to address scalability

New Attach Option to get subset of tasks

Root window collapsed Shows task 4 in different

state.

Same Root window opened to show all tasks

Example of thumb-screw on msg window

Cycle thru message state

Performance … status quo is less promising

• MPI scale is an issue - OpenMP reduces problem

• Understanding thread performance is issue

• Users DO want to use the tools - this is new– They need estimates for their large code runs …

• Is my job is running or hung?

• Tools aren’t yet ready for scale -

including size-of-code scaling

• Several tools do not support threads

• Problems often not in the user’s code

List of sample problems User observes that …

• … as the number of tasks grows, the code becomes relatively slower and slower. The sum of the CPU time and the system time doesn't add up to wall-clock time – and this missing time is the component growing the fastest. [Diagnosis – bad adaptor software configuration was causing excessive fragmentation and retransmission of MPI messages]

• … unexplained code slow-down from previous runs and nothing in the code has changed. [Diagnosis – orphaned processes on one node slowed down entire code,]

• … threaded version of code much slower than straight MPI. [Diagnosis – code had many small malloc calls and was serializing through the malloc code.]

• … certain part of code takes 10 seconds to run while the problem is small – and then after a call to a memory-intensive routine – the same portion of code takes 18 seconds to run. [Diagnosis – not sure – but believed to be memory heap fragmentation causing paging.]

• … job runs faster on Blue (604e system) than it does on Snow (P3 system). [Diagnosis – not yet known – wonder about flow-control default setting].

• … a non-blocking message-test code is taking up to 15 times longer to run on Snow than it does on Blue. [Diagnosis - not yet known - flow control setting doesn’t help.]

What are we doing about this?• PathForward contracts: KAI/Pallas, Etnus, MSTI

• Infrastructure development: to facilitate new tools / probes – supports click-back to source– currently QT on DPCL … future???

• Probe components: -memory usage, mpi classification

• Lightweight CoreFile … and Performance Monitors

• External observation … Monitor, PS, VMSTAT …

• Testing new IBM beta tools

• Sys admins starting performance regression database

4 8 16 32 64 128 256.00

25,000,000.00

50,000,000.00

75,000,000.00

100,000,000.00

125,000,000.00

150,000,000.00

175,000,000.00

User code

Comm_size

Comm_rank

Barrier

Allreduce

Number of Processors

Microseconds

Tool Work In Progress

the faster I go, the behinder I get

… we ARE making progress, but the problems are getting harder and coming in faster ...

It’s a Team EffortRich Zwakenberg - debuggingKaren WarrenBor ChanJohn May - performance toolsJeff VetterJohn GyllenhaalChris ChambreauMike McCrackenJohn Engle - compiler supportLinda Stanberry - mpi relatedBronis deSupinskiSusan Post - system testingBrian Carnes - general Mary ZoselScott Taylor - emeritasJohn Ranelletti

Tools at Scale - Requirements and Experience Mary Zosel, LLNL ASCI / PSE ASCI Simulation Development Environment Tools Project Prepared for SciComp 2000.

cpu requirementnext

cpu requirementthis

new system software64

llnl128 cpu snow

ibm120 nodes

large platforms

large scale requirementswhat

production runslarge

Documents

Proiect ASCI

DOE ASCI TeraFLOPS

The Distributed ASCI Supercomputer (DAS) project

Pablo antonio zosel ramirez

Defining Extreme Scale Computing: ASCI Purple · Computing:...

Sent Enc i Asci Lingo

High Performance Global File Systems - ScicomP

ASCI, MDP Booklet, 2015-16

ASCI 605 GRP

Brief Note ASCI Training

Intel® Xeon Phiâ„¢ Product Family - An Overview -...

Asci csr 16-07-2014

IBM High Performance Computing â€“ Products and -...

Contents feugaitepulae, - ASCI

An Introduction to the Algal Stream Condition Index...

Bootcamp in Alpaca Husbandry ASCI 298