Tools at Scale - Requirements and Experience Mary Zosel, LLNL ASCI / PSE ASCI Simulation Development Environment Tools Project Prepared for SciComp 2000.
Post on 05-Jan-2016
215 Views
Preview:
Transcript
Tools at Scale - Requirements and Experience
Mary Zosel, LLNL
ASCI / PSE
ASCI Simulation Development Environment
Tools Project
Prepared for SciComp 2000
La Jolla, Ca.
Aug 14-16, 2000
UCRL: VG - 139702
Presentation Outline:
Overview of Systems
Requirements for Scale
Experience/Progress in debugging and tuning
ASCI WHITE• 8192 P3 cpu’s• NightHawk 2 nodes• Colony Switch• 12.3 TF peak• 160 TB disk• 28 tractor trailers• Classified Network
Full system at IBM
120 nodes in new home atLLNL - remainder due late Aug.
White joins these IBM platforms at LLNL
• 128 cpu - SNOW - (8-way P3 NH 1 nodes - Colony)– Experimental software development platform - Unclassified
• 1344 cpu - BLUE - (4-way 604e silver nodes / TB3MX)– Production unclassified platform
• 16 cpu - BABY - (4-way 604e silver nodes / TB3MX)– Experimental development platform - first stop for new system software
• 64 cpu - ER - (4 way 604e silver nodes / TB3MX)– Backup production system “parts” - and experimental software
• 5856 cpu - SKY (3 sectors of 488 silver nodes - connected with TB3MX and 6 HPGN IP routers) - Classified production system.
• When White is complete - ~2/3 of SKY will become the unclassified production system
Why the big machines?
• The purpose of ASCI is new 3-D codes for use in place of testing for Stockpile Certification.
• ASCI program plan calls for series of application milepost demonstrations of increasingly complex calculations which require the very large platforms.– Last year- 1000 cpu requirement
– This year - 1500 cpu requirement
– Next year - ~4000 cpu requirement
• Tri-lab resource -> multiple code teams with large scale requirements
What does this imply for development environment?Pressure Stress Pressure
• Deadlines: multiple code teams working against time
• Long Calculations: need to understand and optimize time requirements of each component to plan for production runs
• Large Scale: easy to push past the knee of scalability - and past the Troutbeck US limit of 1024 tasks
• Large Memory: n**2 buffer management schemes hurt • Access Contention: not easy to get large test runs -
especially for tool work
What Tools are in use?Staying with standards helps make tools usable
• Languages/Compilers: – C, C++, Fortran from both IBM and KAI
• Runtime: OpenMP and MPI– Production codes not using pvm, shmem, direct LAPI use, etc. and direct
use of pthreads is very limited
• Debugging / Tuning:– TotalView, LCF, Great Circle, ZeroFault, Guide, Vampir, xprofiler,
pmapi / papi, and hopefully new IBM tools
Debugging --- LLNL Experience• Users DO want to use the debugger with large # cpus• There have been lots of frustrations - but there is progress and
expectation of further improvements– Slow to attach / start … what was hours is now minutes– Experience / education helps avoid some problems ...
• Need large memory settings in ld• Now have MP_SYNC_ON_CONNECT off by default• Set startup timeouts (MP_TIMEOUT)
– “Sluggish but tolerable” describes a recent 512 cpu session
• Local feature development aimed at scale ... – Subsetting, collapse, shortcuts, filtering, … both CLI and X versions
• Etnus continuing to address scalability
New Attach Option to get subset of tasks
Root window collapsed Shows task 4 in different
state.
Same Root window opened to show all tasks
Example of thumb-screw on msg window
Cycle thru message state
Performance … status quo is less promising
• MPI scale is an issue - OpenMP reduces problem
• Understanding thread performance is issue
• Users DO want to use the tools - this is new– They need estimates for their large code runs …
• Is my job is running or hung?
• Tools aren’t yet ready for scale -
including size-of-code scaling
• Several tools do not support threads
• Problems often not in the user’s code
List of sample problems User observes that …
• … as the number of tasks grows, the code becomes relatively slower and slower. The sum of the CPU time and the system time doesn't add up to wall-clock time – and this missing time is the component growing the fastest. [Diagnosis – bad adaptor software configuration was causing excessive fragmentation and retransmission of MPI messages]
• … unexplained code slow-down from previous runs and nothing in the code has changed. [Diagnosis – orphaned processes on one node slowed down entire code,]
• … threaded version of code much slower than straight MPI. [Diagnosis – code had many small malloc calls and was serializing through the malloc code.]
• … certain part of code takes 10 seconds to run while the problem is small – and then after a call to a memory-intensive routine – the same portion of code takes 18 seconds to run. [Diagnosis – not sure – but believed to be memory heap fragmentation causing paging.]
• … job runs faster on Blue (604e system) than it does on Snow (P3 system). [Diagnosis – not yet known – wonder about flow-control default setting].
• … a non-blocking message-test code is taking up to 15 times longer to run on Snow than it does on Blue. [Diagnosis - not yet known - flow control setting doesn’t help.]
What are we doing about this?• PathForward contracts: KAI/Pallas, Etnus, MSTI
• Infrastructure development: to facilitate new tools / probes – supports click-back to source– currently QT on DPCL … future???
• Probe components: -memory usage, mpi classification
• Lightweight CoreFile … and Performance Monitors
• External observation … Monitor, PS, VMSTAT …
• Testing new IBM beta tools
• Sys admins starting performance regression database
4 8 16 32 64 128 256.00
25,000,000.00
50,000,000.00
75,000,000.00
100,000,000.00
125,000,000.00
150,000,000.00
175,000,000.00
User code
Wait
Send
Irecv
Init
Comm_size
Comm_rank
Bcast
Barrier
Allreduce
Number of Processors
Microseconds
Tool Work In Progress
the faster I go, the behinder I get
… we ARE making progress, but the problems are getting harder and coming in faster ...
It’s a Team EffortRich Zwakenberg - debuggingKaren WarrenBor ChanJohn May - performance toolsJeff VetterJohn GyllenhaalChris ChambreauMike McCrackenJohn Engle - compiler supportLinda Stanberry - mpi relatedBronis deSupinskiSusan Post - system testingBrian Carnes - general Mary ZoselScott Taylor - emeritasJohn Ranelletti
top related