John Morrison CCN Division Leader Nicholas C. Metropolis Center for Modeling and Simulation 7th Workshop on Distributed Supercomputing March 4, 2003 ASCI.

John MorrisonJohn MorrisonCCN Division Leader CCN Division Leader

Nicholas C. Metropolis Center for Modeling and Simulation

7th Workshop on Distributed SupercomputingMarch 4, 2003

ASCI Q

LA-UR-03-0541

The ASCI Q System at Los Alamos The ASCI Q System at Los Alamos

LA-UR-

Q is operational for stewardship applications (1st 10T)

Many ASCI applications are experiencing significant performance increases over Blue Mountain.

Linpack performance run of 7.727 TeraOps (more than 75% efficiency)

Initial user response is very positive (with some issues!)

(Users want more cycles…) Users from the tri-lab community are also using the system

Available to users for Classified ASCI codes since August 2002Available to users for Classified ASCI codes since August 2002 Smaller initial system available since April 2002Smaller initial system available since April 2002

Los Alamos has run its December 2002 ASCI Milestone calculation on QLos Alamos has run its December 2002 ASCI Milestone calculation on Q

LA-UR-

Question 1:

Is your machine living up to the performance expectations? If yes, how? If not, what is the root cause?

LA-UR-

Performance ComparisonQ vs. White vs. Blue Mountain

SAGE (timing.input)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1 10 100 1000 10000# PEs

Cyc

le t

ime

(s)

Blue Mountain

ASCI White

ASCI Q

Cycle-time : lower is betterWeak-scaling of SAGE (problem per processor is constant )-> ideal cycle-time is a constant for all PEs (but have parallel overheads)

LA-UR-

Modeled and Measured PerformanceUnique capability for performance prediction developed in the

Performance and Architecture Lab (PAL) at Los Alamos

Latest two sets of measurements are consistent

(~70% longer than model)

SAGE on QB 1-rail (timing.input)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 10 100 1000 10000# PEs

Cycle

Tim

e (

s)

Model

21-Sep

25-Nov

Lower is better!

There is a difference why ?

LA-UR-

Using fewer PEs per Node

Test performance using 1,2,3 and 4 PEs per node

Reduces the number of compute processors available

Performance degradation appears when using all 4 procs in a node!

Sage on QB (timing.input)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 10 100 1000 10000

#PEs

Cycle

Tim

e (

s)

1PEsPerNode

2PEsPerNode

3PEsPerNode

4PEsPerNode

LA-UR-

Performance Variability

Lots of noise on the nodes: daemons and kernel activityThis noise was analyzed, quantified, modeled, and included

back in the application modelThis system activity has structure: it was identified and

modeledCycle-time varies from cycle to cycle

SAGE QB 3584 PEs (timing.input)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

100 200 300 400 500 600 700 800 900 1000

Cycle Number

Cycle

Tim

e (

s)

Cyc_sec

Model

LA-UR-

Performance Variability (2)

Histogram of cycle-time over 1000 cycles

Minimum cycle-time is very close to model! (0.75 vs 0.70)

SAGE QB 3584 PEs (timing.input)

0

20

40

60

80

100

1200.7

0.8

0.9 1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9 2

Histogram Bins (s)

Fre

qu

en

cy

Performance is variable (some cycles are not affected!)

LA-UR-

Modeled and Experimental Data

The model is a close approximation of the experimental dataThe primary bottleneck is the noise generated by the compute

nodes (Tru64)

1

2

3

4

5

6

7

8

0 200 400 600 800 1000

Lat

ency

ms

Nodes

Barrier, 1 ms Granularity, Modelled and Experimental Data

experimentmodelwithout 0without 1without 31without 0, 1 and 31without background noise Lower

Is better

LA-UR-

Performance after System Optimization

Sage on QB (timing.input)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1000 2000 3000 4000 5000

#PEs

Cyc

le T

ime (

s)

Sept -21st

Nov25th

Jan27th (Average)

Jan27th (Min)

Model

After system mods (both kernel and daemons and Quadrics RMS: right on target! After these optimizations, Q will deliver the performance that it’s supposed to. Modeling works!

LA-UR-

Resources

Performance and Architecture Lab (PAL in CCS-3

Work by Petrini, Kerbyson, Hoisie

Publications on this work and other architecture and performance topics at

www.c3.lanl.gov/par_arch

LA-UR-

Plan of Attack

Find low hanging fruit (common problems with high payback) to attack first

1. Kill unnecessary daemons

2. Look at member 1 and 2 for CFS related activities

3. Member 31 noise!!

LA-UR-

Kill Daemons

HP SC engineering is checking that there are no operational problems with permanently switching them off.

Daemons status

envmond /sbin/init.d/envmon stop

insightd sbin/init.d/insightd stop

snmpd /sbin/init.d/snmpd stop

advfsd Not running at LANL

smsd Not running at LANL

lat Already off

lpd /sbin/init.d/lpd stop

xlogin Not running at LANL

niff /sbin/init.d/niffd stop

LA-UR-

Summary on Performance

Performance of Q machine is meeting and exceeding performance expectations

Performance Modeling Integral part of Q machine system deployment

Performance testing done at each major contractual milestone

FS-QB used in the unclassified environment for performance variability testing.

Approach is to systematically evaluate and implement recommendations of performance variability testing

LA-UR-

Question 2: What is the MTBI? What are the topmost reasons for interrupts? What is the average utilization rate?

LA-UR-

Machine Q Interrupts and Overall MTBIASCI QA Categorized Failures (Unscheduled Interrupts) per Month Data Thru: 02/22/2003

21

60

10292

75

108

70

9984

67

414

50 52

2534

23 17 2333

25

74

152144

100

142

93

116107

100

0

20

40

60

80

100

120

140

160

May June July August September October November December January February

Month to Date

MONTH

FAIL

UR

ES

Hardware Other Total

ASCI QA System MTBI per Month Data thru: 02/22/2003

29.8

9.7

4.9 5.27.2

5.37.7

6.4 75.3

0

5

10

15

20

25

30

35

May June July August September October November December January February

Month to Date

Month

Hou

rs

LA-UR-

Topmost Reasons for HW interrupts

Detailed Scheduled and Unscheduled Categorized Hardware Interrupts

July August September October November December January1 GBit Ethernet Card1 1 2CPU 70 23 33 65 48 62 61Memory Dimm 15 47 32 34 19 28 20PCI Fibre Channel Adaptor 2 1 4 1 1PCI GBIT Ethernet Board3System Board 2 2 1 1…….……..……….Total 91 75 69 103 67 91 83

LA-UR-

Interrupts for CPUs

QA Number of CPU Failures (Unscheduled Interrupts) Per Week Data Thru: 02/22/2003

13

17

15

10

13

15

11

1312

10

0

2

4

6

8

10

12

14

16

18

Wk3 Wk4 Wk1 Wk2 Wk3 Wk4 Wk5 Wk1 Wk2 Wk3

December December January January January January January February February February

Week

Fail

ure

s

LA-UR-

Scientific Investigation of Cosmic Rays Impact on CPU Failures

• L2 Btag memory parity checked but not correctedAt altitude at Los Alamos the number of neutrons is about 6-10 times higher than at sea levelWith large number of ES45 systems and altitude we could be finding neutron induced CPU failures due to single bit soft errors

• Neutron Monitors installed with Q to measure neutron flux• LANSCE Beam line testing of different memories

Two classes of programs usedSome discrepancies between results, trying to figure outOnly testing for neutron impact, other particles being evaulated

• Statistical analysis for predicted error rates• Attempting to map beam line test output to predicted # of CPU

failures on Q based on neutron flux at SCC

LA-UR-

Scientific Investigation of Cosmic Rays Impact on CPU Failures - continued

• Initial results seem to indicate that the system is being impacted by neutrons hitting the L2 btag memory

• Mapping of beam line results to predict # of CPU failures is not yet fully understood

• We are managing around this problem from an applications perspective as demonstrated by the recent success of the milestone runs.

LA-UR-

Memory Interrupts

QA and QB Unscheduled Memory Dimm Failures per Weekfor Previous Ten Weeks thru 02/22/2003

78 8

5 5

2

4

1

9

16

43

1 1

5

2

0 0 01

0

2

4

6

8

10

12

14

16

18

Wk3 Wk4 Wk1 Wk2 Wk3 Wk4 Wk5 Wk1 Wk2 Wk3

December December January January January January January February February February

Week

Num

ber

of F

ailu

res

QA QB

LA-UR-

QA YTD 1/1/03 - 2/8/03

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

# jobs 16426 107 1377 194 212 462 251 287 379 184 0

Proc Hrs 48266 486 10782 7538 11738 38953 123013 344321 856987 292796 0

4 8 16 32 64 128 256 512 1024 2048 4096

2.8%0% 0.6% 0.4% 0.7%

2.2%

7.1%

19.8%

49.4%

16.9%

0%

Overall utilization rate for initialFew months is between 50-60%

LA-UR-

QB Final 11/1/02 - 1/22/03

0

500000

1000000

1500000

2000000

2500000

3000000

# jobs 4694 4382 142 82 549 925 254 1062 1031 277 559

Proc Hrs 26052 75050 1266 684 63379 987403 150619 649458 3E+06 239710 272862

4 8 16 32 64 128 256 512 1024 2048 4096

0.5% 1.5%0% 0%

1.3%

19.7%

3.0%

13.0%

50.8%

4.8% 5.4%

Over 4.3 Million processor hours for Science Runs System Utilization over 85% some weeks

LA-UR-

Question 3:

What is the primary complaint, if any, from the users?

LA-UR-

Historical Top Issues

Reliability and AvailabilityMessage Passing Interface (MPI)LSF integrationFile systemsCode development tools

LA-UR-

Current Top User IssuesOctober 2002 Q Technical Quarterly Review

Highest PriorityFile system problemsSystem Availability & ReliabilityHPSS “store” performance

Note the absence of MPI issues

LA-UR-

Current Top User IssuesOctober, 2002

Medium PrioritySerial jobs (& login sessions) on QAQ file system performance poor for many small serial files

Formal change control and preventative maintenance on all Q systems

QA viz users need non-purged file system

LA-UR-

Current Top User IssuesOctober, 2002

Lower Priority LSF configurations on all Q systemsEarly-Beta nature of QA versus User count

White-to-LANL(Q & HPSS) connectivityDFS on Q (for Sandia users)MPI CRC on QQ “devq” 2-login limit

LA-UR-

Highest Priority

File system problems Loss of all /scratch files (multiple

times) Local component failures impact

entire file system Files not always visible (PFS & NFS) Slow performance (e.g. simple “ls”

command)System Availability & Reliability

Whole machine impact Long (4-8 hr) reboot time! Many hung “services” require reboots

LA-UR-

Highest Priority

HPSS “store” performanceHPSS rates too low for QA capability

< 50MB/s100’s GB (not unusual) require hours to store

SW & HW upgrades (relief is coming)150MB/s Nov. target; 600MB/s Jan. targetParallel clients; new HW & 4- & 16-way stripes

Totalview & F90 data in modules on QCan’t see F90 data located in modulesWorkaround cumbersome & sometimes even crashesIssue is over 1yr old!

LA-UR-

Medium Priority

Serial jobs (& login sessions) on QA 4 PE minimum due to RMS/LSF config

Q file system performance poor for many small serial files

Many codes write serial files from 1 PE Some codes write 1 serial file per PE per dump time Some codes write multiple sets of files at each

dump timeFormal change control and preventative maintenance

on all Q systems Machine needs to move to more production-like

statusQA viz users need non-purged file system

Interactive viz requires all files be resident simultaneously

No special “viz” file systems as on BlueMtn

John Morrison CCN Division Leader Nicholas C. Metropolis Center for Modeling and Simulation 7th Workshop on Distributed Supercomputing March 4, 2003 ASCI.

Documents

performance comparisonq

performance topics

performance prediction

nodetest performance

performance expectations

asci qlaur

asci whitelaur

los alamos laur