John Morrison John Morrison CCN Division Leader CCN Division Leader Nicholas C. Metropolis Center for Modeling and Simulation 7th Workshop on Distributed Supercomputing March 4, 2003 ASCI Q LA-UR-03-0541 The ASCI Q System at Los The ASCI Q System at Los Alamos Alamos
31
Embed
John Morrison CCN Division Leader Nicholas C. Metropolis Center for Modeling and Simulation 7th Workshop on Distributed Supercomputing March 4, 2003 ASCI.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
John MorrisonJohn MorrisonCCN Division Leader CCN Division Leader
Nicholas C. Metropolis Center for Modeling and Simulation
7th Workshop on Distributed SupercomputingMarch 4, 2003
ASCI Q
LA-UR-03-0541
The ASCI Q System at Los Alamos The ASCI Q System at Los Alamos
LA-UR-
Q is operational for stewardship applications (1st 10T)
Many ASCI applications are experiencing significant performance increases over Blue Mountain.
Linpack performance run of 7.727 TeraOps (more than 75% efficiency)
Initial user response is very positive (with some issues!)
(Users want more cycles…) Users from the tri-lab community are also using the system
Available to users for Classified ASCI codes since August 2002Available to users for Classified ASCI codes since August 2002 Smaller initial system available since April 2002Smaller initial system available since April 2002
Los Alamos has run its December 2002 ASCI Milestone calculation on QLos Alamos has run its December 2002 ASCI Milestone calculation on Q
LA-UR-
Question 1:
Is your machine living up to the performance expectations? If yes, how? If not, what is the root cause?
LA-UR-
Performance ComparisonQ vs. White vs. Blue Mountain
SAGE (timing.input)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
1 10 100 1000 10000# PEs
Cyc
le t
ime
(s)
Blue Mountain
ASCI White
ASCI Q
Cycle-time : lower is betterWeak-scaling of SAGE (problem per processor is constant )-> ideal cycle-time is a constant for all PEs (but have parallel overheads)
LA-UR-
Modeled and Measured PerformanceUnique capability for performance prediction developed in the
Performance and Architecture Lab (PAL) at Los Alamos
Latest two sets of measurements are consistent
(~70% longer than model)
SAGE on QB 1-rail (timing.input)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 10 100 1000 10000# PEs
Cycle
Tim
e (
s)
Model
21-Sep
25-Nov
Lower is better!
There is a difference why ?
LA-UR-
Using fewer PEs per Node
Test performance using 1,2,3 and 4 PEs per node
Reduces the number of compute processors available
Performance degradation appears when using all 4 procs in a node!
Sage on QB (timing.input)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 10 100 1000 10000
#PEs
Cycle
Tim
e (
s)
1PEsPerNode
2PEsPerNode
3PEsPerNode
4PEsPerNode
LA-UR-
Performance Variability
Lots of noise on the nodes: daemons and kernel activityThis noise was analyzed, quantified, modeled, and included
back in the application modelThis system activity has structure: it was identified and
modeledCycle-time varies from cycle to cycle
SAGE QB 3584 PEs (timing.input)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
100 200 300 400 500 600 700 800 900 1000
Cycle Number
Cycle
Tim
e (
s)
Cyc_sec
Model
LA-UR-
Performance Variability (2)
Histogram of cycle-time over 1000 cycles
Minimum cycle-time is very close to model! (0.75 vs 0.70)
SAGE QB 3584 PEs (timing.input)
0
20
40
60
80
100
1200.7
0.8
0.9 1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9 2
Histogram Bins (s)
Fre
qu
en
cy
Performance is variable (some cycles are not affected!)
LA-UR-
Modeled and Experimental Data
The model is a close approximation of the experimental dataThe primary bottleneck is the noise generated by the compute
nodes (Tru64)
1
2
3
4
5
6
7
8
0 200 400 600 800 1000
Lat
ency
ms
Nodes
Barrier, 1 ms Granularity, Modelled and Experimental Data
After system mods (both kernel and daemons and Quadrics RMS: right on target! After these optimizations, Q will deliver the performance that it’s supposed to. Modeling works!
LA-UR-
Resources
Performance and Architecture Lab (PAL in CCS-3
Work by Petrini, Kerbyson, Hoisie
Publications on this work and other architecture and performance topics at
www.c3.lanl.gov/par_arch
LA-UR-
Plan of Attack
Find low hanging fruit (common problems with high payback) to attack first
1. Kill unnecessary daemons
2. Look at member 1 and 2 for CFS related activities
3. Member 31 noise!!
LA-UR-
Kill Daemons
HP SC engineering is checking that there are no operational problems with permanently switching them off.
Daemons status
envmond /sbin/init.d/envmon stop
insightd sbin/init.d/insightd stop
snmpd /sbin/init.d/snmpd stop
advfsd Not running at LANL
smsd Not running at LANL
lat Already off
lpd /sbin/init.d/lpd stop
xlogin Not running at LANL
niff /sbin/init.d/niffd stop
LA-UR-
Summary on Performance
Performance of Q machine is meeting and exceeding performance expectations
Performance Modeling Integral part of Q machine system deployment
Performance testing done at each major contractual milestone
FS-QB used in the unclassified environment for performance variability testing.
Approach is to systematically evaluate and implement recommendations of performance variability testing
LA-UR-
Question 2: What is the MTBI? What are the topmost reasons for interrupts? What is the average utilization rate?
LA-UR-
Machine Q Interrupts and Overall MTBIASCI QA Categorized Failures (Unscheduled Interrupts) per Month Data Thru: 02/22/2003
21
60
10292
75
108
70
9984
67
414
50 52
2534
23 17 2333
25
74
152144
100
142
93
116107
100
0
20
40
60
80
100
120
140
160
May June July August September October November December January February
Month to Date
MONTH
FAIL
UR
ES
Hardware Other Total
ASCI QA System MTBI per Month Data thru: 02/22/2003
29.8
9.7
4.9 5.27.2
5.37.7
6.4 75.3
0
5
10
15
20
25
30
35
May June July August September October November December January February
Month to Date
Month
Hou
rs
LA-UR-
Topmost Reasons for HW interrupts
Detailed Scheduled and Unscheduled Categorized Hardware Interrupts
July August September October November December January1 GBit Ethernet Card1 1 2CPU 70 23 33 65 48 62 61Memory Dimm 15 47 32 34 19 28 20PCI Fibre Channel Adaptor 2 1 4 1 1PCI GBIT Ethernet Board3System Board 2 2 1 1…….……..……….Total 91 75 69 103 67 91 83
LA-UR-
Interrupts for CPUs
QA Number of CPU Failures (Unscheduled Interrupts) Per Week Data Thru: 02/22/2003
13
17
15
10
13
15
11
1312
10
0
2
4
6
8
10
12
14
16
18
Wk3 Wk4 Wk1 Wk2 Wk3 Wk4 Wk5 Wk1 Wk2 Wk3
December December January January January January January February February February
Week
Fail
ure
s
LA-UR-
Scientific Investigation of Cosmic Rays Impact on CPU Failures
• L2 Btag memory parity checked but not correctedAt altitude at Los Alamos the number of neutrons is about 6-10 times higher than at sea levelWith large number of ES45 systems and altitude we could be finding neutron induced CPU failures due to single bit soft errors
• Neutron Monitors installed with Q to measure neutron flux• LANSCE Beam line testing of different memories
Two classes of programs usedSome discrepancies between results, trying to figure outOnly testing for neutron impact, other particles being evaulated
• Statistical analysis for predicted error rates• Attempting to map beam line test output to predicted # of CPU
failures on Q based on neutron flux at SCC
LA-UR-
Scientific Investigation of Cosmic Rays Impact on CPU Failures - continued
• Initial results seem to indicate that the system is being impacted by neutrons hitting the L2 btag memory
• Mapping of beam line results to predict # of CPU failures is not yet fully understood
• We are managing around this problem from an applications perspective as demonstrated by the recent success of the milestone runs.
LA-UR-
Memory Interrupts
QA and QB Unscheduled Memory Dimm Failures per Weekfor Previous Ten Weeks thru 02/22/2003
78 8
5 5
2
4
1
9
16
43
1 1
5
2
0 0 01
0
2
4
6
8
10
12
14
16
18
Wk3 Wk4 Wk1 Wk2 Wk3 Wk4 Wk5 Wk1 Wk2 Wk3
December December January January January January January February February February