Eidgenössische Technische Hochschule Zürich Ecole polytechnique fédérale de Zurich Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich 25th Annual International Symposium on Computer Architecture 7th Workshop on Scalable Shared Memory Multiprocessor Memory System Performance of High End SMPs, PCs and Clusters of PCs Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/
24
Embed
Memory System Performance of High End SMPs, PCs and Clusters of PCs
Memory System Performance of High End SMPs, PCs and Clusters of PCs. Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/. Memory Systems. Low End designs in PCs: - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
EidgenössischeTechnische Hochschule
Zürich
Ecole polytechnique fédérale de ZurichPolitecnico federale di Zurigo
Swiss Federal Institute of Technology Zurich
25th Annual International Symposium on Computer Architecture
7th Workshop on Scalable Shared Memory Multiprocessor
Memory System Performance of High End SMPs, PCs and
Clusters of PCs
Ch. Kurmann, T. Stricker
Laboratory for Computer SystemsETHZ - Swiss Institute of Technology
CH-8092 Zurich
Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/
2
Memory Systems
Low End designs in PCs: extremely low cost standard I/O interface
High End designs in “Killer” Workstations: well engineered memory systems support for additional datastreams better I/O busses
Are Low End SMPs the universal compute nodes for parallel and distributed systems?
3
Contribution
The answer is probably the memory system performance.
How significant are the differences in memory system performance?
Limitations of Low End memory systems for local computation (e.g. in scientific applications) for inter-node communication (e.g. in databases)
4
Extended Copy Transfer Characterization
ECT is a method to characterize the performance of memory systems (ISCA95 and HPCA97): Categories
Access pattern, stride (spatial locality) Working set (temporal locality)
Value Transfer bandwidth (large amount of data)
Same chart resulting from one microbenchmark Local and Remote transfers compute and communicate accesses
5
Measurement Problems
Some parameter combinations are hard tomeasure, even with carefully tuned C code: Reduced performance for large strides and small
working-sets in L1 caches is a measurement artifact and not architecture related.
Compilers occasionally generate suboptimal instruction schedules for loads / stores.
6
Local Load Access: Pentium Pro PC
Working set
Access pattern
(stride between 64bit words)
12
81
279664634832312416151287654321
16
M8
M4
M2
M1
M5
12
K2
56
K1
28
K6
4 K
32
K1
6 K
8 K
4 K
2 K
1 K
0.5
K
600
500
400
300
200
100
0
600
500
400
300
200
100
0
Lo
ad b
and
wid
th (
MB
ytes
/sec
)
Lo
ad b
and
wid
th (
MB
yte/
s)
Pentium Pro FXone processor
200 MHz
DRAM
L1
L2
7
Local Load Access: SGI Origin
12
81
279664634832312416151287654321
64
M3
2 M
16
M8
M4
M2
M1
M5
12
K2
56
K1
28
K6
4 K
32
K1
6 K
8 K
4 K
2 K
1 K
0.5
K
1600
1400
1200
1000
800
600
400
200
0
1600
1400
1200
1000
800
600
400
200
0
Lo
ad b
and
wid
th (
MB
ytes
/sec
)
Lo
ad b
and
wid
th (
MB
yte/
s)
SGI Origin 10000one processor
195 MHz
L1
L2
Working set
Access pattern
(stride between 64bit words)
8
Local Load Access: DEC 8400
12
81
279664634832312416151287654321
64
M3
2M
16
M8
M4
M2
M1
M5
12
k2
56
k1
28
k6
4k
32
k1
6k
8k
4k
2k
1k
.5k
1200
1000
800
600
400
200
0
1200
1000
800
600
400
200
0
Lo
ad b
and
wid
th (
MB
ytes
/sec
)
Lo
ad b
and
wid
th (
MB
yte/
s)
DEC Alpha 8400one processor
300 MHz
L2
L3
L1
Working set
Access pattern
(stride between 64bit words)
9
Local Load Access: Sun Enterprise
Working set
Access pattern
(stride between 64bit words)
12
81
279664634832312416151287654321
16
M8
M4
M2
M1
M5
12
K2
56
K1
28
K6
4 K
32
K1
6 K
8 K
4 K
2 K
1 K
0.5
K
700
600
500
400
300
200
100
0
700
600
500
400
300
200
100
0
Lo
ad b
and
wid
th (
MB
ytes
/sec
)
Lo
ad b
and
wid
th (
MB
yte/
s)
Sun Ultra Enterpriseone Ultra SPARC II
248 MHz
DRAM
L1
L2
10
Local Load Access: SGI Cray T3E
12
81
279664634832312416151287654321
16
M8
M4
M2
M1
M5
12
K2
56
K1
28
K6
4 K
32
K1
6 K
8 K
4 K
2 K
1 K
0.5
K
1200
1000
800
600
400
200
0
1200
1000
800
600
400
200
0
Lo
ad b
and
wid
th (
MB
ytes
/sec
)
Lo
ad b
and
wid
th (
MB
yte/
s)
Cray T3Eone processor
300 MHz
DRAM
L1L2
Working set
Access pattern
(stride between 64bit words)
11
Comparison - Local Access
1 2 3 4 5 6 7 81
21
51
62
43
1 32
48
63
64
96
12
71
28
19
2
0
50
100
150
200
250
300
Me
mo
ry L
oa
d b
an
dw
idth
(M
byt
e/s
)
Access pattern (stride between 64bit words)
Pentium Pro
SGI Origin
DEC 8400
Sun Enterp.
Cray T3E
450
12
Performance in an SMP setting
Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors
Topics of interest: small working sets in caches: performance remains