Page 1
Analytical Performance Modeling of Hierarchical Interconnect Fabrics
Nikita Nikitin, Javier de San Pedro, Josep Carmona and Jordi Cortadella
Universitat Politècnica de Catalunya
Supported by Intel Corporation
International Symposium on Networks-on-Chip (NOCS) 2012, Copenhagen, Denmark
Page 2
Outline • Introduction
– Hierarchical Chip Multiprocessors (CMPs)
– Performance modeling for CMPs
– The cyclic dependency between latency and traffic
• Analytical performance modeling
– Modeling traffic
– Modeling latency
– Methods to resolve the dependency
• Results and conclusions
NOCS'12 Universitat Politècnica de Catalunya 2
Page 3
The trends in CMP design • Hundreds of computing units per chip
– Smaller, simpler, more power-efficient cores
• Advanced memory management – Larger on-chip cache
– Increasing interconnect (IC) bandwidth
• Tiled architecture
NOCS'12 Universitat Politècnica de Catalunya 3
R R R R
R R R R
R R R R
R R R R Mem
ory
Co
ntr
olle
r
Mem
ory
Co
ntr
olle
r
C
L2 R
L1
Page 4
Hierarchical interconnects
4
C+L1
L2
C+L1
L2
L3
IC ( Bus / Ring )
NI
R
Dir
NOCS'12 Universitat Politècnica de Catalunya
Tiled CMP with hierarchical interconnect
R
R R
Mem
ory
Co
ntr
olle
r
Mem
ory
Co
ntr
olle
r
IC
R
IC
IC IC
R R
R
• Exploit locality of memory references*
* “Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation CMPs”, R.Das et al., HPCA, 2009
Page 5
Design of CMP architecture • Goal: efficient use of chip resources
– Maximize performance
– Fit area/power/thermal budget
• Multidimensional exploration space
(#cores / cache size /
memory hierarchy / IC topologies /…)
• Means: automated design space exploration
– Analytical performance models are essential
NOCS'12 Universitat Politècnica de Catalunya 5
C C
L3
R
D
R
MC
MC
IC
R
IC
R
R
IC
R
IC
R
Page 6
Contention modeling
• Contention impacts CMP performance
• Crucial evaluating hierarchical interconnects
– Is the required bandwidth sustainable?
NOCS'12 Universitat Politècnica de Catalunya 6
R
R R
Mem
ory
Co
ntr
olle
r
Mem
ory
Co
ntr
olle
r
IC
R
IC
IC IC
R R
R
# of wires? Router architecture?
Local IC topology?
Page 7
Motivational example
NOCS'12 Universitat Politècnica de Catalunya 7
(a) 8x8 mesh (b) 4x4 mesh with bus clusters
(c) 2x2 mesh with bus clusters
Estimation w/o contention is very
inaccurate!
48 cores, 16 cache modules core cache IC Legend:
0
2
4
6
8
10
(a) (b) (c)
Thro
ugh
pu
t (I
PC
)
No contention
With contention
Page 8
Analytical modeling of CMP performance
• Analytical models for ICs: – Latency L as a function of traffic λ
– λ defined by the workload
Emphasis: λ depends on L!
• This work: resolve the cyclic dependency of traffic and latency – Formulate λ as a function of L
– Add existing model for L(λ)
– Resolve the system efficiently
NOCS'12 Universitat Politècnica de Catalunya 8
L λ IPC
Core1
Corei
CoreN
…
Li λi
Memory subsystem
L L •••
(Throughput)
Page 9
Outline • Introduction
– Hierarchical Chip Multiprocessors (CMPs)
– Performance modeling for CMPs
– The cyclic dependency between latency and traffic
• Analytical performance modeling
– Modeling traffic
– Modeling latency
– Methods to resolve the dependency
• Results and conclusions
NOCS'12 Universitat Politècnica de Catalunya 9
Page 10
Modeling memory traffic
Traffic to memory (probability of a memory reference per cycle):
NOCS'12 Universitat Politècnica de Catalunya 10
Average latency of memory access Memory access penalty
Core L λ
Memory
subsystem
Parameters of core executing some workload: 1. - ideal Cycles Per Instruction
2. - # Memory references Per Instruction
Real performance of in-order core:
Page 11
Modeling average memory latency • Average latency of memory requests for a core:
NOCS'12 Universitat Politècnica de Catalunya 11
0
0,05
0,1
0,15
0,2
0,25
0 5 10 M
iss
Rat
io
Cache size (Mb)
Latencies are calculated using - Cache latencies - Interconnect topology - Routing algorithm (XY)
Probabilities are calculated using - Miss ratio dependency on cache size
Application
15% miss in 64K L1
5% miss in 1M L2
0
0,1
0,2
0,3
0,4
0 5 10 M
iss
Rat
io
Cache size (Mb)
Application
Page 12
Modeling contention latency
NOCS'12 Universitat Politècnica de Catalunya 12
CL
MC
MC
R
CL
R
CL
R
CL
R
R
C C
L3
NI
D
Mesh NoC Bus-based cluster
Delays in queues are defined by extending M/G/1 queuing model:
“An Analytical Approach for Network-on-Chip Performance Analysis”, Ogras et al., TCAD, 2010 (Best Paper Award)
Page 13
System of non-linear equations
• Solve using numerical methods
• General methods are very slow – 10x10 mesh (10K vars./eqns.) – MATLAB timeout after few hours
• Proposed methods: – Fixed-point iteration
– Bisection search for λ
The cyclic dependency of L and λ
NOCS'12 Universitat Politècnica de Catalunya 13
Any “black-box” model for L(λ)!
Analytical model for latency
…
…
Page 14
Fixed-point iteration
+ Fast (10x10 mesh in several ms)
+ Converges to the exact solution
NOCS'12 Universitat Politècnica de Catalunya 14
0
10
20
30
40
50
0 0,05 0,1 0,15 0,2
L, a
vera
ge la
ten
cy (
cycl
es)
λ, average traffic rate (flits/cycle)
L(λ) λ (L)
Characteristic of the IC Characteristic of
the cores/workload
– May not converge for high λ
Hop-count latency
Page 15
Bisection search for λ
– Fast, as fixed-point
– Always converges to an approximate solution
(good for homogeneous clusters)
NOCS'12 Universitat Politècnica de Catalunya 15
0
10
20
30
40
50
0 0,05 0,1 0,15 0,2
L, a
vera
ge la
ten
cy (
cycl
es)
λ, average traffic rate (flits/cycle)
L(λ) λ (L)
Characteristic of the IC Characteristic of
the cores/workload
λ=0 λ(Lhop-count)
Page 16
Outline • Introduction
– Hierarchical Chip Multiprocessors (CMPs)
– Performance modeling for CMPs
– The cyclic dependency between latency and traffic
• Analytical performance modeling
– Modeling traffic
– Modeling latency
– Methods to resolve the dependency
• Results and conclusions
NOCS'12 Universitat Politècnica de Catalunya 16
Page 17
Performance of analytical methods
Test Mesh Cont. lat. Num. of var./eqn.
Runtime (sec)
MATLAB Fixed-Point Bisection
T1 2 x 2 5% 236 0.023 0.001 0.001
T2 4 x 4 13% 1224 1.412 0.001 0.002
T3 6 x 6 8% 3108 30.831 0.002 0.003
T4 8 x 8 12% 6128 408.539 0.006 0.010
T5 10 x 10 23% 10260 Timeout (1hr) 0.010 0.012
T6 10 x 10 46% 10260 Timeout (1hr) 0.022 0.015
T7 10 x 10 55% 10260 Timeout (1hr) NA 0.016
NOCS'12 Universitat Politècnica de Catalunya 17
Page 18
Case study: performance exploration
NOCS'12 Universitat Politècnica de Catalunya 18
Parameter Value
Chip area Core area Core IPC0
MPI L1 size L2 size Memory density Mesh dimensions MC latency
350 mm2
1.25 mm2
2.0 0.5 64, 128 Kb 64 Kb to 3 Mb 1 mm2 / Mb 2x2 to 16x16 100 cycles
0
0,05
0,1
0,15
0,2
0,25
0 2 4 6 8 10
Mis
s R
atio
Cache size (Mb)
1062 configurations explored
Cache Size 64K 128K 256K 512K 1M 2M 4M 8M
Area* (mm2) 0.063 0.125 0.25 0.5 1.0 2.0 4.0 8.0
Latency (cycles) 2 3 4 5 6 7 8 9
Page 19
Simulation environment
NOCS'12 Universitat Politècnica de Catalunya 19
Network simulation
Global (mesh)
memory L3 cache
node
Bus Local (bus, ring, …)
Core
Memory
controller
• Verify model by simulation
• Cycle-accurate NoC simulator – On top of BookSim 2.0
• Extensions – Hierarchical networks
– Bus topologies
– Probabilistic state-machines
for cores and memories
Page 20
Faithfulness of the model
20
0
5
10
15
20
25
30
35
1
52
10
3
15
4
20
5
25
6
30
7
35
8
40
9
46
0
51
1
56
2
61
3
66
4
71
5
76
6
81
7
86
8
91
9
97
0
10
21
Thro
ugh
pu
t (I
PC
)
Configurations sorted in descending order of throughput
Modeling
Simulation
• Average difference in throughput is about 10%
• Corresponds to the error of the latency model
NOCS’12 Universitat Politècnica de Catalunya
Page 21
Best-throughput ordering
21
Simulation time: 5.5 hours Modeling time: 16.8 sec (>1000x faster)
0
200
400
600
800
1000
0 200 400 600 800 1000 B
est
co
nfi
gura
tio
ns
by
anal
ysis
th
at in
clu
de
N
Number of best config. by simulation (N)
Static latency
Full latency
Ideal (Simulation)
(1; 33)
(4; 44)
(1; 2) (4; 6)
(50; 64)
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 Be
st c
on
figu
rati
on
s b
y an
alys
is t
hat
incl
ud
e N
Number of best configurations by simulation (N)
Static latency
Full latency
Ideal (Simulation)
NOCS’12 Universitat Politècnica de Catalunya
No contention
With contention
No contention
With contention
Ideal (Simulation)
Page 22
Conclusions
• Analytical modeling of contention in CMPs is essential
• There exists cyclic dependency between latency and traffic of memory requests
• This dependency can be efficiently resolved using numerical methods (fixed-point, bisection)
• Precision of the model is significantly improved
• Current work: out-of-order cores, heterogeneity
NOCS'12 Universitat Politècnica de Catalunya 22
Page 23
Backup
NOCS'12 Universitat Politècnica de Catalunya 23
Page 24
Sufficient for convergence of :
0
10
20
30
40
50
0 0,05 0,1 0,15 0,2
L, a
vera
ge la
ten
cy (
cycl
es)
λ, average traffic rate (flits/cycle)
L(λ) λ (L)
Fixed-point convergence issues
NOCS'12 Universitat Politècnica de Catalunya 24
Hop-count latency
Page 25
Bisection search
NOCS'12 Universitat Politècnica de Catalunya 25
Latency model Traffic model
Page 26
Average latency calculation
• Average Memory Access Time (AMAT):
NOCS'12 Universitat Politècnica de Catalunya 26
Page 27
Best configuration
27
- 6x6 mesh, 36 clusters, 5 cores/cluster
- total 180 cores with 64K L1, 256K L2
- 68Mb total shared L3
Throughput = 30.81 IPC
R R R R R R
R R R R R R
R R R R R R
R R R R R R
R R R R R R
R R R R R R
Mem
ory
Co
ntr
olle
r
Mem
ory
Co
ntr
olle
r
Memory Controller
Memory Controller
C+L1
L2
C+L1
L2
C+L1
L2
L3
Bus
C+L1
L2
C+L1
L2
NI
R
Dir
NOCS'12 Universitat Politècnica de Catalunya
Page 28
Runtime: Modeling vs Simulation
0,001
0,01
0,1
1
10
100
1000
0 100 200 300 400 500 600 700 800 900 1000 1100
Ru
nti
me
(se
con
ds)
Number of components (cores + memories) in CMP
Analytical
Simulation
Modeling a CMP with ~700 components in 1 second
28 NOCS’12 Universitat Politècnica de Catalunya