Page 1
1
dist-gem5: Distributed Simulation of
Computer Clusters
Illinois: Mohammad Alian, Prof. Nam Sung Kim
ARM: Gabor Dozsa, Stephan Diestelhorst, Nikos Nikoleris, Radhika Jagtap
Tutorial at IEEE International Symposium on Workload Characterization (IISWC), Seattle, USA
1 Oct 2017
Page 2
2
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
Page 3
3
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
Page 4
4
▪ Definition
▪ A cluster of computers that communicate and interact with each other by passing messages over
the network to process given tasks.
▪ Examples
▪ Datacenters, supercomputers
Distributed computer systems
The IBM Blue Gene/P supercomputer "Intrepid"
at Argonne National Laboratory runs 164,000 processor
cores in 40 racks/cabinets connected by a high-speed 3-
D torus network.
A Google datacenter
Page 5
5
▪ To maximize performance and/or energy-efficiency, we must capture the intricate
interplay amongst computers and their HW/SW sub-systems, especially due to
communications and interactions w/ each other by passing messages over the network
Exploring and optimizing distributed computer systems
Request Response
ResponseRequest
Network
Clients
Servers
0
0.5
1
1.5
2
2.5
3
3.5
0.0
0.2
0.4
0.6
0.8
1.0
0.14 0.19 0.24
Fre
qu
en
cy (
GH
z)
Uti
lizati
on
Time (s)
BW(rx)
BW(tx)
U(core)
F(core)
Page 6
6
Using physical computers
▪ Advantage
▪ Fast evaluations for large-scale distributed computer systems
▪ Disadvantage
▪ Limited design space exploration (unable to explore distributed computer systems based on future
processor and computer sub-systems architectures that have not been developed yet)
Using queuing-theoretic models
▪ Advantage
▪ Simple and fast evaluations for large-scale distributed computer systems
▪ Disadvantage
▪ Inaccurate/misleading evaluations (unable to capture complex interplay b/w HW/SW sub-systems of
computers)
Past methods exploring distributed computer systems [1]
Page 7
7
Using existing (full-system) simulators
▪ Advantage
▪ More flexible design space exploration than physical computer systems
▪ More precise evaluation than queuing-theoretic models
▪ Disadvantage
▪ gem5: limited scalability w/ slow evaluation (legacy gem5)
▪ Not flexible (SST + gem5)
▪ Proprietary and limited to x86 (COTSON)
Past methods exploring distributed computer systems [2]
Page 8
8
▪ Evaluating performance and power dissipation of a distributed system
▪ Complex interplay among system components at scale
▪ Demanding a full-system, cycle-level simulator which is fast enough to simulate a large-
scale computer system
▪ Enabling distributed simulation:
▪ Simulation of a distributed computer
▪ system w/ many simulation hosts
dist-gem5
scaleOS
ISAscaches
memory
network
devices
performancePower
cores
Page 9
9
▪ Product of excellent synergistic collaboration b/w industry and academia
▪ Integrating the best features of concurrently developed multi-gem5 from ARM and pd-gem5 from U.
of Illinois for fast and deterministic simulations of distributed computer simulations
History of dist-gem5 development
pd-gem5 multi-gem5
U. of Illinois ARM Research
dist-gem5
[Best Paper Finalist] M. Alian, et al., “dist-gem5: Distributed Simulation of Computer Clusters,”IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2017.
M. Alian, et al., pd-gem5: Simulation infrastructure for parallel/distributed computer systems. IEEE Computer Architecture Letters, vol: 15, no: 1, 2016.
Page 10
10
Example of research w/ dist-gem5
Datacenter power management algorithm
▪ Desired P/C-state governor
▪ react to change in core utilization in a timely manner
▪ Approaches
▪ predict changes in core utilization
▪ core utilization is highly correlated w/ network activity
▪ Hide P/C-state transition latency
▪ overlap P/C-state transition w/ packet reception
and processing
BW(rx)
U(core)
MC
DMA
Interrupt
Handler
SoftIRQ
1 2 n...
rx_desc_ring
s
k
b
Network
Stack
DRAM
s
k
b
s
k
b
Copy to
User
p
k
t
NIC
NIC
CPU
DRAMRCPCIe
Channel
[Nominated for the Best Paper Award] M. Alian, et al. “NCAP: Network-Driven, Packet
Context-Aware Power Management for client-server architecture. IEEE International
Symposium on High-Performance Computer Architecture (HPCA), February 2017.
Page 11
11
Other promising research directions
▪ Exploring HW/SW cross-layer approaches for datacenter computers and their sub-
systems
▪ Exploiting information from network HW/SW layers as hints for efficient management of computer
resource management (e.g., prefetching pages from slow to fast memory in hybrid memory system)
▪ Off-loading simple data-intensive operations to network interface cards (NICs)
▪ Developing efficient evaluation methodologies for large-scale distributed computer
systems
▪ Exploring systematic hybrid evaluation approaches judiciously mixing queuing-theoretic modeling
and dist-gem5-based simulation approaches for efficiently evaluating a VERY large-scale distributed
computer systems (e.g., obtaining detailed parameters for queuing-theoretic analytical model using
dist-gem5)
Page 12
12
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
Page 13
13
Michigan m5 + Wisconsin GEMS = gem5
“The gem5 simulator is a modular platform for computer-system architecture research,
encompassing system-level architecture as well as processor microarchitecture.”
What is gem5?
Page 14
14
Level of detail
▪ HW Virtualization
▪ Very no/limited timing
▪ The same Host/guest ISA
▪ Functional mode
▪ No timing, chain basic blocks of instructions
▪ Can add cache models for warming
▪ Timing mode
▪ Single time for execute and memory lookup
▪ Advanced on bundle
▪ Detailed mode
▪ Full out-of-order, in-order CPU models
▪ Hit-under-miss, reodering, …
µarch Exploration
HW Validation
Perf. Validation
Cycle Accurate
1–50 KIPS
RTL simulation
High-level perf./power
Architecture exploration
Approximately Timed
0.2–3 MIPS
gem5
Loosely Timed
50–200 MIPS
Qemu
SW Dev
HW Virt.
gem5 + kvm
GIPS
Page 15
15
Level of detail
▪ HW Virtualization
▪ Very no/limited timing
▪ The same Host/guest ISA
▪ Functional mode
▪ No timing, chain basic blocks of instructions
▪ Can add cache models for warming
▪ Timing mode
▪ Single time for execute and memory lookup
▪ Advanced on bundle
▪ Detailed mode
▪ Full out-of-order, in-order CPU models
▪ Hit-under-miss, reodering, …
µarch Exploration
HW Validation
Perf. Validation
Cycle Accurate
1–50 KIPS
RTL simulation
High-level perf./power
Architecture exploration
Approximately Timed
0.2–3 MIPS
gem5
Loosely Timed
50–200 MIPS
Qemu
SW Dev
HW Virt.
gem5 + kvm
GIPS
Page 16
16
When not to use gem5
▪ Performance validation
▪ gem5 is not a cycle-accurate microarchitecture model!
▪ This typically requires more accurate models such as RTL simulation.
▪ Commercial products such as ARM CycleModels operate in this space.
▪ Core microarchitecture exploration
▪ Only do this if you have a custom, detailed, CPU model!
▪ gem5’s core models were not designed to replace more accurate microarchitectural models.
▪ To validate functional correctness or test bleeding-edge ISA improvements
▪ gem5 is not as rigorously tested as commercial products.
▪ New (ARMv8.0+) or optional instructions are sometimes not implemented
▪ Commercial products such as ARM FastModels offer better reliability in this space.
Page 17
17
Why gem5?
▪ Runs real workloads▪ Analyze workloads that customers use and care about
▪ … including complex workloads such as Android
▪ Comprehensive model library▪ Memory and I/O devices
▪ Full OS, Web browsers
▪ Clients and servers
▪ Rapid early prototyping▪ New ideas can be tested quickly
▪ System-level impact can be quantified
▪ System-level insights▪ Enables us to study complex
memory-system interactions
▪ Can be wired to custom models
Ubuntu (Linux 4.x) Android Nougat
Page 18
18
• Some timing
• Caches
• No BPs
• Fast
• Some timing
• Caches
• Limited BPs
• Fast
• Full timing
• Caches
• Branch predictors
• Slow
• No timing
• No caches
• No BP
• Really fast
CPU models overview
BaseCPU
BaseKvmCPU TraceCPUBaseSimpleCPU
AtomicSimpleCPU
TimingSimpleCPU
DerivO3CPU MinorCPU
X86KvmCPU
ArmV8KvmCPU
Page 19
19
Discrete event based simulation
▪ Discrete: Handles time in discrete steps
▪ Each step is a tick
▪ Usually 1THz in gem5
▪ Simulator skips to the next event on the timeline
Time
Event handler
Event handlerMyObj::startup()Schedule
Call
Page 20
Simulating a distributed system
Page 21
21
Host #1
Distributed gem5 Simulation – high level view
Host #1
simulated
system
#1
Host #2
Host #3
Packet
forwarding
▪ gem5 processes modeling full systems run in parallel
on a cluster of host machines
▪ Packet forwarding engine
▪ Forward packets among the simulated systems
▪ Synchronize the distributed simulation
▪ Simulate network topology
gem5 process
host machine
simulated
system
#2
simulated
system
#3
Page 22
22
Core components
Packet forwardingDistributed
checkpointing
Synchronization
Simulated network
Page 23
23
Core components
Packet forwardingDistributed
checkpointing
Synchronization
Simulated network
Page 24
24
physical host #1
physical host #3
physical host #2
physical switch
phys
NIC#1
phys
NIC#
2
phys
port1
phys
port2
phys
port3
phys
NIC#3
dist-gem5 architecture – packet forwarding
Page 25
25
physical host #1
physical host #3
physical host #2
physical switch
phys
NIC#1
phys
NIC#2
phys
port1
phys
port2
phys
port3
phys
NIC#3
dist-gem5 architecture – packet forwarding
gem5 #1
simulated
system #1
sim
NIC
gem5 #3
simulated switch
gem5 #2
simulated
system #2
sim
NIC
sim
port
0
sim
port
1
Page 26
26
physical host #1
physical host #3
physical host #2
physical switch
phys
NIC#1
phys
NIC#2
phys
port1
phys
port2
phys
port3
phys
NIC#3
gem5 #1
simulated
system #1
sim
NIC
gem5 #3
simulated switch
gem5 #2
simulated
system #2
sim
NIC
sim
port
0
sim
port
1
dist-gem5 architecture – packet forwarding
simulated packets
are embedded into
host TCP/IP
packetssim pkt
TCP sim pkt
sim pktTCP sim pkt
sim pkt
Page 27
27
Asynchronous processing of incoming messages
▪ Simulation thread (main thread)
▪ Process/insert events in the event queue
▪ In case of send pkt event, encapsulate the simulated
Ethernet packet in a message and send it out
▪ Receiver thread
▪ Create for each gem5 process
▪ Waits for incoming packets
▪ Creates a recv pkt event and insert it to the event
queue
eventQsimulation
threadsend pkt
recv pkt
physical host
gem5 process
receiver
thread
phys
NIC
Page 28
28
▪ What is the correct tick for the receive event?
▪ st : send tick
▪ lat: simulated link latency
▪ bw: simulated link bandwidth (bytes/tick)
▪ size: simulated packet size (bytes)
▪ rt: receive tick
rt = st + lat + size / bw
▪ Accurate simulation
▪ rt >= curTick() when the receiver gem5 gets the real message encapsulating the simulated packet
▪ receiver gem5 can schedule the receive event for the simulated NIC
Simulation accuracy and packet forwarding
Head
Tail
event queue
receive frame
tim
e
curTick
Page 29
29
Core components
Packet forwardingDistributed
checkpointing
Synchronization
Simulated network
Page 30
30
Need for synchronization
30
wall clock time
• Receiver gem5 can run ahead of
sender gem5
✓Physical host mismatch
✓Different events to be processed
• Slowed down receiver gem5 to
ensure simulation accuracy
• Quantum-based synchronization
gem5#0
gem5#1
send time
expected
delivery time
simulated network delay
recv time
late packet arrival
Page 31
31
Accurate packet forwarding
31
wall clock time
• quantum: interval for periodic synchronization in simulated time
• Sync-event flushes inter gem5 communication channels
• If quantum ≤ simulated link delay:
✓expected delivery tick falls inside the next quantum
• Optimal quantum size for accurate forwarding == simulated link delay
gem5#0
gem5#1
gem5#0
gem5#1
send time
packet arrival wall
clock time
expected
delivery time
simulated network delay
quantum
global sync
quantum
Page 32
32
▪ Simulation progress gets stopped at each sync
tick in each gem5 process
▪ Simulated compute node
▪ Sends out ‘synq request’ message
▪ Waits until ‘sync ack’ message comes back
▪ Simulated switch
▪ Waits until it receives a ‘sync request’ message
▪ Sends out ‘sync ack’ message
Compute nodes, switch and synchronization
compute
node
gem5
Ethernet
switch
gem5
compute
node
gem5
compute
node
gem5
Page 33
33
▪ A vanilla global gem5 event is scheduled at each sync tick in each gem5 process
▪ A global gem5 event is a transparent thread barrier (in case of multiple simulation threads)
▪ dist-gem5 global sync is prepared to work with multi-queue/multi-threaded gem5 simulations
The global sync event
▪ The process() method in a compute node
▪ sends out ‘sync request’ messages for each
simulated link
▪ waits on a condition variable to get notified
about completion by the receiver thread
▪ The process() method in a switch
▪ waits for completion notification from the
receiver thread
▪ sends out ‘sync ack’ messages for each
simulated link
▪ Receiver thread keeps processing incoming messages while simulation thread is blocked
▪ creates receive events in the event queue for simulated Ethernet frames
▪ notifies blocked simulation thread when ‘sync
ack’ messages arrive
▪ notifies blocked simulation thread when
‘sync request’ messages arrive
Page 34
34
Core components
Packet forwardingDistributed
checkpointing
Synchronization
Simulated network
Page 35
35
▪ Checkpoint support for dist-gem5 relies on the mainline gem5 checkpoint support
▪ Each gem5 process of a dist-gem5 run creates its own checkpoint
▪ dist-gem5 adds an extra co-ordination layer to ensure correctness
▪ No in-flight message may exist among gem5 processes when the distributed checkpoint is taken
Distributed checkpointing
m5 checkpoint pseudo inst
exitSimLoop() drain() serialize() drainResume() simulate()
m5 checkpoint pseudo inst
global syncexitSimLoop()
drain() global sync serialize()drain
Resume()simulate()
dist-gem5 checkpoint co-ordination
Page 36
36
Distributed checkpointing (cont.)
▪ Checkpoint can only be initiated at a periodic global sync
▪ Simplifying implementation without scarifying usability
Checkpoint flavour collaborative
checkpoint
immediate checkpoint
Condition all compute nodes signal
intent
at least one compute node
signals intent
Example use case Instrumented MPI
application source code to
take a checkpoint at the
MPI_barrier() before ROI
Taking a checkpoint from
the bootscript before
starting an MPI application
(i.e. before calling ‘mpirun’)
Page 37
37
Checkpoint @ global sync
▪ In practical use cases a distributed checkpoint is taken “near” an application barrier (e.g.
MPI_Barrier() or mpirun)
▪ We want to take the checkpoint when all processes hit the barrier in the application code =>
desired application state can be captured even if we allow checkpoint writes only at global sync
▪ At a global sync
▪ A compute node gem5 processes can signal its intention to take a checkpoint
▪ ‘m5 checkpoint’ pseudo instruction => ‘need checkpoint’ meta info in the next ‘sync request’ message
▪ Switch gem5 process can command to write a checkpoint
▪ ‘write checkpoint’ meta info in the ‘sync ack’ message => exitSimLoop() in all gem5 processes
Page 38
38
Writing a checkpoint
draining
gem5#0
gem5#1
Wall clock time
p
writing
checkpoint
p
global sync
d0
dist checkpoint starts
d1
q – d0
q : sync quantum ticks
d0, d1: drain ticks
▪ Distributed checkpoint can start
only at a global sync
▪ Draining may require different
number of ticks in each gem5
▪ After drain complete, we flush out
in-flight messages with an extra
global sync
▪ Global sync implements both an
execution and a data (message)
barrier
q – d1
Page 39
39
Restoring from a checkpoint
Wall clock time
global sync
q’ : sync quantum ticks
d0, d1 : drain ticks
align
ticks
restoringfrom
checkpoint
d0
d1
q’
q’
d’
draining
▪ Checkpoint might be written at
different ticks in different gem5
processes
▪ An additional global sync to align
the ticks:
d0 + d’ = d1
▪ Global sync delivers the max tick
value to all gem5 processes
▪ Periodic global sync always happens
at the same tick in every gem5
▪ Global sync period may change at
restore
▪ Same checkpoint can be used to
explore different network link
latency/bandwidth effects
Page 40
40
▪ User is allowed to change simulated link parameters when restoring from a checkpoint
▪ Same checkpoint can be used to explore different network link latency/bandwidth effects
▪ Global sync period may change at restore (if the simulated link latency change)
▪ Checkpoint may contain simulated packets to get received in the future
▪ Receive ticks for such packets need to be adjusted to reflect the change of the simulated link
parameters
Restoring from a checkpoint (cont.)
Page 41
41
Core components
Packet forwardingDistributed
checkpointing
Synchronization
Simulated network
Page 42
42
server #2
dist-gem5 architecture – network modeling
42
Server #1
server #3
server #4
server #5
server #6
server #7
Server #0
top of rack
switch #0
server #10
server #9
server #11
server #12
server #13
server #14
server #15
server #8
top of rack
switch #1
server #58
server #57
server #59
server #60
server #61
server #62
server #63
server #56
top of rack
switch #7
aggregate
switch
. . .
simulate in one gem5 process
Page 43
43
▪ Configurable baseline Ethernet switch model
▪ Port number, delay, bandwidth, buffer size
physical host
Configurable network model
43
top of rack
switch #0
top of rack
switch #1top of rack
switch #7
aggregate
switch
p8
p0 p7 p0 p7
p8
. . .. . . . . .
p0 p7p1
p8
gem5
simulated etherLinksimulated port
distEtherLink
simulated etherSwitch
p0 p7
MAC
Table In-orderQ#0
In-orderQ#n
IPORT#0
IPORT#n
OPORT#0
OPORT#n
. . .
. . .
Page 44
Deterministic simulation
Page 45
45
▪ We assume that a single compute node gem5 simulation is deterministic
▪ Ordering and speed of dist-gem5 messages in real world
▪ Speed of gem5 processes (relative to each other) may vary
▪ Communication speed among gem5 process may vary
▪ Global sync guarantees deterministic packet forwarding
▪ sync quantum <= simulated link latency
▪ global sync is a message barrier
Deterministic execution issues
Page 46
46
: message
delivery in
wall clock
time
Global sync and deterministic packet forwarding
gem5#0
gem5#1
gem5#0
gem5#1
qsend tick#1
receive tick#1n
q
global sync
q : global sync
period in ticks
(quantum)
n: simulated
link latency in
ticks
gem5#2 gem5#2
n receive
tick#2
send tick#2
wall clock time
▪ Receive tick for a simulated
packet may not fall within the
same quantum which the
message gets received in
▪ A message is always gets sent
and received within a single
quantum
Page 47
Validation and speedup
[Best Paper Finalist] M. Alian, et al., “dist-gem5: Distributed Simulation of Computer Clusters,”
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2017.
Page 48
49
Methodology – simulation techniques
▪ For example, simulating a cluster w/ 7 nodes and 1 network switch:
quad core physical host
gem5#6
system#6
gem5#7
switch
gem5#4
system#4
gem5#2
system#2
gem5#0
system#0
gem5#5
system#5
gem5#3
system#3
gem5#1
system#1
quad core physical host
gem5#6
system#6
gem5#7
switch
gem5#4
system#4
gem5#5
system#5
quad core physical host
gem5#0
system#6 switch
system#4
system#2
system#0
system#5
system#3
system#1
quad core physical host
gem5#6
system#2
gem5#7
system#3
gem5#4
system#0
gem5#5
system#1
single-threaded-gem5 dist-gem5parallel-gem5
Page 49
50
Methodology – experimental setup
▪ Focus on off-chip network performance using network intensive applications
▪ iperf, memcached, httperf, tcptest, netperf, NAS parallel benchmark
▪ Verification/validation against:
▪ Single-threaded-gem5
▪ Physical cluster
▪ 4 node cluster w/ AMD A10-5800K
▪ Speedup comparison against:
▪ Single-threaded-gem5
▪ Parallel-gem5
category gem5 configuration
O3 core 4 cores; 4 way superscalar
memory 8GB DDR3 1600 MHz
network Intel GbE NIC; 1 μs Link latency
OS Linux Ubuntu 14.04 (Kernel 4.3)
Page 50
51
Validation – network latency and bandwidth
▪ iperf (left) and memcahed (right)
▪ Follows the behavior of physical setup
▪ 17.5% lower response time for memcached
0.0
0.3
0.6
0.9
1.2
1.5
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
Late
ncy (
ms
)
Bandwidth (Gbps)
dist-gem5
phys
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
1 5 10 20 30 40 50 60 70 80 90 95
Late
ncy (
ms)
memcached Distribution Percentile
dist-gem5
phys
Page 51
52
Speedup – simulation time reduction
▪ Running httperf on each simulated node sending
fixed number of requests to a unique simulated
node (apache server)
▪ Compared with single-threaded-gem5
▪ dist-gem5 simulating 63 nodes on 16 physical
hosts is
▪ 83.1 faster than single-threaded-gem5
▪ 12.8 faster than parallel-gem5
2.76.3
21.8
36.0
83.1
2.7 3.76.6 6.0 6.5
0
10
20
30
40
50
60
70
80
90
3 7 15 31 63
Sp
ee
du
p (
No
rm.
sin
gle
-th
read
ed
-gem
5)
Number of Simulated Nodes
dist-gem5 parallel-gem5
speedup of parallel-gem5 saturates!
Page 52
53
Scalability – simulation time vs. simulated cluster size
▪ Simulation time increase for simulating 63 vs. 3 nodes:
▪ 57.3 for Single-threaded-gem5
▪ 23.9 for parallel-gem5
▪ 1.9 for dist-gem5
1.41.9
1.9
3.9
11.2
23.9
1.0
2.6
9.4
25.0
57.3
1.0
10.0
100.0
0 10 20 30 40 50 60 70
No
rmalized
Sim
ula
tio
n T
ime
Number of Simulated Nodes
dist-gem5 parallel-gem5 single-threaded-gem5
dist-gem5 scales well!
Page 53
54
Synchronization overhead
▪ Sweep synchronization quantum size
▪ # of http req remains near constants
▪ Maximum 2.6% variance
▪ Almost the same amount of work done at each
quantum size
▪ Simulation time improvement
▪ 4.9% from 0.5 μs to 1 μs
▪ 15.7% from 0.5 μs to 128 μs
0
4
8
12
16
20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.5 1 2 4 8 16 32 64 128
Nu
mb
er
of
Req
uests
(K
Req
)
No
rma
lized
Sim
ula
tio
n T
ime
Synchronization Quantum Size (μs)
Simulation Time Req#
dist-gem5 synchronization is efficient!
Page 54
55
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
Page 55
56
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
Page 56
Getting started with gem5 FS mode
Page 57
58
Download and build gem5
▪ Guest architecture
▪ Several architectures in the source
tree.
▪ Most common ones are:
▪ ARM
▪ NULL – Used for trace-drive simulation
▪ X86
▪ Optimization level:
▪ debug: Debug symbols, no/few
optimizations
▪ opt: Debug symbols + most
optimizations
▪ fast: No symbols + even more
optimizations
dist-gem5 currently support ARM. We have tested x86 and the patches are on their way
Page 58
59
Example disk images
▪ Example kernels and disk images can be downloaded from gem5.org/Download
▪ This includes pre-compiled boot loaders
▪ Set the M5_PATH variable to point to the extracted directory
▪ Most example scripts try to find files using M5_PATH
▪ Kernels/boot loaders/device trees in ${M5_PATH}/binaries
▪ Disk images in ${M5_PATH}/disks
Page 59
60
Running an example script
▪ Simulates an arm64 system with 4 cores
▪ Uses a functional ‘atomic’ CPU model
▪ Runs the script on the simulated system after booting the Linux
▪ Using “init-param” you can set a parameter which is accessible from simulated system
Page 60
61
Sample rcS script and gem5 terminal output
gem5 terminal outputsample rcS script
Page 61
62
System overview
▪ gem5 will sketch the system overview for you if you install “pydot” on the host
▪ apt-get install python-pydot
Core0 Core1 Core2 Core3
MemCrtl IO Devices
Ethernet Card
Chipset
Page 62
Getting started with dist-gem5
Page 63
64
▪ before you run a dist-gem5 simulation you need:
▪ Setup passwordless ssh between the “launch host” and “simulation hosts”
▪ Set LSB_MCPU_HOSTS to map gem5 processes to simulation hosts
▪ Default will run all processes on localhost
▪ Assuming sim cluster size 4, the following will run 2 full-system gem5 processes on 10.10.10.2 and 2 on
10.10.10.3
Getting started with dist-gem5
Page 64
65
Running an example script
dist-gem5 launch script
Page 65
66
Running an example script
gem5-dist.sh options
simulated cluster sizegem5 executableswitch node config scriptfull-system nodes config script
root dir that stores logs and stats
$RUNDIRlog.switchlog.0log.1…log.(N-1)m5out.switchm5out.0…m5out.(N-1)
gem5 strerr/stdout
gem5 outdir
Page 66
67
Running an example script
simulated cluster sizegem5 executableswitch node config scriptfull-system nodes config script
root dir that stores logs and statsfull-system nodes arguments
gem5 binary arguments
Page 67
68
Sample rcS script for a dist-gem5 run
Page 68
69
Sample rcS script for a dist-gem5 run
assign MAC/IP addr and bring up the NIC
ping other nodes from node 0
Page 69
70
dist-gem5 output terminal for the example rcS script
node w/ rank 0
rank 1
rank 2
rank 3
Page 70
dist-gem5 launch script walk-through
Page 71
72
1. Launch a gem5 process simulating the network switch
2. Wait for switch process to start
3. Read dist-iface port# from log.switch
4. Start full-system gem5 processes
gem5-dist.sh script big picture
switch
ssh
sw port #
log.switch
ssh,
sw port#
Node Node Node Node
dist-etherlink
Each FS gem5 process will connect to acorresponding switch iface at the process startup
Page 72
73
gem5-dist.sh script walk through
Step 1
Step 2
Step 3
Page 73
74
gem5-dist.sh script walk through
Step 4
Page 74
75
▪ Default gem5-dist.sh runs identical full-system gem5 nodes
Heterogenous cluster modeling
shared for all full-system nodes
Page 75
76
▪ Default gem5-dist.sh runs identical full-system gem5 nodes
▪ Not desirable always
▪ We do not need to always simulate the entire cluster with high fidelity
▪ Server node with OOO and clients with atomic CPU
▪ Simulating a heterogenous cluster
▪ Nodes with different number/type of CPUs
▪ Nodes with different memory size/type
▪ Nodes with different ISAs!
▪ Modify gem5-dist.sh script to easily achieve that
▪ Let’s see how to have different arguments for node 0
Heterogenous cluster modeling
Page 76
77
▪ Declare a new variable for node0 arguments (“N0_ARGS”)
gem5-dist.sh changes to support heterogeneous dist runs (1)
Page 77
78
▪ Define a new option flag “—node0-args” and set “N0_ARGS” from command line
arguments
gem5-dist.sh changes to support heterogeneous dist runs (2)
Page 78
79
▪ Add “N0_ARGS” to node 0’s arguments
gem5-dist.sh changes to support heterogeneous dist runs (3)
Page 79
80
Example script with –node0-args
node0 is quad core and the rest
are single core
Page 80
81
▪ The key is gem5-dist.sh scripts
▪ You can easily extend it to have desired dist-gem5 launches
▪ Heterogenous cluster simulation
▪ Arbitrary gem5 process to physical host/core mapping
▪ …
▪ Support for simulation pool management software
▪ Instead of explicitly mapping processes to nodes, and using ssh to run gem5 processes, the cluster
management software maps and runs processes
▪ E.g. a HT-Condor version is available in https://publish.illinois.edu/icsl-pdgem5/download/
Other dist-gem5 simulation approaches
Page 81
dist-gem5 checkpointing/restoring
Page 82
83
▪ Specify the root checkpoint directory using “-c” option in gem5-dist.sh
▪ All the checkpoints would have the same dump tick
▪ You can restore by passing “—checkpoint-restore” option to all gem5 processes (full-
system processes + switch process)
Checkpointing/restoring
$CKPTDIR
m5out.switch
m5out.0
…
m5out.(N-1)
Stores checkpoint files for the gem5 process simulating network switch
Stores checkpoint files for the gem5 process simulating full-system node #0
…
Page 83
84
Restoring from a checkpoint
--cf-args will add its options to all the gem5
processes in the simulated cluster
(full-system processes + switch process)
Page 84
85
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
Page 85
86
Star tree
Network topologies
physical host
top of rack
switch #0top of rack
switch #7
aggregate
switch
p8
p0 p7
p8
. . . . . .
p0 p7p1
gem5
simulated etherLink
simulated port
distEtherLink
simulated etherSwitch
p0 p7
physical host
top of rack
switch
p0 p7
. . . . . .
gem5
p0 p63
. . .
. . .
. . .
Page 86
87
Star network topology config script
Instantiate 64 DistEtherLink SimObjects (dist_size == 64)
Page 87
88
Star network topology config script
physical host
top of rack
switch
p0 p7
. . . . . .
gem5
p0 p63
Page 88
89
Tree topology config script
Instantiate 8 top of rack and 1 aggregate switch
Page 89
90
Tree topology config script
Again we need 64 DistEtherLinks
Page 90
91
Tree topology config script (cont.)
Instantiate 8 aggregate EtherLinks
Page 91
92
Tree topology config script (cont.)
Connect DistEtherLinks to top of rack switches
Page 92
93
Tree topology config script (cont.)
Use EtherLinks to connect aggregate and top of rack switches
Page 93
94
Tree topology config script (cont.)
physical host
top of rack
switch #0top of rack
switch #7
aggregate
switch
p8
p0 p7
p8
. . . . . .
p0 p7p1
gem5
p0 p7
Page 94
95
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
Page 95
96
▪ Look into log.switch, log.0, …, log.N-1
▪ Unexpected abortion of a gem5 process
▪ Segmentation fault, panic, failed connection, …
▪ Normal exit from rcS script
▪ E.g. “info: m5 exit called with non-zero delay” message in a log file
▪ Check m5out.X/system.terminal to find out why “/sbin/m5 exit” gets called
▪ Check if there is any gem5 process running on simulation hosts
▪ Trace based debug
▪ Enable some debug flags and look into log files to get more info
▪ gem5-dist.sh:
dist-gem5 debugging checklist
Page 96
97
▪ GDB debugging
▪ Use “-debug” option for gem5-dist.sh
▪ Debug each gem5 binary using gbd debugger
▪ Each gem5 process will open a gdb terminal and runs from there
dist-gem5 debugging checklist (cont.)
Page 97
101
4 node simulated cluster debugging using gdb
Page 98
102
4 node simulated cluster debugging using gdb
Page 99
103
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
Page 100
104
1. Install/build the benchmark on disk-image
▪ Mount disk-image, copy source, chroot to disk-image, build source OR
▪ Mount disk-image, chroot to disk-image, install application using “apt-get install” OR
▪ Cross compile application, mount disk-image, copy application binary to disk-image
Steps to prepare a benchmark (e.g. apache-bench) on dist-gem5
Page 101
105
2. Take a checkpoint
3. Restore from checkpoint and run the benchmark
▪ Example rcS for running apache-bench; one master node and multiple slaves:
Steps to prepare a benchmark (e.g. apache-bench) on dist-gem5
Page 102
106
▪ Response time of a request is ts1 – ts0
▪ Monitoring response time at software disturbs
client application
▪ Client side queueing
▪ Imprecise statistics
▪ Solution:
▪ Use m5 pseudo instruction to measure response time
Annotating benchmarks for accurate stat collection
Request Response
ResponseRequest
Network
Clients
Servers
Client Server
ts1
ts0 request
respond
Page 103
107
▪ Call m5_work_begin(req.id) before sending a request
▪ Call m5_work_end(req.id) when receiving a response
▪ Example output:
Annotating benchmarks for accurate stat collection
Client Server
m5_work_end(req.id)
m5_work_begin(req.id)request
respond
work_item_end reports the round trip latency
Page 104
108
1. Generate libm5.a for your desired ISA
▪ E.g for aarch64:
2. Copy “libm5.a” and “m5op.h” to the application’s root dir
3. Include “m5op.h” in the source code of the application
4. Add the desired m5ops to application source code
5. Add “-L. –lm5” flags to the gcc compiler and build the application
Steps to annotate an application using m5 ops
Page 105
109
Sample ab.c annotation with m5 ops
C->reqId is a unique ID for each request
Page 106
110
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
Page 107
111
▪ If you have questions, you can post your questions in gem5 mailing lists
▪ Or contact us directly
▪ Mohammad Alian [email protected]
▪ Gabor Dozsa [email protected]
▪ Please check dist-gem5 web-page for more updates and resources for dist-gem5
▪ https://publish.illinois.edu/icsl-pdgem5/
Thank you
Page 108
112
Dist-gem5: Distributed Simulate of
Computer Clusters
Illinois: Mohammad Alian, Prof. Nam Sung Kim
ARM: Gabor Dozsa, Stephan Diestelhorst, Nikos Nikoleris, Radhika Jagtap
Tutorial at IEEE International Symposium on Workload Characterization (IISWC), Seattle, USA
1 Oct 2017