-
The gem5 SimulatorISCA 2011
Brad Beckmann1 Nathan Binkert2 Ali Saidi3 Joel Hestness4
Gabe Black5 Korey Sewell6 Derek Hower7
1 AMD Research 2 HP Labs 3 ARM, Inc. 4 University of Texas,
Austin5 Google, Inc. 6 University of Michigan, Ann Arbor
7 University of Wisconsin, Madison
June 5th, 2011
1
-
Welcome!
Were glad youre here! The gem5 simulator has been multi-year
effort A wide variety of institutions have participated
This tutorial is for you Please ask questions! Dont save them
for the break! We intend the focus to be audience driven
2
-
Tutorial Goals and Timeline
Tutorial goals Introduce you to the gem5 simulator Answer your
development questions
Two halves 8:30-noon: Overview of the simulator, features,
components,
and simple examples after lunch: Birds of a feather sessions and
informal
discussions of simulator internals
3
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
4
-
Introduction to gem5
Introduction to gem5
Brad Beckmann
AMD Research
5
-
Introduction to gem5
What is gem5? The best parts of M5 The best parts of GEMS
Overall goals, design principles, and capabilities
6
-
What is gem5?
The combination of M5 and GEMS into a new simulator
Google scholar statistics M5 (IEEE Micro, CAECW): 440 citations
GEMS (CAN): 588 citations
Best aspects of both glued together M5: CPU models, ISAs, I/O
devices, infrastructure GEMS (essentially Ruby): cache coherence
protocols,
interconnect models
7
-
What else is new?
Many other things have changed since previous tutorialsbeyond
GEMS+M5
Some of the highlights: The worlds most popular ISAs: ARM and
x86 The In-order CPU model New documentation
Overall gem5 has a high degree of capabilities
8
-
Android on ARM FS
9
android.mp4Media File (video/mp4)
-
64 Processor Linux on x86 FS
10
-
What gem5 is Not
A hardware design language Higher level for design space
exploration, simulation speed
A restrictive environment Just C++ and Python with an event
queue and a bunch of
APIs you can choose to ignore
Finished! Always room for improvement . . .
11
-
What We Would Like gem5 to Be
Something that spares you the pain weve been through A community
resource
Modular enough to localize changes Contribute back, and spare
others some pain
A path to reproducible/comparable results A common platform for
evaluating ideas
Let us know how we can help you contribute Public wiki is up at
http://www.gem5.org Please submit patches and additional features
Ability to add modules with EXTRAS= The more active the community
is, the more successful gem5
will be!
12
-
Two Views of gem5
View #1 A framework for event-driven simulation
Events, objects, statistics, configuration
View #2 A collection of predefined object models
CPUs, caches, busses, devices, etc.
This tutorial focuses on #2 You may find #1 useful even if #2 is
not
At least three other simulators have been created using #1
13
-
Main GoalsOverall Goal: Open source community tool focused
onarchitectural modeling
Flexibility Multiple CPU models across the speed vs. accuracy
spectrum Two execution modes: System-call Emulation &
Full-system Two memory system models: Classic & Ruby Once you
learn it, you can apply to a wide-range of
investigations Availability
For both academic and corporate researchers No dependence on
proprietary code BSD license
Collaboration Combined effort of many with different specialties
Active community leveraging collaborative technologies
14
-
Key Features
Pervasive object-oriented design Provides modularity,
flexibility Significantly leverages inheritance e.g. SimObject
Python integration Powerful front-end interface Provides
initialization, configuration, & simulation control
Domain-Specific Languages ISA DSL: defines ISA semantics Cache
Coherence DSL (a.k.a.SLICC): defines coherence logic
Standard interfaces: Ports and MessageBuffers
15
-
Capabilities
Execution modes: System-call Emulation (SE) &Full-System
(FS)
ISAs: Alpha, ARM, MIPS, Power, SPARC, x86 CPU models:
AtomicSimple, TimingSimple, InOrder, and O3 Cache coherence
protocols: broadcast-based, directories,
etc. Interconnection networks: Simple & Garnet
(Princeton,
MIT) Devices: NICs, IDE controller, etc. Multiple systems:
communicate over TCP/IP
16
-
Cross-Product Matrix
Processor Memory System
CPU Model System Mode Classic RubySimple Garnet
Atomic Simple SEFS
Timing Simple SEFS
InOrder SEFS
O3 SEFS
17
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
18
-
Basics
Basics
Nate Binkert
HP Labs
19
-
Basics
Compiling gem5 Running gem5 Very brief overview of a few key
concepts:
Objects Events Modes Ports Stats
20
-
Building Executables
Platforms Linux, BSD, MacOS, Solaris, etc. Little endian
machines
Some architectures support big endian 64-bit machines help a
lot
Tools GCC/G++ 3.4.6+
Most frequently tested with 4.2-4.5 Python 2.4+ SCons
0.98.1+
We generally test versions 0.98.5 and 1.2.0
http://www.scons.org
SWIG 1.3.31+ http://www.swig.org
21
-
Compile Targets
build// configs
By convention, usually _ ALPHA_SE (Alpha syscall emulation)
ALPHA_FS (Alpha full system) Other ISAs: ARM, MIPS, POWER, SPARC,
X86 Sometimes followed by Ruby protocol: ALPHA_SE_MOESI_hammer You
can define your own configs
binary gem5.debug debug build, symbols, tracing, assert gem5.opt
optimized build, symbols, tracing, assert gem5.fast optimized
build, no debugging, no symbols, no
tracing, no assertions gem5.prof gem5.fast + profiling
support
22
-
Sample Compile
blue% scons build/X86_FS/gem5.optscons: Reading SConscript files
...Checking for leading underscore in global variables...noChecking
for C header file Python.h... yesChecking for C library pthread...
yes
Reading
/n/blue/z/binkert/work/m5/incoming/src/mem/ruby/SConsoptsReading
/n/blue/z/binkert/work/m5/incoming/src/mem/protocol/SConsoptsReading
/n/blue/z/binkert/work/m5/incoming/src/arch/arm/SConsopts
Building in
/n/blue/z/binkert/work/m5/incoming/build/X86_FSVariables file
/n/blue/z/binkert/work/m5/incoming/build/variables/X86_FS not
found,
using defaults in
/n/blue/z/binkert/work/m5/incoming/build_opts/X86_FSscons: done
reading SConscript files.scons: Building targets ...[ CXX]
X86_FS/sim/main.cc -> .o[ CXX] X86_FS/sim/async.cc -> .o[
CXX] X86_FS/sim/core.cc -> .o[ TRACING] ->
X86_FS/debug/Event.hhDefining FAST_ALLOC_STATS as 0 in
build/X86_FS/config/fast_alloc_stats.hh.Defining FORCE_FAST_ALLOC
as 0 in build/X86_FS/config/force_fast_alloc.hh.Defining
NO_FAST_ALLOC as 0 in build/X86_FS/config/no_fast_alloc.hh.[ CXX]
X86_FS/sim/debug.cc -> .o[ TRACING] ->
X86_FS/debug/Config.hh[ CXX] X86_FS/sim/eventq.cc -> .o[ CXX]
X86_FS/sim/init.cc -> .o[ TRACING] ->
X86_FS/debug/TimeSync.hh[SO PARAM] Root ->
X86_FS/params/Root.hh[SO PARAM] SimObject ->
X86_FS/params/SimObject.hh...
23
-
Running Simulations
maize% ./build/ARM_FS/gem5.opt --helpUsage=====
gem5.opt [gem5 options] script.py [script options]
gem5 is copyrighted software; use the --copyright option for
details.
Options=======--version show programs version number and
exit--help, -h show this help message and exit--build-info, -B Show
build information--copyright, -C Show full copyright
information--readme, -R Show the readme--outdir=DIR, -d DIR Set the
output directory to DIR [Default: /tmp/m5out]--redirect-stdout, -r
Redirect stdout (& stderr, without -e) to
file--redirect-stderr, -e Redirect stderr to file--stdout-file=FILE
Filename for -r redirection [Default: simout]--stderr-file=FILE
Filename for -e redirection [Default: simerr]--interactive, -i
Invoke the interactive interpreter after running the script--pdb
Invoke the python debugger before running the
script--path=PATH[:PATH], -p PATH[:PATH]
Prepend PATH to the system path when invoking the script--quiet,
-q Reduce verbosity--verbose, -v Increase verbosity
24
-
Running Simulations (cont)
Statistics Options--------------------stats-file=FILE Sets the
output file for statistics [Default:
stats.txt]
Configuration Options-----------------------dump-config=FILE
Dump configuration output file [Default: config.ini]
Debugging Options-------------------debug-break=TIME[,TIME]
Cycle to create a breakpoint--debug-help Print help on trace
flags--debug-flags=FLAG[,FLAG]
Sets the flags for tracing (-FLAG disables a
flag)--remote-gdb-port=REMOTE_GDB_PORT
Remote gdb base port (set to 0 to disable listening)
Trace Options---------------trace-start=TIME Start tracing at
TIME (must be in ticks)--trace-file=FILE Sets the output file for
tracing [Default: cout]--trace-ignore=EXPR Ignore EXPR sim
objects
Help Options--------------list-sim-objects List all built-in
SimObjects, their params and default
values
25
-
Sample Run
maize% ./build/ARM_SE/gem5.opt configs/example/se.pygem5
Simulator System. http://gem5.orggem5 is copyrighted software; use
the --copyright option for details.
gem5 compiled Jun 2 2011 17:39:30gem5 started Jun 3 2011
14:48:20gem5 executing on maizecommand line:
./build/ARM_SE/gem5.opt configs/example/se.pyGlobal frequency set
at 1000000000000 ticks per second0: system.remote_gdb.listener:
listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0.
Starting simulation...Hello world!hack: be nice to actually delete
the event hereExiting @ tick 3350000 because target called
exit()
26
-
Modes
gem5 has two fundamental modes Full system (FS)
For booting operating systems Models bare hardware, including
devices Interrupts, exceptions, privileged instructions, fault
handlers
Syscall emulation (SE) For running individual applications, or
set of applications on
MP/SMT Models user-visible ISA plus common system calls System
calls emulated, typ. by calling host OS Simplified address
translation model, no scheduling
Selected via compile-time option Vast majority of code is
unchanged, though
27
-
Objects
Everything you care about is an object (C++/Python) Derived from
SimObject base class
Common code for creation, configuration parameters,
naming,checkpointing, etc.
Uniform method-based APIs for object types CPUs, caches, memory,
etc. Plug-compatibility across implementations
Functional vs. detailed CPU Conventional vs. indirect-index
cache
Easy replication: cores, multiple systems, . . .
28
-
Events
Standard event queue timing model Global logical time in ticks
No fixed relation to real time
Normally picoseconds in our examples Objects schedule their own
events
Flexibility for detail vs. performance trade-offs E.g., a CPU
typically schedules event at regular intervals
Every cycle or every n picoseconds Wont schedule self if
stalled/idle
29
-
Ports src/mem/port.{hh,cc}
Method for connecting MemObjects together Each MemObject
subclass has its own Port subclass(es)
Specialized to forward packets to appropriate methods
ofMemObject subclass
Each pair of MemObjects is connected via a pair of
Ports(peers)
Function pairs pass packets across ports sendTiming() on one
port calls recvTiming() on peer
Result: class-specific handling with arbitrary connections
andonly a single virtual function call
30
-
Access Modes
Three access modes: Functional, Atomic, Timing Selected by
choosing function on initial Port:
sendFunctional(), sendAtomic(), sendTiming() Functional
mode:
Just make it happen Used for loading binaries, debugging, etc.
Accesses happen instantaneously updating data everywhere
in the hierarchy If devices contain queues of packets they must
be scanned
and updated as well
31
-
Access Modes (contd)
Atomic mode: Requests complete before sendAtomic() returns
Models state changes (cache fills, coherence, etc.) Returns approx.
latency w/o contention or queuing delay Used for fast simulation,
fast forwarding, or warming caches
Timing mode: Models all timing/queuing in the memory system
Split transaction
sendTiming() just initiates send of request to target Target
later calls sendTiming() to send response packet
Atomic and Timing accesses can not coexist in system
32
-
Statistics
Scalar Average Vector Formula Histogram Distribution Vector
Distribution
33
-
Statistics Example hh file
class MySimObject : public SimObject{
private:Stats::Scalar txBytes;Stats::Formula
txBandwidth;Stats::Vector syscall;
public:void regStats();
};
34
-
Statistics Example cc file
txBytes.name(name() + ".txBytes").desc("Bytes
Transmitted").prereq(txBytes);
txBandwidth.name(name() + ".txBandwidth").desc("Transmit
Bandwidth (bits/s)").precision(0);
txBandwidth = txBytes * Stats::constant(8) / simSeconds;
syscall.init(SystemCalls ::Number).name(name() +
".syscall").desc("number of syscalls executed").flags(total | pdf |
nozero | nonan);
35
-
Statistics Output
client.tsunami.etherdev.txBandwidth
4302720client.tsunami.etherdev.txBytes
13446server.tsunami.etherdev.txBandwidth
4684921600server.tsunami.etherdev.txBytes 14640380sim_seconds
0.025000server.cpu.kern.syscall 492server.cpu.kern.syscall_1 189
38.41% 38.41%server.cpu.kern.syscall_2 249 50.61%
89.02%server.cpu.kern.syscall_3 54 10.98% 100.00%
36
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
37
-
Debugging
Debugging
Ali Saidi
ARM Research & Development
38
-
Debugging Facilities
Tracing Instruction Tracing Diffing Traces
Using gdb to debug gem5 Debugging C++ and gdb-callable functions
Remote Debugging
Python Debugging
Pipeline Viewer
39
-
Tracing/Debugging src/base/trace.*
printf() is a nice debugging tool
Keep good printfs for tracing
Lots of debug output is a very good thing
Example flags: Fetch, Decode, Ethernet, Exec, TLB, DMA, Bus,
Cache,
Loader, O3CPUAll, etc. Print out all flags with --debug-help
option
40
-
Enabling Tracing
Selecting flags: --debug-flags=Cache,Bus
--debug-flags=Exec,-ExecTicks
Selecting destination: --trace-file=my_trace.out
--trace-file=my_trace.out.gz
Selecting start: --trace-start=23000000
./build/ARM_FS/gem5.opt
--debug-flags=Cache,Bus--trace-start=2400 configs/example/fs.py
41
-
Adding Debuging
Print statement put in source code Encourage you to add ones to
your models or contribute ones
you find particularly useful Macros remove them for gem5.fast or
gem5.prof binaries
So you must be using gem5.debug or gem5.opt to get anyoutput
Adding an extra tracing statement: DPRINTF(Flag, normal printf
%s\n,arguments);
Adding a new debug flags (in a SConscript):
DebugFlag(MyNewFlag)
42
-
Instruction Tracing src/sim/insttracer.hh
Separate from the general debug/trace facility But both are
enabled the same way
Per-instruction records populated as instruction executes Start
with PC and mnemonic Add argument and result values as they become
known
Printed to trace when instruction completes Flags for printing
cycle, symbolic addresses, etc.
4000: sys.cpu : @sym+776 : add r3, r3, #8 : IntAlu :
D=0x000083584500: sys.cpu : @sym+780 : sub r3, r3, r7 : IntAlu :
D=0x400000005000: sys.cpu : @sym+784 : add r5, r5, r3 : IntAlu :
D=0x000173cc5500: sys.cpu : @sym+788 : add r6, r6, r3 : IntAlu :
D=0x000174006000: sys.cpu : @sym+792.0 : addi_uop r34, r5, #0 :
IntAlu : D=0x000173cc6500: sys.cpu : @sym+792.1 : ldr_uop r3, [r34,
#0] : MemRead : D=0x000f0000 A=0x173cc7000: sys.cpu : @sym+792.2 :
ldr_uop r4, [r34, #4] : MemRead : D=0x000f0000 A=0x173d07500:
sys.cpu : @sym+796 : and r4, r4, r9 : IntAlu : D=0x000f00008000:
sys.cpu : @sym+800 : teqs r3, r4 : IntAlu : D=0x00000001
43
-
Using GDB with gem5
Several gem5 functions designed to be called from GDB:
schedBreakCycle() also with --debug-break
setDebugFlag()/clearDebugFlag() dumpDebugStatus() eventqDump()
SimObject::find() takeCheckpoint()
44
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb
Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file
build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at
build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening
for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0.
Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb
Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file
build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at
build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening
for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0.
Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb
Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file
build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at
build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening
for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0.
Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb
Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file
build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at
build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening
for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0.
Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb
Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file
build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at
build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening
for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0.
Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb
Fedora (6.8-37.el5)...
(gdb) b mainBreakpoint 1 at 0x4090b0: file
build/ARM_FS/sim/main.cc, line 40.(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at
build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.
gem5 Simulator System...0: system.remote_gdb.listener: listening
for remote gdb #0 on port 7000
**** REAL SIMULATION ****info: Entering event queue @ 0.
Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) p _curTick$1 = 1000000
45
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call
schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 :
IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 :
teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *)
0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 =
(BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call
takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb)
continueContinuing.Writing checkpointinfo: Entering event queue @
1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call
schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 :
IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 :
teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *)
0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 =
(BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call
takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb)
continueContinuing.Writing checkpointinfo: Entering event queue @
1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call
schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 :
IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 :
teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *)
0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 =
(BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call
takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb)
continueContinuing.Writing checkpointinfo: Entering event queue @
1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
aliHighlight
aliHighlight
-
Using GDB with gem5
(gdb) call setDebugFlag("Exec")(gdb) call
schedBreakCycle(1001000)(gdb) continueContinuing.
1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 :
IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 :
teqs r0, r6 : IntAlu : D=0x0000000000000000
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")$2 = (SimObject *)
0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 =
(BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431
(gdb) call clearDebugFlag("Exec")(gdb) call
takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb)
continueContinuing.Writing checkpointinfo: Entering event queue @
1001001. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint
trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)
46
aliHighlight
aliHighlight
aliHighlight
-
Diffing Traces util/{rundiff,tracediff}
Often useful to compare traces from two simulations Find where
known good and modified simulators diverge
Standard diff works only on files (not pipes) ...but you really
dont want to run to completion
util/rundiff Perl script for diffing two pipes on the fly
util/tracediff Handy wrapper for using rundiff to compare gem5
outputs tracediff "a/gem5.opt|b/gem5.opt"--debug-flags=Exec
compares instruction traces from twobuilds of gem5
See comments for details
47
-
Advanced Trace Diffing
Sometimes if you run into a nasty bug its hard to
compareapples-to-apples traces
Different cycle counts, different code paths
frominterrupts/timers
Some mechanisms that can help: -ExecTicks dont print out ticks
-ExecKernel dont print out kernel code -ExecUser dont print out
user code ExecAsid print out ASID of currently running process
State trace PTRACE program that runs binary on real system
compares
cycle-by-cycle to gem5 Supports ARM, x86, SPARC See wiki for
more information
48
-
Remote Debugging
./build/ARM_FS/gem5.opt configs/example/fs.pygem5 Simulator
System
...command line: ./build/ARM_FS/gem5.opt
configs/example/fs.pyGlobal frequency set at 1000000000000 ticks
per secondinfo: kernel located at:
/chips/pd/randd/dist/binaries/vmlinux.armListening for system
connection on port 5900Listening for system connection on port
34560: system.remote_gdb.listener: listening for remote gdb #0 on
port 7000info: Entering event queue @ 0. Starting simulation...
Remote gdb connection listening on port 7000
49
aliHighlight
-
Remote Debugging
GNU gdb (Sourcery G++ Lite 2010.09-50)
7.2.50.20100908-cvsCopyright (C) 2010 Free Software Foundation,
Inc....(gdb) symbol-file /dist/binaries/vmlinux.armReading symbols
from //dist/binaries/vmlinux.arm...done.(gdb) set remote Z-packet
on(gdb) set tdesc filename arm-with-neon.xml(gdb) target remote
127.0.0.1:7000Remote debugging using 127.0.0.1:7000cache_init_objs
(cachep=0xc7c00240, flags=3351249472) at mm/slab.c:2658(gdb)
stepsighand_ctor (data=0xc7ead060) at kernel/fork.c:1467(gdb) info
registers
r0 0xc7ead060-940912544r1 0x5201312r2 0xc002f1e4-1073548828r3
0xc7ead060-940912544r4 0x00r5 0xc7ead020-940912608r6 0x00r7
0xc7ead03c-940912580r8 0xc7c034a0-943704928r9 0x1001001048832r10
0xc7c0cee0-943665440r11 0x2002002097664r12 0xc0000000-1073741824sp
0xc7c29e280xc7c29e28lr 0xc008ed98-1073156712pc 0xc002f1e40xc002f1e4
cpsr 0x1319
50
aliHighlight
aliHighlight
aliHighlight
aliHighlight
aliHighlight
aliHighlight
-
Python Debugging
It is possible to drop into the python interpreter (-i flag)
This currently happens after the script file is run If you want to
do this before objects are instantiated, remove
them from script It is possible to drop into the python debugger
(--pdb flag)
Occurs just before your script is invoked Lets you use the
debugger to debug your script code
Code that enables this stuff is in src/python/m5/main.py At the
bottom of the main function Can copy the mechanism directly into
your scripts, if in the
wrong place for you needs import pdb pdb.set_trace()
51
-
O3 Pipeline ViewerUse --debug-flags=O3PipeView and
util/o3-pipeview.py
52
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
53
-
Checkpointing and Fastforwarding
Checkpointing and Fastforwarding
Joel Hestness
University of Texas, Austin
54
-
Checkpointing and Fastforwarding
Idea is simple: Snapshot of relevant system state Restore it
later and/or in different CPUs, configurations
Provides flexibility: Test numerous different systems
configurations Exact same point in the benchmark Avoid
re-simulating up to that point Avoid non-determinism inherent with
different configurations
55
-
Checkpointing and Fastforwarding
Outline: Constraints Checkpointing demo Checkpointing internals
Instrumenting a benchmark Fastforwarding internals Fastforwarding
demo
56
-
Checkpointing and Fastforwarding
Constraints: Original simulation and test simulations must
have
Same ISA Same number of cores Same memory size
Usually run original sim with atomic (functional) CPUs
57
-
Checkpointing DEMO!
Starting simulation
58
-
Checkpointing DEMO!
Starting simulation
59
-
Checkpointing DEMO!
Simulated system is running
60
-
Checkpointing DEMO!
Another terminal to control simulated system
61
-
Checkpointing DEMO!
Attach to simulated system
62
-
Checkpointing DEMO!
Simulated system has booted to shell
63
-
Checkpointing DEMO!
Run a quick application
64
-
Checkpointing DEMO!
Run a quick application
65
-
Checkpointing DEMO!
Drop a checkpoint
66
-
Checkpointing DEMO!
Exit simulation
67
-
Checkpointing DEMO!
Exit simulation
68
-
Checkpointing DEMO!
Restore from checkpoint into different simulated system
69
-
Checkpointing DEMO!
Simulated system is running
70
-
Checkpointing DEMO!
Attach to simulated system
71
-
Checkpointing DEMO!
Run a quick application
72
-
Checkpointing DEMO!
Slower execution: detailed v. functional simulation
73
-
Checkpointing DEMO!
Exit simulation
74
-
Checkpointing Output
cpt.6967183789500/ m5.cpt: State of system components
system.disk?.image.cow: Modified state of disk(s)
system.physmem.physmem: State of memory
75
-
Specifying State to Checkpoint
To checkpoint a piece of state, serialize it To restore that
state, unserialize it
voidserialize(std::ostream &os){
SERIALIZE_ARRAY(interrupts,
NumInterruptLevels);SERIALIZE_SCALAR(intstatus);
}
voidunserialize(Checkpoint *cp, const std::string
§ion){
UNSERIALIZE_ARRAY(interrupts,
NumInterruptLevels);UNSERIALIZE_SCALAR(intstatus);
}
76
-
Checkpointing functionality status
Classic memory model: Does not save state of caches
Ruby memory model: Can save state of caches
77
-
Instrumenting a Benchmark
Copy files from ./util/m5/ into source tree: m5op.h m5ops.h
Appropriate assembly file: m5op_.S
Include m5op.h in source code that should take a checkpoint
#include "m5op.h"
...
// Take checkpoint in codem5_checkpoint(0,0);
1st param: no. ticks in future to schedule the checkpoint 2nd
param: no. ticks between checkpoints (periodic)
Compile and link against assembly file
78
-
Checkpointing Functionality in Progress
Current limitation: cache warm-up1 Take periodic checkpoints
throughout execution2 Inspect statistics for interesting sections
(think Simpoints)3 Choose interesting sections4 Create memory
access traces for cache warm-up5 Restore from checkpoint:
1 Start simulated system2 Warm up caches from trace3 Restore the
rest of state4 Begin execution
79
-
Fastforwarding
Setup: Specify sets of CPUs
cpu_class = AtomicSimpleCPUswitch_cpu_class =
DerivO3CPUtest_sys.cpu = [cpu_class(cpu_id=i) for i in
xrange(np)]switch_cpus = [switch_cpu_class(defer_registration=True,
cpu_id=(np+i))
for i in xrange(np)]switch_cpu_list = [(testsys.cpu[i],
switch_cpus[i]) for i in xrange(np)]
80
-
Fastforwarding DEMO!
Starting simulation
81
-
Fastforwarding DEMO!
Starting simulation
82
-
Fastforwarding DEMO!
Simulated system is running
83
-
Fastforwarding DEMO!
Another terminal to control simulated system
84
-
Fastforwarding DEMO!
Attach to simulated system
85
-
Fastforwarding DEMO!
Simulated system has booted to shell
86
-
Fastforwarding DEMO!
Run a quick application
87
-
Fastforwarding DEMO!
Run a quick application
88
-
Fastforwarding DEMO!
Switch from functional to detailed CPUs
89
-
Fastforwarding DEMO!
Switch from functional to detailed CPUs
90
-
Fastforwarding DEMO!
Run a quick application
91
-
Fastforwarding DEMO!
Slower execution: detailed v. functional simulation
92
-
Fastforwarding DEMO!
Exit simulation
93
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
94
-
Break
Break
95
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
96
-
Multiple Architecture Support
Multiple Architecture Support
Gabe Black
Google, Inc.
97
-
Overview
Tour of the ISAs Parts of an ISA Decoding and instructions
98
-
ISA Support
Full-System & Syscall Emulation Alpha ARM SPARC x86
Syscall Emulation MIPS POWER
99
-
Alpha
Alpha 21264 including the BWX, MVI, FIX, and CIXA 21164 PAL
code. Syscall Emulation
Linux or Tru64 binaries Simple Atomic, Simple Timing, In-Order,
Out-of-Order CPU
models Full system
Linux or FreeBSD Simple Atomic, Simple Timing, In-Order,
Out-of-Order CPU
models Four-cores in a normal Tsunami system Also gem5 big
Tsunami support 64 cores
Custom PAL code and kernel patches required
100
-
ARM ARMv7-A, Thumb, Thumb2, MP, VFPv3, NEON
Doesnt (yet) include TrustZone, ThumbEE, Virtualization,LPAE
Syscall Emulation EABI Linux binaries - no OABI Simple Atomic,
Simple Timing, Out-of-Order CPU models
Full system Linux or Android Simple Atomic, Simple Timing,
Out-of-Order CPU models Four-cores in a normal ARM RealView system
No kernel patches required Also supports frame buffer, and control
via VNC
Can run X11, Android, Web browsers, etc
101
-
ARM
101
-
MIPS
32 bit little endian Syscall Emulation
Linux binaries Simple Atomic, Simple Timing, In-Order,
Out-of-Order CPU
models Full system
Significant progress, but not actively developed
102
-
POWER
POWER ISA v2.06 B Book, 32-bit, little endian Most instructions
available, but some FP missing; no vector
support Syscall Emulation
Linux binaries Simple Atomic, Simple Timing, Out-of-Order CPU
models
Full system No current plans
103
-
SPARC
UltraSPARC Architecture 2005 Syscall Emulation
Linux or Solaris binaries Simple Atomic, Simple Timing,
Out-of-Order CPU models
Full system Solaris Single core of a UltraSPARC T1 (Niagara)
processor Simple Atomic CPU model only Significant progress on MP,
but not actively developed
104
-
x86
Generic x86 CPU w/ 64 bit, 3DNow, & SSE extensions Effort
focused on modern features No x87 floating point. Compile 32 bit
with -msse2. No Windows support any time soon. Syscall
Emulation
Linux binaries Simple Atomic, Simple Timing, Out-of-Order CPU
models
Full system Linux Simple Atomic, Simple Timing CPU models MP
support
105
-
Parts of an ISA
Parameterization Number of registers Endianness Page size
Specialized objects TLBs Faults Control state Interrupt
controller
Instructions Instructions themselves Decoding mechanism
106
-
Instruction decode process
Memory
Byte Byte Byte Byte ByteByte
Predecoder Context
ExtMachInst
Decoder
StaticInst Macroop
Microop Microop
Or
107
-
ISA Description Languagesrc/arch/isa_parser.py,
src/arch/*/isa/*
Custom domain-specific language Defines decoding & behavior
of ISA Generates C++ code
Scads of StaticInst subclasses decodeInst () function
Maps machine instruction to StaticInst instance Multiple scads
of execute() methods
Cross-product of CPU models and StaticInst subclasses
108
-
Definitions etc.
def bitfield OPCODE ;def bitfield RA ;def bitfield RB ;def
bitfield INTFUNC ; // function codedef bitfield RC < 4: 0>;
// dest reg
def operands {{Ra: (IntReg, uq, PALMODE ?
AlphaISA::reg_redir[RA] : RA,
IsInteger, 1),Rb: (IntReg, uq, PALMODE ? AlphaISA::reg_redir[RB]
: RB,
IsInteger, 2),Rc: (IntReg, uq, PALMODE ? AlphaISA::reg_redir[RC]
: RC,
IsInteger, 3),Fa: (FloatReg, df, FA, IsFloating, 1),Fb:
(FloatReg, df, FB, IsFloating, 2),Fc: (FloatReg, df, FC,
IsFloating, 3),
}}
def format LoadAddress(code) {{// Python code here...
}}
def format IntegerOperate(code) {{// Python code here...
}}
109
-
Instruction Decode and Semantics
decode OPCODE {format LoadAddress {
0x08: lda ({{ Ra = Rb + disp; }});0x09: ldah ({{ Ra = Rb +
(disp
-
Microcode
def macroop MOVS_E_M_M {and t0, rcx, rcx, flags=(EZF,),
dataSize=aszbr label("end"), flags=(CEZF,)# Find the constant we
need to either add or subtract from rdiruflag t0, 10movi t3, t3,
dsz, flags=(CEZF,), dataSize=aszsubi t4, t0, dsz, dataSize=aszmov
t3, t3, t4, flags=(nCEZF,), dataSize=asz
topOfLoop:ld t1, seg, [1, t0, rsi]st t1, es, [1, t0, rdi]
subi rcx, rcx, 1, flags=(EZF,), dataSize=aszadd rdi, rdi, t3,
dataSize=aszadd rsi, rsi, t3, dataSize=aszbr label("topOfLoop"),
flags=(nCEZF,)
end:fault "NoFault"
};
111
-
Key Features
Very compact representation Most instructions take 1 line of C
code Alpha: 3437 lines of isa description 39K lines of C++
15K generic decode, 12K for each of 2 CPU models Characteristics
auto-extracted from C
source, dest regs; func unit class; etc. execute() code
customized for CPU models
Thoroughly documented (for us, anyway) See wiki pages
112
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
113
-
CPU Modeling
CPU Modeling
Korey Sewell
University of Michigan, Ann Arbor
114
-
Overview
High Level View Supported CPU Models
AtomicSimpleCPU TimingSimpleCPU InOrderCPU O3CPU
CPU Model Internals Parameters Time Buffers Key Interfaces
115
-
CPU Models - System Level View
CPU Models are designed to be hot pluggable with arbitraryISAs
and Memory Systems
116
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine
Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:
Fast, Functional Simulation: 2.9 million and 1.2
millioninstructions per second on the twolf benchmark
Warming Up Caches Studies that do not require detailed CPU
modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support Two
Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling
Slower than SimpleCPUs: 200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage
Forces both timing and execution of simulation to be accurate
Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU
and TimingSimpleCPU
Common Uses: Fast, Functional Simulation: 2.9 million and 1.2
million
instructions per second on the twolf benchmark Warming Up Caches
Studies that do not require detailed CPU modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support Two
Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling
Slower than SimpleCPUs: 200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage
Forces both timing and execution of simulation to be accurate
Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU
and TimingSimpleCPU Common Uses:
Fast, Functional Simulation: 2.9 million and 1.2
millioninstructions per second on the twolf benchmark
Warming Up Caches Studies that do not require detailed CPU
modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support Two
Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling
Slower than SimpleCPUs: 200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage
Forces both timing and execution of simulation to be accurate
Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU
and TimingSimpleCPU Common Uses:
Fast, Functional Simulation: 2.9 million and 1.2
millioninstructions per second on the twolf benchmark
Warming Up Caches Studies that do not require detailed CPU
modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support
Two Types: InOrderCPU and O3CPU Execute in Execute, detailed
modeling Slower than SimpleCPUs: 200K instructions per second
on
the twolf benchmark Models the timing for each pipeline stage
Forces both timing and execution of simulation to be accurate
Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU
and TimingSimpleCPU Common Uses:
Fast, Functional Simulation: 2.9 million and 1.2
millioninstructions per second on the twolf benchmark
Warming Up Caches Studies that do not require detailed CPU
modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support Two
Types: InOrderCPU and O3CPU
Execute in Execute, detailed modeling Slower than SimpleCPUs:
200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage
Forces both timing and execution of simulation to be accurate
Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
Supported CPU Models src/cpu/*.hh,cc Simple CPUs
Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU
and TimingSimpleCPU Common Uses:
Fast, Functional Simulation: 2.9 million and 1.2
millioninstructions per second on the twolf benchmark
Warming Up Caches Studies that do not require detailed CPU
modeling
Detailed CPUs Parameterizable Pipeline Models w/SMT support Two
Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling
Slower than SimpleCPUs: 200K instructions per second on
the twolf benchmark Models the timing for each pipeline stage
Forces both timing and execution of simulation to be accurate
Important for Coherence, I/O, Multiprocessor Studies, etc.
117
-
AtomicSimpleCPU src/cpu/simple/atomic/*.hh,cc
On every CPU tick(),perform all necessaryoperations for an
instruction
Memory accesses areatomic
Fastest functional simulation
118
-
TimingSimpleCPU src/cpu/simple/timing/*.hh,cc
Memory accesses usetiming path
CPU waits until memoryaccess returns
Fast, provides some level oftiming
119
-
InOrder CPU Model src/cpu/inorder/*.hh,cc Detailed in-order CPU
InOrder is a new feature to the gem5 Simulator
Default 5-stage pipeline Fetch, Decode, Execute, Memory,
Writeback
120
-
InOrder CPU Model src/cpu/inorder/*.hh,cc Detailed in-order CPU
InOrder is a new feature to the gem5 Simulator
Default 5-stage pipeline Fetch, Decode, Execute, Memory,
Writeback
120
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
Detailed in-order CPU Default 5-stage pipeline
Fetch, Decode, Execute, Memory, Writeback
Key Resources CacheUnit, ExecutionUnit, BranchPredictor,
etc.
Key Parameters Pipeline Stages, Hardware Threads
Implementation: Customizable Set of Pipeline Components Pipeline
stages interact with Resource Pool Pipeline defined through
Instruction Schedules
Each instruction type defines what resources they need in
aparticular stage
If an instruction cant complete all its resource requests in
onestage, it blocks the pipeline
121
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
Detailed in-order CPU Default 5-stage pipeline
Fetch, Decode, Execute, Memory, Writeback Key Resources
CacheUnit, ExecutionUnit, BranchPredictor, etc.
Key Parameters Pipeline Stages, Hardware Threads
Implementation: Customizable Set of Pipeline Components Pipeline
stages interact with Resource Pool Pipeline defined through
Instruction Schedules
Each instruction type defines what resources they need in
aparticular stage
If an instruction cant complete all its resource requests in
onestage, it blocks the pipeline
121
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
Detailed in-order CPU Default 5-stage pipeline
Fetch, Decode, Execute, Memory, Writeback Key Resources
CacheUnit, ExecutionUnit, BranchPredictor, etc. Key
Parameters
Pipeline Stages, Hardware Threads
Implementation: Customizable Set of Pipeline Components Pipeline
stages interact with Resource Pool Pipeline defined through
Instruction Schedules
Each instruction type defines what resources they need in
aparticular stage
If an instruction cant complete all its resource requests in
onestage, it blocks the pipeline
121
-
InOrder CPU Model src/cpu/inorder/*.hh,cc
Detailed in-order CPU Default 5-stage pipeline
Fetch, Decode, Execute, Memory, Writeback Key Resources
CacheUnit, ExecutionUnit, BranchPredictor, etc. Key
Parameters
Pipeline Stages, Hardware Threads Implementation: Customizable
Set of Pipeline Components
Pipeline stages interact with Resource Pool Pipeline defined
through Instruction Schedules
Each instruction type defines what resources they need in
aparticular stage
If an instruction cant complete all its resource requests in
onestage, it blocks the pipeline
121
-
O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU
Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW
Issue, Execute, and Writeback
Model varying amount of pipeline stages by changing
delaysbetween pipeline stages (e.g. fetchToDecodeDelay)
Key Resources Physical Register (PR) File, IQ, LSQ, ROB,
Functional Unit
(FU) Pool Key Parameters
Interstage pipeline delays, Hardware threads,
IQ/LSQ/ROB/PRentries, FU Delays
Other Key Features Support for CISC decoding (e.g. x86) Renaming
with a Physical Register (PR) File Functional units with varying
latencies Branch Prediction Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU
Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW
Issue, Execute, and Writeback Model varying amount of pipeline
stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay)
Key Resources Physical Register (PR) File, IQ, LSQ, ROB,
Functional Unit
(FU) Pool Key Parameters
Interstage pipeline delays, Hardware threads,
IQ/LSQ/ROB/PRentries, FU Delays
Other Key Features Support for CISC decoding (e.g. x86) Renaming
with a Physical Register (PR) File Functional units with varying
latencies Branch Prediction Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU
Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW
Issue, Execute, and Writeback Model varying amount of pipeline
stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay) Key
Resources
Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU)
Pool
Key Parameters Interstage pipeline delays, Hardware threads,
IQ/LSQ/ROB/PR
entries, FU Delays Other Key Features
Support for CISC decoding (e.g. x86) Renaming with a Physical
Register (PR) File Functional units with varying latencies Branch
Prediction Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU
Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW
Issue, Execute, and Writeback Model varying amount of pipeline
stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay) Key
Resources
Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU)
Pool
Key Parameters Interstage pipeline delays, Hardware threads,
IQ/LSQ/ROB/PR
entries, FU Delays
Other Key Features Support for CISC decoding (e.g. x86) Renaming
with a Physical Register (PR) File Functional units with varying
latencies Branch Prediction Memory dependence prediction
122
-
O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU
Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW
Issue, Execute, and Writeback Model varying amount of pipeline
stages by changing delays
between pipeline stages (e.g. fetchToDecodeDelay) Key
Resources
Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU)
Pool
Key Parameters Interstage pipeline delays, Hardware threads,
IQ/LSQ/ROB/PR
entries, FU Delays Other Key Features
Support for CISC decoding (e.g. x86) Renaming with a Physical
Register (PR) File Functional units with varying latencies Branch
Prediction Memory dependence prediction
122
-
CPU Model Internals src/cpu/*
A key reason that the CPU Models are hot pluggable intogem5 is
that the CPUs share common components andinterfaces within the
simulator
Parameter Definition Shared Components
Branch Predictors, TLBs, ISA decoding, Interrupt Handlers
TimeBuffer-Based Communication External Interfaces
System: ThreadContext ISA: StaticInst and DynInst Memory: Ports,
{send/recv}Timing
123
-
CPU Model Internals src/cpu/*
A key reason that the CPU Models are hot pluggable intogem5 is
that the CPUs share common components andinterfaces within the
simulator Parameter Definition
Shared Components Branch Predictors, TLBs, ISA decoding,
Interrupt Handlers
TimeBuffer-Based Communication External Interfaces
System: ThreadContext ISA: StaticInst and DynInst Memory: Ports,
{send/recv}Timing
123
-
CPU Model Internals src/cpu/*
A key reason that the CPU Models are hot pluggable intogem5 is
that the CPUs share common components andinterfaces within the
simulator Parameter Definition Shared Components
Branch Predictors, TLBs, ISA decoding, Interrupt Handlers
TimeBuffer-Based Communication External Interfaces
System: ThreadContext ISA: StaticInst and DynInst Memory: Ports,
{send/recv}Timing
123
-
CPU Model Internals src/cpu/*
A key reason that the CPU Models are hot pluggable intogem5 is
that the CPUs share common components andinterfaces within the
simulator Parameter Definition Shared Components
Branch Predictors, TLBs, ISA decoding, Interrupt Handlers
TimeBuffer-Based Communication
External Interfaces System: ThreadContext ISA: StaticInst and
DynInst Memory: Ports, {send/recv}Timing
123
-
CPU Model Internals src/cpu/*
A key reason that the CPU Models are hot pluggable intogem5 is
that the CPUs share common components andinterfaces within the
simulator Parameter Definition Shared Components
Branch Predictors, TLBs, ISA decoding, Interrupt Handlers
TimeBuffer-Based Communication External Interfaces
System: ThreadContext ISA: StaticInst and DynInst Memory: Ports,
{send/recv}Timing
123
-
CPU Internals - Parameterssrc/cpu/{simple/inorder/o3}*.py
Parameters are defined in a *.py in each CPUs directory e.g. The
contents of src/cpu/inorder/InOrderCPU.py are shown
below:
class InOrderCPU(BaseCPU):type = InOrderCPU...cachePorts =
Param.Unsigned(2, "Cache Ports")stageWidth = Param.Unsigned(4,
"Stage width")...icache_port = Port("Instruction Port")dcache_port
= Port("Data Port")...predType = Param.String("tournament", "Branch
predictor type (local, tournament)")
Use in your configuration scripts
...cpu = InOrderCPU()cpu.stageWidth = 2...
124
-
CPU Internals - Parameterssrc/cpu/{simple/inorder/o3}*.py
Parameters are defined in a *.py in each CPUs directory e.g. The
contents of src/cpu/inorder/InOrderCPU.py are shown
below:
class InOrderCPU(BaseCPU):type = InOrderCPU...cachePorts =
Param.Unsigned(2, "Cache Ports")stageWidth = Param.Unsigned(4,
"Stage width")...icache_port = Port("Instruction Port")dcache_port
= Port("Data Port")...predType = Param.String("tournament", "Branch
predictor type (local, tournament)")
Use in your configuration scripts
...cpu = InOrderCPU()cpu.stageWidth = 2...
124
-
CPU Internals - Time Buffers src/base/timebuf.hh
Similar to queues Are advance()d each CPU cycle
Each pipeline stage places information into time buffer Next
stage reads from time buffer by indexing into appropriate
cycle Used for both forwards and backwards communication
Avoids unrealistic interaction between pipeline stages Time
buffer class is templated
Its template parameter is the communication struct
betweenstages
125
-
CPU Internals - Time Buffers src/base/timebuf.hh
Similar to queues Are advance()d each CPU cycle
Each pipeline stage places information into time buffer Next
stage reads from time buffer by indexing into appropriate
cycle
Used for both forwards and backwards communication Avoids
unrealistic interaction between pipeline stages
Time buffer class is templated Its template parameter is the
communication struct between
stages
125
-
CPU Internals - Time Buffers src/base/timebuf.hh
Similar to queues Are advance()d each CPU cycle
Each pipeline stage places information into time buffer Next
stage reads from time buffer by indexing into appropriate
cycle Used for both forwards and backwards communication
Avoids unrealistic interaction between pipeline stages
Time buffer class is templated Its template parameter is the
communication struct between
stages
125
-
CPU Internals - Time Buffers src/base/timebuf.hh
Similar to queues Are advance()d each CPU cycle
Each pipeline stage places information into time buffer Next
stage reads from time buffer by indexing into appropriate
cycle Used for both forwards and backwards communication
Avoids unrealistic interaction between pipeline stages Time
buffer class is templated
Its template parameter is the communication struct
betweenstages
125
-
Time Buffer Communication
Demonstrated on out-of-order pipeline ... Red is a time
buffer
Fetch Decode RenameIssue
ExecuteWriteback
Commit
Backwards Communication
126
-
CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh
Interface for accessing total architectural state of a
singlethread PC, register values, etc.
Used to obtain pointers to key classes CPU, process, system,
ITB, DTB, etc.
Abstract base class Each CPU model must implement its own
derived
ThreadContext
127
-
CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh
Interface for accessing total architectural state of a
singlethread PC, register values, etc.
Used to obtain pointers to key classes CPU, process, system,
ITB, DTB, etc.
Abstract base class Each CPU model must implement its own
derived
ThreadContext
127
-
CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh
Interface for accessing total architectural state of a
singlethread PC, register values, etc.
Used to obtain pointers to key classes CPU, process, system,
ITB, DTB, etc.
Abstract base class Each CPU model must implement its own
derived
ThreadContext
127
-
CPU Interfaces - StaticInst Classsrc/cpu/static_inst.{hh,cc}
Represents a decoded instruction Has classifications of the inst
Corresponds to the binary machine inst Only has static
information
Has all the methods needed to execute an instruction Tells which
regs are source and dest Contains the execute() function ISA parser
generates execute() for all insts
128
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
Dynamic version of StaticInst Used to hold extra information
detailed CPU models
BaseDynInst Holds PC, Results, Branch Prediction Status
Interface for TLB translations
InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current
status of an instructions request to a resource Manages each
instructions pipeline schedule
O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed
Registers Interfaces to the IQ, LSQ and ROB
129
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
Dynamic version of StaticInst Used to hold extra information
detailed CPU models
BaseDynInst Holds PC, Results, Branch Prediction Status
Interface for TLB translations
InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current
status of an instructions request to a resource Manages each
instructions pipeline schedule
O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed
Registers Interfaces to the IQ, LSQ and ROB
129
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
Dynamic version of StaticInst Used to hold extra information
detailed CPU models
BaseDynInst Holds PC, Results, Branch Prediction Status
Interface for TLB translations
InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current
status of an instructions request to a resource Manages each
instructions pipeline schedule
O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed
Registers Interfaces to the IQ, LSQ and ROB
129
-
CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}
Dynamic version of StaticInst Used to hold extra information
detailed CPU models
BaseDynInst Holds PC, Results, Branch Prediction Status
Interface for TLB translations
InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current
status of an instructions request to a resource Manages each
instructions pipeline schedule
O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed
Registers Interfaces to the IQ, LSQ and ROB
129
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
130
-
Ruby Memory System
Ruby Memory System
Derek Hower
University of Wisconsin, Madison
131
-
Outline
Feature Overview Rich Configuration Rapid Prototyping
SLICC Modular & Detailed Components
Lifetime of a Ruby memory request
132
-
Feature Overview
Flexible Memory System Rich configuration - Just run it
Simulate combinations of caches, coherence,
interconnect,etc...
Rapid prototyping - Just create it Domain-Specific Language
(SLICC) for coherence protocols Modular components
Detailed statistics e.g., Request size/type distribution, state
transition
frequencies, etc... Detailed component simulation
Network (fixed/flexible pipeline and simple) Caches (Pluggable
replacement policies) Memory (DDR2)
133
-
Feature Overview
Flexible Memory System Rich configuration - Just run it
Simulate combinations of caches, coherence,
interconnect,etc...
Rapid prototyping - Just create it Domain-Specific Language
(SLICC) for coherence protocols Modular components
Detailed statistics e.g., Request size/type distribution, state
transition
frequencies, etc... Detailed component simulation
Network (fixed/flexible pipeline and simple) Caches (Pluggable
replacement policies) Memory (DDR2)
133
-
Feature Overview
Flexible Memory System Rich configuration - Just run it
Simulate combinations of caches, coherence,
interconnect,etc...
Rapid prototyping - Just create it Domain-Specific Language
(SLICC) for coherence protocols Modular components
Detailed statistics e.g., Request size/type distribution, state
transition
frequencies, etc...
Detailed component simulation Network (fixed/flexible pipeline
and simple) Caches (Pluggable replacement policies) Memory
(DDR2)
133
-
Feature Overview
Flexible Memory System Rich configuration - Just run it
Simulate combinations of caches, coherence,
interconnect,etc...
Rapid prototyping - Just create it Domain-Specific Language
(SLICC) for coherence protocols Modular components
Detailed statistics e.g., Request size/type distribution, state
transition
frequencies, etc... Detailed component simulation
Network (fixed/flexible pipeline and simple) Caches (Pluggable
replacement policies) Memory (DDR2)
133
-
Rich Configuration - Just run it
Can build many different memory systems CMPs, SMPs, SCMPs 1/2/3
level caches Pt2Pt/Torus/Mesh Topologies MESI/MOESI coherence
Each components is individually configurable Build heterogeneous
cache architectures (new) Adjust cache sizes, bandwidth, link
latencies, etc...
Get research started without modifying code!
134
-
Configuration Examples
1 8 core CMP, 2-Level, MESI protocol, 32K L1s, 8MB 8-bankedL2s,
crossbar interconnect scons build/ALPHA_FS/gem5.opt
PROTOCOL=MESI_CMP_directory RUBY=True ./build/ALPHA_FS/gem5.opt
configs/example/ruby_fs.py -n 8 --l1i_size=32kB
--l1d_size=32kB --l2_size=8MB --num-l2caches=8
--topology=Crossbar --timing
2 64 socket SMP, 2-Level on-chip Caches, MOESI protocol,32K L1s,
8MB L2 per chip, mesh interconnect scons build/ALPHA_FS/gem5.opt
PROTOCOL=MOESI_CMP_directory RUBY=True ./build/ALPHA_FS/m5.opt
configs/example/ruby_fs.py -n 64 --l1i_size=32kB
--l1d_size=32kB --l2_size=512MB --num-l2caches=64
--topology=Mesh --timing
Many other configuration options Protocols only work with
specific architectures (see wiki)
135
-
Rapid Prototyping - Just create it
Modular construction Coherence controller (SLICC) Cache
(C++)
Replacement Policy (C++) DRAM (C++) Topology (Python) Network
implementation (C++)
Debugging support
136
-
SLICC: Specification Language forImplementing Cache
Coherence
Domain-Specific Language Syntatically similar to C/C++ Like
HDLs, constrains operations to be hardware-like (e.g., no
loops) Two generation targets
C++ for simulation Coherence controller object
HTML for documentation Table-driven specification (State x Event
-> Actions & next
state)
137
-
SLICC Protocol Structure
Collection of Machines, e.g. L1 Controller L2 Controller DRAM
Controller
Machines are connectedthrough network ports(different than
MemPorts)
Network can be an arbitrarytopology
138
-
Machine Structure
Machines are (logically) per-block Consist of:
Ports - Interface to the world States - Both stable and
transient Events - Triggered by incoming messages Transitions - Old
state x Event -> New state Actions - Occur atomically during
transition, e.g.,
Send/receive messages from network139
-
MI Example...Directory Directory
Pt-to-Pt Interconnect
...L1 Cache L1 Cache
CPU CPU
Single-level coherence protocol 2 controller types Cache +
Directory
Cache Controller 2 stable states: Modified (a.k.a. Valid),
Invalid
Directory Controller [Not Shown] 2 stable states: Modified
(Present in cache), Valid
3 virtual networks (request, response, forward) See
src/mem/ruby/protocols/MI_example.*
140
-
MI Example - L1 Cache ControllerMachine structure
Machine Pseduo-Codemachine(L1Cache, "MI Example L1 Cache)
: Sequencer * sequencer, // parameters to the machine object
(set at initialization)CacheMemory * cacheMemory,int
cache_response_latency = 12,int issue_latency = 2
{ // 3 virtual channels to/from network + connection to CPU
// M,I, & Load,Store, etc.
// e.g., RequestMessage::GETX -> Fwd_GETX
// e.g., issueRequest
// e.g., I x Store -> M}
141
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer
requestFromCache, network=To, virtual_network=0,
ordered=true;MessageBuffer responseFromCache, network=To,
virtual_network=1, ordered=true;MessageBuffer forwardToCache,
network=From, virtual_network=2, ordered=true;MessageBuffer
responseToCache, network=From, virtual_network=1, ordered=true;
// out_port - map request type to outgoing message
bufferout_port(requestNetwork_out, RequestMsg,
requestFromCache);out_port(responseNetwork_out, ResponseMsg,
responseFromCache);
// in_port - map request type to incomming message buffer// and
produce code to accept incomming
messagesin_port(forwardRequestNetwork_in, RequestMsg,
forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg,
responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to
SequencerMessageBuffer mandatoryQueue,
ordered=false;in_port(mandatoryQueue_in, RubyRequest,
mandatoryQueue, desc=...) { ... }// There is no corresponing
out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer
requestFromCache, network=To, virtual_network=0,
ordered=true;MessageBuffer responseFromCache, network=To,
virtual_network=1, ordered=true;MessageBuffer forwardToCache,
network=From, virtual_network=2, ordered=true;MessageBuffer
responseToCache, network=From, virtual_network=1, ordered=true;
// out_port - map request type to outgoing message
bufferout_port(requestNetwork_out, RequestMsg,
requestFromCache);out_port(responseNetwork_out, ResponseMsg,
responseFromCache);
// in_port - map request type to incomming message buffer// and
produce code to accept incomming
messagesin_port(forwardRequestNetwork_in, RequestMsg,
forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg,
responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to
SequencerMessageBuffer mandatoryQueue,
ordered=false;in_port(mandatoryQueue_in, RubyRequest,
mandatoryQueue, desc=...) { ... }// There is no corresponing
out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer
requestFromCache, network=To, virtual_network=0,
ordered=true;MessageBuffer responseFromCache, network=To,
virtual_network=1, ordered=true;MessageBuffer forwardToCache,
network=From, virtual_network=2, ordered=true;MessageBuffer
responseToCache, network=From, virtual_network=1, ordered=true;
// out_port - map request type to outgoing message
bufferout_port(requestNetwork_out, RequestMsg,
requestFromCache);out_port(responseNetwork_out, ResponseMsg,
responseFromCache);
// in_port - map request type to incomming message buffer// and
produce code to accept incomming
messagesin_port(forwardRequestNetwork_in, RequestMsg,
forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg,
responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to
SequencerMessageBuffer mandatoryQueue,
ordered=false;in_port(mandatoryQueue_in, RubyRequest,
mandatoryQueue, desc=...) { ... }// There is no corresponing
out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDefining a machine interface
Interface to the network
// MessageBuffers - opaque C++ communication queuesMessageBuffer
requestFromCache, network=To, virtual_network=0,
ordered=true;MessageBuffer responseFromCache, network=To,
virtual_network=1, ordered=true;MessageBuffer forwardToCache,
network=From, virtual_network=2, ordered=true;MessageBuffer
responseToCache, network=From, virtual_network=1, ordered=true;
// out_port - map request type to outgoing message
bufferout_port(requestNetwork_out, RequestMsg,
requestFromCache);out_port(responseNetwork_out, ResponseMsg,
responseFromCache);
// in_port - map request type to incomming message buffer// and
produce code to accept incomming
messagesin_port(forwardRequestNetwork_in, RequestMsg,
forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg,
responseToCache) { ... }
Interface to a CPU
// The other end of mandatoryQueue attaches to
SequencerMessageBuffer mandatoryQueue,
ordered=false;in_port(mandatoryQueue_in, RubyRequest,
mandatoryQueue, desc=...) { ... }// There is no corresponing
out_port - handled with hitCallback
142
-
MI Example - L1 Cache ControllerDeclaring States
State Declaration
// STATESstate_declaration(State, desc="Cache states") {
// Stable StatesI, AccessPermission:Invalid, desc="Not
Present/Invalid";M, AccessPermission:Read_Write,
desc="Modified";
// Transient StatesII, AccessPermission:Busy, desc="Not
Present/Invalid, issued PUT";MI, AccessPermission:Busy,
desc="Modified, issued PUT";MII, AccessPermission:Busy,
desc="Modified, issued PUTX, received nack";IS,
AccessPermission:Busy, desc="Issued request for LOAD/IFETCH";IM,
AccessPermission:Busy, desc="Issued request for STORE/ATOMIC";
}
143
-
MI Example - L1 Cache ControllerDeclaring Events
Event Declaration
// EVENTSenumeration(Event, desc="Cache events") {
// from processorLoad, desc="Load request from
processor";Ifetch, desc="Ifetch request from processor";Store,
desc="Store request from processor";
// From network (directory)Data, desc="Data from
network";Fwd_GETX, desc="Forward from network";Inv,
desc="Invalidate request from dir";Writeback_Ack, desc="Ack from
the directory for a writeback";Writeback_Nack, desc="Nack from the
directory for a writeback";
// Internally generatedReplacement, desc="Replace a block";
}
144
-
MI Example - L1 Cache ControllerMapping messages to events
Mapping occurs in in_port declaration. peek(in_port,
message_type)
Sets variable in_msg to head of in_port queue. trigger(Event,
address)
Event mapping
in_port(forwardRequestNetwork_in, RequestMsg, forwardToCache)
{if (forwardRequestNetwork_in.isReady()) {
peek(forwardRequestNetwork_in, RequestMsg) {if (in_msg.Type ==
CoherenceRequestType:GETX) {
trigger(Event:Fwd_GETX, in_msg.Address);}...
}}
}
145
-
MI Example - L1 Cache ControllerDefining Transitions
transition(Starting State(s), Event, [EndingState]) [ { Actions
} ]
Transition sequence for new Store request
transition(I, Store, IM) {v_allocateTBE; // allocate TBE (a.k.a.
MSHR) on transition to transient
statei_allocateL1CacheBlock;a_issueRequest;m_popMandatoryQueue;
}
transition(IM, Data, M)
{u_writeDataToCache;s_store_hit;w_deallocateTBE; // deallocate TBE
on transition back to stable staten_popResponseQueue;
}...
146
-
MI Example - L1 Cache ControllerDefining Actions
action(name, abbrev, [desc]) { implementation } Two special
functions available in action
peek(in_port, message_type) { use in_msg } assigns in_msg to
message at head of port
enqueue(out_port, message_type, [options]) {set out_msg }
enqueues out_msg on out_port
Special variable address is available inside an action block Set
to the address associated with the event that caused the
calling transition
Example Action Definition
action(e_sendData, "e", desc="Send data from cache to
requestor") {peek(forwardRequestNetwork_in, RequestMsg) {
enqueue(responseNetwork_out, ResponseMsg,
latency=cache_response_latency) {out_msg.Address :=
address;out_msg.Type := CoherenceResponseType:DATA;out_msg.Sender
:= machineID;out_msg.Destination.add(in_msg.Requestor); // uses
in_msg set by peekout_msg.DataBlk :=
cacheMemory[address].DataBlk;out_msg.MessageSize :=
MessageSizeType:Response_Data;
}}
}
147
-
MI Example - L1 Cache ControllerTransition Table
148
-
MI ExampleConnecting SLICC Machines with a Topology
Creating the Topology Not In
SLICCsrc/mem/ruby/network/topologies/Pt2Pt.py
# returns a SimObject for for a Pt2Pt Topologydef
makeTopology(nodes, options, IntLink, ExtLink, Router):
# Create an individual router for each controller (node),# and
connect them (ext_links)routers = [Router(router_id=i) for i in
range(len(nodes))]ext_links = [ExtLink(link_id=i, ext_node=n,
int_node=routers[i])
for (i, n) in enumerate(nodes)]link_count = len(nodes)
# Connect routers all-to-all (int_links)int_links = []for i in
xrange(len(nodes)):
for j in xrange(len(nodes)):if (i != j):
link_count += 1int_links.append(IntLink(link_id=link_count,
node_a=routers[i],node_b=routers[j]))
# Return Pt2Pt Topology SimObjectreturn
Pt2Pt(ext_links=ext_links,
int_links=int_links,routers=routers)
149
-
Using C++ Objects in SLICC SLICC can be arbitrarily extended
with C++ objects
e.g., Interface with a new message filter Steps:
Create class in C++ Declare interface in SLICC with structure,
external=yes Initialize object in machine Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg
msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg
msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=yes)
{
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=param;
action(af_allocateUnlessFiltered, af) {if
(requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Using C++ Objects in SLICC SLICC can be arbitrarily extended
with C++ objects
e.g., Interface with a new message filter Steps:
Create class in C++ Declare interface in SLICC with structure,
external=yes Initialize object in machine Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg
msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg
msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=yes)
{
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=param;
action(af_allocateUnlessFiltered, af) {if
(requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Using C++ Objects in SLICC SLICC can be arbitrarily extended
with C++ objects
e.g., Interface with a new message filter Steps:
Create class in C++ Declare interface in SLICC with structure,
external=yes Initialize object in machine Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg
msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg
msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=yes)
{
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=param;
action(af_allocateUnlessFiltered, af) {if
(requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Using C++ Objects in SLICC SLICC can be arbitrarily extended
with C++ objects
e.g., Interface with a new message filter Steps:
Create class in C++ Declare interface in SLICC with structure,
external=yes Initialize object in machine Use!
Extending SLICC
// MessageFilter.hclass MessageFilter {public:
MessageFilter(int param1);
// returns 1 if message should be filteredint filter(RequestMsg
msg);
};
// MessageFilter.ccint MessageFilter::filter(RequestMsg
msg){
...return 0;
}
// MI_example-cache.smstructure(MessageFilter, external=yes)
{
int filter(RequestMsg);};
MessageFilter requestFilter,constructor_hack=param;
action(af_allocateUnlessFiltered, af) {if
(requestFilter.filter(in_msg) != 1) {
cacheMemory.allocate(address, new Entry);}
}
150
-
Detailed Component Simulation: Caches
Set-Associative Caches Each CacheMemory object represents one
bank of cache Configurable bit select for indexing Modular
replacement policy
Tree-based pseudo-LRU LRU
See src/mem/ruby/system/CacheMemory.hh
151
-
Detailed Component Simulation: Memory
Memory controller models a single channel DDR2 controller
Implements closed-page policy Can configure ranks, tCAS, refresh,
etc.. See src/mem/ruby/system/MemoryController.hh
152
-
Detailed Component Simulation: Network Simple Network
Idealized routers - fixed latency, no internal resources Does
model link bandwidth
Garnet Network Detailed routers - both fixed and flexible
pipeline model From Princeton, MIT
See src/mem/ruby/network/*
153
-
Ruby Debugging Support
Random testing support Stresses protocol by inserting random
timing delays
Support for coherence transition tracing Frequent assertions
Deadlock detection
154
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to
RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest
via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in
L1Controller.
4 Until request is completed:1 (Event, State) is matched to a
transition.
2 Actions in the matched transition (optionally) send messagesto
network & allocate TBE.
3 Responses from network trigger more events.
5 Last event causes action that callsSequencer::hitCallback
& deallocates TBE.
6 RubyRequest is converted back into a Packet & sent
toRubyPort.
155
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to
RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest
via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in
L1Controller.
4 Until request is completed:1 (Event, State) is matched to a
transition.2 Actions in the matched transition (optionally) send
messages
to network & allocate TBE.
3 Responses from network trigger more events.
5 Last event causes action that callsSequencer::hitCallback
& deallocates TBE.
6 RubyRequest is converted back into a Packet & sent
toRubyPort.
155
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to
RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest
via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in
L1Controller.
4 Until request is completed:1 (Event, State) is matched to a
transition.2 Actions in the matched transition (optionally) send
messages
to network & allocate TBE.3 Responses from network trigger
more events.
5 Last event causes action that callsSequencer::hitCallback
& deallocates TBE.
6 RubyRequest is converted back into a Packet & sent
toRubyPort.
155
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to
RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest
via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in
L1Controller.
4 Until request is completed:1 (Event, State) is matched to a
transition.2 Actions in the matched transition (optionally) send
messages
to network & allocate TBE.3 Responses from network trigger
more events.
5 Last event causes action that callsSequencer::hitCallback
& deallocates TBE.
6 RubyRequest is converted back into a Packet & sent
toRubyPort.
155
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to
RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest
via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in
L1Controller.
4 Until request is completed:1 (Event, State) is matched to a
transition.2 Actions in the matched transition (optionally) send
messages
to network & allocate TBE.3 Responses from network trigger
more events.
5 Last event causes action that callsSequencer::hitCallback
& deallocates TBE.
6 RubyRequest is converted back into a Packet & sent
toRubyPort.
155
-
Lifetime of a Ruby Memory Request
1 Request enters through RubyPort::recvTiming, isconverted to
RubyRequest, and passed to Sequencer.
2 Request enters SLICC controllers throughSequencer::makeRequest
via mandatoryQueue.
3 Message on mandatoryQueue triggers an event in
L1Controller.
4 Until request is completed:1 (Event, State) is matched to a
transition.2 Actions in the matched transition (optionally) send
messages
to network & allocate TBE.3 Responses from network trigger
more events.
5 Last event causes action that callsSequencer::hitCallback
& deallocates TBE.
6 RubyRequest is converted back into a Packet & sent
toRubyPort.
155
-
Outline
1 Introduction to gem5
2 Basics
3 Debugging
4 Checkpointing and Fastforwarding
5 Break
6 Multiple Architecture Support
7 CPU Modeling
8 Ruby Memory System
9 Wrap-Up
156
-
Wrap-Up
Wrap-Up
Brad Beckmann
AMD Research
157
-
Summary
Reviewed the basics High-level features Debugging
Checkpointing
Highlighted new aspects ISA changes: x86 & ARM InOrder CPU
model Ruby memory system
Upcoming Computer Architecture News (CAN) article Summarizes
goals, features, and capabilities Please cite if you use gem5
Overall gem5 has a wide range of capabilities ...but not all
combinations currently work
158
-
Cross-Product Table
Processor Memory System
CPU Model System Mode Classic RubySimple Garnet
Atomic Simple SEFS
Timing Simple SEFS
InOrder SEFS
O3 SEFS
Spectrum of choices (light = speed, dark = accuracy)
159
-
Matrix Examples: Alpha and x86Alpha
Processor Memory System
CPU Model System Mode Classic RubySimple Garnet
Atomic Simple SEFS
Timing Simple SEFS
InOrder SEFS
O3 SEFS
x86Processor Memory System
CPU Model System Mode Classic RubySimple Garnet
Atomic Simple SEFS
Timing Simple SEFS
InOrder SEFS