Heterogeneous Thread Assignment Simulation Kris Lange Nopparat suwaanarat Pree Thiengburanathum
Heterogeneous Thread Assignment Simulation
Kris LangeNopparat suwaanarat
Pree Thiengburanathum
Introduction Motivation Review concepts M5 architecture Configuring M5 Simulator Simulation Results and Analysis Conclusion
Agenda
Basis: "Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures“
Paper makes 2 claims◦ Heterogeneous CMP outperform homogenous
CMP (for a fixed total die size)◦ Benefits of heterogeneous CMP are enhanced
using dynamic thread assignment policies
Introduction
Gain deeper understanding of research paper
Verify results of this paper Gain hands-on experience running a peer-
reviewed experiment
Motivation
Heterogeneous CMP system Homogeneous CMP system Heterogeneous VS Homogenous in multi-
programmed.
Review: Concepts
Heterogeneous CMP systemMany simple cores = higher thread parallelismFewer cores, larger = lower thread parallelism
We want to maximize resource utilization and achieve high degree of inter-thread
parallelism.
How? Mapping running tasks and using control mechanism.
Review: Concepts
Which one has a better total execution time? Control mechanism:Thread Assignment Policies:
Static thread assignmentrandombest
Dynamic thread assignmentround robinIPC driven
Review: ConceptP1 P2
Thread A 1.6 0.4
Thread B 1.5 1
•Static thread Assignment•Usually assign thread to the faster core.• Well studies problem before assign.• Solution rely on heuristics
• a random static assignment. Don’t know the work loads and IPC.
• a pseudo best static assignment. Know the work loads and IPC, use heuristic to find out.
• Disadvantages: Doesn’t assign thread in run time. does not optimize faster core(s) usage. slow” threads on slower core(s) penalize overall system
performance.
8
Concepts: Assignment Policies
Dynamic thread assignment◦ Round Robin Assignment rotating the assignment of threads to processors in a
round robin fashion. ensures that the available faster are equally shared
among the running programs.
9
Concepts: Assignment Policies
IPC driven Assignment◦ Considering the characteristics of the executing
threads.◦ Look at IPC number and ratio between two cores
to decide the thread mapping.◦ Thread with higher ratio run on faster core.◦ Thread with lower ratio run on lower core.
10
Concepts: Assignment policies
Goal: duplicate experiment in paper (peer-reviewed)
2-phase simulation◦1) Obtain IPC trace values for Spec2000 programs Using M5 simulator Alpha EV5 + EV6 cores
◦2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies
Simulation Approach
Which simulator is suitable ? Rsim Simple MP SimOS Simic TFsim SimFlex GEMS
Introduction & Overview What is M5 ?
A brief peek inside
What is M5 ? A modular platform for simulating systems
Encompass
system-level architecture
processor microarchitecture
key properties of M5 Pervasively Object-oriented
Multiple interchangeable CPU models
Event-driven memory system
Multiprocessor / multi-system capability
Overview of M5 Architecture
CPU
L1
cache
BUS
L2
cache
BUS
Busbridge
Busbridge
Mem
I/Odevice
BUS
BUS
M5 M5
M5
M5
M5
M5’s Architecture CPU Models ISA Memory System Cache Buses
CPU model• A Simple CPU model• 2 Detail CPU models
CPU model
Backward Communication
Fetch Decode Rename
Issueexecutionwritebac
k
Commit
Instruction Set Architecture (ISA)
goal allow human-readable ISA description
two parts◦ A simple part- describes the decode◦ A declaration part-describes the global
information
Memory System
Goal
combine the timing and functional models into one model
Simplify the memory system code Make changes easier
Memory Architecture cache
port
port
mem
cache
port
port
Bus
port
mem
cache
port
port
port
peer
peer
peer
peer
Cache
Coherency Prefetching
BASEPrefetcher
Prefetcher
BHB Prefetcher StirdePrefetcher TaggedPrefetcher
BUSES
memory , I/O , CPUs Master- closer to memory Slave- closer to CPU
Setup for M5 Simulator◦ Window Vista running VMware on fedora core.
Download the simulator from the website.◦ www.m5sim.org (open source)
Required Software:◦ g++, python, scons, zlib, swig
Configuring the M5 Simulator
FS mode ◦ Full System mode. This mode simulates a complete
system including a kernel, I/O devices, etc. This mode currently only works with the ALPHA architecture.
SE mode◦ Syscall Emulation mode. This mode simulates
statically compiled binaries by functionally emulating any syscall they make.
Example of commands how to build and run M5◦ % scons build/ALPHA_SE/m5.debug◦ % ./build/ALPHA_SE/m5.debug config/example/se.py
Building, Compiling and running M5
What is cross compilation?◦ Compiling a program for a target platform
different from the platform the compiler is run on M5 test programs must be compiled
Alpha+Linux Why?
◦ M5 implements Alpha ISA and Linux syscalls Since we don’t own Alpha hardware: cross-
compile
Cross Compilation
Build toolchain must be built for specific target◦ gcc, glibc, binutils, etc.
Dan Kegel’s crosstool makes this easier: http://www.kegel.com/crosstool
Of the 3 Spec2000 programs we considered, we were only able to successfully cross compile gzip
Cross Compilation: Take 1
Scour the net until you run across this link:◦ http://arch.cs.duke.edu/spec2000binaries.tar.bz2◦ All Spec200 binaries compiled for alpha-linux!
Cross Compilation: Take 2
---------- Begin Simulation Statistics ----------
host_inst_rate 86899 # Simulator instruction rate (inst/s) host_mem_usage 543680 # Number of bytes of host
memory used host_seconds 0.07 # Real time elapsed on the host host_tick_rate 28827895 # Simulator tick rate (ticks/s) sim_freq 1000000000000 # Frequency of simulated ticks sim_insts 5997 # Number of instructions simulated sim_seconds 0.000002 # Number of seconds simulated sim_ticks 2005326 # Number of ticks simulated system.cpu0.dtb.accesses 0 # DTB accesses system.cpu0.dtb.acv 0 # DTB access violations system.cpu0.dtb.hits 0 # DTB hits system.cpu2.num_refs 1960 # Number of
memory references :
M5 Output
•M5 produces simulation results at end:
We want IPC trace every 1 million cycles So we patched:
Getting M5 to Output Trace
• diff -Naur src/cpu/o3/cpu.cc /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc• --- src/cpu/o3/cpu.cc 2007-11-01 19:13:05.000000000 -0600• +++ /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc 2007-12-01 22:54:38.000000000 -0700• @@ -422,6 +422,21 @@• • ++numCycles;• • + ++totalCycles; // we could use numCycles...if only i could figure out how to stringificate• + ++currentCycles;• + if (currentCycles >= 1000000) {• + double currentIpc = (double)currentCommittedInsts / (double)currentCycles;• +• + cout << "IPC: "• + << totalCycles << ","• + << totalCommittedInstsInt << ","• + << currentIpc << std::endl;• +• + currentCommittedInsts = 0;• + currentCycles = 0;• + }• +• +• // activity = false;• • //Tick each of the stages• @@ -452,8 +467,10 @@• if (removeInstsThisCycle) {
Build the processor core
EV5 configuration on M5
EV6 configuration on M5
Goal: duplicate experiment in paper (peer-reviewed)
2-phase simulation◦1) Obtain IPC trace values for Spec2000
programs Using M5 simulator Alpha EV5 + EV6 cores
◦2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies
Simulation Approach
Spec 2000 Paper:
◦ - gzip◦ - gcc◦ crafty (chess program)◦ parser (Natural language processor)◦ bzip2◦ wupwis (quantum chromdynamics)◦ swim (shallow water modeling)◦ mgrid (multi-grid solver in 3d potential field)◦ galgel (fluid dynamics modeling)◦ equake (earthquake modeling)◦ lucas (prime number test)
Us:◦ gzip◦ Bzip2◦ crafty
Choosing Workload
Spec 2000 input is proprietary Compromise:
◦ gzip/bzip2 input: Shakespeare plays◦ crafty input: sample chess game
Workload Input
Obtained from M5
IPC Traces
IPC Traces
IPC Traces
java Modular design Core simulator module Common thread-assignment policy interface Policy modules
Static Round Robin (dynamic) IPC-Driven (dynamic)
CMP Simulator
Command-line interface◦ Example: CMPSim spec2000 10 2 1 roundrobin
Input:◦ Workload◦ Number of threads
Selected randomly from 3 Spec 2000 programs◦ # EV5 cores◦ # EV6 cores◦ Thread assignment policy
CMP Simulator
Output:
CMP Simulator
Threads,Experiment,System IPC1,20EV5 RR,0.9050977847675382,20EV5 RR,1.461270365117883,20EV5 RR,2.062440678690534,20EV5 RR,2.785906338609815,20EV5 RR,3.353738438981526,20EV5 RR,4.072995790685577,20EV5 RR,4.174490205113648,20EV5 RR,4.9159374259,20EV5 RR,5.4738372761363610,20EV5 RR,6.0009047619318211,20EV5 RR,6.6482488852272712,20EV5 RR,7.2646014659090913,20EV5 RR,7.9047740170454514,20EV5 RR,8.4654566539772715,20EV5 RR,9.2339358454545516,20EV5 RR,9.8010424846590917,20EV5 RR,10.3671315159091
IPC data are temporal sequences
CMP Simulator Issue
Randomly assign threads to cores at startup Repeat process whenever core becomes idle Weaknesses:
◦ When one core becomes idle, it will persist in that state unless some unassigned thread exists.
◦ In the case of a heterogeneous system, this results in underutilization of "faster" cores.
◦ Execution of "slow" threads on "slower" cores may penalize overall system performance.
Static Policy
Randomly assign threads to cores at startup Define swap_period
Experimentally, swap_period = 20M cycles works well if (current_cycle % swap_period == 0)
◦ Migrate thread from EV6 -> wait queue◦ Migrate thread from EV5 -> EV6◦ Migrate thread from wait queue -> EV6
When core becomes idle, assign longest-waiting thread
Round Robin Policy
Costs◦ Inter-core context switch
PC, registers, etc must be transferred◦ Cache warmup
Simple model◦ switch_loss: 50%◦ switch_duration: 1M cycles
Modeling Thread Migration
No effort is made to optimize thread-to-core mapping
Round Robin Weakness
Optimize thread-to-core mapping• Define IPC ratio = EV6 IPC / EV5 IPC Heuristic: threads with highest IPC ratio are
assigned to EV6 System must compute average IPC for each
core type Requires forced migrations
To handle IPC spikes, use a weighted average:◦ Current IPC * 0.65 + Previous IPC * 0.35
IPC-Driven Policy
Randomly assign threads to cores at startup Again, define swap_period
Experimentally, swap_period = 20M cycles works well if (current_cycle % swap_period == 0)
◦ Sort threads by weighted IPC ratio◦ Migrate accordingly
When core becomes idle, assign thread from wait queue with highest IPC ratio
IPC-Driven Policy
Verifying Simulator
Goal: verify results of paper Repeat their experiments
Experiments
Policy Comparison◦ Static vs Round Robin vs IPC-Driven◦ Heterogeneous system: 5 x EV5, 3 x EV6
Experiment #1
Expected Policy Results
Actual Policy Results
Heterogeneous vs. Homogenous System• Let 1 EV6 = 5 EV5
Based on die areas Configurations
◦ 20 EV5◦ 10 EV5, 2 EV6◦ 5 EV5, 3 EV6◦ 4 EV6
Experiment #2
Expected Heterogeneous Results
Actual Heterogeneous Results
Simulator neglects L2 cache contention! Simplified thread migration model Only used 3 spec 2000 programs
◦ Paper used 11 Didn't have access to spec 2000 inputs Our EV5 and EV6 configurations were not
perfect◦ Lack of M5 documentation made this difficult
Experiment Limitations
Google Code◦ Source Control◦ Wiki
Project Organization
Confirmed dynamic thread assignment outperforms static thread assignment
Unable to confirm heterogeneous outperforms homogenous◦ Limitations of minimal Spec 2000 workload
Learned how to design complex, peer-reviewed experiment
Conclusion
Questions?