Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Heterogeneous Thread Assignment Simulation

Kris LangeNopparat suwaanarat

Pree Thiengburanathum

Introduction Motivation Review concepts M5 architecture Configuring M5 Simulator Simulation Results and Analysis Conclusion

Agenda

Basis: "Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures“

Paper makes 2 claims◦ Heterogeneous CMP outperform homogenous

CMP (for a fixed total die size)◦ Benefits of heterogeneous CMP are enhanced

using dynamic thread assignment policies

Introduction

Gain deeper understanding of research paper

Verify results of this paper Gain hands-on experience running a peer-

reviewed experiment

Motivation

Heterogeneous CMP system Homogeneous CMP system Heterogeneous VS Homogenous in multi-

programmed.

Review: Concepts

Heterogeneous CMP systemMany simple cores = higher thread parallelismFewer cores, larger = lower thread parallelism

We want to maximize resource utilization and achieve high degree of inter-thread

parallelism.

How? Mapping running tasks and using control mechanism.

Review: Concepts

Which one has a better total execution time? Control mechanism:Thread Assignment Policies:

Static thread assignmentrandombest

Dynamic thread assignmentround robinIPC driven

Review: ConceptP1 P2

Thread A 1.6 0.4

Thread B 1.5 1

•Static thread Assignment•Usually assign thread to the faster core.• Well studies problem before assign.• Solution rely on heuristics

• a random static assignment. Don’t know the work loads and IPC.

• a pseudo best static assignment. Know the work loads and IPC, use heuristic to find out.

• Disadvantages: Doesn’t assign thread in run time. does not optimize faster core(s) usage. slow” threads on slower core(s) penalize overall system

performance.

8

Concepts: Assignment Policies

Dynamic thread assignment◦ Round Robin Assignment rotating the assignment of threads to processors in a

round robin fashion. ensures that the available faster are equally shared

among the running programs.

9

Concepts: Assignment Policies

IPC driven Assignment◦ Considering the characteristics of the executing

threads.◦ Look at IPC number and ratio between two cores

to decide the thread mapping.◦ Thread with higher ratio run on faster core.◦ Thread with lower ratio run on lower core.

10

Concepts: Assignment policies

Goal: duplicate experiment in paper (peer-reviewed)

2-phase simulation◦1) Obtain IPC trace values for Spec2000 programs Using M5 simulator Alpha EV5 + EV6 cores

◦2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies

Simulation Approach

Which simulator is suitable ? Rsim Simple MP SimOS Simic TFsim SimFlex GEMS

Introduction & Overview What is M5 ?

A brief peek inside

What is M5 ? A modular platform for simulating systems

Encompass

system-level architecture

processor microarchitecture

key properties of M5 Pervasively Object-oriented

Multiple interchangeable CPU models

Event-driven memory system

Multiprocessor / multi-system capability

Overview of M5 Architecture

CPU

L1

cache

BUS

L2

cache

BUS

Busbridge

Busbridge

Mem

I/Odevice

BUS

BUS

M5 M5

M5

M5

M5

M5’s Architecture CPU Models ISA Memory System Cache Buses

CPU model• A Simple CPU model• 2 Detail CPU models

CPU model

Backward Communication

Fetch Decode Rename

Issueexecutionwritebac

k

Commit

Instruction Set Architecture (ISA)

goal allow human-readable ISA description

two parts◦ A simple part- describes the decode◦ A declaration part-describes the global

information

Memory System

Goal

combine the timing and functional models into one model

Simplify the memory system code Make changes easier

Memory Architecture cache

port

port

mem

cache

port

port

Bus

port

mem

cache

port

port

port

peer

peer

peer

peer

Cache

Coherency Prefetching

BASEPrefetcher

Prefetcher

BHB Prefetcher StirdePrefetcher TaggedPrefetcher

BUSES

memory , I/O , CPUs Master- closer to memory Slave- closer to CPU

Setup for M5 Simulator◦ Window Vista running VMware on fedora core.

Download the simulator from the website.◦ www.m5sim.org (open source)

Required Software:◦ g++, python, scons, zlib, swig

Configuring the M5 Simulator

http://www.m5sim.org/

FS mode ◦ Full System mode. This mode simulates a complete

system including a kernel, I/O devices, etc. This mode currently only works with the ALPHA architecture.

SE mode◦ Syscall Emulation mode. This mode simulates

statically compiled binaries by functionally emulating any syscall they make.

Example of commands how to build and run M5◦ % scons build/ALPHA_SE/m5.debug◦ % ./build/ALPHA_SE/m5.debug config/example/se.py

Building, Compiling and running M5

What is cross compilation?◦ Compiling a program for a target platform

different from the platform the compiler is run on M5 test programs must be compiled

Alpha+Linux Why?

◦ M5 implements Alpha ISA and Linux syscalls Since we don’t own Alpha hardware: cross-

compile

Cross Compilation

Build toolchain must be built for specific target◦ gcc, glibc, binutils, etc.

Dan Kegel’s crosstool makes this easier: http://www.kegel.com/crosstool

Of the 3 Spec2000 programs we considered, we were only able to successfully cross compile gzip

Cross Compilation: Take 1

http://www.kegel.com/crosstool

Scour the net until you run across this link:◦ http://arch.cs.duke.edu/spec2000binaries.tar.bz2◦ All Spec200 binaries compiled for alpha-linux!

Cross Compilation: Take 2

http://arch.cs.duke.edu/spec2000binaries.tar.bz2

---------- Begin Simulation Statistics ----------

host_inst_rate 86899 # Simulator instruction rate (inst/s) host_mem_usage 543680 # Number of bytes of host

memory used host_seconds 0.07 # Real time elapsed on the host host_tick_rate 28827895 # Simulator tick rate (ticks/s) sim_freq 1000000000000 # Frequency of simulated ticks sim_insts 5997 # Number of instructions simulated sim_seconds 0.000002 # Number of seconds simulated sim_ticks 2005326 # Number of ticks simulated system.cpu0.dtb.accesses 0 # DTB accesses system.cpu0.dtb.acv 0 # DTB access violations system.cpu0.dtb.hits 0 # DTB hits system.cpu2.num_refs 1960 # Number of

memory references :

M5 Output

•M5 produces simulation results at end:

We want IPC trace every 1 million cycles So we patched:

Getting M5 to Output Trace

• diff -Naur src/cpu/o3/cpu.cc /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc• --- src/cpu/o3/cpu.cc 2007-11-01 19:13:05.000000000 -0600• +++ /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc 2007-12-01 22:54:38.000000000 -0700• @@ -422,6 +422,21 @@• • ++numCycles;• • + ++totalCycles; // we could use numCycles...if only i could figure out how to stringificate• + ++currentCycles;• + if (currentCycles >= 1000000) {• + double currentIpc = (double)currentCommittedInsts / (double)currentCycles;• +• + cout << "IPC: "• + << totalCycles << ","• + << totalCommittedInstsInt << ","• + << currentIpc << std::endl;• +• + currentCommittedInsts = 0;• + currentCycles = 0;• + }• +• +• // activity = false;• • //Tick each of the stages• @@ -452,8 +467,10 @@• if (removeInstsThisCycle) {

Build the processor core

EV5 configuration on M5

EV6 configuration on M5

Goal: duplicate experiment in paper (peer-reviewed)

2-phase simulation◦1) Obtain IPC trace values for Spec2000

programs Using M5 simulator Alpha EV5 + EV6 cores

◦2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies

Simulation Approach

Spec 2000 Paper:

◦ - gzip◦ - gcc◦ crafty (chess program)◦ parser (Natural language processor)◦ bzip2◦ wupwis (quantum chromdynamics)◦ swim (shallow water modeling)◦ mgrid (multi-grid solver in 3d potential field)◦ galgel (fluid dynamics modeling)◦ equake (earthquake modeling)◦ lucas (prime number test)

Us:◦ gzip◦ Bzip2◦ crafty

Choosing Workload

Spec 2000 input is proprietary Compromise:

◦ gzip/bzip2 input: Shakespeare plays◦ crafty input: sample chess game

Workload Input

Obtained from M5

IPC Traces

IPC Traces

IPC Traces

java Modular design Core simulator module Common thread-assignment policy interface Policy modules

Static Round Robin (dynamic) IPC-Driven (dynamic)

CMP Simulator

Command-line interface◦ Example: CMPSim spec2000 10 2 1 roundrobin

Input:◦ Workload◦ Number of threads

Selected randomly from 3 Spec 2000 programs◦ # EV5 cores◦ # EV6 cores◦ Thread assignment policy

CMP Simulator

Output:

CMP Simulator

Threads,Experiment,System IPC1,20EV5 RR,0.9050977847675382,20EV5 RR,1.461270365117883,20EV5 RR,2.062440678690534,20EV5 RR,2.785906338609815,20EV5 RR,3.353738438981526,20EV5 RR,4.072995790685577,20EV5 RR,4.174490205113648,20EV5 RR,4.9159374259,20EV5 RR,5.4738372761363610,20EV5 RR,6.0009047619318211,20EV5 RR,6.6482488852272712,20EV5 RR,7.2646014659090913,20EV5 RR,7.9047740170454514,20EV5 RR,8.4654566539772715,20EV5 RR,9.2339358454545516,20EV5 RR,9.8010424846590917,20EV5 RR,10.3671315159091

IPC data are temporal sequences

CMP Simulator Issue

Randomly assign threads to cores at startup Repeat process whenever core becomes idle Weaknesses:

◦ When one core becomes idle, it will persist in that state unless some unassigned thread exists.

◦ In the case of a heterogeneous system, this results in underutilization of "faster" cores.

◦ Execution of "slow" threads on "slower" cores may penalize overall system performance.

Static Policy

Randomly assign threads to cores at startup Define swap_period

Experimentally, swap_period = 20M cycles works well if (current_cycle % swap_period == 0)

◦ Migrate thread from EV6 -> wait queue◦ Migrate thread from EV5 -> EV6◦ Migrate thread from wait queue -> EV6

When core becomes idle, assign longest-waiting thread

Round Robin Policy

Costs◦ Inter-core context switch

PC, registers, etc must be transferred◦ Cache warmup

Simple model◦ switch_loss: 50%◦ switch_duration: 1M cycles

Modeling Thread Migration

No effort is made to optimize thread-to-core mapping

Round Robin Weakness

Optimize thread-to-core mapping• Define IPC ratio = EV6 IPC / EV5 IPC Heuristic: threads with highest IPC ratio are

assigned to EV6 System must compute average IPC for each

core type Requires forced migrations

To handle IPC spikes, use a weighted average:◦ Current IPC * 0.65 + Previous IPC * 0.35

IPC-Driven Policy

Randomly assign threads to cores at startup Again, define swap_period

Experimentally, swap_period = 20M cycles works well if (current_cycle % swap_period == 0)

◦ Sort threads by weighted IPC ratio◦ Migrate accordingly

When core becomes idle, assign thread from wait queue with highest IPC ratio

IPC-Driven Policy

Verifying Simulator

Goal: verify results of paper Repeat their experiments

Experiments

Policy Comparison◦ Static vs Round Robin vs IPC-Driven◦ Heterogeneous system: 5 x EV5, 3 x EV6

Experiment #1

Expected Policy Results

Actual Policy Results

Heterogeneous vs. Homogenous System• Let 1 EV6 = 5 EV5

Based on die areas Configurations

◦ 20 EV5◦ 10 EV5, 2 EV6◦ 5 EV5, 3 EV6◦ 4 EV6

Experiment #2

Expected Heterogeneous Results

Actual Heterogeneous Results

Simulator neglects L2 cache contention! Simplified thread migration model Only used 3 spec 2000 programs

◦ Paper used 11 Didn't have access to spec 2000 inputs Our EV5 and EV6 configurations were not

perfect◦ Lack of M5 documentation made this difficult

Experiment Limitations

Google Code◦ Source Control◦ Wiki

Project Organization

Confirmed dynamic thread assignment outperforms static thread assignment

Unable to confirm heterogeneous outperforms homogenous◦ Limitations of minimal Spec 2000 workload

Learned how to design complex, peer-reviewed experiment

Conclusion

Questions?

Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Documents

Kris Lange Nopparat suwaanarat Pree Thiengburanathum.