YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Heterogeneous Thread Assignment Simulation

Kris LangeNopparat suwaanarat

Pree Thiengburanathum

Page 2: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Introduction Motivation Review concepts M5 architecture Configuring M5 Simulator Simulation Results and Analysis Conclusion

Agenda

Page 3: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Basis: "Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures“

Paper makes 2 claims◦ Heterogeneous CMP outperform homogenous

CMP (for a fixed total die size)◦ Benefits of heterogeneous CMP are enhanced

using dynamic thread assignment policies

Introduction

Page 4: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Gain deeper understanding of research paper

Verify results of this paper Gain hands-on experience running a peer-

reviewed experiment

Motivation

Page 5: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Heterogeneous CMP system Homogeneous CMP system Heterogeneous VS Homogenous in multi-

programmed.

Review: Concepts

Page 6: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Heterogeneous CMP systemMany simple cores = higher thread parallelismFewer cores, larger = lower thread parallelism

We want to maximize resource utilization and achieve high degree of inter-thread

parallelism.

How? Mapping running tasks and using control mechanism.

Review: Concepts

Page 7: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Which one has a better total execution time? Control mechanism:Thread Assignment Policies:

Static thread assignmentrandombest

Dynamic thread assignmentround robinIPC driven

Review: ConceptP1 P2

Thread A 1.6 0.4

Thread B 1.5 1

Page 8: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

•Static thread Assignment•Usually assign thread to the faster core.• Well studies problem before assign.• Solution rely on heuristics

• a random static assignment. Don’t know the work loads and IPC.

• a pseudo best static assignment. Know the work loads and IPC, use heuristic to find out.

• Disadvantages: Doesn’t assign thread in run time. does not optimize faster core(s) usage. slow” threads on slower core(s) penalize overall system

performance.

8

Concepts: Assignment Policies

Page 9: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Dynamic thread assignment◦ Round Robin Assignment rotating the assignment of threads to processors in a

round robin fashion. ensures that the available faster are equally shared

among the running programs.

9

Concepts: Assignment Policies

Page 10: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

IPC driven Assignment◦ Considering the characteristics of the executing

threads.◦ Look at IPC number and ratio between two cores

to decide the thread mapping.◦ Thread with higher ratio run on faster core.◦ Thread with lower ratio run on lower core.

10

Concepts: Assignment policies

Page 11: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Goal: duplicate experiment in paper (peer-reviewed)

2-phase simulation◦1) Obtain IPC trace values for Spec2000 programs Using M5 simulator Alpha EV5 + EV6 cores

◦2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies

Simulation Approach

Page 12: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Which simulator is suitable ? Rsim Simple MP SimOS Simic TFsim SimFlex GEMS

Page 13: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Introduction & Overview What is M5 ?

A brief peek inside

Page 14: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

What is M5 ? A modular platform for simulating systems

Encompass

system-level architecture

processor microarchitecture

Page 15: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

key properties of M5 Pervasively Object-oriented

Multiple interchangeable CPU models

Event-driven memory system

Multiprocessor / multi-system capability

Page 16: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Overview of M5 Architecture

CPU

L1

cache

BUS

L2

cache

BUS

Busbridge

Busbridge

Mem

I/Odevice

BUS

BUS

M5 M5

M5

M5

M5

Page 17: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

M5’s Architecture CPU Models ISA Memory System Cache Buses

Page 18: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

CPU model• A Simple CPU model• 2 Detail CPU models

Page 19: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

CPU model

Backward Communication

Fetch Decode Rename

Issueexecutionwritebac

k

Commit

Page 20: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Instruction Set Architecture (ISA)

goal allow human-readable ISA description

two parts◦ A simple part- describes the decode◦ A declaration part-describes the global

information

Page 21: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Memory System

Goal

combine the timing and functional models into one model

Simplify the memory system code Make changes easier

Page 22: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Memory Architecture cache

port

port

mem

cache

port

port

Bus

port

mem

cache

port

port

port

peer

peer

peer

peer

Page 23: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Cache

Coherency Prefetching

BASEPrefetcher

Prefetcher

BHB Prefetcher StirdePrefetcher TaggedPrefetcher

Page 24: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

BUSES

memory , I/O , CPUs Master- closer to memory Slave- closer to CPU

Page 25: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Setup for M5 Simulator◦ Window Vista running VMware on fedora core.

Download the simulator from the website.◦ www.m5sim.org (open source)

Required Software:◦ g++, python, scons, zlib, swig

Configuring the M5 Simulator

Page 26: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

FS mode ◦ Full System mode. This mode simulates a complete

system including a kernel, I/O devices, etc. This mode currently only works with the ALPHA architecture.

SE mode◦ Syscall Emulation mode. This mode simulates

statically compiled binaries by functionally emulating any syscall they make.

Example of commands how to build and run M5◦ % scons build/ALPHA_SE/m5.debug◦ % ./build/ALPHA_SE/m5.debug config/example/se.py

Building, Compiling and running M5

Page 27: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

What is cross compilation?◦ Compiling a program for a target platform

different from the platform the compiler is run on M5 test programs must be compiled

Alpha+Linux Why?

◦ M5 implements Alpha ISA and Linux syscalls Since we don’t own Alpha hardware: cross-

compile

Cross Compilation

Page 28: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Build toolchain must be built for specific target◦ gcc, glibc, binutils, etc.

Dan Kegel’s crosstool makes this easier: http://www.kegel.com/crosstool

Of the 3 Spec2000 programs we considered, we were only able to successfully cross compile gzip

Cross Compilation: Take 1

Page 29: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Scour the net until you run across this link:◦ http://arch.cs.duke.edu/spec2000binaries.tar.bz2◦ All Spec200 binaries compiled for alpha-linux!

Cross Compilation: Take 2

Page 30: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

---------- Begin Simulation Statistics ----------

host_inst_rate 86899 # Simulator instruction rate (inst/s) host_mem_usage 543680 # Number of bytes of host

memory used host_seconds 0.07 # Real time elapsed on the host host_tick_rate 28827895 # Simulator tick rate (ticks/s) sim_freq 1000000000000 # Frequency of simulated ticks sim_insts 5997 # Number of instructions simulated sim_seconds 0.000002 # Number of seconds simulated sim_ticks 2005326 # Number of ticks simulated system.cpu0.dtb.accesses 0 # DTB accesses system.cpu0.dtb.acv 0 # DTB access violations system.cpu0.dtb.hits 0 # DTB hits system.cpu2.num_refs 1960 # Number of

memory references :

M5 Output

•M5 produces simulation results at end:

Page 31: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

We want IPC trace every 1 million cycles So we patched:

Getting M5 to Output Trace

• diff -Naur src/cpu/o3/cpu.cc /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc• --- src/cpu/o3/cpu.cc 2007-11-01 19:13:05.000000000 -0600• +++ /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc 2007-12-01 22:54:38.000000000 -0700• @@ -422,6 +422,21 @@• • ++numCycles;• • + ++totalCycles; // we could use numCycles...if only i could figure out how to stringificate• + ++currentCycles;• + if (currentCycles >= 1000000) {• + double currentIpc = (double)currentCommittedInsts / (double)currentCycles;• +• + cout << "IPC: "• + << totalCycles << ","• + << totalCommittedInstsInt << ","• + << currentIpc << std::endl;• +• + currentCommittedInsts = 0;• + currentCycles = 0;• + }• +• +• // activity = false;• • //Tick each of the stages• @@ -452,8 +467,10 @@• if (removeInstsThisCycle) {

Page 32: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Build the processor core

Page 33: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

EV5 configuration on M5

Page 34: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

EV6 configuration on M5

Page 35: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Goal: duplicate experiment in paper (peer-reviewed)

2-phase simulation◦1) Obtain IPC trace values for Spec2000

programs Using M5 simulator Alpha EV5 + EV6 cores

◦2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies

Simulation Approach

Page 36: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Spec 2000 Paper:

◦ - gzip◦ - gcc◦ crafty (chess program)◦ parser (Natural language processor)◦ bzip2◦ wupwis (quantum chromdynamics)◦ swim (shallow water modeling)◦ mgrid (multi-grid solver in 3d potential field)◦ galgel (fluid dynamics modeling)◦ equake (earthquake modeling)◦ lucas (prime number test)

Us:◦ gzip◦ Bzip2◦ crafty

Choosing Workload

Page 37: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Spec 2000 input is proprietary Compromise:

◦ gzip/bzip2 input: Shakespeare plays◦ crafty input: sample chess game

Workload Input

Page 38: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Obtained from M5

IPC Traces

Page 39: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

IPC Traces

Page 40: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

IPC Traces

Page 41: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

java Modular design Core simulator module Common thread-assignment policy interface Policy modules

Static Round Robin (dynamic) IPC-Driven (dynamic)

CMP Simulator

Page 42: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Command-line interface◦ Example: CMPSim spec2000 10 2 1 roundrobin

Input:◦ Workload◦ Number of threads

Selected randomly from 3 Spec 2000 programs◦ # EV5 cores◦ # EV6 cores◦ Thread assignment policy

CMP Simulator

Page 43: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Output:

CMP Simulator

Threads,Experiment,System IPC1,20EV5 RR,0.9050977847675382,20EV5 RR,1.461270365117883,20EV5 RR,2.062440678690534,20EV5 RR,2.785906338609815,20EV5 RR,3.353738438981526,20EV5 RR,4.072995790685577,20EV5 RR,4.174490205113648,20EV5 RR,4.9159374259,20EV5 RR,5.4738372761363610,20EV5 RR,6.0009047619318211,20EV5 RR,6.6482488852272712,20EV5 RR,7.2646014659090913,20EV5 RR,7.9047740170454514,20EV5 RR,8.4654566539772715,20EV5 RR,9.2339358454545516,20EV5 RR,9.8010424846590917,20EV5 RR,10.3671315159091

Page 44: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

IPC data are temporal sequences

CMP Simulator Issue

Page 45: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Randomly assign threads to cores at startup Repeat process whenever core becomes idle Weaknesses:

◦ When one core becomes idle, it will persist in that state unless some unassigned thread exists.

◦ In the case of a heterogeneous system, this results in underutilization of "faster" cores.

◦ Execution of "slow" threads on "slower" cores may penalize overall system performance.

Static Policy

Page 46: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Randomly assign threads to cores at startup Define swap_period

Experimentally, swap_period = 20M cycles works well if (current_cycle % swap_period == 0)

◦ Migrate thread from EV6 -> wait queue◦ Migrate thread from EV5 -> EV6◦ Migrate thread from wait queue -> EV6

When core becomes idle, assign longest-waiting thread

Round Robin Policy

Page 47: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Costs◦ Inter-core context switch

PC, registers, etc must be transferred◦ Cache warmup

Simple model◦ switch_loss: 50%◦ switch_duration: 1M cycles

Modeling Thread Migration

Page 48: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

No effort is made to optimize thread-to-core mapping

Round Robin Weakness

Page 49: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Optimize thread-to-core mapping• Define IPC ratio = EV6 IPC / EV5 IPC Heuristic: threads with highest IPC ratio are

assigned to EV6 System must compute average IPC for each

core type Requires forced migrations

To handle IPC spikes, use a weighted average:◦ Current IPC * 0.65 + Previous IPC * 0.35

IPC-Driven Policy

Page 50: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Randomly assign threads to cores at startup Again, define swap_period

Experimentally, swap_period = 20M cycles works well if (current_cycle % swap_period == 0)

◦ Sort threads by weighted IPC ratio◦ Migrate accordingly

When core becomes idle, assign thread from wait queue with highest IPC ratio

IPC-Driven Policy

Page 51: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Verifying Simulator

Page 52: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Goal: verify results of paper Repeat their experiments

Experiments

Page 53: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Policy Comparison◦ Static vs Round Robin vs IPC-Driven◦ Heterogeneous system: 5 x EV5, 3 x EV6

Experiment #1

Page 54: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Expected Policy Results

Page 55: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Actual Policy Results

Page 56: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Heterogeneous vs. Homogenous System• Let 1 EV6 = 5 EV5

Based on die areas Configurations

◦ 20 EV5◦ 10 EV5, 2 EV6◦ 5 EV5, 3 EV6◦ 4 EV6

Experiment #2

Page 57: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Expected Heterogeneous Results

Page 58: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Actual Heterogeneous Results

Page 59: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Simulator neglects L2 cache contention! Simplified thread migration model Only used 3 spec 2000 programs

◦ Paper used 11 Didn't have access to spec 2000 inputs Our EV5 and EV6 configurations were not

perfect◦ Lack of M5 documentation made this difficult

Experiment Limitations

Page 60: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Google Code◦ Source Control◦ Wiki

Project Organization

Page 61: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Confirmed dynamic thread assignment outperforms static thread assignment

Unable to confirm heterogeneous outperforms homogenous◦ Limitations of minimal Spec 2000 workload

Learned how to design complex, peer-reviewed experiment

Conclusion

Page 62: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Questions?


Related Documents