Top Banner
Heterogeneous Thread Assignment Simulation Kris Lange Nopparat suwaanarat Pree Thiengburanathum
62

Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Dec 28, 2015

Download

Documents

Víctor Lucas
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Heterogeneous Thread Assignment Simulation

Kris LangeNopparat suwaanarat

Pree Thiengburanathum

Page 2: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Introduction Motivation Review concepts M5 architecture Configuring M5 Simulator Simulation Results and Analysis Conclusion

Agenda

Page 3: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Basis: "Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures“

Paper makes 2 claims◦ Heterogeneous CMP outperform homogenous

CMP (for a fixed total die size)◦ Benefits of heterogeneous CMP are enhanced

using dynamic thread assignment policies

Introduction

Page 4: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Gain deeper understanding of research paper

Verify results of this paper Gain hands-on experience running a peer-

reviewed experiment

Motivation

Page 5: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Heterogeneous CMP system Homogeneous CMP system Heterogeneous VS Homogenous in multi-

programmed.

Review: Concepts

Page 6: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Heterogeneous CMP systemMany simple cores = higher thread parallelismFewer cores, larger = lower thread parallelism

We want to maximize resource utilization and achieve high degree of inter-thread

parallelism.

How? Mapping running tasks and using control mechanism.

Review: Concepts

Page 7: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Which one has a better total execution time? Control mechanism:Thread Assignment Policies:

Static thread assignmentrandombest

Dynamic thread assignmentround robinIPC driven

Review: ConceptP1 P2

Thread A 1.6 0.4

Thread B 1.5 1

Page 8: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

•Static thread Assignment•Usually assign thread to the faster core.• Well studies problem before assign.• Solution rely on heuristics

• a random static assignment. Don’t know the work loads and IPC.

• a pseudo best static assignment. Know the work loads and IPC, use heuristic to find out.

• Disadvantages: Doesn’t assign thread in run time. does not optimize faster core(s) usage. slow” threads on slower core(s) penalize overall system

performance.

8

Concepts: Assignment Policies

Page 9: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Dynamic thread assignment◦ Round Robin Assignment rotating the assignment of threads to processors in a

round robin fashion. ensures that the available faster are equally shared

among the running programs.

9

Concepts: Assignment Policies

Page 10: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

IPC driven Assignment◦ Considering the characteristics of the executing

threads.◦ Look at IPC number and ratio between two cores

to decide the thread mapping.◦ Thread with higher ratio run on faster core.◦ Thread with lower ratio run on lower core.

10

Concepts: Assignment policies

Page 11: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Goal: duplicate experiment in paper (peer-reviewed)

2-phase simulation◦1) Obtain IPC trace values for Spec2000 programs Using M5 simulator Alpha EV5 + EV6 cores

◦2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies

Simulation Approach

Page 12: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Which simulator is suitable ? Rsim Simple MP SimOS Simic TFsim SimFlex GEMS

Page 13: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Introduction & Overview What is M5 ?

A brief peek inside

Page 14: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

What is M5 ? A modular platform for simulating systems

Encompass

system-level architecture

processor microarchitecture

Page 15: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

key properties of M5 Pervasively Object-oriented

Multiple interchangeable CPU models

Event-driven memory system

Multiprocessor / multi-system capability

Page 16: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Overview of M5 Architecture

CPU

L1

cache

BUS

L2

cache

BUS

Busbridge

Busbridge

Mem

I/Odevice

BUS

BUS

M5 M5

M5

M5

M5

Page 17: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

M5’s Architecture CPU Models ISA Memory System Cache Buses

Page 18: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

CPU model• A Simple CPU model• 2 Detail CPU models

Page 19: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

CPU model

Backward Communication

Fetch Decode Rename

Issueexecutionwritebac

k

Commit

Page 20: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Instruction Set Architecture (ISA)

goal allow human-readable ISA description

two parts◦ A simple part- describes the decode◦ A declaration part-describes the global

information

Page 21: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Memory System

Goal

combine the timing and functional models into one model

Simplify the memory system code Make changes easier

Page 22: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Memory Architecture cache

port

port

mem

cache

port

port

Bus

port

mem

cache

port

port

port

peer

peer

peer

peer

Page 23: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Cache

Coherency Prefetching

BASEPrefetcher

Prefetcher

BHB Prefetcher StirdePrefetcher TaggedPrefetcher

Page 24: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

BUSES

memory , I/O , CPUs Master- closer to memory Slave- closer to CPU

Page 25: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Setup for M5 Simulator◦ Window Vista running VMware on fedora core.

Download the simulator from the website.◦ www.m5sim.org (open source)

Required Software:◦ g++, python, scons, zlib, swig

Configuring the M5 Simulator

Page 26: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

FS mode ◦ Full System mode. This mode simulates a complete

system including a kernel, I/O devices, etc. This mode currently only works with the ALPHA architecture.

SE mode◦ Syscall Emulation mode. This mode simulates

statically compiled binaries by functionally emulating any syscall they make.

Example of commands how to build and run M5◦ % scons build/ALPHA_SE/m5.debug◦ % ./build/ALPHA_SE/m5.debug config/example/se.py

Building, Compiling and running M5

Page 27: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

What is cross compilation?◦ Compiling a program for a target platform

different from the platform the compiler is run on M5 test programs must be compiled

Alpha+Linux Why?

◦ M5 implements Alpha ISA and Linux syscalls Since we don’t own Alpha hardware: cross-

compile

Cross Compilation

Page 28: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Build toolchain must be built for specific target◦ gcc, glibc, binutils, etc.

Dan Kegel’s crosstool makes this easier: http://www.kegel.com/crosstool

Of the 3 Spec2000 programs we considered, we were only able to successfully cross compile gzip

Cross Compilation: Take 1

Page 29: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Scour the net until you run across this link:◦ http://arch.cs.duke.edu/spec2000binaries.tar.bz2◦ All Spec200 binaries compiled for alpha-linux!

Cross Compilation: Take 2

Page 30: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

---------- Begin Simulation Statistics ----------

host_inst_rate 86899 # Simulator instruction rate (inst/s) host_mem_usage 543680 # Number of bytes of host

memory used host_seconds 0.07 # Real time elapsed on the host host_tick_rate 28827895 # Simulator tick rate (ticks/s) sim_freq 1000000000000 # Frequency of simulated ticks sim_insts 5997 # Number of instructions simulated sim_seconds 0.000002 # Number of seconds simulated sim_ticks 2005326 # Number of ticks simulated system.cpu0.dtb.accesses 0 # DTB accesses system.cpu0.dtb.acv 0 # DTB access violations system.cpu0.dtb.hits 0 # DTB hits system.cpu2.num_refs 1960 # Number of

memory references :

M5 Output

•M5 produces simulation results at end:

Page 31: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

We want IPC trace every 1 million cycles So we patched:

Getting M5 to Output Trace

• diff -Naur src/cpu/o3/cpu.cc /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc• --- src/cpu/o3/cpu.cc 2007-11-01 19:13:05.000000000 -0600• +++ /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc 2007-12-01 22:54:38.000000000 -0700• @@ -422,6 +422,21 @@• • ++numCycles;• • + ++totalCycles; // we could use numCycles...if only i could figure out how to stringificate• + ++currentCycles;• + if (currentCycles >= 1000000) {• + double currentIpc = (double)currentCommittedInsts / (double)currentCycles;• +• + cout << "IPC: "• + << totalCycles << ","• + << totalCommittedInstsInt << ","• + << currentIpc << std::endl;• +• + currentCommittedInsts = 0;• + currentCycles = 0;• + }• +• +• // activity = false;• • //Tick each of the stages• @@ -452,8 +467,10 @@• if (removeInstsThisCycle) {

Page 32: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Build the processor core

Page 33: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

EV5 configuration on M5

Page 34: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

EV6 configuration on M5

Page 35: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Goal: duplicate experiment in paper (peer-reviewed)

2-phase simulation◦1) Obtain IPC trace values for Spec2000

programs Using M5 simulator Alpha EV5 + EV6 cores

◦2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies

Simulation Approach

Page 36: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Spec 2000 Paper:

◦ - gzip◦ - gcc◦ crafty (chess program)◦ parser (Natural language processor)◦ bzip2◦ wupwis (quantum chromdynamics)◦ swim (shallow water modeling)◦ mgrid (multi-grid solver in 3d potential field)◦ galgel (fluid dynamics modeling)◦ equake (earthquake modeling)◦ lucas (prime number test)

Us:◦ gzip◦ Bzip2◦ crafty

Choosing Workload

Page 37: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Spec 2000 input is proprietary Compromise:

◦ gzip/bzip2 input: Shakespeare plays◦ crafty input: sample chess game

Workload Input

Page 38: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Obtained from M5

IPC Traces

Page 39: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

IPC Traces

Page 40: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

IPC Traces

Page 41: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

java Modular design Core simulator module Common thread-assignment policy interface Policy modules

Static Round Robin (dynamic) IPC-Driven (dynamic)

CMP Simulator

Page 42: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Command-line interface◦ Example: CMPSim spec2000 10 2 1 roundrobin

Input:◦ Workload◦ Number of threads

Selected randomly from 3 Spec 2000 programs◦ # EV5 cores◦ # EV6 cores◦ Thread assignment policy

CMP Simulator

Page 43: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Output:

CMP Simulator

Threads,Experiment,System IPC1,20EV5 RR,0.9050977847675382,20EV5 RR,1.461270365117883,20EV5 RR,2.062440678690534,20EV5 RR,2.785906338609815,20EV5 RR,3.353738438981526,20EV5 RR,4.072995790685577,20EV5 RR,4.174490205113648,20EV5 RR,4.9159374259,20EV5 RR,5.4738372761363610,20EV5 RR,6.0009047619318211,20EV5 RR,6.6482488852272712,20EV5 RR,7.2646014659090913,20EV5 RR,7.9047740170454514,20EV5 RR,8.4654566539772715,20EV5 RR,9.2339358454545516,20EV5 RR,9.8010424846590917,20EV5 RR,10.3671315159091

Page 44: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

IPC data are temporal sequences

CMP Simulator Issue

Page 45: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Randomly assign threads to cores at startup Repeat process whenever core becomes idle Weaknesses:

◦ When one core becomes idle, it will persist in that state unless some unassigned thread exists.

◦ In the case of a heterogeneous system, this results in underutilization of "faster" cores.

◦ Execution of "slow" threads on "slower" cores may penalize overall system performance.

Static Policy

Page 46: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Randomly assign threads to cores at startup Define swap_period

Experimentally, swap_period = 20M cycles works well if (current_cycle % swap_period == 0)

◦ Migrate thread from EV6 -> wait queue◦ Migrate thread from EV5 -> EV6◦ Migrate thread from wait queue -> EV6

When core becomes idle, assign longest-waiting thread

Round Robin Policy

Page 47: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Costs◦ Inter-core context switch

PC, registers, etc must be transferred◦ Cache warmup

Simple model◦ switch_loss: 50%◦ switch_duration: 1M cycles

Modeling Thread Migration

Page 48: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

No effort is made to optimize thread-to-core mapping

Round Robin Weakness

Page 49: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Optimize thread-to-core mapping• Define IPC ratio = EV6 IPC / EV5 IPC Heuristic: threads with highest IPC ratio are

assigned to EV6 System must compute average IPC for each

core type Requires forced migrations

To handle IPC spikes, use a weighted average:◦ Current IPC * 0.65 + Previous IPC * 0.35

IPC-Driven Policy

Page 50: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Randomly assign threads to cores at startup Again, define swap_period

Experimentally, swap_period = 20M cycles works well if (current_cycle % swap_period == 0)

◦ Sort threads by weighted IPC ratio◦ Migrate accordingly

When core becomes idle, assign thread from wait queue with highest IPC ratio

IPC-Driven Policy

Page 51: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Verifying Simulator

Page 52: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Goal: verify results of paper Repeat their experiments

Experiments

Page 53: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Policy Comparison◦ Static vs Round Robin vs IPC-Driven◦ Heterogeneous system: 5 x EV5, 3 x EV6

Experiment #1

Page 54: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Expected Policy Results

Page 55: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Actual Policy Results

Page 56: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Heterogeneous vs. Homogenous System• Let 1 EV6 = 5 EV5

Based on die areas Configurations

◦ 20 EV5◦ 10 EV5, 2 EV6◦ 5 EV5, 3 EV6◦ 4 EV6

Experiment #2

Page 57: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Expected Heterogeneous Results

Page 58: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Actual Heterogeneous Results

Page 59: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Simulator neglects L2 cache contention! Simplified thread migration model Only used 3 spec 2000 programs

◦ Paper used 11 Didn't have access to spec 2000 inputs Our EV5 and EV6 configurations were not

perfect◦ Lack of M5 documentation made this difficult

Experiment Limitations

Page 60: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Google Code◦ Source Control◦ Wiki

Project Organization

Page 61: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Confirmed dynamic thread assignment outperforms static thread assignment

Unable to confirm heterogeneous outperforms homogenous◦ Limitations of minimal Spec 2000 workload

Learned how to design complex, peer-reviewed experiment

Conclusion

Page 62: Kris Lange Nopparat suwaanarat Pree Thiengburanathum.

Questions?