Top Banner
Conference title 1 Rafael Ubal, David Kaeli Simulation of OpenCL and APUs on Multi2Sim 4.1
73
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: m2s-4.1

Conference title 1

Rafael Ubal, David Kaeli

Simulation of OpenCL andAPUs on Multi2Sim 4.1

Page 2: m2s-4.1

ISCA 2013, Tel-Aviv 2

Introduction

Simulation methodology

Outline

Part 2 – Simulation of a Southern Islands GPU

OpenCL from host to device

Disassembler

Emulation

Timing simulation

Visualization tool

Case study: ND-Range virtualization

Part 3 – Concluding Remarks

Additional Projects

The Multi2Sim Community

Part 1 – Simulation of an x86 CPU

Disassembler

Emulation

Timing simulation

Memory hierarchy

Visualization tool

NEHA
Highlight
NEHA
Sticky Note
What is a disassembler? A disassembler is a computer program that translates machine language into assembly language—the inverse operation to that of an assembler.
Page 3: m2s-4.1

ISCA 2013, Tel-Aviv 3

IntroductionGetting Started

• User accounts for demos

Follow ourdemos!

─ Machine: fusion1.ece.neu.edu

─ User: isca1, isca2, isca3, ...

─ Password: isca2013

Page 4: m2s-4.1

ISCA 2013, Tel-Aviv 4

IntroductionGetting Started

$ ssh isca<N>@fusion1.ece.neu.edu -X

(Notice the X forwarding for later demos using graphics)

• Connect to our server

$ wget http://www.multi2sim.org/files/multi2sim-4.1.tar.gz$ tar -xzf multi2sim-4.1.tar.gz$ cd multi2sim-4.1$ ./configure && make

• Download and compile Multi2Sim

$ lsdemo1 demo2 demo3 demo4 demo5 demo6 demo7 README

All files needed for each demo are present in its corresponding directory.README files describe commands to run and interpretation of outputs.

• Demo descriptions

Page 5: m2s-4.1

ISCA 2013, Tel-Aviv 5

IntroductionFirst Execution

#include <stdio.h>

int main(int argc, char **argv){

int i;printf("Number of arguments: %d\n", argc);for (i = 0; i < argc; i++)

printf("\targv[%d] = %s\n", i, argv[i]);return 0;

}

Demo 1

$ test-args hello there

Number of arguments: 4 arg[0] = 'test-args' arg[1] = 'hello' arg[2] = 'there'

$ m2s test-args hello there

< Simulator message in stderr >Number of arguments: 4 arg[0] = 'test-args' arg[1] = 'hello' arg[2] = 'there'< Simulator statistics >

• Source code

• Native execution • Execution on Multi2Sim

Page 6: m2s-4.1

ISCA 2013, Tel-Aviv 6

; This is a comment.

[ Section 0 ]Color = RedHeight = 40

[ OtherSection ]Variable = Value

IntroductionSimulator Input/Output Files

• Example of INI file format

• Multi2Sim uses INI file for

─ Configuration files.

─ Output statistics.

─ Statistic summary in standard error output.

Page 7: m2s-4.1

ISCA 2013, Tel-Aviv 7

Simulation MethodologyApplication-Only vs. Full-System

Virtual izat ion of User-space subset of ISA System call interface

Full-systemsimulator core

Application-onlysimulator core

Guestprogram 1

Guestprogram 2

Full O.S.

Guestprogram 1

Guestprogram 2

... ...

Virtualizat ion of Complete processor ISA I/O hardware

• Full-system simulation

An entire OS runs on top of the simulator. The simulator models the entire ISA, and virtualizes native hardware devices, similar to a virtual machine. Very accurate simulations, but extremely slow.

• Application-only simulation

Only an application runs on top of the simulator. The simulator implements a subset of the ISA, and needs to virtualize the system call interface (ABI). Multi2Sim falls in this category.

Page 8: m2s-4.1

ISCA 2013, Tel-Aviv 8

Simulation MethodologyFour-Stage Simulation Process

• Modular implementation

─ Four clearly different software modules per architecture (x86, MIPS, ...)

─ Each module has a standard interface for stand-alone execution, or interaction with other modules.

Disassembler

Instructionbytes

Instructionfields

Emulator(or functional

simulator)

Run oneinstruction

Instructioninformation

Timingsimulator

(or detailed/architectural)

Pipelinetrace

Visualtool

ExectuableELF file

Instructionsdump

Exectuable file,program arguments

Programoutput

Executable file,program arguments,

processor configuration

Performancestatistics

Userinteraction

Cycle navigation,timing diagrams

NEHA
Highlight
NEHA
Sticky Note
ELF: Executable and Linkable Format(ELF) is a common standard file format for executables, object code, shared libraries, and core dumps.
Page 9: m2s-4.1

ISCA 2013, Tel-Aviv 9

Simulation MethodologyCurrent Architecture Support

• In our latest Multi2Sim SVN repository

─ 4 GPU + 3 CPU architectures supported or in progress.

─ This tutorial will focus on x86 and AMD Southern Islands.

Disasm. EmulationTiming

simulationGraphicpipelines

ARM X In progress – –

MIPS X ––

x86 X X X X

AMD Evergreen X X X X

AMD Southern Islands X X X X

NVIDIA Fermi X In progress ––

NVIDIA Kepler In progress –––

X

Page 10: m2s-4.1

ISCA 2013, Tel-Aviv 10

Part 1

Simulation of anx86 CPU

Page 11: m2s-4.1

ISCA 2013, Tel-Aviv 11

DisassemblerEmulator

(or functionalsimulator)

Timingsimulator

(or detailed/architectural)

Visualtool

The x86 Disassembler

NEHA
Highlight
NEHA
Sticky Note
x86 = 32 bit
Page 12: m2s-4.1

ISCA 2013, Tel-Aviv 12

08048900 <_start>: 8048900: 31 ed xor ebp,ebp 8048902: 5e pop esi 8048903: 89 e1 mov ecx,esp 8048905: 83 e4 f0 and esp,0xfffffff0 8048908: 50 push eax 8048909: 54 push esp 804890a: 52 push edx 804890b: 68 70 91 04 08 push 0x8049170 ...

• Verification of common output

$ objdump -S -M intel test-args $ m2s --x86-disasm test-args

• GNU x86 disassembler • Multi2Sim x86 disassembler

The x86 DisassemblerMethodology

─ Implementation of an efficient instruction decoder based on lookup tables.

─ When used as a stand-alone tool, the output is provided with exactly the same format as the GNU x86 disassembler for automatic verification.

Page 13: m2s-4.1

ISCA 2013, Tel-Aviv 13

The x86 Emulator

DisassemblerEmulator

(or functionalsimulator)

Timingsimulator

(or detailed/architectural)

Visualtool

Page 14: m2s-4.1

ISCA 2013, Tel-Aviv 14

The x86 EmulatorProgram Loading

1) Parse ELF executable

─ Read ELF sections and symbols.

─ Initialize code and data.

2) Initialize stack

─ Program headers.

─ Arguments.

─ Environment variables.

3) Initialize registers

─ Program entry → eip

─ Stack pointer → esp

Stack

Program args.Env. variables

mmap region

(not initialized)

Heap

Initialized data

Text

Initialized data0x08000000

0x08xxxxxx

0x40000000

0xc0000000

eax

ebx

eax

ecx

esp

eip

Initial virtual memory image

Initial values for x86 registers

Stac

k po

inte

rIn

stru

ctio

n po

inte

r

Page 15: m2s-4.1

ISCA 2013, Tel-Aviv 15

The x86 EmulatorEmulation Loop

• Emulation of x86 instructions

─ Update x86 registers.

─ Update memory map if needed.

─ Example: add [bp+16], 0x5

• Emulation of Linux system calls

─ Analyze system call code and arguments.

─ Update memory map.

─ Update register eax with return value.

─ Example: read(fd, buf, count)

Demo 2

Read instr.at eip

Instr.bytes

Decodeinstruction

Instr.fields

Instr. isint 0x80

No Yes

Emulatesystem call

Emulatex86 instr.

Move eipto next instr.

Page 16: m2s-4.1

ISCA 2013, Tel-Aviv 16

The x86 Timing Simulator

DisassemblerEmulator

(or functionalsimulator)

Timingsimulator

(or detailed/architectural)

Visualtool

Page 17: m2s-4.1

ISCA 2013, Tel-Aviv 17

The x86 Timing SimulatorSuperscalar Processor

• Superscalar x86 pipelines

─ 6-stage pipeline with configurable latencies.

─ Supported features include speculative execution, branch prediction, micro-instruction generation, trace caches, out-of-order execution, …

─ Modeled structures include fetch queues, reorder buffer, load-store queues, register files, register mapping tables, ...

Instr.cache

Fetch

Tracecache

···

···

Fetchqueue

Tracequeue

Decode ···

Uopqueue

Dispatch

WritebackDatacache

Reg.file

Issue ALU

Commit

Instruction queue

Load/store queue

Reorder buffer··· ···

NEHA
Highlight
NEHA
Highlight
NEHA
Sticky Note
What is a trace queue? 1. Events in Queues can be recorded in a trace file 2. Statistical information (such as throughput on a link/queue) can be computed using the trace data
Page 18: m2s-4.1

ISCA 2013, Tel-Aviv 18

The x86 Timing SimulatorMultithreaded and Multicore Processors

• Multicore

─ Fully replicated superscalar pipelines, connected through caches.

─ Running multiple programs concurrently, or one program spawning child threads (using OpenMP, pthread, etc.)

• Multithreading

─ Replicated superscalar pipelines with partially shared resources.

─ Fine-grain, coarse-grain, and simultaneous multithreading.

Demo 3

Superscalar Core

Shared resources:

Reorder bufferInstruction queueLoad/store queue

Register fileFunctional units

Private resources:

Hardware Thread

Program counterRegister aliasing tableTLB

Node 0

Node 1

Node m – 1

Nodes mto 2m - 1

Nodes(n – 1)m

to nm – 1

···

···

···

···

···

···

m threads

n cores

Mem

ory

hier

arch

y

Page 19: m2s-4.1

ISCA 2013, Tel-Aviv 19

The x86 Timing SimulatorBenchmark Support

• Multithreaded applications

─ SPLASH-2 benchmark suite with pre-compiled x86 executables and data files available on the website.

─ PARSEC-2.1 with pre-compiled x86 executables and data files.

• Single-threaded applications

─ SPEC 2000 and SPEC 2006 benchmarks are fully supported. Pre-compiled x86 binaries are available on the website.

─ The Mediabench suite includes program binaries and data files, with all you need to run them.

Page 20: m2s-4.1

ISCA 2013, Tel-Aviv 20

The Memory HierarchyConfiguration

• Flexible hierarchies

─ Any number of caches organized in any number of levels.

─ Cache levels connected through default cross-bar interconnects, or complex custom interconnect configurations.

─ Each architecture undergoing a timing simulation specifies its ownentry point (cache memory) in the memory hierarchy, for data or instructions.

─ Cache coherence is guaranteed with an implementation of the 5-state MOESI protocol.

NEHA
Highlight
NEHA
Sticky Note
What is the MOESI protocol?
Page 21: m2s-4.1

ISCA 2013, Tel-Aviv 21

The Memory HierarchyConfiguration Examples

Example 1

Three CPU cores with private L1 caches, two L2 caches, and default cross-bar based interconnects. Cache L2-0 serves physical address range [0, 7ff...ff], and cache L2-1 serves [80...00, ff...ff].

Core 0 Core 1 Core 2

L1-0 L1-1 L1-2

Switch

L2-0 L2-1

Switch

Main Memory

Demo 4

Page 22: m2s-4.1

ISCA 2013, Tel-Aviv 22

Example 2

Four CPU cores with private L1 data caches, L1 instruction caches and L2 caches shared every 2 cores (serving the whole address space), and four main memory modules, connected with a custom network on a ring topology.

DataL1-0

Core 0 Core 1

DataL1-1

Switch

L2-0 L2-1

sw1 sw2sw0 sw3

MM-0 MM-1 MM-2 MM-3

Inst.L1-0

Switch

DataL1-2

Core 2 Core 3

DataL1-3

Inst.L1-1

n0

n2 n3 n4 n5

n1

The Memory HierarchyConfiguration Examples

Page 23: m2s-4.1

ISCA 2013, Tel-Aviv 23

s3

s0 s1

s2

n0 n1

n2n3

Example 3

Ring connection between four switches associated with end-nodes with routing tables calculated automatically based on shortest paths. The resulting routing algorithm can contain cycles, potentially leading to routing deadlocks at runtime.

The Memory HierarchyConfiguration Examples

Page 24: m2s-4.1

ISCA 2013, Tel-Aviv 24

n3

s1

s2s3

s0 n1

n2

n0

Virtual Channel 0

Virtual Channel 1

Example 4

Ring connection between for switches associated with end nodes, where a routing cycle has been removed by adding an additional virtual channel.

The Memory HierarchyConfiguration Examples

Page 25: m2s-4.1

ISCA 2013, Tel-Aviv 25

Pipeline Visualization Tool

DisassemblerEmulator

(or functionalsimulator)

Timingsimulator

(or detailed/architectural)

Visualtool

Page 26: m2s-4.1

ISCA 2013, Tel-Aviv 26

Pipeline Visualization ToolPipeline Diagrams

─ Cycle bar on main window for navigation.

─ Panel on main window shows software contexts mapped to hardware cores.

─ Clicking on the Detail button opens a secondary window with a pipeline diagram.

Page 27: m2s-4.1

ISCA 2013, Tel-Aviv 27

Pipeline Visualization ToolMemory Hierarchy

─ Panel on main window shows how memory accesses traverse the memory hierarchy.

─ Clicking on a Detail button opens a secondary window with the cache memory representation.

─ Each row is a set, each column is a way.

─ Each cell shows the tag and state (color) of a cache block.

─ Additional columns show the number of sharers and in-flight accesses.

Demo 5

Page 28: m2s-4.1

ISCA 2013, Tel-Aviv 28

Part 2

Simulation of aSouthern Islands GPU

Page 29: m2s-4.1

ISCA 2013, Tel-Aviv 29

OpenCL on the HostExecution Framework

─ Multi2Sim 4.1 includes a new execution framework for OpenCL, developed in collaboration with University of Toronto.

─ The new framework is a more accurate analogy to a native execution, and is fully AMD-compliant.

─ When working with x86 kernel binaries, the OpenCL runtime can perform both native and simulated execution correctly.

─ When run natively, an OpenCL call to clGetDeviceIDs returns only thex86 device.

─ When run on Multi2Sim, clGetDeviceIDs returns one device per supported architecture: x86, Evergreen, and Southern Islands devices (more to be added).

Page 30: m2s-4.1

ISCA 2013, Tel-Aviv 30

OpenCL on the HostExecution Framework

─ The following slides show the modular organization of the OpenCL execution framework, based on 4 software/hardware entities.

─ In each case, we compare native execution withsimulated execution.

Userapplication

API call

Devicedriver

ABI call

Hardware

Internalinterface

Runtimelibrary

Use

r-le

vel c

ode

OS-

leve

l cod

e

Page 31: m2s-4.1

ISCA 2013, Tel-Aviv 31

OpenCL on the HostThe OpenCL CPU Host Program

NativeAn x86 OpenCL host program performs an OpenCL API call.

Multi2SimExact same scenario.

Page 32: m2s-4.1

ISCA 2013, Tel-Aviv 32

OpenCL on the HostThe OpenCL Runtime Library

NativeAMD's OpenCL runtime library handles the call, and communicates with the driver through system calls ioctl, read, write, etc. These are referred to as ABI calls.

Multi2SimMulti2Sim's OpenCL runtime library, running with guest code, transparently intercepts the call. It communicates with the Multi2Sim driver using system calls with codes not reserved in Linux.

Page 33: m2s-4.1

ISCA 2013, Tel-Aviv 33

OpenCL on the HostThe OpenCL Device Driver

NativeThe AMD Catalyst driver (kernel module) handles the ABI call and communicates with the GPU through the PCIe bus.

Multi2SimAn OpenCL driver module (Multi2Sim code) intercepts the ABI call and communicates with the GPU emulator.

Page 34: m2s-4.1

ISCA 2013, Tel-Aviv 34

OpenCL on the HostThe GPU Emulator

NativeThe command processor in the GPU handles the messages received from the driver.

Multi2SimThe GPU emulator updates its internal state based on the message received from the driver.

Page 35: m2s-4.1

ISCA 2013, Tel-Aviv 35

OpenCL on the HostTransferring Control

• Beginning execution on the GPU

─ The key OpenCL call that effectively triggers GPU execution is clEnqueueNDRangeKernel.

• Order of events

─ The host program performs API call clEnqueueNDRangeKernel.

─ The runtime intercepts the call, and enqueues a new task in an OpenCL command queue object. A user-level thread associated with the command queue eventually processes the command, performing a LaunchKernel ABI call.

─ The driver intercepts the ABI call, reads ND-Range parameters, and launches the GPU emulator.

─ The GPU emulator enters a simulation loop until the ND-Range completes.

Page 36: m2s-4.1

ISCA 2013, Tel-Aviv 36

OpenCL on the DeviceExecution Model

─ Work-items execute multiple instances of the same kernel code.

─ Work-groups are sets of work-items that can synchronize and communicate efficiently.

─ The ND-Range is composed by all work-groups, not communicating with each other and executing in any order.

Work-group

Work-group

···

···

Work-group

···

Global Memory

Work-group

Work-item

Work-item

···

···

Work-item

···

Local Memory

Work-item

···__kernel func(){

}

Private Memory

ND-Range Work-Group Work-Item

Page 37: m2s-4.1

ISCA 2013, Tel-Aviv 37

OpenCL on the DeviceExecution Model

• Software-hardware mapping

─ When the kernel is launched by the Southern Islands driver, the OpenCLND-Range is mapped to the compute device (Fig. a).

─ The work-groups are mapped to the compute units (Fig. b).

─ The work-items are executed by the SIMD lanes (Fig. c).

─ This is a simplified view of the architecture, explored later in more detail.

SIMD 0Ultra-Threaded Dispatcher

ComputeUnit 0

ComputeUnit 1 ···

Global Memory

Lane

0

15···

SIMD 3

Lane

0

15···

···

Local Memory

Wav

efro

nt S

ched

uler

Register File

FunctionalUnits

(integer &float.)

a) Compute device. c) SIMD lane (stream core).

ComputeUnit 31

b) Compute unit.

Page 38: m2s-4.1

ISCA 2013, Tel-Aviv 38

DisassemblerEmulator

(or functionalsimulator)

Timingsimulator

(or detailed/architectural)

Visualtool

The Southern Islands Disassembler

Page 39: m2s-4.1

ISCA 2013, Tel-Aviv 39

The Southern Islands DisassemblerSIMD Execution

• Motivation

─ OpenCL follows a Single-Program Multiple-Data (SPMD) programming model, which maps to a Single-Instruction Multiple-Data (SIMD) execution model.

─ A wavefront is a fixed set of work-items forming the SIMD execution unit. One single instruction is fetched, multiple instances of it are executed.

Trade-off: 1 Wavefront = 64 Work-Items

• If the wavefront is large...

─ Amortize ports in the instruction cache and fetch hardware in general.

• If the wavefront is small...

─ Easier to create work-groups with a size equal to an exact multiple of the wavefront size, reducing waste in the wavefront tail..

─ Reduce the effects of thread divergence in conditional execution or loops.

Page 40: m2s-4.1

ISCA 2013, Tel-Aviv 40

The Southern Islands DisassemblerScalar Instructions

• Motivation

─ Sometimes work-items within a wavefront do not only execute the same instruction, but also do it on the same data.

─ The compiler can detect some of these cases and emit scalar instructions, issued to a special execution unit of much lower cost.

• Scalar opportunities

─ Loading a base address of a buffer.

─ Loading values from constant memory.

─ Loop initialization, comparison, and increments.

Page 41: m2s-4.1

ISCA 2013, Tel-Aviv 41

The Southern Islands DisassemblerVector Addition Kernel

s_buffer_load_dword s0, s[4:7], 0x04 s_buffer_load_dword s1, s[4:7], 0x18 s_buffer_load_dword s4, s[8:11], 0x00 s_buffer_load_dword s5, s[8:11], 0x04 s_buffer_load_dword s6, s[8:11], 0x08 s_load_dwordx4 s[8:11], s[2:3], 0x58 s_load_dwordx4 s[16:19], s[2:3], 0x60 s_load_dwordx4 s[20:23], s[2:3], 0x50 s_waitcnt lgkmcnt(0) s_min_u32 s0, s0, 0x0000ffff v_mov_b32 v1, s0 v_mul_i32_i24 v1, s12, v1 v_add_i32 v0, vcc, v0, v1 v_add_i32 v0, vcc, s1, v0 v_lshlrev_b32 v0, 2, v0 v_add_i32 v1, vcc, s4, v0 v_add_i32 v2, vcc, s5, v0 v_add_i32 v0, vcc, s6, v0 tbuffer_load_format_x v1, v1, s[8:11], 0 tbuffer_load_format_x v2, v2, s[16:19], 0 s_waitcnt vmcnt(0) v_add_i32 v1, vcc, v1, v2 tbuffer_store_format_x v1, v0, s[20:23], 0 s_endpgm

Scal

ar in

stru

ctio

ns

The loads

The additionThe store

Vect

or in

stru

ctio

ns Vector registers

Scalar registers__kernel void vector_add( __read_only __global int *src1, __read_only __global int *src2, __write_only __global int *dst){ int id = get_global_id(0); dst[id] = src1[id] + src2[id];}

• Source code • Assembly code

Page 42: m2s-4.1

ISCA 2013, Tel-Aviv 42

The Southern Islands DisassemblerConditional Statements

__kernel void if_kernel(__global int *v)

{ uint id = get_global_id(0); if (id < 5) v[id] = 10;}

• Source code • Assembly code

s_buffer_load_dword s0, s[8:11], 0x04 s_buffer_load_dword s1, s[8:11], 0x18 s_waitcnt lgkmcnt(0) s_min_u32 s0, s0, 0x0000ffff v_mov_b32 v1, s0 v_mul_i32_i24 v1, s16, v1 v_add_i32 v0, vcc, v0, v1 v_add_i32 v0, vcc, s1, v0 s_buffer_load_dword s0, s[12:15], 0x00 v_cmp_lt_u32 s[2:3], v0, 5 s_and_saveexec_b64 s[2:3], s[2:3] v_lshlrev_b32 v0, 2, v0 s_waitcnt lgkmcnt(0) v_add_i32 v0, vcc, s0, v0 v_mov_b32 v1, 10 tbuffer_store_format_x v1, v0, s[4:7], 0 s_mov_b64 exec, s[2:3] s_endpgm

The comparison.Save active mask.

Store value 10.

Restore active mask.

Page 43: m2s-4.1

ISCA 2013, Tel-Aviv 43

The Southern Islands Emulator

DisassemblerEmulator

(or functionalsimulator)

Timingsimulator

(or detailed/architectural)

Visualtool

Page 44: m2s-4.1

ISCA 2013, Tel-Aviv 44

The Southern Islands EmulatorProgram Loading

Userapplication

API call

Devicedriver

ABI call

Hardware

Internalinterface

Runtimelibrary

• Responsible entity

─ The device driver is the responsible module for setting up an initial state for the hardware, leaving it ready to run the first ISA instruction.

─ Natively, it writes on hardware registers and global memory locations. On Multi2Sim, it calls initialization functions of the emulator.

• Setup

─ Instruction memories in compute units, each with one copy of the ISA section of the kernel binary.

─ Initial global memory image, copying global buffers from CPU to GPU memory.

─ Kernel arguments.

─ ND-Range topology, including number of dimensions and sizes.

Page 45: m2s-4.1

ISCA 2013, Tel-Aviv 45

Split ND-Rangeinto work-groups

Work-grouppool

Anywork-groups

left?

Grab work-group andsplit in wavefronts

Wavefrontpool

Anywavefront

left?

For each running wavefront not stalled in a barrier:

● Read instruction @PC● Emulate (update mem. + regs.)● Advance PC

Yes

No

Yes

NoEnd

The Southern Islands EmulatorEmulation Loop

Demo 6

• Work-group execution

─ Work-groups can execute in any order. This order is irrelevant for emulation purposes.

─ The chosen policy is executing one work-group at a time, in increasing order of ID for each dimension.

• Wavefront execution

─ Wavefronts within a work-group can also execute in any order, as long as synchronizations are considered.

─ The chosen policy is executing one wavefront at a time until it hits a barrier, if any.

Page 46: m2s-4.1

ISCA 2013, Tel-Aviv 46

Soutern Islands Timing Simulation

DisassemblerEmulator

(or functionalsimulator)

Timingsimulator

(or detailed/architectural)

Visualtool

Page 47: m2s-4.1

ISCA 2013, Tel-Aviv 47

Southern Islands Timing SimulationThe GPU Architecture

Command Processor

Ultra-Threaded Dispatcher

ComputeUnit 0

ComputeUnit 1

ComputeUnit 31···

L1Cache

L1Cache

L1Cache···

Crossbar

Main Memory Hierarchy(L2 caches, memory controllers,

video memory)

─ A command processor receives and processes commands from the host.

─ When the ND-Range is created, an ultra-threaded dispatcher (scheduler) assignswork-groups into compute units while new available slots occur.

─ The following slides present the architecture on each of the compute units.

Page 48: m2s-4.1

ISCA 2013, Tel-Aviv 48

Southern Islands Timing SimulationThe Compute Unit

─ The instruction memory of each compute unit contains the OpenCL kernel.

─ A front-end fetches instructions and sends them to the appropriate execution unit.

─ There is one scalar unit, vector-memory unit, branch unit, LDS (local data store) unit.

─ There are multiple instances of SIMD units.

Page 49: m2s-4.1

ISCA 2013, Tel-Aviv 49

Southern Islands Timing SimulationThe Front-End

─ Work-groups are split into wavefronts and allocated to wavefront pools.

─ Fetch and issue stages operate in a round-robin fashion.

─ There is one SIMD unit associated to each wavefront pool.

Wavefront Pool ···

···

···

···

Wavefront Pool

Wavefront Pool

Wavefront Pool

···

···

···

···

···

Fetch buffers, oneper wavefront pool

SIMD issue buffer,matchingwavefront pool

Scalar unitissue buffer

Branch unitissue buffer

Vector memoryunit issue buffer

LDS unitissue buffer

Fetch

Issue

Page 50: m2s-4.1

ISCA 2013, Tel-Aviv 50

─ Runs arithmetic-logic vector instructions.

─ There are 4 SIMD units, each one associated with one of the 4 wavefront pools.

─ The SIMD unit pipeline is modeled with 5 stages: decode, read, execute, write, and complete.

─ In the execute stage, a wavefront (64 work-items max.) is split into 4 subwavefronts (16 work-items each). Subwavefronts are pipelined over the 16 stream cores in 4 consecutive cycles.

─ The vector register file is accessed in the read and write stages to consume input and produce output operands, respectively.

Southern Islands Timing SimulationThe SIMD Unit

Page 51: m2s-4.1

ISCA 2013, Tel-Aviv 51

Southern Islands Timing SimulationThe SIMD Unit

Execute

Work-item 0Work-item 16Work-item 32Work-item 48

PipelinedFunctionalunits

SIMD Lane 0

SIMD Lane 1Work-items 1, 17, 33, 49

...

SIMD Lane 15Work-items 15, 31, 47, 63

···Read

···

Issuebuffer

Readbuffer

Write···

Executebuffer

···

Complete

Write

buffer

···

Decode

buffer

Decode

From

com

pute

unit

fro

nt-e

nd

Vector/scalarregister file

Vectorregister file

Page 52: m2s-4.1

ISCA 2013, Tel-Aviv 52

Southern Islands Timing SimulationThe Scalar Unit

─ Runs both arithmetic-logic and memory scalar instructions.─ Modeled with 5 stages – decode, read, execute/memory, write, complete.

···Read

···

Issuebuffer

Readbuffer

···

Decode

buffer

Decode

From

com

pute

unit

fro

nt-e

nd

Scalarregister file

Execute

Memory

···

Executebuffer

Write

Complete

Write

buffer

Vectorregister file

···

Page 53: m2s-4.1

ISCA 2013, Tel-Aviv 53

Southern Islands Timing SimulationThe Vector-Memory Unit

─ Runs vector memory instructions.─ Modeled with 5 stages – decode, read, memory, write, complete.─ Accesses to the global memory hierarchy happen mainly in this unit.

Read

···

Issuebuffer

Decode

buffer

Decode

From

com

pute

unit

fro

nt-e

nd

Vectorregister file

···

··· Memory

Readbuffer

···

Memorybuffer

Write

Complete

Write

buffer···

Vectorregister file

Globalmemory

Page 54: m2s-4.1

ISCA 2013, Tel-Aviv 54

Southern Islands Timing SimulationThe Branch Unit

─ Runs branch instructions. These instructions decide whether to make an entire wavefront jump to a target address depending on the scalar condition code.

─ Modeled with 5 stages – decode, read, execute/memory, write, complete.

Read

···

Issuebuffer

Decode

buffer

Decode

From

com

pute

unit

fro

nt-e

nd

Scalarregister file

(condition codes)

···

··· Execute

Readbuffer

···

Executebuffer

Write

Complete

Write

buffer···

Scalar reg. file(programcounter)

Page 55: m2s-4.1

ISCA 2013, Tel-Aviv 55

Southern Islands Timing SimulationThe LDS (Local Data Store) Unit

─ Runs local memory accesses instructions.─ Modeled with 5 stages – decode, read, execute/memory, write, complete.─ The memory stage accesses the compute unit local memory for read/write.

Read

···

Issuebuffer

Decode

buffer

Decode

From

com

pute

unit

fro

nt-e

nd

Vectorregister file

···

··· Memory

Readbuffer

···

Memorybuffer

Write

Complete

Write

buffer···

Vectorregister file

Localmemory

Page 56: m2s-4.1

ISCA 2013, Tel-Aviv 56

Southern Islands Timing SimulationBenchmark Support

• Other applications

─ Other applications from the Rodinia and Parboil benchmark suite have been added support for by other users. Packages are under development.

• AMD OpenCL SDK

─ Host programs are statically linked x86 binaries.

─ The SDK is available for device kernels using three different architectures: Evergreen, Southern Islands, and x86.

─ Packages include data files and execution commands.

Page 57: m2s-4.1

ISCA 2013, Tel-Aviv 57

Global Memory HierarchyOrganization

─ Fully configurable memory hierarchy, with default values based on theAMD Radeon HD 7970 Southern Islands GPU.

─ One 16KB data L1 per compute unit.

─ One scalar L1 cache shared by every 4 compute units.

─ Six L2 banks with a total size of 128KB, each connected to a DRAM module.

CU 0

L1

CU 1

L1

CU 2

L1

CU 3

L1Scalarcache

CU 28

L1

CU 29

L1

CU 30

L1

CU 31

L1Scalarcache. . .

. . .L2Bank 0

L2Bank 1

L2Bank 1

...Interconnect

Page 58: m2s-4.1

ISCA 2013, Tel-Aviv 58

Global Memory HierarchyPolicies

─ Inclusive, write-back, non-coherent caches.

─ Non-coherence is implemented as a 3-state “coherence” protocol: NSI

─ Blocks in N state are merged on write-back using write bit masks.

Non-coherent – Block can be accessed for read/write access, while shared with other caches.

Shared – Block is accessible for read access, possible shared with other caches.

Invalid – Block does not contain valid data.

1111 0000 0000 1111

L2 cache

Writeback

Writeback

Cache blockmerged onwriteback

L1 cache L1 cache

Page 59: m2s-4.1

ISCA 2013, Tel-Aviv 59

Southern Islands Pipeline Visualization

DisassemblerEmulator

(or functionalsimulator)

Timingsimulator

(or detailed/architectural)

Visualtool

Page 60: m2s-4.1

ISCA 2013, Tel-Aviv 60

Demo 7

Southern Islands Pipeline VisualizationMain Window and Timing Diagram

─ Main window provides cycle-by-cycle navigation throughout simulation.

─ A dedicated Southern Islands panel contains one widget per compute unit, showing allocated work-groups.

─ The memory hierarchy panel shows caches connected to Southern Islands compute units, and special-purpose scalar caches.

Page 61: m2s-4.1

ISCA 2013, Tel-Aviv 61

Case Study: ND-Range VirtualizationMotivation

• Could be also called...

─ Cooperative work-group execution.

─ Next-level system heterogeneity.

• The idea

─ The program writes one single OpenCL kernel.

─ The OpenCL runtime creates one kernel binary targeting each architecture present in the system.

─ The ND-Range is dynamically and automatically partitioned, and work-groups are spread across CPU and GPU cores.

• The result

─ Data-parallel programming model complexity resides in the design of the OpenCL kernel, but not in cross-device scheduling.

─ System resources (cores) better utilized and transparently managed.

Page 62: m2s-4.1

ISCA 2013, Tel-Aviv 62

Case Study: ND-Range VirtualizationCurrent Heterogeneous Execution

Hardware

ND-Range

WG-0 WG-1 WG-2 WG-3 WG-N

S.I. S.I. S.I. S.I.x86x86

. . .

Hardware

ND-Range

WG-0 WG-1 WG-2 WG-3 WG-N

S.I. S.I. S.I. S.I.x86x86

. . .

S.I. S.I. S.I. S.I.x86x86

Tim

e

Hostprogram

Idle Device kernel

• Work-group mapping

─ Complete ND-Range sent to one device, selected by the user with a call to clGetDeviceIDs.

• Attained concurrency

─ Host program runs on one x86 core, device kernel runs on Southern Islands compute units.

─ Idle execution regions on x86 cores while Southern Islands compute units run the ND-Range.

Page 63: m2s-4.1

ISCA 2013, Tel-Aviv 63

Case Study: ND-Range VirtualizationHeterogeneous Execution with ND-Range Virtualization

S.I. S.I. S.I. S.I.x86x86

Tim

e

Hostprogram+ kernel

KernelKernel

Hardware

ND-Range

S.I. S.I. S.I. S.I.x86x86

WG-0 WG-1 WG-2 WG-3 WG-N. . .

• Work-group mapping

─ Portions of ND-Range executed by CPU/GPU cores with different ISAs.

─ Work-groups mapped to CPU cores or GPU compute units as they become available.

• Attained concurrency

─ x86 cores run both the host program and a portion of the ND-Range.

─ Idle regions are removed during the execution of the ND-Range..

Page 64: m2s-4.1

ISCA 2013, Tel-Aviv 64

Case Study: ND-Range VirtualizationThe Hardware

─ Protocols MOESI and NSI merged into a single 6-state protocol: NMOESI.

─ CPU and GPU cores with any ISA can be connected to different entry points of the memory hierarchy.

─ Processing nodes interact with the memory hierarchy with three types of accesses: load, store, and n-store.

─ GPUs and CPUs running OpenCL kernels issue n-store write accesses. Rest issue regular store accesses.

. . .

L2Bank 0

...

ARM x86 Evg. S.I. . . .Fermi

L1 L1 L1 L1

Interconnect

. . .L2Bank 1

NMOESI Interface

Page 65: m2s-4.1

ISCA 2013, Tel-Aviv 65

Case Study: ND-Range VirtualizationThe Software

The OpenCL host program is simplified. Only one command queue needed to exploit system heterogeneity.

Userapplication

API call

Devicedriver

ABI call

Hardware

Internalinterface

Runtimelibrary

The runtime provides an additional virtual fused device, returned as a result of a call to clGetDeviceIDs.

The driver extends its interface to allow each physical device to run a portion of the ND-Range, at the work-group granularity.

Page 66: m2s-4.1

ISCA 2013, Tel-Aviv 66

ConcludingRemarks

Page 67: m2s-4.1

ISCA 2013, Tel-Aviv 67

The Multi2Sim CommunityAdditional Material

• The Multi2Sim Guide─ Complete documentation of the simulator's user interface, simulation

models, and additional tools.

• Multi2Sim forums and mailing list─ New version releases and other important information is posted on the

Multi2Sim mailing list (no spam, 1 email per month).─ Users share question and knowledge on the website forum.

• M2S-Cluster─ Automatic verification framework for Multi2Sim.─ Based on a cluster of computers running condor.

Page 68: m2s-4.1

ISCA 2013, Tel-Aviv 68

The Multi2Sim CommunityMulti2C – The Multi2Sim Compiler

─ LLVM-based compiler for OpenCL and CUDA kernels.

─ Future release Multi2Sim 4.2 will include a working version.

─ Diagrams show progress as per SVN Rev. 1838

• Orange = under development• Yellow = arriving soon• Green = supported

vec-add.clOpenCL Cto LLVM

front-end

CUDAto LLVM

front-endvec-add.cu

LLVM toSouthernIslands

back-end

vec-add.llvm

vec-add.s

LLVM toFermi

back-end

LLVM toKepler

back-end

vec-add.s

vec-add.s

SouthernIslands

assembler

Fermiassembler

Keplerassembler

vec-add.bin

vec-add.cubin

vec-add.cubin

Page 69: m2s-4.1

ISCA 2013, Tel-Aviv 69

The Multi2Sim CommunityAcademic Efforts at Northeastern

• The “GPU Programming and Architecture” course─ We started an unofficial seminar that students can voluntarily attend. The

syllabus covers OpenCL programming, GPU architecture, and state-of-the-art research topics on GPUs.

─ Average attendance of ~25 students per semester.

• Undergraduate directed studies─ Official alternative equivalent to a 4-credit course that an undergraduate

student can optionally enroll in, collaborating in Multi2Sim development.

• Graduate-level development─ Lots of research projects at the graduate level depend are based on

Multi2Sim, and selectively included in the development trunk for public access.

─ Simulation of OpenGL pipelines, support for new CPU/GPU architectures, among others.

Page 70: m2s-4.1

ISCA 2013, Tel-Aviv 70

The Multi2Sim CommunityCollaborating Research Groups

• Universidad Politécnica de Valencia─ Pedro López, Salvador Petit, Julio Sahuquillo, José Duato.

• Northeastern University─ Chris Barton, Shu Chen, Zhongliang Chen, Tahir Diop, Xiang Gong, David

Kaeli, Nicholas Materise, Perhaad Mistry, Dana Schaa, Rafael Ubal, Mark Wilkening, Ang Shen, Tushar Swamy, Amir Ziabari.

• University of Mississippi─ Byunghyun Jang

• NVIDIA─ Norm Rubin

• University of Toronto─ Jason Anderson, Natalie Enright, Steven Gurfinkel, Tahir Diop.

• University of Texas─ Rustam Miftakhutdinov

Page 71: m2s-4.1

ISCA 2013, Tel-Aviv 71

The Multi2Sim CommunityMulti2Sim Academic Publications

• Conference papers─ Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded

Processors, SBAC-PAD, 2007.─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous

Computing, PACT, 2012.

• Tutorials─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous

Computing, PACT, 2011.─ Programming and Simulating Fused Devices — OpenCL and Multi2Sim, ICPE,

2012.─ Multi-Architecture ISA-Level Simulation of OpenCL, IWOCL, 2013.─ Simulation of OpenCL and APUs on Multi2Sim, ISCA, 2013.

Page 72: m2s-4.1

ISCA 2013, Tel-Aviv 72

The Multi2Sim CommunityPublished Academic Works Using Multi2Sim

• Recent─ R. Miftakhutdinov, E. Ebrahimi, Y. Patt, Predicting Performance Impact of

DVFS for Realistic Memory Systems, MICRO, 2012.─ D. Lustig, M. Martonosi, Reducing GPU Offload Latency via Fine-Grained

CPU-GPU Synchronization, HPCA, 2013.

• Other─ H. Calborean, R. Jahr, T. Ungerer, L. Vintan, A Comparison of Multi-objective

Algorithms for the Automatic Design Space Exploration of a Superscalar System, Advances in Intelligent Systems and Computing, vol. 187.

─ X. Li, C. Wang, X. Zhou, Z. Zhu, Cache Promotion Policy Using Re-reference Interval Prediction, CLUSTER, 2012.

… and 62 more citations, as per Google Scholar.

Page 73: m2s-4.1

ISCA 2013, Tel-Aviv 73

The Multi2Sim CommunitySponsors