Top Banner
HAsim FPGA-Based Processor Models: Basic Models Michael Adler Elliott Fleming Michael Pellauer Joel Emer
38

HAsim FPGA-Based Processor Models: Basic Models

Feb 22, 2016

Download

Documents

davida

HAsim FPGA-Based Processor Models: Basic Models. Michael Adler Elliott Fleming Michael Pellauer Joel Emer. Managing Complexity. How do we reduce the work of writing software models? Re-use components Split functional & timing models - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HAsim FPGA-Based Processor Models:    Basic Models

HAsim FPGA-Based Processor Models: Basic Models

Michael AdlerElliott FlemingMichael PellauerJoel Emer

Page 2: HAsim FPGA-Based Processor Models:    Basic Models

2

Managing Complexity

How do we reduce the work of writing software models?– Re-use components– Split functional & timing models– Model only what is necessary to compute target performance– Limit data manipulation in timing models, e.g.:

• Manage cache tags but not data• Let functional model handle DEC and EXE details

Same on FPGA, plus:– Use software for operations that don’t affect model speed

Page 3: HAsim FPGA-Based Processor Models:    Basic Models

3

Functional Model

• ISA semantics only – no timing• Similarities to software models:

– ISA-agnostic code implements core functions:• Register file• Virtual to physical address translation• Store buffer

– ISA-specific code:• Decode• Execute

HAsim currently implements Alpha and SMIPS

Page 4: HAsim FPGA-Based Processor Models:    Basic Models

4

Functional Model is Hybrid FPGA / Software

• Difficult but infrequent tasks do not need to be accelerated• HAsim uses gem5 for:

– Loading programs– Target machine address space management– Emulation of rare & complex instructions

Page 5: HAsim FPGA-Based Processor Models:    Basic Models

5

Functional Model is Latency Insensitive (LI)

• LI design enables hybrid FPGA / software implementation– Functional memory is cached on FPGA, homed on host– Emulate system calls (gem5 user-mode) in software

• Area-efficient FPGA implementation– Store buffer uses a hash table instead of a CAM– Choose number of register-file read ports to handle the common case– Serialized rewind

Page 6: HAsim FPGA-Based Processor Models:    Basic Models

6

Functional MemoryFP

GASo

ftwar

e

Mem Read Interface

Central Cache

Hit

Miss

Private Mem Cache

Mem Read Interface

Private Mem Cache

Central Cache

Hit

Mem Interface

Miss

Mem Interface

Memory (m5)

Time

Page 7: HAsim FPGA-Based Processor Models:    Basic Models

7

Reducing Model Complexity:Shared Functional Model

ITranslate

Fetch

DTranslate

Memory

Local

Commit

Global

Commit

Decode

Execute

Functional Pipeline

Functional

State

• Similar philosophy to software models:– Write ISA functional model once– Functional machine state is completely

managed– Timing models can be ISA-independent

• Each functional pipeline stage behaves like a request/response FIFO

ISPASS 2008 Paper:Quick Performance Models Quickly: Timing-Directed Simulation on FPGAs

Page 8: HAsim FPGA-Based Processor Models:    Basic Models

8

FPGA Functional Detail: Token Indexed Scoreboard

• Functional model maintains a “token” for each active instruction• Like a pointer, but better on hardware than new, delete and global

storage• Token indexed state in functional scoreboard:

– Decode information (isLoad, isStore, …)– Architectural to physical register mapping– Memory information– Model verification logic (e.g. prove that an instruction properly flows

through all functional stages.)

Page 9: HAsim FPGA-Based Processor Models:    Basic Models

10

Functional Model API (Named Connections)

Connection_Client#(FUNCP_REQ_DO_ITRANSLATE, FUNCP_RSP_DO_ITRANSLATE) linkToITR <- mkConnection_Client("funcp_doITranslate");

Connection_Client#(FUNCP_REQ_GET_INSTRUCTION, FUNCP_RSP_GET_INSTRUCTION) linkToFET <- mkConnection_Client("funcp_getInstruction");

Connection_Client#(FUNCP_REQ_GET_DEPENDENCIES, FUNCP_RSP_GET_DEPENDENCIES) linkToDEC <- mkConnection_Client("funcp_getDependencies");

Connection_Client#(FUNCP_REQ_GET_RESULTS, FUNCP_RSP_GET_RESULTS) linkToEXE <- mkConnection_Client("funcp_getResults");

Connection_Client#(FUNCP_REQ_DO_DTRANSLATE, FUNCP_RSP_DO_DTRANSLATE) linkToDTR <- mkConnection_Client("funcp_doDTranslate");

Connection_Client#(FUNCP_REQ_DO_LOADS, FUNCP_RSP_DO_LOADS) linkToLOA <- mkConnection_Client("funcp_doLoads");

Connection_Client#(FUNCP_REQ_DO_STORES, FUNCP_RSP_DO_STORES) linkToSTO <- mkConnection_Client("funcp_doSpeculativeStores");

Connection_Client#(FUNCP_REQ_COMMIT_RESULTS, FUNCP_RSP_COMMIT_RESULTS) linkToLCO <- mkConnection_Client("funcp_commitResults");

Connection_Client#(FUNCP_REQ_COMMIT_STORES, FUNCP_RSP_COMMIT_STORES) linkToGCO <- mkConnection_Client("funcp_commitStores");

Connection_Send#(CONTROL_MODEL_CYCLE_MSG) linkModelCycle <- mkConnection_Send("model_cycle"); Connection_Send#(CONTROL_MODEL_COMMIT_MSG) linkModelCommit <- mkConnection_Send("model_commits");

Page 10: HAsim FPGA-Based Processor Models:    Basic Models

11

Functional: Sample Data Structures

// FUNCP_REQ_DO_ITRANSLATE

typedef struct{ CONTEXT_ID contextId; ISA_ADDRESS virtualAddress; // Virtual address to translate} FUNCP_REQ_DO_ITRANSLATE deriving (Eq, Bits);

// FUNCP_RSP_DO_ITRANSLATE

typedef struct{ CONTEXT_ID contextId; MEM_ADDRESS physicalAddress; // Result of translation. MEM_OFFSET offset; // Offset of the instruction. Bool fault; // Translation failure: fault will be raised on // attempts to commit this token. physicalAddress // is on the guard page, so it can still be used // in order to simplify timing model logic. Bool hasMore; // More translations coming? (The fetch spans two addresses.)} FUNCP_RSP_DO_ITRANSLATE deriving (Eq, Bits);

Page 11: HAsim FPGA-Based Processor Models:    Basic Models

12

Step 1: Basic Unpipelined Target

Page 12: HAsim FPGA-Based Processor Models:    Basic Models

13

Key Components

• How do I start running?• Signal completion?• Execute instructions?

Page 13: HAsim FPGA-Based Processor Models:    Basic Models

14

The “Starter”

• Global controller is in software– Orchestrates program loading– Tells local controllers when to begin

• Local controllers (FPGA)– Associated with target machine pipeline stage models– LI (like most of the API)

LOCAL_CONTROLLER#(MAX_NUM_CPUS) localCtrl <- mkLocalController(inports, outports);

let cpu_iid <- localCtrl.startModelCycle();

localCtrl.endModelCycle(cpu_iid, 1);

Page 14: HAsim FPGA-Based Processor Models:    Basic Models

15

Timing: Starting the Pipeline

rule stage1_itrReq (True);

// Begin a new model cycle. let cpu_iid <- localCtrl.startModelCycle(); linkModelCycle.send(cpu_iid); debugLog.nextModelCycle(cpu_iid);

Page 15: HAsim FPGA-Based Processor Models:    Basic Models

16

Must Drive the Functional Pipeline

ITranslate

Fetch

DTranslate

Memory

Local

Commit

Global

Commit

Decode

Execute

Functional Pipeline

Functional

State

Page 16: HAsim FPGA-Based Processor Models:    Basic Models

17

Timing Model

ITranslate

Fetch

DTranslate

Memory

Local

Commit

Global

Commit

Decode

Execute

Functional Pipeline

Functional

State

IP

Next IP

• Timing & functional models communicate state using tokens

• Minimal timing model:– Only state is IP – Drives a single token at a

time

Timing Pipeline

Page 17: HAsim FPGA-Based Processor Models:    Basic Models

18

Timing: Stage 1 (ITranslate)

rule stage1_itrReq (True);

// Begin a new model cycle. let cpu_iid <- localCtrl.startModelCycle(); linkModelCycle.send(cpu_iid); debugLog.nextModelCycle(cpu_iid);

// Translate next pc. Reg#(ISA_ADDRESS) pc = pcPool[cpu_iid]; let ctx_id = getContextId(cpu_iid); linkToITR.makeReq(initFuncpReqDoITranslate(ctx_id, pc)); debugLog.record_next_cycle(cpu_iid, $format("Translating virtual address: 0x%h", pc));

endrulepcPool is the only state in the timing model!

Page 18: HAsim FPGA-Based Processor Models:    Basic Models

19

Timing: Fetch

rule stage2_itrRsp_fetReq (True);

// Get the ITrans response started by stage1_itrReq let rsp = linkToITR.getResp(); linkToITR.deq();

let cpu_iid = getCpuInstanceId(rsp.contextId);

debugLog.record(cpu_iid, $format("ITR Responded, hasMore: %0d", rsp.hasMore));

// Fetch the next instruction linkToFET.makeReq(initFuncpReqGetInstruction(rsp.contextId, rsp.physicalAddress, rsp.offset)); debugLog.record(cpu_iid, $format("Fetching physical address: 0x%h, offset: 0x%h", rsp.physicalAddress, rsp.offset));

endrule

Page 19: HAsim FPGA-Based Processor Models:    Basic Models

21

Timing: Execute Response rule stage5_exeRsp (True); // Get the execution result let exe_resp = linkToEXE.getResp(); linkToEXE.deq();

let tok = exe_resp.token; let res = exe_resp.result;

let cpu_iid = tokCpuInstanceId(tok); Reg#(ISA_ADDRESS) pc = pcPool[cpu_iid];

// If it was a branch we must update the PC. case (res) matches tagged RBranchTaken .addr: begin pc <= addr; end

tagged RBranchNotTaken .addr: begin pc <= pc + exe_resp.instructionSize; end

default: begin pc <= pc + exe_resp.instructionSize; end endcase . . .

Page 20: HAsim FPGA-Based Processor Models:    Basic Models

23

Timing: Execute (Minimal Requirement)

• Send the PC to the functional model• Receive the updated PC from the functional model

Page 21: HAsim FPGA-Based Processor Models:    Basic Models

24

Functional: Execute

• Read input registers required from token scoreboard• Wait for input registers to have valid values

– Functional model supports OOO models but enforces execution in a valid order

– Timing model must execute instructions in a valid order or simulation will deadlock

• Forward values to an ISA-specific data path• Write result to physical register(s)

– Functional model may return without finishing result register write (e.g. floating point emulation)

– State returned to the timing model (e.g. branch) resolution must be computed before return

Page 22: HAsim FPGA-Based Processor Models:    Basic Models

25

Hybrid Instruction Emulation (Infrequent/Complicated Instructions)

Write Line

Writ

e Ba

ck o

rIn

valid

ate

FPGA

Softw

are

Time

Execute

EmulationServer

Functional Instruction Simulator

MemoryServer

FunctionalCache

Execute

EmulationServer

Sync Registers

Done

Sync

Reg

ister

s

RRRLayer

Emulate Instruction Em

ulati

on D

one

……

Ack

Page 24: HAsim FPGA-Based Processor Models:    Basic Models

27

Step 2: Simple Pipelined Target

Page 25: HAsim FPGA-Based Processor Models:    Basic Models

28

Model Performance Goal: Pipeline Parallelism

ITranslate

Fetch

DTranslate

Memory

Local

Commit

Global

Commit

Decode

Execute

Functional Pipeline

Functional

State

IPs

Next IPs

• Unpipelined target necessarily serialized functional model calls

• Pipelined target could run each stage in parallel

• Must solve one problem: how to manage time

Page 26: HAsim FPGA-Based Processor Models:    Basic Models

29

Managing Time:A-Ports and Soft Connections

FPGA cycles != simulated cycles

– HAsim computes target time algorithmically– We are building a timing model, NOT a prototype

– 1:n cycle mapping would force us to slow the timing clock to the longest operation, even if it is infrequent

– 1:n would force us either to fit an entire design on the FPGA or synchronize clock domains

Page 27: HAsim FPGA-Based Processor Models:    Basic Models

30

Option #1: Global Controller [rejected]

Central controller advances cycle when all modules are ready•Slowest possible cycle no longer dictates throughput•However:

– Place & route becomes difficult– Long signal to global controller is on the critical path

FET DEC EXE MEM WB

Controller curCC

Page 28: HAsim FPGA-Based Processor Models:    Basic Models

31

Option #2: A-Ports

•Extension of Asim named channels•FIFO with user-specified latency and capacity•Pass exactly one message per cycle per port

– Beginning of model cycle: read all input ports– End of model cycle: write all output ports

ISFPGA 2008 Paper: A-Ports: An Efficient Abstraction for Cycle-Accurate Performance Models on FPGAs

FET DEC EXE MEM WB11

1 10

2

Page 29: HAsim FPGA-Based Processor Models:    Basic Models

32

Now the Timing Model is LI!

• Each target machine pipeline stage may take multiple FPGA cycles

• This reduces– Algorithmic complexity– FPGA timing pressure

• Less work per cycle• May use FPGA area-efficient code (e.g. linear search replacing a CAM)

Page 30: HAsim FPGA-Based Processor Models:    Basic Models

33

Target Machine Model Pipeline Stage Conventions

• Separate source module for each target pipeline stage• Private local controller for each module

– Local controllers automatically assemble themselves onto a ring– All are managed by the global controller

• All external communication is via A-Ports– Functional model API is only exception (we have debated switching to A-

Ports)

Page 31: HAsim FPGA-Based Processor Models:    Basic Models

34

Example: In-order Decode FPGA Pipeline Stages

These FPGA stages represent one target machine cycle:

1. Receive register scoreboard writebacks from EXE and MEM stages

2. Consume faults from EXE and COMMIT stages (trigger rewind)

3. Consume dependence info (scoreboard state and writebacks)

4. Attempt issue (if data ready and EXE slot available)

5. Update local state (scoreboard)

inorder-decode-stage.bsv

Page 32: HAsim FPGA-Based Processor Models:    Basic Models

35

Key Observation: Parallel Target Machine Yields Parallel FPGA Model

• Each pipeline stage is a separate module• Only dependence between modules is through A-Ports• A-Ports are parallel• Parallelism in the model is proportional to the parallelism of the

target

Page 33: HAsim FPGA-Based Processor Models:    Basic Models

36

Pipeline Parallelism

ITranslate

Fetch

DTranslate

Memory

Local

Commit

Global

Commit

Decode

Execute

Functional Pipeline

Functional

State

IPs

Next IPs

• Model of a pipelined design naturally runs pipelined on an FPGA

• Detailed model of a pipelined design runs faster than a trivial, unpipelined model!

Page 34: HAsim FPGA-Based Processor Models:    Basic Models

37

Aggregate Simulator Throughput (Parsec Black-Scholes)

Page 35: HAsim FPGA-Based Processor Models:    Basic Models

39

Modeling Caches

Page 36: HAsim FPGA-Based Processor Models:    Basic Models

40

What Makes Modeling Caches Hard on FPGAs?

• Storage• Not much else – cache management is just a set of pipeline stages

Page 37: HAsim FPGA-Based Processor Models:    Basic Models

41

Modeled Caches Are Not as Big as You Might Think

• Only model tags – no data in the timing model• Cache model is LI

– Connected only by A-Ports– FPGA-latency of the cache tag storage is irrelevant!– Tag storage on the FPGA may be hierarchical– Build FPGA cache to model a target-machine cache, but they are

unrelated (LEAP provides automatically generated caches)

Terminology gets messy, but the implementation does not. LEAP scratchpad storage has the same interface as an array.

Page 38: HAsim FPGA-Based Processor Models:    Basic Models

42

Multiple Cores and Shared Cache

• Later, we will consider multiplexed timing pipelines• On-chip network connecting a shared cache becomes interesting

(interleaving varies with OCN topology)