HAsim FPGA-Based Processor Models: Basic Models

HAsim FPGA-Based Processor Models: Basic Models

Michael AdlerElliott FlemingMichael PellauerJoel Emer

2

Managing Complexity

How do we reduce the work of writing software models?– Re-use components– Split functional & timing models– Model only what is necessary to compute target performance– Limit data manipulation in timing models, e.g.:

• Manage cache tags but not data• Let functional model handle DEC and EXE details

Same on FPGA, plus:– Use software for operations that don’t affect model speed

3

Functional Model

• ISA semantics only – no timing• Similarities to software models:

– ISA-agnostic code implements core functions:• Register file• Virtual to physical address translation• Store buffer

– ISA-specific code:• Decode• Execute

HAsim currently implements Alpha and SMIPS

http://csg.csail.mit.edu/6.884/handouts/labs/smips-spec.pdf

4

Functional Model is Hybrid FPGA / Software

• Difficult but infrequent tasks do not need to be accelerated• HAsim uses gem5 for:

– Loading programs– Target machine address space management– Emulation of rare & complex instructions

5

Functional Model is Latency Insensitive (LI)

• LI design enables hybrid FPGA / software implementation– Functional memory is cached on FPGA, homed on host– Emulate system calls (gem5 user-mode) in software

• Area-efficient FPGA implementation– Store buffer uses a hash table instead of a CAM– Choose number of register-file read ports to handle the common case– Serialized rewind

6

Functional MemoryFP

GASo

ftwar

e

Mem Read Interface

Central Cache

Hit

Miss

Private Mem Cache

Mem Read Interface

Private Mem Cache

Central Cache

Hit

Mem Interface

Miss

Mem Interface

Memory (m5)

Time

7

Reducing Model Complexity:Shared Functional Model

ITranslate

Fetch

DTranslate

Memory

Local

Commit

Global

Commit

Decode

Execute

Functional Pipeline

Functional

State

• Similar philosophy to software models:– Write ISA functional model once– Functional machine state is completely

managed– Timing models can be ISA-independent

• Each functional pipeline stage behaves like a request/response FIFO

ISPASS 2008 Paper:Quick Performance Models Quickly: Timing-Directed Simulation on FPGAs

http://asim.csail.mit.edu/redmine/attachments/79/200804_ISPASS_Funcp.pdf

8

FPGA Functional Detail: Token Indexed Scoreboard

• Functional model maintains a “token” for each active instruction• Like a pointer, but better on hardware than new, delete and global

storage• Token indexed state in functional scoreboard:

– Decode information (isLoad, isStore, …)– Architectural to physical register mapping– Memory information– Model verification logic (e.g. prove that an instruction properly flows

through all functional stages.)

10

Functional Model API (Named Connections)

Connection_Client#(FUNCP_REQ_DO_ITRANSLATE, FUNCP_RSP_DO_ITRANSLATE) linkToITR <- mkConnection_Client("funcp_doITranslate");

Connection_Client#(FUNCP_REQ_GET_INSTRUCTION, FUNCP_RSP_GET_INSTRUCTION) linkToFET <- mkConnection_Client("funcp_getInstruction");

Connection_Client#(FUNCP_REQ_GET_DEPENDENCIES, FUNCP_RSP_GET_DEPENDENCIES) linkToDEC <- mkConnection_Client("funcp_getDependencies");

Connection_Client#(FUNCP_REQ_GET_RESULTS, FUNCP_RSP_GET_RESULTS) linkToEXE <- mkConnection_Client("funcp_getResults");

Connection_Client#(FUNCP_REQ_DO_DTRANSLATE, FUNCP_RSP_DO_DTRANSLATE) linkToDTR <- mkConnection_Client("funcp_doDTranslate");

Connection_Client#(FUNCP_REQ_DO_LOADS, FUNCP_RSP_DO_LOADS) linkToLOA <- mkConnection_Client("funcp_doLoads");

Connection_Client#(FUNCP_REQ_DO_STORES, FUNCP_RSP_DO_STORES) linkToSTO <- mkConnection_Client("funcp_doSpeculativeStores");

Connection_Client#(FUNCP_REQ_COMMIT_RESULTS, FUNCP_RSP_COMMIT_RESULTS) linkToLCO <- mkConnection_Client("funcp_commitResults");

Connection_Client#(FUNCP_REQ_COMMIT_STORES, FUNCP_RSP_COMMIT_STORES) linkToGCO <- mkConnection_Client("funcp_commitStores");

Connection_Send#(CONTROL_MODEL_CYCLE_MSG) linkModelCycle <- mkConnection_Send("model_cycle"); Connection_Send#(CONTROL_MODEL_COMMIT_MSG) linkModelCommit <- mkConnection_Send("model_commits");

11

Functional: Sample Data Structures

// FUNCP_REQ_DO_ITRANSLATE

typedef struct{ CONTEXT_ID contextId; ISA_ADDRESS virtualAddress; // Virtual address to translate} FUNCP_REQ_DO_ITRANSLATE deriving (Eq, Bits);

// FUNCP_RSP_DO_ITRANSLATE

typedef struct{ CONTEXT_ID contextId; MEM_ADDRESS physicalAddress; // Result of translation. MEM_OFFSET offset; // Offset of the instruction. Bool fault; // Translation failure: fault will be raised on // attempts to commit this token. physicalAddress // is on the guard page, so it can still be used // in order to simplify timing model logic. Bool hasMore; // More translations coming? (The fetch spans two addresses.)} FUNCP_RSP_DO_ITRANSLATE deriving (Eq, Bits);

12

Step 1: Basic Unpipelined Target

13

Key Components

• How do I start running?• Signal completion?• Execute instructions?

14

The “Starter”

• Global controller is in software– Orchestrates program loading– Tells local controllers when to begin

• Local controllers (FPGA)– Associated with target machine pipeline stage models– LI (like most of the API)

LOCAL_CONTROLLER#(MAX_NUM_CPUS) localCtrl <- mkLocalController(inports, outports);

let cpu_iid <- localCtrl.startModelCycle();

localCtrl.endModelCycle(cpu_iid, 1);

15

Timing: Starting the Pipeline

rule stage1_itrReq (True);

// Begin a new model cycle. let cpu_iid <- localCtrl.startModelCycle(); linkModelCycle.send(cpu_iid); debugLog.nextModelCycle(cpu_iid);

…

16

Must Drive the Functional Pipeline

ITranslate

Fetch

DTranslate

Memory

Local

Commit

Global

Commit

Decode

Execute

Functional Pipeline

Functional

State

17

Timing Model

ITranslate

Fetch

DTranslate

Memory

Local

Commit

Global

Commit

Decode

Execute

Functional Pipeline

Functional

State

IP

Next IP

• Timing & functional models communicate state using tokens

• Minimal timing model:– Only state is IP – Drives a single token at a

time

Timing Pipeline

18

Timing: Stage 1 (ITranslate)

rule stage1_itrReq (True);

// Begin a new model cycle. let cpu_iid <- localCtrl.startModelCycle(); linkModelCycle.send(cpu_iid); debugLog.nextModelCycle(cpu_iid);

// Translate next pc. Reg#(ISA_ADDRESS) pc = pcPool[cpu_iid]; let ctx_id = getContextId(cpu_iid); linkToITR.makeReq(initFuncpReqDoITranslate(ctx_id, pc)); debugLog.record_next_cycle(cpu_iid, $format("Translating virtual address: 0x%h", pc));

endrulepcPool is the only state in the timing model!

19

Timing: Fetch

rule stage2_itrRsp_fetReq (True);

// Get the ITrans response started by stage1_itrReq let rsp = linkToITR.getResp(); linkToITR.deq();

let cpu_iid = getCpuInstanceId(rsp.contextId);

debugLog.record(cpu_iid, $format("ITR Responded, hasMore: %0d", rsp.hasMore));

// Fetch the next instruction linkToFET.makeReq(initFuncpReqGetInstruction(rsp.contextId, rsp.physicalAddress, rsp.offset)); debugLog.record(cpu_iid, $format("Fetching physical address: 0x%h, offset: 0x%h", rsp.physicalAddress, rsp.offset));

endrule

21

Timing: Execute Response rule stage5_exeRsp (True); // Get the execution result let exe_resp = linkToEXE.getResp(); linkToEXE.deq();

let tok = exe_resp.token; let res = exe_resp.result;

let cpu_iid = tokCpuInstanceId(tok); Reg#(ISA_ADDRESS) pc = pcPool[cpu_iid];

// If it was a branch we must update the PC. case (res) matches tagged RBranchTaken .addr: begin pc <= addr; end

tagged RBranchNotTaken .addr: begin pc <= pc + exe_resp.instructionSize; end

default: begin pc <= pc + exe_resp.instructionSize; end endcase . . .

23

Timing: Execute (Minimal Requirement)

• Send the PC to the functional model• Receive the updated PC from the functional model

24

Functional: Execute

• Read input registers required from token scoreboard• Wait for input registers to have valid values

– Functional model supports OOO models but enforces execution in a valid order

– Timing model must execute instructions in a valid order or simulation will deadlock

• Forward values to an ISA-specific data path• Write result to physical register(s)

– Functional model may return without finishing result register write (e.g. floating point emulation)

– State returned to the timing model (e.g. branch) resolution must be computed before return

25

Hybrid Instruction Emulation (Infrequent/Complicated Instructions)

Write Line

Writ

e Ba

ck o

rIn

valid

ate

FPGA

Softw

are

Time

Execute

EmulationServer

Functional Instruction Simulator

MemoryServer

FunctionalCache

Execute

EmulationServer

Sync Registers

Done

Sync

Reg

ister

s

RRRLayer

Emulate Instruction Em

ulati

on D

one

……

Ack

26

Unpipelined Model Source

Single source file: unpipelined-pipeline.bsv

http://asim.csail.mit.edu/redmine/projects/hasim-models/repository/entry/trunk/modules/hasim/timing-models/pipeline/unpipelined/no-cache/unpipelined-pipeline.bsv

27

Step 2: Simple Pipelined Target

28

Model Performance Goal: Pipeline Parallelism

ITranslate

Fetch

DTranslate

Memory

Local

Commit

Global

Commit

Decode

Execute

Functional Pipeline

Functional

State

IPs

Next IPs

• Unpipelined target necessarily serialized functional model calls

• Pipelined target could run each stage in parallel

• Must solve one problem: how to manage time

29

Managing Time:A-Ports and Soft Connections

FPGA cycles != simulated cycles

– HAsim computes target time algorithmically– We are building a timing model, NOT a prototype

– 1:n cycle mapping would force us to slow the timing clock to the longest operation, even if it is infrequent

– 1:n would force us either to fit an entire design on the FPGA or synchronize clock domains

30

Option #1: Global Controller [rejected]

Central controller advances cycle when all modules are ready•Slowest possible cycle no longer dictates throughput•However:

– Place & route becomes difficult– Long signal to global controller is on the critical path

FET DEC EXE MEM WB

Controller curCC

31

Option #2: A-Ports

•Extension of Asim named channels•FIFO with user-specified latency and capacity•Pass exactly one message per cycle per port

– Beginning of model cycle: read all input ports– End of model cycle: write all output ports

ISFPGA 2008 Paper: A-Ports: An Efficient Abstraction for Cycle-Accurate Performance Models on FPGAs

FET DEC EXE MEM WB11

1 10

2

http://asim.csail.mit.edu/redmine/attachments/78/200802_ISFPGA_APorts.pdf

http://asim.csail.mit.edu/redmine/attachments/78/200802_ISFPGA_APorts.pdf

32

Now the Timing Model is LI!

• Each target machine pipeline stage may take multiple FPGA cycles

• This reduces– Algorithmic complexity– FPGA timing pressure

• Less work per cycle• May use FPGA area-efficient code (e.g. linear search replacing a CAM)

33

Target Machine Model Pipeline Stage Conventions

• Separate source module for each target pipeline stage• Private local controller for each module

– Local controllers automatically assemble themselves onto a ring– All are managed by the global controller

• All external communication is via A-Ports– Functional model API is only exception (we have debated switching to A-

Ports)

34

Example: In-order Decode FPGA Pipeline Stages

These FPGA stages represent one target machine cycle:

1. Receive register scoreboard writebacks from EXE and MEM stages

2. Consume faults from EXE and COMMIT stages (trigger rewind)

3. Consume dependence info (scoreboard state and writebacks)

4. Attempt issue (if data ready and EXE slot available)

5. Update local state (scoreboard)

inorder-decode-stage.bsv

http://asim.csail.mit.edu/redmine/projects/hasim-models/repository/entry/trunk/modules/hasim/timing-models/pipeline/inorder/decode/inorder-decode-stage.bsv



35

Key Observation: Parallel Target Machine Yields Parallel FPGA Model

• Each pipeline stage is a separate module• Only dependence between modules is through A-Ports• A-Ports are parallel• Parallelism in the model is proportional to the parallelism of the

target

36

Pipeline Parallelism

ITranslate

Fetch

DTranslate

Memory

Local

Commit

Global

Commit

Decode

Execute

Functional Pipeline

Functional

State

IPs

Next IPs

• Model of a pipelined design naturally runs pipelined on an FPGA

• Detailed model of a pipelined design runs faster than a trivial, unpipelined model!

37

Aggregate Simulator Throughput (Parsec Black-Scholes)

39

Modeling Caches

40

What Makes Modeling Caches Hard on FPGAs?

• Storage• Not much else – cache management is just a set of pipeline stages

41

Modeled Caches Are Not as Big as You Might Think

• Only model tags – no data in the timing model• Cache model is LI

– Connected only by A-Ports– FPGA-latency of the cache tag storage is irrelevant!– Tag storage on the FPGA may be hierarchical– Build FPGA cache to model a target-machine cache, but they are

unrelated (LEAP provides automatically generated caches)

Terminology gets messy, but the implementation does not. LEAP scratchpad storage has the same interface as an array.

42

Multiple Cores and Shared Cache

• Later, we will consider multiplexed timing pipelines• On-chip network connecting a shared cache becomes interesting

(interleaving varies with OCN topology)

HAsim FPGA-Based Processor Models: Basic Models

Documents

fpga functional

functional scoreboard

functional stages

model complexity

functional modelisa

model speed

scoreboardfunctional

multiple timing models