Top Banner
1 October 1, 2004 PACT 2004 Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures Calvin Lin Ramadass Nagarajan, Sundeep Kushwaha, Doug Burger, Kathryn S. McKinley, Stephen W. Keckler Department of Computer Sciences The University of Texas at Austin October 1, 2004
20

Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

Jul 18, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

1October 1, 2004 PACT 2004

Static Placement, Dynamic Issue (SPDI)

Scheduling for EDGE Architectures

Calvin Lin

Ramadass Nagarajan, Sundeep Kushwaha, Doug Burger, Kathryn S.

McKinley, Stephen W. Keckler

Department of Computer Sciences

The University of Texas at Austin

October 1, 2004

Page 2: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

2October 1, 2004 PACT 2004

Architecture and Technology Trends

• Increasing wire delays limit sizes of monolithic structures [Agarwal,

ISCA’00]

Need aggressive partitioning

• Clock rate growths show diminishing returns

[Hrishikesh, ISCA’02] [Sprangle, ISCA’02]

Deeper pipelines approaching optimal limits

Need to improve instruction throughput (IPC)

• Conventional architectures and their schedulers are not equipped to

deal with these trends

20 mm

100 nm

70 nm

35 nm

Page 3: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

3October 1, 2004 PACT 2004

The Problem with Conventional Approaches

• VLIW approach

Relies completely on compiler to schedule code

+ Eliminates need for dynamic dependence check hardware

+ Good match for partitioning

+ Can minimize communication latencies on critical paths

– Poor tolerance to unpredictable dynamic latencies

– These latencies continue to grow

• Superscalar approach

Hardware dynamically schedules code

+ Can tolerate dynamic latencies

– Quadratic complexity of dependence check hardware

– Not a good match for partitioning

– Difficult to make good placement decisions

– ISA does not allow software to help with instruction placement

Page 4: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

4October 1, 2004 PACT 2004

Dissecting the Problem

• Scheduling is a two-part problem

Placement: Where an instruction executes

Issue: When an instruction executes

• VLIW represents one extreme

Static Placement and Static Issue (SPSI)

+ Static Placement works well for partitioned architectures

– Static Issue causes problems with unknown latencies

• Superscalars represent another extreme

Dynamic Placement and Dynamic Issue (DPDI)

+ Dynamic Issue tolerates unknown latencies

– Dynamic Placement is difficult in the face of partitioning

Page 5: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

5October 1, 2004 PACT 2004

Our Solution: EDGE Architectures

• EDGE: Explicit Dataflow Graph Execution

Supports Static Placement and Dynamic Issue (SPDI)

Renegotiates the compiler/hardware binary interface

• An EDGE ISA explicitly encodes the dataflow graph specifying targets

i1: movi r1, #10

i2: movi r2, #20

i3: add r3, r2, r1

RISC

• Static Placement

Explicit DFG simplifies hardware no HW dependency analysis!

Results are forwarded directly no associative issue queues!

through point-to-point network no global bypass network!

• Dynamic Instruction Issue

Instructions execute in original dataflow-order

ALU-1: movi #10, ALU-3

ALU-2: movi #20, ALU-3

ALU-3: add ALU-4

EDGE

mov mov

addALU-3

ALU-1 ALU-2

Page 6: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

6October 1, 2004 PACT 2004

Static Placement and Dynamic Issue (SPDI)

• Combines strengths of static and dynamic schedulers

Static Placement (SP)

Dynamic Issue (DI)

• Benefits for the static scheduler

Precise timing information not required

Can convey placement information to the hardware

• Benefits for the dynamic scheduler

No associative tag match

Tolerates dynamic latencies

• Scheduling Goals

Spread parallelism among numerous execution resources

Minimize on-chip communication latencies

Page 7: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

7October 1, 2004 PACT 2004

Outline

• Architectural Overview

Execution substrate

Scheduling problem

• SPDI scheduling algorithm

Locality optimizations

Contention optimizations

• Experimental results

• Conclusions

Page 8: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

8October 1, 2004 PACT 2004

TRIPS Architecture

0 1 2 3

I-cache 0

I-cache 1

I-cache 2

I-cache 3D-cache/LSQ 3

D-cache/LSQ 2

D-cache/LSQ 1

D-cache/LSQ 0

Global Ctrl

Branch PredictorI-cache H

Register banksExecution node

Execution array

• Topology and latency of interconnect exposed to the static scheduler

• Reduced register pressure

Page 9: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

9October 1, 2004 PACT 2004

The Scheduling Problem

Execution Node

opcode src1 src2

opcode src1 src2

opcode src1 src2

Instruction Buffers form

a logical “z-dimension”

in each node

opcode src1 src2

3D scheduling problem

Control

Router

ALU

• Instruction buffers add depth to the execution array

2D array of ALUs; 3D volume of instructions

Page 10: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

10October 1, 2004 PACT 2004

ld

shl

add

sw

br

Static Scheduling Problem

add

add

ld

cmp

br

sub

shl

ld

cmp

br

ld

add

add

sw

br

sw

sw

add

cmp

br

ld

Register File

Data C

aches

Hyperblock

add

add

CFG

•Program split into hyperblocks

•Hyperblocks scheduled onto the

entire 4 4 4 volume

Page 11: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

11October 1, 2004 PACT 2004

List Scheduling Algorithm

Determine priority order of instructions

Pick the unscheduled instruction (I)

with highest priority

For each ALU compute cost of I

Pick ALU (Ai) with minimum cost

Schedule I at Ai

Cost[I] = max (Cost[P1]+Distance[A1,Ai],

Cost[P2]+Distance[A2,Ai] )

+

Latency(I)

Ai

A2 A1

P1 P2

I

Hyperblock

DFG

• Local algorithm – one hyperblock at a time

• No backtracking or re-placement of instructions

Page 12: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

12October 1, 2004 PACT 2004

M

Scheduler Optimizations: 1 of 2

• Balance load among ALUs

Estimate ALU contention

• Locality optimization

Place loads and their

consumers close to caches

Place register reads close to

registers

Cost[I] = max (Cost[P1]+Distance[A1,Ai]

Cost[P2]+Distance[A2,Ai])

+

Contention (Ai)

+

Latency(I)

A1

A2

Ai

P1 P2

I

Hyperblock

DFG

M

load

Page 13: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

13October 1, 2004 PACT 2004

Scheduler Optimizations: 2 of 2

• Lookahead optimization

Estimate future use for register outputs or loads

• Critical path re-computation

Cost[I] = max (Cost[P1]+Distance[A1,Ai]

Cost[P2]+Distance[A2,Ai])

+

Contention (Ai)

+

Lookahead (I)

+

Latency(I)

P6

P5

P3

P4

P2 P1

P5

P4

Hyperblock

DFG

P3

P2

P1

P6

Page 14: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

14October 1, 2004 PACT 2004

Prototype Evaluation

• Experimental Methodology

Use Trimaran infrastructure to produce hyperblocks

Schedule instructions using a custom greedy scheduler

Evaluate performance using a detailed microarchitecture simulator

• Simulated Machine Parameters

8 8 array of ALUs, 128 instruction slots

0.5 cycle hop-hop latency

64KB, 2-way L1 Instruction and L1 Data caches

32Kbits two-level local/global tournament-style branch predictor

Optimistic assumptions: Oracular memory disambiguation, no TLBs,

centralized data cache

• Benchmarks

8 SpecInt, 8 SpecFP, 3 MediaBench

Page 15: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

15October 1, 2004 PACT 2004

Scheduler Results – Integer Benchmarks

0

0.5

1

1.5

2

2.5

3

3.5

mcf

ad

pcm

co

mp

r

pa

rse

r

gzip

two

lf

m8

8ksim

bzip

2

HM

EA

N

IPC

No Opt Load Balance Load Balance+Locality

Page 16: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

16October 1, 2004 PACT 2004

Scheduler Results – Floating Point

0

2

4

6

8

10

12

14

16

18

eq

ua

ke

turb

3d

hyd

ro2

d

mp

eg

2

art

am

mp

tom

catv

sw

im

mg

rid

dct

HM

EA

N

IPC

No opt Load Balance Load Balance + Locality

Page 17: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

17October 1, 2004 PACT 2004

Comparison with Ideal Scheduler: 1 of 2

0

0.5

1

1.5

2

2.5

3

3.5

4

mcf

ad

pcm

co

mp

r

pa

rse

r

gzip

two

lf

m8

8ksim

bzip

2

HM

EA

N

IPC

Best Scheduler Ideal Scheduler

Integer Benchmarks

Ideal schedules do not have communication latencies on the critical path

Page 18: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

18October 1, 2004 PACT 2004

Comparison with Ideal Scheduler: 2 of 2

0

5

10

15

20

25

equake

mpeg2

hydro

2d

turb

3d

art

am

mp

tom

catv

mgrid

sw

im dct

HM

EA

N

IPC

Best Scheduler Ideal Scheduler

Floating Point Benchmarks

Ideal schedules do not have communication latencies on the critical path

Page 19: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

19October 1, 2004 PACT 2004

On-Going Work

• Improving the scheduler:

Profile guided scheduler optimizations

Code-specific heuristics

• Select heuristics based on properties of the hyperblock

Minimize network contention

• Analysis shows avoidable performance loss due to network

contention

• Improving our evaluation with TRIPS-specific compiler:

Build larger hyperblocks

Aware of TRIPS-specific scheduling constraints

Page 20: Scheduling for EDGE Architectures Static Placement ...trips/talks/pact04.pdf · List Scheduling Algorithm Determine priority order of instructions Pick the unscheduled instruction

20October 1, 2004 PACT 2004

Conclusions

• Scheduling has two components that can be separated

Placement and issue

• EDGE architectures enable a new scheduling model

Static Placement, Dynamic Issue

Hardware dynamically tolerates unknown latencies

Compiler gives the hardware the ILP

Simpler static instruction scheduler

• Scheduler summary

Simple algorithm with well-chosen heuristics suffices

Load balancing heuristics are important

Register and cache locality heuristics are important

Performance within 20% of an optimistic upper bound