Top Banner
UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez 1,2 Antonio González 1,2 1 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2 Intel Barcelona Research Center Intel Labs Barcelona
33

UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Mar 29, 2015

Download

Documents

Maxwell Norby
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Effective Instruction Scheduling Techniques for an Interleaved Cache

Clustered VLIW Processor

Effective Instruction Scheduling Techniques for an Interleaved Cache

Clustered VLIW Processor

Enric Gibert1

Jesús Sánchez1,2

Antonio González1,2

1Dept. d’Arquitectura de Computadors

Universitat Politècnica de Catalunya (UPC)

Barcelona

2Intel Barcelona Research CenterIntel LabsBarcelona

Page 2: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Motivation

Capacity vs. Communication-bound Clustered microarchitectures

– Simpler + faster– Power consumption– Communications not homogeneous

Clustering embedded/DSP domain

Page 3: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Clustered Microarchitectures

CLUSTER 1

Reg. FileReg. File

FUsFUs

CLUSTER 2

Reg. FileReg. File

FUsFUs

CLUSTER 3

Reg. FileReg. File

FUsFUs

CLUSTER 4

Reg. FileReg. File

FUsFUs

Register-to-register communication buses

L1 cacheL1 cache

L2 cacheL2 cache

Memory buses

GOAL: distribute the memory hierarchy!!!

Page 4: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Contributions

Distribution of data cache:– Interleaved cache clustered VLIW processor

Hardware enhancement: – Attraction Buffers

Effective instruction scheduling techniques– Modulo scheduling– Loop unrolling + smart assignment of latencies +

padding

Page 5: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Talk Outline

MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and

techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

Page 6: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

MultiVLIW

CLUSTER 1

Register FileRegister File

Func. UnitsFunc. Units

Register-to-register communication buses

cache module

CLUSTER 2

Register FileRegister File

Func. UnitsFunc. Units

cache module

CLUSTER 3

Register FileRegister File

Func. UnitsFunc. Units

cache module

CLUSTER 4

Register FileRegister File

Func. UnitsFunc. Units

cache module

L2 cachecache block

TAG+STATE+DATA TAG+STATE+DATA TAG+STATE+DATA TAG+STATE+DATA

Cache-Coherence Protocol!!!

Page 7: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Talk Outline

MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and

techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

Page 8: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Interleaved Cache

CLUSTER 1

Register FileRegister File

Func. UnitsFunc. Units

Register-to-register communication buses

cache module

CLUSTER 2

Register FileRegister File

Func. UnitsFunc. Units

cache module

CLUSTER 3

Register FileRegister File

Func. UnitsFunc. Units

cache module

CLUSTER 4

Register FileRegister File

Func. UnitsFunc. Units

cache module

L2 cacheTAG W0 W1 W2 W4 W5 W6 W7W3

TAG W0 W4 TAG W1 W5 TAG W2 W6 TAG W3 W7

subblock 1local hitremote hitlocal missremote miss

cache block

Page 9: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Talk Outline

MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and

techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

Page 10: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

succ

essf

ul

not successful

BASE Scheduling Algorithm

II=II+1

Best profit inoutput edges

START

Sort nodes

Next nodeSelect possible

clusters HowMany?

Least loaded

Schedule it HowMany?

>0

>1

1

0su

cces

sful

not successful

Page 11: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Scheduling Algorithm

For word-interleaved cache clustered processors

Scheduling steps:1. Loop unrolling2. Assignment of latencies to memory

instructions latencies stall time + compute time latencies stall time + compute time

3. Order instructions (DDG nodes)4. Cluster assignment and scheduling

Page 12: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

STEP 1: Loop Unrolling

CLUSTER 1

cache module

a[0] a[4]

CLUSTER 2

cache module

a[1] a[5]

CLUSTER 3

cache module

a[2] a[6]

CLUSTER 4

cache module

a[3] a[7]

for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i]}

ld r31, a[i] ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3]

25% local accesses

100% local accesses

for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes) ...}

ld r3, a[i]

25% local accessesSelective unrolling:• No unrolling

• UnrollxN

• OUF unrolling

Strides multiple of NxI

Optimum Unrolling Factor (OUF)

Page 13: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

STEP 2: Latency Assignment

n1load

n2load

n3add

n4store

n5sub

REC1

distance=1

n6load

n7div

n8add

REC2

memory dependencesregister-flow deps.

distance=1

STEP 2

II stall B

5

10

14

1

3

6.8

5

3.3

2.06

-

5

9

-

0.5

2.7

-

10

3.3

STEP 1

Load Latency

change

II stall B

n1

To LM

To RH

To LH

5

10

14

1

3

6.8

5

3.3

2.06

n2

To LM

To RH

To LH

5

10

14

0.25

0.75

2.95

20

13.3

4.75

LH=1 cycleRH=5 cyclesLM=10 cyclesRM=15 cycles

L=1

L=1

L=1

L=8L=1

L=15

L=15

L=15

MII=33

MII=22L=15

L=10

L=15

MII=28

MII=22L=15

L=5

L=15

MII=23

MII=22L=5

L=1

L=1

MII=9

MII=10

Page 14: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Step 3: Order instructions Step 4: Cluster assignment and scheduling

STEPS 3 and 4

Page 15: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Scheduling Restrictions

CLUSTER 1

a[0] a[4]

Cache module

CL

US

TE

R 3

CL

US

TE

R 2

CLUSTER 4

a[3] a[7]

Cache module

NEXT MEMORY LEVELNEXT MEMORY LEVEL

memory buses

cycle i - - - store to a[0]

cycle i+1 - - - -

cycle i+2 - - - -

cycle i+3 load from a[0] - - -

NON-DETERMINISTIC BUS LATENCY!!!

Page 16: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Step 3: Order instructions Step 4: Cluster assignment and scheduling

– Non-memory instructions same as BASE• Minimize register communications + maximize workload

– Memory instructions:• Memory instructions in same chain same cluster• IPBC (Interleaved Preferred Build Chains)

– Average “preferred cluster” of the chain– Padding meaningful preferred cluster information

» Stack frames» Dynamically allocated data

• IBC (Interleaved Build Chains)– Minimize register communications of 1st instr. of chain

STEPS 3 and 4

NxI boundary

Page 17: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Memory Dependent Chains

n1load

n2load

n3add

n4store

n5sub

distance=1

n6load

n7div

n8add

memory dependencesregister-flow deps.

distance=1

Preferred = 1

Preferred = 1

Preferred = 2

Preferred=2

LH=1 cycleRH=5 cyclesLM=10 cyclesRM=15 cycles

L=1

L=1

L=1

L=8L=1

L=5

L=1

L=1

n1 n2 n4 n6

IPBC cluster 1 cluster 2

IBC same as n4 minimize register communications

order={n5, n4, n3, n2, n1, n8, n7, n6}

Page 18: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Talk Outline

MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and

techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

Page 19: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Attraction Buffers

Cost-effective mechanism local accesses

CLUSTER 1

cache module

a[0] a[4]

CLUSTER 2

cache module

a[1] a[5]

CLUSTER 3

cache module

a[2] a[6]

CLUSTER 4

cache module

a[3] a[7]

ABuffer

ld r3, a[3]ld r3, a[7]...

stride 16 bytes

a[3] a[7]

Local accesses = 0%

Local accesses = 50%

Page 20: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Talk Outline

MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and

techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

Page 21: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Evaluation Framework

IMPACT C compiler Mediabench benchmark suite

Profile Execution

epicdec test_image titanic

epicenc test_image titanic

g721dec clinton S_16_44

g721enc clinton S_16_44

gsmdec clinton S_16_44

gsmenc clinton S_16_44

jpegdec testimg monalisa

Profile Execution

jpegenc testimg monalisa

mpeg2dec mei16v2 tek6

pegwitdec pegwit techrep

pegwitenc pgptest techrep

pgpdec pgptext techrep

pgpenc pgptest techrep

rasta ex5_c1 ex5_c1

Page 22: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Evaluation Framework

Unified cache MultiVLIW Interleaved cache

# clusters 4

Functional units

1 FP / cluster + 1 integer / cluster + 1 memory / cluster

Register buses 4 buses running at ½ the core freq.

Cache configuration

8KB, 2-way set-associative, 32 byte blocks

L2 always hits

Cache latencies

Hit=5

Miss=14

Hit=1

Miss=10

Local Hit=1 Remote Hit=5Local Miss=10

Remote Miss=15

Algorithm BASE IBC IPBC + IBC

Interleaving factor

- - 4 bytes

Page 23: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Talk Outline

MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and

techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

Page 24: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Local Accesses

0%

25%

50%

75%

100%

Base

OU

FO

UF

+P

OU

F+

P+

NC

Base

OU

FO

UF

+P

OU

F+

P+

NC

Base

OU

FO

UF

+P

OU

F+

P+

NC

Base

OU

FO

UF

+P

OU

F+

P+

NC

Base

OU

FO

UF

+P

OU

F+

P+

NC

Base

OU

FO

UF

+P

OU

F+

P+

NC

Memory Accesses

combined

remote misses

local misses

remote hits

local hits

epicdec gsmdec jpegenc pgpenc rasta AMEAN

OUF=Optimum UFP=PaddingNC=No Chains

Page 25: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Why Remote Accesses?

Double precision accesses (mpeg2dec) Unclear “preferred cluster” information

• Indirect accesses (e.g. a[b[i]]) (jpegdec, jpegenc, pegwitdec, pegwitenc)

• Different alignment (epicenc, jpegdec, jpegenc)

• Strides not multiple of NxI (selective unrolling, …)

Memory dependent chains (epicdec, pgpdec, pgpenc, rasta)

for (k=0; k<MAX; k++){ for (i=k; i<MAX; i++) load a[i]}

Page 26: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Stall Time

0

0,2

0,4

0,6

0,8

1

1,2

IBC

IBC

+A

BIP

BC

IPB

C+

AB

IBC

IBC

+A

BIP

BC

IPB

C+

AB

IBC

IBC

+A

BIP

BC

IPB

C+

AB

IBC

IBC

+A

BIP

BC

IPB

C+

AB

IBC

IBC

+A

BIP

BC

IPB

C+

AB

IBC

IBC

+A

BIP

BC

IPB

C+

AB

combined

remote misses

local misses

remote hit

No

rmal

ized

sta

ll t

ime

epicdec gsmdec jpegdec pgpenc rasta AMEAN

Page 27: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Cycle Count Results

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6m

ultiV

LIW

IPB

C+

AB

IBC

+A

BU

nifie

d

mul

tiVLI

WIP

BC

+A

BIB

C+

AB

Uni

fied

mul

tiVLI

WIP

BC

+A

BIB

C+

AB

Uni

fied

mul

tiVLI

WIP

BC

+A

BIB

C+

AB

Uni

fied

mul

tiVLI

WIP

BC

+A

BIB

C+

AB

Uni

fied

mul

tiVLI

WIP

BC

+A

BIB

C+

AB

Uni

fied

stall time

compute time

epicdec gsmdec jpegdec pgpenc rasta AMEAN

no

rmal

ized

nu

mb

er o

f cy

cles

Page 28: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Talk Outline

MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and

techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

Page 29: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Conclusions

Interleaved cache clustered VLIW processor Effective instruction scheduling techniques

– Smart assignment of latencies – Loop unrolling + padding (27% local hits)

Source of remote accesses and stall time Attraction Buffers ( stall time up to 34%) Cycle count results:

– MultiVLIW (7% slowdown but simpler hardware)– Unified cache (11% speedup)

Page 30: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Questions?

Page 31: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Question: Latency Assignment

MII(REC1)=20 MII(DDG)=10

Node II stall B(ratio) B(substract)

n1 15 4 3.75 11

n2 10 5 2 5

n3 5 1 5 4

n4 5 1 5 4

n5 10 0 MAX 10

Page 32: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Question: Padding

void foo(int *array, int *accum) { *accum = 0; for (i=0; i<MAX; i++) *accum += array[i];}

void main() { int *a, value; a = malloc(MAX*sizeof(int)); foo(a, &value);}

CLUSTER 1

a[0]a[4]...

CLUSTER 2

accuma[1]a[5]...

CLUSTER 3

a[2]a[6]...

CLUSTER 4

a[3]a[7]...

Page 33: UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC

MICRO35Istanbul

Nov. 2002

Question: Coherence

Memory Dependent Chains– Modified data

• Present in only one Attraction Buffer

– Data present in multiple Attraction Buffers• Replicated in read-only manner

– Local scheduling technique• At end of loop flush Attraction Buffer’s contents

CLUSTER 1

a[2]

ABuffer

CLUSTER 2

a[2]

ABuffer

CLUSTER 3

ABuffer

CLUSTER 4

a[2]

ABuffer