Top Banner
Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin
57

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Diverge-Merge Processor (DMP)

Hyesoon Kim José A. Joao

Onur Mutlu* Yale N. Patt

HPS Research Group *Microsoft ResearchUniversity of Texas at Austin

Page 2: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

2

Outline

Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion

Page 3: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

3

Predicated Execution

Convert control flow dependence to data dependence

(normal branch code)

C B

D

AT N

p1 = (cond) branch p1, TARGET

mov b, 1 jmp JOIN

TARGET: mov b, 0

A

B

C

B

C

D

A

(predicated code)

A

B

C

if (cond) { b = 0;}else { b = 1;} p1 = (cond)

(!p1) mov b, 1

(p1) mov b, 0

Page 4: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

4

Fetch Decode Rename Schedule RegisterRead Execute

Benefit of Predicated Execution Predicated Execution can be high

performance and energy-efficient.

A

BC

D

AE

F

Predicated Execution

Branch Prediction

Pipeline flush!!

E D BF

nop

Fetch Decode Rename Schedule RegisterRead Execute

AB AC B AC BD AD C BE AE D CF B AF E D C B A AF BCDEF E D ABCF E ABCDF E D C B AF E D C ABE D C B AF AF BCDE

Page 5: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

5

Limitations/Problems of Predication

ISA: Predicate registers and predicated instructions Dynamic-Hammock Predication[Klauser’98] can solve this problem but

it is only applicable to simple hammocks.

Adaptivity: Static predication is not adaptive to run-time branch behavior. Branch behavior changes based on input set, phase, control-flow path. Wish Branches[Kim’05]

Complex CFG: A large subset of control-flow graphs is not converted to predicated code. Function calls, loops, many instructions inside a region,

and complex CFGs Hyperblock[Mahlke’92] cannot adapt to frequently-executed paths

dynamically.

Page 6: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

6

Outline

Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion

Page 7: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

7

Diverge-Merge Processor (DMP)

DMP can dynamically predicate complex branches

(in addition to simple hammocks).

The compiler identifies Diverge branches

Control-flow merge (CFM) points

The microarchitecture decides when and what to

predicate dynamically.

Page 8: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

8

select-µops (φ-nodes in SSA)

Dynamic Predication

A

B

C

H

Klauser et al.[PACT’98]: Dynamic-hammock predication

C B

H

AT N

mov R1, 1 jmp JOIN

TARGET: mov R1, 0

A

B

C

p1 = (cond) branch p1, TARGET

(mov R1, 1)PR10 = 1

(mov R1, 0)PR11 = 0

PR12 = (cond) ? PR11 : PR10

Low-confidence

H JOIN: add R5, R1, 1

Page 9: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

9

Diverge-Merge Processor

C B

E

D

F G

Frequently executed path

Not frequently executed path

A

C

E

B

H

Insert select-µops

Diverge Branch

CFM point

A

H

Page 10: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

10

diverge-branch executed block CFM point

Diverge-Merge Processor

C B

E

D

F G

Frequently executed path

Not frequently executed path

A A A

A A A

A

H

Page 11: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

11

Control-Flow GraphsA

simple hammock

A

nested hammock

A

frequently-hammock

A

loop

A

. . . . . . . . . . .

non-merging

DMP

Dynamic Hammock

SW pred

Wish br.

Dual-path

Page 12: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

12

Dual-path Execution vs. DMP

Low-confidence

C

D

E

F

B

D

E

F

A

BC

D

E

F

path 1 path 2

C

D

E

F

B

path 1 path 2

Dual-path DMP

CFMCFM

Page 13: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

13

Control-Flow GraphsA

simple hammock

A

nested hammock

A

frequently-hammock

A

loop

A

. . . . . . . . . . .

non-merging

DMP

Dynamic-hammock

SW pred

Wish br.

Dual-path

sometimes

sometimes

Page 14: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

14

0

2

4

6

8

10

12

gzip vp

rgc

cm

cf

craf

ty

pars

er eon

perlb

mk

gap

vorte

xbz

ip2 twol

f

com

p goijp

eg li

m88

ksim

amea

n

Mis

pre

dic

tio

ns

pe

r k

ilo in

str

uc

tio

ns

(M

PK

I)

non-merging

loop

frequently

nested

simple

Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically

predicated in DMP.

Page 15: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

15

0

2

4

6

8

10

12

gzip vp

rgc

cm

cf

craf

ty

pars

er eon

perlb

mk

gap

vorte

xbz

ip2 twol

f

com

p goijp

eg li

m88

ksim

amea

n

Mis

pre

dic

tio

ns

pe

r k

ilo in

str

uc

tio

ns

(M

PK

I)

non-merging

loop

frequently

nested

simple

Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically

predicated in DMP.

Page 16: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

16

Outline

Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion

Page 17: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

17

Fetch Mechanism

C B

E

D

F G

predicted path

A

C

E

B

H

Diverge Branch

CFM point

A

H

Low Confidence

Round-robin fetch

Page 18: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

18

PR21PR11PR41

add pr21 pr13, #1 (p1)

Dynamic Predication

Arch. Phy. M

R1

R2 PR12

R3 PR13

A

C

E

B

H

branch r0, C

add r1 r3, #1

add r4 r1, r3

add r1 r2, # -1

branch pr10,C p1 = pr10

add pr24 pr41, pr13

add pr31 pr12, # -1(!p1)

Arch. Phy. M

R1

R2 PR12

R3 PR13

PR31

1

1

select-µop pr41 = p1? pr21 : pr31

RAT2

RAT1

Forks RAT, RAS, and GHR

PR11

Page 19: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

19

DMP Support

ISA Support Mark diverge branches/CFM points.

Compiler Support [CGO’07] The compiler identifies diverge branches and the

corresponding CFM points. Hardware Support

Confidence estimator Fetch mechanisms Load/store processing Instruction retirement Dynamic predication

Page 20: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

20

Hardware Complexity Analysis

ST-LD Forwarding

SWpred.

Dualpath

Select-Uop Gen.

Rename Support

Front-End

Check Flush/no Flush

Predicate Registers

Confidence Estimator

Wishbr.

Multi path

Dyn.ham.

DMP

Page 21: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

21

Outline

Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion

Page 22: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

22

Simulation Methodology

12 SPEC 2000 INT, 5 SPEC 95 INT Different input sets for profiling and evaluation

Alpha ISA execution driven simulator Baseline processor configuration

64KB perceptron predictor/O-GEHL (paper) Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window 2 KB 12-bit history enhanced JRS confidence

estimator Less aggressive processor (paper) Power model using Wattch

Page 23: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

23

0

10

20

30

40

50

60

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

com

p go

ijpeg

li

m88

ksim

hmea

n

IPC

im

pro

vem

ent

(%)

simplesimple,nestedsimple,nested,frequentlysimple,nested,frequently,loop

Different CFG types

Page 24: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

24

Performance Improvement

0

5

10

15

20

25

Per

form

ance

Im

pro

vem

ent

(%) DMP

dynamic-hammockdual-pathmultipathlimited software predicationwish branches

Page 25: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

25

Energy Consumption

-5

0

5

10

Red

uct

ion

(%

)

DMPdynamic-hammockdual-pathmultipathlimited software predicationwish branches

Page 26: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

26

Outline

Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion

Page 27: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

27

Conclusion DMP introduces the concept of frequently-hammocks and it

dynamically predicates complex CFGs.

DMP can overcome the three major limitations of software predication: ISA support, adaptivity, complex CFG.

DMP reduces branch mispredictions energy efficiently 19% performance improvement, 9% less energy

DMP divides the work between the compiler and the microarchitecture: The compiler analyzes the control-flow graphs. The microarchitecture decides when and what to predicate

dynamically.

Page 28: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Thank You!!

Page 29: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Questions?

Page 30: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

30

Handling Mispredictions

C B

E

D

F G

predicted path

A

C

E

B

H

Diverge Br.

CFM point

A

H

Misprediction!add pr21 pr13, #1 (p1)

branch pr10,C p1 = pr10

add pr24 pr41, pr13

add pr31 pr12, # -1(!p1)

select-µop pr41 = p1? pr21 : pr31

add pr44 pr34, # -1(!p1)

B

C

E

H

A

(0)

(1)

(1)Flush

D add pr34 pr31, pr13D

Page 31: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

31

Loop Branches

Exit Condition The loop branch is predicted to exit the loop.

Benefit Reduced pipeline flushes: when the predicated

loop is iterated more times than it should be. Instructions in the extra iterations of the loop

become NOPs. Instructions after loop-exit can still be executed.

Negative Effects Increased execution delay of loop-carried

dependencies The overhead of select-µops

Page 32: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

32

Loop Branches Predicate each loop iteration separately

A

B

select-uop pr32 = p2 ? pr31: pr22 select-uop pr33 = p2 ? pr30: pr23

select-uop pr22 = p1 ? pr21: pr11 select-uop pr23 = p1? pr20: pr10

add pr21 pr11, #1 (p1) pr20 = (cond1) (p1)branch A, pr20 (p1) p2 = pr20

A

add r1 r1, #1r0 = (cond1)branch A, r0

A

add r1 r1, #1r0 = (cond1)branch A, r0

A

add r7 r1, #10B

add r1 r1, #1r0 = (cond1)branch A, r0

A

add pr31 pr22, #1 (p2)pr30 = (cond1) (p2)branch A, pr30 (p2)

A

add pr7 pr32, #10B

branch A, pr10 p1 = pr10A

Loop br. is predicted to exit the loop

Page 33: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

33

Enhanced Mechanisms Multiple CFM points

The hardware chooses one CFM point for each instance of dynamic predication.

Exit Optimizations Counter Policy: What if one path does not

reach the CFM point? Number of fetched instructions > Threshold

Yield Policy: What if another low confidence diverge branch is encountered in dynamic predication mode? Later low confidence branch is more likely

mispredicted.

A

B C

G D F

EH

Page 34: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

34

Detailed DMP Support

32 Predicate register ids Fetch mechanism

High performance I-Cache Fetch two cache lines Predict 3 branches Fetch stops at the first taken branch

Page 35: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

35

Diverge and Merge?

0%

20%

40%

60%

80%

100%

gzi

p

vpr

gcc

mcf

cra

fty

pa

rse

r

eo

n

pe

rlbm

k

ga

p

vort

ex

bzi

p2

two

lf

com

p

go

ijpe

g li

m8

8ks

im

am

ea

n

Me

rge

(%

)

Page 36: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

36

Useful Dynamic Predication Mode

0

5

10

15

20

25

30

gzi

p

vpr

gcc

mcf

cra

fty

pa

rse

r

eo

n

pe

rlbm

k

ga

p

vort

ex

bzi

p2

two

lf

com

p

go

ijpe

g li

m8

8ks

im

am

ea

n

Div

erg

e b

ran

ch

ac

tua

lly m

isp

red

icte

d (

%)

Page 37: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

37

Perfect Branch Prediction

-40

-35

-30

-25

-20

-15

-10

-5

0

Energy

-70

-60

-50

-40

-30

-20

-10

0

EDP

0102030405060708090

100

Performance

delta

(%

)

4 wide-20 stages-128 window

8 wide-30 stages-512 window

Page 38: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

38

Maximum Power

0

2

4

6

8

Max

imu

m P

ow

er I

ncr

eam

ent

(%) DMP

dynamic-hammock

dual-path

multipath

software predication

wish branches

Page 39: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

39

Branch Predictor Effects

0

5

10

15

20

25

30

35

IPC

del

ta (

%)

perceptron-dynamic-hammockperceptron-dual-pathperceptron-multipathperceptron-DMP

OGEHL-baseOGEHL-dynamic-hammockOGEHL-dual-pathOGEHL-multipathOGEHL-DMP

Page 40: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

40

Confidence Estimator Effects

0

5

10

15

20

25

30

35

dynamic-hammock dual-path multipath DMP

IPC

del

ta (

%)

512B

2KB

4KB

16KB

perfect

Page 41: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

41

Results in Less Aggressive Processors

-5

0

5

10

15

20

25

30

35

gzip vp

rgc

cm

cf

craft

y

parse

reo

n

perlb

mk

gap

vorte

xbz

ip2 twolf

com

p goijp

eg li

m88

ksim

hmea

n

IPC

del

ta (

%)

dynamic-hammock

dual-path

multi-path

dmp

0.50.550.6

0.650.7

0.750.8

0.850.9

0.951

1.05

gzip vp

rgc

cm

cf

craf

ty

pars

er eon

perlb

mk

gap

vorte

xbz

ip2 twol

f

com

p goijp

eg li

m88

ksim

amea

n

Exe

cuti

on

tim

e n

orm

aliz

ed t

o t

he

bas

elin

e

limited software predicationwish branches dmp

Page 42: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

42

DMP vs. Perfect Conditional BP

227229

0

20

40

60

80

100

120

140gz

ip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

com

p go

ijpeg

li

m88

ksim

hmea

n

IPC

del

ta (

%)

dmp

Perf BP

Page 43: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

43

Enhanced DMP Mechanisms

-10

0

10

20

30

40

50

60

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

com

p go

ijpeg

li

m88

ksim

hmea

n

IPC

del

ta (

%)

single-cfmmultiple-cfmmcfm-countermcfm-counter-yield

Page 44: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

44

47%58%

-50

510

1520

2530

3540

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

com

p go

ijpeg

li

m88

ksim

hmea

n

IPC

im

pro

vem

ent

(%)

dynamic-hammockdual-path

multipathDMP

DMP vs. Other Mechanisms

Page 45: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

45

0

0.2

0.4

0.6

0.8

1

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

com

p

go

ijpeg

li

m88

ksim

amea

n

No

rma

lize

d e

xe

cu

tio

n t

ime

limited software predicationwish branches DMP

Comparisons with Predication/Wish Branches

non-predicated

Page 46: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

46

Reduction in Pipeline Flushes

Average overhead: Dynamic-hammock: 4 instructions/entry Dual-path: 150 instructions/entry Multipath: 200 instructions/entry DMP: 20 instructions/entry

0

10

20

30

40

50

60

70

80gz

ip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

com

p go

ijpeg

li

m88

ksim

amea

nRed

uct

ion

in

pip

elin

e fl

ush

es (

%)

dynamic-hammockdual-pathmultipathDMP

Page 47: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

47

Handling Nested Diverge Branches

Basic DMP Ignore other low

confidence div. branches

Enhanced DMP Exit dynamic

predication mode and re-enter from the younger low confidence branch on predicted path (Yield policy)

C B

EF G

Diverge Br.

CFM point

A

H

D

Page 48: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

48

Compiler Support [CGO’07]

Compiler analyzes the control flow and the profile data Step1: Identify diverge branch candidates and

CFM points. Step2: Select diverge branches based on

(1) the number of instructions between a branch and the CFM point

(2) the probability of merging at the CFM point Heuristics or a cost-benefit model

Step3: Mark the selected branches/CFM points.

Page 49: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

49

Future Research

Hardware Support Better confidence estimators Efficient hardware mechanism to detect

diverge branches and CFM points Increase hardware complexity but eliminate

the need for ISA/compiler support

Compiler Support Better compiler algorithms [CGO’07]

Page 50: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

50

Power Measurement Configurations

100 nm Technology Baseline processor

4GHZ Less aggressive processor

1.5GHz CC3 clock-gating model in Wattch: unused

units dissipate only 10% of their maximum power

DMP: one more RAT/RAS/GHR, select-uop generation module, additional fields in BTB, predicate registers, CFM registers, load-store forwarding, instruction retirement

Page 51: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

51

Fetched wrong-path instructions per entry into dynamic-predication/dual-path mode

0

50

100

150

200

250

300

350

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

com

p go

ijpeg

li

m88

ksim

amea

n

Wro

ng

-pat

h i

nst

ruct

ion

s p

er e

ntr

y

dynamic-hammockdual-pathmultipathdmp

Page 52: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

52

Fetched/Executed Instructions

-25

-20

-15

-10

-5

0

5

baseline less-aggressive

de

lta

(%

)

fetched instructionsexecuted instructionsmax powerenergyenergy-delay product

Page 53: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

53

ISA Support

Example of Diverge Br and CFM markers

OPCODE TARGET

00 : normal branch10 : diverge forward branch11 : diverge loop branch

CFM rel address

CFM = CFM rel address + PC

Page 54: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

54

Entering Dynamic Predication Mode

Entry condition When a diverge branch has low confidence.

The Front-end Stores the address of the CFM point to the CFM

register. Forks the RAS, GHR, and RAT. Allocates a predicate register.

Fetch Mechanisms Round-robin fetch from two paths The processor follows the branch predictor until

it reaches the corresponding CFM point.

Page 55: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

55

Exiting Dynamic Predication Mode

Exit condition Both paths of a diverge branch have

reached the corresponding CFM point. A diverge branch is resolved.

Select-µop mechanism Similar to φ-node in SSA Merges register values from two paths.

Page 56: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

56

Multipath Execution

Low-confidence

C

E

H

I

Instructions after the control-flow merge point are fetched multiple times. Waste of resources and energy.

B

G

H

I

A

BC

E

H

I

path 3 path 4

D GF D

H

I

F

H

I

path 1 path 2

Low-confidence

Page 57: Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

57

Modeling Software Predication

Mark using a binary instrumentation tool

All simple and nested hammocks can be predicated.

All instruction between a branch and the control-flow merge point are fetched.

All nested branches are predicated.