Top Banner
rea-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip Brisk Adam Kaplan Majid Sarrafzadeh Embedded and Reconfigurable Systems Lab Computer Science Department University of California, Los Angeles [email protected] u [email protected] [email protected] DAC ’04. June 9, 2004. San Diego Convention Center, San Diego, CA
71

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs

Jan 13, 2016

Download

Documents

jerrod

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs. Philip BriskAdam KaplanMajid Sarrafzadeh. [email protected]. [email protected]. [email protected]. Embedded and Reconfigurable Systems Lab. Computer Science Department. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Area-Efficient Instruction Set Synthesis for Reconfigurable

System on Chip Designs

Philip Brisk Adam Kaplan Majid Sarrafzadeh

Embedded and Reconfigurable Systems LabComputer Science Department

University of California, Los Angeles

[email protected] [email protected]@cs.ucla.edu

DAC ’04. June 9, 2004. San Diego Convention Center, San Diego, CA

Page 2: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Outline

• Custom Instruction Generation and Selection

• Resource Sharing

• Algorithm Description with Examples

• Datapath Synthesis Techniques

• Experimental Methodology and Results

• Summary

Page 3: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Custom Instruction Generation

• Compiler Profiles Application Code• Extracts Favorable IR Patterns• Synthesizes Patterns as Hardware Datapaths

Custom Instruction Selection

• Area Constraints Limit on-Chip Functionality• NP-Hard 0-1 Knapsack Problem• Formulated as an Integer Linear Program (ILP)

Custom Instruction Generation and Selection

Page 4: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

For each custom instruction iGain(i) : Estimated Performance Gain of iArea(i) : Estimated Area of iSelected(i) : 1 if i is Selected; 0 Otherwise

Goal Maximize Gain of Selected Instructions

ConstraintArea of Selected Instructions FPGA Area

<

ILP Formulation forInstruction Selection Problem

Page 5: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

What About Resource Sharing?

Area = 17

Area = 25

Two DFGs

1.5

My Datapath

Area = 28ILP Area Estimate = 42

AreaCosts

85

1

3

Page 6: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Analysis

0-1 Knapsack Problem Formulation Over-Estimated Area by 150%

• ILP Solvers Do Not Consider Resource Sharing

How to Remedy This

• Develop a Resource Sharing Algorithm

• Avoid Additive Area Estimates Based on per-Instruction Costs

Page 7: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Resource Sharing for DFGs

Given: • A Set of DFGs G* = {G1, …, Gn}

Goal: • Construct a Consolidation Graph GC of Minimal Cost

Constraints:• GC Must be Acyclic• GC Must be a Supergraph of each Gi in G*

That’s Life: • The Problem is NP-Hard

Page 8: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Resource Sharing Overview

G3 G4G1 G2

Decompose Patterns into Input-Output Paths• Path Based Resource Sharing (PBRS)

Page 9: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Resource Sharing Overview

G3 G4G1 G2

Decompose Patterns into Input-Output Paths• Path Based Resource Sharing (PBRS)

Page 10: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Resource Sharing Overview

Use Substring Matching to Share Resources• Merge DFGs Along Matched Nodes

G3 G4G1 G2

Page 11: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Resource Sharing Overview

Synthesize GC

• Requires Less Area than Synthesizing G1…G4 Separately

Gc

G3 G4

G1 G2

Page 12: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

AreaCosts

85

1

3

Path-BasedResource Sharing

P1: ()

P2: ()

Page 13: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

P1: ()

P2: ()

MACStrO(L) L – Length of String

( )

Area of MACStr = 26

Maximum Area Common Substring

AreaCosts

85

1

3

Page 14: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

P1: ()

P2: ()

MACSeqO(L2/logL) L – Length of String

( )

Area of MACSeq = 43

AreaCosts

85

1

3

Maximum Area Common Subsequence

Page 15: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Resource Sharing Algorithm

Global Phase

Determine:Which DFGs to MergeAn Initial Path to Merge

Local PhaseAggressively Apply PBRS to Share Resources Between the DFGs Selected by the Global Phase

Repeat Until all DFGs are Merged, or no Further Resource Sharing is Possible

Page 16: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Resource Sharing Algorithm

AreaCosts

85

1

3

G1 G2G3 G4

Page 17: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Global Phase

AreaCosts

85

1

3

G3 G4G1 G2

Page 18: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Global Phase

AreaCosts

85

1

3

G3 G4G1 G2

MACSeq/MACStr

Page 19: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Entering Local Phase

AreaCosts

85

1

3

G1 G2

MACSeq/MACStr

Page 20: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G1 G2

1 2

2

2

2

2

G12

MACSeq/MACStr

Page 21: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G1 G2

1 2

2

2

2

2

G12

0

0

0

0

MACSeq/MACStr

Page 22: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G1 G2

0

0

0

0

1 2

2

2

2

2

G12

Page 23: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G1 G2

0

0

0

0

1 2

2

2

2

2

G12

MACSeq/MACStr

Page 24: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G1 G2

0

0

0

0

1 2

2

2

2

2

G12

MACSeq/MACStr

Page 25: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G1 G2

0

0

0

0

2

2

2

2

G12

MACSeq/MACStr

Page 26: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G1 G2

0

0

0

0

2

2

2

2

G12

MACSeq/MACStr

Page 27: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G1 G2

0

0

0

0

2

2

2

2

G12

MACSeq/MACStr

Page 28: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G1 G2

0

0

0

0

2

2

2

2

G12

Page 29: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Returning To Global Phase

AreaCosts

85

1

3

G12

G3 G4

Page 30: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Global Phase

AreaCosts

85

1

3

G3 G4

G12

Page 31: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Global Phase

AreaCosts

85

1

3

G3 G4

G12

MACSeq/MACStr

Page 32: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Entering Local Phase

AreaCosts

85

1

3

G12

G4

MACSeq/MACStr

Page 33: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G4

0

0

0

0

G12 G124

4

4

4

12

12

12

12

12

12

MACSeq/MACStr

Page 34: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G4

0

0

0

0

G12 G124

4

4

4

12

12

12

12

12

12

MACSeq/MACStr

Page 35: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G4

0

0

0

0

G12 G124

4

4

4

12

12

12

12

12

12

Page 36: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G4

0

0

0

0

G12 G124

4

4

4

12

12

12

12

12

12

MACSeq/MACStr

Page 37: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

G4

0

0

0

0

G12 G124

4

4

4

12

12

12

12

12

12

MACSeq/MACStr

Page 38: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

A Local Decision

AreaCosts

85

1

3

0

0

0

0

G4

G12 G124

4

4

12

12

12

12

12

MACSeq/MACStr

Page 39: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

A Local Decision

AreaCosts

85

1

3

0

0

0

0

G4

G12 G124

4

4

12

12

12

12

12

Page 40: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

A Local Decision

AreaCosts

85

1

3

0

0

0

0

G4

G12 G124

4

4

12

12

12

12

12

MACSeq/MACStr

Page 41: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

A Local Decision

AreaCosts

85

1

3

0

0

0

0

G4

G12 G124

4

4

12

12

12

12

12

MACSeq/MACStr

Page 42: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Cycles are Illegal

AreaCosts

85

1

3

0

0

0

0

ILLEGAL!

4

12

12

12

12

G124

4

4

12

12

12

12 12

G124

MACSeq/MACStr

Page 43: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Cycles are Illegal

AreaCosts

85

1

3

0

0

0

0

G124

4

4

12

12

12

12 12LEGAL!

4

12

12

12

12G124

MACSeq/MACStr

Page 44: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G4

G12 G124

4

12

12

12

12

Page 45: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Returning To Global Phase

AreaCosts

85

1

3

G3

G124

Page 46: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Global Phase

AreaCosts

85

1

3

G3

G124

Page 47: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Global Phase

AreaCosts

85

1

3

G3

G124

MACSeq/MACStr

Page 48: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Global Phase

AreaCosts

85

1

3

G3

G124

3

33124124

124 124

124

124

G1234

MACSeq/MACStr

Page 49: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Global Phase

AreaCosts

85

1

3

G3

G124

3

33124124

124 124

124

124

0

0

0

0

G1234

MACSeq/MACStr

Page 50: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124

3

33124124

124 124

124

124

G1234

Page 51: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

3

33124124

124 124

124

124

MACSeq/MACStr

Page 52: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

3

33124124

124 124

124

124

MACSeq/MACStr

Page 53: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

3124

124

124

124

MACSeq/MACStr

Page 54: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

3124

124

124

124

MACSeq/MACStr

Page 55: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

3124

124

124

124

Page 56: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

3124

124

124

124

MACSeq/MACStr

Page 57: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

3124

124

124

124

MACSeq/MACStr

Page 58: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

124

124

MACSeq/MACStr

124

Page 59: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

124

124

MACSeq/MACStr

124

Page 60: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

124

124

124

MACSeq/MACStr

Page 61: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Local Phase

AreaCosts

85

1

3

0

0

0

0

G3

G124 G1234

124

124

124

Page 62: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

We’re Done

AreaCosts

85

1

3

G1 G2

G3 G4

G1234

Page 63: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

We’re Done

AreaCosts

85

1

3

G1 G2

G3 G4

Area = 17 Area = 25

Area = 14 Area = 20

G1234 Area = 30

Total Area of DFGs = 76

G1234

Page 64: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

VLIW Synthesis

Experimental Procedure

Custom Instr.Generation

Set of Patterns

Machine-SUIFCompiler

ConsolidationGraph

ConstructionAlgorithm

Consolidation Graph

EstimateArea

Pipeline Synthesis

Page 65: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Pipelined Datapath Synthesis

Compiler

Loop Bodies• 80-90% of Program Execution Time• Parallelism Exists Across Multiple

Iterations• Pipelined Datapath Yields Maximal

Throughput.

Data Flow Graph

Insert Registers& Muxes

Page 66: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Pipelined Datapath Synthesis

Gc

G1 G2 G3 G4

Page 67: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

VLIW Datapath Synthesis

Compiler

Non-Loop Computations• Instruction-Level Parallelism• Similar to Latency-Constrained

Scheduling in High-Level Synthesis

Data Flow Graph

Page 68: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Benchmark Suite

MediaBench Benchmark Suite

Exp. Benchmark File/FunctionNum. Instrs.

Largest Instr. (Operations)

Avg. Ops per Instr.

1234

567891011

MesaPGPRasta

EpicJPEGJPEGJPEGMPEG2MPEG2Rasta

Rasta

blend.cidea.cmul_mdmd_md.c

collapse_pyrjpeg_fdct_ifastjpeg_idct_4x4jpeg_idct_2x2idct_col

FR4TR

Lqsolve.c

idct_row

61457

2158794

10

18864

917125

303725

5.53.23.03.0

4.47.05.93.17.2

20.07.5

Page 69: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Experimental Results

0

5000

10000

15000

20000

25000

Slic

es

1 2 3 4 5 6 7 8 9 10 11

Pipelined Datapath Area Estimates

Additive

CG/PBRS

XilinxE-1000 Area

Page 70: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Experimental Results

0

5000

10000

15000

20000

25000

Slic

es

1 2 3 4 5 6 7 8 9 10 11

VLIW Datapath Area Estimates

Additive

CG/PBRS

XilinxE-1000 Area

Page 71: Area-Efficient Instruction Set  Synthesis for Reconfigurable System on Chip Designs

Summary

• Area Estimates Based on Resource Sharing

• 0-1 Knapsack Problem Formulation Does Allow for Resource Sharing Estimates

• Resource Sharing Algorithm

• PBRS applied to Data Flow Graphs

• Experimental Results

• ILP Overestimates Area Costs by as much as 374% and 582% for Pipelined and VLIW Datapaths