Page 1
Area-Efficient Instruction Set Synthesis for Reconfigurable
System on Chip Designs
Philip Brisk Adam Kaplan Majid Sarrafzadeh
Embedded and Reconfigurable Systems LabComputer Science Department
University of California, Los Angeles
[email protected] [email protected] @cs.ucla.edu
DAC ’04. June 9, 2004. San Diego Convention Center, San Diego, CA
Page 2
Outline
• Custom Instruction Generation and Selection
• Resource Sharing
• Algorithm Description with Examples
• Datapath Synthesis Techniques
• Experimental Methodology and Results
• Summary
Page 3
Custom Instruction Generation
• Compiler Profiles Application Code• Extracts Favorable IR Patterns• Synthesizes Patterns as Hardware Datapaths
Custom Instruction Selection
• Area Constraints Limit on-Chip Functionality• NP-Hard 0-1 Knapsack Problem• Formulated as an Integer Linear Program (ILP)
Custom Instruction Generation and Selection
Page 4
For each custom instruction iGain(i) : Estimated Performance Gain of iArea(i) : Estimated Area of iSelected(i) : 1 if i is Selected; 0 Otherwise
Goal Maximize Gain of Selected Instructions
ConstraintArea of Selected Instructions FPGA Area
<
ILP Formulation forInstruction Selection Problem
Page 5
What About Resource Sharing?
Area = 17
Area = 25
Two DFGs
1.5
My Datapath
Area = 28ILP Area Estimate = 42
AreaCosts
85
1
3
Page 6
Analysis
0-1 Knapsack Problem Formulation Over-Estimated Area by 150%
• ILP Solvers Do Not Consider Resource Sharing
How to Remedy This
• Develop a Resource Sharing Algorithm
• Avoid Additive Area Estimates Based on per-Instruction Costs
Page 7
Resource Sharing for DFGs
Given: • A Set of DFGs G* = {G1, …, Gn}
Goal: • Construct a Consolidation Graph GC of Minimal Cost
Constraints:• GC Must be Acyclic• GC Must be a Supergraph of each Gi in G*
That’s Life: • The Problem is NP-Hard
Page 8
Resource Sharing Overview
G3 G4G1 G2
Decompose Patterns into Input-Output Paths• Path Based Resource Sharing (PBRS)
Page 9
Resource Sharing Overview
G3 G4G1 G2
Decompose Patterns into Input-Output Paths• Path Based Resource Sharing (PBRS)
Page 10
Resource Sharing Overview
Use Substring Matching to Share Resources• Merge DFGs Along Matched Nodes
G3 G4G1 G2
Page 11
Resource Sharing Overview
Synthesize GC
• Requires Less Area than Synthesizing G1…G4 Separately
Gc
G3 G4
G1 G2
Page 12
AreaCosts
85
1
3
Path-BasedResource Sharing
P1: ()
P2: ()
Page 13
P1: ()
P2: ()
MACStrO(L) L – Length of String
( )
Area of MACStr = 26
Maximum Area Common Substring
AreaCosts
85
1
3
Page 14
P1: ()
P2: ()
MACSeqO(L2/logL) L – Length of String
( )
Area of MACSeq = 43
AreaCosts
85
1
3
Maximum Area Common Subsequence
Page 15
Resource Sharing Algorithm
Global Phase
Determine:Which DFGs to MergeAn Initial Path to Merge
Local PhaseAggressively Apply PBRS to Share Resources Between the DFGs Selected by the Global Phase
Repeat Until all DFGs are Merged, or no Further Resource Sharing is Possible
Page 16
Resource Sharing Algorithm
AreaCosts
85
1
3
G1 G2G3 G4
Page 17
Global Phase
AreaCosts
85
1
3
G3 G4G1 G2
Page 18
Global Phase
AreaCosts
85
1
3
G3 G4G1 G2
MACSeq/MACStr
Page 19
Entering Local Phase
AreaCosts
85
1
3
G1 G2
MACSeq/MACStr
Page 20
Local Phase
AreaCosts
85
1
3
G1 G2
1 2
2
2
2
2
G12
MACSeq/MACStr
Page 21
Local Phase
AreaCosts
85
1
3
G1 G2
1 2
2
2
2
2
G12
0
0
0
0
MACSeq/MACStr
Page 22
Local Phase
AreaCosts
85
1
3
G1 G2
0
0
0
0
1 2
2
2
2
2
G12
Page 23
Local Phase
AreaCosts
85
1
3
G1 G2
0
0
0
0
1 2
2
2
2
2
G12
MACSeq/MACStr
Page 24
Local Phase
AreaCosts
85
1
3
G1 G2
0
0
0
0
1 2
2
2
2
2
G12
MACSeq/MACStr
Page 25
Local Phase
AreaCosts
85
1
3
G1 G2
0
0
0
0
2
2
2
2
G12
MACSeq/MACStr
Page 26
Local Phase
AreaCosts
85
1
3
G1 G2
0
0
0
0
2
2
2
2
G12
MACSeq/MACStr
Page 27
Local Phase
AreaCosts
85
1
3
G1 G2
0
0
0
0
2
2
2
2
G12
MACSeq/MACStr
Page 28
Local Phase
AreaCosts
85
1
3
G1 G2
0
0
0
0
2
2
2
2
G12
Page 29
Returning To Global Phase
AreaCosts
85
1
3
G12
G3 G4
Page 30
Global Phase
AreaCosts
85
1
3
G3 G4
G12
Page 31
Global Phase
AreaCosts
85
1
3
G3 G4
G12
MACSeq/MACStr
Page 32
Entering Local Phase
AreaCosts
85
1
3
G12
G4
MACSeq/MACStr
Page 33
Local Phase
AreaCosts
85
1
3
G4
0
0
0
0
G12 G124
4
4
4
12
12
12
12
12
12
MACSeq/MACStr
Page 34
Local Phase
AreaCosts
85
1
3
G4
0
0
0
0
G12 G124
4
4
4
12
12
12
12
12
12
MACSeq/MACStr
Page 35
Local Phase
AreaCosts
85
1
3
G4
0
0
0
0
G12 G124
4
4
4
12
12
12
12
12
12
Page 36
Local Phase
AreaCosts
85
1
3
G4
0
0
0
0
G12 G124
4
4
4
12
12
12
12
12
12
MACSeq/MACStr
Page 37
Local Phase
AreaCosts
85
1
3
G4
0
0
0
0
G12 G124
4
4
4
12
12
12
12
12
12
MACSeq/MACStr
Page 38
A Local Decision
AreaCosts
85
1
3
0
0
0
0
G4
G12 G124
4
4
12
12
12
12
12
MACSeq/MACStr
Page 39
A Local Decision
AreaCosts
85
1
3
0
0
0
0
G4
G12 G124
4
4
12
12
12
12
12
Page 40
A Local Decision
AreaCosts
85
1
3
0
0
0
0
G4
G12 G124
4
4
12
12
12
12
12
MACSeq/MACStr
Page 41
A Local Decision
AreaCosts
85
1
3
0
0
0
0
G4
G12 G124
4
4
12
12
12
12
12
MACSeq/MACStr
Page 42
Cycles are Illegal
AreaCosts
85
1
3
0
0
0
0
ILLEGAL!
4
12
12
12
12
G124
4
4
12
12
12
12 12
G124
MACSeq/MACStr
Page 43
Cycles are Illegal
AreaCosts
85
1
3
0
0
0
0
G124
4
4
12
12
12
12 12LEGAL!
4
12
12
12
12G124
MACSeq/MACStr
Page 44
Local Phase
AreaCosts
85
1
3
0
0
0
0
G4
G12 G124
4
12
12
12
12
Page 45
Returning To Global Phase
AreaCosts
85
1
3
G3
G124
Page 46
Global Phase
AreaCosts
85
1
3
G3
G124
Page 47
Global Phase
AreaCosts
85
1
3
G3
G124
MACSeq/MACStr
Page 48
Global Phase
AreaCosts
85
1
3
G3
G124
3
33124124
124 124
124
124
G1234
MACSeq/MACStr
Page 49
Global Phase
AreaCosts
85
1
3
G3
G124
3
33124124
124 124
124
124
0
0
0
0
G1234
MACSeq/MACStr
Page 50
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124
3
33124124
124 124
124
124
G1234
Page 51
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
3
33124124
124 124
124
124
MACSeq/MACStr
Page 52
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
3
33124124
124 124
124
124
MACSeq/MACStr
Page 53
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
3124
124
124
124
MACSeq/MACStr
Page 54
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
3124
124
124
124
MACSeq/MACStr
Page 55
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
3124
124
124
124
Page 56
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
3124
124
124
124
MACSeq/MACStr
Page 57
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
3124
124
124
124
MACSeq/MACStr
Page 58
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
124
124
MACSeq/MACStr
124
Page 59
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
124
124
MACSeq/MACStr
124
Page 60
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
124
124
124
MACSeq/MACStr
Page 61
Local Phase
AreaCosts
85
1
3
0
0
0
0
G3
G124 G1234
124
124
124
Page 62
We’re Done
AreaCosts
85
1
3
G1 G2
G3 G4
G1234
Page 63
We’re Done
AreaCosts
85
1
3
G1 G2
G3 G4
Area = 17 Area = 25
Area = 14 Area = 20
G1234 Area = 30
Total Area of DFGs = 76
G1234
Page 64
VLIW Synthesis
Experimental Procedure
Custom Instr.Generation
Set of Patterns
Machine-SUIFCompiler
ConsolidationGraph
ConstructionAlgorithm
Consolidation Graph
EstimateArea
Pipeline Synthesis
Page 65
Pipelined Datapath Synthesis
Compiler
Loop Bodies• 80-90% of Program Execution Time• Parallelism Exists Across Multiple
Iterations• Pipelined Datapath Yields Maximal
Throughput.
Data Flow Graph
Insert Registers& Muxes
Page 66
Pipelined Datapath Synthesis
Gc
G1 G2 G3 G4
Page 67
VLIW Datapath Synthesis
Compiler
Non-Loop Computations• Instruction-Level Parallelism• Similar to Latency-Constrained
Scheduling in High-Level Synthesis
Data Flow Graph
Page 68
Benchmark Suite
MediaBench Benchmark Suite
Exp. Benchmark File/FunctionNum. Instrs.
Largest Instr. (Operations)
Avg. Ops per Instr.
1234
567891011
MesaPGPRasta
EpicJPEGJPEGJPEGMPEG2MPEG2Rasta
Rasta
blend.cidea.cmul_mdmd_md.c
collapse_pyrjpeg_fdct_ifastjpeg_idct_4x4jpeg_idct_2x2idct_col
FR4TR
Lqsolve.c
idct_row
61457
2158794
10
18864
917125
303725
5.53.23.03.0
4.47.05.93.17.2
20.07.5
Page 69
Experimental Results
0
5000
10000
15000
20000
25000
Slic
es
1 2 3 4 5 6 7 8 9 10 11
Pipelined Datapath Area Estimates
Additive
CG/PBRS
XilinxE-1000 Area
Page 70
Experimental Results
0
5000
10000
15000
20000
25000
Slic
es
1 2 3 4 5 6 7 8 9 10 11
VLIW Datapath Area Estimates
Additive
CG/PBRS
XilinxE-1000 Area
Page 71
Summary
• Area Estimates Based on Resource Sharing
• 0-1 Knapsack Problem Formulation Does Allow for Resource Sharing Estimates
• Resource Sharing Algorithm
• PBRS applied to Data Flow Graphs
• Experimental Results
• ILP Overestimates Area Costs by as much as 374% and 582% for Pipelined and VLIW Datapaths