Copy Propagation Optimization for VLIW DSP Processors with Distributed Register Files Chung-Ju Wu Chung-Ju Wu Sheng-Yuan Chen Sheng-Yuan Chen Jenq-Kuen Lee Jenq-Kuen Lee Department of Computer Science Department of Computer Science National Tsing Hua University National Tsing Hua University Hsinchu, Taiwan Hsinchu, Taiwan
22
Embed
Copy Propagation Optimizations for VLIW DSP Processors with Dresearch.ihost.com/lcpc06/presentations/56_Presentation.pdf · Impact on Compiler Techniques Instruction Scheduling &
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Copy Propagation Optimization for VLIW DSP Processors with Distributed Register Files
Department of Computer ScienceDepartment of Computer ScienceNational Tsing Hua UniversityNational Tsing Hua University
Hsinchu, TaiwanHsinchu, Taiwan
2
OutlineIntroduction & Background
Issues & Motivations
Enhanced Data Flow Analysis (EDFA)Cost Models
Algorithm
Running Examples for Algorithm
Experimental Results & Summary
3
Introduction & BackgroundMulti-port Design
Not good on Area, Access Time, and Power
Modern VLIW DSP designCluster-based architectureDistributed register filesNot general purpose register anymore !!
Load / Store Unit
Private Registers (A)
Arithmetic Unit
Public Registers (D)
Private Registers (AC)
c lus t e r 2
Memory Interface
Private Registers (A)
Arithmetic Unit
Public Registers (D)
Private Registers (AC)
c lus t e r 1
Load / Store Unit
Program Sequence Control Unit
Scalar Unit
Private Registers (R)
FU FU FU FU…
Register File
• Impact on Compiler -- Instruction Scheduling/Register Allocation -- Software Pipelining -- Global Register Allocation -- Data Flow Optimization
4
Distributed Register FileImpact on Compiler Techniques
Instruction Scheduling & Register Allocation“Instruction scheduling for clustered VLIW DSPs”, R. Leupers, International Conference on Parallel Architectures and Compilation Techniques (PACT), 2000.“Register Allocation for VLIW DSP Processors with Irregular Register Files”, Yung-Chia Lin, Yi-Ping You, and Jenq Kuen Lee, in Proceedings of Compilers for Parallel Computers (CPC), Jan 2006.
Software Pipelining“Optimizing Loop Performance for Clustered VLIW Architectures”, Yi Qian, Philip Sweany, Steve Carr, International Conference on Parallel Architectures and Compilation Techniques (PACT), 2002.
GRA (Global Register Allocation)“Global Register Partitioning”, Jason Hiser, Steve Carr, in Proceedings International Conference on Parallel Architectures and Compilation Techniques (PACT), 2000.
Data Flow OptimizationsCopy PropagationCommon Subexpressions Elimination
5
Copy PropagationReplace variables with earlier equivalents at compiler time.
Reduce some data dependency.
Exploit available optimizations of programs.
What does it happen if we apply conventional copy propagation on distributed register files?
(1) Y := X;(2) Z := X;(3) C := A + B + X;
(1) Y := X;(2) Z := Y;(3) C := A + B + Z;
(1) Y := X;(2) Z := X;(3) C := A + B + Z;
Optimization: Copy Propagation
6
PAC DSP ArchitectureParallel Architecture Core
Developed by STC (System-on-chip Technology Center) of Industrial Technology Research Institute in Taiwan.Potentially for audio/video applications.
Port of D register file3 read port / 2 write portfor D[0-15] register files
1r 1w for Scalar2r 1w for ALU/LSU
Presented in Microprocessor Forum 2006
7
Motivation Example I(Cluster Communication)
LSU
ALU
A
AC
D
LSU
ALU
A
AC
D
Cluster 1 Cluster 2
x,t6,t8 t3
What if ……C1<x,t6,t8> and C2<t3> ?
x := t3
t3 + t6
t3 + t8
Performance Anomaly !!
8
Motivation Example II(Register Nature with Irregular Accessibility)
LSU
ALU
A
AC
D
(3) ADD d4,a1,ac1
(5) ADD d6,a1,ac2
a1
ac1,ac2
CAN NOT use A and AC register as operands concurrently !!
Compiler must insert extra copy assignment to move data into D register file!
(3.1) MOV d5,a1
(3.2) ADD d4,d5,ac1
(5.1) MOV d5,a1
(5.2) ADD d6,d5,ac2
9
Motivation Example III(Port Pressure)
10
cluster communicationIf the copy propagation occurs between clusters, we might have more communication overhead.
register nature with irregular accessibilityPrivate registers can only be accessible by the corresponding functional units.
port pressureLimited number of read/write port causes scheduler to separate code into different packages.
Performance Anomaly
11
Enhanced Data Flow Analysis
DefinitionAt every propagation decision point, for every propagation from variable n to variable m, say (n, m), a data flow profit P, is computed:P = Gain(n, m) – Cost(n, m) Gain(n, m): the reduced cost by applying copy propagation. Cost(n, m): the penalty if copy propagation is performed.
Gain() ≥ Cost() Good !
Gain() < Cost() Bad !
12
EDFA Cost Function Gain(n,m)
Gain(n, m) =
RCC(n, m): the reduced communication costs by propagating n to m.
ACA(c[j]): return the number of all available copy assignments which can be reduced along this
n m path.
13
EDFA Cost Function Cost(n,m)
CBC(n, m) : return the cost of propagating across clusters.RP(n, m) : return the extra copy assignment to move
data between private registers.
PP(n, m) =
Cost(n, m) = CBC(n, m) + RP(n, m) + PP(n, m)
14
EDFA algorithmEnhanced Data Flow Analysis (EDFA) algorithm
Data flows between nodes form an acyclic Data Flow Graph.
Perform conventional copy propagation without propagating variables.
Collect all possible propagation path,recalculating the profit, and output the revised result.
I
MI
M
I
I
M
M
I
I
M
M
I
I
M
I
M
Variablesin
Statements
I
MI
M
I
I
M
M
I
I
M
M
I
I
M
I
M
15
EDFA Estimation Algorithm
Sharing Edges on propagation tree.
Traverse the propagation path, revise the cost
More accurate on evaluating gain
half cost
16
Example
P = Gain(n, m) – Cost(n, m)
Cluster 1 Cluster 2
AC register fileA register fileD register file
Pa−c = 1 − 3 = -2
Pb−c = 1 − 0 = 1
M ILSU ALU
better!!
Communication between clusters: 3 cycles
Each instruction cost: 1 cycle
17
Example (cont’)
P = Gain(n, m) – Cost(n, m)
Cluster 1 Cluster 2
AC register fileA register fileD register file
Pa−c = 6 − 1 = 5
M ILSU ALU
good!!
Communication between clusters: 3 cycles
Each instruction cost: 1 cycle
18
PACDSP CompilerBased on Open Research Compiler (ORC)Intermediate Representation:WHIRL (Winning Hierarchical Intermediate Representation Language)Low Power Optimization (On-going work), TODAES’06Register Allocation
SA (Simulated Annealing), LCPC’05PALF (Ping-pong Aware Local Favorable), CPC’06
No Propagation Original Data Flow Analysis Enhanced Data Flow Analysis
21
ConclusionSummary
We address the conventional data-flow equations over distributed register files.We propose an Enhanced Data Flow Analysis (EDFA) framework for compilers to avoid performance anomaly.EDFA keeps the advantage of copy propagation optimization.
Future WorkIntegrate with common sub-expression elimination module.