Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas http://www.utdallas.edu/~edsha [email protected]
Dec 29, 2015
Optimizing Parallel Embedded Systems
Dr. Edwin Sha
ProfessorComputer Science
University of Texas at Dallas
http://www.utdallas.edu/[email protected]
Dr. Edwin Sha2
Parallel Architecture is Ubiquitous
Parallel Architecture is everywhere• As small as cellular phone• Modern DSP processor (VLIW), network processors• Modern CPU (instruction-level parallelism)• Your home PC (small number of processors)• Application-specific systems (image processing, speech processing,
network routers, look-up table, etc.)• File server• Database server or web server• Supercomputers
Interested in domain-specific HW/SW parallel systems
Dr. Edwin Sha3
Organization of the Presentation
Introduction to parallel architectures
Using sorting as an example to show various implementations on parallel architectures.
Introduction to embedded systems: strict constraints
Timing optimization: parallelize loops and nested loops.Retiming, Multi-dimensional RetimingFull Parallelism: all the nodes can be executed in parallel
Design space exploration and optimizations for code size, data memory, low-power, etc.
Intelligent prefetching and partitioning to hide memory latency
Conclusions
Dr. Edwin Sha4
Technology Trend• Microprocessor performance increases 50% - 100% per yearWhere does the performance gain from? Clock Rate and Capacity.• Clock Rate increases only 30% per year
0.1
1
10
100
1,000
19701975
19801985
19901995
20002005
Clo
ck r
ate
(M
Hz)
i4004i8008
i8080
i8086 i80286i80386
Pentium100
R10000
Dr. Edwin Sha5
Technology TrendTr
ansis
tors
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
19701975
19801985
19901995
20002005
i4004i8008
i8080
i8086
i80286i80386
R2000
Pentium R10000
R3000
Transistor count grows much faster than clock rate.
• Increase 40% per year,
• Order of magnitude more contribution in 2 decades
Dr. Edwin Sha6
Exploit Parallelism at Every Level
• Algorithms Level• Thread level
Eg. Each request of service is created as a thread• Iteration level (loop level)
Eg. For_all i= 1 to n do {loop body}.
All n iterations can be parallelized.• Loop body level (instruction-level)
Parallelize instructions inside a loop body as mush as possible
• Hardware level: parallelize and pipeline the execution of an instruction such as multiplication, etc.
Dr. Edwin Sha7
Sorting on Linear Array of Processors
Input: x1, x2, .. xn. Output: A sorted sequence (Ascending order)
Architecture: a linear array of k processors. Assume k=n at first.• What is the optimal time for sorting. Obviously it takes O(n) time
to reach the rightmost.• Lets consider the different sequential algorithms and then think
how to use them on a linear array of processors. This is a good example.
• Selection Sort• Insertion Sort• Bubble Sort• Bucket Sort• Sample Sort
Dr. Edwin Sha8
Selection SortAlgorithm: for i = 1 to n
pick the ith smallest one
Is it good?
Timing:
(n-1) + … + 2 + 1 = n(n-1)
2
3 + 2 + 1 = 6
5,1,2,4
Keep 1
5,2,4
Keep 2
5,4
Keep 4
5
Dr. Edwin Sha9
Insertion Sort
5
5
5
5
1
2
4
1 1
2
5,1,2,4
time
Timing: n only !1 2 3 4
4 clock cycles
in this example
Problem:
Need global bus
Dr. Edwin Sha10
Pipeline Sorting without Global WireOrganization
Systolic array
y
x
z
Initially, y =
If x > y
z x
else
z y
y x
Dr. Edwin Sha11
Bubble Sorting
5 1
5
1
2
5
1
4
2
5
1
2
4
5
1
2
4
5
1
2
4
5
The worst algorithm in sequential model ! But a good one in this case.
time
7 clock cycles
In this example
Timing: 2n-1
for n procs.
O(n) time
How about n ?
O(n n / k) for k procs.
Can we get O(n/k log n/k)
Dr. Edwin Sha12
Bucket Sortcan be lower than the lower bound (n log n) to be O(n)?
100 200 300
1
19
5
98
…
125
167
102
…
201
257
207
…
399
336
318
…400
-- splitters
• But it assumes n elements are uniformly distributed over an interval [a, b].
• The interval [a, b] is divided into k equal-sized subintervals called buckets.
• Scan through each element and put it to the corresponding bucket. The number of elements in each bucket is about n/k.
Dr. Edwin Sha13
Bucket Sort
• Then sort each bucket locally.• The sequential running time is O(n + k(n/k) log (n/k))
= O(n log (n/k)). • If k = n/128, then we get O(n) algorithm.• Parallelization is straightforward.
It is pretty good. Very little communication required between processors.
But what happens when the input data are not uniformly distributed. One bucket may have almost all the elements.• How to smartly pick appropriate splitters so each bucket will
have at most 2 n/k elements. (Sample sort)
Dr. Edwin Sha14
Sample Sort
First Step: Splitter selection (An important step)
Smartly select k-1 splitters from some samples.
Second Step: Bucket sort using these splitters on k buckets.
Guarantee: Each bucket has at most 2n/k elements.
•Directly divide n input elements into k blocks of size n/k each and sort each block.•From each sorted block it chooses k-1 evenly spaced elements. Then sort these k(k-1) elements.•Select the k-1 evenly spaced elements from these k(k-1) elements.•Scan through the n input elements and use these k-1 splitters to put each element to the corresponding bucket.
Dr. Edwin Sha15
Sample Sort
Sequential: O(n log n/k) + O(k k log k) + O(n log n/k). Not an O(n) alg. But it is very efficient for parallel implementation
Sort Sort Sort Sort
Sort
Final splitters
Step 1
Step 2
Bucket sort using these splittersStep 3
Dr. Edwin Sha16
Randomized Sample Sort
Processor 0 randomly pick k samples. : over-sampling ratio such as 64 or 128.
Sort these samples and select k-1 evenly spaced numbers as splitters.
With high probability, the splitters are picked well. I.e. with low probability, there is a big bucket.
But cannot be used for hard real-time systems.
To sort 5 million numbers in a SUN cluster with 4 machines using MPI in our tests:• Randomized sample sort takes 5 seconds• Deterministic sample sort takes 10 seconds• Radix sort takes > 500 seconds (too many communications).
Dr. Edwin Sha17
Embedded Systems Overview
Embedded computing systems• Computing systems embedded within electronic devices• Repeatedly carry out a particular function or a set of functions.• Nearly any computing system other than a desktop computer are
embedded systems• Billions of units produced yearly, versus millions of desktop
units• About 50 per household, 50 - 100 per automobile
Dr. Edwin Sha18
Some common characteristics of embedded systems
Application Specific• Executes a single program, repeatedly• New ones might be adaptive, and/or multiple mode
Tightly-constrained• Low cost, low power, small, fast, etc.
Reactive and real-time• Continually reacts to changes in the system’s environment• Must compute certain results in real-time without delay
Dr. Edwin Sha19
A “short list” of embedded systems
And the list grows longer each year.
Anti-lock brakesAuto-focus camerasAutomatic teller machinesAutomatic toll systemsAutomatic transmissionAvionic systemsBattery chargersCamcordersCell phonesCell-phone base stationsCordless phonesCruise controlCurbside check-in systemsDigital camerasDisk drivesElectronic card readersElectronic instrumentsElectronic toys/gamesFactory controlFax machinesFingerprint identifiersHome security systemsLife-support systemsMedical testing systems
ModemsMPEG decodersNetwork cardsNetwork switches/routersOn-board navigationPagersPhotocopiersPoint-of-sale systemsPortable video gamesPrintersSatellite phonesScannersSmart ovens/dishwashersSpeech recognizersStereo systemsTeleconferencing systemsTelevisionsTemperature controllersTheft tracking systemsTV set-top boxesVCR’s, DVD playersVideo game consolesVideo phonesWashers and dryers
Dr. Edwin Sha20
An embedded system example -- a digital camera
Microcontroller
CCD preprocessor Pixel coprocessorA2D
D2A
JPEG codec
DMA controller
Memory controller ISA bus interface UART LCD ctrl
Display ctrl
Multiplier/Accum
Digital camera chip
lens
CCD
Single-functioned -- always a digital camera
Tightly-constrained -- Low cost, low power, small, fast
Dr. Edwin Sha21
Design metric competition -- improving one may worsen others
Expertise with both software and hardware is needed to optimize design metrics
• Not just a hardware or software expert, as is common
• A designer must be comfortable with various technologies in order to choose the best for a given application and constraints
Need serious Design Space Explorations
SizePerformance
Power
NRE cost
Dr. Edwin Sha22
Processor technology
Processors vary in their customization for the problem at hand
total = 0for i = 1 to N loop total += M[i]end loop
General-purpose processor (software)
Single-purpose processor
(hardware)
Application-specific processor
Desired functionality
Dr. Edwin Sha23
Design Productivity Gap
1981 leading edge chip required 100 designer months• 10,000 transistors / 100 transistors/month
2002 leading edge chip requires 30,000 designer months• 150,000,000 / 5000 transistors/month
Designer cost increase from $1M to $300M
10,000
1,000
100
10
1
0.1
0.01
0.001
Log
ic tr
ansi
stor
s pe
r ch
ip(i
n m
illi
ons)
100,000
10,000
1000
100
10
1
0.1
0.01
Pro
duct
ivit
y(K
) T
rans
./Sta
ff-M
o.
1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009
IC capacity
productivity
Gap
Dr. Edwin Sha24
More challenges coming
Parallel• Consist of multiple processors with hardware.
Heterogeneous, Networked• Each processor has its own speed, memory, power, reliability, etc.
Fault-Tolerance, Reliability & Security• A major issue for critical applications
Design Space Explorations: timing, code-size, data memory, power consumption, cost, etc.
System-Level Design, Analysis, and Optimization are important.
Compiler is playing an important role. We need more research.
Lets start with Timing optimizations, then other optimizations, and design space issues.
Dr. Edwin Sha25
Timing Optimization
Parallelization for Nested Loops
Focus on computation or data intensive applications.
Loops are the most critical parts.
Multi-dimensional systems (MD) Uniform nested loops.
Develop efficient algorithms to obtain the schedule with the minimum execution time while hiding memory latencies.
ALU part: MD Retiming to fully parallelize computations.Memory part: prefetching and partitioning to hide memory latencies.
Developed by Edwin Sha’s group. The results are exciting.
Dr. Edwin Sha26
Graph Representation for Loops
A[0] = A[1] = 0;
For (i=2; i<n; i++)
{
A[i] = D[i-2] / 3;
B[i] = A[i] * 5;
C[i] = A[i] + 7;
D[i] = B[i] + C[i];
}
B
A
C
D
Delays
Dr. Edwin Sha27
Schedule looped DFG
DFG: Static Schedule:
B
A
C
D
AB CD
AB CD
AB CD
AB CD… …
ScheduleLength = 3
Dr. Edwin Sha28
Rotation: Loop pipeliningOriginal Schedule: Regrouped Schedule: Rotated
Schedule:AB CD
AB CD
AB CD
AB CD… …
AB CD
AB CD
AB CD
AB CD… …
B CD
prologue
epilogue
AB CD A
B CD A
B CD A
B CD A
… …
Dr. Edwin Sha29
Graph Representation Using Retiming
B
A
C
D
B
A
C
D
DAGLongest path = 3
Longest path = 2
Dr. Edwin Sha30
Multi-dimensional Problems:Multi-dimensional problems
DO 10 J = 0, N
DO 1 I = 0, M
d(i,j) = b(i,j-1) * c(i-1,j) D
a(i,j) = d(i,j) * .5 A
b(i,j) = a(i,j) + 1. B
c(i,j) = a(i,j) + 2. C
1 Continue
10 Continue
Circuit optimization
B
C
AD
z2-1
z1-1
(0,1)
(1,0)
Dr. Edwin Sha31
An Example of DSP Processor: TI TMS320C64X
Clocking speed: 1.1 GHz, Up to 8800 MIPS.
Dr. Edwin Sha32
Dr. Edwin Sha33
× 1.3
× 1.3
For I = 1, …..
For I = 1, …..
One-Dimensional Retiming (Leiserson-Saxe, ’91)
Dr. Edwin Sha34
× 1.3
For I = 1, …….
A(1) = B(-1) + 1
For I = 1, ……
B(I) = A(I) × 1.3
A(I+1) = B(I-1) + 1
Another Example
Dr. Edwin Sha35
Retiming
An integer-value transformation on nodes Registers are re-distributed G = < V, E, d > Gr = < V, E, dr > dr(e) = d(e) + r(u) – r(u) >= 0 Legal retiming
# delays of a cycle remains constant
Dr. Edwin Sha36
A nested loop
Illegal cases
Retiming nested loops
New problems …
Multi-Dimensional Retiming
Dr. Edwin Sha37
Multi-Dimensional Retiming
Iteration Space
Dr. Edwin Sha38
After Retiming r(A)=(-1,1)
Dr. Edwin Sha39
Multi-Dimensional Data Flow Graph
Dr. Edwin Sha40
Retimed MDFG
Dr. Edwin Sha41
Retimed Cell Dependence Graph
Dr. Edwin Sha42
Iteration Space for Retimed Graph
Legal schedule with row-wise executions. S=(0,1)
Dr. Edwin Sha43
Illigal MD Retiming
Dr. Edwin Sha44
Required Solution Needs:
To avoid illegal retiming
To be general
To obtain full parallelism
To be a fast Algorithm
Dr. Edwin Sha45
Schedule Vector(wavefront processing)
Legal schedule: s·d 0
Dr. Edwin Sha46
Legal
Feasible
Schedule-Based MD Retiming
Dr. Edwin Sha47
ILP Formulation
Dr. Edwin Sha48
Dr. Edwin Sha49
Example: s = (1,4), c(Gr) = 1
Dr. Edwin Sha50
s • (1,1) > 0
s • (-1,1) > 0
Pick s = (0,1),
r s
=> r = (1,0)
(1,1)(-1,1) s
r
Schedule plane S+
Chained MD Retiming
x
y
Dr. Edwin Sha51
S must be feasible for new delay vectors
Let new d’=(d+kr). We know s d >0 and s r=0
s • (d + kr) = s • d + s • (kr) must be > 0
=> s is a legal schedule for d + kr .
s
r
d1
d2
d3
Schedule plane
Dr. Edwin Sha52
Chained MD Retiming Algorithm
Dr. Edwin Sha53
Chained MD Retiming Example
Dr. Edwin Sha54
Synchronous Circuit Optimization Example – original design
Dr. Edwin Sha55
Critical path: 6 adders & 2 mul.
Synchronous Circuit Optimization Example – original design (cont.)
Dr. Edwin Sha56
Synchronous Circuit Optimization Example – Gnanasekaran’88
Dr. Edwin Sha57
Critical path is the minimum
Synchronous Circuit Optimization Example – retimed design
Dr. Edwin Sha58
Embedded System Design Review
Strict requirements• Time, power-consumption, code size, data memory size,
hardware cost, areas, etc.• Time-to-market, time-to-prototype
Special architectural support• Harvard architecture, on-chip memory, register files, etc.
Increasing amount of software• Flexibility of software, short time-to-market, easy upgrades• The amount of software is doubling every two years.
How to generate high-quality code for embedded systems?How to minimize and search the design space?Compiler role?
Dr. Edwin Sha59
Compiler in Embedded Systems
Ordinary C compilers for embedded processors are notoriously known for their poor code quality.• Data memory overhead for compiled code can reach a factor of 5• Cycle overhead can reach a factor of 8, compared with the
manually generated assembly code. (Rozenberg et al., 1998)
For code generation
In general, compilers are included in control flow loops for design space exploration and therefore they play an important role for design phase.
Exploring efficient designs in a huge, n-dimensional space, where each dimension corresponds to a design choice.
Dr. Edwin Sha60
Compiler in Design Space ExplorationAlgorithm selection: to analyze dependencies between algorithms
and processor architectures. HW/SW partitioningMemory related issues (program and data memory)
• Optimally placing programs and data in on-chip memories, and hide off-chip memory latencies by smart pre-fetching.(Sha, et al.)
• Data mapping for processors with multiple on-chip memory modules. (Zhuge and Sha)
• Code size reduction for software-pipelined applications. (Zhuge and Sha)
Instruction set options• Search for power-optimized instruction sets (Kin et al. 1999)• Scheduling for loops and DAG with the min. energy. (Shao and
Sha)
Dr. Edwin Sha61
Design Space Minimization
A direct design space is too large. Must do design space minimization before exploration.
Derive the properties of relationships of design metrics• A huge number of design points can be proved to be infeasible
before performing time-consuming design space exploration. Using our design space minimization algorithm, the design points
can be reduced from 510 points to 6 points for synthesizing All-pole Filter, for example.
Approach:1. Develop effective optimization techniques: code size, time, data
memory, low-power2. Understand the relationship among them3. Design space minimization algorithms4. Efficient design space exploration algorithms using fuzzy logic
Dr. Edwin Sha62
Example Relation of OptimizationsRetiming
• Transform a DFG to minimize its cycle period in polynomial time by redistribution the delays in the DFG.
• Cycle period c(G) of DFG G is the computation time of the longest zero-delay path.
B
A
C
D
B
A
C
D
Unfolding• The original DFG G is unfolded f times, so the unfolded graph Gf consists
of f copies of the original node set.• Iteration period: P = c( Gf )/f.• Code size is increased f times. Software pipelining will increase more.
Dr. Edwin Sha63
Experimental Results
The search space size using our method is only 2% of the search space using the standard method on average.
The quality of the solutions found by our algorithm is better than that of the standard method.
Benchmarks Iter. Period Req.
Search points
Ratio Final Solutionsunfold #add #mult Iter.
PeriodCode Size
Biquad(std.) 3/2 228 4 4 16 3/2 80
Biquad(ours) 3/2 4 1.5% 3 3 10 5/3 28DEQ(std) 8/5 486 5 10 18 8/5 110
DEQ(ours) 8/5 6 1.2% 3 5 15 5/3 37All-pole(std.) 18/5 510 5 10 10 18/5 150All-pole(ours) 18/5 6 1.2% 3 10 3 11/3 51
Elliptic(std.) 5 694 F F F F FElliptic(ours) 5 2 0.3% 2 6 9 5 764-Stage(std.) 7/4 909 F F F F F4-Stage(ours) 7/4 55 6% 4 7 33 7/4 112
Dr. Edwin Sha64
A Design Space Minimization Problem
Clearly understand the relationships: retiming, unfolding and iteration period.
Dr. Edwin Sha65
Program Memory Consideration with Code Size Minimization
Multiple on-chip memory banks, but usually only one program memory.
The capacity of an on-chip memory bank is very limited• Motolora’s DSP56K has only 512*24 bit program memory• ARM940T uses 4K instruction cache (Icache)• StringARM SA-1110 uses 16K cache
A widely used performance optimization technique, software pipelining, expands the code size to several times of the original code size.
Designers need to fit the code into the small on-chip memory to avoid slow (external) memory accesses.
The code size becomes a critical concern for many embedded processors.
Dr. Edwin Sha66
Code Size Expansion Caused bySoftware Pipelining
Schedule length is decreased from 4 cycles for 1 cycle.
Code size is expanded to 3 times larger than the original code size.
for i=1 to n do A[i] = E[i-4] + 9; B[i] = A[i] * 0.5; C[i] = A[i] + B[i-2]; D[i] = A[i] * C[i]; E[i] = D[i] + 30;end
A[1] = E[-3] + 9;A[2] = E[-2] + 9;B[1] = A[1] * 0.5;C[1] = A[1] + B[-1];A[3] = E[-1] + 9;B[2] = A[2] * 0.5;C[2] = A[2] + B[0];D[1] = A[1] * C[1];for i=1 to n-3 do A[i+3] = E[i-1] + 9; B[i+2] = A[i+2] * 0.5; C[i+2] = A[i+2] + B[i]; D[i+1] = A[i+1] * C[i+1]; E[i] = D[i] + 30;EndB[n] = A[n] * 0.5;C[n] = A[n] + B[n-2];D[n-1] = A[n-1] * C[n-1];E[n-2] = D[n-2] + 30;D[n] = A[n] * C[n];E[n-1] = D[n-1] + 30;E[n] = D[n] + 30;
Dr. Edwin Sha67
Rotation SchedulingResource Constrained Loop Scheduling based on Retiming concept.
Retiming gives a clear framework for software pipelining depth.
Given an initial DAG schedule, rotation scheduling repeatedly rotates down the nodes in the first row of the schedule.
In each step of rotation, the nodes in the first row:• retimed once by pushing one delay from each of incoming edges of the
node and adding one delay to each of its outgoing edges;• rescheduled to an available locations (such as earliest ones) in the
schedule based on the new precedence relations defined in the retimed graph.
The optimal schedule length can be obtained in polynomial time (2 |V|) in most cases.
The techniques can be generalized to deal with code-size, switching activities, branches, nested loops, etc.
Dr. Edwin Sha68
Rotation vs. Modulo scheduling
Dr. Edwin Sha69
Schedule a Cyclic DFG
DFG Schedule
B
A
C
D
AB CD
AB CD
AB CD
AB CD… …
ScheduleLength = 3
Dr. Edwin Sha70
Rotation: Loop PipeliningAB CD
AB CD
AB CD
AB CD… …
Original schedule
AB CD
AB CD
AB CD
AB CD… …
B CD
prologue
epilogue
Rotation
AB C AD
B C AD
B C AD
B C AD
… …
Rescheduling
Dr. Edwin Sha71
Retiming View of Loop Pipelining
Cycle period = 3
Cycle period = 2
B
A
C
D
B
A
C
D
Dr. Edwin Sha72
The Second Rotation
prologue
The schedule after the 1st rotation phase
The 2nd rotation
AB C A B C D A
B C D A
B C D A
… …
The final scheduleafter rescheduling
AB C AD
B C AD
B C AD
B C AD
… …
AB C AD
B C AD
B C AD
B C AD
… …
Dr. Edwin Sha73
Retiming View of Loop Pipelining
B
A
C
D
B
A
C
D
r(A)=1
r(B)=r(C)=r(D)=0
Cycle period = 2
r(A)=2
r(B)=r(C)=1
r(D)=0
Cycle period = 1
Dr. Edwin Sha74
Prologue and Retiming Function
AB CD
AB CD
AB CD
AB CD… …
AB C A
B C D A
B C D A
B C D A
… …
Original schedule The 1st rotationr(A)=1
The 2nd rotationr(A)=2
The number of copies of node A in prologue = r(A) The number of copies of node A in epilogue = (maxu r(u)) – r(A), for u V .
D
B CD
AB C AD
B C AD
B C AD
B C AD
… …
Dr. Edwin Sha75
CRED Technique Using Predicate Register
Predicate register• An instruction can be guarded by a predicate register.• The instruction is executed when the value of the predicate
register is true; otherwise, the instruction is disabled.
Implement CRED using predicate register with counter (TI’s TMS320C6x)• Set the initial value p = (maxu r(u)) – r(v) .• Decrement p by one in each iteration.• The instruction is executed when 0 p > -n, where n is the loop
counter of the original loop. The instruction is disabled when p > 0 or p –n.
Dr. Edwin Sha76
The New Execution Sequence
D B E C A
B CA
ADB E C A
ACBDE
EB
DE
CEpilogue
StaticPrologue
Schedule
Mult Mult Adder Adder Adder
D
DB
E
CA
[-2]A[-1]C[1]E[-1]B[0]D
[-3]A[-2]C[-2]B[-1]D [0]E[-4]A[-3]C[-1]E[-3]B[-2]D
[-5]A[-4]C[-2]E[-4]B[-3]D[-6]A[-5]C[-3]E[-5]B[-4]D
[-7]A[-6]C[-4]E[-6]B[-5]D
Mult Mult Adder Adder Adder
Prologue
Static
Epilogue
[-1]A[2]E[1]D [0]C
[2]D [1]B [3]E [1]C [0]A
[0]B
Schedule
Software-pipelined loop schedule with r(A)=3, r(B)=r(C)=2, r(D)=1, r(E)=0, and n=5.
The execution sequence after performingCRED using 4 conditional registers.
The new code size.
Dr. Edwin Sha77
Processor Classes
Processor Class 0: No predicate register• Motorola’s StarCore DSP processor
Processor Class 1: Has “condition code” bits in instruction, no predicate register• Intel’s StrongARM and other ARM architectures
Processor Class 2: Has 1-bit predicate registers• Philip’s TriMedia Multimedia processor
Processor Class 3: Has predicate registers with counters• TI’s TMS320C6x processor
Processor Class 4: Specialized hardware support for executing software-pipelined loops• IA64
Dr. Edwin Sha78
Code Size Reduction for Class 3
A[1] = E[-3] + 9;A[2] = E[-2] + 9;B[1] = A[1] * 0.5;C[1] = A[1] + B[-1];A[3] = E[-1] + 9;B[2] = A[2] * 0.5;C[2] = A[2] + B[0];D[1] = A[1] * C[1];for i=1 to n-3 do A[i+3] = E[i-1] + 9; B[i+2] = A[i+2] * 0.5; C[i+2] = A[i+2] + B[i]; D[i+1] = A[i+1] * C[i+1]; E[i] = D[i] + 30;EndB[n] = A[n] * 0.5;C[n] = A[n] + B[n-2];D[n-1] = A[n-1] * C[n-1];E[n-2] = D[n-2] + 30;D[n] = A[n] * C[n];E[n-1] = D[n-1] + 30;E[n] = D[n] + 30;
p=0;q=1;r=2;s=3for i=1 to n-3 do [p]A[i+3] = E[i-1] + 9; p--; [q]B[i+2] = A[i+2] * 0.5; [q]C[i+2] = A[i+2] + B[i]; q--; [r]D[i+1] = A[i+1] *C[i+1] r--; [s]E[i] = D[i] + 30; s--;end
Dr. Edwin Sha79
CRED on Various Types of Processors
TI model and IA46 is very efficient for code size reduction.
TI model is very effective for DSP processors supporting predicate registers but without specialized hardware as in IA64.
Benchmarks Orig Soft. Pipe.
Various Types of Processors
Class 0 Class 1 Class 2 Class 3 Class 4
StarCore ARM TriMedia TI IA64
Size % Size % Size % Size % Size %
IIR Filter 8 16 20 -25.0 16 0 16 0 12 25.0 12 25.0
Differential Equation
11 22 23 -4.5 19 13.6 19 13.6 15 31.8 15 31.8
All-pole Filter 15 60 39 35.0 31 48.3 31 48.3 23 61.7 19 68.3
Elliptic Filter 34 68 46 32.4 42 38.2 42 38.2 38 44.1 38 44.1
4-stage Lattice Filter
26 78 44 43.6 38 51.3 38 51.3 32 60.0 32 60.0
Voltera Filter 27 54 39 27.8 35 35.2 35 35.2 31 42.6 31 42.6
Ave. Improv. 18.2 31.1 31.1 44.2 45.6
Dr. Edwin Sha80
Experimental Results on Code Size/Performance Trade-off
Code size/performance exploration for All-pole Filter on the modified TMS320C6x processor with only 2 predicate registers.
The code size is increased when software pipeline depth is increased and the schedule length is decreased.
Our approach find the shortest schedule length satisfying a code size constraint.
Pipeline Depth
Schedule Length
Instr. Words
Std. Ours %
2 11 23 11 52.2
3 7 30 19 36.7
4 5 39 29 25.6
Dr. Edwin Sha81
Code-size Reduction for Nested Loop1
2
3
4
5
6
7
8
9
10
11
12
(0,1)
(1,0)
Data Flow Graph
(0,0)
(1,0)
(2,0)
(0,1)
(1,1)
(2,1)
(0,2)
(1,2)
(2,2) •Assume 8 functional units.
•Traditional software pipelining can only make 6 clock cycle at best.
•Interchanging the loop index can not help optimization.
The original loop:Outer loop begin (trip count = m)
Outer 1 (6 cycles, 15 instructions)
Inner loop (10 cycles, 12 instr., trip count = n)
Outer 2 (5 cycles, 15 instr.)
Assuming m=1000, n=10
Total cycles = m(6+10n+5) = 10mn+11n = 111,000
Code size = 42 instr.
Cell Dependency Graph
(a)
Dr. Edwin Sha82
MD Retiming and Code Reduction1
2
3
4
5
6
7
8
9
10
11
12
(-4,1)
(1,0)
(1,0)
(1,0)
(1,0)
(1,0)
Inner-outer combined software pipelining:Outer loop begin (trip count = m)
Outer 1 & Prologue (12 cycles, 15+28 instr.)
Inner loop (2 cycles, 12 instr., trip count = n-4)
Outer 2 & Epilogue (12 cycles, 15+20 instr.)
Total cycles = m(12+2(n-4)+12) = 2mn+16n = 36,000
Code size = 90 instr.
Retimed DFG: r(1)=r(2)=r(3)=r(4)=(4,0), r(5)=r(6)=(3,0), r(7)=r(8)=(2,0), r(9)=r(10)=(1,0) r(11)=r(12)=(0,0)
Code size reduction: remove pro. epi.Outer loop begin (trip count = m)
Outer 1 (6 cycles, 15+4 instr.)
Inner loop (2 cycles, 16 instr., trip count = n+4)
Outer 2 (5 cycles, 15 instr.)
Total cycles = m(6+2(n+4)+5) = 2mn+19n = 39,000
Code size = 50 instr.
(b)
(c)
Dr. Edwin Sha83
Outer Loop Pipeline and Code Reduction
Outer loop pipelining: Outer loop begin (trip count = m-1)
Outer 1 (6 cycles, 19 instr.)
Inner loop (2 cycles, 16 instr., trip count = n+4)
Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 19 instr.)
Inner loop (2 cycles, 16 instr., trip count = n+4)
Outer 2 (5 cycles, 15 instr.)
Total cycles = 6+(m-1)(2(n+4)+6)+2(n+4)+5 = 2mn+14n+5 = 34005
Code size = 100 instr.
Reduce new epilogue:Outer loop begin (trip count = m)
Outer 1 (6 cycles, 19+1 instr.)
Inner loop (2 cycles, 16 instr., trip count = n+4)
Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 20 instr.)
Total cycles = 6+m(2(n+4)+6) = 2mn+14n+6 = 34006
Code size = 71 instr.
(d)
(e)
Dr. Edwin Sha84
Data Memory Consideration with Optimal Data Mapping
Multiple memory banks are accessible in parallel.Provides higher memory bandwidth.Many existing compilers cannot work well for such kind of architectural
feature. Instead, all variables are assigned to just one bank.The technique of data mapping and scheduling becomes one of the most
importance factors in performance optimization
Data Memory Bank 0
Data Memory Bank 1
DB0
DB1
ALU
Dr. Edwin Sha85
IIR Filter – Data Flow Graph
8 9A
B
14 1
6
5
0
4
7
10
C
D
2 31516
12
11
1920 2324
21 221718
A A
A AB
C
D
E
E
GGAFF
Dr. Edwin Sha86
Our Model– Variable Independence Graph
B C
D E
A G
F
2
1/2
1/21/21/2
1/2
1/2
1/21/2
7/8
Partition 1Partition 2
Weight(e==(u,v)): “gain” to put u, v in different memory modules. We want to find maximum-weight partition.
Dr. Edwin Sha87
Experimental Results
IG approach uses list scheduling and interference graph model (M. Saghir, etc., University of Toronto, Canada; R. Leupers, etc., University of Dortmund, Germany).
Our approach uses rotation scheduling with variable repartitioning algorithm and variable independence graph.
Different approaches result in different variable partitions.
The largest improvement on schedule length using our approach is 52.9%. The average improvement on the benchmarks is 44.8%.
Benchmarks IG Ours Improvement
IIR Filter 17 8 52.9%
Differential Equation 21 11 52.4%
All-pole Filter 29 18 37.9%
4-stage Lattice Filter 44 22 50.0%
Elliptical Filter 35 24 31.4%
Voltera Filter 41 23 43.9%
Dr. Edwin Sha88
Dr. Edwin Sha89
Dr. Edwin Sha90
Dr. Edwin Sha91
Dr. Edwin Sha92
Dr. Edwin Sha93
Dr. Edwin Sha94
Dr. Edwin Sha95
Dr. Edwin Sha96
Dr. Edwin Sha97
Dr. Edwin Sha98
Dr. Edwin Sha99
Conclusions
An exciting area: optimizations for parallel DSP and embedded systems. Gave an overview. Needs much more work.
Consider both architectures and compilers.
Presented techniques:• Multi-dimensional (MD) retiming, Rotation• Code-size minimization for software pipelined loops• Design space minimization• Optimal partitioning and prefetching to completely hide memory
latencies. And decide the minimum required on-chip memory
Detailed retiming, unfolding, low-power scheduling, rate-optimal scheduling, etc. were presented in tutorial. Still a lot more.
Please check my web page for details: www.utdallas.edu/~edsha