Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha [email protected].

Optimizing Parallel Embedded Systems

Dr. Edwin Sha

ProfessorComputer Science

University of Texas at Dallas

http://www.utdallas.edu/[email protected]

mailto:[email protected]

Dr. Edwin Sha2

Parallel Architecture is Ubiquitous

Parallel Architecture is everywhere• As small as cellular phone• Modern DSP processor (VLIW), network processors• Modern CPU (instruction-level parallelism)• Your home PC (small number of processors)• Application-specific systems (image processing, speech processing,

network routers, look-up table, etc.)• File server• Database server or web server• Supercomputers

Interested in domain-specific HW/SW parallel systems

Dr. Edwin Sha3

Organization of the Presentation

Introduction to parallel architectures

Using sorting as an example to show various implementations on parallel architectures.

Introduction to embedded systems: strict constraints

Timing optimization: parallelize loops and nested loops.Retiming, Multi-dimensional RetimingFull Parallelism: all the nodes can be executed in parallel

Design space exploration and optimizations for code size, data memory, low-power, etc.

Intelligent prefetching and partitioning to hide memory latency

Conclusions

Dr. Edwin Sha4

Technology Trend• Microprocessor performance increases 50% - 100% per yearWhere does the performance gain from? Clock Rate and Capacity.• Clock Rate increases only 30% per year

0.1

1

10

100

1,000

19701975

19801985

19901995

20002005

Clo

ck r

ate

(M

Hz)

i4004i8008

i8080

i8086 i80286i80386

Pentium100

R10000

Dr. Edwin Sha5

Technology TrendTr

ansis

tors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

19701975

19801985

19901995

20002005

i4004i8008

i8080

i8086

i80286i80386

R2000

Pentium R10000

R3000

Transistor count grows much faster than clock rate.

• Increase 40% per year,

• Order of magnitude more contribution in 2 decades

Dr. Edwin Sha6

Exploit Parallelism at Every Level

• Algorithms Level• Thread level

Eg. Each request of service is created as a thread• Iteration level (loop level)

Eg. For_all i= 1 to n do {loop body}.

All n iterations can be parallelized.• Loop body level (instruction-level)

Parallelize instructions inside a loop body as mush as possible

• Hardware level: parallelize and pipeline the execution of an instruction such as multiplication, etc.

Dr. Edwin Sha7

Sorting on Linear Array of Processors

Input: x1, x2, .. xn. Output: A sorted sequence (Ascending order)

Architecture: a linear array of k processors. Assume k=n at first.• What is the optimal time for sorting. Obviously it takes O(n) time

to reach the rightmost.• Lets consider the different sequential algorithms and then think

how to use them on a linear array of processors. This is a good example.

• Selection Sort• Insertion Sort• Bubble Sort• Bucket Sort• Sample Sort

Dr. Edwin Sha8

Selection SortAlgorithm: for i = 1 to n

pick the ith smallest one

Is it good?

Timing:

(n-1) + … + 2 + 1 = n(n-1)

2

3 + 2 + 1 = 6

5,1,2,4

Keep 1

5,2,4

Keep 2

5,4

Keep 4

5

Dr. Edwin Sha9

Insertion Sort

5

5

5

5

1

2

4

1 1

2

5,1,2,4

time

Timing: n only !1 2 3 4

4 clock cycles

in this example

Problem:

Need global bus

Dr. Edwin Sha10

Pipeline Sorting without Global WireOrganization

Systolic array

y

x

z

Initially, y =

If x > y

z x

else

z y

y x

Dr. Edwin Sha11

Bubble Sorting

5 1

5

1

2

5

1

4

2

5

1

2

4

5

1

2

4

5

1

2

4

5

The worst algorithm in sequential model ! But a good one in this case.

time

7 clock cycles

In this example

Timing: 2n-1

for n procs.

O(n) time

How about n ?

O(n n / k) for k procs.

Can we get O(n/k log n/k)

Dr. Edwin Sha12

Bucket Sortcan be lower than the lower bound (n log n) to be O(n)?

100 200 300

1

19

5

98

…

125

167

102

…

201

257

207

…

399

336

318

…400

-- splitters

• But it assumes n elements are uniformly distributed over an interval [a, b].

• The interval [a, b] is divided into k equal-sized subintervals called buckets.

• Scan through each element and put it to the corresponding bucket. The number of elements in each bucket is about n/k.

Dr. Edwin Sha13

Bucket Sort

• Then sort each bucket locally.• The sequential running time is O(n + k(n/k) log (n/k))

= O(n log (n/k)). • If k = n/128, then we get O(n) algorithm.• Parallelization is straightforward.

It is pretty good. Very little communication required between processors.

But what happens when the input data are not uniformly distributed. One bucket may have almost all the elements.• How to smartly pick appropriate splitters so each bucket will

have at most 2 n/k elements. (Sample sort)

Dr. Edwin Sha14

Sample Sort

First Step: Splitter selection (An important step)

Smartly select k-1 splitters from some samples.

Second Step: Bucket sort using these splitters on k buckets.

Guarantee: Each bucket has at most 2n/k elements.

•Directly divide n input elements into k blocks of size n/k each and sort each block.•From each sorted block it chooses k-1 evenly spaced elements. Then sort these k(k-1) elements.•Select the k-1 evenly spaced elements from these k(k-1) elements.•Scan through the n input elements and use these k-1 splitters to put each element to the corresponding bucket.

Dr. Edwin Sha15

Sample Sort

Sequential: O(n log n/k) + O(k k log k) + O(n log n/k). Not an O(n) alg. But it is very efficient for parallel implementation

Sort Sort Sort Sort

Sort

Final splitters

Step 1

Step 2

Bucket sort using these splittersStep 3

Dr. Edwin Sha16

Randomized Sample Sort

Processor 0 randomly pick k samples. : over-sampling ratio such as 64 or 128.

Sort these samples and select k-1 evenly spaced numbers as splitters.

With high probability, the splitters are picked well. I.e. with low probability, there is a big bucket.

But cannot be used for hard real-time systems.

To sort 5 million numbers in a SUN cluster with 4 machines using MPI in our tests:• Randomized sample sort takes 5 seconds• Deterministic sample sort takes 10 seconds• Radix sort takes > 500 seconds (too many communications).

Dr. Edwin Sha17

Embedded Systems Overview

Embedded computing systems• Computing systems embedded within electronic devices• Repeatedly carry out a particular function or a set of functions.• Nearly any computing system other than a desktop computer are

embedded systems• Billions of units produced yearly, versus millions of desktop

units• About 50 per household, 50 - 100 per automobile

Dr. Edwin Sha18

Some common characteristics of embedded systems

Application Specific• Executes a single program, repeatedly• New ones might be adaptive, and/or multiple mode

Tightly-constrained• Low cost, low power, small, fast, etc.

Reactive and real-time• Continually reacts to changes in the system’s environment• Must compute certain results in real-time without delay

Dr. Edwin Sha19

A “short list” of embedded systems

And the list grows longer each year.

Anti-lock brakesAuto-focus camerasAutomatic teller machinesAutomatic toll systemsAutomatic transmissionAvionic systemsBattery chargersCamcordersCell phonesCell-phone base stationsCordless phonesCruise controlCurbside check-in systemsDigital camerasDisk drivesElectronic card readersElectronic instrumentsElectronic toys/gamesFactory controlFax machinesFingerprint identifiersHome security systemsLife-support systemsMedical testing systems

ModemsMPEG decodersNetwork cardsNetwork switches/routersOn-board navigationPagersPhotocopiersPoint-of-sale systemsPortable video gamesPrintersSatellite phonesScannersSmart ovens/dishwashersSpeech recognizersStereo systemsTeleconferencing systemsTelevisionsTemperature controllersTheft tracking systemsTV set-top boxesVCR’s, DVD playersVideo game consolesVideo phonesWashers and dryers

Dr. Edwin Sha20

An embedded system example -- a digital camera

Microcontroller

CCD preprocessor Pixel coprocessorA2D

D2A

JPEG codec

DMA controller

Memory controller ISA bus interface UART LCD ctrl

Display ctrl

Multiplier/Accum

Digital camera chip

lens

CCD

Single-functioned -- always a digital camera

Tightly-constrained -- Low cost, low power, small, fast

Dr. Edwin Sha21

Design metric competition -- improving one may worsen others

Expertise with both software and hardware is needed to optimize design metrics

• Not just a hardware or software expert, as is common

• A designer must be comfortable with various technologies in order to choose the best for a given application and constraints

Need serious Design Space Explorations

SizePerformance

Power

NRE cost

Dr. Edwin Sha22

Processor technology

Processors vary in their customization for the problem at hand

total = 0for i = 1 to N loop total += M[i]end loop

General-purpose processor (software)

Single-purpose processor

(hardware)

Application-specific processor

Desired functionality

Dr. Edwin Sha23

Design Productivity Gap

1981 leading edge chip required 100 designer months• 10,000 transistors / 100 transistors/month

2002 leading edge chip requires 30,000 designer months• 150,000,000 / 5000 transistors/month

Designer cost increase from $1M to $300M

10,000

1,000

100

10

1

0.1

0.01

0.001

Log

ic tr

ansi

stor

s pe

r ch

ip(i

n m

illi

ons)

100,000

10,000

1000

100

10

1

0.1

0.01

Pro

duct

ivit

y(K

) T

rans

./Sta

ff-M

o.

1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009

IC capacity

productivity

Gap

Dr. Edwin Sha24

More challenges coming

Parallel• Consist of multiple processors with hardware.

Heterogeneous, Networked• Each processor has its own speed, memory, power, reliability, etc.

Fault-Tolerance, Reliability & Security• A major issue for critical applications

Design Space Explorations: timing, code-size, data memory, power consumption, cost, etc.

System-Level Design, Analysis, and Optimization are important.

Compiler is playing an important role. We need more research.

Lets start with Timing optimizations, then other optimizations, and design space issues.

Dr. Edwin Sha25

Timing Optimization

Parallelization for Nested Loops

Focus on computation or data intensive applications.

Loops are the most critical parts.

Multi-dimensional systems (MD) Uniform nested loops.

Develop efficient algorithms to obtain the schedule with the minimum execution time while hiding memory latencies.

ALU part: MD Retiming to fully parallelize computations.Memory part: prefetching and partitioning to hide memory latencies.

Developed by Edwin Sha’s group. The results are exciting.

Dr. Edwin Sha26

Graph Representation for Loops

A[0] = A[1] = 0;

For (i=2; i<n; i++)

{

A[i] = D[i-2] / 3;

B[i] = A[i] * 5;

C[i] = A[i] + 7;

D[i] = B[i] + C[i];

}

B

A

C

D

Delays

Dr. Edwin Sha27

Schedule looped DFG

DFG: Static Schedule:

B

A

C

D

AB CD

AB CD

AB CD

AB CD… …

ScheduleLength = 3

Dr. Edwin Sha28

Rotation: Loop pipeliningOriginal Schedule: Regrouped Schedule: Rotated

Schedule:AB CD

AB CD

AB CD

AB CD… …

AB CD

AB CD

AB CD

AB CD… …

B CD

prologue

epilogue

AB CD A

B CD A

B CD A

B CD A

… …

Dr. Edwin Sha29

Graph Representation Using Retiming

B

A

C

D

B

A

C

D

DAGLongest path = 3

Longest path = 2

Dr. Edwin Sha30

Multi-dimensional Problems:Multi-dimensional problems

DO 10 J = 0, N

DO 1 I = 0, M

d(i,j) = b(i,j-1) * c(i-1,j) D

a(i,j) = d(i,j) * .5 A

b(i,j) = a(i,j) + 1. B

c(i,j) = a(i,j) + 2. C

1 Continue

10 Continue

Circuit optimization

B

C

AD

z2-1

z1-1

(0,1)

(1,0)

Dr. Edwin Sha31

An Example of DSP Processor: TI TMS320C64X

Clocking speed: 1.1 GHz, Up to 8800 MIPS.

Dr. Edwin Sha32

Dr. Edwin Sha33

× 1.3

× 1.3

For I = 1, …..

For I = 1, …..

One-Dimensional Retiming (Leiserson-Saxe, ’91)

Dr. Edwin Sha34

× 1.3

For I = 1, …….

A(1) = B(-1) + 1

For I = 1, ……

B(I) = A(I) × 1.3

A(I+1) = B(I-1) + 1

Another Example

Dr. Edwin Sha35

Retiming

An integer-value transformation on nodes Registers are re-distributed G = < V, E, d > Gr = < V, E, dr > dr(e) = d(e) + r(u) – r(u) >= 0 Legal retiming

# delays of a cycle remains constant

Dr. Edwin Sha36

A nested loop

Illegal cases

Retiming nested loops

New problems …

Multi-Dimensional Retiming

Dr. Edwin Sha37

Multi-Dimensional Retiming

Iteration Space

Dr. Edwin Sha38

After Retiming r(A)=(-1,1)

Dr. Edwin Sha39

Multi-Dimensional Data Flow Graph

Dr. Edwin Sha40

Retimed MDFG

Dr. Edwin Sha41

Retimed Cell Dependence Graph

Dr. Edwin Sha42

Iteration Space for Retimed Graph

Legal schedule with row-wise executions. S=(0,1)

Dr. Edwin Sha43

Illigal MD Retiming

Dr. Edwin Sha44

Required Solution Needs:

To avoid illegal retiming

To be general

To obtain full parallelism

To be a fast Algorithm

Dr. Edwin Sha45

Schedule Vector(wavefront processing)

Legal schedule: s·d 0

Dr. Edwin Sha46

Legal

Feasible

Schedule-Based MD Retiming

Dr. Edwin Sha47

ILP Formulation

Dr. Edwin Sha48

Dr. Edwin Sha49

Example: s = (1,4), c(Gr) = 1

Dr. Edwin Sha50

s • (1,1) > 0

s • (-1,1) > 0

Pick s = (0,1),

r s

=> r = (1,0)

(1,1)(-1,1) s

r

Schedule plane S+

Chained MD Retiming

x

y

Dr. Edwin Sha51

S must be feasible for new delay vectors

Let new d’=(d+kr). We know s d >0 and s r=0

s • (d + kr) = s • d + s • (kr) must be > 0

=> s is a legal schedule for d + kr .

s

r

d1

d2

d3

Schedule plane

Dr. Edwin Sha52

Chained MD Retiming Algorithm

Dr. Edwin Sha53

Chained MD Retiming Example

Dr. Edwin Sha54

Synchronous Circuit Optimization Example – original design

Dr. Edwin Sha55

Critical path: 6 adders & 2 mul.

Synchronous Circuit Optimization Example – original design (cont.)

Dr. Edwin Sha56

Synchronous Circuit Optimization Example – Gnanasekaran’88

Dr. Edwin Sha57

Critical path is the minimum

Synchronous Circuit Optimization Example – retimed design

Dr. Edwin Sha58

Embedded System Design Review

Strict requirements• Time, power-consumption, code size, data memory size,

hardware cost, areas, etc.• Time-to-market, time-to-prototype

Special architectural support• Harvard architecture, on-chip memory, register files, etc.

Increasing amount of software• Flexibility of software, short time-to-market, easy upgrades• The amount of software is doubling every two years.

How to generate high-quality code for embedded systems?How to minimize and search the design space?Compiler role?

Dr. Edwin Sha59

Compiler in Embedded Systems

Ordinary C compilers for embedded processors are notoriously known for their poor code quality.• Data memory overhead for compiled code can reach a factor of 5• Cycle overhead can reach a factor of 8, compared with the

manually generated assembly code. (Rozenberg et al., 1998)

For code generation

In general, compilers are included in control flow loops for design space exploration and therefore they play an important role for design phase.

Exploring efficient designs in a huge, n-dimensional space, where each dimension corresponds to a design choice.

Dr. Edwin Sha60

Compiler in Design Space ExplorationAlgorithm selection: to analyze dependencies between algorithms

and processor architectures. HW/SW partitioningMemory related issues (program and data memory)

• Optimally placing programs and data in on-chip memories, and hide off-chip memory latencies by smart pre-fetching.(Sha, et al.)

• Data mapping for processors with multiple on-chip memory modules. (Zhuge and Sha)

• Code size reduction for software-pipelined applications. (Zhuge and Sha)

Instruction set options• Search for power-optimized instruction sets (Kin et al. 1999)• Scheduling for loops and DAG with the min. energy. (Shao and

Sha)

Dr. Edwin Sha61

Design Space Minimization

A direct design space is too large. Must do design space minimization before exploration.

Derive the properties of relationships of design metrics• A huge number of design points can be proved to be infeasible

before performing time-consuming design space exploration. Using our design space minimization algorithm, the design points

can be reduced from 510 points to 6 points for synthesizing All-pole Filter, for example.

Approach:1. Develop effective optimization techniques: code size, time, data

memory, low-power2. Understand the relationship among them3. Design space minimization algorithms4. Efficient design space exploration algorithms using fuzzy logic

Dr. Edwin Sha62

Example Relation of OptimizationsRetiming

• Transform a DFG to minimize its cycle period in polynomial time by redistribution the delays in the DFG.

• Cycle period c(G) of DFG G is the computation time of the longest zero-delay path.

B

A

C

D

B

A

C

D

Unfolding• The original DFG G is unfolded f times, so the unfolded graph Gf consists

of f copies of the original node set.• Iteration period: P = c( Gf )/f.• Code size is increased f times. Software pipelining will increase more.

Dr. Edwin Sha63

Experimental Results

The search space size using our method is only 2% of the search space using the standard method on average.

The quality of the solutions found by our algorithm is better than that of the standard method.

Benchmarks Iter. Period Req.

Search points

Ratio Final Solutionsunfold #add #mult Iter.

PeriodCode Size

Biquad(std.) 3/2 228 4 4 16 3/2 80

Biquad(ours) 3/2 4 1.5% 3 3 10 5/3 28DEQ(std) 8/5 486 5 10 18 8/5 110

DEQ(ours) 8/5 6 1.2% 3 5 15 5/3 37All-pole(std.) 18/5 510 5 10 10 18/5 150All-pole(ours) 18/5 6 1.2% 3 10 3 11/3 51

Elliptic(std.) 5 694 F F F F FElliptic(ours) 5 2 0.3% 2 6 9 5 764-Stage(std.) 7/4 909 F F F F F4-Stage(ours) 7/4 55 6% 4 7 33 7/4 112

Dr. Edwin Sha64

A Design Space Minimization Problem

Clearly understand the relationships: retiming, unfolding and iteration period.

Dr. Edwin Sha65

Program Memory Consideration with Code Size Minimization

Multiple on-chip memory banks, but usually only one program memory.

The capacity of an on-chip memory bank is very limited• Motolora’s DSP56K has only 512*24 bit program memory• ARM940T uses 4K instruction cache (Icache)• StringARM SA-1110 uses 16K cache

A widely used performance optimization technique, software pipelining, expands the code size to several times of the original code size.

Designers need to fit the code into the small on-chip memory to avoid slow (external) memory accesses.

The code size becomes a critical concern for many embedded processors.

Dr. Edwin Sha66

Code Size Expansion Caused bySoftware Pipelining

Schedule length is decreased from 4 cycles for 1 cycle.

Code size is expanded to 3 times larger than the original code size.

for i=1 to n do A[i] = E[i-4] + 9; B[i] = A[i] * 0.5; C[i] = A[i] + B[i-2]; D[i] = A[i] * C[i]; E[i] = D[i] + 30;end

A[1] = E[-3] + 9;A[2] = E[-2] + 9;B[1] = A[1] * 0.5;C[1] = A[1] + B[-1];A[3] = E[-1] + 9;B[2] = A[2] * 0.5;C[2] = A[2] + B[0];D[1] = A[1] * C[1];for i=1 to n-3 do A[i+3] = E[i-1] + 9; B[i+2] = A[i+2] * 0.5; C[i+2] = A[i+2] + B[i]; D[i+1] = A[i+1] * C[i+1]; E[i] = D[i] + 30;EndB[n] = A[n] * 0.5;C[n] = A[n] + B[n-2];D[n-1] = A[n-1] * C[n-1];E[n-2] = D[n-2] + 30;D[n] = A[n] * C[n];E[n-1] = D[n-1] + 30;E[n] = D[n] + 30;

Dr. Edwin Sha67

Rotation SchedulingResource Constrained Loop Scheduling based on Retiming concept.

Retiming gives a clear framework for software pipelining depth.

Given an initial DAG schedule, rotation scheduling repeatedly rotates down the nodes in the first row of the schedule.

In each step of rotation, the nodes in the first row:• retimed once by pushing one delay from each of incoming edges of the

node and adding one delay to each of its outgoing edges;• rescheduled to an available locations (such as earliest ones) in the

schedule based on the new precedence relations defined in the retimed graph.

The optimal schedule length can be obtained in polynomial time (2 |V|) in most cases.

The techniques can be generalized to deal with code-size, switching activities, branches, nested loops, etc.

Dr. Edwin Sha68

Rotation vs. Modulo scheduling

Dr. Edwin Sha69

Schedule a Cyclic DFG

DFG Schedule

B

A

C

D

AB CD

AB CD

AB CD

AB CD… …

ScheduleLength = 3

Dr. Edwin Sha70

Rotation: Loop PipeliningAB CD

AB CD

AB CD

AB CD… …

Original schedule

AB CD

AB CD

AB CD

AB CD… …

B CD

prologue

epilogue

Rotation

AB C AD

B C AD

B C AD

B C AD

… …

Rescheduling

Dr. Edwin Sha71

Retiming View of Loop Pipelining

Cycle period = 3

Cycle period = 2

B

A

C

D

B

A

C

D

Dr. Edwin Sha72

The Second Rotation

prologue

The schedule after the 1st rotation phase

The 2nd rotation

AB C A B C D A

B C D A

B C D A

… …

The final scheduleafter rescheduling

AB C AD

B C AD

B C AD

B C AD

… …

AB C AD

B C AD

B C AD

B C AD

… …

Dr. Edwin Sha73

Retiming View of Loop Pipelining

B

A

C

D

B

A

C

D

r(A)=1

r(B)=r(C)=r(D)=0

Cycle period = 2

r(A)=2

r(B)=r(C)=1

r(D)=0

Cycle period = 1

Dr. Edwin Sha74

Prologue and Retiming Function

AB CD

AB CD

AB CD

AB CD… …

AB C A

B C D A

B C D A

B C D A

… …

Original schedule The 1st rotationr(A)=1

The 2nd rotationr(A)=2

The number of copies of node A in prologue = r(A) The number of copies of node A in epilogue = (maxu r(u)) – r(A), for u V .

D

B CD

AB C AD

B C AD

B C AD

B C AD

… …

Dr. Edwin Sha75

CRED Technique Using Predicate Register

Predicate register• An instruction can be guarded by a predicate register.• The instruction is executed when the value of the predicate

register is true; otherwise, the instruction is disabled.

Implement CRED using predicate register with counter (TI’s TMS320C6x)• Set the initial value p = (maxu r(u)) – r(v) .• Decrement p by one in each iteration.• The instruction is executed when 0 p > -n, where n is the loop

counter of the original loop. The instruction is disabled when p > 0 or p –n.

Dr. Edwin Sha76

The New Execution Sequence

D B E C A

B CA

ADB E C A

ACBDE

EB

DE

CEpilogue

StaticPrologue

Schedule

Mult Mult Adder Adder Adder

D

DB

E

CA

[-2]A[-1]C[1]E[-1]B[0]D

[-3]A[-2]C[-2]B[-1]D [0]E[-4]A[-3]C[-1]E[-3]B[-2]D

[-5]A[-4]C[-2]E[-4]B[-3]D[-6]A[-5]C[-3]E[-5]B[-4]D

[-7]A[-6]C[-4]E[-6]B[-5]D

Mult Mult Adder Adder Adder

Prologue

Static

Epilogue

[-1]A[2]E[1]D [0]C

[2]D [1]B [3]E [1]C [0]A

[0]B

Schedule

Software-pipelined loop schedule with r(A)=3, r(B)=r(C)=2, r(D)=1, r(E)=0, and n=5.

The execution sequence after performingCRED using 4 conditional registers.

The new code size.

Dr. Edwin Sha77

Processor Classes

Processor Class 0: No predicate register• Motorola’s StarCore DSP processor

Processor Class 1: Has “condition code” bits in instruction, no predicate register• Intel’s StrongARM and other ARM architectures

Processor Class 2: Has 1-bit predicate registers• Philip’s TriMedia Multimedia processor

Processor Class 3: Has predicate registers with counters• TI’s TMS320C6x processor

Processor Class 4: Specialized hardware support for executing software-pipelined loops• IA64

Dr. Edwin Sha78

Code Size Reduction for Class 3

A[1] = E[-3] + 9;A[2] = E[-2] + 9;B[1] = A[1] * 0.5;C[1] = A[1] + B[-1];A[3] = E[-1] + 9;B[2] = A[2] * 0.5;C[2] = A[2] + B[0];D[1] = A[1] * C[1];for i=1 to n-3 do A[i+3] = E[i-1] + 9; B[i+2] = A[i+2] * 0.5; C[i+2] = A[i+2] + B[i]; D[i+1] = A[i+1] * C[i+1]; E[i] = D[i] + 30;EndB[n] = A[n] * 0.5;C[n] = A[n] + B[n-2];D[n-1] = A[n-1] * C[n-1];E[n-2] = D[n-2] + 30;D[n] = A[n] * C[n];E[n-1] = D[n-1] + 30;E[n] = D[n] + 30;

p=0;q=1;r=2;s=3for i=1 to n-3 do [p]A[i+3] = E[i-1] + 9; p--; [q]B[i+2] = A[i+2] * 0.5; [q]C[i+2] = A[i+2] + B[i]; q--; [r]D[i+1] = A[i+1] *C[i+1] r--; [s]E[i] = D[i] + 30; s--;end

Dr. Edwin Sha79

CRED on Various Types of Processors

TI model and IA46 is very efficient for code size reduction.

TI model is very effective for DSP processors supporting predicate registers but without specialized hardware as in IA64.

Benchmarks Orig Soft. Pipe.

Various Types of Processors

Class 0 Class 1 Class 2 Class 3 Class 4

StarCore ARM TriMedia TI IA64

Size % Size % Size % Size % Size %

IIR Filter 8 16 20 -25.0 16 0 16 0 12 25.0 12 25.0

Differential Equation

11 22 23 -4.5 19 13.6 19 13.6 15 31.8 15 31.8

All-pole Filter 15 60 39 35.0 31 48.3 31 48.3 23 61.7 19 68.3

Elliptic Filter 34 68 46 32.4 42 38.2 42 38.2 38 44.1 38 44.1

4-stage Lattice Filter

26 78 44 43.6 38 51.3 38 51.3 32 60.0 32 60.0

Voltera Filter 27 54 39 27.8 35 35.2 35 35.2 31 42.6 31 42.6

Ave. Improv. 18.2 31.1 31.1 44.2 45.6

Dr. Edwin Sha80

Experimental Results on Code Size/Performance Trade-off

Code size/performance exploration for All-pole Filter on the modified TMS320C6x processor with only 2 predicate registers.

The code size is increased when software pipeline depth is increased and the schedule length is decreased.

Our approach find the shortest schedule length satisfying a code size constraint.

Pipeline Depth

Schedule Length

Instr. Words

Std. Ours %

2 11 23 11 52.2

3 7 30 19 36.7

4 5 39 29 25.6

Dr. Edwin Sha81

Code-size Reduction for Nested Loop1

2

3

4

5

6

7

8

9

10

11

12

(0,1)

(1,0)

Data Flow Graph

(0,0)

(1,0)

(2,0)

(0,1)

(1,1)

(2,1)

(0,2)

(1,2)

(2,2) •Assume 8 functional units.

•Traditional software pipelining can only make 6 clock cycle at best.

•Interchanging the loop index can not help optimization.

The original loop:Outer loop begin (trip count = m)

Outer 1 (6 cycles, 15 instructions)

Inner loop (10 cycles, 12 instr., trip count = n)

Outer 2 (5 cycles, 15 instr.)

Assuming m=1000, n=10

Total cycles = m(6+10n+5) = 10mn+11n = 111,000

Code size = 42 instr.

Cell Dependency Graph

(a)

Dr. Edwin Sha82

MD Retiming and Code Reduction1

2

3

4

5

6

7

8

9

10

11

12

(-4,1)

(1,0)

(1,0)

(1,0)

(1,0)

(1,0)

Inner-outer combined software pipelining:Outer loop begin (trip count = m)

Outer 1 & Prologue (12 cycles, 15+28 instr.)

Inner loop (2 cycles, 12 instr., trip count = n-4)

Outer 2 & Epilogue (12 cycles, 15+20 instr.)

Total cycles = m(12+2(n-4)+12) = 2mn+16n = 36,000


Retimed DFG: r(1)=r(2)=r(3)=r(4)=(4,0), r(5)=r(6)=(3,0), r(7)=r(8)=(2,0), r(9)=r(10)=(1,0) r(11)=r(12)=(0,0)

Code size reduction: remove pro. epi.Outer loop begin (trip count = m)

Outer 1 (6 cycles, 15+4 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)


Total cycles = m(6+2(n+4)+5) = 2mn+19n = 39,000


(b)

(c)

Dr. Edwin Sha83

Outer Loop Pipeline and Code Reduction

Outer loop pipelining: Outer loop begin (trip count = m-1)



Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 19 instr.)



Total cycles = 6+(m-1)(2(n+4)+6)+2(n+4)+5 = 2mn+14n+5 = 34005


Reduce new epilogue:Outer loop begin (trip count = m)

Outer 1 (6 cycles, 19+1 instr.)


Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 20 instr.)

Total cycles = 6+m(2(n+4)+6) = 2mn+14n+6 = 34006


(d)

(e)

Dr. Edwin Sha84

Data Memory Consideration with Optimal Data Mapping

Multiple memory banks are accessible in parallel.Provides higher memory bandwidth.Many existing compilers cannot work well for such kind of architectural

feature. Instead, all variables are assigned to just one bank.The technique of data mapping and scheduling becomes one of the most

importance factors in performance optimization

Data Memory Bank 0

Data Memory Bank 1

DB0

DB1

ALU

Dr. Edwin Sha85

IIR Filter – Data Flow Graph

8 9A

B

14 1

6

5

0

4

7

10

C

D

2 31516

12

11

1920 2324

21 221718

A A

A AB

C

D

E

E

GGAFF

Dr. Edwin Sha86

Our Model– Variable Independence Graph

B C

D E

A G

F

2

1/2

1/21/21/2

1/2

1/2

1/21/2

7/8

Partition 1Partition 2

Weight(e==(u,v)): “gain” to put u, v in different memory modules. We want to find maximum-weight partition.

Dr. Edwin Sha87

Experimental Results

IG approach uses list scheduling and interference graph model (M. Saghir, etc., University of Toronto, Canada; R. Leupers, etc., University of Dortmund, Germany).

Our approach uses rotation scheduling with variable repartitioning algorithm and variable independence graph.

Different approaches result in different variable partitions.

The largest improvement on schedule length using our approach is 52.9%. The average improvement on the benchmarks is 44.8%.

Benchmarks IG Ours Improvement

IIR Filter 17 8 52.9%

Differential Equation 21 11 52.4%

All-pole Filter 29 18 37.9%

4-stage Lattice Filter 44 22 50.0%

Elliptical Filter 35 24 31.4%

Voltera Filter 41 23 43.9%

Dr. Edwin Sha88

Dr. Edwin Sha89

Dr. Edwin Sha90

Dr. Edwin Sha91

Dr. Edwin Sha92

Dr. Edwin Sha93

Dr. Edwin Sha94

Dr. Edwin Sha95

Dr. Edwin Sha96

Dr. Edwin Sha97

Dr. Edwin Sha98

Dr. Edwin Sha99

Conclusions

An exciting area: optimizations for parallel DSP and embedded systems. Gave an overview. Needs much more work.

Consider both architectures and compilers.

Presented techniques:• Multi-dimensional (MD) retiming, Rotation• Code-size minimization for software pipelined loops• Design space minimization• Optimal partitioning and prefetching to completely hide memory

latencies. And decide the minimum required on-chip memory

Detailed retiming, unfolding, low-power scheduling, rate-optimal scheduling, etc. were presented in tutorial. Still a lot more.

Please check my web page for details: www.utdallas.edu/~edsha

Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha [email protected].

Documents