Top Banner
Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas http://www.utdallas.edu/~edsha [email protected]
99

Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha [email protected].

Dec 29, 2015

Download

Documents

Norman Johns
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Optimizing Parallel Embedded Systems

Dr. Edwin Sha

ProfessorComputer Science

University of Texas at Dallas

http://www.utdallas.edu/[email protected]

Page 2: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha2

Parallel Architecture is Ubiquitous

Parallel Architecture is everywhere• As small as cellular phone• Modern DSP processor (VLIW), network processors• Modern CPU (instruction-level parallelism)• Your home PC (small number of processors)• Application-specific systems (image processing, speech processing,

network routers, look-up table, etc.)• File server• Database server or web server• Supercomputers

Interested in domain-specific HW/SW parallel systems

Page 3: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha3

Organization of the Presentation

Introduction to parallel architectures

Using sorting as an example to show various implementations on parallel architectures.

Introduction to embedded systems: strict constraints

Timing optimization: parallelize loops and nested loops.Retiming, Multi-dimensional RetimingFull Parallelism: all the nodes can be executed in parallel

Design space exploration and optimizations for code size, data memory, low-power, etc.

Intelligent prefetching and partitioning to hide memory latency

Conclusions

Page 4: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha4

Technology Trend• Microprocessor performance increases 50% - 100% per yearWhere does the performance gain from? Clock Rate and Capacity.• Clock Rate increases only 30% per year

0.1

1

10

100

1,000

19701975

19801985

19901995

20002005

Clo

ck r

ate

(M

Hz)

i4004i8008

i8080

i8086 i80286i80386

Pentium100

R10000

Page 5: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha5

Technology TrendTr

ansis

tors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

19701975

19801985

19901995

20002005

i4004i8008

i8080

i8086

i80286i80386

R2000

Pentium R10000

R3000

Transistor count grows much faster than clock rate.

• Increase 40% per year,

• Order of magnitude more contribution in 2 decades

Page 6: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha6

Exploit Parallelism at Every Level

• Algorithms Level• Thread level

Eg. Each request of service is created as a thread• Iteration level (loop level)

Eg. For_all i= 1 to n do {loop body}.

All n iterations can be parallelized.• Loop body level (instruction-level)

Parallelize instructions inside a loop body as mush as possible

• Hardware level: parallelize and pipeline the execution of an instruction such as multiplication, etc.

Page 7: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha7

Sorting on Linear Array of Processors

Input: x1, x2, .. xn. Output: A sorted sequence (Ascending order)

Architecture: a linear array of k processors. Assume k=n at first.• What is the optimal time for sorting. Obviously it takes O(n) time

to reach the rightmost.• Lets consider the different sequential algorithms and then think

how to use them on a linear array of processors. This is a good example.

• Selection Sort• Insertion Sort• Bubble Sort• Bucket Sort• Sample Sort

Page 8: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha8

Selection SortAlgorithm: for i = 1 to n

pick the ith smallest one

Is it good?

Timing:

(n-1) + … + 2 + 1 = n(n-1)

2

3 + 2 + 1 = 6

5,1,2,4

Keep 1

5,2,4

Keep 2

5,4

Keep 4

5

Page 9: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha9

Insertion Sort

5

5

5

5

1

2

4

1 1

2

5,1,2,4

time

Timing: n only !1 2 3 4

4 clock cycles

in this example

Problem:

Need global bus

Page 10: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha10

Pipeline Sorting without Global WireOrganization

Systolic array

y

x

z

Initially, y =

If x > y

z x

else

z y

y x

Page 11: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha11

Bubble Sorting

5 1

5

1

2

5

1

4

2

5

1

2

4

5

1

2

4

5

1

2

4

5

The worst algorithm in sequential model ! But a good one in this case.

time

7 clock cycles

In this example

Timing: 2n-1

for n procs.

O(n) time

How about n ?

O(n n / k) for k procs.

Can we get O(n/k log n/k)

Page 12: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha12

Bucket Sortcan be lower than the lower bound (n log n) to be O(n)?

100 200 300

1

19

5

98

125

167

102

201

257

207

399

336

318

…400

-- splitters

• But it assumes n elements are uniformly distributed over an interval [a, b].

• The interval [a, b] is divided into k equal-sized subintervals called buckets.

• Scan through each element and put it to the corresponding bucket. The number of elements in each bucket is about n/k.

Page 13: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha13

Bucket Sort

• Then sort each bucket locally.• The sequential running time is O(n + k(n/k) log (n/k))

= O(n log (n/k)). • If k = n/128, then we get O(n) algorithm.• Parallelization is straightforward.

It is pretty good. Very little communication required between processors.

But what happens when the input data are not uniformly distributed. One bucket may have almost all the elements.• How to smartly pick appropriate splitters so each bucket will

have at most 2 n/k elements. (Sample sort)

Page 14: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha14

Sample Sort

First Step: Splitter selection (An important step)

Smartly select k-1 splitters from some samples.

Second Step: Bucket sort using these splitters on k buckets.

Guarantee: Each bucket has at most 2n/k elements.

•Directly divide n input elements into k blocks of size n/k each and sort each block.•From each sorted block it chooses k-1 evenly spaced elements. Then sort these k(k-1) elements.•Select the k-1 evenly spaced elements from these k(k-1) elements.•Scan through the n input elements and use these k-1 splitters to put each element to the corresponding bucket.

Page 15: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha15

Sample Sort

Sequential: O(n log n/k) + O(k k log k) + O(n log n/k). Not an O(n) alg. But it is very efficient for parallel implementation

Sort Sort Sort Sort

Sort

Final splitters

Step 1

Step 2

Bucket sort using these splittersStep 3

Page 16: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha16

Randomized Sample Sort

Processor 0 randomly pick k samples. : over-sampling ratio such as 64 or 128.

Sort these samples and select k-1 evenly spaced numbers as splitters.

With high probability, the splitters are picked well. I.e. with low probability, there is a big bucket.

But cannot be used for hard real-time systems.

To sort 5 million numbers in a SUN cluster with 4 machines using MPI in our tests:• Randomized sample sort takes 5 seconds• Deterministic sample sort takes 10 seconds• Radix sort takes > 500 seconds (too many communications).

Page 17: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha17

Embedded Systems Overview

Embedded computing systems• Computing systems embedded within electronic devices• Repeatedly carry out a particular function or a set of functions.• Nearly any computing system other than a desktop computer are

embedded systems• Billions of units produced yearly, versus millions of desktop

units• About 50 per household, 50 - 100 per automobile

Page 18: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha18

Some common characteristics of embedded systems

Application Specific• Executes a single program, repeatedly• New ones might be adaptive, and/or multiple mode

Tightly-constrained• Low cost, low power, small, fast, etc.

Reactive and real-time• Continually reacts to changes in the system’s environment• Must compute certain results in real-time without delay

Page 19: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha19

A “short list” of embedded systems

And the list grows longer each year.

Anti-lock brakesAuto-focus camerasAutomatic teller machinesAutomatic toll systemsAutomatic transmissionAvionic systemsBattery chargersCamcordersCell phonesCell-phone base stationsCordless phonesCruise controlCurbside check-in systemsDigital camerasDisk drivesElectronic card readersElectronic instrumentsElectronic toys/gamesFactory controlFax machinesFingerprint identifiersHome security systemsLife-support systemsMedical testing systems

ModemsMPEG decodersNetwork cardsNetwork switches/routersOn-board navigationPagersPhotocopiersPoint-of-sale systemsPortable video gamesPrintersSatellite phonesScannersSmart ovens/dishwashersSpeech recognizersStereo systemsTeleconferencing systemsTelevisionsTemperature controllersTheft tracking systemsTV set-top boxesVCR’s, DVD playersVideo game consolesVideo phonesWashers and dryers

Page 20: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha20

An embedded system example -- a digital camera

Microcontroller

CCD preprocessor Pixel coprocessorA2D

D2A

JPEG codec

DMA controller

Memory controller ISA bus interface UART LCD ctrl

Display ctrl

Multiplier/Accum

Digital camera chip

lens

CCD

Single-functioned -- always a digital camera

Tightly-constrained -- Low cost, low power, small, fast

Page 21: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha21

Design metric competition -- improving one may worsen others

Expertise with both software and hardware is needed to optimize design metrics

• Not just a hardware or software expert, as is common

• A designer must be comfortable with various technologies in order to choose the best for a given application and constraints

Need serious Design Space Explorations

SizePerformance

Power

NRE cost

Page 22: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha22

Processor technology

Processors vary in their customization for the problem at hand

total = 0for i = 1 to N loop total += M[i]end loop

General-purpose processor (software)

Single-purpose processor

(hardware)

Application-specific processor

Desired functionality

Page 23: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha23

Design Productivity Gap

1981 leading edge chip required 100 designer months• 10,000 transistors / 100 transistors/month

2002 leading edge chip requires 30,000 designer months• 150,000,000 / 5000 transistors/month

Designer cost increase from $1M to $300M

10,000

1,000

100

10

1

0.1

0.01

0.001

Log

ic tr

ansi

stor

s pe

r ch

ip(i

n m

illi

ons)

100,000

10,000

1000

100

10

1

0.1

0.01

Pro

duct

ivit

y(K

) T

rans

./Sta

ff-M

o.

1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009

IC capacity

productivity

Gap

Page 24: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha24

More challenges coming

Parallel• Consist of multiple processors with hardware.

Heterogeneous, Networked• Each processor has its own speed, memory, power, reliability, etc.

Fault-Tolerance, Reliability & Security• A major issue for critical applications

Design Space Explorations: timing, code-size, data memory, power consumption, cost, etc.

System-Level Design, Analysis, and Optimization are important.

Compiler is playing an important role. We need more research.

Lets start with Timing optimizations, then other optimizations, and design space issues.

Page 25: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha25

Timing Optimization

Parallelization for Nested Loops

Focus on computation or data intensive applications.

Loops are the most critical parts.

Multi-dimensional systems (MD) Uniform nested loops.

Develop efficient algorithms to obtain the schedule with the minimum execution time while hiding memory latencies.

ALU part: MD Retiming to fully parallelize computations.Memory part: prefetching and partitioning to hide memory latencies.

Developed by Edwin Sha’s group. The results are exciting.

Page 26: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha26

Graph Representation for Loops

A[0] = A[1] = 0;

For (i=2; i<n; i++)

{

A[i] = D[i-2] / 3;

B[i] = A[i] * 5;

C[i] = A[i] + 7;

D[i] = B[i] + C[i];

}

B

A

C

D

Delays

Page 27: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha27

Schedule looped DFG

DFG: Static Schedule:

B

A

C

D

AB CD

AB CD

AB CD

AB CD… …

ScheduleLength = 3

Page 28: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha28

Rotation: Loop pipeliningOriginal Schedule: Regrouped Schedule: Rotated

Schedule:AB CD

AB CD

AB CD

AB CD… …

AB CD

AB CD

AB CD

AB CD… …

B CD

prologue

epilogue

AB CD A

B CD A

B CD A

B CD A

… …

Page 29: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha29

Graph Representation Using Retiming

B

A

C

D

B

A

C

D

DAGLongest path = 3

Longest path = 2

Page 30: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha30

Multi-dimensional Problems:Multi-dimensional problems

DO 10 J = 0, N

DO 1 I = 0, M

d(i,j) = b(i,j-1) * c(i-1,j) D

a(i,j) = d(i,j) * .5 A

b(i,j) = a(i,j) + 1. B

c(i,j) = a(i,j) + 2. C

1 Continue

10 Continue

Circuit optimization

B

C

AD

z2-1

z1-1

(0,1)

(1,0)

Page 31: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha31

An Example of DSP Processor: TI TMS320C64X

Clocking speed: 1.1 GHz, Up to 8800 MIPS.

Page 32: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha32

Page 33: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha33

× 1.3

× 1.3

For I = 1, …..

For I = 1, …..

One-Dimensional Retiming (Leiserson-Saxe, ’91)

Page 34: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha34

× 1.3

For I = 1, …….

A(1) = B(-1) + 1

For I = 1, ……

B(I) = A(I) × 1.3

A(I+1) = B(I-1) + 1

Another Example

Page 35: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha35

Retiming

An integer-value transformation on nodes Registers are re-distributed G = < V, E, d > Gr = < V, E, dr > dr(e) = d(e) + r(u) – r(u) >= 0 Legal retiming

# delays of a cycle remains constant

Page 36: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha36

A nested loop

Illegal cases

Retiming nested loops

New problems …

Multi-Dimensional Retiming

Page 37: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha37

Multi-Dimensional Retiming

Iteration Space

Page 38: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha38

After Retiming r(A)=(-1,1)

Page 39: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha39

Multi-Dimensional Data Flow Graph

Page 40: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha40

Retimed MDFG

Page 41: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha41

Retimed Cell Dependence Graph

Page 42: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha42

Iteration Space for Retimed Graph

Legal schedule with row-wise executions. S=(0,1)

Page 43: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha43

Illigal MD Retiming

Page 44: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha44

Required Solution Needs:

To avoid illegal retiming

To be general

To obtain full parallelism

To be a fast Algorithm

Page 45: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha45

Schedule Vector(wavefront processing)

Legal schedule: s·d 0

Page 46: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha46

Legal

Feasible

Schedule-Based MD Retiming

Page 47: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha47

ILP Formulation

Page 48: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha48

Page 49: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha49

Example: s = (1,4), c(Gr) = 1

Page 50: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha50

s • (1,1) > 0

s • (-1,1) > 0

Pick s = (0,1),

r s

=> r = (1,0)

(1,1)(-1,1) s

r

Schedule plane S+

Chained MD Retiming

x

y

Page 51: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha51

S must be feasible for new delay vectors

Let new d’=(d+kr). We know s d >0 and s r=0

s • (d + kr) = s • d + s • (kr) must be > 0

=> s is a legal schedule for d + kr .

s

r

d1

d2

d3

Schedule plane

Page 52: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha52

Chained MD Retiming Algorithm

Page 53: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha53

Chained MD Retiming Example

Page 54: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha54

Synchronous Circuit Optimization Example – original design

Page 55: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha55

Critical path: 6 adders & 2 mul.

Synchronous Circuit Optimization Example – original design (cont.)

Page 56: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha56

Synchronous Circuit Optimization Example – Gnanasekaran’88

Page 57: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha57

Critical path is the minimum

Synchronous Circuit Optimization Example – retimed design

Page 58: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha58

Embedded System Design Review

Strict requirements• Time, power-consumption, code size, data memory size,

hardware cost, areas, etc.• Time-to-market, time-to-prototype

Special architectural support• Harvard architecture, on-chip memory, register files, etc.

Increasing amount of software• Flexibility of software, short time-to-market, easy upgrades• The amount of software is doubling every two years.

How to generate high-quality code for embedded systems?How to minimize and search the design space?Compiler role?

Page 59: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha59

Compiler in Embedded Systems

Ordinary C compilers for embedded processors are notoriously known for their poor code quality.• Data memory overhead for compiled code can reach a factor of 5• Cycle overhead can reach a factor of 8, compared with the

manually generated assembly code. (Rozenberg et al., 1998)

For code generation

In general, compilers are included in control flow loops for design space exploration and therefore they play an important role for design phase.

Exploring efficient designs in a huge, n-dimensional space, where each dimension corresponds to a design choice.

Page 60: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha60

Compiler in Design Space ExplorationAlgorithm selection: to analyze dependencies between algorithms

and processor architectures. HW/SW partitioningMemory related issues (program and data memory)

• Optimally placing programs and data in on-chip memories, and hide off-chip memory latencies by smart pre-fetching.(Sha, et al.)

• Data mapping for processors with multiple on-chip memory modules. (Zhuge and Sha)

• Code size reduction for software-pipelined applications. (Zhuge and Sha)

Instruction set options• Search for power-optimized instruction sets (Kin et al. 1999)• Scheduling for loops and DAG with the min. energy. (Shao and

Sha)

Page 61: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha61

Design Space Minimization

A direct design space is too large. Must do design space minimization before exploration.

Derive the properties of relationships of design metrics• A huge number of design points can be proved to be infeasible

before performing time-consuming design space exploration. Using our design space minimization algorithm, the design points

can be reduced from 510 points to 6 points for synthesizing All-pole Filter, for example.

Approach:1. Develop effective optimization techniques: code size, time, data

memory, low-power2. Understand the relationship among them3. Design space minimization algorithms4. Efficient design space exploration algorithms using fuzzy logic

Page 62: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha62

Example Relation of OptimizationsRetiming

• Transform a DFG to minimize its cycle period in polynomial time by redistribution the delays in the DFG.

• Cycle period c(G) of DFG G is the computation time of the longest zero-delay path.

B

A

C

D

B

A

C

D

Unfolding• The original DFG G is unfolded f times, so the unfolded graph Gf consists

of f copies of the original node set.• Iteration period: P = c( Gf )/f.• Code size is increased f times. Software pipelining will increase more.

Page 63: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha63

Experimental Results

The search space size using our method is only 2% of the search space using the standard method on average.

The quality of the solutions found by our algorithm is better than that of the standard method.

Benchmarks Iter. Period Req.

Search points

Ratio Final Solutionsunfold #add #mult Iter.

PeriodCode Size

Biquad(std.) 3/2 228 4 4 16 3/2 80

Biquad(ours) 3/2 4 1.5% 3 3 10 5/3 28DEQ(std) 8/5 486 5 10 18 8/5 110

DEQ(ours) 8/5 6 1.2% 3 5 15 5/3 37All-pole(std.) 18/5 510 5 10 10 18/5 150All-pole(ours) 18/5 6 1.2% 3 10 3 11/3 51

Elliptic(std.) 5 694 F F F F FElliptic(ours) 5 2 0.3% 2 6 9 5 764-Stage(std.) 7/4 909 F F F F F4-Stage(ours) 7/4 55 6% 4 7 33 7/4 112

Page 64: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha64

A Design Space Minimization Problem

Clearly understand the relationships: retiming, unfolding and iteration period.

Page 65: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha65

Program Memory Consideration with Code Size Minimization

Multiple on-chip memory banks, but usually only one program memory.

The capacity of an on-chip memory bank is very limited• Motolora’s DSP56K has only 512*24 bit program memory• ARM940T uses 4K instruction cache (Icache)• StringARM SA-1110 uses 16K cache

A widely used performance optimization technique, software pipelining, expands the code size to several times of the original code size.

Designers need to fit the code into the small on-chip memory to avoid slow (external) memory accesses.

The code size becomes a critical concern for many embedded processors.

Page 66: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha66

Code Size Expansion Caused bySoftware Pipelining

Schedule length is decreased from 4 cycles for 1 cycle.

Code size is expanded to 3 times larger than the original code size.

for i=1 to n do A[i] = E[i-4] + 9; B[i] = A[i] * 0.5; C[i] = A[i] + B[i-2]; D[i] = A[i] * C[i]; E[i] = D[i] + 30;end

A[1] = E[-3] + 9;A[2] = E[-2] + 9;B[1] = A[1] * 0.5;C[1] = A[1] + B[-1];A[3] = E[-1] + 9;B[2] = A[2] * 0.5;C[2] = A[2] + B[0];D[1] = A[1] * C[1];for i=1 to n-3 do A[i+3] = E[i-1] + 9; B[i+2] = A[i+2] * 0.5; C[i+2] = A[i+2] + B[i]; D[i+1] = A[i+1] * C[i+1]; E[i] = D[i] + 30;EndB[n] = A[n] * 0.5;C[n] = A[n] + B[n-2];D[n-1] = A[n-1] * C[n-1];E[n-2] = D[n-2] + 30;D[n] = A[n] * C[n];E[n-1] = D[n-1] + 30;E[n] = D[n] + 30;

Page 67: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha67

Rotation SchedulingResource Constrained Loop Scheduling based on Retiming concept.

Retiming gives a clear framework for software pipelining depth.

Given an initial DAG schedule, rotation scheduling repeatedly rotates down the nodes in the first row of the schedule.

In each step of rotation, the nodes in the first row:• retimed once by pushing one delay from each of incoming edges of the

node and adding one delay to each of its outgoing edges;• rescheduled to an available locations (such as earliest ones) in the

schedule based on the new precedence relations defined in the retimed graph.

The optimal schedule length can be obtained in polynomial time (2 |V|) in most cases.

The techniques can be generalized to deal with code-size, switching activities, branches, nested loops, etc.

Page 68: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha68

Rotation vs. Modulo scheduling

Page 69: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha69

Schedule a Cyclic DFG

DFG Schedule

B

A

C

D

AB CD

AB CD

AB CD

AB CD… …

ScheduleLength = 3

Page 70: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha70

Rotation: Loop PipeliningAB CD

AB CD

AB CD

AB CD… …

Original schedule

AB CD

AB CD

AB CD

AB CD… …

B CD

prologue

epilogue

Rotation

AB C AD

B C AD

B C AD

B C AD

… …

Rescheduling

Page 71: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha71

Retiming View of Loop Pipelining

Cycle period = 3

Cycle period = 2

B

A

C

D

B

A

C

D

Page 72: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha72

The Second Rotation

prologue

The schedule after the 1st rotation phase

The 2nd rotation

AB C A B C D A

B C D A

B C D A

… …

The final scheduleafter rescheduling

AB C AD

B C AD

B C AD

B C AD

… …

AB C AD

B C AD

B C AD

B C AD

… …

Page 73: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha73

Retiming View of Loop Pipelining

B

A

C

D

B

A

C

D

r(A)=1

r(B)=r(C)=r(D)=0

Cycle period = 2

r(A)=2

r(B)=r(C)=1

r(D)=0

Cycle period = 1

Page 74: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha74

Prologue and Retiming Function

AB CD

AB CD

AB CD

AB CD… …

AB C A

B C D A

B C D A

B C D A

… …

Original schedule The 1st rotationr(A)=1

The 2nd rotationr(A)=2

The number of copies of node A in prologue = r(A) The number of copies of node A in epilogue = (maxu r(u)) – r(A), for u V .

D

B CD

AB C AD

B C AD

B C AD

B C AD

… …

Page 75: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha75

CRED Technique Using Predicate Register

Predicate register• An instruction can be guarded by a predicate register.• The instruction is executed when the value of the predicate

register is true; otherwise, the instruction is disabled.

Implement CRED using predicate register with counter (TI’s TMS320C6x)• Set the initial value p = (maxu r(u)) – r(v) .• Decrement p by one in each iteration.• The instruction is executed when 0 p > -n, where n is the loop

counter of the original loop. The instruction is disabled when p > 0 or p –n.

Page 76: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha76

The New Execution Sequence

D B E C A

B CA

ADB E C A

ACBDE

EB

DE

CEpilogue

StaticPrologue

Schedule

Mult Mult Adder Adder Adder

D

DB

E

CA

[-2]A[-1]C[1]E[-1]B[0]D

[-3]A[-2]C[-2]B[-1]D [0]E[-4]A[-3]C[-1]E[-3]B[-2]D

[-5]A[-4]C[-2]E[-4]B[-3]D[-6]A[-5]C[-3]E[-5]B[-4]D

[-7]A[-6]C[-4]E[-6]B[-5]D

Mult Mult Adder Adder Adder

Prologue

Static

Epilogue

[-1]A[2]E[1]D [0]C

[2]D [1]B [3]E [1]C [0]A

[0]B

Schedule

Software-pipelined loop schedule with r(A)=3, r(B)=r(C)=2, r(D)=1, r(E)=0, and n=5.

The execution sequence after performingCRED using 4 conditional registers.

The new code size.

Page 77: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha77

Processor Classes

Processor Class 0: No predicate register• Motorola’s StarCore DSP processor

Processor Class 1: Has “condition code” bits in instruction, no predicate register• Intel’s StrongARM and other ARM architectures

Processor Class 2: Has 1-bit predicate registers• Philip’s TriMedia Multimedia processor

Processor Class 3: Has predicate registers with counters• TI’s TMS320C6x processor

Processor Class 4: Specialized hardware support for executing software-pipelined loops• IA64

Page 78: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha78

Code Size Reduction for Class 3

A[1] = E[-3] + 9;A[2] = E[-2] + 9;B[1] = A[1] * 0.5;C[1] = A[1] + B[-1];A[3] = E[-1] + 9;B[2] = A[2] * 0.5;C[2] = A[2] + B[0];D[1] = A[1] * C[1];for i=1 to n-3 do A[i+3] = E[i-1] + 9; B[i+2] = A[i+2] * 0.5; C[i+2] = A[i+2] + B[i]; D[i+1] = A[i+1] * C[i+1]; E[i] = D[i] + 30;EndB[n] = A[n] * 0.5;C[n] = A[n] + B[n-2];D[n-1] = A[n-1] * C[n-1];E[n-2] = D[n-2] + 30;D[n] = A[n] * C[n];E[n-1] = D[n-1] + 30;E[n] = D[n] + 30;

p=0;q=1;r=2;s=3for i=1 to n-3 do [p]A[i+3] = E[i-1] + 9; p--; [q]B[i+2] = A[i+2] * 0.5; [q]C[i+2] = A[i+2] + B[i]; q--; [r]D[i+1] = A[i+1] *C[i+1] r--; [s]E[i] = D[i] + 30; s--;end

Page 79: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha79

CRED on Various Types of Processors

TI model and IA46 is very efficient for code size reduction.

TI model is very effective for DSP processors supporting predicate registers but without specialized hardware as in IA64.

Benchmarks Orig Soft. Pipe.

Various Types of Processors

Class 0 Class 1 Class 2 Class 3 Class 4

StarCore ARM TriMedia TI IA64

Size % Size % Size % Size % Size %

IIR Filter 8 16 20 -25.0 16 0 16 0 12 25.0 12 25.0

Differential Equation

11 22 23 -4.5 19 13.6 19 13.6 15 31.8 15 31.8

All-pole Filter 15 60 39 35.0 31 48.3 31 48.3 23 61.7 19 68.3

Elliptic Filter 34 68 46 32.4 42 38.2 42 38.2 38 44.1 38 44.1

4-stage Lattice Filter

26 78 44 43.6 38 51.3 38 51.3 32 60.0 32 60.0

Voltera Filter 27 54 39 27.8 35 35.2 35 35.2 31 42.6 31 42.6

Ave. Improv. 18.2 31.1 31.1 44.2 45.6

Page 80: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha80

Experimental Results on Code Size/Performance Trade-off

Code size/performance exploration for All-pole Filter on the modified TMS320C6x processor with only 2 predicate registers.

The code size is increased when software pipeline depth is increased and the schedule length is decreased.

Our approach find the shortest schedule length satisfying a code size constraint.

Pipeline Depth

Schedule Length

Instr. Words

Std. Ours %

2 11 23 11 52.2

3 7 30 19 36.7

4 5 39 29 25.6

Page 81: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha81

Code-size Reduction for Nested Loop1

2

3

4

5

6

7

8

9

10

11

12

(0,1)

(1,0)

Data Flow Graph

(0,0)

(1,0)

(2,0)

(0,1)

(1,1)

(2,1)

(0,2)

(1,2)

(2,2) •Assume 8 functional units.

•Traditional software pipelining can only make 6 clock cycle at best.

•Interchanging the loop index can not help optimization.

The original loop:Outer loop begin (trip count = m)

Outer 1 (6 cycles, 15 instructions)

Inner loop (10 cycles, 12 instr., trip count = n)

Outer 2 (5 cycles, 15 instr.)

Assuming m=1000, n=10

Total cycles = m(6+10n+5) = 10mn+11n = 111,000

Code size = 42 instr.

Cell Dependency Graph

(a)

Page 82: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha82

MD Retiming and Code Reduction1

2

3

4

5

6

7

8

9

10

11

12

(-4,1)

(1,0)

(1,0)

(1,0)

(1,0)

(1,0)

Inner-outer combined software pipelining:Outer loop begin (trip count = m)

Outer 1 & Prologue (12 cycles, 15+28 instr.)

Inner loop (2 cycles, 12 instr., trip count = n-4)

Outer 2 & Epilogue (12 cycles, 15+20 instr.)

Total cycles = m(12+2(n-4)+12) = 2mn+16n = 36,000

Code size = 90 instr.

Retimed DFG: r(1)=r(2)=r(3)=r(4)=(4,0), r(5)=r(6)=(3,0), r(7)=r(8)=(2,0), r(9)=r(10)=(1,0) r(11)=r(12)=(0,0)

Code size reduction: remove pro. epi.Outer loop begin (trip count = m)

Outer 1 (6 cycles, 15+4 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.)

Total cycles = m(6+2(n+4)+5) = 2mn+19n = 39,000

Code size = 50 instr.

(b)

(c)

Page 83: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha83

Outer Loop Pipeline and Code Reduction

Outer loop pipelining: Outer loop begin (trip count = m-1)

Outer 1 (6 cycles, 19 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 19 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.)

Total cycles = 6+(m-1)(2(n+4)+6)+2(n+4)+5 = 2mn+14n+5 = 34005

Code size = 100 instr.

Reduce new epilogue:Outer loop begin (trip count = m)

Outer 1 (6 cycles, 19+1 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 20 instr.)

Total cycles = 6+m(2(n+4)+6) = 2mn+14n+6 = 34006

Code size = 71 instr.

(d)

(e)

Page 84: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha84

Data Memory Consideration with Optimal Data Mapping

Multiple memory banks are accessible in parallel.Provides higher memory bandwidth.Many existing compilers cannot work well for such kind of architectural

feature. Instead, all variables are assigned to just one bank.The technique of data mapping and scheduling becomes one of the most

importance factors in performance optimization

Data Memory Bank 0

Data Memory Bank 1

DB0

DB1

ALU

Page 85: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha85

IIR Filter – Data Flow Graph

8 9A

B

14 1

6

5

0

4

7

10

C

D

2 31516

12

11

1920 2324

21 221718

A A

A AB

C

D

E

E

GGAFF

Page 86: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha86

Our Model– Variable Independence Graph

B C

D E

A G

F

2

1/2

1/21/21/2

1/2

1/2

1/21/2

7/8

Partition 1Partition 2

Weight(e==(u,v)): “gain” to put u, v in different memory modules. We want to find maximum-weight partition.

Page 87: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha87

Experimental Results

IG approach uses list scheduling and interference graph model (M. Saghir, etc., University of Toronto, Canada; R. Leupers, etc., University of Dortmund, Germany).

Our approach uses rotation scheduling with variable repartitioning algorithm and variable independence graph.

Different approaches result in different variable partitions.

The largest improvement on schedule length using our approach is 52.9%. The average improvement on the benchmarks is 44.8%.

Benchmarks IG Ours Improvement

IIR Filter 17 8 52.9%

Differential Equation 21 11 52.4%

All-pole Filter 29 18 37.9%

4-stage Lattice Filter 44 22 50.0%

Elliptical Filter 35 24 31.4%

Voltera Filter 41 23 43.9%

Page 88: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha88

Page 89: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha89

Page 90: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha90

Page 91: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha91

Page 92: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha92

Page 93: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha93

Page 94: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha94

Page 95: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha95

Page 96: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha96

Page 97: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha97

Page 98: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha98

Page 99: Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas edsha edsha@utdallas.edu.

Dr. Edwin Sha99

Conclusions

An exciting area: optimizations for parallel DSP and embedded systems. Gave an overview. Needs much more work.

Consider both architectures and compilers.

Presented techniques:• Multi-dimensional (MD) retiming, Rotation• Code-size minimization for software pipelined loops• Design space minimization• Optimal partitioning and prefetching to completely hide memory

latencies. And decide the minimum required on-chip memory

Detailed retiming, unfolding, low-power scheduling, rate-optimal scheduling, etc. were presented in tutorial. Still a lot more.

Please check my web page for details: www.utdallas.edu/~edsha