Top Banner
Carnegie Mellon DFT Compiler for Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie Mellon University
32

DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

Jul 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

Carnegie Mellon

DFT Compiler for

Custom and Adaptable Systems

Paolo D’Alberto

Electrical and Computer Engineering

Carnegie Mellon University

Page 2: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

2

Carnegie Mellon

Personal Research Background

TimePadova Bologna UC Irvine CMU

Theory of

computing

Algorithm Engineering

Embedded and

High Performance Computing

Compiler:

Static and Dynamic

Page 3: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

3

Carnegie Mellon

Problem Statement

♦ Automatic DFT library generation across hardware and

software with respect to different metrics

- Time (or performance Operations per seconds)

- Energy (or energy efficiency Operations per Joule)

- Power

Requires the following

♦ Software (code generation or selection)

♦Hardware (HW generation or selection)

♦ Software/Hardware Partitioned

♦Demonstrate Performance and Energy Efficiency

♦Demonstrate automatic generation

Page 4: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

4

Carnegie Mellon

SPIRAL’s Approach

♦ One infrastructure for SW, HW, SW/HW

♦ Optimization at the “right” level of abstraction

♦ Conquers the high-level for automation

transform

ruletree

SPL

Σ-SPL

SW

(C/Fortran)

SW

vector/parallel

netlistmachine code

problem specification

easy manipulation for search

vector/parallel/streaming optimizations

loop optimizations

traditional

human domain

tools

available

HW

(RTL Verilog)SW/HW

partitioned

Page 5: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

Carnegie Mellon

Performance/Energy Optimization

of DSP Transforms on the Intel

XScale Processor

Do Different Architectures need

different Algorithms?

Page 6: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

6

Carnegie Mellon

Motivation (Why XScale ?)

♦ XScale processor deploys a interesting ISA

- Complex Instructions: pre-fetching, add-shift …

- Only Integer Operations

♦ XScale is an re-configurable architecture

- architecture = MEMORY + BUS + CPU

- Memory: τ = 99, 132 and 165MHz

- BUS: τα/2 MHz where α = 1,2, and 4

- CPU: ταβ MHz where β = 1, 1.5, 2 and 3

♦ (α,β,τ) we can tune/change at run time by SW

♦ 36 possible architectures

- 4 Recommended (by the manual),

- 13 Investigated (13 is a lucky number in Italy)

Page 7: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

7

Carnegie Mellon

Motivation (Why Spiral for XScale?)

♦ 13 different architectures

- Which one to use, which one to write code for ?

- We write and test codes for the fastest CPU ?

- Fastest Bus ? Memory ? Slowest ?

♦ SPIRAL: Re-configurable SW

- DSP-transforms SW generator for every architecture

- Targeting and deployment the best SW for a HW

♦ SPIRAL: Inverse Problem by fast Forward Problem

- Given a transform/application, what is the best code-system?

♦ SPIRAL: Energy Minimization by dynamic adaptation

- Slow-down when possible (w/o loss of performance)

- Using simple and fast to apply techniques

Page 8: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

8

Carnegie Mellon

Related Work

♦ Power/Energy Modeling (each system)- To drive the architecture selection [Contreras & Martonosi 2005]

♦Compiler Techniques- Static decision where to change the architecture configuration

- [Hsu & Kremer 2003, Xie et al. 2003]

♦Run time Techniques - By dynamically monitoring the application and changing architecture

- [Singleton et al. 2005]

♦ SW adaptation- The code adapts to each architecture

- [Frigo et al. 2005, Whaley et al. 2001, Im et al. 2004]

♦One code fits all- IPP

Page 9: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

9

Carnegie Mellon

Feedback/SPIRAL Approach

♦Given a transform

♦Choose a HW configuration

♦ Select an algorithm

♦Determine an

implementation

♦Optimize the code

♦Run the code

DSP Transform

HW

Algorithm

Implementation

Verification ♦Dynamic

Programming/Others

♦ Prune the search space

- HW, Algorithm, Implementation

Page 10: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

10

Carnegie Mellon

SPIRAL Approach

♦ Same infrastructure for SW, HW, SW/HW

♦ Optimization at the “right” level of abstraction

♦ Complete automation: Conquers the high-level for automation

transform

ruletree

SPL

Σ-SPL

SW

(C/Fortran)

SW

vector/parallel

netlistmachine code

problem specification

easy manipulation for search

frequency/parallel/streaming opts.

loop optimizations

traditional

human domain

tools

available

HW

(RTL Verilog)SW/HW

partitioned

Page 11: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

11

Carnegie Mellon

Our Approach: From Math to Code -- Static

♦We take a transform F

- E.g., F = DFT, FIR, or WHT…

♦We annotate the formula with the architecture information

♦We generate code

- The switching point are transferred to the code easily

Page 12: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

12

Carnegie Mellon

Our Approach: From Math to Code -- Dynamic

♦We take a transform (WHT: consider the input as a matrix):

♦We annotate the formula with the architecture information

♦We generate code

- The switching point are transferred to the code easily, through the

rule tree

- Very difficult to find these from the code directly

E.g., Poor cache use

fast Memory reads/writesE.g., Good cache use

fast CPU and Bus

WHT on the rowsWHT on the column

Page 13: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

13

Carnegie Mellon

Dynamic Switching

of Configuration

497-165-165

Time

Execution

497-165-165530-132-265530-132-265

♦ Red = 530-132-256

♦ Blue = 497-165-165

♦ Grey = no work

Ruletree = Recursion strategy

Page 14: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

14

Carnegie Mellon

XScale (Linux)

♦ Timing

♦ Feedback to

spiral

Setup: Code Generation for XScale

Code for XScale (LKM)

Timing

SPIRAL Host

(Linux/Windows)

♦ Formula generation

♦ Rule tree

♦ Code optimization

♦ Code generation

♦ Cross compilation

♦ Feedback timing to guide the

search

We measure Execution Time, Power,

Energy of the entire board

and compare vs. IPP

Page 15: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

15

Carnegie Mellon

DFT: Experimental Results

Better

IPP 4.1 SPIRAL

Page 16: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

16

Carnegie Mellon

FIR 16 taps: Experimental ResultsIPP 4.1 SPIRAL

Better

Page 17: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

17

Carnegie Mellon

WHT: Experimental Results

Better

♦ If the switching time is less

than 0.1ms we could improve

performance using 530-1/4-

1/2 and 497-1/3-1/3

♦ Switching between 398-14-

1/3 and 497-1/3-1/3

improves energy

consumption 3-5%

Page 18: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

Carnegie Mellon

DFT Compiler for CPU+FPGA

What if we can build our own DFT co-

processor?

Page 19: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

19

Carnegie Mellon

SW/HW Partitioning Backgrounder

Basic Idea

♦ Fixed set of compute-intensive primitives in HW

⇒ performance/power/energy efficiency

♦ Control-intensive SW on CPU

⇒ flexibility in functionality

♦ E.g. support a library of many DFT sizes efficiently

HW/gates + SW/CPU

SW

CPU

FPGA

HW

function

Page 20: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

20

Carnegie Mellon

Core Selection: Rule-Tree-Based

Partitioning of DFT

♦Recursive structure of DFT

♦Map a set of fixed size DFTs (that fits in HW)

♦Rest is performed in generated SW

♦ Few HW modules can speed up an

entire library

1024

16 64

4096

64 64

192

3 64

256

2 128

2 64

HW/SW Partitioning: how to choose?

Page 21: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

21

Carnegie Mellon

Related Work

♦ The problem we try to solve is NP-Hard [Arato et al. 2005]

- Optimization of metrics with area constraints

♦ SpecSyn (and other System Level Design Languages)

- Specification of the behavior and partitioning of a system

♦Heuristics for the partitioning problem

- Kavalade and Lee 1994, Knerr et al. 2006

♦ Solution for specific problems

- JPEG Zhang et al. 2005 or reactive systems Weis et al. 2005

♦Compiler Tools

- Hot-spot determination Stitt et al. 2004

- Pragmas NAPA C 1998

♦ Architectures specification

- Garp 1997

Page 22: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

22

Carnegie Mellon

SPIRAL’s Approach

♦ One infrastructure for SW, HW, SW/HW

♦ Optimization at the “right” level of abstraction

♦ Conquers the high-level for automation

transform

ruletree

SPL

Σ-SPL

SW

(C/Fortran)

SW

vector/parallel

netlistmachine code

problem specification

easy manipulation for search

vector/parallel/streaming optimizations

loop optimizations

traditional

human domain

tools

available

HW

(RTL Verilog)SW/HW

partitioned

Page 23: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

23

Carnegie Mellon

Virtual Cores by a HW interface

♦ A DFT of size 2n can be used to compute DFTs

of sizes 2k, for k < n

♦ Virtual core for 2k using a 2n core:

- Up-Sampling in time, periodicity in frequency

- Same latency as real 2n core

- Same throughput as real 2k core, for n-c ≤ k < n

DFT 512

DFT 256input vector

(length 256)Virtual Core

Real Core

Page 24: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

24

Carnegie Mellon

Experiment: Partitioning Decision

♦HW:

- E.g., DFT cores for sizes 64, 512

- E.g., Virtual DFT cores for sizes 16, 32,128, and 256

♦ SW:

- Generated library for 2-power sizes

SW

CPU

FPGA

DFT 64

DFT 32

DFT 512

DFT 256

DFT 128 DFT 16

Page 25: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

25

Carnegie Mellon

One Real HW Core DFT64

♦Software only vs. SW + HW core DFT64♦Early ramp-up and peak, 1.5x – 2.6x speed-up

1.5x

2.6x

HW speeds up SW

“HW only”

Page 26: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

26

Carnegie Mellon

One Real HW Core DFT512

♦Software only vs. SW + HW core DFT512♦Slower ramp-up but better speed-up

5.6x

Virtual cores

2.1x

HW speeds up SW

“HW only”

Page 27: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

27

Carnegie Mellon

Two Real HW Cores: DFT64 and DFT256

♦DP search: Software only vs. SW + HW (both, DFT512 only, DFT64 only)♦For a library: 2 cores combine early ramp-up with high speed-up

5.6x

2.1x2.6x

Page 28: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

28

Carnegie Mellon

Performance

♦Software only vs. SW + 2 HW cores♦Clear winner for each size, but for whole library?♦This is performance, but what about power/energy?

Page 29: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

29

Carnegie Mellon

Energy/Power:

♦Software only vs. SW + 2 HW cores♦Clear winner for each size, but for whole library?♦What about performance/power trade-off?

Page 30: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

30

Carnegie Mellon

Error AnalysisSNR

0

20

40

60

80

100

4 8 16 32 64 128 256 512 1024 2048 4096 8192

N

SW-only

32-256

64-512

128-1024

Maximum Error

0.E+00

2.E-04

4.E-04

6.E-04

8.E-04

1.E-03

4 8 16 32 64 128 256 512 1024 2048 4096 8192

Error

Max

SW-only

32-256

64-512

128-1024

Average error

0.0E+00

1.0E-05

2.0E-05

3.0E-05

4.0E-05

4 8 16 32 64 128 256 512 1024 2048 4096 8192

Error SW-only

32-256

64-512

128-1024

Better

Better

Better

Page 31: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

31

Carnegie Mellon

Conclusions

♦We can evaluate different performance metrics for different

transforms and architectures

- Different applications need different architectures

- Different architectures need different algorithms

♦We can evaluate the dynamic effects of configuration

manipulation

- XScale

� No performance improvement (unless the switching time is <0.1ms)

� Energy efficiency may improve by 3-5% using the current switching

- SW/HW

� Large difference in using different configurations

♦ Spiral approach simplifies the search, the code generation and

the code evaluation

Page 32: DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

32

Carnegie Mellon

SPIRAL TeamHW SW