Top Banner
A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David Brooks Harvard University
30

A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

Dec 14, 2015

Download

Documents

Alayna Siers
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

A Pre-RTL, Power-Performance Accelerator

Simulator Enabling Large Design Space Exploration of Customized

Architectures

Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David Brooks

Harvard University

Page 2: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

2

Programmable

Accelerators (DSP, GPU)

Application-Specific

Accelerator(ASIP, ASIC)

General-Purpose Cores

(CPU)

FlexibilityProgrammabili

ty

EnergyEfficiency

Beyond Homogeneous Parallelism

Design Cost

Page 3: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

3

OMAP 4 SoC

Today’s SoC

Page 4: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

4

OMAP 4 SoC

Today’s SoC

ARM Cores GPUDSP DSP

System Bus

Secondary Bus

Secondary Bus

Tertiary Bus

DMA

DMA SDUSBAudio Video Face Imaging

USB

Page 5: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

5

Today’s SoC

CPU + L2$ + GPU39%

Other Blocks 61%

Apple A7

Harvard VLSI-ARCH GroupSoC Tapeout

Page 6: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

6

Today’s SoC

GPU/DSP

CPU

Buses MemInter-faceAcc

CPU

Acc

Acc

Acc

Acc

Acc

Acc

Acc

Acc

Page 7: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

7

Future Accelerator-Centric Architectures

FlexibilityDesign Cost Programmability

How to decompose an application to accelerators?How to rapidly design lots of accelerators?How to design and manage the shared resources?

GPU/DSP

Big Cores

Shared ResourcesMemoryInterface

Sea of Fine-Grained Accelerators

Small Cores

Page 8: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

8

Private L1/Scratchpad

Aladdin

AcceleratorSpecific

Datapath

Shared Memory/InterconnectModels

UnmodifiedC-Code

Accelerator DesignParameters

(e.g., # FU, mem. BW)

Power/Area

Performance

“Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems

Design Cost Flexibility Programmability

Aladdin: A pre-RTL, Power-Performance Accelerator

Simulator

“Design Assistant” Understand Algorithmic-HW

Design Space before RTL

Page 9: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

9

GPU/DSP

Big Cores

Shared ResourcesMemoryInterface

Sea of Fine-Grained Accelerators

Small Cores

Future Accelerator-Centric Architecture

Page 10: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

10

GPU/DSP

Big Cores

Shared ResourcesMemoryInterface

Sea of Fine-Grained Accelerators

Small Cores

Future Accelerator-Centric Architecture

Aladdin can rapidly evaluate large design space of accelerator-centric architectures.

Page 11: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

Aladdin Overview

C Code

Power/Area

Performance

Activity

Acc Design Parameters

Optimization Phase

Realization Phase

Optimistic IR

InitialDDDG

IdealisticDDDG

Program Constraine

d DDDG

ResourceConstraine

d DDDG

Power/Area Models

11

Dynamic Data Dependence Graph

(DDDG)

Page 12: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

Aladdin Overview

C CodeOptimistic

IRInitialDDDG

IdealisticDDDG

Program Constraine

d DDDG

ResourceConstraine

d DDDG

Power/Area Models

Optimization Phase

Realization Phase

Power/Area

Performance

Activity

Acc Design Parameters

12

Page 13: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

13

From C to Design Space

C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];

Page 14: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

From C to Design Space

IR Dynamic Trace

C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];

0. r0=0 //i = 01. r4=load (r0 + r1) //load a[i]2. r5=load (r0 + r2) //load b[i]3. r6=r4 + r54. store(r0 + r3, r6) //store

c[i]5. r0=r0 + 1 //++i6. r4=load(r0 + r1) //load a[i]7. r5=load(r0 + r2) //load b[i]8. r6=r4 + r59. store(r0 + r3, r6) //store

c[i]10. r0 = r0 + 1 //++i…

14

Page 15: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

From C to Design Space

Initial DDDG0.

i=0

1. ld a 2. ld b

3. +

4. st c

5. i++

6. ld a 7. ld b

8. +

9. st c

10. i++

11. ld a 12. ld b

13. +

14. st c

C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];

IR Trace:0. r0=0 //i = 01. r4=load (r0 + r1) //load a[i]2. r5=load (r0 + r2) //load b[i]3. r6=r4 + r54. store(r0 + r3, r6) //store c[i]5. r0=r0 + 1 //++i6. r4=load(r0 + r1) //load a[i]7. r5=load(r0 + r2) //load b[i]8. r6=r4 + r59. store(r0 + r3, r6) //store c[i]10.r0 = r0 + 1 //++i…

15

Page 16: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

0. i=0

5. i++

10. i++

11. ld a 12. ld b

13. +

14. st c

6. ld a 7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];

IR Trace:0. r0=0 //i = 01. r4=load (r0 + r1) //load a[i]2. r5=load (r0 + r2) //load b[i]3. r6=r4 + r54. store(r0 + r3, r6) //store c[i]5. r0=r0 + 1 //++i6. r4=load(r0 + r1) //load a[i]7. r5=load(r0 + r2) //load b[i]8. r6=r4 + r59. store(r0 + r3, r6) //store c[i]10.r0 = r0 + 1 //++i…

0. i=0

5. i++ 10. i++

11. ld a 12. ld b

13. +

14. st c

6. ld a 7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

16

From C to Design Space

Idealistic DDDG

Page 17: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

17

• Include application-specific customization strategies. • Node-Level:

– Bit-width Analysis– Strength Reduction– Tree-height Reduction

• Loop-Level:– Remove dependences between loop index variables

• Memory Optimization:– Memory-to-Register Conversion– Store-Load Forwarding– Store Buffer

• Extensible– e.g. Model CAM accelerator by matching nodes in DDDG

From C to Design Space

Optimization Phase: C->IR->DDDG

Page 18: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

From C to Design Space

One Design

MEM MEM

MEM MEM

MEM

MEM

+

+

+

Resource Activity Idealistic DDDG

Acc Design Parameters: Memory BW <= 2 1 Adder

0. i=0

5.i++ 10. i++

11. ld a12. ld b

13. +

14. st c

6. ld a 7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

15. i++

16. ld a17. ld b

18. +

19. st c

Cycle

0. i=0

5.i++

6. ld a 7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

18

Page 19: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

From C to Design Space

Another Design

MEM MEM MEM MEM

MEM MEM MEM MEM

MEM MEM

MEM MEM

+ +

+ +

+ +

+Resource Activity

Cycle

0. i=0

5.i++

10. i++

11. ld a 12. ld b

13. +

14. st c

7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

15. i++

16. ld a 17. ld b

18. +

19. st c

6. ld a

19

Acc Design Parameters: Memory BW <= 4 2 Adders

Idealistic DDDG0.

i=05.i++ 10. i++

11. ld a12. ld b

13. +

14. st c

6. ld a 7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

15. i++

16. ld a17. ld b

18. +

19. st c

Page 20: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

20

• Constrain the DDDG with program and user-defined resource constraints

• Program Constraints– Control Dependence– Memory Ambiguation

• Resource Constraints– Loop-level Parallelism– Loop Pipelining– Memory Ports– # of FUs (e.g., adders, multipliers)

From C to Design Space

Realization Phase: DDDG->Estimates

Page 21: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

21

Cycle

Power

Acc Design Parameters: Memory BW <= 4 2 Adders

Acc Design Parameters: Memory BW <= 2 1 Adder

From C to Design Space

Power-Performance per Design

Page 22: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

22

From C to Design Space

Design Space of an Algorithm

Cycle

Power

Page 23: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

Aladdin Validation

C Code Power/Area Performance

Aladdin

ModelSim

Design Compiler

Verilog

Activity

23

Page 24: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

Aladdin Validation

C Code Power/Area Performance

Aladdin

RTL Designer

HLS C Tuning

Vivado HLS

ModelSim

Design Compiler

Verilog

Activity

24

Page 25: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

Aladdin Validation

25

Page 26: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

26

Aladdin Validation

Page 27: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

Aladdin enables rapid design space exploration for accelerators.

C Code Power/Area Performance

Aladdin

RTL Designer

HLS C Tuning

Vivado HLS

ModelSim

Design Compiler

Verilog

Activity

27

7 mins

52 hours

Page 28: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

28

Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC.

GPU

Shared ResourcesMemoryInterface

Sea of Fine-Grained Accelerators

Big Cores

Small Cores

GPGPU-Sim

MARSx86...

XIOSim…

Cacti/Orion2

DRAMSim2

Page 29: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

29

Acc Core

Cache

Memory

Acc Core

Cache

Memory

Core

Modeling Accelerators in a SoC-like Environment

Page 30: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen,

30

• Architectures with 1000s of accelerators will be radically different; New design tools are needed.

• Aladdin enables rapid design space exploration of future accelerator-centric platforms.

• You can find Aladdin athttp://vlsiarch.eecs.harvard.edu/aladdin

Aladdin: A pre-RTL, Power-Performance Accelerator

Simulator