Top Banner
1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou C A C ore L2 Bank R outer Accelerator Accelerator & BiN M anager $2 C $2 C $2 C $2 $2 $2 $2 $2 A A A A $2 $2 $2 $2 A A A A $2 $2 $2 $2 A A A $2 $2 $2 $2 A A A A A $2 $2 $2 $2 A A A A $2 $2 $2 $2 A A A A $2 $2 $2 $2 A ABM A C $2 A ABM
25

CDSC CHP Prototyping

Jan 15, 2016

Download

Documents

Lydia

CDSC CHP Prototyping. Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou. Accelerator-Rich Architectures: ARC, CHARM, BiN. Goals. Implement the architecture features & supports into the prototype system Architecture Proposals - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CDSC CHP Prototyping

1

CDSC CHP Prototyping

Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat,

Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou

C

A

Core L2 Bank Router

AcceleratorAccelerator

& BiN Manager

$2 C $2 C $2 C $2

$2 $2 $2 $2A A A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A A

A $2 $2 $2 $2A A

A

A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A ABM A

C $2

A ABM

Page 2: CDSC CHP Prototyping

2

Accelerator-Rich Architectures: ARC, CHARM, BiN

C

A

Core L2 Bank Router

AcceleratorAccelerator

& BiN Manager

$2 C $2 C $2 C $2

$2 $2 $2 $2A A A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A A

A $2 $2 $2 $2A A

A

A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A ABM A

C $2

A ABM

Page 3: CDSC CHP Prototyping

3

Goals

Implement the architecture features & supports into the

prototype system Architecture Proposals

• Architecture-rich CMPs• CHARM• Hybrid cache• Buffer-in NUCA etc

Bridge different thrusts in CDSC

Page 4: CDSC CHP Prototyping

4

Server-Class Platform: HC-1ex Architecture

Xeon Quad Core LV5408

40W TDP

Tesla C1060

100GB/s off-chip bandwidth

200W TDP

4 XC6vlx760 FPGAs

80GB/s off-chip bandwidth

90W Design Power

Page 5: CDSC CHP Prototyping

5

Drawback of the Commodity Systems

Limited ability to customize from the architecture point of

view

Board-level integration rather than chip-level integration

Commodity systems can only reach certain-level, we need

further innovations

Page 6: CDSC CHP Prototyping

6

CHP Prototyping Plan

Create the working hardware and software Use FPGA Extensible Processing Platform (EPP) as the

platform• Reuse existing FPGA IPs as much as possible

Working in multiple phases

Page 7: CDSC CHP Prototyping

7

Target Platforms: Xilinx ML605 and Zynq

Dual-core A9 with programmable logics

Virtex6-based board

Page 8: CDSC CHP Prototyping

8

CHP Prototyping Phases ARC Implementation

Phase 1: Basic platform• Accelerator and Software GAM

Phase 2: Adding modularity using available IP• E.g. Xilinx DMAC IP

Phase 3: First step toward BiN• Shared buffer• Customized modules (e.g. DMA-controller, plug-n-play accelerator)

Phase 4: System Enhancement• Crossbar • AXI implementation

CHARM Implementation

Page 9: CDSC CHP Prototyping

9

ARC Phase 1 Goals

Setting up a basic environment Multi-core + simple accelerators + OS

• Understanding the system interactions in more detail

Simple controller as GAM (global accelerator manager)• Supports sharing at system-level for multiple accelerators

of a same type

Page 10: CDSC CHP Prototyping

10

Microblaze-0(Linux with MMU)

Microblaze-1 (GAM)(Bare-metal; no MMU)

AXI4 (xbar)

AXI4lite (bus)

DDR3

Mailbox(vecadd)

FSL

vecadd vecadd

timer uartmutex

FSLFSL

vecsub vecsub

Mailbox(vecsub)

FSL

ARC Phase 1 Example System Diagram

Page 11: CDSC CHP Prototyping

11

ARC Phase-2 Goals

Implementing a system similar to ARC original design GAM, Accelerator, DMA-Controller, SPM

Adding modularity using available IP E.g. Xilinx DMAC IP

Page 12: CDSC CHP Prototyping

12

ARC Phase-2 Architecture

Page 13: CDSC CHP Prototyping

ARC Phase-2 Performance and Power ResultsARC Phase-2 Performance and Power ResultsBenchmarking kernel:Benchmarking kernel:

ResultsResults

( ) ( ) ( ) 2 ( ) 3 ( ) 4 ( ) 5 ( ) 6 ( ) 7 ( ) 8 ( ) 9y i x i x i x i x i x i x i x i x i x i for i = 0...4096

Runtime (us)Runtime (us) Power (W)Power (W) EDP (Energy delay EDP (Energy delay

product) Gainproduct) Gain

CHP prototye on Xilinx FPGA ML605 CHP prototye on Xilinx FPGA ML605

@ 100MHz@ 100MHz

1,7461,746 22 17,570X17,570X

2x Quad-core Intel Xeon CPU E5405 2x Quad-core Intel Xeon CPU E5405

x64 @ 2.00GHz, 1 FPU per corex64 @ 2.00GHz, 1 FPU per core

562562 8080 1,365X1,365X

Dual-core Intel Xeon CPU 5150 x32 Dual-core Intel Xeon CPU 5150 x32

@ 2.66GHz, 1 FPU per core@ 2.66GHz, 1 FPU per core

10,06110,061 6565 94X94X

16-Core UltraSPARC T1 @ 1.2 GHz, 16-Core UltraSPARC T1 @ 1.2 GHz,

1 shared FPU1 shared FPU

852,163852,163 7272 1X1X

Page 14: CDSC CHP Prototyping

ARC Phase-2 Runtime BreakdownARC Phase-2 Runtime Breakdown

0 100 200 300 400 500 600 700

P0 P1 P2 P3

usReservation request sent

Parameter sent

Reservation succeded

P0 P3

Page 0 translated

Page 1 translatedPage 2

translatedPage 3

translated

P1

Task done

Acc freed

P2

GAM reserves acc

GAM passes parameter

Acc wrapper partitions task

DMAC wrapper requests Page 0

DMAC transfers Page

0

DMAC wrapper requests Page 1

Acc computes

DMAC wrapper requests Page 2

DMAC transfers Page

1

DMAC transfers Page

2

DMAC transfers Page

3

DMAC wrapper requests Page 3

Acc done

GAM passes done signal

11.91 us

Core

GAM

ACC

DMAC

Page 15: CDSC CHP Prototyping

ARC Phase-2 Area BreakdownARC Phase-2 Area BreakdownSlice Logic UtilizationSlice Logic Utilization

Number of Slice Registers: 45,283 out Number of Slice Registers: 45,283 out of 301,440: 15%of 301,440: 15%

Number of Slice LUTs: 40,749 out of Number of Slice LUTs: 40,749 out of 150,720: 27%150,720: 27%• Number used as logic: 32,505 out of Number used as logic: 32,505 out of

150,720: 21%150,720: 21%• Number used as Memory: 5,248 out of Number used as Memory: 5,248 out of

58,400: 8%58,400: 8%

Slice Logic Distribution:Slice Logic Distribution: Number of occupied Slices: 17,621 out Number of occupied Slices: 17,621 out

of 37,680: 46%of 37,680: 46% Number of LUT Flip Flop pairs used: Number of LUT Flip Flop pairs used:

54,323 54,323• Number with an unused Flip Flop: Number with an unused Flip Flop:

14,617 out of 54,323: 26%14,617 out of 54,323: 26%• Number with an unused LUT: 13,574 Number with an unused LUT: 13,574

out of 54,323: 24%out of 54,323: 24%• Number of fully used LUT-FF pairs: Number of fully used LUT-FF pairs:

26,132 out of 54,323: 48% 26,132 out of 54,323: 48%

DMAC wrapper

AXI

AXI

AXI

AXI

Microblaze (Linux)

Microblaze (GAM)

DRAMController

Ethernet

Ethernet DMA

Ethernet DMA

DMAC

AXILite

Accelerator

Page 16: CDSC CHP Prototyping

ARC Phase-3 GoalsARC Phase-3 Goals

First step toward BiN:First step toward BiN: Shared bufferShared buffer

Designing our customized modules Designing our customized modules Customized DMA-controllerCustomized DMA-controller

• Handles batch TLB missesHandles batch TLB misses

Plug-n-play accelerator designPlug-n-play accelerator design• Making the interface general enough at least for a class of Making the interface general enough at least for a class of

acceleratorsaccelerators

Page 17: CDSC CHP Prototyping

ARC Phase-3 ArchitectureARC Phase-3 Architecture A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)

Global accelerator manager (GAM) for accelerator sharingGlobal accelerator manager (GAM) for accelerator sharing Shared on-chip buffers: Much more accelerators than buffer bank resourcesShared on-chip buffers: Much more accelerators than buffer bank resources Virtual addressing in the accelerators, accelerator virtualizationVirtual addressing in the accelerators, accelerator virtualization Virtual addressing DMA, with on-demand TLB filling from coreVirtual addressing DMA, with on-demand TLB filling from core No network-on-chip, no buffer sharing with cache, no customized instruction in the coreNo network-on-chip, no buffer sharing with cache, no customized instruction in the core

ACC0 ACC1 ACC2 ACC3 DMAC0 DMAC1 DMAC2 DMAC3

Buffer0

Buffer2

IOMMUACC

wrapper 0ACC

wrapper 1ACC

wrapper 2ACC

wrapper 3

GAM Core

AXI

AXI_B3

AXILite

Mailbox 0

Mailbox 1

DRAM

Core-GAM

Core-IOMMU

Buffer1

Buffer3

AXI_B2

AXI_B1

AXI_B0

Mutex INTCMDM TimerUARTEthernet

Bus master Bus slave

AXI Bus AXILite Bus FSL AXIStream

Page 18: CDSC CHP Prototyping

Performance and Power ResultsPerformance and Power ResultsBenchmarking kernel:Benchmarking kernel:

ResultsResults

( ) ( ) ( ) 2 ( ) 3 ( ) 4 ( ) 5 ( ) 6 ( ) 7 ( ) 8 ( ) 9y i x i x i x i x i x i x i x i x i x i for i = 0...4096

Runtime (us)Runtime (us) Power (W)Power (W) EDP (Energy delay EDP (Energy delay

product) Gainproduct) Gain

CHP prototye on Xilinx FPGA ML605 CHP prototye on Xilinx FPGA ML605

@ 100MHz@ 100MHz

1,8021,802 22 8,050,786X8,050,786X

2x Quad-core Intel Xeon CPU E5405 2x Quad-core Intel Xeon CPU E5405

x64 @ 2.00GHz, 1 FPU per corex64 @ 2.00GHz, 1 FPU per core

562562 8080 2,069,261X2,069,261X

Dual-core Intel Xeon CPU 5150 x32 Dual-core Intel Xeon CPU 5150 x32

@ 2.66GHz, 1 FPU per core@ 2.66GHz, 1 FPU per core

10,06110,061 6565 7,947X7,947X

16-Core UltraSPARC T1 @ 1.2 GHz, 16-Core UltraSPARC T1 @ 1.2 GHz,

1 shared FPU1 shared FPU

852,163852,163 7272 1X1X

Page 19: CDSC CHP Prototyping

Impact of Communication & Computation OverlappingImpact of Communication & Computation Overlapping

0 200 400 600 800

Pages 0-4 translated

Reservation request sent

Parameter sent

Reservation succeded

Acc computation

Task done

Acc freed

GAM reserves

Acc GAM passes parameter

Acc wrapper partitions task

IOMMU requests

Pages 0-4

IOMMU requests

Pages 5-9

GAM passes done signal

Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100

Pages 5-9 translated

DMAC transfers Pages 6-9

DMAC transfers Pages 0-4

0 200 400 600 800

Pages 0 translated

usReservation request sent

Parameter sent

Reservation succeded

P73-D

Task done

Acc freed

GAM reserves

Acc

GAM passes parameter

Acc wrapper partitions task

IOMMU requests

Pages 0-4

GAM passes done signal

Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100

Pages 7 translated

Pages 1 translated

Pages 2 translated

Pages 4 translated

Pages 3 translated

Pages 5 translated

Pages 6 translated

P0R0 W2

P1D-1

P2D-0

P42-D

P3D-1

P0D-0

P1R1 W3

P2R0 W2

P53-D

P62-D

P3R1 W3

us

19%19%Pipelined Communication & ComputationPipelined Communication & Computation

No pipelineNo pipeline

Page 20: CDSC CHP Prototyping

Overhead of Buffer Sharing: Bank Access Contention (1)Overhead of Buffer Sharing: Bank Access Contention (1)

0 200 400 600 800

Pages 0 translated

usReservation request sent

Parameter sent

Reservation succeded

P73-D

Task done

Acc freed

GAM reserves

Acc

GAM passes parameter

Acc wrapper partitions task

IOMMU requests

Pages 0-4

GAM passes done signal

Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100

Pages 7 translated

Pages 1 translated

Pages 2 translated

Pages 4 translated

Pages 3 translated

Pages 5 translated

Pages 6 translated

P0R0 W2

P1D-1

P2D-0

P42-D

P3D-1

P0D-0

P1R1 W3

P2R0 W2

P53-D

P62-D

P3R1 W3

0 200 400 600 800

Pages 0 translated

usReservation request sent

Parameter sent

Reservation succeded

P73-D

Task done

Acc freed

GAM reserves

Acc

GAM passes parameter

Acc wrapper partitions task

IOMMU requests

Pages 0-4

GAM passes done signal

Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100

Pages 1 translated

Pages 2 translated

Pages 4 translated

Pages 3 translated

Pages 5 translated

Pages 6 translated

P0R0 W2

P1D-1

P2D-0

P42-D

P3D-1

P0D-0

P53-D

P62-D

Pages 7 translated

P0R0 W2

P0R0 W2

P0R0 W2

3.2%3.2%

The 4 logic buffers are allocated to 4 separate buffer banksThe 4 logic buffers are allocated to 4 separate buffer banks

The 4 logic buffers are allocated to 1 buffer bankThe 4 logic buffers are allocated to 1 buffer bank

Reason: AXI bus allow masters simultaneously issue transactions. Reason: AXI bus allow masters simultaneously issue transactions. and the AXI transaction time dominates buffer access time and the AXI transaction time dominates buffer access time

Page 21: CDSC CHP Prototyping

Overhead of Buffer Sharing: Bank Access Contention (2)Overhead of Buffer Sharing: Bank Access Contention (2)

0 200 400 600 800

Pages 0-4 translated

usReservation request sent

Parameter sent

Reservation succeded

Acc computation

Task done

Acc freed

GAM reserves

Acc GAM passes parameter

Acc wrapper partitions task

IOMMU requests

Pages 0-4

IOMMU requests

Pages 5-9

GAM passes done signal

Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100

Pages 5-9 translated

DMAC transfers Pages 6-9

DMAC transfers Pages 0-4

0 200 400 600 800

Pages 0-4 translated

usReservation request sent

Parameter sent

Reservation succeded

Acc computation

Task done

Acc freed

GAM reserves

Acc GAM passes parameter

Acc wrapper partitions task

IOMMU requests

Pages 0-4

IOMMU requests

Pages 5-9

GAM passes done signal

Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100

Pages 5-9 translated

DMAC transfers Pages 6-9

2300

DMAC transfers Pages 6-9

2.7%2.7%

The 4 logic buffers are allocated to 4 separate buffer banksThe 4 logic buffers are allocated to 4 separate buffer banks

The 4 logic buffers are allocated to 1 buffer bankThe 4 logic buffers are allocated to 1 buffer bank

Page 22: CDSC CHP Prototyping

Area BreakdownArea BreakdownSlice Logic UtilizationSlice Logic Utilization

Number of Slice Registers: 105,969 out Number of Slice Registers: 105,969 out of 301,440: 35%of 301,440: 35%

Number of Slice LUTs: 93,755 out of Number of Slice LUTs: 93,755 out of 150,720: 62%150,720: 62%• Number used as logic: 80,410 out of Number used as logic: 80,410 out of

150,720: 53%150,720: 53%• Number used as Memory: 7,406 out of Number used as Memory: 7,406 out of

58,400: 12%58,400: 12%

Slice Logic Distribution:Slice Logic Distribution: Number of occupied Slices: 32,779 out Number of occupied Slices: 32,779 out

of 37,680: 86%of 37,680: 86% Number of LUT Flip Flop pairs used: Number of LUT Flip Flop pairs used:

112,772 112,772• Number with an unused Flip Flop: Number with an unused Flip Flop:

25,037 out of 112,772: 22%25,037 out of 112,772: 22%• Number with an unused LUT: 19,017 Number with an unused LUT: 19,017

out of 112,772: 16%out of 112,772: 16%• Number of fully used LUT-FF pairs: Number of fully used LUT-FF pairs:

68,718 out of 112,772: 60% 68,718 out of 112,772: 60%

Microblaze0 (Linux)

Microblaze1 (GAM)

AXI-DDR

DDRController

Ethernet DMA

Ethernet Accelerator

(Sum of 10 SQRTs)

IOMMU

Buffer Selectors

AXI-BUF0

DMAC0

DMAC1

DMAC2

DMAC3

AXILite

BUF0-CRTL

AXI-BUF1

AXI-BUF2

AXI-BUF3

BUF1-CRTL

BUF2-CRTL

BUF3-CRTL

Page 23: CDSC CHP Prototyping

Phase-4 ARC GoalsPhase-4 ARC Goals

Finding bottlenecks and system enhancementFinding bottlenecks and system enhancement

Communication bottleneckCommunication bottleneck Crossbar design instead of AXI-busCrossbar design instead of AXI-bus

Speed-up AXI non-burst implementation Speed-up AXI non-burst implementation

Page 24: CDSC CHP Prototyping

24

CrossbarCrossbar In addition to previously proposedIn addition to previously proposed

now support partial configurationnow support partial configuration• will not affect working LCAswill not affect working LCAs

Passed on-board testPassed on-board test

Hierarchical DMACsHierarchical DMACs Data transfer between Data transfer between

• Main memoryMain memory• Shared buffer banksShared buffer banks

# of buffer banks can be large# of buffer banks can be large

want to keep AXI bus sizewant to keep AXI bus size

Hierarchical DMACs and busesHierarchical DMACs and buses

Accelerator Memory System DesignAccelerator Memory System Design

IOMMU

Buffer bank1

Buffer bank2

Buffer bank3

Buffer bank4

Buffer bank9

AXI buses

DM

AC1

DM

AC2

DM

AC3

Select-bit Receiver

GAM

Mai

n AX

I bus

to DDR

LCA1

LCA2

LCA3

LCA4

OC core

Page 25: CDSC CHP Prototyping

25

Crossbar ResultsCrossbar Results