1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou C A C ore L2 Bank R outer Accelerator Accelerator & BiN M anager $2 C $2 C $2 C $2 $2 $2 $2 $2 A A A A $2 $2 $2 $2 A A A A $2 $2 $2 $2 A A A $2 $2 $2 $2 A A A A A $2 $2 $2 $2 A A A A $2 $2 $2 $2 A A A A $2 $2 $2 $2 A ABM A C $2 A ABM
CDSC CHP Prototyping. Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou. Accelerator-Rich Architectures: ARC, CHARM, BiN. Goals. Implement the architecture features & supports into the prototype system Architecture Proposals - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
CDSC CHP Prototyping
Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat,
Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou
C
A
Core L2 Bank Router
AcceleratorAccelerator
& BiN Manager
$2 C $2 C $2 C $2
$2 $2 $2 $2A A A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A A
A $2 $2 $2 $2A A
A
A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A ABM A
C $2
A ABM
2
Accelerator-Rich Architectures: ARC, CHARM, BiN
C
A
Core L2 Bank Router
AcceleratorAccelerator
& BiN Manager
$2 C $2 C $2 C $2
$2 $2 $2 $2A A A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A A
A $2 $2 $2 $2A A
A
A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A ABM A
C $2
A ABM
3
Goals
Implement the architecture features & supports into the
ARC Phase-2 Area BreakdownARC Phase-2 Area BreakdownSlice Logic UtilizationSlice Logic Utilization
Number of Slice Registers: 45,283 out Number of Slice Registers: 45,283 out of 301,440: 15%of 301,440: 15%
Number of Slice LUTs: 40,749 out of Number of Slice LUTs: 40,749 out of 150,720: 27%150,720: 27%• Number used as logic: 32,505 out of Number used as logic: 32,505 out of
150,720: 21%150,720: 21%• Number used as Memory: 5,248 out of Number used as Memory: 5,248 out of
58,400: 8%58,400: 8%
Slice Logic Distribution:Slice Logic Distribution: Number of occupied Slices: 17,621 out Number of occupied Slices: 17,621 out
of 37,680: 46%of 37,680: 46% Number of LUT Flip Flop pairs used: Number of LUT Flip Flop pairs used:
54,323 54,323• Number with an unused Flip Flop: Number with an unused Flip Flop:
14,617 out of 54,323: 26%14,617 out of 54,323: 26%• Number with an unused LUT: 13,574 Number with an unused LUT: 13,574
out of 54,323: 24%out of 54,323: 24%• Number of fully used LUT-FF pairs: Number of fully used LUT-FF pairs:
26,132 out of 54,323: 48% 26,132 out of 54,323: 48%
DMAC wrapper
AXI
AXI
AXI
AXI
Microblaze (Linux)
Microblaze (GAM)
DRAMController
Ethernet
Ethernet DMA
Ethernet DMA
DMAC
AXILite
Accelerator
ARC Phase-3 GoalsARC Phase-3 Goals
First step toward BiN:First step toward BiN: Shared bufferShared buffer
Plug-n-play accelerator designPlug-n-play accelerator design• Making the interface general enough at least for a class of Making the interface general enough at least for a class of
acceleratorsaccelerators
ARC Phase-3 ArchitectureARC Phase-3 Architecture A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)
Global accelerator manager (GAM) for accelerator sharingGlobal accelerator manager (GAM) for accelerator sharing Shared on-chip buffers: Much more accelerators than buffer bank resourcesShared on-chip buffers: Much more accelerators than buffer bank resources Virtual addressing in the accelerators, accelerator virtualizationVirtual addressing in the accelerators, accelerator virtualization Virtual addressing DMA, with on-demand TLB filling from coreVirtual addressing DMA, with on-demand TLB filling from core No network-on-chip, no buffer sharing with cache, no customized instruction in the coreNo network-on-chip, no buffer sharing with cache, no customized instruction in the core
ACC0 ACC1 ACC2 ACC3 DMAC0 DMAC1 DMAC2 DMAC3
Buffer0
Buffer2
IOMMUACC
wrapper 0ACC
wrapper 1ACC
wrapper 2ACC
wrapper 3
GAM Core
AXI
AXI_B3
AXILite
Mailbox 0
Mailbox 1
DRAM
Core-GAM
Core-IOMMU
Buffer1
Buffer3
AXI_B2
AXI_B1
AXI_B0
Mutex INTCMDM TimerUARTEthernet
Bus master Bus slave
AXI Bus AXILite Bus FSL AXIStream
Performance and Power ResultsPerformance and Power ResultsBenchmarking kernel:Benchmarking kernel:
ResultsResults
( ) ( ) ( ) 2 ( ) 3 ( ) 4 ( ) 5 ( ) 6 ( ) 7 ( ) 8 ( ) 9y i x i x i x i x i x i x i x i x i x i for i = 0...4096
The 4 logic buffers are allocated to 4 separate buffer banksThe 4 logic buffers are allocated to 4 separate buffer banks
The 4 logic buffers are allocated to 1 buffer bankThe 4 logic buffers are allocated to 1 buffer bank
Reason: AXI bus allow masters simultaneously issue transactions. Reason: AXI bus allow masters simultaneously issue transactions. and the AXI transaction time dominates buffer access time and the AXI transaction time dominates buffer access time
Overhead of Buffer Sharing: Bank Access Contention (2)Overhead of Buffer Sharing: Bank Access Contention (2)
The 4 logic buffers are allocated to 4 separate buffer banksThe 4 logic buffers are allocated to 4 separate buffer banks
The 4 logic buffers are allocated to 1 buffer bankThe 4 logic buffers are allocated to 1 buffer bank
Area BreakdownArea BreakdownSlice Logic UtilizationSlice Logic Utilization
Number of Slice Registers: 105,969 out Number of Slice Registers: 105,969 out of 301,440: 35%of 301,440: 35%
Number of Slice LUTs: 93,755 out of Number of Slice LUTs: 93,755 out of 150,720: 62%150,720: 62%• Number used as logic: 80,410 out of Number used as logic: 80,410 out of
150,720: 53%150,720: 53%• Number used as Memory: 7,406 out of Number used as Memory: 7,406 out of
58,400: 12%58,400: 12%
Slice Logic Distribution:Slice Logic Distribution: Number of occupied Slices: 32,779 out Number of occupied Slices: 32,779 out
of 37,680: 86%of 37,680: 86% Number of LUT Flip Flop pairs used: Number of LUT Flip Flop pairs used:
112,772 112,772• Number with an unused Flip Flop: Number with an unused Flip Flop:
25,037 out of 112,772: 22%25,037 out of 112,772: 22%• Number with an unused LUT: 19,017 Number with an unused LUT: 19,017
out of 112,772: 16%out of 112,772: 16%• Number of fully used LUT-FF pairs: Number of fully used LUT-FF pairs:
68,718 out of 112,772: 60% 68,718 out of 112,772: 60%
Microblaze0 (Linux)
Microblaze1 (GAM)
AXI-DDR
DDRController
Ethernet DMA
Ethernet Accelerator
(Sum of 10 SQRTs)
IOMMU
Buffer Selectors
AXI-BUF0
DMAC0
DMAC1
DMAC2
DMAC3
AXILite
BUF0-CRTL
AXI-BUF1
AXI-BUF2
AXI-BUF3
BUF1-CRTL
BUF2-CRTL
BUF3-CRTL
Phase-4 ARC GoalsPhase-4 ARC Goals
Finding bottlenecks and system enhancementFinding bottlenecks and system enhancement
Communication bottleneckCommunication bottleneck Crossbar design instead of AXI-busCrossbar design instead of AXI-bus