Improving Scalability of CMPs with Dense ACCs Coverage
Post on 02-Nov-2021
3 Views
Preview:
Transcript
Nasibeh Teimouri, Hamed Tabkhi and Gunar Schirner
Improving Scalability of CMPs
with Dense ACCs Coverage
Embedded System Lab. (ESL)
Department of Electrical and Computer Engineering
Northeastern University, Boston (MA), USA
Context:
• Embedded performance-demanding streaming applications
• Vision, software-
defined radio, multimedia
• Heterogeneous implementation to meet performance demands under
stringent constraints
• Accelerator-based Chip Multi-Processors (ACMPs)Interrupt Line
ACC-based CMP
P1 P2 P3 P4 P5
P6 P7 P8 P9
P10P0
2
− Trends (among others):
− Increasing ACC coverage
− Increasing density
− Adjacent nodes in HW
Shared
Memory
Control/Streaming Communication Fabric
DMADMADMAACC 0
SPM
ACC 1
SPM
ACC 2
SPM
ACC 3
SPM
ACC 4
SPM
ACC-based CMP
Host
Processor
Challenges with Denser ACCs Coverage
• Processor-centric view– System orchestration by processor
– Processor becomes bottleneck
• High contention on shared resources
– Memory: local/shared data
– System Communication Fabric:
ACC-to-ACC traffic
Communication Fabric
ACC 0
SPM
ACC 1
SPM
Shared Memory
2 5
3 6
ACC-based CMP
Host
Processor
DM
ADM
ADM
A
14
3
– Processor
• Unclear ACC comm. semantic
– Rely on processor interaction
• Scalability severely limited with denser
ACCs coverage [DAC’15]
� System bottlenecks
� ACCs underutilized
Problem Formulation and Contribution
1. Define semantics of ACC communication / interaction
• Foundation for direct ACC-to-ACC communication
2. Transparent Self-Synchronizing (TSS) Architecture
template
4
template
• Realizes semantics
• Mitigate system bottlenecks
• Peer view between processor and ACCs
Outline
• Trend: Increasing ACC Coverage and Density
– Motivation and challenges
• Problem Definition
• ACC-to-ACC Communication Semantics
• Transparent Self-Synchronizing (TSS) Architecture Template
• Experimental Results
• Conclusions
5
ACC Communication Aspects
• Orchestration by processor
• Data preparation according to the ACC job size, data type
• Synchronization of the ACCs and DMA
• DMA data transfer from/to ACC’s memory through communication fabric
Done data copy
• Which aspects need to be defined?
1. Granularity of processing:
ACC job size?
2. Data access model:
When and which memory region accessible?
3. Marshaling / Data Representation:
Adjust data type for ACC input
4. Synchronization:
Start, stop, flow control?
6
ACC_C
DMADMADMAHost
Processor
Streaming/Control Communication Fabric
Shared Memo ry
FIFO/Random Access
Processing Done
DMA Config (size, addr) ACC Config
DMA copying
type/granularity adjusted data
I0 I1
O0
O1
B us IF
Comp
ACC Communication Semantic
• Synchronization / Control
– Initializing ACC for each computation and managing FIFO access
• Synchronization signals “Iready”, “Oread” and “Finished”
• Data access model
– Double buffering
– More general: FIFO with head/tail Random Access (RA)
• Granularity and marshaling management
– Data type/size adjustment of input/output of input/output data– Data type/size adjustment of input/output of input/output data
7
P C
Data Flow M odel
ACCP ACCC
Or chestration
Syn
chp
Gra
nu
lari
tyM
arsh
ali
ng
Syn
chc
IReady
Finishe d
ORead
Finishe d
RA
FIF O
RA
FIF O
ACC Communic at ion Semant ic
I0O0
- All semantic aspects currently involve processor!
- Even for ACC-to-ACC communication
Outline
• Trend: Increasing ACC Coverage and Density
– Motivation and challenges
• Problem Definition
• ACC-to-ACC Communication Semantics
• Transparent Self-Synchronizing (TSS) Architecture Template
• Experimental Results
• Conclusions
8
TSS: ACC-to-ACC Communication
• Separation of computation and communication
– Input Control Mgmt (ICM) and Output Control Mgmt (OCM)
• Efficient realization of the comm. semantics
– Data access (I/O buffer)
– Synchronization, data granularity management
– Data marshalling
• Local interconnect across the ACCs
– Hides ACC-to-ACC traffic from system bus– Hides ACC-to-ACC traffic from system bus
9
ACCp
Processing
OReady
ORead Syn
chG
ran
ula
rity
O0
O1
Mars
hali
ng
OCMP
ACCcIReady
ICMC
Mars
hali
ng
Sy
nch
Gra
nu
lari
ty
I0
I1
IRead
Ma
rsha
lin
gProcessing
Interc
onn
ect
P0P2
P1
P4
P3
P6
P5P8
P7
TSS: Interconnect Network
• Interconnection network
– Many options: MUX, NoC, Bus
– Full connectivity not needed
(only feasible connections)
– Depends on domain
ACCP
Processing
OReady
ORead Syn
chG
ran
ula
rity
O0
O1
Mars
hali
ng
OCMP
ACCCIReady
ICMC
Mars
halin
g
Sy
nch
Gra
nu
lari
ty
I0
I1
IRead
Ma
rsha
lin
g
Processing
Inter
con
nec
t
• Current choice: MUX based interconnect
10
• Current choice: MUX based interconnect
– Simplicity
– Parallelism
ACC0In DataFlow0
In DataFlow 1
ICM0 OCM0
ACC1ICM1 OCM1
ACC2ICM2 OCM2
ACC3
Mu
x0
Mu
x1
Mu
x2
ICM3 OCM3
ACC4ICM4 OCM4
ACC5ICM5 OCM5
ACC6
Mu
x3
Mu
x4
Mu
x5
ICM6 OCM6
ACC7ICM7 OCM7
ACC8ICM8 OCM8
Out DataFlow 0
Out DataFlow 1
SEL0
SEL1
SEL2
SEL3
SEL4
SEL5
TSS: System Integration and Benefits
• Gateway
• Interface to system for each
flow/stream (ACC chain)
• Configuration & control
• Granularity adjustment
• Benefits
Gateway
SE
L 0
SE
L 1
SE
L 2
Cont/Conf Unit
OCM0
(Fl ow 0)
OCM0
(Fl ow 1)
OCM0
(Flow 2)
ICM0
(Flow 0)
ICM0
(Flow 1)
ICM0
(Fl ow 2)
SPM (output)O0 O1 O0 O1 O0 O1 I0 I1 I0 I1 I0 I1
SPM (Input)
Bus Interface MMR
Data to/from
Shared Memory
From
Processor
Int to
Processor
• Benefits
• Each ACC chain appears as
one ACC to processor
• Hides all internals
• Much smaller internal granularity
• Minimal as per ACC’s algorithm
• Reduces on-chip memory
11
Shared
Memory
Control/Streaming Communication Fabric
Inter rupt L ine
Mux
2
Gateway
TSS
Host
Pr oc essor
SPM
Mu
x1
ACC0
Mu
x0
Mu
x5
Mu
x4
Mu
x3
Mu
x8
Mu
x7
Mu
x6
OC
M0
ICM
0
ACC1
OC
M1
ICM
1
ACC2
OC
M2
ICM
2
ACC3
OC
M3
ICM
3
ACC4
OC
M4
ICM
4
ACC5
OC
M5
ICM
5
ACC6
OC
M6
ICM
6
ACC7
OC
M7
ICM
7
ACC8
OC
M8
ICM
8
CU
Outline
• Trend: Increasing ACC Coverage and Density
– Motivation and challenges
• Problem Definition
• ACC-to-ACC Communication Semantics
• Transparent Self-Synchronizing (TSS) Architecture Template
• Experimental Results
• Conclusions
12
Experimental Setup
• Compare: Processor-centric ACMP, TSS
– Same HW / SW Mapping
– Impact of architecture on performance?
• 8 streaming applications (SDF3)
– H263Dec, H263Enc, MP3dec, MP3PB, Sam.Rate, Modem, Synthetic,
Satellite
• ISS-based (OVP) Virtual platforms• ISS-based (OVP) Virtual platforms
– Automatically generated
– 2MB total on-chip mem
13
Virtual Platform Settings
Processor -ARM9 /500MH
-OS : UCOS II
Communication
Fabric
-Multi-layer AMA-AHB (32-bit)
-Freq: 200MHz
-Dedicated DMA per channel
Memory - 2 MB
ACCs -Double-buffered
-Freq: 200MHz
TSS over ACMP: System Performance and Memory Saving
• Average speedup: 3 times
– Minimize interaction with the processor
• 1/7th of orchestration demand
– Self-synchronization (OCM/ICM)
– Reduces system load
• 1/7th (avg) of on-chip memory
– Smaller internal job size
• 1/10th of traffic on system fabric
14
• 1/10th of traffic on system fabric
– ACC-to-ACC comm. fabric.
• 1/8th energy consumption
– Fewer off-chip access
– Smaller on-chip mem.
Outline
• Trend: Increasing ACC Coverage and Density
– Motivation and challenges
• Problem Definition
• ACC-to-ACC Communication Semantics
• Transparent Self-Synchronizing (TSS) Architecture Template
• Experimental Results
• Conclusions
15
Conclusions • Defined semantic aspects of ACC communication
– Synchronization
– Data access model
– Data granularity
– Data representation / marshalling
• Introduced architecture template
Transparent Self-Synchronizing (TSS)
– Efficient realization of semantics
• Separation computation / communication with ICM/OCM
• Internal interconnect network
• Adjustable internal granularity (through gateway)
– Each ACC chain regardless of length appears as one ACC
• Illustrated architecture benefits (processor-centric vs. TSS)
– 8 streaming apps (SDF3) mapped to ISS-based VPs
– 3x speedup (at 1/8th energy consumption) with same HW/SW mapping
16
Thank you!Thank you!
17
top related