Compilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs PhD work of Mickael Dardaillon Mickaël Dardaillon, Kevin Marquet (Citi), Tanguy Risset (Citi), Jérôme Martin (Cea Leti), Henri-Pierre Charles (CEA List) June 24th, 2016
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Compilation ofParametric Dataflow Applications for
Software-Defined-Radio-Dedicated MPSoCsPhD work of Mickael Dardaillon
Jérôme Martin (Cea Leti), Henri-Pierre Charles (CEA List)
June 24th, 2016
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Evolution of telecommunication protocols
2G
3G
Wi-Fi
Bluetooth10
1000
1000000
100
10000
100000
1990 1995 20052000 2010
2G
3G
4G
data rate
(kbps)
year
BluetoothWi-Fi
2 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Evolution of telecommunication protocols
SDR
10
1000
1000000
100
10000
100000
1990 1995 20052000 2010
2G
3G
4G
data rate
(kbps)
year
BluetoothWi-Fi
2 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
4G LTE-Advanced: Downlink
0 1 2 3 4 5 6 7 8 9
1 sub-frame (1 ms)
1 frame (10 ms)
I MIMO: 4× 2 antennasI LTE throughput: 1.4 GbpsI LTE-Advanced: 7 GbpsI Latency: 2 msI Power budget: 500 mW
3 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
4G LTE-Advanced: Downlink
0 1 2 3 4 5 6 7 8 9
1 sub-frame (1 ms)
1 frame (10 ms)
Control
User 1
User 2
User 3
Data
...
14 OFDM Symbols
20
48
sub
ca
rriers
(20
MH
z)
I MIMO: 4× 2 antennasI LTE throughput: 1.4 GbpsI LTE-Advanced: 7 GbpsI Latency: 2 msI Power budget: 500 mW
3 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
4G LTE-Advanced: Downlink
0 1 2 3 4 5 6 7 8 9
1 sub-frame (1 ms)
1 frame (10 ms)
Control
User 1
User 2
User 3
Data
...
14 OFDM Symbols
20
48
sub
ca
rriers
(20
MH
z)
I MIMO: 4× 2 antennasI LTE throughput: 1.4 GbpsI LTE-Advanced: 7 GbpsI Latency: 2 msI Power budget: 500 mW
3 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
ContextWhat is an SDR software?
Baseband processing insoftware
I ZigBeeI . . .I LTE Advanced
ConstraintsI Computing power ∼ GFLOPSI Reconfiguration time < 100µsI Consumption < 500mW
Architecture independent SDRsoftware
RF Frontend 1
AGC + synchronization
FFT
CFOestimation
CFOcorrection
channelestimation
RF Frontend 2
FFT
CFOcorrection
MIMO decoding
Demodulation
Deinterleaving
Error correction
4 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
ContextWhat is an SDR software?
Baseband processing insoftware
I ZigBeeI . . .I LTE Advanced
ConstraintsI Computing power ∼ GFLOPSI Reconfiguration time < 100µsI Consumption < 500mW
Architecture independent SDRsoftware
RF Frontend 1
AGC + synchronization
FFT
CFOestimation
CFOcorrection
channelestimation
RF Frontend 2
FFT
CFOcorrection
MIMO decoding
Demodulation
Deinterleaving
Error correction4 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
ContextWhat is an SDR software?What is an SDR hardware platform?
I EVP16?
I VLIWI Vector Processor
I SB3500?
I DSPI Control
Processor
I Magali?
I ConfigurableUnits
I NoC
I . . .
⇒ No unified hardware platformmodel for SDR.
Problem Statement: how toprogram and compile atelecommunication protocol to anheterogeneous MPSoC?
5 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
ContextWhat is an SDR software?What is an SDR hardware platform?
I EVP16?I VLIWI Vector Processor
I SB3500?
I DSPI Control
Processor
I Magali?
I ConfigurableUnits
I NoC
I . . .
Vector Processing for Software-Defined Radio 2619
Prog
ram
mem
ory
VLI
Wco
ntro
ller
AC
U
· · · · · ·
Vector FU
Vector register file
Vector memory
P words wide 1 word wide
Scalar RF
Scalar FU
Figure 6: A generic vector-processor architecture.
Prog
ram
mem
ory
VLI
Wco
ntro
ller
AC
U
8 words wide 1 word wide
· · · · · ·
Vector memory
4 vector registers
Load/store
ALU
MAC
Shift
4 scalar regs.
Load/store
ALU
MAC
Shift
Figure 7: The OnDSP architecture.
(iii) The VLIW execution model supports parallelismamong multiple vector functional units (FUs), for ex-ample, MAC, ALU. This VLIW parallelism comes inaddition to vector parallelism (R3).
(iv) On top of that a VLIW instruction may also specify sev-eral operations on scalar functional units (R4).
(v) To keep many functional units busy, there is ex-tensive support for address calculations (ACUs, e.g.,postincrement, modulo) and for zero-overhead loop-ing (R4).
Compared to other programmable architectures, SIMD ex-ecution results in low power consumption (R8), becausethe “overhead” of address calculations, address decoding, in-struction fetching/decoding, and control is shared by P oper-ations. A similar reasoning holds for silicon area per MOPS.
With the above in common, two vector processor in-stances have been developed within Philips: OnDSP targetingWLAN, and EVP targeting 3G and beyond.
4.1. OnDSPThe OnDSP vector processor is a key component of severalmultistandard programmable wireless LAN baseband prod-uct ICs [15]. The application of vector processing to WLANwill be addressed in Section 6.1.
The OnDSP architecture is depicted in Figure 7. Thevector size equals P = 8 (128 bits). A single VLIW in-struction can specify a number of vector operations, forexample, load/store, ALU, MAC, address calculations, and
Prog
ram
mem
ory
VLI
Wco
ntro
ller
AC
U
16 words wide 1 word wide
· · · · · ·
Vector memory
16 vector registers
Load/store unit
ALU
MAC/shift unit
Shu!e unit
Intravector unit
Code generation unit
32 scalar regs.
Load/store U
ALU
MAC U
AXU
Figure 8: The EVP architecture.
loop-control ((R3), (R4)). OnDSP supports a couple of spe-cific vector instructions, including word insertion/deletion,sliding, and gray coding/decoding. Data addresses must bea multiple of P. Program code is compressed vertically(“tagged VLIW” [16]).
In a 0.12 µm CMOS process, OnDSP measures about1.5 mm2 (250 kgates), runs 160 MHz (worst-case commer-cial), and dissipates about 0.8 mW/MHz including a typicalmemory configuration (R8). A macroassembler is used forVLIW scheduling, although optimization by hand is used forcritical code.
4.2. EVP
The EVP (embedded vector processor) is a productized ver-sion of the CVP [7]. Although originally developed to sup-port 3G standards, the current architecture proves to behighly versatile. Care has been taken to cover the OnDSP ca-pabilities for OFDM standards.
The EVP architecture is depicted in Figure 8. The mainword width is 16 bits, with support for 8-bit and 32-bit data(R1). The EVP supports multiple data types, including com-plex numbers (R1). For example, a complex vector multipli-cation uses P multipliers to multiply 1/2p complex numberseach two clock cycles.
The SIMD width is scalable (R2), and has been set toP = 16 (256 bits) for the first product instance EVP16. Themaximum VLIW-parallelism available equals five vector op-erations plus four scalar operations plus three address up-dates plus loop-control. Specific FUs of the EVP include thefollowing ((R3), (R4)).
(i) The shu!e unit can be used to rearrange the elementsof a single vector according to an arbitrary pattern(R5).
(ii) The intravector unit supports operations such as add(or take the maximum of) the elements of a single vec-tor, possibly split in, M segments of P/M elements each(R6), with M a power of 2.
⇒ No unified hardware platformmodel for SDR.
Problem Statement: how toprogram and compile atelecommunication protocol to anheterogeneous MPSoC?
5 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
ContextWhat is an SDR software?What is an SDR hardware platform?
I EVP16?I VLIWI Vector Processor
I SB3500?I DSPI Control
Processor
I Magali?
I ConfigurableUnits
I NoC
I . . .
IEEE SIGNAL PROCESSING MAGAZINE [26] MARCH 2010
Each of the three Sandblaster cores has support for SIMD instructions and thus it can exploit the DLP available in the application. Because the platform consists of three data processing cores, inter-TLP among the different tasks in the application can be also exploited on the platform. Each Sandblaster core also offers a fine-grain intra-TLP inside a single core. This intracore parallelism is also referred to as “token triggered threading” (T
3), which is a form of simultaneous multithreading (SMT). Support for SMT allows the core to switch between different threads and
their contexts quickly. However, the Sandblaster core has only limited ILP where only four instructions can be executed in parallel.
INFINEON MUSICInfineon’s MuSIC-1 platform [9] is a heterogeneous multicore platform that consists of various accelerators along with four programmable cores. Each of these four programmable cores pro-vides DLP and is used for the inner modem PHY processing with the help of filter accelerators. The turbo/Viterbi accelerators are used for performing the outer modem PHY processing. The block diagram of the platform is depict-ed in Figure 6.
The multicore nature of the MuSIC-1 platform supports intercore TLP, which allows the mapping of different tasks on different cores. Similar to Sandbridge, the ILP inside a single core is limited.
ST-ERICSSON EXTREME VECTOR PROCESSOR PLATFORMThe extreme vector processor (EVP) [13] consists of 16-wide SIMD processor with five issue slots. Three of the five slots operate on vector data and two operate on scalar data. This processor exploits both data- and instruction-level parallel-ism in the application. However, not much public information is available on the complete platform architecture and how many cores would be needed to sup-port a wireless standard.
ARM/UNIVERSITY OF MICHIGAN’S ARDBEG PLATFORMARM/University of Michigan’s Ardbeg platform [14] consists of three proces-sor cores. Two cores are allocated for baseband processing and one core for control. The platform also consists of a
turbo coprocessor for outer-modem processing (see Fig-ure 7). The platform enables TLP to be exploitable between the four functional blocks (control processor, two baseband cores, and a turbo accelerator). Each of the baseband cores is 512-b wide and is capable of performing 64-way, 32-way, and 16-way SIMD on 8-b, 16-b, and 32-b data, respectively. However, the baseband core does not allow a large amount of ILP inside the core. The baseband processor is also used to perform certain outer-modem functionality such as Viterbi decoding.
Core 3iCache
SBXMemory
SHB
Core 2iCache
SBXMemory
SHB
IO andOther
Interfaces
IO andOther
Interfaces
IO andOther
Interfaces
Core 1iCache
SBXMemory
SHB
HSN 4 AMBA
ARMIO
Subsystem
MemorySubsystem
DMA
DeviceController
Buses
HSN
SBXComplex
[FIG5] Sandbridge SB3500 platform architecture.
VLIW CU
Global PRFGlobal DRF
ICac
he
Con
figur
atio
n M
emor
ies
VLIW ViewInst. Fetch
Branch ctrlInst. Dispatch
DMEM
CGA View
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FU FU FU FU
Mode ctrlCGA and VLIW
VLI
W S
ectio
n
CG
A S
ectio
n
[FIG4] IMEC’s ADRES processor in the BEAR platform.
⇒ No unified hardware platformmodel for SDR.
Problem Statement: how toprogram and compile atelecommunication protocol to anheterogeneous MPSoC?
5 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
ContextWhat is an SDR software?What is an SDR hardware platform?
I EVP16?I VLIWI Vector Processor
I SB3500?I DSPI Control
ProcessorI Magali?
I ConfigurableUnits
I NoC
I . . .
OFDMofdm1
OFDMofdm2
OFDMofdm3
OFDMofdm4
TURBOturbo
DEMODdemod
MODmod
LDPCldpc
WIFLEXwiflex
ARMarm
80518051
DMAdma2
DMAdma3
DMAdma1
DMAdma4
DMAdma5
DSPdsp2
DSPdsp3
DSPdsp5
DSPdsp4
DSPdsp1
⇒ No unified hardware platformmodel for SDR.
Problem Statement: how toprogram and compile atelecommunication protocol to anheterogeneous MPSoC?
5 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
ContextWhat is an SDR software?What is an SDR hardware platform?
I EVP16?I VLIWI Vector Processor
I SB3500?I DSPI Control
ProcessorI Magali?
I ConfigurableUnits
I NoCI . . .
⇒ No unified hardware platformmodel for SDR.
Problem Statement: how toprogram and compile atelecommunication protocol to anheterogeneous MPSoC?
5 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Magali SDR
LTE demonstrator[Clermidy et al., 09]Power consumption: 231mW
6 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Magali SDR
DSPdsp2
DSPdsp3
DSPdsp5
DSPdsp4
DSPdsp1
LTE demonstrator[Clermidy et al., 09]Power consumption: 231mW
6 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Magali SDR
OFDMofdm1
OFDMofdm2
OFDMofdm3
OFDMofdm4
TURBOturbo
DEMODdemod
MODmod
LDPCldpc
WIFLEXwiflex
DSPdsp2
DSPdsp3
DSPdsp5
DSPdsp4
DSPdsp1
LTE demonstrator[Clermidy et al., 09]Power consumption: 231mW
6 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Magali SDR
OFDMofdm1
OFDMofdm2
OFDMofdm3
OFDMofdm4
TURBOturbo
DEMODdemod
MODmod
LDPCldpc
WIFLEXwiflex
DMAdma2
DMAdma3
DMAdma1
DMAdma4
DMAdma5
DSPdsp2
DSPdsp3
DSPdsp5
DSPdsp4
DSPdsp1
LTE demonstrator[Clermidy et al., 09]Power consumption: 231mW
6 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
OutlineContext
SDR software?
Programming Model for SDRDataflow Model of ComputationInput Format
Dataflow Refinement and Buffer VerificationMapping and SchedulingMicro-Scheduling
Experimentations on MagaliCode GenerationExperimental Results
Conclusion
7 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
State of the Art in SDR ProgrammingImperative Concurrent
Platform LanguageExoCHI [Wang et al., 07] OpenMP + CBEAR [Derudder et al., 09] Matlab + C
Dataflow
Platform LanguageSimulinkLabViewGNU Radio Python + CRVC-CAL [Lucarz et al., 08] XML + CDiplodocusDF [Gonzalez-Pina et al., 12] UMLMAPS [Castrillon et al., 13] C like
8 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Static Dataflow (SDF) [Lee et al., 87]
Decod1Src110
Ctrl10 1 1
9 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Phase Approach with Static Dataflow
...
Decod2 Sink1 10100 10
Src2
Decod2 Sink2 10100 10
Src2
Decod2 Sink3 10100 10
Src2
Decod1Src110
Ctrl10 1 1
10 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Dynamic Dataflow (DDF) [Buck, 93]
SDF DDF
Analysable Expressive
KPN
Scenario Aware DataFlow (SADF) [Theelen et al., 06]Mode Controlled DataFlow (MCDF) [Moreira et al., 12]Schedulable Parametric DataFlow (SPDF) [Fradet et al., 12]Parameterized and Interfaced dataflow Meta-Model (PiMM)[Desnos et al., 13]Boolean Parametric DataFlow (BPDF) [Bebelis et al., 13]
Kahn Process Network (KPN) [Kahn, 74]
11 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Dynamic Dataflow (DDF) [Buck, 93]
SDF DDF
Analysable Expressive
KPNMCDFSPDF BPDFSADFPiMM
Scenario Aware DataFlow (SADF) [Theelen et al., 06]Mode Controlled DataFlow (MCDF) [Moreira et al., 12]Schedulable Parametric DataFlow (SPDF) [Fradet et al., 12]Parameterized and Interfaced dataflow Meta-Model (PiMM)[Desnos et al., 13]Boolean Parametric DataFlow (BPDF) [Bebelis et al., 13]Kahn Process Network (KPN) [Kahn, 74]
11 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Schedulable Parametric DataFlow (SPDF)
Decod1
Src
10
10 1 1 Ctrl
[Fradet et al., 12]I Model of ComputationI AnalysisI Quasi-Static Scheduling
12 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Schedulable Parametric DataFlow (SPDF)
Decod1
Src
10
10 1 1 Ctrlset p[1]
Sinkp 10
100
10
p
Decod2
[Fradet et al., 12]I Model of ComputationI AnalysisI Quasi-Static Scheduling
...
12 / 39
Context Programming Model for SDR Micro-Scheduling Experimentations on Magali Conclusion
Parametric DataFlow Format (PaDaF)
Decod1
Decod2
Src
10 set p[1]
Sink
10 1
p
1
10
100
10
p
Ctrl
Actor specification
class Decod: public Actor{PortIn<int> in;PortOut<int> out;ParamIn p;void compute() {[...]out.push(res, p);