Integrated Systems Laboratory Exercise 6: PULP Programming Introduction to the PULP Computing Platform Antonio Pullini Michael Gautschi Davide Schiavone 24.05.2016
Integrated Systems Laboratory
Exercise 6: PULP Programming
Introduction to the PULP Computing Platform
Antonio Pullini
Michael Gautschi
Davide Schiavone
24.05.2016
Integrated Systems Laboratory
How efficient do we need to be?
2[RuchIBM11]
1012ops/J
↓
1pJ/op
↓
1GOPS/mW
23.05.2016 2
Integrated Systems Laboratory
Minimum energy operation
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Total Energy
Leakage Energy
Dynamic Energy
0.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2
En
erg
y/C
ycle
(nJ
)
32nm CMOS, 25oC
4.7X
Logic Vcc / Memory Vcc (V)
Source: Vivek De, INTEL – Date 2013
Near-Threshold Computing (NTC):
1. Don’t waste energy pushing devices in strong inversion
2. Recover performance with parallel execution
23.05.2016 3
Integrated Systems Laboratory
LATEST PULP ARCHITECTURE
4
L2MEMORY
SPI M
SoCTCDMBANK
#0
TCDMBANK
#1
TCDMBANK #M-1
SHARED I$
RISC-V#0
RISC-V#1
RISC-V #N-1
BR
IDG
E
CLUSTER
CLUSTERPERIPHS
TCDM INTERCONNECT
...
...C
LUST
ER B
US
CLUSTER
INTERCONNECTBUS 64-bit 32-bit
PER
IPH
ERA
L IN
TER
CO
NN
ECT
DC FIFO
DC FIFOID R.
AXI2MEM
SOC
BU
S
AXITO
APB
ROM
Soc Ctrl
JTAG2AXI
JTAG
FLL Ctr
UART
GPIO
Top
DU DU DU
PAD
MU
X
JTAGTAP
I2C
I2S
AP
B B
US
ADVDEBUG IF
JTAGTAP
TMC
uD
MA
SPI S
FLLs
CRYPT
HWCE
DMA
23.05.2016
Integrated Systems Laboratory
PULP ClusterMCHAN
Private non blocking per-core command queues
Ultra low latency micro-coded programming interface
Support for multiple out-of-order outstanding transactions
TCDM interconnect Single-cycle access to the
shared tightly coupled data
memory
Interleaved addressing to
reduce memory contention
Up to 16 processing
elements Shared I$
No D$
Shared L1 multi-banked
data memory (TCDM)
Design choices
– I$ high code locality & simple architecture
– No D$ low locality & high complexity
– Multibank smaller energy per access, “almost” multiported
23.05.2016 5
Integrated Systems Laboratory
Sharing Data: TCDM Interconnect
• What: Synchronous low latency crossbar based on binary trees :
– Single or dual channel
– Distributed Arbitration (Round robin)
– Combinational handshake (single phase)
– Deterministic access latency (1 cycle)
– Test and Set Supported
– Word level or bank level Interleaving.
• Used to build Tigthly Coupled Data Memory systems
– Slaves are pure memories.
– Processor and accellerators directly share data
(TCDM)
– Bank conflicts can be alleviate increasing the
Banking Factor(BF)
• Reconfigurable pipeline stages– Enables usage of SRAM in LV where SRAM slows
down more than logic
• Reconfigurable MMU– Reconfigures memory map to support selective
SRAM shutdown in ULV corners
– Allocate private memory to cores to avoid
contention
TCD
M I
NTE
RC
O
M
M
M
M
M
M
M
M
S
S
S
S
S
S
S
SC
han
nel
0C
han
nel
1
Slav
e p
ort
s (2
n)
𝐵𝐹 > 2
6
SCM SRAM
SCM SRAM
SCM SRAM
SCM SRAM
SCM SRAM
SCM SRAM
SCM SRAM
SCM SRAM
MM
UM
MU
MM
UM
MU
MM
UM
MU
MM
UM
MU
23.05.2016
Integrated Systems Laboratory
Multi Ported - Multi Bank Shared ICache and Filter
cache (L0)
ICA
CH
E I
NTE
RC
O
M
M
Fro
m p
rocessors
M
M
S
S
S
S
I$
AX
I 4
no
de
(6
4 b
it)
Ref
ill P
ort
To
L2
I$
I$
I$
Co
nf
Po
rt
AU
X-
Reg
iste
rs
From Peripheral Interconnect
What: A shared I$ system with ultra low latency (1 cycle), and high performance
Cache banks are read only and optimized to get full BW in case of hit.
Better silicon utilization, lower leakage
No data replication
Read Multicast and L0 buffer helps to hide the bank collision penalty
10% on average better energy efficiency at core level
Average 4X improvement on dsp kernels
Filter cache (L0)
datatag
core32
64,128
723.05.2016
Integrated Systems Laboratory
DMA for tightly coupled processor clusters
• low-area, low-power DMA engine optimized
for integration in tightly coupled processor
clusters
• Dedicated, per-core non blocking
programming channels
• Ultra low latency programming (~10 cycles)
• Small footprint (30Kgates) : avoid usage of
large local FIFOs by forwarding data directly
to TCDM (no store and forward)
• Support for multiple outstanding
transactions
• Parallel RX/TX channels allow achieving full
bandwidth for concurrent load and store
operations
Integration on a tightly
coupled multi core cluster
Architetcure of the DMA for tightly
coupled procesors cluster
823.05.2016
Integrated Systems Laboratory
Event Unit
23.05.2016 9
Integrated Systems Laboratory
RI5CY Architecture Overview
10
• 4 stage integer pipeline supporting the basic RISC-V instruction set– Single cycle memory access
– Single cycle multiplication/macs
– Non-aligned memory access in two cycles
– Hardware loop, pre-/post-increment, and vector support
– Support for compressed instructions
23.05.2016
Integrated Systems Laboratory
Conventional I/O Subsystem
23.05.2016 11
Protocol
Conversion
INT
ER
CO
NN
EC
T
CPU DMA
L1I/O Buffer
DATA IN
Interrupt
PULP-ClusterIO-Subsystem
Transfer Data to L1Read I/O Buffer
Integrated Systems Laboratory
PULP Latest I/O Subsystem
23.05.2016 12
INT
ER
CO
NN
EC
T
CPU
L1Protocol
Conversion
Protocol
Conversion
uDMA
CORE DMA
L2 memory
PULP-ClusterIO-Subsystem
PULP SoC Domain
Configuration
Interrupt/EventTransfer Data to L2
Transfer Data to L1
Integrated Systems Laboratory
23.05.2016 13
What will we do today?
• Try simple benchmark to see how system works, how to run code, how to get outputs and how to interact with the simulator
• EX1:– Implement a simple hw semaphore and use it to protect access to
shared resources
– Optimize the design to consume less power
• EX2:– Implement and use a simple HW syncronizer
• EX3:– Study a matrix multiply example and understand it’s weakness
– Improve the matrix multiply benchmark using all cores, DMA transfers and TCDM memory
Integrated Systems Laboratory
PULP MEMORY MAP
Test&Set Space
Load/Store Space
0x1000_0000
0x1010_0000
0x1020_0000
0x1020_4000
0x1020_8000
0x1030_0000
Reserved
DEMUX PERIPHS
CLUSTER PERIPHS
0x1000_0000
0x1040_0000
0x1020_4000
0x1020_4400
EVENT UNIT
0x1020_4800
DMA
Reserved
0x1020_8000
0x1020_0000
0x1020_0400
0x1020_0C00
CL. CTRL UNIT
TIMER UNIT
0x1020_0800
EVENT UNIT
0x1020_1000
0x1020_1400
0x1020_4000
0x1020_1800
Reserved
I$ CONTROL
Reserved
Reserved
DBG INTERFACE
0x1040_0000
0x1030_0000
0x1030_8000
CORE 0 DBG
0x1031_0000
CORE 1 DBG
0x1033_8000
CORE 7 DBG
...
Reserved
0x1034_0000
0xFFFF_FFFF
0x2000_0000
Cluster
0x1000_0000
0x0000_0000
0x1040_0000
0x1C00_0000
L2
APB
Peripherals0x1A10_0000
0x1A00_0000
0x1A10_0000
0x1A10_1000
0x1A10_3000
CVP
GPIO
0x1A10_2000
UDMA
0x1A10_4000
0x1A10_5000
ROM
0x1A11_0000
SOC CTRL
SOC TIMER
Reserved
0x1A20_0000
STDOUT
0x1A12_0000
FC
Subsystem
EX 06 IP
0x1B00_0000
0x1B40_0000
Cluster
TCDM Alias
0x0010_0000
0x1A10_6000
23.05.2016 14
Integrated Systems Laboratory
23.05.2016 15
Getting Started
• Set the shell to tcsh:
$ tcsh
• Copy data from master account and setup the
environment:
$ ~soc_master/6_pulp/setup_ex_06.csh
$ cd /scratch/$USER/6_pulp
$ source setup_env.csh
• Compile the RTL:
$ source ./hdl/scripts/compile_hdl.csh (remember this command)
Integrated Systems Laboratory
23.05.2016 16
Running Helloworld!
• Compile the example program$ cd sw/1_helloworld
$ make build all
• Run the program:$ make run gui=1
This command will open the modelsim tool(RTL simulator). To recompile your software and relunch you will not need to shutdown and restart modelsim each time use Ctrl-Z on the terminal and then use the ‘bg’ command to put modelsim in background
• Run the program within modelsim (the following command are for the modelsim shell, not the terminal!)
$modelsim$ run -a
• Check the outputIn the transcript everything that is after [STDOUT] is result of a printf in your core. The string are printed character by character
• Core0 should print «Hello from Core 0»
Integrated Systems Laboratory
23.05.2016 17
System Verilog Basics• C-like hardware-description language
• In HDL coding the data type logic should be used!
SV: VHDL:1bit signal: logic std_logic
N-bit vector: logic [N-1:0] std_logic_vector(N-1 downto 0)
• System Verilog uses modules (entities in VHDL)
module myUnit
#(
parameter WIDTH = 0
)
(
input logic clk_i,
input logic rst_ni,
input logic [WIDTH-1:0] data_di,
output logic [WIDTH-1:0] data_do
);
<< Module body >>
endmodule
Module Declaration
…
myUnit
#(
.WIDTH(32)
)
myUnit_i
(
.clk_i ( clk ),
.rst_ni ( rst ),
.data_di ( data_di ),
.data_do( data_do )
);
…
Module Instantiation
Integrated Systems Laboratory
23.05.2016 18
Sequential and combinational logic in System
Verilog
…
logic enable;
enum logic {idle, write, read} CS, NS;
…
assign enable = 1’b1;
…
always_ff @(posedge clk_i, negedge rst_ni)
begin
if (~rst_ni) begin
Data_DP <= 32’b0;
CS <= idle;
end
else if (enable) begin
Data_DP <= Data_DN;
CS <= NS;
end
end
…
Flip-Flop (with enable and active low reset)
…
always_comb
begin
NS = CS;
Data_DO = Data_DP;
case (CS)
idle:
begin
NS = write;
Data_DO = 32’b0;
end
…
default:
NS = idle;
endcase
…
end
…
Combinational process with a state machine
• Flip-flops– Use always_ff construct
– Use non-blocking assignments ( <= )
• Combinational logic– Use always_comb construct
– Or simple assignments assign = …;
– Use blocking assignments ( = )
Integrated Systems Laboratory
23.05.2016 19
EX1: Implement HW semaphore
• Introduction:– Only one core is printing helloworld. What about the others?
• Task A:– Extend the helloworld.c such that all the cores print their hello
message.
• Hints:– Recompile the software with:
$ make clean all
– Restart the simulation in modelsim with:
«restart –f; run -al»
• This will automatically use the recompiled HDL files
Integrated Systems Laboratory
23.05.2016 20
EX1: Implement a HW semaphore
• Introduction:– As you saw the result of the print from all the cores is interleaved and this is
because the access to the stdout is not granted exclusively to each core during the print of the message. In this exercise we will add hardware support to locking by implementing a simple hardware semaphore.
• Task B:– Analyze the code of apb_sync_if.sv in the folder hdl/sources
– From the modelsim prompt type ‘do waves/ex_06.do’ to enable the waveform window to hep with the debugging
– Modify the HDL code so that:• the register r_sem is set to 0 at reset or when writing to the REG_TSTSET register
• The register r_sem is set to 1 when the REG_TSTSET is read
• PRDATA returns the value of r_sem on the bit0 when reading REG_TSTSET
– On the software side modify the helloworld.c and:• Add a simple function to check the semaphore: It should poll the register untill a 0 is
read(lock)
• Add a function to set back the semaphore value to 0(free)
• Add lock and free functions before and after the printf and check id the printfs results are serialized
Integrated Systems Laboratory
23.05.2016 21
EX1: Implement a HW semaphore
• Introduction:– Look at the ‘clock_en_i’ signal of each core. This is the signal that gates the
clock of the core putting it in low power state. As you can see it is most of the time active and this is because we are continuously polling the status of the semaphore. In ULP systems polling has to be avoided as much as possible and substituted with a more “power friendly” event management. In the last part of exercise we will modify the hw semaphore to include support for events
• Task C:– Modify the HDL code so that ‘event_o’ is set to 1 when the semaphore is set
back to 0 by writing the REG_TSTSET register. The event has to remain high for one cycle only.
– Change your lock function and add a wait for event call is the semaphore is busy and before doing the next check on the semaphore status.
– The cores now should not remain active when other cores are printing their strings to the output
Integrated Systems Laboratory
23.05.2016 22
EX1: Implement a HW semaphore
• Useful information:– The address at which the HW semaphore is mapped is
0x1A105000
– A memory mapped register can be read withvolatile int my_val = *(volatile int*)(address);
– And written with:*(volatile int*)(address) = my_val;
– To listen to an event we first need to configure the mask of the events we want to listen to, then call the wait funtion, and in case of events coming from the SOC(our case) empty the event fifo and event buffer.
eu_evt_maskSet(mask); to set the mask. Our mask is 0x80000000
eu_evt_wait (); to wait for the event
soc_event_id = *(volatile int*)(REG_EVNT_ID); to empty the fifo
eu_evt_clr(mask); to clear the event buffer
Integrated Systems Laboratory
23.05.2016 23
EX2: Implement HW synchronizer
• Introduction:– Another very important aspect when dealing with multicore systems is
synchronization between cores. In this exercise we will implement an simple HW synchronizer and use it.
• Task A: (hardware)– Close the previous session of modelsim and change to the sw/2_sync_matrixMul
directory. Open the matrixMul.c application and study the code. You will notice that the workload of the cores is not balanced and as it is will never work since core 0 will start checking the results before the other cores are done with their computation.
– A HW synchronizer can be implemented with a simple counter that is incremented by each core with a simple write operation that generates an event as soon as the counter reaches a certain level. In our case we could set the level to 4 and we implement the sync funcion as write to the increment register and then wait for the wakeup event.
– Modify the HDL code such that:• the s_counter_next is incremented at each write to REG_HWSYNC_SET address.
• When (r_counter==r_level), reset r_counter to 0
• Set the r_level by writing to address REG_HWSYNC_LEVEL and add the reset value of 4
• Set event_o to 1 for one cycle when the counter reaches the level.
Integrated Systems Laboratory
23.05.2016 24
EX2: Implement HW synchronizer (Software)
• Task A (software)
– On the software side modify the custom_bar function in matrixMul.c in order to:
• Implement the sync function(writing to REG_HWSYNC_SET)
• Go to sleep and wait for an event(eu_evt_wait())
• Clear the event in the fifo (dummy read to FIFO buffer)
• Clear the event buffer (eu_evt_clr(mask))
– If you succeeded you should see the cores going to sleep after completingthe matrixMul. In this state they are consuming only very little power
– Be sure that the program now finishes without errors!!
– Hints:• The HWSYNC_SET address is: 0x1A105008
• To clear the FIFO which is containing the event from the HWSync you have to readfrom the REG_EVNT_ID register
• eu_evt_clr(mask) clears the event buffer of the event unit
Integrated Systems Laboratory
23.05.2016 25
EX3: Matrix Multiplication
• Task A:– Go to sw/3_matrixMul
– Study the matrixMul.c code
– Compile it
– Estimate the number of cycles to complete the matrix multiplication by doing some back-of-the-envelope calculations.
– Run the simulation & check how many cycles it actually took!
• Task B:– Analyze where the losses come from and use the presented functionality of the
cluster to speedup the matrix multiplication.
– You might want to use the following features in your C-code:• Multiple processing cores
– Split the matrix in 4 chunks and compute each chunk with one core
• Fast TCDM memory– Copy data to an array located in the “.heapsram” section before doing computations
• Synchronization barriers
– You can use the printf() function for debugging
Integrated Systems Laboratory
23.05.2016 26