Exercise 6: PULP Programminggmichi/asocd/exercises/ex_06.pdf · – I$ high code locality & simple architecture ... – Combinational handshake (single phase) – Deterministic access

Integrated Systems Laboratory

Exercise 6: PULP Programming

Introduction to the PULP Computing Platform

Antonio Pullini

Michael Gautschi

Davide Schiavone

24.05.2016


How efficient do we need to be?

2[RuchIBM11]

1012ops/J

↓

1pJ/op

↓

1GOPS/mW

23.05.2016 2


Minimum energy operation

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Total Energy

Leakage Energy

Dynamic Energy

0.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2

En

erg

y/C

ycle

(nJ

)

32nm CMOS, 25oC

4.7X

Logic Vcc / Memory Vcc (V)

Source: Vivek De, INTEL – Date 2013

Near-Threshold Computing (NTC):

1. Don’t waste energy pushing devices in strong inversion

2. Recover performance with parallel execution

23.05.2016 3


LATEST PULP ARCHITECTURE

4

L2MEMORY

SPI M

SoCTCDMBANK

#0

TCDMBANK

#1

TCDMBANK #M-1

SHARED I$

RISC-V#0

RISC-V#1

RISC-V #N-1

BR

IDG

E

CLUSTER

CLUSTERPERIPHS

TCDM INTERCONNECT

...

...C

LUST

ER B

US

CLUSTER

INTERCONNECTBUS 64-bit 32-bit

PER

IPH

ERA

L IN

TER

CO

NN

ECT

DC FIFO

DC FIFOID R.

AXI2MEM

SOC

BU

S

AXITO

APB

ROM

Soc Ctrl

JTAG2AXI

JTAG

FLL Ctr

UART

GPIO

Top

DU DU DU

PAD

MU

X

JTAGTAP

I2C

I2S

AP

B B

US

ADVDEBUG IF

JTAGTAP

TMC

uD

MA

SPI S

FLLs

CRYPT

HWCE

DMA

23.05.2016


PULP ClusterMCHAN

Private non blocking per-core command queues

Ultra low latency micro-coded programming interface

Support for multiple out-of-order outstanding transactions

TCDM interconnect Single-cycle access to the

shared tightly coupled data

memory

Interleaved addressing to

reduce memory contention

Up to 16 processing

elements Shared I$

No D$

Shared L1 multi-banked

data memory (TCDM)

Design choices

– I$ high code locality & simple architecture

– No D$ low locality & high complexity

– Multibank smaller energy per access, “almost” multiported

23.05.2016 5


Sharing Data: TCDM Interconnect

• What: Synchronous low latency crossbar based on binary trees :

– Single or dual channel

– Distributed Arbitration (Round robin)

– Combinational handshake (single phase)

– Deterministic access latency (1 cycle)

– Test and Set Supported

– Word level or bank level Interleaving.

• Used to build Tigthly Coupled Data Memory systems

– Slaves are pure memories.

– Processor and accellerators directly share data

(TCDM)

– Bank conflicts can be alleviate increasing the

Banking Factor(BF)

• Reconfigurable pipeline stages– Enables usage of SRAM in LV where SRAM slows

down more than logic

• Reconfigurable MMU– Reconfigures memory map to support selective

SRAM shutdown in ULV corners

– Allocate private memory to cores to avoid

contention

TCD

M I

NTE

RC

O

M

M

M

M

M

M

M

M

S

S

S

S

S

S

S

SC

han

nel

0C

han

nel

1

Slav

e p

ort

s (2

n)

𝐵𝐹 > 2

6

SCM SRAM

SCM SRAM

SCM SRAM

SCM SRAM

SCM SRAM

SCM SRAM

SCM SRAM

SCM SRAM

MM

UM

MU

MM

UM

MU

MM

UM

MU

MM

UM

MU

23.05.2016


Multi Ported - Multi Bank Shared ICache and Filter

cache (L0)

ICA

CH

E I

NTE

RC

O

M

M

Fro

m p

rocessors

M

M

S

S

S

S

I$

AX

I 4

no

de

(6

4 b

it)

Ref

ill P

ort

To

L2

I$

I$

I$

Co

nf

Po

rt

AU

X-

Reg

iste

rs

From Peripheral Interconnect

What: A shared I$ system with ultra low latency (1 cycle), and high performance

Cache banks are read only and optimized to get full BW in case of hit.

Better silicon utilization, lower leakage

No data replication

Read Multicast and L0 buffer helps to hide the bank collision penalty

10% on average better energy efficiency at core level

Average 4X improvement on dsp kernels

Filter cache (L0)

datatag

core32

64,128

723.05.2016


DMA for tightly coupled processor clusters

• low-area, low-power DMA engine optimized

for integration in tightly coupled processor

clusters

• Dedicated, per-core non blocking

programming channels

• Ultra low latency programming (~10 cycles)

• Small footprint (30Kgates) : avoid usage of

large local FIFOs by forwarding data directly

to TCDM (no store and forward)

• Support for multiple outstanding

transactions

• Parallel RX/TX channels allow achieving full

bandwidth for concurrent load and store

operations

Integration on a tightly

coupled multi core cluster

Architetcure of the DMA for tightly

coupled procesors cluster

823.05.2016


Event Unit

23.05.2016 9


RI5CY Architecture Overview

10

• 4 stage integer pipeline supporting the basic RISC-V instruction set– Single cycle memory access

– Single cycle multiplication/macs

– Non-aligned memory access in two cycles

– Hardware loop, pre-/post-increment, and vector support

– Support for compressed instructions

23.05.2016


Conventional I/O Subsystem

23.05.2016 11

Protocol

Conversion

INT

ER

CO

NN

EC

T

CPU DMA

L1I/O Buffer

DATA IN

Interrupt

PULP-ClusterIO-Subsystem

Transfer Data to L1Read I/O Buffer


PULP Latest I/O Subsystem

23.05.2016 12

INT

ER

CO

NN

EC

T

CPU

L1Protocol

Conversion

Protocol

Conversion

uDMA

CORE DMA

L2 memory

PULP-ClusterIO-Subsystem

PULP SoC Domain

Configuration

Interrupt/EventTransfer Data to L2

Transfer Data to L1


23.05.2016 13

What will we do today?

• Try simple benchmark to see how system works, how to run code, how to get outputs and how to interact with the simulator

• EX1:– Implement a simple hw semaphore and use it to protect access to

shared resources

– Optimize the design to consume less power

• EX2:– Implement and use a simple HW syncronizer

• EX3:– Study a matrix multiply example and understand it’s weakness

– Improve the matrix multiply benchmark using all cores, DMA transfers and TCDM memory


PULP MEMORY MAP

Test&Set Space

Load/Store Space

0x1000_0000

0x1010_0000

0x1020_0000

0x1020_4000

0x1020_8000

0x1030_0000

Reserved

DEMUX PERIPHS

CLUSTER PERIPHS

0x1000_0000

0x1040_0000

0x1020_4000

0x1020_4400

EVENT UNIT

0x1020_4800

DMA

Reserved

0x1020_8000

0x1020_0000

0x1020_0400

0x1020_0C00

CL. CTRL UNIT

TIMER UNIT

0x1020_0800

EVENT UNIT

0x1020_1000

0x1020_1400

0x1020_4000

0x1020_1800

Reserved

I$ CONTROL

Reserved

Reserved

DBG INTERFACE

0x1040_0000

0x1030_0000

0x1030_8000

CORE 0 DBG

0x1031_0000

CORE 1 DBG

0x1033_8000

CORE 7 DBG

...

Reserved

0x1034_0000

0xFFFF_FFFF

0x2000_0000

Cluster

0x1000_0000

0x0000_0000

0x1040_0000

0x1C00_0000

L2

APB

Peripherals0x1A10_0000

0x1A00_0000

0x1A10_0000

0x1A10_1000

0x1A10_3000

CVP

GPIO

0x1A10_2000

UDMA

0x1A10_4000

0x1A10_5000

ROM

0x1A11_0000

SOC CTRL

SOC TIMER

Reserved

0x1A20_0000

STDOUT

0x1A12_0000

FC

Subsystem

EX 06 IP

0x1B00_0000

0x1B40_0000

Cluster

TCDM Alias

0x0010_0000

0x1A10_6000

23.05.2016 14


23.05.2016 15

Getting Started

• Set the shell to tcsh:

$ tcsh

• Copy data from master account and setup the

environment:

$ ~soc_master/6_pulp/setup_ex_06.csh

$ cd /scratch/$USER/6_pulp

$ source setup_env.csh

• Compile the RTL:

$ source ./hdl/scripts/compile_hdl.csh (remember this command)


23.05.2016 16

Running Helloworld!

• Compile the example program$ cd sw/1_helloworld

$ make build all

• Run the program:$ make run gui=1

This command will open the modelsim tool(RTL simulator). To recompile your software and relunch you will not need to shutdown and restart modelsim each time use Ctrl-Z on the terminal and then use the ‘bg’ command to put modelsim in background

• Run the program within modelsim (the following command are for the modelsim shell, not the terminal!)

$modelsim$ run -a

• Check the outputIn the transcript everything that is after [STDOUT] is result of a printf in your core. The string are printed character by character

• Core0 should print «Hello from Core 0»


23.05.2016 17

System Verilog Basics• C-like hardware-description language

• In HDL coding the data type logic should be used!

SV: VHDL:1bit signal: logic std_logic

N-bit vector: logic [N-1:0] std_logic_vector(N-1 downto 0)

• System Verilog uses modules (entities in VHDL)

module myUnit

#(

parameter WIDTH = 0

)

(

input logic clk_i,

input logic rst_ni,

input logic [WIDTH-1:0] data_di,

output logic [WIDTH-1:0] data_do

);

<< Module body >>

endmodule

Module Declaration

…

myUnit

#(

.WIDTH(32)

)

myUnit_i

(

.clk_i ( clk ),

.rst_ni ( rst ),

.data_di ( data_di ),

.data_do( data_do )

);

…

Module Instantiation


23.05.2016 18

Sequential and combinational logic in System

Verilog

…

logic enable;

enum logic {idle, write, read} CS, NS;

…

assign enable = 1’b1;

…

always_ff @(posedge clk_i, negedge rst_ni)

begin

if (~rst_ni) begin

Data_DP <= 32’b0;

CS <= idle;

end

else if (enable) begin

Data_DP <= Data_DN;

CS <= NS;

end

end

…

Flip-Flop (with enable and active low reset)

…

always_comb

begin

NS = CS;

Data_DO = Data_DP;

case (CS)

idle:

begin

NS = write;

Data_DO = 32’b0;

end

…

default:

NS = idle;

endcase

…

end

…

Combinational process with a state machine

• Flip-flops– Use always_ff construct

– Use non-blocking assignments ( <= )

• Combinational logic– Use always_comb construct

– Or simple assignments assign = …;

– Use blocking assignments ( = )


23.05.2016 19

EX1: Implement HW semaphore

• Introduction:– Only one core is printing helloworld. What about the others?

• Task A:– Extend the helloworld.c such that all the cores print their hello

message.

• Hints:– Recompile the software with:

$ make clean all

– Restart the simulation in modelsim with:

«restart –f; run -al»

• This will automatically use the recompiled HDL files


23.05.2016 20

EX1: Implement a HW semaphore

• Introduction:– As you saw the result of the print from all the cores is interleaved and this is

because the access to the stdout is not granted exclusively to each core during the print of the message. In this exercise we will add hardware support to locking by implementing a simple hardware semaphore.

• Task B:– Analyze the code of apb_sync_if.sv in the folder hdl/sources

– From the modelsim prompt type ‘do waves/ex_06.do’ to enable the waveform window to hep with the debugging

– Modify the HDL code so that:• the register r_sem is set to 0 at reset or when writing to the REG_TSTSET register

• The register r_sem is set to 1 when the REG_TSTSET is read

• PRDATA returns the value of r_sem on the bit0 when reading REG_TSTSET

– On the software side modify the helloworld.c and:• Add a simple function to check the semaphore: It should poll the register untill a 0 is

read(lock)

• Add a function to set back the semaphore value to 0(free)

• Add lock and free functions before and after the printf and check id the printfs results are serialized


23.05.2016 21


• Introduction:– Look at the ‘clock_en_i’ signal of each core. This is the signal that gates the

clock of the core putting it in low power state. As you can see it is most of the time active and this is because we are continuously polling the status of the semaphore. In ULP systems polling has to be avoided as much as possible and substituted with a more “power friendly” event management. In the last part of exercise we will modify the hw semaphore to include support for events

• Task C:– Modify the HDL code so that ‘event_o’ is set to 1 when the semaphore is set

back to 0 by writing the REG_TSTSET register. The event has to remain high for one cycle only.

– Change your lock function and add a wait for event call is the semaphore is busy and before doing the next check on the semaphore status.

– The cores now should not remain active when other cores are printing their strings to the output


23.05.2016 22


• Useful information:– The address at which the HW semaphore is mapped is

0x1A105000

– A memory mapped register can be read withvolatile int my_val = *(volatile int*)(address);

– And written with:*(volatile int*)(address) = my_val;

– To listen to an event we first need to configure the mask of the events we want to listen to, then call the wait funtion, and in case of events coming from the SOC(our case) empty the event fifo and event buffer.

eu_evt_maskSet(mask); to set the mask. Our mask is 0x80000000

eu_evt_wait (); to wait for the event

soc_event_id = *(volatile int*)(REG_EVNT_ID); to empty the fifo

eu_evt_clr(mask); to clear the event buffer


23.05.2016 23

EX2: Implement HW synchronizer

• Introduction:– Another very important aspect when dealing with multicore systems is

synchronization between cores. In this exercise we will implement an simple HW synchronizer and use it.

• Task A: (hardware)– Close the previous session of modelsim and change to the sw/2_sync_matrixMul

directory. Open the matrixMul.c application and study the code. You will notice that the workload of the cores is not balanced and as it is will never work since core 0 will start checking the results before the other cores are done with their computation.

– A HW synchronizer can be implemented with a simple counter that is incremented by each core with a simple write operation that generates an event as soon as the counter reaches a certain level. In our case we could set the level to 4 and we implement the sync funcion as write to the increment register and then wait for the wakeup event.

– Modify the HDL code such that:• the s_counter_next is incremented at each write to REG_HWSYNC_SET address.

• When (r_counter==r_level), reset r_counter to 0

• Set the r_level by writing to address REG_HWSYNC_LEVEL and add the reset value of 4

• Set event_o to 1 for one cycle when the counter reaches the level.


23.05.2016 24

EX2: Implement HW synchronizer (Software)

• Task A (software)

– On the software side modify the custom_bar function in matrixMul.c in order to:

• Implement the sync function(writing to REG_HWSYNC_SET)

• Go to sleep and wait for an event(eu_evt_wait())

• Clear the event in the fifo (dummy read to FIFO buffer)

• Clear the event buffer (eu_evt_clr(mask))

– If you succeeded you should see the cores going to sleep after completingthe matrixMul. In this state they are consuming only very little power

– Be sure that the program now finishes without errors!!

– Hints:• The HWSYNC_SET address is: 0x1A105008

• To clear the FIFO which is containing the event from the HWSync you have to readfrom the REG_EVNT_ID register

• eu_evt_clr(mask) clears the event buffer of the event unit


23.05.2016 25

EX3: Matrix Multiplication

• Task A:– Go to sw/3_matrixMul

– Study the matrixMul.c code

– Compile it

– Estimate the number of cycles to complete the matrix multiplication by doing some back-of-the-envelope calculations.

– Run the simulation & check how many cycles it actually took!

• Task B:– Analyze where the losses come from and use the presented functionality of the

cluster to speedup the matrix multiplication.

– You might want to use the following features in your C-code:• Multiple processing cores

– Split the matrix in 4 chunks and compute each chunk with one core

• Fast TCDM memory– Copy data to an array located in the “.heapsram” section before doing computations

• Synchronization barriers

– You can use the printf() function for debugging


23.05.2016 26

Exercise 6: PULP Programminggmichi/asocd/exercises/ex_06.pdf · – I$ high code locality & simple architecture ... – Combinational handshake (single phase) – Deterministic access

Documents