Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

Post on 29-Mar-2015

217 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

Transcript

Best of Both Worlds: A Bus-Enhanced Network on-Chip

(BENoC)

Ran Manevich, Isask’har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny

Technion – Israel Institute of Technology

May, 2009

2

Network on-Chip : the Good News

Interconnect for SoCs, CMPs and FPGAs Multi-hop, packet-based communication Efficient resource sharing

Scalable performance and efficiency in Power Area Design productivity

System Bus

3

Network on-Chip : the Bad News

Increased and hard-to-predict latency due to multi-hop and sharing Time critical signals

Broadcast? multicast? No easy solutions Slow (10s of cycles)

I wish I had a bus at hand ….

4

Solution: Bus-Enhanced NoC (BENoC)

Bus re-introduced as a NoC “add-on”

Use NoC for data Optimized for high bandwidth

Use bus for short meta-data Low bandwidth, low latency Broadcast, multicast

Overhead should be justified!

R

RR RR

R

R

R RR

R

R

R R

R

R

R

R R

R

R

R

R

R

RR

RR

R

R

R

R

Module Module

Module Module

Module Module

Module Module

Module

Module

Module

Module

Module

Module

Module

Module

5

In-band support of time critical communication; and:In-band Multicast/Broadcast Complex router

implementation Suffer from multi-hop latency

Existing Bus-NoC hybrids Form a topological hierarchy Typically bus used for local

communication

Related WorkModule

Module

Module

Module

Module

Module

Module Module Module

R

R

R R

R

R R R

R

Module Module

Module Module

Module

Module

Module

Module

R R R

Module Module

Module Module

Module

Module

Module

Module

R R R

R R R

6

BENoC Services

Fast unicast and multicast signaling CMP cache example

Anycast Find resources that fulfills certain

conditions E.g., “Looking for an idling DSP”; or

“Where are the 5 closest multipliers?” Convergecast

Efficient collection of feedback back to the initiator

Barrier synchronization, …

7

Additional BENoC Applications

NoC control Router configuration

E.g., routing table configuration Adapt NoC routing for load balancing Fault discovery and recovery

System control Power management Resource load balancing

Debug

8

Outline Introduction MetaBus architecture MetaBus latency and energy analysis CMP cache use case

9

Conventional System Buses

Figure is copied from “Amba Specifications Rev 2.0” - http://www.arm.com/products/solutions/AMBA_Spec.html

Bandwidth optimized Poor scalability Not suitable for tasks in

BENoC

10

MetaBus Design Requirements

Low area, low power Low bandwidth Low latency Simple Versatile Scalable

Multicast and broadcast support

Acknowledgement

R

R

R

R

R R

R

RR R R

RR R R

R

Module

Module

Module

Module

Module

Module

Module

Module

ModuleModule Module Module

ModuleModule Module Module

“MetaBus”

11

MetaBus Architecture

Many possible implementations Example: tree topology with distributed

arbitration

Module#1

Module#2

Module#3

Module#4

Module#5

Module#6

Module#7

Module#8

Module#9

BusStation

BusStation

BusStation

BusStation

Root

BusStation

12

Module#1

Module#2

Module#3

Module#4

Module#5

Module#6

Module#7

Module#8

Module#9

BusStation

BusStation

BusStation

BusStation

Root

BusStation

Data Path

Data to rootData to receivers

13

Module#1

Module#2

Module#3

Module#4

Module#5

Module#6

Module#7

Module#8

Module#9

BusStation

BusStation

BusStation

BusStation

Root

BusStation

Address word propagates to the rootData word

1Data word 2

propagates to the modules

Example: Broadcast of Two Words

14

Module#1

Module#2

Module#3

BusStation

BusStation

Root

BusStation

Distributed Arbitration Mechanism

Bus RequestBus Grant

15

Module#1

Module#2

Module#3

Module#4

Module#5

Module#6

Module#7

Module#8

Module#9

BusStation 3

BusStation 4

BusStation 5

BusStation 2

Root

BusStation 1

Address word propagates to the rootData word

1propagates to the modules

Masking Saves Power

Mask1Mask2Mask3Mask4Mask5

Mask1

Mask2

Mask3

Mask4

Mask5

Unicast from Module#3 to Module#5

1 0

1 0 1

10101

16

(Binary )Bus Station

17

MetaBus Floorplan – An Example

64 modules balanced binary MetaBus

18

Outline Introduction MetaBus architecture MetaBus Latency and energy analysis CMP cache use case

19

Analysis Highlights 1/4

NoC Broadcast+Unicast Energy/Transaction:

2NoC broadcast flits NL NDE V N K C C

2

1

2NoC unicast flits W NL ND

nE V N L C C

20

Analysis Highlights 2/4

MetaBus Broadcast and Unicast Energy/Transaction:

2,

12

,1 1

D D

MetaBus flits D BL BD upbroadcast

B Bn n

flits BL R BD down Rn n

E V N B C C

V N C B C B

2,

2,1

MetaBus flits D BL BD upunicast

flits R D BL D BD down

E V N B C C

V N B B C B C

21

Analysis Highlights 3/4

NoC unicast and broadcast latency:

NoC unicast CiR Nclk Nclk flitsT nN T T N

NoC broadcast Nclk flitsT n T N

22

Analysis Highlights 4/4

MetaBus unicast and broadcast latency:

,,

,

, ,

,

1.5

0.7 0.4

0.7 0.4

MetaBus flits

BL BD upD BL BD up BL BL

BD up

R BL BD down BL BD downD BL BL

BD down

T N

C CB R C R C

C

B C C R CB R C

C

23

Results - Energy Consumption

Energy consumption for a 3 data words broadcast and unicast transactions

0

0.5

1

1.5

2

2.5

3

3.5

0 5 10 15 20 25 30 35 40

Number of Modules

En

erg

y p

er t

ran

sact

ion

[n

J]

MetaBus Broadcast

Network Broadcast

MetaBus Unicast

Network Unicast

Bus and NoC unicast and broadcast energy per transaction

10X10 mm chip

64 modules mesh

1GHz NoC clock

Speed optimized bus

@0.18um

24

Results - Latencies 3 data words broadcast and unicast

transactions latencies in system with a frequency and a speed optimized MetaBus.

0

20

40

60

80

100

120

0 5 10 15 20 25 30 35 40

Number of modules

La

ten

cy

[n

s]

MetaBus

Network Broadcast

Network Unicast

Figure 9: Bus and NoC broadcast latencies

10X10 mm chip

64 modules mesh

1GHz NoC clock

Speed optimized bus

@0.18um

25

Outline Introduction MetaBus architecture MetaBus Latency and energy analysis CMP cache use case

26

Dynamic Non-Uniform Cache Access

Split large cache into independent smaller banks Non uniform cache access time (NUCA)

Cache lines are moved to shorten access time Dynamic NUCA

Before fetching a into its L1$, a CPU needs to find the L2 cache storing the line

CPUL1$

L2$ L2$

L2$ L2$

L2$ L2$

L2$ L2$

L2$ L2$

L2$ L2$

L2$ L2$

L2$ L2$

CPUL1$

CP

UL1$

CP

UL1$

CPUL1$

CPUL1$

CP

UL1

$

CP

UL1

$

L2$

CMP

(Chi

p Mul

ti Pr

oces

sor)

27

Simulation Setup 16 processors, 64 L2 cache banks PARSEC and SPLASH-2 benchmarks Vanilla Wormhole NoC Simulation account for bus latency,

arbitration time, etc.

28

Simulation Results

Performance improvement in BENoC compared to a NoC-based CMP

(a) average read transaction latency; (b) application speed

29

Summary Current NoCs are largely distributed

Borrowing concepts from off-chip networks On-chip environment provides an

opportunity Enhancing the network with a bus gives the

best of both worlds Advanced services are easily supported

Anycast, management and control Cost effective

Power and performance Analysis and simulation

30

Thank you!

Questions?

zigi@tx.technion.ac.il

Bus-Enhanced NoC

M odule

M odule M odule

M odule M odule

M odule M odule

M odule

M odule

M odule

M odule

M odule

QNoCResearch

GroupGroup

ResearchQNoC

top related