TKT TKT-2431 Soc 2431 Soc Design Design TKT TKT-2431 Soc 2431 Soc Design Design Lec 10 Lec 10 – On On-chip communication chip communication Erno Erno Salminen Salminen, , Tero Tero Arpinen Arpinen Department of Computer Systems Department of Computer Systems Tampere University of Technology Tampere University of Technology Tampere University of Technology Tampere University of Technology Fall 2010 Fall 2010
84
Embed
TKT-2431 Soc Design · TKT-2431 Soc Design ... IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, ... applications past multiprocessor research.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Department of Computer SystemsDepartment of Computer SystemsTampere University of TechnologyTampere University of TechnologyTampere University of TechnologyTampere University of Technology
Fall 2010Fall 2010
Copyright noticeCopyright notice Part of the slides adapted from slide set
by Alberto Sangiovanni-VincentelliEE249 t U i it f C lif i B k l course EE249 at University of California, Berkeley
http://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml by Timo D. Hämäläinen
M i O Chi Chi C i ti S C S i Managing On-Chip Chip Communications, SoC Symposium, Tampere 19.11.2003
#2/45
Copyright(2): Part of figures fromCopyright(2): Part of figures from L. Benini, G. De Micheli, Networks on chips: a new
V. Lahtinen, Design and Analysis of Interconnection Architectures for On-Chip Digital Systems, PhD Th i T U i i f T h lThesis, Tampere University of Technology, Department of Information Technology, June 2004. http://www.tkt.cs.tut.fi/research/daci/pub_open/lahtinen_thep p _ p _
System-on-Chip (MPSoC) Technology," Computer-System on Chip (MPSoC) Technology, ComputerAided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.27, no.10, pp.1701-1713, Oct 2008
See also: E Salminen A Kulmala T D Hämäläinen "Survey of Network-on-chip Proposals" white paper E. Salminen, A. Kulmala, T.D. Hämäläinen, Survey of Network on chip Proposals , white paper,
OCP-IP, [online]: http://www.ocpip.org/socket/whitepapers/OCP-IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, 2008, 13 pages.
E. Salminen, A. Kulmala, T.D. Hämäläinen, "On Network-on-chip comparison", Euromicro conf. on Digital System Design, Lübeck, Germany, August 27-31, 2007, pp. 503-510. http://daci digitalsystems cs tut fi:8180/pubfs/fileservlet?download=true&filedir=dacifs&freal=Salminen
Make sure that simple things worksimple things work before even tryingbefore even trying more complex onesmore complex ones
#5/45
Problem Statement Problem Statement -- SoC ComplexitySoC Complexity SoC consists of heterogenous components Varying communication requirements/profiles Varying communication requirements/profiles Not all components communicate with each
otherSoC
other
Mem_1 Mem_N Periph_1 Periph_N
Communication networkCommunication network
#6/45
Proc_1 Proc_N Acc_1 Acc_N
Different requirementsDifferent requirements1. Varying Bandwidth (or throughput) Amount of data transferred in unit time, [MB/s] High requirement between CPU and memory Low requirement between CPU and peripheral
2 Diff t l t t ti2. Different latency expectations
M 1 M N P i h 1 P i h NMem_1 Mem_N Periph_1 Periph_N
CPU_1 Acc_NCPU_N Acc_1
#7/45
High BWLow BW
Characteristics of offered traffic foadCharacteristics of offered traffic foad1. Spatial: where the data go all sources similar?
2. Temporal: average data rate3. Temporal: when to transferp
a) Short bursts of high transfer activity and long periods of inactivity
b) T f ith t t i d i t lb) Transfers with constant sizes and intervals
very
data amountsrc
Spatial: Temporal:
a
c d
timebursty
time
moderately bursty
Spatial:
a) one dst: neighbor
b) one dst: some
c) few dst
#8/45
b
time
constant bitrate
c) few dst
d) send to allb
Basic metric: LatencyBasic metric: Latency
Delay between start of transfer and completionp time (last data ejected) – time (first data enters) [n cycles for transferring d words]
Interrupts usually require low latency Cache fills require low latencyCache fills require low latency Real-time systems require guaranteed
latency (always below some limit)latency (always below some limit) Stream data (voice, video) may require
One should 1 include the latency of1. include the latency of
network interface (NI)2. exclude the headers
when calculating traffic l dload
3. measure the latency of the whole transfers (which may be several packets.may be several packets. I.e. at lest one full packet, not just header latency)
4. include ”infinite” buffer at source to avoid throttling
#10/45
source to avoid throttling[Salminen, On the credibility of load-latency measurements, Soc, 2008]
Components have local clocks Communication needs handshaking/synchronization Communication needs handshaking/synchronization
M 1 M N P i h 1 P i h NMem_1 Mem_N Periph_1 Periph_N
Proc_1 Proc_N Acc_1 Acc_N
#15/45
High freqLow freq
Energy breakdown forecastEnergy breakdown forecast
compare
#16/45
[Mattan Erez, Stream Architectures –Programmability and Efficiency,
Tampere SoC, Nov. 17 2004]
LocalizationLocalizationC i ti t b l li d t id l Communication must be localized to avoid long wires consume much energy
C i i
are slow, prone to error, cause routing congestionSeveral small components instead of few large Communication
between non-neighboring
tcomponents requires many hops
[Mattan Erez, Stream Architectures –
#17/45
Programmability and Efficiency, Tampere SoC, Nov. 17 2004]
Reliability problemsReliability problems ”Synchronization failures between clock
domains will be rare but unavoidable” - BeniniElectrical noise due to crosstalk,
electromagentic interference, radiation...gData errors or upsets, soft errorsData transfers become unreliable andData transfers become unreliable and
nondeterministicDesign needs both deterministic andDesign needs both deterministic and
stochastic models
#18/45
Achieving reliabilityAchieving reliability Today, designers use physical techniques to
overcome reliability problems Wire sizingWire sizing Length optimization Repeater insertion Shieldingg Data coding Bunch of others...Huge design effort requiredg g q
In (near) future, 100% reliability on physical level cannot be afforded anymore
Reliability muts be increased with additional HW or Reliability muts be increased with additional HW or SW layers Error detecting/correcting codes Retransmissions
#19/45
Retransmissions Request/acknowledge and time-out counters
NetworkNetwork--onon--chip (NoC)chip (NoC)
NetworkNetwork--onon--Chip (NoC)Chip (NoC) Communication network on chip NoC motivation NoC motivation1. High fab cost and effort in traditional VLSI Design general-purpose platform Design general purpose platform
2. Flexibility - For changing application needs3 Concurrency in transfers3. Concurrency in transfers4. Only short signal wires due to power and
delay problemsdelay problems5. On-chip wires are no longer reliable Us all packet s itched m lti hop net ork
#21/45
Usually packet-switched, multi-hop network
Differences betweenDifferences betweenMultiprocessors and SoCMultiprocessors and SoCpp
Multiprocessor systems (past) System-on-Chip (portable device)Scaleability important after fab (increase Scaleability an issue only at design timeScaleability important after fab (increase nodes)
Scaleability an issue only at design time (reuse, easy addition of nodes)
Load balancing and even distribution of computation important for maximum performance
Energy consumption important, idle nodes must be shut down
p
Communication network used as means of balancing computation and communication (both adjusted for optimal performance)
Computation might already be fixed per node (functional partition) Network serves nodes (only network adj sted)performance) adjusted)
Dataflow computing Computation is very heterogeneous, both dataflow and control style
In principle any node can compute a Execution of various applications clustered given task within SoC (specialized nodes)
Some research seems to be ”Re-inventing the wheel” New challenge: Energy saving combined to
Much experience and well established reasearch of routing, switching, scaleability, tailoring according to
#22/45
past multiprocessor researchapplications
Micronetwork protocol stackMicronetwork protocol stack Layers are specialized and optimized according to
application (domain)
abstraction
Splitting long transfer
HW dependent SW
Arbitration, packetization to increase reliabilityRouting
Splitting long transfer into packets, reordering
Arbitration, packetization to increase reliability
Network topologyNetwork topologyDefines the components (e.g. routers) p ( g ) the connections (e.g. each router connected to 4
neighbours)Vast number of topologies proposed in
literature – but there’s no free lunch!
b=bus hb=hierarchical bus r=ringp=point-to-point
#28/45
ft=fat-treex=crossbar c=customt=2-D torus
Network topology (2)Network topology (2)Can be modeled with graphs node = router (+processing unit)( p g ) edge = data stream
Number of nodes denoted with NAverage path length L Avg num of edges between all nodes in graphg g g p Small L desired for small latency
Average degree <k>g g Avg. num of edges in each switch Large <k> may decrease L but implementation
#29/45
gets more complex also
Metric: Bisection bandwidthMetric: Bisection bandwidthWhen design is partitioned into two (nearly)
equal halves, it is the minimum number of i hi h t b t th h lwires which must cross between the halves
considering all possible partitions Number of nodes in halves differs at most by 1 Number of nodes in halves differs at most by 1 Also other definitions...
High number means higher number ofHigh number means higher number of possible routes and hence increased bandwidth, flexibility and possibly fault-t ltoleranceShould increase with the number of nodes in
scalable networks
#30/45
scalable networks
Generic routerGeneric routerForwards data from input ports to outputsFIFOs can be on either side of the crossbar 1 FIFO per port is the most common virtual channels allow multiple FIFOs per port
generic router
Area and delay increase reapidly with the number of ports
generic router
routing arbitrator
.
......
nput
por
tsoutput port
FIFOscrossbar
...
#31/45
in ts...
Routing algortihmRouting algortihm Selects route from source to destination1. Deterministic
S Same route always used between source and destination e.g. 2-D mesh: first find correct row, then correct column All packets arrive in-order One blocked (or faulty) link/router, blocks all packets on
that route2. Adaptivep
Route varies according to blockage Better performance (at least when reordering neglected) Better faul-tolerance Better faul tolerance Deadloack avoidance needs extra care
Data may arrive out-of-order Reordering buffers required at receiver
#32/45
Reordering buffers required at receiver Buffers may consume large area/energy
SwitchingSwitching1. Store-and-forward switching
Data forwarded when whole packet received Whole packet buffered increases area and latency increases area and latency
2. Virtual cut-through: Data forwarded ASAP Whole packet buffered if output blocked
3. Wormhole: Data forwarded ASAP Buffer sizes can be independent of the packet size Reserves the whole transfer path and hence increases contention Reserves the whole transfer path and hence increases contention
Some schemes drop packets when contention is high Highly undetermistic Acknowledges required (roundtrip latency, buffers for retransfers) Not recommended in general Not recommended in general
Buffering has big impact on NoC performance and router area
#33/45
Quick terminology quizQuick terminology quizWhat is in common with the following terms? Koala bear Whale fish (valaskala in Finnish) Wormhole routing
Such things do not exist although many people talk about them Koala is marsupial Whale is mammal Wormhole is switching policy
#34/45
Example topologiesExample topologies
(Shared multimaster) bus(Shared multimaster) bus Bus = set of signals
connected to all devices Sh d Shared resource
One connection between devices reserves the whole interconnection
Single busN = 16L 1interconnection
Bandwidth shared among devices
L = 1<k> = -
Bandwidth may be scaled by adding links
Most common SoC network M lti l b
Low implementation costs, simpleL i l li bl ti
Multiple busN = 16L = 1
<k> = -
#36/45
Long signal lines problematic
Bus arbitration / addr decodingBus arbitration / addr decoding Arbitration decides which master can use the
shared resource (e.g. bus or memory)( g y) Single-master system does not need arbitration E.g. priority, round-robin, TDMA Two-level : e.g. TDMA + priority May be pipelined with previous transfer
Decoding is needed to determine the target Central / Distributed schemes Address and Data are broadcast to every node Decoder select which read the data or respond
Topologies: mesh and torusTopologies: mesh and torus2-D mesh and torus are very popularSimple layout for uniformly sized nodesSimple layout for uniformly sized nodes Wrap-around wires in torus need special
attention
2-D mesh
#41/45
2 D meshN = 16L = 4.7<k> = 4
2-D torusN = 16L = 4.1<k> = 5
Topologies: TreeTopologies: Tree Trad. tree has bisection
bandwidth=1 Bottleneck for uniform
traffic Does not matter when the
Rooted, complete, binary tree
N = 16L = 6 5
traffic is localized
Fat-tree has more (or wider) links near root
L = 6.5<k> = 2.9
wider) links near root Becoming more popular as
NoC topology
Trees also constructed so that each node is processing node
Fat tree with butterfly elements and fanout of 2 (binary fat tree)
N = 16L = 6.5
#42/45
processing node <k> = 3.5
Topologies: static analysisTopologies: static analysis Some basic properties may be analyzed statically Simulation with real applications preferred (i.e. dynamic analysis)
N t k N b f N b f Li kN t k P ll l L t Bi ti Li k Network Number of switches
Number of wires
Links
Single bus 0 1 Bi
Multiple bus 0 e Bi
Hierarchical bus (chain) e 1 e Bi
Network Parallel transactions
Longest path
Bisection bandwidth
Links
Single bus 1 1 1 Bi
Multiple bus e (e ≤ N) 1 e BiHierarchical bus (chain) e-1 e Bi
Crossbar N2/4 N2/2 Bi
One-sided crossbar N2/2 N2-N/2 Bi
Binary tree N-1 2(N-1) Bi
Hierarchical bus (chain) e (e ≤ N) e (e ≤ N) 1 Bi
Crossbar N N N-1 Bi
One-sided crossbar N 2N-1 N/2 Bi
Binary tree N 2log2N 1 BiFat tree (fanout 2) Nlog2N 2Nlog2N Bi
Ring N 2N Bi
3-D hypercube N N+(N/2)log2N Bi
Binary tree N 2log2N 1 Bi
Fat tree (fanout 2) N 2log2N N Bi
Ring N N/2+2 2 Bi
3-D hypercube N log2N+2 N/2 Bi2-D mesh N 3N-2N1/2 Bi
2-D torus N 3N Bi
Point-to-point, fully connected
0 (N2-N)/2 Bi
2-D mesh N 2N1/2 N1/2 Bi
2-D torus N N1/2+2 2N1/2 Bi
Point-to-point, fully connected
N 1 (N/2)*(N/2) Bi
#43/45
Omega network (MIN) (N/4)(log2N-1) (N/2)log2N UniOmega network (MIN) N/2 log2N N Uni
W. Wolf. et al. , "Multiprocessor System-on-Chip (MPSoC) Technology," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.27, no.10, pp.1701-1713, Oct. 2008
S. Dutta et al., "Viper: A multiprocessor SOC for advanced set-top box and digital TV systems," Design & Test of Computers, IEEE , vol.18, no.5, pp.21-31, Sep-Oct 2001
ST ST NomadikNomadikST ST NomadikNomadik(2003) (2003)
Multiple buses
#46/45 Erno Salminen - Nov. 2010
CellCell BE BE byby IBM/Sony/Toshiba (2005)IBM/Sony/Toshiba (2005) Khunjush, F.; Dimopoulos, N.J.; , "Extended characterization of DMA transfers on the Cell BE processor,"
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on , vol., no., pp.1-8, 14-18 April 2008
See aldo: D. Shippy, M. Phipps, The Race for a New Game Machine: Creating the Chips Inside the XBox 360 and the Playstation 3 Citradel 2009and the Playstation 3, Citradel, 2009
Four rings
#47/45 Erno Salminen - Nov. 2010
Tile64 Tile64 byby TileraTilera (2008)(2008)2-D mesh with 4 DDR controller for extrnal
S. Bell et al., TILE64 -Processor: A 64-Core SoCwith Mesh Interconnect, ISSCC 2008
#48/45 Erno Salminen - Nov. 2010
Faust (Faust (2009)2009)
M difi d 2 DModified 2-D mesh, asynchnoronousNoC
[E. Beigne et al., An Asynchronous Power Aware and Adaptive NoCBased Circuit, JSSC, 2009]
#49/45 Erno Salminen - Nov. 2010
ConclusionConclusionSoC has many components, different
requirementsWire delays and power consumption
becoming very problematicBi diff b t l l d l b l (Big difference between local and global (or off-chip) communicationFully synchronous approach becomingFully synchronous approach becoming
unfeasibleNetwork-on-chip = multi-hop on-chip networkNetwork on chip multi hop on chip network Often packet-switched Buffering, routing, and topology are important
#50/45 Erno Salminen - Nov. 2010
design decisions
NoCNoC SurveySurveyNoteNote: : AllAll slidesslides in in thisthis set set areare lecturelecturematerialmaterial!!
Erno Salminen - Nov. 2010
Survey of NetworkSurvey of Network--onon--chip proposals chip proposals [2008][2008][ ][ ]
This paper gives an overview of state-of-the-art regarding the network-on-chip (NoC) proposals.
NoC paradigm replaces dedicated, design-specific wires withNoC paradigm replaces dedicated, design specific wires with scalable, general purpose, multi-hop network. Numerous examples from literature are selected to highlight the contemporary approaches and reported implementation results. Th j t d f N C h d t th t iThe major trends of NoC research and aspects that require more investigations are pointed out.
A packet-switched 2-D mesh is the most used and studied topology so far It is also a sort of an average NoC currentlytopology so far. It is also a sort of an average NoC currently. Good results and interesting proposals are plenty.
However, large differences in implementation results, vague documentation and lack of comparison were also observeddocumentation, and lack of comparison were also observed.
--- clip clip (39 lines omitted in the slide show)---
#53/45 Erno Salminen - Nov. 2010
NoC implementationsNoC implementations
--- clip clip (14 lines omitted in the slide show)---
#54/45 Erno Salminen - Nov. 2010
Average NoC 2008Average NoC 2008
#55/45 Erno Salminen - Nov. 2010 [Salminen et al. Survey of NoC proposals, OCP-IP, 2008]
Average NoC 2008 (2)Average NoC 2008 (2)
#56/45 Erno Salminen - Nov. 2010
as[Salminen et al. Survey of NoC proposals, OCP-IP, 2008]
Case StudyCase StudyCase StudyCase Study
Managing Interconnection Complexity in Managing Interconnection Complexity in Heterogeneous IP Block InterconnectionHeterogeneous IP Block Interconnection(HIBI)(HIBI)(HIBI)(HIBI)
Erno Salminen - Nov. 2010
Overview of Managing OnOverview of Managing On--Chip Chip CommunicationsCommunications
Dedicated point-to-point links
Simple Alwaysguaranteed
LimitedLimited IP block specificyyp
Single bus
nts
nts
WW exib
ility
exib
ility
ss ee
elem
enel
emen
cy&
BW
cy&
BW
ty &
Fle
ty &
Fle
bloc
ksbl
ocks
rk re
use
rk re
use
Hierarchical bus structures
Regular multi-hop topologies et
wor
k et
wor
k
Late
ncLa
tenc
alea
bilit
alea
bilit
# of
IP
# of
IP
Net
wor
Net
worstructures
topologies
Customized multi-hop Verycomple
Designonce
Generalp rpose
Best-effort/Predictable
Ne
Ne
Sca
Sca NN
Arbitrar
#58/45 Erno Salminen - Nov. 2010
p complex oncepurposePredictable Arbitrary
Lessons LearnedLessons LearnedMany communication networks have been studied in
TUT On chip communication research started 1997 On-chip communication research started 1997
A regular topology can well be fitted to algorithm specific comp/comm balanced implementationIn general case there is no optimal topology
Communication-centric design was successfully conducted for performanceconducted for performanceImportant to exploit features of application(s) to optimize interconnection
Established parallel processing doctrines can be applied to SoCSoC challenge is heterogeneity in computation
#59/45 Erno Salminen - Nov. 2010
SoC challenge is heterogeneity in computation
Interconnection Implementation ViewInterconnection Implementation View Make lowest level data transfer mechanisms simple and
efficient Minimum number of signalsg “Every clock edge carries useful data in transaction”
Perform all high-level operations on basic mechanisms Layered protocol model, OCP compatibley p , p Message passing
Use identical HW modules to compose overall interconnection Translate IP specific communication operations to networka s ate spec c co u cat o ope at o s to et o Support all (practical) topologies No limits to number of IP blocks (whole design) Support (re-)configurabilitypp ( ) g y Fit to all communication needs –from memories to peripherals
“Gives body to build interconnect”“Gives body to build interconnect”
#60/45 Erno Salminen - Nov. 2010
System Design ViewSystem Design View Make interconnection aware of application functionality
A) System design time Communication profiled from application processes Communication profiled from application processes Clustering: localization of communication Allocation of communication resources (segments, buffers) Optimization of non-reconfigurable parameters Optimization of non reconfigurable parameters Initial QoS and other transfer parameters
B) Run time Utilize knowledge of predictable communication events if Utilize knowledge of predictable communication events if
available Guaranteed QoS in transfers
Track communication –change QoS & other parameters if required
Totally change mode of operation if required HIBI Design Flow is 80% of the HIBI interconnect scheme
#61/45 Erno Salminen - Nov. 2010
“Gives brains to the communication”“Gives brains to the communication”
HIBI wrapper is the only building block used everywhere in interconnectiony Between network and IP-blocks Between network segments Wrapper is parametrizable, modular, and
configurableA FIFO b ff i Asyncronous FIFO buffering
HIBI network
HIBIWrapper
FIFO / OCP i t f
HIBIwrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
#62/45 Erno Salminen - Nov. 2010
P1 Mem1PN Acc1... AccN...... MemN
interface
IP
HIBI NetworkHIBI Network HIBI network consists of bus segments and bridges
Transfers in segment synchronous circuit switched Transfers across bridges asynchronous packet switched Scales from serial point-to-point link to an arbitrary
topologyp gy
Identical signals between wrappers in network side No dedicated point-to-point signals
All i l h d i hi k All signals shared within network segment Wrapper layout is independent of the number of agents
Totally distributed arbitrationTotally distributed arbitration No central arbiter Each wrapper is aware of communication details
#63/45 Erno Salminen - Nov. 2010
HIBI Network Example
rr rr
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCKIP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper Bridge
HIB
IH
IBI
Wra
ppe
Wra
ppe
HIB
IH
IBI
Wra
ppe
Wra
ppe
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapperpppp
IP BLOCKIP BLOCKIP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
pppp
IP BLOCKIP BLOCK
pppp
IP BLOCKIP BLOCK
pppp
IP BLOCKIP BLOCK
pppp pppp
IP BLOCKIP BLOCK
Clock domainClock domain
#64/45 Erno Salminen - Nov. 2010
Bus latencyBus latency Total latency consists of several phases From: K. Kuusilinna, PhD Thesis, TUT, 2001.
Action Available MethodsAction Available MethodsRequest bus ownership
Wait for higher priority transactions to complete / Arbitrationrb
itrat
ion
tenc
y
Central arbiter, daisy chain, wired-OR,connectionless arbitration
Begin transaction Address/data multiplexing,handshaking
contection
Until all data has been transferred ora limit for data transfers per burst is reached.
Wait for master ready /Wait for target ready
Transfer first data
Initi
alla
tenc
y
a ds a g
p
Transfer data
Wait for master ready /Wait for target ready
Subs
eque
ntda
ta la
tenc
y
Optimizing this phase has biggest impact in long transfers
#65/45 Erno Salminen - Nov. 2010
Drive or wait for the bus to settle to idle state
Turn
-aro
und
late
ncy
Figure: Bus latency
transfers
HIBI Quality of ServiceHIBI Quality of ServiceTDMA (time division multiple access) with
freely run-time adjustable frame length and y j gslot durations and allocationsRe-synchronization to application phasey pp pAlso traditional priority/round-robin
time frametime frame time frametime frame
allocated time slotA1
competitionA3 A2 A3 A1 A3 t
competition
A3A2
A3A1
A1 A2 A1 A3 A1
Priority
Round-robin
tA2 A3 A1
#66/45 Erno Salminen - Nov. 2010
A2A1 A2 A3 A1 A2 A3 t
HIBI Basic TransferHIBI Basic TransferPipelined with arbitrationSplit-transactionsSplit transactionsBurst transfersNo wait cycles allowedNo wait cycles allowedNon pre-emptive transfers QoS is guaranteed with TDMA or with a QoS is guaranteed with TDMA or with a
combination of Send Max+Priority/RoundRobinpipeline
rq addr
ret addr
addr
data
w addr
w data ret dataw data
w addr rq addr ret addr
rq data rq data
ret addr ...
#67/45 Erno Salminen - Nov. 2010
t
ret addrdata w data ret dataw data rq data rq data
Wrapper Configuration MemoryWrapper Configuration Memory Stores all information for distributed arbitration
Permanent: ROM, 1 page Semi run-time configurable: ROM with several pages Full run-time configurable: RAM, with pages
Curr page
Curr conf
C f
Newconf
values
Dem Mux
Time slot
valuesConf page
Timeslot
mux
#69/45 Erno Salminen - Nov. 2010
logicslotsignalsCycle counter
HIBI Wrapper Area in ASICHIBI Wrapper Area in ASIC
35 000
25 000
30 000
35 000
RAMROM
15 000
20 000
Area
[gat
es]
ROM
5 000
10 000
A
08 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b
lo prior FIFOs = 3 / 3hi prior FIFOs = 0 / 0
lo prior FIFOs = 5 / 5hi prior FIFOs = 5 / 5
lo prior FIFOs = 10 / 5hi prior FIFOs = 10 / 5
#70/45 Erno Salminen - Nov. 2010
1-page mem 1-page mem 2-page mem
Runtime comparisonRuntime comparisonSalminen et al., SAMOS 2005.
#71/45 Erno Salminen - Nov. 2010
OtherOther notesnotes on on NoCNoC
Erno Salminen - Nov. 2010
Network topology categoriesNetwork topology categories1. Static networks utilize only point-to-point or
shared connection lines2. Dynamic networks use switches (or routers)
for communicationa) Direct = each processing node connected to
switchb) Indirect = some switches are not connected
directly to any processing node
#73/45 Erno Salminen - Nov. 2010
Problems with Current NoC DiscussionProblems with Current NoC DiscussionWhat is ”NoC” – no common definition
Something new, good by definition (needs no proof),...General purpose – but to what extentGeneral purpose – but to what extent
Arbitrary connectivity between any node? Uniform overall transfer distribution?
Discussion about “optimal topology” Discussion about optimal topology Multiprocessor architectures for scientific computations? Can massive fine-grain granularity parallelism be utilized in
Retransfer buffersRetransfer buffers If packets are dropped or corrupted in delivery (usually) they have to
retransferred Variable latencies problematic: is packet dropped and just havinf longer latency If Time-out latency exceeded , packet is assumed to be missingy p g
Source must store packets until it recieves acknowledge of succesfull transfer Sending acknowledge after each packet results in small buffer but (at least)
double latency Sengin ack after each N packet reuires bigger buffers but gives better g p gg g
performance
source destination
ack (ok)a) ack for each packet
src
buf
dst Latency per pkt = send_latency + ack_latency
b) ack for each N src dst
Latency per pkt =
#79/45 Erno Salminen - Nov. 2010ack (ok,ok,fail,ok)
Sometimes processing units may accept out-of-order delivery or buffers can be integrated with internal memory of the processing unit
If ack is sent after 4 packets buffer for 4 packets is needed If ack is sent after 4 packets, buffer for 4 packets is needed Furthermore, separate buffers are needed for each source as data may
received in interleaved manner E.g. (pkt_<n>_<src>) received: pkt_1_1, pkt_4_1, pkt_4_2, pkt_3_3... E if k t ft N k t d S E.g. if ack sent after N apckets and S sources
Reserve buffer Notification of the reserved buffer
Reserve buffer
Configure rx DMA
ACK
Configure rx DMA
Actual data
(optional ACK)
Actual data
(copy data)
C d t
Observedtx duration
Consume data
Reserve buffer etc.
Consume data
Observedtx duration
#81/45 Erno Salminen - Nov. 2010
Intertiwned/ReorderingIntertiwned/Reordering Transfers from different
sources may arbitrarily i t t i d
destination0
i) fixed-length packets
intertwined In addition, packets may
arrive out-of-order...
ddee
aabbcc
dd aa bb eecc
from
net
wor
k
arrive out-of-order
source0
”FIFO”-like buffers
ii) variable-length packets
netw
ork
source1destination0
aabbcc
ddee ddaabbeecc
destination0
netw
ork
...dd ee
dd aa bb eecc
These are either single words, bursts, or packets, depending on
the network
from
cc
linked list buffers
aa bb
#82/45 Erno Salminen - Nov. 2010
Irregular IP sizeIrregular IP size IP’s tend to have irregular size and shape Largest IP per row/column decides its height/width
S Some space wasted links will have varying length
Reordering the IPs reduces areag Ensure that frequently communicating IPs are still close to
each other
#83/45 Erno Salminen - Nov. 2010
<19.5% reduction in area>
Customized meshCustomized meshConnect more than IP to one routerSomewhat smaller bandwidth available per IPSomewhat smaller bandwidth available per IP Usually enough, though