Design of a Network-On-Chip platform for MPSoCs using TLM 2.0 standard and FPGA implementation Fernando Adolfo Escobar Juzga Electric and Electronic Engineering Department APPROVED: Antonio Garc´ ıa Rozo, Ph.D. Mauricio Guerrero, MSc. Alain Gauthier, Ph.D. Dean of Faculty
80
Embed
Design of a Network-On-Chip platform for MPSoCs using TLM ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Design of a Network-On-Chip platform for MPSoCs using TLM 2.0 standard and FPGAimplementation
Fernando Adolfo Escobar Juzga
Electric and Electronic Engineering Department
APPROVED:
Antonio Garcıa Rozo, Ph.D.
Mauricio Guerrero, MSc.
Alain Gauthier, Ph.D.Dean of Faculty
to my
MOTHER, FATHER and SISTERS
with love
Design of a Network-On-Chip platform for MPSoCs using TLM 2.0 standard and FPGAimplementation
by
Fernando Adolfo Escobar Juzga
Thesis
Presented to the Academic Faculty of the Graduate School of
Universidad de los Andes, Bogota
in Partial Fulfilment
of the Requirements
for the Degree of
Master Of Electronic Engineering
Electric and Electronic Engineering Department
Universidad de Los Andes
January 2011
Acknowledgements
I wish to thank my advisers Antonio Garcıa and Mauricio Guerrero for their guide and
support throughout the project; it was their experience and knowledge what helped me
choose and love this research area from years before. To my parents and sisters that have
unconditionally supported me at all times and without whom I wouldn’t have got here.
Additionally I want to thank all my friends who continuously inspire and demonstrate me
how far can one go with hard work, dedication and passion.
This thesis wouldn’t have been possible without the support of the OSCI TLM working
group and all its members. Finally, I want to thank CMUA group for providing me with
the necessary resources and tools that were required.
iv
Abstract
Complex systems that include a great variety of modules inside the same dice require higher
level design techniques that allow obtaining accurate models suitable to test hardware as
well as software at early stages; multiprocessors Systems On-Chip (MPSoCs) are scaling
to levels where it is possible to embed tens and up to hundreds of cores on the same chip.
Such architectures cannot be integrated with traditional bus structures as they are not
scalable; as a solution to that, a new paradigm called Network on Chip (NoC) has gained
strength to solve this issue.
SystemC, an IEEE standard for electronic level design (ESL) is used here to build a
NoC functional model; to simplify hardware details and speed up simulations, the new
Transaction Level Modelling standard (TLM 2.0) is also adopted. Relying on different
design constrains, variables such as router and network interfaces architectures, routing
algorithms, message and flit size, etc, are evaluated.
At a final stage, a VHDL synthesis is done and compared with other implementations.
Results prove this design flow to be adequate and helpful for this kind of systems due to
ture. Because bus systems are the medium through which processors and most peripherals
transfer information, NICs use them to interchange data with end modules; due to the
great variety of bus specifications, this work considers only the most common ones, that is,
AMBA from ARM [11] and OCP [10] from the OCP-IP group; the latter was selected for
its simplicity and high support from the ESL community.
On the other hand, and as previously stated, top layers of the OSI model can be sum-
marized into a greater group that refers to software models necessary to access the network;
there are mainly two approaches in the field of multi-processor programming: shared or dis-
tributed memory. As its name indicates, shared memory implies that processing units can
access the same physical or logical memory spaces at any time; a well known API implemen-
tation of this, is called OpenMP [12], and is lead by a non-profit corporation, composed
of several companies and researchers, named OpenMP-ARB. The distributed model, on
2
Figure 1:OSI Protocol Stack. Networks are usually defined according to thelayers shown.
the contrary, assigns separate physical memory sectors for each unit and through message
passing, data is shared between all modules at any moment. One of the first API imple-
mentations for this protocol is called MPI [13] and is still developed by the MPI Forum.
Message passing has had wide application for computer networks and appeals better suited
for them as computers don’t share the same memory. For the purposes of this work, some
of the MPI specifications were adopted for NIC design.
The following sections will provide a better insight on all the topics already mentioned;
a functional TLM 2.0 model of a Network On Chip is proposed and validated through
simulations; traffic patterns and other NoC parameters are analysed with the design and
finally a VHDL synthesis is presented to evaluate area consumption.
3
Chapter 1
Networks On Chip Design
To properly board NoC design, several aspects need to be defined in terms of the afore-
mentioned OSI model. Through the evaluation of each layer, all aspects needed for our
high level model will be defined.
1.1 Parallel Computing Memory Model
Parallel computing has faced big challenges since its creation: task dependencies, race
conditions, mutual exclusion and parallel slowdown are concerns that can’t be omitted.
Whether shared or distributed memory, these are software aspects that have to be solved
at high level so represent an additional task for the programmer.
Because none of the previous issues marks any difference on the memory model to be
used, it is necessary to consider elements that clearly affect this decision: portability and
scalability. A shared memory configuration is shown in Figure 1.1; if a program is to be
run on such platforms, either a software compiler, aware of all system resources, has to be
provided along with the hardware, or the programmer has to know about low level details
to write an application for it. Apart from that, if the number of cores is modified, the
cost at software level shouldn’t be expensive at all, and again, it will depend either on the
compiler or the programmer.
Distributed memory shown in Figure 1.2; as indicated, processors interchange informa-
tion through messages. In contrast to the previous approach, no additional compiler or deep
knowledge about the hardware is needed; only network accessing methods are required. In
case the number of cores change, a correct parametrized software description would solve
4
the problem.
Loads more of pros and cons on each configuration could be mentioned but it goes
beyond the scope of this work; it will suffice by stating that the distributed memory model
better suits the NoC’s behaviour and is the one implemented here.
Figure 1.1:Shared Memory Model. All processing elements share a big memoryarea; each core may have as many caches as desired yet the main mem-ory is common to all of them.
Figure 1.2:Distributed Memory Model. Interconnections between cores is donethrough a network; if data has to be shared, it is sent via messagepassing.
1.2 Networks On Chip
Designing Networks On Chip is a process that requires consideration of several variables
in order to separate communication from computation. The OSI model shown in Figure 1
5
can be taken as a reference for this systems; to understand this association, each level of
the stack can be defined as follows:
1. Physical Layer: Defines voltage levels, length and width of wires, timing details
and topology among others.
2. Data Link Layer: Its the one in charge of safe data delivery; specifies flow control
mechanisms between hardware modules.
3. Network Layer: Controls message delivery from one node to another. It’s respon-
sible for storing data and implementing routing algorithms.
4. Transport Layer: Is in charge of establishing connections between end-nodes and
provide the information for them. This module (un) packages data and send (receive)
it to (from) the routers.
5. Session, Presentation and Application Layers: Can be condensed into a single
Application group for NoCs and refers to higher level aspects of the communication
such as software.
By following the above mentioned scheme, it is possible to define a functional and
synthesizable NoC model considering all it’s aspects. Although a high SystemC model of
the network is constructed, hardware details are considered for future implementation.
The following sections show the specifications on each layer for the model developed.
1.2.1 Physical Layer: Topology
Some aspects of the physical layer depend on the technology to be used for fabrication and
can’t be specified from the beginning; operating frequency and voltage levels are examples
of such limitations; because synthesis is not the main target of this work, the previous items
were discarded. Bus width was selected to match most standard processors nowadays, that
is 32 bit ones.
6
Another important issue and perhaps the most relevant on this layer, is the topology;
contrary to computer networks, NoCs have a fixed structure that cannot be modified for
the rest of the chip’s lifetime. On this subject, several configurations have been proposed:
Figure 1.3 illustrates the most common topologies for networks; SPIN [14], Mesh, Torus,
Folded Torus, Octagon and Trees are a few examples. According to [1], Mesh and Torus
topologies constitute 62% of the overall designs; trees represent 12% and the rest, smaller
percentiles. There are as well specific ad-hoc implementations that can be seen in Figure
1.4; addition of links and combination of basic structures constitute the differences; despite
reducing worst case paths or improving latency delays, the cost on area consumption and
creation of new routing algorithms might be too high.
The guidelines to pick a topology were its scalability and the availability of routing
algorithms. As mentioned before, Mesh and Torus structures are used by the majority
of researchers and it is mainly because of their scalability: the cost of adding one or two
cores to a grid is pretty low as it doesn’t critically change the structure, additionally,
routing algorithms need not to be modified. Differences among them are turnaround links
that can significantly reduce some of the worst case conditions. In reference [18], both
structures show similar behaviour in power consumption, throughput and saturation, but
Torus topologies perform better with adaptive routing algorithms which, as will be seen on
the next section, are needed. After considering all previous restrictions and the results of
the cited references, Torus topology was selected for this work.
1.2.2 Data Link Layer: Flow Control
Routers are complex modules that have simple handshaking protocols to transfer data.
Whether interacting with another router or a network interface card (NIC), the mechanism
is the same. Again, some differences when compared to computer networks, exist; modules
inside the same chip transmit data in a much more reliable way than physically separated
ones so it suffices with controlling when to send and receive information, assuming that it is
properly transmitted. Some router implementations such as Æthereal [19] or MANGO[20]
7
Figure 1.3:Common Network Topologies. Both for computers networks and NoCs,most common structures are shown in the graphic.
Figure 1.4:Ad-Hoc Network Topologies. Academic proposals for NoC topologies:Mesh Connected Crossbars [16](left), Spidergon [17] (center), and Di-agonal Mesh [15] (right).
8
offer Quality of Service (QoS) guarantees but it requires a highly and specilized work that
goes beyond the scope of this work.
Flow control techniques are shown in Table 1.1; most implementation use the Credit
Based approach; STALL/GO has never been implemented and the rest of the literature
use handshaking and ACK/NACK like solutions. The handshaking approach is adopted
for our design.
It is important to note that flow control on the NoC’s SystemC model is abstracted
with the TLM 2.0 standard and may correspond to any of the techniques available when
ready for synthesis.
Table 1.1: Flow Control Techniques for NoCs [2].
.
Name Description
Credit Based Every router keeps an internal counter of the
spaces available for data storage (credits); once
a new space is free, a credit is sent back to in-
form its availability.
Handshaking Signal Based A VALID signal is sent whenever a flit is trans-
mitted. The receiver acknowledges by asserting a
VALID signal after consuming it.
ACK/NACK A copy of a data packet is kept in a buffer until
an ACK is received; if asserted, the flit is deleted.
If a NACK signal is asserted, the flit is scheduled
for retransmission.
STALL/GO Two wires are used for flow control; When a
buffer space is available, a GO signal is activated.
When no space is available, a STALL signal is as-
serted.
9
1.2.3 Network Layer: Switching Policy and Routing Algorithm
Switching policy determines the way information is transmitted; it can be either packet or
circuit switched. Circuit Switching is the least implemented and states that a path from
source to destination must be reserved before transmitting data and shall only be released
after the message has been fully delivered. This policy is time expensive and may increase
network congestion because messages can be blocked for long time if data is big; such
situation may easily lead to deadlock issues.
Packet switching is widely used in both computer networks and On-Chip ones; it can
be implemented on either the following three versions:
1. Wormhole: Packets are splitted into smaller ones called flits (Flow Control Units).
Head flits contain address’ information and each router uses it for forwarding it to the
destination; body flits follow it in a worm-like way. Only a 1-flit space is necessary
on each router input for implementation.
2. Store and Forward: Routers accept and send data when there is enough capacity
for fully storing the packet. A minimum space equal to the packet’s maximum length,
is required per router.
3. Virtual Cut Through: Data is transmitted per flit but is only accepted when
there’s enough buffer space for saving the whole packet; all routers must be able to
store at least the maximum’s packet length.
Figure 1.5 illustrates how information is transmitted through packet switching tech-
niques; around 80% of proposed NoCs, implement the wormhole one because of its low-area
requirements; wormhole switching was also selected for this work given those advantages.
Another item addressed by the Network Layer that highly affects the platform’s perfor-
mance is the routing algorithm; because of Torus resemblance with Mesh arrangements,
10
Figure 1.5:Packet Switching on NoCs: Wormhole(left), Store and Foward (center)and Virtual Cut Through (right) [19]. Only the wormhole techniquesignificantly reduces area consumption.
most algorithms that work for it, may as well operate on Torus networks with minor mod-
ifications.
A good guideline for selecting an appropriate algorithm, irrespective of the structure,
is the scheme shown in 1.6; several router implementation details can be established from
that graph: router complexity increases with the number of destinations it can deliver
information to. Due to area restrictions and the possibility of solving it at the software
level, multicast routing is discarded for the current work.
Routing decisions also determine the chip’s design: centralized routing requires a con-
trolling entity, aware of all nodes and traffic throughout the network, to decide how should
the information traverse it; source routing might increase the packet’s size for long paths
and finally multiphase routing also implies some of the previous problems. Distributed
routing is by far the most suited for NoCs and facilitates the adoption of the algorithms
proposed.
As for implementation, both lookup tables and FSM are feasible to adopt; area cost
on both options is similar and don’t affect the design drastically; one variable that could
determine which to choose is whether the algorithm is deterministic (always the same path
between two nodes) or adaptive (relies on network congestion). Thanks to the fact that a
high level model of the network will be created, tests are to be carried on with deterministic
and adaptive algorithms; adaptive ones are be backtracking (fault tolerant), mis-routing
11
(can route away from the destination if necessary) and partial (don’t consider all possible
routing paths).
Figure 1.6: Guidelines for selecting a Routing Algorithm [2].
For grid-like structures the most common deterministic algorithm is the XY one, where
information travels in the X direction until it reaches the Y coordinate of the destination;
it then travels in the Y direction. Adaptive routing is more complex as it attempts send-
ing data through low congested paths that aren’t always minimal; because of that, two
conditions that usually restricts the algorithms adaptability are deadlock, where several
messages block each other’s path preventing themselves to ever advance, and livelock
where data keeps travelling throughout the chip without ever reaching the target.
A few semi-adaptive, deadlock and livelock free algorithms widely adopted are known
as turn model solutions [21], [22]; from all possible 90◦ turns, 2 are prohibited in order
to avoid deadlock. Figure 1.7 shows three algorithms inferred with this theory. To better
12
understand each one, a brief explanation, taken from [23], is presented:
? West-First: Packets should start going to the west if necessary, then, adaptively are
routed south, east and north. Prohibited turns are the two to the west. Figure 1.8 shows
some path examples with this algorithm.
? North-Last: When going north, packets can’t turn anywhere else; the only option for
packets to go northwards is when that is the last direction to take. Examples are shown
in Figure 1.9.
? Negative-First: Prohibited turns are the two from a positive direction to a negative
one; if a packet has to go in the negative direction, it must start in that direction. Figure
1.10 exemplifies this behaviour.
Any of the aforementioned algorithms can be used with the SystemC model of the
network as describing them doesn’t require much development time; studies shown in [18]
demonstrate that no significant difference among them exist.
Other algorithms have been proposed in [24], [25], [26], [27] and many more references
but will be left for future work.
1.2.4 Transport Layer: Network Interface Card
Up to this point, most design specifications affected the router’s final structure, however,
this layer has more implications on the Network Interface Card; problems to be solved at
this level are end-to-end flow control and (un)packing of information.
In order to control packet injection on the network, our NIC design is based on the
message passing model previously mentioned; the way processors intercommunicate with
each other can be summarized in two activities: sending and receiving data; for each
message transmitted by a core (write operation), another one should be expecting it (read
operation).
13
Figure 1.7:Turn model for Adaptive Routing: Two turns are prohibited on eachmodel to avoid deadlocks; minimal and non-minimal paths are possiblefrom all options. [22]
Figure 1.8: West First Routing Examples [23].
Figure 1.9: North Last Routing Examples [23].
14
Figure 1.10: Negative First Routing Examples [23].
It is clear that processors won’t be synchronized at all times, and at a certain point, two
or more cores could send messages to another that isn’t ready yet; this will only increase
network congestion, require retransmission protocols, message discard support, and might
also lead to a deadlock at high level if not properly solved.
Considering the indicated problems and especially area constrains, the proposed Net-
work Interface Card implements end-to-end flow control with the following protocol: when
a core requests data, that is, performs a read operation, it sends a 1-flit-size packet to the
core that is intended to write on it; upon reception, the second NIC sends the information
only if the second core has a pending write transaction that matches the requester’ address;
if the second core doesn’t expect that specific request, discards it, and the first one has
to retry after some time. On the other hand when a NIC receives a write transaction, it
starts packing data, so that when a request arrives, most if not all information is ready to
be transmitted; if an application is properly written, the number of read requests should
match the number of write statements.
The cost of such implementation is that for every read/write pair, at least one flit has
to be sent between two nodes in order to “establish” a connection; this is nonetheless, far
more efficient than allowing all cores to send their packets any time and oblige NICs to
constantly delete them if they don’t correspond to expected transactions.
15
Other important items regarding NIC end-to-end flow control behaviour are:
1. No read transactions requested by a processing element are accepted by the NIC while
another read is in progress; violation of the algorithm sequence can lead to incorrect
results.
2. If data is being transmitted, the NIC can accept a read transaction from the processor
but won’t send the request until the previous transaction has terminated.
3. If a NIC is receiving packets from the network (read transaction), a write transaction
can be started from processor to NIC; data can be stored at a send buffer but won’t
be sent until a request from the correct module is received.
4. A write transaction starts when a processing element sends data to the NIC for
transmission. For the processing element, it ends when all the information has been
transferred to the NIC; for the latter, when all flits have been injected into the
network.
5. A read transaction starts when a processing element requests data from the NIC; it
ends when all the information requested is successfully delivered from the NIC to the
processing element.
6. Irrespective of the type of transaction a NIC is performing, under any circumstances
can it skip the execution order when another read/write transaction is received.
7. Buffer size for storing incoming and outgoing transactions was defined to be of 64
words. Separate buffers are implemented to improve performance.
As stated before, the protocol used for communication between the NIC and the process-
ing elements is the OCP-IP one; because it belongs to another section, it is not explained
here.
16
1.3 SystemC and Transaction Level Modelling TLM
2.0
Transaction Level Modelling TLM is a standard developed by the Open SystemC Initiative
(OSCI) which provides tools to rapidly create virtual descriptions of embedded platforms;
it’s main objective is to decouple computation from communication at a high abstrac-
tion level so that complex systems can be modelled. According to the OSCI group [28],
simulations run from 10X up to 1000X faster than corresponding HDL descriptions.
The TLM 2.0 standard allows two coding styles: loosely timed (L.T) and approximately
timed (A.T). When a quick and slightly detailed model of a design is required, the loosely
timed approach can be adopted; L.T transactions are modelled as a single function call
(read or write) that either returns after some delay, or do it immediately with an additional
delay argument so that the caller reacts after that time. A.T descriptions, on the contrary,
provide mechanisms for specifying as much timing details as desired so are more suited for
architectural analysis and hardware verification. The Network On Chip model developed
here only uses A.T descriptions and therefore an emphasis on explaining it is made. Figure
1.11 shows a bigger context where it’s worth applying TLM 2.0.1 descriptions.
The basic unit in all TLM transactions is the object interchanged, the generic payload ;
it’s a C++ class which members include the minimum elements to execute a transaction:
command, address and data; apart from those, additional variables such as byte en-
ables, streaming width, bus width, response status, etc, are included to model more
complex protocols. Generic payload objects also support user defined extensions that can
carry an unlimited number of attributes if required. Table 1.2 explains the basic attributes
aforementioned.
All TLM 2.0.1 transactions are carried out between an Initiator and at least one Tar-
get; the channel through which they communicate is called a socket and the only module
allowed to start transactions is the Initiator; Target modules can just reply to in-progress
transactions; Interconnect modules (such as routers or buses) can also be integrated with
17
Figure 1.11: Transaction Level Modelling Use Cases, Coding Styles and Mechanisms [28].
the previous ones. Figure 1.12 shows an example of one Initiator, one Interconnect com-
ponent and a Target.
AT transactions can be split into 4 phases as shown in Figure 1.13; through functions
named non-blocking forward transport (nb forward) and non-blocking backward transport
(nb backward), communication takes place; both functions have three parameters:
1. Trans: Pointer to the generic payload object.
2. Phase: Current transaction phase; it can be either of those shown in Figure 1.13.
3. Delay: Time that a module has to wait before responding to a transaction.
Initiators call nb transport forward, with BEGIN REQ as phase argument, to start
transmitting data; they use phase END RESP to conclude a transaction. Targets call
18
Table 1.2: Generic Payload Attributes according to [29].
.
Generic Payload Attribute Meaning
Command Can be either Write or Read.
Address Target address to execute transaction.
Data Pointer Pointer to the data array. Data should be read or written
to this variable.
Data Length Length of the data to be transferred computed as
BUSWIDTH/4;
Byte Enable Pointer Used to enable access to specific data bytes.
Byte Enable Length To specify the number of valid elements of the byte enable
pointer.
Streaming Width States the number of words per burst transfer.
DMI Allowed Marks whether the Direct Memory Interface can be used
or not.
Response Status Used for storing the status of the transaction.
nb transport backward with phase END REQ to acknowledge the reception of a transac-
tion and use phase BEGIN RESP to indicate the correct execution of the it regardless
whether is read or write.
At some points it might be unnecessary to use all four phases to model a platform’s
behaviour, i.e. when a write transaction is performed: an initiator(cpu) sends data to a
target (memory) which can execute the order immediately; in this case, the target can reply
to the initiator with a phase update, changing it from BEGIN REQ to BEGIN RESP and
adding some delay; the way each agent is aware of such status updates is by checking the
return value of a nb transport call. Return values can be either of TLM ACCEPTED (no
change in phase), TLM UPDATED (phase updated) or TLM COMPLETED (transaction
19
executed).
Specific rules concerning each module’s permission to modify the generic payload at-
tributes, possible return values from each nb transport call, and detailed explanation of the
whole standard, can be found on [29] for more information.
Figure 1.12:TLM Transaction Flow [28]. The generic payload object is created bythe Initiator but is only referenced by interconnection modules ortargets. Socket arrow’s indicate how the information flows.
As previously mentioned, some extensions can be added to the generic payload object
for routing purposes and can be either global or instance specific, that is, each module of
can add attributes to the transaction object and being the only one able to access them;
this work adds two extensions to the generic payload object: a global one for end-to-end
verification purposes, and an instance specific one for router operation. Next chapter will
show more details about this.
20
Figure 1.13:TLM Base Protocol Phases [28]; the initiator is the vertical line on theleft and the target the one on the right.
1.4 Open Core Protocol
The Open Core Protocol International Partnership (OCP-IP) is a community in charge of
“proliferating a common standard for intellectual property core interfaces, or sockets that
facilitate “plug and play” System-on-Chip design”[10]. Their specifications for intercon-
necting modules is a bus model as complete as ARM’s AMBA-AXI and can be perfectly
described with OSCI’s TLM 2.0.1 standard.
Because of the amount of details the OCP has, a light version of it will be used for this
work; all basic signals shown in Table 1.3 are used but additional burst support is included.
Standard OCP burst extension require 8 additional signals where all but MBurstLenght
can be skipped; to see how can this be done, consider Table 1.4: MAtomicLength is used
when the length of data is bigger than the word size and this is not the case; MBurstPrecise
indicates that the length of the burst is known at the start of the transmission as always
is for our design; MBurstSeq specifies how are the addresses of the burst emitted which in
this work are assumed to be incrementing; MBurstSingleReq implies that only one request
is done per burst transfer; MDataLast, MReqLast and MRespLast are unnecessary as each
module keeps track of the number of data transferred.
21
Table 1.3:Basic OCP Signals extracted from [10]. Signal MDataValid is skippedin our implementation. Width measured in bits.
.
Name Width Driver Function
Clk 1 varies OCP Clock
MAddr configurable master Transfer address
MCmd 3 master Transfer command
MData configurable master Write data
MDataValid 1 master Write data valid
MRespAccept 1 master Master accepts response
SCmdAccept 1 slave Slave accepts transfer
SData configurable slave Read data
SDataAccept 1 slave Slave accepts write data
SResp 2 slave Transfer response
Table 1.4: Burst OCP Signals [10]. Only MBurstLenght is enough for this work’s NIC.
.
Name Width Driver Function
MAtomicLength configurable master Length of atomic burst.
MBurstLength configurable master Burst Length.
MBurstPrecise 1 master Burst length precise.
MBurstSeq 3 master Address sequence.
MBurstSingleReq 1 master Single request/multiple
data protocol
MDataLast 1 master Last data in burst.
MReqLast 1 master Last request in burst.
SRespLast configurable slave Last response in burst.
22
To better understand how transfer with the OCP protocol work, consider Figure 1.14;
only signal MRespAccept is missing on the diagram yet the behaviour is practically the
same. Figure 1.15 shows an scenario for burst transfers, handshaking is carried on the same
way.
Figure 1.14:OCP Read Transaction [10]; signal behaviour when performing a readrequest: When the master issues the command it has to wait for SCm-dAccept to assert before changing the MCmd line. After some timethe slave indicates valid data on the SData bus by issuing a Data Validcommand on the SResp line.
23
Figure 1.15:OCP Burst Write Transaction [10]; signals MBurstSeq and MBurstPre-cise never change. Handshaking between master and slave is basicallythe same as the previous non-burst example.
24
Chapter 2
NoC Implementation
Once the implementation details and design flow have been clarified as in the previous
chapter, its now possible to describe the router and the NIC at any level of abstraction.
Although most items regarding each structure are well defined, some aspects still lack
specification and will be analysed hereafter. Code of each description can be found on the
Appendix section.
2.1 Flit and Message structure
In order to determine the structure of both the router and the NIC, it’s necessary to define
the units they are going to deal with: Flits and Messages. Messages are composed of
one or more flits, which are the units injected into the router’s network; because wormhole
routing is to be used, one of the flits must include information about the origin and des-
tination of the whole message; NICs, however, require additional data fields to properly
implement end-to-end flow control. As a start, a review of explanations provided on Sec-
tion 1.2.4 and the constrains mentioned on Section 2.3 are necessary to define all constrains.
If more TCP-like control parameters are needed for high level control, those parameters
must be set by processing elements and are to be transmitted to the NIC as common data;
NICs only support the minimum amount of control fields to ensure correct functionality.
Head Flit structure is displayed in Figure 2.1 and message structure in 2.2. Flit fields are
explained in Table 2.1.
25
Figure 2.1: Head Flit Structure
Figure 2.2: Message Structure. Payload can be up to 64 bytes long.
Table 2.1: Flit fields explanation
.
Field Use
Type Flits can be either: Head, Body, Tail or Single; single
flits are used to ask for data and for barrier operations.
Source X Flit’s origin X coordinate.
Source Y Flit’s origin Y coordinate.
Destination X Flit’s destination X coordinate.
Destination Y Flit’s destination Y coordinate.
Length Message length. Maximum 64 words.
Single Indicates whether flit is a single-flit transaction or not.
Message Number Message number stated by source module.
Broadcast States whether the message is broadcast or not.
BarrierID. Stores a BarrierID according to the source.
ReadWrite If message is single-flit, this bit is set when is a barrier
write.
26
Through SystemC descriptions and simulations it was possible to establish the correct
behaviour of the platform. Now that the units needed by both router and NIC are defined,
their designs can be presented.
2.2 Router Architecture
Studies presented in Chapter 1 yielded the following conclusions regarding router imple-
mentation:
? Topology: Torus. Displayed in Figure 2.3. Taken from [30].
? Switching Policy: Wormhole Packet Switched.
? Flow Control Technique: Handshaking Signals.
? Routing Algorithms: Deterministic XY and Adaptive Turn Model.
Only two aspects about the router’s structure are still undefined: Arbitration tech-
niques and number of Virtual Channels. When two ore more inputs attempt to use a
router’s output it is necessary to establish a mechanism to assign output control. Table 2.2
lists usual solutions to this problem; most implementations listed in [2] use Round-Robin
or First Come - First Served techniques for Best Effort routers and priority approaches for
Guaranteed Traffic (GT) ones such as [19] and [14].
Specialized routers are required when GT services are to be provided; just a few NoCs
like the ones cited have implemented GT services. Best effort Round Robin arbitration will
be used on this work.
On the other hand, Virtual Channels (VCs) are buffer additions to the router’s inputs
(outputs) used for alleviating congestion on the network; despite using the same physical
paths, addition of buffers decrease the probability of deadlock and improve performance
as delayed messages can hold on routers and still advance to their destinations. Area is
the main cost of adding Virtual Channels and is also one of the most critical issues in
27
Figure 2.3: Torus Topology NoC
Table 2.2: Router Arbitration Techniques [2].
.
Arbitration Technique Policy
Round Robin Output is assigned equally starting from the
first element.
First Come - First Served Output control is assigned in request order.
Priority Based All packets are assigned a priority and get
output control according to their importance.
Priority Based Round Robin Round Robin is implemented but a priority
proportional to the frequency of usage is as-
signed.
28
embedded system design; because of that, an optimal placement and integration of buffers
is required. Figure 2.4 shows a router with input VCs, which are in principle, connected to
all possible outputs. In [31] studies show that for unicast routing, having a VC per output
at each input can reduce area consumption significantly; with this result, the router and
VC integration can be seen in Figure 2.5.
Figure 2.4:Block diagram of a Router with Virtual Channels. Area constrainshave to be considered to choose an appropriate number of buffers.
Now that all specifications related to the router’s behaviour are defined a high level
block diagram of it can be constructed; no major implementation details are shown for it is
an abstraction of the real hardware and all functional blocks are software described. Figure
2.6 shows the general block diagram that will be used to describe the TLM model of the
router.
29
Figure 2.5:Virtual Channel connections to Router. A single VC per output isavailable at each input so to decrease area consumption. Extractedfrom [31].
Figure 2.6:General Router Block Diagram. Four virtual channels at each inputare placed to reach all possible outputs; no packets are routed backthrough the same input.
30
2.2.1 Router TLM Model
SystemC’s Transaction Level Modelling is a standard for decoupling communication from
computation in high level designs; most mechanisms offered by the standard are easily
abstracted to bus models because it’s the traditional way to interconnect Systems On
Chip (SoCs). As routers use different flow control techniques compared to traditional bus
systems, different interpretation of the TLM 2.0.1 phases is required in order for the model
to keep faithful to the hardware. Table 2.3 explains phase’s meaning for inter-router, packet
based communication.
Table 2.3: TLM 2.0.1 Phases Interpretation for Routers.
.
Phase Flow Direction Meaning
BEGIN REQ Init. Router To Target Router Flit is being transmitted.
END REQ Target Router To Init. Router Flit is stored, can be erased
on initiator.
BEGIN RESP Target Router To Init. Router A new space is free. Can
send more flits.
END RESP Init. Router To Target Router Final reply.
Another addition to the TLM 2.0.1 base protocol, described in the previous chapter,
are routing extensions; as mentioned before, extensions can be locally or globally accessed.
The proposed model uses both for debugging and verification purposes; a local extension
is created on every transaction when they traverse a router and each one adds its own
extension to the transaction. It’s got the following fields:
(a) Port: Stores the number of the incoming port through which the transaction entered.
(b) Port VC: Stores the number of the outgoing port through which the transaction will
go out.
31
(c) TimesBlocked: Counter that is increased in 1 unit when a router attempts to be
transmitted. This allows recognizing deadlock situations.
The global extension is created by the initiator, can be accessed by all modules and
adds the following information to the transaction object:
(a) MainInitiator: Stores the ID of the module that first issued the transaction.
(b) FinalTarget: Stores the ID of the module where the transaction is to be delivered.
(c) TransID: Records the transaction number for debugging purposes.
(d) FlitType: Stores the type of flit of the current transaction.
(e) TransCounter: Incremented every time a transaction passes through a router.
(f) TransPath: Array for storing the path the flit goes. Used for debugging.
At this point it is necessary to clarify that there are four type of flits: Head ones which
contain routing information, body ones which are the data itself, tail ones that mark the
end of a packet (may or may not contain data) and full ones that are single-flit messages
used for (a) sending read requests from one core to the other (end-to-end flow control) and
(b) single-flit writes used for barrier operations.
SystemC implementation of the router is composed of five functions that act on each
port:
I Non-blocking Transport Forward: Is a standard mandatory function that receives
three parameters: a transaction pointer, a TLM phase argument and a time value
called delay. When a module wants to send a flit, it calls this function with those
parameters and a BEGIN REQ as phase argument; the delay time is the time at
which the target has to react after getting this call. The function checks the type of
flit, space availability, computes the output, returns TLM ACCEPTED and tells the
32
simulator to execute Forward Payload Event Queue at the time indicated by the delay.
Detailed behaviour of this method is shown in Algorithm 2.1.
II Non-blocking Transport Backward: Is also a mandatory function that receives
the same three parameters but correct phase arguments are either END REQ or BE-
GIN RESP. If receiving END REQ, a method called Backward Event Queue is notified
for execution after the delay time; if BEGIN RESP is received method Transaction
Update is notified.
III Forward Payload Event Queue: Function invoked by nb forward transport ; it takes
the transaction object and stores it on the corresponding Virtual Channel and noti-
fies method Transaction Update to be executed after an internal delay time. It also
returns phase END REQ back to the initiator to acknowledge the correct storage of
the transaction.
IV Backward Payload Event Queue: Function invoked by nb backward transport, in
charge of double checking that the transaction is correct. Notifies the Transaction
Update method for immediate execution.
V Transaction Update: Considered the brain of the router; it starts transaction previ-
ously stored on the VCs, deletes transactions already sent, notifies modules the avail-
ability of new spaces if there are some and implements output arbitration. Algorithm
2.2.1 describes the thoroughly method .
2.2.2 Traffic Evaluation and Routing Algorithm Testing
MPSoC platforms are generic systems that can implement any algorithm whose inter-
module traffic can be known once task partitioning is done; because it is uncertain which
application will be executed on such platforms, it is necessary to test synthetic traffic
patterns on the chip to establish its performance under random circumstances. There
33
Algorithm 2.1 Non-blocking Transport Forward.
Require: Transaction object, phase, delay
1: if phase = BEGIN REQ then
2: if (FlitType = Head) or (FlitType = Full) then
3: OutPort = Value returned by Routing Algorithm.
4: if (VC Empty) then
5: Reserve Virtual Channel
6: Set response status to TLM OK RESPONSE
7: if (OutPort Free) then
8: Take control of OutPort
9: end if
10: else
11: Set response status to TLM GENERIC ERROR RESPONSE
12: Return TLM ACCEPTED
13: end if
14: Notify Forward Payload Event Queue to execute after delay tim
15: Decrease VC Space
16: Return TLM ACCEPTED
17: else if (VC has space) then
18: Set response status to TLM OK RESPONSE
19: Decrease VC Space
20: Notify Forward Payload Event Queue to execute after delay time
21: Return TLM ACCEPTED
22: else
23: Set response status to TLM GENERIC ERROR RESPONSE
24: Return TLM ACCEPTED
25: end if
26: else if phase = END RESP then
27: Return TLM ACCEPTED
28: else
29: Abort Execution
30: end if
34
Algorithm 2.2 Transaction Update Method Implemented by Routers
Require: Virtual Circuit to Update
Require: InputPort, OutputPort
1: if A transaction on VC can be started then
2: Call non-blocking forward method on the next module with phase BEGIN REQ.
3: end if
4: for i = 0 to V CSize do
5: if A Transaction can be freed then
6: Delete transaction.
7: Increase VC space.
8: Call non-blocking backward method on the previous module with phase BE-
GIN RESP to indicate that a new space is available.
9: if Transaction is type “Tail” then
10: Free Virtual Circuit.
11: Stop controlling Output Port.
12: end if
13: end if
14: end for
15: if Output Port is not busy then
16: for i = 0 to Number of Router Inputs do
17: NewInput = InputPort + 1
18: if NewInput is ready to use OutputPort then
19: Give NewInput control of OutputPort.
20: Execute again from the start.
21: end if
22: end for
23: end if
35
are a few typical tests conducted on NoC designs that help realizing routing algorithms
performance:
Uniform Traffic
Nodes communicate with each other with the same probability.
Matrix Transpose Traffic
Each node sends messages only to a destination with the upper and lower halves of
its own address transposed.
Hotspot Traffic
Each node sends messages to other nodes with an equal probability except for a
specific node (called Hotspot) which receives messages with a greater probability.
The percentage of additional messages that a Hotspot node receives compared to the
other nodes is indicated after the Hotspot name e.g Hotspot 15%.
Complement Traffic
Each node sends messages only to a node corresponding to the one’s complement of
its own address.
Several scenarios were tested under some of this traffic conditions and mainly three
routing algorithms were implemented: West-First (adaptive), North-Last (adaptive) and
XY (deterministic). Additionally, an aspect that hasn’t been studied yet, VC depth, was
also considered and the results can be seen on the following figures.
Figures 2.7 and 2.8 show link utilisation under Hotspot 10 % (on node 7) traffic condi-
tions; two groups of figures are provided as all links transmit information in both directions
(up-down or right-left); for a better discrimination of link congestion, plots were done sep-
arately. XY Routing in Figure 2.7(a) has 4 high traffic links (higher bars); West First
in Figure 2.7(b) presents only two congested links and North Last in Figure 2.7(c) just
one. On the other direction, Figure 2.8(a) shows XY with 2 congested links, Figure 2.8(b)
36
presents West First behaviour with 1 high traffic link and Figure 2.8(c) has 2 congested
links with North Last routing.
From the previous figures, apparently West First and North Last routing better spread
traffic along the network as they only have 3 high traffic links in both corresponding graphs.
No turnaround links show significant utilization despite the adaptiveness of those algo-
rithms.
In order to check the overall behaviour of all routing algorithms under this traffic pat-
tern, plots for average flit latency and total simulation time are shown in Figure 2.9. From
2.9(a) it can be seen that the more Virtual Channels are, the more the flit latency on the
network; that is because NICs can inject more packets into the network at any given time;
adaptive algorithms show lesser values than XY’s, indicating that information is forwarded
faster with them. Figure 2.9(b) gives more information about the routing performance; for
large messages XY and North Last perform better than West First regardless the Virtual
Channel depth. For shorter transmissions West First decreases simulation time. In gen-
eral, results presented for this traffic pattern are very close to each other and might need
a deeper analysis to make a routing decision.
A second pattern was studied under the same conditions as before; Matrix Transpose
traffic was implemented and results are shown in figures 2.10, 2.11 and 2.12. From graphs
2.10(a) and 2.11(a), 8 congested links can be distinguished when using XY routing; West
First behaviour, shown in 2.10(b) and 2.11(b), only present 2 high traffic links, as well
as North Last routing in 2.10(c) and 2.11(c). Although, as before, adaptive algorithms
attempt to better distribute traffic throughout all available paths, results from 2.12 demon-
strate that long messages with low Virtual Channel depth get faster to their destination
with XY-Routing, which is also, the one with lowest flit latency; medium size messages are
more suited to West First-Routing under this pattern.
37
(a) XY Routing
(b) West First Routing
(c) North Last Routing
Figure 2.7: Link Utilisation for Hotspot 10 %. Traffic going Right - Down
38
(a) XY Routing
(b) West First Routing
(c) North Last Routing
Figure 2.8: Link Utilisation for Hotspot 10 %. Traffic going Left Up
39
(a) Average Flit Latency
(b) Total Simulation Time
Figure 2.9:Timing statistics for Hotspot 10 %. Traffic. All routing algorithmswere evaluated under several message size and Virtual Channel depthconditions.
40
(a) XY Routing
(b) West First Routing
(c) North Last Routing
Figure 2.10: Link Utilisation for Matrix Transpose Traffic going Right - Down
41
(a) XY Routing
(b) West First Routing
(c) North Last Routing
Figure 2.11: Link Utilisation for Matrix Transpose Traffic going Left Up
42
(a) Average Flit Latency
(b) Total Simulation Time
Figure 2.12:Timing statistics for Matrix Transpose. Traffic. All routing algorithmswere evaluated under several message size and Virtual Channel depthconditions.
43
Despite the fact that more traffic patterns could have been evaluated for the scope of
this study it is enough to show the capabilities of the constructed model. Each of the
timing graphs (3D plots) considered 8 message sizes and 8 Virtual Channel depths, that is
64 simulations in total. Each run used 100 messages on each of the 16 nodes for a total of
1600 messages that varied from 6400 up to 51200 flits being transported along the NoC.
2.2.3 Router VHDL Model
Once the router’s behaviour was validated with the high level model presented, a detailed
HDL design was implemented; three main blocks compose this design: Input Port, Output
Port and Multiplexers. Input ports are in charge of data reception control, routing and
Virtual Channel storage; Output ports control data transmission and round robin channel
arbitration; multiplexers interconnect all input buffers with the router’s outputs.
Because of the flit size, that is 34 bits, there are 34 lines for data transmission and 34
for data reception; also, two lines are used for handshaking transmission control, Tx and
Tx Ack and two lines for reception control, Rx and Rx Ack. In summary, each port has 36
inputs and 36 outputs. Figure 2.13 shows the router’s black box.
Input Port module is composed by 4 FIFOs, a routing unit, one multiplexer and one
de-multiplexer; deterministic XY routing was chosen for state machine implementation.
Flow control was implemented according to the studies presented in section ??, where
handshaking signals were selected. The control module receives incoming requests from
external modules through the Rx input; it sends a request signal to the routing unit who
replies back when the output has been computed; after getting a response from the routing
module, the flit is stored at the corresponding queue, if space is available. When no space
is available, Rx Ack line remains de-asserted.
The routing module was designed with a state machine that uses external comparators
to determine whether the coordinates of the destination are larger or smaller than the
router’s. Depending on the results of all compares, an output is computed and stored into
44
Figure 2.13:VHDL Router Black Box. Five ports are needed for a torus or (mostrouters on) mesh configurations.
an internal register; block diagram of this module is presented in Figure 2.14 and the one
for the whole InputPort is shown in 2.15.
Output Port module is composed of a control unit in charge of arbitrating outputs and
negotiate data transmission with another router or NIC; also, two de-multiplexers a Flit
decoder and a multiplexer are included into this big object. Flit decoder is in charge of
notifying the control when a tail or single flit has been transmitted so that it assigns the
output to another input. A block diagram of this box is shown in 2.16.
Finally, the full block diagram of the router is shown in Figure 2.17. VHDL code is
attached at the end on the Appendix section. Due to the number of input/outputs of
this module, implementation can only be possible on an ASIC, however, FPGA synthesis
allowed us to know some information about area consumption. Studies shown in [32] lists
statistics about the number of slices consumed on a Virtex-II FPGA; a 5 input/output
router consumed 397 slices. Also, [8] obtained a 1762 CLB consumption on a Virtex-II-
8000 FPGA.
45
Figure 2.14: VHDL Block Diagram for the XY Routing Module.
Figure 2.15:VHDL Block Diagram for Input Port Module. Four FIFOs are neededto route information to each output port.
46
Figure 2.16: VHDL Block Diagram for Output Port Module.
Our router model was synthesised on Virtex5. Virtual Channels were generated as
FIFO memories with Xilinx’s IP Core Generator with 16-flit depth. Resource utilisation
is shown in Table 2.4. Because low-level detailed designed was not the objective of this
work, HDL simulations are skipped on this document yet VHDL code is attached at the
Appendices section.
It is important to note that pin-out of all previous modules was not enough for syn-
thesizing a single router; however, if more of this modules are embedded, a small network
of them can be constructed and the chips could be plugged to external processors with
FPGAs outputs.
2.3 Network Interface Card Architecture
Network Interface design is intended to support and validate message passing transactions
which are composed of two tasks for communication, send and receive, and one for syn-
47
Figure 2.17:VHDL Block Diagram for the Router. Multiplexers shown on diagramare the same as the ones shown in Figure 2.16 for data selection.
48
Table 2.4: Router Area Consumption on Virtex 5 (XC5VFX30T-1FF665)
.
Device Utilisation
Logic Utilisation Used Available Utilization
Number of Slice Registers 730 20480 3 %
Number of Slice LUTs 846 20480 4 %
Number of fully used LUT-FF pairs 230 1346 17 %
Number of bonded IOBs 372 360 103 %
Number of Block RAM/FIFOs 10 68 14 %
Number BUFG/BUFGCTRLs 1 32 3 %
chronization called barrier. Those functions were taken from the MPI standard and suffice
the functionality required.
2.3.1 Network Interface TLM Model
SystemC TLM Model of the NIC has one target socket for receiving the core’s transaction,
one initiator socket for sending data to the local router and another target socket to get
data from it; for each target socket there is a corresponding nb forward transport function
and for the initiator socket, a nb backward transport method is provided.
On sockets connected to routers, TLM phases are interpreted the same way as stated
on Table 2.3, however, on sockets connected to processing elements (end-modules), phases
are considered as specified by the standard.in Section 1.3.
In order for the system to react at the appropriate time (because of transaction delays),
there are three payload event queues linked to each nb transport function. Other methods
are in charge of standard operations such as storing data on send or receive buffers, ar-
bitrate output control, reply to processing elements, etc. Next a list of all NIC method’s
functionality is presented.
I. BuildHeadFlit: Method in charge of creating a transaction’s header flit. It stores
49
message number, type of flit and initiator and target addresses on a single word.
II. GetHeaderInfo: Is in charge of extracting all header information from a head flit.
III. CheckIfExpectedTransaction: Method invoked when a new request arrives; it is
in charge of establishing whether it corresponds to a write transaction started by the
local processor or not. If the transaction doesn’t match the expected one it’s stored
at a incoming requests buffer.
IV. StoreAtSendBuffer: Function used for storing flits at a send buffer (if a write
transaction) or at the send request buffer (if a read transaction).
V. StoreAtReceiveBuffer: In charge of storing flits at a receive buffer (if a read trans-
action) or at the receive request buffer (if a write transaction) when the request doesn’t
match with the processor request.
VI. RESPONSE TransactionUpdate: Dynamic-event triggered method used for send-
ing phase BEGIN RESP back to the router when a tail flit has been received; it also
sends BEGIN RESP to the local processor to indicate it that data is ready to be
transmitted.
VII. REQUEST TransactionUpdate: Method invoked to send a new write request
when the timer has expired.
VIII. RRESPONSE TransactionUpdate: Is in charge of returning phase BEGIN RESP
back to the router when a read request has been received.
IX. SEND TransactionUpdate: Acts as a central control unit for the NIC module;
this method checks whether the IncomingRequestQueue has a valid transaction that
matches the one specified by the processor, if so, grants output access to the send
queue. After that, sends the first flit on the send buffer and updates debugging
information; after that, frees already transmitted flits from the queue and checks if
50
it was a tail flit, if so, releases the output port. More tasks are performed by this
function and can be better described by pseudo-algorithm 2.3.1.
Apart from reading and writing, all cores are capable of executing barrier operations for
synchronization. Depending on the core’s ID, a barrier is implemented differently: there
must always be a master core and one or more slaves cores; master cores await for slaves
to send a barrier message and once got everyone’s, they issue a command for all of them
to resume executing their tasks. Because it is necessary to address all nodes when issuing
barrier transactions from the master core, and because routers are unable to realize of that,
NICs were designed to support a broadcast command that sends the same data to every
node. This functionality is also useful when processors need to share information stored at
one of them, however, it won’t be until a node gets requests flits from all the others, that
it will start transferring data; this approach might prove useful in some scenarios but can
also decrease overall performance on others.
In order to improve performance and reduce processor computation the NIC implements
barrier operations as follows:
Slave cores : Send a normal write request transaction to the master core and expect a
one-flit write.
Master core : Builds a single-flit write transaction and stores it at the send buffer; when
requests from all modules are received, it sends that flit to all the modules. When all
flits have been transmitted the NIC replies back to the core.
Mechanical computation implied by the barrier function is done at the NIC so that the
core can perform other operations; the cost of that is an increase in area consumption.
TLM phases for read operations can be seen in Figure 2.18. This transactions take
a long time to complete because once the NIC is notified of a read transaction, it sends
a request-data flit to the appropriate module and has to wait for information to come;
51
Algorithm 2.3 Transaction Update pseudo-algorithm implemented by NICs.
1: if Write Pending and Not Read In-progress then
2: Check Request queue.
3: if Transaction Requested is expected then
4: Give Output control to Send Queue
5: end if
6: end if
7: if Send Queue controls Output then
8: Send first flit on Queue
9: if Flit accepted then
10: Mark Flit as accepted.
11: Notify method for later execution to delete Flit.
12: end if
13: if Write is Unicast then
14: Delete transmitted flits in Send Queue.
15: end if
16: if Write is broadcast then
17: if Write is Burst and Burst Completed then
18: Create new Head Flit.
19: Notify method for later to start transmission to next node.
20: Reset Transmission counters
21: end if
22: end if
23: if Write is Burst and Not all data packed then
24: Store next flit at Send Queue
25: end if
26: end if
27: if All data is transmitted then
28: Send phase BEGIN RESP back to Initiator to release transactions.
29: end if52
there is only after getting all packets that the processing element is notified about the data
availability, and the transaction concluded.On the other hand, write transactions between
the NIC and the processing element can be finalised faster. A phase diagram for write
transactions can be seen in Figure 2.19.
Figure 2.18:TLM Phases in a NIC Read Operation. CPUs ask for data, NICs senda request to the corresponding module and waits for data to arrive.After all information is received, phase BEGIN RESP is issued to theCPU to indicate the end of the transaction.
53
Figure 2.19:TLM Phases in a NIC Write Operation. Processing elements send alldata to the NICs and finalize the transaction after transmitting all theinformation. NICs await a read request and send packets when thecorresponding one is received.
54
2.3.2 Network Interface Hardware Design
Network Interface design was extrapolated from the SystemC high level description; a high
HDL complexity was found on this module as it has to implement part of the router’s
functionality, solve end to end flow control and communicate with the processing element
through the OCP-IP bus model. Several control units were necessary for this design to
support all the features implemented in SystemC listed in the previous section; because of
space constrains, a general block diagram of the overall module is shown in Figure 2.20;
control signal paths are shown in red and data path ones, in yellow.
To better understand the figure, a correspondence between the TLM 2.0 model and the
VHDL one is presented in Table 2.5; although the equivalence is not exact, it tries to match
the main aspects. Functions shown in the table are also described in the previous section.
One of the most complex modules of the NIC was the OCP-Handshaking Control and
required careful design in order for it to support transactions and respect their execution
order; from the diagram in 2.20 it can be seen that another control unit (End-to-End Flow
Control) was necessary. State diagrams of both are shown in Figures 2.21 and 2.22.
When a new transaction is started from the processing element, handshaking control
verifies whether is possible to initiate it internally; if that is possible, an appropriate header
flit is stored at the corresponding queue and information is packed (for write transactions).
End to end flow control is notified of the operation in progress and commands transmission
and reception units to do the necessary operations to carry on with the transaction: if a
read is to be performed, a request flit is sent to the corresponding module; if a write is
requested, reception control must report itself when a request matching the write address
is received.
VHDL implementation of the NIC is left for future work as it doesn’t constitute a
common test metric on the NoC field.
55
Table 2.5: VHDL-SystemC equivalence of NIC blocks
.
VHDL Block TLM Methods/Objects Function
OCP-Handshaking Control nb transport fw, RESPONSE
Transaction Update
Transfer data from (to) process-
ing elements
End to End Flow Control Target Payload Event Queue,
Check If Expected Transaction
Execute transactions tidily.
Tx Control nb fw router, SEND Transac-
tion Update
Initiate transactions with
router
Rx Control nb transport bw router, Store
At Receive Buffer
Receive transactions from
router
FIFO DataIN and Requests Double-ended queue Store data read and incoming
requests
Bank DataOUT Double-ended queue Store data out and read re-
quests.
Rest of Blocks Build Head Flit, Get Header
Info
Set and retrieve head flit infor-
mation
56
Figure 2.20: VHDL Block Diagram for Network Interface Card
57
Figure 2.21: State Machine for the Handshaking Control
58
Figure 2.22: State Machine for the Handshaking Control
2.4 Software Performance Results
After validating both the router and NIC TLM models, software applications were pro-
grammed to analyse performance results with the whole NoC. Matrix multiplication was
implemented for its straightforward parallelization; previous performance graphs could also
be obtained but are not shown for space constrains.
2.4.1 4× 4 Matrix Multiplication
The first test scenario was a 4 × 4 matrix multiplication split into 16 cores (1 master, 15
slaves) where each one performed a row-column product and returned its result back to the
master. MPI directives such as MPI Send, MPI Receive and MPI Broadcast were used for
data sharing between modules.
Figure 2.23 shows the full Network On Chip performance regarding operation time
for three routing algorithms and several Virtual Channel depths. In concordance with
59
previous simulations, XY routing had the worst behaviour while West-First had the best
one; increase in buffer storage allowed an exponential-like decrease in timing measures.
Figure 2.23: NoC Performance for a 4× 4 Matrix Multiplication
2.4.2 8× 8 Matrix Multiplication
In order to increase data transfer, a second test was performed with an 8 × 8 matrix
multiplication. Each core (including the master one) computed 4 row-column products and
sent its result back to the master; platform behaviour is shown in Figure 2.24. Compared
to the first scenario, decrease is not exponential but linear with both adaptive routing
algorithms and does not improve with buffer increase for XY routing.
Because of the way MPI Broadcast was implemented, that is, if the master core is N ,
data is sent first to node N + 1 and so on, it is expected that cores finish their operation
with the same order; this fact was actually verified and is presented in Figures 2.25 and
2.26.
Results obtained from the SystemC Network On Chip model demonstrate the advan-
tages of having a high level abstraction of hardware platform and the richness of statistics
60
Figure 2.24: NoC Performance for a 8× 8 Matrix Multiplication
Figure 2.25: Total simulation time at each node with North Last Routing
61
Figure 2.26: Total simulation time at each node with West First Routing
extracted from it. More studies and better tuning of the NoC, can significantly lead to
concise and robust design decisions when continuing with the design flow.
62
Chapter 3
Concluding Remarks
A co-design methodology was successfully validated with the adoption of the new IEEE
standard for high level modelling, SystemC; it was possible to specify both hardware and
software constrains from the start and therefore a clearer and more concrete approach was
possible. As in any design, unspecified conditions or unexpected behaviour were encoun-
tered throughout the design process, yet correction and validation was much faster with
the virtual platform constructed.
Contrary to traditional hardware design, a virtual, portable platform of the real hard-
ware, is available for quick software development and testing, and only requires a C++
compiler to run; no specialized or licensed software is needed to start writing applications
for this system, which makes it versatile. In case of requiring more statistics or perfor-
mance metrics about hardware or software behaviour, it suffices with adding few lines to
the libraries provided; code that don’t make calls to the simulator won’t change or modify
the platform statistics.
Ideas borrowed from software theories for distributed programming were extrapolated
to create a hardware platform capable of running MPI-like applications; despite being an
old standard, most its premises are used nowadays for non-shared memory architectures
and are still under development. Thanks to this approach, an end to end flow control
technique was proposed to reduce undesired traffic inside the network as well as to avoid
wasting processing time in end-modules therefore improving performance.
Traffic simulations carried on showed that with the proposed NoC structure, 4x4 torus
topologies have close performance metrics with adaptive algorithms as well as with deter-
63
ministic ones; two turn model [23] algorithms for mesh topologies were implemented and
adapted to a Torus one, but no significant difference between them was encountered under
the traffic patterns implemented. Real software applications are needed to select a suitable
routing algorithm that matches the required performance.
Most design decisions presented in Chapter 1 were taken to cope with the desired func-
tionality but also to reduce hardware complexity (area constrains); because of that, many
control task were left for high level implementation either at the NIC or at the processor
level. Contrary to studies shown in [31] and others which allow multicast capable routers,
the model developed here can only process unicast transactions at low level (routers) or
broadcast transactions at a higher level (NIC). If multicast support is required, additional
logic can be integrated inside NICs; router modifications are far more complex but can be
also added.
Implementation statistics of the VHDL router model suggest that it is possible to syn-
thesize more than one router on FPGAs but still consume big area, specially RAM blocks;
reducing Virtual Channel depth will allow including more routers on the same chip so that
a real, considerable sized NoC can be implemented.
3.1 Significance of the Result
The intend of this work was mainly to develop and validate a high level model of a NoC to
allow implementation of parallel algorithms targeted to this platform; despite pure C++
code can be used for software development, there are other market, open-source, tools
that can also be integrated with the libraries created here. Among the most common,
Open Virtual Platforms (OVP) from Imperas, is a collection of high level descriptions or
processors and hardware modules that can be easily integrated with TLM models such as
the one created. More detailed, almost cycle accurate, simulations are possible with OVP
if needed.
Even though TLM modelling is becoming a common design practice, to the best of our
64
knowledge, no TLM 2.0 of a entire NoC has been proposed; there are approaches such
as Noxim [33], that use SystemC to simulate a customizable mesh network of routers,
they don’t implement the TLM 2.0 standard. Another advantage of this platform is that
it provides both, NICs and routers, which makes it an immediate option for embedded
software development.
Versatility and high reconfiguration capabilities are also an important characteristic of
our NoC model. Default timing parameters can be changed with minor effort and can be
set to match real hardware constrains if needed. Apart from that, if the MPI approach
doesn’t suit a design’s specifications, a new NIC model can be integrated with the router’s
network if it follows indications shown in 2.2.1. A final remarkable aspect of this work is
that if more complex statistics such as average message blocking time, most used paths,
throughput, etc, can be extracted from the model with minor additions to the source code.
The router developed was designed for Tours topology NoCs and therefore has 5 in-
put/output ports, nonetheless, mesh topologies are straightforward to obtain by modifying
routing algorithms; setting them to avoid sending data beyond the limits of the mesh, i.e.
impeding the use of turnaround links, will automatically change it from torus to mesh. If
another topology like those presented in section 1.2.1, you can always take our base router
as a start.
Finally, another contribution for the state of the art regarding NoC and specifically NIC
design, is that through an MPI abstraction, a hardware module was created for end-to-end
flow control; approaches using MPI as software approach for NoC programming such as [34]
don’t synthesize it onto the NIC module but implement it at high level. Which approach is
better is still undetermined and benchmarks need to be conducted to have a better insight
on this.
65
3.2 Future Work
From the beginning of this work, it was stated that when creating complex platforms such
as NoCs, it was necessary to be able of co-developing software applications to validate
their correct behaviour; here only a high level TLM 2.0, and a VHDL model of a NoC
was constructed, but there’s a lack of software applications integrated with the SystemC
description. Although typical traffic and load balancing studies were applied, only real
software implementation can demonstrate the validity of the results shown.
Software engineering faced the problem of distributed computing years ago, and is be-
ing dealing with it for a long time; thanks to them, concepts such as shared or distributed
memory models are now known and have been solved through APIs like MPI or OpenMP;
hardware engineers are starting to raise abstraction levels for embedded design and are
facing the same problems than software ones; what this means, is that a deeper integration
between both branches can lead to better design strategies, as the ones required for this
kind of platforms. This work’s objective was to implement MPI support, however, MPI is
a standard created more than 10 years ago, which lead to think that far better solutions
are currently available but remain unknown for the hardware community.
A router VHDL model was provided along this work and a NIC design was mostly de-
scribed as well; future work should also include an implementation of this work to verify
its correctness at the hardware level; integration with 32-bit compatible processors would
complete such verification.
66
References
[1] Salminen et al, “Survey of Network on Chip proposals,” OCP-IP White Paper. OCP-
IP, 2008
[2] A. Agarwal, C. Iskander and R. Shankar, “Survey of Network on Chip Architectures
and Contributions,” Journal of Engineering, Computing and Architectures. Volume 3,
Issue 1, 2009.
[3] K. Popovici and A. Jerraya, “Virtual Platforms in Sis-
tem On Chip Design,” 47th Design Automation Conference.