Exploiting the Dynamic Partial Reconfiguration on NoC ...scholar.cu.edu.eg/?q=hmostafa/files/amr.pdf · Exploiting the Dynamic Partial Reconfiguration on NoC-Based FPGA Amr Hassan1,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Exploiting the Dynamic Partial Reconfiguration onNoC-Based FPGA
Amr Hassan1, 2, Hassan Mostafa2, 3, Hossam A. H. Fahmy2, Yehea Ismail31Mentor Graphics Corporation
2Electronics and Communications Engineering Department, Cairo University, Giza 12613, Egypt3Center for Nano-electronics & Devices, American University in Cairo & Zewail City for Science and Technology, Cairo, Egypt
Abstract—Dynamic Partial Reconfiguration (DPR) of SRAM-based Field Programmable Gate Arrays (FPGAs) becomes ademanding feature by many applications for its ability to addmore flexibility over runtime phase. Recently, implementationdesigns which utilize DPR are easier than before. However, tech-niques that FPGAs use to perform DPR (like ICAP and JTAG)encounter a performance bottleneck; only one DPR is allowedat a time. In this paper, we present a state-of-art NoC-basedFPGA simulator, which supports partial dynamic reconfigurationsimulation. Design limitations and performance degradations ofusing DPR on NoC-based FPGA are estimated using NoC-DPRsimulator. Experiments are carried out using NoC-DPR simulatorto measure the reconfiguration time overhead by increasingnumber of simultaneous DPRs on FPGA fabric. It is shown thatthe overhead of reconfiguration time is increased exponentiallywith increasing the number of carried out simultaneous DPRs.However, DPR of NoC-based FPGA can enhance performancecompared to DPR of normal FPGAs with some trade-offs.
Index Terms—Dynamic Partial Reconfiguration, Network onChip, Field Programmable Gate Arrays, System analysis anddesign
I. INTRODUCTION
D ynamic partial reconfiguration (DPR) is a promising fea-
ture for number of applications mapped on SRAM-based
FPGAs (Field Programmable Gate Arrays) technology, which
have dynamic natural over runtime, like signal processing,
including image and video, and electronic measurement ap-
Our simulator is command line based tool consisting of 2-
D mesh network of routers which is simulated by NoCTweak
[1]. Each node consists of a Processor Element (PE), Network
Interface (NI) and an associated router. Each router connects
with four nearest neighboring routers forming a 2-D mesh
network. Using ReChannel [2] library each PE can be dynam-
ically reconfigured by special type of data packet generates
from certain nodes (master node 0,0), data packets can be
injected into the network through its router. Packets are routed
in the network of routers by a selected routing algorithm
to their destinations at which the packets are immediately
consumed.
The main consideration must be taken when merging DPR
simulation library with NoC simulator, is that all NoC modules
must be will defined through a clear hierarchy at SystemC.
Consequently, we have to implement separate NI instead of
embedded one with NoCTweak simulator, that is to perform
DPR on just core process not on all node components (PE and
NI). However, latency and throughput values have changed due
to this modification. That will be discussed in details within
results and discussion section.
A. NoCTweak Simulator
NoCTweak is an open-source 2-D mesh network on chip
simulator for early exploration of performance and energy effi-
ciency of on-chip networks. The simulator has been developed
using SystemC, a C++ plugin, which allows accurate and fast
modeling of concurrent hardware modules at the cycle-level
accuracy [1]. The simulator is composed of a hierarchy of
modules (processor (core), network interface (NI) and an as-
sociated router) that implements different functionalists of the
network and simulation environment. Each of these modules
has a well-defined interface that facilitates replacement and
customization of module implementations without affecting
other parts of the simulated system.
B. ReChannel Simulation Library
To model the dynamic partial reconfiguration process, we
can simplify this process as reconfigurable modules are “ac-
tivated” and “deactivated” by conditionally intercepting the
communication between static and reconfigurable parts of the
design. This is achieved in ReChannel [2] Library through the
concept of switches (portals) as shown at Fig. 1 . We can
use portals to allow the usage of any SystemC channel in a
reconfigurable context, which leads modeling reconfigurable
systems with a highly flexible methodology. As the main
modification of the original system, that is to be extended
with reconfigurability feature, takes place within the intercon-
nection of different parts of the system – i.e. between static
and reconfigurable parts – no changes to existing modules are
required. This facilitates the interface of reconfiguration parts
to static parts. Reconfiguration properties, like configuration
times, can be added to those modules using inheritance and
can be chosen through argument parsed by command line.
Fig. 1. A portal connecting two reconfigurable modules to a standard channelSystemC
IV. RESULTS AND DISCUSSION
Inserting an explicit Network Interface (NI) with exter-
nal decoupling buffer, which is responsible for storing and
synchronizing flits, which is Flow Control Unit (minimum
unit of message), between PE and network router, affects
network performance, specifically in latency and throughput.
We measure latency after inserting NI and compare it to
latency of NoCTweak using network buffer size 2-flits, with
different injection flits rate on different network sizes. Fig. 2
shows the difference between latency of NoCTweak simulator
and NoC-DPR simulator, latency can reach above 50k cycle
and saturated at 20k respectively. that is due to the NI of NoC-
DPR simulator is responsible for controlling packet generation
from PE according to NoC state, therefore it sends and receives
control signals to PE.
On the contrary in NoCTweak PE has a infinite buffer
which store all generated flits and injects them directly to
router input buffer without controlling, consequently we don’t
have saturated value unlike NoC-DPR which saturated at lower
latency 5k, 6k, and 20k cycles for 2-flits buffer size for 2x2,
3x3, and 14x14 NoC size respectively.
On the other hand, In NoCTweak simulator, the throughput
was saturated at specific values as Fig.3a, because of infinite
buffer at PE; the network loaded with maximum accepted flits
at higher FIR (Flits Injection Rate of each PE), however in
NoC-DPR simulator PE stops packet generation when network
state is fully loaded; that resulted in slightly increase in the
peak value of throughput, but it decreases exponentially due
to network will not be loaded with maximum accepted flits as
shown in Fig. 3b. Throughput saturated values was 0.21, and
0.17 approximately became the peak values 0.22, and 0.175
at 0.22 and 0.2 FIR in NoC-DPR simulator for 2x2 and 3x3
NoC size respectively.
Our experiment aims to simulation the apply of DPR on
provided simulated NoC-based FPGA using different network
sizes and different number of parallel DPR applied, so we
can compare between them with respect to Reconfiguration
Time (RT). First we used Virtex-5 xc5vfx100t FPGA to select
different partial reconfiguration regions, then we get bit stream
size of each region using Xilinx ISE tool, then we get RT,
190278
(a)
(b)
Fig. 2. Average latency for 2-flits buffer depth for: A) NoCTweak simulatorwith Infinite PE buffer B) NoC-DPR simulator
afterwards we use these values as input to to NoC-DPR
simulator for each experiment. The theoretically estimated RT
is calculated by using Partial Reconfiguration Cost Calcula-
tor[10], which uses a cost model to estimate RT. We assumed
that all PEs of NoC are different Partial Reconfiguration (PR)
regions. therefore, other network resources like Router, NI
and wires assumed to be hardwired on FPGA chip layout to
simplify RT estimate. The main advantage of using NoC-based
FPGA instead normal FPGA is that multiple simultaneous
DPRs can be achieved, as we deal each PE as a small FPGA,
which has its own reconfiguration control unit like ICAP (
Internal Configuration Access Port for Xilinx FPGAs). we
investigate in the following section which suitable NoC size
and how much afforded parallel DPRs can be achieved, and
estimate the effecting on RT.
Comparison between different number of simultaneous
DPRs at NoC has been done based on the correspondence
between theoretical and simulated RT for each number as
illustrated in Fig.4. This metric is also a good indicator for the
performance variation from a certain NoC size to another, as
each NoC size has different values of throughput and latency
that affects at RT. However, we use fixed buffer size 8-flits
and FIR 0.1 (flits/cycle) for following experiments. In small
network size (RT above 1 msec) the gap is narrow between
theoretical and experimental RT, for example for 1-DPR at a
time, a slightly increase can be noticed in the gap as network
(a)
(b)
Fig. 3. Average throughput for 2-flits buffer depth for: A) NoCTweaksimulator with Infinite PE buffer B) NoC-DPR simulator
size increases over 7x7. In contrast, simulated RT for 5-DPRs
simultaneous, is drifted from the beginning (at small network
sizes) and the gap jumped at larger network sizes.
Fig. 5 shows the difference percentage of theoretical and
experimental RT using one, two, three, four and five simulta-
neous DPRs. It is shown that at small NoC sizes, from 2x2
to 9x9, the difference is always below than 50% that because
RT is relatively bigger than any NoC overheads (latency and
throughput effects), on the other hand starting form NoC size
10x10, NoC latency becomes noticeable and can be affects
RT.
As simultaneous number of DPRs at NoC increase, the
difference increase markedly. When using one DPR at a time
the percentage does not exceed 35% all over NoC sizes while
when using 3-simultaneous DPRs percentage can reach up to
100% at large NoC 18x18; that is due to the complexity of
controlling parallel DPR and the added delays of NoC latency.
Furthermore, we have to prevent other nodes to communicate
with current PR node.
V. DESIGN RECOMMENDATIONS
Performing DPR feature on NoC-based FPGAs must be
done while taking some considerations:
• we can not use general NoC simulator to simulate DPR
application. we have to grantee that when one process
element (PE) performing DPR, other PEs must be pre-
191279
Fig. 4. comparison between theoretical and simulated reconfiguration timeof multiple DPRs on NoC-based FPGAs
Fig. 5. Difference percentage of reconfiguration time using multiple DPRson NoC-based FPGAs
vented from send or receive to/from this PE until DPR is
finished.
• In our simulator we assume that one PE (0,0) is the
master of DPR process, which responsible of sending
reconfiguration packets, therefore when target PE receives
reconfiguration packet it start to perform reconfiguration.
When DPR is finished, destination send back to master
node acknowledge packet to broadcast to all other nodes
that target PE is back and any node can communicate
with.
• The clear advantage of using NoC-based FPGAs instead
formal FPGAs is that ability to perform multiple DPRs
simultaneously, which was studied in previous section.
• The recommended network size and number of simulta-
neous DPRs can be estimated according to the desired
reconfiguration time. See Fig. 4 and Fig. 5, as RT is the
main parameter over the other NoC parameters (latency
and throughput) which effects directly in the deviation
between theoretical and simulated results. for instance if
reconfiguration time can be 100 msec we can use up to 5-
DPR simultaneously and any suitable network size from
2x2 to 9x9. In contrast if the limit of RT was 1 msec
the optimal choice is 3-DPR simultaneously and network
size larger than 13x13.
VI. CONCLUSION
In this paper, a state-of-art NoC-DPR simulator illustrated
and used to get some recommendation for how to use DPR of
NoC-based FPGA, and which is optmal size of NoC as stated
in the previous section.It is obvious that NoC-based FPGA enhance reconfiguration
capabilities due to multiple PE, which can be perform multiple
DPRs at a time. However, that needs to add more resources like
controlling unit and routers. So, the time performance of DPR
with NoC is better than time performance of DPR at normal
FPGA, considering applications that concern reconfiguration
time over area overhead. Despite that, the number of simul-
taneous DPRs can not exceed certain limit for certain NoC
sizes; as we could not gain more reduction in reconfiguration
time, instead we adding more resources overhead.
ACKNOWLEDGMENT
This research was funded by NTRA, ITIDA, Cairo Univer-
sity, Zewail City of Science and Technology, AUC, the STDF,
Intel, Mentor Graphics, and MCIT.
REFERENCES
[1] Anh T. Tran and Bevan M. Baas. NoCTweak: a Highly ParameterizableSimulator for Early Exploration of Performance and Energy of NetworksOn-Chip, Dept. Electr. Comput. Eng., Univ. California, 2012, July 2012
[2] Raabe A., Hochgurtel S., Zachmann G., and Anlauf J. K. ReChannel:Describing and Simulating Reconfigurable Hardware in SystemC, ACMTransactions on Design Automation of Electronic Systems (TODAES),v.13 n.1, p.1-18, January 2008
[3] Adriatic Consortium. 2002. Advanced methodolgy for designing recon-figurable SoC and application-targeted IP-entities in wireless communi-cations webpage. http://www.imec.be/adriatic
[4] Benkhermi, I., Benkhelifa, A., Chillet, D., Pillement, S., Prévotet, J.-C.,and Verdier, F. 2005. System-Level modelling for reconfigurable SoCs.In the 20th Conference on Design of Circuits and Integrated Systems(DCIS), Lisboa, Portugal
[5] Alisson V. De Brito , Elamr U. K. Melcher , Wilson Rosas, Anopen-source tool for simulation of partially reconfigurable systemsusing SystemC, Proceedings of the IEEE Computer Society AnnualSymposium on Emerging VLSI Technologies and Architectures, p.434,March 02-03, 2006
[6] Schallenberg, A., Oppenheimer, F., and Nebel, W. 2004. Designing fordynamic partially reconfigurable FPGAs with SystemC and OSSS. Inthe Forum on Specification and Design Languages, Lille, France
[7] N. Jiang et al., “BookSim interconnection network simulator,” Online,https://nocs.stanford.edu/cgibin/trac.cgi/wiki/Resources/BookSim
[8] R. Palesi et al., “Noxim - the noc simulator,” Online,http://noxim.sourceforge.net/
[9] A. Ehliar and D. Liu, "An FPGA based open source network-on-chiparchitecture", 17th International Conference on Field ProgrammableLogic and Applications, FPL, pp.800-.803, IEEE Amsterdam, Holland,2007
[10] Kyprianos Papadimitriou , Apostolos Dollas , Scott Hauck, Performanceof partial reconfiguration in FPGA systems: A survey and a costmodel, ACM Transactions on Reconfigurable Technology and Systems(TRETS), v.4 n.4, p.1-24, December 2011