-
Conger #233 MAPLD 2005
NARC: NARC: NetworkNetwork--Attached Reconfigurable Computing
for Attached Reconfigurable Computing for HighHigh--performance,
Networkperformance, Network--based Applicationsbased
Applications
Chris Conger, Ian Troxel, Daniel Espinosa, Vikas Aggarwal, and
Alan D. George
High-performance Computing and Simulation (HCS) Research
LabDepartment of Electrical and Computer Engineering
University of Florida
-
#233 MAPLD 2005Conger 2
Outline
IntroductionNARC Board Architecture, ProtocolsCase Study
ApplicationsExperimental SetupResults and AnalysisPitfalls and
Lessons LearnedConclusionsFuture Work
-
#233 MAPLD 2005Conger 3
IntroductionNetwork-Attached Reconfigurable Computer (NARC)
Project
Inspiration: network-attached storage(NAS) devicesCore concept:
investigate challenges and alternatives for enabling direct network
access and control over reconfigurable (RC) devicesMethod:
prototype hardware interface and software infrastructure,
demonstrate proof of concept for benefits of network-attached RC
resources
Motivations for NARC project include (butnot limited to)
applications such as:
Network-accessible processing resourcesGeneric network RC
resource, viable alternative to server and supercomputer
solutionsPower and cost savings over server-based FPGA cards are
key benefits
No server needed to host RC deviceInfrastructure provided for
robust operation and interfacing with users
Performance increase over existing RC solutions is not a primary
goal of this approachNetwork monitoring and packet analysis
Easy attachment; unobtrusive, fast traffic gathering and
processingNetwork intrusion and attack detection, performance
monitoring, active traffic injectionDirect network connection of
FPGA can enable wire-speed processing of network traffic
Aircraft and advanced munitions systemsStandard Ethernet
interface eases addition and integration of RC devices in aircraft
and munitions systemsLow weight and power also attractive
characteristics of NARC device for such applications
-
#233 MAPLD 2005Conger 4
Envisioned ApplicationsAerospace & military applications
Modular, low-power design lends itself well to military craft
and munitions deploymentFPGAs providing high-performance radar,
sonar, and other computational capabilities
Scientific field operationsQuickly provide first-level
estimations for scientific field operations for geologists,
biologists, etc.
Field-deployable covert operationsCompletely wireless device
enabled through battery, WLANPassive network monitoring
applicationsActive network traffic injection
Distributed computingCost-effective, RC-enabled clusters or
cluster resourcesCluster NARC devices at a fraction of cost, power,
cooling
Cost-effective intelligent sensor networksUse FPGAs in close
conjunction with sensors to provide pre-processing functions before
network transmission
High-performance network technologiesFast Ethernet may be
replaced by any network technologyGig-E, Infiniband, RapidIO,
proprietary communication protocols
-
#233 MAPLD 2005Conger 5
NARC Board Architecture: HardwareARM9 network control with FPGA
processing power (see Figure 1)
Prototype design consists of two boards, connected via
cable:Network interface board (ARM9 processor + peripherals)Xilinx
development board(s) (FPGA)
Network interface peripherals include:Layer-2 network connection
(hardware PHY+MAC)External memory, SDRAM and FlashSerial port
(debug communication link)FPGA control and data lines
NARC hardware specifications:ARM-core microcontroller, 1.8V
core, 3.3V peripheral
32-bit RISC, 5-stage pipeline, in-order execution16KB data
cache, 16KB instruction cacheCore clock speed 180MHz, peripheral
clock 60MHzOn-chip Ethernet MAC layer with DMA
External memory, 3.3V32MB SDRAM, 32-bit data bus2MB Flash,
16-bit data busPort available for additional 16-bit SRAM
devices
Ethernet transceiver, 3.3VDM9161 PHY layer transceiver100Mbps,
full duplex capableRMII interface to MAC
Figure 1 –
Block diagram of NARC device
-
#233 MAPLD 2005Conger 6
NARC Board Architecture: SoftwareARM processor runs Linux kernel
2.4.19
Provides TCP(UDP)/IP stack, resource management, threaded
execution, Berkeley Sockets interface for applicationsConfigured
and compiled with drivers specifically for our board
Applications written in C, compiled using GCC compiler for ARM
(see Figure 2)NARC API: Low-level driver function library for basic
services
Initialize and configure on-chip peripherals of ARM-core
processorConfigure FPGA (SelectMAP protocol)Transfer data to/from
FPGA, manipulate control linesMonitor and initiate network
traffic
NARC protocol for job exchange (from remote workstation)NARC
board application and client application must follow standard rules
and procedures for responding to requests from a userUser appends a
small header onto data (if any) containing info. about request
before sending over network (see Figure 3)
Bootstrap software in on-board Flash, automatically loads and
executes on power-up
Configures clocks, memory controllers, I/O pins, etcContacts
tftp server running on network, downloads Linux and ramdiskBoot
Linux, automatically execute NARC board software contained in
ramdisk
Optional serial interface through HyperTerminal for
debugging/development
Figure 3 –
Request header field definitions
Figure 2 –
Software development process
main.c
main applicatio
n
util.c
library routines
narc.h
definitions
global vars
Makefile arm-linux-gcc
RAMDISK
NARC Board
Linux Kernel
Client application `
User Workstation
gccNARCboard
application
client.c
clientapplicatio
n
-
#233 MAPLD 2005Conger 7
NARC Board Architecture: FPGA InterfaceData communicated to/from
FPGA by means of unidirectional data paths
8-bit input port, 8-bit output port, 8 control lines (Figure
4)Control lines manage data transfer, also drive configuration
signalsData transferred one byte at a time, full duplex
communication possibleControl lines include following signals:
Clock – software-generated signal to clock data on data
portsReset – reset signal for interface logic in FPGAReady – signal
indicating device is ready to accept another byte of dataValid –
signal indicating device has placed valid data on portSelectMAP –
all signals necessary to drive SelectMAP configuration
Figure 4 –
FPGA interface signal diagram
ARM FPGAOut[0:7]
In[0:7]Out[0:7]
In[0:7]
a_validf_ready
a_validf_ready
a_readyf_validf_valid
a_ready
clockreset
SelectMAPPort
D[0:7]
PROG, INIT, CS, WRITE, DONE
PROG, INIT, CS, WRITE, DONE
FPGA configuration through SelectMAP protocolFastest
configuration option for Xilinx FPGAs, protocol emulated using GPIO
pins of ARMNARC board enables remote configuration and management
of FPGA
User submits configuration request (RTYPE = 01), along with
bitfile and function descriptorFunction descriptor is ASCII string,
formatted list of functions with associated RTYPE definitionARM
halts and configures FPGA, stores descriptor in dedicated RAM
buffer for user queries
All FPGA designs must restrict use of all SelectMAP pins after
configurationSome signals are shared between SelectMAP port and
FPGA-ARM linkOnce configured, SelectMAP pins must remain tri-stated
and unused
-
#233 MAPLD 2005Conger 8
Results and Analysis: Raw PerformanceFPGA interface I/O
throughput (Table 1)
1KB data transferred over link, timedMeasured using hardware
methods
Logic analyzer – to capture raw link data rate, divide data sent
by time from first clock to last clock (see Figure 9)
Performance lower than desired for prototypeHandshake protocol
may add unnecessary overhead
Widening data paths, optimizing software routine will
significantly improve FPGA I/O performance
Network throughput (Table 2)Measured using Linux network
benchmark IPerf
NARC board located on arbitrary switch within network,
application partner is user workstationTransfers as much data as
possible in 10 seconds, calculates throughput based on data sent
divided by 10 seconds
Performed two experiments with NARC board serving as clientin
one run, server in otherBoth local and remote (remote location ~400
miles away, at Florida State University) IPerf partnerNetwork
interface achieves reasonably good bandwidth efficiency
External memory throughput (Table 3)4KB transferred to external
SDRAM, both read and writeMeasurements again taken using logic
analyzerMemory throughput sufficient to provide wire-speed
buffering of network traffic
On-chip Ethernet MAC has DMA to this SDRAMShould help alleviate
I/O bottleneck between ARM and FPGA
Mb/s Input Output
Logic Analyzer 6.08 6.12
Mb/s Local Network
Remote Network (WAN)
NARC-
Server 75.4 4.9Server-
Server 78.9 5.3
Figure 9 –
Logic analyzer timing
Table 1 –
FPGA interface I/O performance
Table 2 –
Network throughput
Mb/s Read Write
Logic Analyzer 183.2 160
Table 3 –
External SDRAM throughput
-
#233 MAPLD 2005Conger 9
Results and Analysis: Raw Performance
Reconfiguration speedIncludes time to transfer bitfile over
network, plus time to configure device (transfer bitfile from ARM
to FPGA), plus time to receive acknowledgementOur design currently
completes a user-initiated reconfiguration request with a 1.2MB
bitfile in 2.35 sec
Area/resource usage of minimal wrapper for Virtex-II Pro
FPGAStats on resource requirements for a minimal design to provide
required link control and data transfer in an application wrapper
are presented below:
Design implemented on older Virtex-II Pro FPGANumbers to right
indicate requirements for wrapper only, un-used resources available
for use in user applicationsExtremely small footprint!Footprint
will be even smaller on larger FPGA
Device utilization
summary:--------------------------------------------------------
Selected Device : 2vp20ff1152-5
Number of Slices: 143 out of 9280 1% Number of Slice Flip Flops:
120 out of 18560 0% Number of 4 input LUTs: 238 out of 18560 1%
Number of bonded IOBs: 24 out of 564 4%Number of BRAMs: 8 out of 88
9% Number of GCLKs: 1 out of 16 6%
-
#233 MAPLD 2005Conger 10
Case Study ApplicationsClustered RC Devices: N-Queens
HPC application demonstrating NARC board’s role as generic
compute resourceApplication characterized by minimal communication,
heavy computation within FPGANARC version of N-Queens adapted from
previously implemented application for PCI-based Celoxica RC1000
board housed in a conventional serverN-Queens algorithm is a part
of the DoD high-performance computing benchmark suite and
representative of select military and intelligence processing
algorithms
Exercises functionality of various developed mechanisms and
protocols for job submission, data transfer, etc. on NARCUser
specifies a single parameter N, upon completion the algorithm
returns total number of possible solutionsPurpose of algorithm is
to determine how many possible arrangements of N queens there are
on an N × N chess board, such that no queen may attack another (see
Figure 5)Results are presented from both NARC-based execution and
RC1000-based execution for comparison
Figure c/o Jeff Somers
Figure 5 –
Possible 8x8 solution
-
#233 MAPLD 2005Conger 11
Case Study ApplicationsNetwork processing: Bloom Filter
This application performs passive packet analysis through use of
a classification algorithm known as a Bloom Filter
Application characterized by constant, bursty communication
patternsMost communication is Rx over network, transmission to
FPGAFilter may be programmed or queried
NARC device copies all received network frames to memory, ARM
parses TCP/IP header and sends it to Bloom Filter for
classification
User can send programming requests, which include a header and
string to be programmed into FilterUser can also send result
collection requests, which causes a formatted results packet to be
sent back to the userOtherwise, application constantly runs,
querying each header against the current Bloom Filter and recording
match/header pair information
Bloom Filter works by using multiple hash functions on a given
bit string, each hash function rendering indices of a separate bit
vector (see Figure 6)
To program, hash inputted string and set resulting bit positions
as 1To query, hash inputted string, if all resulting bit positions
are 1 the string matches
Implemented on Virtex-II Pro FPGAUses slightly larger, but
ultimately more effective application wrapper (see Figure 7)Larger
FPGA selected to demonstrate interoperability with any FPGA
Figure 6 –
Bloom Filter algorithmic architecture
Figure 7 –
Bloom Filter implementation architecture
-
#233 MAPLD 2005Conger 12
Experimental SetupN-Queens: Clustered RC devices
NARC device located on arbitrary switch in networkUser
interfaces through client application on workstation, requests
N-Queens procedure
Figure 8 illustrates experimental environmentClient application
records time required to satisfy requestPower supply measures
current draw of active NARC device
N-Queens also implemented on RC-enabled server equipped with
Celoxica RC1000 board
Client-side function call to NARC board replaced with function
call to RC1000 board in local workstation, same timing
measurementComparison offered in terms of performance, power, cost
Workstation
NARC
EthernetNetwork
User
RC-enabled servers
NARC
Figure 8 –
Experimental environmentBloom Filter: Network processingSame
experimental setup as N-Queens case studySoftware on ARM
co-processor captures all Ethernet frames
Only packet headers (TCP/IP) are passed to FPGAData continuously
sent to FPGA as packets arrive over network
By attaching NARC device to switch, limited packets can be
capturedOnly broadcast packets and packets destined for the NARC
device can be seenDual-port device could be inserted in-line with
network link, monitor all flow-through traffic
-
#233 MAPLD 2005Conger 13
Results and Analysis: N-Queens Case StudyFirst, consider an
execution time comparison between our NARC board and a PCI-based RC
card (see Figure 10a and 10b)
Both FPGA designs clocked at 50MHzPerformance difference is
minimal between devices
Being able to match performance of PCI-based card is a
resounding success!
Power consumption and cost of NARC devices drastically lower
than that of server with RC card combosMultiple users may share
NARC device, PCI-based cards somewhat fixed in an individual
server
Power consumption calculated using following method
Three regulated power supplies exist in complete NARC device
(network interface + FPGA board): 5V, 3.3V, 2.5VCurrent draw from
each supply was measured Power consumption is calculated as sum of
V×I products of all three supplies
N-Queens Execution Time Comparison(small board size)
0
0.01
0.02
0.03
0.04
0.05
5 6 7 8 9 10
Algorithm Parameter (N)
Exe
c. T
ime
(s)
NARCRC-1000
Figure 10 –
Performance comparison between NARC board and PCI-based RC card
on server
N-Queens Execution Time Comparison(large board size)
010203040506070
11 12 13 14
Algorithm Parameter (N)
Exec
. Tim
e (s
)
NARCRC-1000
-
#233 MAPLD 2005Conger 14
Results and Analysis: N-Queens Case StudyFigure 11 summarizes
the performance ratio of N-Queens between both NARC and RC-1000
platformsConsider Table 4 for a summary of cost and power
statistics
Unit price shown excluding cost of FPGAFPGA costs offset when
compared to another devicePrice shown includes PCB fabrication,
component costs
Approximate power consumption drastically less than server +
RC-card combo
Power consumption of server varies depending on particular
hardwareTypical servers operate off of 200-400W power supplies
See Figure 12 for example of approximate power consumption
calculation
NARC Board
Cost per unit (prototype) $175.00
Approx. Power Consumption 3.28 WTable 4 –
Price and power figures for NARC device
Figure 12 –
Power consumption calculation
P = (5V)(I5
) + (3.3V)(I33
) + (2.5V)(I25
)
I5
≈
0.2A ; I33
≈
0.49A ; I25
≈
0.27A
P = (5)(.2) + (3.3)(.49) + (2.5)(.27) = 3.28W
NARC / RC-1000 Performance Ratio
0
5
10
15
20
25
5 6 7 8 9 10 11 12 13 14
Algorithm Parameter (N)
Rat
io
RATIOEquivalency
Figure 11 –
Power consumption calculation
-
#233 MAPLD 2005Conger 15
Results and Analysis: Bloom FilterPassive, continuous network
traffic analysis
Wrapper design was slightly larger than previous minimal wrapper
used with N-QueensStill small footprint on chip, majority of FPGA
remains for applicationMaximum wrapper clock frequency 183 MHz,
should not limit application clock if in same clock domain
Packets received over network link are parsed by ARM, with
TCP/IP header saved in bufferHeaders sent one-at-a-time as query
requests to Bloom Filter (FPGA), when query finishes another header
will be de-queued if available
User may query NARC device at any time for results update,
program new pattern
Device utilization
summary:-------------------------------------------------------Selected
Device : 2vp20ff1152-5
Number of Slices: 1174 out of 9280 13% Number of Slice Flip
Flops: 1706 out of 18560 9% Number of 4 input LUTs: 2032 out of
18560 11% Number of bonded IOBs: 24 out of 564 4%Number of BRAMs: 9
out of 88 10% Number of GCLKs: 1 out of 16 6%
Figure 13 –
Device utilization statistics for Bloom Filter design
Figure 13 shows resource usage for Virtex-II Pro FPGAMaximum
clock frequency of 113MHz
Not affected by wrapper constraintSignificantly faster
computation speed than FPGA-ARM link communication speed
FPGA-side buffer will not fill up, headers are processed before
next header transmitted to FPGAARM-side buffer may fill up under
heavy traffic loads
32MB ARM-side RAM gives large buffer
-
#233 MAPLD 2005Conger 16
Pitfalls and Lessons LearnedFPGA I/O throughput capacity remains
persistent problem
One motivation for designing custom hardware is to remove
typical PCI bottleneck and provide wire-speed network connectivity
for FPGAUnder-provisioned data path between FPGA and network
interface restricts performance benefits for our prototype
designLuckily, this problem may be solved through a variety of
approaches
Wider data paths (16-bit, 32-bit) double or quadruple
throughput, at expense of higher pin countUse of higher-performance
co-processor capable of faster I/O switching frequenciesOptimized
data transfer protocol
Having co-processor in addition to FPGA to handle network
interface is vital to success of our approach
Required in order to permit initial remote configuration of
FPGA, as well as additional reconfigurations upon user
requestOffloading network stack, basic request handling, and other
maintenance-type tasks from FPGA saves significant amount of
valuable slices for user designsDrastically eases interfacing with
user application on networked workstationActive co-processor for
FPGA applications, e.g. parsing network packets as in Bloom Filter
application
-
#233 MAPLD 2005Conger 17
ConclusionsA novel approach to providing FPGAs with standalone
network connectivity has been prototyped and successfully
demonstrated
Investigated issues critical to providing remote management of
standalone NARC resourcesProposed and demonstrated solutions to
discovered challengesPerformed pair of case studies with two
distinct, representative applications for a NARC device
Network-attached RC devices offer potential benefits for a
variety of applicationsImpressive cost and power savings over
server-based RC processingIndependent NARC devices may be shared by
multiple users without movingTightly coupled network interface
enables FPGA to be used directly in path of network traffic for
real-time analysis and monitoring
Two issues that are proving to be a challenge to our approach
include:Data latency in FPGA communicationSoftware infrastructure
required to achieve a robust standalone RC unit
While prototype design achieves relatively good performance in
some areas, and limited performance in others, this is acceptable
for concept demonstration
Fairly complex board design; architecture and software
enhancements in developmentAs proof of “NARC” concept, important
goal of project was achieved in demonstration of an effective and
efficient infrastructure for managing NARC devices
-
#233 MAPLD 2005Conger 18
Future WorkExpansion of network processing capabilities
Further development of packet filtering applicationMore specific
and practical activity or behavior sought from network
trafficAnalyze streaming packets at or near wire-speed rates
Expansion of Ethernet link to 2-port hubPermit transparent
insertion of device into network pathProvide easier access to all
packets in switched IP network
Merging FPGA with ARM co-processor and network interface into
one device
Ultimate vision for NARC deviceWill restrict number of different
FPGAs which may be supported, according to chosen FPGA
socket/footprint for boardIncreased difficulty in PCB design
Expansion to Gig-E, other network technologiesFast Ethernet
targeted for prototyping effort, concept demonstrationTrue
high-performance device should support Gigabit EthernetOther
potential technologies include (but not limited to) InfiniBand,
RapidIO
Further development of management infrastructureNeed for more
robust control/decision-making middlewareAutomatic device
discovery, concurrent job execution, fault-tolerant operation
NARC: �Network-Attached Reconfigurable Computing for
High-performance, Network-based
ApplicationsOutlineIntroductionEnvisioned ApplicationsNARC Board
Architecture: HardwareNARC Board Architecture: SoftwareNARC Board
Architecture: FPGA InterfaceResults and Analysis: Raw
PerformanceResults and Analysis: Raw PerformanceCase Study
ApplicationsCase Study ApplicationsExperimental SetupResults and
Analysis: N-Queens Case StudyResults and Analysis: N-Queens Case
StudyResults and Analysis: Bloom FilterPitfalls and Lessons
LearnedConclusionsFuture Work