COMP 4211 Project Report
COMP4211 Project Report
Network Processor Technical Report – Architecture, Performance
and Future Trends
COMP4211 Project Report
Network Processor Technical Report
-- Architecture, Performance and Future Trends
Author: Jiening Jiang (2279326)
S1, 2005 CSE, UNSW
Table Of Contents
31Introduction
32Architectures
32.1The Challenge of Packets Processing
32.2Characters of Packets Processing
42.3The Architecture Techniques
42.4Generic Packet Processing Architecture
62.5Evaluating the Architectures
72.6Pipelining Process Engines
82.7Memory architecture
92.8Memory bandwidth
102.9On-chip Communication
103Case studies
103.1IXA2800
133.2PowerNP NP4GX
144Conclusion and future trends
155Reference:
1 Introduction
With increasing network users and many new bandwidth-consume
applications emergence, the Internet line speed and bandwidth have
increased and are increasing tremendously. Now edge routers
normally connect to10Gbps, 40Gbps is coming. Today’s routers must
handle not only packets forwarding but also more complex tasks,
such as the complex queuing and quality-of-service (QoS),
encryption / decryption and so on. It requires enormous processing
power.
When the Internet invented, routers were built base on
general-purpose processor. It was pretty much like a normal
computer at those days. With increased users and speed, the ASIC
used in routers. However the Internet protocols and applications
are changing and evolving quite often, and the developing new ASIC
is time-cost and expensive. So, new specific network processor
emerged. It provides robust, flexible, programmable solutions to
the Internet routing, switching and higher-level applications. It
targets the design trade-off between performance and flexibility,
and gives a very good solution based on the state-of-art
architectures.
2 Architectures
Before we move to the network processor architectures, lets
study the challenge and characters of packets processing.
2.1 The Challenge of Packets Processing
The enormous challenge is line speed. Table 1 shows the packets
arrive rates in different speeds. It assumed that packet size is 40
bytes, which is common roughly packet size used in multimedia data
stream.
Line Speed
40-byte Packet Arrival Rate
2.5 Gbps
OC-48
160 ns
10 Gbps
OC-192
35 ns
40 Gbps
OC-768
8 ns
Table 1: inter-packet arrival rates [7]
The inter packet arrival time is the rough time can spend on
processing if routers don’t want to drop packets intentionally. In
traditional architectures, the normal memory access time is larger
than the inter packet arrival time.
2.2 Characters of Packets Processing
A data stream is divided into amount of small size of packets.
Each packet contains some duplicate information, such as IP header.
All packets can be processed in parallel. This is so call
Packet-level Parallelism (PLP)
Some packets are time-critical and simple, such as multimedia
data stream. While some are complex and not time-critical, such as
routing table update message, network management messages.
Different types of packets need different strategies to
process.
2.3 The Architecture Techniques
The challenges are so enormous that new architectures must be
built to solve the problem. A variety of architecture techniques
have been used to address the challenges. They are divided in 3
categories: [5]
· Application-specific logic
i. Extending the RISC instruction set
ii. Use of customized on-chip or off-chip hardware assists
· Advanced processor architectures
i. Multithreading
ii. Instruction-level parallelism
· Macroparallelism
i. Multiple processors
ii. Pipelined processors
Commercial NP products use almost all these techniques. Most NPs
have so-call multi-core architecture. Each core is a small-scale
microprocessor and uses multithreading and ILP techniques. Some
functions are implemented in specific logic units. Details see case
studies.
2.4 Generic Packet Processing Architecture
Figure 1 shows the generic packet processing architecture. This
architecture meets the characters of packets. Almost all commercial
NPs are based on this architecture.
Figure 1: Generic Packet Processing Architecture
PHY layer processing converts the analogue signal to digital
signal with some type of frame format. Packet Processing performs
all the necessary operations on the network traffic at line speed.
These operations are also known as “fast path” or “data path”
operations. Host Processing handles a number of functions such as
device configurations, network managements and so on, which are
slow and not time-crucial.
In some papers, they are referred as “data plane” and “control
plane”. Finally, switching handles the forwarding of data traffic
between the ingress and egress ports of the bus, backplane, or
other switch fabric of the router.
The data plane processes the time-critical packets, while
control plane handles the less time-critical managements and system
configurations. The operations processed by control plane are more
complex and diversity than data plane operations. So, the
general-purpose processor can be used to execute control plane
operations. The data plane operations are executed by a number of
dedicated processing engines.
2.5 Evaluating the Architectures
The architecture for the control plane is preferred to the
general processor architecture. For the data plane architecture,
there are so many different architectures. Why most NPs chose the
multi-core structure. Crowley etc. evaluated the performance of
network packets processing based on four different architectures
(Superscalar, Fine-Grain Multithreaded, Chip-Multiprocessor and
Simultaneous Multithreaded). [1]
SuperScalar: multiple issues, out-of-order execution
processor
Fine-Grain Multithreaded (FGMT): multithreading support extends
the core out-of-order, superscalar microprocessor by adding support
for multiple hardware thread contexts.
Simultaneous Multithreaded: The SMT architecture extends the
FGMT architecture by adding support for instructions to be fetched
and issued from multiple threads within one circle.
Chip-Multiprocessor: A CMP partitions chip resources in rigidly
the form of multiple processors. Each processor can operate a
different thread.
The figure 2 is the result of evaluating the four architectures
[1]. The four architectures run in different benchmarks from basic
ip4 forwarding to complex encryptions MD5, 3DES. All run at clock
rate 500MHz and ignore the operating system overheads.
Figure 2: Performance results of all architectures [1]
In the case of ignoring the operating system overheads, both SMT
and CMP achieved highest performance than the other two. The
results of the study suggest that both have roughly equivalent
performance that is two to four times greater than SS and FGMT. The
reason is that both SMT and CMP are suit to exploit the parallel
nature of network workloads.
Many NP products choose the architecture close to CMP to achieve
the high performance of PLP.
2.6 Pipelining Process Engines
The processing time for each individual packet in high-speed
link is only a few nanoseconds. Individual PE is not capable of
processing packets at this short time. Therefore, pipelining PEs is
the solution to this high performance requirement.
The processing mode of the PEs can be programmed as context
pipeline or functional pipeline [5].
In a context pipeline, the pipeline stages are mapped to
different PEs. Each PE constitutes a context pipe stage, and
cascading two or more context pipe stages constitutes a context
pipeline, as figure 3 show.
Pipeline stage 0 Pipeline stage 1
Pipeline stage m
Function 0 Function 1
Function m
Packet 1
…
Packet 2
…
…
Packet n
…
Time
Figure 3: Context pipeline of process engines [5]
· Advantage of context pipeline:
· The entire PE memory space can be dedicated to a single
function. When a stage function needs large program memory, this
model is good.
· The context pipeline is also desirable when a pipe stage needs
to maintain state (bit vectors or tables) to perform its work. The
local memory can store the state; therefore eliminate the latency
of accessing external memory.
· Disadvantage of context pipeline:
· If the context is very large, it will take longer time to pass
each pipeline stage. That will affect the overall pipeline
throughputs.
· As the each pipeline stage must execute at the maximum packet
arrival rate, it would be difficult to partition the application
into stages.
In functional pipeline, the context remains with a PE while
different functions are performed on the packet as time progresses.
The PE execution time is divided into n pipe stages, and each pipe
stage performs a different function. A signal PE can constitute a
functional pipeline. Figure 4 is the model of functional
pipeline.
Packet n+1
Packet n+2
…
Packet 2n
Time
Figure 4: Functional pipeline of a processor engine [5]
· Advantage of functional pipeline:
· The context remains locally within the PE
· It supports a longer execution period
· Disadvantage of functional pipeline
· The entire PE program memory space must support multiple
functions
· Function control must be passed between stages, therefore it
should be minimized
· Mutual exclusion may be more difficult because the multiple
PEs access the same data structures.
The both models have advantages and disadvantages. Some NPs can
be programmed in either model.
2.7 Memory architecture
The NPs need massive memory operations to process the packets.
The operations such as pattern matching, queue / dequeue,
encryption / decryption and etc require a lot of memory reads and
writes. Thus the good memory architecture is a key factor of system
performance.
The main parts of the memory system are large amount of local
registers, local memory and cache, high speed SRAM and high
bandwidth DRAM.
There are several ways to minimize the access speed and maximum
the performance.
· Memory latency hiding
Modern computer architectures use multithreading to hide the
memory access latency. Multithreading can hide the memory access
latency by allowing the processor to switch to another thread while
waiting for the slow memory access.
· Memory co-processors
Certain complex memory intensive tasks such as table-lookup and
tree searching require a significant number of processor circles.
Memory co-processor receives the request from the main processor
and carries out necessary operations, then return back the result.
Meanwhile the main processor can do other operations.
Some memory co-processors provide Content Addressable Memory
(CAM) to accelerate the memory search operations.
· Caching
Using cache can significant improve the packets throughput of
the systems, as caching speedup the routing table lookup. One
mechanism of caching for address is called Host Address Cache
(HAC), which is identical to a normal cache. The architecture is as
figure 5 [2].
Output
Figure 5: Host address cache
Using the least significant k bit of the destination IP address
as index to select one of 2k cache sets.
2.8 Memory bandwidth
As the bandwidth of links increased dramatically, the memory
bandwidth requirement of NPs is also a key factor of system
performance. The system throughput is critical.
Multithreading latency hiding only optimises the latency rather
than the throughput.
There are three different models:
· Replicate the memory state for each processor; or share the
state: It would be very expensive when the problem sizes are very
large.
· Pipeline processors with distributed memories: It is very hard
to statically partition different data structures, eg different
lookup databases.
· Pipelined wide word memory
2.9 On-chip Communication
Traditional central CPU memory architectures are not capable for
the high speed packet processing. Memory access scheme use the
distributed memory or other high performance architectures, as
mentioned above.
The normal bus based on-chip communication does not meet the
demands of high-speed link such as OC-768. On-chip crossbar is an
alternative. But it is expensive and low scalability. Most new
generation of NPs use high-speed buses and other mechanisms to meet
the requirements. IXA28XX uses Hyper Task Chaining, which will be
talked in Case Studies. The Motorola C-5 and Agere PayloadPlus use
high bandwidth buses.
3 Case studies
There are dozens NP products in the market. Among them are Intel
IXA, IBM PowerNP, Agere PayloadPlus, Cisco Toaster2, Motorola C-5
and so on. Here I only study IXA and PowerNP
3.1 IXA2800
· Features
· Second-generation network processor
· Programmable parallel processing architecture
· Solving complex problem at line speed
· Xscale core with sixteen 32-bit independent multithreaded
Microengins, provide more than 25 giga-operations per second
· Hyper tasks chaining technique
· Hyper tasks chaining [8]
Hyper Task Chaining implements several significant innovations
to ensure low latency communication among processes. These
mechanisms include “Next Neighbor” registers that enable individual
Microengines to rapidly pass data and state information to adjacent
Microengines. Reflector Mode pathways ensure that data and global
event signals can be shared with multiple Microengines, using
32-bit unidirectional buses that connect the network processor’s
internal processing and memory resources. A third enhancement, Ring
Buffer registers, provides a highly efficient mechanism for
flexibly linking tasks among multiple software pipelines. Ring
buffers allow developers to establish “producer-consumer”
relationships among Microengines, efficiently propagating results
along the pipeline in FIFO order. To minimize latency associated
with external memory references, register structures are
complemented by 16 entries of Content Addressable Memory (CAM)
associated with each Microengine. Configured as a distributed
cache, the CAM enables multiple threads and Microengines to
manipulate the same data simultaneously, while maintaining data
coherency.
· Architecture overview
Figure 6: IXA2800 network processor functional block diagram
[9]
The major parts of the IXA2800 are XScale Core and 16 MEs.
The Intel XScale ® core is a 32-bit general-purpose RISC
processor. It incorporates an extensive list of architectural
features that enable it to achieve high performance. It is
compatible to the ARM* Version 5 (V5) Architecture. It implements
the integer instruction set of ARM V5, but does not provide
hardware support for the floating-point instructions.
XScale core is logically in the control plane of the NP. It
handles slow, complex tasks and device configurations.
The Microengines do most of the programmable per-packet
processing in the network processor. There are 16 Microengines,
connected as shown in Figure 6. The Microengines can access all of
the shared resources (SRAM, DRAM, MSF, etc.) and the private
connections between adjacent Microengines.
The Microengines provide support for software-controlled
multi-threaded operation. Given the disparity in processor cycle
times compared to external memory times, a single thread of
execution often blocks, waiting for external memory operations to
complete. Multiple threads enable interleave operations. There is
usually at least one thread ready to run while others are
waiting.
The Microengine detail is shown in figure 7.
Microengines are logically in data plane of the NP. They perform
time-critical operations of packets processing.
Figure 7: Microengine Block Diagram
The Control Store is a RAM that holds the program that is
executed by the Microengine. It holds 8192 instructions, each of
which is 40 bits wide. It is initialized by the Intel XScale ®
core.
There are eight hardware Contexts available in the Microengine.
To allow for efficient context swapping, each Context has its own
register set, Program Counter, and Context specific Local
registers. Having a copy per Context eliminates the need to move
Context specific information to/from shared memory and Microengine
registers for each Context swap. Fast context swapping allows a
Context to do computation while other Contexts wait for I/O
(typically external memory accesses) to complete or for a signal
from another Context or hardware unit. (A context swap is similar
to a taken branch in timing.)
As shown in the block diagram in Figure 7, each Microengine
contains four types of 32-bit datapath registers:
256 General Purpose registers
512 Transfer registers
128 Next Neighbor registers
640 32-bit words of Local Memory
Local Memory is addressable storage within the Microengine.
Local Memory is read and written exclusively under program control.
Local Memory supplies operands to the execution datapath as a
source, and receives results as a destination.
3.2 PowerNP NP4GX
The IBM NP4GX is a network processor used in the line speed
OC-48 / OC-192. Figure 8 is the architecture block diagram.
Figure 8: NP4GX architecture block diagram [10]
The major parts are:
· Protocol processor
NP4GX has 16 multithreaded protocol processors. These protocol
processors arranged as 8 dynamic protocol processor units
(DPPU).
· Coprocessor and hardware assists
Coprocessor and specific hardware support packet queuing, header
manipulation. Scheduling support hardware is designed to provide
robust QoS functions. Four embedded search engines perform multiple
lookups into very large tables and have access to more than 700MB
of table memory with greater than 87Gbps bandwidth. Advanced CRC
computation hardware performs multiple types of calculations
· Control processor
NP4GX incorporates a PowerPC 440 as the control processor,
support control plane functions.
4 Conclusion and future trends
The sophisticate architectures give the NPs enormous processing
power to deal with the very high line speed and broad bandwidth
requirements. The most widely used architecture is parallel
processing architecture that exploits the packet-level parallelism
as well as instruction-level parallelism and thread-level
parallelism. NPs also utilize the coprocessors and high-speed
on-chip communication to get high processing speed.
However using more coprocessors would lead less flexibility that
NPs initially pursued. It is an option of using reconfigurable
circuits to solve this problem. But there are no reconfigurable
circuits in recent commercial NPs.
The on-chip communication on NPs is capable of dealing with
current line speed requirement. With the line speed increasing, new
mechanism must be invented.
Current NPs architecture models are all based on
“store-processing-forwarding” strategy. Is it feasible to abolish
store state? Can we build an architecture based on
“arriving-processing-forwarding”? In this model, NPs begin
processing a packet once it arrives the ingress port instead
waiting the whole packet arrival and buffing the whole packet.
5 Reference:
[1] Patrick Crowley, et al “Characterizing Processor
Architectures for Programmable Network Interfaces,” the Proceedings
of the 2000 International Conference on Supercomputing, Santa Fe,
N.M., May 2000
[2] Mohammad Shorfuzzaman, et al: “Architectures for Network
Processors: Key Features, Evaluation, and Trends,” The 2004
International MultiConference in Computer Science & Computer
Engineering Las Vegas, Nevada, USA June 21-24, 2004
[3] J.R Allen Jr. et al “IBM PowerNP network processor:
Hardware, software, and applications,” IBM Journal Res. & Dev.
Vol.47 No. 2/3 March/May 2003
[4] Lin Chuang, et al “Analysis and Research on Network
Processot,” Journal of Software, Feb. 2003 14(2): 253-267
[5] Patrick Crowley, et al “Network Processor Design, Issues and
Practices,” Vol1, Morgan Kaufmann Publishers, 2003
[6] http://biz.yahoo.com/prnews/050121/nyf067_1.html
[7] Intel “Next Generation Network Processor Technologies,
Enabling Cost Effective Solutions for 2.5 Gbps to 40 Gbps Network
Services” Oct. 2001
[8] Intel “IXA2800 Network Processor Datasheet”
[9] Intel “IXA2800 Network Processor Hardware Reference Manual”
Aug. 2004
[10] IBM “NP4GX Datasheet”
Host Processing
(Slow path and/or control functions)
PHY Layer
Packet Processing
(fast or data path function)
Switching
PE 0.1
PE 0.2
…
PE 0.n
PE 1.1
PE 1.2
…
PE 1.n
PE m.1
PE m.2
…
PE m.n
Thread 0 Pipe Pipe Pipe …Pipe
Thread 1 stage 0 stage 1 stage 2 stage m
… Function Function FunctionFunction
Thread n 0 1 1 (cont.)p
Destination IP Address
Right Shifter
Programmable Hash Engine
TagIndex
Tag Mask
Data Memory
=
Author: Jiening Jiang
Page: 6
_1103281554.psd
_1103284684.psd
_1103281200.psd