Thesis

Multi-Threaded End-to-End Applications on Network Processors

A Thesis

Presented to

the Faculty of the California Polytechnic State University

San Luis Obispo

In Partial Fullfilment

of the Requirements for the Degree

Master of Science in Computer Science

by

Michael S. Watts

June 2005


Copyright c© 2005

by

Michael S. Watts

ii

APPROVAL PAGE

TITLE: Multi-Threaded End-to-End Applications on Network Processors

AUTHOR: Michael S. Watts

DATE SUBMITTED: January 26, 2006

Professor Diana Franklin

Advisor or Committee Chair Signature

Professor Hugh Smith

Committee Member Signature

Professor Phil Nico

Committee Member Signature

iii

Abstract


by

Michael S. Watts

High speed networks put a heavy load on network processors, therefore optimiza-

tion of applications for these devices is an important area of research. Many

network processors provide multiple processing chips, and it is up to the applica-

tion developer to utilize the available parallelism. To fully exploit this power, one

must be able to parallelize full end-to-end applications that may be composed of

several less complex application kernels.

This thesis presents a multi-threaded end-to-end application benchmark suite

and a generic network processor simulator modeled after the Intel IXP1200. Us-

ing our benchmark suite we evaluate the effectiveness of network processors to

support end-to-end applications as well as the effectiveness of various paralleliza-

tion techniques to take advantage of the network processor architecture. We show

that kernel performance is an inaccurate indicator of end-to-end application per-

formance and that relying on such data can lead to sub-optimal parallelization.

iv

Contents

Contents v

1 Introduction 1

2 Related Work 4

2.1 Network Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Intel IXP1200 . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Network Processor Simulators . . . . . . . . . . . . . . . . . . . . 7

2.2.1 SimpleScalar . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 PacketBench . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 MiBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.2 CommBench . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.3 NetBench . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Application Frameworks . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.1 Click . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.2 NP-Click . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.3 NEPAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.4 NetBind . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 The Simulator 13

3.1 Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Memory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 14

v

3.3 Methods of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Application Development . . . . . . . . . . . . . . . . . . . . . . . 16

4 Benchmark Applications 18

4.1 Message Digest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 URL-Based Switch . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Advanced Encryption Standard . . . . . . . . . . . . . . . . . . . 23

5 Results 25

5.1 Isolation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 MD5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.2 URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1.4 Isolation Analysis . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Shared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.1 MD5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.2 URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.4 Shared Analysis . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Static Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4 Dynamic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Conclusion 42

7 Future Work 44

Bibliography 46

A Acronyms 50

vi

Chapter 1

Introduction

As available processing power has increased, devices that traditionally used

Application Specific Integrated Circuit (ASIC) chips are beginning to use pro-

grammable processors in order to take advantage of their flexibility. This increase

in flexibility has traditionally been gained at the sacrifice of speed. Network

processors aim to bridge the gap between speed and flexibility by taking advan-

tage of the benefits of both ASICs and general purpose processors. There is

no single unifying characteristic that allows all network processors to accomplish

this goal. However, there are several major strategies employed to bridge the gap:

parallel processing, special-purpose hardware, memory structure, communication

mechanisms, and peripherals [24]. Network processors have made it possible for

the deployment of complex applications into the network at nodes that previously

acted only as routers and switches.

High speed networks put a heavy load on network processors, therefore opti-

mization of applications for these devices is an important area of research. It is up

to the application developer to utilize the parallelism available in network proces-

sors. Parallelization of kernels is often a trivial task compared to parallelization

1

of end-to-end applications.

In the context of this thesis, kernels are programs that carry out a single task.

This task is of limited use in and of itself, however multiple kernels can often be

combined to provide a more useful solution. In the area of networking, kernels are

programs such as MD5, URL-based switching, and AES discussed in Chapter 4.

These kernels can also be applicable outside the area networking, however since

the context of this thesis is networking, these kernels focus on packet processing.

An end-to-end application refers to a useful combination of kernels. The end-

to-end application discussed in Chapter 5 makes use of the kernels in Chapter

4 by first calculating the MD5 signature of each packet, then determining its

destination using URL-based switching, and finally encrypting it using AES. In

the proposed scenario, the integrity of the packet could be verified and its payload

decrypted at the destination node.

Our first contribution is the creation of a simulator that emulates a generic

network processor modeled on the Intel IXP1200. Our simulator fills a gap in

existing academic research by supporting multiple processing units. In this way,

interaction between the six microengines of the Intel IXP1200 can be simulated.

We chose to emulate the IXP1200 because it is a member of the commonly used

Intel IXP line of Network Processing Unit (NPU)s.

Our second contribution is the construction of multi-threaded, end-to-end

application benchmarks based on the NetBench [18] and MiBench [10] single-

threaded kernels. Since network processors are capable of supporting complex

applications, it is important to have benchmarks that fully utilize them. Exist-

ing benchmark suites make it difficult to research the properties of parallelized

end-to-end applications since they are made up of single-threaded kernels. Our

benchmarks have been designed to provide insight into the characteristics of end-

2

to-end applications.

Our third contribution is an analysis of our multi-threaded, end-to-end ap-

plication benchmarks on our network processor simulator. This analysis reveals

characteristics of the kernels making up the end-to-end applications and the end-

to-end applications themselves, as well as insight into the strengths and weak-

nesses of network processors.

This paper is organized as follows. In the next chapter we provide background

and related work. In Chapter 3 we present our simulator. The kernels that make

up our end-to-end application benchmark are presented in Chapter 4. Chapter

5 describes our testing methodology and our evaluation of the effectiveness of

network processors to support end-to-end applications as well as the effectiveness

of various parallelization techniques to take advantage of the network processor

architecture. Finally, our conclusion is presented in Chapter 6.

3

Chapter 2

Related Work

As the size and capacity of the Internet continues to grow, devices within the

network and at the network edge are increasing in complexity in order to provide

more services. Traditionally, these devices have made use of ASICs which provide

high performance and low flexibility. NPUs bridge the gap between speed and

flexibility by taking advantage of the benefits of both ASICs and general pur-

pose processors. There is no single unifying characteristic that allows all network

processors to accomplish this goal. However, there are several major strate-

gies employed to bridge the gap: parallel processing, special-purpose hardware,

memory structure, communication mechanisms, and peripherals [24]. Network

processors have made it possible for the deployment of complex applications into

the network at nodes that previously acted only as routers and switches.

2.1 Network Processors

NPU is a general term used to describe any processor designed to process

packets for network communication. Another characteristic of NPUs is that their

4

programmability allows applications deployed to them to access higher layers

of the network stack than traditional routers and switches. The OSI reference

model defines seven layers of network communication from the physical layer

(layer 1) to the application layer (layer 7) [15]. NPUs are capable of supporting

layer 7 applications which have traditionally been reserved for desktop and server

computers.

There are over 30 different self-identified NPUs available today [24]. These

NPUs can be classified into two categories based on their processing element

configuration: pipelined and symmetric. A processing element (PE) is a processor

able to decode an instruction stream [24]. Pipelined configurations dedicate each

PE to a particular packet processing task, while in symmetric configurations

each PE is capable of performing any task [24]. Both of these configurations are

capable of taking advantage of the inherent parallelism in packet processing.

Pipelined architectures include: Cisco PXF [25], EZChip NP1 [8], and Xel-

erator Network Processors [31]. Symmetric architectures include: Intel IXP [6]

and IBM PowerNP [1].

High-speed networks place high demands on the performance of NPUs. In

order to prevent network communication delays, NPUs must quickly and effi-

ciently process packets. Parallel processing through the use of multiple PEs is

only one strategy used in NPUs to improve performance. Another strategy is

to use special-purpose hardware to offload tasks from the PEs. Special-purpose

hardware includes co-processors and special functional units.

Co-processors are more complex then functional units. They may be attached

to several PEs, memories, and buses, and they may store state. A co-processor

can be advantageous to the programmer when implementing an application, but

5

can also dictate that the programmer use a specific algorithm in order to take

advantage of a particular co-processor. Special functional units are used to im-

plement common networking operations that are hard to implement efficiently in

software yet easy to implement in hardware [24].

Since memory access can potentially waste processing cycles, NPUs often use

multi-threading to efficiently utilize processing power. Hardware is dedicated to

multi-threading such as separate register banks for different threads and hardware

units to schedule and swap threads with no overhead. Special units also handle

memory management and the copying of packets from network interfaces into

shared memory [24].

2.1.1 Intel IXP1200

The IXP1200 was designed to support applications requiring fast memory

access, low latency access to network interfaces, and strong processing of bit,

byte, word, and longword operations. For processors, the IXP1200 provides a

single StrongARM processor and six independent 32-bit RISC PEs called mi-

croengines. This boils down to a single powerful processor coupled with 6 very

simple, weaker engines for highly parallel computation. In addition, each mi-

croengine provides four hardware supported threads with zero-overhead context

switching. The StrongARM was designed to manage complex tasks and to offload

specific tasks to individual microengines [6].

The StrongARM and microengines share 8 MBytes of SRAM for relatively

fast accesses and 256 MBytes of SDRAM for larger memory space requirements

(but slow accesses). There is also a scratch memory unit available to all proces-

sors consisting of 1 MByte SRAM. The StrongARM has a 16 KByte instruction

6

cache and 8 KByte data cache, providing it with fast accesses on a small amount

of data. Each microengine has a 1 KByte data cache and a large number of

transfer registers. The IXP1200 platform does not provide any built-in memory

management, therefore the application developers are responsible for maintaining

memory address space [6].

2.2 Network Processor Simulators

Simulators are often used to execute programs written to run on hardware

platforms that are inconvenient or inaccessible to developers [28]. Simulators are

also able to provide performance statistics such as cycle count, memory usage,

bus bandwidth, and cache misses. These statistics enable developers to identify

bottlenecks and tune applications to specific hardware configurations.

Simulators are an important aspect of research in network processors due to

the high-cost and the wide variety of architecture found in current NPUs. High-

cost often makes cutting-edge NPUs inaccessible in academic research although

outdated NPUs are becoming more accessible. The wide variety of NPU archi-

tectures makes developing applications to run across multiple platforms difficult.

Since simulators can potentially be configured to simulate multiple platforms,

analysis of architectural differences can be performed.

2.2.1 SimpleScalar

SimpleScalar provides tools for developing cycle-accurate hardware simulation

software that models real-world architecture [3]. We chose to use SimpleScalar

because of its prevalence in architectural research. SimpleScalar takes as input

7

binaries compiled for the SimpleScalar architecture and simulates their execution

[3]. The SimpleScalar architecture is similar to MIPS, which is commonly found

in NPU platforms such as the Intel IXP. A modified version of GNU GCC allows

binaries to be compiled from FORTRAN or C into SimpleScalar binaries [3].

2.2.2 PacketBench

PacketBench is a simulator developed at the University of Massachusetts to

provide exploration and understanding of NPU workloads [22]. PacketBench

makes use of SimpleScalar ARM for cycle-accurate simulation [22]. PacketBench

also emulates some of the functionality of a NPU by providing a simple API for

sending and receiving packets and for memory management [22]. In this way, the

underlying details of specific NPU architectures are hidden from the application

developer. Although PacketBench is useful in characterizing workload, it does

not provide simulation support for multiprocessor environments. Since NPUs

make extensive use of parallelization, we chose not to use this tool.

2.3 Benchmarks

Benchmarks are applications designed to assess the performance character-

istics of computer hardware architectures [27]. One approach is to use a single

benchmark suite to compare the performance of several different architectures.

Another approach is to compare the performance of different applications on a

specific architecture. Benchmarks designed to mimic a particular type of work-

load are called Synthetic, while Application benchmarks are real-world applica-

tions [27]. For the purposes of this paper, our interest is in application bench-

8

marks, and more specifically, representative benchmarks for the domain of NPUs.

2.3.1 MiBench

MiBench is a benchmark suite providing representative applications for em-

bedded microprocessors [10]. Due to the diversity of the embedded microproces-

sor domain, MiBench is composed of 35 applications divided into six categories:

Automotive and Industry Control, Network, Security, Consumer Devices, Office

Automation, and Telecommunications. The Network and Security categories in-

clude Rijndael encryption, Dijkstra, Patricia, Cyclic Redundancy Check (CRC),

Secure Hash Algorithm (SHA), Blowfish, and Pretty Good Privacy (PGP) algo-

rithms. The Telecommunications category consists of mostly signal processing

algorithms, while the other categories are not relevant to this discussion. All

MiBench applications are available in standard C source code allowing them to

be ported to any platform with compiler support.

2.3.2 CommBench

CommBench was designed to evaluate the performance of network devices

based on eight typical network applications. The applications included in Comm-

Bench are categorized into header-processing and payload-processing. Header-

processing applications include Radix-Tree Routing table lookup, FRAG packet

fragmentation, Deficit Round Robin scheduling, and tcpdump traffic monitor-

ing. Payload-processing applications include CAST block cipher encryption, ZIP

data compression, Reed-Solomon Forward Error Correction (REED) redundancy

checking, and JPEG lossy image compressing [30].

9

2.3.3 NetBench

NetBench is a benchmarking suite consisting of a representative set of network

applications likely to be found in the network processor domain. These applica-

tions are split into three categories: micro, IP, and application. The micro-level

includes the CRC-32 checksum calculation and the table lookup routing scheme.

IP-level programs include IPv4 routing, Deficit-Round Robin (DRR) scheduling,

Network Address Translation (NAT), and the IPCHAINS firewall application.

Finally, application-level includes URL-based switching, Diffie-Hellmen (DH) en-

cryption for VPN connections, and Message-Digest 5 (MD5) packet signing [18].

Although CommBench and NetBench offer good representations of typical

network applications, they are both limited to single-threaded environments. Our

work builds on the NetBench suite by parallelizing several NetBench applications.

2.4 Application Frameworks

Application framework is a widely used term referring to a set of libraries and

a standard structure for implementing applications for a particular platform [26].

Application frameworks often promote code-reuse and good design principles.

Several frameworks for NPUs are available in academia, each offering various

benefits to application developers. NPU vendors also provide frameworks specific

to their architectures, such as the Intel IXA Software Development Kit [5].

One key advantage of academic frameworks is the possibility that they will

be able to support multiple architectures, thus enabling developers to design and

implement applications independent of a specific architecture. Unfortunately,

of the NPU-specific frameworks surveyed in this paper only NEPAL currently

10

realizes cross-platform support. The others are currently striving to meet this

goal.

2.4.1 Click

Click is an application development environment designed to describe net-

working applications [13]. Applications implemented using Click are assembled

by combining packet processing elements. Each element implements a simple

autonomous function. The application is described by building a directed graph

with processing elements at the nodes and packet flow described using edges.

Click supports multi-threading but has not been extended to multiprocessor ar-

chitectures. The modularity of Click applications gives insight into their inherent

concurrency and allows alterations in parallelization to be made without changing

functionality.

2.4.2 NP-Click

NP-Click is based upon Click and designed to enable application development

on NPUs without requiring in-depth understanding of the details of the target

architecture [19]. NP-Click offers a layer of abstraction between the developer

and the hardware through the use of a programming model. The code produced

using the NP-Click programming model has been shown to run within 10% of

the performance of hand-coded solutions while significantly reducing development

time [19]. The current implementation of NP-Click targets only the Intel IXP1200

network processor although a goal of this project is to support other architectures.

11

2.4.3 NEPAL

The Network Processor Application Language (NEPAL) is a design environ-

ment for developing and executing module-based applications for network proces-

sors [17]. In a similar fashion to Click, application development takes place by

defining a set of modules and a module tree that defines the flow of execution

and communication between modules. The platform independence of NEPAL

was verified using their own customized version of SimpleScalar ARM simula-

tor for multiprocessor architectures. They provide performance results for two

simulated NPUs modeled after the IXP1200 [6] and Cisco Toaster [25].

2.4.4 NetBind

NetBind is a binding tool for dynamically constructing data paths in NPUs

[4]. Data paths are made up of components performing simple operations on

packet streams. NetBind modifies the machine code of executable components

in order to combine them into a single executable at run-time. The current

implementation of NetBind specifically targets the IXP1200 network processor,

although it could be ported to other architectures in the future.

12

Chapter 3

The Simulator

The simulator developed for this work was built on the SimpleScalar tool set

[3]. SimpleScalar provides tools for developing cycle-accurate hardware simula-

tion software that models real-world architecture. This simulation tool set was

chosen because of its prevalence in architectural research. For this work, we mod-

ified an existing simulator with support for multiple processors in order to create

a generic network processor simulator modeled after the Intel IXP1200. We chose

to model the IXP1200 because it is a member of the commonly used Intel IXP

line of NPUs.

3.1 Processing Units

The simulator includes a single main processor and six auxiliary processors

each supporting up to four concurrent threads. This configuration corresponds to

the StrongARM core processor and accompanying microengines on the IXP1200.

The StrongARM core is represented by an out-of-order processor. The micro-

engines are represented by single-issue in-order processors. Since each micro-

13

Parameter StrongARM MicroenginesScheduling Out-of-order In-orderWidth 1 (single-issue) 1 (single-issue)L1 I Cache Size 16 KByte SRAM (no miss penalty)L1 D Cache Size 8 KByte 1 KByte

Table 3.1. Processor Parameters

engine must support four threads with zero overhead context switching [6], the

simulator creates one single-issue in-order processor for each microengine thread.

When a single-issue in-order processor is created, it is given the number of threads

allocated to its physical microengine so it knows to execute every n cycles, where

n is the number of threads on the microengine. The total number of required

threads is specified on the command line when the simulator is run, therefore

unused threads are not created.

3.2 Memory Structure

The StrongARM and microengines share 8 MBytes of SRAM and 256 MBytes

of SDRAM [6]. There is also a scratch memory unit consisting of 1 MByte SRAM.

These memory units are represented in the simulator using a single DRAM unit.

Separate DRAM caches back these memory units for the StrongARM and micro-

engines.

The StrongARM has a 16 KByte instruction cache and 8 KByte data cache

that are backed by SRAM [6]. Each microengine has a 1 KByte data cache and

unlimited instruction cache. The microengines are given unlimited instruction

cache in order to mimic the behavior of the large number of transfer registers

associated with each microengine on the IXP1200. Since the number of simulated

registers cannot exceed the number of physical registers on the host architecture,

14

we determined this to be the best option available.

Since the IXP1200 is capable of connecting with any number of network de-

vices through its high speed 64 bit IX bus interface, the amount of delay incurred

to fetch a packet could very greatly. For the purposes of this simulator, network

delay is not important and it is assumed that the next packet is available as soon

as the application is ready to receive it. In order to imitate this behavior, a large

chunk of the DRAM unit is allocated as “network” memory and is backed by a

no-penalty cache object available to all processors.

The simulator does not provide any built in memory management, therefore

the application developers are responsible for maintaining memory address space.

The simulator assigns address ranges to each of the memory units. SRAM is

dedicated for the call stack and DRAM is broken up into a range for text, global

variables, and the heap.

3.3 Methods of Use

The simulator compiles to Linux using GCC 3.2.3 to an executable called

sim3ixp1200. The simulator takes a list of arguments that modify architectural

defaults and indicate the location of a SimpleScalar native executable and any

arguments that should be passed to it. Its use can be expressed as:

sim3ixp1200 [-h] [sim-args] program [program-args]

The -h option lists available simulator arguments of the form -Parameter:value.

These arguments can modify aspects of the simulation architecture including the

number of PEs, threads, cache specifications, and memory unit specifications.

Default values for each available parameter are based on the IXP1200 architec-

15

ture.

The most important parameter for this work was Threads that controls the

number of microengine threads made available to the SimpleScalar application.

Threads can be any value between 0 and 24 inclusive. Zero threads indicates the

microengines will not be used and therefore the application will execute only on

the StrongARM processor. When the number of threads are greater than zero,

they are allotted to the 6 possible microengines using a round-robin scheme so

that the threads are distributed as evenly as possible. For instance, if 8 threads

are requested, then 4 microengines will be run 1 thread and 2 microengines will

run 2 threads.

3.4 Application Development

The applications developed for this work were written in C and compiled

using a GCC 2.7.2.3 cross-compiler. A cross-compiler translates code written on

one computer architecture into code that is executable on another architecture.

For this work, the host architecture was Linux/x86 and the target architecture

was SimpleScalar PISA (a MIPS-like instruction set).

Since the simulator does not support POSIX threads, developing multi-threaded

applications follows a completely different path. Instead of the main process

spawning child threads, the same application code is automatically executed in

each simulator thread. In order for the application code to distinguish which

thread it is running in, a function called getcpu() that returns an integer is

made available by the simulator. This function, although mis-named, returns

the thread identifier, not the CPU identifier. Code that is meant to run in a

particular thread must be isolated in an if block that tests the return value from

16

getcpu(). This function requires a penalty of one cycle, but it is typically called

only once and its value stored in a local variable during the initialization of the

application. A global variable called ncpus is automatically made available by

the simulator and populated with the number of threads.

It is often necessary in application development to require all threads to reach

a particular point before any thread is allowed to proceed. This is accomplished

using another function made available by the simulator called barrier(). A call

to barrier() requires one cycle for the function call, but induces no penalty

while a thread waits.

The simulator reports statistics on the utilization of each hardware unit at the

end of each execution. For each PE this includes cycle count, instruction count,

and fetch stalls. For each memory unit this includes hits, misses, reads, and

writes. In addition, the simulator provides a function called PrintCycleCount()

that can be used at any time to print the cycle count of the current thread to

standard error and standard output. This function is useful when an application

has an initialization process that should not count towards the total cycle count.

By making a call to PrintCycleCount() at the beginning and end of a block of

code, the total cycle count for that block can be determined by analyzing the

output. When the developer requires that the application make some calculation

based on cycle count, the function GetCycles() returning and integer can be

used. Both of these functions induce a penalty of one cycle for the call, but no

penalty for their execution.

17

Chapter 4

Benchmark Applications

Previous research in the area of network processors has focused on exploring

their performance characteristics by running individual applications in isolation

and in a single threaded environment. Network processors are capable of support-

ing more complex applications that guide packets through a series of applications

running in parallel. For the first stage of this work we ported three typical network

applications to our simulator: MD5, URL-switching, and Advanced Encryption

Standard (AES). This process involved modifying memory allocations to use

appropriate simulator address space and reorganizing each application to take

advantage of multiple threads. For the second stage of this work we combined

these three applications into three types of end-to-end applications: shared, sta-

tic, and dynamic. These distinctions refer to three different ways of utilizing the

available threads.

18

4.1 Message Digest

The MD5 algorithm [23] creates a 128-bit signature of an arbitrary length

input message. Until recently, it was believed to be infeasible to produce two

messages with the same signature or to produce the original message given a

signature. However, in March 2005, Arjen Lenstra, Xiaoyun Wang, and Benne

de Weger demonstrated [14] that two valid X.509 certificates [11] could be created

with identical MD5 signatures. Although more robust algorithms exist, MD5 is

still extensively used in public-key cryptography and in verifying data integrity.

Our implementation of MD5 was adopted from the NetBench suite of ap-

plications for network processors [18, 16]. The NetBench implementation was

designed to process packets in a serial fashion utilizing a single thread. The

multi-threaded, multiprocessor nature of NPUs is better utilized by processing

packets in parallel. In order to analyze the performance characteristics in this

environment, our implementation of MD5 offloads the processing of packets to

available microengine threads. In this way, the number of packets processed in

parallel is equal to the number of microengine threads.

As shown in Figure 4.1, the StrongARM processor is responsible for accepting

incoming packets and distributing them to idle microengine threads. Commu-

nication between the StrongARM and microengines is done through the use of

semaphores. When the StrongARM finds an idle thread, it copies a pointer to

the current packet and the length of the current packet to shared memory lo-

cations known by both the StrongARM and the thread. The StrongARM then

sets a semaphore that triggers the thread to begin executing. When all packets

have been processed, the StrongARM waits for each thread to become idle, then

notifies them to exit before exiting itself.

19

Figure 4.1. MD5 - StrongARM Algorithm

Each microengine thread proceeds as shown in Figure 4.2. It waits until its

semaphore has changed, then either exits or copies the current packet to its stack

before processing it to generate a 128-bit signature. It then resets its semaphore

and returns to waiting.

4.2 URL-Based Switch

URL-based switching directs network traffic based on the Uniform Resource

Locator (URL) found in a packet. Other terms for URL-based switch include

Layer 7 switch, content-switch, and web-switch. The purpose of switching based

on Layer 7 content is to realize improved performance and reliability of web-based

20

Figure 4.2. MD5 - Microengine Algorithm

services. A Layer 4 switch located in front of the cluster of servers can control

how each Transmission Control Protocol (TCP) connection is established on a

per connection basis. How requests are directed within a connection is out of

reach to a Layer 4 switch. Traffic can be managed per request, rather than per

connection, by a URL-based switch [12].

In order to manager requests, a URL-switch acts as the end point of each

TCP connection and establishes its own connections to the servers containing

the content requested by the client. It then relays content to the client. In this

way, the switch can perform load-balancing and fault detection and recovery.

For instance, if one server is overloaded or unreachable, the switch can send its

request to a different server.

Our URL-based switching algorithm is based on the implementation found

in NetBench [18, 16]. The algorithm searches the contents of a packet for a list

of matching patterns. Each pattern has an associated destination that can be

used to switch the packet or begin another process. The focus of our URL-based

21

switch is the pattern matching algorithm.

Unlike our implementation of MD5, our URL-based switch does not utilize

parallelism by processing multiple packets at once, instead it uses multiple threads

to process each packet. The data structure used to store patterns is a list of lists.

Each element of the primary list is made up of a secondary list and the largest

common substring of the patterns in the secondary list. The algorithm proceeds

as shown in Figures 4.3 and 4.4.

Figure 4.3. URL - StrongARM Algorithm

Each packet received by the StrongARM is copied to the stack and run

through the Internet checksum algorithm to verify its integrity. For each ele-

ment in the list of largest common substrings, the StrongARM copies the el-

ement’s secondary list pointer to a shared memory location known by an idle

microengine thread. A pointer to the current packet is also copied to shared

memory and then the idle microengine’s semaphore is set to notify it to begin

22

Figure 4.4. URL - Microengine Algorithm

executing. The microengine thread first copies the packet to its stack and then

uses a Boyer Moore search function to determine whether the packet contains

the largest common pattern. If this test is positive, then the thread proceeds to

search for a matching pattern in the secondary list. Otherwise, the microengine

resets its semaphore and returns to an idle state. If the thread finds a matching

pattern, it sets its semaphore to reflect this before returning to an idle state.

The StrongARM continues until it reaches the end of the primary list or until a

thread finds a matching pattern, it then processes the next packet.

4.3 Advanced Encryption Standard

The AES is an encryption standard adopted by the US government in 2001

[20]. The standard was proposed by Vincent Rijmen and Joan Daemen under the

name Rijndael [7]. AES is a block cipher encryption algorithm based on 128, 192,

or 256 bit keys. The algorithm is known to perform efficiently in both hardware

and software.

Our implementation of AES is based on the Rijndael algorithm found in

23

the MiBench embedded benchmark suite [10, 21]. In much the same way that

our MD5 algorithm processes packets in parallel, our AES algorithm offloads

the encryption of packets to microengine threads. The encryption is performed

using a 256 bit key that is loaded into each thread’s stack during startup. This

algorithm executes on the simulator in the same manner as MD5 above (Figures

4.1 and 4.2).

24

Chapter 5

Results

In order to evaluate the effectiveness of NPUs to support multi-threaded end-

to-end applications and the effectiveness of various parallelization techniques to

take advantage of the NPU architecture, we performed four types of tests: Isola-

tion, Shared, Static, and Dynamic. The Isolation tests establish a baseline and ex-

plore application behavior on the multi-threading NPU architecture. The Shared

tests explore how each application is affected by the concurrent execution of other

applications. The Static tests reveal characteristics of an end-to-end application

and how to best distribute threads. Finally, the Dynamic tests serve to compare

an on-demand thread allocation algorithm to statically allocated threads.

5.1 Isolation Tests

The purpose of the Isolation tests is twofold: to establish a baseline for sub-

sequent tests and to explore the effects of multi-threading on the NPU. The

Isolation tests consisted of independent tests for each application. For each in-

dependent test, the number of microengine threads available to the application

25

was varied between 1 and 24, since the simulator supports up to 24 threads. A

data point was also gathered for the serial version of each application in which

no microengine threads were used.

5.1.1 MD5

Figure 5.1. MD5 Isolated Speedup on 1000 Packets

Test results in Figure 5.1 show that parallelization of the MD5 algorithm

offers significant speedup compared to its serial counterpart. The data point at

zero threads represents the serial version of MD5 executed on the StrongARM

processor. The data point at 1 thread represents the multi-threaded version

making use of the StrongARM and a single microengine. This case is slower than

the serial version because of the overhead involved in communication between the

StrongARM and the microengine and because the microengine does not offer as

strong processing power as the StrongARM. As the number of threads increases,

the combined processing power of the microengines outweighs the communication

overhead.

26

The slope of the speedup graph in Figure 5.1 decreases suddenly at 7, 13,

and 19 threads. These changes can be attributed to the fact that there are 6

microengines, therefore, up until 7 threads each microengine is responsible for

a single thread. From 7-12 threads, each microengine is burdened with up to 2

threads. Similarly, as the number of threads increases to 24, each microengine is

burdened with 3 and then 4 threads causing the speedup to approach a flat line.

5.1.2 URL

Figure 5.2. URL Isolated Speedup on 100 Packets (non-polling)

Although test results for the parallelization of URL show improvements over

the serial version, characteristics of the algorithm limited speedup. As stated in

the previous chapter, the URL algorithm is parallelized in such a way that multi-

ple threads work together to process each packet. Each thread is responsible for

searching the packet for a particular set of patterns, and the first match preempts

further execution. The drawback of this algorithm is that since only one thread

will find a match, the other threads do work that in hindsight is unnecessary.

27

Figure 5.3. URL Isolated Speedup on 100 Packets (polling)

This in itself would not be detrimental to the application’s performance except

that all threads are vying for a limited number of shared resources.

We developed two variations of the URL algorithm in an attempt to minimize

the cycles spent searching false leads. The first version allows each thread to

run to completion after a matching pattern is found. Once a thread reports to

the StrongARM that a match has been found, the StrongARM stops spawning

new threads and simply waits for the active threads to finish, although their

processing is immaterial. In the alternative approach, when a match is found,

the StrongARM sets a global flag that is constantly polled by each thread. When

a thread detects that the flag has changed, it stops executing.

Although it was expected that the polling version of URL would perform

better, it actually performed slightly worse than the non-polling version. As

shown in Figures 5.2 and 5.3, the highest speedup attained by non-polling was

1.75 and for polling 1.64. Analysis of the application’s output shows that a

matching pattern is found in only about 40% of the trace packets, thus polling

28

is unable to preempt execution 60% of time. The difference in speedup is due to

the fact that the polling version is doing unnecessary work 60% of the time and

that polling itself wasts too many cycles.

In both versions of URL, the speedup drops off after reaching a maximum

between 4 and 6 threads. This indicates that contention to shared resources

becomes a problem after this point.

5.1.3 AES

Figure 5.4. AES Isolated Speedup on 100 Packets

Speedup tests on AES show that this algorithm performs poorly when of-

floaded to the microengines. The AES encryption algorithm requires each packet

be read and processed 16 bytes at a time. State is maintained for the lifetime

of each packet in an accumulator that made up of the encryption key and state

variables. In addition, a static lookup table of 8 Kbytes is required. The L1 data

cache for the StrongARM is 8 Kbytes compared to 1 Kbytes for the microengines.

Due to the limited size of the microengine caches, AES suffers from substantial

29

cache misses.

Processing of each packet consumes roughly 1.36 million simulator cycles when

encryption is performed on the StrongARM. The same process consumes roughly

11.4 million simulator cycles on a microengine thread when it is the only micro-

engine thread running. This is an increase by a factor of 8.4. In contrast, MD5

consumes roughly 0.518 million cycles on the StrongARM and 0.922 million cy-

cles on a single microengine thread. This results in an increase by a factor of 1.6.

Thus, AES requires a substantially higher increase in cycles when moving from

the StrongARM to a microengine thread.

Figure 5.4 shows that although performance on the microengine threads is

far worse than the serial version, it remains relatively constant as the number of

threads increases. Therefore, the poor performance of AES on the microengine

threads is primarily a result of processing power and cache size, not memory

contention between threads which would be the case if speedup tailed off.

5.1.4 Isolation Analysis

These tests reveal general characteristics of each kernel on both the Stron-

gARM and microengines. MD5 has been shown to offer strong speedup on mi-

croengine threads using conventional parallelization. URL, using an alternative

approach to multi-threading, has been shown to provide maximum speedup be-

tween 4 and 6 threads employing either a polling or non-polling scheme. Finally,

AES reveals an algorithm with poor performance on the microengines that cannot

be overcome by multi-threading.

30

5.2 Shared Tests

The purpose of the Shared tests is to determine how sensitive each kernel

is to the concurrent execution of the other kernels. For these tests we ran all

three kernels on the simulator at the same time. The StrongARM served as the

controller, passing incoming packets to available microengine threads. We ran

one test for each kernel, in which the number of threads available to the kernel

under test was varied, while the threads available to the other kernels remained

constant. Our baseline for each of these tests was 1 thread for MD5, 4 threads

for URL, and 1 thread for AES. This baseline was chosen because running

URL with few than 4 threads was found to cause a significant bottleneck. The

number of threads available to the kernel under examination was increased for

each subsequent run. Each kernel processed a separate packet stream until the

kernel under test completed the desired number of packets, in this case 50.

Figure 5.5 shows the speedup results from all three tests on the same graph

revealing the relative speedup of each kernel. Clearly, MD5 and AES have much

greater speedup than URL, indicating they are less sensitive to the concurrent

processing of other kernels. However, it is more interesting to compare the Shared

speedup of each kernel with its Isolated speedup. This comparison is covered in

the following subsections.

5.2.1 MD5

The speedup results of MD5 in the Isolation and Shared tests, shown in

Figures 5.1 and 5.5 respectively, show few differences. The slope of each graph

is approximately the same and both peak near a speedup of six. This indicates

that MD5 is not substantially affected by the concurrent execution of URL and

31

Figure 5.5. Shared Speedup on 50 Packets

AES. The lightweight nature of MD5 with regards to memory is the most likely

explanation for this behavior.

Figure 5.6 compares the MD5 Isolation and Shared tests with regard to the

number of cycles consumed by the StrongARM while 50 packets are processed,

revealing that more cycles are required to process the same packet stream when

MD5 is sharing the resources of the NPU. The horizontal-axis corresponds to the

number of MD5 threads employed to process the packets while the vertial-axis

corresponds to the number of cycles spent processing the packet stream. Since

with the Shared tests 4 threads are allocated to URL and 1 to AES, these threads

cause contention for access to shared resources and therefore higher cycle counts

than the Isolation tests.

32

Figure 5.6. MD5 Isolated vs. Shared Cycles on 50 Packets

5.2.2 URL

Although the Shared speedup of URL shown in Figure 5.5 steadily increases,

its maximum of 1.17 with 22 threads does not match the Isolation speedup shown

in Figure 5.2 that peaks at 1.75 and degrades to 1.41 with 22 threads. This

indicates that URL is affected by the concurrent execution of other applications

due to its memory access requirements.

5.2.3 AES

The Shared speedup of AES shown in Figure 5.5 is an order of magnitude

greater than the Isolation speedup shown in Figure 5.4. This high speedup is

due to the fact that the baseline for this test performed extremely poorly. This

can be attributed to two characteristics of the AES kernel. Firstly, as shown

in the Isolation tests, AES performs poorly on the microengines due to their

lack of processing power and limited size of their cache. Secondly, since the

33

StrongARM is the controller for all three kernels it continuously monitors all of

the microengine threads and distributes incoming packets as necessary. In the

baseline, the StrongARM has to monitor one thread for each kernel, thus only

one-third of its time is spent monitoring the AES thread. Therefore, the AES

thread occasionally finishes processing a packet and wastes idle cycles waiting

for the StrongARM to send it another packet. As more threads are allocated to

AES, the StrongARM spends a larger percentage of time monitoring AES threads

therefore increasing throughput.

5.2.4 Shared Analysis

The Shared tests reveal that MD5 and AES are relatively insensitive to the

concurrent execution of the other kernels on a single NPU. URL, however, is

sensitive, and its speedup suffers when it is run alongside the other kernels.

5.3 Static Tests

The Static tests were designed to reveal characteristics of the end-to-end ap-

plication, such as the location of bottlenecks and the ideal thread configuration.

The testing process was similar to that of the Shared tests. The difference being

that instead of processing independent packet streams, the applications worked

together to process a single packet stream. Each incoming packet was processed

first by MD5, then by URL, and finally by AES. This scenario represents a

possible end-to-end application running on a NPU as shown in Figure 5.7. The

purpose of this application is to distribute sensitive information from a trusted

internal network through the Internet to a variety of hosts. Each packet is re-

34

ceived by the application from the internal network, the application calculates

its MD5 signature, determines its destination based on a deep inspection of the

packet, and then encrypts it. Finally, the the encrypted packet along with its

signature is sent to a host machine; although this step is not included in the sim-

ulated application. To complete this scenario, the host machine would decrypt

the packet and verify that the contents were not modified in transit by comparing

the included signature to a newly generated one. This is also not included in the

simulation.

Figure 5.7. End-to-End Application Scenario

Figure 5.8. Optimization with Static Allocation of Threads

For these tests, the number of threads allocated to each stage of the end-to-

35

end application is static for each run. Once again, the baseline test is 1 thread for

MD5, 4 threads for URL, and 1 thread for AES. Each subsequent test increases

the number of threads by one and attempts to determine the optimal configura-

tion. The optimal configuration is determined by giving the additional thread to

each of the applications in turn, and observing which configuration yields the best

speedup. This configuration is then used as a starting point for the subsequent

test.

Figure 5.8 shows the resulting optimal configurations for each number of avail-

able threads between 6 and 24. These configurations were found through test runs

of 50 packets. MD5 never became a bottleneck point and 1 thread remained suf-

ficient throughout the tests. URL and AES almost evenly split the remaining

threads, with the a final configuration of 12 threads for AES, 11 for URL, and

1 for MD5. These results show that the demands of AES and URL are similar

and parallelization offers increased performance for these applications, while the

simplicity of MD5 makes parallelization of it in the context of this end-to-end

application unnecessary.

The above discovery reveals an interesting characteristic of this end-to-end

application. Although MD5 provided the best speedup in the Isolation tests,

parallelizing it in the Static tests resulted in less performance improvement than

further parallelization of the other applications. This can be explained by Am-

dahl’s Law [2], which states that the overall speedup achievable from the improve-

ment of a proportion of the required computation is affected by the size of that

proportion. If P is the proportion and S is the speedup, Amdahl’s Law states

that the overall speedup will be:

36

1

(1− P ) + PS

Therefore, the computation required to perform MD5 in this end-to-end ap-

plication is a small proportion of the overall computation. Subsequently, speedup

benefits more through increased parallelization of URL and AES.

It is also interesting to note that although AES did not benefit from additional

microengines during the Isolation tests (Figure 5.4), in the high-load context of

this end-to-end application additional AES threads benefit overall performance.

Figure 5.8 also shows that initially more threads were allocated to URL and

after 14 threads more threads were allocated to AES. Since URL is required to

finish processing each packet before it can be sent to AES, URL caused more of

a bottleneck when it had less than 10 threads. After that point, AES required 4

threads to ever 1 for URL in order to keep pace.

5.4 Dynamic Tests

The Dynamic tests present an alternative approach to the Static tests. Where

the Static tests represent ideal configurations, the Dynamic tests represent real-

istic configurations. Static allocation of microengine threads is also much less

feasible since all possible configurations must be run in order to determine the

best one for the given end-to-end application. This could become an extremely

complex and lengthy process. The trade-off with a dynamic heuristic is increased

complexity in the logic of the application.

The purpose of these tests was to determine how an on-demand allocation

of threads performs against a static approach. The Dynamic tests consist of

37

all three kernels processing the same packet stream in serial, as in the Static

tests, but with threads dynamically allocated based on demand. Once again,

the StrongARM serves as the controller and is responsible for allocating threads.

Allocation is implemented through the use of queues for each stage of the end-

to-end application. Each queue stores pointers to packets that are waiting to be

processed by the next stage. The StrongARM detects when a queue has packets

and creates threads to process them.

Figure 5.9. Dynamic Speedup on 50 Packets

Figure 5.9 shows the speedup of the Dynamic application using as a baseline

the Static configuration consisting of 1 MD5, 4 URL, and 1 AES thread. The

speedup increases from 4.29 with 6 threads to 4.39 with 24 threads, a substantial

increase over the Static baseline.

Figure 5.10 shows the difference between the number of cycles requires for

each of the applications to process the same number of packets. While the Static

version spent in the neighborhood of 1.3 billion cycles per 50 packets, the Dynamic

version spent closer to 300 million, a ratio of 4.3:1.

38

Figure 5.10. Static vs. Dynamic Cycles on 50 Packets

This discrepancy can be attributed to cycles wasted on idle threads. With the

Static version, each thread is statically assigned to perform either MD5, URL, or

AES. Since the URL kernel requires much longer to run than MD5, the queue

of packets waiting for URL processing is quickly filled forcing the MD5 thread

to stop processing new packets until URL can reduced the queue. At the same

time, when the URL threads were unable to process packets as quickly as the

AES threads, some AES threads wasted idle cycles. The Dynamic version did

not suffer from these bottleneck issues because idle threads were put to use by

whichever kernel required them.

Another benefit of the Dynamic version is that it is able to adjust to changes

in load caused by varying packet sizes and payloads. Specifically, since URL

performs a thorough string matching on the payload of each packet, the size of

the packet has a large affect on the number of cycles required to process it. The

Dynamic version is able to minimize bottlenecks in URL due to large packets by

putting more threads to work on the bottleneck.

39

Figure 5.10 also shows that neither the Static nor the Dynamic versions of

the end-to-end application benefit much from additional threads. The number

of cycles remains relatively constants from 6 to 24 threads. The Isolation tests

show that between 6 and 24 threads MD5 is the only kernel to experience signif-

icant performance improvement. The speedup of URL declines slightly and AES

remains relatively constant. Therefore, with the exception of the MD5 kernel,

the end-to-end applications experience performance characteristics similar to the

Isolation tests. Once again, this can be explained by Amdahl’s Law [2], because

MD5 constitutes only a small percentage of the overall computation. Thus, the

performance of the end-to-end application is driven by the performance of the

URL and AES kernels.

5.5 Analysis

We performed four types of tests for our analysis: Isolation, Shared, Static,

and Dynamic. The Isolation tests established a baseline and explored kernel

behavior on the multi-threading NPU architecture. The Shared tests explored

how each kernel was affected by the concurrent execution of other kernels. The

Static tests revealed characteristics of an end-to-end application and how to best

distribute threads. Finally, the Dynamic tests served to compare an on-demand

thread allocation algorithm to statically allocated threads.

The Isolation tests revealed general characteristics of each kernel on both

the StrongARM and microengines. MD5 offered strong speedup on microengine

threads using conventional parallelization. URL, using an alternative approach to

multi-threading, provided maximum speedup between 4 and 6 threads employing

either a polling or non-polling scheme. Finally, AES revealed an algorithm with

40

poor performance on the microengines that could not be overcome by multi-

threading.

The Shared tests revealed that MD5 and AES are relatively insensitive to the

concurrent execution of the other applications on a single NPU. URL, however,

was shown to be sensitive because its speedup suffered when it was run alongside

the other kernels.

The Static tests provided a baseline for the Dynamic tests and revealed the

optimal thread configurations for running the end-to-end application. Results

showed that the demands of AES and URL are similar and parallelization offered

increased performance for these applications, while the simplicity of MD5 made

parallelization of it in the context of this end-to-end application unnecessary.

As an alternative to statically allocating threads, the Dynamic tests explored

the benefits of dynamically allocating threads. Overall the Dynamic tests re-

quired less than 25% as many cycles to process each 50 packet test as the Static

conterpart.

41

Chapter 6

Conclusion

We have presented a network processor simulator, multi-threaded end-to-end

benchmark applications, and an analysis of the characteristics of these appli-

cations on NPUs. Our first contribution was the creation of a simulator that

emulates a generic network processor modeled on the Intel IXP1200. Our simu-

lator fills a gap in existing academic research by supporting multiple processing

units. Our second contribution was the construction of multi-threaded, end-to-

end application benchmarks. These benchmarks extend the functionality of ex-

isting benchmarks based on single-threaded kernels. Our final contribution was

an analysis of the characteristics of our benchmarks on our network processor

simulator.

Our analysis in Chapter 5 found several interesting results. Firstly, although

the MD5 kernel scaled well in the Isolation and Shared tests, parallelization of

it in an end-to-end application had little effect due to Amdahl’s Law. Secondly,

the Static and Dynamic tests found that the end-to-end application did not have

much performance gain from the addition of more than 6 threads. Finally, the

Dynamic version of the end-to-end application required less than 25% as many

42

cycles to process the same packet stream compared to the Static version.

In an attempt to bridge the gap between the speed of ASIC chips and the

flexibility of general purpose processors, NPUs utilize parallel processing and

special-purpose hardware and memory structure, as well as other techniques.

While NPUs make it possible to deploy complex end-to-end applications into the

network, high speed networks put heavy load on these devices making application

optimization an important area of research. The simulator presented in this paper

made development and analysis of two end-to-end application benchmarks as well

as the kernels making up these applications. Through the development of these

kernels and applications we explored several parallelization techniques. Using our

simulator and testing methodology, we unveiled the performance characteristics

of these kernels and application benchmarks on a typical NPU.

43

Chapter 7

Future Work

The simulator developed in this work provides a tool that can be used in a

variety of future projects. Thus far, the simulator has been used by Gridley in

his Master’s thesis on active network algorithm performance [9] and Tsudama to

test his denial-of-service detection algorithm as part of his Master’s thesis [29].

As future work, several improvements could be made to the existing simulator

including support for dedicated processing chips, larger cycle count capability,

and updates necessary to model the current generation of NPUs.

Other future work could include testing the existing end-to-end applications

on an updated simulator to determine whether or not the performance problems

found in this work have been overcome by the current generation of NPUs. If

the same performance problems remain, further investigation into methods of

designing parallel applications to avoid bottlenecks on NPUs will be required.

Additionally, the parameters of the NPU architecture could be adjusted to deter-

mine which changes lead to performance improvements. However, if performance

bottlenecks are not found on current NPUs, then larger scale end-to-end appli-

cations should be developed to push the performance limits of the architecture

44

and reveal new bottlenecks.

The benchmark suite could be extended by including additional kernels. The

end-to-end applications could be extended to include these kernels or new end-

to-end applications could be developed to model other real-world scenarios. Op-

timization of the current and future kernels and end-to-end applications will

continue to be an open area of research.

45

Bibliography

[1] J. Allen, B. Bass, C. Basso, R. Boivie, J. Calvignac, G. Davis, L. Frelechoux,

M. Heddes, A. Herkersdorf, A. Kind, J. Logan, M. Peyravian, M. Rinaldi,

R. Sabhikhi, M. Siegel, and M. Waldvogel. IBM PowerNP network proces-

sor: Hardware, software, and applications. IBM Journal of Research and

Development, 2003.

[2] Gene Amdahl. Validity of the single processor approach to achieving large-

scale computing capabilities. In AFIPS Conference Proceedings, pages 483–

485, Atlantic City, N.J., 1967.

[3] Douglas C. Burger and Todd M. Austin. The simplescalar tool set, version

2.0. Technical Report CS-TR-1997-1342, Computer Sciences Department,

University of Wisconsin, June 1997.

[4] A. Campbell, S. Chou, M. Kounavis, V. Stachtos, and J. Vicente. Netbind: A

binding tool for constructing data paths in network processor-based routers.

In Proceedings of IEEE OPENARCH 2002, New York City, NY, June 2002.

[5] Intel Corporation. Intel internet exchange architecture (IXA) software

development kit. http://www.intel.com/design/network/products/

npfamily/sdk download.htm. Accessed June 3, 2005.

46

http://www.intel.com/design/network/products/npfamily/sdk_download.htm

http://www.intel.com/design/network/products/npfamily/sdk_download.htm

[6] Intel Corporation. Ixp1200 network processor datasheet, September 2003.

[7] J. Daemen and V. Rijmen. AES proposal: Rijndael. First Advanced En-

cryption Standard (AES) Conference, August 1998.

[8] EZchip technologies. Network processor designs for next-generation net-

working equipment. White Paper, December 1999. http://www.ezchip.

com/html/tech nsppaper.html.

[9] Dave Gridley. Active network algorithm performance on a network processor:

Adaptive metric based routing and multicast. Master’s thesis, California

Polytechnic State University, San Luis Obispo, June 2004.

[10] M. Guthaus, J. Ringenberg, T. Austin, T. Mudge, and R. Brown. Mibench:

A free, commercially representative embedded benchmark suite. In Pro-

ceedings of the IEEE 4th Annual Workshop on Workload Characterization,

Austin, TX, December 2001.

[11] R. Housley. Internet X.509 public key infrastructure certificate and certifi-

cate revocation list (CRL) profile. RFC 3280, Internet Engineering Task

Force, April 2002.

[12] PMC-Sierra Inc. URL-based switching. WHITE PAPER PMC-2002232,

February 2001.

[13] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans

Kaashoek. The click modular router. ACM Transactions on Computer Sys-

tems, 18(3):263–297, August 2000.

[14] Arjen Lenstra, Xiaoyun Wang, and Benne de Weger. Colliding X.509

certificates. Cryptology ePrint Archive, Report 2005/067, 2005. http:

//eprint.iacr.org/.

47

http://www.ezchip.com/html/tech_nsppaper.html

http://www.ezchip.com/html/tech_nsppaper.html

http://eprint.iacr.org/

http://eprint.iacr.org/

[15] Alberto Leon-Garcia and Indra Widjaja. Communication Networks: Fun-

damental Concepts and Key Architectures. McGraw-Hill School Education

Group, 2000.

[16] Gokhan Memik. Netbench web site. http://cares.icsl.ucla.edu/

NetBench/, 2002.

[17] Gokhan Memik and William H. Mangione-Smith. NEPAL: A framework for

efficiently structuring applications for network processors. Second Workshop

on Network Processors (NP-2), February 2003.

[18] Mangione Smith Memik and Hu. Netbench: A benchmarking suite for

network processors. In Proceedings of IEEE International Conference on

Computer-Aided Design, November 2001.

[19] K. Keutzer N. Shah, W. Plishker. NP-click: A programming model for

the Intel IXP1200. HPCA-92nd Workshop on Network Processors (NP-2),

February 2003.

[20] National Institute of Standards and Technology, National Bureau of Stan-

dards, U.S. Department of Commerce. Advanced encryption standard. Fed-

eral Information Processing Standard (FIPS) 197, November 2001. http:

//csrc.nist.gov/publications/fips.

[21] University of Michigan at Ann Arbor. Mibench version 1. http://www.

eecs.umich.edu/mibench/, 2002.

[22] Ramaswamy R and T. Wolf. Packetbench: A tool for workload characteri-

zation of network processing. In Proceedings of IEEE 6th Annual Workshop

on Workload Characterization (WWC-6), pages 42–50, Austin, TX, October

2003.

48

http://cares.icsl.ucla.edu/NetBench/

http://cares.icsl.ucla.edu/NetBench/

http://csrc.nist.gov/publications/fips

http://csrc.nist.gov/publications/fips

http://www.eecs.umich.edu/mibench/

http://www.eecs.umich.edu/mibench/

[23] Ronald L. Rivest. The MD5 message-digest algorithm. RFC 1321, Internet

Engineering Task Force, April 1992.

[24] N. Shah and K. Keutzer. Network processors: Origin of species. In Proceed-

ings of ISCIS XVII, The Seventeenth International Symposium on Computer

and Information Sciences, 2002.

[25] Cisco Systems. Parallel express forwarding. White Paper,

2002. http://www.cisco.com/en/US/products/hw/routers/ps133/

products white paper09186a008008902a.shtml.

[26] Wikipedia, the free encyclopedia. Application framework. http://en.

wikipedia.org/wiki/Application framework, May 2005.

[27] Wikipedia, the free encyclopedia. Benchmark (computing). http://en.

wikipedia.org/wiki/Benchmark %28computing%29, June 2005.

[28] Wikipedia, the free encyclopedia. Simulator. http://en.wikipedia.org/

wiki/Simulator#Simulation in computer science, May 2005.

[29] Brett Tsudama. A novel distributed denial-of-service detection algorithm.

Master’s thesis, California Polytechnic State University, San Luis Obispo,

June 2004.

[30] T. Wolf and M. A. Franklin. Commbench - a telecommunications benchmark

for network processors. In Proceedings of IEEE International Symposium on

Performance Analysis of Systems and Software, pages 154–162, Austin, TX,

April 2000.

[31] Xelerated. Xelerator X10q network processor. Product Brief, 2004. http:

//www.xelerated.com/file.aspx?file id=62.

49

http://www.cisco.com/en/US/products/hw/routers/ps133/products_white_paper09186a008008902a.shtml

http://www.cisco.com/en/US/products/hw/routers/ps133/products_white_paper09186a008008902a.shtml

http://en.wikipedia.org/wiki/Application_framework

http://en.wikipedia.org/wiki/Application_framework

http://en.wikipedia.org/wiki/Benchmark_%28computing%29

http://en.wikipedia.org/wiki/Benchmark_%28computing%29

http://en.wikipedia.org/wiki/Simulator#Simulation_in_computer_science

http://en.wikipedia.org/wiki/Simulator#Simulation_in_computer_science

http://www.xelerated.com/file.aspx?file_id=62

http://www.xelerated.com/file.aspx?file_id=62

Appendix A

Acronyms

AES Advanced Encryption Standard

ASIC Application Specific Integrated Circuit

CRC Cyclic Redundancy Check

DH Diffie-Hellmen

DMM Dynamic Module Manager

DRR Deficit-Round Robin

HTTP HyperText Transport Protocol

MD5 Message-Digest 5

NAT Network Address Translation

NEPAL Network Processor Application Language

NPU Network Processing Unit

PE processing element

50

PISA Portable Instruction Set Architecture

PGP Pretty Good Privacy

REED Reed-Solomon Forward Error Correction

SHA Secure Hash Algorithm

TCP Transmission Control Protocol

URL Uniform Resource Locator

51

Thesis

Documents

network processors author

network processors2

network processors copyright

network processor simulators

network processor architecture

programmable processors

benchmark applications

optimization of applications