Thesis Defense

Multi-Threaded End-to-End Applications on Network

Processors

Michael Watts

January 26th, 2006

The Stage

• Demand to move applications from end nodes to network edge

• Increased processing power at edge makes this possible

End Node

End Node

Edge

Edge

The Internet

Example

• All communication between corporate office secured at Internet edge

The Internet

CorporateOffice West

CorporateOffice East

The Internet

• End nodes responsible for establishing secure communication

Applications at Network Edge

• Provide service to end nodes– Security– Quality of Service– Intrusion detection– Load balancing

• Kernels carry out single task– Such as MD5, URL-based switching, and AES

• End-to-end applications combine multiple kernels

Intelligent Devices

• High level applications at network edge– Demand processing power– Demand flexibility of general-purpose processors

• Application-Specific Integrated Circuit (ASIC)– Speed without flexibility– Customized for particular use

• Network Processing Unit (NPU)– Programmable flexibility– Performance through parallelization

Benchmarks

• Increasing complexity of next-generation applications– More demand on NPUs– Benchmark applications used to test

performance of NPUs

• Current network benchmarks– Single-threaded kernels– Insufficient for NPU multi-processor

architecture

Contributions

• Multi-threaded end-to-end application benchmark suite

• Generic NPU simulator

• Analysis shows kernel performance inaccurate indicator of end-to-end application performance

Overview

1. Network Processors and Simulators

2. The NPU Simulator

3. Benchmark Applications

4. Tests and Results

5. Conclusion

6. Future Work

Network Processors

• NPU– Programmable packet processing device– Over 30 self-identified NPUs

• NPU Architecture– Dedicated co-processors– High-speed network interfaces– Multiple processing units

• Pipelined• Symmetric

Pipelined vs. Symmetric

• Pipelined

• Symmetric

Packet

Processing Units

Packet

Packet

Packet

Processing Units

Intel IXP1200

• Symmetric architecture• Processors (266MHz, 32-bit RISC)

– 1 x StrongARM controller• L1 and L2 cache

– 6 x microengines (ME)• 4 hardware supported threads each• No cache, lots of registers

• Shared Memory– 8 MBytes SRAM– 256 MBytes SDRAM– StrongARM and MEs share memory bus– No built-in memory management

Intel IXP1200 Architecture

NPU Simulators

• Purpose– Execute programs on foreign platform– Provide performance statistics

• SimpleScalar– Cycle-accurate hardware simulation– Architecture similar to MIPS– Modified GNU GCC generates binaries

PacketBench

• Developed at University of Massachusetts

• Uses SimpleScalar

• Provides API for basic NPU functions

• NPU platform independence

• Drawback: no support for multiprocessor architectures

Benchmarks

• Applications designed to assess performance characteristics of a single platform or differences between platforms– Synthetic

• Mimic a particular type of workload

– Application• Real-world applications

• Our focus: application benchmarks for the domain of NPUs

Benchmark Suites

• MiBench– Target: embedded microprocessors– Including Rijndael encryption (AES)

• NetBench– Target: NPUs– Including Message-Digest 5 (MD5) and URL-

based switching

• Source available in C• Limitation: single-threaded

The Simulator

• Modified existing multiprocessor simulator• Built on SimpleScalar• Modeled after Intel IXP1200

– Modeled processing units, memory, and cache structure

– Processors share memory bus– SRAM reserved for instruction stacks

Parameter StrongARM Microengines

Scheduling Out-of-order In-order

Width 1 (single-issue) 1 (single-issue)

L1 I Cache Size 16 KByte SRAM (0 penalty)

L1 D Cache Size 8 KByte 1 KByte (replace registers)

Methods of Use

• Simulator compiles on Linux using GCC• Takes SimpleScalar binary as input

sim3ixp1200 [-h] [sim-args] program [program-args]

• Threads argument controlls number of microengine threads (0-24)

• 6 microengines allotted threads using round-robin

Application Development

• Developed in C• Compiled using GCC 2.7.2.3 cross-compiler

– Linux/x86 SimpleScalar

• No POSIX thread support, same binary executed by each thread

• No memory management• Multi-threading

– getcpu()– barrier()– ncpus

Example Code// common initialization…

barrier();

int thread_id = getcpu();

if (thread_id == 0) { // StrongARM}else if (thread_id == 1) { // 1st microengine thread}else { // 2 – ncpu microengine threads}

Benchmark Applications

• Modified 3 kernels from MiBench and NetBench– Message-Digest 5 (MD5)– URL-based switching (URL)– Advanced Encryption Standard (AES)

[Rijndael]

• Modified memory allocations• Modified source of incoming packets• Parallelized

MD5

• Creates a 128-bit signature of input

• Used extensively in public-key cryptography and verification of data integrity

• Packet processing offloaded to microengine (ME) threads

• Packets processed in parallel

MD5 Algorithm

• Every packet processed on separate ME thread

• StrongARM monitors for idle threads and assigns work

Microengines

Inco

min

g P

acke

ts

MD5 Parallelization

StrongARM Microengines

URL

• Directs packets based on payload content

• Useful for load-balancing, fault detection and recovery

• Layer 7 switch, content-switch, web-switch

• Uses pattern matching algorithm

URL Algorithm

• Work for each packet split among ME threads

• StrongARM iterates over search tree, assigning work to idle ME threads

• ME threads report when match found

Microengines

Inco

min

gP

acke

ts

StrongARM

URL Parallelization

StrongARM

Microengines

AES

• Block cipher encryption algorithm

• Made US government standard in 2001

• 256 bit key

• Same parallelization technique as MD5

• Key loaded into each ME’s stack during initialization

• Packet encryption performed in parallel

Performance Tests

• Purpose– Evaluate multi-threading kernels and end-to-

end applications

• Tests– Isolation– Shared– Static– Dynamic

Isolation Tests

• Establish baseline

• Explore effects of multi-threading kernels

• Each kernel run in isolation

• Number of ME threads varied from 1 to 24

• Speedup graphed against serial version

MD5 Isolation Results

• 0: serial on StrongARM• 1-24: parallel on MEs• Decreased speedup on 1 ME • Significant speedup overall• Note decreasing slope at 7, 13, and 19 threads

URL Isolation Results

• When 1 thread finds a match, must wait for other threads to finish– Polling version required polling of global flag– Performed slightly worse (1.64 compared to 1.75)– Matching pattern found in 40% of packets

• When too many threads working at once, shared resource bottlenecks affect speedup

AES Isolation Results

• Performs poorly on MEs• Packets processed in 16 byte chunks• State maintained in accumulator for packet lifetime• Static lookup table of 8 Kbytes• L1 data cache 8 Kbytes for StrongARM – 1 Kbytes for

MEs• Consumes more cycles on ME by factor of 8.4

Shared Tests

• Reveal sensitivity of each kernel to concurrent execution of other kernels

• StrongARM serves as controller

• Baseline of 1 MD5, 4 URL, and 1 AES thread

• Separate packet streams for each kernel

• Number of threads increased for kernel under test

Shared Results

• MD5: not substantially affected• URL: maximum of 1.17 (compared to 1.75)• AES: order of magnitude higher

– Baseline uses ME, not StrongARM

Static Tests

• Characteristics of end-to-end application

• Location of bottlenecks

• Kernels work together to process single packet stream

• Find optimal thread configuration

End-to-End Application

• Distribution of sensitive information from trusted network over Internet to different hosts1. Calculate MD5 signature

2. Determine destination host using URL

3. Encrypt packet using AES

4. Send packet and signature to host

Static Results

• Baseline of 1 MD5, 4 URL, and 1 AES thread• Additional thread tried on each kernel• Best configuration used as starting point for next• Final result 1 MD5, 11 URL, and 12 AES threads

Static Results (cont.)

• Although MD5 best speedup in Isolation, unable to improve speedup in Static– Amdahl’s Law: 1 / ((1 – P) + (P / S))

• More threads initially allocated to URL– URL bottleneck until 10 threads

Dynamic Tests

• MEs not dedicated to single kernel, instead assigned work by StrongARM based on demand

• StrongARM responsible for allocating threads and maintaining wait-queues

• Realistic configuration

• Increased development complexity

Dynamic Algorithm

URL

AESPac

ket

queu

es

StrongARM

Microengines

• MD5 URL AES

• StrongARM monitors MEs

• Assigns work to idle threads

• First from queues, then from incoming packet stream

• AES queue

• URL queue

• Network

• URL queue fills as MD5 outperforms URL

• Additional threads created for URL

• AES threads created each time URL finishes

MD5

URL

URL

AES

Dynamic Results

• Baseline same as Static• Substantial speedup over Static

Dynamic Results (cont.)

• 25% as many cycles as Static• Some ME threads in Static waste idle cycles• Less affected by URL bottleneck• Able to adjust to varying packet sizes

Analysis

• Isolation– Established baseline

• Shared– Explored concurrent kernels

• Static– End-to-end application characteristics– Thread allocation optimization

• Dynamic– Contrast on-demand to static thread allocation

Conclusion

• NPU multi-processor simulator

• Multi-threaded end-to-end benchmark applications

• Analysis of benchmarks on NPU simulator– Kernel performance is not indicative of end-to-

end application performance– MD5 scaled well in Isolation and Shared, little

effect in end-to-end applications

Future Work

• NPU simulator– Already used in two other M.S. thesis projects– Larger cycle count capability– Updated to model current NPU generation

• End-to-end applications– Simulated on next-generation simulator– Further investigation into bottlenecks

Future Work (cont.)

• Benchmark suite– Include additional kernels– Model more real-world end-to-end

applications

Thank You, Questions

Thesis Defense

Documents

internet edge applications

npus benchmark applications

generation applications

aes endtoend applications

application benchmarks

performance characteristics

single platform

parallelization benchmarks