Top Banner
Multi-Threaded End-to- End Applications on Network Processors Michael Watts January 26 th , 2006
48
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Thesis Defense

Multi-Threaded End-to-End Applications on Network

Processors

Michael Watts

January 26th, 2006

Page 2: Thesis Defense

The Stage

• Demand to move applications from end nodes to network edge

• Increased processing power at edge makes this possible

End Node

End Node

Edge

Edge

The Internet

Page 3: Thesis Defense

Example

• All communication between corporate office secured at Internet edge

The Internet

CorporateOffice West

CorporateOffice East

The Internet

• End nodes responsible for establishing secure communication

Page 4: Thesis Defense

Applications at Network Edge

• Provide service to end nodes– Security– Quality of Service– Intrusion detection– Load balancing

• Kernels carry out single task– Such as MD5, URL-based switching, and AES

• End-to-end applications combine multiple kernels

Page 5: Thesis Defense

Intelligent Devices

• High level applications at network edge– Demand processing power– Demand flexibility of general-purpose processors

• Application-Specific Integrated Circuit (ASIC)– Speed without flexibility– Customized for particular use

• Network Processing Unit (NPU)– Programmable flexibility– Performance through parallelization

Page 6: Thesis Defense

Benchmarks

• Increasing complexity of next-generation applications– More demand on NPUs– Benchmark applications used to test

performance of NPUs

• Current network benchmarks– Single-threaded kernels– Insufficient for NPU multi-processor

architecture

Page 7: Thesis Defense

Contributions

• Multi-threaded end-to-end application benchmark suite

• Generic NPU simulator

• Analysis shows kernel performance inaccurate indicator of end-to-end application performance

Page 8: Thesis Defense

Overview

1. Network Processors and Simulators

2. The NPU Simulator

3. Benchmark Applications

4. Tests and Results

5. Conclusion

6. Future Work

Page 9: Thesis Defense

Network Processors

• NPU– Programmable packet processing device– Over 30 self-identified NPUs

• NPU Architecture– Dedicated co-processors– High-speed network interfaces– Multiple processing units

• Pipelined• Symmetric

Page 10: Thesis Defense

Pipelined vs. Symmetric

• Pipelined

• Symmetric

Packet

Processing Units

Packet

Packet

Packet

Processing Units

Page 11: Thesis Defense

Intel IXP1200

• Symmetric architecture• Processors (266MHz, 32-bit RISC)

– 1 x StrongARM controller• L1 and L2 cache

– 6 x microengines (ME)• 4 hardware supported threads each• No cache, lots of registers

• Shared Memory– 8 MBytes SRAM– 256 MBytes SDRAM– StrongARM and MEs share memory bus– No built-in memory management

Page 12: Thesis Defense

Intel IXP1200 Architecture

Page 13: Thesis Defense

NPU Simulators

• Purpose– Execute programs on foreign platform– Provide performance statistics

• SimpleScalar– Cycle-accurate hardware simulation– Architecture similar to MIPS– Modified GNU GCC generates binaries

Page 14: Thesis Defense

PacketBench

• Developed at University of Massachusetts

• Uses SimpleScalar

• Provides API for basic NPU functions

• NPU platform independence

• Drawback: no support for multiprocessor architectures

Page 15: Thesis Defense

Benchmarks

• Applications designed to assess performance characteristics of a single platform or differences between platforms– Synthetic

• Mimic a particular type of workload

– Application• Real-world applications

• Our focus: application benchmarks for the domain of NPUs

Page 16: Thesis Defense

Benchmark Suites

• MiBench– Target: embedded microprocessors– Including Rijndael encryption (AES)

• NetBench– Target: NPUs– Including Message-Digest 5 (MD5) and URL-

based switching

• Source available in C• Limitation: single-threaded

Page 17: Thesis Defense

The Simulator

• Modified existing multiprocessor simulator• Built on SimpleScalar• Modeled after Intel IXP1200

– Modeled processing units, memory, and cache structure

– Processors share memory bus– SRAM reserved for instruction stacks

Parameter StrongARM Microengines

Scheduling Out-of-order In-order

Width 1 (single-issue) 1 (single-issue)

L1 I Cache Size 16 KByte SRAM (0 penalty)

L1 D Cache Size 8 KByte 1 KByte (replace registers)

Page 18: Thesis Defense

Methods of Use

• Simulator compiles on Linux using GCC• Takes SimpleScalar binary as input

sim3ixp1200 [-h] [sim-args] program [program-args]

• Threads argument controlls number of microengine threads (0-24)

• 6 microengines allotted threads using round-robin

Page 19: Thesis Defense

Application Development

• Developed in C• Compiled using GCC 2.7.2.3 cross-compiler

– Linux/x86 SimpleScalar

• No POSIX thread support, same binary executed by each thread

• No memory management• Multi-threading

– getcpu()– barrier()– ncpus

Page 20: Thesis Defense

Example Code// common initialization…

barrier();

int thread_id = getcpu();

if (thread_id == 0) { // StrongARM}else if (thread_id == 1) { // 1st microengine thread}else { // 2 – ncpu microengine threads}

Page 21: Thesis Defense

Benchmark Applications

• Modified 3 kernels from MiBench and NetBench– Message-Digest 5 (MD5)– URL-based switching (URL)– Advanced Encryption Standard (AES)

[Rijndael]

• Modified memory allocations• Modified source of incoming packets• Parallelized

Page 22: Thesis Defense

MD5

• Creates a 128-bit signature of input

• Used extensively in public-key cryptography and verification of data integrity

• Packet processing offloaded to microengine (ME) threads

• Packets processed in parallel

Page 23: Thesis Defense

MD5 Algorithm

• Every packet processed on separate ME thread

• StrongARM monitors for idle threads and assigns work

Microengines

Inco

min

g P

acke

ts

Page 24: Thesis Defense

MD5 Parallelization

StrongARM Microengines

Page 25: Thesis Defense

URL

• Directs packets based on payload content

• Useful for load-balancing, fault detection and recovery

• Layer 7 switch, content-switch, web-switch

• Uses pattern matching algorithm

Page 26: Thesis Defense

URL Algorithm

• Work for each packet split among ME threads

• StrongARM iterates over search tree, assigning work to idle ME threads

• ME threads report when match found

Microengines

Inco

min

gP

acke

ts

StrongARM

Page 27: Thesis Defense

URL Parallelization

StrongARM

Microengines

Page 28: Thesis Defense

AES

• Block cipher encryption algorithm

• Made US government standard in 2001

• 256 bit key

• Same parallelization technique as MD5

• Key loaded into each ME’s stack during initialization

• Packet encryption performed in parallel

Page 29: Thesis Defense

Performance Tests

• Purpose– Evaluate multi-threading kernels and end-to-

end applications

• Tests– Isolation– Shared– Static– Dynamic

Page 30: Thesis Defense

Isolation Tests

• Establish baseline

• Explore effects of multi-threading kernels

• Each kernel run in isolation

• Number of ME threads varied from 1 to 24

• Speedup graphed against serial version

Page 31: Thesis Defense

MD5 Isolation Results

• 0: serial on StrongARM• 1-24: parallel on MEs• Decreased speedup on 1 ME • Significant speedup overall• Note decreasing slope at 7, 13, and 19 threads

Page 32: Thesis Defense

URL Isolation Results

• When 1 thread finds a match, must wait for other threads to finish– Polling version required polling of global flag– Performed slightly worse (1.64 compared to 1.75)– Matching pattern found in 40% of packets

• When too many threads working at once, shared resource bottlenecks affect speedup

Page 33: Thesis Defense

AES Isolation Results

• Performs poorly on MEs• Packets processed in 16 byte chunks• State maintained in accumulator for packet lifetime• Static lookup table of 8 Kbytes• L1 data cache 8 Kbytes for StrongARM – 1 Kbytes for

MEs• Consumes more cycles on ME by factor of 8.4

Page 34: Thesis Defense

Shared Tests

• Reveal sensitivity of each kernel to concurrent execution of other kernels

• StrongARM serves as controller

• Baseline of 1 MD5, 4 URL, and 1 AES thread

• Separate packet streams for each kernel

• Number of threads increased for kernel under test

Page 35: Thesis Defense

Shared Results

• MD5: not substantially affected• URL: maximum of 1.17 (compared to 1.75)• AES: order of magnitude higher

– Baseline uses ME, not StrongARM

Page 36: Thesis Defense

Static Tests

• Characteristics of end-to-end application

• Location of bottlenecks

• Kernels work together to process single packet stream

• Find optimal thread configuration

Page 37: Thesis Defense

End-to-End Application

• Distribution of sensitive information from trusted network over Internet to different hosts1. Calculate MD5 signature

2. Determine destination host using URL

3. Encrypt packet using AES

4. Send packet and signature to host

Page 38: Thesis Defense

Static Results

• Baseline of 1 MD5, 4 URL, and 1 AES thread• Additional thread tried on each kernel• Best configuration used as starting point for next• Final result 1 MD5, 11 URL, and 12 AES threads

Page 39: Thesis Defense

Static Results (cont.)

• Although MD5 best speedup in Isolation, unable to improve speedup in Static– Amdahl’s Law: 1 / ((1 – P) + (P / S))

• More threads initially allocated to URL– URL bottleneck until 10 threads

Page 40: Thesis Defense

Dynamic Tests

• MEs not dedicated to single kernel, instead assigned work by StrongARM based on demand

• StrongARM responsible for allocating threads and maintaining wait-queues

• Realistic configuration

• Increased development complexity

Page 41: Thesis Defense

Dynamic Algorithm

URL

AESPac

ket

queu

es

StrongARM

Microengines

• MD5 URL AES

• StrongARM monitors MEs

• Assigns work to idle threads

• First from queues, then from incoming packet stream

• AES queue

• URL queue

• Network

• URL queue fills as MD5 outperforms URL

• Additional threads created for URL

• AES threads created each time URL finishes

MD5

URL

URL

AES

Page 42: Thesis Defense

Dynamic Results

• Baseline same as Static• Substantial speedup over Static

Page 43: Thesis Defense

Dynamic Results (cont.)

• 25% as many cycles as Static• Some ME threads in Static waste idle cycles• Less affected by URL bottleneck• Able to adjust to varying packet sizes

Page 44: Thesis Defense

Analysis

• Isolation– Established baseline

• Shared– Explored concurrent kernels

• Static– End-to-end application characteristics– Thread allocation optimization

• Dynamic– Contrast on-demand to static thread allocation

Page 45: Thesis Defense

Conclusion

• NPU multi-processor simulator

• Multi-threaded end-to-end benchmark applications

• Analysis of benchmarks on NPU simulator– Kernel performance is not indicative of end-to-

end application performance– MD5 scaled well in Isolation and Shared, little

effect in end-to-end applications

Page 46: Thesis Defense

Future Work

• NPU simulator– Already used in two other M.S. thesis projects– Larger cycle count capability– Updated to model current NPU generation

• End-to-end applications– Simulated on next-generation simulator– Further investigation into bottlenecks

Page 47: Thesis Defense

Future Work (cont.)

• Benchmark suite– Include additional kernels– Model more real-world end-to-end

applications

Page 48: Thesis Defense

Thank You, Questions