An Efficient Programmable 10 Gigabit Ethernet Network Interface Card

An Efficient Programmable 10 Gigabit Ethernet

Network Interface Card

Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai

2

Designing a 10 Gigabit NIC

Programmability for performance Computation offloading improves performance

NICs have power, area concerns Architecture solutions should be efficient

Above all, must support 10 Gb/s links What are the computation, memory requirements? What architecture efficiently meets them? What firmware organization should be used?

3

Mechanisms for an Efficient Programmable 10 Gb/s NIC

A partitioned memory system Low-latency access to control structures High-bandwidth, high-capacity access to frame data

A distributed task-queue firmware Utilizes frame-level parallelism to scale across many simple,

low-frequency processors

New RMW instructions Reduce firmware frame-ordering overheads by 50% and

reduce clock frequency requirement by 17%

4

Outline

Motivation

How Programmable NICs work

Architecture Requirements, Design

Frame-parallel Firmware

Evaluation

5

How Programmable NICs Work

PCIInterface

EthernetInterface

Memory

Processor(s)Bus Ethernet

6

Per-frame Requirements

InstructionsData

Accesses

TX Frame 281 101

RX Frame 253 85

Processing and control data requirements per frame, as determined by dynamic traces of relevant NIC functions

7

Aggregate Requirements10 Gb/s - Max Sized Frames

Instruction Throughpu

t

Control Data

Bandwidth

Frame Data

Bandwidth

TX Frame 229 MIPS 2.6 Gb/s 19.75 Gb/s

RX Frame 206 MIPS 2.2 Gb/s 19.75 Gb/s

Total 435 MIPS 4.8 Gb/s 39.5 Gb/s

1514-byte Frames at 10 Gb/s 812,744 Frames/s

8

Meeting 10 Gb/s Requirements with Hardware

Processor Architecture At least 435 MIPS within embedded device Does NIC firmware have ILP?

Memory Architecture Low latency control data High bandwidth, high capacity frame data … both, how?

9

ILP Processors for NIC Firmware? ILP limited by data, control dependences Analysis of dynamic trace reveal dependences

Perfect BPPerfect

1BPNo BP

In-order 1 0.87 0.87 0.87

In-order 2 1.19 1.19 1.13

In-order 4 1.34 1.33 1.17

Out-order 1

1.00 1.00 0.88

Out-order 2

1.96 1.74 1.21

Out-order 4

2.65 2.00 1.29

10

Processors: 1-Wide, In-order

2x performance costly Branch prediction, reorder buffer, renaming logic, wakeup logic Overheads translate to greater than 2x core power, area costs Great for a GP processor; not for an embedded device

Other opportunities for parallelism? YES! Many steps to process a frame - run them simultaneously Many frames need processing - process simultaneously

Use parallel single-issue cores

Perfect 1BP

No BP

In-order 1 0.87 0.87

Out-order 2

1.74 1.21

11

Memory Architecture

Competing demands Frame data: High bandwidth, high capacity for many

offload mechanisms Control data: Low latency; coherence among processors,

PCI Interface, and Ethernet Interface

The traditional solution: Caches Advantages: low latency, transparent to the programmer Disadvantages: Hardware costs (tag arrays, coherence) In many applications, advantages outweigh costs

12

Are Caches Effective?

SMPCache trace analysis of a 6-processor NIC architecture

0

10

20

30

40

50

60

16B 32B 64B 128B256B 512B 1KB 2KB 4KB 8KB16KB32KB

Cache Size (Bytes)

Hit Ratio (Percent) 6 ProcessorHit Ratio

13

Choosing a Better Organization

Cache HierarchyA Partitioned Organization

14

Putting it All Together

Instruction Memory

I-Cache 0

CPU 0

(P+4)x(S) Crossbar (32-bit)

PCIInterface

EthernetInterfacePCI

Bus DRAM

Ext. Mem. Interface

(Off-Chip)

Scratchpad 0 Scratchpad 1 S-pad S-1

CPU P-1

I-Cache 1 I-Cache P-1

CPU 1

15

Parallel Firmware

NIC processing steps already well-

defined

Previous Gigabit NIC firmware divides

steps between 2 processors

… but does this mechanism scale?

16

Task Assignment with an Event Register

PCI Read Bit SW Event Bit … Other Bits

PCI Interface Finishes Work

Processor(s) inspect

transactions

0 0 011

Processor(s) need to enqueue

TX Data

Processor(s) pass data to

Ethernet Interface

17

Task-level Parallel Firmware

TransferDMAs 0-4

0 Idle Idle

PCI Read Bit

PCI Read HW Status

Function Running (Proc 0)


1Transfer

DMAs 5-9

1

0

TimeProcessDMAs

0-4Idle

ProcessDMAs

5-91 Idle

18

Frame-level Parallel Firmware

TransferDMAs 0-4 Idle

PCI RD HW Status



TransferDMAs 5-9

TimeProcessDMAs

0-4

Build Event

Idle

ProcessDMAs

5-9

Build Event

Idle

19

Evaluation Methodology

Spinach: A library of cycle-accurate LSE simulator modules for network interfaces Memory latency, bandwidth, contention modeled precisely Processors modeled in detail NIC I/O (PCI, Ethernet Interfaces) modeled in detail Verified when modeling the Tigon 2 Gigabit NIC (LCTES

2004)

Idea: Model everything inside the NIC Gather performance, trace data

20

Scaling in Two Dimensions

0

2

4

6

8

10

12

14

16

18

20

100 150 200 250 300

Core Frequency (MHz)

Throughput (Gb/s)

Ethernet Limit

8 Processors

6 Processors

4 Processors

2 Processors

1 Processor

21

Processor Performance

Processor Behavior

IPC Compone

nt

Execution 0.72

Miss Stalls 0.01

Load Stalls 0.12

Scratchpad Conflict Stalls

0.05

Pipeline Stalls

0.10

Total 1.00

Achieves 83% of theoretical peak IPC

Small I-Caches work Sensitive to mem

stalls Half of loads are part of a load-

to-use sequence Conflict stalls could be reduced

with more ports, more banks

22

Reducing Frame Ordering Overheads

Firmware ordering costly - 30% of execution

Synchronization, bitwise check/updates occupy processors, memory

Solution: Atomic bitwise operations that also update a pointer according to last set location

23

Maintaining Frame Ordering

0

Index 0 Index 1 Index 3 Index 4 … more bitsFrame Status Array

0 0 0

CPU A prepares frames

CPU B prepares frames

CPU C Detects

Completed Frames

EthernetInterface

LOCK

Iterate

Notify Hardware

UNLOCK

1 1 1 1

24

RMW Instructions Reduce Clock Frequency

Performance: 6x166 MHz = 6x200 MHz Performance is equivalent at all frame sizes 17% reduction in frequency requirement

Dynamically tasked firmware balances the benefit Send cycles reduced by 28.4% Receive cycles reduced by 4.7%

25

ConclusionsA Programmable 10 Gb/s NIC

This NIC architecture relies on: Data Memory System - Partitioned organization, not

coherent caches Processor Architecture - Parallel scalar processors Firmware - Frame-level parallel organization RMW Instructions - reduce ordering overheads

A programmable NIC: A substrate for offload services

26

Comparing Frame Ordering Methods

0

2

4

6

8

10

12

14

16

18

20

0 200 400 600 800 1000 1200 1400

UDP Datagram Size (Bytes)

Full-Duplex Throughput (Gb/s)

Ethernet Limit(Duplex)

6x200 MHz,Software Only

6x166 MHz, RMWEnhanced

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card

Documents

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card