An Efficient Programmable 10 Gigabit Ethernet
Network Interface Card
Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai
2
Designing a 10 Gigabit NIC
Programmability for performance Computation offloading improves performance
NICs have power, area concerns Architecture solutions should be efficient
Above all, must support 10 Gb/s links What are the computation, memory requirements? What architecture efficiently meets them? What firmware organization should be used?
3
Mechanisms for an Efficient Programmable 10 Gb/s NIC
A partitioned memory system Low-latency access to control structures High-bandwidth, high-capacity access to frame data
A distributed task-queue firmware Utilizes frame-level parallelism to scale across many simple,
low-frequency processors
New RMW instructions Reduce firmware frame-ordering overheads by 50% and
reduce clock frequency requirement by 17%
4
Outline
Motivation
How Programmable NICs work
Architecture Requirements, Design
Frame-parallel Firmware
Evaluation
5
How Programmable NICs Work
PCIInterface
EthernetInterface
Memory
Processor(s)Bus Ethernet
6
Per-frame Requirements
InstructionsData
Accesses
TX Frame 281 101
RX Frame 253 85
Processing and control data requirements per frame, as determined by dynamic traces of relevant NIC functions
7
Aggregate Requirements10 Gb/s - Max Sized Frames
Instruction Throughpu
t
Control Data
Bandwidth
Frame Data
Bandwidth
TX Frame 229 MIPS 2.6 Gb/s 19.75 Gb/s
RX Frame 206 MIPS 2.2 Gb/s 19.75 Gb/s
Total 435 MIPS 4.8 Gb/s 39.5 Gb/s
1514-byte Frames at 10 Gb/s 812,744 Frames/s
8
Meeting 10 Gb/s Requirements with Hardware
Processor Architecture At least 435 MIPS within embedded device Does NIC firmware have ILP?
Memory Architecture Low latency control data High bandwidth, high capacity frame data … both, how?
9
ILP Processors for NIC Firmware? ILP limited by data, control dependences Analysis of dynamic trace reveal dependences
Perfect BPPerfect
1BPNo BP
In-order 1 0.87 0.87 0.87
In-order 2 1.19 1.19 1.13
In-order 4 1.34 1.33 1.17
Out-order 1
1.00 1.00 0.88
Out-order 2
1.96 1.74 1.21
Out-order 4
2.65 2.00 1.29
10
Processors: 1-Wide, In-order
2x performance costly Branch prediction, reorder buffer, renaming logic, wakeup logic Overheads translate to greater than 2x core power, area costs Great for a GP processor; not for an embedded device
Other opportunities for parallelism? YES! Many steps to process a frame - run them simultaneously Many frames need processing - process simultaneously
Use parallel single-issue cores
Perfect 1BP
No BP
In-order 1 0.87 0.87
Out-order 2
1.74 1.21
11
Memory Architecture
Competing demands Frame data: High bandwidth, high capacity for many
offload mechanisms Control data: Low latency; coherence among processors,
PCI Interface, and Ethernet Interface
The traditional solution: Caches Advantages: low latency, transparent to the programmer Disadvantages: Hardware costs (tag arrays, coherence) In many applications, advantages outweigh costs
12
Are Caches Effective?
SMPCache trace analysis of a 6-processor NIC architecture
0
10
20
30
40
50
60
16B 32B 64B 128B256B 512B 1KB 2KB 4KB 8KB16KB32KB
Cache Size (Bytes)
Hit Ratio (Percent) 6 ProcessorHit Ratio
13
Choosing a Better Organization
Cache HierarchyA Partitioned Organization
14
Putting it All Together
Instruction Memory
I-Cache 0
CPU 0
(P+4)x(S) Crossbar (32-bit)
PCIInterface
EthernetInterfacePCI
Bus DRAM
Ext. Mem. Interface
(Off-Chip)
Scratchpad 0 Scratchpad 1 S-pad S-1
CPU P-1
I-Cache 1 I-Cache P-1
CPU 1
15
Parallel Firmware
NIC processing steps already well-
defined
Previous Gigabit NIC firmware divides
steps between 2 processors
… but does this mechanism scale?
16
Task Assignment with an Event Register
PCI Read Bit SW Event Bit … Other Bits
PCI Interface Finishes Work
Processor(s) inspect
transactions
0 0 011
Processor(s) need to enqueue
TX Data
Processor(s) pass data to
Ethernet Interface
17
Task-level Parallel Firmware
TransferDMAs 0-4
0 Idle Idle
PCI Read Bit
PCI Read HW Status
Function Running (Proc 0)
Function Running (Proc 1)
1Transfer
DMAs 5-9
1
0
TimeProcessDMAs
0-4Idle
ProcessDMAs
5-91 Idle
18
Frame-level Parallel Firmware
TransferDMAs 0-4 Idle
PCI RD HW Status
Function Running (Proc 0)
Function Running (Proc 1)
TransferDMAs 5-9
TimeProcessDMAs
0-4
Build Event
Idle
ProcessDMAs
5-9
Build Event
Idle
19
Evaluation Methodology
Spinach: A library of cycle-accurate LSE simulator modules for network interfaces Memory latency, bandwidth, contention modeled precisely Processors modeled in detail NIC I/O (PCI, Ethernet Interfaces) modeled in detail Verified when modeling the Tigon 2 Gigabit NIC (LCTES
2004)
Idea: Model everything inside the NIC Gather performance, trace data
20
Scaling in Two Dimensions
0
2
4
6
8
10
12
14
16
18
20
100 150 200 250 300
Core Frequency (MHz)
Throughput (Gb/s)
Ethernet Limit
8 Processors
6 Processors
4 Processors
2 Processors
1 Processor
21
Processor Performance
Processor Behavior
IPC Compone
nt
Execution 0.72
Miss Stalls 0.01
Load Stalls 0.12
Scratchpad Conflict Stalls
0.05
Pipeline Stalls
0.10
Total 1.00
Achieves 83% of theoretical peak IPC
Small I-Caches work Sensitive to mem
stalls Half of loads are part of a load-
to-use sequence Conflict stalls could be reduced
with more ports, more banks
22
Reducing Frame Ordering Overheads
Firmware ordering costly - 30% of execution
Synchronization, bitwise check/updates occupy processors, memory
Solution: Atomic bitwise operations that also update a pointer according to last set location
23
Maintaining Frame Ordering
0
Index 0 Index 1 Index 3 Index 4 … more bitsFrame Status Array
0 0 0
CPU A prepares frames
CPU B prepares frames
CPU C Detects
Completed Frames
EthernetInterface
LOCK
Iterate
Notify Hardware
UNLOCK
1 1 1 1
24
RMW Instructions Reduce Clock Frequency
Performance: 6x166 MHz = 6x200 MHz Performance is equivalent at all frame sizes 17% reduction in frequency requirement
Dynamically tasked firmware balances the benefit Send cycles reduced by 28.4% Receive cycles reduced by 4.7%
25
ConclusionsA Programmable 10 Gb/s NIC
This NIC architecture relies on: Data Memory System - Partitioned organization, not
coherent caches Processor Architecture - Parallel scalar processors Firmware - Frame-level parallel organization RMW Instructions - reduce ordering overheads
A programmable NIC: A substrate for offload services
26
Comparing Frame Ordering Methods
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200 1400
UDP Datagram Size (Bytes)
Full-Duplex Throughput (Gb/s)
Ethernet Limit(Duplex)
6x200 MHz,Software Only
6x166 MHz, RMWEnhanced