Top Banner
Flexible High Performance Traffic Generation on Commodity Multi- core Platforms Nicola Bonelli , Andrea Di Pietro, Stefano Giordano, Gregorio Procissi CNIT and Dip. di Ingegneria dell’Informazione - Università di Pisa
20

PF_DIRECT@TMA12

Dec 05, 2014

Download

Engineering

Nicola Bonelli

Flexible High Performance Traffic Generation on Commodity Multi-core Platforms
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PF_DIRECT@TMA12

Flexible High Performance Traffic Generation on Commodity Multi-core

Platforms

Nicola Bonelli, Andrea Di Pietro, Stefano Giordano, Gregorio Procissi

CNIT and Dip. di Ingegneria dell’Informazione - Università di Pisa

Page 2: PF_DIRECT@TMA12

Introduction and Motivations• New network devices are emerging… (probes, NIDs, shapers)

• Available traffic generator from the market:• Expensive black-box solutions (i.e. Spirent AX analyzer)

• Not enough extensible: limited traffic patterns, poor semantics for randomization, etc.• PC and professional NICs based solutions are cheaper (Endace, Napatech,

Invea-tech)• Enable fast packet transmission but usually do not provide a framework for traffic

generation

• Traffic generator should combine the flexibility of the software with the power of the modern hardware• multi-core architectures equipped with multi-queues NICs are today

commodity hardware

• Is it possible to create a software for traffic generation that, running of top of such a parallel architecture, is able to provide hardware-class performance?

Page 3: PF_DIRECT@TMA12

Software for traffic generation• A number of software solutions for traffic generation (trafgen, iperf, rude/crude,

mgen)

• Ostinato, and Brute makes use of PF_PACKET sockets and therefore are able to customize traffic at data-link layer:

• - Packet rate hardly exceed few million packets per second (no scalability)• - No explicit support of multi-queue NICs• - It does not support time-stamping to adjust the timing with which to transmit packets

Fast packet transmission…

• Recently accelerated drivers have also emerged: netmap (Luigi Rizzo)• memory-map the DMA descriptors of NICs to user-space and can transmit at wire-speed

(14.8Mpps) the same packet or a small set of of packets • A single thread generating a random-address IP packets does not fill the pipe (~6/8 Mpps

each)• Also using the very fast Mersenne-twister random generator! (~50 CPU cycles)

• Additional investigations are required…

Page 4: PF_DIRECT@TMA12

PF_DIRECT featuresWe implemented a brand new socket, named PF_DIRECT:

• A socket designed for the traffic generation (and transmission)• Compliant with vanilla drivers (not a custom driver)• Designed to run on top of commodity parallel hardware

• Support of timestamp in transmission

• Decoupling the traffic generation from packet transmission• Packets are generated by a user-space thread and transmitted by

multiple kernel threads

• Simple patterns are generated and transmitted nearly at wire speed• More complex patterns, most likely, do not have this requirement

Page 5: PF_DIRECT@TMA12

PF_DIRECT architecturePF_DIRECT kernel module consists of:

• A user-space library written in C++11 supposed to handle memory mapping, packet dispatching among k-thread, etc.

• A special memory mapped byte-oriented SPSC queue• Amortizes traffic coherence between cores (of queue index invalidations)

• Kernel thread supposed to transmit the packets buffered at the SPSC queues, each at the given timestamp

• Active wait or reschedule in case of long wait…• TSC of different cores are synchronized on modern CPUs (INVARIANT_TSC)

• A ring of pre-allocated socket buffers (skb) which are re-used by the kernel module and never get deallocated by network drivers

• User-counter trick

Page 6: PF_DIRECT@TMA12

PF_DIRECT architecture

Page 7: PF_DIRECT@TMA12

Traffic generation with PF_DIRECT Our experimental traffic generator, built on top of PF_DIRECT, consists of:

• User-space application, where each thread of execution represent a source of traffic

• Traffic sources “Engine” (that can concurrently make use of different traffic models)• User-space thread, one per core, running a deadline scheduler (~20 ns

context switch)

• A user-defined traffic mode (micro-thread) is in charge of:• Create the packet to be transmitted• Schedule the timestamp for the packet transmission• Send the packet through the PF_DIRECT socket (buffered it at the SPSC queue)

• Xml composition blocks that allow to instantiate and bind a given source to a core and to a given hardware queue

Page 8: PF_DIRECT@TMA12

Traffic generator architecture

Page 9: PF_DIRECT@TMA12

Experimental results: 1GMonsters

1 Gb link

Xeon 6-core X5650 @2.57 GHz, 12GBytes RAM

Intel 82599 multi-queue 10G Ethernet adapter, ixgbe 3.4.24 device driver

PF_DIRECT for traffic generation

Spirent AX-4000 Traffic Analyzer

Model CBR, 64bytes frames with random IP addresses:single source: 1 user-space threadhardware queue: 1 kernel thread

Page 10: PF_DIRECT@TMA12

1G link: CBR 100kpps, interarrival time

Page 11: PF_DIRECT@TMA12

1G link: variadic rate up to 1.4Mpps

Page 12: PF_DIRECT@TMA12

1G link: Inter-arrival times of Poisson process at 100Kpps

Page 13: PF_DIRECT@TMA12

1G link: Inter-arrival times of Poisson process at 1Mpps

Page 14: PF_DIRECT@TMA12

Experimental results: 10GMascara Monsters

10 Gb link

Xeon 6-core X5650 @2.57 GHz, 12GBytes RAM

Intel 82599 multi-queue 10G Ethernet adapter, ixgbe 3.4.24 device driver

PF_DIRECT for traffic generation

Xeon 6-core X5650 @2.57 GHz, 12 GBytes RAM

Intel 82599 multi-queue 10G ethernet adapter, ixgbe 3.4.24 device driver

PFQ for traffic capture

Model CBR, 64bytes frames with random IP addresses:1 user-space thread

multiple hardware queue: 4 kernel threads

Page 15: PF_DIRECT@TMA12

10G link: variadic rate up to 12.8Mpps

Page 16: PF_DIRECT@TMA12

10G link: Inter-arrival times of Poisson process at 4Mpps

Page 17: PF_DIRECT@TMA12

10G link: throughput bps

Page 18: PF_DIRECT@TMA12

10G link: throughput bps

Page 19: PF_DIRECT@TMA12

Conclusions• PF_DIRECT a Linux socket that leverages the potential

of multi-core architectures and multi-queues NICs

• PF_DIRECT decouples the task of packet generation from that of transmission• A single thread is able to generate non-trivial traffic, close

to the wire-rate ~13Mpps• Multiple kernel-threads transmit packets though multiple

queues• Support transmission timestamp (in TSC)

• Experimental traffic generator on top of PF_DIRECT

Page 20: PF_DIRECT@TMA12

Future work• Release the PF_DIRECT source code

• Additional performance improvements in PF_DIRECT

• Performance: identify a small set of changes, common to different drivers, that could define a “PF_DIRECT aware-driver”

• Implement a stable version of the “traffic generator” with complex traffic models