Top Banner
Intro to DPDK & HW Network Platforms Group
50
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 intro to_dpdk_and_hw

Intro to DPDK & HW

Network Platforms Group

Page 2: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

2

Legal DisclaimerINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Celeron, Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel SpeedStep, Intel XScale, Itanium, Pentium, Pentium Inside, VTune, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Intel® Active Management Technology requires the platform to have an Intel® AMT-enabled chipset, network hardware and software, as well as connection with a power source and a corporate network connection. With regard to notebooks, Intel AMT may not be available or certain capabilities may be limited over a host OS-based VPN or when connecting wirelessly, on battery power, sleeping, hibernating or powered off. For more information, see http://www.intel.com/technology/iamt.

64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information.

No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an Intel Trusted Execution Technology-enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. See http://www.intel.com/technology/security/ for more information.

†Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Pentium® 4 Processor supporting HT Technology and an HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support HT Technology.

Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.

* Other names and brands may be claimed as the property of others.

Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representat ions or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to change without notice.

Copyright © 2013, Intel Corporation. All rights reserved.

Page 3: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

3

Topics

Why DPDK – PMD vs Linux interrupt driver, memory config, user space.

Licensing

Memory IA – NUMA, huge pages, TLBs on IA

Memory DPDK – mem pools, buffers, allocation etc.

Caching handling, DDIO

Page 4: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

4

Intel® Data Plane Development Kit (Intel® DPDK)• Big Idea

Software solution for accelerating Packet Processing workloads on IA.

• Deployment Models • Performance

• Commercial Support

• Delivers 25X performance jump over Linux • Free, Open Source, BSD License

• Comprehensive Virtualization support • Enjoys vibrant community support

Concepts Code Commercial

1.1

28.5

0

10

20

30

Linux Intel® DPDK

Pe

r C

ore

L3

Pe

rfo

rma

nce

(Mp

ps)

Platform

Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Page 5: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

5

What Problems Does DPDK Address?

Page 6: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

6

Packet Size 64 bytes

40G Packets/second 59.5 Million each way

Packet arrival rate 16.8 ns

2 GHz Clock cycles 33 cycles

Typical Server Packet SizesNetwork Infrastructure Packet Sizes

Packet Size (B)

Pac

kets

per

se

con

d

0

10,000,000

20,000,000

30,000,000

40,000,000

50,000,000

60,000,000

70,000,000

64

12

8

19

2

25

6

32

0

38

4

44

8

51

2

57

6

64

0

70

4

76

8

83

2

89

6

96

0

10

24

10

88

11

52

12

16

12

80

13

44

14

08

14

72

What Problem Does DPDK address ?

Packet Size 1024 bytes

40G Packets/second 4.8 Million each way

Packet arrival rate 208.8 ns

2 GHz Clock cycles 417 cycles

40 Gbps Line Rate (or 4x10G) Rx

Process Packet

Tx

Page 7: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

7

Typical Server Packet SizesNetwork Infrastructure Packet Sizes

Packet Size (B)

Pac

kets

per

se

con

d

40 Gbps Line Rate (or 4x10G)

Packet Size 1024 bytes

40G Packets/second 4.8 Million each way

Packet arrival rate 208.8 ns

2 GHz Clock cycles 417 cycles

0

10,000,000

20,000,000

30,000,000

40,000,000

50,000,000

60,000,000

70,000,000

64

12

8

19

2

25

6

32

0

38

4

44

8

51

2

57

6

64

0

70

4

76

8

83

2

89

6

96

0

10

24

10

88

11

52

12

16

12

80

13

44

14

08

14

72

The Problem Intel® DPDK AddressesFrom a CPU perspective:• L3 cache hit latency is ~40 cycles• L3 miss, memory read is ~70ns (140 cycles at

2GHz)

Intel® silicon and Intel® software advances are proactively addressing this problem statement

Packet Size 64 bytes

40G Packets/second 59.5 Million each way

Packet arrival rate 16.8 ns

2 GHz Clock cycles 33 cycles

Page 8: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

8

Benefits – Eliminating / Hiding Overheads

Interrupt Context Switch

Overhead

Kernel User

Overhead

Core To Thread

Scheduling Overhead

Eliminating

Polling

User Mode Driver

Pthread Affinity

How?

Page 9: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

9

• DPDK is BSD licensed:

• http://opensource.org/licenses/BSD-3-Clause

• User is free to modify, copy and re-use code

• No need to provide source code in derived software (unlike GPL license)

Licensing

Page 10: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

10

DPDK Packet Processing Concepts

• DPDK is designed for high-speed packet processing on IA. This is achieved by optimizing the software libraries to IA with some of the following concepts• Huge Pages Cache alignment Ptheads with Affinity• Prefetching New Instructions NUMA• Intel® DDIO Memory Interleave Memory Channel

• Intel® Data Direct I/O Technology (Intel® DDIO)• Enabled by default in all Intel® Xeon® processor E5-based platforms• Enables PCIe adapters to route I/O traffic directly to L3 cache, reducing unnecessary trips to

system memory, providing more than double the throughput of previous-generation servers, while further reducing power consumption and I/O latency.

• Pthreads• On startup of the DPDK specifies the cores to be used via the Pthread call with affinity to tie an

application to a core. Reducing the kernel’s ability of moving the application to another local or remote core affecting performance.

• The user may still use Ptheads or Fork calls after the DPDK has started to allow threads to float or multiple thread to be tied to a single core.

Page 11: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

11

DPDK Packet Processing Concepts• NUMA• DPDK utilizes NUMA memory for allocation of resources to improve performance for processing

and PCIe I/O local to a processor. • With out the NUMA set in a dual socket system memory is interleaved between the two sockets.

• Huge Pages• DPDK utilizes 2M and 1G hugepages to reduce the case of TLB misses which can significantly

affect a cores overall performance.

• Cache Alignment• Better performance by aligning structures on 64 Byte cache lines.

• Software Prefetching• Needs to be issued “appropriately” ahead of time to be effective. Too early could cause eviction

before use• Allows cache to be populated before data is accessed

• Memory channel use• Memory pools add padding to objects to ensure even use of memory channels• Number of channels specified at application start up

Page 12: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

12

Memory configurationIntel Architecture

Page 13: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

13

Memory – performance topics

• NUMA architecture

• Caching

• TLBs

• Huge pages

• Memory allocation

Page 14: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

14

Intel® Core™ Microarchitecture Platform Architecture

Integrated Memory Controller

• 4 DDR3 channels per socket

• Massive memory bandwidth

• Memory Bandwidth scales with # of processors

• Very low memory latency

Intel® QuickPath Interconnect (Intel® QPI)

• New point-to-point interconnect

• Socket to socket connections

• Socket to chipset connections

• Build scalable solutions

IVBEP

IVBEP

PCH

Significant performance leap for new platform

Page 15: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

15

Non-Uniform Memory Access (NUMA)

FSB architecture (legacy)

• All memory in one location

Starting with Intel® Core™ microarchitecture (Nehalem)

• Memory located in multiple places

Latency to memory dependent on location

Local memory

• Highest BW

• Lowest latency

Remote Memory

• Higher latency

IVBEP

IVBEP

PCH

Ensure software is NUMA-optimized for best performance

l

Page 16: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

16

NUMA Considerations for Data Structure Allocation

Intel® NIC

PCH

Core 0

I$ D$

Core 1

I$ D$

L2 Cache

Core 2

I$ D$

Core 3

I$ D$

L2 Cache

rx_queue 0rx_queue 1

rx_queue 3

hash = (tcp->th_sport) ^(tcp->th_dport) ^

(ip->ip_src.s_addr) ^(ip->ip_dst.s_addr);

hash = hash % PRIME_NUMBER;

return lookup_table[hash];

DCA

Memory

Memory

Memory

Memory

Memory

Memory

rx_queue 2

PTU Metrics• MEM_UNCORE_RETIRED.REMOTE_DRAM

• MEM_INSTRUCTIONS_RETIRED.LATENCY_ABOVE_THRESHOLD

DMI

PCIe

QPI

Page 17: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

17

Caching on IA

Page 18: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

18

• IA Processors have cache integrated on processor die.

• Fast access SRAM

• Code & data from system memory (DRAM) stored in fast access cache memory

• Without a cache – CPU runs out of instructions from system memory

• CPU Core “stalls” – waiting for data

• Cache miss (data not in cache)

• CPU needs to get data from system memory

• Cache populated with required data

• Not just the data required, but a block of info is copied

• “Cache line” – 64 Bytes on IA (IVB, HSW etc.)

Cache hit – data present in cache

Caching on IA

Page 19: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

19

• Cache Consistency

• Cache is a copy of a piece of memory

• Needs to always reflect what is contained in system memory

• Snoop

• Cache watches address lines for transaction

• Cache sees if any transactions access memory contained within cache

• Cache keeps consistent with caches of other CPU cores

• Dirty data

• Data modified in cache but not in main memory

• Stale data

• Data modified in main memory, but not in cache

Caching on IA – some terms

Page 20: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

20

• 3 Levels of cache (SNB, IVB, HSW processors)

• L1 cache – 32KB data and 32KB instruction caches

• L2 cache – 256KB – unified (holds code & data)

• L3 cache (LLC) – 25MB (IVB) , 30MB (HSW) common cache for all cores in CPU socket.

• L1 cache is smallest, and fastest.

• CPU tries to access data – not in L1 cache?

• Try L2 cache - not in L2 cache?

• Try L3 cache – not in L3 cache?

• Cache miss - need to access system memory (DRAM).

• L1 & L2 cache is per physical core (shared per logical core)

• L3 cache is shared (per CPU socket)

Caching on IA

Page 21: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

21

• What can be cached?

• Only DRAM can be cached

• IO, MMIO never cached

• L1 cache is smallest, and fastest.

• L1 Code cache is read-only

• Address residing in L1/L2 must be present in L3 cache –“inclusive cache”

Caching on IA

Page 22: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

22

Huge Pages

Page 23: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

23

• All memory addresses virtual

• Memory appears contiguous to applications, even if physically fragmented

• Map virtual address to physical address

• Use page tables to translate virtual address to physical address

• Default page size in Linux on IA is 4kB.

• 4 layers of page tables

Huge Pages

Page 24: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

24

Why Hugepages?

1

2

3 4

1

2

3

DTLB:• 4K pages 64 entries, maps 256 KB, so to access 16G of memory 32MB of PTE tables read by CPU• 2M pages 32 entries, maps 64 MB, so to access 16G of memory 64Kb of PDE tables read by CPU, fits into

CPU cache

One 2MB page = 512 of 4KB pages,512 less page cross penalties

Four memory accesses to get to the page data

Three memory accesses to get to the page data

TLB maps page numbers to page frames. Each TLB miss requires page walk.

Page 25: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

25

• Use Linux hugepage support through “hugetlbfs” filesystem

• Each page is 2MB in size equivalent to 512 4KB pages

• Each page requires only 1 DTLB entry

• Reduce DTLB misses, and therefore page walks

• Gives improved performance

• Need to enable & allocate huge pages with Linux boot command (in GRUB file)

• Better to enable at boot time – prevents fragmentation in physical memory

Huge Pages

Page 26: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

26

Translation Lookaside BuffersTLBs – virtual to physical memory address translation

Intel® 64 and IA-32 Architectures Software Developer’s Manual. Volume 3. System Programming Guide. Chapter 4.10: Caching Translation InformationIntel® 64 and IA-32 Architectures Optimization Reference Manual.

Page 27: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

27

Translation Lookaside Buffers (TLBs)• TLBs – Translation Lookaside Buffers – 2 types

• Instruction TLB

• Data TLB

• TLB is cache – maps virtual memory to physical memory

• When memory requested by application, OS maps virtual address from process to physical address in memory

• Mapping of virtual to physical memory – Page Table Entry (PTE)

• TLB is a cache for the Page Table

• If data is found in TLB during address lookup

• TLB hit

• Otherwise – TLB miss (page walk) - performance hit

• Huge pages (Linux) – can alleviate

Page 28: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

28

Translation Lookaside Buffers (TLBs)

• TLBs are a cache for page tables

• If memory address lookup is not in TLB -> TLB miss

• We must then “walk the page tables”

• This is slow, and costly

• We need to minimise TLB misses

• Solution is to use huge pages

• Use 2M or 1G huge pages instead of default 4k pages

Page 29: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

29

TLB Invalidation

• On multi-core systems one core may change the page table which is used by other cores

• Page table change needs to be propagated to other cores TLBs

• This process is known as “TLB shootdown”

• Need to invalidate the TLBs to avoid using “stale” data

• Need to be aware of other CPU cores invalidating TLBs

• Costly for data plane applications.

• Examples – page faults, VM transitions (VM exit & entry)

• More info in section 4.10.4 of Volume 3A of Intel® 64 and IA-32 Architectures Software Developer’s Manual

• https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

Page 30: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

30

IOTLBs

• As well as TLBs for memory, there are TLBs for DMA – IOTLBs

• Page table structure for DMA address translation

• Sandy Bridge – no huge page support in IOTLBS – page table fragmentation

• 2M and 1G huge pages fragmented to 4k page size

• Causes more IOTLB misses

• SNB could not achieve near 64 byte line rates for 10G NIC

• Huge page support added in IVB

• SR-IOV performance in IVB greatly enhanced

Page 31: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

31

Large Page Table Support

Reducing TLB and IOTLB misses with Large Page Table support

MemoryExtended Page Tables

Intel® VT-d IOTLB, translation cache

NIC

• Intel® Data Plane Development Kit (Intel® DPDK) utilizes Large Page tables to create large contiguous buffers

Intel® Architecture

Virtual Machine Monitor

NIC

Intel DPDK

GPA HPA

Forwarding Sample Code

NIC

Intel® Virtualization Technology for Directed I/O (Intel® VT-d)

Page 32: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

32

Memory Virtualization Challenges

VMM

CPU0

VM0 VMn

GuestPage Tables

TLB

ShadowPage Tables

Memory

InducedVM Exits Remap

Address Translation• Guest OS expects contiguous,

zero-based physical memory

• VMM must preserve this illusion

Page-table Shadowing• VMM intercepts paging operations

• Constructs copy of page tables

Overheads• VM exits add to execution time

• Shadow page tables consume significant host memory

GuestPage Tables

Page 33: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

33

Memory Virtualization with EPT

CPU0

VMM

I/OVirtualization

Intel® VT-xwith EPT

VM0 VMn

ExtendedPage Tables

(EPT)

EPTWalker

No VM Exits

Extended Page Tables (EPT)• Map guest physical to host

address

• New hardware page-table walker

Performance Benefit• Guest OS can modify its own page

tables freely

• Eliminates VM exits

Memory Savings• Shadow page tables required for

each guest user process (w/o EPT)

• A single EPT supports entire VM

Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x)

Page 34: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

34

Memory ConfigurationDPDK

Page 35: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

35

Memory Object Hierarchy

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Memory Segment 0 Memory Segment 1 Memory Segment N

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Physically contiguous

memory

Physically contiguous

memory

Memory

Zone: RG_RX_RING_0 Memory Zone: MP_mbuf_pool

Memory

Zone: RG_TX_RING_0

Ring:RX_RING_0

Ring:TX_RING_0

Memory Pool: mbuf_pool

Memory Zone: MALLOC_HEAP0

Malloc heap

Page 36: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

36

Hugepages• Use Linux hugepage support through “hugetlbfs” filesystem

• Each page is 2MB in size equivalent to 512 4KB pages

• Each page requires only 1 DTLB entry

• Reduce DTLB misses, and therefore page walks

• Gives improved performance

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Memory Segment 0 Memory Segment 1 Memory Segment N

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Physically contiguous

memory

Physically contiguous

memory

Memory

Zone: RG_RX_RING_0

Memory Zone: MP_mbuf_poolMemory

Zone: RG_TX_RING_0

Ring:RX_RING_0

Ring:TX_RING_0

Memory Pool: mbuf_pool

Memory

Zone: MALLOC_HEAP0

Malloc heap

Page 37: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

37

Memory Segments• Internal unit for memory management is the memory segment

• Always backed by Huge Page (2 MB/1 GB page) memory

• Each segment is contiguous in physical and virtual memory

• Broken out into smaller memory zones for individual objects

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Memory Segment 0 Memory Segment 1 Memory Segment N

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Physically contiguous

memory

Physically contiguous

memory

Memory

Zone: RG_RX_RING_0

Memory Zone: MP_mbuf_poolMemory

Zone: RG_TX_RING_0

Ring:RX_RING_0

Ring:TX_RING_0

Memory Pool: mbuf_pool

Memory

Zone: MALLOC_HEAP0

Malloc heap

Page 38: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

38

Memory Zones• Most basic unit of memory allocation – named block of memory

• Allocate-only, cannot free

• Cannot span a segment boundary – contiguous memory

• Physical address of allocated block available to caller

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Memory Segment 0 Memory Segment 1 Memory Segment N

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Physically contiguous

memory

Physically contiguous

memory

Memory

Zone: RG_RX_RING_0

Memory Zone: MP_mbuf_poolMemory

Zone: RG_TX_RING_0

Ring:RX_RING_0

Ring:TX_RING_0

Memory Pool: mbuf_pool

Memory

Zone: MALLOC_HEAP0

Malloc heap

Page 39: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

39

Malloc support – rte_malloc/rte_free• Malloc library provided to allow easier application porting

• Backed by one or more memzones

• Uses hugepage memory, but supports memory freeing

• Not lock-free – avoid in data path

• Physical address information not available per-allocation

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Memory Segment 0 Memory Segment 1 Memory Segment N

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Physically contiguous

memory

Physically contiguous

memory

Memory

Zone: RG_RX_RING_0

Memory Zone: MP_mbuf_poolMemory

Zone: RG_TX_RING_0

Ring:RX_RING_0

Ring:TX_RING_0

Memory Pool: mbuf_pool

Memory

Zone: MALLOC_HEAP0

Malloc heap

Page 40: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

40

Memory Pools• Pool of fixed-size buffers

• One pool can be safely shared among many threads

• Lock-free allocation and freeing of buffers to/from pool

• Designed for fast-path use

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Memory Segment 0 Memory Segment 1 Memory Segment N

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

2MBpage

Physically contiguous

memory

Physically contiguous

memory

Memory

Zone: RG_RX_RING_0

Memory Zone: MP_mbuf_poolMemory

Zone: RG_TX_RING_0

Ring:RX_RING_0

Ring:TX_RING_0

Memory Pool: mbuf_pool

Memory

Zone: MALLOC_HEAP0

Malloc heap

Page 41: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

41

Memory Pools (continued)

Memory PoolPkt Buffers (60K 2K buffers)

Events(2K 100B buffers)

Events(2K 100B buffers)

Processor 0

10G

Intel®DPDK

C4

Data Plane

Intel®DPDK

C3

Data Plane

Intel®DPDK

C2

Data Plane

Intel®DPDK

C1

Data Plane

10G

Per-core

cached

buffers

• Size fixed at creation time:• Fixed size elements• Fixed number of elements

• Multi-producer / multi-consumer safe

• Safe for fast-path use

• Typical usage is packet buffers

• Optimized for performance:• No locking, use CAS instructions• All objects cache aligned• Per core caches to minimise contention / use

of CAS instructions• Support for bulk allocation / freeing of buffers

Page 42: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

42

• For DPDK application – allocated all memory from huge pages

• Allocate all memory at initialisation time (not during run time).

• Pools of buffers created.

• Buffers taken from pools as needed for packet processing

• Returned to pool after use

• Never need to use “malloc” at runtime.

• DPDK takes care of aligning memory to cache lines

Memory allocation - summary

Page 43: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

43

• rte_eal_init()

• Initialises Environment Abstraction Layer

• Takes care of allocating memory from huge pages

• rte_mempool_create()

• Create pool of message buffers (mbufs)

• This pool is used to hold packet data

• mbufs taken from and returned to this pool

Memory allocation

Page 44: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

44

Memory Buffer - mbufMemory buffer structure used throughout the Intel® DPDK

Header holds meta-data about packet and buffer

• Buffer & packet length

• Buffer physical address

• RSS hash or flow director filter information

• Offload flags

Body holds packet data plus room for additional headers and footers.

Page 45: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

45

Memory Buffer – chained mbufMbufs generally used with memory pools

Size of mbuf fixed when the mempool is created

For packets too big for a single mbuf, the mbufs can be linked together in an “mbuf chain”

Page 46: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

46

DDIO

Page 47: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

47

Data Direct I/O (DDIO)

• Ethernet controllers & NICs talk directly with CPU cache

• DDIO makes processor cache the primary source and destination of I/O data, rather than main memory

• DDIO reduces latency, power consumption, and memory bandwidth

• Lower latency – I/O date does not need to go via main memory

• Lower power consumption – reduced memory access

• More scalable I/O bandwidth – reduced memory bottlenecks

Page 48: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

48

Page 49: 1 intro to_dpdk_and_hw

TRANSFORMING NETWORKING & STORAGE

49

DDIO requires no complex setup

• DDIO is enabled by default on all Romley platforms, including pre-released platforms for OEMs, IHVs, and ISVs

− DDIO has been active on all Intel and industry Romleydevelopment and validation

• DDIO has no hardware dependencies

• DDIO is invisible to software

− No driver changes are required

− No OS or VMM changes are required

− No application changes are required

Page 50: 1 intro to_dpdk_and_hw