x DPDK Based Networking Products Enhance and Expand Container Networking [email protected] Jingdong Digital Technology
x
DPDK Based Networking Products Enhance and Expand Container Networking
[email protected] Digital Technology
2
Kubernetes Overview
kube-apiserverkube-controller-
manager
pod1 pod2
kubelet kube-proxy
pod3 pod4 pod5
kubelet kube-proxy
pod6
kube-scheduler
master
node2node1
• Pod to Pod communication• Pod to Service communication
3
Flannel Overview
VXLAN encapsulation
eth0192.168.0.1
vxlan: flannel.110.10.10.0/32
bridge: cni010.10.10.1/24
eth010.10.10.2/24
eth010.10.10.3/24
pod1 pod2
eth0192.168.0.2
vxlan: flannel.110.10.20.0/32
bridge: cni010.10.20.1/24
eth010.10.20.2/24
eth010.10.20.3/24
pod3 pod4
underlying network
OuterEthernet header
node1 node2
Outer IP headersrc: 192.168.0.1dst: 192.168.0.2
OuterUDP header
Vxlan headerInner
Ethernet header
Inner IP headersrc: 10.10.10.2dst: 10.10.20.3
Payload
4
1、pods communicate with endpoints in k8s cluster, packets must be encapsulated2、pods communicate with endpoints out of k8s cluster, packets must be masqueraded
It will lead to extra overhead. Besides, it can’t meet some demands, e.g. pod wants to access white-list enabled application outside of k8s cluster
Our goals:• no encapsulation• no network address translation• pods can be reached from everywhere directly
pod1
node1
non kubernetes nodespod2
node2
Our Choice:• contiv with layer3 routing mode
5
Contiv Overview
• OVS to forward pod packets • BGP to publish pod ip
10.10.0.1 nexthop 192.168.1.110.10.0.2 nexthop 192.168.1.110.10.0.3 nexthop 192.168.1.210.10.0.4 nexthop 192.168.1.2
ovs
eth010.10.0.2/24
pod2
eth010.10.0.1/24
pod1
vvport2vvport1
inb01192.168.1.1
layer3 witch
ovs
eth010.10.0.4/24
pod4
eth010.10.0.3/24
pod3
vvport2vvport1
eth0
netplugin netplugin
inb01192.168.1.2
eth0
bgp bgp
6
Contiv Implementation Detail
1、user creates a new pod in k8s cluster2、netplugin requests a free ip 10.10.0.1 from netmaster3、netplugin creates a veth pair, such as vport1 and vvport14、netplugin moves interface vport1 to pod network
namespace and rename it to eth05、netplugin sets ip and route in the pod network namespace6、netplugin adds vvport1 to ovs7、netplugin publishes 10.10.0.1/32 to bgp neighbor switch
• nw_dst=10.10.0.1 output:vvport1• nw_dst=10.10.0.2 output:vvport2
ovs
eth010.10.0.2/24
pod2
eth010.10.0.1/24
pod1
vvport2vvport1
inb01192.168.1.1
eth0
netpluginbgp
layer3 switch
7
Pod IP is Reachable in IDC Scope
10.10.0.2(in cluster) ping 172.16.0.1(outside cluster)
1、pod2 sends out packet through its eth0
Ethernet headersrc: 10.10.0.2
dst: 172.16.0.1Payload
2、ovs receives packet from vvport2 and forwards it to host eth0
Ethernet headersrc: 10.10.0.2
dst: 172.16.0.1Payload
3、switch receives packet and forwards it to host 172.16.0.1
Ethernet headersrc: 10.10.0.2
dst: 172.16.0.1Payload
in the pod, in the host, in the underlying infrastructure,packet ip header is always the same
ovs
eth010.10.0.2/24
pod2
eth010.10.0.1/24
pod1
vvport2vvport1
inb01192.168.1.1
eth0
netpluginbgp
layer3 switch
machine outside of k8s cluster
172.16.0.1
8
Contiv Optimization
1、multiple bgp neighbors support
2、reduce number of node’s ovs rules from magnitude of cluster to node
3、remove dns and load balance module from netplugin
4、add non-docker container runtime support, e.g. containerd
5、add ipv6 support
9
Load Balance: Native KubeProxycontrol flowdata flow
eth010.10.0.1/24
pod1
vvport1
kube-apiserver
kube-proxy
eth0.200
layer3 switch
iptables
inb01
servicesendpoints
eth0
eth0.100
service traffic
eth010.10.0.2/24
pod2
vvport1
kube-proxy
vvport2
iptables
eth0.200
eth0
eth0.100
servicesendpoints
eth010.10.0.3/24
pod3
10
Load Balance: DPDK-SLB
control flowdata flow
SLB Cluster
eth010.10.0.1/24
pod1
vvport1
kube-apiserver
eth0.200
layer3 switch
inb01
servicesendpoints
eth0
service traffic
eth010.10.0.2/24
pod2
vvport1 vvport2eth0.200
eth0
eth010.10.0.3/24
pod3
slb-controller
kube-proxy kube-proxy
DPDK SLB Cluster
eth0.100 eth0.100
• Kube-Proxy on all nodes not needed
• SLB-Controller watches services and endpoints in K8S, dynamically sends VS and RS info to DPDK-SLB
11
DPDK-SLB: Control Plane
• SLB-Daemon: core process which does load balance and full NAT
• SLB-Agent monitors and configures SLB-Daemon
• OSPFD publishes service subnets to layer3 switch
admin kniworker_1
slb-daemon
slb-agent
(3)(3)
(3)
config
worker_2
config
worker_n
config
ospfd
slb-controller
• Admin core configures VS and RS info to worker cores
• KNI core forwards OSPF packets to kernel, the kernel then sends them to OSPFD
• Worker cores do the load balance
All data (config data, session data, local addrs) is per CPU, fully parallelizing packets processing
12
DPDK-SLB: OSPF Neighbor
• OSPF uses multicast address 224.0.0.5
• Flow Director: destination ip 224.0.0.5 bound to queue_x
• Dedicated KNI core to process OSPF packets
• OSPFD publishes service subnets to layer3 switch
admin kniworker_1
slb-daemon
slb-agent
config
worker_2
config
worker_n
config
ospfd
eth0
queue_1 queue_2 queue_n queue_x
(2)
layer3 switch
(1)
13
DPDK-SLB: Data Plane
1、{client_ip, client_port, vip,vport}2、rss selects a queue according to 5 tuple3、worker_1 does fullnat {local_ip1, local_port, server_ip, server_port}4、worker_1 saves session {cip,cport,vip,vport,lip1,lport,sip,sport}
nic
queue_1
queue_2
queue_n
worker_1
worker_2
worker_n
rss nicclient server
nic
queue_1
queue_2
queue_n
worker_1
worker_2
worker_n
fdir nicserver client
1、{server_ip, server_port, local_ip1, local_port}2、fdir selects a queue according to destination ip addr(local_ip1 bound to queue_1)3、worker_1 lookups session {cip,cport,vip,vport,lip1,lport,sip,sport}4、worker_1 does fullnat {vip, vport, client_ip, client_port}
the key point is that server-to-client packet must be placed on queue1, because only worker_1 has the session
14
Make Apps Run in the Container Cloud Seamlessly
• layer3 switch routes:10.10.0.1 nexthop node110.10.0.4 nexthop node2service subnets nexthop dpdk-slb
• Pod IP can be reachable from vm1 outside k8s cluster
• Service IP can be reachable from vm2 outside k8s cluster
• Help apps to run in the container cloud and traditional environment at the same time
layer3 switch
DPDK-SLB
DPDK-SLB
ovs
eth010.10.0.2
pod2
eth010.10.0.1
pod1
vvport2
vvport1
ovs
eth010.10.0.4
pod4
eth010.10.0.3
pod3
vvport2
vvport1
node2
node1
vm 2
vm 1