Top Banner
TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented by: Thomas Repantis [email protected] CS260-Seminar in Computer Science, Fall 2004 – p.1/35
35

TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Apr 23, 2018

Download

Documents

hoangdien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

TCP Servers:Offloading TCP Processing in Internet Servers.

Design, Implementation, and PerformanceM. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel.

Presented by:

Thomas Repantis

[email protected]

CS260-Seminar in Computer Science, Fall 2004 – p.1/35

Page 2: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Overview

To execute the TCP/IP processing on a dedicatedprocessor, node, or device (the TCP server) usinglow-overhead, non-intrusive communication between itand the host(s) running the server application.Three TCP Server architectures:

1. A dedicated network processor on a symmetricmultiprocessor (SMP) server.

2. A dedicated node on a cluster-based server builtaround a memory-mapped communicationinterconnect such as VIA.

3. An intelligent network interface in a cluster ofintelligent devices with a switch-based I/Ointerconnect such as Infiniband. CS260-Seminar in Computer Science, Fall 2004 – p.2/35

Page 3: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Introduction

• The network subsystem is nowadays one of themajor performance bottlenecks in web servers:Every outgoing data byte has to go through thesame processing path in the protocol stack downto the network device.

• Proposed solution a TCP Server architecture:Decoupling the TCP/IP protocol stack processingfrom the server host, and executing it on adedicated processor/node.

CS260-Seminar in Computer Science, Fall 2004 – p.3/35

Page 4: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Introductory Details

• The communication between the server host andthe TCP server can dramatically benefit from usinglow-overhead, non-intrusive, memory-mappedcommunication.

• The network programming interface provided tothe server application must use and tolerateasynchronous socket communication to avoid datacopying.

CS260-Seminar in Computer Science, Fall 2004 – p.4/35

Page 5: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Apache Execution Time Breakdown

CS260-Seminar in Computer Science, Fall 2004 – p.5/35

Page 6: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Motivation

• The web server spends in user space only 20% ofits execution time.

• Network processing, which includes TCPsend/receive, interrupt processing, bottom halfprocessing, and IP send/receive take about 71%of the total execution time.

• Processor cycles devoted to TCP processing,cache and TLB pollution (OS intrusion on theapplication execution).

CS260-Seminar in Computer Science, Fall 2004 – p.6/35

Page 7: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

TCP Server Architecture

• The application host avoids TCP processing bytunneling the socket I/O calls to the TCP serverusing fast communication channels.

• Shared memory and memory-mappedcommunication for tunneling.

CS260-Seminar in Computer Science, Fall 2004 – p.7/35

Page 8: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Advantages

• Kernel Bypassing.• Asynchronous Socket Calls.• No Interrupts.• No Data Copying.• Process Ahead.• Direct Communication with File Server.

CS260-Seminar in Computer Science, Fall 2004 – p.8/35

Page 9: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Kernel Bypassing

• Bypassing the host OS kernel.• Establishing a socket channel between the

application and the TCP server for each opensocket.

• The socket channel is created by the host OSkernel during the socket call.

CS260-Seminar in Computer Science, Fall 2004 – p.9/35

Page 10: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Asynchronous Socket Calls

• Maximum overlapping between the TCPprocessing of the socket call and the applicationexecution.

• Avoid context switches whenever this is possible.

CS260-Seminar in Computer Science, Fall 2004 – p.10/35

Page 11: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

No Interrupts

• Since the TCP server exclusively executes TCPprocessing, interrupts can be apparently easilyand beneficially replaced with polling.

• Too high polling frequency rate would lead to buscongestion while too low would result in inability tohandle all events.

CS260-Seminar in Computer Science, Fall 2004 – p.11/35

Page 12: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

No Data Copying

• With asynchronous system calls, the TCP servercan avoid the double copying performed in thetraditional TCP kernel implementation of the sendoperation.

• The application must tolerate the wait forcompletion of the send.

• For retransmission, the TCP server can read thedata again from the application send buffer.

CS260-Seminar in Computer Science, Fall 2004 – p.12/35

Page 13: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Process Ahead

• The TCP server can execute certain operationsahead of time, before they are actually requestedby the host.

• Specifically, the accept and receive system calls.

CS260-Seminar in Computer Science, Fall 2004 – p.13/35

Page 14: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Direct Communication with FileServer

• In a multi-tier architecture a TCP server can beinstructed to perform direct communication withthe file server.

CS260-Seminar in Computer Science, Fall 2004 – p.14/35

Page 15: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

TCP Server in an SMP-basedArchitecture

• Dedicating a subset of the processors for in-kernelTCP processing.

• Network generated interrupts are routed to thededicated processors.

• The communication between the application andthe TCP server is through queues in sharedmemory. CS260-Seminar in Computer Science, Fall 2004 – p.15/35

Page 16: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

SMP-based Architecture Details

• Offloading interrupts and receive processing.• Offloading TCP send processing.

CS260-Seminar in Computer Science, Fall 2004 – p.16/35

Page 17: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

TCP Server in a Cluster-basedArchitecture

• Dedicating a subset of nodes to TCP processing.• VIA-based SAN interconnect.

CS260-Seminar in Computer Science, Fall 2004 – p.17/35

Page 18: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Cluster-based Architecture Operation

• The TCP server node acts as the networkendpoint for the outside world.

• The network data is transferred between the hostnode and the TCP server node across SAN usinglow latency memorymapped communication.

CS260-Seminar in Computer Science, Fall 2004 – p.18/35

Page 19: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Cluster-based Architecture Details

• The socket call interface is implemented as a userlevel communication library.

• With this library a socket call is tunneled acrossSAN to the TCP server.

• Several implementations:1. Split-TCP (synchronous)2. AsyncSend3. Eager Receive4. Eager Accept5. Setup With Accept

CS260-Seminar in Computer Science, Fall 2004 – p.19/35

Page 20: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

TCP Server in anIntelligent-NIC-based Architecture

• Cluster of intelligent devices over aswitched-based I/O (Infiniband).

• The devices are considered to be "intelligent", i.e.,each device has a programmable processor andlocal memory.

CS260-Seminar in Computer Science, Fall 2004 – p.20/35

Page 21: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Intelligent-NIC-based ArchitectureDetails

• Each open connection is associated with amemory-mapped channel between the host andthe I-NIC.

• During a message send, the message istransferred directly from user-space to a sendbuffer at the interface.

• A message receive is first buffered at the networkinterface and then copied directly to user-space atthe host.

CS260-Seminar in Computer Science, Fall 2004 – p.21/35

Page 22: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

4-way SMP-based Evaluation

• Dedicating two processors to network processingis always better than dedicating only one.

• Throughput benefits of up to 25-30%.CS260-Seminar in Computer Science, Fall 2004 – p.22/35

Page 23: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

4-way SMP-based Evaluation

CS260-Seminar in Computer Science, Fall 2004 – p.23/35

Page 24: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

4-way SMP-based Evaluation

• When only one processor is dedicated to thenetwork processing, the network processorbecomes a bottleneck and, consequently, theapplication processor suffers from idle time.

• When we apply two processors to the handling ofthe network overhead, there is enough networkprocessing capacity and the application processorbecomes the bottleneck.

• The best system would be one in which thedivision of labor between the network andapplication processors is more flexible, allowing forsome measure of load balancing.

CS260-Seminar in Computer Science, Fall 2004 – p.24/35

Page 25: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

2-node Cluster-based Evaluation forStatic Load

• Asynchronous send operations outperform theircounterparts

CS260-Seminar in Computer Science, Fall 2004 – p.25/35

Page 26: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

2-node Cluster-based Evaluation forStatic Load

• Smaller gain than that achievable with SMP-basedarchitecture.

• 17% is the greatest throughput improvement wecan achieve with this architecture/workloadcombination.

CS260-Seminar in Computer Science, Fall 2004 – p.26/35

Page 27: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

2-node Cluster-based Evaluation forStatic Load

• In the case of Split-TCP and AsyncSend the hosthas idle time available since it is the networkprocessing at the TCP server that proves to be thebottleneck.

CS260-Seminar in Computer Science, Fall 2004 – p.27/35

Page 28: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

2-node Cluster-based Evaluation forStatic and Dynamic Load

• Split TCP and Async Send systems saturate laterthan Regular TCP.

CS260-Seminar in Computer Science, Fall 2004 – p.28/35

Page 29: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

2-node Cluster-based Evaluation forStatic and Dynamic Load

• At an offered load of about 500 reqs/sec, the hostCPU is effectively saturated.

• 18% is the greatest throughput improvement wecan achieve with this architecture.

CS260-Seminar in Computer Science, Fall 2004 – p.29/35

Page 30: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

2-node Cluster-based Evaluation forStatic and Dynamic Load

• Balanced confgurations depend heavily on theparticular characteristics of the workload.

• A dynamic load balancing scheme between hostand TCP server nodes is required for idealperformance in dynamic workloads

CS260-Seminar in Computer Science, Fall 2004 – p.30/35

Page 31: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Intelligent-NIC-based SimulationEvaluation

• For all the simulated processor speeds, theSplit-TCP system outperforms all the otherimplementations.

• The improvements over a conventional systemrange from 20% to 45%.

CS260-Seminar in Computer Science, Fall 2004 – p.31/35

Page 32: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Intelligent-NIC-based SimulationEvaluation

• The ratio of processing power at the host to thatavailable at the NIC plays an important role indetermining the server performance.

• In Split-TCP the processor on the NIC saturatesmuch earlier than the host processor or thenetwork.

• We can achieve better performance with aSplit-TCP implementation only with a fastprocessor on the NIC.

CS260-Seminar in Computer Science, Fall 2004 – p.32/35

Page 33: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Conclusions about TCP Servers 1/2

• Offloading TCP/IP processing is beneficial tooverall system performance when the server isoverloaded.

• An SMP-based approach to TCP servers is moreefficient than a cluster-based one.

• The benefits of SMP and cluster-based TCPservers reach 30% in the scenarios we studied.

• The simulated results show greater gains of up to45% for a cluster of devices.

CS260-Seminar in Computer Science, Fall 2004 – p.33/35

Page 34: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Conclusions about TCP Servers 2/2

• TCP servers require substantial computingresources for complete offloading.

• The type of workload plays a significant role in theefficiency of TCP servers.

• Depending on the application workload, either thehost processor or the TCP Server can be- comethe bottleneck.

• Hence, a scheme to balance the load between thehost and the TCP Server would be beneficial forserver performance.

CS260-Seminar in Computer Science, Fall 2004 – p.34/35

Page 35: TCP Servers - UCRbhuyan/CS260/LECTURE8.pdf · Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has

Thank you!

Questions/comments?

CS260-Seminar in Computer Science, Fall 2004 – p.35/35