STREAMS vs. Sockets Performance Comparison for UDP Experimental Test Results for Linux Brian F. G. Bidulock * OpenSS7 Corporation June 16, 2007 Abstract With the objective of contrasting performance between STREAMS and legacy approaches to system facilities, a com- parison is made between the tested performance of the Linux Na- tive Sockets UDP implementation and STREAMS TPI UDP and XTIoS UDP implementations using the Linux Fast-STREAMS package [LfS]. 1 Background UNIX networking has a rich history. The TCP/IP protocol suite was first implemented by BBN using Sockets under a DARPA re- search project on 4.1aBSD and then incorporated by the CSRG into 4.2BSD [MBKQ97]. Lachmann and Associates (Legent) sub- sequently implemented one of the first TCP/IP protocol suite based on the Transport Provider Interface (TPI) [TLI92] and STREAMS [GC94]. Two other predominant TCP/IP implemen- tations on STREAMS surfaced at about the same time: Wollon- gong and Mentat. 1.1 STREAMS STREAMS is a facility first presented in a paper by Dennis M. Ritchie in 1984 [Rit84], originally implemented on 4.1BSD and later part of Bell Laboratories Eighth Edition UNIX, incorpo- rated into UNIX System V Release 3 and enhanced in UNIX Sysvem V Release 4 and further in UNIX System V Release 4.2. STREAMS was used in SVR4 for terminal input-output, pseudo-terminals, pipes, named pipes (FIFOs), interprocess com- munication and networking. STREAMS was used in SVR3 for networking (with the NSU package). Since its release in System V Release 3, STREAMS has been implemented across a wide range of UNIX, UNIX-like and UNIX-based systems, making its implementation and use an ipso facto standard. STREAMS is a facility that allows for a reconfigurable full duplex communications path, Stream, between a user process and a driver in the kernel. Kernel protocol modules can be pushed onto and popped from the Stream between the user process and driver. The Stream can be reconfigured in this way by a user process. The user process, neighbouring protocol modules and the driver communicate with each other using a message passing scheme. This permits a loose coupling between protocol modules, drivers and user processes, allowing a third-party and loadable kernel module approach to be taken toward the provisioning of protocol modules on platforms supporting STREAMS. On UNIX System V Release 4.2, STREAMS was used for ter- minal input-output, pipes, FIFOs (named pipes), and network communications. Modern UNIX, UNIX-like and UNIX-based systems providing STREAMS normally support some degree of network communications using STREAMS; however, many do not support STREAMS-based pipe and FIFOs 1 or terminal input-output. 2 UNIX System V Release 4.2 supported four Application Pro- grammer Interfaces (APIs) for accessing the network communi- cations facilities of the kernel: Transport Layer Interface (TLI). TLI is an acronym for the Transport Layer Interface [TLI92]. The TLI was the non- standard interface provided by SVR4, later standardized by X/Open as the XTI described below. This interface is now deprecated. X/Open Transport Interface (XTI). XTI is an acronym for the X/Open Transport Interface [XTI99]. The X/Open Trans- port Interface is a standardization of the UNIX System V Release 4, Transport Layer Interface. The interface con- sists of an Application Programming Interface implemented as a shared object library. The shared object library com- municates with a transport provider Stream using a service primitive interface called the Transport Provider Interface. While XTI was implemented directly over STREAMS de- vices supporting the Transport Provider Interface (TPI) [TPI99] under SVR4, several non-traditional approaches ex- ist in implementation: Berkeley Sockets. Sockets uses the BSD interface that was de- veloped by BBN for TCP/IP protocol suite under DARPA contract on 4.1aBSD and released in 4.2BSD. BSD Sock- ets provides a set of primary API functions that are typi- cally implemented as system calls. The BSD Sockets inter- face is non-standard and is now deprecated in favour of the POSIX/SUS standard Sockets interface. POSIX Sockets. Sockets were standardized by the OpenGroup [OG] and IEEE in the POSIX standardization process. They appear in XNS 5.2 [XNS99], SUSv1 [SUS95], SUSv2 [SUS98] and SUSv3 [SUS03]. On systems traditionally supporting Sockets and then retrofitted to support STREAMS, there is one approach toward supporting XTI without refitting the entire networking stack: 3 XTI over Sockets. Several implementations of STREAMS on UNIX utilize the concept of TPI over Sockets. Following this approach, a STREAMS pseudo-device driver is provided that hooks directly into internal socket system calls to im- plement the driver, and yet the networking stack remains fundamentally BSD in style. Typically there are two approaches to implementing XTI on systems not supporting STREAMS: XTI Compatibility Library. Several implementations of XTI on UNIX utilize the concept of an XTI compatibility library. 4 This is purely a shared object library approach to provid- ing XTI. Under this approach it is possible to use the XTI * [email protected]1. For example, AIX. 2. For example, HP-UX. 3. This approach is taken by True64 (Digital) UNIX. 4. One was even available for Linux at one point. 1
19
Embed
STREAMS vs. Sockets Performance Comparison for UDPopenss7.org/papers/strinet/testresults.pdf · STREAMS vs. Sockets Performance Comparison for UDP Experimental Test Results for Linux
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
STREAMS vs. Sockets Performance Comparison for UDP
Experimental Test Results for Linux
Brian F. G. Bidulock∗
OpenSS7 Corporation
June 16, 2007
Abstract
With the objective of contrasting performance betweenSTREAMS and legacy approaches to system facilities, a com-parison is made between the tested performance of the Linux Na-tive Sockets UDP implementation and STREAMS TPI UDP andXTIoS UDP implementations using the Linux Fast-STREAMSpackage [LfS].
1 Background
UNIX networking has a rich history. The TCP/IP protocol suitewas first implemented by BBN using Sockets under a DARPA re-search project on 4.1aBSD and then incorporated by the CSRGinto 4.2BSD [MBKQ97]. Lachmann and Associates (Legent) sub-sequently implemented one of the first TCP/IP protocol suitebased on the Transport Provider Interface (TPI) [TLI92] andSTREAMS [GC94]. Two other predominant TCP/IP implemen-tations on STREAMS surfaced at about the same time: Wollon-gong and Mentat.
1.1 STREAMS
STREAMS is a facility first presented in a paper by Dennis M.Ritchie in 1984 [Rit84], originally implemented on 4.1BSD andlater part of Bell Laboratories Eighth Edition UNIX, incorpo-rated into UNIX System V Release 3 and enhanced in UNIXSysvem V Release 4 and further in UNIX System V Release4.2. STREAMS was used in SVR4 for terminal input-output,pseudo-terminals, pipes, named pipes (FIFOs), interprocess com-munication and networking. STREAMS was used in SVR3 fornetworking (with the NSU package). Since its release in SystemV Release 3, STREAMS has been implemented across a widerange of UNIX, UNIX-like and UNIX-based systems, making itsimplementation and use an ipso facto standard.
STREAMS is a facility that allows for a reconfigurable fullduplex communications path, Stream, between a user process anda driver in the kernel. Kernel protocol modules can be pushedonto and popped from the Stream between the user process anddriver. The Stream can be reconfigured in this way by a userprocess. The user process, neighbouring protocol modules andthe driver communicate with each other using a message passingscheme. This permits a loose coupling between protocol modules,drivers and user processes, allowing a third-party and loadablekernel module approach to be taken toward the provisioning ofprotocol modules on platforms supporting STREAMS.
On UNIX System V Release 4.2, STREAMS was used for ter-minal input-output, pipes, FIFOs (named pipes), and networkcommunications. Modern UNIX, UNIX-like and UNIX-basedsystems providing STREAMS normally support some degreeof network communications using STREAMS; however, manydo not support STREAMS-based pipe and FIFOs1 or terminalinput-output.2
UNIX System V Release 4.2 supported four Application Pro-grammer Interfaces (APIs) for accessing the network communi-
cations facilities of the kernel:
Transport Layer Interface (TLI). TLI is an acronym for theTransport Layer Interface [TLI92]. The TLI was the non-standard interface provided by SVR4, later standardized byX/Open as the XTI described below. This interface is nowdeprecated.
X/Open Transport Interface (XTI). XTI is an acronym for theX/Open Transport Interface [XTI99]. The X/Open Trans-port Interface is a standardization of the UNIX System VRelease 4, Transport Layer Interface. The interface con-sists of an Application Programming Interface implementedas a shared object library. The shared object library com-municates with a transport provider Stream using a serviceprimitive interface called the Transport Provider Interface.
While XTI was implemented directly over STREAMS de-vices supporting the Transport Provider Interface (TPI)[TPI99] under SVR4, several non-traditional approaches ex-ist in implementation:
Berkeley Sockets. Sockets uses the BSD interface that was de-veloped by BBN for TCP/IP protocol suite under DARPAcontract on 4.1aBSD and released in 4.2BSD. BSD Sock-ets provides a set of primary API functions that are typi-cally implemented as system calls. The BSD Sockets inter-face is non-standard and is now deprecated in favour of thePOSIX/SUS standard Sockets interface.
POSIX Sockets. Sockets were standardized by the OpenGroup[OG] and IEEE in the POSIX standardization process.They appear in XNS 5.2 [XNS99], SUSv1 [SUS95], SUSv2[SUS98] and SUSv3 [SUS03].
On systems traditionally supporting Sockets and thenretrofitted to support STREAMS, there is one approach towardsupporting XTI without refitting the entire networking stack:3
XTI over Sockets. Several implementations of STREAMS onUNIX utilize the concept of TPI over Sockets. Followingthis approach, a STREAMS pseudo-device driver is providedthat hooks directly into internal socket system calls to im-plement the driver, and yet the networking stack remainsfundamentally BSD in style.
Typically there are two approaches to implementing XTI onsystems not supporting STREAMS:
XTI Compatibility Library. Several implementations of XTI onUNIX utilize the concept of an XTI compatibility library.4
This is purely a shared object library approach to provid-ing XTI. Under this approach it is possible to use the XTI
3. This approach is taken by True64 (Digital) UNIX.
4. One was even available for Linux at one point.
1
application programming interface, but it is not possible toutilize any of the STREAMS capabilities of an underlyingTransport Provider Interface (TPI) stream.
TPI over Sockets. An alternate approach, taken by the LinuxiBCS package was to provide a pseudo-transport providerusing a legacy character device to present the appearance ofa STREAMS transport provider.
Conversely, on systems supporting STREAMS, but not tradi-tionally supporting Sockets (such as SVR4), there are four ap-proaches toward supporting BSD and POSIX Sockets based onSTREAMS:
Compatibility Library Under this approach, a compatibility li-brary (libsocket.o) contains the socket calls as libraryfunctions that internally invoke the TLI or TPI interface toan underlying STREAMS transport provider. This is the ap-proach originally taken by SVR4 [GC94], but this approachhas subsequently been abandoned due to the difficulties re-garding fork(2) and fundamental incompatibilities derivingfrom a library only approach.
Library and cooperating STREAMS module. Under this ap-proach, a cooperating module, normally called sockmod, ispushed on a Transport Provider Interface (TPI) Stream.The library, normally called socklib or simply socket, andcooperating sockmod module provide the BBN or POSIXSocket API. [VS90] [Mar01]
Library and System Calls. Under this approach, the BSD orPOSIX Sockets API is implemented as system calls withthe sole exception of the socket(3) call. The underlyingtransport provider is still an TPI-based STREAMS trans-port provider, it is just that system calls instead of librarycalls are used to implement the interface. [Mar01]
System Calls. Under this approach, even the socket(3) call ismoved into the kernel. Conversion between POSIX/BSDSockets calls and TPI service primitives is performed com-pletely within the kernel. The sock2path(5) configurationfile is used to configure the mapping between STREAMSdevices and socket types and domains [Mar01].
1.1.1 Standardization.
During the POSIX standardization process, networking andSockets interfaces were given special treatment to ensure thatboth the legacy Sockets approach and the STREAMS approachto networking were compatible. POSIX has standardized boththe XTI and Sockets programmatic interface to networking.STREAMS networking has been POSIX compliant for manyyears, BSD Sockets, POSIX Sockets, TLI and XTI interfaces, andwere compliant in the SVR4.2 release. The STREAMS network-ing provided by Linux Fast-STREAMS package provides POSIXcompliant networking.
Therefore, any application utilizing a Socket or Streamin a POSIX compliant manner will also be compatible withSTREAMS networking.5
1.2 Linux Fast-STREAMS
The first STREAMS package for Linux that provided SVR4STREAMS capabilities was the Linux STREAMS (LiS) packageoriginally available from GCOM [LiS]. This package exhibited in-compatibilities with SVR 4.2 STREAMS and other STREAMSimplementations, was buggy and performed very poorly on Linux.These difficulties prompted the OpenSS7 Project [SS7] to imple-ment an SVR 4.2 STREAMS package from scratch, with the ob-jective of production quality and high-performance, named LinuxFast-STREAMS [LfS].
The OpenSS7 Project also maintains public and internal ver-sions of the LiS package. The last public release was LiS-2.18.3 ;the current internal release version is LiS-2.18.6. The currentproduction public release of Linux Fast-STREAMS is streams-0.9.3.
2 Objective
The question has been asked whether there are performance dif-ferences between a purely BSD-style approach and a STREAMSapproach to TCP/IP networking, cf. [RBD97]. However, theredid not exist a system which permitted both approaches to betested on the same operating system. Linux Fast-STREAMSrunning on the GNU/Linux operating system now permits thiscomparison to be made. The objective of the current study, there-fore, was to determine whether, for the Linux operating system, aSTREAMS-based approach to TCP/IP networking is a viable re-placement for the BSD-style sockets approach provided by Linux,termed NET4.
When developing STREAMS, the authors oft times found thatthere were a number of preconceptions espoused by Linux advo-cates about both STREAMS and STREAMS-based networking,as follows:
• STREAMS is slow.
• STREAMS is more flexible, but less efficient [LML].
• STREAMS performs poorly on uniprocessor and ever pooreron SMP.
• STREAMS networking is slow.
• STREAMS networking is unnecessarily complex and cum-bersome.
For example, the Linux kernel mailing list has this to say aboutSTREAMS:
(REG) STREAMS allow you to ”push” filters onto a networkstack. The idea is that you can have a very primitivenetwork stream of data, and then ”push” a filter (”mod-ule”) that implements TCP/IP or whatever on top ofthat. Conceptually, this is very nice, as it allows cleanseparation of your protocol layers. Unfortunately, imple-menting STREAMS poses many performance problems.Some Unix STREAMS based server telnet implementa-tions even ran the data up to user space and back downagain to a pseudo-tty driver, which is very inefficient.
STREAMS will never be available in the standardLinux kernel, it will remain a separate implementationwith some add-on kernel support (that come with theSTREAMS package). Linus and his networking gurus areunanimous in their decision to keep STREAMS out of thekernel. They have stated several times on the kernel listwhen this topic comes up that even optional support willnot be included.
(REW, quoting Larry McVoy) ”It’s too bad, I can see whysome people think they are cool, but the performance cost- both on uniprocessors and even more so on SMP boxes- is way too high for STREAMS to ever get added to theLinux kernel.”
Please stop asking for them, we have agreement amoungstthe head guy, the networking guys, and the fringe folkslike myself that they aren’t going in.
(REG, quoting Dave Grothe, the STREAMS guy)STREAMS is a good framework for implementingcomplex and/or deep protocol stacks having nothing todo with TCP/IP, such as SNA. It trades some efficiencyfor flexibility. You may find the Linux STREAMSpackage (LiS) to be quite useful if you need to portprotocol drivers from Solaris or UnixWare, as Calderadid.
The Linux STREAMS (LiS) package is available for downloadif you want to use STREAMS for Linux. The following site alsocontains a dissenting view, which supports STREAMS.
The current study attempts to determine the validity of thesepreconceptions.
5. This compatibility is exemplified by the netperf program which doesnot distinguish between BSD or STREAMS based networking in their im-plementation or use.
2
3 Description
Three implementations are tested:
Linux Kernel UDP (udp).
The native Linux socket and networking system.
OpenSS7 STREAMS XTIoS inet Driver.
A STREAMS pseudo-device driver that communicates witha socket internal to the kernel.
The OpenSS7 implementation of STREAMS XTI over Sock-ets implementation of UDP. While the implementation usesthe Transport Provider Interface and STREAMS to commu-nicate with the driver, internal to the driver a UDP Socketis opened and conversion between STREAMS and Socketsperformed.
OpenSS7 STREAMS TPI UDP Driver udp.
A STREAMS pseudo-device driver that fully implementsUDP and communicates with the IP layer in the kernel.
The three implementations tested vary in their implementationdetails. These implementation details are described below.
3.1 Linux Kernel UDP
Normally, in BSD-style implementations of Sockets, Sockets isnot merely the Application Programmer Interface, but also con-sists of a more general purpose network protocol stack imple-mentation [MBKQ97], even though the mechanism is not usedfor more than TCP/IP networking. [GC94]
Although BSD networking implementations consist of a num-ber of networking layers with soft interrupts used for each layerof the networking stack [MBKQ97], the Linux implementation,although based on the the BSD approach, tightly integrates thesocket, protocol, IP and interface layers using specialized inter-faces. Although roughly corresponding to the BSD stack as il-lustrated in Figure 1, the socket, protocol and interface layersin the BSD stack have well defined, general purpose interfacesapplicable to a wider range of networking protocols.
TCP UDP SCTP
Linux NET4
IP
Layer
Interface
SocketSocket
Protocol
Interface
Protocol
Interface
Protocol
Interface
TCP UDP SCTP
IP
Interface
Figure 1: Sockets: BSD and Linux
Both Linux UDP implementations are a good example of thetight integration between the components of the Linux network-ing stack.
Write side processing. On the write side of the Socket, bytesare copied from the user into allocated socket buffers. Writeside socket buffers are charged against the send buffer. Socketbuffers are immediately dispatched to the IP layer for processing.When the IP layer (or a driver) consumes the socket buffer, itreleases the amount of send buffer space that was charged for thesend buffer. If there is insufficient space in the send buffer toaccommodate the write, the calling processed is either blocked orthe system call returns an error (ENOBUFS).
For loop-back operation, immediately sending the socket bufferto the IP layer has the additional ramification that the socketbuffer is immediately struck from the send buffer and immediatelyadded to the receive buffer on the receiving socket. Therefore,the size of the send buffer or the send low water mark, have noeffect.
Read side processing. On the read side of the Socket, thenetwork layer calls the protocol’s receive function. The receivefunction checks if socket is locked (by a reading or writing user).If the socket is locked the socket buffer placed in the socket’sbacklog queue. The backlog queue can hold a maximum numberof socket buffers. If this maximum is exceeded, the packet isdropped. If the socket is unlocked, and the socket buffer will fitin the socket’s receive buffer, the socket buffer is charged againstthe receive buffer. If the socket buffer will not fit in the receivebuffer, the socket buffer is dropped.
Read side processing under Linux does not differ from BSD, ex-cept for loop-back devices. Normally, for non-loop-back devices,skbuffs received by the device are queued against the IP layerand the IP layer software interrupt is raised. When the softwareinterrupt runs, skbuffss are delivered directly to the transportprotocol layer without intermediate queueing [MBKQ97].
For loop-back operation, however, Linux skips queueing at theIP protocol layer (which does not exist as it does in BSD) and,instead, delivers skbuffs directly to the transport protocol.
Due to this difference between Linux and BSD on the read side,it is expected that performance results for Linux would vary fromthat of BSD, and the results of this testing would therefore notbe directly applicable to BSD.
Buffering. Buffering at the Socket consist of a send buffer andlow water mark and a receive buffer and low water mark. Whenthe send buffer is consumed with outstanding messages, writingprocesses will either block or the system call will fail with an error(ENOBUFS). When the send buffer is full higher than the low watermark, a blocked writing process will not be awoken (regardless ofwhether the process is blocked in write or blocked in poll/select).The send low water mark for Linux is fixed at one-half of thesend buffer.
It should be noted that for loop-back operation under Linux,the send buffering mechanism is effectively defeated.
When the receive buffer is consumed with outstanding mes-sages, received messages will be discarded. This is in rather starkcontrast to BSD where messages are effectively returned to thenetwork layer when the socket receive buffer is full and the net-work layer can determine whether messages should be discardedor queued further [MBKQ97].
When there is no data in the receive buffer, the reading processwill either block or return from the system call with an error(ENOBUFS again). When the receive buffer has fewer bytes of datain it than the low water mark, a blocked reading process will notbe awoken (regardless of whether the process is blocked in writeor blocked in poll/select). The receive low water mark for Linuxis typically set to BSD default of 1 byte.6
6. The fact that Linux sets the receive low water mark to 1 byte is anindication that the buffering mechanism on the read side simply does notwork.
3
It should be noted that the Linux buffering mechanism doesnot have hysteresis like that of STREAMS. When the amountof data in the send buffer exceeds the low water mark, poll willcease to return POLLOUT; when the receive buffer is less than thelow water mark, poll will cease to return POLLIN.
Scheduling. Scheduling of processes and the buffering mecha-nism are closely related.
Writing processes for loop-back operation under UDP are al-lowed to spin wildly. Written data charged against the send bufferis immediately released when the loop-back interface is encoun-tered and immediately delivered to the receiving socket (or dis-carded). If the writing process is writing data faster that thereading process is consuming it, the excess will simply be dis-carded, and no back-pressure signalled to the sending socket.
If receive buffer sizes are sufficiently large, the writing processwill lose the processor on uniprocessor systems and the readingprocess scheduled before the buffer overflows; if they are not, theexcess will be discarded. On multiprocessor systems, providedthat the read operation takes less time than the write operation,the reading process will be able to keep pace with the writingprocess. If the receiving process is run with a very low priority,the writing process will always have the processor and a largepercentage of the written messages will be discarded.
It should be noted that this is likely a Linux-specific deficiencyas the BSD system introduces queueing, even on loop-back.
Reading processes for loop-back operation under UDP are awo-ken whenever a single byte is received (due to the default receivelow water mark). If the reading process has higher priority thanthe writing process on uniprocessors, the reading process will beawoken for each message sent and the reading process will readthat message before the writing process is permitted to write an-other. On SMP systems, because reading processes will likelyhave the socket locked while reading each message, backlog pro-cessing will likely be invoked.
3.2 Linux Fast-STREAMS
Linux Fast-STREAMS is an implementation of SVR4.2STREAMS for the GNU/Linux system developed by theOpenSS7 Project [SS7] as a replacement for the buggy, under-performing and now deprecated Linux STREAMS (LiS) package.Linux Fast-STREAMS provides the STREAMS executive and in-terprocess communication facilities (pipes and FIFOs). Add-onpackages provide compatibility between Linux Fast-STREAMSand other STREAMS implementations, a complete XTI sharedobject library, and transport providers. Transport providers forthe TCP/IP suite consist of an inet driver that uses the XTIover Sockets approach as well as a full STREAMS implemen-tation of SCTP (Stream Control Transmission Protocol), UDP(User Datagram Protocol) and RAWIP (Raw Internet Protocol).
3.2.1 XTI over Sockets
The XTI over Sockets implementation is the inet STREAMSdriver developed by the OpenSS7 Project [SS7]. As illustratedin Figure 2, this driver is implemented as a STREAMS pseudo-device driver and uses STREAMS for passing TPI service primi-tives to and from upstream modules or the Stream head. Withinthe driver, data and other TPI service primitives are translatedinto kernel socket calls to a socket that was opened by the drivercorresponding to the transport provider instance. Events re-ceived from this internal socket are also translated into transportprovider service primitives and passed upstream.
Write side processing. Write side processing uses standardSTREAMS flow control mechanisms as are described for TPI,below, with the exception that once the message blocks arriveat the driver they are passed to the internal socket. Therefore,
Protocol
Interface
TCP UDP SCTP
Linux NET4
IP
Layer
Interface
Stream head Socket
inet
Driver
Figure 2: XTI over Sockets inet Driver
a unique characteristic of the write side processing for the XTIover Sockets driver is that data is first copied from user spaceinto STREAMS message blocks and then copied again from theSTREAMS message blocks to the socket. This constitutes twocopies per byte versus one copy per byte and has a significantimpact on the performance of the driver at large message sizes.7
Read side processing. Read side processing uses standardSTREAMS flow control mechanisms as are described for TPI,below. A unique characteristic of the read side processing frothe XTI over Sockets driver is that data is first copied from theinternal socket to a STREAMS message block and then copiedagain from the STREAMS message block to user space. Thisconstitutes two copies per byte versus one copy per byte and hasa significant impact on the performance of the driver at largemessage sizes.8
Buffering. Buffering uses standard STREAMS queueing andflow control mechanisms as are described for TPI, below.
Scheduling. Scheduling resulting from queueing and flow con-trol are the same as described for TPI below. Considering thatthe internal socket used by the driver is on the loop-back inter-face, data written on the sending socket appears immediately atthe receiving socket or is discarded.
3.2.2 STREAMS TPI
The STREAMS TPI implementation of UDP is a directSTREAMS implementation that uses the udp driver developed bythe OpenSS7 Project [SS7]. As illustrated in Figure 3, this driverinterfaces to Linux at the network layer, but provides a com-plete STREAMS implementation of the transport layer. Interfac-ing with Linux at the network layer provides for de-multiplexedSTREAMS architecture [RBD97]. The driver presents the Trans-port Provider Interface (TPI) [TPI99] for use by upper level mod-ules and the XTI library [XTI99].
Linux Fast-STREAMS also provides a raw IP driver (raw) andan SCTP driver (sctp) that operate in the same fashion as the
7. This expectation of peformance impact is held up by the test results.
8. This expectation of peformance impact is held up by the test results.
4
Protocol
Interface
TCP UDP SCTP
Linux NET4
IP
Layer
Interface
Stream head Socket
Driver
udp
Figure 3: STREAMS udp Driver
udp driver. That is, performing all transport protocol functionswithin the driver and interfacing to the Linux NET4 IP layer.One of the project objectives of performing the current testingwas to determine whether it would be worth the effort to write aSTREAMS transport implementation of TCP, the only missingcomponent in the TCP/IP suite that necessitates the continuedsupport of the XTI over Sockets (inet) driver.
Write side processing. Write side processing follows standardSTREAMS flow control. When a write occurs at the Stream head,the Stream head checks for downstream flow control on the writequeue. If the Stream is flow controlled, the calling process isblocked or the write system call fails (EAGAIN). When the Streamis not flow controlled, user data is transferred to allocated mes-sage blocks and passed downstream. When the message blocksarrive at a downstream queue, the count of the data in the mes-sage blocks is added to to the queue count. If the queue countexceeds the high water mark defined for the queue, the queue ismarked full and subsequent upstream flow control tests will fail.
Read side processing. Read side processing follows standardSTREAMS flow control. When a read occurs at the Stream head,the Stream head checks the read queue for messages. If the readqueue has no messages queued, the queue is marked to be enabledwhen messages arrive and the calling process is either blocked orthe system call returns an error (EAGAIN). If messages exist onthe read queue, they are dequeued and data copied from themessage blocks to the user supplied buffer. If the message blockis completely consumed, it is freed; otherwise, the message blockis placed back on the read queue with the remaining data.
Buffering. Buffering follows the standard STREAMS queueingand flow control mechanisms. When a queue is found emptyby a reading process, the fact that the queue requires service isrecorded. Once the first message arrives at the queue followinga process finding the queue empty, the queue’s service procedurewill be scheduled with the STREAMS scheduler. When a queueis tested for flow control and the queue is found to be full, the factthat a process wishes to write the to queue is recorded. Whenthe count of the data on the queue falls beneath the low water
mark, previous queues will be back enabled (that is, their serviceprocedures will be scheduled with the STREAMS scheduler).
Scheduling. When a queue downstream from the stream headwrite queue is full, writing processes either block or fail with anerror (EAGAIN). When the forward queue’s count falls below itslow water mark, the stream head write queue is back-enabled.Back-enabling consists of scheduling the queue’s service proce-dure for execution by the STREAMS scheduler. Only later, whenthe STREAMS scheduler runs pending tasks, does any writingprocess blocked on flow control get woken.
When a stream head read queue is empty and a reading pro-cesses either block or fail with an error (EAGAIN ). When a mes-sage arrives at the stream head read queue, the service proce-dure associated with the queue is scheduled for later executionby the STREAMS scheduler. Only later, when the STREAMSscheduler runs pending tasks, does any reading process blockedawaiting messages get awoken.
4 Method
To test the performance of STREAMS networking, the LinuxFast-STREAMS package was used [LfS]. The Linux Fast-STREAMS package builds and installs Linux loadable kernelmodules and includes the modified netperf and iperf programsused for testing.
Test Program. One program used is a version of the netperf
network performance measurement tool developed and main-tained by Rick Jones for Hewlett-Packard. This modified ver-sion is available from the OpenSS7 Project [Jon07]. While theprogram is able to test using both POSIX Sockets and XTISTREAMS interfaces, modifications were required to the packageto allow it to compile for Linux Fast-STREAMS.
The netperf program has many options. Therefore, a bench-mark script (called netperf benchmark) was used to obtain re-peatable raw data for the various machines and distributionstested. This benchmark script is included in the netperf dis-tribution available from the OpenSS7 Project [Jon07]. A listingof this script is provided in Appendix A.
4.1 Implementations Tested
The following implementations were tested:
UDP Sockets This is the Linux NET4 Sockets implementationof UDP, described in Section ??, with normal scheduling priori-ties. Normal scheduling priority means invoking the sending andreceiving processes without altering their run-time scheduling pri-ority.
UDP Sockets with artificial process priorities.
STREAMS XTIoS UDP. This is the OpenSS7 STREAMSimplementation of XTI over Sockets for UDP, described in Sec-tion 3.2.1. This implementation is tested using normal run-timescheduling priorities.
STREAMS TPI SCTP. This is the OpenSS7 STREAMS im-plementation of UDP using XTI/TPI directly, described in Sec-tion 3.2.2. This implementation is tested using normal run-timescheduling priorities.
4.2 Distributions Tested
To remove the dependence of test results on a particular Linuxkernel or machine, various Linux distributions were used for test-ing. The distributions tested are as follows:
The results for the various distributions and machines is tabu-lated in Appendix B. The data is tabulated as follows:
Performance. Performance is charted by graphing the number ofmessages sent and received per second against the logarithmof the message send size.
Delay. Delay is charted by graphing the number of seconds persend and receive against the sent message size. The delaycan be modelled as a fixed overhead per send or receive oper-ation and a fixed overhead per byte sent. This model resultsin a linear graph with the zero x-intercept representing thefixed per-message overhead, and the slope of the line rep-resenting the per-byte cost. As all implementations use thesame primary mechanism for copying bytes to and from userspace, it is expected that the slope of each graph will be sim-ilar and that the intercept will reflect most implementationdifferences.
Throughput. Throughput is charted by graphing the logarithmof the product of the number of messages per second andthe message size against the logarithm of the message size.It is expected that these graphs will exhibit strong log-log-linear (power function) characteristics. Any curvature inthese graphs represents throughput saturation.
Improvement. Improvement is charted by graphing the quotientof the bytes per second of the implementation and the bytesper second of the Linux sockets implementation as a per-centage against the message size. Values over 0% representan improvement over Linux sockets, whereas values under0% represent the lack of an improvement.
The results are organized in the sections that follow in orderof the machine tested.
5.1 Porky
Porky is a 2.57GHz Pentium IV (i686) uniprocessor machine with1Gb of memory. Linux distributions tested on this machine areas follows:
Fedora Core 6 is the most recent full release Fedora distribution.This distribution sports a 2.6.20-1.2933.fc6 kernel with the latestpatches. This is the x86 distribution with recent updates.
Performance. Figure 4 plots the measured performance ofSockets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at mes-sage sizes of less than 1024 bytes.
Delay. Figure 5 plots the average message delay of UDP Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at mes-sage sizes of less than 1024 bytes.
From the figure, it can be seen that the slope of the delaygraph for STREAMS and Sockets are about the same. Thisis expected as both implementations use the same functionto copy message bytes to and from user space. The slopeof the XTI over Sockets graph is over twice the slope of theSockets graph which reflects the fact that XTI over Socketsperforms multiple copies of the data: two copies on the sendside and two copies on the receive side.
The intercept for STREAMS is lower than Sockets, indi-cating that STREAMS has a lower per-message overheadthan Sockets, despite the fact that the destination addressis being copied to and from user space for each message.
Throughput. Figure 6 plots the effective throughput of UDPSockets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes.
As can be seen from the figure, all implementations exhibitstrong power function characteristics (at least at lower writesizes), indicating structure and robustness for each imple-mentation. The slight concave downward curvature of thegraphs at large message sizes indicates some degree of satu-ration.
Improvement. Figure 7 plots the comparison of Sockets to XTIover Socket and XTI approaches. STREAMS demonstratessignificant improvements (approx. 30% improvement) atmessage sizes below 1024 bytes. Perhaps surprising is thatthe XTI over Sockets approach rivals (95%) Sockets aloneat small message sizes (where multiple copies are not con-trolling).
The results for Fedora Core 6 on Porky are, for the most part,similar to the results from other distributions on the same hostand also similar to the results for other distributions on otherhosts.
5.1.2 CentOS 4.0
CentOS 4.0 is a clone of the RedHat Enterprise 4 distribution.This is the x86 version of the distribution. The distribution sportsa 2.6.9-5.0.3.EL kernel.
Performance. Figure 8 plots the measured performance ofSockets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at mes-sage sizes of less than 1024 bytes.
As can be seen from the figure, Linux Fast-STREAMS out-performs Linux at all message sizes. Also, and perhaps sur-prisingly, the XTI over Sockets implementation also per-forms as well as Linux at lower message sizes.
Delay. Figure 9 plots the average message delay of Sockets com-pared to XTI over Socket and XTI approaches. STREAMSdemonstrates significant improvements at message sizes ofless than 1024 bytes.
Both STREAMS and Sockets exhibit the same slope, andXTI over Sockets exhibits over twice the slope, indicat-ing that copies of data control the per-byte characteristics.STREAMS exhibits a lower intercept than both other im-plementations, indicating that STREAMS has the lowestper-message overhead, regardless of copying the destinationaddress to and from the user with each sent and receivedmessage.
Throughput. Figure 10 plots the effective throughput of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes.
As can be seen from the figure, all implementations exhibitstrong power function characteristics (at least at lower writesizes), indicating structure and robustness for each imple-mentation. Again, the slight concave downward curvatureat large memory sizes indicates memory bus saturation.
Improvement. Figure 11 plots the comparison of Sockets toXTI over Socket and XTI approaches. STREAMS demon-strates significant improvements (approx. 30-40% improve-ment) at message sizes below 1024 bytes. Perhaps surprisingis that the XTI over Sockets approach rivals (97%) Socketsalone at small message sizes (where multiple copies are notcontrolling).
The results for CentOS on Porky are, for the most part, similarto the results from other distributions on the same host and alsosimilar to the results for other distributions on other hosts.
5.1.3 SuSE 10.0 OSS
SuSE 10.0 OSS is the public release version of the SuSE/Novelldistribution. There have been two releases subsequent to thisone: the 10.1 and recent 10.2 releases. The SuSE 10 releasesports a 2.6.13 kernel and the 2.6.13-15-default kernel was thetested kernel.
Performance. Figure 12 plots the measured performance ofSockets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes.
Delay. Figure 13 plots the average message delay of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes.
Again, STREAMS and Sockets exhibit the same slope, andXTI over Sockets more than twice the slope. STREAMSagain has a significantly lower intercept and the XTI overSockets and Sockets intercepts are similar, indicating thatSTREAMS has a smaller per-message overhead, despitecopying destination addresses with each message.
Throughput. Figure 14 plots the effective throughput of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes.
As can be seen from Figure 14, all implementations exhibitstrong power function characteristics (at least at lower writesizes), indicating structure and robustness for each imple-mentation.
Improvement. Figure 15 plots the comparison of Sockets toXTI over Socket and XTI approaches. STREAMS demon-strates significant improvements (25-30%) at all messagesizes.
The results for SuSE 10 OSS on Porky are, for the most part,similar to the results from other distributions on the same hostand also similar to the results for other distributions on otherhosts.
Ubuntu 6.10 is the current release of the Ubuntu distribution.The Ubuntu 6.10 release sports a 2.6.15 kernel. The tested dis-tribution had current updates applied.
Performance. Figure 16 plots the measured performance ofSockets compared to XTI over Socket and XTI approaches.STREAMS demonstrates marginal improvements (approx.5%) at all message sizes.
Delay. Figure 17 plots the average message delay of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates marginal improvements at all mes-sage sizes.
Although STREAMS exhibits the same slope (per-byte pro-cessing overhead) as Sockets, Ubuntu and the 2.6.17 kernelare the only combination where the STREAMS interceptis not significantly lower than Sockets. Also, the XTI overSockets slope is steeper and the XTI over Sockets interceptis much larger than Sockets alone.
Throughput. Figure 18 plots the effective throughput of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates marginal improvements at all mes-sage sizes.
As can be seen from Figure 18, all implementations exhibitstrong power function characteristics (at least at lower writesizes), indicating structure and robustness for each imple-mentation.
Improvement. Figure 19 plots the comparison of Sockets toXTI over Socket and XTI approaches. STREAMS demon-strates marginal improvements (approx. 5%) at all messagesizes.
Unbuntu is the only distribution tested where STREAMSdoes not show significant improvements over Sockets. Nev-ertheless, STREAMS does show marginal improvement (ap-prox. 5%) over all message sizes and performed better thanSockets at all message sizes.
5.1.5 Ubuntu 7.04
Ubuntu 7.04 is the current release of the Ubuntu distribution.The Ubuntu 7.04 release sports a 2.6.20 kernel. The tested dis-tribution had current updates applied.
Performance. Figure 20 plots the measured performance ofSockets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements (approx.20-60%) at all message sizes.
Delay. Figure 21 plots the average message delay of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes.
STREAMS and Sockets exhibit the slope, and XTI overSockets more than twice the slope. STREAMS, however,has a significantly lower intercept and XTI over Sockets andSockets intercepts are similar, indicating that STREAMShas a smaller per-message overhead, despite copying desti-nation addresses with each message.
Throughput. Figure 22 plots the effective throughput of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes.
As can be seen from Figure 22, all implementations exhibitstrong power function characteristics (at least at lower writesizes), indicating structure and robustness for each imple-mentation.
Improvement. Figure 23 plots the comparison of Sockets toXTI over Socket and XTI approaches. STREAMS demon-strates significant improvements (approx. 20-60%) at allmessage sizes.
The results for Ubuntu 7.04 on Porky are, for the most part,similar to the results from other distributions on the same hostand also similar to the results for other distributions on otherhosts.
5.2 Pumbah
Pumbah is a 2.57GHz Pentium IV (i686) uniprocessor machinewith 1Gb of memory. This machine differs from Porky in memorytype only (Pumbah has somewhat faster memory than Porky.)Linux distributions tested on this machine are as follows:
Distribution Kernel
RedHat 7.2 2.4.20-28.7
Pumbah is a control machine and is used to rule out differencesbetween recent 2.6 kernels and one of the oldest and most stable2.4 kernels.
5.2.1 RedHat 7.2
RedHat 7.2 is one of the oldest (and arguably the most stable)glibc2 based releases of the RedHat distribution. This distribu-tion sports a 2.4.20-28.7 kernel. The distribution has all availableupdates applied.
Performance. Figure 24 plots the measured performance ofSockets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes, and staggering improvements at large mes-sage sizes.
Delay. Figure 25 plots the average message delay of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes, and staggering improvements at large mes-sage sizes.
The slope of the STREAMS delay curve is much lower than(almost half that of) the Sockets delay curve, indicating thatSTREAMS is exploiting some memory efficiencies not pos-sible in the Sockets implementation.
Throughput. Figure 26 plots the effective throughput of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates improvements at all message sizes.
As can be seen from Figure 26, all implementations exhibitstrong power function characteristics (at least at lower writesizes), indicating structure and robustness for each imple-mentation.
The Linux NET4 UDP implementation results deviate moresharply from power function behaviour at high messagesizes. This also, is rather different that the 2.6 kernel situ-ation. One contributing factor is the fact that neither thesend nor receive buffers can be set above 65,536 bytes on thisversion of Linux 2.4 kernel. Tests were performed with sendand receive buffer size requests of 131,072 bytes. Both theSTREAMS XTI over Sockets UDP implementation and theLinux NET4 UDP implementation suffer from the maximumbuffer size, whereas, the STREAMS UDP implementationimplements and permits the larger buffers.
Improvement. Figure 27 plots the comparison of Sockets toXTI over Socket and XTI approaches. STREAMS demon-strates significant improvements all message sizes.
The more dramatic improvements over Linux NET4 UDPand XTI over Sockets UDP is likely due in part to the re-striction on buffer sizes in 2.4 as described above.
Unfortunately, the RedHat 7.2 system does not appear tohave acted as a very good control system. The differences inmaximum buffer size make any differences from other testedbehaviour obvious.
5.3 Daisy
Daisy is a 3.0GHz i630 (x86 64) hyper-threaded machine with1Gb of memory. Linux distributions tested on this machine areas follows:
This machine is used as an SMP control machine. Most of thetests were performed on uniprocessor non-hyper-threaded ma-chines. This machine is hyper-threaded and runs full SMP ker-nels. This machine also supports EMT64 and runs x86 64 ker-nels. It is used to rule out both SMP differences as well as 64-bitarchitecture differences.
5.3.1 Fedora Core 6 (x86 64)
Fedora Core 6 is the most recent full release Fedora distribution.This distribution sports a 2.6.20-1.2933.fc6 kernel with the latestpatches. This is the x86 64 distribution with recent updates.
Performance. Figure 28 plots the measured performance ofSockets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at mes-sage sizes of less than 1024 bytes.
Delay. Figure 29 plots the average message delay of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at mes-sage sizes of less than 1024 bytes.
The slope of the delay curve either indicates that Sock-ets is using slightly larger buffers than STREAMS, or thatSockets is somehow exploiting some per-byte efficiencies atlarger message sizes not achieved by STREAMS. Neverthe-less, the STREAMS intercept is so low that the delay curvefor STREAMS is everywhere beneath that of Sockets.
Throughput. Figure 30 plots the effective throughput of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes.
As can be seen from Figure 30, all implementations exhibitstrong power function characteristics (at least at lower writesizes), indicating structure and robustness for each imple-mentation.
Improvement. Figure 31 plots the comparison of Sockets toXTI over Socket and XTI approaches. STREAMS demon-strates significant improvements (approx. 40% improve-ment) at message sizes below 1024 bytes. That STREAMSUDP gives a 40% improvement over a wide range of mes-sage sizes on SMP is a dramatic improvement. Statementsregarding STREAMS networking running poorer on SMPthan on UP are quite wrong, at least with regard to LinuxFast-STREAMS.
5.3.2 CentOS 5.0 (x86 64)
CentOS 5.0 is the most recent full release CentOS distribution.This distribution sports a 2.6.18-8.1.3.el5 kernel with the latestpatches. This is the x86 64 distribution with recent updates.
Performance. Figure 32 plots the measured performance ofSockets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at mes-sage sizes of less than 1024 bytes.
Delay. Figure 33 plots the average message delay of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at mes-sage sizes of less than 1024 bytes.
The slope of the delay curve either indicates that Sock-ets is using slightly larger buffers than STREAMS, or thatSockets is somehow exploiting some per-byte efficiencies atlarger message sizes not achieved by STREAMS. Neverthe-less, the STREAMS intercept is so low that the delay curvefor STREAMS is everywhere beneath that of Sockets.
Throughput. Figure 34 plots the effective throughput of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes.
As can be seen from Figure 34, all implementations exhibitstrong power function characteristics (at least at lower writesizes), indicating structure and robustness for each imple-mentation.
Improvement. Figure 35 plots the comparison of Sockets toXTI over Socket and XTI approaches. STREAMS demon-strates significant improvements (approx. 40% improve-ment) at message sizes below 1024 bytes. That STREAMSUDP gives a 40% improvement over a wide range of mes-sage sizes on SMP is a dramatic improvement. Statementsregarding STREAMS networking running poorer on SMPthan on UP are quite wrong, at least with regard to LinuxFast-STREAMS.
5.4 Mspiggy
Mspiggy is a 1.7Ghz Pentium IV (M-processor) uniprocessornotebook (Toshiba Satellite 5100) with 1Gb of memory. Linuxdistributions tested on this machine are as follows:
Distribution Kernel
SuSE 10.0 OSS 2.6.13-15-default
Note that this is the same distribution that was also tested onPorky. The purpose of testing on this notebook is to rule outthe differences between machine architectures on the test results.Tests performed on this machine are control tests.
5.4.1 SuSE 10.0 OSS
SuSE 10.0 OSS is the public release version of the SuSE/Novelldistribution. There have been two releases subsequent to thisone: the 10.1 and recent 10.2 releases. The SuSE 10 releasesports a 2.6.13 kernel and the 2.6.13-15-default kernel was thetested kernel.
Performance. Figure 36 plots the measured performance ofSockets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes, and staggering improvements at large mes-sage sizes.
Delay. Figure 37 plots the average message delay of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates significant improvements at allmessage sizes, and staggering improvements at large mes-sage sizes.
The slope of the STREAMS delay curve is much lower than(almost half that of) the Sockets delay curve, indicating thatSTREAMS is exploiting some memory efficiencies not pos-sible in the Sockets implementation.
Throughput. Figure 38 plots the effective throughput of Sock-ets compared to XTI over Socket and XTI approaches.STREAMS demonstrates improvements at all message sizes.
As can be seen from Figure 38, all implementations exhibitstrong power function characteristics (at least at lower writesizes), indicating structure and robustness for each imple-mentation.
The Linux NET4 UDP implementation results deviate moresharply from power function behaviour at high messagesizes. One contributing factor is the fact that neither thesend nor receive buffers can be set above about 111,000 byteson this version of Linux 2.6 kernel running on this speedof processor. Tests were performed with send and receivebuffer size requests of 131,072 bytes. Both the STREAMSXTI over Sockets UDP implementation and the Linux NET4UDP implementation suffer from the maximum buffer size,whereas, the STREAMS UDP implementation implementsand permits the larger buffers.
Improvement. Figure 39 plots the comparison of Sockets toXTI over Socket and XTI approaches. STREAMS demon-strates significant improvements all message sizes.
The more dramatic improvements over Linux NET4 UDPand XTI over Sockets UDP is likely due in part to the re-striction on buffer sizes in 2.6 on slower processors as de-scribed above.
Unfortunately, this SuSE 10.0 OSS system does not appearto have acted as a very good control system. The differ-ences in maximum buffer size make any differences fromother tested behaviour obvious.
6 Analysis
With some caveats as described at the end of this section, theresults are consistent enough across the various distributions andmachines tested to draw some conclusions regarding the efficiencyof the implementations tested. This section is responsible forproviding an analysis of the results and drawing conclusions con-sistent with the experimental results.
6.1 Discussion
The test results reveal that the maximum throughput perfor-mance, as tested by the netperf program, of the STREAMSimplementation of UDP is superior to that of the Linux NET4Sockets implementation of UDP. In fact, STREAMS implemen-tation performance at smaller message sizes is significantly (asmuch as 30-40%) greater than that of Linux NET4 UDP. As thecommon belief is that STREAMS would exhibit poorer perfor-mance, this is perhaps a startling result to some.
Looking at both implementations, the differences can be de-scribed by implementation similarities and differences:
Send processing. When Linux NET4 UDP receives a send re-quest, the available send buffer space is checked. If the currentdata would cause the send buffer fill to exceed the send buffermaximum, either the calling process blocks awaiting availablebuffer, or the system call returns with an error (ENOBUFS). If thecurrent send request will fit into the send buffer, a socket buffer(skbuff) is allocated, data is copied from user space to the buffer,and the socket buffer is dispatched to the IP layer for transmis-sion.
Linux 2.6 kernels have an amazing amount of special case codethat gets executed for even a simple UDP send operation. Linux2.4 kernels are far more direct. The result is the same, eventhough they are different in the depths to which they must delvebefore discovering that a send is just a simple send. This mightexplain part of the rather striking differences between the per-formance comparison between STREAMS and NET4 on 2.6 and2.4 kernels.
When the STREAMS Stream head receives a putmsg(2) re-quest, it checks downstream flow control. If the Stream is flowcontrolled downstream, either the calling process blocks awaitingsuccession of flow control, or the putmsg(2) system call returnswith an error (EAGAIN). if the Stream is not flow controlled on thewrite side, message blocks are allocated to hold the control anddata portions of the request and the message blocks are passeddownstream to the driver. When the driver receives an M DATA orM PROTO message block from the Stream head, in its put proce-dure, it simply queues it to the driver write queue with putq(9).putq(9) will result in the enabling of the service procedure forthe driver write queue under the proper circumstances. Whenthe service procedure runs, messages will be dequeued from thedriver write queue transformed into IP datagrams and sent to theIP layer for transmission on the network interface.
Linux Fast-STREAMS has a feature whereby the driver canrequest that the Stream head allocate a Linux socket buffer(skbuff) to hold the data buffer associated with an allocatedmessage block. The STREAMS UDP driver utilizes this feature(but the STREAMS XTIoS UDP driver cannot). STREAMS alsohas the feature that a write offset can be applied to all data blockallocated and passed downstream. The STREAMS UDP driveruses this capability also. The write offset set by the tested driverwas a maximum hard header length.
Network processing. Network processing (that is the bottomend under the transport protocol) for both implementations is ef-fectively the same, with only minor differences. In the STREAMSUDP implementation, no sock structure exists, so issuing socketbuffers to the network layer is performed in a slightly more directfashion.
Loop-back processing is identical as this is performed by theLinux NET4 IP layer in both cases.
For Linux Sockets UDP, when the IP layer frees or orphansthe socket buffer, the amount of data associated with the socketbuffer is subtracted from the current send buffer fill. If the currentbuffer fill is less than 1/2 of the maximum, all processes blockedon write or blocked on poll are are woken.
For STREAMS UDP, when the IP layer frees or orphans thesocket buffer, the amount of data associated with the socketbuffer is subtracted from the current send buffer fill. If the cur-rent send buffer fill is less than the send buffer low water mark(SO SNDLOWAT or XTI SNDLOWAT), and the write queue is blockedon flow control, the write queue is enabled.
One disadvantage that it is expected would slow STREAMSUDP performance is the fact that on the sending side, aSTREAMS buffer is allocated along with a skbuff and theskbuff is passed to Linux NET4 IP and the loop-back device.For Linux Sockets UDP, the same skbuff is reused on both sidesof the interface. For STREAMS UDP, there is (currently) nomechanism for passing through the original STREAMS messageblock and a new message block must be allocated. This resultsin two message block allocations per skbuff.
Receive processing. Under Linux Sockets UDP, when a socketbuffer is received from the network layer, a check is performedwhether the associated socket is locked by a user process or not.If the associated socket is locked, the socket buffer is placed on abacklog queue awaiting later processing by the user process whenit goes to release the lock. A maximum number of socket buffersare permitted to be queued against the backlog queue per socket(approx. 300).
If the socket is not locked, or if the user process is processinga backlog before releasing the lock, the message is processed:the receive socket buffer is checked and if the received messagewould cause the buffer to exceed its maximum size, the messageis discarded and the socket buffer freed. If the received message
14
fits into the buffer, its size is added to the current send bufferfill and the message is queued on the socket receive queue. If aprocess is sleeping on read or in poll, an immediate wakeup isgenerated.
In the STREAMS UDP implementation on the receive side,again there is no sock structure, so the socket locking and back-log techniques performed by UDP at the lower layer do not ap-ply. When the STREAMS UDP implementation receives a socketbuffer from the network layer, it tests the receive side of theStream for flow control and, when not flow controlled, allocates aSTREAMS buffer using esballoc(9) and passes the buffer directlyto the upstream queue using putnext(9). When flow control is ineffect and the read queue of the driver is not full, a STREAMSmessage block is still allocated and placed on the driver readqueue. When the driver read queue is full, the received socketbuffer is freed and the contents discarded. While different inmechanism from the socket buffer and backlog approach takenby Linux Sockets UDP, this bottom end receive mechanism issimilar in both complexity and behaviour.
Buffering. For Linux Sockets, when a send side socket buffer isallocated, the true size of the socket buffer is added to the currentsend buffer fill. After the socket buffer has been passed to theIP layer, and the IP layer consumes (frees or orphans) the socketbuffer, the true size of the socket buffer is subtracted from thecurrent send buffer fill. When the resulting fill is less than 1/2the send buffer maximum, sending processes blocked on send orpoll are woken up. When a send will not fit within the maximumsend buffer size considering the size of the transmission and thecurrent send buffer fill, the calling process blocks or is returnedan error (ENOBUFS). Processes that are blocked or subsequentlyblock on poll(2) will not be woken up until the send buffer filldrops beneath 1/2 of the maximum; however, any process thatsubsequently attempts to send and has data that will fit in thebuffer will be permitted to proceed.
STREAMS networking, on the other hand, performs queueing,flow control and scheduling on both the sender and the receiver.Sent messages are queued before delivery to the IP subsystem.Received messages from the IP subsystem are queued before de-livery to the receiver. Both side implement full hysteresis highand low water marks. Queues are deemed full when they reachthe high water mark and do not enable feeding processes or sub-systems until the queue subsides to the low water mark.
Scheduling. Linux Sockets schedule by waking a receiving pro-cess whenever data is available in the receive buffer to be read,and waking a sending process whenever there is one-half of thesend buffer available to be written. While accomplishing buffer-ing on the receive side, full hysteresis flow control is only per-formed on the sending side. Due to the way that Linux handlesthe loop-back interface, the full hysteresis flow control on thesending side is defeated.
STREAMS networking, on the other hand, uses the queueing,flow control and scheduling mechanism of STREAMS. When mes-sages are delivered from the IP layer to the receiving stream headand a receiving process is sleeping, the service procedure for thereading stream head ’s read queue is scheduled for later execution.When the STREAMS scheduler later runs, the receiving processis awoken. When messages are sent on the sending side they arequeued in the driver’s write queue and the service procedure forthe driver’s write queue is scheduled for later execution. Whenthe STREAMS scheduler later runs, the messages are delivered tothe IP layer. When sending processes are blocked on a full driverwrite queue, and the count drops to the low water mark definedfor the queue, the service procedure of the sending stream headis scheduled for later execution. When the STREAMS schedulerlater runs, the sending process is awoken.
Linux Fast-STREAMS is designed to run tasks queued to theSTREAMS scheduler on the same processor as the queueing pro-cessor or task. This avoid unnecessary context switches.
The STREAMS networking approach results in fewer blockingand wakeup events being generated on both the sending and re-ceiving side. Because there are fewer blocking and wakeup events,there are fewer context switches. The receiving process is per-mitted to consume more messages before the sending process isawoken; and the sending process is permitted to generate moremessages before the reading process is awoken.
Result The result of the differences between the Linux NETand the STREAMS approach is that better flow control is be-ing exerted on the sending side because of intermediate queueingtoward the IP layer. This intermediate queueing on the sendingside, not present in BSD-style networking, is in fact responsi-ble for reducing the number of blocking and wakeup events onthe sender, and permits the sender, when running, to send moremessages in a quantum.
On the receiving side, the STREAMS queueing, flow controland scheduling mechanisms are similar to the BSD-style softwareinterrupt approach. However, Linux does not use software inter-rupts on loop-back (messages are passed directly to the socketwith possible backlogging due to locking). The STREAMS ap-proach is more sophisticated as it performs backlogging, queueingand flow control simultaneously on the read side (at the streamhead).
6.2 Caveats
The following limitations in the test results and analysis must beconsidered:
6.2.1 Loop-back Interface
Tests compare performance on loop-back interface only. Severalcharacteristics of the loop-back interface make it somewhat dif-ferent from regular network interfaces:
1. Loop-back interfaces do not require checksums.
2. Loop-back interfaces have a null hard header.
This means that there is less difference between putting eachdata chunk in a single packet versus putting multiple datachunks in a packet.
3. Loop-back interfaces have negligible queueing and emissiontimes, making RTT times negligible.
4. Loop-back interfaces do not normally drop packets.
5. Loop-back interfaces preserve the socket buffer from sendingto receiving interface.
This also provides an advantage to Sockets TCP. BecauseSTREAMS SCTP cannot pass a message block along withthe socket buffer (socket buffers are orphaned before passingto the loop-back interface), a message block must also beallocated on the receiving side.
7 Conclusions
These experiments have shown that the Linux Fast-STREAMSimplementation of STREAMS UDP as well as STREAMS UDPusing XTIoS networking outperforms the Linux Sockets UDP im-plementation by a significant amount (up to 40% improvement).
The Linux Fast-STREAMS implementation ofSTREAMS UDP networking is superior by a signifi-cant factor across all systems and kernels tested.
All of the conventional wisdom with regard to STREAMS andSTREAMS networking is undermined by these test results forLinux Fast-STREAMS.
15
• STREAMS is fast.
Contrary to the preconception that STREAMS must beslower because it is more general purpose, in fact the reversehas been shown to be true in these experiments for LinuxFast-STREAMS. The STREAMS flow control and schedul-ing mechanisms serve to adapt well and increase both codeand data cache as well as scheduler efficiency.
• STREAMS is more flexible and more efficient.
Contrary to the preconception that STREAMS trades flex-ibility or general purpose architecture for efficiency (thatis, that STREAMS is somehow less efficient because it ismore flexible and general purpose), in fact has shown tobe untrue. Linux Fast-STREAMS is both more flexible andmore efficient. Indeed, the performance gains achieved bySTREAMS appear to derive from its more sophisticatedqueueing, scheduling and flow control model.
• STREAMS better exploits parallelisms on SMP better thanother approaches.
Contrary to the preconception that STREAMS must beslower due to complex locking and synchronization mech-anisms, Linux Fast-STREAMS performed better on SMP(hyperthreaded) machines than on UP machines and out-performed Linux Sockets UDP by and even more significantfactor (about 40% improvement at most message sizes). In-deed, STREAMS appears to be able to exploit inherent par-allelisms that Linux Sockets is unable to exploit.
• STREAMS networking is fast.
Contrary to the preconception that STREAMS networkingmust be slower because STREAMS is more general pur-pose and has a rich set of features, the reverse has beenshown in these experiments for Linux Fast-STREAMS. Byutilizing STREAMS queueing, flow control and scheduling,STREAMS UDP indeed performs better than Linux SocketsUDP.
• STREAMS networking is neither unnecessarily complex norcumbersome.
Contrary to the preconception that STREAMS networkingmust be poorer because of use of a complex yet general pur-pose framework has shown to be untrue in these experimentsfor Linux Fast-STREAMS. Also, the fact that STREAMSand Linux conform to the same standard (POSIX), meansthat they are no more cumbersome from a programming per-spective. Indeed a POSIX conforming application will notknown the difference between the implementation (with theexception that superior performance will be experienced onSTREAMS networking).
8 Future Work
Local Transport Loop-back
UNIX domain sockets are the advocated primary interprocesscommunications mechanism in the 4.4BSD system: 4.4BSDeven implements pipes using UNIX domain sockets [MBKQ97].Linux also implements UNIX domain sockets, but uses the4.1BSD/SVR3 legacy approach to pipes. XTI has an equiva-lent to the UNIX domain socket. This consists of connection-less, connection oriented, and connection oriented with orderlyrelease loop-back transport providers. The netperf program hasthe ability to test UNIX domain sockets, but does not currentlyhave the ability to test the XTI equivalents.
BSD claims that in 4.4BSD pipes were implemented using sock-ets (UNIX domain sockets) instead of using the file system asthey were in 4.1BSD [MBKQ97]. One of the reasons cited forimplementing pipes on Sockets and UNIX domain sockets using
the networking subsystems was performance. Another paper re-leased by the OpenSS7 Project [SS7] shows that experimental re-sults on Linux file-system based pipes (using the SVR3 or 4.1BSDapproaches) perform poorly when compared to STREAMS-basedpipes. Because Linux uses a similar approach to file-system basedpipes in implementation of UNIX domain sockets, it can be ex-pected that UNIX domain sockets under Linux will also performpoorly when compared to loop-back transport providers underSTREAMS.
Sockets interface to STREAMS
There are several mechanisms to providing BSD/POSIX Socketsinterfaces to STREAMS networking [VS90] [Mar01]. The ex-periments in this report indicate that it could be worthwhile tocomplete one of these implementations for Linux Fast-STREAMS[Soc] and test whether STREAMS networking using the Socketsinterface is also superior to Linux Sockets, just as it has beenshown to be with the XTI/TPI interface.
9 Related Work
A separate paper comparing the STREAMS-based pipe imple-mentation of Linux Fast-STREAMS to the legacy 4.1BSD/SVR3-style Linux pipe implementation has also been prepared. Thatpaper also shows significant performance improvements forSTREAMS attributable to similar causes.
A separate paper comparing a STREAMS-based SCTP imple-mentation of Linux Fast-STREAMS to the Linux NET4 Socketsapproach has also been prepared. That paper also shows signif-icant performance improvements for STREAMS attributable tosimilar causes.
References
[GC94] Berny Goodheart and James Cox. The magic gar-den explained: the internals of UNIX System V Re-lease 4, an open systems design / Berny Goodheart& James Cox. Prentice Hall, Australia, 1994. ISBN0-13-098138-9.
[Jon07] Rick Jones. Network performance with netperf – AnOpenSS7 Modified Version. http://www.openss7.-org/download.html, 2007.
[LfS] Linux Fast-STREAMS – A High-Performance SVR4.2 MP STREAMS Implementation for Linux.http://www.openss7.org/download.html.
[LiS] Linux STREAMS (LiS). http://www.openss7.org/-download.html.
[LML] Linux Kernel Mailing List – Frequently AskedQuestions. http://www.kernel.org/pub/linux/docs/-lkml/#s9-9.
[Mar01] Jim Mario. Solaris sockets, past and present. UnixInsider, September 2001.
[MBKQ97] Marshall Kirk McKusick, Keith Bostic, Michael J.Karels, and John S. Quaterman. The design andimplementation of the 4.4BSD operating system.Addison-Wesley, third edition, November 1997. ISBN0-201-54979-4.
[OG] The Open Group. http://www.opengroup.org/.
[RBD97] Vincent Roca, Torsten Braun, and Christophe Diot.Demultiplexed architectures: A solution for efficientSTREAMS-based communications stacks. IEEENetwork, July/August 1997.
16
[Rit84] Dennis M. Ritchie. A Stream Input-output Sys-tem. AT&T Bell Laboratories Technical Journal,63(8):1897–1910, October 1984. Part 2.
[Soc] Sockets for linux fast-streams. http://www.openss7.-org/download.html.
[SS7] The OpenSS7 Project. http://www.openss7.org/.
[SUS95] Single UNIX Specification, Version 1. Open GroupPublication, The Open Group, 1995. http://www.-opengroup.org/onlinepubs/.
[SUS98] Single UNIX Specification, Version 2. Open GroupPublication, The Open Group, 1998. http://www.-opengroup.org/onlinepubs/.
[SUS03] Single UNIX Specification, Version 3. Open GroupPublication, The Open Group, 2003. http://www.-opengroup.org/onlinepubs/.
[TLI92] Transport Provider Interface Specification, Revision1.5. Technical Specification, UNIX International,Inc., Parsipanny, New Jersey, December 10 1992.http://www.openss7.org/docs/tpi.pdf.
[TPI99] Transport Provider Interface (TPI) Specification,Revision 2.0.0, Draft 2. Technical Specification, TheOpen Group, Parsipanny, New Jersey, 1999. http://-www.opengroup.org/onlinepubs/.
[VS90] Ian Vessey and Glen Skinner. Implementing BerkeleySockets in System V Release 4. In Proceedings of theWinter 1990 USENIX Conference. USENIX, 1990.
[XNS99] Network Services (XNS), Issue 5.2, Draft 2.0. OpenGroup Publication, The Open Group, 1999. http://-www.opengroup.org/onlinepubs/.