Improving Asynchronous Invocation Performance in Client ...

Post on 24-Feb-2022

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Improving Asynchronous Invocation Performance inClient-server Systems

Shungeng Zhangdagger Qingyang Wangdagger Yasuhiko KanemasaDaggerdaggerComputer Science and Engineering Louisiana State University

DaggerSoftware Laboratory FUJITSU LABORATORIES LTD

AbstractmdashIn this paper we conduct an experimental study ofasynchronous invocation on the performance of client-server sys-tems Through extensive measurements of both realistic macro-and micro-benchmarks we show that servers with the asyn-chronous event-driven architecture may perform significantlyworse than the thread-based version resulting from two non-trivial reasons First the traditional wisdom of one-event-one-handler event processing flow can create large amounts ofintermediate context switches that significantly degrade theperformance of an asynchronous server Second some runtimeworkload (eg response size) and network conditions (eg net-work latency) may cause significant negative performance impacton the asynchronous event-driven servers but not on thread-based ones We provide a hybrid solution by taking advantageof different asynchronous architectures to adapt to varyingworkload and network conditions Our hybrid solution searchesfor the most efficient execution path for each client requestbased on the runtime request profiling and type checking Ourexperimental results show that the hybrid solution outperformsall the other types of servers up to 19sim90 on throughputdepending on specific workload and network conditions

Index TermsmdashAsynchronous event-driven threads client-server applications performance

I INTRODUCTION

Asynchronous event-driven architecture for high perfor-mance internet servers has been studied extensively be-fore [33] [37] [42] [32] Many people advocate the asyn-chronous event-driven architecture as a better choice thanthe thread-based RPC version to handle high level work-load concurrency because of reduced multi-threading over-head [42] Though conceptually simple taking advantage ofthe asynchronous event-driven architecture to construct highperformance internet servers is a non-trivial task For exampleasynchronous servers are well-known to be difficult to programand debug due to the obscured control flow [16] compared tothe thread-based version

In this paper we show that building high performance in-ternet servers using the asynchronous event-driven architecturerequires careful design of the event processing flow and theability to adapt to the runtime varying workload and networkconditions Concretely we conduct extensive benchmark ex-periments to study some non-trivial design deficiencies of theasynchronous event-driven server architectures that lead to theinferior performance compared to the thread-based versionwhen facing high concurrency workload For example thetraditional wisdom of one-event-one-handler event processingflow may generate a large amount of unnecessary intermediate

0

400

800

1200

1600

2K 4K 6K 8K 10K 12K 14K0

2

4

6

8

10

1228 drop

Thro

ughp

ut [

req

s]

Res

pons

e Ti

me

[s]

Workload [ of users]

SYStomcatV7-TPSYStomcatV7-RTSYStomcatV8-TPSYStomcatV8-RT

Fig 1 Upgrading Tomcat from a thread-based version(V7) to an asynchronous version (V8) in a 3-tier systemleads to significant performance degradation

events and context switches that significantly degrade theperformance of an asynchronous server We also observed thatsome runtime workload and network conditions may causefrequent unnecessary IO system calls (due to the non-blockingnature of asynchronous function calls) for the asynchronousevent-driven servers but not for the thread-based ones

The first contribution is an experimental illustration thatsimply upgrading a thread-based component server to itsasynchronous version in an n-tier web system can lead tosignificant performance degradation of the whole system athigh utilization levels For instance we observed that themaximum achievable throughput of a 3-tier system decreasesby 23 after we upgrade the Tomcat application serverfrom the traditional thread-based version (Version 7) to thelatest asynchronous version (Version 8) (see Figure 1) in astandard n-tier application benchmark (RUBBoS [11]) Ouranalysis reveals that the unexpected performance degradationof the asynchronous Tomcat results from its poor design ofevent processing flow which causes significantly higher CPUcontext switch overhead than the thread-based version whenthe server is approaching saturation Our further study showssuch poor design of event processing flow also exists in otherpopular asynchronous serversmiddleware such as Jetty [3]GlassFish [9] and MongoDB Java Asynchronous Driver [6]

The second contribution is a detailed analysis of variousworkload and network conditions that impact the performanceof asynchronous invocation Concretely we have observedthat a moderate-sized response message (eg 100KB) for anasynchronous server can cause a non-trivial write-spinproblem where the server makes frequent unnecessary IO

system calls resulting from the default small TCP send buffersize and the TCP wait-ACK mechanism wasting the serverCPU resource about 12sim24 We also observed some net-work conditions such as latency can exacerbate the write-spinproblem of asynchronous event-driven servers even further

The third contribution is a hybrid solution that takes ad-vantage of different asynchronous event-driven architecturesto adapt to various runtime workload and network conditionsWe studied a popular asynchronous network IO library namedldquoNetty [7]rdquo which can mitigate the write-spin problem throughsome write operation optimizations However such optimiza-tion techniques in Netty also bring non-trivial overhead in thecase when the write-spin problem does not occur during theasynchronous invocation period Our hybrid solution extendsNetty by monitoring the occurrence of write-spin and oscillat-ing between alternative asynchronous invocation mechanismsto avoid the unnecessary optimization overhead

In general given the strong economic interest in achievinghigh resource efficiency and high quality of service simultane-ously in cloud data centers our results suggest asynchronousinvocation has potentially significant performance advantageover RPC synchronous invocation but still needs carefultunning according to various non-trivial runtime workloadand network conditions Our work also points to significantfuture research opportunities since asynchronous architecturehas been widely adopted by many distributed systems (egpubsub systems [23] AJAX [26] and ZooKeeper [31])

The rest of the paper is organized as follows Section IIshows a case study of performance degradation after softwareupgrading from a thread-based version to an asynchronousversion Section III describes the context switch problem ofservers with asynchronous architecture Section IV explainsthe write-spin problem of asynchronous architecture when ithandles requests with large size responses Section V evaluatestwo practical solutions Section VI summarizes related workand Section VII concludes the paper

II BACKGROUND AND MOTIVATION

A RPC vs Asynchronous Network IO

Modern internet servers usually adopt a few connectorsto communicate with other component servers or end usersThe main activities of a connector include managing up-stream and downstream network connections readingwritingdata through the established connections parsing and routingthe incoming requests to the application (business logic)layer Though similar in functionality synchronous and asyn-chronous connectors have very different mechanisms to inter-act with the application layer logic

Synchronous connectors are mainly adopted by RPC thread-based servers Once accepting a new connection the mainthread will dispatch the connection to a dedicated workerthread until the close of the connection In this case each con-nection consumes one worker thread and the operating systemtransparently switches among worker threads for concurrentrequest processing Although relatively easy to program dueto the user-perceived sequential execution flow synchronous

connectors bring the well-known multi-threading overhead(eg context switches scheduling and lock contention)

Asynchronous connectors accept new connections and man-age all established connections through an event-driven mech-anism using only one or a few threads Given a pool of estab-lished connections in a server an asynchronous connector han-dles requests received from these connections by repeatedlylooping over two phases The first phase (event monitoringphase) determines which connections have pending events ofinterest These events typically indicate that a particular con-nection (ie socket) is readable or writable The asynchronousconnector pulls the connections with pending events by takingadvantage of an event notification mechanism such as selectpoll or epoll supported by the underlying operating systemThe second phase (event handling phase) iterates over each ofthe connections that have pending events Based on the contextinformation of each event the connector dispatches the eventto an appropriate event handler performing the actual businesslogic computation More details can be found in previousasynchronous server research [42] [32] [38] [37] [28]

In practice there are two general designs of asynchronousservers using the asynchronous connectors The first one isa single-threaded asynchronous server which only uses onethread to loop over the aforementioned two phases for exam-ple Nodejs [8] and Lighttpd [5] Such a design is especiallybeneficial for in-memory workloads because context switcheswill be minimum while the single thread will not be blocked bydisk IO activities [35] Multiple single-threaded servers (alsocalled N -copy approach [43] [28]) can be launched togetherto fully utilize multiple processors The second design is touse a worker thread pool in the second phase to concurrentlyprocess connections that have pending events Such a design issupposed to efficiently utilize CPU resources in case of tran-sient disk IO blocking or multi-core environment [38] Severalvariants of the second design have been proposed beforemostly known as the staged design adopted by SEDA [42]and WatPipe [38] Instead of having only one worker threadpool the staged design decomposes the request processing intoa pipeline of stages separated by event queues each of whichhas its own worker thread pool with the aim of modular designand fine-grained management of worker threads

In general asynchronous event-driven servers are believedto be able to achieve higher throughput than the thread-based version because of reduced multi-threading overheadespecially when the server is handling high concurrency CPUintensive workload However our experimental results in thefollowing section will show the opposite

B Performance Degradation after Tomcat Upgrade

System software upgrade is a common practice for internetservices due to the fast evolving of software components Inthis section we show that simply upgrading a thread-basedcomponent server to its asynchronous version in an n-tiersystem may cause unexpected performance degradation at highresource utilization We show one such case by RUBBoS [11]a representative web-facing n-tier system benchmark modeled

0

2K

4K

6K

8K

10K

12K

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Thro

ughp

ut [

req

s]

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(a) The 01KB response size case

0

2K

4K

6K

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(b) The 10KB response size case

0

100

200

300

400

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(c) The 100KB response size caseFig 2 Throughput comparison between TomcatSync and TomcatAsync under different workload concurrencies andresponse sizes As response size increases from 01KB in (a) to 100KB in (c) TomcatSync outperforms TomcatAsyncat wider concurrency range indicating the performance degradation of TomcatAsync with large response size

after the popular news website Slashdot [15] Our experimentsadopt a typical 3-tier configuration with 1 Apache webserver 1 Tomcat application server and 1 MySQL databaseserver (details in Appendix A) At the beginning we useTomcat 7 (noted as TomcatSync) which uses a thread-basedsynchronous connector for inter-tier communication We thenupgrade the Tomcat server to Version 8 (the latest version atthe time noted as TomcatAsync) which by default usesan asynchronous connector with the expectation of systemperformance improvement after Tomcat upgrade

Unfortunately we observed a surprising system perfor-mance degradation after the Tomcat server upgrade as shownin Figure 1 We call the system with TomcatSync asSYStomcatV 7 and the other one with TomcatAsync asSYStomcatV 8 This figure shows the SYStomcatV 7 saturatesat workload 11000 while SYStomcatV 8 saturates at work-load 9000 At workload 11000 SYStomcatV 7 outperformsSYStomcatV 8 by 28 in throughput and the average responsetime is one order of magnitude less (226ms vs 2820ms) Sucha result is counter intuitive since we upgrade Tomcat froma lower thread-based version to a newer asynchronous oneWe note that in both cases the Tomcat server CPU is thebottleneck resource in the system all the hardware resources(eg CPU and memory) of other component servers are farfrom saturation (lt 60)

We use Collectl [2] to collect system level metrics (egCPU and context switches) another interesting phenomenonwe observed is that TomcatAsync encounters significantlyhigher number of context switches than TomcatSync whenthe system is at the same workload For example at workload10000 TomcatAsync encounters 12950 context switches persecond while only 5930 for TomcatSync less than halfof the previous case It is reasonable to suggest that highcontext switches in TomcatAsync cause high CPU overheadleading to inferior throughput compared to TomcatSyncHowever the traditional wisdom told us that a server withasynchronous architecture should have less context switchesthan a thread-based server So why did we observe the oppositehere We will discuss about the cause in the next section

Reactor thread

Worker th

read pool

Dispatch read event

generate write event

dispatch write event

return control back

Worker A

Readingprocess-ing request( )

( )Worker B

Sending response

Fig 3 Illustration of the event processing flow whenTomcatAsync processes one request Totally four contextswitches between the reactor thread and worker threads

III INEFFICIENT EVENT PROCESSING FLOW INASYNCHRONOUS SERVERS

In this section we explain why the performance of the 3-tierbenchmark system degrades after we upgrade Tomcat from thethread-based version TomcatSync to the asynchronous ver-sion TomcatAsync To simplify and quantify our analysiswe design micro-benchmarks to test the performance of bothversions of Tomcat

We use JMeter [1] to generate HTTP requests to access thestandalone Tomcat directly These HTTP requests are catego-rized into three types small medium and large with whichthe Tomcat server (either TomcatSync or TomcatAsync)first conducts some simple computation before respondingwith 01KB 10KB and 100KB of in-memory data respec-tively We choose these three sizes because they are represen-tative response sizes in our RUBBoS benchmark applicationJMeter uses one thread to simulate each end-user We set thethink time between the consecutive requests sent from thesame thread to be zero thus we can precisely control theconcurrency of the workload to the target Tomcat server byspecifying the number of threads in JMeter

We compare the server throughput between TomcatSyncand TomcatAsync under different workload concurrenciesand response sizes as shown in Figure 2 The three sub-figures show that as workload concurrency increases from1 to 3200 TomcatAsync achieves lower throughput thanTomcatSync before a certain workload concurrency Forexample TomcatAsync performs worse than TomcatSync

TABLE I TomcatAsync has more context switches thanTomcatSync under workload concurrencies 8

Response size TomcatAsync TomcatSync[times1000sec]

01KB 40 1610KB 25 7100KB 28 2

before the workload concurrency 64 when the response size is10KB and the crossover point workload concurrency is evenhigher (1600) when the response size increases to 100KBReturn back to our previous 3-tier RUBBoS experimentsour measurements show that under the RUBBoS workloadconditions the average response size of Tomcat per requestis about 20KB and the workload concurrency for Tomcatis about 35 when the system saturates So based on ourmicro-benchmark results in Figure 2 it is not surprising thatTomcatAsync performs worse than TomcatSync SinceTomcat is the bottleneck server of the 3-tier system theperformance degradation of Tomcat also leads to the perfor-mance degradation of the whole system (see Figure 1) Theremaining question is why TomcatAsync performs worsethan TomcatSync before a certain workload concurrency

As we found out that the performance degradation ofTomcatAsync results from its inefficient event processingflow which generates significant amounts of intermediatecontext switches causing non-trivial CPU overhead Table Icompares the context switches between TomcatAsync andTomcatSync at workload concurrency 8 This table showsconsistent results as we have observed in the previous RUB-BoS experiments the asynchronous TomcatAsync encoun-tered significantly higher context switches than the thread-based TomcatSync given the same workload concurrencyand server response size Our further analysis reveals that thehigh context switches of TomcatAsync is because of its poordesign of event processing flow Concretely TomcatAsyncadopts the second design of asynchronous servers (see Sec-tion II-A) which uses a reactor thread for event monitoringand a worker thread pool for event handling Figure 3 illus-trates the event processing flow in TomcatAsync

So to handle one client request there are totally 4 contextswitches among the user-space threads in TomcatAsync(see step 1minus4 in Figure 3) Such inefficient event pro-cessing flow design also exists in many popular asyn-chronous serversmiddleware including network frameworkGrizzly [10] application server Jetty [3] On the other handin TomcatSync each client request is handled by a dedicatedworker thread from the initial reading of the request topreparing the response to sending the response out No contextswitch during the processing of the request unless the workerthread is interrupted or swapped out by operating system

To better quantify the impact of context switches on theperformance of different server architectures we simplifythe implementation of TomcatAsync and TomcatSync byremoving out all the unrelated modules (eg servlet life cyclemanagement cache management and logging) and only keep-ing the essential code related to request processing which we

TABLE II Context switches among user-space threadswhen the server processes one client request

Server type Context NoteSwitch

sTomcat-Async 4 Read and write events are handledby different worker threads (Figure 3)

sTomcat-Async-Fix 2 Read and write events are handledby the same worker thread

sTomcat-Sync 0Dedicated worker thread for each re-quest Context switch occurs due tointerrupt or CPU time slice expires

SingleT-Async 0 No context switches one thread handleboth event monitoring and processing

refer as sTomcat-Async (simplified TomcatAsync) andsTomcat-Sync (simplified TomcatSync) As a referencewe implement two alternative designs of asynchronous serversaiming to reduce the frequency of context switches The firstalternative design which we call sTomcat-Async-Fixmerges the processing of read event and write event fromthe same request by using the same worker thread In thiscase once a worker thread finishes preparing the response itcontinues to send the response out (step 2 and 3 in Figure 3 nolonger exist) thus processing one client request only requirestwo context switches from the reactor thread to a workerthread and from the same worker thread back to the reactorthread The second alternative design is the traditional single-threaded asynchronous server The single thread is responsiblefor both event monitoring and processing The single-threadedimplementation which we refer as SingleT-Async is sup-posed to have the least context switches Table II summarizesthe context switches for each server type when it processesone client request1 Interested readers can check out our serverimplementation from GitHub [13] for further reference

We compare the throughput and context switches among thefour types of servers under increased workload concurrenciesand server response sizes as shown in Figure 4 ComparingFigure 4(a) and 4(d) the maximum achievable throughputby each server type is negatively correlated with the contextswitch frequency during runtime experiments For exampleat workload concurrency 16 sTomcat-Async-Fix outper-forms sTomcat-Async by 22 in throughput while the con-text switches is 34 less In our experiments the CPU demandfor each request is positively correlated to the response sizesmall response size means small CPU computation demandthus the portion of CPU cycles wasted in context switchesbecomes large As a result the gap in context switchesbetween sTomcat-Async-Fix and sTomcat-Async re-flects their throughput difference Such hypothesis is fur-ther validated by the performance of SingleT-Async andsTomcat-Sync which outperform sTomcat-Async by91 and 57 in throughput respectively (see Figure 4(a))Such performance difference is also because of less contextswitches as shown in Figure 4(d) For example the contextswitches of SingleT-Async is a few hundred per secondthree orders of magnitude less than that of sTomcat-Async

1In order to simplify analyzing and reasoning we do not count the contextswitches causing by interrupting or swapping by the operating system

0

10K

20K

30K

40K

1 4 8 16 64 100 400 1000 3200

Thro

ughp

ut [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(a) Throughput when response size is 01KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(b) Throughput when response size is 10KB

0

200

400

600

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(c) Throughput when response size is 100KB

0

40K

80K

120K

160K

1 4 8 16 64 100 400 1000 3200

Cont

ext

Switc

hing

[s

]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(d) Context switch when response size is 01KB

0

20K

40K

60K

80K

100K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(e) Context switch when response size is 10KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(f) Context switch when response size is 100KBFig 4 Throughput and context switch comparison among different server architectures as the server response sizeincreases from 01KB to 100KB (a) and (d) show that the maximum achievable throughput by each server type is negativelycorrelated with their context switch freqency when the server response size is small (01KB) However as the response sizeincreases to 100KB (c) shows sTomcat-Sync outperforms other asynchronous servers before the workload concurrency400 indicating factors other than context switches cause overhead in asynchronous servers

We note that as the server response size becomes larger theportion of CPU overhead caused by context switches becomessmaller since more CPU cycles will be consumed by process-ing request and sending response This is the case as shownin Figure 4(b) and 4(c) where the response sizes are 10KBand 100KB respectively The throughput difference becomesnarrower among the four server architectures indicating lessperformance impact from context switches

In fact one interesting phenomenon has been observedas the response size increases to 100KB Figure 4(c) showsthat SingleT-Async performs worse than the thread-basedsTomcat-Sync before the workload concurrency 400 eventhough SingleT-Async has much less context switchesthan sTomcat-Sync as shown in Figure 4(f) Such obser-vation suggests that there are other factors causing overheadin asynchronous SingleT-Async but not in thread-basedsTomcat-Sync when the server response size is largewhich we will discuss in the next section

IV WRITE-SPIN PROBLEM OF ASYNCHRONOUSINVOCATION

In this section we study the performance degradation prob-lem of an asynchronous server sending a large size responseWe use fine-grained profiling tools such as Collectl [2] andJProfiler [4] to analyze the detailed CPU usage and some keysystem calls invoked by servers with different architecturesAs we found out that it is the default small TCP sendbuffer size and the TCP wait-ACK mechanism that leadsto a severe write-spin problem when sending a relativelylarge size response which causes significant CPU overheadfor asynchronous servers We also explored several network-

related factors that could exacerbate the negative impact of thewrite-spin problem which further degrades the performance ofan asynchronous server

A Profiling Results

Recall that Figure 4(a) shows when the response sizeis small (ie 01KB) the throughput of the asynchronousSingleT-Async is 20 higher than the thread-basedsTomcat-Sync at workload concurrency 8 However as re-sponse size increases to 100KB SingleT-Async through-put is surprisingly 31 lower than sTomcat-Sync underthe same workload concurrency 8 (see Figure 4(c)) Sincethe only change is the response size it is natural to specu-late that large response size brings significant overhead forSingleT-Async but not for sTomcat-Sync

To investigate the performance degradation ofSingleT-Async when the response size is large wefirst use Collectl [2] to analyze the detailed CPU usageof the server with different server response sizes asshown in Table III The workload concurrency for bothSingleT-Async and sTomcat-Sync is 100 and theCPU is 100 utilized under this workload concurrencyAs the response size for both server architectures increasedfrom 01KB to 100KB the table shows the user-space CPUutilization of sTomcat-Sync increases 25 (from 55 to80) while 34 (from 58 to 92) for SingleT-AsyncSuch comparison suggests that increasing response size hasmore impact on the asynchronous SingleT-Async than thethread-based sTomcat-Sync in user-space CPU utilization

We further use JProfiler [4] to profile the SingleT-Asynccase when the response size increases from 01KB to 100KB

TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

Server Type sTomcat-Sync SingleT-Async

Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

Resp size req write()socketwrite() per req

01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

Client

Receive Buffer Read from socket

Parsing and encoding

Write to socket

sum lt data size

syscall

syscall

syscall

Return of bytes written

Return zero

Un

expected

Wait fo

r TCP

AC

Ks

Server

Tim

e

Write to socket

sum lt data size

Write to socket

sum lt data size

The worker thread write-spins until ACKs come back from client

Data Copy to Kernel Finish

Return of bytes written

Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

0

100

200

300

400

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

s]

Network Latency

SingleT-Async-100KBSingleT-Async-autotuning

Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

B Network Latency Exaggerates the Write-Spin Problem

Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

0

200

400

600

800

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

sec]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(a) Throuthput comparison

0

3

6

9

12

15

~0ms ~5ms ~10ms ~20ms

Res

pons

e Ti

me

[s]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

V SOLUTION

So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

A Mitigating Context Switches and Write-Spin Using Netty

Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

syscall Write to socket

Conditions

1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

Application ThreadKernel

Next Event Processing

Data Copy to Kernel Finish

True

Data Sending Finish

Data to be written

False

False

syscall

Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

0

200

400

600

800

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(a) Response size is 100KB

0

10K

20K

30K

40K

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(b) Response size is 01KB

Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

B A Hybrid Solution

In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

vary during runtimebull The workload is in-memory workload

Select()

Pool of connections with pending events

Conn is available

Check req type

Parsing and encoding

Parsing and encoding

Write operation optimization

Socketwrite()

Return to Return to

No

NettyServer SingleT-Async

Event Monitoring Phase

Event Handling Phase

Yes

get next conn

Fig 10 Worker thread processing flow in Hybrid solution

The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

C Validation of HybridNetty

To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

VI RELATED WORK

Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References

    system calls resulting from the default small TCP send buffersize and the TCP wait-ACK mechanism wasting the serverCPU resource about 12sim24 We also observed some net-work conditions such as latency can exacerbate the write-spinproblem of asynchronous event-driven servers even further

    The third contribution is a hybrid solution that takes ad-vantage of different asynchronous event-driven architecturesto adapt to various runtime workload and network conditionsWe studied a popular asynchronous network IO library namedldquoNetty [7]rdquo which can mitigate the write-spin problem throughsome write operation optimizations However such optimiza-tion techniques in Netty also bring non-trivial overhead in thecase when the write-spin problem does not occur during theasynchronous invocation period Our hybrid solution extendsNetty by monitoring the occurrence of write-spin and oscillat-ing between alternative asynchronous invocation mechanismsto avoid the unnecessary optimization overhead

    In general given the strong economic interest in achievinghigh resource efficiency and high quality of service simultane-ously in cloud data centers our results suggest asynchronousinvocation has potentially significant performance advantageover RPC synchronous invocation but still needs carefultunning according to various non-trivial runtime workloadand network conditions Our work also points to significantfuture research opportunities since asynchronous architecturehas been widely adopted by many distributed systems (egpubsub systems [23] AJAX [26] and ZooKeeper [31])

    The rest of the paper is organized as follows Section IIshows a case study of performance degradation after softwareupgrading from a thread-based version to an asynchronousversion Section III describes the context switch problem ofservers with asynchronous architecture Section IV explainsthe write-spin problem of asynchronous architecture when ithandles requests with large size responses Section V evaluatestwo practical solutions Section VI summarizes related workand Section VII concludes the paper

    II BACKGROUND AND MOTIVATION

    A RPC vs Asynchronous Network IO

    Modern internet servers usually adopt a few connectorsto communicate with other component servers or end usersThe main activities of a connector include managing up-stream and downstream network connections readingwritingdata through the established connections parsing and routingthe incoming requests to the application (business logic)layer Though similar in functionality synchronous and asyn-chronous connectors have very different mechanisms to inter-act with the application layer logic

    Synchronous connectors are mainly adopted by RPC thread-based servers Once accepting a new connection the mainthread will dispatch the connection to a dedicated workerthread until the close of the connection In this case each con-nection consumes one worker thread and the operating systemtransparently switches among worker threads for concurrentrequest processing Although relatively easy to program dueto the user-perceived sequential execution flow synchronous

    connectors bring the well-known multi-threading overhead(eg context switches scheduling and lock contention)

    Asynchronous connectors accept new connections and man-age all established connections through an event-driven mech-anism using only one or a few threads Given a pool of estab-lished connections in a server an asynchronous connector han-dles requests received from these connections by repeatedlylooping over two phases The first phase (event monitoringphase) determines which connections have pending events ofinterest These events typically indicate that a particular con-nection (ie socket) is readable or writable The asynchronousconnector pulls the connections with pending events by takingadvantage of an event notification mechanism such as selectpoll or epoll supported by the underlying operating systemThe second phase (event handling phase) iterates over each ofthe connections that have pending events Based on the contextinformation of each event the connector dispatches the eventto an appropriate event handler performing the actual businesslogic computation More details can be found in previousasynchronous server research [42] [32] [38] [37] [28]

    In practice there are two general designs of asynchronousservers using the asynchronous connectors The first one isa single-threaded asynchronous server which only uses onethread to loop over the aforementioned two phases for exam-ple Nodejs [8] and Lighttpd [5] Such a design is especiallybeneficial for in-memory workloads because context switcheswill be minimum while the single thread will not be blocked bydisk IO activities [35] Multiple single-threaded servers (alsocalled N -copy approach [43] [28]) can be launched togetherto fully utilize multiple processors The second design is touse a worker thread pool in the second phase to concurrentlyprocess connections that have pending events Such a design issupposed to efficiently utilize CPU resources in case of tran-sient disk IO blocking or multi-core environment [38] Severalvariants of the second design have been proposed beforemostly known as the staged design adopted by SEDA [42]and WatPipe [38] Instead of having only one worker threadpool the staged design decomposes the request processing intoa pipeline of stages separated by event queues each of whichhas its own worker thread pool with the aim of modular designand fine-grained management of worker threads

    In general asynchronous event-driven servers are believedto be able to achieve higher throughput than the thread-based version because of reduced multi-threading overheadespecially when the server is handling high concurrency CPUintensive workload However our experimental results in thefollowing section will show the opposite

    B Performance Degradation after Tomcat Upgrade

    System software upgrade is a common practice for internetservices due to the fast evolving of software components Inthis section we show that simply upgrading a thread-basedcomponent server to its asynchronous version in an n-tiersystem may cause unexpected performance degradation at highresource utilization We show one such case by RUBBoS [11]a representative web-facing n-tier system benchmark modeled

    0

    2K

    4K

    6K

    8K

    10K

    12K

    1 2 4 8 16 32 64 100 200 400 800 16003200

    Crossover Point

    Thro

    ughp

    ut [

    req

    s]

    Workload Concurrency [ of Connections]

    TomcatSyncTomcatAsync

    (a) The 01KB response size case

    0

    2K

    4K

    6K

    1 2 4 8 16 32 64 100 200 400 800 16003200

    Crossover Point

    Workload Concurrency [ of Connections]

    TomcatSyncTomcatAsync

    (b) The 10KB response size case

    0

    100

    200

    300

    400

    1 2 4 8 16 32 64 100 200 400 800 16003200

    Crossover Point

    Workload Concurrency [ of Connections]

    TomcatSyncTomcatAsync

    (c) The 100KB response size caseFig 2 Throughput comparison between TomcatSync and TomcatAsync under different workload concurrencies andresponse sizes As response size increases from 01KB in (a) to 100KB in (c) TomcatSync outperforms TomcatAsyncat wider concurrency range indicating the performance degradation of TomcatAsync with large response size

    after the popular news website Slashdot [15] Our experimentsadopt a typical 3-tier configuration with 1 Apache webserver 1 Tomcat application server and 1 MySQL databaseserver (details in Appendix A) At the beginning we useTomcat 7 (noted as TomcatSync) which uses a thread-basedsynchronous connector for inter-tier communication We thenupgrade the Tomcat server to Version 8 (the latest version atthe time noted as TomcatAsync) which by default usesan asynchronous connector with the expectation of systemperformance improvement after Tomcat upgrade

    Unfortunately we observed a surprising system perfor-mance degradation after the Tomcat server upgrade as shownin Figure 1 We call the system with TomcatSync asSYStomcatV 7 and the other one with TomcatAsync asSYStomcatV 8 This figure shows the SYStomcatV 7 saturatesat workload 11000 while SYStomcatV 8 saturates at work-load 9000 At workload 11000 SYStomcatV 7 outperformsSYStomcatV 8 by 28 in throughput and the average responsetime is one order of magnitude less (226ms vs 2820ms) Sucha result is counter intuitive since we upgrade Tomcat froma lower thread-based version to a newer asynchronous oneWe note that in both cases the Tomcat server CPU is thebottleneck resource in the system all the hardware resources(eg CPU and memory) of other component servers are farfrom saturation (lt 60)

    We use Collectl [2] to collect system level metrics (egCPU and context switches) another interesting phenomenonwe observed is that TomcatAsync encounters significantlyhigher number of context switches than TomcatSync whenthe system is at the same workload For example at workload10000 TomcatAsync encounters 12950 context switches persecond while only 5930 for TomcatSync less than halfof the previous case It is reasonable to suggest that highcontext switches in TomcatAsync cause high CPU overheadleading to inferior throughput compared to TomcatSyncHowever the traditional wisdom told us that a server withasynchronous architecture should have less context switchesthan a thread-based server So why did we observe the oppositehere We will discuss about the cause in the next section

    Reactor thread

    Worker th

    read pool

    Dispatch read event

    generate write event

    dispatch write event

    return control back

    Worker A

    Readingprocess-ing request( )

    ( )Worker B

    Sending response

    Fig 3 Illustration of the event processing flow whenTomcatAsync processes one request Totally four contextswitches between the reactor thread and worker threads

    III INEFFICIENT EVENT PROCESSING FLOW INASYNCHRONOUS SERVERS

    In this section we explain why the performance of the 3-tierbenchmark system degrades after we upgrade Tomcat from thethread-based version TomcatSync to the asynchronous ver-sion TomcatAsync To simplify and quantify our analysiswe design micro-benchmarks to test the performance of bothversions of Tomcat

    We use JMeter [1] to generate HTTP requests to access thestandalone Tomcat directly These HTTP requests are catego-rized into three types small medium and large with whichthe Tomcat server (either TomcatSync or TomcatAsync)first conducts some simple computation before respondingwith 01KB 10KB and 100KB of in-memory data respec-tively We choose these three sizes because they are represen-tative response sizes in our RUBBoS benchmark applicationJMeter uses one thread to simulate each end-user We set thethink time between the consecutive requests sent from thesame thread to be zero thus we can precisely control theconcurrency of the workload to the target Tomcat server byspecifying the number of threads in JMeter

    We compare the server throughput between TomcatSyncand TomcatAsync under different workload concurrenciesand response sizes as shown in Figure 2 The three sub-figures show that as workload concurrency increases from1 to 3200 TomcatAsync achieves lower throughput thanTomcatSync before a certain workload concurrency Forexample TomcatAsync performs worse than TomcatSync

    TABLE I TomcatAsync has more context switches thanTomcatSync under workload concurrencies 8

    Response size TomcatAsync TomcatSync[times1000sec]

    01KB 40 1610KB 25 7100KB 28 2

    before the workload concurrency 64 when the response size is10KB and the crossover point workload concurrency is evenhigher (1600) when the response size increases to 100KBReturn back to our previous 3-tier RUBBoS experimentsour measurements show that under the RUBBoS workloadconditions the average response size of Tomcat per requestis about 20KB and the workload concurrency for Tomcatis about 35 when the system saturates So based on ourmicro-benchmark results in Figure 2 it is not surprising thatTomcatAsync performs worse than TomcatSync SinceTomcat is the bottleneck server of the 3-tier system theperformance degradation of Tomcat also leads to the perfor-mance degradation of the whole system (see Figure 1) Theremaining question is why TomcatAsync performs worsethan TomcatSync before a certain workload concurrency

    As we found out that the performance degradation ofTomcatAsync results from its inefficient event processingflow which generates significant amounts of intermediatecontext switches causing non-trivial CPU overhead Table Icompares the context switches between TomcatAsync andTomcatSync at workload concurrency 8 This table showsconsistent results as we have observed in the previous RUB-BoS experiments the asynchronous TomcatAsync encoun-tered significantly higher context switches than the thread-based TomcatSync given the same workload concurrencyand server response size Our further analysis reveals that thehigh context switches of TomcatAsync is because of its poordesign of event processing flow Concretely TomcatAsyncadopts the second design of asynchronous servers (see Sec-tion II-A) which uses a reactor thread for event monitoringand a worker thread pool for event handling Figure 3 illus-trates the event processing flow in TomcatAsync

    So to handle one client request there are totally 4 contextswitches among the user-space threads in TomcatAsync(see step 1minus4 in Figure 3) Such inefficient event pro-cessing flow design also exists in many popular asyn-chronous serversmiddleware including network frameworkGrizzly [10] application server Jetty [3] On the other handin TomcatSync each client request is handled by a dedicatedworker thread from the initial reading of the request topreparing the response to sending the response out No contextswitch during the processing of the request unless the workerthread is interrupted or swapped out by operating system

    To better quantify the impact of context switches on theperformance of different server architectures we simplifythe implementation of TomcatAsync and TomcatSync byremoving out all the unrelated modules (eg servlet life cyclemanagement cache management and logging) and only keep-ing the essential code related to request processing which we

    TABLE II Context switches among user-space threadswhen the server processes one client request

    Server type Context NoteSwitch

    sTomcat-Async 4 Read and write events are handledby different worker threads (Figure 3)

    sTomcat-Async-Fix 2 Read and write events are handledby the same worker thread

    sTomcat-Sync 0Dedicated worker thread for each re-quest Context switch occurs due tointerrupt or CPU time slice expires

    SingleT-Async 0 No context switches one thread handleboth event monitoring and processing

    refer as sTomcat-Async (simplified TomcatAsync) andsTomcat-Sync (simplified TomcatSync) As a referencewe implement two alternative designs of asynchronous serversaiming to reduce the frequency of context switches The firstalternative design which we call sTomcat-Async-Fixmerges the processing of read event and write event fromthe same request by using the same worker thread In thiscase once a worker thread finishes preparing the response itcontinues to send the response out (step 2 and 3 in Figure 3 nolonger exist) thus processing one client request only requirestwo context switches from the reactor thread to a workerthread and from the same worker thread back to the reactorthread The second alternative design is the traditional single-threaded asynchronous server The single thread is responsiblefor both event monitoring and processing The single-threadedimplementation which we refer as SingleT-Async is sup-posed to have the least context switches Table II summarizesthe context switches for each server type when it processesone client request1 Interested readers can check out our serverimplementation from GitHub [13] for further reference

    We compare the throughput and context switches among thefour types of servers under increased workload concurrenciesand server response sizes as shown in Figure 4 ComparingFigure 4(a) and 4(d) the maximum achievable throughputby each server type is negatively correlated with the contextswitch frequency during runtime experiments For exampleat workload concurrency 16 sTomcat-Async-Fix outper-forms sTomcat-Async by 22 in throughput while the con-text switches is 34 less In our experiments the CPU demandfor each request is positively correlated to the response sizesmall response size means small CPU computation demandthus the portion of CPU cycles wasted in context switchesbecomes large As a result the gap in context switchesbetween sTomcat-Async-Fix and sTomcat-Async re-flects their throughput difference Such hypothesis is fur-ther validated by the performance of SingleT-Async andsTomcat-Sync which outperform sTomcat-Async by91 and 57 in throughput respectively (see Figure 4(a))Such performance difference is also because of less contextswitches as shown in Figure 4(d) For example the contextswitches of SingleT-Async is a few hundred per secondthree orders of magnitude less than that of sTomcat-Async

    1In order to simplify analyzing and reasoning we do not count the contextswitches causing by interrupting or swapping by the operating system

    0

    10K

    20K

    30K

    40K

    1 4 8 16 64 100 400 1000 3200

    Thro

    ughp

    ut [

    req

    s]

    Workload Concurrency [ of Connections]

    SingleT-AsyncsTomcat-Sync

    sTomcat-Async-FixsTomcat-Async

    (a) Throughput when response size is 01KB

    0

    2K

    4K

    6K

    8K

    10K

    1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

    SingleT-AsyncsTomcat-Sync

    sTomcat-Async-FixsTomcat-Async

    (b) Throughput when response size is 10KB

    0

    200

    400

    600

    1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

    SingleT-AsyncsTomcat-Sync

    sTomcat-Async-FixsTomcat-Async

    (c) Throughput when response size is 100KB

    0

    40K

    80K

    120K

    160K

    1 4 8 16 64 100 400 1000 3200

    Cont

    ext

    Switc

    hing

    [s

    ]

    Workload Concurrency [ of Connections]

    SingleT-AsyncsTomcat-Sync

    sTomcat-Async-FixsTomcat-Async

    (d) Context switch when response size is 01KB

    0

    20K

    40K

    60K

    80K

    100K

    1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

    SingleT-AsyncsTomcat-Sync

    sTomcat-Async-FixsTomcat-Async

    (e) Context switch when response size is 10KB

    0

    2K

    4K

    6K

    8K

    10K

    1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

    SingleT-AsyncsTomcat-Sync

    sTomcat-Async-FixsTomcat-Async

    (f) Context switch when response size is 100KBFig 4 Throughput and context switch comparison among different server architectures as the server response sizeincreases from 01KB to 100KB (a) and (d) show that the maximum achievable throughput by each server type is negativelycorrelated with their context switch freqency when the server response size is small (01KB) However as the response sizeincreases to 100KB (c) shows sTomcat-Sync outperforms other asynchronous servers before the workload concurrency400 indicating factors other than context switches cause overhead in asynchronous servers

    We note that as the server response size becomes larger theportion of CPU overhead caused by context switches becomessmaller since more CPU cycles will be consumed by process-ing request and sending response This is the case as shownin Figure 4(b) and 4(c) where the response sizes are 10KBand 100KB respectively The throughput difference becomesnarrower among the four server architectures indicating lessperformance impact from context switches

    In fact one interesting phenomenon has been observedas the response size increases to 100KB Figure 4(c) showsthat SingleT-Async performs worse than the thread-basedsTomcat-Sync before the workload concurrency 400 eventhough SingleT-Async has much less context switchesthan sTomcat-Sync as shown in Figure 4(f) Such obser-vation suggests that there are other factors causing overheadin asynchronous SingleT-Async but not in thread-basedsTomcat-Sync when the server response size is largewhich we will discuss in the next section

    IV WRITE-SPIN PROBLEM OF ASYNCHRONOUSINVOCATION

    In this section we study the performance degradation prob-lem of an asynchronous server sending a large size responseWe use fine-grained profiling tools such as Collectl [2] andJProfiler [4] to analyze the detailed CPU usage and some keysystem calls invoked by servers with different architecturesAs we found out that it is the default small TCP sendbuffer size and the TCP wait-ACK mechanism that leadsto a severe write-spin problem when sending a relativelylarge size response which causes significant CPU overheadfor asynchronous servers We also explored several network-

    related factors that could exacerbate the negative impact of thewrite-spin problem which further degrades the performance ofan asynchronous server

    A Profiling Results

    Recall that Figure 4(a) shows when the response sizeis small (ie 01KB) the throughput of the asynchronousSingleT-Async is 20 higher than the thread-basedsTomcat-Sync at workload concurrency 8 However as re-sponse size increases to 100KB SingleT-Async through-put is surprisingly 31 lower than sTomcat-Sync underthe same workload concurrency 8 (see Figure 4(c)) Sincethe only change is the response size it is natural to specu-late that large response size brings significant overhead forSingleT-Async but not for sTomcat-Sync

    To investigate the performance degradation ofSingleT-Async when the response size is large wefirst use Collectl [2] to analyze the detailed CPU usageof the server with different server response sizes asshown in Table III The workload concurrency for bothSingleT-Async and sTomcat-Sync is 100 and theCPU is 100 utilized under this workload concurrencyAs the response size for both server architectures increasedfrom 01KB to 100KB the table shows the user-space CPUutilization of sTomcat-Sync increases 25 (from 55 to80) while 34 (from 58 to 92) for SingleT-AsyncSuch comparison suggests that increasing response size hasmore impact on the asynchronous SingleT-Async than thethread-based sTomcat-Sync in user-space CPU utilization

    We further use JProfiler [4] to profile the SingleT-Asynccase when the response size increases from 01KB to 100KB

    TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

    Server Type sTomcat-Sync SingleT-Async

    Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

    TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

    Resp size req write()socketwrite() per req

    01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

    and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

    Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

    Client

    Receive Buffer Read from socket

    Parsing and encoding

    Write to socket

    sum lt data size

    syscall

    syscall

    syscall

    Return of bytes written

    Return zero

    Un

    expected

    Wait fo

    r TCP

    AC

    Ks

    Server

    Tim

    e

    Write to socket

    sum lt data size

    Write to socket

    sum lt data size

    The worker thread write-spins until ACKs come back from client

    Data Copy to Kernel Finish

    Return of bytes written

    Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

    An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

    In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

    0

    100

    200

    300

    400

    ~0ms ~5ms ~10ms ~20ms

    Thro

    ughp

    ut [

    req

    s]

    Network Latency

    SingleT-Async-100KBSingleT-Async-autotuning

    Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

    Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

    B Network Latency Exaggerates the Write-Spin Problem

    Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

    The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

    We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

    0

    200

    400

    600

    800

    ~0ms ~5ms ~10ms ~20ms

    Thro

    ughp

    ut [

    req

    sec]

    Network Latency

    SingleT-AsyncsTomcat-Async-Fix

    sTomcat-Sync

    (a) Throuthput comparison

    0

    3

    6

    9

    12

    15

    ~0ms ~5ms ~10ms ~20ms

    Res

    pons

    e Ti

    me

    [s]

    Network Latency

    SingleT-AsyncsTomcat-Async-Fix

    sTomcat-Sync

    (b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

    is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

    V SOLUTION

    So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

    An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

    A Mitigating Context Switches and Write-Spin Using Netty

    Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

    Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

    syscall Write to socket

    Conditions

    1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

    Application ThreadKernel

    Next Event Processing

    Data Copy to Kernel Finish

    True

    Data Sending Finish

    Data to be written

    False

    False

    syscall

    Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

    0

    200

    400

    600

    800

    1 4 8 16 64 100 400 10003200

    Thr

    ough

    put [

    req

    s]

    Workload Concurrency [ of Connections]

    SingleT-AsyncNettyServer

    sTomcat-Sync

    (a) Response size is 100KB

    0

    10K

    20K

    30K

    40K

    1 4 8 16 64 100 400 10003200

    Thr

    ough

    put [

    req

    s]

    Workload Concurrency [ of Connections]

    SingleT-AsyncNettyServer

    sTomcat-Sync

    (b) Response size is 01KB

    Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

    the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

    In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

    maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

    We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

    B A Hybrid Solution

    In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

    Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

    vary during runtimebull The workload is in-memory workload

    Select()

    Pool of connections with pending events

    Conn is available

    Check req type

    Parsing and encoding

    Parsing and encoding

    Write operation optimization

    Socketwrite()

    Return to Return to

    No

    NettyServer SingleT-Async

    Event Monitoring Phase

    Event Handling Phase

    Yes

    get next conn

    Fig 10 Worker thread processing flow in Hybrid solution

    The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

    The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

    C Validation of HybridNetty

    To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

    size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

    VI RELATED WORK

    Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

    The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

    Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

    0

    02

    04

    06

    08

    10

    0 2 5 10 20 100

    Nor

    mal

    ized

    Thr

    ough

    put

    Ratio of Large Size Response

    SingleT-Async NettyServer HybridNetty

    (a) No network latency between client and server

    0

    02

    04

    06

    08

    10

    0 2 5 10 20 100

    Nor

    mal

    ized

    Thr

    ough

    put

    Ratio of Large Size Response

    SingleT-Async NettyServer HybridNetty

    (b) sim5ms network latency between client and server

    Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

    Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

    VII CONCLUSIONS

    We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

    ACKNOWLEDGMENT

    This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

    HTTP Requests

    Apache Tomcat MySQLClients

    (b) 111 Sample Topology

    (a) Software and Hardware Setup

    Fig 12 Details of the RUBBoS experimental setup

    grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

    APPENDIX ARUBBOS EXPERIMENTAL SETUP

    We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

    REFERENCES

    [1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

    wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

    ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

    mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

    middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

    githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

    servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

    orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

    Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

    DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

    [17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

    [18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

    [19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

    [20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

    [21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

    [22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

    [23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

    [24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

    [25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

    [26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

    Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

    [28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

    [29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

    [30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

    [31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

    [32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

    [33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

    [34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

    [35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

    [36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

    [37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

    [38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

    [39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

    [40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

    [41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

    [42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

    [43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

    • Introduction
    • Background and Motivation
      • RPC vs Asynchronous Network IO
      • Performance Degradation after Tomcat Upgrade
        • Inefficient Event Processing Flow in Asynchronous Servers
        • Write-Spin Problem of Asynchronous Invocation
          • Profiling Results
          • Network Latency Exaggerates the Write-Spin Problem
            • Solution
              • Mitigating Context Switches and Write-Spin Using Netty
              • A Hybrid Solution
              • Validation of HybridNetty
                • Related Work
                • Conclusions
                • Appendix A RUBBoS Experimental Setup
                • References

      0

      2K

      4K

      6K

      8K

      10K

      12K

      1 2 4 8 16 32 64 100 200 400 800 16003200

      Crossover Point

      Thro

      ughp

      ut [

      req

      s]

      Workload Concurrency [ of Connections]

      TomcatSyncTomcatAsync

      (a) The 01KB response size case

      0

      2K

      4K

      6K

      1 2 4 8 16 32 64 100 200 400 800 16003200

      Crossover Point

      Workload Concurrency [ of Connections]

      TomcatSyncTomcatAsync

      (b) The 10KB response size case

      0

      100

      200

      300

      400

      1 2 4 8 16 32 64 100 200 400 800 16003200

      Crossover Point

      Workload Concurrency [ of Connections]

      TomcatSyncTomcatAsync

      (c) The 100KB response size caseFig 2 Throughput comparison between TomcatSync and TomcatAsync under different workload concurrencies andresponse sizes As response size increases from 01KB in (a) to 100KB in (c) TomcatSync outperforms TomcatAsyncat wider concurrency range indicating the performance degradation of TomcatAsync with large response size

      after the popular news website Slashdot [15] Our experimentsadopt a typical 3-tier configuration with 1 Apache webserver 1 Tomcat application server and 1 MySQL databaseserver (details in Appendix A) At the beginning we useTomcat 7 (noted as TomcatSync) which uses a thread-basedsynchronous connector for inter-tier communication We thenupgrade the Tomcat server to Version 8 (the latest version atthe time noted as TomcatAsync) which by default usesan asynchronous connector with the expectation of systemperformance improvement after Tomcat upgrade

      Unfortunately we observed a surprising system perfor-mance degradation after the Tomcat server upgrade as shownin Figure 1 We call the system with TomcatSync asSYStomcatV 7 and the other one with TomcatAsync asSYStomcatV 8 This figure shows the SYStomcatV 7 saturatesat workload 11000 while SYStomcatV 8 saturates at work-load 9000 At workload 11000 SYStomcatV 7 outperformsSYStomcatV 8 by 28 in throughput and the average responsetime is one order of magnitude less (226ms vs 2820ms) Sucha result is counter intuitive since we upgrade Tomcat froma lower thread-based version to a newer asynchronous oneWe note that in both cases the Tomcat server CPU is thebottleneck resource in the system all the hardware resources(eg CPU and memory) of other component servers are farfrom saturation (lt 60)

      We use Collectl [2] to collect system level metrics (egCPU and context switches) another interesting phenomenonwe observed is that TomcatAsync encounters significantlyhigher number of context switches than TomcatSync whenthe system is at the same workload For example at workload10000 TomcatAsync encounters 12950 context switches persecond while only 5930 for TomcatSync less than halfof the previous case It is reasonable to suggest that highcontext switches in TomcatAsync cause high CPU overheadleading to inferior throughput compared to TomcatSyncHowever the traditional wisdom told us that a server withasynchronous architecture should have less context switchesthan a thread-based server So why did we observe the oppositehere We will discuss about the cause in the next section

      Reactor thread

      Worker th

      read pool

      Dispatch read event

      generate write event

      dispatch write event

      return control back

      Worker A

      Readingprocess-ing request( )

      ( )Worker B

      Sending response

      Fig 3 Illustration of the event processing flow whenTomcatAsync processes one request Totally four contextswitches between the reactor thread and worker threads

      III INEFFICIENT EVENT PROCESSING FLOW INASYNCHRONOUS SERVERS

      In this section we explain why the performance of the 3-tierbenchmark system degrades after we upgrade Tomcat from thethread-based version TomcatSync to the asynchronous ver-sion TomcatAsync To simplify and quantify our analysiswe design micro-benchmarks to test the performance of bothversions of Tomcat

      We use JMeter [1] to generate HTTP requests to access thestandalone Tomcat directly These HTTP requests are catego-rized into three types small medium and large with whichthe Tomcat server (either TomcatSync or TomcatAsync)first conducts some simple computation before respondingwith 01KB 10KB and 100KB of in-memory data respec-tively We choose these three sizes because they are represen-tative response sizes in our RUBBoS benchmark applicationJMeter uses one thread to simulate each end-user We set thethink time between the consecutive requests sent from thesame thread to be zero thus we can precisely control theconcurrency of the workload to the target Tomcat server byspecifying the number of threads in JMeter

      We compare the server throughput between TomcatSyncand TomcatAsync under different workload concurrenciesand response sizes as shown in Figure 2 The three sub-figures show that as workload concurrency increases from1 to 3200 TomcatAsync achieves lower throughput thanTomcatSync before a certain workload concurrency Forexample TomcatAsync performs worse than TomcatSync

      TABLE I TomcatAsync has more context switches thanTomcatSync under workload concurrencies 8

      Response size TomcatAsync TomcatSync[times1000sec]

      01KB 40 1610KB 25 7100KB 28 2

      before the workload concurrency 64 when the response size is10KB and the crossover point workload concurrency is evenhigher (1600) when the response size increases to 100KBReturn back to our previous 3-tier RUBBoS experimentsour measurements show that under the RUBBoS workloadconditions the average response size of Tomcat per requestis about 20KB and the workload concurrency for Tomcatis about 35 when the system saturates So based on ourmicro-benchmark results in Figure 2 it is not surprising thatTomcatAsync performs worse than TomcatSync SinceTomcat is the bottleneck server of the 3-tier system theperformance degradation of Tomcat also leads to the perfor-mance degradation of the whole system (see Figure 1) Theremaining question is why TomcatAsync performs worsethan TomcatSync before a certain workload concurrency

      As we found out that the performance degradation ofTomcatAsync results from its inefficient event processingflow which generates significant amounts of intermediatecontext switches causing non-trivial CPU overhead Table Icompares the context switches between TomcatAsync andTomcatSync at workload concurrency 8 This table showsconsistent results as we have observed in the previous RUB-BoS experiments the asynchronous TomcatAsync encoun-tered significantly higher context switches than the thread-based TomcatSync given the same workload concurrencyand server response size Our further analysis reveals that thehigh context switches of TomcatAsync is because of its poordesign of event processing flow Concretely TomcatAsyncadopts the second design of asynchronous servers (see Sec-tion II-A) which uses a reactor thread for event monitoringand a worker thread pool for event handling Figure 3 illus-trates the event processing flow in TomcatAsync

      So to handle one client request there are totally 4 contextswitches among the user-space threads in TomcatAsync(see step 1minus4 in Figure 3) Such inefficient event pro-cessing flow design also exists in many popular asyn-chronous serversmiddleware including network frameworkGrizzly [10] application server Jetty [3] On the other handin TomcatSync each client request is handled by a dedicatedworker thread from the initial reading of the request topreparing the response to sending the response out No contextswitch during the processing of the request unless the workerthread is interrupted or swapped out by operating system

      To better quantify the impact of context switches on theperformance of different server architectures we simplifythe implementation of TomcatAsync and TomcatSync byremoving out all the unrelated modules (eg servlet life cyclemanagement cache management and logging) and only keep-ing the essential code related to request processing which we

      TABLE II Context switches among user-space threadswhen the server processes one client request

      Server type Context NoteSwitch

      sTomcat-Async 4 Read and write events are handledby different worker threads (Figure 3)

      sTomcat-Async-Fix 2 Read and write events are handledby the same worker thread

      sTomcat-Sync 0Dedicated worker thread for each re-quest Context switch occurs due tointerrupt or CPU time slice expires

      SingleT-Async 0 No context switches one thread handleboth event monitoring and processing

      refer as sTomcat-Async (simplified TomcatAsync) andsTomcat-Sync (simplified TomcatSync) As a referencewe implement two alternative designs of asynchronous serversaiming to reduce the frequency of context switches The firstalternative design which we call sTomcat-Async-Fixmerges the processing of read event and write event fromthe same request by using the same worker thread In thiscase once a worker thread finishes preparing the response itcontinues to send the response out (step 2 and 3 in Figure 3 nolonger exist) thus processing one client request only requirestwo context switches from the reactor thread to a workerthread and from the same worker thread back to the reactorthread The second alternative design is the traditional single-threaded asynchronous server The single thread is responsiblefor both event monitoring and processing The single-threadedimplementation which we refer as SingleT-Async is sup-posed to have the least context switches Table II summarizesthe context switches for each server type when it processesone client request1 Interested readers can check out our serverimplementation from GitHub [13] for further reference

      We compare the throughput and context switches among thefour types of servers under increased workload concurrenciesand server response sizes as shown in Figure 4 ComparingFigure 4(a) and 4(d) the maximum achievable throughputby each server type is negatively correlated with the contextswitch frequency during runtime experiments For exampleat workload concurrency 16 sTomcat-Async-Fix outper-forms sTomcat-Async by 22 in throughput while the con-text switches is 34 less In our experiments the CPU demandfor each request is positively correlated to the response sizesmall response size means small CPU computation demandthus the portion of CPU cycles wasted in context switchesbecomes large As a result the gap in context switchesbetween sTomcat-Async-Fix and sTomcat-Async re-flects their throughput difference Such hypothesis is fur-ther validated by the performance of SingleT-Async andsTomcat-Sync which outperform sTomcat-Async by91 and 57 in throughput respectively (see Figure 4(a))Such performance difference is also because of less contextswitches as shown in Figure 4(d) For example the contextswitches of SingleT-Async is a few hundred per secondthree orders of magnitude less than that of sTomcat-Async

      1In order to simplify analyzing and reasoning we do not count the contextswitches causing by interrupting or swapping by the operating system

      0

      10K

      20K

      30K

      40K

      1 4 8 16 64 100 400 1000 3200

      Thro

      ughp

      ut [

      req

      s]

      Workload Concurrency [ of Connections]

      SingleT-AsyncsTomcat-Sync

      sTomcat-Async-FixsTomcat-Async

      (a) Throughput when response size is 01KB

      0

      2K

      4K

      6K

      8K

      10K

      1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

      SingleT-AsyncsTomcat-Sync

      sTomcat-Async-FixsTomcat-Async

      (b) Throughput when response size is 10KB

      0

      200

      400

      600

      1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

      SingleT-AsyncsTomcat-Sync

      sTomcat-Async-FixsTomcat-Async

      (c) Throughput when response size is 100KB

      0

      40K

      80K

      120K

      160K

      1 4 8 16 64 100 400 1000 3200

      Cont

      ext

      Switc

      hing

      [s

      ]

      Workload Concurrency [ of Connections]

      SingleT-AsyncsTomcat-Sync

      sTomcat-Async-FixsTomcat-Async

      (d) Context switch when response size is 01KB

      0

      20K

      40K

      60K

      80K

      100K

      1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

      SingleT-AsyncsTomcat-Sync

      sTomcat-Async-FixsTomcat-Async

      (e) Context switch when response size is 10KB

      0

      2K

      4K

      6K

      8K

      10K

      1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

      SingleT-AsyncsTomcat-Sync

      sTomcat-Async-FixsTomcat-Async

      (f) Context switch when response size is 100KBFig 4 Throughput and context switch comparison among different server architectures as the server response sizeincreases from 01KB to 100KB (a) and (d) show that the maximum achievable throughput by each server type is negativelycorrelated with their context switch freqency when the server response size is small (01KB) However as the response sizeincreases to 100KB (c) shows sTomcat-Sync outperforms other asynchronous servers before the workload concurrency400 indicating factors other than context switches cause overhead in asynchronous servers

      We note that as the server response size becomes larger theportion of CPU overhead caused by context switches becomessmaller since more CPU cycles will be consumed by process-ing request and sending response This is the case as shownin Figure 4(b) and 4(c) where the response sizes are 10KBand 100KB respectively The throughput difference becomesnarrower among the four server architectures indicating lessperformance impact from context switches

      In fact one interesting phenomenon has been observedas the response size increases to 100KB Figure 4(c) showsthat SingleT-Async performs worse than the thread-basedsTomcat-Sync before the workload concurrency 400 eventhough SingleT-Async has much less context switchesthan sTomcat-Sync as shown in Figure 4(f) Such obser-vation suggests that there are other factors causing overheadin asynchronous SingleT-Async but not in thread-basedsTomcat-Sync when the server response size is largewhich we will discuss in the next section

      IV WRITE-SPIN PROBLEM OF ASYNCHRONOUSINVOCATION

      In this section we study the performance degradation prob-lem of an asynchronous server sending a large size responseWe use fine-grained profiling tools such as Collectl [2] andJProfiler [4] to analyze the detailed CPU usage and some keysystem calls invoked by servers with different architecturesAs we found out that it is the default small TCP sendbuffer size and the TCP wait-ACK mechanism that leadsto a severe write-spin problem when sending a relativelylarge size response which causes significant CPU overheadfor asynchronous servers We also explored several network-

      related factors that could exacerbate the negative impact of thewrite-spin problem which further degrades the performance ofan asynchronous server

      A Profiling Results

      Recall that Figure 4(a) shows when the response sizeis small (ie 01KB) the throughput of the asynchronousSingleT-Async is 20 higher than the thread-basedsTomcat-Sync at workload concurrency 8 However as re-sponse size increases to 100KB SingleT-Async through-put is surprisingly 31 lower than sTomcat-Sync underthe same workload concurrency 8 (see Figure 4(c)) Sincethe only change is the response size it is natural to specu-late that large response size brings significant overhead forSingleT-Async but not for sTomcat-Sync

      To investigate the performance degradation ofSingleT-Async when the response size is large wefirst use Collectl [2] to analyze the detailed CPU usageof the server with different server response sizes asshown in Table III The workload concurrency for bothSingleT-Async and sTomcat-Sync is 100 and theCPU is 100 utilized under this workload concurrencyAs the response size for both server architectures increasedfrom 01KB to 100KB the table shows the user-space CPUutilization of sTomcat-Sync increases 25 (from 55 to80) while 34 (from 58 to 92) for SingleT-AsyncSuch comparison suggests that increasing response size hasmore impact on the asynchronous SingleT-Async than thethread-based sTomcat-Sync in user-space CPU utilization

      We further use JProfiler [4] to profile the SingleT-Asynccase when the response size increases from 01KB to 100KB

      TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

      Server Type sTomcat-Sync SingleT-Async

      Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

      TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

      Resp size req write()socketwrite() per req

      01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

      and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

      Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

      Client

      Receive Buffer Read from socket

      Parsing and encoding

      Write to socket

      sum lt data size

      syscall

      syscall

      syscall

      Return of bytes written

      Return zero

      Un

      expected

      Wait fo

      r TCP

      AC

      Ks

      Server

      Tim

      e

      Write to socket

      sum lt data size

      Write to socket

      sum lt data size

      The worker thread write-spins until ACKs come back from client

      Data Copy to Kernel Finish

      Return of bytes written

      Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

      An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

      In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

      0

      100

      200

      300

      400

      ~0ms ~5ms ~10ms ~20ms

      Thro

      ughp

      ut [

      req

      s]

      Network Latency

      SingleT-Async-100KBSingleT-Async-autotuning

      Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

      Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

      B Network Latency Exaggerates the Write-Spin Problem

      Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

      The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

      We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

      0

      200

      400

      600

      800

      ~0ms ~5ms ~10ms ~20ms

      Thro

      ughp

      ut [

      req

      sec]

      Network Latency

      SingleT-AsyncsTomcat-Async-Fix

      sTomcat-Sync

      (a) Throuthput comparison

      0

      3

      6

      9

      12

      15

      ~0ms ~5ms ~10ms ~20ms

      Res

      pons

      e Ti

      me

      [s]

      Network Latency

      SingleT-AsyncsTomcat-Async-Fix

      sTomcat-Sync

      (b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

      is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

      V SOLUTION

      So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

      An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

      A Mitigating Context Switches and Write-Spin Using Netty

      Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

      Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

      syscall Write to socket

      Conditions

      1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

      Application ThreadKernel

      Next Event Processing

      Data Copy to Kernel Finish

      True

      Data Sending Finish

      Data to be written

      False

      False

      syscall

      Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

      0

      200

      400

      600

      800

      1 4 8 16 64 100 400 10003200

      Thr

      ough

      put [

      req

      s]

      Workload Concurrency [ of Connections]

      SingleT-AsyncNettyServer

      sTomcat-Sync

      (a) Response size is 100KB

      0

      10K

      20K

      30K

      40K

      1 4 8 16 64 100 400 10003200

      Thr

      ough

      put [

      req

      s]

      Workload Concurrency [ of Connections]

      SingleT-AsyncNettyServer

      sTomcat-Sync

      (b) Response size is 01KB

      Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

      the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

      In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

      maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

      We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

      B A Hybrid Solution

      In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

      Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

      vary during runtimebull The workload is in-memory workload

      Select()

      Pool of connections with pending events

      Conn is available

      Check req type

      Parsing and encoding

      Parsing and encoding

      Write operation optimization

      Socketwrite()

      Return to Return to

      No

      NettyServer SingleT-Async

      Event Monitoring Phase

      Event Handling Phase

      Yes

      get next conn

      Fig 10 Worker thread processing flow in Hybrid solution

      The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

      The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

      C Validation of HybridNetty

      To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

      size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

      VI RELATED WORK

      Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

      The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

      Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

      0

      02

      04

      06

      08

      10

      0 2 5 10 20 100

      Nor

      mal

      ized

      Thr

      ough

      put

      Ratio of Large Size Response

      SingleT-Async NettyServer HybridNetty

      (a) No network latency between client and server

      0

      02

      04

      06

      08

      10

      0 2 5 10 20 100

      Nor

      mal

      ized

      Thr

      ough

      put

      Ratio of Large Size Response

      SingleT-Async NettyServer HybridNetty

      (b) sim5ms network latency between client and server

      Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

      Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

      VII CONCLUSIONS

      We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

      ACKNOWLEDGMENT

      This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

      HTTP Requests

      Apache Tomcat MySQLClients

      (b) 111 Sample Topology

      (a) Software and Hardware Setup

      Fig 12 Details of the RUBBoS experimental setup

      grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

      APPENDIX ARUBBOS EXPERIMENTAL SETUP

      We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

      REFERENCES

      [1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

      wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

      ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

      mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

      middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

      githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

      servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

      orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

      Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

      DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

      [17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

      [18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

      [19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

      [20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

      [21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

      [22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

      [23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

      [24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

      [25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

      [26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

      Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

      [28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

      [29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

      [30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

      [31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

      [32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

      [33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

      [34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

      [35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

      [36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

      [37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

      [38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

      [39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

      [40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

      [41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

      [42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

      [43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

      • Introduction
      • Background and Motivation
        • RPC vs Asynchronous Network IO
        • Performance Degradation after Tomcat Upgrade
          • Inefficient Event Processing Flow in Asynchronous Servers
          • Write-Spin Problem of Asynchronous Invocation
            • Profiling Results
            • Network Latency Exaggerates the Write-Spin Problem
              • Solution
                • Mitigating Context Switches and Write-Spin Using Netty
                • A Hybrid Solution
                • Validation of HybridNetty
                  • Related Work
                  • Conclusions
                  • Appendix A RUBBoS Experimental Setup
                  • References

        TABLE I TomcatAsync has more context switches thanTomcatSync under workload concurrencies 8

        Response size TomcatAsync TomcatSync[times1000sec]

        01KB 40 1610KB 25 7100KB 28 2

        before the workload concurrency 64 when the response size is10KB and the crossover point workload concurrency is evenhigher (1600) when the response size increases to 100KBReturn back to our previous 3-tier RUBBoS experimentsour measurements show that under the RUBBoS workloadconditions the average response size of Tomcat per requestis about 20KB and the workload concurrency for Tomcatis about 35 when the system saturates So based on ourmicro-benchmark results in Figure 2 it is not surprising thatTomcatAsync performs worse than TomcatSync SinceTomcat is the bottleneck server of the 3-tier system theperformance degradation of Tomcat also leads to the perfor-mance degradation of the whole system (see Figure 1) Theremaining question is why TomcatAsync performs worsethan TomcatSync before a certain workload concurrency

        As we found out that the performance degradation ofTomcatAsync results from its inefficient event processingflow which generates significant amounts of intermediatecontext switches causing non-trivial CPU overhead Table Icompares the context switches between TomcatAsync andTomcatSync at workload concurrency 8 This table showsconsistent results as we have observed in the previous RUB-BoS experiments the asynchronous TomcatAsync encoun-tered significantly higher context switches than the thread-based TomcatSync given the same workload concurrencyand server response size Our further analysis reveals that thehigh context switches of TomcatAsync is because of its poordesign of event processing flow Concretely TomcatAsyncadopts the second design of asynchronous servers (see Sec-tion II-A) which uses a reactor thread for event monitoringand a worker thread pool for event handling Figure 3 illus-trates the event processing flow in TomcatAsync

        So to handle one client request there are totally 4 contextswitches among the user-space threads in TomcatAsync(see step 1minus4 in Figure 3) Such inefficient event pro-cessing flow design also exists in many popular asyn-chronous serversmiddleware including network frameworkGrizzly [10] application server Jetty [3] On the other handin TomcatSync each client request is handled by a dedicatedworker thread from the initial reading of the request topreparing the response to sending the response out No contextswitch during the processing of the request unless the workerthread is interrupted or swapped out by operating system

        To better quantify the impact of context switches on theperformance of different server architectures we simplifythe implementation of TomcatAsync and TomcatSync byremoving out all the unrelated modules (eg servlet life cyclemanagement cache management and logging) and only keep-ing the essential code related to request processing which we

        TABLE II Context switches among user-space threadswhen the server processes one client request

        Server type Context NoteSwitch

        sTomcat-Async 4 Read and write events are handledby different worker threads (Figure 3)

        sTomcat-Async-Fix 2 Read and write events are handledby the same worker thread

        sTomcat-Sync 0Dedicated worker thread for each re-quest Context switch occurs due tointerrupt or CPU time slice expires

        SingleT-Async 0 No context switches one thread handleboth event monitoring and processing

        refer as sTomcat-Async (simplified TomcatAsync) andsTomcat-Sync (simplified TomcatSync) As a referencewe implement two alternative designs of asynchronous serversaiming to reduce the frequency of context switches The firstalternative design which we call sTomcat-Async-Fixmerges the processing of read event and write event fromthe same request by using the same worker thread In thiscase once a worker thread finishes preparing the response itcontinues to send the response out (step 2 and 3 in Figure 3 nolonger exist) thus processing one client request only requirestwo context switches from the reactor thread to a workerthread and from the same worker thread back to the reactorthread The second alternative design is the traditional single-threaded asynchronous server The single thread is responsiblefor both event monitoring and processing The single-threadedimplementation which we refer as SingleT-Async is sup-posed to have the least context switches Table II summarizesthe context switches for each server type when it processesone client request1 Interested readers can check out our serverimplementation from GitHub [13] for further reference

        We compare the throughput and context switches among thefour types of servers under increased workload concurrenciesand server response sizes as shown in Figure 4 ComparingFigure 4(a) and 4(d) the maximum achievable throughputby each server type is negatively correlated with the contextswitch frequency during runtime experiments For exampleat workload concurrency 16 sTomcat-Async-Fix outper-forms sTomcat-Async by 22 in throughput while the con-text switches is 34 less In our experiments the CPU demandfor each request is positively correlated to the response sizesmall response size means small CPU computation demandthus the portion of CPU cycles wasted in context switchesbecomes large As a result the gap in context switchesbetween sTomcat-Async-Fix and sTomcat-Async re-flects their throughput difference Such hypothesis is fur-ther validated by the performance of SingleT-Async andsTomcat-Sync which outperform sTomcat-Async by91 and 57 in throughput respectively (see Figure 4(a))Such performance difference is also because of less contextswitches as shown in Figure 4(d) For example the contextswitches of SingleT-Async is a few hundred per secondthree orders of magnitude less than that of sTomcat-Async

        1In order to simplify analyzing and reasoning we do not count the contextswitches causing by interrupting or swapping by the operating system

        0

        10K

        20K

        30K

        40K

        1 4 8 16 64 100 400 1000 3200

        Thro

        ughp

        ut [

        req

        s]

        Workload Concurrency [ of Connections]

        SingleT-AsyncsTomcat-Sync

        sTomcat-Async-FixsTomcat-Async

        (a) Throughput when response size is 01KB

        0

        2K

        4K

        6K

        8K

        10K

        1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

        SingleT-AsyncsTomcat-Sync

        sTomcat-Async-FixsTomcat-Async

        (b) Throughput when response size is 10KB

        0

        200

        400

        600

        1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

        SingleT-AsyncsTomcat-Sync

        sTomcat-Async-FixsTomcat-Async

        (c) Throughput when response size is 100KB

        0

        40K

        80K

        120K

        160K

        1 4 8 16 64 100 400 1000 3200

        Cont

        ext

        Switc

        hing

        [s

        ]

        Workload Concurrency [ of Connections]

        SingleT-AsyncsTomcat-Sync

        sTomcat-Async-FixsTomcat-Async

        (d) Context switch when response size is 01KB

        0

        20K

        40K

        60K

        80K

        100K

        1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

        SingleT-AsyncsTomcat-Sync

        sTomcat-Async-FixsTomcat-Async

        (e) Context switch when response size is 10KB

        0

        2K

        4K

        6K

        8K

        10K

        1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

        SingleT-AsyncsTomcat-Sync

        sTomcat-Async-FixsTomcat-Async

        (f) Context switch when response size is 100KBFig 4 Throughput and context switch comparison among different server architectures as the server response sizeincreases from 01KB to 100KB (a) and (d) show that the maximum achievable throughput by each server type is negativelycorrelated with their context switch freqency when the server response size is small (01KB) However as the response sizeincreases to 100KB (c) shows sTomcat-Sync outperforms other asynchronous servers before the workload concurrency400 indicating factors other than context switches cause overhead in asynchronous servers

        We note that as the server response size becomes larger theportion of CPU overhead caused by context switches becomessmaller since more CPU cycles will be consumed by process-ing request and sending response This is the case as shownin Figure 4(b) and 4(c) where the response sizes are 10KBand 100KB respectively The throughput difference becomesnarrower among the four server architectures indicating lessperformance impact from context switches

        In fact one interesting phenomenon has been observedas the response size increases to 100KB Figure 4(c) showsthat SingleT-Async performs worse than the thread-basedsTomcat-Sync before the workload concurrency 400 eventhough SingleT-Async has much less context switchesthan sTomcat-Sync as shown in Figure 4(f) Such obser-vation suggests that there are other factors causing overheadin asynchronous SingleT-Async but not in thread-basedsTomcat-Sync when the server response size is largewhich we will discuss in the next section

        IV WRITE-SPIN PROBLEM OF ASYNCHRONOUSINVOCATION

        In this section we study the performance degradation prob-lem of an asynchronous server sending a large size responseWe use fine-grained profiling tools such as Collectl [2] andJProfiler [4] to analyze the detailed CPU usage and some keysystem calls invoked by servers with different architecturesAs we found out that it is the default small TCP sendbuffer size and the TCP wait-ACK mechanism that leadsto a severe write-spin problem when sending a relativelylarge size response which causes significant CPU overheadfor asynchronous servers We also explored several network-

        related factors that could exacerbate the negative impact of thewrite-spin problem which further degrades the performance ofan asynchronous server

        A Profiling Results

        Recall that Figure 4(a) shows when the response sizeis small (ie 01KB) the throughput of the asynchronousSingleT-Async is 20 higher than the thread-basedsTomcat-Sync at workload concurrency 8 However as re-sponse size increases to 100KB SingleT-Async through-put is surprisingly 31 lower than sTomcat-Sync underthe same workload concurrency 8 (see Figure 4(c)) Sincethe only change is the response size it is natural to specu-late that large response size brings significant overhead forSingleT-Async but not for sTomcat-Sync

        To investigate the performance degradation ofSingleT-Async when the response size is large wefirst use Collectl [2] to analyze the detailed CPU usageof the server with different server response sizes asshown in Table III The workload concurrency for bothSingleT-Async and sTomcat-Sync is 100 and theCPU is 100 utilized under this workload concurrencyAs the response size for both server architectures increasedfrom 01KB to 100KB the table shows the user-space CPUutilization of sTomcat-Sync increases 25 (from 55 to80) while 34 (from 58 to 92) for SingleT-AsyncSuch comparison suggests that increasing response size hasmore impact on the asynchronous SingleT-Async than thethread-based sTomcat-Sync in user-space CPU utilization

        We further use JProfiler [4] to profile the SingleT-Asynccase when the response size increases from 01KB to 100KB

        TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

        Server Type sTomcat-Sync SingleT-Async

        Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

        TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

        Resp size req write()socketwrite() per req

        01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

        and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

        Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

        Client

        Receive Buffer Read from socket

        Parsing and encoding

        Write to socket

        sum lt data size

        syscall

        syscall

        syscall

        Return of bytes written

        Return zero

        Un

        expected

        Wait fo

        r TCP

        AC

        Ks

        Server

        Tim

        e

        Write to socket

        sum lt data size

        Write to socket

        sum lt data size

        The worker thread write-spins until ACKs come back from client

        Data Copy to Kernel Finish

        Return of bytes written

        Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

        An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

        In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

        0

        100

        200

        300

        400

        ~0ms ~5ms ~10ms ~20ms

        Thro

        ughp

        ut [

        req

        s]

        Network Latency

        SingleT-Async-100KBSingleT-Async-autotuning

        Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

        Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

        B Network Latency Exaggerates the Write-Spin Problem

        Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

        The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

        We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

        0

        200

        400

        600

        800

        ~0ms ~5ms ~10ms ~20ms

        Thro

        ughp

        ut [

        req

        sec]

        Network Latency

        SingleT-AsyncsTomcat-Async-Fix

        sTomcat-Sync

        (a) Throuthput comparison

        0

        3

        6

        9

        12

        15

        ~0ms ~5ms ~10ms ~20ms

        Res

        pons

        e Ti

        me

        [s]

        Network Latency

        SingleT-AsyncsTomcat-Async-Fix

        sTomcat-Sync

        (b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

        is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

        V SOLUTION

        So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

        An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

        A Mitigating Context Switches and Write-Spin Using Netty

        Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

        Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

        syscall Write to socket

        Conditions

        1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

        Application ThreadKernel

        Next Event Processing

        Data Copy to Kernel Finish

        True

        Data Sending Finish

        Data to be written

        False

        False

        syscall

        Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

        0

        200

        400

        600

        800

        1 4 8 16 64 100 400 10003200

        Thr

        ough

        put [

        req

        s]

        Workload Concurrency [ of Connections]

        SingleT-AsyncNettyServer

        sTomcat-Sync

        (a) Response size is 100KB

        0

        10K

        20K

        30K

        40K

        1 4 8 16 64 100 400 10003200

        Thr

        ough

        put [

        req

        s]

        Workload Concurrency [ of Connections]

        SingleT-AsyncNettyServer

        sTomcat-Sync

        (b) Response size is 01KB

        Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

        the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

        In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

        maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

        We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

        B A Hybrid Solution

        In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

        Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

        vary during runtimebull The workload is in-memory workload

        Select()

        Pool of connections with pending events

        Conn is available

        Check req type

        Parsing and encoding

        Parsing and encoding

        Write operation optimization

        Socketwrite()

        Return to Return to

        No

        NettyServer SingleT-Async

        Event Monitoring Phase

        Event Handling Phase

        Yes

        get next conn

        Fig 10 Worker thread processing flow in Hybrid solution

        The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

        The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

        C Validation of HybridNetty

        To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

        size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

        VI RELATED WORK

        Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

        The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

        Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

        0

        02

        04

        06

        08

        10

        0 2 5 10 20 100

        Nor

        mal

        ized

        Thr

        ough

        put

        Ratio of Large Size Response

        SingleT-Async NettyServer HybridNetty

        (a) No network latency between client and server

        0

        02

        04

        06

        08

        10

        0 2 5 10 20 100

        Nor

        mal

        ized

        Thr

        ough

        put

        Ratio of Large Size Response

        SingleT-Async NettyServer HybridNetty

        (b) sim5ms network latency between client and server

        Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

        Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

        VII CONCLUSIONS

        We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

        ACKNOWLEDGMENT

        This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

        HTTP Requests

        Apache Tomcat MySQLClients

        (b) 111 Sample Topology

        (a) Software and Hardware Setup

        Fig 12 Details of the RUBBoS experimental setup

        grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

        APPENDIX ARUBBOS EXPERIMENTAL SETUP

        We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

        REFERENCES

        [1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

        wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

        ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

        mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

        middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

        githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

        servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

        orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

        Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

        DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

        [17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

        [18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

        [19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

        [20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

        [21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

        [22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

        [23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

        [24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

        [25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

        [26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

        Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

        [28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

        [29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

        [30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

        [31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

        [32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

        [33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

        [34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

        [35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

        [36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

        [37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

        [38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

        [39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

        [40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

        [41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

        [42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

        [43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

        • Introduction
        • Background and Motivation
          • RPC vs Asynchronous Network IO
          • Performance Degradation after Tomcat Upgrade
            • Inefficient Event Processing Flow in Asynchronous Servers
            • Write-Spin Problem of Asynchronous Invocation
              • Profiling Results
              • Network Latency Exaggerates the Write-Spin Problem
                • Solution
                  • Mitigating Context Switches and Write-Spin Using Netty
                  • A Hybrid Solution
                  • Validation of HybridNetty
                    • Related Work
                    • Conclusions
                    • Appendix A RUBBoS Experimental Setup
                    • References

          0

          10K

          20K

          30K

          40K

          1 4 8 16 64 100 400 1000 3200

          Thro

          ughp

          ut [

          req

          s]

          Workload Concurrency [ of Connections]

          SingleT-AsyncsTomcat-Sync

          sTomcat-Async-FixsTomcat-Async

          (a) Throughput when response size is 01KB

          0

          2K

          4K

          6K

          8K

          10K

          1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

          SingleT-AsyncsTomcat-Sync

          sTomcat-Async-FixsTomcat-Async

          (b) Throughput when response size is 10KB

          0

          200

          400

          600

          1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

          SingleT-AsyncsTomcat-Sync

          sTomcat-Async-FixsTomcat-Async

          (c) Throughput when response size is 100KB

          0

          40K

          80K

          120K

          160K

          1 4 8 16 64 100 400 1000 3200

          Cont

          ext

          Switc

          hing

          [s

          ]

          Workload Concurrency [ of Connections]

          SingleT-AsyncsTomcat-Sync

          sTomcat-Async-FixsTomcat-Async

          (d) Context switch when response size is 01KB

          0

          20K

          40K

          60K

          80K

          100K

          1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

          SingleT-AsyncsTomcat-Sync

          sTomcat-Async-FixsTomcat-Async

          (e) Context switch when response size is 10KB

          0

          2K

          4K

          6K

          8K

          10K

          1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

          SingleT-AsyncsTomcat-Sync

          sTomcat-Async-FixsTomcat-Async

          (f) Context switch when response size is 100KBFig 4 Throughput and context switch comparison among different server architectures as the server response sizeincreases from 01KB to 100KB (a) and (d) show that the maximum achievable throughput by each server type is negativelycorrelated with their context switch freqency when the server response size is small (01KB) However as the response sizeincreases to 100KB (c) shows sTomcat-Sync outperforms other asynchronous servers before the workload concurrency400 indicating factors other than context switches cause overhead in asynchronous servers

          We note that as the server response size becomes larger theportion of CPU overhead caused by context switches becomessmaller since more CPU cycles will be consumed by process-ing request and sending response This is the case as shownin Figure 4(b) and 4(c) where the response sizes are 10KBand 100KB respectively The throughput difference becomesnarrower among the four server architectures indicating lessperformance impact from context switches

          In fact one interesting phenomenon has been observedas the response size increases to 100KB Figure 4(c) showsthat SingleT-Async performs worse than the thread-basedsTomcat-Sync before the workload concurrency 400 eventhough SingleT-Async has much less context switchesthan sTomcat-Sync as shown in Figure 4(f) Such obser-vation suggests that there are other factors causing overheadin asynchronous SingleT-Async but not in thread-basedsTomcat-Sync when the server response size is largewhich we will discuss in the next section

          IV WRITE-SPIN PROBLEM OF ASYNCHRONOUSINVOCATION

          In this section we study the performance degradation prob-lem of an asynchronous server sending a large size responseWe use fine-grained profiling tools such as Collectl [2] andJProfiler [4] to analyze the detailed CPU usage and some keysystem calls invoked by servers with different architecturesAs we found out that it is the default small TCP sendbuffer size and the TCP wait-ACK mechanism that leadsto a severe write-spin problem when sending a relativelylarge size response which causes significant CPU overheadfor asynchronous servers We also explored several network-

          related factors that could exacerbate the negative impact of thewrite-spin problem which further degrades the performance ofan asynchronous server

          A Profiling Results

          Recall that Figure 4(a) shows when the response sizeis small (ie 01KB) the throughput of the asynchronousSingleT-Async is 20 higher than the thread-basedsTomcat-Sync at workload concurrency 8 However as re-sponse size increases to 100KB SingleT-Async through-put is surprisingly 31 lower than sTomcat-Sync underthe same workload concurrency 8 (see Figure 4(c)) Sincethe only change is the response size it is natural to specu-late that large response size brings significant overhead forSingleT-Async but not for sTomcat-Sync

          To investigate the performance degradation ofSingleT-Async when the response size is large wefirst use Collectl [2] to analyze the detailed CPU usageof the server with different server response sizes asshown in Table III The workload concurrency for bothSingleT-Async and sTomcat-Sync is 100 and theCPU is 100 utilized under this workload concurrencyAs the response size for both server architectures increasedfrom 01KB to 100KB the table shows the user-space CPUutilization of sTomcat-Sync increases 25 (from 55 to80) while 34 (from 58 to 92) for SingleT-AsyncSuch comparison suggests that increasing response size hasmore impact on the asynchronous SingleT-Async than thethread-based sTomcat-Sync in user-space CPU utilization

          We further use JProfiler [4] to profile the SingleT-Asynccase when the response size increases from 01KB to 100KB

          TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

          Server Type sTomcat-Sync SingleT-Async

          Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

          TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

          Resp size req write()socketwrite() per req

          01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

          and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

          Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

          Client

          Receive Buffer Read from socket

          Parsing and encoding

          Write to socket

          sum lt data size

          syscall

          syscall

          syscall

          Return of bytes written

          Return zero

          Un

          expected

          Wait fo

          r TCP

          AC

          Ks

          Server

          Tim

          e

          Write to socket

          sum lt data size

          Write to socket

          sum lt data size

          The worker thread write-spins until ACKs come back from client

          Data Copy to Kernel Finish

          Return of bytes written

          Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

          An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

          In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

          0

          100

          200

          300

          400

          ~0ms ~5ms ~10ms ~20ms

          Thro

          ughp

          ut [

          req

          s]

          Network Latency

          SingleT-Async-100KBSingleT-Async-autotuning

          Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

          Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

          B Network Latency Exaggerates the Write-Spin Problem

          Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

          The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

          We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

          0

          200

          400

          600

          800

          ~0ms ~5ms ~10ms ~20ms

          Thro

          ughp

          ut [

          req

          sec]

          Network Latency

          SingleT-AsyncsTomcat-Async-Fix

          sTomcat-Sync

          (a) Throuthput comparison

          0

          3

          6

          9

          12

          15

          ~0ms ~5ms ~10ms ~20ms

          Res

          pons

          e Ti

          me

          [s]

          Network Latency

          SingleT-AsyncsTomcat-Async-Fix

          sTomcat-Sync

          (b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

          is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

          V SOLUTION

          So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

          An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

          A Mitigating Context Switches and Write-Spin Using Netty

          Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

          Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

          syscall Write to socket

          Conditions

          1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

          Application ThreadKernel

          Next Event Processing

          Data Copy to Kernel Finish

          True

          Data Sending Finish

          Data to be written

          False

          False

          syscall

          Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

          0

          200

          400

          600

          800

          1 4 8 16 64 100 400 10003200

          Thr

          ough

          put [

          req

          s]

          Workload Concurrency [ of Connections]

          SingleT-AsyncNettyServer

          sTomcat-Sync

          (a) Response size is 100KB

          0

          10K

          20K

          30K

          40K

          1 4 8 16 64 100 400 10003200

          Thr

          ough

          put [

          req

          s]

          Workload Concurrency [ of Connections]

          SingleT-AsyncNettyServer

          sTomcat-Sync

          (b) Response size is 01KB

          Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

          the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

          In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

          maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

          We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

          B A Hybrid Solution

          In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

          Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

          vary during runtimebull The workload is in-memory workload

          Select()

          Pool of connections with pending events

          Conn is available

          Check req type

          Parsing and encoding

          Parsing and encoding

          Write operation optimization

          Socketwrite()

          Return to Return to

          No

          NettyServer SingleT-Async

          Event Monitoring Phase

          Event Handling Phase

          Yes

          get next conn

          Fig 10 Worker thread processing flow in Hybrid solution

          The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

          The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

          C Validation of HybridNetty

          To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

          size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

          VI RELATED WORK

          Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

          The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

          Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

          0

          02

          04

          06

          08

          10

          0 2 5 10 20 100

          Nor

          mal

          ized

          Thr

          ough

          put

          Ratio of Large Size Response

          SingleT-Async NettyServer HybridNetty

          (a) No network latency between client and server

          0

          02

          04

          06

          08

          10

          0 2 5 10 20 100

          Nor

          mal

          ized

          Thr

          ough

          put

          Ratio of Large Size Response

          SingleT-Async NettyServer HybridNetty

          (b) sim5ms network latency between client and server

          Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

          Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

          VII CONCLUSIONS

          We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

          ACKNOWLEDGMENT

          This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

          HTTP Requests

          Apache Tomcat MySQLClients

          (b) 111 Sample Topology

          (a) Software and Hardware Setup

          Fig 12 Details of the RUBBoS experimental setup

          grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

          APPENDIX ARUBBOS EXPERIMENTAL SETUP

          We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

          REFERENCES

          [1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

          wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

          ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

          mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

          middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

          githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

          servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

          orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

          Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

          DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

          [17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

          [18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

          [19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

          [20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

          [21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

          [22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

          [23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

          [24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

          [25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

          [26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

          Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

          [28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

          [29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

          [30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

          [31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

          [32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

          [33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

          [34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

          [35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

          [36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

          [37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

          [38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

          [39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

          [40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

          [41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

          [42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

          [43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

          • Introduction
          • Background and Motivation
            • RPC vs Asynchronous Network IO
            • Performance Degradation after Tomcat Upgrade
              • Inefficient Event Processing Flow in Asynchronous Servers
              • Write-Spin Problem of Asynchronous Invocation
                • Profiling Results
                • Network Latency Exaggerates the Write-Spin Problem
                  • Solution
                    • Mitigating Context Switches and Write-Spin Using Netty
                    • A Hybrid Solution
                    • Validation of HybridNetty
                      • Related Work
                      • Conclusions
                      • Appendix A RUBBoS Experimental Setup
                      • References

            TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

            Server Type sTomcat-Sync SingleT-Async

            Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

            TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

            Resp size req write()socketwrite() per req

            01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

            and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

            Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

            Client

            Receive Buffer Read from socket

            Parsing and encoding

            Write to socket

            sum lt data size

            syscall

            syscall

            syscall

            Return of bytes written

            Return zero

            Un

            expected

            Wait fo

            r TCP

            AC

            Ks

            Server

            Tim

            e

            Write to socket

            sum lt data size

            Write to socket

            sum lt data size

            The worker thread write-spins until ACKs come back from client

            Data Copy to Kernel Finish

            Return of bytes written

            Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

            An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

            In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

            0

            100

            200

            300

            400

            ~0ms ~5ms ~10ms ~20ms

            Thro

            ughp

            ut [

            req

            s]

            Network Latency

            SingleT-Async-100KBSingleT-Async-autotuning

            Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

            Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

            B Network Latency Exaggerates the Write-Spin Problem

            Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

            The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

            We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

            0

            200

            400

            600

            800

            ~0ms ~5ms ~10ms ~20ms

            Thro

            ughp

            ut [

            req

            sec]

            Network Latency

            SingleT-AsyncsTomcat-Async-Fix

            sTomcat-Sync

            (a) Throuthput comparison

            0

            3

            6

            9

            12

            15

            ~0ms ~5ms ~10ms ~20ms

            Res

            pons

            e Ti

            me

            [s]

            Network Latency

            SingleT-AsyncsTomcat-Async-Fix

            sTomcat-Sync

            (b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

            is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

            V SOLUTION

            So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

            An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

            A Mitigating Context Switches and Write-Spin Using Netty

            Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

            Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

            syscall Write to socket

            Conditions

            1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

            Application ThreadKernel

            Next Event Processing

            Data Copy to Kernel Finish

            True

            Data Sending Finish

            Data to be written

            False

            False

            syscall

            Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

            0

            200

            400

            600

            800

            1 4 8 16 64 100 400 10003200

            Thr

            ough

            put [

            req

            s]

            Workload Concurrency [ of Connections]

            SingleT-AsyncNettyServer

            sTomcat-Sync

            (a) Response size is 100KB

            0

            10K

            20K

            30K

            40K

            1 4 8 16 64 100 400 10003200

            Thr

            ough

            put [

            req

            s]

            Workload Concurrency [ of Connections]

            SingleT-AsyncNettyServer

            sTomcat-Sync

            (b) Response size is 01KB

            Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

            the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

            In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

            maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

            We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

            B A Hybrid Solution

            In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

            Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

            vary during runtimebull The workload is in-memory workload

            Select()

            Pool of connections with pending events

            Conn is available

            Check req type

            Parsing and encoding

            Parsing and encoding

            Write operation optimization

            Socketwrite()

            Return to Return to

            No

            NettyServer SingleT-Async

            Event Monitoring Phase

            Event Handling Phase

            Yes

            get next conn

            Fig 10 Worker thread processing flow in Hybrid solution

            The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

            The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

            C Validation of HybridNetty

            To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

            size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

            VI RELATED WORK

            Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

            The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

            Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

            0

            02

            04

            06

            08

            10

            0 2 5 10 20 100

            Nor

            mal

            ized

            Thr

            ough

            put

            Ratio of Large Size Response

            SingleT-Async NettyServer HybridNetty

            (a) No network latency between client and server

            0

            02

            04

            06

            08

            10

            0 2 5 10 20 100

            Nor

            mal

            ized

            Thr

            ough

            put

            Ratio of Large Size Response

            SingleT-Async NettyServer HybridNetty

            (b) sim5ms network latency between client and server

            Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

            Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

            VII CONCLUSIONS

            We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

            ACKNOWLEDGMENT

            This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

            HTTP Requests

            Apache Tomcat MySQLClients

            (b) 111 Sample Topology

            (a) Software and Hardware Setup

            Fig 12 Details of the RUBBoS experimental setup

            grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

            APPENDIX ARUBBOS EXPERIMENTAL SETUP

            We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

            REFERENCES

            [1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

            wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

            ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

            mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

            middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

            githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

            servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

            orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

            Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

            DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

            [17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

            [18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

            [19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

            [20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

            [21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

            [22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

            [23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

            [24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

            [25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

            [26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

            Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

            [28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

            [29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

            [30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

            [31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

            [32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

            [33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

            [34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

            [35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

            [36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

            [37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

            [38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

            [39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

            [40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

            [41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

            [42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

            [43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

            • Introduction
            • Background and Motivation
              • RPC vs Asynchronous Network IO
              • Performance Degradation after Tomcat Upgrade
                • Inefficient Event Processing Flow in Asynchronous Servers
                • Write-Spin Problem of Asynchronous Invocation
                  • Profiling Results
                  • Network Latency Exaggerates the Write-Spin Problem
                    • Solution
                      • Mitigating Context Switches and Write-Spin Using Netty
                      • A Hybrid Solution
                      • Validation of HybridNetty
                        • Related Work
                        • Conclusions
                        • Appendix A RUBBoS Experimental Setup
                        • References

              0

              100

              200

              300

              400

              ~0ms ~5ms ~10ms ~20ms

              Thro

              ughp

              ut [

              req

              s]

              Network Latency

              SingleT-Async-100KBSingleT-Async-autotuning

              Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

              Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

              B Network Latency Exaggerates the Write-Spin Problem

              Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

              The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

              We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

              0

              200

              400

              600

              800

              ~0ms ~5ms ~10ms ~20ms

              Thro

              ughp

              ut [

              req

              sec]

              Network Latency

              SingleT-AsyncsTomcat-Async-Fix

              sTomcat-Sync

              (a) Throuthput comparison

              0

              3

              6

              9

              12

              15

              ~0ms ~5ms ~10ms ~20ms

              Res

              pons

              e Ti

              me

              [s]

              Network Latency

              SingleT-AsyncsTomcat-Async-Fix

              sTomcat-Sync

              (b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

              is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

              V SOLUTION

              So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

              An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

              A Mitigating Context Switches and Write-Spin Using Netty

              Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

              Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

              syscall Write to socket

              Conditions

              1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

              Application ThreadKernel

              Next Event Processing

              Data Copy to Kernel Finish

              True

              Data Sending Finish

              Data to be written

              False

              False

              syscall

              Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

              0

              200

              400

              600

              800

              1 4 8 16 64 100 400 10003200

              Thr

              ough

              put [

              req

              s]

              Workload Concurrency [ of Connections]

              SingleT-AsyncNettyServer

              sTomcat-Sync

              (a) Response size is 100KB

              0

              10K

              20K

              30K

              40K

              1 4 8 16 64 100 400 10003200

              Thr

              ough

              put [

              req

              s]

              Workload Concurrency [ of Connections]

              SingleT-AsyncNettyServer

              sTomcat-Sync

              (b) Response size is 01KB

              Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

              the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

              In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

              maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

              We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

              B A Hybrid Solution

              In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

              Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

              vary during runtimebull The workload is in-memory workload

              Select()

              Pool of connections with pending events

              Conn is available

              Check req type

              Parsing and encoding

              Parsing and encoding

              Write operation optimization

              Socketwrite()

              Return to Return to

              No

              NettyServer SingleT-Async

              Event Monitoring Phase

              Event Handling Phase

              Yes

              get next conn

              Fig 10 Worker thread processing flow in Hybrid solution

              The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

              The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

              C Validation of HybridNetty

              To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

              size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

              VI RELATED WORK

              Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

              The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

              Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

              0

              02

              04

              06

              08

              10

              0 2 5 10 20 100

              Nor

              mal

              ized

              Thr

              ough

              put

              Ratio of Large Size Response

              SingleT-Async NettyServer HybridNetty

              (a) No network latency between client and server

              0

              02

              04

              06

              08

              10

              0 2 5 10 20 100

              Nor

              mal

              ized

              Thr

              ough

              put

              Ratio of Large Size Response

              SingleT-Async NettyServer HybridNetty

              (b) sim5ms network latency between client and server

              Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

              Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

              VII CONCLUSIONS

              We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

              ACKNOWLEDGMENT

              This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

              HTTP Requests

              Apache Tomcat MySQLClients

              (b) 111 Sample Topology

              (a) Software and Hardware Setup

              Fig 12 Details of the RUBBoS experimental setup

              grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

              APPENDIX ARUBBOS EXPERIMENTAL SETUP

              We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

              REFERENCES

              [1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

              wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

              ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

              mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

              middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

              githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

              servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

              orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

              Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

              DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

              [17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

              [18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

              [19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

              [20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

              [21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

              [22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

              [23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

              [24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

              [25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

              [26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

              Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

              [28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

              [29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

              [30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

              [31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

              [32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

              [33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

              [34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

              [35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

              [36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

              [37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

              [38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

              [39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

              [40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

              [41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

              [42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

              [43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

              • Introduction
              • Background and Motivation
                • RPC vs Asynchronous Network IO
                • Performance Degradation after Tomcat Upgrade
                  • Inefficient Event Processing Flow in Asynchronous Servers
                  • Write-Spin Problem of Asynchronous Invocation
                    • Profiling Results
                    • Network Latency Exaggerates the Write-Spin Problem
                      • Solution
                        • Mitigating Context Switches and Write-Spin Using Netty
                        • A Hybrid Solution
                        • Validation of HybridNetty
                          • Related Work
                          • Conclusions
                          • Appendix A RUBBoS Experimental Setup
                          • References

                syscall Write to socket

                Conditions

                1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

                Application ThreadKernel

                Next Event Processing

                Data Copy to Kernel Finish

                True

                Data Sending Finish

                Data to be written

                False

                False

                syscall

                Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

                0

                200

                400

                600

                800

                1 4 8 16 64 100 400 10003200

                Thr

                ough

                put [

                req

                s]

                Workload Concurrency [ of Connections]

                SingleT-AsyncNettyServer

                sTomcat-Sync

                (a) Response size is 100KB

                0

                10K

                20K

                30K

                40K

                1 4 8 16 64 100 400 10003200

                Thr

                ough

                put [

                req

                s]

                Workload Concurrency [ of Connections]

                SingleT-AsyncNettyServer

                sTomcat-Sync

                (b) Response size is 01KB

                Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

                the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

                In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

                maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

                We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

                B A Hybrid Solution

                In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

                Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

                vary during runtimebull The workload is in-memory workload

                Select()

                Pool of connections with pending events

                Conn is available

                Check req type

                Parsing and encoding

                Parsing and encoding

                Write operation optimization

                Socketwrite()

                Return to Return to

                No

                NettyServer SingleT-Async

                Event Monitoring Phase

                Event Handling Phase

                Yes

                get next conn

                Fig 10 Worker thread processing flow in Hybrid solution

                The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

                The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

                C Validation of HybridNetty

                To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

                size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

                VI RELATED WORK

                Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

                The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

                Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

                0

                02

                04

                06

                08

                10

                0 2 5 10 20 100

                Nor

                mal

                ized

                Thr

                ough

                put

                Ratio of Large Size Response

                SingleT-Async NettyServer HybridNetty

                (a) No network latency between client and server

                0

                02

                04

                06

                08

                10

                0 2 5 10 20 100

                Nor

                mal

                ized

                Thr

                ough

                put

                Ratio of Large Size Response

                SingleT-Async NettyServer HybridNetty

                (b) sim5ms network latency between client and server

                Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

                Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

                VII CONCLUSIONS

                We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

                ACKNOWLEDGMENT

                This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

                HTTP Requests

                Apache Tomcat MySQLClients

                (b) 111 Sample Topology

                (a) Software and Hardware Setup

                Fig 12 Details of the RUBBoS experimental setup

                grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

                APPENDIX ARUBBOS EXPERIMENTAL SETUP

                We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

                REFERENCES

                [1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

                wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

                ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

                mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

                middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

                githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

                servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

                orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

                Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

                DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

                [17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

                [18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

                [19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

                [20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

                [21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

                [22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

                [23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

                [24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

                [25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

                [26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

                Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

                [28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

                [29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

                [30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

                [31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

                [32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

                [33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

                [34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

                [35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

                [36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

                [37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

                [38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

                [39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

                [40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

                [41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

                [42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

                [43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

                • Introduction
                • Background and Motivation
                  • RPC vs Asynchronous Network IO
                  • Performance Degradation after Tomcat Upgrade
                    • Inefficient Event Processing Flow in Asynchronous Servers
                    • Write-Spin Problem of Asynchronous Invocation
                      • Profiling Results
                      • Network Latency Exaggerates the Write-Spin Problem
                        • Solution
                          • Mitigating Context Switches and Write-Spin Using Netty
                          • A Hybrid Solution
                          • Validation of HybridNetty
                            • Related Work
                            • Conclusions
                            • Appendix A RUBBoS Experimental Setup
                            • References

                  Select()

                  Pool of connections with pending events

                  Conn is available

                  Check req type

                  Parsing and encoding

                  Parsing and encoding

                  Write operation optimization

                  Socketwrite()

                  Return to Return to

                  No

                  NettyServer SingleT-Async

                  Event Monitoring Phase

                  Event Handling Phase

                  Yes

                  get next conn

                  Fig 10 Worker thread processing flow in Hybrid solution

                  The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

                  The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

                  C Validation of HybridNetty

                  To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

                  size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

                  VI RELATED WORK

                  Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

                  The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

                  Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

                  0

                  02

                  04

                  06

                  08

                  10

                  0 2 5 10 20 100

                  Nor

                  mal

                  ized

                  Thr

                  ough

                  put

                  Ratio of Large Size Response

                  SingleT-Async NettyServer HybridNetty

                  (a) No network latency between client and server

                  0

                  02

                  04

                  06

                  08

                  10

                  0 2 5 10 20 100

                  Nor

                  mal

                  ized

                  Thr

                  ough

                  put

                  Ratio of Large Size Response

                  SingleT-Async NettyServer HybridNetty

                  (b) sim5ms network latency between client and server

                  Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

                  Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

                  VII CONCLUSIONS

                  We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

                  ACKNOWLEDGMENT

                  This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

                  HTTP Requests

                  Apache Tomcat MySQLClients

                  (b) 111 Sample Topology

                  (a) Software and Hardware Setup

                  Fig 12 Details of the RUBBoS experimental setup

                  grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

                  APPENDIX ARUBBOS EXPERIMENTAL SETUP

                  We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

                  REFERENCES

                  [1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

                  wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

                  ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

                  mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

                  middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

                  githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

                  servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

                  orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

                  Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

                  DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

                  [17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

                  [18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

                  [19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

                  [20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

                  [21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

                  [22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

                  [23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

                  [24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

                  [25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

                  [26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

                  Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

                  [28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

                  [29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

                  [30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

                  [31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

                  [32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

                  [33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

                  [34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

                  [35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

                  [36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

                  [37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

                  [38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

                  [39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

                  [40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

                  [41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

                  [42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

                  [43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

                  • Introduction
                  • Background and Motivation
                    • RPC vs Asynchronous Network IO
                    • Performance Degradation after Tomcat Upgrade
                      • Inefficient Event Processing Flow in Asynchronous Servers
                      • Write-Spin Problem of Asynchronous Invocation
                        • Profiling Results
                        • Network Latency Exaggerates the Write-Spin Problem
                          • Solution
                            • Mitigating Context Switches and Write-Spin Using Netty
                            • A Hybrid Solution
                            • Validation of HybridNetty
                              • Related Work
                              • Conclusions
                              • Appendix A RUBBoS Experimental Setup
                              • References

                    0

                    02

                    04

                    06

                    08

                    10

                    0 2 5 10 20 100

                    Nor

                    mal

                    ized

                    Thr

                    ough

                    put

                    Ratio of Large Size Response

                    SingleT-Async NettyServer HybridNetty

                    (a) No network latency between client and server

                    0

                    02

                    04

                    06

                    08

                    10

                    0 2 5 10 20 100

                    Nor

                    mal

                    ized

                    Thr

                    ough

                    put

                    Ratio of Large Size Response

                    SingleT-Async NettyServer HybridNetty

                    (b) sim5ms network latency between client and server

                    Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

                    Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

                    VII CONCLUSIONS

                    We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

                    ACKNOWLEDGMENT

                    This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

                    HTTP Requests

                    Apache Tomcat MySQLClients

                    (b) 111 Sample Topology

                    (a) Software and Hardware Setup

                    Fig 12 Details of the RUBBoS experimental setup

                    grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

                    APPENDIX ARUBBOS EXPERIMENTAL SETUP

                    We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

                    REFERENCES

                    [1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

                    wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

                    ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

                    mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

                    middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

                    githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

                    servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

                    orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

                    Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

                    DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

                    [17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

                    [18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

                    [19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

                    [20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

                    [21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

                    [22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

                    [23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

                    [24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

                    [25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

                    [26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

                    Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

                    [28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

                    [29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

                    [30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

                    [31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

                    [32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

                    [33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

                    [34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

                    [35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

                    [36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

                    [37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

                    [38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

                    [39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

                    [40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

                    [41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

                    [42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

                    [43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

                    • Introduction
                    • Background and Motivation
                      • RPC vs Asynchronous Network IO
                      • Performance Degradation after Tomcat Upgrade
                        • Inefficient Event Processing Flow in Asynchronous Servers
                        • Write-Spin Problem of Asynchronous Invocation
                          • Profiling Results
                          • Network Latency Exaggerates the Write-Spin Problem
                            • Solution
                              • Mitigating Context Switches and Write-Spin Using Netty
                              • A Hybrid Solution
                              • Validation of HybridNetty
                                • Related Work
                                • Conclusions
                                • Appendix A RUBBoS Experimental Setup
                                • References

                      REFERENCES

                      [1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

                      wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

                      ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

                      mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

                      middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

                      githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

                      servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

                      orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

                      Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

                      DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

                      [17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

                      [18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

                      [19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

                      [20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

                      [21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

                      [22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

                      [23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

                      [24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

                      [25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

                      [26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

                      Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

                      [28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

                      [29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

                      [30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

                      [31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

                      [32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

                      [33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

                      [34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

                      [35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

                      [36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

                      [37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

                      [38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

                      [39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

                      [40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

                      [41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

                      [42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

                      [43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

                      • Introduction
                      • Background and Motivation
                        • RPC vs Asynchronous Network IO
                        • Performance Degradation after Tomcat Upgrade
                          • Inefficient Event Processing Flow in Asynchronous Servers
                          • Write-Spin Problem of Asynchronous Invocation
                            • Profiling Results
                            • Network Latency Exaggerates the Write-Spin Problem
                              • Solution
                                • Mitigating Context Switches and Write-Spin Using Netty
                                • A Hybrid Solution
                                • Validation of HybridNetty
                                  • Related Work
                                  • Conclusions
                                  • Appendix A RUBBoS Experimental Setup
                                  • References

                        top related