SAX: A Tool for Studying Congestion-induced Surfer Behavior

SAX: A Tool for Studying Congestion-induced SurferBehavior

Dinh Nguyen Tran, Wei Tsang Ooi, and Y.C. Tay

National University of Singapore

Abstract. User reaction to traffic congestion can have severe impact on networkstability and significant implication for traffic engineering. For example, userswho persist in large peer-to-peer transfers despite congestion can drive the net-work into congestion collapse. On the other hand, users who abort large trans-fers can smoothen self-similar traffic. We present a tool, called SAX, for study-ing congestion-induced behavior of web surfers. SAX extracts information fromHTTP packet traces and infers clicks, abortions and sessions. Measurements withSAX show how users back-off when congestion occurs.

1 Introduction

Despite exponential growth of Internet traffic in the last ten years, widespread outagesimilar to the congestion collapse observed in the early years [1, 2] have not happened,even with the occasional flash crowds (Olympics, 9/11, etc.). Much credit for the In-ternet’s ability to deal with such traffic bursts goes to the TCP’s congestion controlmechanism. TCP congestion control, however, only affects the traffic in a connection,not the number of connections. This number is determined by the users, who thereforeplay a role in controlling the traffic.

Such traffic control by users is particularly prominent during web surfing. A websurfer reacts to congestion in two ways: (U1) she may abort a slow download by click-ing “Stop”, “Reload” or another hyper-link, and (U2) she may cut short her surfingsession. Such behavior is a form of congestion-induced user back-off: (U1) stops adownload and (U2) reduces the number of completed downloads.

User reaction to network congestion can significantly affect network stability andtraffic engineering. Indeed, aborting large HTTP pares down the tail of the file sizedistribution. This action in effect smoothens out self-similar traffic, possibly makingelaborate traffic engineering for countering burstiness unnecessary.

As part of a larger study on the interaction between bandwidth supply and de-mand [3], we are interested in finding evidence for (U1) and (U2) in traffic traces.Figs. 1 and 2 show our main results, obtained from analysis of a 50GB tcpdump tracetaken from a link in an academic network over two work days. Fig. 1 shows evidencefor user back-off (U1). As download bandwidth decreases, the probability of abortinga download increases. The cumulative distribution function (cdf) is represented by thesmoother curve. Fig. 2 shows evidence for user back-off (U2). As session bandwidthdecreases, the number of completed downloads per session decreases; here, we focuson session bandwidths below 20KBps — the cdf indicates that they make up 95% ofthe data.

Download bandwidth (KBps)0 200 400 600 800 100012001400

prob

abilit

y

0.0

0.2

0.4

0.6

0.8

1.0

p_abortdownload bandwidth cdf

Fig. 1. Evidence for user back-off (U1).

Session bandwidth (KBps)0 20 40 60 80 100 120

#com

plet

ed d

ownl

oads

/ses

sion

0

20

40

60

80

100

120

Sess

ion

band

widt

h cd

f

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 2. Evidence for user back-off (U2).

The rest of this paper presents how we analyze the HTTP packet trace to obtain theresults in Figs. 1 and 2 using SAX, a tool we developed to infer surfer actions frompacket traces.

2 Models for Congestion and User Surfing Actions

Before we present SAX, we need to propose a good measure of “congestion”. It isdifficult to quantify network congestion in general using some metric. For example,download time is not a good measure of congestion, since it is affected by round-triptime, server load, etc. We therefore focus our analysis on a single bottleneck link 1, anddefine the congestion level on the link as the number of concurrent downloads at anyinstant on this link, denoted as k. Fig. 3 confirms that this metric is appropriate: ourmeasurements show k traces the session arrival rate as it rises and falls over 48-hour.

An alternative measure of congestion level is the per-download bandwidth bk =(link bandwidth)/k. Essentially, a bk-axis reverses the k-axis, and the interesting,high-congestion part of the curve is compressed near the vertical axis; thus, presentation-wise, k seems a better choice than bk.

To formalize actions (U1) and (U2), we need to model sessions, downloads andabort. In our model, a user surfs the web through a series of sessions, each consisting ofa series of HTTP requests. A user sends HTTP requests by typing in URLs, clicking onbookmarks or hyper-links, etc. Each of these actions is modeled as a click and each clickgenerates one download. A download may consist of multiple and possibly parallelHTTP requests. (Our download corresponds to Barford and Crovella’s web object [4]and Choi and Limb’s web-request [5].) One user may launch multiple downloads inparallel from different browser windows.

A user may be frustrated by — and abort — a download that takes too long to fin-ish. For example, a user who is presented with several web search results may clickone link, find the download too slow, and abort it by clicking on another link. We de-note the probability that a download is aborted as pabort. After the download is aborted,

1 This singling out one link from the Internet is analogous to how classical demand-supplyanalysis isolates one market in a larger economy.

Time [hours]0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

# co

ncur

rent

con

nect

ions

k

0102030405060708090

sess

ion

arriv

al ra

te [#

/sec

]

0.00

0.02

0.04

0.06

0.08

0.10

0.12congestion ksession arrival rate

Fig. 3. How congestion k and session arrivalrate changes over time.

nextp

pretry

abort abortp p

wait!complete

1!

click

wait!abort

think

session arrival

exit session

exitsession

Fig. 4. User Surfing Model

the user might click again. We denote the probability of a click following an aborteddownload in the same session as pretry. We say a download is completed, if it is notaborted. Probability pnext denotes the proportion of completed downloads that are fol-lowed by another click in the session. Fig. 4 shows this user behavior model, with threeuser states. A user stays in either wait-abort state (for aborted download) and wait-complete state (for completed download) while a download is in progress. A thinkstate, where the user is viewing the completed download, follows the wait-completestate. One can elaborate on this simple model by adding sleep time between sessionsand non-reactive elephantine downloads [3].

Analysis of the same HTTP packet trace in Section 1 illustrates how users behavewhen congestion-level changes. Fig. 5 plots pabort, pnext and pretry against k. As ex-pected, pabort increases with the congestion measure k, while pnext decreases. As forpretry, there are two possibilities: at low congestion levels, when throughput is still sat-isfactory, an increase in congestion may prompt a user to retry; with poor throughputat high congestion levels, however, any increase in congestion may prompt a user toabandon the session. These possibilities are consistent with the slight increase in pretry

for small k in Fig. 5, and the slight decrease for large k. Note that, as expected, pretry

is less than pnext at every congestion level k, indicating that a user is less likely to con-tinue a session after an abort. These graphs provide further evidence for user reactionto congestion.

# concurrent connections k0 20 40 60 80 100 120

p_ab

ort

0.0

0.2

0.4

0.6

0.8

1.0

# co

ncur

rent

con

nect

ions

cdf

0.0

0.2

0.4

0.6

0.8

1.0

# concurrent connections k0 20 40 60 80 100 120 #

conc

urre

nt c

onne

ctio

ns c

df

0.0

0.2

0.4

0.6

0.8

1.0

p_ne

xt

0.90

0.92

0.94

0.96

0.98

1.00

# concurrent connections k0 20 40 60 80 100 120 #

conc

urre

nt c

onne

ctio

ns c

df

0.0

0.2

0.4

0.6

0.8

1.0

p_re

try

0.80

0.85

0.90

0.95

1.00

Fig. 5. How pabort, pnext and pretry vary with congestion measure k.

We obtain Fig. 5, by inferring user surfing actions indirectly, through analysis ofpacket-level HTTP traces. We chose this approach, as we found other alternatives im-practical — using human observers to log surfing activities is time consuming, andmodifying existing web browsers [6] to log user actions requires large-scale deploy-ment of the modified browsers. Comparatively, analyzing packet traces is simpler as itinvolves only passive collection of packet traces and development of analysis tool. Wedescribes this analysis tool next.

3 SAX: Surfer Actions eXtractor

We develop SAX as an off-line tool to infer user actions from HTTP packet traces. asa tool to infer user actions from HTTP packet traces. SAX takes HTTP packet dumpas input, analyzes the relationships among the packets, and groups the packets intodownloads. Each download is classified as either completed or aborted and is furthergrouped into sessions.

The difficulties of extracting HTTP information from packet level data are well-documented by Feldmann [7]. Additional issues encountered by SAX are:

(D1) Packets from parallel connections may belong to the same download, and packetsfrom a persistent connection may belong to different downloads.

(D2) Two downloads from the same user might overlap in time – before a web pagefinishes downloading, the user may click on one of its links, thus initiating anotherdownload.

(D3) Aborted and completed downloads need to be distinguished.(D4) Software-generated, automatic downloads do not reflect user behavior and need

to be filtered out.

Grouping Packets into Downloads. A request-response is a collection of packets,containing an HTTP request, followed by an HTTP response with response header anddata for that request. A download is a collection of request-responses, the first of whichis initiated by a click. A click takes place when (C1) the user requests a page by enteringthe URL in the address bar of a browser or launch the browser through another applica-tion (such as an RSS reader) or (C2) the user clicks on an URL in a previously retrievedpage.

To identify a click, we consider both cases above separately. A click generated by(C1) is identified by an HTTP request without a referrer, while a click generated by(C2) is identified by an HTTP request with a URL that corresponds to a hyper-link orsubmit button, with Referer field pointing to a previously downloaded web page.

After identifying the click that starts a download X , we proceed to group request-responses that belong to the same download. The subsequent request-responses in Xconsists of HTTP requests and responses for embedded objects required to display thecontaining web page. Such embedded objects are downloaded automatically by thebrowser. Just like (C2), the request header for such objects contains a Referer field thatspecifies the containing web page. However, the URLs of the requested objects ap-pear as embedded objects in a previously downloaded document. Using this fact, SAX

includes in X any HTTP requests for embedded objects that have Referer fields (recur-sively) pointing to objects that are already included in X . Browser-initiated requests forRedirect links are also included in X .

HTTP requests generated by browsers due to auto-update, however, should be ex-cluded. SAX uses a temporal threshold !auto to identify such requests. An embeddedobject Y with a Referer pointing to download X , is added to X only if the arrival timefor Y is within !auto of X’s last packet; otherwise, Y is identified as an auto-update.

One expects the period for auto-updates to be in minutes, whereas the requests for aweb page and its embedded objects should occur within seconds of each other [8]. Thereis therefore considerable latitude in specifying the threshold !auto. Our measurementsshows that 95% of the silent gap between Y and X is within 2 minutes. We thus set!auto to 2 minutes in our measurements.

Identifying Aborted Downloads. Having identified downloads, we now classify thesedownloads as either completed or aborted. Such classification was not performed inprevious web traffic characterizations efforts.

We say a request-response is not completed if (i) the entity’s length is specifiedin the response header and the amount of data received by the browser is less than theentity’s length; or (ii) HTTP 1.1 and chunked transfer-coding are used and the browserhas not received an end-of-chunk indication; or (iii) HTTP 1.0 is used and the entity’slength is not specified in the response header (i.e. SAX assumes the server will send aFIN when the download completes).

SAX detects an abort of a request-response if the first FIN or RST is sent by thebrowser and the request-response is not completed. A download is aborted if and onlyif one of its request-responses is aborted. Otherwise, a download is completed. Somebrowsers terminate a download immediately if they see some special response headers(e.g. “304 Not Modified”). For such special cases, SAX considers the download ascompleted.

Extracting URLs from HTML Documents. SAX’s method for detecting clicks andgrouping request-responses into downloads relies on its accuracy in extracting URLsfrom the HTML documents. Issues that SAX addresses related to this task are:

• When a user clicks on the submit button of an HTML document X , the requestedURL Y may contain the form’s information. Some extra processing is necessary tomatch Y to X .

• Relative URLs need to be converted to absolute URLs for matching.• A browser may request an embedded object immediately when it finds the URL in

a partially received HTML document. SAX therefore needs to extract URLs uponarrival of any HTML fragments to keep pace with the behavior.

• The message body of a request-response can be encoded using chunked transfer-coding or content-coding (gzip, compress, etc.). SAX partially decompressesthe packets in main memory to extract the URLs.

Excluding Background Downloads. Besides background downloads generated byauto-updates of web pages, SAX needs to exclude background HTTP downloads gen-erated by non-browser applications (e.g. Windows Update). SAX therefore compiles alist of the most common URLs that it has seen; any background downloads that appearon this list are excluded during post-processing.

To define “most common”, SAX checks periodically (every 20 minutes), for eachuser, whether a seen URL X is again requested; if so (no matter how many times), X’scounter is incremented by 1. URLs with large counters are treated as most commonURLs. This accounting allows SAX to differentiate downloads by applications whichappear regularly over extended periods of time, from URLs that attract flash crowds.

Limitations of SAX. SAX is unable to process encrypted packets that belong to se-cure HTTP flows. Also SAX cannot identify some HTTP requests that are triggered byJavaScript since the URL can be created dynamically by the program. HTTP requestsgenerated by JavaScript may have an empty Referer field, confusing SAX to treat it asa new click (C1). Finally, since SAX depends on HTTP packet dump, it cannot identifya download that is partially served from browser cache.

4 From SAX to the Model

The output from SAX includes download description (URLs, timestamps, size, abort/complete,etc.), request-response and Referer information, TCP connections, etc. We now describehow these are processed to give Figs. 1, 2, and 5.

Session Definition. Our user model groups clicks into sessions. Two sessions are sep-arated by a sleep time Tsleep, where the user is not actively surfing the web. Withina session, a think time Tthink separates a click from the previous download comple-tion, while the download is viewed (think state in Fig. 4); in contrast, think time forCARENA separates one click instant from the next [6].

Think time and sleep time vary from one person to another and, for each person,from one session to another. In the traffic trace, think time and sleep time both appearas a silent gap tsilent between packets. To distinguish between think time and sleeptime, we use a threshold !session, where Tthink < !session < Tsleep in most cases. Thereis considerable latitude in specifying !session since, in general, Tthink ! Tsleep. Otherstudies have found average think time within a session to be less than a minute [9,4, 5, 10]. In our experiments, we generously set !session to 10 minutes, since our mea-surements shows that 95% of silent gaps between two downloads is within 10 minutes.(Hlavacs and Kotsis used the thresholds of 8.3 minutes and 30 minutes in their model,depending on whether there was a change in web server address [11].)

Session Bandwidth and Download Bandwidth. Having identified downloads and ses-sions, we can now compute the session bandwidth bsession in Fig. 2. We are interestedin this metric since the length of a session may depend on the surfer’s aggregated expe-rience of multiple downloads. We quantify this experience with bsession, defined as the

number of bytes transferred in a session (aborted or completed), divided by the sum ofdownload transfer time.

A user may abort a slow download. Therefore, another metric of interest is thedownload bandwidth bdownload, i.e. the number of bytes transferred in a download(aborted or completed) divided by its time span (see Fig. 1).

!-intervals. Recall that we use k, the number of concurrent downloads on the bot-tleneck link, as a metric for network congestion (Fig. 5). Measuring k, however, is nottrivial as a download may contain silent gaps, may spread over parallel connections,and share a TCP connection with another download.

This issue led us to the following idea: Partition the trace into equal (non-overlapping)intervals of time. Let (t, t + ") denote the interval between times t and t + ", where" > 0, and length(t, t + ") = "; we call (t, t + ") a "-interval. Let D be the set ofdownloads, d " D, and Id be the time interval between start and end of the downloadd. We measure the number of concurrent downloads in (t, t + ") by

k =!

d!D length(Id # (t, t + "))"

. (1)

This idea is illustrated in Fig. 6. Note that, if no download starts or ends during theinterval, then Eqn. (1) gives precisely the number of downloads spanning that interval.

We also use "-intervals to measure the probabilities, as described below.

Measuring the Probabilities. Our user model has parameters pabort, pnext and pretry.To measure these probabilities, let nclick, ncompleted, nabort, nretry and nnext be (re-spectively) the number of downloads, completed downloads, aborted downloads, retriesafter aborts, and clicks after think time. Then, one could calculate the probabilities by

pabort =nabort

nclick, pretry =

nretry

nabort, pnext =

nnext

ncompleted. (2)

However, it is not clear how nclick, etc. are to be measured. An obvious choice is tomeasure them per session (Fig. 4), calculate each probability for every session, thenaggregate the probability over the sessions. This approach has three problems:

• pretry is a conditional probability, so it is ill-defined if nabort = 0 for a session;pnext has a similar problem if ncompleted = 0 for a session.

• How should the per-session values for pretry (say) be aggregated over the sessionsto give one pretry value for each k?

• If a session ends with a completed download, nretry = nabort, so pretry = 1 for thatsession; if a session ends with an aborted download, then nretry = nabort $ 1, sopretry for a session can only take values 0, 1

2 , 23 , 3

4 , . . .. This discrete spread makesany smooth aggregation over all sessions difficult.

Therefore, instead of aggregating after the division (2), we first aggregate the valuesfor nabort, etc., then do the division. This can be done as follows: consider each "-interval and measure the number nabort of aborted downloads and the number nclick of

downloads in that interval, then divide one by the other to get pabort. Each "-intervalthus gives a (k, pabort) pair, from which we derive the relationship between the twometrics.

However, the size of " forces a tradeoff: a large " gives a poor measurement fork, while a small " gives noisy measurements of nclick and nabort; furthermore, userreaction to congestion within the "-interval may occur after that interval.

We resolve this tension thus: for each "-interval, let nclick and nabort be the numberof downloads and aborted downloads from the start of that "-interval to the end of thesession; their ratio then gives pabort for that "-interval. We measure pretry and pnext

similarly.

Curve Smoothing. By using "-intervals for the measurements, we have discretizedtime; add to this the bursty nature of network traffic, and the data becomes jittery, mak-ing it difficult to discern trends.

To smooth out the jitter, we use a sliding window of size L units. For example, tosmooth out a function of k, a unit is one integer value. If we have (k, pabort) measure-ments sequenced by k, then we consider consecutive (k, pabort) measurements withinthat window (i.e. such that x % k % x + L, where x is the start of the window), andaverage the k values and the pabort values to give an aggregate pair; we then slide thewindow by 1 (i.e. consider window x+1 % k % x+1+L), and repeat the aggregation.

In our measurements, we chose L = 6 for smoothing k in Fig. 5; for smoothing afunction of download bandwidth in Figs. 1 and 2, each unit is 50KBps. We set " = 1minute in our measurement. The smoothened curves yield our main results, which arepresented in Figs. 1, 2 and 5.

x1 x2

x3

x4

x1 x2 x3 x4

time

+ + +!

k =

!

download1 download2

download3

download4

Fig. 6. Using intervals to measure the number of concurrent downloads k.

5 Related Work

User Behavioral Model. Our model is the first to study how users react to congestion.Choi and Limb’s behavioral model [5] and Barford and Crovella’s user equivalent [4]do not include sessions (U2), whereas the layered model by Hlavacs and Kotsis doesnot provide for user reaction to delays [11].

Rossi et al.’s measurement study of how transfer delays cause users to interruptTCP connections is relevant [12], but they do not offer a user model. Furthermore, adownload may be more than one TCP connection.

Studies on how users react to server delays [13, 14] are only marginally relevantsince, in our context, the users may be visiting different web sites, and each user mayvisit multiple web sites.

Characterizing Web Traffic. Mah [15] was one of the first to model HTTP trafficby analyzing packet dumps. He used a threshold (1 second) between arrival times ofpackets to determine whether two HTTP connections belong to the same download.Such a heuristic can fail if two downloads overlap (D2). A similar approach is used byBarford-Crovella [4], Lan-Heidemann [16], Smith et. al. [17], Molina et. al [18], andAbrahamsson-Ahlgren [19]. Choi and Limb pointed out the inadequacy in relying ona 1-second threshold [5]; instead, they parsed the HTTP headers to detect the start ofdownloads. However, header information may not be enough, and it may be necessaryto extract information from the body as well.

None of the previous work characterizes downloads as completed or aborted (D3).Rossi et al. [12], used TCP FIN and RST to distinguish completed and aborted TCPconnections, which does not correspond to downloads due to parallel or persistent con-nections (D1).

Among the related work, Abrahamsson-Ahlgren [19] is the only study that groupsdownloads into sessions. They use a threshold of 15 minutes to separate HTTP requestsinto different sessions. Other previous studies did not distinguish between think timeand sleep time.

Packet Analysis Tool. Feldmann’s BLT is a tool for extracting HTTP information fromsniffed packets [7], much like HTTPdump [20] and the more general Nprobe [21]. Onecould use such information for further studies of compression, traffic invariants, proxycaching, etc. In principle, we can process BLT output to identify user actions, such asclicks and aborts. However, the heuristics used by such general tools for handling miss-ing packets, erroneous HTTP format etc. filter out some information that are needed bystudies like ours, e.g. for distinguishing between a click and a download of an embeddedobject.

6 Conclusion

We presented an analysis tool called SAX that infer user surfing behavior from HTTPpacket traces. SAX was designed to confirm the existence of congestion-induced user

back-offs while surfing. We believe, however, that SAX is useful in its own way, forinstance, to researchers studying affects of web caches, or to researchers studying surf-ing behavior of different social groups. Furthermore, the user surfing model constructedusing SAX can be used in a network simulator such as ns-2 to generate web traffic thatincorporates user behavior.

References1. Nagle, J.: Congestion Control in IP/TCP. IETF. (1984) RFC 896.2. Jacobson, V.: Congestion avoidance and control. In: Symposium proceedings on Communi-

cations architectures and protocols, ACM Press (1988) 314–3293. Tay, Y.C., Tran, D.N., Liu, E.Y., Ooi, W.T., Morris, R.: Modeling web surfers and bandwidth

demand/supply for congestion-induced behavior. Submitted (2005)4. Barford, P., Crovella, M.: Generating representative web workloads for network and server

performance evaluation. In: Proc. SIGMETRICS. (1998) 151–1605. Choi, H.K., Limb, J.O.: A behavioral model of web traffic. In: Proc. Int. Conf. Network

Protocols. (1999) 327–3346. Nino, I.J., de la Ossa, B., Gil, J.A., Sahuquillo, J., Pont, A.: Carena, a tool to capture and

replay web navigation sessions. In: E2EMON. (2005)7. Feldmann, A.: BLT: Bi-layer tracing of HTTP and TCP/IP. Computer Networks 33 (2000)

321–3358. Casilari, E., Reyes-Lecuona, A., Gonzalez, F., Diaz-Estrella, A., Sandoval, F.: Characteriza-

tion of web traffic. In: GLOBECOM. (2001) 1862–18669. Arlitt, M., Williamson, C.: A synthetic workload model for internet mosaic traffic. In: Proc.

Summer Computer Simulation Conference. (1995) 852–85710. Reyes-Lecuona, A., Gonzalez, E., Casilari, E., Casasola, J., Diaz-Estrella, A.: A page-

oriented WWW traffic model for wireless system simulations. In: Proc. Int. TeletrafficCongress (ITC-16). (1999) 817–826

11. Hlavacs, H., Kotsis, G.: Modeling user behavior: a layered approach. In: MASCOTS. (1999)218–225

12. Rossi, D., Casetti, C., Mellia, M.: User patience and the web: a hands-on investigation. In:GLOBECOM. (2003) 4163–4168

13. Arlitt, M., Williamson, C.: Internet web servers: workload characterization and performanceimplications. IEEE/ACM Trans. on Networking 5 (1997) 631–645

14. Dalal, A., Jordan, S.: Improving user-perceived performance at a world wide web server. In:GLOBECOM. (2001) 2465–2469

15. Mah, B.A.: An empirical model of HTTP network traffic. In: INFOCOM (2). (1997) 592–600

16. Lan, K.C., Heidemann, J.: Rapid model parameterization from traffic measurements. ACMTrans. on Modeling and Computer Simulation 12 (2002) 201–229

17. Smith, F.D., Campos, F.H., Jeffay, K., Ott, D.: What TCP/IP protocol headers can tell usabout the Web. In: SIGMETRICS/Performance. (2001) 245–256

18. Molina, M., Castelli, P., Foddis, G.: Web traffic modeling exploiting tcp connections’ tem-poral clustering through html-reduce. IEEE Network 14 (2000) 46–55

19. Abrahamsson, H., Ahlgren, B.: Using empirical distributions to characterize web clienttraffic and to generate synthetic traffic. In: GLOBECOM. (2000) 428–433

20. Wooster, R., Williams, S., Brooks, P.: HTTPDUMP: Network HTTP packet snooper.http://ei.cs.vt.edu/˜succeed/96httpdump/ (1996)

21. Moore, A., Hall, J., Kreibich, C., Harris, E., Pratt, I.: Architecture of a network monitor. In:PAM. (2003)