Top Banner
Modeling and generating realistic streaming media server workloads Wenting Tang a,1 , Yun Fu b,2 , Ludmila Cherkasova a, * , Amin Vahdat b a Hewlett-Packard Laboratories, 1501 Page Mill Road, Palo Alto, CA 94303, United States b Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, United States Received 28 May 2004; received in revised form 27 February 2006; accepted 2 May 2006 Available online 14 June 2006 Responsible Editor: U. Krieger Abstract Currently, Internet hosting centers and content distribution networks leverage statistical multiplexing to meet the performance requirements of a number of competing hosted network services. Developing efficient resource allocation mechanisms for such services requires an understanding of both the short-term and long-term behavior of client access patterns to these competing services. At the same time, streaming media services are becoming increasingly popular, pre- senting new challenges for designers of shared hosting services. These new challenges result from fundamentally new char- acteristics of streaming media relative to traditional web objects, principally different client access patterns and significantly larger computational and bandwidth overhead associated with a streaming request. To understand the characteristics of these new workloads we use two long-term traces of streaming media services to develop MediSyn, a publicly available streaming media workload generator. In summary, this paper makes the following contributions: (i) we propose a frame- work for modeling long-term behavior of network services by capturing the process of file introduction, non-stationary popularity of media accesses, file duration, encoding bit rate, and session duration. (ii) We propose a variety of practical models based on the study of the two workloads. (iii) We develop an open-source synthetic streaming service workload generator to demonstrate the capability of our framework to capture the models. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Streaming media server workload; Synthetic workload generator; Media access patterns; Temporal and static properties; Non-stationary popularity; Zipf–Mandelbrot law; File life span; Modeling 1389-1286/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2006.05.003 * Corresponding author. Tel.: +1 650 857 3753; fax: +1 650 857 7029. E-mail addresses: [email protected], [email protected] (W. Tang), [email protected], [email protected] (Y. Fu), lucy.cherka- [email protected] (L. Cherkasova), [email protected] (A. Vahdat). 1 This work was originated and largely completed while W. Tang worked at HPLabs. Currently, W. Tang is with Arcsight Inc. 5 Result Way, Cupertino, CA, United States. 2 This work was mostly done while Y. Fu worked at HPLabs during his summer internship. Currently, Y. Fu is with Yahoo! Inc. 2821 Mission College Blvd., Santa Clara, CA 95054, United States. Computer Networks 51 (2007) 336–356 www.elsevier.com/locate/comnet
21

Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

Aug 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

Computer Networks 51 (2007) 336–356

www.elsevier.com/locate/comnet

Modeling and generating realistic streaming mediaserver workloads

Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova a,*, Amin Vahdat b

a Hewlett-Packard Laboratories, 1501 Page Mill Road, Palo Alto, CA 94303, United Statesb Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive,

La Jolla, CA 92093, United States

Received 28 May 2004; received in revised form 27 February 2006; accepted 2 May 2006Available online 14 June 2006

Responsible Editor: U. Krieger

Abstract

Currently, Internet hosting centers and content distribution networks leverage statistical multiplexing to meet theperformance requirements of a number of competing hosted network services. Developing efficient resource allocationmechanisms for such services requires an understanding of both the short-term and long-term behavior of client accesspatterns to these competing services. At the same time, streaming media services are becoming increasingly popular, pre-senting new challenges for designers of shared hosting services. These new challenges result from fundamentally new char-acteristics of streaming media relative to traditional web objects, principally different client access patterns and significantlylarger computational and bandwidth overhead associated with a streaming request. To understand the characteristics ofthese new workloads we use two long-term traces of streaming media services to develop MediSyn, a publicly availablestreaming media workload generator. In summary, this paper makes the following contributions: (i) we propose a frame-work for modeling long-term behavior of network services by capturing the process of file introduction, non-stationarypopularity of media accesses, file duration, encoding bit rate, and session duration. (ii) We propose a variety of practicalmodels based on the study of the two workloads. (iii) We develop an open-source synthetic streaming service workloadgenerator to demonstrate the capability of our framework to capture the models.� 2006 Elsevier B.V. All rights reserved.

Keywords: Streaming media server workload; Synthetic workload generator; Media access patterns; Temporal and static properties;Non-stationary popularity; Zipf–Mandelbrot law; File life span; Modeling

1389-1286/$ - see front matter � 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.comnet.2006.05.003

* Corresponding author. Tel.: +1 650 857 3753; fax: +1 650 857 7029.E-mail addresses: [email protected], [email protected] (W. Tang), [email protected], [email protected] (Y. Fu), lucy.cherka-

[email protected] (L. Cherkasova), [email protected] (A. Vahdat).1 This work was originated and largely completed while W. Tang worked at HPLabs. Currently, W. Tang is with Arcsight Inc. 5 Result

Way, Cupertino, CA, United States.2 This work was mostly done while Y. Fu worked at HPLabs during his summer internship. Currently, Y. Fu is with Yahoo! Inc. 2821

Mission College Blvd., Santa Clara, CA 95054, United States.

Page 2: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

W. Tang et al. / Computer Networks 51 (2007) 336–356 337

1. Introduction

Two recent trends in network services motivatethis work, a move toward shared service hosting cen-ters and the growing popularity of streaming media.Traditionally, service providers over-provision theirsites to address highly bursty client access patterns.These access patterns can vary by an order of magni-tude on an average day [10] and by three orders ofmagnitude in the case of flash crowds. In fact,services are often most valuable exactly when theunexpected takes place. Consider the example of anews service when an important event takes place;load on CNN reportedly doubled every seven min-utes shortly after 9 AM on September 11, 2001 [7].

Thus, we are pursuing a vision where large-scalehosting infrastructures simultaneously provide‘‘resource-on-demand’’ capabilities to competingInternet services [19,8]. The idea is that the systemcan use statistical multiplexing and efficient resourceallocation to dynamically satisfy the requirementsof services subject to highly bursty access patterns.For instance, surplus resources resulting from‘‘troughs’’ in accesses to one service may be reallo-cated to satisfy the requirements of a second serviceexperiencing a peak. Further, Service Level Agree-ments (SLAs) may specify that, under resourceconstraints, one service should preferentially receiveresources over other services.

A second emerging trend is the growing popular-ity of streaming media services. Streaming mediatakes the form of video and audio clips from news,sports, entertainment, and educational sites. Stream-ing media is also gaining momentum in enterpriseintranets for training purposes and company broad-casts. These workloads differ from traditional webworkloads in many respects, presenting a numberof challenges to system designers and media serviceproviders [13,18]. For instance, transmitting mediafiles requires more computing power, bandwidthand storage and is more sensitive to network jitterthan web objects. Further, media access lasts for amuch longer period of time and allows for user inter-action (pause, fast forward, rewind, etc.).

The long-term goal of our work is to studyresource provisioning and resource allocation atthe confluence of the above two trends: network ser-vice hosting infrastructures for next-generationstreaming workloads. A key obstacle to carryingout such a study is the lack of understanding ofchanging client access patterns over a long periodof time. For both hosting centers and content distri-

bution networks (CDNs), we require such an under-standing to determine, for example, how to placeobjects at individual sites (potentially spread acrossthe network) and how to allocate resources to indi-vidual streams and to individual clients.

Thus, we use long-term traces from two stream-ing media services to construct an open-sourcemedia workload generator called MediSyn. ForMediSyn, we develop a number of novel models tocapture a broad range of characteristics for networkservices. We also demonstrate how these modelsgeneralize to capture the characteristics of tradi-tional web services. Overall, this paper makes thefollowing contributions:

• A primary contribution of our work is its focus onthe long-term behavior of network services.Among the features of our synthetic generator isthe ability to reflect the dynamics and evolutionof content at media sites and the change of accessrate to this content over time. Existing workloadgenerators assume that there is a set of activeobjects fixed at the beginning of the ‘‘trace’’.Similarly, existing techniques assume that objectpopularity remains the same over the entire dura-tion of the experiment. While these are reasonableassumptions for experiments designed to last forminutes, we are interested in long-term provision-ing and resource allocation, as well as the resourceallocation for simultaneous competing services(consider a CDN simultaneously hosting hun-dreds of individual services).

• It was observed [2,12] that the popularity distri-bution in media workloads collected over signifi-cant period of time (more than 6 months) doesnot follow a Zipf-like distribution. We showedthat a special version of Zipf–Mandelbrot lawcan be used to capture the popularity distributionin such workloads. The traditional Zipf-like dis-tribution is a special case of the Zipf–Mandelbrotdistribution.

• We designed a set of new models to capture anumber of characteristics critical to streamingmedia services, including file duration, file accessprefix duration, non-stationary file popularity,new file introduction process and diurnal accesspatterns.

The rest of this paper is organized as follows.Section 2 outlines the workload properties thatMediSyn attempts to capture and presents theworkload generation process adopted by MediSyn.

Page 3: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

Table 1Summary for two media logs used to develop property models inMediSyn

HPC HPL

Log duration 29 months 21 monthsNumber of files 2999 412Number of sessions 666,074 14,489

338 W. Tang et al. / Computer Networks 51 (2007) 336–356

Section 3 outlines the real-world workloads used tinour study and introduces the models used in Medi-Syn and discusses their specifics. We review previousrelated work in Section 5. Finally, we conclude witha summary and future work in Section 6.

2. Media workload properties and their generation

in MediSyn

Accurate workload characterization is critical forsuccessful generation of realistic workloads. A syn-thetic media workload generator can produce traceswith targeted, controllable parameters and desireddistributions for performance experiments studyingeffective streaming media delivery architectures andstrategies. For such experiments, the generatedworkload must not only mimic the highly dynamicresource-utilization patterns found on today’s mediasystems but also provide flexible means to generatemore intensive, bursty and diverse workloads forfuture media systems. Challenges to designing a use-ful analytical workload generator include:

• identifying essential properties of workloads tar-geted by synthetic workload generators, and thosethat most affect the behavior of hosting centers,

• designing appropriate mathematical models thatclosely reproduce the identified workload proper-ties from real traces.

In this section, we highlight the main propertiesof streaming media workloads modeled in MediSynand how these properties are composed togetherduring workload generation process in MediSyn.

We partition media workload properties in two

groups: static and temporal properties.

• Static properties provide the characteristics of theunderlying media fileset, reflect the aggregate,quantitative properties of client accesses (inde-pendent of the access time), and present the prop-erties of individual file accesses. Static propertiesinclude:– file duration that represents the advertised

duration of the file (in seconds),– file encoding bit rate that reflects the rate (in

bits/s) used for file encoding and that definesbandwidth requirements for the file transfer,

– file access popularity that defines the number ofaccesses to a file within a certain period of time,

– file access prefix that represents the elapsedtime of the requested media file when the play

ended (a play is ended prematurely when theclient hits the stop button).

• Temporal properties reflect the dynamics and evo-

lution of accesses to media content over time,and determine the ordering and the timing of ses-sion arrivals. The temporal properties of mediaworkloads include:– new file introduction process that reflects at

what rate the new content is introduced atthe media site (and hence, when it appears inmedia workload),

– file life span that defines the file popularitychanges over a daily time scale within a certainperiod of time,

– diurnal access pattern that specify how thenumber of accesses to a site varies during agiven period of time, e.g., a day.

MediSyn’s goal is to generate a synthetic tracerepresenting a sequence of file accesses to media ser-vice. This process consists of generating values/dis-tributions for all the properties introduced abovefor each media file.

Once all the file’s properties are generated, Med-iSyn generates a sequence of accesses to each fileaccordingly to the assigned popularity distributionsand file temporal properties. At the end, all themedia sessions for all the files are combined andsorted according to a global time and mergedtogether to generate the synthetic trace.

3. Main models of workload generation in MediSyn

This section describes the models used in Medi-Syn to capture static and temporal properties ofstreaming media workloads.

Throughout this paper, we use two representativestreaming media server logs, collected over a periodof years, to demonstrate the chosen properties andto validate our mathematical models introduced toreflect these properties. The streaming media serverlogs represent two different media services: HP Cor-

Page 4: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

(a)

0 2000 4000 6000 8000 100000

50

100

150

200

250

300

350

400

Duration (sec)

Num

ber

of fi

les

(b)

0 2000 4000 6000 8000 100000

50

100

150

200

250

300

350

400

Duration (sec)

Num

ber

of fi

les

Fig. 1. PDF of the HPC duration distribution: (a) four normaldistributions to capture the four peaks and (b) the aggregatedistribution of the four normal distributions.

W. Tang et al. / Computer Networks 51 (2007) 336–356 339

porate Media Solutions Server (HPC) and HPLabs

Media Server (HPL). We define a session as a clientaccess to a particular file. Table 1 briefly summa-rizes the workloads.

In Table 1, the HPC media server shows moreactivities than the HPLabs server. While the HPL-abs server serves a small number of research-ori-ented communities within HP, the HPC workloadrepresents a reasonably busy media server with300–800 client sessions everyday with peak rate at12,000 sessions per day. Given this difference, itbecomes even more interesting whether we candesign common models that are capable of generat-ing these diverse enterprise media workloads.

In the paper, we chosed to use a visualizationhelp for presentating our models and results. Mostof the figures are used to visually characterize thenature of studied workloads and help in under-standing the main workload properties and theirspecifics. We used MatLab for fitting our data withdistribution candidates. In particular, we used v2

(chi-squared) and k2 for goodness-of-fit measure-ment based on [22,21].

3.1. Static properties

3.1.1. Duration

Prior studies [12,3] observed that media filesmight be classified into a set of groups accordingto their durations. Different workloads can begrouped based on the content of media files hostedby a streaming service. For example, music sitesmay have file durations from 3 to 5 min, whilemovie sites may have file durations from one andhalf to two hours. While a particular workloadmight be captured by a certain statistical distribu-tion, the same distribution may fail to captureanother workload. In our case, although the fileduration distribution of the HPC trace can be mod-eled by a heavy-tail distribution such as a Weibull

distribution [15], the same distribution fails to cap-ture the file duration distribution of the HPL trace.

As shown in Figs. 1(a) and 2(a), the file durationsin our traces are concentrated around a set of hotpoints. These hot points are usually some commondurations, semantically meaningful to a particulartype of media content. Based on this observation,we classify these hot points into a set of groupsand use a set of normal distributions to model thegrouped file duration distribution as shown in Figs.1(a) and 2(a). Here, each group is modeled by a nor-mal distribution with the mean (l) of each distribu-

tion defined by the hot point of that group. Thestandard deviation (r) of each normal distributiondetermines the concentration of the durationswithin that group.

Note that we do not use segmented probabilitydensity functions (PDFs) to model the duration dis-tribution. We assume a hot point can affect theentire duration scope rather than just a segment.Thus, we use an aggregated distribution, whosePDF sums the PDFs of all normal distributions pro-portionally. To proportionally sum all durationgroups, we associate each group with a ratio deter-mined by the number of files in the group comparedwith the total number of files in the trace. So thenormal distribution PDF of each group is normal-ized against the ratio of that group. If only a frac-tion of a normal distribution for a group is used,normalization is performed on the adopted fractionof the distribution. For example, since the mean of

Page 5: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

(a)

0 2000 4000 6000 8000 10000 120000

5

10

15

20

25

30

35

40

Duration (sec)

Num

ber

of fi

les

(b)

0 2000 4000 6000 8000 10000 120000

5

10

15

20

25

30

35

40

Duration (sec)

Num

ber

of fi

les

Fig. 2. PDF of the HPL duration distribution: (a) five normaldistributions to capture the five peaks and (b) the aggregatedistribution of the five normal distributions.

Table 2Parameters of the normal distributions for the HPC trace

Group 1 2 3 4

l 0 2000 3300 5821r 600 400 600 1223

Ratio 63% 10% 18% 9%

Table 3Parameters of the normal distributions for the HPL trace

Group 1 2 3 4 5

l 117 2900 4200 5160 6300r 1200 240 360 180 1000

Ratio 19% 26% 30% 10% 15%

(a)

0 2000 4000 6000 8000 100000

50

100

150

200

250

300

350

Duration (sec)

Num

ber

of fi

les

(b)

0 2000 4000 6000 8000 100000

50

100

150

200

250

300

350

Duration (sec)

Num

ber

of fi

les

Fig. 3. PDF of the MediSyn duration distribution: (a) Fournormal distributions to capture the four peaks and (b) theaggregate distribution of the four normal distributions.

340 W. Tang et al. / Computer Networks 51 (2007) 336–356

the first group in Table 2 is 0, only half of the nor-mal distribution is used. Tables 2 and 3 present the

mean (l), the standard deviation (r) and the ratio ofeach normal distribution for the HPC and HPLtrace respectively. They show that the HPC andHPL traces have different hot points.

In MediSyn, users can specify a set of durationgroups with different l, r and ratios based on thenature of the media workload they want to generate.For each duration group, MediSyn generates asequence of durations according to the ratio andthe normal distribution of the group. We use therejection method [15] to generate the durationsequence according to the parameterized normal dis-tribution. Fig. 3(a) and (b) shows the durations gen-erated by MediSyn to simulate the HPC workloadbased on the parameters presented in Tables 2 and 3.

We use k2 discrepancy measure introduced byVern Paxson [26] to compare the duration set gener-ated by MediSyn with the original data set. The k2

value is 0.1140.

Page 6: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

W. Tang et al. / Computer Networks 51 (2007) 336–356 341

3.1.2. Encoding bit rate

Since most of commercial media servers aredesigned to stream media files encoded at someconstant bit rates, the current version of MediSynis designed to only generate a set of constant bitrates for the underlying fileset.

MediSyn models encoding bit rates by a discretedistribution, where the value of each bit rate and theratio of the bit rate occupied in the fileset can bespecified. Based on the discrete distribution pro-vided by users, MediSyn generates a sequence ofbit rates for the fileset and matches the bit rate withthe file duration randomly, since we observe thatthere is no correlation between them in our traces(the correlation coefficient is 0.0144).

3.1.3. Popularity

Earlier studies [13,18] found that media file pop-ularity can often be captured by a Zipf-like distribu-tion. A Zipf-like distribution states that the accessfrequency of the ith most popular file is propor-tional to 1/ia. If the frequencies of files and thecorresponding popularity ranks are plotted on alog–log scale, a Zipf-like distribution can be fittedby a straight line. A larger a implies more sessionsare concentrated on the most popular files. Somesynthetic workload generators [6,17] also adopt aZipf-like distribution in generating file popularity.

However, several studies [2,12,3,5,9] analyzingthe properties of workloads collected over signifi-cant periods of time observed that for some weband streaming media workloads, a Zipf-like distri-bution does not accurately capture the file popular-

100

101

102

103

104

100

101

102

103

104

105

Popularity rank

Fre

quen

cy

HPC logHPL log

Fig. 4. The original popularity distributions of the HPC traceand the HPL trace on a log–log scale.

ity distribution. The popularity distribution of theseworkloads shows a circular curve on a log–log scale.

For reference, Fig. 4 shows the file popularitydistributions of the HPC and the HPL traces overthe entire trace periods on a log–log scale. Theyare more like circular curves similar to those distri-butions noticed in previous web studies [5,9] andmedia workload studies [2,12].

If we use a straight line (a Zipf-like distribution)to fit the circular curve and generate session fre-quencies based on the value of a obtained by curvefitting, the generated frequencies must be skewedfrom the original session frequencies. Breslau et al.[9] calculated a by excluding the top 100 files. Forour traces, not only the beginning but also the endof the curves cannot be fitted by straight lines.Moreover, since the most popular files are especiallyimportant for synthetic streaming media workloads,we cannot ignore the first 100 files.

• Zipf–Mandelbrot law

Zipf–Mandelbrot law [25] is a discrete probabilitydistribution, which is a generalized Zipf distribu-tion. The law can be described as

f ðxÞ ¼ Cðxþ kÞa ; ð1Þ

where x is the file popularity rank, k is a con-stant, C is a normalization constant, a is the sameas the parameter of Zipf distribution. C ¼PN

i¼11=ðiþ kÞa. We observe that circular curvesof file popularity can be captured by Zipf–Man-delbrot law.

• k-transformation

To facilitate users to select popularity distri-butions in our workload generator, we providea simple transformation (k-transformation) thatcan assist the parameter fitting of Zipf–Mandelb-rot law and intuitively illustrate the meaning ofthe parameters. The k-transformation is definedas follows: given x as a file rank, y as the corre-sponding access frequency for the file, the follow-ing k-transformation can transform x and y to aZipf-like distribution between xk and yk with thesame a,

xk ¼xþ kx � 1

kxð2Þ

yk ¼y þ ky � 1

kyð3Þ

where kx and ky are scale parameters. Sinceyk ¼ Ck=xa

k (Ck is the normalization constant),

Page 7: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

Fre

quen

cy

Fig.on a

TabPartrac

Tra

HPCHPL

342 W. Tang et al. / Computer Networks 51 (2007) 336–356

ðy þ ky � 1Þ=ky ¼Ck

ððxþ kx � 1Þ=kxÞa; ð4Þ

y ¼ Ckkaxky

ðxþ kx � 1Þa þ 1� ky : ð5Þ

Eq. (5) is a centered Zipf–Mandelbrot distribu-tion. The parameters C and a of the Zipf–Man-delbrot distribution on x and y can bedescribed as C = Ckkx

aky, k = kx � 1. We alsointroduce a constant a = 1 � ky.Fig. 5 shows the relationship between xk and yk

of the HPC and HPL traces on a log–log scalerespectively. We observe that they are perfectlystraight lines. The a value of the Zipf k-transfor-mation is derived through linear regression [14].Table 4 shows some critical parameters relatedto the curve fitting of the two workloads. Asshown in the table, R2 is 0.995 for both theHPC and HPL traces, indicating that straightlines fit both the distributions very well.The rea-son that the original traces do not show perfectlystraight lines at the heads of the curves is thatthere is little differentiation in the frequencies ofthe most popular files (with smaller x). It canbe attributed to the fact that a long-term trace

100

101

102

103

100

101

102

103

104

Popularity rank

HPC logHPL log

5. The popularity distributions after Zipf k-transformationlog–log scale.

le 4ameters of Zipf k-transformation of the HPC and the HPLes

ce a R2 kx ky Maximumfrequency

Numberof files

1.561 0.995 12 12 17831 14341.23 0.995 7 7 961 364

can collect enough files with similar popularitiesover time, and thus these files can be consideredas a group (equivalence class), where a grouprank will be a better reflection of the file popular-ities. Intuitively, the effect of the k-transforma-tion is that the popularity follows a Zipf-likedistribution if we check every group of kx files.We divide x by kx to scale the file ranks so thatthe ((i � 1)kx + 1)th rank becomes the ith rankand reflect now the corresponding file grouprank. So we actually move all points on thelog–log scale along the x-axis to the left andsqueeze the points to a more straight line.Similarly, the reason that the traces do not showperfectly straight lines at the tails of the curves isthat there is not enough differentiation in thenumber of files with the lowest frequencies. Sowe divide y by ky to squeeze those pointsalong the y-axis on the log–log scale. The valueof ky is not necessarily the same as kx. However,MediSyn uses the same value for kx and ky basedon our observations for both the HPC and HPLtraces, which we simply refer to as k. The scaleparameter k of our k-transformation is similarto the scale parameter k of a general Pareto dis-tribution [1].An explanation for the k-transformation is thatthe original frequency sequence cannot be fittedby a Zipf-like distribution starting from rank 1,but it can be fitted into part of a Zipf-like distri-bution starting from rank kx. To describe this filerank starting from kx by a Zipf-like distribution,we have to divide its original rank by kx. Similarexplanation can be applied for ky.

For popularity generation, instead of specifyingthe total number of requests, we choose to followtraditional load generator style to ask users to spec-ify the maximum access frequence of the most pop-ular file and then generate other file’s popularitybased on Zipf–Mandelbrot law. To generate asequence of frequencies, users of MediSyn only needto specify the maximum frequency M for the mostpopular file, the number of files N, the scale param-eter k, and the Zipf-like distribution parameter a.MediSyn computes the frequency of the mostpopular xth file (x 2 [1,N]) using the followingformula:

Mk

ðxþk�1k Þa � 1

!k þ 1; ð6Þ

Page 8: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

100

101

102

103

104

100

101

102

103

104

105

Popularity rank

Fre

quen

cy

HPC logMediSyn HPCHPL logMediSyn HPL

Fig. 6. Comparison between the popularity distribution gener-ated by MediSyn and the original traces on a log–log scale.

Table 5Correlation coefficient between file popularity and file duration

Workload HPCfrequency

HPLfrequency

HPCrank

HPLrank

Correlationcoefficient

�0.03 0.05 �0.20 �0.002

0 100 200 300 4000

2000

4000

6000

8000

10000

12000

Popularity rank

Dur

atio

n

Fig. 8. Popularity rank vs. duration for the HPL trace.

W. Tang et al. / Computer Networks 51 (2007) 336–356 343

where Mk ¼ M�1k þ 1. Fig. 6 compares the frequen-

cies generated by MediSyn with the original fre-quencies in our traces.

To determine whether there is a correlationbetween file duration and file popularity, wecompute the correlation coefficient between filepopularity and file duration for both of our work-loads. Table 5 shows these results. We use boththe file frequency and the file rank as the popularitymetric to compute the correlation coefficient.

0 500 1000 15000

2000

4000

6000

8000

10000

Popularity rank

Dur

atio

n

Fig. 7. Popularity rank vs. duration for the HPC trace.

Figs. 7 and 8 show the relationship between pop-ularity and file duration for the HPC and HPLtraces.

Overall, we observe no strong correlationbetween file popularity and file duration. So fileduration and file popularity are randomly matchedin MediSyn.

We also check for possible correlation betweenpopularity and encoding bit rate. Once again, thereis no correlation between them, so MediSyn matchespopularity and encoding bit rate randomly.

3.1.4. Prefix

One major characteristics of streaming work-loads is that a significant amount of clients do notfinish playing an entire media file [12,3]. Typically,this reflects the browsing nature of client accesses,client time constraints, or QoS-related issues. Mostincomplete sessions (i.e. terminated by clients beforethe video is finished entirely) access only the initialsegments of media files. In the HPC (HPL) trace,only 29% (12.6%) of the accesses finish the playbackof the files. 50% (60%) of the accesses in the HPC(HPL) trace last less than 2 min. This high percent-age of incomplete accesses as well as a high numberof sessions accessing only the initial part of the filecreate a very special resource usage model, whichis widely considered for streaming media cachedesign [24].

We refer to the duration between the start of amedia session and the time when the session is ter-minated by the client as the prefix duration of thesession, or simply the prefix. Figs. 9 and 10 showthe histogram for the prefixes of two typical exam-ple files in the HPC trace. The ‘‘spikes’’ in the

Page 9: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

0 200 400 600 8000

0.05

0.1

0.15

0.2

0.25

Prefix (sec)

Rat

io

Fig. 9. A typical access prefix distribution of short durationmedia file.

0 1000 2000 3000 4000 50000

0.01

0.02

0.03

Prefix (sec)

Rat

io

Fig. 10. A typical access prefix distribution of long durationmedia file.

0 2000 4000 6000 8000 100000

0.2

0.4

0.6

0.8

1

Duration (sec)

r c

Fig. 11. Fraction of complete sessions (rc’s) versus the corre-sponding media file durations.

344 W. Tang et al. / Computer Networks 51 (2007) 336–356

figures correspond to successfully completed mediasessions for the files, while the other prefixes in thefigures are incomplete sessions. We observe thatthere is a strong correlation between the file dura-tion and the prefix distribution:

• Complete sessions. The fraction of complete ses-sions of a file highly depends on the file duration.A short file tends to have more complete sessions.For example, the file durations in Figs. 9 and 10are 723 s and 4133 s respectively. The file in Fig. 9has more complete sessions than that in Fig. 10.We use rc to denote the ratio of complete sessionsfor a file compared with the total number of ses-sions for the file. Fig. 11 shows the relationshipbetween file duration and rc. We can observe that

the rc of each file highly depends on the fileduration.

• Incomplete sessions. The prefix distribution ofincomplete sessions of a file depends on the fileduration. Fig. 9 reflects that the prefixes ofincomplete sessions for a short-duration mediafile can be captured by an exponential distribu-tion. While for a long-duration file as shown inFig. 10, the prefixes of incomplete sessions can-not be captured by an exponential distribution.

Thus, the overall prefix distribution of a mediaworkload highly depends on each file’s prefix distri-bution, which in turn depends on the duration ofthe file. There is not a straightforward solution todirectly capture the overall prefix distribution forthe entire workload. To generate each file’s prefixdistribution, we first generate rc for the file, thenmodel the distribution of incomplete sessions forthe file based on the assigned rc.

To generate rc for a file, we need to determine therelationship between rc and the file duration asshown in Fig. 11. We observe that the contour ofthe dotted area in Fig. 11 follows a Zipf-like distri-bution. To obtain this curve, we segment the dura-tions of all files into 1-min bins. The maximum rc

value of each bin (denoted as rmaxc ) constitutes the

contour of the dotted area in Fig. 11. Fig. 12 showsthe relationship between rmax

c of each bin and theduration (in minutes) for the corresponding bin ona log–log scale. Because the maximum value ofrmax

c is 0.74, the curve is flat in the beginning. Theother points can be fitted with a straight line. Thus,we can use a Zipf-like distribution to capture the

Page 10: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

100

101

102

103

10–3

10–2

10–1

100

Duration (min)

r c

Fig. 12. rmaxc of all bins versus the corresponding bin duration on

a log–log scale.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

rc

r e

RatioUpper boundLower bound

Fig. 13. re versus rc.

W. Tang et al. / Computer Networks 51 (2007) 336–356 345

distribution of rmaxc for all bins. We also observe that

the rc values for other files within a bin, are uni-formly distributed between 0 and the rmax

c value ofthe bin. So, to generate rc for each file, MediSyn firstclassifies files into 1-min bins based on their dura-tions. Then, MediSyn generates rmax

c for each bin.For files in each bin, their rc values are chosenaccording to a uniform distribution between 0 andrmax

c of the bin. Through this process, the rc of eachfile can be determined.

After the rc value of each file is determined, thedistribution of incomplete sessions needs to bedetermined. As mentioned above, depending onthe file duration, it could be captured by an expo-nential distribution or a mix of an exponential anda uniform distribution. Additionally, both Figs. 9and 10 show a similar shape in the beginning ofthe distributions within a certain range of duration.We observe that for more than 90% of the mediafiles in the HPC trace, the distributions of prefixeswithin the first 5 min can be fitted by exponentialdistributions. These results confirm similar findingsfor an educational workload studied by Almeidaet al. [3]. Given the fact that prefixes within a certainduration range (e.g., the first 5 min) occupy a highpercentage of total incomplete sessions, we intro-duce a cut-off point and use the following methodto model the prefix distribution of a given media file:

• If a media file duration is less than the cut-off

point, its incomplete prefixes are modeled by anexponential distribution.

• If a media file duration is longer than the cut-off

point, the distribution of incomplete prefixes is

modeled by the concatenation of two distributions.The distribution of incomplete prefixes less thanthe cut-off point is modeled by an exponential dis-tribution. The distribution of the remainingincomplete prefixes longer than the cut-off pointis approximated by a uniform distribution.

In the HPC trace, the cut-off point is 5 min. Usersof MediSyn can specify their own cut-off point. Weuse the following denotations:

• re defines the ratio of incomplete sessions whoseprefixes are within the cut-off point comparedwith the total number of sessions of the file,

• ru defines the ratio of incomplete sessions whoseprefixes are longer than the cut-off point com-pared with the total number of sessions of thefile. If a file duration is less than the cut-off point,ru is 0.

Given rc for a file, MediSyn needs to generate thevalues of re and ru for the file. The strategy is to gen-erate re for the file and to set ru = 1 � rc � re. Fig. 13shows the relationship between rc and re for all filesin the HPC trace. We can see that for a given rc, thevalue of re is bounded on both the lower and uppersides. If we denote the maximum of all rc values asRmax

c (0.74 for the HPC trace), the upper boundrupper

e and the lower bound rlowere of a given rc can

be computed by Eqs. (7) and (8) respectively.

ruppere ¼ 1� rc ð7Þ

rlowere ¼ rupper

e � ð0:6þ 0:4 � rc=Rmaxc Þ ð8Þ

Page 11: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

0 2000 4000 6000 8000 100000

0.2

0.4

0.6

0.8

1

Session prefixes (sec)

Cum

ulat

ive

dist

ribut

ion

func

tion

HPC logMediSyn

Fig. 14. CDFs of the prefixes generated by MediSyn and theHPC trace.

346 W. Tang et al. / Computer Networks 51 (2007) 336–356

Fig. 13 plots these calculated upper and lowerbounds. The upper bound is caused by the limita-tion that the sum of rc, re, and ru is 1. The lowerbound changes from 60% of the upper bound tothe same as the upper bound while rc increases from0 to Rmax

c . So during workload generation, for agiven rc, MediSyn generates the value of re accord-ing to a uniform distribution between the corre-sponding upper bound rupper

c and the lower boundrlower

c .After generating rc, re and ru for each media file,

MediSyn still needs to generate the mean (l) of theexponential distribution for the incomplete prefixes.Since each file has its own exponential distributionfor prefixes, we have to derive the distribution forthe l’s of all media files. Analysis of session prefixesin the HPC trace shows that l’s of all media files fol-low a normal distribution.

MediSyn generates a sequence of prefixes accord-ing to the generated ratios of rc, ru, re and l for eachfile. Then, these prefixes are randomly matched withall the sessions of the file. In this way, MediSyn cangenerate all session prefixes for every file. Fig. 14shows the cumulative distribution function (CDF)of the prefixes in the HPC trace compared withthe CDF of the prefixes generated by MediSyn.

3.2. Temporal properties

3.2.1. Causes of temporal locality in media workloads

collected over long period of time

Temporal reference locality, which is universallyobserved in web and media workloads [13,12,4], is

the primary factor that affects session arrival order-ing. Temporal locality states that recently accessedobjects are likely to be accessed in the near futurein the access stream. Two factors can cause the tem-poral locality in the access stream: skewed popularity

distribution and temporal correlation [11,16]. Sincepopular files have a higher probability to be accessedwithin the access stream, a file’s popularity contrib-utes to its temporal locality. If we randomly permutethe access stream, the temporal locality caused byskewed popularity is still preserved under reorder-ing. However, temporal locality caused by temporalcorrelation cannot be preserved under random per-mutation. In MediSyn, to generate a stream of ses-sion arrivals exhibiting proper temporal locality,we need to clearly understand and distinguish thecauses of temporal locality.

To check the existence of possible temporal cor-relation among sessions for the same files in ourtraces, we compare reference distances [4] betweenthe original HPC trace and a randomly permutedHPC trace. Let s1, s2, . . . , sn be the stream of accessesrepresenting the media sessions of the entire HPCtrace. Let si1 ; si2 ; . . . ; sim be the sequence of sessionsto the same file fi. The reference distances of si2 ;si3 ; . . . ; sim are defined as i2 � i1, i3 � i2, . . . ,im � im�1. We calculate the reference distances forall the files and their sessions over the entire HPCtrace. Then we apply similar procedure to the ran-domly permuted HPC trace.

Fig. 15(a) shows the histogram of the referencedistances in the original HPC trace on a log–log scale.The X-axis shows the reference distances, and the Y-axis shows the number of sessions in the original HPCtrace for the corresponding reference distance.Fig. 15(b) shows the histogram of the reference dis-tances in the randomly permuted HPC trace on alog–log scale. It can be observed that two curves arenot the same. In Fig. 15(b), there are less referenceswith small distances. For example, there are122,765 references with distance 1 in Fig. 15(a). Butthere are only 5468 references with distance 1 inFig. 15(b). Since in Fig. 15(b), reference distancesare only determined by the file popularity, we assumethe reason that Fig. 15(a) and (b) are not the same isthat there is temporal correlation among sessions forthe same files over the entire trace.

To verify whether our media traces exhibit short-term temporal correlation, we calculate referencedistances within every day and sum the number ofreferences with the same distance over all days forthe HPC trace. Then, we permute accesses within

Page 12: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

(a)10

010

110

210

310

410

0

102

104

106

Reference distance

Num

ber

of r

efer

ence

s

(b)10

010

110

210

310

410

0

102

104

106

Reference distance

Num

ber

of r

efer

ence

s

Fig. 16. (a) Reference distances calculated within each day in the original HPC trace and (b) reference distances calculated within each dayin the permuted HPC trace.

Fig. 15. (a) Reference distances calculated over the entire trace period in the original HPC trace and (b) reference distances calculated overthe entire trace period in the permuted HPC trace.

W. Tang et al. / Computer Networks 51 (2007) 336–356 347

every day and calculate reference distances again forthe permuted trace. The results are shown inFig. 16(a) and (b). We observe that there is almostno difference between the original trace and the per-muted trace on reference distances. This implies thatthere is no temporal correlation for sessions within asingle day and that temporal locality of sessionswithin a single day is purely determined by the filepopularity distribution within that day.

The analysis above implies that temporal correla-tion in media workloads collected over long periodof time is due to long-term temporal correlationexhibited on a daily time scale, and that there isno temporal correlation for sessions within a singleday. These observations motivate our choice of tem-poral properties in MediSyn. The temporal proper-ties described below intend to reflect the dynamicsand evolution of accesses to media content over

time and to define the proper temporal localityand long-term temporal correlation found in mediaworkloads.

3.2.2. New file introduction process

One recent study [12] observes that accesses tonew files constitute most of the accesses in any givenmonth for enterprise media servers. We envisionthat this access pattern is even more pronouncedfor media news and sports sites. While for educa-tional media sites the rate of new file introductionand accesses to them might be different, we aim todesign a generic parameterized model capable ofcapturing the specifics of new content introductionfor different media workloads. Among the designgoals of our synthetic generator is the ability toreflect the evolution of media content provided bydifferent media sites over a long period of time

Page 13: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

0 10 20 30 40 50 600

20

40

60

80

100

120

140

Number of new files introduced per introduction day

Num

ber

of in

trod

uctio

n da

ys

348 W. Tang et al. / Computer Networks 51 (2007) 336–356

(months). Since we design MediSyn to supportdetailed resource allocation studies, we mustaccount for the dynamic introduction of new con-tent and its relative popularity.

The process of new file introduction mimics howfiles are introduced at a media site and attempts toanswer the following questions:

• What is the new file arrival process on a dailytime scale?

• What is the new file arrival process within anintroduction day?

To model new file arrival on a daily level, we cap-ture the time gap measured in days between two

5 10 15 200

50

100

150

200

250

New file introduction gap (day)

Num

ber

of in

trod

uctio

n da

ys

Fig. 17. New file introduction gaps measured in days for theHPC trace.

5 10 15 200

10

20

30

40

50

60

New file introduction gap (day)

Num

ber

of in

trod

uctio

n da

ys

Fig. 18. New file introduction gaps measured in days for theHPL trace.

Fig. 19. The number of new files introduced per introductionday.

5 10 15 200

0.2

0.4

0.6

0.8

1

File introduction gap in days

Fre

quen

cy

Fig. 20. The histogram and the PDF of the new file introductiongap for the HPC trace.

introduction days and the number of new files intro-duced in each introduction day. Fig. 17 shows thedistribution of new file introduction gaps measuredin days for the HPC trace. The distribution depictedin Fig. 17 can be captured by a Pareto distributionwith a = 2.0164. Fig. 20 shows the histogram ofthe original trace and the PDF of the fitted Paretodistribution. Fig. 18 shows the introduction gap dis-tribution for the HPL trace, which can be capturedby an exponential distribution with l = 4.2705.Fig. 21 shows the histogram of the original traceand the PDF of the fitted exponential distribution.We center the exponential distribution at x = 1.

MediSyn can generate new file introduction timegaps according to one of three possible distribu-

Page 14: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

5 10 15 200

0.2

0.4

0.6

0.8

1

File introduction gap in days

Fre

quen

cy

Fig. 21. The histogram and the PDF of the new file introductiongap for the HPL trace.

0 2 4 6 8 10x 10

4

0

100

200

300

400

500

600

700

New file introduction gap (sec) within a day

Num

ber

of fi

les

Fig. 23. New file introduction time gaps within an introductionday.

W. Tang et al. / Computer Networks 51 (2007) 336–356 349

tions: (1) a Pareto distribution, (2) an exponential

distribution, (3) a fixed interval. If users specify aPareto distribution for the new file introductionprocess, files tend to be introduced into the systemclustered over time. If the introduction process isspecified by an exponential distribution, the new filearrival process is a Poisson arrival process, whichmeans the interarrival times are independent. Thefixed interval is used to model some artificial intro-duction process with regular patterns.

Since there may be multiple new files introducedin a given day, we must also model the number offiles introduced per introduction day. Fig. 19 showsthe distribution for the number of files introduced ina given day for the HPC trace. The distribution can

10 20 30 40 500

0.2

0.4

0.6

0.8

1

Number of files introduced per introduction day

Fre

quen

cy

Fig. 22. The histogram and the PDF for the number of new filesintroduced per day.

be fitted by a Pareto distribution with a = 1.1323.Fig. 22 shows the histogram and the PDF of the fit-ted Pareto distribution.

After determining the number of files introducedin a given day, MediSyn needs to model the new filearrival process within that day. We model this pro-cess by capturing the gap between two file arrivals.Fig. 23 shows the time gaps for new files introducedwithin a day. Since the distribution is too sparse ontime scale of seconds, we measure the time gaps atmultiples of 900 s (15 min). The distribution can becaptured by a Pareto distribution with a = 1.0073.

Due to the properties of Pareto distribution, if weonly model the time gap between two file arrivals

0 5 10 15 20 250

10

20

30

40

50

Time (hour)

Num

ber

of fi

les

Fig. 24. The start times of new file introduction within intro-duction days.

Page 15: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

0 5 10 15 20 250

10

20

30

40

50

Time (hour)

Num

ber

of fi

les

Fig. 25. The rotated start times of new file introduction withinintroduction days.

0

1e+07

2e+07

3e+07

4e+07

5e+07

6e+07

7e+07

8e+07

0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 8e+07

Med

iSyn

gen

erat

ed fi

le in

trod

uctio

n tim

e (s

ec)

HPC file introduction time (sec)

Introduction Time Correlation

Fig. 26. The correlation of file introduction time between theHPC workload and the workload generated by MediSyn.

350 W. Tang et al. / Computer Networks 51 (2007) 336–356

and start to introduce new files from the beginningof a day, then most of the files will be introducedin the beginning of every day. So we also capturethe start times of new file introduction processwithin every introduction day. Fig. 24 shows thisdistribution. Since it looks like a rotated normal dis-tribution with the peak at 0, we rotate the distribu-tion by 12 h as shown in Fig. 25. This is a normal

distribution with mean 43,200 s and standard devia-tion 21,600 s.

Fig. 26 shows the correlation of file introductiontime between the HPC workload and a workloadgenerated by MediSyn to simulate the HPC work-load.

The correlation coefficient is 0.9889. So the gen-erated file introduction time sequence matches theoriginal trace quite well.

3.2.3. Life span

Since temporal correlation is observed in mediaworkloads, an independent reference model com-bined with a global popularity distribution [9] isinsufficient for a synthetic workload generator togenerate a file access stream. SURGE [6] uses astack distance model to generate web referencestreams with reference locality. Both the indepen-dent reference model [9] and the stack distancemodel [6,4] assume that each file’s popularity is sta-tionary over the entire trace period and that eachfile is introduced at the start of the trace. Since weobserve non-stationary popularity in streamingmedia workloads, such models are unsuitable forgenerating session arrivals in streaming mediaworkloads.

A new property called life span has recently beenproposed in [12] to measure the change in access rateof newly introduced files. Life spans reflect the time-liness of accesses to the introduced files. We observethat accesses to a media file are not uniformly dis-tributed over the entire trace period. Instead, mostaccesses for a file occur shortly after the file is intro-duced, with access frequency gradually decreasingover time. For example, for the HPC (HPL) log,52% (51%) of the accesses occur during the first weekafter file introduction, while only 16% (10%) of theaccesses occur during the second week, etc. Hence,the file access frequency (file popularity) changesover time. In other words, file popularity is non-sta-

tionary over the trace period. This phenomenonimplies that session arrivals might be very burstywhen new files are introduced at a site.

To accurately model the non-stationarity of filepopularity, we use the new file introduction processto mimic how media files are introduced at themedia sites, as we described above. In addition, eachfile has its own life span, which characterizes itschanging popularity after the file’s introduction.Thus, the global file popularity distribution, the fileintroduction process and life spans of individualfiles, all together capture the popularity change ofmedia files over the entire trace.

We define the relative access time of a file as arandom variable whose value is the time measuredin days when the file is accessed by a client afterthe file is introduced. The distribution of a file’s rel-

ative access times describes the temporal correlationof all accesses to the file. We also call this distribu-tion the life span distribution of the file. In ourtraces, we observe two types of life span distribu-tions as illustrated in Figs. 27 and 28 respectively.

Page 16: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

0 100 200 300 400 5000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

Life span (day)

Rat

io

Fig. 27. A regular lifespan.

0 50 100 150 2000

0.2

0.4

0.6

Life span (day)

Rat

io

Fig. 28. A news-like lifespan.

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

Normalized file rank (0–1]

Rat

io

Fig. 29. PDF that a file at a certain rank has a Pareto life spandistribution.

Table 6The parameters for the distributions (normal distributions) of theparameters in lognormal and pareto life span distributions

Normaldistributionparamaters

Lognormal l Lognormal r Pareto a

l 3.0935 1.1417 0.7023r 0.9612 0.3067 0.2092

W. Tang et al. / Computer Networks 51 (2007) 336–356 351

Since most files in our traces have life spans similarto Fig. 27, we call this type of life span a regular life

span.News-like streaming contents typically have life

span distributions similar to Fig. 28, where mostaccesses occur shortly after the file introductionand the access frequency diminishes relativelyquickly. So we refer to this kind of life span asnews-like life span.

We experimented with gamma, Pareto, exponen-tial and lognormal distributions to fit the relative

access times of our traces. Although gamma distribu-tions can somehow capture both news-like andregular life spans, the combination of Pareto andlognormal distributions can better fit them. Thus,news-like life spans follow Pareto distributions, andregular life spans follow lognormal distributions.

To generate a sequence of regular life spans andnews-like life spans, we need to model the distribu-tions of the mean (l) and the standard deviation (r)for regular life spans, and the distributions of a fornews-like life spans. Our analysis of the HPC andHPL traces shows that these parameters follow nor-

mal distributions. Table 6 shows the parameters forthese normal distributions derived from the HPClog to capture the parameters of regular life spans(l and r) and news-like life spans (a).

There is a strong correlation between file popu-larity and life span shape. A file with a higher pop-ularity rank tends to have a higher probability forhaving a news-like life span. Fig. 29 shows thePDF for this probability. The distribution can becaptured by an exponential distribution. File rankshave been transformed between 0 and 1 so that lfor the exponential distribution is independent ofthe number of media files generated. In the HPCtrace, we observed 82 news-like life spans out ofthe 400 most popular files. Users of MediSyn canspecify their own ratio according to the workloadthey want to generate. A workload including morefiles with news-like life spans has a more burstyaccess pattern.

Page 17: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

0 50 100 150 2000

0.05

0.1

0.15

0.2

0.25

Gap (sec)

Rat

io

Fig. 31. The PDF of session access interarrival time gaps for afile measured in an hour.

352 W. Tang et al. / Computer Networks 51 (2007) 336–356

3.2.4. Diurnal access pattern

Earlier studies observed the diurnal access pat-terns for streaming media workloads [13,18,2,3,17]. The diurnal access pattern defines how thenumber of accesses to a site varies during a givenperiod of time, e.g., a day. Diurnal access patternsmight significantly vary for different media sites.For mixed media workloads utilizing a shared infra-structure, the diurnal access patterns have to betaken into account when designing the optimal sup-port for efficient resource allocation. Additionally,the diurnal access pattern is important for capturingthe burstiness of resource consumption within agiven time period. The diurnal access patterns aredefined using the second time scale in our syntheticworkload generator, e.g., within a day.

After determining the life span and the globalpopularity of every file, MediSyn can generate thenumber of accesses for every day of a file’s life span.Distributing these accesses over a day is challengingbecause we wish to model both session interarrivaltime and diurnal access patterns.

Fig. 30 shows a typical session interarrival timedistribution for a file measured in a day. It is aheavy-tail distribution and can be fitted by a Paretodistribution better than an exponential distribution.However, if we generate all interarrival times withina day based on this Pareto distribution, it is difficultto simultaneously ensure diurnal pattern. Fig. 31shows the interarrival time distribution for the samefile within one hour of the same day. This distribu-tion is not a heavy-tail distribution and can be cap-tured by an exponential distribution. Thus, if we can

0 200 400 600 800 1000 12000

0.2

0.4

0.6

Gap (sec)

Rat

io

Fig. 30. The PDF of session access interarrival time gaps for afile measured in a day.

determine the number of accesses in each hour of aday according to a certain diurnal pattern, we canuse an exponential distribution to generate the inter-arrival times of the accesses in this hour. Thus, wecan both generate the diurnal pattern and satisfythe observed exponential distribution for interarri-val times.

Diurnal access patterns are universally observedby other streaming workload analyses. But we donot explicitly find diurnal patterns for single files.We only observe an aggregate diurnal access patternfor all file accesses. Fig. 32 shows the average ratiosof accesses in each hour for all files in the HPCtrace.

In MediSyn, a user can specify a global diurnalpattern like Fig. 32, which contains a set of bins.

0 5 10 15 20 250

0.02

0.04

0.06

Hour

Rat

io

Fig. 32. The session access diurnal pattern for the HPC trace.Each bin is an hour.

Page 18: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

W. Tang et al. / Computer Networks 51 (2007) 336–356 353

Each bin specifies a time period and the ratio ofaccesses in this bin. Since we believe there is no tem-poral correlation among file accesses within a day(i.e., the temporal locality within a day is entirelydetermined by file popularities), we can make everyfile follow the diurnal pattern. Essentially, each file’ssession arrival process within a given day is modeledas a non-homogeneous Poisson process [23], whereonly the session arrivals within each bin can bemodeled by a Poisson process. The session arrivalrate of the file for a given bin is computed basedon the diurnal pattern specified by the user andthe number of accesses within a day determinedby the file life span. MediSyn generates the interar-rival time gaps within each bin and constructs asequence of sessions for the file on the scale ofseconds.

4. Workload generation process in MediSyn

MediSyn’s goal is to generate a synthetic tracerepresenting a sequence of file accesses to media ser-vice. This process consists of two steps: file property

generation and file access generation.

• File property generation

The first step is to generate values for all proper-ties introduced above for each media file. Staticproperties define file set parameters (durationand encoding bit rate) and the principal accesspatterns (popularity and prefix). Temporal prop-erties define the ordering and timing of mediasessions. MediSyn defines these properties usinga parameterized set of distributions in the inputconfiguration file. If a property can be describedby a value such as global file popularity, dura-tion, encoding bit rate, MediSyn first generatesa sequence of values according to the given distri-bution, and then selects a value for each file. If aproperty is modeled as a distribution, the choiceof the distribution and parameter(s) of the distri-bution are generated for each file.Thus, a file isthe basic unit to which the property values are

Table 7Properties generated for each file

File id Duration Bit rate (Kbps) Popularity File introduct

1 3600 112 20,000 1002 200 350 14,300 50... ..

. ... ..

. ...

n 600 28.8 1 10,000

propagated at the first step of workload genera-tion. At the end of the first step, the set of corre-sponding static and temporal properties shown inTable 7 is generated for each file. Section 3describes each property generation and correla-tions among the properties in detail.

• File access generation

Taking the assigned file popularities as the basis,MediSyn independently generates the arrival ofmedia sessions to each file using: (i) the initial fileintroduction time, (ii) the life span of the file, and(iii) the diurnal access pattern of the file. Each fileaccess includes the following three fields:– timestamp indicating the session arrival time,– file id specifying the target file accessed during

the media session,– file access prefix describing the duration of the

media session.Once a sequence of media sessions is generatedfor each file, all the media sessions are sortedaccording to a global time and merged togetherto generate the synthetic trace.

5. Related work

Accurate workload characterization lays down afoundation for a successful synthesis of realisticworkloads. A number of studies on multimediaworkload analysis have been reported in literature[2,3,13,12,18,20].

Acharya et al. [2], presented the analysis of thesix-month trace data from mMOD system (the mul-ticast Media on Demand) which had a mix of edu-cational and entertainment videos. They observedhigh temporal locality of accesses, the special clientbrowsing pattern showing clients preference to pre-view the initial portion of the videos, and thatrankings of video titles by popularity do not fit aZipfian distribution.

Almeida et al. [3] performed an analysis of twoeducational media server workloads. The authorsprovide a detailed study of client session arrival

ion time (s) Life span Life span parameters � � �Pareto 1.0 � � �Lognormal 2.0,10.0 � � �... ..

. ...

Lognormal 1.0, 1.0 � � �

Page 19: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

354 W. Tang et al. / Computer Networks 51 (2007) 336–356

process: the client session arrivals in one workloadcan be characterized as a Poisson process, and theinterarrival times in the second workload follow aheavy-tail Pareto distribution. They also observedthat media delivered per session depends on themedia file length.

The study by Chesire et al. [13] analyzed themedia proxy workload at a large university. Theauthors presented a detailed characterization of ses-sion duration (most of the media streams are lessthan 10 min), object popularity (78% of objects areaccessed only once), sharing patterns of streamingmedia among the clients, and that popularity distri-bution follows a Zipf-like distribution (trace dura-tion covers one week).

Two enterprise media server workloads havebeen extensively studied in [12]. The data was col-lected over significant period of time. Thus authorsconcentrated on the analysis of media server accesstrends, access locality, dynamics and evolution ofthe media workload over time. They reported non-Zipfian and non-stationary popularity of filesobserved in their data.

In our work, we attempt to summarize findingsfrom the earlier work, and build a general, unifiedmodel for workload characteristics capturingunique properties of streaming media workloadsas well as the dynamics in media workloadsobserved over long period of time.

Since HTTP requests and streaming media ses-sions are very different, streaming media workloadsexhibit many new properties relative to traditionalweb workloads. Thus existing synthetic web work-load generators [6] are not suitable for generatingstreaming media workloads.

The only synthetic workload generator forstreaming media reported in literature is GISMO[17]. MediSyn adopts similar approach chosen inGISMO to organize the synthetic trace generationin two steps: (i) defining the individual session char-acteristics, and (ii) determining the media sessionarrival process. GISMO operates over a ‘‘fixed’’set of media files already ‘‘introduced’’ at a mediasite, with the assumption that object popularity fol-lows a Zipf-like distribution and remains the sameover the entire duration of the experiment. Sincewe pursue the goal of developing a synthetic work-load generator which reflects the dynamics and evo-lution of media workloads over time, we propose aset of new models to reflect these new temporalproperties of streaming media workloads inMediSyn.

6. Conclusion and future work

Development of efficient resource allocationmechanisms for Internet hosting centers and CDNs,serving streaming media content, requires perform-ing the experiments with realistic streaming mediaworkloads which need to be scaled, parametrized,and mixed in a controllable and desirable way.

In this work, we present a synthetic streamingmedia workload generator, MediSyn, which is spe-cially designed to accomplish this goal. In MediSyn,we develop a number of novel models to capture aset of characteristics critical to streaming mediaservices, including file duration, file access prefix,non-stationary file popularity, new file introductionprocess, and diurnal access pattern. Among the pri-mary features of our synthetic generator is the abil-ity to reflect the dynamics and evolution of contentat media sites and the change of access rate to thesites over time. Our evaluation, based on twolong-term traces of streaming media services, dem-onstrates that MediSyn accurately captures theessential properties of media workloads, which arechosen to represent the unique (while generic) prop-erties of streaming media workloads and theirdynamics over time.

MediSyn implementation is based on a modulardesign allowing the particular system properties tobe customized, enhanced or extended to reflect therequirements of individual scenarios. In a futurework, we plan to extend MediSyn with implementa-tion of client interactivities within media sessions.

As part of MediSyn, we plan to release a work-load analysis tool reflecting the property profilesgenerated by MediSyn. These profiles can be conve-niently used for tuning the workload generatorparameters to specify the desired properties.

Acknowledgements

Our sincere thanks to anonymous referees foruseful remarks and numerous suggestions duringseveral thorough revisions of the paper. Theirinsightful questions and constructive criticismhelped us to significantly improve the content andpresentation of the paper.

References

[1] General Pareto Distribution. Available from: <http://www.math.uah.edu/stat/special/special12.html>.

Page 20: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

W. Tang et al. / Computer Networks 51 (2007) 336–356 355

[2] Soam Acharya, Brian Smith, Peter Parnes, Characterizinguser access to videos on the world wide Web, in: Proceedingsof ACM/SPIE Multimedia Computing and Networking,January 2000.

[3] Jussara Almeida, Jeffrey Krueger, Derek Eager, MaryVernon, Analysis of educational media server workloads,in: Proceedings of NOSSDAV, June 2001.

[4] Virgılio Almeida, Azer Bestavros, Mark Crovella, Adrianade Oliveira, Characterizing reference locality in the WWW,in: Proceedings of PDIS, December 1996.

[5] Virgilio Augusto Almeida, Marcio Cesirio, Rodrigo Fons-eca, Wagner Meira Jr., Cristina Murta, Analyzing thebehavior of a proxy server in the light of regional andcultural issues, in: Proceedings of WCW, June 1998.

[6] Paul Barford, Mark Crovella, Generating representativeWeb workloads for network and server performance evalu-ation, in: Proceedings of SIGMETRICS, June 1998.

[7] Dave Bianchi, CNN.com: Facing A World Crisis. Availablefrom: <http://www.tcsa.org/lisa2001/cnn.txt>.

[8] Rebecca Braynard, Dejan Kostic, Adolfo Rodriguez, JeffreyChase, Amin Vahdat, Opus: an overlay peer utility service,in: Proceedings of the 5th International Conference on OpenArchitectures and Network Programming (OPENARCH),June 2002.

[9] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, ScottShenker, Web caching and Zipf-like distributions: evidence,and implications, in: Proceedings of INFOCOM, March1999.

[10] Jeffrey Chase, Darrell Anderson, Prachi Thakar, AminVahdat, Ronald Doyle, Managing Energy and ServerResources in Hosting Centers, in: Proceedings of the 18thACM SOSP, October 2001.

[11] Ludmila Cherkasova, Gianfranco Ciardo, Characterizingtemporal locality and its impact on Web server performance,in: Proceedings of ICCCN, October 2000.

[12] Ludmila Cherkasova, Minaxi Gupta, Characterizing local-ity, evolution, and life span of accesses in enterprise mediaserver workloads, in: Proceedings of NOSSDAV, May 2002.

[13] Maureen Chesire, Alec Wolman, Geoffrey Voelker, HenryLevy, Measurement and analysis of a streaming-mediaworkload, in: Proceedings of USITS, March 2001.

[14] Morris DeGroot, Mark Schervish, Probability and Statistics,third ed., Addison-Wesley, 2002.

[15] Raj Jain, The art of computer systems performance analysis:technique for experimental design, measurement, simulationand modeling, John Wiley & Sons, 1992.

[16] Shudong Jin, Azer Bestavros, Temporal locality in Webrequests streams: sources, characteristics, and caching impli-cations. Technical report BUCS-TR-1999-009, Departmentof Computer Science, Boston University, August 1999.

[17] Shudong Jin, Azer Bestavros, GISMO: A generator ofinternet streaming media objects and workloads. Technicalreport BUCS-TR-2001-020, Department of Computer Sci-ence, Boston University, October 2001.

[18] Dario Luperello, Sarit Mukherjee, Sanjoy Paul, Streamingmedia traffic: an empirical study, in: Proceedings of WebCaching Workshop, June 2002.

[19] Hewlett Packard, Utility Data Center. Available from:<http://www.hp.com/go/hpudc>.

[20] Jitendra Padhye, Jim Kurose, An empirical study of clientinteractions with a continuous-media courseware server, in:Proceedings of NOSSDAV, June 1998.

[21] V. Paxson, Empirically-derived analytic models of wide-areaTCP connections, IEEE/ACM Transactions on Networking2 (4) (1994).

[22] S. Pederson, M. Johnson, Estimating model discrepancy,Technometrics 32 (3) (1990) 305–314.

[23] Sheldon Ross, Introduction to Probability Models, Aca-demic Press, 1997.

[24] Subhabrata Sen, Jennifer Rexford, Don Towsley, Proxyprefix caching for multimedia streams, in: Proceedings ofINFOCOM, March 1999.

[25] Zipf–Mandelbrot law. Available from: <http://en.wikipe-dia.org/wiki/Zipf-Mandelbrot_law>.

[26] Vern Paxson, Empirically-derived analytic models of wide-area TCP connections, IEEE/ACM Transactions on Net-working 2 (4) (1994).

Wenting Tang currently works at Arc-sight Inc. to develop next-generationSecurity Event and Information Man-agement (SEIM) system. Before that, heworked at VMware to develop virtualinfrastructure management software. Heconducted research in areas of utilitycomputing, content delivery networks,streaming media and scalable web serversystems in HPLabs before he joinedVMware. He received his Ph.D. from

Michigan State University.

Yun Fu received his B.S. and M.Eng. inComputer Science from Nankai Univer-sity, China, in 1995 and 1998 respectivelyand his M.S. and Ph.D. in ComputerScience from Duke University, USA, in2001 and 2004 respectively. He iscurrently a senior engineer at Yahoo!Inc. His research interests include dis-tributed systems, operating systems,computer networks, databases, and datamining.

Ludmila Cherkasova is a senior scientistin the Enterprise Software and SystemsLaboratory at HPLabs, Palo Alto. Shejoined Hewlett-Packard Laboratories in1991. Before joining HPLabs, she was asenior researcher at Institute of Com-puting Systems, Russia, and adjunctassociate professor at Novosibirsk StateUniversity. Her current research interestsare in distributed systems, internet tech-nologies and networking, performance

measurement and monitoring, characterization of next-genera-tion system workloads and emerging applications in the large-

scale enterprise data centers.
Page 21: Modeling and generating realistic streaming media server ...€¦ · Modeling and generating realistic streaming media server workloads Wenting Tang a,1, Yun Fu b,2, Ludmila Cherkasova

r Networks 51 (2007) 336–356

Amin Vahdat is an Associate Professor in

the Computer Science and EngineeringDepartment at UC San Diego and theDirector of the Center for NetworkedSystems. He received his Ph.D. from theUniversity of California, Berkeley andthen was on the faculty at DukeUniversity from 1998 to 2003 beforejoining UC San Diego in 2004. Vahdatreceived the NSF CAREER award in2000, the Alfred P. Sloan Fellowship in

2003 and the Duke University David and Janet Vaughn Distin-guished Teaching Award in 2003. He co-founded the USENIX/

356 W. Tang et al. / Compute

ACM Symposium on Networked Systems Design and Imple-mentation (NSDI) in 2003. His research focuses broadly oncomputer systems and networks, with recent focuses on avail-ability, resource allocation, programming models and languagesfor distributed systems, and scalable network emulationenvironments.