Top Banner
Cluster-Based Scalable Network Services Armando Fox Steven D. Gribble Yatin Chawathe Eric A. Brewer Paul Gauthier University of California at Berkeley Inktomi Corporation {fox, gribble, yatin, brewer}@cs.berkeley.edu {gauthier}@inktomi.com We identify three fundamental requirements for scalable network services: incremental scalability and overflow growth provisioning, 24x7 availabil- ity through fault masking, and cost-effectiveness. We argue that clusters of commodity workstations interconnected by a high-speed SAN are excep- tionally well-suited to meeting these challenges for Internet-server workloads, provided the soft- ware infrastructure for managing partial failures and administering a large cluster does not have to be reinvented for each new service. To this end, we propose a general, layered architecture for build- ing cluster-based scalable network services that encapsulates the above requirements for reuse, and a service-programming model based on com- posable workers that perform transformation, aggregation, caching, and customization (TACC) of Internet content. For both performance and implementation simplicity, the architecture and TACC programming model exploit BASE, a weaker-than-ACID data semantics that results from trading consistency for availability and rely- ing on soft state for robustness in failure manage- ment. Our architecture can be used as an “off the shelf” infrastructural platform for creating new network services, allowing authors to focus on the “content” of the service (by composing TACC building blocks) rather than its implementation. We discuss two real implementations of services based on this architecture: TranSend, a Web distil- lation proxy deployed to the UC Berkeley dialup IP population, and HotBot, the commercial imple- mentation of the Inktomi search engine. We present detailed measurements of TranSend’s per- formance based on substantial client traces, as well as anecdotal evidence from the TranSend and HotBot experience, to support the claims made for the architecture. 1 Introduction “One of the overall design goals is to cre- ate a computing system which is capable of meeting almost all of the requirements of a large computer utility. Such systems must run continuously and reliably 7 days a week, 24 hours a day... and must be capable of meeting wide service demands. “Because the system must ultimately be comprehensive and able to adapt to unknown future requirements, its framework must be general, and capable of evolving over time.” Corbató and Vyssotsky on Multics, 1965 [17] Although it is normally viewed as an operating system, Multics (Multiplexed Information and Computer Service) was originally conceived as an infrastructural computing service, so it is not sur- prising that its goals as stated above are similar to our own. The primary obstacle to deploying Mul- tics was the absence of the network infrastructure, which is now in place. Network applications have exploded in popularity in part because they are easier to manage and evolve than their desktop application counterparts: they eliminate the need for software distribution, and simplify customer service and bug tracking by avoiding the difficulty of dealing with multiple platforms and versions. Also, basic queueing theory shows that a large central (virtual) server is more efficient in both cost and utilization than a collection of smaller servers; standalone desktop systems represent the degenerate case of one “server” per user. All of these support the argument for Network Comput- ers [28]. Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advan- tage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redis- tribute to lists, requires prior specific permission and/or a fee.
24

Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

Aug 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

Cluster-Based Scalable Network ServicesArmando Fox Steven D. Gribble Yatin Chawathe Eric A. Brewer Paul Gauthier

University of California at Berkeley Inktomi Corporation

{fox, gribble, yatin, brewer}@cs.berkeley.edu {gauthier}@inktomi.com

We identify three fundamental requirements forscalable network services: incremental scalabilityand overflow growth provisioning, 24x7 availabil-ity through fault masking, and cost-effectiveness.We argue that clusters of commodity workstationsinterconnected by a high-speed SAN are excep-tionally well-suited to meeting these challengesfor Internet-server workloads, provided the soft-ware infrastructure for managing partial failuresand administering a large cluster does not have tobe reinvented for each new service. To this end, wepropose a general, layered architecture for build-ing cluster-based scalable network services thatencapsulates the above requirements for reuse,and a service-programming model based on com-posable workers that perform transformation,aggregation, caching, and customization (TACC)of Internet content. For both performance andimplementation simplicity, the architecture andTACC programming model exploit BASE, aweaker-than-ACID data semantics that resultsfrom trading consistency for availability and rely-ing on soft state for robustness in failure manage-ment. Our architecture can be used as an “off theshelf” infrastructural platform for creating newnetwork services, allowing authors to focus on the“content” of the service (by composing TACCbuilding blocks) rather than its implementation.We discuss two real implementations of servicesbased on this architecture: TranSend, a Web distil-lation proxy deployed to the UC Berkeley dialupIP population, and HotBot, the commercial imple-mentation of the Inktomi search engine. Wepresent detailed measurements of TranSend’s per-formance based on substantial client traces, aswell as anecdotal evidence from the TranSend and

HotBot experience, to support the claims made forthe architecture.

1 Introduction“One of the overall design goals is to cre-

ate a computing system which is capable ofmeeting almost all of the requirements of alarge computer utility. Such systems mustrun continuously and reliably 7 days aweek, 24 hours a day... and must be capableof meeting wide service demands.

“Because the system must ultimately becomprehensive and able to adapt tounknown future requirements, its frameworkmust be general, and capable of evolvingover time.”

— Corbató and Vyssotsky on Multics,1965[17]

Although it is normally viewed as an operatingsystem, Multics (Multiplexed Information andComputer Service) was originally conceived as aninfrastructural computing service, so it is not sur-prising that its goals as stated above are similar toour own. The primary obstacle to deploying Mul-tics was the absence of the network infrastructure,which is now in place. Network applications haveexploded in popularity in part because they areeasier to manage and evolve than their desktopapplication counterparts: they eliminate the needfor software distribution, and simplify customerservice and bug tracking by avoiding the difficultyof dealing with multiple platforms and versions.Also, basic queueing theory shows that a largecentral (virtual) server is more efficient in bothcost and utilization than a collection of smallerservers; standalone desktop systems represent thedegenerate case of one “server” per user. All ofthese support the argument for Network Comput-ers[28].

Permission to make digital/hard copy of all or part of this workfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advan-tage, the copyright notice, the title of the publication and its dateappear, and notice is given that copying is by permission of ACM,Inc. To copy otherwise, to republish, to post on servers or to redis-tribute to lists, requires prior specific permission and/or a fee.

Page 2: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

However, network services remain difficult todeploy because of three fundamental challenges:scalability, availability and cost effectiveness.

• By scalability, we mean that when the loadoffered to the service increases, anincremental and linear increase in hardwarecan maintain the same per-user level ofservice.

• By availability, we mean that the service as awhole must be available 24x7, despitetransient partial hardware or software failures.

• By cost effectiveness, we mean that theservice must be economical to administer andexpand, even though it potentially comprisesmany workstation nodes.

We observe that clusters of workstations havesome fundamental properties that can be exploitedto meet these requirements: using commodity PCsas the unit of scaling allows the service to ride theleading edge of the cost/performance curve, theinherent redundancy of clusters can be used tomask transient failures, and “embarrassingly par-allel” network service workloads map well ontonetworks of workstations. However, developingcluster software and administering a running clus-ter remain complex. The primary contributions ofthis work are the design, analysis, and implemen-tation of a layered framework for building net-work services that addresses this complexity. Newservices can use this framework as an off-the-shelfsolution to scalability, availability, and severalother problems, and focus instead on thecontentof the service being developed. The lower layerhandles scalability, availability, load balancing,support for bursty offered load, and system moni-toring and visualization, while the middle layerprovides extensible support for caching, transfor-mation among MIME types, aggregation of infor-mation from multiple sources, and personalizationof the service for each of a large number of users(mass customization). The top layer allows com-position of transformation and aggregation into aspecific service, such as accelerated Web brows-ing or a search engine.

Pervasive throughout our design and imple-mentation strategies is the observation that much

of the data manipulated by a network service cantolerate semantics weaker than ACID[26]. Wecombine ideas from prior work on tradeoffsbetween availability and consistency, and the useof soft state for robust fault-tolerance to character-ize the data semantics of many network services,which we refer to as BASE semantics (basicallyavailable, soft state, eventual consistency). Inaddition to demonstrating how BASE simplifiesthe implementation of our architecture, we presenta programming model for service authoring that isa good fit for BASE semantics and that maps wellonto our cluster-based service framework.

1.1 Validation: Two Real Services

Our framework reflects the implementation oftwo real network services in use today: TranSend,a scalable transformation and caching proxy forthe 25,000 Berkeley dialup IP users (connectingthrough a bank of 600 modems), and the Inktomisearch engine (commercialized as HotBot), whichperforms millions of queries per day against adatabase of over 50 million web pages.

The Inktomi search engine is an aggregationserver that was initially developed to explore theuse of cluster technology to handle the scalabilityand availability requirements of network services.The commercial version, HotBot, handles severalmillion queries per day against a full-text databaseof 54 million web pages. It has been incrementallyscaled up to 60 nodes, provides high availability,and is extremely cost effective. Inktomi predatesthe framework we describe, and thus differs fromit in some respects. However, it strongly influ-enced the framework’s design, and we will use itto validate particular design decisions.

We focus our detailed discussion on TranSend,which provides Web caching and data transforma-tion. In particular, real-time, datatype-specific dis-tillation and refinement[22] of inline Web imagesresults in an end-to-end latency reduction by a fac-tor of 3-5, giving the user a much more responsiveWeb surfing experience with only modest imagequality degradation. TranSend was developed atUC Berkeley and has been deployed for the25,000 dialup IP users there, and is beingdeployed to a similar community at UC Davis.

Page 3: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

In the remainder of this section we argue thatclusters are an excellent fit for Internet services,provided the challenges we describe for clustersoftware development can be surmounted. In Sec-tion 2 we describe the proposed layered architec-ture for building new services, and a programmingmodel for creating services that maps well ontothe architecture. We show how TranSend and Hot-Bot map onto this architecture, using HotBot tojustify specific design decisions within the archi-tecture. Sections 3 and 4 describe the TranSendimplementation and its measured performance,including experiments on its scalability and faulttolerance properties. Section 5 discusses relatedwork and the continuing evolution of this work,and we summarize our observations and contribu-tions in Section 6.

1.2 Advantages of Clusters

Particularly in the area of Internet servicedeployment, clusters provide three primary bene-fits over single larger machines, such as SMPs:incremental scalability, high availability, and thecost/performance and maintenance benefits ofcommodity PC’s. We elaborate on each of these inturn.

Scalability: Clusters are well suited to Internetservice workloads, which are highly parallel(many independent simultaneous users) and forwhich the grain size typically corresponds to atmost a few CPU-seconds on a commodity PC. Forthese workloads, large clusters can dwarf thepower of the largest machines. For example, Ink-tomi’s HotBot cluster contains 60 nodes with 120processors, 30 GB of physical memory, and hun-dreds of commodity disks. Wal-Mart uses a clusterfrom TeraData with 768 processors and 16 ter-abytes of online storage.

Furthermore, the ability to grow clusters incre-mentally over time is a tremendous advantage inareas such as Internet service deployment, wherecapacity planning depends on a large number ofunknown variables. Incremental scalabilityreplaces capacity planning with relatively fluidreactionary scaling. Clusters correspondinglyeliminate the “forklift upgrade”, in which youmust throw out the current machine (and related

investments) and replace it via forklift with aneven larger one.

High Availability : Clusters have naturalredundancy due to the independence of the nodes:Each node has its own busses, power supply,disks, etc., so it is “merely” a matter of software tomask (possibly multiple simultaneous) transientfaults. A natural extension of this capability is totemporarily disable a subset of nodes and thenupgrade them in place (“hot upgrade”). Suchcapabilities are essential for network services,whose users have come to expect 24-hour uptimedespite the inevitable reality of hardware and soft-ware faults due to rapid system evolution.

Commodity Building Blocks: The final set ofadvantages of clustering follows from the use ofcommodity building blocks rather than high-end,low-volume machines. The obvious advantage iscost/performance, since memory, disks, and nodescan all track the leading edge; for example, wechanged the building block every time we grewthe HotBot cluster, each time picking the reliablehigh volume previous-generation commodityunits, helping to ensure stability and robustness.Furthermore, since many commodity vendorscompete on service (particularly for PC hard-ware), it is easy to get high-quality configurednodes in 48 hours or less. In contrast, large SMPstypically have a lead time of 45 days, are morecumbersome to purchase, install, and upgrade, andare supported by a single vendor, so it is muchharder to get help when difficulties arise. Onceagain, it is a “simple matter of software” to tietogether a collection of possibly heterogeneouscommodity building blocks.

To summarize, clusters have significant advan-tages in scalability, growth, availability, and cost.Although fundamental, these advantages are noteasy to realize.

1.3 Challenges of Cluster Computing

There are a number of areas in which clustersare at a disadvantage relative to SMP’s. In thissection we describe some of these challenges andhow they influenced the architecture we will pro-pose in Section 2.

Administration: Administration is a seriousconcern for systems of many nodes. We leverage

Page 4: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

ideas in prior work[1], which describes how a uni-fied monitoring/reporting framework with datavisualization support was an effective tool for sim-plifying cluster administration.

Component vs. system replication: Eachcommodity PC in a cluster is not usually powerfulenough to support an entire service, but can proba-bly support some components of the service.Component-level rather than whole-system repli-cation therefore allows commodity PCs to serve asthe unit of incremental scaling, provided the soft-ware can be naturally decomposed into looselycoupled modules. We address this challenge byproposing an architecture in which each compo-nent has well-circumscribed functional responsi-bilities and is largely “interchangeable” with othercomponents of the same type. For example, acache node can run anywhere that a disk is avail-able, and a worker that performs a specific kind ofdata compression can run anywhere that signifi-cant CPU cycles are available.

Partial failures: Component-level replicationleads directly to the fundamental issue separatingclusters from SMPs: the need to handle partialfailures (i.e., the ability to survive and adapt tofailures of subsets of the system). Traditionalworkstations and SMPs never face this issue, sincethe machine is either up or down.

Shared state: Unlike SMPs, clusters have noshared state. Although much work has been doneto emulate global shared state through softwaredistributed shared memory[33,34,36], we canimprove performance and reduce complexity if wecan avoid or minimize the need for shared stateacross the cluster.

These last two concerns, partial failure andshared state, lead us to focus on the sharingsemantics actually required by network services.

1.4 BASE Semantics

We believe that the design space for networkservices can be partitioned according to the datasemantics that each service demands. At oneextreme is the traditional transactional databasemodel with the ACID properties (atomicity, con-sistency, isolation, durability)[26], providing thestrongest semantics at the highest cost and com-plexity. ACID makes no guarantees regarding

availability; indeed, it is preferable for an ACIDservice to be unavailable than to function in a waythat relaxes the ACID constraints. ACID seman-tics are well suited for Internet commerce transac-tions, billing users, or maintaining user profileinformation for personalized services.

For other Internet services, however, the pri-mary value to the user is not necessarily strongconsistency or durability, but rather high availabil-ity of data:

• Stale data can be temporarily tolerated aslong as all copies of data eventually reachconsistency after a short time(e.g., DNSservers do not reach consistency until entrytimeouts expire[41]).

• Soft state, which can be regenerated at theexpense of additional computation or file I/O,is exploited to improve performance; data isnot durable.

• Approximate answers (based on stale data orincomplete soft state) delivered quickly maybe more valuable than exact answersdelivered slowly.

We refer to the data semantics resulting fromthe combination of these techniques asBASE—Basically Available, Soft State, Eventual Consis-tency. By definition, any data semantics that arenot strictly ACID are BASE. BASE semanticsallow us to handle partial failure in clusters withless complexity and cost. Like pioneering systemssuch as Grapevine[9] , BASE reduces the com-plexity of the service implementation, essentiallytrading consistency for simplicity; like later sys-tems such as Bayou[21] that allow trading consis-tency for availability, BASE providesopportunities for better performance. For exam-ple, where ACID requires durable and consistentstate across partial failures, BASE semantics oftenallows us to avoid communication and disk activ-ity or to postpone it until a more convenient time.

In practice, it is simplistic to categorize everyservice as either ACID or BASE; instead, differentcomponents of services demand varying datasemantics. Directories such as Yahoo![64] main-tain a database of soft state with BASE semantics,but keep user customization profiles in an ACID

Page 5: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

database. Transformation proxies[23,57] interposedbetween clients and servers transform Internetcontent on-the-fly; the transformed content isBASE data that can be regenerated by computa-tion, but if the service bills the user per session,the billing should certainly be delegated to anACID database.

We focus on services that have an ACID com-ponent, but manipulate primarily BASE data. Webservers, search/aggregation servers[58], cachingproxies [14,44], and transformation proxies are allexamples of such services; our framework sup-ports a superset of these services by providingintegrated support for the requirements of all four.As we will show, BASE semantics greatly sim-plify the implementation of fault tolerance andavailability and permit performance optimizationswithin our framework that would be precluded byACID.

2 Cluster-Based Scalable Service ArchitectureIn this section we propose a system architecture

and service-programming model for building scal-able network services on clusters. The architectureattempts to address both the challenges of clustercomputing and the challenges of deploying net-work services, while exploiting clusters’ strengths.We view our contributions as follows:

• A proposed system architecture for scalablenetwork services that exploits the strengths ofcluster computing, as exemplified by cluster-based servers such as TranSend and HotBot.

• Separation of thecontent of network services(i.e., what the services do) from theirimplementation, by encapsulating the“scalable network service” (SNS)requirements of high availability, scalability,and fault tolerance in a reusable layer withnarrow interfaces.

• A programming model based on compositionof stateless worker modules into newservices. The model maps well onto oursystem architecture and numerous existingservices map directly onto it.

• Detailed measurements of a productionservice that instantiates the architecture and

validates our performance and reliabilityclaims.

In the remainder of this section we review thebenefits and challenges of cluster computing andpropose a network service architecture thatexploits these observations and allows encapsula-tion of the SNS requirements. We then describe aprogramming model that minimizes service devel-opment effort by allowing implementation of newservices entirely at the higher layers.

2.1 Proposed Functional Organization of an SNS

The above observations lead us to the software-component block diagram of a generic SNSshown Figure 1. Each physical workstation in anetwork of workstations (NOW[2]) supports oneor more software components in the figure, buteach component in the diagram is confined to onenode. In general, the components whose tasks arenaturally parallelizable are replicated for scalabil-ity, fault tolerance, or both. In our measurements(Section 4), we will argue that the performancedemands on the non-replicated components arenot significant for the implementation of a largeclass of services, and that the practical bottlenecksare bandwidth into and out of the system andbandwidth in the system area network (SAN).

System Area

Wide-Area Network

Figure 1: Architecture of a generic SNS.Components include front ends (FE), a pool ofworkers (W) some of which may be caches ($), auser profile database, a graphical monitor, and afault-tolerant load manager, whose functionalitylogically extends into the manager stubs (MS) andworker stubs (WS).

Manager

GraphicalMonitor

UserProfile DB

Network

$

$

$

Worker Pool

FE FEFEMS MS MS

W

WS WS WS

W W WorkerAPI

Page 6: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

Front Ends provide the interface to the SNS asseen by the outside world (e.g., HTTPserver). They “shepherd” incoming requestsby matching them up with the appropriateuser profile from the customization database,and queueing them for service by one ormore workers. Front ends maximize systemthroughput by maintaining state for manysimultaneous outstanding requests, and canbe replicated for both scalability andavailability.

The Worker Pool consists of caches andservice-specific modules that implement theactual service (data transformation/filtering,content aggregation, etc.) Each type ofmodule may be instantiated zero or moretimes, depending on offered load.

The Customization Database stores userprofiles that allow mass customization ofrequest processing.

The Manager balances load across workers andspawns additional workers as offered loadfluctuates or faults occur. When necessary, itmay assign work to machines in theoverflowpool, a set of backup machines (perhaps ondesktops) that can be harnessed to handleload bursts and provide a smooth transitionduring incremental growth. The overflowpool is discussed in Section 2.2.3.

The Graphical Monitor for systemmanagement supports tracking andvisualization of the system’s behavior,asynchronous error notification via email orpager, and temporary disabling of systemcomponents for hot upgrades.

The System-Area Network provides a low-latency, high-bandwidth interconnect, such asswitched 100-Mb/s Ethernet or Myrinet[43].Its main goal is to prevent the interconnectfrom becoming the bottleneck as the systemscales.

2.2 Separating Service Content From Implementa-tion: A Reusable SNS Support Layer

Layered software models allow layers to beisolated from each other and allow existing soft-

ware in one layer to be reused in different imple-mentations. We observe that the components inthe above architecture can be grouped naturallyinto three layers of functionality, as shown in Fig-ure 2: SNS (scalable network service implementa-tion), TACC (transformation, aggregation,caching, customization), and Service. The keycontributions of our architecture are the reusabil-ity of the SNS layer, and the ability to add simple,stateless “building blocks” at the TACC layer andcompose them in the Service layer. We discussTACC in Section 2.3. The SNS layer, which wedescribe here, provides scalability, load balancing,fault tolerance, and high availability; it comprisesthe front ends, manager, SAN, and monitor in Fig-ure 1.

2.2.1 ScalabilityComponents in our SNS architecture may be

replicated for fault tolerance or high availability,but we also use replication to achieve scalability.When the offered load to the system saturates thecapacity of some component class, more instances

Service: Service-specific code• Workers that present human interface to

what TACC modules do, including device-specific presentation

• User interface to control the service

TACC : Transformation,Aggregation, Cach-ing, Customization

• API for composition of stateless datatransformation and content aggregationmodules

• Uniform caching of original, post-aggregation and post-transformation data

• Transparent access to Customizationdatabase

SNS: Scalable Network Service support• Incremental and absolute scalability

• Worker load balancing and overflowmanagement

• Front-end availability, fault tolerancemechanisms

• System monitoring and logging

Page 7: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

of that component can be launched on incremen-tally added nodes. The duties of our replicatedcomponents are largely independent of each other(because of the nature of the internet services’workload), which means the amount of additionalresources required is a linear function of theincrease in offered load. Although the componentsare mostly independent, they do have some depen-dence on the shared, non-replicated system com-ponents: the SAN, the resource manager, andpossibly the user profile database. Our measure-ments in Section 4 confirm that even for very largesystems, these shared components do not becomea bottleneck.

The static partitioning of functionality betweenfront ends and workers reflects our desire to keepworkers as simple as possible, by localizing in thefront ends the control decisions associated withsatisfying user requests. In addition to managingthe network state for outstanding requests, a frontend encapsulates service-specific worker dispatchlogic, accesses the profile database to pass theappropriate parameters to the workers, notifies theend user in a service-specific way (e.g., construct-ing an HTML page describing the error) when oneor more workers fails unrecoverably, provides theuser interface to the profile database, and so forth.This division of responsibility allows workers toremain simple and stateless, and allows the behav-ior of the service as a whole to be defined almostentirely in the front end. If the workers are analo-gous to processes in a Unix pipeline, the front endis analogous to an interactive shell.

2.2.2 Centralized Load BalancingLoad balancing is controlled by a centralized

policy implemented in the manager. The managercollects load information from the workers, syn-thesizes load balancing hints based on the policy,and periodically transmits the hints to the frontends, which make local scheduling decisionsbased on the most recent hints. The load balancingand overflow policies are left to the system opera-tor. We describe our experiments with load balanc-ing and overflow in Section 4.5.

The decision to centralize rather than distributeload balancing is intentional: If the load balancercan be made fault tolerant, and if we can ensure it

does not become a performance bottleneck, cen-tralization makes it easier to implement and rea-son about the behavior of the load balancingpolicy. In Section 3.1.3 we discuss the evolutionthat led to this design decision and its implicationsfor performance, fault tolerance, and scalability.

2.2.3 Prolonged Bursts and Incremental GrowthAlthough we would like to assume that there is

a well-defined average load and that arriving traf-fic follows a Poisson distribution, burstiness hasbeen demonstrated for Ethernet traffic[35], file sys-tem traffic[27], and Web requests[18], and is con-firmed by our traces of web traffic (discussedlater). In addition, Internet services can experiencerelatively rare but prolonged bursts of high load:after the recent landing of Pathfinder on Mars, itsweb site served over 220 million hits in a 4-dayperiod [45]. Often, it is during such bursts thatuninterrupted operation is most critical.

Our architecture includes the notion of an over-flow pool for absorbing these bursts. The overflowmachines are not dedicated to the service, and nor-mally do not have workers running on them, butthe manager can spawn workers on the overflowmachines on demand when unexpected load burstsarrive, and release the machines when the burstsubsides. In an institutional or corporate setting,the overflow pool could consist of workstations onindividuals’ desktops. Because worker nodes arealready interchangeable, workers do not need toknow whether they are running on a dedicated oran overflow node, since load balancing is handledexternally. In addition to absorbing sustainedbursts, the ability to temporarily harness overflowmachines eases incremental growth: when theoverflow machines are being recruited unusuallyoften, it is time to purchase more dedicated nodesfor the service.

2.2.4 Soft State for Fault Tolerance and AvailabilityThe technique of constructing robust entities

by relying on cached soft state refreshed by peri-odic messages from peers has been enormouslysuccessful in wide-area TCP/IP networks[5,20,39],another arena in which transient component fail-ure is a fact of life. Correspondingly, our SNScomponents operate in this manner, and monitor

Page 8: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

one another usingprocess peer fault tolerance1:when a component fails, one of its peers restarts it(on a different node if necessary), while cachedstale state carries the surviving componentsthrough the failure. After the component isrestarted, it gradually rebuilds its soft state, typi-cally by listening to multicasts from other compo-nents. We give specific examples of thismechanism in Section 3.1.3.

We use timeouts as an additional fault-toler-ance mechanism, to infer certain failure modesthat cannot be otherwise detected. If the conditionthat caused the timeout can be automaticallyresolved (e.g., if workers lost because of a SANpartition can be restarted on still-visible nodes),the manager performs the necessary actions. Oth-erwise, the SNS layer reports the suspected failurecondition, and the service layer determines how toproceed (e.g., report the error or fall back to a sim-pler task that does not require the failed worker).

2.2.5 Narrow Interface to Service-Specific WorkersTo allow new services to reuse all these facili-

ties, the manager and front ends provide a narrowAPI, shown as the Manager Stubs and WorkerStubs in Figure 1, for communicating with theworkers, the manager, and the graphical systemmonitor. The worker stub provides mechanismsfor workers to implement some required behaviorsfor participating in the system (e.g., supplyingload data to assist the manager in load balancingdecisions and reporting detectable failures in theirown operation). The worker stub hides fault toler-ance, load balancing, and multithreading consid-erations from the worker code, which may use allthe facilities of the operating system, need not bethread-safe, and can, in fact, crash without takingthe system down. The minimal restrictions onworker code allow worker authors to focus instead

1: Not to be confused withprocess pairs, a different fault-tolerance mechanism for hard-state processes, discussed in[6]. Process peers are similar to the fault tolerance mecha-nism explored in the early “Worm” programs [55] and to“Robin Hood/Friar Tuck” fault tolerance: “Each ghost-jobwould detect the fact that the other had been killed, andwould start a new copy of the recently slain program withina few milliseconds. The only way to kill both ghosts was tokill them simultaneously (very difficult) or to deliberatelycrash the system.” [50]

on thecontent of the service, even using off-the-shelf code (as we have in TranSend) to implementthe worker modules.

The manager stub linked to the front ends pro-vides support for implementing the dispatch logicthat selects which worker type(s) are needed tosatisfy a request; since the dispatch logic is inde-pendent of the core load balancing and fault toler-ance mechanisms, a variety of services can bebuilt using the same set of workers.

2.3 TACC: A Programming Model for Internet Services

Having encapsulated the “SNS requirements”into a separate software layer, we now require aprogramming model for building the servicesthemselves in higher layers. We focus on a partic-ular subset of services, based ontransformation,aggregation, caching, and customization of Inter-net content (TACC). Transformation is an opera-tion on a single data object that changes itscontent; examples include filtering, transcoding,re-rendering, encryption, and compression.Aggregation involves collecting data from severalobjects and collating it in a prespecified way; forexample, collecting all listings of cultural eventsfrom a prespecified set of Web pages, extractingthe date and event information from each, andcomposing the result into a dynamically-gener-ated “culture this week” page. Our initial imple-mentation allows Unix-pipeline-like chaining ofan arbitrary number of stateless transformationsand aggregations; this results in a very generalprogramming model that subsumes transforma-tion proxies [22], proxy filters [67], customizedinformation aggregators[59,13], and search engines.The selection of which workers to invoke for aparticular request is service-specific and con-trolled outside the workers themselves; for exam-ple, given a collection of workers that convertimages between pairs of encodings, a correctlychosen sequence of transformations can be usedfor general image conversion.

Customization represents a fundamentaladvantage of the Internet over traditional wide-area media such as television. Many online ser-vices, including the Wall Street Journal, the LosAngeles Times, and C/Net, have deployed “per-sonalized” versions of their service as a way to

Page 9: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

increase loyalty and the quality of the service. Suchmass customization requires the ability to track usersand keep profile data for each user, although the con-tent of the profiles differs across services. The custom-ization database, a traditional ACID database, maps auser identification token (such as an IP address orcookie) to a list of key-value pairs for each user of theservice. A key strength of the TACC model is that theappropriate profile information is automatically deliv-ered to workers along with the input data for a particu-lar user request; this allows the same workers to bereused for different services. For example, an image-compression worker can be run with one set of param-eters to reduce image resolution for faster Web brows-ing, and a different set of parameters to reduce imagesize and bit depth for handheld devices. We havefound composable, customizable workers to be a pow-erful building block for developing new services, andwe discuss our experience with TACC and its continu-ing evolution in Section 5.

Caching is important because recomputing or stor-ing data has become cheaper than moving it across theInternet. For example, a study of the UK National webcache has shown that even a small cache (400MB) canreduce the load on the network infrastructure by 40%[61], and SingNet, the largest ISP in Singapore, hassaved 40% of its telecom charges using web caching[60]. In the TACC model, caches can store post-trans-formation (or post-aggregation) content and evenintermediate-state content, in addition to caching orig-inal Internet content.

Many existing services are subsumed by the TACCmodel and fit well with it. (In Section 5.4 we describesome that do not.) For example, the HotBot searchengine collects search results from a number of data-base partitions and collates the results. Transformationinvolves converting the input data from one form toanother. In TranSend, graphic images can be scaledand filtered through a low-pass filter to tune them for aspecific client or to reduce their size. A key strength ofour architecture is the ease of composition of tasks;this affords considerable flexibility in the transforma-tions and aggregations the service can perform, with-out requiring workers to understand service-specificdispatch logic, load balancing, etc., any more thanprograms in a Unix pipeline need to understand theimplementation of the pipe mechanism.

We claim that a large number of interesting ser-vices can be implemented entirely at the service andTACC layers, and that relatively few services will ben-

efit from direct modification to the SNS layer unlessthey have very specific low-level performance needs.In Section 5.1 we describe our experience addingfunctionality at both the TACC and service layers.

3 Service ImplementationThis section focuses on the implementation of

TranSend, a scalable Web distillation proxy, and com-pares it with HotBot. The goals of this section are todemonstrate how each component shown in Figure 1maps into the layered architecture, to discuss relevantimplementation details and trade-offs, and to providethe necessary context for the measurements we reportin the next section.

3.1 TranSend SNS Components

3.1.1 Front EndsTranSend runs on a cluster of SPARCstation 10 and

20 machines, interconnected by switched 10 Mb/sEthernet and connected to the dialup IP pool by a sin-gle 10 Mb/s segment. The TranSend front end presentsan HTTP interface to the client population. A thread isassigned to each arriving TCP connection. Requestprocessing involves fetching Web data from the cach-ing subsystem (or from the Internet on a cache miss),pairing up the request with the user’s customizationpreferences, sending the request and preferences to apipeline of one or moredistillers (the TranSend lossy-compression workers) to perform the appropriatetransformation, and returning the result to the client.Alternatively, if an appropriate distilled representationis available in the cache, it can be sent directly to theclient. A large thread pool allows the front end to sus-tain throughput and maximally exploit parallelismdespite the large number of potentially long, blockingoperations associated with each task, and provides aclean programming model. The production TranSendruns with a single front-end of about 400 threads.

3.1.2 Load Balancing ManagerClient-side JavaScript support[46] balances load

across multiple front ends and masks transient frontend failures, although other mechanisms such asround-robin DNS[12] or commercial routers[16] couldalso be used. For internal load balancing, TranSenduses a centralized manager whose responsibilitiesinclude tracking the location of distillers, spawningnew distillers on demand, balancing load across dis-tillers of the same class, and providing the assuranceof fault tolerance and system tuning. We argue for a

Page 10: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

centralized as opposed to distributed managerbecause it is easier to change the load balancingpolicy and reason about its behavior; the next sec-tion discusses the fault-tolerance implications ofthis decision.

The manager periodically beacons its existenceon an IP multicast group to which the other com-ponents subscribe. The use of IP multicast pro-vides a level of indirection and relievescomponents of having to explicitly locate eachother. When the front end has a task for a distiller,the manager stub code contacts the manager,which locates an appropriate distiller, spawning anew one if necessary. The manager stub cachesthe new distiller’s location for future requests.

The worker stub attached to each distilleraccepts and queues requests on behalf of the dis-tiller and periodically reports load2 information tothe manager. The manager aggregates load infor-mation from all distillers, computes weightedmoving averages, and piggybacks the resultinginformation on its beacons to the manager stub.The manager stub (at the front end) caches theinformation in these beacons and uses lotteryscheduling[63] to select a distiller for each request.The cached information provides a backup so thatthe system can continue to operate (using slightlystale load data) even if the manager crashes. Even-tually, the fault tolerance mechanisms (discussedin Section 3.1.3) restart the manager and the sys-tem returns to normal.

To allow the system to scale as the loadincreases, the manager can automatically spawn anew distiller on an unused node if it detects exces-sive load on distillers of a particular class. (Thespawning and load balancing policies aredescribed in detail in Section 4.5.) Another mech-anism used for adjusting to bursts in load is over-flow: if all the nodes in the system are used up, themanager can resort to starting up temporary dis-tillers on a set of overflow nodes. Once the burstsubsides, the distillers may be reaped.

2: In the current implementation, distiller load is charac-terized in terms of the queue length at the distiller, option-ally weighted by the expected cost of distilling each item.

3.1.3 Fault Tolerance and Crash RecoveryIn the original prototype for the manager, infor-

mation about distillers was kept as hard state,using a log file and crash recovery protocols simi-lar to those used by ACID databases. Resilienceagainst crashes was via process-pair fault toler-ance, as in[6]: the primary manager process wasmirrored by a secondary whose role was to main-tain a current copy of the primary’s state, and takeover the primary’s tasks if it detects that the pri-mary has failed. In this scenario, crash recovery isseamless, since all state in the secondary processis up-to-date.

However, by moving entirely to BASE seman-tics, we were able to simplify the manager greatlyand increase our confidence in its correctness. InTranSend, all state maintained by the manager isexplicitly designed to be soft state. When a dis-tiller starts up, it registers itself with the manager,whose existence it discovers by subscribing to awell-known multicast channel. If the distillercrashes before de-registering itself, the managerdetects the broken connection; if the managercrashes and restarts, the distillers detect beaconsfrom the new manager and re-register themselves.Timeouts are used as a backup mechanism to inferfailures. Since all state is soft and is periodicallybeaconed, no explicit crash recovery or state mir-roring mechanisms are required to regenerate loststate. Similarly, the front end does not require anyspecial crash recovery code, since it can recon-struct its state as it receives the next few beaconsfrom the manager.

With this use of soft state, each “watcher” pro-cess only needs to detect that its peer is alive(rather than mirroring the peer’s state) and, insome cases, be able to restart the peer (rather thantake over the peer’s duties). Broken connections,timeouts, or loss of beacons are used to infer com-ponent failures and restart the failed process. Themanager, distillers, and front ends are processpeers:

• The manager reports distiller failures to themanager stubs, which update their caches ofwhere distillers are running.

• The manager detects and restarts a crashedfront end.

Page 11: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

• The front end detects and restarts a crashedmanager.

This process peer functionality is encapsulatedwithin the manager stub code. Simply by linkingagainst the stub, front ends are automaticallyrecruited as process peers of the manager.

3.1.4 User Profile DatabaseThe service interface to TranSend allows each

user to register a series of customization settings,using either HTML forms or a Java/JavaScriptcombination applet. The actual database is imple-mented usinggdbm because it is freely availableand its performance is adequate for our needs:user preference reads are much more frequent thanwrites, and the reads are absorbed by a write-through cache in the front end.

3.1.5 Cache Nodes

TranSend runs Harvest object cache workers[10] on four separate nodes. Harvest suffers fromthree functional/performance deficiencies, two ofwhich we resolved.

First, although a collection of Harvest cachescan be treated as “siblings”, by default all siblingsare queried on each request, so that the cache ser-vice time would increase as the load increaseseven if more cache nodes were added. Therefore,for both scalability and improved fault tolerance,the manager stub can manage a number of sepa-rate cache nodes as a single virtual cache, hashingthe key space across the separate caches and auto-matically re-hashing when cache nodes are addedor removed. Second, we modified Harvest to allowdata to be injected into it, allowing distillers (viathe worker stub) to store post-transformed or inter-mediate-state data into the large virtual cache.Finally, because the interface to each cache nodeis HTTP, a separate TCP connection is requiredfor each cache request. We did not repair this defi-ciency due to the complexity of the Harvest code,and as a result some of the measurements reportedin Section 4.4 are slightly pessimistic.

Caching in TranSend is only an optimization.All cached data can be thrown away at the cost ofperformance—cache nodes are workers whoseonly job is the management of BASE data.

3.1.6 Datatype-Specific DistillersThe second group of workers is the distillers,

which perform transformation and aggregation.As a result, we were able to leverage a largeamount of off-the-shelf code for our distillers. Wehave built three parameterizable distillers forTranSend: scaling and low-pass filtering of JPEGimages using the off-the-shelfjpeg-6a library [29],

GIF-to-JPEG conversion followed by JPEG deg-radation3, and a Perl HTML “munger” that marksup inline image references with distillation prefer-ences, adds extra links next to distilled images sothat users can retrieve the original content, andadds a “toolbar” (Figure 4) to each page thatallows users to control various aspects ofTranSend’s operation. The user interface forTranSend is thus controlled by the HTML dis-tiller, under the direction of the user preferencesfrom the front end.

Each of these distillers took approximately 5-6hours to implement, debug, and optimize.Although pathological input data occasionallycauses a distiller to crash, the process-peer faulttolerance guaranteed by the SNS layer means thatwe don’t have to worry about eliminating all suchpossible bugs and corner cases from the system.

3.1.7 Graphical MonitorOur extensible Tcl/Tk[48] graphical monitor

presents a unified view of the system as asingle

3: We chose this approach after discovering that the JPEGrepresentation is smaller and faster to operate on for mostimages, and produces aesthetically superior results.

Figure 3: Scaling this JPEG image by a factor of 2in each dimension and reducing JPEG quality to 25results in a size reduction from 10KB to 1.5KB.

Figure 4: User Interface for manipulatingpreferences.

Page 12: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

virtual entity. Components of the system reportstate information to the monitor using a multicastgroup, allowing multiple monitors to run at geo-graphically dispersed locations for remote man-agement. The monitor can page or email thesystem operator if a serious error occurs, forexample, if it stops receiving reports from somecomponent.

The benefits of visualization are not just cos-metic: We can immediately detect by looking atthe visualization panel what state the system as awhole is in, whether any component is currentlycausing a bottleneck (such as cache-miss time,distillation queueing delay, interconnect), whatresources the system is using, and other such fig-ures of interest.

3.1.8 How TranSend Exploits BASEDistinguishing ACID vs. BASE semantics in

the design of service components greatly simpli-fies TranSend’s fault-tolerance and improves itsavailability. Only the user-profile database isACID; everything else exploits some aspect ofBASE semantics, both in manipulating applica-tion data (i.e., Web content) and in the implemen-tation of the system components themselves.

Stale load balancing data: The load balancingdata in the manager stub is slightly stalebetween updates from the manager, whicharrive a few seconds apart. The use of staledata for the load balancing and overflowdecisions improves performance and helps tohide faults, since using cached data avoidscommunicating with the source. Timeouts areused to recover from cases where stale datacauses an incorrect load balancing choice.For example, if a request is sent to a workerthat no longer exists, the request will time outand another worker will be chosen. From thestandpoint of performance, as we will showin our measurements, the use of slightly staledata is not a problem in practice.

Soft state: The two advantages of soft state areimproved performance from avoiding(blocking) commits and trivial recovery.Transformed content is cached and can be

regenerated from the original (which may bealso cached).

Approximate answers: Users of TranSendrequest objects that are named by the objectURL and the user preferences, which areused to derive distillation parameters.However, if the system is too heavily loadedto perform distillation, it can return asomewhat different version from the cache; ifthe user clicks the “Reload” button later, theywill get the distilled representation theyasked for if the system now has sufficientresources to perform the distillation. If therequired distiller has temporarily orpermanently failed, the system can return theoriginal content. In all cases, an approximateanswer delivered quickly is more useful thanthe exact answer delivered slowly.

3.2 HotBot ImplementationIn this section we highlight the principal differ-

ences between the implementations of TranSendand HotBot.The original Inktomi work, which isthe basis of HotBot, predates the layered modeland scalable server architecture presented hereand uses ad hoc rather than generalized mecha-nisms in some places.

Front ends and service interface:HotBotruns on a mixture of single- and multiple-CPUSPARCstation server nodes, interconnected byMyrinet [43]. The HTTP front ends in HotBot run50-80 threads per node and handle the presenta-tion and customization of results based on userpreferences and browser type. The presentation isperformed using a form of “dynamic HTML”based on Tcl macros[54].

Load balancing: HotBot workers staticallypartition the search-engine database for load bal-ancing. Thus each worker handles a subset of thedatabase proportional to its CPU power, and everyquery goes to all workers in parallel.

Failure management: Unlike the workers inTranSend, HotBot worker nodes are not inter-changeable, since each worker uses a local disk tostore its part of the database. The original Inktominodes cross-mounted databases, so that there werealways multiple nodes that could reach any data-

Page 13: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

base partition. Thus, when a node when down,other nodes would automatically take over respon-sibility for that data, maintaining 100% data avail-ability with graceful degradation in performance.

Since the database partitioning distributes doc-uments randomly and it is acceptable to lose partof the database temporarily, HotBot moved to amodel in which RAID storage handles disk fail-ures, while fast restart minimizes the impact ofnode failures. For example, with 26 nodes the lossof one machine results in the database droppingfrom 54M to about 51M documents, which is stillsignificantly larger than other search engines (suchas Alta Vista at 30M).

The success of the fault management of HotBotis exemplified by the fact that during February

Component TranSend HotBot

Load balanc-ing

Dynamic, byqueue lengthsat workernodes

Static parti-tioning ofread-onlydata

Applicationlayer

ComposableTACC workers

Fixed searchservice appli-cation

Service layer

Worker dis-patch logic,HTML / JavaS-cript UI

DynamicHTML gen-eration,HTML UI

Failure man-agement

Centralizedbut fault-toler-ant using pro-cess-peers

Distributedto each node

Worker place-ment

FE’s andcaches boundto their nodes

All workersbound totheir nodes

User profile(ACID) data-base

Berkeley DBwith readcaches

ParallelInformixserver

Caching

Harvest cachesstore pre- andpost-transfor-mation Webdata

integratedcache ofrecentsearches, forincrementaldelivery

1997, HotBot was physically moved (from Berke-ley to San Jose) without ever being down, by mov-ing half of the cluster at a time and changing DNSresolution in the middle. Although various parts ofthe database were unavailable at different timesduring the move, the overall service was still upand useful—user feedback indicated that few peo-ple were affected by the transient changes.

User profile database:We expect commercialsystems to use a real database for ACID compo-nents. HotBot uses Informix with primary/backupfailover for the user profile and ad revenue track-ing database, with each front end linking in anInformix SQL client. However, all other HotBotdata is BASE, and as in TranSend, timeouts areused to recover from stale cluster-state data.

3.3 Summary

The TranSend implementation quite closelymaps into the layered architecture presented inSection 2, while the HotBot implementation dif-fers in the use of a distributed manager, static loadbalancing by data partitioning, and workers thatare tied to particular machines. The careful sepa-ration of responsibility into different componentsof the system, and the layering of componentsaccording to the architecture, made the implemen-tation complexity manageable.

4 Measurements of the TranSend ImplementationWe took measurements of TranSend using a

cluster of 15 Sun SPARC Ultra-1 workstationsconnected by 100 Mb/s switched Ethernet and iso-lated from external load or network traffic. Formeasurements requiring Internet access, theaccess was via a 10 Mb/s switched Ethernet net-work connecting our workstation to the outsideworld. In the following subsections we analyzethe size distribution and burstiness characteristicsof TranSend’s expected workload, describe theperformance of two throughput-critical compo-nents (the cache nodes and data-transformationworkers) in isolation, and report on experimentsthat stress TranSend’s fault tolerance, responsive-ness to bursts, and scalability.

Page 14: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

4.1 HTTP Traces and the Playback Engine

Many of the performance tests are based uponHTTP trace data that we gathered from ourintended user population, namely the 25,000 UCBerkeley dialup IP users, up to 600 of whom maybe connected via a bank of 14.4K or 28.8Kmodems. The modems’ connection to the Internetpasses through a single 10 Mb/s Ethernet seg-ment; we placed a tracing machine running an IPpacket filter on this segment for a month and ahalf, and unobtrusively gathered a trace of approx-imately 20 million (anonymized) HTTP requests.GIF, HTML, and JPEG were by far the three mostcommon MIME types observed in our traces(50%, 22%, and 18%, respectively), and hence ourthree implemented distillers cover these commoncases. Data for which no distiller exists is passedunmodified to the user.

Figure 5 illustrates the distribution of sizesoccurring for these three MIME types. Althoughmost content accessed on the web is small (con-siderably less than 1 KB), the average byte trans-ferred is part of large content (3-12 KB). Thismeans that the users’ modems spend most of theirtime transferring a few, large files. It is the goal ofTranSend to eliminate this bottleneck by distillingthis large content into smaller, but still useful rep-resentations; data under 1 KB is transferred to theclient unmodified, since distillation of such smallcontent rarely results in a size reduction.

Figure 5 also reveals a number of interestingproperties of the individual data types. The GIF

Figure 5: Distribution of content lengths for HTML,GIF, and JPEG files. The spikes to the left of themain GIF and JPEG distributions are errormessages mistaken for image data, based on filename extension. Average content lengths: HTML -5131 bytes, GIF - 3428 bytes, JPEG - 12070 bytes.

HTML

GIF

JPG

0

0.01

0.02

0.03

0.04

0.05

0.06

10 100 1000 10000 100000 1000000Data Size (bytes)

Pro

babi

lity

distribution has two plateaus—one for data sizesunder 1KB (which correspond to icons, bullets,etc.) and one for data sizes over 1KB (which cor-respond to photos or cartoons). Our 1KB distilla-tion threshold therefore exactly separates thesetwo classes of data, and deals with each correctly.JPEGs do not show this same distinction: the dis-tribution falls of rapidly under the 1KB mark.

In order to realistically stress test TranSend, wecreated a high performance trace playback engine.The engine can generate requests at a constant(and dynamically tunable) rate, or it can faithfullyplay back a trace according to the timestamps inthe trace file. We thus had fine-grained controlover both the amount and nature of the loadoffered to our implementation during our experi-mentation.

4.2 Burstiness

Burstiness is a fundamental property of a greatvariety of computing systems, and can beobserved across all time scales[18,27,35]. Our HTTPtraces show that the offered load to our implemen-tation will contain bursts—Figure 6 shows therequest rate observed from the user base across a24 hour, 3.5 hour, and 3.5 minute time interval.The 24 hour interval exhibits a strong 24 hourcycle that is overlaid with shorter time-scalebursts. The 3.5 hour and 3.5 minute intervalsreveal finer grained bursts.

We described in Section 2.2.3 how our archi-tecture allows an arbitrary subset of machines tobe managed as an overflow pool during temporarybut prolonged periods of high load. The overflowpool can also be used to absorb bursts on shortertime scales. We argue that there are two possibleadministrative avenues for managing the overflowpool:

1. Select an average desired utilization levelfor the dedicated worker pool. Since we canobserve a daily cycle, this amounts to draw-ing a line across Figure 6a (i.e., picking anumber of tasks/sec) such that the fractionof black under the line is the desired utiliza-tion level.

2. Select an acceptable percentage of time thatthe system will resort to the overflow pool.This amounts to drawing a line across Fig-

Page 15: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

ure 6a such that the fraction of columns thatcross the line is this percentage.4

Since we have measured the average number ofrequests/s that a distiller of a given class can han-dle, the number of tasks /s that we picked (fromstep 1 or 2 above) dictates how many distillers willneed to be in the dedicated (non-overflow) pool.

4.3 Distiller Performance

If the system is behaving well, the distillationof images is the most computationally expensivetask performed by TranSend. We measured theperformance of our distillers by timing distillationlatency as a function of input data size, calculatedacross approximately 100,000 items from the dia-lup IP trace file. Figure 7 shows that for the GIFdistiller, there is an approximately linear relation-ship between distillation time and input size,

4: Note that the utilization level cannot necessarily be pre-dicted given a certain acceptable percentage, and vice-versa.

18:00 07:30 18:00

120 seconds/bucket

22:40 02:000

400

800

1200

task

s/bu

cket

time

22:40 02:00

task

s/bu

cket

time00:20 01:10

0

100

200

300

23:34 23:39

30 seconds/bucket

task

s/bu

cket

time

0

6

12

18

23:34:10 23:35:50 23:37:30 23:37:30

1 second/bucket

Figure 6: The number of requests per second fortraced dialup IP users, showing burstiness acrossdifferent time scales. (a) 24 hours with 2 minutebuckets, 5.8 req/s avg., 12.6 req/s max. (b) 3 hr 20min with 30 second buckets, 5.6 req/s avg., 10.3req/s peak. (c) 3 min 20 sec, 8.1 req/s avg., 20 req/s peak.

(a)

(b)

(c)

although a large variation in distillation time isobserved for any particular data size. The slope ofthis relationship is approximately 8 millisecondsper kilobyte of input. Similar results wereobserved for the JPEG and HTML distillers,although the HTML distiller is far more efficient.

4.4 Cache Partition Performance

In [10], a detailed performance analysis of theHarvest caching system is presented. We summa-rize the results here:

• The average cache hit takes 27 ms to service,including network and OS overhead,implying a maximum average service ratefrom each partitioned cache instance of 37requests per second. TCP connection and tear-down overhead is attributed to 15 ms of thisservice time.

• 95% of all cache hits take less than 100 ms toservice, implying cache hit rate has lowvariation.

• The miss penalty (i.e., the time to fetch datafrom the Internet) varies widely, from 100 msthrough 100 seconds. This implies thatshould a cache miss occur, it is likely todominate the end-to-end latency through thesystem, and therefore much effort should beexpended to minimize cache miss rate.

As a supplement to these results, we ran a num-ber of cache simulations to explore the relation-ship between user population size, cache size, andcache hit rate, using LRU replacement. Weobserved that the size of the user population

0

0.05

0.1

0.15

0.2

0.25

0.3

0 5000 10000 15000 20000 25000 30000

Figure 7: Average distillation latency vs. GIF size,based on GIF data gathered from the dialup IP trace.

Avg

. Dis

tilla

tion

late

ncy

(s)

GIF size (bytes)

Page 16: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

greatly affects the attainable hit rate. Cache hitrate increases monotonically as a function ofcache size, but plateaus out at a level that is afunction of the user population size. For the userpopulation observed across the traces (approxi-mately 8000 people over the 1.5 month period),six gigabytes of cache space (in total, partitionedover all instances) gave us a hit rate of 56%. Simi-larly, we observed that for a given cache size,increasing the size of the user populationincreases the hit rate in the cache (due to anincrease in locality across the users), until thepoint at which the sum of the users’ working setsexceeds the cache size, causing the cache hit rateto fall.

From these results, we can deduce that thecapacity of a single front end will be limited bythe high cache miss penalties. The number ofsimultaneous, outstanding requests at a front endis equal to , whereN is the number ofrequests arriving per second, andT is the averageservice time of a request. A high cache miss pen-alty implies thatT will be large. Because two TCPconnections (one between the client and front end,the other between the front end and a cache parti-tion) and one thread context are maintained in thefront end for each outstanding request, implyingthat front ends are vulnerable to state managementand context switching overhead. As an example,for offered loads of 15 requests per second to afront end, we have observed 150-350 outstandingrequests and therefore up to 700 open TCP con-nections and 300 active thread contexts at anygiven time. As a result, the front end spends morethan 70% of its time in the kernel (as reported bythe top utility) under this load. Eliminating thisoverhead is the subject of ongoing research.

4.5 Self Tuning and Load Balancing

TranSend uses queue lengths at the distillers asa metric for load balancing. As queue lengthsgrow due to increased load, the moving average ofthe queue length maintained by the manager startsincreasing; when the average crosses a config-urable thresholdH, the manager spawns a newdistiller to absorb the load. The thresholdH mapsto the greatest delay the user is willing to toleratewhen the system is under high load. To allow the

N T×

new distiller to stabilize the system, the spawningmechanism is disabled forD seconds; the parame-ter D represents a tradeoff between stability (rateof spawning and reaping distillers) and user-per-ceptible delay.

Figure 8(a) shows the variation in distillerqueue lengths over time. The system was boot-strapped with one front end and the manager. On-demand spawning of the first distiller wasobserved as soon as load was offered. Withincreasing load, the distiller queue graduallyincreased until the manager decided to spawn asecond distiller, which reduced the queue lengthof the first distiller and balanced the load acrossboth distillers within five seconds. Continuedincrease in load caused a third distiller to start up,which again reduced and balanced the queuelengths within five seconds.

Figure 8(b) shows an enlarged view of thegraph in Figure 8(a). During the experiment, wemanually killed the first two distillers, causing the

0

5

10

15

20

25

0 50 100 150 200 250 300 350 400

Dis

tille

r Que

ue L

engt

h

Time (seconds)

0

8

16

24

32

40

Distiller 1started

Distiller 2started

Distiller 3started

Distiller 4started

Distiller 5started

Distiller 1& 2 died

Distiller 1Distiller 2Distiller 3Distiller 4Distiller 5

Offered load

0

5

10

15

20

25

250 260 270 280 290 300 310 320

Dist

iller Q

ueue

Len

gth

Time (seconds)Distiller 4started

Distiller 5started

Distiller 1& 2 died

Distiller 1Distiller 2Distiller 3Distiller 4Distiller 5

Figure 8: Distiller queue lengths observed overtime as the load presented to the systemfluctuates, and as distillers are manually broughtdown. (b) is an enlargement of (a).

(a)

(b)

Page 17: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

load on the remaining distiller to rapidly increase.The manager immediately reacted and started up anew distiller. Even afterD seconds, the managerdiscovered that the system was overloaded andstarted up one more distiller, causing the load tostabilize.

When we first ran this experiment, we noticedrapid oscillations in queue lengths. Inspectionrevealed that since the front end’s manager stubsonly periodically received distiller queue lengthreports, they were making load balancing deci-sions based on stale data. To repair this, wechanged the manager stub to keep a running esti-mate of the change in distiller queue lengthsbetween successive reports; these estimates weresufficient to eliminate the oscillations. The data inFigure 8 reflects the modified load balancing func-tionality.

4.6 Scalability

To demonstrate the scalability of the system,we needed to eliminate two bottlenecks that limitthe load we could offer: the overhead associatedwith having a very large number of open filedescriptors, and the bottleneck 10Mb/s Ethernetconnecting our cluster to the Internet. To do this,we prepared a trace file that repeatedly requested afixed number of JPEG images, all approximately10KB in size, based on the distributions weobserved (Section 4.1). These images would thenremain resident in the cache partitions, eliminatingcache miss penalty and the resulting buildup of filedescriptors in the front end. We recognize thatalthough a non-zero cache miss penalty does notintroduce any additional network, stable storage,or computational burden on the system, it doesresult in an increase in the amount of state in thefront end, which as we mentioned in Section 4.4limits the performance of a single front end. Onthe other hand, by turning off caching of distilledimages, we force our system to re-distill the imageevery time it was requested, and in that respect ourmeasurements are pessimistic relative to the sys-tem’s normal mode of operation.

Our strategy for the experiment was as follows:

1. Begin with a minimal instance of the sys-tem: one front end, one distiller, themanager, and some fixed number of cache

partitions. (Since for these experiments werepeatedly requested the same subset ofimages, the cache was effectively nottested.)

2. Increase the offered load until some systemcomponent saturates (e.g., distiller queuesgrow too long, front ends cannot acceptadditional connections, etc.).

3. Add more resources to the system to elimi-nate this saturation (in many cases thesystem does this automatically, as when itrecruits overflow nodes to run more work-ers), and record the amount of resourcesadded as a function of the increase inoffered load, measured in requests persecond.

4. Continue until the saturated resource cannotbe replenished (i.e., we run out of hard-ware), or until adding more of the saturatedresource no longer results in a linear orclose-to-linear improvement in performance.

Table 2 presents the results of this experiment.At 24 requests per second, as the offered loadexceeded the capacity of the single available dis-tiller, the manager automatically spawned oneadditional distiller, and then subsequent distillersas necessary. At 87 requests per second, the Ether-net segment leading into the front end saturated,requiring a new front end to be spawned. We wereunable to test the system at rates higher than 159requests per second, as all of our cluster’smachines were hosting distillers, front ends, orplayback engines. We did observe nearly perfectly

Requests/Second

# FrontEnds

#Distillers

Element thatsaturated

0-24 1 1 distillers

25-47 1 2 distillers

48-72 1 3 distillers

73-87 1 4 FE Ethernet

88-91 2 4 distillers

92-112 2 5 distillers

113-135 2 6 distillers & FEEthernet

136-159 3 7 distillers

Table 2: Results of the scalability experiment

Page 18: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

linear growth of the system over the scaled range:a distiller can handle approximately 23 requestsper second, and a 100 Mb/s Ethernet segment intoa front-end can handle approximately 70 requestsper second.5 We were unable to saturate the frontend, the cache partitions, or fully saturate the inte-rior SAN during this experiment. We draw twoconclusions from this result:

• Even with a commodity 100 Mb/s SAN,linear scaling is limited primarily bybandwidth into the system rather thanbandwidth inside the system.

• Although we run TranSend on four SPARC10’s, a single Ultra-1 class machine wouldsuffice to serve the entire dialup IPpopulation of UC Berkeley (25,000 usersofficially, over 8000 of whom surfed duringthe trace).

Ultimately, the scalability of our system is lim-ited by the shared or centralized components ofthe system, namely the user profile database, themanager, and the SAN. In our experience, neitherthe database nor the manager have ever been closeto saturation. The main task of the manager (insteady state) is to accumulate load announcementsfrom all distillers and multicast this information tothe front ends. We conducted an experiment to testthe capability of the manager to handle these loadannouncements. Nine hundred distillers were cre-ated on four machines. Each of these distillersgenerated a load announcement packet for themanager every half a second. The manager waseasily able to handle this aggregate load of 1800announcements per second. With each distillercapable of processing over 20 front end requestsper second, the manager is computationally capa-ble of sustaining a total number of distillers equiv-alent to 18000 requests per second. This number isnearly three orders of magnitude greater than thepeak load ever seen on UC Berkeley’s modempool which is comparable to a modest-sized ISP.Similarly, HotBot’s ACID database (parallelInformix server), used for ad revenue tracking and

5: We believe that TCP connection setup and processingoverhead is the dominating factor. Using a more efficientTCP implementation such as Fast Sockets [52] may alleviatethis limitation, although more investigation is needed.

user profiles, can serve about 400 requests per sec-ond, significantly greater than HotBot’s load.

On the other hand, SAN saturation is a poten-tial concern for communication-intensive work-loads such as TranSend’s. The problem ofoptimizing component placement given a specificnetwork topology, technology, and workload is animportant topic for future research. As a prelimi-nary exploration of how TranSend behaves as theSAN saturates, we repeated the scalability experi-ments using a 10 Mb/s switched Ethernet. As thenetwork was driven closer to saturation, wenoticed that most of our (unreliable) multicasttraffic was being dropped, crippling the ability ofthe manager to balance load and the ability of themonitor to report system conditions.

One possible solution to this problem is theaddition of a low-speed utility network to isolatecontrol traffic from data traffic, allowing the sys-tem to more gracefully handle (and perhaps avoid)SAN saturation. Another possibility is to use ahigher-performance SAN interconnect: a Myrinet[43] microbenchmark run on the HotBot implemen-tation measured 32 MBytes/s all-pairs trafficbetween 40 nodes, far greater than the traffic expe-rienced during the normal use of the system, sug-gesting that Myrinet will support systems of atleast several tens of nodes.

5 DiscussionIn previous sections we presented detailed

measurements of a scalable network serviceimplementation that confirmed the effectivenessof our layered architecture. In this section, we dis-cuss some of the more interesting and novelaspects of our architecture, reflect on furtherpotential applications of this research, and com-pare our work with others’ efforts.

5.1 Extensibility: New Workers and CompositionOne of our goals was to make the system easily

extensible at the TACC and Service layers bymaking it easy to create workers and chain themtogether. Our HTML and JPEG distillers consistalmost entirely of off-the-shelf code, and eachtook an afternoon to write. Debugging the patho-logical cases for the HTML distiller was spreadout over a period of days—since the system

Page 19: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

masked transient faults by bypassing original con-tent “around” the faulting distiller, we could onlydeduce the existence of bugs by noticing (usingthe Monitor display) that the HTML distiller hadbeen restarted several times over a period of hours.

The other aspect of extensibility is the easewith which new services can be added by compos-ing workers and modifying the service presenta-tion interface. We now discuss several examples ofnew services in various stages of construction,indicating what must be changed in the TACC andService layers for each. The services share the fol-lowing common features, which make them ame-nable to implementation using our framework:

• Compute-intensive transformation or aggregation

• Computation is parallelizable with granularity of a fewCPU seconds

• Substantial value added by mass customization

• Data manipulated has BASE semantics

We restrict our discussion here to services thatcan be implemented using the HTTP proxy model(i.e., transparent interposition of computationbetween Web clients and Web servers). The fol-lowing applications have all been prototyped usingTranSend.

Keyword Filtering : The keyword filteraggregator is very simple (about 10 lines ofPerl). It allows users to specify a Perl regularexpression as customization preference. Thisregular expression is then applied to allHTML before delivery. A simple examplefilter marks all occurrences of the chosenkeywords with large, bold, red typeface.

Bay Area Culture Page: This service retrievesscheduling information from a number ofcultural pages on the web, and collates theresults into a single, comprehensive calendarof upcoming events, bounded by dates storedas part of each user’s profile. The service isimplemented as a single aggregator in theTACC layer, and is composed with theunmodified TranSend service layer,delivering the benefits of distillationautomatically. This service exploits BASE“approximate answers” semantics at theapplication layer: extremely general, layout-

independent heuristics are used to extractscheduling information from the culturalpages. About 10-20% of the time, theheuristics spuriously pick up non-date text(and the accompanying non-descriptions ofevents), but the service is still useful andusers simply ignore spurious results. Earlyexperience with services such as this onesuggest that our SNS architecture may be apromising platform for deploying certainkinds of simple network agents.

TranSend Metasearch: The metasearchservice is similar to the Bay Area CulturePage in that it collates content from othersources in the Internet. This content,however, is dynamically produced—anaggregator accepts a search string from auser, queries a number of popular searchengines, and collates the top results fromeach into a single result page. Commercialmetasearch engines already exist[58], but theTranSend metasearch engine wasimplemented using 3 pages of Perl code inroughly 2.5 hours, and inherits scalability,fault tolerance, and high availability from theSNS layer.

Anonymous Rewebber: Just as anonymousremailer chains[24] allow email authors toanonymously disseminate their content, ananonymous rewebber network allows webauthors to anonymously publish their content.The rewebber described in[25] wasimplemented in one week using our TACCarchitecture. The rewebber’s workers performencryption and decryption, its user profiledatabase maintains public key informationfor anonymous servers, and its cache storesdecrypted versions of frequently accessedpages. Since encryption and decryption ofdistinct pages requested by independent usersis both computationally intensive and highlyparallelizable, this service is a natural fit forour architecture.

Real Web Access for PDAs and Smart Phones:We have already extended TranSend tosupport graphical Web browsing on the USR

Page 20: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

PalmPilot [62], a typical “thin client” device.Previous attempts to provide Web browsingon such devices have foundered on the severelimitations imposed by small screens, limitedcomputing capability, and austereprogramming environments, and virtually allhave fallen back to simple text-onlybrowsing. But the ability of our architectureto move complexity into the service workersrather than the client allows us to approachthis problem from a different perspective. Wehave built TranSend workers that outputsimplified markup and scaled-down imagesready to be “spoon fed” to an extremelysimple browser client, given knowledge ofthe client’s screen dimensions and fontmetrics. This greatly simplifies client-sidecode since no HTML parsing, layout, orimage processing is necessary, and as a sidebenefit, the smaller and more efficient datarepresentation reduces transmission time tothe client.

5.2 Economic Feasibility

Given the improved quality of service providedby TranSend, an interesting question is the addi-tional cost required to operate this service. Fromour performance data, a US$5000 Pentium Proserver should be able to support about 750modems, or about 15,000 subscribers (assuming a20:1 subscriber to modem ratio). Amortized over1 year, the marginal cost per user is an amazing 25cents/month.

If we include the savings to the ISP due to acache hit rate of 50% or more, as we observed inour cache experiments, then we can eliminate theequivalent of 1-2 T1 lines per TranSend installa-tion, which reduces operating costs by aboutUS$3000 per month. Thus, we expect that theserver would pay for itself in only two months. Inthis argument we have ignored the cost of admin-istration, which is nontrivial, but we believeadministration costs for TranSend would be mini-mal— we run TranSend at Berkeley with essen-tially no administration except for featureupgrades and bug fixes, both of which are per-formed without bringing the service down.

5.3 Related WorkContent transformation by proxy: Filtering

and on-the-fly compression have become particu-larly popular for HTTP[31], whose proxy mecha-nism was originally intended for users behindsecurity firewalls. The mechanism has been usedto shield clients from the effects of poor (espe-cially wireless) networks [22,37], perform filtering[67] and anonymization, and perform value-addedtransformations on content, including Kanjitranscoding [56], Kanji-to-GIF conversion [65],application-level stream transducing[13,59], andpersonalized agent services for Web browsing[7].

Fault tolerance and high availability: TheWorm programs[55] are an early example of pro-cess-peer fault tolerance. Tandem Computer andothers explored a related mechanism, process-pairfault tolerance,[6] in which a secondary (backup)process ran in parallel with the primary and main-tained a mirror of the primary’s internal state byprocessing the same message traffic as the pri-mary, allowing it to immediately replace the pri-mary in the event of failure. Tandem alsoadvocated the use of simple “building blocks” toensure high availability. The Open Group SHAWSproject[49] plans to build scalable highly availableweb servers using a fault tolerance toolkit calledCORDS, but that project is still in progress.

BASE: Grapevine[9] was an important earlyexample of trading consistency for simplicity;Bayou [21] later explored trading consistency foravailability in application-specific ways, provid-ing an operational spectrum between ACID andBASE for a distributed database. The use of softstate to provide improved performance andincrease fault tolerance robustness has been wellexplored in the wide-area Internet, in the contextof IP packet routing[39], multicast routing[20], andwireless TCP optimizations such as TCP Snoop[5]; the lessons learned in those areas stronglyinfluenced our design philosophy for the TACCserver architecture.

Load balancing and scaling:WebOS[66] andSWEB++ [3] have exploited the extensibility ofclient browsers via Java and JavaScript to enhancescalability of network-based services by dividinglabor between the client and server. We note that

Page 21: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

our system does not preclude, and in fact benefitsfrom, exploiting intelligence and computationalresources at the client, as we do for the TranSenduser interface and coarse-grained load balancing.However, as discussed in the Introduction, weexpect the utility of centralized, highly-availableservices to continue to increase, and this cannotoccur without the growth path provided by linearincremental scalability in the SNS sense.

5.4 Future WorkOur past work on adaptation via distillation

[23,22] described how distillation could be dynami-cally tuned to match the behavior of the user’s net-work connection, and we have successfullydemonstrated adaptation to network changes bycombining our original WWW proxy prototypewith the Event Notification mechanisms devel-oped by Welling and Badrinath[4], and plan toleverage these mechanisms to provide an adaptivesolution for Web access from wireless clients.

We have not investigated how well our pro-posed architecture works outside the Internet-server domain. In particular, we do not believe itwill work well for write-intensive services wherethe writes carry hard state or where strong consis-tency is desired, such as commerce servers, filesystems, or online voting systems.

The programming model for TACC services isstill embryonic. We plan to develop it into a well-defined programming environment with an SDK,and we will encourage our colleagues to authorservices of their own using our system.

Previous research into operating systems sup-port for busy Internet servers[32, 42] has identifiedinadequacies in OS implementations and the set ofabstractions available to applications. We plan toinvestigate similar issues related specifically tocluster-based middleware services, as motivatedby our observations in Section 4.4.

6 ConclusionsWe proposed a layered architecture for cluster-

based scalable network services. We identifiedchallenges of cluster-based computing, andshowed how our architecture addresses these chal-lenges. The architecture is reusable: authors ofnew network services write and compose stateless

workers that transform, aggregate, cache, and cus-tomize (TACC) Internet content, but are shieldedfrom the software complexity of automatic scal-ing, high availability, and failure management. Weargued that a large class of network services canget by with BASE, a weaker-than-ACID datasemantics that results from the combination oftrading consistency for availability and exploitingsoft state for performance and failure manage-ment.

We discussed in depth the design and imple-mentation of two cluster-based scalable networkservices: the TranSend distillation Web proxy andthe HotBot search engine. Using extensive clienttraces, we conducted detailed performance mea-surements of TranSend. While gathering thesemeasurements, we scaled TranSend up to 10Ultra-1 workstations serving 159 web requests persecond, and demonstrated that a single such work-station is sufficient to serve the needs of the entire600 modem UC Berkeley dialup IP bank.

Since the class of cluster-based scalable net-work services we have identified can substantiallyincrease the value of Internet access to end userswhile remaining cost-efficient to deploy andadminister, we believe that cluster-based value-added network services will become an importantInternet-service paradigm.

7 AcknowledgmentsThis paper has benefited from the detailed and

perceptive comments of our reviewers, especiallyour shepherd Hank Levy. We thank Randy Katzand Eric Anderson for their detailed readings ofearly drafts of this paper, and David Culler for hisideas on TACC’s potential as a model for clusterprogramming. Ken Lutz and Eric Fraser config-ured and administered the test network on whichthe TranSend scaling experiments were per-formed. Cliff Frost of the UC Berkeley Data Com-munications and Networks Services groupallowed us to collect traces on the Berkeley dialupIP network and has worked with us to deploy andpromote TranSend within Berkeley. Undergradu-ate researchers Anthony Polito, Benjamin Ling,and Andrew Huang implemented various parts ofTranSend’s user profile database and user inter-

Page 22: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

face. Ian Goldberg and David Wagner helped usdebug TranSend, especially through their imple-mentation of the rewebber.

8 References

[1] E. Anderson and David A. Patterson.Extensible, Scal-able Monitoring For Clusters of Computers.Proc. 1997Large Installation System Administration Confere(LISA XI), to appear.

[2] T. E. Anderson et al.The Case for NOW (Networks ofWorkstations). IEEE Micro, February 1995.

[3] D. Andresen, T. Yang, O. Egecioglu, O. H. Ibarra, and T.R. Smith.Scalability Issues for High Performance Dig-ital Libraries on the World Wide Web.Proceedings ofADL ‘96, Forum on Research and Technology Advanc-es in Digital Libraries, IEEE, Washington D.C., May1996

[4] B. R. Badrinath and G. Welling.A Framework for Envi-ronment Aware Mobile Applications. International Con-ference on Distributed Computing Systems, May 1997(to appear)

[5] H. Balakrishnan, S. Seshan, E. Amir, R. Katz.ImprovingTCP/IP Performance over Wireless Networks. Proeed-ings. of the 1st ACM Conference on Mobile Computingand Networking, Berkeley, CA, November 1995.

[6] J. F. Bartlett.A NonStop Kernel.Proc. 8th SOSP and Op-erating Systems Review 15(5), December 1981

[7] Rob Barrett, Paul Maglio, and Daniel Kellem.How toPersonalize the Web. Proc. CHI 97.

[8] Berkeley Home IP Service FAQ. http://ack.berkeley.edu/dcns/modems/hip/hip_faq.html.

[9] A.D. Birrell et al.Grapevine: An Exercise in DistributedComputing. Communications of the ACM 25(4), Feb.1984.

[10] C.M. Bowman et al. Harvest:A Scalable, CustomizableDiscovery and Access System. Technical Report CU-CS-732-94, Department of Computer Science, Universi-ty of Colorado, Boulder, August 1994

[11] Tim Bray.Measuring the Web. Proc. WWW-5, Paris,May 1996.

[12] T. Brisco.RFC 1764: DNS Support for Load Balancing,April 1995.

[13] C. Brooks, M.S. Mazer, S. Meeks and J. Miller.Appli-cation-Specific Proxy Servers as HTTP Stream Trans-ducers. Proc. WWW-4, Boston, May 1996. http://www.w3.org/pub/Conferences/WWW4/Papers/56.

[14] A. Chankhunthod, P. B. Danzig, C. Neerdaels, M. F.Schwartz and K. J. Worrell.A Hierarchical Internet Ob-ject Cache. Proceedings of the 1996 Usenix Annual

Technical Conference, 153-163, January 1996.

[15] D.Clark.Policy Routing in Internet Protocols.InternetRequest for Comments 1102, May 1989,

[16] Cisco Systems. Local Director. http://www.cisco.com/warp/public/751/lodir/index.html.

[17] F. J. Corbató and V. A. Vyssotsky. Introduction andOverview of the Multics System.AFIPS ConferenceProceedings, 27, 185-196, (1965 Fall Joint ComputerConference), 1965. http://www.lilli.com/fjcc1.html

[18] M.E. Crovella and A. Bestavros.Explaining WorldWide Web Traffic Self-Similarity. Tech Rep. TR-95-015,Computer Science Department, Boston University, Oc-tober 1995.

[19] P. B. Danzig, R. S. Hall and M. F. Schwartz.A Case forCaching File Objects Inside Internetworks. Proceedingsof SIGCOMM '93. 239-248, September 1993.

[20] S. Deering, D. Estrin, D. Farinacci, V. Jacobson, C.-G.Liu, and L. Wei.An Architecture for Wide-Area Multi-cast Routing. Proceedings of SIGCOMM ‘94, Universi-ty College London, London, U.K., September 1994.

[21] A. Demers, K. Petersen, M. Spreitzer, D. Terry, M.Theimer, B. Welch.The Bayou Architecture: Supportfor Data Sharing Among Mobile Users.

[22] A. Fox and E. A. Brewer.Reducing WWW Latency andBandwidth Requirements via Real-Time Distillation.Proc. WWW-5, Paris, May 1996.

[23] A. Fox, S. D. Gribble, E. Brewer and E. Amir.Adaptingto Network and Client Variation Via On-Demand Dy-namic Distillation. Proceedings of ASPLOS-VII, Bos-ton, October 1996.

[24] Ian Goldberg, David Wagner, and Eric Brewer.Priva-cy-enhancing Technologies for the Internet. Proc. ofIEEE Spring COMPCON, 1997

[25] Ian Goldberg and David Wagner.TAZ Servers and theRewebber Network: Enabling Anonymous Publishing onthe World Wide Web. Unpublished manuscript, May1997, available at http://www.cs.berkeley.edu/~daw/cs268/.

[26] J. Gray.The Transaction Concept: Virtues and Limita-tions. Proceedings of VLDB. Cannes, France, Septem-ber 1981, 144-154.

[27] S.D. Gribble, G.S. Manku, and E. Brewer.Self-Similar-ity in File-Systems: Measurement and Applications. Un-published, available at http://www.cs.berkeley.edu/~gribble/papers/papers.html

[28] T.R. Halfhill. Inside the Web PC. Byte Magazine,March 1996.

[29] Independent JPEG Group.jpeg6a library.

Page 23: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

[30] Inktomi Corporation: The Inktomi Technology BehindHotBot. May 1996. http://www.inktomi.com/white-pap.html.

[31] Internet Engineering Task Force.Hypertext TransferProtocol—HTTP 1.1. RFC 2068, March 1, 1997.

[32] M. F. Kaashoek, D. R. Engler, G. R. Ganger, and D. A.Wallach.Server Operating Systems. Proceedings of theSIGOPS European Workshop, September 1996.

[33] P. Keleher, A. Cox, and W. Zwaenepoel.Lazy ReleaseConsistency for Software Distributed Shared Memory.Proceedings of the 19th Annual Symposium on Comput-er Architecture. May, 1992.

[34] P. Keleher, A. Cox, S. Swarkadas, and W. Zwaenepoel.TreadMarks: Distributed Shared Memory on StandardWorkstations and Operating Systems. Proceedings of the1994 Winter USENIX Conference, January, 1994.

[35] W. Leland, M.S. Taqqu, W. Willinger, and D.V. Wil-son.On the Self-Similar Nature of Ethernet Traffic (ex-tended version).IEEE/ACM Transactions on Networkv2, February 1994.

[36] K. Li. Shared Virtual Memory on Loosely Coupled Mi-croprocessors. PhD Thesis, Yale University, September1986.

[37] M. Liljeberg et al.Enhanced Services for World WideWeb in Mobile WAN Environment. University of Helsin-ki CS Technical Report No. C-1996-28, April 1996.

[38] Bruce A. Mah.An Empirical Model of HTTP NetworkTraffic. Proc. INFOCOM 97, Kobe, Japan, April 1997.

[39] J. McQuillan, I Richer, E. Rosen.The New Routing Al-gorithm for the ARPANET. IEEE Transactions on Com-munications COM-28, No. 5, pp. 711-719 , May 1980.

[40] M. Meeker and C. DePuy.The Internet Report. MorganStanley Equity Research, April 1996. http://www.mas.com/misc/inet/morganh.html

[41] P.V. Mockapetris and K.J. Dunlap.Development of theDomain Name System. ACM SIGCOMM ComputerCommunication Review, 1988.

[42] Jeffrey C. Mogul.Operating Systems Support for BusyInternet Servers. Proceedings of HotOS-V, Orcas Island,Washington, May 1995.

[43] Myricom Inc.Myrinet: A Gigabit Per Second LocalArea Network. IEEE-Micro, Vol.15, No.1, February1995, pp. 29-36.

[44] National Laboratory for Applied Network Research.The Squid Internet Object Cache. http://squid.nlanr.net.

[45] National Aeronautics and Space Administration.TheMars Pathfinder Mission Home Page. http://mpfw-ww.jpl.nasa.gov/default1.html.

[46] Netscape Communications Corporation.Netscape

Proxy Automatic Configuration. http://home.netscape.com/eng/mozilla/2.02/relnotes/unix-2.02.html#Proxies.

[47] Nokia Communicator 9000 Press Release. Available athttp:// www.club.nokia.com/support/9000/press.html.

[48] J.K. Ousterhout.Tcl and the Tk Toolkit. Addison-Wes-ley, 1994.

[49] Open Group Research Institute.Scalable Highly Avail-able Web Server Project (SHAWS).http://www.osf.org/RI/PubProjPgs/SFTWWW.htm

[50] Eric S. Raymond, ed.The New Hackers’ Dictionary.Cambridge, MA: MIT Press, 1991. Also http://www.ccil.org/jargon/jargon.html.

[51] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J.Ried.GroupLens: An Open Architecture for Collabora-tive Filtering of Netnews. Proceedings of 1994 Confer-ence on Computer Supported Cooperative Work, ChapelHill, NC.

[52] S. H. Rodrigues and T. E. Anderson.High-PerformanceLocal-Area Communication Using Fast Sockets. Proc.1997 Winter USENIX, Anaheim, CA.

[53] Jacques A.J. Roufs.Perceptual Image Quality: Conceptand Measurement.Philips Journal of Research, 47:3-14,1992.

[54] A. Sah, K. E. Brown and E. Brewer.Programming theInternet from the Server-Side with Tcl and Audience1.Proceedings of Tcl96, July 1996.

[55] J. F. Shoch and J. A. Hupp.The “Worm” Programs—Early Experience with a Distributed System. CACM25(3):172-180, March 1982.

[56] Y. Sato. DeleGate Server. Documentation available athttp://www.aubg.edu:8080/cii/src/delegate3.0.17/doc/Manual.txt.

[57] B. Schilit and T. Bickmore.Digestor: Device-Indepen-dent Access to the World Wide Web. Proc. WWW-6,Santa Clara, CA, April 1997.

[58] E. Selberg, O. Etzioni and G. Lauckhart.Metacrawler:About Our Service. http://www.metacrawler.com/about.html.

[59] M.A. Schickler, M.S. Mazer, and C. Brooks.Pan-Browser Support for Annotations and Other Meta-Infor-mation on the World Wide Web. Proc. WWW-5, Paris,May 1996. http://www5conf.inria.fr/fich_html/papers/P15/Overview.html

[60] SingNet (Singapore ISP).Heretical Caching Effort forSingNet Customers. http://www.singnet.com.sg/cache/proxy

[61] N. Smith.The UK National Web Cache - The State ofthe Art. Proc. WWW-5, paris, May 1996. http://

Page 24: Cluster-Based Scalable Network Services...ware infrastructure for managing partial failures ... provides high availability, and is extremely cost effective. Inktomi predates the framework

www5conf.inria.fr/fich_html/papers/P45/Overview.ht-ml

[62] US Robotics Palm Pilot home page - http://www.usr.com/palm/.

[63] C. Waldspurger and W. Weihl.Lottery Scheduling:Flexible Proportional Share Resource Management.Proceedings of the First OSDI, November 1994.

[64] Yahoo!, Inc. http://www.yahoo.com

[65] Kao-Ping Yee. Shoduoka Mediator Service. http://www.lfw.org/ shodouka.

[66] C. Yoshikawa, B. Chun, P. Eastham, A. Vahdat, T.Anderson, and D. Culler.Using Smart Clients to BuildScalable Services. Proceedings of Winter 1997 US-ENIX, January 1997.

[67] B. Zenel.A Proxy Based Filtering Mechanism for theMobile Environment. Ph.D. Thesis Proposal, Depart-ment of Computer Science, Columbia University,March 1996.