-
A Scalable Distributed Information Management System�
Praveen Yalagandula and Mike DahlinDepartment of Computer
SciencesThe University of Texas at Austin
Abstract
We present a Scalable Distributed Information Manage-ment System
(SDIMS) that aggregates information aboutlarge-scale networked
systems and that can serve as abasic building block for a broad
range of large-scaledistributed applications by providing detailed
views ofnearby information and summary views of global
infor-mation. To serve as a basic building block, a SDIMSshould
have four properties: scalability to many nodesand attributes,
flexibility to accommodate a broad rangeof applications,
administrative isolation for security andavailability, and
robustness to node and network failures.We design, implement and
evaluate a SDIMS that (1)leverages Distributed Hash Tables (DHT) to
create scal-able aggregation trees, (2) provides flexibility
througha simple API that lets applications control propagationof
reads and writes, (3) provides administrative isolationthrough
simple extensions to current DHT algorithms, and(4) achieves
robustness to node and network reconfigura-tions through lazy
reaggregation, on-demand reaggrega-tion, and tunable spatial
replication. Through extensivesimulations and micro-benchmark
experiments, we ob-serve that our system is an order of magnitude
more scal-able than existing approaches, achieves isolation
proper-ties at the cost of modestly increased read latency in
com-parison to flat DHTs, and gracefully handles failures.
1 Introduction
The goal of this research is to design and build a
ScalableDistributed Information Management System (SDIMS)that
aggregates information about large-scale networkedsystems and that
can serve as a basic building block for abroad range of large-scale
distributed applications. Mon-itoring, querying, and reacting to
changes in the stateof a distributed system are core components of
applica-tions such as system management [15, 31, 37, 42],
serviceplacement [14, 43], data sharing and caching [18, 29, 32,35,
46], sensor monitoring and control [20, 21], multicasttree
formation [8, 9, 33, 36, 38], and naming and request
�This is an extended version of SIGCOMM 2004 paper. Please
cite
the original SIGCOMM paper.
routing [10, 11]. We therefore speculate that a SDIMS ina
networked system would provide a “distributed operat-ing systems
backbone” and facilitate the development anddeployment of new
distributed services.
For a large scale information system, hierarchical ag-gregation
is a fundamental abstraction for scalability.Rather than expose all
information to all nodes, hierarchi-cal aggregation allows a node
to access detailed views ofnearby information and summary views of
global infor-mation. In a SDIMS based on hierarchical
aggregation,different nodes can therefore receive different answers
tothe query “find a [nearby] node with at least 1 GB of freememory”
or “find a [nearby] copy of file foo.” A hierar-chical system that
aggregates information through reduc-tion trees [21, 38] allows
nodes to access information theycare about while maintaining system
scalability.
To be used as a basic building block, a SDIMS shouldhave four
properties. First, the system should be scal-able: it should
accommodate large numbers of participat-ing nodes, and it should
allow applications to install andmonitor large numbers of data
attributes. Enterprise andglobal scale systems today might have
tens of thousandsto millions of nodes and these numbers will
increase overtime. Similarly, we hope to support many
applications,and each application may track several attributes
(e.g., theload and free memory of a system’s machines) or mil-lions
of attributes (e.g., which files are stored on whichmachines).
Second, the system should have flexibility to accom-modate a
broad range of applications and attributes. Forexample,
read-dominated attributes like numCPUs rarelychange in value, while
write-dominated attributes likenumProcesses change quite often. An
approach tuned forread-dominated attributes will consume high
bandwidthwhen applied to write-dominated attributes. Conversely,an
approach tuned for write-dominated attributes will suf-fer from
unnecessary query latency or imprecision forread-dominated
attributes. Therefore, a SDIMS shouldprovide mechanisms to handle
different types of attributesand leave the policy decision of
tuning replication to theapplications.
Third, a SDIMS should provide administrative isola-tion. In a
large system, it is natural to arrange nodesin an organizational or
an administrative hierarchy (e.g.,
1
-
....
ee math
........
........
univ1 ........
........
cs
ROOT
univ2 univ3 univ4
pc1 pc2 pc3 pc4
........
com
edu
pc5 pc6
Figure 1: Administrative hierarchy
Figure 1). A SDIMS should support administrative iso-lation in
which queries about an administrative domain’sinformation can be
satisfied within the domain so that thesystem can operate during
disconnections from other do-mains, so that an external observer
cannot monitor or af-fect intra-domain queries, and to support
domain-scopedqueries efficiently.
Fourth, the system must be robust to node failures
anddisconnections. A SDIMS should adapt to reconfigura-tions in a
timely fashion and should also provide mecha-nisms so that
applications can tradeoff the cost of adap-tation with the
consistency level in the aggregated resultswhen reconfigurations
occur.
We draw inspiration from two previous works: Astro-labe [38] and
Distributed Hash Tables (DHTs).
Astrolabe [38] is a robust information managementsystem.
Astrolabe provides the abstraction of a singlelogical aggregation
tree that mirrors a system’s adminis-trative hierarchy. It provides
a general interface for in-stalling new aggregation functions and
provides eventualconsistency on its data. Astrolabe is robust due
to itsuse of an unstructured gossip protocol for
disseminatinginformation and its strategy of replicating all
aggregatedattribute values for a subtree to all nodes in the
subtree.This combination allows any communication pattern toyield
eventual consistency and allows any node to answerany query using
local information. This high degree ofreplication, however, may
limit the system’s ability to ac-commodate large numbers of
attributes. Also, althoughthe approach works well for
read-dominated attributes, anupdate at one node can eventually
affect the state at allnodes, which may limit the system’s
flexibility to supportwrite-dominated attributes.
Recent research in peer-to-peer structured networks re-sulted in
Distributed Hash Tables (DHTs) [18, 28, 29, 32,35, 46]—a data
structure that scales with the number ofnodes and that distributes
the read-write load for differentqueries among the participating
nodes. It is interestingto note that although these systems export
a global hashtable abstraction, many of them internally make use
ofwhat can be viewed as a scalable system of aggregationtrees to,
for example, route a request for a given key to the
right DHT node. Indeed, rather than export a general
DHTinterface, Plaxton et al.’s [28] original application makesuse
of hierarchical aggregation to allow nodes to locatenearby copies
of objects. It seems appealing to develop aSDIMS abstraction that
exposes this internal functionalityin a general way so that
scalable trees for aggregation canbe a basic system building block
alongside the DHTs.
At a first glance, it might appear to be obvious that sim-ply
fusing DHTs with Astrolabe’s aggregation abstractionwill result in
a SDIMS. However, meeting the SDIMS re-quirements forces a design
to address four questions: (1)How to scalably map different
attributes to different ag-gregation trees in a DHT mesh? (2) How
to provide flex-ibility in the aggregation to accommodate different
appli-cation requirements? (3) How to adapt a global, flat DHTmesh
to attain administrative isolation property? and (4)How to provide
robustness without unstructured gossipand total replication?
The key contributions of this paper that form the foun-dation of
our SDIMS design are as follows.
1. We define a new aggregation abstraction that spec-ifies both
attribute type and attribute name and thatassociates an aggregation
function with a particularattribute type. This abstraction paves
the way forutilizing the DHT system’s internal trees for
aggre-gation and for achieving scalability with both nodesand
attributes.
2. We provide a flexible API that lets applications con-trol the
propagation of reads and writes and thustrade off update cost, read
latency, replication, andstaleness.
3. We augment an existing DHT algorithm to ensurepath
convergence and path locality properties in or-der to achieve
administrative isolation.
4. We provide robustness to node and network re-configurations
by (a) providing temporal replicationthrough lazy reaggregation
that guarantees eventualconsistency and (b) ensuring that our
flexible API al-lows demanding applications gain additional
robust-ness by using tunable spatial replication of data
ag-gregates or by performing fast on-demand reaggre-gation to
augment the underlying lazy reaggregationor by doing both.
We have built a prototype of SDIMS. Through simula-tions and
micro-benchmark experiments on a number ofdepartment machines and
PlanetLab [27] nodes, we ob-serve that the prototype achieves
scalability with respectto both nodes and attributes through use of
its flexibleAPI, inflicts an order of magnitude lower maximum
nodestress than unstructured gossiping schemes, achieves iso-lation
properties at a cost of modestly increased read la-
2
-
tency compared to flat DHTs, and gracefully handles
nodefailures.
This initial study discusses key aspects of an ongo-ing system
building effort, but it does not address all is-sues in building a
SDIMS. For example, we believe thatour strategies for providing
robustness will mesh wellwith techniques such as supernodes [22]
and other ongo-ing efforts to improve DHTs [30] for further
improvingrobustness. Also, although splitting aggregation amongmany
trees improves scalability for simple queries, thisapproach may
make complex and multi-attribute queriesmore expensive compared to
a single tree. Additionalwork is needed to understand the
significance of this limi-tation for real workloads and, if
necessary, to adapt queryplanning techniques from DHT abstractions
[16, 19] toscalable aggregation tree abstractions.
In Section 2, we explain the hierarchical aggregationabstraction
that SDIMS provides to applications. In Sec-tions 3 and 4, we
describe the design of our system forachieving the flexibility,
scalability, and administrativeisolation requirements of a SDIMS.
In Section 5, we de-tail the implementation of our prototype
system. Section 6addresses the issue of adaptation to the
topological recon-figurations. In Section 7, we present the
evaluation of oursystem through large-scale simulations and
microbench-marks on real networks. Section 8 details the related
work,and Section 9 summarizes our contribution.
2 Aggregation Abstraction
Aggregation is a natural abstraction for a large-scale
dis-tributed information system because aggregation
providesscalability by allowing a node to view detailed
informa-tion about the state near it and progressively
coarser-grained summaries about progressively larger subsets ofa
system’s data [38].
Our aggregation abstraction is defined across a treespanning all
nodes in the system. Each physical node inthe system is a leaf and
each subtree represents a logicalgroup of nodes. Note that logical
groups can correspondto administrative domains (e.g., department or
university)or groups of nodes within a domain (e.g., 10
workstationson a LAN in CS department). An internal non-leaf
node,which we call virtual node, is simulated by one or
morephysical nodes at the leaves of the subtree for which
thevirtual node is the root. We describe how to form suchtrees in a
later section.
Each physical node has local data stored as a setof
�attributeType � attributeName � value � tuples such as
(configuration, numCPUs, 16), (mcast membership, ses-sion foo,
yes), or (file stored, foo, myIPaddress). Thesystem associates an
aggregation function ftype with eachattribute type, and for each
level-i subtree Ti in the sys-tem, the system defines an aggregate
value Vi � type � name for
each (attributeType, attributeName) pair as follows. For
a(physical) leaf node T0 at level 0, V0 � type � name is the
locallystored value for the attribute type and name or NULL ifno
matching tuple exists. Then the aggregate value for alevel-i
subtree Ti is the aggregation function for the type,ftype computed
across the aggregate values of each of Ti’sk children:Vi � type �
name � ftype � V 0i � 1 � type � name � V 1i � 1 � type � name
����� V k � 1i � 1 � type � name � .
Although SDIMS allows arbitrary aggregationfunctions, it is
often desirable that these functionssatisfy the hierarchical
computation property [21]:f�v1 �������� vn � = f � f � v1 ��������
vs1 �� f
�vs1 � 1 �������� vs2 ��� �������
f�vsk � 1 �������� vn �� , where vi is the value of an
attribute
at node i. For example, the average operation, definedas avg
�v1 �������� vn � � 1 � n � ∑ni � 0 vi, does not satisfy the
property. Instead, if an attribute stores values as tuples�sum �
count � , the attribute satisfies the hierarchical com-
putation property while still allowing the applications
tocompute the average from the aggregate sum and countvalues.
Finally, note that for a large-scale system, it is diffi-cult or
impossible to insist that the aggregation value re-turned by a
probe corresponds to the function computedover the current values
at the leaves at the instant of theprobe. Therefore our system
provides only weak consis-tency guarantees – specifically eventual
consistency as de-fined in [38].
3 Flexibility
A major innovation of our work is enabling flexible ag-gregate
computation and propagation. The definition ofthe aggregation
abstraction allows considerable flexibilityin how, when, and where
aggregate values are computedand propagated. While previous systems
[15, 29, 38, 32,35, 46] implement a single static strategy, we
argue thata SDIMS should provide flexible computation and
prop-agation to efficiently support wide variety of
applicationswith diverse requirements. In order to provide this
flexi-bility, we develop a simple interface that decomposes
theaggregation abstraction into three pieces of
functionality:install, update, and probe.
This definition of the aggregation abstraction allowsour system
to provide a continuous spectrum of strategiesranging from lazy
aggregate computation and propagationon reads to aggressive
immediate computation and prop-agation on writes. In Figure 2, we
illustrate both extremestrategies and an intermediate strategy.
Under the lazyUpdate-Local computation and propagation strategy,
anupdate (or write) only affects local state. Then, a probe(or
read) that reads a level-i aggregate value is sent up thetree to
the issuing node’s level-i ancestor and then downthe tree to the
leaves. The system then computes the de-sired aggregate value at
each layer up the tree until the
3
-
Update Strategy On Update On Probe for Global Aggregate Value On
Probe for Level-1 Aggregate Value
Update-Local
Update-Up
Update-All
Figure 2: Flexible APIlevel-i ancestor that holds the desired
value. Finally, thelevel-i ancestor sends the result down the tree
to the is-suing node. In the other extreme case of the
aggressiveUpdate-All immediate computation and propagation onwrites
[38], when an update occurs, changes are aggre-gated up the tree,
and each new aggregate value is floodedto all of a node’s
descendants. In this case, each level-inode not only maintains the
aggregate values for the level-i subtree but also receives and
locally stores copies of allof its ancestors’ level- j ( j � i)
aggregation values. Also,a leaf satisfies a probe for a level-i
aggregate using purelylocal data. In an intermediate Update-Up
strategy, the rootof each subtree maintains the subtree’s current
aggregatevalue, and when an update occurs, the leaf node updatesits
local state and passes the update to its parent, and theneach
successive enclosing subtree updates its aggregatevalue and passes
the new value to its parent. This strat-egy satisfies a leaf’s
probe for a level-i aggregate valueby sending the probe up to the
level-i ancestor of the leafand then sending the aggregate value
down to the leaf. Fi-nally, notice that other strategies exist. For
example, anUpdate-UpRoot-Down1 strategy (not shown) would
ag-gregate updates up to the root of a subtree and send a
sub-tree’s aggregate values to only the children of the root ofthe
subtree. In general, an Update-Upk-Downj strategyaggregates up to
the kth level and propagates the aggre-gate values of a node at
level l (s.t. l ��� ) downward for jlevels.
A SDIMS must provide a wide range of flexible com-putation and
propagation strategies to applications forit to be a general
abstraction. An application shouldbe able to choose a particular
mechanism based on itsread-to-write ratio that reduces the
bandwidth consump-tion while attaining the required responsiveness
and pre-cision. Note that the read-to-write ratio of the
attributesthat applications install vary extensively. For example,a
read-dominated attribute like numCPUs rarely changesin value, while
a write-dominated attribute like numPro-cesses changes quite often.
An aggregation strategy likeUpdate-All works well for
read-dominated attributes butsuffers high bandwidth consumption
when applied forwrite-dominated attributes. Conversely, an approach
likeUpdate-Local works well for write-dominated attributesbut
suffers from unnecessary query latency or imprecisionfor
read-dominated attributes.
parameter description optional
attrType Attribute Typeaggrfunc Aggregation Functionup How far
upward each update is sent
(default: all)X
down How far downward each aggregate issent (default: none)
X
domain Domain restriction (default: none) XexpTime Expiry
Time
Table 1: Arguments for the install operation
SDIMS also allows non-uniform computation andpropagation across
the aggregation tree with different upand down parameters in
different subtrees so that appli-cations can adapt with the spatial
and temporal hetero-geneity of read and write operations. With
respect to spa-tial heterogeneity, access patterns may differ for
differentparts of the tree, requiring different propagation
strategiesfor different parts of the tree. Similarly with respect
totemporal heterogeneity, access patterns may change overtime
requiring different strategies over time.
3.1 Aggregation API
We provide the flexibility described above by splitting
theaggregation API into three functions: Install() installs
anaggregation function that defines an operation on an at-tribute
type and specifies the update strategy that the func-tion will use,
Update() inserts or modifies a node’s lo-cal value for an
attribute, and Probe() obtains an aggre-gate value for a specified
subtree. The install interface al-lows applications to specify the
k and j parameters of theUpdate-Upk-Downj strategy along with the
aggregationfunction. The update interface invokes the aggregation
ofan attribute on the tree according to corresponding aggre-gation
function’s aggregation strategy. The probe inter-face not only
allows applications to obtain the aggregatedvalue for a specified
tree but also allows a probing nodeto continuously fetch the values
for a specified time, thusenabling an application to adapt to
spatial and temporalheterogeneity. The rest of the section
describes these threeinterfaces in detail.
3.1.1 Install
The Install operation installs an aggregation function inthe
system. The arguments for this operation are listed
4
-
parameter description optional
attrType Attribute TypeattrName Attribute Nameval Value
Table 2: Arguments for the update operation
parameter description optional
attrType Attribute TypeattrName Attribute Namemode Continuous or
One-shot (default: one-
shot)X
level Level at which aggregate is sought (de-fault: at all
levels)
X
up How far up to go and re-fetch the value(default: none)
X
down How far down to go and re-aggregate(default: none)
X
expTime Expiry Time
Table 3: Arguments for the probe operation
in Table 1. The attrType argument denotes the type ofattributes
on which this aggregation function is invoked.Installed functions
are soft state that must be periodicallyrenewed or they will be
garbage collected at expTime.
The arguments up and down specify the aggre-gate computation and
propagation strategy Update-Upk-Downj. The domain argument, if
present, indicates thatthe aggregation function should be installed
on all nodesin the specified domain; otherwise the function is
installedon all nodes in the system.
3.1.2 Update
The Update operation takes three arguments attrType, at-trName,
and value and creates a new (attrType, attrName,value) tuple or
updates the value of an old tuple withmatching attrType and
attrName at a leaf node.
The update interface meshes with installed aggregatecomputation
and propagation strategy to provide flexibil-ity. In particular, as
outlined above and described in detailin Section 5, after a leaf
applies an update locally, the up-date may trigger re-computation
of aggregate values upthe tree and may also trigger propagation of
changed ag-gregate values down the tree. Notice that our
abstractionassociates an aggregation function with only an
attrTypebut lets updates specify an attrName along with the
at-trType. This technique helps achieve scalability with re-spect
to nodes and attributes as described in Section 4.
3.1.3 Probe
The Probe operation returns the value of an attribute toan
application. The complete argument set for the probeoperation is
shown in Table 3. Along with the attrNameand the attrType
arguments, a level argument specifies thelevel at which the answers
are required for an attribute.In our implementation we choose to
return results at all
levels k � l for a level-l probe because (i) it is inexpensiveas
the nodes traversed for level-l probe also contain levelk
aggregates for k � l and as we expect the network costof
transmitting the additional information to be small forthe small
aggregates which we focus and (ii) it is useful asapplications can
efficiently get several aggregates with asingle probe (e.g., for
domain-scoped queries as explainedin Section 4.2).
Probes with mode set to continuous and with finite ex-pTime
enable applications to handle spatial and temporalheterogeneity.
When node A issues a continuous probeat level l for an attribute,
then regardless of the up anddown parameters, updates for the
attribute at any node inA’s level-l ancestor’s subtree are
aggregated up to levell and the aggregated value is propagated down
along thepath from the ancestor to A. Note that continuous
modeenables SDIMS to support a distributed sensor-actuatormechanism
where a sensor monitors a level-i aggregatewith a continuous mode
probe and triggers an actuatorupon receiving new values for the
probe.
The up and down arguments enable applications to per-form
on-demand fast re-aggregation during reconfigura-tions, where a
forced re-aggregation is done for the corre-sponding levels even if
the aggregated value is available,as we discuss in Section 6. When
present, the up and downarguments are interpreted as described in
the install oper-ation.
3.1.4 Dynamic Adaptation
At the API level, the up and down arguments in installAPI can be
regarded as hints, since they suggest a compu-tation strategy but
do not affect the semantics of an aggre-gation function. A SDIMS
implementation can dynami-cally adjust its up/down strategies for
an attribute basedon its measured read/write frequency. But a
virtual in-termediate node needs to know the current up and
downpropagation values to decide if the local aggregate is freshin
order to answer a probe. This is the key reason whyup and down need
to be statically defined at the installtime and can not be
specified in the update operation. Indynamic adaptation, we
implement a lease-based mech-anism where a node issues a lease to a
parent or a childdenoting that it will keep propagating the updates
to thatparent or child. We are currently evaluating different
poli-cies to decide when to issue a lease and when to revoke
alease.
4 Scalability
Our design achieves scalability with respect to both nodesand
attributes through two key ideas. First, it carefully de-fines the
aggregation abstraction to mesh well with its un-derlying scalable
DHT system. Second, it refines the basic
5
-
DHT abstraction to form an Autonomous DHT (ADHT)to achieve the
administrative isolation properties that arecrucial to scaling for
large real-world systems. In this sec-tion, we describe these two
ideas in detail.
4.1 Leveraging DHTs
In contrast to previous systems [4, 15, 38, 39, 45],SDIMS’s
aggregation abstraction specifies both an at-tribute type and
attribute name and associates an aggrega-tion function with a type
rather than just specifying and as-sociating a function with a
name. Installing a single func-tion that can operate on many
different named attributesmatching a type improves scalability for
“sparse attributetypes” with large, sparsely-filled name spaces.
For ex-ample, to construct a file location service, our
interfaceallows us to install a single function that computes an
ag-gregate value for any named file. A subtree’s aggregatevalue for
(FILELOC, name) would be the ID of a nodein the subtree that stores
the named file. Conversely, As-trolabe copes with sparse attributes
by having aggregationfunctions compute sets or lists and suggests
that scalabil-ity can be improved by representing such sets with
Bloomfilters [6]. Supporting sparse names within a type providesat
least two advantages. First, when the value associatedwith a name
is updated, only the state associated withthat name needs to be
updated and propagated to othernodes. Second, splitting values
associated with differentnames into different aggregation values
allows our sys-tem to leverage Distributed Hash Tables (DHTs) to
mapdifferent names to different trees and thereby spread
thefunction’s logical root node’s load and state across multi-ple
physical nodes.
Given this abstraction, scalably mapping attributes toDHTs is
straightforward. DHT systems assign a long, ran-dom ID to each node
and define an algorithm to route arequest for key k to a node rootk
such that the union ofpaths from all nodes forms a tree DHTtreek
rooted at thenode rootk. Now, as illustrated in Figure 3, by
aggregatingan attribute along the aggregation tree corresponding
toDHTtreek for k � hash(attribute type, attribute name), dif-ferent
attributes will be aggregated along different trees.
In comparison to a scheme where all attributes are ag-gregated
along a single tree, aggregating along multipletrees incurs lower
maximum node stress: whereas in a sin-gle aggregation tree
approach, the root and the intermedi-ate nodes pass around more
messages than leaf nodes, in aDHT-based multi-tree, each node acts
as an intermediateaggregation point for some attributes and as a
leaf nodefor other attributes. Hence, this approach distributes
theonus of aggregation across all nodes.
001 010100
000
011 101
111
110
011 111 001 101 000 100 110010L0
L1
L2
L3
Figure 3: The DHT tree corresponding to key 111(DHTtree111) and
the corresponding aggregation tree.
4.2 Administrative Isolation
Aggregation trees should provide administrative isolationby
ensuring that for each domain, the virtual node at theroot of the
smallest aggregation subtree containing allnodes of that domain is
hosted by a node in that domain.Administrative isolation is
important for three reasons: (i)for security – so that updates and
probes flowing in a do-main are not accessible outside the domain,
(ii) for avail-ability – so that queries for values in a domain are
not af-fected by failures of nodes in other domains, and (iii)
forefficiency – so that domain-scoped queries can be simpleand
efficient.
To provide administrative isolation to aggregationtrees, a DHT
should satisfy two properties:
1. Path Locality: Search paths should always be con-tained in
the smallest possible domain.
2. Path Convergence: Search paths for a key from dif-ferent
nodes in a domain should converge at a nodein that domain.
Existing DHTs support path locality [18] or can eas-ily support
it by using the domain nearness as the distancemetric [7, 17], but
they do not guarantee path convergenceas those systems try to
optimize the search path to the rootto reduce response latency. For
example, Pastry [32] usesprefix routing in which each node’s
routing table containsone row per hexadecimal digit in the nodeId
space wherethe ith row contains a list of nodes whose nodeIds
dif-fer from the current node’s nodeId in the ith digit withone
entry for each possible digit value. Notice that for agiven row and
entry (viz. digit and value) a node n canchoose the entry from many
different alternative destina-tion nodes, especially for small i
where a destination nodeneeds to match n’s ID in only a few digits
to be a candidatefor inclusion in n’s routing table. A common
policy is tochoose a nearby node according to a proximity metric
[28]to minimize the network distance for routing a key. Un-der this
policy, the nodes in a routing table sharing a shortprefix will
tend to be nearby since there are many suchnodes spread roughly
evenly throughout the system dueto random nodeId assignment. Pastry
is self-organizing—nodes come and go at will. To maintain Pastry’s
localityproperties, a new node must join with one that is
nearbyaccording to the proximity metric. Pastry provides a seed
6
-
discovery protocol that finds such a node given an arbi-trary
starting point.
Given a routing topology, to route a packet to an arbi-trary
destination key, a node in Pastry forwards a packet tothe node with
a nodeId prefix matching the key in at leastone more digit than the
current node. If such a node isnot known, the current node uses an
additional data struc-ture, the leaf set containing L immediate
higher and lowerneighbors in the nodeId space, and forwards the
packetto a node with an identical prefix but that is
numericallycloser to the destination key in the nodeId space.
Thisprocess continues until the destination node appears in theleaf
set, after which the message is routed directly. Pas-try’s expected
number of routing steps is logn, where n isthe number of nodes, but
as Figure 4 illustrates, this algo-rithm does not guarantee path
convergence: if two nodesin a domain have nodeIds that match a key
in the samenumber of bits, both of them can route to a third node
out-side the domain when routing for that key.
Simple modifications to Pastry’s route table construc-tion and
key-routing protocols yield an Autonomous DHT(ADHT) that satisfies
the path locality and path conver-gence properties. As Figure 5
illustrates, whenever twonodes in a domain share the same prefix
with respect to akey and no other node in the domain has a longer
prefix,our algorithm introduces a virtual node at the boundary
ofthe domain corresponding to that prefix plus the next digitof the
key; such a virtual node is simulated by the existingnode whose id
is numerically closest to the virtual node’sid. Our ADHT’s routing
table differs from Pastry’s in twoways. First, each node maintains
a separate leaf set foreach domain of which it is a part. Second,
nodes use twoproximity metrics when populating the routing tables
–hierarchical domain proximity is the primary metric andnetwork
distance is secondary. Then, to route a packet toa global root for
a key, ADHT routing algorithm uses therouting table and the leaf
set entries to route to each suc-cessive enclosing domain’s root
(the virtual or real nodein the domain matching the key in the
maximum numberof digits).
The routing algorithm we use in routing for a key atnode with
nodeId is shown in the Algorithm 4.2. By rout-ing at the lowest
possible domain till the root of that do-main is reached, we ensure
that the routing paths con-form to the Path Convergence property.
The routing al-gorithm guarantees that as long as the leafset
membershipis correct, Path Convergence property is satisfied. We
usethe Pastry leaf set maintenance algorithm for
maintainingleafsets at all levels; after reconfigurations, once the
re-pairs are done on a particular domain level’s leafset,
theautonomy properties are met at that domain level. Notethat the
modifications proposed for the Pastry still pre-serves the
fault-tolerance properties of the original algo-rithm; rather, they
enhance the fault-tolerance of the algo-
110XX
010XX011XX
100XX
101XX
univ
dep1 dep2
key = 111XX
011XX 100XX 101XX 110XX 010XX
L1
L0
L2
Figure 4: Example shows how isolation property is vio-lated with
original Pastry. We also show the correspond-ing aggregation
tree.
110XX
010XX011XX
100XX
101XX
univ
dep1 dep2
key = 111XX
X
011XX 100XX 101XX 110XX 010XX
L0
L1
L2
Figure 5: Autonomous DHT satisfying the isolation prop-erty.
Also the corresponding aggregation tree is shown.
rithm but at the cost of extra maintenance overhead.
Algorithm 1 ADHTroute(key)1: flipNeigh � checkRoutingTable(key)
;2: l � numDomainLevels - 1 ;3: while (l � � 0) do4: if
(commLevels(flipNeigh, nodeId) ��� l) then5: send the key to
flipNeigh ; return ;6: else7: leafNeigh � an entry in leafset[l]
closer to key
than nodeId ;8: if (leafNeigh ! � null) then9: send the key to
leafNeigh ; return ;
10: end if11: end if12: l � l � 1;13: end while14: this node is
the root for this key
Properties. Maintaining a different leaf set for
eachadministrative hierarchy level increases the number ofneighbors
that each node tracks to
�2b ��� lgb n � c � l from�
2b ��� lgb n � c in unmodified Pastry, where b is the num-ber of
bits in a digit, n is the number of nodes, c is the leafset size,
and l is the number of domain levels. Routingrequires O(lgbn � l)
steps compared to O(lgbn) steps inPastry; also, each routing hop
may be longer than in Pas-try because the modified algorithm’s
routing table preferssame-domain nodes over nearby nodes. We
experimen-tally quantify the additional routing costs in Section
7.
In a large system, the ADHT topology allows domainsto improve
security for sensitive attribute types by in-stalling them only
within a specified domain. Then, ag-
7
-
A1 A2 B1((B1.B.,1),(B.,1),(.,1))
((B1.B.,1),(B.,1),(.,1))
L2
L1
L0
((B1.B.,1),(B.,1),(.,3))((A1.A.,1),
(A.,2),(.,2))
((A1.A.,1),(A.,1),(.,1))
((A2.A.,1),(A.,1),(.,1))
Figure 6: Example for domain-scoped queries
gregation occurs entirely within the domain and a node ex-ternal
to the domain can neither observe nor affect the up-dates and
aggregation computations of the attribute type.Furthermore, though
we have not implemented this fea-ture in the prototype, the ADHT
topology would also sup-port domain-restricted probes that could
ensure that noone outside of a domain can observe a probe for
datastored within the domain.
The ADHT topology also enhances availability by al-lowing the
common case of probes for data within a do-main to depend only on a
domain’s nodes. This, for ex-ample, allows a domain that becomes
disconnected fromthe rest of the Internet to continue to answer
queries forlocal data.
Aggregation trees that provide administrative isola-tion also
enable the definition of simple and efficientdomain-scoped
aggregation functions to support querieslike “what is the average
load on machines in domain X?”For example, consider an aggregation
function to countthe number of machines in an example system with
threemachines illustrated in Figure 6. Each leaf node l up-dates
attribute NumMachines with a value vl containing aset of tuples of
form (Domain, Count) for each domainof which the node is a part. In
the example, the nodeA1 with name A1.A. performs an update with the
value((A1.A.,1),(A.,1),(.,1)). An aggregation function at an
in-ternal virtual node hosted on node N with child set C com-putes
the aggregate as a set of tuples: for each domain Dthat N is part
of, form a tuple
�D � ∑c � C
�count � � D � count ��
vc �� . This computation is illustrated in the Figure 6.Now a
query for NumMachines with level set to MAXwill return the
aggregate values at each intermediate vir-tual node on the path to
the root as a set of tuples (treelevel, aggregated value) from
which it is easy to extractthe count of machines at each enclosing
domain. Forexample, A1 would receive ((2,
((B1.B.,1),(B.,1),(.,3))),(1, ((A1.A.,1),(A.,2),(.,2))), (0,
((A1.A.,1),(A.,1),(.,1)))).Note that supporting domain-scoped
queries would be lessconvenient and less efficient if aggregation
trees did notconform to the system’s administrative structure. It
wouldbe less efficient because each intermediate virtual nodewill
have to maintain a list of all values at the leaves in itssubtree
along with their names and it would be less conve-nient as
applications that need an aggregate for a domainwill have to pick
values of nodes in that domain from the
local MIB
MIBsancestor
reduction MIB(level 1)MIBs
ancestor
MIB fromchild 0X...
MIB fromchild 0X...
Level 2
Level 1
Level 3
Level 0
1XXX...
10XX...
100X...
From parents0X..
To parent 0X...
−− aggregation functions
From parents
To parent 10XX...
1X..1X..
1X..
To parent 11XX...
Node Id: (1001XXX)
1001X..
100X..
10X..
1X..
Virtual Node
Figure 7: Example illustrating the data structures and
theorganization of them at a node.
list returned by a probe and perform computation.
5 Prototype Implementation
The internal design of our SDIMS prototype comprisesof two
layers: the Autonomous DHT (ADHT) layer man-ages the overlay
topology of the system and the Aggrega-tion Management Layer (AML)
maintains attribute tuples,performs aggregations, stores and
propagates aggregatevalues. Given the ADHT construction described
in Sec-tion 4.2, each node implements an Aggregation Manage-ment
Layer (AML) to support the flexible API describedin Section 3. In
this section, we describe the internal stateand operation of the
AML layer of a node in the system.
We refer to a store of (attribute type, attribute name,value)
tuples as a Management Information Base orMIB, following the
terminology from Astrolabe [38] andSNMP [34]. We refer an
(attribute type, attribute name)tuple as an attribute key.
As Figure 7 illustrates, each physical node in the sys-tem acts
as several virtual nodes in the AML: a node actsas leaf for all
attribute keys, as a level-1 subtree root forkeys whose hash
matches the node’s ID in b prefix bits(where b is the number of
bits corrected in each step ofthe ADHT’s routing scheme), as a
level-i subtree root forattribute keys whose hash matches the
node’s ID in the ini-tial i � b bits, and as the system’s global
root for attributekeys whose hash matches the node’s ID in more
prefix bitsthan any other node (in case of a tie, the first
non-matchingbit is ignored and the comparison is continued
[46]).
To support hierarchical aggregation, each virtual nodeat the
root of a level-i subtree maintains several MIBs thatstore (1)
child MIBs containing raw aggregate values gath-ered from children,
(2) a reduction MIB containing locallyaggregated values across this
raw information, and (3) anancestor MIB containing aggregate values
scattered downfrom ancestors. This basic strategy of maintaining
child,
8
-
reduction, and ancestor MIBs is based on Astrolabe [38],but our
structured propagation strategy channels informa-tion that flows up
according to its attribute key and ourflexible propagation strategy
only sends child updates upand ancestor aggregate results down as
far as specifiedby the attribute key’s aggregation function. Note
that inthe discussion below, for ease of explanation, we assumethat
the routing protocol is correcting single bit at a time(b � 1). Our
system, built upon Pastry, handles multi-bitcorrection (b � 4) and
is a simple extension to the schemedescribed here.
For a given virtual node ni at level i, each child MIBcontains
the subset of a child’s reduction MIB that con-tains tuples that
match ni’s node ID in i bits and whoseup aggregation function
attribute is at least i. These lo-cal copies make it easy for a
node to recompute a level-iaggregate value when one child’s input
changes. Nodesmaintain their child MIBs in stable storage and use a
sim-plified version of the Bayou log exchange protocol
(sansconflict detection and resolution) for synchronization
afterdisconnections [26].
Virtual node ni at level i maintains a reduction MIB oftuples
with a tuple for each key present in any child MIBcontaining the
attribute type, attribute name, and outputof the attribute type’s
aggregate functions applied to thechildren’s tuples.
A virtual node ni at level i also maintains an ancestorMIB to
store the tuples containing attribute key and a listof aggregate
values at different levels scattered down fromancestors. Note that
the list for a key might contain mul-tiple aggregate values for a
same level but aggregated atdifferent nodes (see Figure 5). So, the
aggregate valuesare tagged not only with level information, but are
alsotagged with ID of the node that performed the aggrega-tion.
Level-0 differs slightly from other levels. Each level-0leaf
node maintains a local MIB rather than maintainingchild MIBs and a
reduction MIB. This local MIB storesinformation about the local
node’s state inserted by lo-cal applications via update() calls. We
envision various“sensor” programs and applications insert data into
localMIB. For example, one program might monitor local
con-figuration and perform updates with information such astotal
memory, free memory, etc., A distributed file sys-tem might perform
update for each file stored on the localnode.
Along with these MIBs, a virtual node maintains twoother tables:
an aggregation function table and an out-standing probes table. An
aggregation function table con-tains the aggregation function and
installation arguments(see Table 1) associated with an attribute
type or an at-tribute type and name. Each aggregate function is
in-stalled on all nodes in a domain’s subtree, so the
aggregatefunction table can be thought of as a special case of the
an-
cestor MIB with domain functions always installed up toa root
within a specified domain and down to all nodeswithin the domain.
The outstanding probes table main-tains temporary information
regarding in-progress probes.
Given these data structures, it is simple to support thethree
API functions described in Section 3.1.
Install The Install operation (see Table 1) installs ona domain
an aggregation function that acts on a specifiedattribute type.
Execution of an install operation for func-tion aggrFunc on
attribute type attrType proceeds in twophases: first the install
request is passed up the ADHTtree with the attribute key (attrType,
null) until it reachesthe root for that key within the specified
domain. Then,the request is flooded down the tree and installed on
allintermediate and leaf nodes.
Update When a level i virtual node receives an updatefor an
attribute from a child below: it first recomputes thelevel-i
aggregate value for the specified key, stores thatvalue in its
reduction MIB and then, subject to the func-tion’s up and domain
parameters, passes the updated valueto the appropriate parent based
on the attribute key. Also,the level-i (i ! 1) virtual node sends
the updated level-i ag-gregate to all its children if the
function’s down parameterexceeds zero. Upon receipt of a level-i
aggregate from aparent, a level k virtual node stores the value in
its ances-tor MIB and, if k ! i � down, forwards this aggregate
toits children.
Probe A Probe collects and returns the aggregate valuefor a
specified attribute key for a specified level of thetree. As Figure
2 illustrates, the system satisfies a probefor a level-i aggregate
value using a four-phase protocolthat may be short-circuited when
updates have previouslypropagated either results or partial results
up or down thetree. In phase 1, the route probe phase, the system
routesthe probe up the attribute key’s tree to either the root
ofthe level-i subtree or to a node that stores the requestedvalue
in its ancestor MIB. In the former case, the systemproceeds to
phase 2 and in the latter it skips to phase 4. Inphase 2, the probe
scatter phase, each node that receivesa probe request sends it to
all of its children unless thenode’s reduction MIB already has a
value that matchesthe probe’s attribute key, in which case the node
initiatesphase 3 on behalf of its subtree. In phase 3, the
probeaggregation phase, when a node receives values for
thespecified key from each of its children, it executes the
ag-gregate function on these values and either (a) forwardsthe
result to its parent (if its level is less than i) or (b)initiates
phase 4 (if it is at level i). Finally, in phase 4,the aggregate
routing phase the aggregate value is routeddown to the node that
requested it. Note that in the ex-treme case of a function
installed with up � down � 0, alevel-i probe can touch all nodes in
a level-i subtree whilein the opposite extreme case of a function
installed withup � down � ALL, probe is a completely local
operation
9
-
at a leaf.For probes that include phases 2 (probe scatter)
and
3 (probe aggregation), an issue is how to decide when anode
should stop waiting for its children to respond andsend up its
current aggregate value. A node stops waitingfor its children when
one of three conditions occurs: (1)all children have responded, (2)
the ADHT layer signalsone or more reconfiguration events that mark
all childrenthat have not yet responded as unreachable, or (3) a
watch-dog timer for the request fires. The last case accounts
fornodes that participate in the ADHT protocol but that failat the
AML level.
At a virtual node, continuous probes are handled simi-larly as
one-shot probes except that such probes are storedin the
outstanding probe table for a time period of exp-Time specified in
the probe. Thus each update for an at-tribute triggers
re-evaluation of continuous probes for thatattribute.
We implement a lease-based mechanism for dynamicadaptation. A
level-l virtual node for an attribute can issuethe lease for
level-l aggregate to a parent or a child onlyif up is greater than
l or it has leases from all its children.A virtual node at level l
can issue the lease for level-kaggregate for k � l to a child only
if down ! k � l or if it hasthe lease for that aggregate from its
parent. Now a probefor level-k aggregate can be answered by level-l
virtualnode if it has a valid lease, irrespective of the up and
downvalues. We are currently designing different policies todecide
when to issue a lease and when to revoke a leaseand are also
evaluating them with the above mechanism.
Our current prototype does not implement access con-trol on
install, update, and probe operations but we plan toimplement
Astrolabe’s [38] certificate-based restrictions.Also our current
prototype does not restrict the resourceconsumption in executing
the aggregation functions; but,‘techniques from research on
resource management inserver systems and operating systems [2, 3]
can be appliedhere.
6 Robustness
In large scale systems, reconfigurations are common. Ourtwo main
principles for robustness are to guarantee (i)read availability –
probes complete in finite time, and (ii)eventual consistency –
updates by a live node will be visi-ble to probes by connected
nodes in finite time. During re-configurations, a probe might
return a stale value for tworeasons. First, reconfigurations lead
to incorrectness inthe previous aggregate values. Second, the nodes
neededfor aggregation to answer the probe become unreachable.Our
system also provides two hooks that applications canuse for
improved end-to-end robustness in the presence ofreconfigurations:
(1) On-demand re-aggregation and (2)application controlled
replication.
Our system handles reconfigurations at two levels –adaptation at
the ADHT layer to ensure connectivity andadaptation at the AML
layer to ensure access to the datain SDIMS.
6.1 ADHT Adaptation
Our ADHT layer adaptation algorithm is same as
Pastry’sadaptation algorithm [32] — the leaf sets are repaired
assoon as a reconfiguration is detected and the routing ta-ble is
repaired lazily. Note that maintaining extra leaf setsdoes not
degrade the fault-tolerance property of the orig-inal Pastry;
indeed, it enhances the resilience of ADHTsto failures by providing
additional routing links. Due toredundancy in the leaf sets and the
routing table, updatescan be routed towards their root nodes
successfully evenduring failures. Also note that the administrative
isola-tion property satisfied by our ADHT algorithm ensuresthat the
reconfigurations in a level i domain do not affectthe probes for
level i in a sibling domain.
6.2 AML Adaptation
Broadly, we use two types of strategies for AML adap-tation in
the face of reconfigurations: (1) Replication intime as a
fundamental baseline strategy, and (2) Replica-tion in space as an
additional performance optimizationthat falls back on replication
in time when the system runsout of replicas. We provide two
mechanisms for repli-cation in time. First, lazy re-aggregation
propagates al-ready received updates to new children or new parents
ina lazy fashion over time. Second, applications can reducethe
probability of probe response staleness during such re-pairs
through our flexible API with appropriate setting ofthe down
parameter.
Lazy Re-aggregation: The DHT layer informs theAML layer about
reconfigurations in the network usingthe following three function
calls – newParent, failed-Child, and newChild. On newParent(parent,
prefix), allprobes in the outstanding-probes table corresponding
toprefix are re-evaluated. If parent is not null, then ag-gregation
functions and already existing data are lazilytransferred in the
background. Any new updates, installs,and probes for this prefix
are sent to the parent immedi-ately. On failedChild(child, prefix),
the AML layer marksthe child as inactive and any outstanding probes
that arewaiting for data from this child are re-evaluated.
OnnewChild(child, prefix), the AML layer creates space inits data
structures for this child.
Figure 8 shows the time line for the default lazy re-aggregation
upon reconfiguration. Probes initiated be-tween points 1 and 2 and
that are affected by reconfigu-rations are reevaluated by AML upon
detecting the recon-figuration. Probes that complete or start
between points 2and 8 may return stale answers.
10
-
Reconfig
reconfignoticesDHT
partialDHT
completeDHT
ends
Lazy
Time
Data
3 7 81 2 4 5 6starts
LazyData
starts
LazyData
starts
LazyData
repairrepair
reaggr reaggr reaggr reaggr
happens
Figure 8: Default lazy data re-aggregation time line
On-demand Re-aggregation: The default lazy ag-gregation scheme
lazily propagates the old updates inthe system. Additionally, using
up and down knobs inthe Probe API, applications can force on-demand
fast re-aggregation of updates to avoid staleness in the face
ofreconfigurations. In particular, if an application detects
orsuspects an answer as stale, then it can re-issue the
probeincreasing the up and down parameters to force the re-freshing
of the cached data. Note that this strategy will beuseful only
after the DHT adaptation is completed (Point6 on the time line in
Figure 8).
Replication in Space: Replication in space is morechallenging in
our system than in a DHT file location ap-plication because
replication in space can be achieved eas-ily in the latter by just
replicating the root node’s contents.In our system, however, all
internal nodes have to be repli-cated along with the root.
In our system, applications control replication in spaceusing up
and down knobs in the Install API; with largeup and down values,
aggregates at the intermediate virtualnodes are propagated to more
nodes in the system. By re-ducing the number of nodes that have to
be accessed toanswer a probe, applications can reduce the
probability ofincorrect results occurring due to the failure of
nodes thatdo not contribute to the aggregate. For example, in a
filelocation application, using a non-zero positive down pa-rameter
ensures that a file’s global aggregate is replicatedon nodes other
than the root. Probes for the file locationcan then be answered
without accessing the root; hencethey are not affected by the
failure of the root. However,note that this technique is not
appropriate in some cases.An aggregated value in file location
system is valid as longas the node hosting the file is active,
irrespective of the sta-tus of other nodes in the system; whereas
an applicationthat counts the number of machines in a system may
re-ceive incorrect results irrespective of the replication.
Ifreconfigurations are only transient (like a node temporar-ily not
responding due to a burst of load), the replicatedaggregate closely
or correctly resembles the current state.
7 Evaluation
We have implemented a prototype of SDIMS in Java us-ing the
FreePastry framework [32] and performed large-scale simulation
experiments and micro-benchmark ex-periments on two real networks:
187 machines in the de-partment and 69 machines on the PlanetLab
[27] testbed.
0.1
1
10
100
1000
10000
0.0001 0.01 1 100 10000
Avg
. num
ber
of m
essa
ges
per
oper
atio
n
Read to Write ratio
Update-All
Up=ALL, Down=9
Up=ALL, Down=6
Update-Up
Update-Local
Up=2, Down=0
Up=5, Down=0
Figure 9: Flexibility of our approach. With different UPand DOWN
values in a network of 4096 nodes for differ-ent read-write
ratios.
In all experiments, we use static up and down values andturn off
dynamic adaptation. Our evaluation supports fourmain conclusions.
First, flexible API provides differentpropagation strategies that
minimize communication re-sources at different read-to-write
ratios. For example, inour simulation we observe Update-Local to be
efficientfor read-to-write ratios below 0.0001, Update-Up around1,
and Update-All above 50000. Second, our system isscalable with
respect to both nodes and attributes. In par-ticular, we find that
the maximum node stress in our sys-tem is an order lower than
observed with an Update-All,gossiping approach. Third, in contrast
to unmodified Pas-try which violates path convergence property in
upto 14%cases, our system conforms to the property. Fourth,
thesystem is robust to reconfigurations and adapts to failureswith
in a few seconds.
7.1 Simulation Experiments
Flexibility and Scalability: A major innovation of oursystem is
its ability to provide flexible computation andpropagation of
aggregates. In Figure 9, we demonstratethe flexibility exposed by
the aggregation API explainedin Section 3. We simulate a system
with 4096 nodes ar-ranged in a domain hierarchy with branching
factor (bf) of16 and install several attributes with different up
and downparameters. We plot the average number of messagesper
operation incurred for a wide range of read-to-writeratios of the
operations for different attributes. Simula-tions with other sizes
of networks with different branchingfactors reveal similar results.
This graph clearly demon-strates the benefit of supporting a wide
range of computa-tion and propagation strategies. Although having a
smallUP value is efficient for attributes with low
read-to-writeratios (write dominated applications), the probe
latency,when reads do occur, may be high since the probe needsto
aggregate the data from all the nodes that did not sendtheir
aggregate up. Conversely, applications that wish to
11
-
1
10
100
1000
10000
100000
1e+06
1e+07
1 10 100 1000 10000 100000
Max
imum
Nod
e S
tres
s
Number of attributes installed
Gossip 256
Gossip 4096
Gossip 65536
DHT 256
DHT 4096
DHT 65536
Figure 10: Max node stress for a gossiping approach vs.ADHT
based approach for different number of nodes withincreasing number
of sparse attributes.
improve probe overheads or latencies can increase theirUP and
DOWN propagation at a potential cost of increasein write
overheads.
Compared to an existing Update-all single aggregationtree
approach [38], scalability in SDIMS comes from (1)leveraging DHTs
to form multiple aggregation trees thatsplit the load across nodes
and (2) flexible propagationthat avoids propagation of all updates
to all nodes. Fig-ure 10 demonstrates the SDIMS’s scalability with
nodesand attributes. For this experiment, we build a simula-tor to
simulate both Astrolabe [38] (a gossiping, Update-All approach) and
our system for an increasing numberof sparse attributes. Each
attribute corresponds to themembership in a multicast session with
a small numberof participants. For this experiment, the session
size isset to 8, the branching factor is set to 16, the
propagationmode for SDIMS is Update-Up, and the participant
nodesperform continuous probes for the global aggregate value.We
plot the maximum node stress (in terms of messages)observed in both
schemes for different sized networkswith increasing number of
sessions when the participantof each session performs an update
operation. Clearly,the DHT based scheme is more scalable with
respect toattributes than an Update-all gossiping scheme.
Observethat at some constant number of attributes, as the numberof
nodes increase in the system, the maximum node stressincreases in
the gossiping approach, while it decreases inour approach as the
load of aggregation is spread acrossmore nodes. Simulations with
other session sizes (4 and16) yield similar results.
Administrative Hierarchy and Robustness: Al-though the routing
protocol of ADHT might lead to anincreased number of hops to reach
the root for a key ascompared to original Pastry, the algorithm
conforms tothe path convergence and locality properties and thus
pro-vides administrative isolation property. In Figure 11,
wequantify the increased path length by comparisons withunmodified
Pastry for different sized networks with dif-
0
1
2
3
4
5
6
7
10 100 1000 10000 100000
Pat
h Le
ngth
Number of Nodes
ADHT bf=4
ADHT bf=16
ADHT bf=64
PASTRY bf=4,16,64
Figure 11: Average path length to root in Pastry versusADHT for
different branching factors. Note that all linescorresponding to
Pastry overlap.
0
2
4
6
8
10
12
14
16
10 100 1000 10000 100000
Per
cent
age
of v
iola
tions
Number of Nodes
bf=4bf=16bf=64
Figure 12: Percentage of probe pairs whose paths to theroot did
not conform to the path convergence propertywith Pastry.
ferent branching factors of the domain hierarchy tree.
Toquantify the path convergence property, we perform sim-ulations
with a large number of probe pairs – each pairprobing for a random
key starting from two randomly cho-sen nodes. In Figure 12, we plot
the percentage of probepairs for unmodified pastry that do not
conform to the pathconvergence property. When the branching factor
is low,the domain hierarchy tree is deeper resulting in a
largedifference between Pastry and ADHT in the average pathlength;
but it is at these small domain sizes, that the pathconvergence
fails more often with the original Pastry.
7.2 Testbed experiments
We run our prototype on 180 department machines (somemachines
ran multiple node instances, so this configura-tion has a total of
283 SDIMS nodes) and also on 69 ma-chines of the PlanetLab [27]
testbed. We measure theperformance of our system with two
micro-benchmarks.In the first micro-benchmark, we install three
aggregationfunctions of types Update-Local, Update-Up, and
Update-
12
-
Upda
te-A
ll
Upda
te-U
p
Upda
te-L
ocal
0
200
400
600
800La
tenc
y (in
ms)
Average Latency
Upda
te-A
ll
Upda
te-U
p
Upda
te-L
ocal
0
1000
2000
3000
Late
ncy
(in m
s) Average Latency
(a) (b)
Figure 13: Latency of probes for aggregate at global rootlevel
with three different modes of aggregate propagationon (a)
department machines, and (b) PlanetLab machines
0
20
40
60
80
100
120
140
0 5 10 15 20 25 2700
2720
2740
2760
2780
2800
2820
2840
Late
ncy
(in m
s)
Val
ues
Obs
erve
d
Time(in sec)
Valueslatency
Node Killed
Figure 14: Micro-benchmark on department networkshowing the
behavior of the probes from a single nodewhen failures are
happening at some other nodes. All 283nodes assign a value of 10 to
the attribute.
All, perform update operation on all nodes for all
threeaggregation functions, and measure the latencies incurredby
probes for the global aggregate from all nodes in thesystem. Figure
13 shows the observed latencies for bothtestbeds. Notice that the
latency in Update-Local is highcompared to the Update-UP policy.
This is because la-tency in Update-Local is affected by the
presence of evena single slow machine or a single machine with a
high la-tency network connection.
In the second benchmark, we examine robustness. Weinstall one
aggregation function of type Update-Up thatperforms sum operation
on an integer valued attribute.Each node updates the attribute with
the value 10. Thenwe monitor the latencies and results returned on
the probeoperation for global aggregate on one chosen node, whilewe
kill some nodes after every few probes. Figure 14shows the results
on the departmental testbed. Due to thenature of the testbed
(machines in a department), there islittle change in the latencies
even in the face of reconfigu-rations. In Figure 15, we present the
results of the experi-ment on PlanetLab testbed. The root node of
the aggrega-tion tree is terminated after about 275 seconds. There
isa 5X increase in the latencies after the death of the initialroot
node as a more distant node becomes the root nodeafter repairs. In
both experiments, the values returned on
10
100
1000
10000
100000
0 50 100 150 200 250 300 350 400 450 500500
550
600
650
700
Late
ncy
(in m
s)
Val
ues
Obs
erve
d
Time(in sec)
Valueslatency
Node Killed
Figure 15: Probe performance during failures on 69 ma-chines of
PlanetLab testbed
probes start reflecting the correct situation within a shorttime
after the failures.
From both the testbed benchmark experiments and thesimulation
experiments on flexibility and scalability, weconclude that (1) the
flexibility provided by SDIMS al-lows applications to tradeoff
read-write overheads (Fig-ure 9), read latency, and sensitivity to
slow machines(Figure 13), (2) a good default aggregation strategy
isUpdate-Up which has moderate overheads on both readsand writes
(Figure 9), has moderate read latencies (Fig-ure 13), and is
scalable with respect to both nodes andattributes (Figure 10), and
(3) small domain sizes are thecases where DHT algorithms fail to
provide path conver-gence more often and SDIMS ensures path
convergencewith only a moderate increase in path lengths (Figure
12).
7.3 Applications
SDIMS is designed as a general distributed monitoringand control
infrastructure for a broad range of applica-tions. Above, we
discuss some simple microbenchmarksincluding a multicast membership
service and a calculate-sum function. Van Renesse et al. [38]
provide detailedexamples of how such a service can be used for a
peer-to-peer caching directory, a data-diffusion service, a
publish-subscribe system, barrier synchronization, and
voting.Additionally, we have initial experience using SDIMS
toconstruct two significant applications: the control planefor a
large-scale distributed file system [12] and a networkmonitor for
identifying “heavy hitters” that consume ex-cess resources.
Distributed file system control: The PRACTI (PartialReplication,
Arbitrary Consistency, Topology Indepen-dence) replication system
provides a set of mechanismsfor data replication over which
arbitrary control policiescan be layered. We use SDIMS to provide
several keyfunctions in order to create a file system over the
low-levelPRACTI mechanisms.
First, nodes use SDIMS as a directory to handle read
13
-
misses. When a node n receives an object o, it updates
the(ReadDir, o) attribute with the value n; when n discardso from
its local store, it resets (ReadDir, o) to NULL. Ateach virtual
node, the ReadDir aggregation function sim-ply selects a random
non-null child value (if any) and weuse the Update-Up policy for
propagating updates. Fi-nally, to locate a nearby copy of an object
o, a node n1issues a series of probe requests for the (ReadDir, o)
at-tribute, starting with level � 1 and increasing the levelvalue
with each repeated probe request until a non-nullnode ID n2 is
returned. n1 then sends a demand read re-quest to n2, and n2 sends
the data if it has it. Conversely, ifn2 does not have a copy of o,
it sends a nack to n1, and n1issues a retry probe with the down
parameter set to a valuelarger than used in the previous probe in
order to force on-demand re-aggregation, which will yield a fresher
valuefor the retry.
Second, nodes subscribe to invalidations and updatesto interest
sets of files, and nodes use SDIMS to set upand maintain
per-interest-set network-topology-sensitivespanning trees for
propagating this information. To sub-scribe to invalidations for
interest set i, a node n1 first up-dates the (Inval, i) attribute
with its identity n1, and theaggregation function at each virtual
node selects one non-null child value. Finally, n1 probes
increasing levels of thethe (Inval, i) attribute until it finds the
first node n2 "� n1;n1 then uses n2 as its parent in the spanning
tree. n1 alsoissues a continuous probe for this attribute at this
level sothat it is notified of any change to its spanning tree
parent.Spanning trees for streams of pushed updates are main-tained
in a similar manner.
In the future, we plan to use SDIMS for at least twoadditional
services within this replication system. First,we plan to use SDIMS
to track the read and write ratesto different objects; prefetch
algorithms will use this in-formation to prioritize replication
[40, 41]. Second, weplan to track the ranges of invalidation
sequence numbersseen by each node for each interest set in order to
augmentthe spanning trees described above with additional
“holefilling” to allow nodes to locate specific invalidations
theyhave missed.
Overall, our initial experience with using SDIMS forthe PRACTII
replication system suggests that (1) the gen-eral aggregation
interface provided by SDIMS simpli-fies the construction of
distributed applications—giventhe low-level PRACTI mechanisms, we
were able to con-struct a basic file system that uses SDIMS for
several dis-tinct control tasks in under two weeks and (2) the
weakconsistency guarantees provided by SDIMS meet the re-quirements
of this application—each node’s controller ef-fectively treats
information from SDIMS as hints, and ifa contacted node does not
have the needed data, the con-troller retries, using SDIMS
on-demand re-aggregation toobtain a fresher hint.
Distributed heavy hitter problem: The goal of theheavy hitter
problem is to identify network sources, des-tinations, or protocols
that account for significant or un-usual amounts of traffic. As
noted by Estan et al. [13],this information is useful for a variety
of applications suchas intrusion detection (e.g., port scanning),
denial of ser-vice detection, worm detection and tracking, fair
networkallocation, and network maintenance. Significant workhas
been done on developing high-performance stream-processing
algorithms for identifying heavy hitters at onerouter, but this is
just a first step; ideally these applicationswould like not just
one router’s views of the heavy hittersbut an aggregate view.
We use SDIMS to allow local information about heavyhitters to be
pooled into a view of global heavy hitters.For each destination IP
address IPx, a node updates the at-tribute
�DestBW � IPx � with the number of bytes sent to IPx
in the last time window. The aggregation function for at-tribute
type DestBW is installed with the Update-UP strat-egy and simply
adds the values from child nodes. Nodesperform continuous probe for
global aggregate of the at-tribute and raise an alarm when the
global aggregate valuegoes above a specified limit. Note that only
nodes sendingdata to a particular IP address perform probes for the
cor-responding attribute. Also note that techniques from [25]can be
extended to hierarchical case to tradeoff precisionfor
communication bandwidth.
8 Related Work
The aggregation abstraction we use in our work is heav-ily
influenced by the Astrolabe [38] project. Astrolabeadopts a
Propagate-All and unstructured gossiping tech-niques to attain
robustness [5]. However, any gossipingscheme requires aggressive
replication of the aggregates.While such aggressive replication is
efficient for read-dominated attributes, it incurs high message
cost for at-tributes with a small read-to-write ratio. Our
approachprovides a flexible API for applications to set
propagationrules according to their read-to-write ratios. Other
closelyrelated projects include Willow [39], Cone [4], DASIS
[1],and SOMO [45]. Willow, DASIS and SOMO build a sin-gle tree for
aggregation. Cone builds a tree per attributeand requires a total
order on the attribute values.
Several academic [15, 21, 42] and commercial [37] dis-tributed
monitoring systems have been designed to moni-tor the status of
large networked systems. Some of themare centralized where all the
monitoring data is collectedand analyzed at a central host. Ganglia
[15, 23] usesa hierarchical system where the attributes are
replicatedwithin clusters using multicast and then cluster
aggregatesare further aggregated along a single tree. Sophia [42]
is adistributed monitoring system designed with a declarativelogic
programming model where the location of query ex-
14
-
ecution is both explicit in the language and can be calcu-lated
during evaluation. This research is complementaryto our work. TAG
[21] collects information from a largenumber of sensors along a
single tree.
The observation that DHTs internally provide a scal-able forest
of reduction trees is not new. Plaxton etal.’s [28] original paper
describes not a DHT, but a sys-tem for hierarchically aggregating
and querying object lo-cation data in order to route requests to
nearby copiesof objects. Many systems—building upon both Plax-ton’s
bit-correcting strategy [32, 46] and upon other strate-gies [24,
29, 35]—have chosen to hide this power and ex-port a simple and
general distributed hash table abstrac-tion as a useful building
block for a broad range of dis-tributed applications. Some of these
systems internallymake use of the reduction forest not only for
routing butalso for caching [32], but for simplicity, these systems
donot generally export this powerful functionality in theirexternal
interface. Our goal is to develop and expose theinternal reduction
forest of DHTs as a similarly generaland useful abstraction.
Although object location is a predominant target ap-plication
for DHTs, several other applications like multi-cast [8, 9, 33, 36]
and DNS [11] are also built using DHTs.All these systems implicitly
perform aggregation on someattribute, and each one of them must be
designed to handleany reconfigurations in the underlying DHT. With
the ag-gregation abstraction provided by our system, designingand
building of such applications becomes easier.
Internal DHT trees typically do not satisfy domain lo-cality
properties required in our system. Castro et al. [7]and Gummadi et
al. [17] point out the importance of pathconvergence from the
perspective of achieving efficiencyand investigate the performance
of Pastry and other DHTalgorithms, respectively. SkipNet [18]
provides domainrestricted routing where a key search is limited to
the spec-ified domain. This interface can be used to ensure
pathconvergence by searching in the lowest domain and mov-ing up to
the next domain when the search reaches the rootin the current
domain. Although this strategy guaranteespath convergence, it loses
the aggregation tree abstrac-tion property of DHTs as the domain
constrained routingmight touch a node more than once (as it
searches forwardand then backward to stay within a domain).
There are some ongoing efforts to provide the rela-tional
database abstraction on DHTs: PIER [19] and Grib-ble et al. [16].
This research mainly focuses on support-ing “Join” operation for
tables stored on the nodes in anetwork. We consider this research
to be complementaryto our work; the approaches can be used in our
systemto handle composite probes – e.g., find a nearest machinewith
file “foo” and has more than 2 GB of memory.
9 Conclusions
This paper presents a Scalable Distributed InformationManagement
System (SDIMS) that aggregates informa-tion in large-scale
networked systems and that can serveas a basic building block for a
broad range of applica-tions. For large scale systems, hierarchical
aggregationis a fundamental abstraction for scalability. We build
oursystem by extending ideas from Astrolabe and DHTs toachieve (i)
scalability with respect to both nodes and at-tributes through a
new aggregation abstraction that helpsleverage DHT’s internal trees
for aggregation, (ii) flexi-bility through a simple API that lets
applications controlpropagation of reads and writes, (iii)
administrative isola-tion through simple augmentations of current
DHT algo-rithms, and (iv) robustness to node and network
reconfig-urations through lazy reaggregation, on-demand
reaggre-gation, and tunable spatial replication.
Our system is still in a nascent state. The initial workdoes
provide evidence that we can achieve scalable dis-tributed
information management by leveraging aggrega-tion abstraction and
DHTs. Our work also opens up manyresearch issues in different
fronts that need to be solved.Below we enumerate some future
research directions.
1. Robustness: In our current system, in spite ofour current
techniques, reconfigurations are costly(O
�m � d � log2
�N �� where m is the number of at-
tributes, N is the number of nodes in the system, andd is the
fraction of attributes that each node is inter-ested in [44]).
Malkhi et al. [22] propose Supernodesto reduce the number of
reconfigurations at the DHTlevel; this technique can be leveraged
to reduce thenumber of reconfigurations at the Aggregation
Man-agement Layer.
2. Self-tuning adaptation: The read-to-write ratios
forapplications are dynamic. Instead of applicationschoosing the
right strategy, the system should be ableto self-tune the
aggregation and propagation strategyaccording to the changing
read-to-write ratios.
3. Handling Composite Queries: Queries involvingmultiple
attributes pose an issue in our system as dif-ferent attributes are
aggregated along different trees.
4. Caching: While caching is employed effectively inDHT file
location applications, further research isneeded to apply this
concept in our general frame-work.
Acknowlegements
We are grateful to J.C. Browne, Robert van Renessee,Amin Vahdat,
Jay Lepreau, and the anonymous reviewersfor their helpful comments
on this work.
15
-
References[1] K. Albrecht, R. Arnold, M. Gahwiler, and R.
Wattenhofer. Join and
Leave in Peer-to-Peer Systems: The DASIS approach.
Technicalreport, CS, ETH Zurich, 2003.
[2] G. Back, W. H. Hsieh, and J. Lepreau. Processes in
KaffeOS:Isolation, Resource Management, and Sharing in Java. In
Proc.OSDI, Oct 2000.
[3] G. Banga, P. Druschel, and J. Mogul. Resource Containers: A
NewFacility for Resource Management in Server Systems. In
OSDI99,Feb. 1999.
[4] R. Bhagwan, P. Mahadevan, G. Varghese, and G. M.
Voelker.Cone: A Distributed Heap-Based Approach to Resource
Selection.Technical Report CS2004-0784, UCSD, 2004.
[5] K. P. Birman. The Surprising Power of Epidemic
Communication.In Proceedings of FuDiCo, 2003.
[6] B. Bloom. Space/time tradeoffs in hash coding with allowable
er-rors. Comm. of the ACM, 13(7):422–425, 1970.
[7] M. Castro, P. Druschel, Y. C. Hu, and A. Rowstron.
ExploitingNetwork Proximity in Peer-to-Peer Overlay Networks.
TechnicalReport MSR-TR-2002-82, MSR.
[8] M. Castro, P. Druschel, A.-M. Kermarrec, A. Nandi, A.
Rowstron,and A. Singh. SplitStream: High-bandwidth Multicast in a
Coop-erative Environment. In SOSP, 2003.
[9] M. Castro, P. Druschel, A.-M. Kermarrec, and A.
Rowstron.SCRIBE: A Large-scale and Decentralised Application-level
Mul-ticast Infrastructure. IEEE JSAC (Special issue on Network
Sup-port for Multicast Communications), 2002.
[10] J. Challenger, P. Dantzig, and A. Iyengar. A scalable and
highlyavailable system for serving dynamic data at frequently
accessedweb sites. In In Proceedings of ACM/IEEE, Supercomputing
’98(SC98), Nov. 1998.
[11] R. Cox, A. Muthitacharoen, and R. T. Morris. Serving DNS
usinga Peer-to-Peer Lookup Service. In IPTPS, 2002.
[12] M. Dahlin, L. Gao, A. Nayate, A. Venkataramani, P.
Yalagandula,and J. Zheng. PRACTI replication for large-scale
systems. Tech-nical Report TR-04-28, The University of Texas at
Austin, 2004.
[13] C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for
count-ing active flows on high speed links. In Internet Measurement
Con-ference 2003, 2003.
[14] Y. Fu, J. Chase, B. Chun, S. Schwab, and A. Vahdat. SHARP:
Anarchitecture for secure resource peering. In Proc. SOSP, Oct.
2003.
[15] Ganglia: Distributed Monitoring and Execution
System.http://ganglia.sourceforge.net.
[16] S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu.
What CanPeer-to-Peer Do for Databases, and Vice Versa? In
Proceedings ofthe WebDB, 2001.
[17] K. Gummadi, R. Gummadi, S. D. Gribble, S. Ratnasamy,S.
Shenker, and I. Stoica. The Impact of DHT Routing Geome-try on
Resilience and Proximity. In SIGCOMM, 2003.
[18] N. J. A. Harvey, M. B. Jones, S. Saroiu, M. Theimer, and A.
Wol-man. SkipNet: A Scalable Overlay Network with Practical
Local-ity Properties. In USITS, March 2003.
[19] R. Huebsch, J. M. Hellerstein, N. Lanham, B. T. Loo, S.
Shenker,and I. Stoica. Querying the Internet with PIER. In
Proceedings ofthe VLDB Conference, May 2003.
[20] C. Intanagonwiwat, R. Govindan, and D. Estrin. Directed
diffu-sion: a scalable and robust communication paradigm for
sensornetworks. In MobiCom, 2000.
[21] S. R. Madden, M. J. Franklin, J. M. Hellerstein, and W.
Hong.TAG: a Tiny AGgregation Service for Ad-Hoc Sensor Networks.In
OSDI, 2002.
[22] D. Malkhi. Dynamic Lookup Networks. In FuDiCo, 2002.
[23] M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia
distributedmonitoring system: Design, implementation, and
experience. Insubmission.
[24] P. Maymounkov and D. Mazieres. Kademlia: A Peer-to-peer
In-formation System Based on the XOR Metric. In Proceesings of
theIPTPS, March 2002.
[25] C. Olston and J. Widom. Offering a precision-performance
trade-off for aggregation queries over replicated data. In VLDB,
pages144–155, Sept. 2000.
[26] K. Petersen, M. Spreitzer, D. Terry, M. Theimer, and A.
Demers.Flexible Update Propagation for Weakly Consistent
Replication.In Proc. SOSP, Oct. 1997.
[27] Planetlab. http://www.planet-lab.org.
[28] C. G. Plaxton, R. Rajaraman, and A. W. Richa. Accessing
NearbyCopies of Replicated Objects in a Distributed Environment.
InACM SPAA, 1997.
[29] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S.
Shenker. AScalable Content Addressable Network. In Proceedings of
ACMSIGCOMM, 2001.
[30] S. Ratnasamy, S. Shenker, and I. Stoica. Routing Algorithms
forDHTs: Some Open Questions. In IPTPS, March 2002.
[31] T. Roscoe, R. Mortier, P. Jardetzky, and S. Hand.
InfoSpect: Usinga Logic Language for System Health Monitoring in
DistributedSystems. In Proceedings of the SIGOPS European
Workshop,2002.
[32] A. Rowstron and P. Druschel. Pastry: Scalable, Distributed
ObjectLocation and Routing for Large-scale Peer-to-peer Systems.
InMiddleware, 2001.
[33] S.Ratnasamy, M.Handley, R.Karp, and S.Shenker.
Application-level Multicast using Content-addressable Networks. In
Proceed-ings of the NGC, November 2001.
[34] W. Stallings. SNMP, SNMPv2, and CMIP. Addison-Wesley,
1993.
[35] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H.
Balakrish-nan. Chord: A scalable Peer-To-Peer lookup service for
internetapplications. In ACM SIGCOMM, 2001.
[36] S.Zhuang, B.Zhao, A.Joseph, R.Katz, and J.Kubiatowicz.
Bayeux:An Architecture for Scalable and Fault-tolerant Wide-Area
DataDissemination. In NOSSDAV, 2001.
[37] IBM Tivoli
Monitoring.www.ibm.com/software/tivoli/products/monitor.
[38] R. VanRenesse, K. P. Birman, and W. Vogels. Astrolabe: A
Ro-bust and Scalable Technology for Distributed System
Monitoring,Management, and Data Mining. TOCS, 2003.
[39] R. VanRenesse and A. Bozdog. Willow: DHT, Aggregation,
andPublish/Subscribe in One Protocol. In IPTPS, 2004.
[40] A. Venkataramani, P. Weidmann, and M. Dahlin. Bandwidth
con-strained placement in a wan. In PODC, Aug. 2001.
[41] A. Venkataramani, P. Yalagandula, R. Kokku, S. Sharif,
andM. Dahlin. Potential costs and benefits of long-term
prefetch-ing for content-distribution. Elsevier Computer
Communications,25(4):367–375, Mar. 2002.
[42] M. Wawrzoniak, L. Peterson, and T. Roscoe. Sophia: An
Informa-tion Plane for Networked Systems. In HotNets-II, 2003.
[43] R. Wolski, N. Spring, and J. Hayes. The network weather
ser-vice: A distributed resource performance forecasting service
formetacomputing. Journal of Future Generation Computing Sys-tems,
15(5-6):757–768, Oct 1999.
[44] P. Yalagandula. SDIMS: A Scalable Distributed Information
Man-agement System, Feb. 2004. Ph.D. Proposal.
16
-
[45] Z. Zhang, S.-M. Shi, and J. Zhu. SOMO: Self-Organized
MetadataOverlay for Resource Management in P2P DHT. In IPTPS,
2003.
[46] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry:
AnInfrastructure for Fault-tolerant Wide-area Location and
Routing.Technical Report UCB/CSD-01-1141, UC Berkeley, Apr.
2001.
17