UNIVERSITY OF CINCINNATI _____________ , 20 _____ I,______________________________________________, hereby submit this as part of the requirements for the degree of: ________________________________________________ in: ________________________________________________ It is entitled: ________________________________________________ ________________________________________________ ________________________________________________ ________________________________________________ Approved by: ________________________ ________________________ ________________________ ________________________ ________________________
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
UNIVERSITY OF CINCINNATI
_____________ , 20 _____
I,______________________________________________,hereby submit this as part of the requirements for thedegree of:
Modeling Large-scale Peer-to-Peer Networks and a CaseStudy of Gnutella
A thesis submitted to the
Division of Graduate Studies and Research of
the University of Cincinnati
in partial fulfillment of the
requirements for the degree of
MASTER OF SCIENCE
in the Department of
Electrical and Computer Engineering and Computer Scienceof the College of Engineering
June, 2000
by
Mihajlo A. Jovanovic B.S., Department of Mathematics andComputer Science, Otterbein College, Westerville, Ohio, 1997.
Thesis Advisor and Committee Chair: Dr. Fred S. Annexstein andDr. Kenneth A. Berman
Abstract
The ongoing digital revolution has brought on the emergence of novel network ap-plications such as Gnutella, Freenet, and Napster, intended to facilitate worldwidesharing of information. These applications have embraced the familiar peer-to-peer(P2P) architecture model of the original Internet in new and innovative ways, foreverchanging the world of personal computing. However if P2P is to truly replace thewell-established client-server model as the computing paradigm of the future, moreefficient decentralized algorithms must first be designed. This requires better under-standing of the P2P network model on which those algorithms would be operating.Such model includes both network topology and traffic.
In this thesis, we study both of these factors using as our case study Gnutella -a fully-decentralized file sharing network application. In order to study the Gnutellanetwork topology, we have developed a network crawler that allows topology dis-covery to be performed in parallel. Upon analyzing the obtained topology data, wediscovered it exhibits strong ”small-world” properties. More specifically, we observedthe properties of small diameter and clustering in the Gnutella network topology. Inaddition, we report evidence of four different power laws previously observed in othertechnological networks, such as the Internet and the WWW.
In the second part of our thesis, we utilize our topology model in order to studynetwork traffic. Specifically, we show that heterogeneous latencies present in manylarge-scale P2P network applications, when combined with the standard protocolmechanisms of time-to-live (TTL) and unique message identification (UID) used togovern flooding message transmissions, can potentially have a devastating effect onthe reachability of message broadcast. We call this combined effect ”short-circuiting,”and we investigate consequences of this phenomenon. We show through experimenta-tion that, in the worst case, short-circuiting can near-completely eliminate the reachof broadcast messages. We report measurements obtained through both network sim-ulation studies and experimental studies performed on Gnutella. Our results indicatethat, on average, the real effects of short-circuiting are significant, but not devastatingto the performance of an overall large-scale system.
We believe our discoveries of both network topology properties and short-circuitingare an important step toward a uniform model of P2P network applications, and couldserve as a valuable tool in analyzing the performance of existing algorithms, as wellas designing new, more scalable solutions.
Acknowledgments
First, I would like to thank my advisers, Dr. Fred Annexstein and Dr. Kenneth
Berman, for hours of intellectually stimulating discussions, suggestions and ideas.
For the duration of this thesis, they have been not just my advisers but also my
mentors, providing constant encouragement as well as financial support in the form
of a Research Assistantship.
I would also like to thank Dr. Yizong Cheng for taking the time out of his busy
schedule to be on my thesis committee, and Dr. John Schlipf for attending my
thesis defense. Special thanks goes to Dr. John Franco for providing motivation and
guidance, particularly during my first year at UC, and also Linda Gruber for her
always kind and helpful attitude.
I extend my sincere gratitude to the Department of Electrical and Computer En-
gineering and Computer Science for its generous support without which this work
would not be possible. The department has provided me with a Graduate Assis-
tantship during my first year and a University Graduate Scholarship for three full
academic years.
Finally, I dedicate this work to my parents, Aleksandar and Mirjana, whose love
and support, even from half a world away, I could not have done it without.
Figure 2.5: Log-log plot of eigenvalues versus rank (power-law 4) for four snapshotsof the Gnutella topology
26
Chapter 3
Modeling Network Latencies
In this chapter we further refine our model of P2P networks to include traffic. In par-
ticular, we study the effects of heterogeneous latencies on reachability in P2P network
applications operating under flooding protocols. We call this potentially devastating
effect “short-circuiting.” Traditionally, latency has been studied to model network
performance as it relates to throughput. Network reachability has traditionally been
studied through the analysis of distance in graphs. In this work, we point towards
a novel fact that heterogeneous latencies can significantly impact reachability, inde-
pendent of distance.
We begin with a brief introduction of short-circuiting. We then present our formal
model for studying the effects of short-circuiting. Finally, we report our results from
both network simulation studies and empirical tests performed on Gnutella. We
conclude based on these results that, on average, the real effects of short-circuiting
are significant, but not devastating to the performance of an overall system.
27
3.1 Latency Effects
We have seen in chapter 1 that P2P applications are inherently decentralized, there-
fore relying on efficient decentralized algorithms for communication between hosts.
As a result, many of these applications, including Gnutella, have adopted a flood-
ing mechanism to forward messages in an effort to maximize reachability. Notice
that reachability, or the number of hosts receiving a particular message, is an im-
portant performance metric for many P2P applications, particularly those used for
file-sharing.
Flooding dictates that each host is to simply forward each received message to
all of its neighbors, except the one from which the message was received. As such,
flooding provides a simple and effective way of broadcasting messages in a dynam-
ically changing network without requiring the use of routing tables or knowledge
of the global network topology. However it clearly does not scale for Internet-wide
applications, as it generates a large number of redundant messages and uses all avail-
able paths across the network. For this reason, in practice, flooding is typically
implemented in combination with one or more of the following standard governing
mechanisms designed to restrict its scope and limit redundant messages:
Mechanism 1. Time-to-Live Bounds Time-to-Live (TTL) is a governing mech-
anism that prevent messages from traveling farther than a specified number
of hops, defined by an initial TTL value. TTL bounds are implemented by
including in each message header a TTL value field. When a node receives a
message it first checks to see if its TTL value is greater than zero. If not, the
node continues the flood with a decremented TTL. Otherwise the message is
dropped.
Mechanism 2. Unique Message Identification Unique Message Identification is
28
a mechanism that prevents unique messages from being transmitted more than
once from each node. This mechanism is implemented by including in each
message header a UID (a unique ID label, or unique sequence number). When
a node receives a message it checks to see if it has previously seen that message.
If it has , the message is dropped and not forwarded. Otherwise, the node stores
the new UID in a local table, and then continues the flood.
Mechanism 3. Path Identification Path Identification is a mechanism that pre-
vents message paths from looping. This mechanism is implemented by including
in each message a header that records which nodes of the network have already
encountered the message. Before forwarding messages, each node simply checks
the header to verify whether or not it has previously seen the message. If so,
the message is dropped and not forwarded. If not, the node adds its name to
the header, and then continues the flood.
Ordinarily, a broadcast operation functioning under these mechanisms should
reach all nodes within the TTL bound of the broadcast source. However we have
discovered that network latencies can negatively impact reachability of broadcast op-
erations. We define latency as the time it takes a message to traverse a link in the
network. We will show that, when Mechanisms 1 and 2 are implemented together,
heterogeneous network latencies can potentially have a devastating effect on reach-
ability. We call this phenomenon the ”short-circuiting effect,” and describe it as
follows:
Short-circuiting Effect. Consider a message broadcast from a source node a, and
consider a path P = {u1, u2, . . . , up}, joining nodes a = u1 and b = up. It is
possible that there may be no throughput of the broadcast messages from a to b
along P , even if the hop-length p of the path P is less than or equal to the TTL
value t. This can result from heterogeneous latencies, as the following scenario
29
shows. Suppose there exists a message path Q from a to some intermediate
node x = ui of P , having a strictly smaller latency (but, with possibly a greater
hop number). Then a broadcast message originating from a, and following path
P will be killed (by Mechanism 2) when it reaches x, since it is the duplicate
of an earlier arriving message originating from a, but following path Q. Notice
that there may also be no throughput along path R consisting of the path Q
together with the subpath of P from x to b. This effect results from the fact
that R may possibly have a hop-length strictly greater than t, and hence, by
Mechanism 1 there is no throughput of the broadcast message originating at
a along path R. And, indeed, there may be no throughput of the broadcast
message along any path from a to b; it is this latency effect on reachability
which we call short-circuiting.
For the remainder of this chapter, we will consider broadcasts as operating under
the combination of Mechanisms 1 and 2. Note that short-circuiting like effects can
not be caused by the combination of Mechanisms 1 and 3, since, in that case, all
loop-free paths within the TTL bound are valid message paths.
3.2 Modeling the Short-Circuiting Effect
In order to analyze the problem of SC, we refine our network model from chapter
1 to include edge weights representing latency values on communication links. We
consider the latency of a message path to be the sum of the latencies of its edges.
The flooding operation governed by mechanisms 1 and 2 in a network G is defined
by the following protocol regimen. Packets in the network we will denote p(u, t, h),
with unique message identifier UID = u, initial TTL value TTL = t, and current
hop-value HOP = h. The hop-value denotes the number of hops from the packet’s
source node. We will denote a packet (ready for broadcast) originating at node s,
30
with initial TTL = t, by p(us, t, 0). The broadcast regimen operates as follows, and
defines the valid message paths associated with the transmission of the broadcast
packet.
1. Source s sends p(us, t, 0) to all the neighbors of s, injecting the packet on all
links connected to s at the same time.
2. Nodes process packets on first-come-first-served basis as follows: when a node v
receives packet p(us, t, h) it checks whether the UID us has been seen previously.
If it has, then the packet is dropped with no further processing.
3. If not, then v records us in its local table, and check whether t = h. If t > h,
then v replicates and forwards the message p(us, t, h+1) (with incremented hop
count) to all neighbors except u, the node from which it received the packet. If
t = h then the packet is dropped and not forwarded.
When latencies are introduced into this model of a flooding broadcast, complica-
tions arise as to the reachability of nodes. To determine reachability it is not sufficient
to consider only minimum-cost paths from s to v.
In order to quantify reachability, we introduce the notion of a horizon, defined as
following:
Definition 5 The t-horizon R(s, t) from a source node s, is the set of all nodes v
which receive a packet ps(u, t,−) broadcast from s with TTL = t. The t-neighborhood
N(s, t) from a source node s, is the is the set of all nodes within a hop-distance of
t from s. Likewise, for a set of source nodes S, we denote by R(S, t) and N(S, t)
are the t-horizon, and t-neighborhood, respectively, from S, where we assume that the
broadcast is initiated by each s ∈ S simultaneously.
In the subsequent sections, we present our experimental results on the size of
t-horizon as a function of latencies under the described broadcast model.
31
3.3 Empirical experiments
We have conducted a series of experiments to empirically test the effects of short-
circuiting. These experiments are divided into two categories: simulations performed
on various static network topologies and empirical tests performed on a real P2P
network application. For the later, we use Gnutella as our case study.
3.3.1 Gnutella Studies
We have already mentioned Gnutella as a rapidly evolving technology based on the
peer-to-peer network model. In this section we continue our case study of Gnutella
with the analysis of short-circuiting effects on reachability. In order to see why
Gnutella presents a meaningful testbed for studying the problem of short-circuiting,
let us briefly describe its design. Gnutella’s application-level protocol supports two
basic types of broadcast requests: ping, which is essentially a request for a host to
announce itself, and a query. These messages are propagated through the network by
means of a flooding broadcast. The response messages are then routed back along the
same path that the original request arrived by means of dynamically updated routing
tables maintained by each host. The flooding in Gnutella is implemented using mech-
anisms 1 and 2 described in previous sections, with the Gnutella software generally
limiting TTL values to at most 7. Its routing protocol, together with heterogeneous
latencies, make Gnutella potentially vulnerable to the short-circuiting effects we have
described.
Our original interest in the effects of short-circuiting arose from an experiment
that involved crawling and mapping the entire Gnutella network. In particular, we
noted that the number of reachable hosts reported by a client was substantially less
than on off-line analysis of the generated topology map. This analysis consisted of
calculating the number of elements in the BFS tree rooted at a node representing that
32
particular client. We consistently noted discrepancies of this nature of approximately
one half. After conjecturing that short-circuiting may play a substantial role is such
discrepancies, we attempted to try to prove this empirically.
Figure 3.1: The results of level-1 short-circuiting effects on the broadcast horizon onthe Gnutella network, October 2000. The y-axis represents the broadcast horizonsize, and the x-axis labels each of 68 broadcast trials. The top line is the resultinghorizon from multiple distinct broadcasts from the same source, and the lower lineis the resulting horizon from a single broadcast message from a single source. Thediscrepancy represents “level-1 short-circuiting” effects.
To test our hypothesis, we have devised an experimental method of discovering
what we call the “level-1 short-circuiting” effect. These are the effects of short-
circuiting caused by the paths interfering at the first level, that is, in our experiments
we compare the 7-horizon of a message broadcast from v with the 6-horizon of distinct
message broadcasts from the neighbors of v. The idea is that sending messages with
distinct ID labels will prevent them from interfering with each other, and thereby
allows us to measure a subset of the total short-circuiting effect. The actual number
of hosts reached by the broadcast of the shared message is compared to a union of
host sets reached by the set of distinct broadcast messages. More refined estimates
of short-circuiting effects can be obtained by comparing the hop counts of messages
33
responding to a shared broadcast to the hop counts of messages responding to distinct
broadcasts: if the former is larger than the minimum of the later, than we posit that
short-circuiting has occurred. Figure 3.1 shows the results of a particular experiment
of this nature conducted in October of 2000 . We note that the observed reductions
average 55%.
2 3 4 5 6 70
50
100
150
200
250
300
350
400
4502 servers3 servers
Figure 3.2: Horizon-size versus t
In another set of experiments we focused on the t-horizon as a function of the TTL
value. We performed the experiment by connecting to a set of servers and sending
successive ping messages with increasing TTL. Figure 3.2 shows the results of one such
experiment using two and three broadcast servers. As predicted by short-circuiting,
we observed a decrease in t-horizon after TTL has exceeded certain threshold, typ-
ically around 5. We have been able to explain this phenomenon analytically in [9].
This particular experiment required connections to selected servers to persist over a
longer period of time, so that a number of test trials could be performed.
Difficulties in conducting experiments on Gnutella. Overall, we have found it
quite challenging to isolate the effects of short-circuiting, as well as other phenomena,
34
on the Gnutella application. The challenge has been mainly due the system instability,
both in terms of topology and latencies. One of our preliminary experiments focused
on measuring variance in the size of the broadcast horizon over time. We have found
that several identical tests of horizon size, which were performed consecutively, can
differ drastically in their results. Figure 3.3 shows the size of the broadcast horizon
over time using four broadcast servers. Each data point represents the horizon size
for a particular broadcast trial, with trials performed consecutively in six minute
intervals.
0 20 40 60 80 100 120 140 160 180 2000
500
1000
1500
2000
2500
3000
3500
4000
Figure 3.3: Horizon-size variation over time with broadcasting client using multipleconnections on the Gnutella network, March 2001. The y-axis represents the horizonsize, and the x-axis labels each of 180 broadcast trials, performed consecutively in sixminute intervals.
We attribute this phenomenon to the highly dynamic nature of the network and
constantly changing network conditions and topology. (We remark that in our net-
work simulations, we have also observed that slight changes in latency distribution
can result in dramatic changes in the size of the t-horizon.) Such high variance, as
well as the existence of a number of factors influencing the actual number of hosts
35
reached, makes it challenging to obtain meaningful results.
By far the biggest challenge to isolating the effects of short-circuiting on Gnutella
is due to emergence of a new generation of “intelligent” Gnutella clients. These
clients contain built-in application logic designed to promote overall network health
by conserving bandwidth. While such clients have succeeded in allowing the Gnutella
network to scale-up to about five times the original size, they have also created a
serious obstacle to conducting sophisticated experimental studies on the network.
In order to see this, consider a simple procedure for calculating the size of the t-
horizon in Gnutella, performed by sending a ping message and counting the number
of responses. Figure 3.4 shows the results of an experiment in which eight of these
procedures were performed simultaneously.
1 2 3 4 5 6 7 8 9 10 11 12 13 140
500
1000
1500
2000
2500
3000
3500ping1ping2ping3ping4ping5ping6ping7ping8
Figure 3.4: Difficulty in conducting experiments on today’s Gnutella network
As you can see, typically only one of these procedures will result in a considerable
number of responses. The reason for this is that Gnutella clients are now ”intelligent”
enough to realize when messages are the same, and will only forward one of them.
In addition, many clients will now cache the responses to ping and query messages
36
for a certain amount of time. While such design decisions are understandable from
the performance standpoint, they also effectively take away the ability to accurately
determine the exact size of the broadcast horizon in Gnutella at any given time. As
a result we have found it extremely difficult to repeat experiments such as those
reported in figures 3.1 and 3.2 on the current system. Because of the difficulties with
measuring short-circuiting effects directly on the application, we turned our attention
to a series of network simulation studies in which we were able to precisely isolate the
effects of short-circuiting on theoretical network topologies.
3.3.2 Network Simulation Studies
In order to study the practical impact of short-circuited t-horizon reductions, we
needed to carefully consider both the topology of the network and the assignment of
latencies. Simulated studies allowed us to isolate the effects of short-circuiting on fixed
topologies. We conducted the simulations using our network simulator gnutsim, based
on a modified version of Dijkstra’s shortest path algorithm. The Java source code for
gnutsim is given in appendix B. To carry out these simulations, we needed to choose
the network topological model, as well as the network latency model. We report in
this chapter on a number of well-known regular topologies, such as the mesh and the
hypercube, as well as the Watts-Strogatz “small world” topology and snapshots of the
Gnutella topology obtained through crawling. To model network latencies we used
several classes of weights representing various commonly used Internet connection
bandwidths. We conducted our experiments by using random distributions of these
weights.
We present the statistics of our simulation studies as tables, which report the
reduction ratios in reachability caused by short-circuiting, given by randomly chosen
latencies on a fixed topology. Each table is associated with a fixed topology. Each
(a) Reduction rations for the Watts-Strogatz topology
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
(b) Histogram of 1000 trials with random distribution of latencies (t = 10)
Figure 3.5: Short-circuiting effects for the Watts-Strogatz topology (nodes =10000, k = 3, p = 0.2)
38
row of the table represents results from 100 trials using random latencies. In each
row we report for a fixed t, the worst, average, and best observed t-horizon, and
t-neighborhood (which is equal to t-horizon when using uniform latencies). We then
give the reduction ratios by dividing the worst over t-neighborhood, and the average
over t-neighborhood.
Figure 3.5 represents the results for the Watts-Strogatz small-world topology. The
histogram on the right represents distribution of t-horizon values over 100 trials using
random latencies for t = 10, which is the value of t for which the reduction ratios are
the most severe. The results for other topologies are presented in appendix C.
Observations and Conclusions. Our empirical results indicate that, in practice,
the effects of short-circuiting are not as devastating as suggested by the theoretical re-
sults in [9]. We have observed the most significant impact on “small-world” topologies
such as our Gnutella snapshots and Watts-Strogatz network models. Fr these graphs,
we have observed reduction ratios in t-horizon size of over 90% in the worst case,
for certain values of t. In other words, we have observed that with random latencies
one can expect instances where the ratio of sizes of the t-neighborhood divided by
t-horizon is greater than 10 to 1, as shown in figure ??. Furthermore, the histogram
in the same figure shows that the reduction in reachability caused by short-circuiting
was always greater than 50% using random latencies.
In our experimental studies we have also observed that both random graphs and
highly structured graphs such as the mesh and hypercube tend to have, on aver-
age, less pronounced short-circuiting effects, as compared with “small-world” graphs.
Intuitively, this can be best understood if one considers the potentially stimulating
effect of the clustering property as defined in chapter 2 on short-circuiting.
In general, for a fixed TTL = t, the distribution of t-horizon sizes tends to be
normally distributed with small variance, independent of network topology. We have
39
also observed that, independent of topology, mean reduction ratios are dependent on
the TTL= t. Our results suggest that the reduction ratio increases as t increases,
until certain thresholds are reached, usually at about the point t is equal to half the
network radius or diameter, after which the reduction decreases.
40
Chapter 4
Gnutella Crawler Implementation
In this chapter we discuss issues related to design and implementation of our Gnutella
network crawler. We begin by providing a brief introduction to Gnutella and its
protocol, necessary for understanding the remainder of this chapter. We then present
both the sequential and parallel algorithms for discovering topology of the Gnutella
network, followed by the discussion of our distributed implementation using Java
RMI.
4.1 Introduction to Gnutella
Gnutella can be best explained as a fully distributed, information sharing technology.
It originated as a project at Nullsoft, a subsidiary of America Online, but was aban-
doned out of fear of its potential use for copyright infringement. After being quickly
reverse-engineered by several programmers and open-source enthusiasts, Gnutella’s
popularity really took off. Gnutella allows distributed file sharing by allowing each
user to specify directories on their local machine they want to share. In this sense,
Gnutella can be viewed as a distributed file storage system with search capabili-
ties. Unlike its predecessor Napster, which relies on a centralized search database,
41
Gnutella promotes decentralization of all network functions. As we have already seen,
Gnutella is based on a peer-to-peer model. This means that users connect to each
other directly through a piece of client-server software, forming a high-level network.
Throughout this thesis, we have and will continue to refer to this high-level network
as the Gnutella network, or GnutellaNet. Because Gnutella software functions as
both a server and a client, it is sometimes referred to as a ”servant.” In this thesis
we may use the terms client, servant, and host interchangeably to refer to Gnutella
software running on a particular machine.
4.1.1 Gnutella Protocol
Each Gnutella client implements the application level Gnutella protocol, which spec-
ifies how messages are routed between GnutellaNet hosts. We have already described
Gnutella’s protocol design at a high-level in chapter 3. We will now complete our
description with a few implementation details.
Gnutella protocol support four basic types of messages summarized in table 4.1.
The routing technique employed by the Gnutella protocol is a form of controlled
flooding, where messages are passed recursively between hosts. Flooding operates
by each Gnutella host forwarding the received ping and search messages to all of its
neighbors, except to the one that sent the message. To limit exponential spread of
messages through the network, each message header contains a time-to-live (TTL)
field. TTL is used in the same fashion as in the IP protocol: at each hop its value
is decremented until it reaches zero, at which point the message is dropped. This
is equivalent to mechanism 1 described in chapter 3. The maximum TTL value
specified by the Gnutella protocol is seven. Recall that this restriction effectively
segments the Gnutella network into subnets, imposing on each user a virtual ”horizon”
beyond which their messages cannot reach. In practice, this situation is acceptable
42
Type Description Contains
Ping Request for a host to an-
nounce itself
No body
Pong Reply to Ping message IP and port of responding host, num-
ber and size of files shared
Query Search request Minimum speed requirement for re-
sponding host, search string
Query Hits Reply to Query message IP and port speed of responding host,
number of matching files and their in-
dexed result set
Table 4.1: Gnutella protocol message description
as information may still get around. Each Gnutella message is also flagged with a
unique ID. Message ID is used by peers to detect and subsequently drop duplicate
messages, indicating a loop in GnutellaNet topology (mechanism 2). In addition, it is
also used to route the response messages along the same path that the original request
arrived. This is implemented by each host maintaining a dynamic routing table of
message IDs and connection labels indicating a particular connection along which
that specific message arrived. When a response message arrives at a host, it should
contain the same message ID as the original request. The host then checks its routing
table to determine along which link the response message should be forwarded. This
technique greatly improves efficiency while also preserving network bandwidth.
43
4.1.2 Discovering Gnutella Network Topology
Topology discovery in IP networks is a well-studied area of research [26]. Generally
the approach is based on some protocol-specific feature, as in the case of traceroute.
Although Gnutella protocol is much simpler than IP and provides no feedback regard-
ing message delivery, it nevertheless provides the necessary functionality for mapping
GnutellaNet topology. Notice that, according to the Gnutella protocol, it is possible
to discover neighbors of a particular host by connecting to that host and sending a
ping message with TTL = 2. As a result, pong messages would be sent back from
the connected host and all of its immediate neighbors. A complete network topology
could therefore be discovered by connecting to all the hosts, discovering their neigh-
bors, and combining the information into a single graph. We refer to this process
as crawling. Notice that, by following the described procedure, each edge would be
discovered twice thus introducing a level of redundancy. However it is still necessary
to connect to all the hosts in order to guarantee that the obtained topology map is
complete.
Compared with IP networks, GnutellaNet is highly dynamic. This means that its
topology is constantly changing - nodes and edges are added and removed as hosts
join and leave the network, establish new connections, and close the existing ones.
Therefore any topology discovery algorithm operating on the Gnutella network is
really capturing an instance, or a snapshot of the topology at a specific point in time.
Clearly, this posses an additional requirement for any topology discovery algorithm
to be efficient, since the accuracy of the topology map is inversely proportional to
the actual running time of an algorithm that was used to obtain it. In designing our
crawler, we have paid close attention to this requirement.
44
4.2 Design
In this section we discuss some issues related to design of our Gnutella network
crawler. We present informal performance analysis for both our sequential and parallel
algorithms for discovering Gnutella network topology.
4.2.1 Algorithm
Based on the procedure described in the previous section for discovering GnutellaNet
topology, an intuitive design solution might be to use the BFS to crawl the network,
applying the algorithm for discovering direct neighbors to each encountered host.
However, there are some practical issues that make this approach inefficient. In order
to see this, let us first examine the basic operation of discovering neighbors of a single
Gnutella host. This operation requires establishing a connection, sending a ping
message, and waiting for all pong messages to be received - overall a time-consuming
process with running time in the order of several minutes. However it is clear that such
operation represents a lower bound for any topology discovery algorithm operating
on Gnutella and based on the procedure described in the previous section. We will
therefore use this basic operation as a unit in our performance analysis of algorithms
for discovering GnutellaNet topology.
The complexity of the BFS algorithm for discovering topology of the Gnutella
network with N hosts is clearly O(log N). Also, for the moment, let us assume that
our crawling workstation is capable of maintaining up to b simultaneous network
connections. Then if b ≥ N and we had a list of addresses for all the Gnutella hosts,
we could simply connect to all of them simultaneously and obtain the entire network
topology map in constant time. Fortunately such list is available, as every Gnutella
client maintains a dynamically updated list of live hosts. Using this list as input, we
can now formulate our new algorithm for discovering GnutellaNet topology as follows:
45
Procedure buildTopoMap (G, l)
Input: An empty graph G, and a complete host list l
Output: A graph G representing the Gnutella network topology
for each element h of lconnect to hif (connection is successful)
send ping message with TTL = 2for each response message m from host h2
if (h2! = h)add edge h − h2 to Gif (h2 is not in l)
add h2 to the end of l
Due to highly dynamic nature of the network, the input list of hosts is not guar-
anteed to be neither complete nor perfectly accurate. This means that new hosts
not contained in the list could have just joined the network and, furthermore, hosts
contained in the list may no longer be active. Nevertheless our algorithm will still
work, as new hosts will be discovered at run-time and added to the end of the list.
Similarly, hosts that are no longer active will simply be ignored. The ability of our
algorithm to work with incomplete input data is particularly important considering
highly dynamic nature of the Gnutella network. However the more complete the list
is, the closer the performance of our algorithm will be to optimal.
Notice that our algorithm in effect partitions the problem of discovering Gnutella
network topology into two steps, or phases: discovering nodes (host list) and discov-
ering edges (connections). Since the functionality for solving the first phase is already
provided through the existing Gnutella client software, our algorithm’s focus is on the
second phase of the problem.
46
4.2.2 Initial Implementation
We have implemented the algorithm presented in the previous section as a Java
application. We chose Java as our development platform primarily for its support
for networking and threads. Platform-independence was also an important benefit,
particularly for our distributed implementation described is the subsequent sections.
The main problem with our initial implementation is due to our original assump-
tion that the number of connections that could be maintained simultaneously is
greater than the total number of Gnutella hosts. In practice, this assumption doesn’t
hold as the number of live Gnutella hosts at any given time is typically in the order
of thousands. To cope with this situation we were forced to organize threads into
groups of b, where b is the maximum number of simultaneous connections that our
system could handle. This strategy introduces additional complexity and, as already
discussed, sacrifices the integrity of a time-critical task such as topology discovery in
a highly dynamic network. However since connections to different Gnutella hosts can
be done asynchronously, a natural solution would be to run the crawler in parallel.
The following section describes issues involved in discovering GnutellaNet topology
in parallel, as well as our implementation using Java RMI.
4.2.3 Parallel Algorithm
The simplest and perhaps the most natural way to make our topology discovery algo-
rithm run in parallel would be to partition the initial list of Gnutella host addresses.
Each processor would then be responsible for discovering neighbors of only a subset of
hosts. In addition, each processors would need to have some way of knowing whether
a newly discovered host address has already been “crawled” by another processor.
One way this could be done is by hashing the host address string and checking the
result (modulo the number of processors participating in the crawl) against the pro-
47
cessor’s index. If there is a match, the processor would know that it should go ahead
and crawl the host. If not, it would then need to pass the information to the appro-
priate processor. In fact, this technique is commonly used for indexing the WWW
by many search engines, including Google, primarily because it results in good load
balancing. However it also requires additional inter-processor communication in or-
der to pass the Gnutella host addresses discovered at run-time to the appropriate
processors. Instead, we have opted for perhaps less elegant but more robust solution.
Our algorithm provides each processor with a complete input list of active hosts.
Each processor then executes an algorithm for calculating the subset for which it is
responsible, based on its unique processor number and the total number of processors
involved in the computation. For example, processor 0 of 10 would only attempt to
discover neighbors of the first 10% of hosts from the input list. The parallel version of
the topology discovery algorithm presented in the previous section is formulated bel-
low. For clarity, we are assuming that the size of the initial list of hosts is a multiple
of the number of processors.
Procedure parallelBuildTopoMap (G, l)
Input: An empty graph G, and a complete host list l
Output: A graph G representing the Gnutella network topology
startIndex = (sizeofhosts/numberofprocs) ∗ procIDendIndex = startIndex + (sizeofhosts/numberofprocs) − 1l2 = hosts[startIndex..endIndex]for each element h of l2
connect to hif (connection is successful)
send ping message with TTL = 2for each response message m from host h2
if (h2! = h)add edge h − h2 to Gif (h2 is not in l)
add h2 to the end of l2
48
Despite its apparent simplicity, due to highly asynchronous nature of the task, our
parallel algorithm in the best cast achieves optimal speed-up. In addition, as long as
total number of Gnutella hosts N ≤ pb, where p is the number of processors and b
is the maximum number of connections each processor can maintain simultaneously,
our algorithm will run in constant time. In practice, we were typically able to satisfy
this requirement with only a few processors, as the size of the largest connected public
segment of the Gnutella network at the time rarely exceeded two thousand users.
One potential problem with our algorithm is that its performance is dependent
on the “completeness” of the input list of host addresses. Recall from our previous
discussion that the input list is not guaranteed to be complete, as new hosts could
have joined the network. Because our algorithm only partitions the initial set of
hosts, each processor would discover new hosts independently. This would result in
redundant work being performed by all the processors. Notice that this would not
be a problem had be used the hashing solution mentioned above. However it is easy
to show that, as long as the number of hosts discovered at run-time is within b,
performance of our algorithm will be within a factor of two of optimal. This is true
because only a single additional step will be required by each processor.
Typically an important issue in designing parallel algorithms is load balancing. In
our case, this refers to the actual number of connections each processor is required to
make. Recall that the input list of potential hosts may also contain some hosts that
have recently left the network. Therefore even though each processor will receive an
equal number of potential hosts to connect to, the number of actual live hosts in a
list is likely to be smaller and will vary between processors. However our experiments
indicate this is not a significant problem. In order to see this recall that, even though
the actual number of connections made by each processor could vary, they are still
handled simultaneously by each processor in a single logical step.
49
4.2.4 Limitations
The main limitation of our crawler is related to the notion of private networks. Since
a significant portion of Gnutella users reside behind a firewall that prevents anyone
on the outside from establishing direct connection to them, our crawler will not be
able to accurately discover topology between such hosts. Notice that these hosts may
still appear in the final topology graph, due to their connections with hosts outside
the firewall. In this sense, the topology obtained by our crawler can be viewed as a
subgraph of the actual Gnutella network topology.
In addition, even though running time of our algorithm is optimal for any topology
discovery algorithm based on the Gnutella protocol, the actual execution time is still
bounded by the RTT time of messages in the Gnutella network and can take up
to several minutes. One could therefore argue the integrity of our topology data,
based on the fact that the network structure may have significantly changed over
the course of several minutes. Despite these limitations we believe our crawler is a
valuable tool, able to accurately capture important structural properties of the actual
Gnutella network topology.
4.3 Distributed Computing Solution Using Java
RMI
We have implemented our parallel algorithm for GnutellaNet topology discovery for a
network of workstations (NOW), primarily because we felt it would give the greatest
amount of flexibility and portability to our code. In addition, we felt that the task at
hand would be perfectly suited for a distributed computing model, since it requires
very little inter-processor communication. In fact, in our design, communication only
occurs at the beginning of the process, to distribute input, and at the end, to gather
50
the output at a central location. The mechanism for this communication is provided
by Java RMI. Remote method invocation (RMI) is JavaSoft’s implementation of
remote procedure calls (RPC). It is distributed as a standard Java library, providing
necessary functionality for distributed object communication. In our implementation,
crawling a subset of the Gnutella network is provided as a service residing on various
remote locations throughout our network. In other words, our parallel algorithm
described in the previous section is implemented as a distributed object residing on
remote machines.
Our distributed computing system includes an object serving as the ”brain” of
the entire computation. This central object is responsible for “bootstrapping” the
entire topology discovery process by distributing the initial list of Gnutella hosts
to other remote objects. Upon receiving the input, each remote object performs
topology discovery of its portion of the network, and subsequently returns a graph
object representing network topology to the central object. The central object is then
responsible for merging all the output graphs into a single one representing topology
of the entire Gnutella network. We should mention that our crawler utilized some
Java classes providing functionality related to Gnutella protocol compliance from furi
- a full-fledged open-source Gnutella client developed by William Wong [3].
The main feature of our distributed implementation is that is allows a heteroge-
neous network of workstations to participate in discovery of the Gnutella network
topology. As explained, this topology discovery can be executed in constant time
using only a few processors. In addition, the output graph representing Gnutella
network topology is provided in GML format [18], which is a fast growing standard
for representing graph data structures, and can immediately be viewed using visu-
alization tools such as LEDA’s graphwin [8]. Several visualizations of the Gnutella
network topology data obtained using our crawler are presented in appendix A.
51
Chapter 5
Conclusions and future research
5.1 Conclusions
Modeling complex network structures produces by modern P2P network applications
is a difficult task. The main contribution of this thesis to the task at hand is two-fold.
First, we made several important discoveries regarding the structure of the underlying
network topology of a P2P network application known as Gnutella. Specifically we
discovered it exhibits “small-world” properties of clustering and small diameter. In
addition, we observed four different power law relationships of various graph metrics.
It is our thesis that these empirical observations must be accounted for by any accu-
rate graph-based model of P2P network topology. Second, we pointed out potential
devastating effects of heterogeneous latencies on reachability of message broadcast in
P2P network applications operating under flooding protocols. Even though our em-
pirical results indicate that this problem we call “short-circuiting” is on average not
devastating to the overall system performance, we believe it should be taken seriously
by protocol designers. It is our hope that our results can be used in designing the
new generation of application-level protocols for P2P network applications.
52
5.2 Future Directions
Future research directions can be divided into three categories: those dealing with
network topology, visualization, and server placement. In the following sections, we
briefly discuss each one.
5.2.1 Network Topology Modeling
In this thesis we have reported discoveries of some structural properties of P2P net-
work topologies. However the search continues toward a uniform model of P2P net-
work topology, encompassing all of those structural properties observed in real net-
work applications. We speculate that for many P2P network applications, including
Gnutella, such model will be a modification of the discussed Barabasi-Albert model,
perhaps accounting for hosts leaving the network and dynamically-changing connec-
tions. In addition, more research needs to be done on spectral analysis of the topology
graph’s eigenvalues and their relationship with the structural properties.
5.2.2 Network Visualization
Better graph drawing algorithms need to be designed for visualizing the topology
of large-scale P2P networks. Such algorithms should be able to present topological
structure of a network in a way so that meaningful conclusions can be drawn. Network
visualizations can then be used by engineers to identify network-related problems.
5.2.3 Server Placement
The problem of finding an optimal placement of servers has received a lot of attention
in the Internet community. Many P2P file-sharing applications such as Gnutella
present another attractive practical application of this problem. For example, each
53
time a Gnutella user connects to the network can be modeled as a graph augmentation
problem. This problem can be formulated as adding a single vertex and t edges to
a graph G so that the size of t-horizon would be optimized. In the future, we plan
to examine some theoretical issues behind this problem using the knowledge we’ve
obtained on the Gnutella topology model.
54
Bibliography
[1] Cooperative Association for Internet Data Analysis (CAIDA).