1 Analyzing Peer-To-Peer Traffic Across Large Networks Subhabrata Sen, Member, IEEE, and Jia Wang, Member, IEEE 組組 組組組 : d96725004 組組組 d95725005 2009 年 6 年 15 年 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 12, NO. 2, APRIL 2004
1
Analyzing Peer-To-Peer TrafficAcross Large Networks
Subhabrata Sen, Member, IEEE, and Jia Wang, Member, IEEE
組員:李英宗 d96725004 林慶和 d95725005
2009 年 6月 15日
IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 12, NO. 2, APRIL 2004
2ACN 2009
Authors
Subhabrata Sen received the B.Eng. Degree in computer science from Jadavpur University, India, in 1992, and the M.S. and Ph.D. degrees in computer science from the University of Massachusetts,A mherst, in 1997 and 2001, respectively.
Jia Wang received the B.S. degree in computer science from the State University of New York, Binghamton, in 1996, and the M.S. and Ph.D. degrees in computer science from Cornell University, Ithaca, NY, in 1999 and 2001, respectively.
They’re currently two members of the Internet and Networking Systems Research Center at AT&T Labs–Research in Florham Park, NJ. Their research interests include network measurement, routing and topology analysis, traffic flow measurement, overlay networks and applications, network security and anomaly detection, Web performance, content distribution networks, and other Internet-related research work. Dr. Sen and Dr.Wang are the members of the Association for Computing Machinery (ACM).
3ACN 2009
IntroductionMotivation & Goals
The use of P2P applications is for distributed file sharing Large and growing traffic volume impact on the underlying network to characterize P2P behavior with a view to understanding how these
systems impact the network and to gain insights into developing P2P systems with superior
performance.
Previous research almost exclusively on P2P signaling traffic setting up P2P crawlers on the Internet, using “active probing” approach
Early version Based on data from the edge networks provide a view of local P2P usage
This work provides a complementary “backbone view” from a large tier-1 ISP gathering data at multiple border routers across the ISP.
4ACN 2009
Outline
MethodologyCharacterization MetricsView and Analysis resultsP2P vs Web
5ACN 2009
Methodology
Popular P2P Applications Three systems: Gnutella, FastTrack, DirectConnect All decentralized, self organizing Data and index information distributed over peers Transient peer membership
Measurement Approach Large-scale passive measurement Flow-level data gathered from routers across a large tier-1 ISP’s
backbone Analyze both signaling and data traffic Three levels of granularity: IP address, network prefix,
Autonomous system Collect data using Cisco’s NetFlow
6ACN 2009
Methodology
Advantages Requires knowledge about P2P protocol: port# Non-intrusive measurement More easy than crawler More complete view of P2P traffic Allow localized analysis
Limitations Flow level data, No AP-level details May not capture the complete flow
7ACN 2009
Characterization Metrics
Characterization Topology: hosts distributions, application-level
overlay Traffic distribution: downstream & upstream Dynamic behavior:how frequently hosts join an
leave the system, how long a host stay…
8ACN 2009
Characterization Metrics
Metrics Host distribution Traffic Volume Host Connectivity Traffic pattern over time Connection duration and on-time
Data cleaning Invalid IP: 10.x.x.x/8 、 172.16.x.x/13 、 192.168.x.x/16 No matched prefix in routing tables Invalid AS#(>64512)、 Remove 4% of flow records
9ACN 2009
Overview of P2P traffic
TABLE I Netflow DATA SET OF P2P TRAFFIC OVER TCP
Total around 800 million flow records
10ACN 2009
Host distribution
Fig. 2. Host density: the distribution of the hosts participating in three P2P systems per day (y-axis is in logscale).
11ACN 2009
Traffic volume distribution
Fig. 3. Cumulative distribution of traffic volume associated with IP addresses ranked in decreasing order of volume, for September 14, 2001 (x-axis is in logscale). Aggregate traffic observed for FastTrack on this day was 960 GB.
Significant skews in traffic volume across granularities Few entities source/receive most of the traffic
12ACN 2009
Host connectivity
Fig. 5. Cumulative distribution of network connectivity at the IP and network prefix (PR) levels, for hosts participating in FastTrack on September 14, 2001.
Connectivity is very small for most hosts, very high for few hosts Distribution is less skewed at prefix and AS levels
13ACN 2009
Time of day effect
Fig. 6. Distribution of number of IP addresses and traffic volume across hours in FastTrack on September 14, 2001 (GMT). (a) The traffic volume transferred in each bin. (b) The number of unique IP addresses, network prefixes, and ASes that are active in each bin.
14ACN 2009
Host connection duration & on-time
Substantial transience: most hosts stay in the system for a short time Distribution less skewed at the prefix and AS levels
FastTrack (9/14/2001) thd=30min
15ACN 2009
Mean bandwidth usage
Fig. 9. Cumulative distribution of the mean upstream and downstream bandwidth usage of hosts participating in FastTrack, and DirectConnect on September 14, 2001 (x-axis is in logscale). (a) FastTrack. (b) DirectConnect.
Upstream < Downstream: ADSL, Rate limiting
16ACN 2009
Traffic Characterization
The P2P traffic does not fit well with power law distributions.
Relationships between measures Traffic volume #IPs On-times Mean bandwidth usage
17ACN 2009
The power laws
Fig. 10. Rank-frequency plots of the P2P metrics for FastTrack on September 14, 2001: (a) overall host connectivity; (b) host connectivity for the top 10% IP addresses; (c) traffic volume of the top 10% IP addresses; (d) on-time of the top 10% IP addresses (both x-axis and y-axis are labeled in logscale).
18ACN 2009
Relationships: Traffic volume vs on-time、 Connectivity 、 #BW
Volume heavy hitters are likely to have long on-times; Hosts with short on-times contribute small traffic volumes
A Host communicating with many others can transmit a small amount of traffic; a host communicating with few others can also source significant traffic.
Volume heavy hitters are likely to have large bandwidths; Hosts with small bandwidths contribute small traffic volumes
19ACN 2009
Traffic volume vs on-time、 Connectivity 、 #BW
Fig. 11. FastTrack data set for September 14, 2001—top 1%. IP addresses ranked by volume of data sent out. Scatter plots (log-log scale): (a) upstream volume versus upstream on-time; (b) upstream volume versus number of unique upstream IP addresses that an IP address connects to; (c) upstream volume versus average upstream bandwidth of an IP address.
20ACN 2009
Connectivity 、 on-time 、 #BW
Hosts with high connectivity have long on-times; Hosts with short on-times communicate with few other hosts.
Hosts with high upstram badwidths have low connectivity counts; Hosts send traffic to many others tend to span the bandwidths, but no one with the highest bandwidths
Hosts with low upstram badwidths have very long on-time (maybe download large file or SuperNode)
21ACN 2009
Connectivity 、 on-time 、 #BW
Fig. 12. FastTrack data set for September 14, 2001—top 1% IP addresses ranked by volume of data sent out. Scatter plots (log-log scale): (a) number of unique upstream IP addresses that a host connects to versus total upstream on-time of the IP address; (b) number of unique upstream IP addresses versus average upstream bandwidth; (c) average upstream bandwidth versus total upstream on-time.
22ACN 2009
P2P vs Web
97% of prefixes contributing P2P traffic also contribute Web traffic
Heavy hitter prefixes for P2P traffic tend to be heavy hitters for Web traffic
P2P traffic contributed by the top heavy hitter prefixes is more stable than either Web or total traffic
0.01%, 0.1%, 1%, 10% heavy hitters contribute 10%, 30%, 50%, 90% of the traffic volume
23ACN 2009
P2P vs Web
Fig. 13. Cumulative distribution of the traffic volume changes for top heavy hitter prefixes. (a) Top 0.01%. prefixes. (b) Top 1% prefixes.
24ACN 2009
Summary
The analysis covers both signaling & data traffic. complements previous work for Gnutella.
Significant increase in both traffic volume and number of Users.
The traffic volume generated by individual hosts is extremely variable less than 10% #IPs 99% of the traffic volume.
Traffic distributions are extremely skewed Both of traffic volume, connectivity, ontime and
average bandwidth usage. But do not strictly obey with power laws.
25ACN 2009
Summary
All three P2P systems exhibit a high level of system dynamics But only a small fraction of hosts are persistent
over long time periods.P2P is significant, but stable component of the
Internet traffic More stable than Web traffic or overall traffic Application-specific layer-3 traffic engineering is a
promising way to manage the P2P workload in an ISP’s network.