Top Banner
Data-Driven Network Analysis: Do You Really Know Your Data? Walter Willinger AT&T Labs-Research [email protected]
28

Data-Driven Network Analysis:  Do You Really Know Your Data?

Dec 30, 2015

Download

Documents

halla-moss

Data-Driven Network Analysis:  Do You Really Know Your Data?. Walter Willinger AT&T Labs-Research [email protected]. Heard about “Network Science”?. Recent “hot topic” area in science Thousands of papers, many in high-impact journals such as Science or Nature - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data-Driven Network Analysis:   Do You Really Know Your Data?

Data-Driven Network Analysis: 

Do You Really Know Your Data?

Walter WillingerAT&T Labs-Research

[email protected]

Page 2: Data-Driven Network Analysis:   Do You Really Know Your Data?

2

Heard about “Network Science”?

• Recent “hot topic” area in science– Thousands of papers, many in high-impact journals

such as Science or Nature– Interdisciplinary flavor: (Stat.) Physics, Math, CS– Main apps: Internet, social science, biology, …

• Offers an alluring new recipe for doing network analysis– Largely measurement-driven– Main focus is on universal properties– Exploiting the predictive power of simple models

• small world networks: clustering and path lengths

• scale free networks: power law degree distributions

– Emphasis on self-organization and emergence

Page 3: Data-Driven Network Analysis:   Do You Really Know Your Data?

3

NETWORK SCIENCE

http://www.nap.edu/catalog/11516.html

• “First, networks lie at the core of the economic, political, and social fabric of the 21st century.”

• “Second, the current state of knowledge about the structure, dynamics, and behaviors of both large infrastructure networks and vital social networks at all scales is primitive.”

• “Third, the United States is not on track to consolidate the information that already exists about the science of large, complex networks, much less to develop the knowledge that will be needed to design the networks envisaged…”

January, 2006

Page 4: Data-Driven Network Analysis:   Do You Really Know Your Data?

4

Network Science• What?

“The study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena.” (National Research Council Report, 2006)

• Why? “To develop a body of rigorous results that

will improve the predictability of the engineering design of complex networks and also speed up basic research in a variety of applications areas.” (National Research Council Report, 2006)

• Who?– Physicists (statistical physics),

mathematicians (graph theory), computer scientists (algorithm design), etc.

Page 5: Data-Driven Network Analysis:   Do You Really Know Your Data?

5

As Internet researchers, why should we care?

• The teaching of “Network Science”

Page 6: Data-Driven Network Analysis:   Do You Really Know Your Data?

6

The “New Science of Networks”

Page 7: Data-Driven Network Analysis:   Do You Really Know Your Data?

7

Why should we care?• The teaching of “Network Science”

• The claims “Network Science” makes about the Internet– High-degree nodes form a hub-like core– Fragile/vulnerable to targeted node removal– Achilles’ heel– Zero epidemic threshold

• Network Science and the Internet– Lies, damned lies, statistics …– Rich source for wrong/bad models/theories– The published claims about the Internet are not

“controversial” – they are simply wrong!

Page 8: Data-Driven Network Analysis:   Do You Really Know Your Data?

8

What is wrong with “Network Science”?

• No critical assessment of available data

• Ignores all networking-related “details”

• Overarching desire to reproduce observed properties of the data even though the quality of the data is insufficient to say anything about those properties with sufficient confidence

• Reduces model validation to the ability to reproduce an observed statistics of the data (e.g., node degree distribution)

Page 9: Data-Driven Network Analysis:   Do You Really Know Your Data?

9

How to fix “Network Science”?

• Know your data!– Importance of data hygiene

• Take model validation more serious!– Model validation ≠ data fitting

• Apply an engineering perspective to engineered systems!– Design principles vs. random coin tosses

Page 10: Data-Driven Network Analysis:   Do You Really Know Your Data?

10

Some illustrative Examples

• Example 1– Data: Traceroute measurements – Objective: Inferring Internet topology at the

router-level• Example 2

– Data: Traceroute measurements – Objective: Inferring Internet topology at the

level of Autonomous Systems (ASes)• Example 3

– Data: BGP measurements– Objective: Inferring Internet topology at the

level of Autonomous Systems (ASes)

Page 11: Data-Driven Network Analysis:   Do You Really Know Your Data?

11

Measurement tool: traceroute

• traceroute www.duke.edu• traceroute to www.duke.edu (152.3.189.3), 30 hops max, 60 byte packets

• 1 fp-core.research.att.com (135.207.16.1) 2 ms 1 ms 1 ms• 2 ngx19.research.att.com (135.207.1.19) 1 ms 0 ms 0 ms• 3 12.106.32.1 1 ms 1 ms 1 ms• 4 12.119.12.73 2 ms 2 ms 2 ms• 5 tbr1.n54ny.ip.att.net (12.123.219.129) 4 ms 5 ms 3 ms• 6 ggr7.n54ny.ip.att.net (12.122.88.21) 3 ms 3 ms 3 ms•7 192.205.35.98 4 ms 4 ms 8 ms• 8 jfk-core-02.inet.qwest.net (205.171.30.5) 3 ms 3 ms 4 ms• 9 dca-core-01.inet.qwest.net (67.14.6.201) 11 ms 11 ms 11 ms•10 dca-edge-04.inet.qwest.net (205.171.9.98) 11 ms 15 ms 11 ms•11 gw-dc-mcnc.ncren.net (63.148.128.122) 18 ms 18 ms 18 ms•12 rlgh7600-gw-to-rlgh1-gw.ncren.net (128.109.70.38) 18 ms 18 ms 18 ms•13 roti-gw-to-rlgh7600-gw.ncren.net (128.109.70.18) 20 ms 20 ms 20 ms•14 art1sp-tel1sp.netcom.duke.edu (152.3.219.118) 23 ms 20 ms 20 ms•15 webhost-lb-01.oit.duke.edu (152.3.189.3) 21 ms 38 ms 20 ms

• 1 traceroute measurement: about 1KB

Page 12: Data-Driven Network Analysis:   Do You Really Know Your Data?

12

Large-scale traceroute experiments

1 million x 1 million traceroutes: 1PB

Page 13: Data-Driven Network Analysis:   Do You Really Know Your Data?

13

http://www.isi.edu/scan/mercator/mercator.html

Two Examples of inferred ISP topology

Page 14: Data-Driven Network Analysis:   Do You Really Know Your Data?

14

About the Traceroute tool (1)

• traceroute is strictly about IP-level connectivity– Originally developed by Van Jacobson (1988)– Designed to trace out the route to a host

• Using traceroute to map the router-level topology– Engineering hack– Example of what we can measure, not what we

want to measure!• Basic problem #1: IP alias resolution problem

– How to map interface IP addresses to IP routers– Largely ignored or badly dealt with in the past– New efforts in 2008 for better heuristics …

Page 15: Data-Driven Network Analysis:   Do You Really Know Your Data?

15

Interfaces 1 and 2 belong to the same router

Page 16: Data-Driven Network Analysis:   Do You Really Know Your Data?

16

IP Alias Resolution Problem for Abilene (thanks to Adam Bender)

Page 17: Data-Driven Network Analysis:   Do You Really Know Your Data?

17

About the Traceroute tool (2)

• traceroute is strictly about IP-level connectivity• Basic problem #2: Layer-2 technologies (e.g., MPLS,

ATM)– MPLS is an example of a circuit technology that

hides the network’s physical infrastructure from IP– Sending traceroutes through an opaque Layer-2

cloud results in the “discovery” of high-degree nodes, which are simply an artifact of an imperfect measurement technique.

– This problem has been largely ignored in all large-scale traceroute experiments to date.

Page 18: Data-Driven Network Analysis:   Do You Really Know Your Data?

18

(a) (b)

Page 19: Data-Driven Network Analysis:   Do You Really Know Your Data?

19

Page 20: Data-Driven Network Analysis:   Do You Really Know Your Data?

20

About the Traceroute tool (3)

• The irony of traceroute measurements– The high-degree nodes in the middle of the

network that traceroute reveals are not for real …– If there are high-degree nodes in the network,

they can only exist at the edge of the network where they will never be revealed by generic traceroute-based experiments …

• Additional irony– Bias in (mathematical abstraction of) traceroute– Has been a major focus within CS/Networking

literature– Non-issue in the presence of above-mentioned

problems

Page 21: Data-Driven Network Analysis:   Do You Really Know Your Data?

21

Example 1: Lessons learned

• Know your measurement technique!– Question: Can you trust the data obtained by your tool?

• Know your data!– Critical role of Data Hygiene in the Petabyte Age– Corollary: Petabytes of garbage = garbage– Data hygiene is often viewed as “dirty/unglamorous”

work– Question: Can the data be used for the purpose at hand?

• Regarding Example 1:– (Current) traceroute measurements are of (very) limited

use for inferring router-level connectivity– It is unlikely that future traceroute measurements will be

more useful for the purpose of router-level inference

Page 22: Data-Driven Network Analysis:   Do You Really Know Your Data?

22

A textbook example for what can go wrong …

• J.-J. Pansiot and D. Grad, “On routes and multicast trees in the Internet,” ACM Computer Communication Review 28(1), 1998.– Original traceroute data -- purpose for using the data is explicitly

stated– Most of the issues with traceroute are listed!

• M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On the power-law relationships of the Internet topology”, Proc. ACM SIGCOMM’99, 1999.– Rely on the Pansiot-Grad data, but use it for a very different

purpose– Take the available data at face value, even though Pansiot/Grad list

most of the problems– There is no scientific basis for the reported power-law findings!

• R. Albert, H. Jeong, and A.-L. Barabasi, “Error and attack tolerance of complex networks”, Nature, 2000.– Do not even cite original data source (i.e., Pansiot/Grad)– Take the results of FFF’99 at face value– The reported results are all wrong!

Page 23: Data-Driven Network Analysis:   Do You Really Know Your Data?

23

Applying lessons to Example 2

• Example 2: Use of traceroute measurements to infer Internet topology at the level of Autonomous Systems (ASes)

• Know your measurement technique!– traceroute (see Example 1)

• Know your data!– Main source of errors: IP address sharing

between BGP neighbors makes mapping traceroute paths to AS paths very difficult

– Up to 50% of traceroute-derived AS adjacencies appear to be bogus

Page 24: Data-Driven Network Analysis:   Do You Really Know Your Data?

24

Applying lessons to Example 2 (cont.)

• Regarding Example 2– (Current) traceroute measurements are of (very)

limited use for inferring AS-level connectivity– Obtaining the “ground truth” is very challenging– It is possible that in the future, more targeted

traceroute measurements in conjunction with BGP data will be more useful for the purpose of inferring AS-level connectivity

Page 25: Data-Driven Network Analysis:   Do You Really Know Your Data?

25

Applying lessons to Example 3

• Example 3: Use of BGP data to infer Internet topology at the level of Autonomous Systems (ASes)

• Know your measurement technique!– BGP -- de facto inter-domain routing protocol– BGP -- designed to propagate reachability

information among ASes, not connectivity information

– Engineering hack – not designed to obtain connectivity information

– Example of what we can measure, not what we want to measure!

– Collect BGP routing information base (RIB) information from as many routers as possible

Page 26: Data-Driven Network Analysis:   Do You Really Know Your Data?

26

Applying lessons to Example 3 (cont.)

• Know your data!– Examining the hygiene of BGP measurements requires

significant commitment and domain knowledge– Parts of the available data seem accurate and solid

(i.e., customer-provider links, nodes)– Parts of the available data are highly problematic and

incomplete (i.e., peer-to-peer links)– “Ground truth” is hard to come by

• Regarding Example 3– (Current) BGP-based measurements are of

questionable quality for inferring AS-level connectivity– Obtaining the “ground truth” is very challenging– It is possible that in the future, more targeted

traceroute measurements in conjunction with BGP data will be more useful for the purpose of inferring AS-level connectivity

Page 27: Data-Driven Network Analysis:   Do You Really Know Your Data?

27

A Reminder

• Data-driven network analysis in the presence of high-quality data that can be taken at face value– “All models are wrong … but some are useful”

(G.E.P. Box)

• Data-driven network analysis in the presence of highly ambiguous data that should not be taken at face value– “When exactitude is elusive, it is better to be

approximately right than certifiably wrong.” (B.B. Mandelbrot)

Page 28: Data-Driven Network Analysis:   Do You Really Know Your Data?

28

SOME RELATED REFERENCES• L. Li, D. Alderson, W. Willinger, and J. Doyle, A first-principles

approach to understanding the Internet’s router-level topology, Proc. ACM SIGCOMM 2004.

• J.C. Doyle, D. Alderson, L. Li, S. Low, M. Roughan, S. Shalunov, R. Tanaka, and W. Willinger. The "robust yet fragile" nature of the Internet. PNAS 102(41), 2005.

• D. Alderson, L. Li, W. Willinger, J.C. Doyle. Understanding Internet Topology: Principles, Models, and Validation. ACM/IEEE Trans. on Networking 13(6), 2005.

• L. Li, D. Alderson, J.C. Doyle, W. Willinger. Toward a Theory of Scale-Free Networks: Definition, Properties, and Implications. Internet Mathematics 2(4), 2006.

• R. Oliveira, D. Pei, W. Willinger, B. Zhang, L. Zhang. In Search of the elusive Ground Truth: The Internet's AS-level Connectivity Structure.Proc. ACM SIGMETRICS 2008.

• B. Krishnamurthy and W. Willinger. What are our standards for validation of measurement-based networking research? Proc. ACM HotMetrics Workshop 2008.

• W. Willinger, D. Alderson, and J.C. Doyle. Mathematics and the Internet: A Source of Enormous Confusion and Great Potential. Notices of the AMS, Vol. 56, No. 2, 2009.