Top Banner
Boston U., 2 005 C. Faloutsos 1 School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University
95

School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 1

School of Computer ScienceCarnegie Mellon

Data Mining using Fractals and Power laws

Christos Faloutsos

Carnegie Mellon University

Page 2: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 2

School of Computer ScienceCarnegie Mellon

THANK YOU!

• Prof. Azer Bestavros

• Prof. Mark Crovella

• Prof. George Kollios

Page 3: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 3

School of Computer ScienceCarnegie Mellon

Overview

• Goals/ motivation: find patterns in large datasets:– (A) Sensor data– (B) network/graph data

• Solutions: self-similarity and power laws

• Discussion

Page 4: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 4

School of Computer ScienceCarnegie Mellon

Applications of sensors/streams

• ‘Smart house’: monitoring temperature, humidity etc

• Financial, sales, economic series

Page 5: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 5

School of Computer ScienceCarnegie Mellon

Motivation - Applications• Medical: ECGs +; blood

pressure etc monitoring

• Scientific data: seismological; astronomical; environment / anti-pollution; meteorological [Kollios+, ICDE’04]

Sunspot Data

0

50

100

150

200

250

300

Page 6: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 6

School of Computer ScienceCarnegie Mellon

Motivation - Applications (cont’d)

• civil/automobile infrastructure

– bridge vibrations [Oppenheim+02]

– road conditions / traffic monitoring

Automobile traffic

0200400600800

100012001400160018002000

time

# cars

Page 7: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 7

School of Computer ScienceCarnegie Mellon

Motivation - Applications (cont’d)

• Computer systems

– web servers (buffering, prefetching)

– network traffic monitoring

– ...

http://repository.cs.vt.edu/lbl-conn-7.tar.Z

Page 8: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 8

School of Computer ScienceCarnegie Mellon

Web traffic

• [Crovella Bestavros, SIGMETRICS’96]

1000 sec

Page 9: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 9

School of Computer ScienceCarnegie Mellon

...

survivable,self-managing storage

infrastructure

...

a storage brick(0.5–5 TB)~1 PB

“self-*” = self-managing, self-tuning, self-healing, … Goal: 1 petabyte (PB) for CMU researchers www.pdl.cmu.edu/SelfStar

Self-* Storage (Ganger+)

Page 10: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 10

School of Computer ScienceCarnegie Mellon

Problem definition

• Given: one or more sequences x1 , x2 , … , xt , …; (y1, y2, … , yt, …)

• Find – patterns; clusters; outliers; forecasts;

Page 11: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 11

School of Computer ScienceCarnegie Mellon

Problem #1

• Find patterns, in large datasets

time

# bytes

Page 12: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 12

School of Computer ScienceCarnegie Mellon

Problem #1

• Find patterns, in large datasets

time

# bytes

Poisson indep., ident. distr

Page 13: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 13

School of Computer ScienceCarnegie Mellon

Problem #1

• Find patterns, in large datasets

time

# bytes

Poisson indep., ident. distr

Page 14: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 14

School of Computer ScienceCarnegie Mellon

Problem #1

• Find patterns, in large datasets

time

# bytes

Poisson indep., ident. distr

Q: Then, how to generatesuch bursty traffic?

Page 15: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 15

School of Computer ScienceCarnegie Mellon

Overview

• Goals/ motivation: find patterns in large datasets:– (A) Sensor data

– (B) network/graph data

• Solutions: self-similarity and power laws• Discussion

Page 16: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 16

School of Computer ScienceCarnegie Mellon

Problem #2 - network and graph mining• How does the Internet look like?• How does the web look like?• What constitutes a ‘normal’ social

network?• What is the ‘network value’ of a

customer? • which gene/species affects the others

the most?

Page 17: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 17

School of Computer ScienceCarnegie Mellon

Network and graph mining

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]

Graphs are everywhere!

Page 18: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 18

School of Computer ScienceCarnegie Mellon

Problem#2Given a graph:

• which node to market-to / defend / immunize first?

• Are there un-natural sub-graphs? (eg., criminals’ rings)?

[from Lumeta: ISPs 6/1999]

Page 19: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 19

School of Computer ScienceCarnegie Mellon

Solutions

• New tools: power laws, self-similarity and ‘fractals’ work, where traditional assumptions fail

• Let’s see the details:

Page 20: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 20

School of Computer ScienceCarnegie Mellon

Overview

• Goals/ motivation: find patterns in large datasets:– (A) Sensor data– (B) network/graph data

• Solutions: self-similarity and power laws

• Discussion

Page 21: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 21

School of Computer ScienceCarnegie Mellon

What is a fractal?

= self-similar point set, e.g., Sierpinski triangle:

...zero area: (3/4)^inf

infinite length!

(4/3)^inf

Q: What is its dimensionality??

Page 22: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 22

School of Computer ScienceCarnegie Mellon

What is a fractal?

= self-similar point set, e.g., Sierpinski triangle:

...zero area: (3/4)^inf

infinite length!

(4/3)^inf

Q: What is its dimensionality??A: log3 / log2 = 1.58 (!?!)

Page 23: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 23

School of Computer ScienceCarnegie Mellon

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• Q: fd of a plane?

Page 24: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 24

School of Computer ScienceCarnegie Mellon

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• A: nn ( <= r ) ~ r^1(‘power law’: y=x^a)

• Q: fd of a plane?• A: nn ( <= r ) ~ r^2fd== slope of (log(nn) vs..

log(r) )

Page 25: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 25

School of Computer ScienceCarnegie Mellon

Sierpinsky triangle

log( r )

log(#pairs within <=r )

1.58

== ‘correlation integral’

= CDF of pairwise distances

Page 26: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 26

School of Computer ScienceCarnegie Mellon

Observations: Fractals <-> power laws

Closely related:

• fractals <=>

• self-similarity <=>

• scale-free <=>

• power laws ( y= xa ; F=K r-2)

• (vs y=e-ax or y=xa+b)log( r )

log(#pairs within <=r )

1.58

Page 27: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 27

School of Computer ScienceCarnegie Mellon

Outline

• Problems

• Self-similarity and power laws

• Solutions to posed problems

• Discussion

Page 28: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 28

School of Computer ScienceCarnegie Mellon

time

#bytes

Solution #1: traffic

• disk traces: self-similar: (also: [Leland+94])• How to generate such traffic?

Page 29: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 29

School of Computer ScienceCarnegie Mellon

Solution #1: traffic

• disk traces (80-20 ‘law’) – ‘multifractals’

time

#bytes

20% 80%

Page 30: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 30

School of Computer ScienceCarnegie Mellon

80-20 / multifractals20 80

Page 31: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 31

School of Computer ScienceCarnegie Mellon

80-20 / multifractals20

• p ; (1-p) in general

• yes, there are dependencies

80

Page 32: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 32

School of Computer ScienceCarnegie Mellon

More on 80/20: PQRS

• Part of ‘self-* storage’ project

time

cylinder#

Page 33: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 33

School of Computer ScienceCarnegie Mellon

More on 80/20: PQRS

• Part of ‘self-* storage’ project

p q

r s

q

r s

Page 34: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 34

School of Computer ScienceCarnegie Mellon

Overview

• Goals/ motivation: find patterns in large datasets:– (A) Sensor data

– (B) network/graph data

• Solutions: self-similarity and power laws– sensor/traffic data

– network/graph data

• Discussion

Page 35: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 35

School of Computer ScienceCarnegie Mellon

Problem #2 - topology

How does the Internet look like? Any rules?

Page 36: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 36

School of Computer ScienceCarnegie Mellon

Patterns?

• avg degree is, say 3.3• pick a node at random

– guess its degree, exactly (-> “mode”)

degree

count

avg: 3.3

Page 37: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 37

School of Computer ScienceCarnegie Mellon

Patterns?

• avg degree is, say 3.3• pick a node at random

– guess its degree, exactly (-> “mode”)

• A: 1!!

degree

count

avg: 3.3

Page 38: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 38

School of Computer ScienceCarnegie Mellon

Patterns?

• avg degree is, say 3.3• pick a node at random

- what is the degree you expect it to have?

• A: 1!!• A’: very skewed distr.• Corollary: the mean is

meaningless!• (and std -> infinity (!))

degree

count

avg: 3.3

Page 39: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 39

School of Computer ScienceCarnegie Mellon

Solution#2: Rank exponent R• A1: Power law in the degree distribution

[SIGCOMM99]

internet domains

log(rank)

log(degree)

-0.82

att.com

ibm.com

Page 40: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 40

School of Computer ScienceCarnegie Mellon

Solution#2’: Eigen Exponent E

• A2: power law in the eigenvalues of the adjacency matrix

E = -0.48

Exponent = slope

Eigenvalue

Rank of decreasing eigenvalue

May 2001

Page 41: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 41

School of Computer ScienceCarnegie Mellon

Power laws - discussion

• do they hold, over time?

• do they hold on other graphs/domains?

Page 42: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 42

School of Computer ScienceCarnegie Mellon

Power laws - discussion

• do they hold, over time?

• Yes! for multiple years [Siganos+]

• do they hold on other graphs/domains?

• Yes!– web sites and links [Tomkins+], [Barabasi+]– peer-to-peer graphs (gnutella-style)– who-trusts-whom (epinions.com)

Page 43: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 43

School of Computer ScienceCarnegie Mellon

Time Evolution: rank R

-1

-0.9

-0.8

-0.7

-0.6

-0.50 200 400 600 800

Instances in time: Nov'97 and on

Ra

nk

ex

po

ne

nt

• The rank exponent has not changed! [Siganos+]

Domainlevel

log(rank)

log(degree)

-0.82

att.com

ibm.com

Page 44: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 44

School of Computer ScienceCarnegie Mellon

The Peer-to-Peer Topology

• Number of immediate peers (= degree), follows a power-law

[Jovanovic+]

degree

count

Page 45: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 45

School of Computer ScienceCarnegie Mellon

epinions.com

• who-trusts-whom [Richardson + Domingos, KDD 2001]

(out) degree

count

Page 46: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 46

School of Computer ScienceCarnegie Mellon

Why care about these patterns?

• better graph generators [BRITE, INET]– for simulations– extrapolations

• ‘abnormal’ graph and subgraph detection

Page 47: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 47

School of Computer ScienceCarnegie Mellon

Outline

• problems

• Fractals

• Solutions

• Discussion – what else can they solve? – how frequent are fractals?

Page 48: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 48

School of Computer ScienceCarnegie Mellon

What else can they solve?

• separability [KDD’02]• forecasting [CIKM’02]• dimensionality reduction [SBBD’00]• non-linear axis scaling [KDD’02]• disk trace modeling [PEVA’02]• selectivity of spatial/multimedia queries

[PODS’94, VLDB’95, ICDE’00]• ...

Page 49: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 49

School of Computer ScienceCarnegie Mellon

Storyboard

• Search results

(ranked)

Collage with maps,

common phrases,

named entities and

dynamic query sliders

• Query (6TB of data)

Full Content Indexing, Search and Retrieval from Digital Video Archives

www.informedia.cs.cmu.edu

Page 50: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 50

School of Computer ScienceCarnegie Mellon

What else can they solve?

• separability [KDD’02]• forecasting [CIKM’02]• dimensionality reduction [SBBD’00]• non-linear axis scaling [KDD’02]• disk trace modeling [PEVA’02]• selectivity of spatial/multimedia queries

[PODS’94, VLDB’95, ICDE’00]• ...

Page 51: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 51

School of Computer ScienceCarnegie Mellon

Problem #3 - spatial d.m.

Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’

galaxies

- patterns? (not Gaussian; not uniform)

-attraction/repulsion?

- separability??

Page 52: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 52

School of Computer ScienceCarnegie Mellon

Solution#3: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

CORRELATION INTEGRAL!

Page 53: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 53

School of Computer ScienceCarnegie Mellon

Solution#3: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

[w/ Seeger, Traina, Traina, SIGMOD00]

Page 54: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 54

School of Computer ScienceCarnegie Mellon

spatial d.m.

r1r2

r1

r2

Heuristic on choosing # of clusters

Page 55: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 55

School of Computer ScienceCarnegie Mellon

Solution#3: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

Page 56: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 56

School of Computer ScienceCarnegie Mellon

Problem#4: dim. reduction

• given attributes x1, ... xn

– possibly, non-linearly correlated

• drop the useless ones mpg

cc

Page 57: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 57

School of Computer ScienceCarnegie Mellon

Problem#4: dim. reduction

• given attributes x1, ... xn

– possibly, non-linearly correlated

• drop the useless ones

(Q: why? A: to avoid the ‘dimensionality curse’)Solution: keep on dropping attributes, until

the f.d. changes! [SBBD’00]

mpg

cc

Page 58: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 58

School of Computer ScienceCarnegie Mellon

Outline

• problems

• Fractals

• Solutions

• Discussion – what else can they solve? – how frequent are fractals?

Page 59: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 59

School of Computer ScienceCarnegie Mellon

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

• <and many-many more! see [Mandelbrot]>

Page 60: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 60

School of Computer ScienceCarnegie Mellon

Fractals: Brain scans

• brain-scans

octree levels

Log(#octants)

2.63 = fd

Page 61: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 61

School of Computer ScienceCarnegie Mellon

fMRI brain scans

• Center for Cognitive Brain Imaging @ CMU

• Tom Mitchell, Marcel Just, ++

fMRI Goal: human brain function

Which voxels are active,

for a given cognitive task?

Page 62: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 62

School of Computer ScienceCarnegie Mellon

More fractals

• periphery of malignant tumors: ~1.5

• benign: ~1.3

• [Burdet+]

Page 63: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 63

School of Computer ScienceCarnegie Mellon

More fractals:

• cardiovascular system: 3 (!) lungs: ~2.9

Page 64: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 64

School of Computer ScienceCarnegie Mellon

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

Page 65: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 65

School of Computer ScienceCarnegie Mellon

More fractals:

• Coastlines: 1.2-1.58

1 1.1

1.3

Page 66: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 66

School of Computer ScienceCarnegie Mellon

Page 67: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 67

School of Computer ScienceCarnegie Mellon

Cross-roads of Montgomery county:

•any rules?

GIS points

Page 68: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 68

School of Computer ScienceCarnegie Mellon

GIS

A: self-similarity:• intrinsic dim. = 1.51

log( r )

log(#pairs(within <= r))

1.51

Page 69: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 69

School of Computer ScienceCarnegie Mellon

Examples:LB county

• Long Beach county of CA (road end-points)

1.7

log(r)

log(#pairs)

Page 70: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 70

School of Computer ScienceCarnegie Mellon

More power laws: areas – Korcak’s law

Scandinavian lakes

Any pattern?

Page 71: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 71

School of Computer ScienceCarnegie Mellon

More power laws: areas – Korcak’s law

Scandinavian lakes area vs complementary cumulative count (log-log axes)

log(count( >= area))

log(area)

Page 72: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 72

School of Computer ScienceCarnegie Mellon

More power laws: Korcak

Japan islands;

area vs cumulative count (log-log axes) log(area)

log(count( >= area))

Page 73: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 73

School of Computer ScienceCarnegie Mellon

More power laws

• Energy of earthquakes (Gutenberg-Richter law) [simscience.org]

log(count)

Magnitude = log(energy)day

Energy released

Page 74: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 74

School of Computer ScienceCarnegie Mellon

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

Page 75: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 75

School of Computer ScienceCarnegie Mellon

A famous power law: Zipf’s law

• Bible - rank vs. frequency (log-log)

log(rank)

log(freq)

“a”

“the”

“Rank/frequency plot”

Page 76: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 76

School of Computer ScienceCarnegie Mellon

TELCO data

# of service units

count ofcustomers

‘best customer’

Page 77: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 77

School of Computer ScienceCarnegie Mellon

SALES data – store#96

# units sold

count of products

“aspirin”

Page 78: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 78

School of Computer ScienceCarnegie Mellon

Olympic medals (Sidney’00, Athens’04):

log( rank)

log(#medals)

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

athens

sidney

Page 79: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 79

School of Computer ScienceCarnegie Mellon

Even more power laws:

• Income distribution (Pareto’s law)• size of firms

• publication counts (Lotka’s law)

Page 80: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 80

School of Computer ScienceCarnegie Mellon

Even more power laws:

library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001)

log(#citations)

log(count)

Ullman

Page 81: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 81

School of Computer ScienceCarnegie Mellon

Even more power laws:

• web hit counts [w/ A. Montgomery]

Web Site Traffic

log(freq)

log(count)

Zipf“yahoo.com”

Page 82: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 82

School of Computer ScienceCarnegie Mellon

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

Page 83: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 83

School of Computer ScienceCarnegie Mellon

Power laws, cont’d

• In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER]

log indegree

- log(freq)

from [Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins ]

Page 84: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 84

School of Computer ScienceCarnegie Mellon

Power laws, cont’d

• In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER]

• length of file transfers [Crovella+Bestavros ‘96]

• duration of UNIX jobs [Harchol-Balter]

Page 85: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 85

School of Computer ScienceCarnegie Mellon

Conclusions

• Fascinating problems in Data Mining: find patterns in– sensors/streams – graphs/networks

Page 86: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 86

School of Computer ScienceCarnegie Mellon

Conclusions - cont’d

New tools for Data Mining: self-similarity & power laws: appear in many cases

Bad news:

lead to skewed distributions

(no Gaussian, Poisson,

uniformity, independence,

mean, variance)

Good news:• ‘correlation integral’

for separability• rank/frequency plots• 80-20 (multifractals)• (Hurst exponent, • strange attractors,• renormalization theory, • ++)

Page 87: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 87

School of Computer ScienceCarnegie Mellon

Resources

• Manfred Schroeder “Chaos, Fractals and Power Laws”, 1991

• Jiawei Han and Micheline Kamber “Data Mining: Concepts and Techniques”, 2001

Page 88: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 88

School of Computer ScienceCarnegie Mellon

References

• [vldb95] Alberto Belussi and Christos Faloutsos, Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension Proc. of VLDB, p. 299-310, 1995

• M. Crovella and A. Bestavros, Self similarity in World wide web traffic: Evidence and possible causes , SIGMETRICS ’96.

Page 89: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 89

School of Computer ScienceCarnegie Mellon

References

• J. Considine, F. Li, G. Kollios and J. Byers, Approximate Aggregation Techniques for Sensor Databases (ICDE’04, best paper award).

• [pods94] Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension, PODS, Minneapolis, MN, May 24-26, 1994, pp. 4-13

Page 90: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 90

School of Computer ScienceCarnegie Mellon

References

• [vldb96] Christos Faloutsos, Yossi Matias and Avi Silberschatz, Modeling Skewed Distributions Using Multifractals and the `80-20 Law’ Conf. on Very Large Data Bases (VLDB), Bombay, India, Sept. 1996.

• [sigmod2000] Christos Faloutsos, Bernhard Seeger, Agma J. M. Traina and Caetano Traina Jr., Spatial Join Selectivity Using Power Laws, SIGMOD 2000

Page 91: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 91

School of Computer ScienceCarnegie Mellon

References

• [vldb96] Christos Faloutsos and Volker Gaede Analysis of the Z-Ordering Method Using the Hausdorff Fractal Dimension VLD, Bombay, India, Sept. 1996

• [sigcomm99] Michalis Faloutsos, Petros Faloutsos and Christos Faloutsos, What does the Internet look like? Empirical Laws of the Internet Topology, SIGCOMM 1999

Page 92: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 92

School of Computer ScienceCarnegie Mellon

References

• [ieeeTN94] W. E. Leland, M.S. Taqqu, W. Willinger, D.V. Wilson, On the Self-Similar Nature of Ethernet Traffic, IEEE Transactions on Networking, 2, 1, pp 1-15, Feb. 1994.

• [brite] Alberto Medina, Anukool Lakhina, Ibrahim Matta, and John Byers. BRITE: An Approach to Universal Topology Generation. MASCOTS '01

Page 93: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 93

School of Computer ScienceCarnegie Mellon

References

• [icde99] Guido Proietti and Christos Faloutsos, I/O complexity for range queries on region data stored using an R-tree (ICDE’99)

• Stan Sclaroff, Leonid Taycher and Marco La Cascia , "ImageRover: A content-based image browser for the world wide web" Proc. IEEE Workshop on Content-based Access of Image and Video Libraries, pp 2-9, 1997.

Page 94: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 94

School of Computer ScienceCarnegie Mellon

References

• [kdd2001] Agma J. M. Traina, Caetano Traina Jr., Spiros Papadimitriou and Christos Faloutsos: Tri-plots: Scalable Tools for Multidimensional Data Mining, KDD 2001, San Francisco, CA.

Page 95: School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Boston U., 2005 C. Faloutsos 95

School of Computer ScienceCarnegie Mellon

Thank you!

Contact info:christos <at> cs.cmu.edu

www. cs.cmu.edu /~christos

(w/ papers, datasets, code for fractal dimension estimation, etc)