Top Banner
Charalampos (Babis) E. Tsourakakis Modern Data Mining Algorithms 1 Data Analysis Project 20 Apr. 2010
69

Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Mar 10, 2016

Download

Documents

Data Mining with MapReduce: Graph and Tensor Algorithms with Applications
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Charalampos  (Babis)    E.  Tsourakakis  

Modern Data Mining Algorithms 1

Data Analysis Project 20 Apr. 2010

Page 2: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 2

Page 3: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 3

Leonard Euler (1707-1783)

Seven Bridges of Königsberg Eulerian Paths

Page 4: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms P0-4

Internet Map [lumeta.com]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]

Page 5: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 5

m customers n products

Market Basket Analysis

m documents n words

Documents-Terms

freedom

dance

prison

Page 6: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

0200040006000800010000051015202530time (min)value

Temperature 02000400060008000100000100200300400500600time (min)valueLight

020004000600080001000000.511.522.5time (min)value

Voltage 0200040006000800010000010203040time (min)value

Humidity

Intel Berkeley lab

6 Modern Data Mining Algorithms

Page 7: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

time

Loca

tion

Data modeled as a tensor, i.e., multidimensional matrix, Tx(#sensors)x(#types of measurements)  

7

Multi-­‐dimensional    time  series  can  be  modeled    in  such  way.  

Modern Data Mining Algorithms

Page 8: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 8

voxel x subjects x trials x task conditions x timeticks

Functional Magnetic Resonance Imaging (fMRI)

Page 9: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 9

Page 10: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Spam  Detection    Exponential  random  graphs    Clustering  Coefficients  &  Transitivity  Ratio     Uncovering  the  Hidden  Thematic  Structure  of  the  web  

  Link  Recommendation  

Modern Data Mining Algorithms 10

Friends of friends tend to become friends themselves

Page 11: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 11

Spectral  family  

Triangle    Sparsifiers  

Randomized  SVD    

Contributions  

Page 12: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 12

Theorem  1    

Δ(G)    =    #  triangles  in  graph  G(V,E)                                                                    =  eigenvalues  of  adjacency  matrix  AG                

Page 13: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 13

Theorem  2  

Δ(i)  =  #Δs  vertex  i  participates  at.                    =  j-­‐th  eigenvector                    =  i-­‐th  entry  of  

i

Δ(i) = 2

Page 14: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 14

Airports Political blogs

Page 15: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 15

  Very  important  for  us  because:   Few  eigenvalues  contribute  a  lot!   Cubes  amplify  this  even  more!   Lanczos  converges  fast  due  to  large  spectral  gaps!  

Page 16: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 16

  Almost  symmetric  around  0!  

  Sum  of  cubes  almost  cancels  out!  

Political Blogs

Omit!

Keep only 3!

3

Page 17: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 17

Nodes   Edges     Description  

~75K   ~405K   Epinions  network  

~404K   ~2.1M   Flickr  

~27K   ~341K   Arxiv  Hep-­‐Th  

~1K   ~17K   Political  blogs  

~13K   ~148K   Reuters  news  

~3M   35M   Wikipedia  2006-­‐Sep-­‐05  

~3.15M   ~37M   Wikipedia  2006-­‐Nov-­‐04  

~13.5K   ~37.5K   AS  Oregon  

~23.5K   ~47.5K   CAIDA  AS  2004  to  2008  (means  over  151          timestamps)  

Social Networks

Co-authorship network

Information Networks

Web Graphs

Internet Graphs

Page 18: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

18 Modern Data Mining Algorithms

Page 19: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

19 Modern Data Mining Algorithms

Page 20: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 20

Triangles node i participates Tria

ngle

s no

de i

parti

cipa

tes

acco

rdin

g to

our

est

imat

ion

Page 21: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

21 Modern Data Mining Algorithms

2-3 eigenvalues almost ideal results!

Page 22: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 22

  Kronecker  graphs  is  a  model  for  generating  graphs  that  mimic  properties  of  real-­‐world  networks.  The  basic  operation  is  the  Kronecker  product([Leskovec  et  al.]).  

0   1   1  

1   0   1  

1   1   0  

Initiator graph

Adjacency matrix A[0]

Kronecker Product

Adjacency matrix A[1] Adjacency matrix A[2]

Repeat  k  times   Adjacency matrix A[k]

Page 23: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 23

  Theorem[KroneckerTRC  ]  Let  B  =  A[k]    k-­‐th  Kronecker  product  and  Δ(GA),      Δ(GΒ)    the  total  number  of  triangles  in  GA  ,  GΒ  .    Then,  the    following  equality  holds:  

Page 24: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Observation  1:  Eigendecomposition  <-­‐>  SVD  when  matrix  is  symmetric,  i.e.,      eigenvectors  =  left  singular  vectors      λi=σi  sgn(uivi)    (where  λi,σi  eigenvalue,  singular  value  respectively,  ui  and  vi  left  and  right  singular  vectors  respectively.              

  Observation  2:  We  care  about  a  low  rank  approximation  of  A  

Modern Data Mining Algorithms 24

Page 25: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Frieze,  Kannan,  Vempala  

  Idea:  Sample  c  columns,  obtain  A  and  find  Ak  instead  of  the  optimal  Ak.  Recover  signs  from  left  and  right  singular  vectors.  Use  EigenTriangle!  

  Results:  c=100,  k=6  for  Flickr,  EigenTriangle  95.6%  accuracy,  Approximation  95.46%  

Modern Data Mining Algorithms 25

(1) Pick column i with probability proportional to its squared length (2) Use the sampled matrix to obtain a good low rank approximation to the original one

~ ~

Page 26: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 26

Spectral  family  

Triangle    Sparsifiers  

Randomized  SVD    

Contributions  

Page 27: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Approximate  a  given  graph  G  with  a  sparse  graph  H,  such  that  H  is  close  to  G  in  a  certain  notion.  

  Examples:      Cut  preserving  Benczur-­‐Karger  

         Spectral  Sparsifier  Spielman-­‐Teng    

Modern Data Mining Algorithms 27

What  about  Triangle  Sparsifiers?    

Page 28: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

G(V,E)  i j

HEADS! (i,j) “survives” with probability p

28 Modern Data Mining Algorithms

t =# Δ

Page 29: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

G(V,E)  k m

TAILS! (k,m) “dies”

29 Modern Data Mining Algorithms

Now, count triangles in G’ and let T/p3

be the estimate of t.

G’(V,E’)  

t =# Δ

Τ =# Δ Main  Theoretical  Results:  Under  mild  conditions  on  the  triangle  density  (at  least  nearly  linear  number  of  triangles),  our  estimate  is  strongly  concentrated  around  the  true  number  of  triangles!  

Page 30: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 30

Page 31: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 31

Re  

1  day  =  86400  seconds   Expected  Speedup  1/p2  

Page 32: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 32

Page 33: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

 Milgram  1967  

Modern Data Mining Algorithms 33

The “small world experiment” • Pick 300 people at random •  Ask them to get a letter to a by passing it through friends to a stockbroker in Boston. How many steps does it take?

Only 6! Typically  the  diameter  of  real-­‐world  network  is  surprisingly  small!    

Page 34: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Does  the  same  observation  hold  on  the  Yahoo  Web  Graph  (2002),  where  #nodes=1.4B  and  #edges=6.83B?  

Modern Data Mining Algorithms 34

Page 35: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Assume  we  have  a  multiset  M={x1,..,xm}  and  we  want  to  count  the  number  of  distinct  elements  n  from  M.  How  can  we  do  this  using  small  amount  of  space?  

                                                                                           Flajolet  &  G.  Nigel  Martin  

Modern Data Mining Algorithms 35

Page 36: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Hash  function  h(x  in  U):[0,..,2L-­‐1]    y  =  Σ  bit(y,k)  2k      ρ(y)  =  minimum  k  s.t  bit(y,k)=1,  o/w  L   Let’s  keep  a  bitmask[0..L]    Hash  every  x  in  M  and  find  ρ(h(x)).  If  BITMASK[ρ(h(x))]  is  not  0,  then  flip  it!    

  How  will  the  bitmask  look  at  the  end?  0000000000….  010110…  1111111111111  

Modern Data Mining Algorithms 36

i<<log(n) i>>log(n) i~=log(n)

Page 37: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  How  will  the  bitmask  look  at  the  end?  0000000000….  010110…  1111111111111  

Modern Data Mining Algorithms 37

i<<log(n) i>>log(n) i~=log(n)

This  region  will  give  us  the  information.  Flajolet-­‐Martin  prove  that  for  the  random  variable  R=leftmost  0  in  our  bitmask:                                                                                          E(R)=  log(0.77351*n)  

Page 38: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  For  every  h  =  1,2,  ..    Estimate  the  cardinality  of  the  set  N(h),  i.e.,  the  pairs  of  nodes  reachable  within  h  steps.  

 When  the  cardinality  stabilizes,  output  the  number  of  steps  to  reach  that  cardinality  as  the  diameter.    

  Scalability  O(diam(G)*m),  m=#edges    Efficient  access  to  the  file  (very  important)    Parallelizable  (also  very  important)  

Modern Data Mining Algorithms 38

Page 39: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  The  diameter  of  the  Yahoo  Web  Graph  is  surprisingly  small  (7~8)  

Modern Data Mining Algorithms 39

Page 40: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 40

Page 41: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

= x x

Document to term matrix

Documents to Document HCs

Strength of each concept

Term to Term HCs data graph java brain lung

CS

MD

41 Modern Data Mining Algorithms

Page 42: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 42

Tucker  is  an  SVD-­‐like  decomposition  of  a  tensor,  one  projection  matrix  per  mode  and  a  core  tensor  giving  the  correlation  among  the  projection  matrices  

Page 43: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  In:  D    Out:  D’=[G;U0,U1,U2]  1.  Spatial  compression  

  Tucker  decomposition  2.  Temporal  compression  

  Wavelet  transform  3.  Sparsify  the  core      

tensor  G    e2  =  1  -­‐  ||G||2/||D||2  

modality

D  

loca

tion

X  U1  

U2T  lo

catio

n

modality

Tucker-2 sparsify

G'  U1  

U2T  lo

catio

n

modality

In   Out  

Transform Matrix (fixed)

U0  

Wavelet coefficients

G  

43 Modern Data Mining Algorithms

Page 44: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  In:      sensor  measurements  

  Out:      Projection  matrices  U1  and  U2      Core  G’  (wavelet  coefficients)  

 Mining  guide:    U1  and  U2  reveal  the  patterns  on  location  and  modality,  respectively  

  G’  provides  the  patterns  on  time  

G'  U1  

U2T  loca

tion

modality

D  

loca

tion

modality

0200040006000800010000051015202530time (min)value

Temperature

02000400060008000100000100200300400500600time (min)value

Light

0200040006000800010000010203040time (min)value

Humidity

020004000600080001000000.511.522.5time (min)value

Voltage

44 Modern Data Mining Algorithms

Page 45: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  1st  HC  :  dominant  trend,  e.g.  daily  periodicity.    2nd  HC:  Exceptions  

G'  U1  

U2T  

1st Hidden Concept Daily Periodicity

2nd Hidden Concept Exceptions

1 . .

54

1 . .

54

45 Modern Data Mining Algorithms

Page 46: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

•  1st  HC  indicates  the  main  sensor  modality  correlations  ▪  Temperature  and  light  are  positively  correlated,  while  humidity  is  anti-­‐

correlated  with  the  rest  

•  2nd  HC  indicates  an  abnormal  pattern  which  is  due  to  battery  outage  for  some  sensors  

volt humid

temp light

volt humid

temp light

1st Hidden Concept 2nd Hidden Concept

G'  U1  

U2T  

modality

1 2 3 4 1 2 3 4

46 Modern Data Mining Algorithms

Page 47: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

U1  

U2T  

modality

•  1st  scalogram  indicates  daily  periodicity  •  2nd  scalogram  gives  abnormal  flat  trend  due  to  battery  outage  

47 Modern Data Mining Algorithms

G'  

Page 48: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 48

Page 49: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

 Most  of  the  real-­‐world  processes  result  in  sparse  tensors.  However,  there  exist  important  processes  which  result  in  dense  tensors:  

Modern Data Mining Algorithms 49

Physical  Process     Percentage  of  non-­‐zero  entries  

Sensor  network  (sensor  x  measurement  type  x  timeticks)  

85%  

Computer  network  (machine  x    measurement  type  x  timeticks)  

81%  

Page 50: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  It  can  be  either  very  slow  or  impossible  to  perform  due  to  memory  constraints  a  Tucker  decomposition  on  a  dense  tensor.  

  Can  we  trade  a  little  bit  of  accuracy  for  efficiency?  

Modern Data Mining Algorithms 50

Page 51: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 51

McSherry Achlioptas

MACH extends the work of Achlioptas-McSherry for fast low rank approximations to the multilinear setting.

Page 52: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Toss  a  coin  for  each  non-­‐zero  entry  with  probability  p      If  it  “survives”  reweigh  it  by  1/p.      If  not,  make  it  zero!  

  Perform  Tucker  on  the  sparsified  tensor!    For  the  theoretical  results,  see  Tsourakakis,  SDM  2010.  

Modern Data Mining Algorithms 52

Page 53: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Intemon  (Carnegie  Mellon  University  Self-­‐Monitoring  system)  

  Tensor  X,  100  machines  x  12  types  of    measurement  x  10080  timeticks  

  Jimeng  Sun  showed  in  his  thesis  that  Tucker  decompositions  can  be  used  to  monitor  efficiently  the  system  

Modern Data Mining Algorithms 53

Page 54: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 54

For  p=0.1  we  obtain    that  Pearson’s  Correlation  Coefficient    is  0.99  

Ideal  ρ=1  

Page 55: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 55

Exact MACH

The  qualitative  analysis  which  is  important  for  our  goals  remains  the  same!  

Find the differences!

Page 56: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Berkeley  Lab  

  Tensor  54  sensors  x  4  types  of  measurement  x  5385  timeticks  

Modern Data Mining Algorithms 56

Page 57: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 57

The  qualitative  analysis  which  is  important    for  our  goals  remains  the  same!  

Page 58: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 58

The  spatial  principal  mode  is  also  preserved,    and  Pearson’s  correlation  coefficient    is  again  almost  1!  

Page 59: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 59

                       REMARKS  1)  Daily  periodicity    is    apparent.  2)  Pearson’s  correlation  Coefficient  0.99  with  the  exact  component.  

Page 60: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Introduction    PART  I:  Graphs  

  Triangles    Diameter  

  PART  II:  Tensors    2  Heads  method   MACH  

  Conclusion/Research  Directions  

Modern Data Mining Algorithms 60

Page 61: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

 More  Applications  of  Probabilistic  Combinatorics  in  Large  Scale  Graph  Mining    Randomized  Algorithms  work  very  well  (e.g.,  sublinear  time  algorithm),  but  typically  hard  to  analyze.  

  Smallest  p*  for  tensor  sparsification  for  the  (messy)  HOOI    algorithm  

Modern Data Mining Algorithms 61

Page 62: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

  Better  sparsification  (Edge  (1,2)  is  important,  Weighted  Graphs!)  

  Property  Testing:  Is  a  graph  triangle  free?        Does  Boolean  Matrix  Multiplication  have  a  truly  subcubic  algorithm?  

Triangle Sparsifiers 62 3/16/2010

Page 63: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 63

Faloutsos Miller Schwartz Frieze Kolountzakis Koutis

Drineas Kang Leskovec

Page 64: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 64

Page 65: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 65

Page 66: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 66

Concentration appears

Concentration becomes stronger

Pick p=1/ Keep doubling until concentration

Page 67: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

67 Modern Data Mining Algorithms

Mildness, pick p=1

Concentration

How to choose p?

Page 68: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 68

I  want  to  compute  the  number  of  triangles!  

Use  Lanczos  to  compute  the  first  two    eigenvalues  

please!  

Is  the  cube  of  the  second  one  significantly  smaller  than  

the  cube  of  the  first?  

  NO   Iterate  then!  

After  some  iterations…  (hopefully  

few!)  

Compute  the    k-­‐th  eigenvalue.  

Is          much  smaller  

than      ?  

YES!  Algorithm  terminates!  The  estimated  #  of  Δs  is  the  sum  of  cubes  of  λi’s  

divided  by  6!  

Page 69: Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Modern Data Mining Algorithms 69

Remark:Even if our theoretical results refer to HOSVD, MACH works for HOOI