Spacey random walks and higher-order data analysis

Spacey random walks for !higher-order data analysis

David F. Gleich!Purdue University!

May 20, 2016!Joint work with Austin Benson, Lek-Heng Lim, Tao Wu, supported by NSF CAREER CCF-1149756, IIS-1422918, DARPA SIMPLEX

Papers {1602.02102,1603.00395}

TMA 2016 David Gleich · Purdue 1

Markov chains, matrices, and eigenvectors have a long relationship. Kemeny and Snell, 1976. “In the land of Oz they never have two nice days in a row. If they have a nice day, they are just as likely to have snow as rain the next day. If they have snow or rain, they have an even chain of having the same the next day. If there is a change from snow or rain, only half of the time is this change to a nice day. “

28 FINITE MARKOV CHAINS CHAP. I1

repeated each time the boundary is reached. The transition matrix is

E X A M P L E 4

Assume now that once a boundary state is reached the particle stays a t this state with probability and moves to the other boundary state with probability 11% I n this case the transition matrix is

s1 sz s3 s4 s5

(4)

E X A M P L E 5 As the final choice for the behavior a t the boundary, let us assume

that when the particle reaches one boundary i t moves directly to the other. The transition matrix is

We next consider a modified version of the random walk. If the process is in one of the three interior states, i t has equal probability of moving right, moving left, or staying in its present state. If i t is

SEC. 2 BASIC CONCEPTS OF MARKOV CHAINS 29

on the boundary, i t cannot stay, but has equal probability of moving to any of the four other states. The transition matrix is :

S1 S2 s3 34 s5

E X A M P L E 7 A sequence of digits is generated a t random. We takepas states

the following: sl if a 0 occurs, sz if a 1 or Z occurs, s3 if a 3, 4, 5, or 6 occurs, sq if a 7 or 8 occurs, s5 if a 9 occurs. This process is an indepen- dent trials process, but we shall see tha t Markov chain theory gives us information even about this special case. The transition matrix is

E X A M P L E 8 According to Finite Jlathematics (Chapter Ti , Section 8), in the Land

of Oz they never have two nice days in a row. If they have a nice day they are just as likely to have snow as rain the next day. If they have snow (or rain) they have an even chance of having the same the nest day. If there is a change from snow or rain, only half of the time is this a change to a nice day. We form a three-state Markov chain with states B, N, and S for rain, nice, and snow, respectively. The transition matrix is then

R N S

SEC. 2 BASIC CONCEPTS OF MARKOV CHAINS 29

on the boundary, it cannot stay, but has equal probability of moving to any of the four other states. The transition matrix is:

Sl 52 Sa S4 85

Sl 0 1/4 1/4 1/4 1/4 52 1/3 1/3 1/3 0 0

P = Sa 0 lla 1/3 1/3 0 (6)

S4 0 0 1/3 1/3 1/3

55 1/4 1/4 1/4 1/4 0

EXAMPI.E 7 A sequence of digits is generated at random. We states

the following: 81 if a 0 occurs, S2 if a 1 or 2 occurs, S3 if a 3, 4, 5, or 6 occurs, S4 if a 7 or 8 occurs, S5 if a {) occurs. This process is an indepen-dent trials process, but we shall see that Markov chain theory gives us information even about this special case. The transition matrix is

51 82 83 54 S5

51 .1 .2 .4 .2 .1

S2 .1 .2 .4 .2 .1

P = S3 .1 .2 .4 .2 .1 (7)

S4 .1 .2 .4 .2 .1

S5 .1 .2 .4 .2 .1

EXAMPLE 8 According to Finite Mathematics (Chapter V, Section 8), in the Land

of Oz they never have two nice days in a row. If they have a nice day they are just as likely to have snow as rain the next day. If they have snow (or rain) they have an even chance of having the same the next day. If there is a change from snow or rain, only half of the time is this a change to a nice day. We form a three-state Markov cha·in with states R, N, and S for rain, nice, and snow, respectively. The transition matrix is then

(8)

Column-stochastic in my talk

P =

0

B@

R N S

R 1/2 1/2 1/4N 1/4 0 1/4S 1/4 1/2 1/2

1

CA

x stationary distribution

x

i

=X

j

P(i , j)xj

, x

i

� 0,X

i

x

i

= 1

x =

2/51/52/5

�x is an eigenvector


Markov chains, matrices, and eigenvectors have a long relationship. 1.  Start with a Markov chain

2.  Inquire about the stationary distribution

3.  This question gives rise to an eigenvector problem on the transition matrix


X1, X2, ... , Xt , Xt+1, ...

x

i

= limN!1

1N

NX

t=1

Ind[Xt

= i ] This is the limiting fraction of time the chain spends in state i

In general, Xt will be a stochastic process in this talk

Higher-order Markov chains are more useful for modern data problems. Higher-order means more history! Rosvall et al. (Nature Com. 2014) found •  Higher-order Markov chains were critical to "

finding multidisiplinary journals in citation "data and patterns in air traffic networks.

Chierichetti et al. (WWW 2012) found •  Higher-order Markov models capture browsing "

behavior more accurately than first-order models.

(and more!)

somewhat less than second order. Next, we assembled the links into networks.All links with the same start node in the bigrams represent out-links of the startnode in the standard network (Fig. 6d). A physical node in the memory network,which corresponds to a regular node in a standard network, has one memory nodefor each in-link (Fig. 6e). A memory node represents a pair of nodes in thetrigrams. For example, the blue memory node in Fig. 6e represents passengers whocome to Chicago from New York. All links with the same start memory node in thetrigrams represent out-links of the start memory node in the memory network. Inthis way, the memory network can maintain dependency between where passen-gers come from and where they go next. Figure 1a,b show the dramatic effect ofmaintaining second-order memory: passenger travel is much more constrainedthan what the standard network can capture. See Supplementary Note 1 for detailsof how we obtained pathways for all analysed networks and represented them asnetworks with and without memory.

Significance analysis with resampling. We performed two different statisticaltests to validate our results, bootstrap resampling of all summary statistics inTable 1 and surrogate data testing of the Markov order in Fig. 2 and SupplementaryFig. 3. Bootstraping allows us to assign confidence intervals to the summary sta-tistics based on resampling of the observed data set. Accordingly, only trigramsobserved in the data will occur, but possibly with different frequencies. Contrarily,surrogate data testing allows us to also generate unobserved trigrams and istherefore suitable for hypothesis testing of the Markov order against a null model.In turn, we describe the two methods below.

For the bootstrapping, we generated 100 bootstrap replicas for each data set byresampling the weights of the pathways from a multinomial distribution (forpatients, taxis and emails, we only had access to trigrams and resampled theirweights directly). This scheme corresponds to resampling of all pathways withreplacement. That is, we assume that pathways are generated independently. Forthe air traffic depicted in Fig. 6a, for example, we assume that tickets are boughtindependently. This assumption of independence is, of course, only approximatelytrue, but as flight tickets rarely are bought for more than a few passengers at thesame time, the approximation will work well in practice. After resampling the

pathways, we generated the networks as described in Fig. 6b–e and performed anyanalysis as on the raw network. For each set of summary statistics, we calculatedthe bootstrap confidence interval by ordering the 100 bootstrap estimates andeliminated the ten smallest and ten largest estimates. In general, we report thelower and upper limits of this interval.

For the surrogate data testing, our null hypothesis was that the flow is first-order Markov, and we used the conditional entropy at each node as a test statistic.Assuming that the null hypothesis is true, we estimated the probability that theconditional entropy in a second-order Markov process is at least as low as theobserved value. We estimated this probability, the P-value, with surrogateresampling and rejected the null hypothesis if the P-value was lower than 0.10. Foreach node and for each resampling, we removed the second-order Markov effect byperforming random pairings between all nodes visited before and after the nodegiven by all trigrams centred at the node. With this resampling scheme, we cansingle out nodes with a significant second-order Markov effect. See SupplementaryNote 2, for further details and for surrogate testing of higher Markov orders.

Community detection with second-order Markov dynamics. We have chosen towork with the flow-based map equation framework30. In principle, we could haveused alternative flow-based methods31, but the map equation framework allows usto compare the community structure with first- and second-order Markovdynamics by only modifying the dynamics and not the mechanics of the method.As we are interested in overlapping modules, we build our new method on ageneralization of the map equation to overlapping modules39.

The map equation framework is an information-theoretic approach that takesadvantage of the duality between compressing data and finding regularities in thedata. Given module assignments of all nodes in the network, the map equationmeasures the description length of a random walker that moves from node to nodeby following the links between the nodes. Therefore, finding the optimal partitionor cover of the network corresponds to testing different node assignments andpicking the one that minimizes the description length30.

The map equation framework easily generalizes to higher-order Markovdynamics, because memory networks only change the dynamics of the randomwalker as described above. Therefore, instead of applying the search algorithm onthe standard network, we apply it on the memory network and assign memorynodes to modules, with one important difference: as we are interested inmovements with or without memory between physical nodes, the description of therandom walker must reflect this process. Therefore, when two or more memorynodes of the same physical node are assigned to the same module, the descriptionlength must capture the fact that the memory nodes share the same codeword.We achieve this description by summing the visit frequencies of all memory nodesof each physical node in a module and then use this visit frequency to derive theoptimal codeword length. We ensure that the community detection results onlydepend on memory effects by representing first-order Markov dynamics in amemory network, with each memory node having the out-links of itscorresponding physical node in the standard network. In this way, the compressionalgorithm remains the same and only the dynamics change.

Figure 7 illustrates the effect of second-order Markov dynamics on communitydetection. The pathways represent air travel between San Francisco, Las Vegas andNew York, and correspond to a subset of the itineraries in the city data. With first-order Markov dynamics, there are no regularities to take advantage of in a modulardescription, and clustering all the cities together gives a shorter description length.With second-order Markov dynamics, however, the strong out-and-back travelpattern to and from Las Vegas makes it more efficient to describe the dynamics astwo overlapping modules, with Las Vegas assigned to both modules. That is, thefirst-order dynamics obscure the actual travel pattern and prevent a modulardescription from compressing the data. See Supplementary Note 3, for furtherdetails.

To validate our method, we have performed benchmark tests on syntheticpathways. We first describe how we build artificial pathways such that flow tends tostay inside predefined communities when described by a second-order Markovmodel. Next, we show that Infomap for memory networks, the community-detection algorithm we have developed, can recover the planted structure.However, when the artificial pathways are described by a first-order Markov modelin a standard network, much of the structure is washed out. We show that neitherInfomap nor other commonly used methods for overlapping communities canaccurately recover the planted structure.

We used the following algorithm to generate trigrams within and betweencommunities.

As planted structure, we consider 128 nodes and the community size fixed to 32nodes, similar to that in the Girvan–Newman benchmark57. Moreover, we tune thenumber of communities M. If M¼ 4, each node is assigned to a single community.If M44, multiple memberships are assigned to nodes in random order, with theconstraint that no node can be assigned to the same community twice.

As synthetic pathways, we draw Ein internal trigrams and Eout external trigrams.Internal trigrams are paths of three nodes i,j,k such that if nodes i and j belong tocommunity C, node k also belongs to C. For external trigrams, at least two of thethree nodes are not assigned to the same community. Below we describe a simplesampling algorithm. In these tests, we set Ein¼ 50,000 and Eout¼ 5,000 and 20,000,respectively. The number of trigrams is relatively high compared with the network

Assemble dataItinerary

Jacksonville Jacksonville

Bigrams Trigrams174,085 50,104

3,319211

24,919

172,830

95,97799,14072.56972.467

Number of tickets

49,6321.03112017,207418

New York Chicago

Chicago

San Francisco

Atlanta

Atlanta

AtlantaAtlanta

Atlanta Atlanta

Atlanta

Atlanta

San Francisco

San franciscoChicago

Chicago

Chicago

New York

New york

New York New York

New YorkNew York

New York

New York

New York

New York

Extract network

Chicago Chicago

ChicagoChicago

Chicago

Chicago

Chicago

ChicagoChicago

San Francisco San FranciscoSan Francisco

San Francisco

First–order Markov Second–order Markov

New York New YorkSeattle Seattle

Chicago

San Francisco San Francisco

Atlanta Atlanta

Chicago

Figure 6 | From pathway data to networks with and without memory.(a) Itineraries weighted by passenger number. (b) Aggregated bigramsfor links between physical nodes. (c) Aggregated trigrams for links betweenmemory nodes. (d) Network without memory. (e) Network withmemory. Corresponding dynamics in Fig. 1a,b.

ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms5630

10 NATURE COMMUNICATIONS | 5:4630 | DOI: 10.1038/ncomms5630 | www.nature.com/naturecommunications

& 2014 Macmillan Publishers Limited. All rights reserved.

Rosvall et al. 2014


Stationary dist. of higher-order Markov chains are still matrix eqns.

Convert into a first order Markov chain on pairs of states. Xi ,j =

X

j ,k

P(i , j , k )Xj ,k Xi ,j � 0,P

i ,j Xi ,j = 1

P(i , j , k ) = Prob. of state i given hist. j , k

x

i

=P

j

X (i , j) Marginal for the stat. dist.

P[Xt+1 = i | Xt = j , Xt�1 = k ] = P(i , j , k )


Last state 1 2 3

Current state 1 2 3 1 2 3 1 2 3

P[next state = 1] 0 0 0 1/4 0 0 1/4 0 3/4

P[next state = 2] 3/5 2/3 0 1/2 0 1/2 0 1/2 0

P[next state = 3] 2/5 1/3 1 1/4 1 1/2 3/4 1/2 1/4

Stationary dist. of higher-order Markov chains are still matrix eqns.

(1,1) (2,1)

(3,1)

(1,2)

(2,2)(3,2)

(1,3)

(2,3)(3,3)

2/5

3/5

1/3 2/3

1

1/4 1/2

1/4

1

1/21/2

1/4

3/4

1/21/2

3/4

1/4

1 1 3 3 1 · · ·2/5 1 3/4

2 3 3 · · ·1/2

1 2 3 2 · · ·1/3 1/2

Fig. 1. The first-order Markov chain that results from converting the second-order Markovchain in (1.2) into a first-order Markov chain on pairs of states. Here, state (i, j) means that thelast state was i and the second-last state was j. Because we are modeling a second-order Markovchain, the only transitions from state (i, j) are to state (k, i) for some k.

probability of transitioning to state i, given that the last state is j and the second laststate was k. So for this example, we have P 3,2,1 = 1/3. In this case,

Pi

Pi,j,k

= 1 forall j and k, and we call these stochastic hypermatrices. (For a m-order Markov chain,we will have an m + 1-order stochastic hypermatrix). The stationary distributionof a second-order Markov chain, it turns out, is simply computed by converting thesecond-order Markov chain into a first-order Markov chain on a larger state space givenby pairs of states. To see how this conversion takes place, note that if the previous twostates were 2 and 1 as in the example above, then we view these as an ordered pair(2, 1). The next pair of states will be either (2, 2) with probability 2/3 and (3, 2) withprobability 1/3. Thus, the sequence of states generated by a second-order Markovchain can be extracted from the sequence of states of a first-order Markov chain definedon pairs of states. Figure 1 shows a graphical illustration of the first-order Markovchain that arises from our small example.

The stationary distribution of the first-order chain is an N ⇥N matrix X whereX

ij

is the stationary probability associated with the pair of states (i, j). This matrixsatisfies the stationary equations:

Xij

=X

k

Pi,j,k

Xjk

,

X

i,j

Xij

= 1, Xij

� 0, 1 i, j N. (1.3)

(These equations can be further transformed into a matrix and a vector x 2 RN

2

ifdesired, but for our exposition, this is unnecessary.) The conditions when X existscan be deduced from applying Perron-Frobenius theorem to this more complicatedequation. Once the matrix X is found, the stationary distribution over states of thesecond-order chain is given by the row and column sums of X.

Computing the stationary distribution in this manner requires ⇥(N2) storage,regardless of any possible e�ciency in storing and manipulating P . For modernproblems on large datasets, this is infeasible.1

1We wish to mention that our goal is not the stationary distribution of the higher-order chainin a space e�cient matter. For instance, a memory-friendly alternative is directly simulating the

3

The implicit Markov chain

P(i , j , k ) = Prob. of state i given hist. j , kP[Xt+1 = i | Xt = j , Xt�1 = k ] = P(i , j , k )


Last state 1 2 3

Current state 1 2 3 1 2 3 1 2 3

P[next state = 1] 0 0 0 1/4 0 0 1/4 0 3/4

P[next state = 2] 3/5 2/3 0 1/2 0 1/2 0 1/2 0

P[next state = 3] 2/5 1/3 1 1/4 1 1/2 3/4 1/2 1/4

Hypermatrices, tensors, and tensor eigenvectors have been used too Z-eigenvectors (above) proposed by Lim (2005), Qi (2005). Many references to doing tensors for data analysis (1970+) Anandkumar et al. 2014 •  Tensor eigenvector decomp. are optimal to recover "

latent variable models based on higher-order moments.

1

32

A

tensor A : n ⇥ n ⇥ n

tensor eigenvector

X

j ,k

A(i , j , k )xj

x

k

= �x

i

,

Ax

2 = �x


But there were few results connecting hypermatrices, tensors, and higher-order Markov chains.


Li and Ng proposed a link between tensors and high-order MC 1.  Start with a higher-order Markov chain 2.  Look at the stationary distribution

3.  Assume/approximate as rank 1

4.  … and we have a tensor eigenvector

Li and Ng 2014.

X

i ,j = x

i

x

j

x

i

=X

j ,k

P(i , j , k )xj

x

k


Xi ,j =X

k

P(i , j , k )Xj ,k Xi ,j � 0,P

i ,j Xi ,j = 1

Li and Ng proposed an algebraic link between tensors and high-order MC The Li and Ng stationary distribution

Li and Ng 2014.

x

i

=X

j ,k

P(i , j , k )xj

x

k

•  Is a tensor z-eigenvector •  Is non-negative and sums to one •  Can sometimes be computed "

[Li and Ng, 14; Chu and Wu, 14; Gleich, Lim, Yu 15]

•  May or may not be unique •  Almost always exists

Our question!Is there a stochastic process underlying this tensor eigenvector?

Px

2 = x


Markov chain ! matrix equationIntro

Markov chain ! matrix equation !approximation

Li & Ng,"Multilinear "PageRank

Desired stochastic process ! approx. equations

Our question!Is there a stochastic process underlying this tensor eigenvector?


X1, X2, ... ! Px = x

X1, X2, ... ! Px

2 = x

X1, X2, ... ! “PX = X ” ! Px

2 = x

The spacey random walk Consider a higher-order Markov chain. If we were perfect, we’d figure out the stat-ionary distribution of that. But we are spacey!•  On arriving at state j, we promptly "

“space out” and forget we came from k. •  But we still believe we are “higher-order” •  So we invent a state k by drawing "

a random state from our history.


Benson, Gleich, Lim arXiv:2016

P[Xt+i | history] = P[Xt+i | Xt = j , Xt�1

= k ]

走神 or

According to my students

1012497

114

Xt-1

XtYt

Key insight limiting dist. of this process are tensor eigenvectors

The Spacey Random Walk P[Xt+1 = i | Xt = j , Xt�1 = k ] = P(i , j , k ) Higher-order

Markov

Benson, Gleich, Lim arXiv:2016 TMA 2016 David Gleich · Purdue 13

P[Xt+1 = i | Xt = j , Yt = g] = P(i , j , g)

P(Xt+1 = i | Ft )

=X

k

Pi ,Xt ,k Ck (t)/(t + n)

The spacey random walk process

This is a reinforced stochastic process or a"(generalized) vertex-reinforced random walk! "Diaconis; Pemantle, 1992; Benaïm, 1997; Pemantle 2007


Let Ct (k ) = (1 +Pt

s=1 Ind{Xs = k})

How often we’ve visited state k in the past

Ft is the �-algebra

generated by the history

{Xt : 0 t n}


Generalized vertex-reinforced!random walks (VRRW) A vertex-reinforced random walk at time t transitions according to a Markov matrix M given the observed frequencies of visiting each state. The map from the simplex of prob. distributions to Markov chains is key to VRRW


M. Benïam 1997

P(Xt+1 = i | Ft )= [M(t)]i ,Xt

= [M(c(t))]i ,Xt c(t) = observed frequency

c 7! M(c)How often we’ve been where

Where we are going to next

Stationary distributions of VRRWs correspond to ODEs THEOREM [Benaïm, 1997] Paraphrased"The sequence of empirical observation probabilities ct is an asymptotic pseudo-trajectory for the dynamical system Thus, convergence of the ODE to a fixed point is equivalent to stationary distributions of the VRRW. •  M must always have a unique stationary distribution! •  The map to M must be very continuous •  Asymptotic pseudo-trajectories satisfy

dx

dt= ⇡[M(x)] � x

⇡(M(x)) is a map to the stat. dist

TMA 2016 David Gleich · Purdue 16 lim

t!1kc(t + T ) � x(t + T )

x(t)=c(t)k = 0

The Markov matrix for !Spacey Random Walks

A necessary condition for a stationary distribution (otherwise makes no sense) TMA 2016 David Gleich · Purdue 17

Property B. Let P be an order-m, n dimensional probability table. Then P hasproperty B if there is a unique stationary distribution associated with all stochasticcombinations of the lastm�2modes. That is,M =

Pk ,`,... P(:, :, k , `, ...)�k ,`,... defines

a Markov chain with a unique Perron root when all �s are positive and sum to one.

dx

dt= ⇡[M(x)] � x

This is the transition probability associated with guessing the last state based on history!

2

1M(x)

1

32

x

P

M(c) =X

k

P(:, :, k )ck


Stationary points of the ODE for the Spacey Random Walk are tensor evecs

M(c) =X

k

P(:, :, k )ck

dx/dt = 0 , ⇡(M(x)) = x , M(x)x = x ,X

j ,k

P(i , j , k )xj

x

k

= x

i

But not all tensor eigenvectors are stationary points!

dx

dt= ⇡[M(x)] � x


Some results on spacey random walk models 1.  If you give it a Markov chain hidden in a hypermatrix,

then it works like a Markov chain. 2.  All 2 x 2 x 2 x … x 2 problems have a stationary

distribution (with a few corner cases). 3.  This shows that an “exotic” class of Pólya urns always converges

4.  Spacey random surfer models have unique stationary distributions in some regimes

5.  Spacey random walks model Hardy-Weinberg laws in pop. genetics

6.  Spacey random walks are a plausible model of taxicab behavior


All 2-state spacey random walk models have a stationary distribution If we unfold P(i,j,k) for a 2 x 2 x 2 then Key idea reduce to 1-dim ODE

R =

a b c d1 � a 1 � b 1 � c 1 � d

�

M(x) = R(x ⌦ �) =

c � x1(c � a) d � x1(d � b)1 � c + x1(c � a) 1 � d + x1(d � b)

�

⇡(h

p 1�q1�p q

i) =

1 � q2 � p � q


The one-dimensional ODE has a really simple structure

stable stable

unstable

x1

dx1/dt

In general, dx1/dt (0) ≥ 0, dx1/dt (1) ≤ 0, so there must be a stable point by cont.


With multiple states, the situation is more complicated If P is irreducible, there always exists a fixed point of the algebraic equation By Li and Ng 2013 using Brouwer’s theorem. State of the art computation!•  Power method [Li and Ng], "

more analysis in [Chu & Wu, Gleich, Lim, Yu] and more today •  Shifted iteration, Newton iteration [Gleich, Lim, Yu]

New idea!•  Integrate the ODE

Px

2 = x


M(c) =X

k

P(:, :, k )ckdx

dt= ⇡[M(x)] � x

Spacey random surfers are a refined model with some structure Akin to the PageRank modification of a Markov chain 1.  With probability α, follow the spacey random walk 2.  With probability 1-α, teleport based a distribution v!The solution of is unique if α < 0.5 THEOREM (Benson, Gleich, Lim)"The spacey random surfer model always has a stationary dist. if α < 0.5. In other words, the ODE always converges to a stable point

Gleich, Lim, Yu, SIMAX 2015 Benson, Gleich, Lim, arXiv:2016

x = ↵Px

2 + (1 � ↵)v

dx

dt= (1 � ↵)[� � ↵R(x ⌦ �)]�1

v � x


Yongyang Yu Purdue

Some nice open problems in this model •  For all the problems we have, Matlab’s ode45 has

never failed to converge to a eigenvector. (Even when all other algorithms will not converge.)

•  Can we show that if the power method converges to a fixed point, then the ODE converges? (The converse is false.)

•  There is also a family of models (e.g. pick “second” state based on history instead of the “third”), how can we use this fact?


Here’s what we are using spacey random walks to do! 1.  Model the behavior of taxicabs in a large city. "

Involves fitting transition probabilities to data. "Benson, Gleich, Lim arXiv:2016

2.  Cluster higher-order data in a type of “generalized” spectral clustering."Involves a useful asymptotic property of spacey random walks"Benson, Gleich, Leskovec SDM2016"Wu, Benson, Gleich, arXiv:2016


Taxicab’s are a plausible spacey random walk model

1,2,2,1,5,4,4,…

1,2,3,2,2,5,5,…

2,2,3,3,3,3,2,…

5,4,5,5,3,3,1,…

Model people by locations. 1  A passenger with location k is drawn at random. 2  The taxi picks up the passenger at location j. 3  The taxi drives the passenger to location i with probability P(i,j,k) Approximate locations by history à spacey random walk.

Beijing Taxi image from Yu Zheng "(Urban Computing Microsoft Asia)


Image from nyc.gov


NYC Taxi Data support the spacey random walk hypothesis One year of 1000 taxi trajectories in NYC. States are neighborhoods in Manhattan. P(i,j,k) = probability of taxi going from j to i "when the passenger is from location k. Evaluation RMSE


First-order Markov 0.846

Second-order Markov 0.835

Spacey 0.835


A property of spacey random walks makes the connection to clustering Spacey random walks (with stat. dists.) are asymptotically Markov chains •  once the occupation vector c converges, then future

transitions are according to the Markov chain M(c)

This makes a connection to clustering •  spectral clustering methods can be derived by looking

for partitions of reversible Markov chains (and research is on non-reversible ones too..)

We had an initial paper on using this idea for “motif-based clustering” of a graph, but there is much better technique we have now.


Benson, Leskovec, Gleich. SDM 2015 Wu, Benson, Gleich. arXiv:2016

Jure Leskovec Stanford

Given data bricks, we can cluster them using these ideas, with one more

[i1, i2, …, in]3 [i1, i2, …, in1

] x [j1, j2, …, jn2

] x [k1, k2, …, kn3

]

If the data is a symmetric cube, we can normalize it to get a transition tensor

If the data is a brick, we symmetrize using Ragnarsson and van Loan’s idea


Wu, Benson, Gleich arXiv:2016

A !h

0 AAT 0

iGeneralization of

The clustering methodology 1.  Symmetrize the brick (if necessary) 2.  Normalize to be a column stochastic tensor 3.  Estimate the stationary distribution of the

spacey random walk (spacey random surf.) or a generalization… (super-spacey RW)

4.  Form the asymptotic Markov model 5.  Bisect using eigenvectors or properties of

that asymptotic Markov model; then recurse.


Clustering airport-airport-airline networks

UNCLUSTERED (No structure apparent)

Airport-Airport-Airline"Network

CLUSTERED Diagonal structure evident

Name Airports Airlines Notes

World Hubs 250 77 Beijing, JFKEurope 184 32 Europe, MoroccoUnited States 137 9 U.S. and CancunChina/Taiwan 170 33 China, Taiwan, ThailandOceania/SE Asia 302 52 Canadian airlines tooMexico/Americas 399 68


Clusters in symmetrized three-gram and four-gram data Data 3, 4-gram data from COCA (ngrams.info) “best clusters” pronouns & articles (the, we, he, …) prepositions & link verbs (in, of, as, to, …)

Fun 3-gram clusters!{cheese, cream, sour, low-fat, frosting, nonfat, fat-free} {bag, plastic, garbage, grocery, trash, freezer} {church, bishop, catholic, priest, greek, orthodox, methodist, roman, priests, episcopal, churches, bishops}

Fun 4-gram clusters !{german, chancellor, angela, merkel, gerhard, schroeder, helmut, kohl}


Clusters in 3-gram Chinese text


社会 – society – economy – develop – “ism”

国家 – nation 政府 – government

We also get stop-words in the Chinese text (highly occuring words.) But then we also get some strange words. Reason Google’s Chinese corpus has a bias in its books.

One more problem

Tensor methods for network alignment

Network alignment is the problem of computing an approximate isomorphism between two net-works. In collaboration with Mohsen Bayati, Amin Saberi, Ying Wang, and Margot Gerritsen,the PI has developed a state of the art belief propagation method (Bayati et al., 2009).

FIGURE 6 – Previous workfrom the PI tackled net-work alignment with ma-trix methods for edgeoverlap:

i

j

j

0i

0

OverlapOverlap

A L B

This proposal is for match-ing triangles using tensormethods:

j

i

k

j

0

i

0

k

0

TriangleTriangle

A L B

If xi, xj , and xk areindicators associated withthe edges (i, i0), (j, j0), and(k, k0), then we want toinclude the product xixjxk

in the objective, yielding atensor problem.

We propose to study tensor methods to perform network alignmentwith triangle and other higher-order graph moment matching. Similarideas were proposed by Svab (2007); Chertok and Keller (2010) alsoproposed using triangles to aid in network alignment problems.In Bayati et al. (2011), we found that triangles were a key missingcomponent in a network alignment problem with a known solution.Given that preserving a triangle requires three edges between twographs, this yields a tensor problem:

maximizeX

i2L

wixi +X

i2L

X

j2L

xixjSi,j +X

i2L

X

j2L

X

k2L

xixjxkTi,j,k

| {z }triangle overlap term

subject to x is a matching.

Here, Ti,j,k = 1 when the edges corresponding to i, j, and k inL results in a triangle in the induced matching. Maximizing thisobjective is an intractable problem. We plan to investigate a heuris-tic based on a rank-1 approximation of the tensor T and usinga maximum-weight matching based rounding. Similar heuristicshave been useful in other matrix-based network alignment algo-rithms (Singh et al., 2007; Bayati et al., 2009). The work involvesenhancing the Symmetric-Shifted-Higher-Order Power Method due toKolda and Mayo (2011) to incredibly large and sparse tensors . On thisaspect, we plan to collaborate with Tamara G. Kolda. In an initialevaluation of this triangle matching on synthetic problems, using thetensor rank-1 approximation alone produced results that identifiedthe correct solution whereas all matrix approaches could not.

vision for the future

All of these projects fit into the PI’s vision for modernizing the matrix-computation paradigmto match the rapidly evolving space of network computations. This vision extends beyondthe scope of the current proposal. For example, the web is a huge network with over onetrillion unique URLs (Alpert and Hajaj, 2008), and search engines have indexed over 180billion of them (Cuil, 2009). Yet, why do we need to compute with the entire network?By way of analogy, note that we do not often solve partial di↵erential equations or modelmacro-scale physics by explicitly simulating the motion or interaction of elementary particles.We need something equivalent for the web and other large networks. Such investigations maytake many forms: network models, network geometry, or network model reduction. It is thevision of the PI that the language, algebra, and methodology of matrix computations will

11

Tensor methods for network alignment

Network alignment is the problem of computing an approximate isomorphism between two net-works. In collaboration with Mohsen Bayati, Amin Saberi, Ying Wang, and Margot Gerritsen,the PI has developed a state of the art belief propagation method (Bayati et al., 2009).

FIGURE 6 – Previous workfrom the PI tackled net-work alignment with ma-trix methods for edgeoverlap:

i

j

j

0i

0

OverlapOverlap

A L B

This proposal is for match-ing triangles using tensormethods:

j

i

k

j

0

i

0

k

0

TriangleTriangle

A L B

If xi, xj , and xk areindicators associated withthe edges (i, i0), (j, j0), and(k, k0), then we want toinclude the product xixjxk

in the objective, yielding atensor problem.

We propose to study tensor methods to perform network alignmentwith triangle and other higher-order graph moment matching. Similarideas were proposed by Svab (2007); Chertok and Keller (2010) alsoproposed using triangles to aid in network alignment problems.In Bayati et al. (2011), we found that triangles were a key missingcomponent in a network alignment problem with a known solution.Given that preserving a triangle requires three edges between twographs, this yields a tensor problem:

maximizeX

i2L

wixi +X

i2L

X

j2L

xixjSi,j +X

i2L

X

j2L

X

k2L

xixjxkTi,j,k

| {z }triangle overlap term

subject to x is a matching.

Here, Ti,j,k = 1 when the edges corresponding to i, j, and k inL results in a triangle in the induced matching. Maximizing thisobjective is an intractable problem. We plan to investigate a heuris-tic based on a rank-1 approximation of the tensor T and usinga maximum-weight matching based rounding. Similar heuristicshave been useful in other matrix-based network alignment algo-rithms (Singh et al., 2007; Bayati et al., 2009). The work involvesenhancing the Symmetric-Shifted-Higher-Order Power Method due toKolda and Mayo (2011) to incredibly large and sparse tensors . On thisaspect, we plan to collaborate with Tamara G. Kolda. In an initialevaluation of this triangle matching on synthetic problems, using thetensor rank-1 approximation alone produced results that identifiedthe correct solution whereas all matrix approaches could not.

vision for the future

All of these projects fit into the PI’s vision for modernizing the matrix-computation paradigmto match the rapidly evolving space of network computations. This vision extends beyondthe scope of the current proposal. For example, the web is a huge network with over onetrillion unique URLs (Alpert and Hajaj, 2008), and search engines have indexed over 180billion of them (Cuil, 2009). Yet, why do we need to compute with the entire network?By way of analogy, note that we do not often solve partial di↵erential equations or modelmacro-scale physics by explicitly simulating the motion or interaction of elementary particles.We need something equivalent for the web and other large networks. Such investigations maytake many forms: network models, network geometry, or network model reduction. It is thevision of the PI that the language, algebra, and methodology of matrix computations will

11

Triangular Alignment (TAME): A Tensor-based Approach for Higher-order Network Alignment"Joint w. Shahin Mohammadi, Ananth Grama, and Tamara Kolda http://arxiv.org/abs/1510.06482

max x

T(A ⌦ B)x s.t. kxk = 1 max(A ⌦ B)x

3

s.t. kxk = 1

A, B is triangle hypergraph adjacency A, B is edge adjacency matrix

“Solved” with x of dim. 86 million" has 5 trillion non-zeros A ⌦ B

www.cs.purdue.edu/homes/dgleich

Summary!Spacey random walks are a new type of stochastic process that provides a direct interpretation of tensor eigenvectors of higher-order Markov chains probability tables. !We are excited!!•  Many potential new applications of the spacey random walk process •  Many open theoretical questions for us (and others) to follow up on.!!Code!https://github.com/dgleich/mlpagerankhttps://github.com/arbenson/tensor-schttps://github.com/arbenson/spacey-random-walkshttps://github.com/wutao27/GtensorSC

Papers!Gleich, Lim, Yu. Mulltilinear PageRank, SIMAX 2015 Benson, Gleich, Leskovec, Tensor spectral clustering, SDM 2015 Benson, Gleich, Lim. Spacey random walks. arXiv:1602.02102 Wu, Benson, Gleich. Tensor spectral co-clustering. arXiv:1603.00395

35

Spacey random walks and higher-order data analysis

Technology