Bipartite Edge Prediction via Transductive Learning over ...nyc.lti.cs.cmu.edu/.../Publications/liu-icml2015-slides.pdfICML 2015 Bipartite Edge Prediction via Transductive Learning

Bipartite Edge Prediction via Transductive Learning over Product Graphs

Bipartite Edge Prediction via TransductiveLearning over Product Graphs

Hanxiao Liu, Yiming Yang

School of Computer Science, Carnegie Mellon University

July 8, 2015

ICML 2015 Bipartite Edge Prediction via Transductive Learning over Product Graphs 1

Bipartite Edge Prediction via Transductive Learning over Product GraphsProblem Description

Outline

1 Problem Description

2 The Proposed Framework

3 FormulationProduct Graph ConstructionGraph-based Transductive Learning

4 Optimization

5 Experiment

6 Conclusion



Problem Description

Many applications involve predicting the edges of a bipartite graph.

I

II

A

B

C

?

?

?

?-2

+5

Graph G Graph H

1 Recommender System2 Host-Pathogen Interaction3 Question-Answering Mapping4 Citation Network . . .

Sometimes, vertex sets on both sides are intrinsically structured.Heterogeneous info: G + H + partial observationsCombine them to make better edge predictions?



Problem Description


I

II

A

B

C

?

?

?

?-2

+5Graph G Graph H


Sometimes, vertex sets on both sides are intrinsically structured.

Heterogeneous info: G + H + partial observationsCombine them to make better edge predictions?



Problem Description


I

II

A

B

C

?

?

?

?-2

+5Graph G Graph H


Sometimes, vertex sets on both sides are intrinsically structured.Heterogeneous info: G + H + partial observations

Combine them to make better edge predictions?



Problem Description


I

II

A

B

C

?

?

?

?-2

+5Graph G Graph H


Sometimes, vertex sets on both sides are intrinsically structured.Heterogeneous info: G + H + partial observationsCombine them to make better edge predictions?


Bipartite Edge Prediction via Transductive Learning over Product GraphsThe Proposed Framework

The Proposed Framework

I

II

A

B

C

?

?

?

?-2

+5Graph G Graph H

Transductive learning should be effective1 Labeled edges (red) are highly sparse2 Unlabeled edges (gray) are massively available

Assumption: similar edges should have similar labelsPrerequisite: a similarity measure among the edges, i.e. a “Graph ofEdges” (not directly provided)Can be induced from G and H via Graph Product!




I

II

A

B

C

?

?

?

?-2

+5Graph G Graph H


Assumption: similar edges should have similar labels

Prerequisite: a similarity measure among the edges, i.e. a “Graph ofEdges” (not directly provided)Can be induced from G and H via Graph Product!




I

II

A

B

C

?

?

?

?-2

+5Graph G Graph H


Assumption: similar edges should have similar labelsPrerequisite: a similarity measure among the edges, i.e. a “Graph ofEdges” (not directly provided)

Can be induced from G and H via Graph Product!




I

II

A

B

C

?

?

?

?-2

+5Graph G Graph H


Assumption: similar edges should have similar labelsPrerequisite: a similarity measure among the edges, i.e. a “Graph ofEdges” (not directly provided)Can be induced from G and H via Graph Product!




The “Graph of Edges” can be induced by taking the product of G and H

In the product graph G ◦HEach Vertex ∼ edge (in the original bipartite graph)Each Edge ∼ edge-edge similarity

The adjacency matrix of the product graph is defined by “◦” (to bediscussed later).




The “Graph of Edges” can be induced by taking the product of G and H

In the product graph G ◦HEach Vertex ∼ edge (in the original bipartite graph)Each Edge ∼ edge-edge similarity

The adjacency matrix of the product graph is defined by “◦” (to bediscussed later).




Problem Mapping

Edge Prediction(Original Problem)Given G, H and labeled edges,predict the unlabeled edges

I

II

A

B

C

?

?

?

?-2

+5

Vertex Prediction(Equivalent Problem)Given G◦H and labeled vertices,predict the unlabeled vertices

(I, C)?

(I, A)-2

(I, B)?

(II, C)?

(II, A)?

(II, B)+5


Bipartite Edge Prediction via Transductive Learning over Product GraphsFormulation

Outline




4 Optimization

5 Experiment

6 Conclusion



Product Graph Construction

Outline




4 Optimization

5 Experiment

6 Conclusion





Q: When should vertex (i, j) ∼ (i′, j′) in the product graph?Tensor GP i ∼ i′ in G AND j ∼ j′ in H

Cartesian GP(i ∼ i′ in G AND j = j′

)OR

(i = i′ AND j ∼ j′ in H

)

Can be trivially generalized to weighted graphs.

To compute the adjacency matrices of PGG ◦Tensor H = G⊗H︸︷︷︸

Kronecker (a.k.a. Tensor) Product

G ◦Cartesian H = G⊗ I + I ⊗H = G⊕H︸︷︷︸Kronecker Sum







)OR


)Can be trivially generalized to weighted graphs.










)OR


)Can be trivially generalized to weighted graphs.








Both GPs can be written in the form of spectral decomposition

G ◦Tensor H =∑i,j

(λi × µj)(ui ⊗ vj)(ui ⊗ vj)> (1)

G ◦Cartesian H =∑i,j

(λi + µj)(ui ⊗ vj)(ui ⊗ vj)> (2)







(λi × µj)︸︷︷︸soft AND

(ui ⊗ vj)(ui ⊗ vj)> (1)


(λi + µj)︸︷︷︸soft OR

(ui ⊗ vj)(ui ⊗ vj)> (2)

The interplay of graphs is captured by the interplay of their spectrum!








(ui ⊗ vj)(ui ⊗ vj)> (1)



(ui ⊗ vj)(ui ⊗ vj)> (2)


Generalization: Spectral Graph Product

G ◦H def=∑i,j

(λi ◦ µj)(ui ⊗ vj)(ui ⊗ vj)> (3)

where “◦” can be arbitrary binary operator (“×”, “+”, . . . )








(ui ⊗ vj)(ui ⊗ vj)> (1)



(ui ⊗ vj)(ui ⊗ vj)> (2)


Generalization: Spectral Graph Product

G ◦H def=∑i,j

(λi ◦ µj)(ui ⊗ vj)(ui ⊗ vj)> (3)

where “◦” can be arbitrary binary operator (“×”, “+”, . . . )

Commutative Property: G ◦H and H ◦G are isomorphic.ICML 2015 Bipartite Edge Prediction via Transductive Learning over Product Graphs 22


Graph-based Transductive Learning

Outline




4 Optimization

5 Experiment

6 Conclusion





With the product graph A def= G ◦H constructed, we solve a standardgraph-based transductive learning problem over A

Learning Objective

minf

`(f)︸︷︷︸Loss Function

+ λf>A−1f︸︷︷︸Graph Regularization

(4)

fi system-predicted value for vertex i in A`(f) quantifies the gap between f and partially observed labels.

λf>A−1f quantifies the smoothness over graphUnderlying assumption: f ∼ N (0, A)





With the product graph A def= G ◦H constructed, we solve a standardgraph-based transductive learning problem over A

Learning Objective

minf


+ λf>A−1f︸︷︷︸Graph Regularization

(4)

fi system-predicted value for vertex i in A`(f) quantifies the gap between f and partially observed labels.

λf>A−1f quantifies the smoothness over graphUnderlying assumption: f ∼ N (0, A)





The enhanced learning objective

minf


+ λf>κ(A)−1f︸︷︷︸Graph Regularization

(5)

to incorporate a variety of graph transduction patterns:

k-step Random Walk κ(A) = Ak

Regularized Laplacian κ(A) = (εI −A)−1 = I +A+A2 +A3 + . . .

Diffusion Process κ(A) = exp(A) ≡ I +A+ 12!A

2 + 13!A

3 + · · ·

All can be viewed as to transform the spectrum of A :=∑i θiuiu

>i

Ak =∑i

θki uiu>i (εI−A)−1 =

∑i

1ε− θi

uiu>i exp(A) =

∑i

eθiuiu>i






minf



(5)

to incorporate a variety of graph transduction patterns:k-step Random Walk κ(A) = Ak



2 + 13!A

3 + · · ·


>i

Ak =∑i


∑i

1ε− θi

uiu>i exp(A) =

∑i

eθiuiu>i






minf



(5)

to incorporate a variety of graph transduction patterns:k-step Random Walk κ(A) = Ak



2 + 13!A

3 + · · ·


>i

Ak =∑i


∑i

1ε− θi

uiu>i exp(A) =

∑i

eθiuiu>i


Bipartite Edge Prediction via Transductive Learning over Product GraphsOptimization

Outline




4 Optimization

5 Experiment

6 Conclusion



Optimization

Transductive Learning over Product Graph

minf

`(f) + λ f>κ(A)−1f︸︷︷︸r(f)

(6)

Challenge: κ(A) = κ( G︸︷︷︸m×m

◦ H︸︷︷︸n×n

) is a huge mn×mn matrix!



Optimization


minf

`(f) + λ f>κ(A)−1f︸︷︷︸r(f)

(6)


◦ H︸︷︷︸n×n


Even if κ(A)−1 is given, it is expensive to compute ∇r(f) naively



Optimization


minf

`(f) + λ f>κ(A)−1f︸︷︷︸r(f)

(6)


◦ H︸︷︷︸n×n


Prohibitive to load it into memoryProhibitive to compute its inverseEven if κ(A)−1 is given, it is expensive to compute ∇r(f) naively



Optimization


minf

`(f) + λ f>κ(A)−1f︸︷︷︸r(f)

(6)


◦ H︸︷︷︸n×n


Prohibitive to load it into memory No need to store κ(A)Prohibitive to compute its inverseEven if κ(A)−1 is given, it is expensive to compute ∇r(f) naively



Optimization


minf

`(f) + λ f>κ(A)−1f︸︷︷︸r(f)

(6)


◦ H︸︷︷︸n×n


Prohibitive to load it into memory No need to store κ(A)Prohibitive to compute its inverse No need of matrix inverseEven if κ(A)−1 is given, it is expensive to compute ∇r(f) naively



Optimization


minf

`(f) + λ f>κ(A)−1f︸︷︷︸r(f)

(6)


◦ H︸︷︷︸n×n


Prohibitive to load it into memory No need to store κ(A)Prohibitive to compute its inverse No need of matrix inverseEven if κ(A)−1 is given, it is expensive to compute ∇r(f) naivelyCan be performed much more efficiently



Optimization

Keys for complexity reduction1 Instead of matrices—

κ only manipulates eigenvalues◦ only manipulates the interplay of eigenvalues

2 The “vec” trick:

Bottleneck: multiplication (X ⊗ Y )ff = vec(F ), where Fij

def= system-predicted score for edge (i, j)



Optimization



2 The “vec” trick:Bottleneck: multiplication (X ⊗ Y )f

f = vec(F ), where Fijdef= system-predicted score for edge (i, j)



Optimization



2 The “vec” trick:Bottleneck: multiplication (X ⊗ Y )ff = vec(F ), where Fij

def= system-predicted score for edge (i, j)



Optimization



2 The “vec” trick:Bottleneck: multiplication (X ⊗ Y )ff = vec(F ), where Fij

def= system-predicted score for edge (i, j)(X ⊗ Y )f︸︷︷︸

O(m2n2) time/space

= (X ⊗ Y )vec(F )

≡ vec(XFY >)︸︷︷︸O(mn(m + n)) time, O((m + n)2) space

(7)



Optimization with Low-rank Constraint

Further speedup is possible by factorizing F into two low-rank matrices

The cost of each alternating gradient step is proportional torank(F ) · rank(Σ)Σ: a “Characteristic Matrix” where Σij = 1

κ(λi◦µj)

An interesting observation: rank(Σ) is usually a small constant!Example: Diffusion process over the Cartesian PG

Σ =

e−(λ1+µ1) . . . e−(λ1+µn)

.... . .

...e−(λm+µ1) . . . e−(λm+µn)

=

e−λ1

...e−λm

[e−µ1 . . . e−µn]

=⇒ rank(Σ) = 1




Further speedup is possible by factorizing F into two low-rank matricesThe cost of each alternating gradient step is proportional torank(F ) · rank(Σ)

Σ: a “Characteristic Matrix” where Σij = 1κ(λi◦µj)


Σ =

e−(λ1+µ1) . . . e−(λ1+µn)

.... . .

...e−(λm+µ1) . . . e−(λm+µn)

=

e−λ1

...e−λm

[e−µ1 . . . e−µn]

=⇒ rank(Σ) = 1




Further speedup is possible by factorizing F into two low-rank matricesThe cost of each alternating gradient step is proportional torank(F ) · rank(Σ)Σ: a “Characteristic Matrix” where Σij = 1

κ(λi◦µj)


Σ =

e−(λ1+µ1) . . . e−(λ1+µn)

.... . .

...e−(λm+µ1) . . . e−(λm+µn)

=

e−λ1

...e−λm

[e−µ1 . . . e−µn]

=⇒ rank(Σ) = 1





κ(λi◦µj)An interesting observation: rank(Σ) is usually a small constant!

Example: Diffusion process over the Cartesian PG

Σ =

e−(λ1+µ1) . . . e−(λ1+µn)

.... . .

...e−(λm+µ1) . . . e−(λm+µn)

=

e−λ1

...e−λm

[e−µ1 . . . e−µn]

=⇒ rank(Σ) = 1





κ(λi◦µj)An interesting observation: rank(Σ) is usually a small constant!Example: Diffusion process over the Cartesian PG

Σ =

e−(λ1+µ1) . . . e−(λ1+µn)

.... . .

...e−(λm+µ1) . . . e−(λm+µn)

=

e−λ1

...e−λm

[e−µ1 . . . e−µn]

=⇒ rank(Σ) = 1


Bipartite Edge Prediction via Transductive Learning over Product GraphsExperiment

Outline




4 Optimization

5 Experiment

6 Conclusion



Datasets and Baselines

Datasets

Dataset G H

Movielens-100K Users MoviesCora Publications Publications

Courses Courses Prerequisite Courses

BaselinesMC Matrix Completion.

Ignores the info of G and H.TK Tensor Kernel.

Implicitly construct PG, no transductionGRMC Graph Regularized Matrix Completion.

Transduction over G and H, no PG constructed



Results

Performance of several interesting combinations of ◦ and κ

Dataset Graph Transduction Graph Product MAP AUC ndcg@3

CoursesRandom Walk Tensor 0.488 0.827 0.461

Diffusion Cartesian 0.518 0.872 0.500von-Neumann Tensor 0.472 0.861 0.449von-Neumann Cartesian 0.366 0.531 0.359

Sigmoid Cartesian 0.443 0.617 0.431

CoraRandom Walk Tensor 0.222 0.764 0.205

Diffusion Cartesian 0.256 0.884 0.232von-Neumann Tensor 0.230 0.853 0.211von-Neumann Cartesian 0.218 0.633 0.212

Sigmoid Cartesian 0.192 0.443 0.188

MovieLensRandom Walk Tensor - - 0.7695

Diffusion Cartesian - - 0.7702von-Neumann Tensor - - 0.7720von-Neumann Cartesian - - 0.7624

Sigmoid Cartesian - - 0.7650



Results

Proposed method (Diff + Cartesian GP) v.s. Baselines

Dataset Method MAP AUC ndcg@3

CoursesMC 0.319 0.758 0.294

GRMC 0.366 0.777 0.343TK 0.449 0.810 0.446

Proposed 0.490 0.838 0.473

CoraMC 0.101 0.697 0.086

GRMC 0.115 0.702 0.101TK 0.248 0.872 0.231

Proposed 0.268 0.894 0.243

MovieLensMC - - 0.748

GRMC - - 0.752TK - - 0.718

Proposed - - 0.765


Bipartite Edge Prediction via Transductive Learning over Product GraphsConclusion

Outline




4 Optimization

5 Experiment

6 Conclusion



Conclusion

SummaryProblem Predicting the missing edges of a bipartite graph with

graph-structured vertex sets on both sides.Contribution A novel approach via transductive learning over product

graph, efficient algorithmic solution and good results.

On-going WorkExtend to k Graphs (k > 2)

Bipartite Graph → k-partite GraphEdge → Hyperedge

Determine the “optimal” graph product for any given problem.



Conclusion

SummaryProblem Predicting the missing edges of a bipartite graph with

graph-structured vertex sets on both sides.Contribution A novel approach via transductive learning over product

graph, efficient algorithmic solution and good results.

On-going WorkExtend to k Graphs (k > 2)

Bipartite Graph → k-partite GraphEdge → Hyperedge

Determine the “optimal” graph product for any given problem.



[email protected]


Bipartite Edge Prediction via Transductive Learning over ...nyc.lti.cs.cmu.edu/.../Publications/liu-icml2015-slides.pdfICML 2015 Bipartite Edge Prediction via Transductive Learning

Documents