Top Banner
Relational Machine Learning Applications and Models Bhushan Kotnis Heidelberg University
48

Relational machine-learning

Apr 12, 2017

Download

Technology

Bhushan Kotnis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Relational machine-learning

Relational Machine LearningApplications and Models

Bhushan Kotnis

Heidelberg University

Page 2: Relational machine-learning

Table of contents

1. Introduction

2. Models

1

Page 3: Relational machine-learning

Introduction

Page 4: Relational machine-learning

Networks and Graphs

• Social Networks : Link Prediction, Relevant Ads, FeedRecommendation.

• Biological Networks: Gene Ontology, Protein InteractionNetworks, Cellular Networks.

• Financial Networks: Assessing risk and exposure, providinginformation, detecting fraud

• Knowledge Graphs: Background knowledge for AI, intelligentsearch engines.

2

Page 5: Relational machine-learning

Networks and Graphs

• Social Networks : Link Prediction, Relevant Ads, FeedRecommendation.

• Biological Networks: Gene Ontology, Protein InteractionNetworks, Cellular Networks.

• Financial Networks: Assessing risk and exposure, providinginformation, detecting fraud

• Knowledge Graphs: Background knowledge for AI, intelligentsearch engines.

2

Page 6: Relational machine-learning

Networks and Graphs

• Social Networks : Link Prediction, Relevant Ads, FeedRecommendation.

• Biological Networks: Gene Ontology, Protein InteractionNetworks, Cellular Networks.

• Financial Networks: Assessing risk and exposure, providinginformation, detecting fraud

• Knowledge Graphs: Background knowledge for AI, intelligentsearch engines.

2

Page 7: Relational machine-learning

Networks and Graphs

• Social Networks : Link Prediction, Relevant Ads, FeedRecommendation.

• Biological Networks: Gene Ontology, Protein InteractionNetworks, Cellular Networks.

• Financial Networks: Assessing risk and exposure, providinginformation, detecting fraud

• Knowledge Graphs: Background knowledge for AI, intelligentsearch engines.

2

Page 8: Relational machine-learning

Social Networks

• Problem: Rank Ads/Feeds, suggest relevant articles.

• Users are connected to one another, share interests,demographic data, news preferences.

• Linked Machine Learning problem: Predict ads, articlerecommendation, feed, etc using a unified model.

3

Page 9: Relational machine-learning

Social Networks

• Problem: Rank Ads/Feeds, suggest relevant articles.

• Users are connected to one another, share interests,demographic data, news preferences.

• Linked Machine Learning problem: Predict ads, articlerecommendation, feed, etc using a unified model.

3

Page 10: Relational machine-learning

Social Networks

• Problem: Rank Ads/Feeds, suggest relevant articles.

• Users are connected to one another, share interests,demographic data, news preferences.

• Linked Machine Learning problem: Predict ads, articlerecommendation, feed, etc using a unified model. 3

Page 11: Relational machine-learning

Genetic Regulatory Network

• Genes Regulatory Network: Molecular interaction network,Genes interacting with proteins and other molecules.

• Problem: Infer family, function of the Gene based on itsinteractions. Mutations leading to diseases.

• Link prediction problem: Linked ML problem because aprediction depends on other predictions.

4

Page 12: Relational machine-learning

Genetic Regulatory Network

• Genes Regulatory Network: Molecular interaction network,Genes interacting with proteins and other molecules.

• Problem: Infer family, function of the Gene based on itsinteractions. Mutations leading to diseases.

• Link prediction problem: Linked ML problem because aprediction depends on other predictions.

4

Page 13: Relational machine-learning

Genetic Regulatory Network

• Genes Regulatory Network: Molecular interaction network,Genes interacting with proteins and other molecules.

• Problem: Infer family, function of the Gene based on itsinteractions. Mutations leading to diseases.

• Link prediction problem: Linked ML problem because aprediction depends on other predictions.

4

Page 14: Relational machine-learning

Financial Networks

• Interconnected banks, companies, commodities, products,events, people, locations.

• Problem: Infer missing connections for estimating exposure.

• Problem: Reasoning using path correlations.

5

Page 15: Relational machine-learning

Financial Networks

• Interconnected banks, companies, commodities, products,events, people, locations.

• Problem: Infer missing connections for estimating exposure.

• Problem: Reasoning using path correlations.

5

Page 16: Relational machine-learning

Financial Networks

• Interconnected banks, companies, commodities, products,events, people, locations.

• Problem: Infer missing connections for estimating exposure.

• Problem: Reasoning using path correlations.5

Page 17: Relational machine-learning

Knowledge Graphs

6

Page 18: Relational machine-learning

The KGC Problem

• Knowledge Graph : G set of triples (s, r, t), s, t ∈ E and r ∈ R.

• Ranking Problem: Query (s, r, ?) target set e1, e2, en. Rank targetsbased on plausibility of relation r existing between s and ei.

• (Frankfurt, cityliesonriver, ?) Choices: Rhine, Mosel, Thames,Main, Hudson.

• (user_id_201345, user_prefers_genre, ?) Choices: Fiction,Non-Fiction, Horror, Romance, Fantasy.

• (TP53, disease, ?) Choices: none, Breast Cancer, Liver Cancer,Lung Cancer.

7

Page 19: Relational machine-learning

The KGC Problem

• Knowledge Graph : G set of triples (s, r, t), s, t ∈ E and r ∈ R.

• Ranking Problem: Query (s, r, ?) target set e1, e2, en. Rank targetsbased on plausibility of relation r existing between s and ei.

• (Frankfurt, cityliesonriver, ?) Choices: Rhine, Mosel, Thames,Main, Hudson.

• (user_id_201345, user_prefers_genre, ?) Choices: Fiction,Non-Fiction, Horror, Romance, Fantasy.

• (TP53, disease, ?) Choices: none, Breast Cancer, Liver Cancer,Lung Cancer.

7

Page 20: Relational machine-learning

The KGC Problem

• Knowledge Graph : G set of triples (s, r, t), s, t ∈ E and r ∈ R.

• Ranking Problem: Query (s, r, ?) target set e1, e2, en. Rank targetsbased on plausibility of relation r existing between s and ei.

• (Frankfurt, cityliesonriver, ?) Choices: Rhine, Mosel, Thames,Main, Hudson.

• (user_id_201345, user_prefers_genre, ?) Choices: Fiction,Non-Fiction, Horror, Romance, Fantasy.

• (TP53, disease, ?) Choices: none, Breast Cancer, Liver Cancer,Lung Cancer.

7

Page 21: Relational machine-learning

The KGC Problem

• Knowledge Graph : G set of triples (s, r, t), s, t ∈ E and r ∈ R.

• Ranking Problem: Query (s, r, ?) target set e1, e2, en. Rank targetsbased on plausibility of relation r existing between s and ei.

• (Frankfurt, cityliesonriver, ?) Choices: Rhine, Mosel, Thames,Main, Hudson.

• (user_id_201345, user_prefers_genre, ?) Choices: Fiction,Non-Fiction, Horror, Romance, Fantasy.

• (TP53, disease, ?) Choices: none, Breast Cancer, Liver Cancer,Lung Cancer.

7

Page 22: Relational machine-learning

The KGC Problem

• Knowledge Graph : G set of triples (s, r, t), s, t ∈ E and r ∈ R.

• Ranking Problem: Query (s, r, ?) target set e1, e2, en. Rank targetsbased on plausibility of relation r existing between s and ei.

• (Frankfurt, cityliesonriver, ?) Choices: Rhine, Mosel, Thames,Main, Hudson.

• (user_id_201345, user_prefers_genre, ?) Choices: Fiction,Non-Fiction, Horror, Romance, Fantasy.

• (TP53, disease, ?) Choices: none, Breast Cancer, Liver Cancer,Lung Cancer.

7

Page 23: Relational machine-learning

Models

Page 24: Relational machine-learning

Recommendation Engines

• Recommend Movies. ui: vector representing user i and virepresents product i. u, v ∈ Rd.

• Minimize∑

i,j(ri,j − uTi vj)2 + Regularizer

• If rating ri,j is very high then we want high similarity (dotproduct) between user and product vectors.

• These vectors are called latent factors. Not interpretable, couldbe genre, topics, themes. Help generalization.

• Initialize them randomly and learn using SGD. They capture thestructure of the matrix.

8

Page 25: Relational machine-learning

Recommendation Engines

• Recommend Movies. ui: vector representing user i and virepresents product i. u, v ∈ Rd.

• Minimize∑

i,j(ri,j − uTi vj)2 + Regularizer

• If rating ri,j is very high then we want high similarity (dotproduct) between user and product vectors.

• These vectors are called latent factors. Not interpretable, couldbe genre, topics, themes. Help generalization.

• Initialize them randomly and learn using SGD. They capture thestructure of the matrix.

8

Page 26: Relational machine-learning

Recommendation Engines

• Recommend Movies. ui: vector representing user i and virepresents product i. u, v ∈ Rd.

• Minimize∑

i,j(ri,j − uTi vj)2 + Regularizer

• If rating ri,j is very high then we want high similarity (dotproduct) between user and product vectors.

• These vectors are called latent factors. Not interpretable, couldbe genre, topics, themes. Help generalization.

• Initialize them randomly and learn using SGD. They capture thestructure of the matrix.

8

Page 27: Relational machine-learning

Recommendation Engines

• Recommend Movies. ui: vector representing user i and virepresents product i. u, v ∈ Rd.

• Minimize∑

i,j(ri,j − uTi vj)2 + Regularizer

• If rating ri,j is very high then we want high similarity (dotproduct) between user and product vectors.

• These vectors are called latent factors. Not interpretable, couldbe genre, topics, themes. Help generalization.

• Initialize them randomly and learn using SGD. They capture thestructure of the matrix.

8

Page 28: Relational machine-learning

Recommendation Engines

• Recommend Movies. ui: vector representing user i and virepresents product i. u, v ∈ Rd.

• Minimize∑

i,j(ri,j − uTi vj)2 + Regularizer

• If rating ri,j is very high then we want high similarity (dotproduct) between user and product vectors.

• These vectors are called latent factors. Not interpretable, couldbe genre, topics, themes. Help generalization.

• Initialize them randomly and learn using SGD. They capture thestructure of the matrix.

8

Page 29: Relational machine-learning

RESCAL Model

• Capture Graph structure. Graph has multiple relations: users ×products, users × demographics, products × Categories.

• Solution: One matrix factorization problem for every relation.

• f (s, r, t) = xTs Wr xt. Where (xs, xt) ∈ Rd, Wr ∈ Rd×d

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Can also use

softmax, or l2 loss like collaborative filtering.

9

Page 30: Relational machine-learning

RESCAL Model

• Capture Graph structure. Graph has multiple relations: users ×products, users × demographics, products × Categories.

• Solution: One matrix factorization problem for every relation.

• f (s, r, t) = xTs Wr xt. Where (xs, xt) ∈ Rd, Wr ∈ Rd×d

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Can also use

softmax, or l2 loss like collaborative filtering.

9

Page 31: Relational machine-learning

RESCAL Model

• Capture Graph structure. Graph has multiple relations: users ×products, users × demographics, products × Categories.

• Solution: One matrix factorization problem for every relation.

• f (s, r, t) = xTs Wr xt. Where (xs, xt) ∈ Rd, Wr ∈ Rd×d

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Can also use

softmax, or l2 loss like collaborative filtering.

9

Page 32: Relational machine-learning

RESCAL Model

• Capture Graph structure. Graph has multiple relations: users ×products, users × demographics, products × Categories.

• Solution: One matrix factorization problem for every relation.

• f (s, r, t) = xTs Wr xt. Where (xs, xt) ∈ Rd, Wr ∈ Rd×d

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Can also use

softmax, or l2 loss like collaborative filtering.

9

Page 33: Relational machine-learning

Interpretation

score(s,r,t)

Figure 1: RESCAL as Neural Network: Three latent factors (d =3).

• xs ⊕ xTt : all possible latent factor interactions (d× d) matrix.Matrix Wr acts like a mask, boosting or suppressing pairwiseinteractions.

• Entities appear in multiple relations as subjects or objects.Information Sharing!

10

Page 34: Relational machine-learning

Interpretation

score(s,r,t)

Figure 1: RESCAL as Neural Network: Three latent factors (d =3).

• xs ⊕ xTt : all possible latent factor interactions (d× d) matrix.Matrix Wr acts like a mask, boosting or suppressing pairwiseinteractions.

• Entities appear in multiple relations as subjects or objects.Information Sharing!

10

Page 35: Relational machine-learning

Billinear Diag. and TransE Model

• RESCAL [2]: Requires O(Ned+ Nrd2) parameters. Scalabilityissues for large Nr.

• Bilinear Diag [4]: Enforce Wr to be a diagonal matrix. Assumessymmetric relations. Why? Memory Complexity : O(Ned+ Nrd)

• TransE [1] : f(s, r, t) = −||(xs + xr)− xt||2.

• TransE : Can it model all types of relations. Why?

• Takeaway: Make sure parameters are shared. Either sharedrepresentation or shared layer.

11

Page 36: Relational machine-learning

Billinear Diag. and TransE Model

• RESCAL [2]: Requires O(Ned+ Nrd2) parameters. Scalabilityissues for large Nr.

• Bilinear Diag [4]: Enforce Wr to be a diagonal matrix. Assumessymmetric relations. Why? Memory Complexity : O(Ned+ Nrd)

• TransE [1] : f(s, r, t) = −||(xs + xr)− xt||2.

• TransE : Can it model all types of relations. Why?

• Takeaway: Make sure parameters are shared. Either sharedrepresentation or shared layer.

11

Page 37: Relational machine-learning

Billinear Diag. and TransE Model

• RESCAL [2]: Requires O(Ned+ Nrd2) parameters. Scalabilityissues for large Nr.

• Bilinear Diag [4]: Enforce Wr to be a diagonal matrix. Assumessymmetric relations. Why? Memory Complexity : O(Ned+ Nrd)

• TransE [1] : f(s, r, t) = −||(xs + xr)− xt||2.

• TransE : Can it model all types of relations. Why?

• Takeaway: Make sure parameters are shared. Either sharedrepresentation or shared layer.

11

Page 38: Relational machine-learning

Billinear Diag. and TransE Model

• RESCAL [2]: Requires O(Ned+ Nrd2) parameters. Scalabilityissues for large Nr.

• Bilinear Diag [4]: Enforce Wr to be a diagonal matrix. Assumessymmetric relations. Why? Memory Complexity : O(Ned+ Nrd)

• TransE [1] : f(s, r, t) = −||(xs + xr)− xt||2.

• TransE : Can it model all types of relations. Why?

• Takeaway: Make sure parameters are shared. Either sharedrepresentation or shared layer.

11

Page 39: Relational machine-learning

Billinear Diag. and TransE Model

• RESCAL [2]: Requires O(Ned+ Nrd2) parameters. Scalabilityissues for large Nr.

• Bilinear Diag [4]: Enforce Wr to be a diagonal matrix. Assumessymmetric relations. Why? Memory Complexity : O(Ned+ Nrd)

• TransE [1] : f(s, r, t) = −||(xs + xr)− xt||2.

• TransE : Can it model all types of relations. Why?

• Takeaway: Make sure parameters are shared. Either sharedrepresentation or shared layer.

11

Page 40: Relational machine-learning

Negative Sampling

• How to generate negative samples? Negatives may not beprovided.

• Closed World Assumption: If not a positive then must be anegative.

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Softer negatives:

(s, r,′ t′) more negative than (s, r, t)

• Soft Max Loss : log(1+ exp(−yif(si, ri, ti)). Negatives are ‘really’negative.

• Number of negative samples during training affect performance.See [3].

12

Page 41: Relational machine-learning

Negative Sampling

• How to generate negative samples? Negatives may not beprovided.

• Closed World Assumption: If not a positive then must be anegative.

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Softer negatives:

(s, r,′ t′) more negative than (s, r, t)

• Soft Max Loss : log(1+ exp(−yif(si, ri, ti)). Negatives are ‘really’negative.

• Number of negative samples during training affect performance.See [3].

12

Page 42: Relational machine-learning

Negative Sampling

• How to generate negative samples? Negatives may not beprovided.

• Closed World Assumption: If not a positive then must be anegative.

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Softer negatives:

(s, r,′ t′) more negative than (s, r, t)

• Soft Max Loss : log(1+ exp(−yif(si, ri, ti)). Negatives are ‘really’negative.

• Number of negative samples during training affect performance.See [3].

12

Page 43: Relational machine-learning

Negative Sampling

• How to generate negative samples? Negatives may not beprovided.

• Closed World Assumption: If not a positive then must be anegative.

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Softer negatives:

(s, r,′ t′) more negative than (s, r, t)

• Soft Max Loss : log(1+ exp(−yif(si, ri, ti)). Negatives are ‘really’negative.

• Number of negative samples during training affect performance.See [3].

12

Page 44: Relational machine-learning

Negative Sampling

• How to generate negative samples? Negatives may not beprovided.

• Closed World Assumption: If not a positive then must be anegative.

• Max-Margin: max[0, 1−

(f(s, r, t)− f(s, r, t′)

)]. Softer negatives:

(s, r,′ t′) more negative than (s, r, t)

• Soft Max Loss : log(1+ exp(−yif(si, ri, ti)). Negatives are ‘really’negative.

• Number of negative samples during training affect performance.See [3].

12

Page 45: Relational machine-learning

Deep Learning

countryofHQ(target relation)

Similarity metric

0.94

d

Q

Microsoft isBasedIn

Seattle locatedIn

USA (dummy_rel)

Washington locatedIn

(Path Vector)

Figure 2: At each step, the RNN consumes both entity and relation vectors of the path. The entityrepresentation can be obtained from its types. The path vector yπ is the last hidden state. The parametersof the RNN and relation embeddings are shared across all query relations. The dot product betweenthe final representation of the path and the query relation gives a confidence score, with higher scoresindicating that the query relation exists between the entity pair.

2 Background

In this section, we introduce the compositionalmodel (Path-RNN) of Neelakantan et al. (2015).The Path-RNN model takes in input a pathbetween two entities and infers new relationsbetween them. Reasoning is performed non-atomically about conjunctions of relations in anarbitrary length path by composing them with arecurrent neural network (RNN). The representa-tion of the path is given by the last hidden state ofthe RNN obtained after processing all the relationsin the path.

Let pes, etq be an entity pair and S denotethe set of paths between them. The set S isobtained by doing random walks in the knowl-edge graph starting from es till et. Let π “

tes, r1, e1, r2, . . . , rk, etu P S denote a path be-tween pes, etq. The length of a path is the num-ber of relations in it, hence, plenpπq “ kq. Letyrt P Rd denote the vector representation of rt.The Path-RNN model combines all the relationsin π sequentially using a RNN with an intermedi-ate representation ht P Rh at step t given by

ht “ fpWrhhht´1 `Wr

ihyrrtq. (1)

Wrhh P Rhˆh and Wr

ih P Rdˆh are the param-eters of the RNN. Here r denotes the query rela-tion. Path-RNN has a specialized model for pre-dicting each query relation r, with separate param-eters pyr

rt ,Wrhh,W

rihq for each r. f is the sig-

moid function. The vector representation of pathπ pyπq is the last hidden state hk. The similarity of

yπ with the query relation vector yr is computedas the dot product between them:

scorepπ, rq “ yπ ¨ yr (2)

Pairs of entities may have several paths connectingthem in the knowledge graph (Figure 1b). Path-RNN computes the probability that the entity pairpes, etq participates in the query relation prq by,

Ppr|es, etq “ maxσpscorepπ, rqq,@π P S (3)

where σ is the sigmoid function.Path-RNN and other models such as the Path

Ranking Algorithm (PRA) and its extensions (Laoet al., 2011; Lao et al., 2012; Gardner et al., 2013;Gardner et al., 2014) makes it impractical to beused in downstream applications, since it requirestraining and maintaining a model for each relationtype. Moreover, parameters are not shared acrossmultiple target relation types leading to large num-ber of parameters to be learned from the trainingdata.

In (??), the Path-RNN model selects the maxi-mum scoring path between an entity pair to make aprediction, possibly ignoring evidence from otherimportant paths. Not only is this a waste of com-putation (since we have to compute a forward passfor all the paths anyway), but also the relations inall other paths do not get any gradients updatesduring training as the max operation returns zerogradient for all other paths except the maximumscoring one. This is especially ineffective duringthe initial stages of the training since the maxi-mum probable path will be random.

Figure 2: Source : Das et al. (2016). RNN generates a representation for thepath. Similarity between path representation and query relation indicateswhether the path supports the query.

13

Page 46: Relational machine-learning

Questions

I am convinced that the crux of the problem of learning isrecognizing relationships and being able to use them.

Christopher Strachey in a letter to Alan Turing, 1954.

14

Page 47: Relational machine-learning

References I

A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, andO. Yakhnenko.Translating embeddings for modeling multi-relational data.In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q.Weinberger, editors, Advances in Neural Information ProcessingSystems 26, pages 2787–2795. Curran Associates, Inc., 2013.

M. Nickel, V. Tresp, and H.-P. Kriegel.A three-way model for collective learning on multi-relationaldata.In ICML, 2011.

15

Page 48: Relational machine-learning

References II

T. Trouillon, C. R. Dance, J. Welbl, S. Riedel, É. Gaussier, andG. Bouchard.Knowledge graph completion via complex tensor factorization.arXiv preprint arXiv:1702.06879, 2017.

B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng.Embedding entities and relations for learning and inference inknowledge bases.arXiv preprint arXiv:1412.6575, 2014.

16