Top Banner
Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi` eve Chafouleas & David Ferland March 23, 2020 Genevi` eve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 March 23, 2020 1 / 26
26

Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Jun 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Neural Word Embedding as Implicit MatrixFactorization

Levy & Goldberg, 2014

Genevieve Chafouleas & David Ferland

March 23, 2020

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 1 / 26

Page 2: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Intro

This paper shows that the objective function of the Word2Vec Skip-gramwith negative sampling (SGNS) is an implicit weighted matrix factorizationof a shifted PMI matrix.

They propose using SVD decomposition of the shifted PPMI matrix as analternative word embedding technique.

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 2 / 26

Page 3: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Outline

Context and Motivation

Word-context Matrix

Review Word2Vec Skip-gram with negative sampling(SGNS)

Implicit matrix factorization

Proposed Alternative Word representations

Empirical Results

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 3 / 26

Page 4: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Context - Word Representations

NLP/NLU tasks generally require a word representation

String token => numeric vector

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 4 / 26

Page 5: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Context - Distributional Hypothesis

Simple representations treats individual words as unique symbols (e.g.one-hot encoding, bag of words) => do not consider context

But many tasks benefit from capturing semantic or meaning-relatedrelationship between words is key => consider context

Common paradigm: The Distributional Hypothesis (Harris, Firth)“You shall know a word by the company it keeps” (Firth)

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 5 / 26

Page 6: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Distributed word representations

Count-based

Based on matrix M ∈ R|Vw |×|Vc |

Rows are sparse vectors

PMI (point-mutualinformation)

PPMI (positive PMI)

Prediction-based(neural/word embedding)

Learned W ∈ R|Vw |×d ,C ∈ R|Vc |×d

Rows are dense vectors

word2vec: CBOW, Skip-Gram

Skip-gram Negative Sampling(SGNS)

Main goal

Show that SGNS can be cast as a weighted factorization of the shiftedPMI matrix

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 6 / 26

Page 7: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Distributed word representations

Count-based

Based on matrix M ∈ R|Vw |×|Vc |

Rows are sparse vectors

PMI (point-mutualinformation)

PPMI (positive PMI)

Prediction-based(neural/word embedding)

Learned W ∈ R|Vw |×d ,C ∈ R|Vc |×d

Rows are dense vectors

word2vec: CBOW, Skip-Gram

SGNS

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 7 / 26

Page 8: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

PMI Matrix

Word-Context matrix: M ∈ R|Vw |×|Vc |

rowi : wi ∈ Vw

columnj : cj ∈ Vc

Mi,j = f (wi , cj): measure of association

Co-occurrence matrix: f (w , c) = P(w , c)

Pointwise Mutual Information (PMI) matrix:

f (w , c) = PMI (w , c) = log

(P(w , c)

P(w)P(c)

)

Intuition on PMI

How much more/less likely is the co-occurrence of (w, c) than observingthem independently.

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 8 / 26

Page 9: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

(P)PMI Matrix

For w ∈ VW and c ∈ VC and (w , c) word-context pairs observed in D.

Empirical PMI:

P(w , c) = #(w ,c)|D| , P(w) = #(w)

|D| , P(c) = #(c)|D|

PMI (w , c) = log

(#(w , c) · |D|#(w) ·#(c))

)Issue for unseen (w,c) pairs:

PMI (w , c) = log 0 = −∞

Alternative: PPMI

PPMI (w , c) = max(PMI (w , c), 0)

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 9 / 26

Page 10: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Distributed word representations

Count-based

Based on matrix M ∈ R|Vw |×|Vc |

Rows are sparse vectors

PMI (point-mutual information)

PPMI (positive PMI)

Prediction-based(neural/word embedding)

Learned W ∈ R|Vw |×d ,C ∈ R|Vc |×d

Rows are dense vectors

word2vec: CBOW, Skip-Gram

SGNS

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 10 / 26

Page 11: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Word2Vec

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 11 / 26

Page 12: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Word2Vec - Skip-Gram Notation

Notation:

D ≡ collection of observed (w,c) pairs

Each w ∈ VW is associated with a vector ~w ∈ Rd

Each c ∈ VC is associated with a vector ~c ∈ Rd

Expressing these vectors as matrices: W ∈ R|Vw |×d , C ∈ R|VC |×d

Vc = Vw

Output layer: Hierarchical Softmax or Negative Sampling

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 12 / 26

Page 13: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Skip-Gram Negative Sampling (SGNS)

Softmax:For each context word ci to predict, we have:

p(ci |wcenter ) =exp (~ci · ~wcenter )∑|Vc |j=1 exp (~cj · ~wcenter )

Costly to train due to large |Vc | (must update all voc. weights)

Alternative: Skip-Gram with Negative SamplingFor each training sample: 1 positive and k random negative samples

k+1 binary classifications using Logistic Regression

⇒ Only k+1 weight updates for each training sample

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 13 / 26

Page 14: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Word2Vec - SGNS Objective

PD|w ,c(w , c) modeled as:

P(D = 1|w , c) = σ(~w · ~c) = exp (~w ·~c)1+exp (~w ·~c)

P(D = 0|w , c) = 1− σ(~w · ~c) = σ(−~w · ~c)

SGNS objective for a given (w , c) pair

log σ(~w · ~c) + k · EcN∼PD[log σ(−~w · ~cN)]

where cN is drawn from PD(c) = #(c)|D| .

tot.loss = l =∑

(w ,c)∈D

#(w , c)(log σ(~w ·~c)+k ·EcN∼PD[log σ(−~w · ~cN)]) (1)

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 14 / 26

Page 15: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

SGNS as Implicit Matrix Factoriztion

SGNS embeds words and contexts into matrices W and C

Consider M = W · CT

Mij = ~wi · ~cjrepresents an implicit association measure f (wi , cj)

What is the matrix M that Word2vec implicitly factorizes?

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 15 / 26

Page 16: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Characterizing the Implicit Matrix

tot.loss =∑

(w ,c)∈D

#(w , c)(log σ(~w · ~c) + k · EcN∼PD[log σ(−~w · ~cN)])

For a specific (w , c) pair:

l(w , c) = #(w , c)︸ ︷︷ ︸positive obs. weight

log σ(~w · ~c) + k ·#(w) · #(c)

|D|︸ ︷︷ ︸negative obs. weight

log σ(−~w · ~c)

We take the derivative and solve for ~w · ~c :

~w · ~c = log

(#(w , c) · |D|#(w) ·#(c)

· 1

k

)= log

(#(w , c) · |D|#(w) ·#(c)

)− log(k)

SGNS is factorizing implicitly:

MSGNSij = ~wi · ~cj = PMI (wi , cj)− log k

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 16 / 26

Page 17: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Alternative Word Representation

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 17 / 26

Page 18: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Shifted PPMI

Shifted PPMI

MSPPMIk = SPPMIk(w , c) = max(PMI (w , c)− log k , 0)

where k is a hyperparameter

Solves the issue of having cell value equal to log(0) = −∞MSPPMIk is a a sparse matrix, can apply SVD efficiently.

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 18 / 26

Page 19: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

SVD over Shifted PPMI

Truncated SVD

Given a matrix M, we have Md = Ud ·Σd · VTd

Md that best approximates M under L2.

Md = argminRank(M′ )=d ||M′ −M||2

A popular approach in NLP is factorizing MPPMI with SVD:

WSVD = Ud ·Σd , CSVD = Vd

Symetric SVD of MSPPMI

WSVD1/2 = Ud ·√

Σd , CSVD1/2 = Vd ·√

Σd

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 19 / 26

Page 20: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

SVD versus SGNS

SVD over shifted PPMI matrixAdvantages

No hyperparameter tuning.

easily applied on count-agg.data (i.e {(w , c , (w , c))}).

More efficient for large corpas.

Disadvantages

Un-weighted L2 loss whensolving for best SVD, objectivedoes not distinguish betweenunobserved and observed pairs.

Must define arbitrarily W fromthe decomposed matrices

SGNSAdvantages

The objective weights different(w , c) pairs differently.

Trained over observed pairs andlearns embedding W directly

Disadvantages

Requires hyperparameter tuning.

Requires each observation(w , c) to be presentedseparately in training.

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 20 / 26

Page 21: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Experimental Setup

Trained on English Wikepedia.

Trained SGNS models and word representation alternatives.

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 21 / 26

Page 22: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Optimizing the Objective

Deviation is calculated(`−`opt`opt

)Optimal objective: PMI − log k

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 22 / 26

Page 23: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Performance of Word Representations on Linguistic Tasks

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 23 / 26

Page 24: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

Conclusion

SGNS implicitly factorizing the (shifted) word-context PMI matrix.

Presented SPPMI as word representation.

Presentated matrix factorization of SPPMI as word representation.

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 24 / 26

Page 25: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

References

[1] https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf[2] https://medium.com/radix-ai-blog/unifying-word-embeddings-and-matrix-factorization-part-1-cb3984e95141[3] https://medium.com/radix-ai-blog/unifying-word-embeddings-and-matrix-factorization-part-2-a0174ace78b8[4] https://medium.com/radix-ai-blog/unifying-word-embeddings-and-matrix-factorization-part-3-4269d9a07470

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 25 / 26

Page 26: Neural Word Embedding as Implicit Matrix …grabus/courses/ift...Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014 Genevi eve Chafouleas & David Ferland

The End

Genevieve Chafouleas & David Ferland Neural Word Embedding as Implicit Matrix Factorization Levy & Goldberg, 2014March 23, 2020 26 / 26