Top Banner
Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm Vector Semantics (aka Distributional Semantics) Marina San(ni [email protected]fil.uu.se Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden Spring 2016 1
56

Lecture: Vector Semantics (aka Distributional Semantics)

Apr 16, 2017

Download

Education

Marina Santini
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture: Vector Semantics (aka Distributional Semantics)

Seman&c  Analysis  in  Language  Technology  http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm

Vector Semantics

(aka Distributional Semantics)

Marina  San(ni  [email protected]  

 

Department  of  Linguis(cs  and  Philology  

Uppsala  University,  Uppsala,  Sweden  

 

Spring  2016  

 

 1  

Page 2: Lecture: Vector Semantics (aka Distributional Semantics)

Previous  Lecture:  Word  Sense  Disambigua$on  

2  

Page 3: Lecture: Vector Semantics (aka Distributional Semantics)

Similarity  measures  (dic$onary-­‐based)  

Page 4: Lecture: Vector Semantics (aka Distributional Semantics)

Colloca$onal  features:  supervised  

•  Posi(on-­‐specific  informa(on  about  the  words  and  colloca(ons  in  window  

•  guitar  and  bass  player  stand  

•  word  1,2,3  grams  in  window  of  ±3  is  common  

10 CHAPTER 16 • COMPUTING WITH WORD SENSES

ually tagged with WordNet senses (Miller et al. 1993, Landes et al. 1998). In ad-dition, sense-tagged corpora have been built for the SENSEVAL all-word tasks. TheSENSEVAL-3 English all-words test data consisted of 2081 tagged content word to-kens, from 5,000 total running words of English from the WSJ and Brown corpora(Palmer et al., 2001).

The first step in supervised training is to extract features that are predictive ofword senses. The insight that underlies all modern algorithms for word sense disam-biguation was famously first articulated by Weaver (1955) in the context of machinetranslation:

If one examines the words in a book, one at a time as through an opaquemask with a hole in it one word wide, then it is obviously impossibleto determine, one at a time, the meaning of the words. [. . . ] But ifone lengthens the slit in the opaque mask, until one can see not onlythe central word in question but also say N words on either side, thenif N is large enough one can unambiguously decide the meaning of thecentral word. [. . . ] The practical question is : “What minimum value ofN will, at least in a tolerable fraction of cases, lead to the correct choiceof meaning for the central word?”

We first perform some processing on the sentence containing the window, typi-cally including part-of-speech tagging, lemmatization , and, in some cases, syntacticparsing to reveal headwords and dependency relations. Context features relevant tothe target word can then be extracted from this enriched input. A feature vectorfeature vectorconsisting of numeric or nominal values encodes this linguistic information as aninput to most machine learning algorithms.

Two classes of features are generally extracted from these neighboring contexts,both of which we have seen previously in part-of-speech tagging: collocational fea-tures and bag-of-words features. A collocation is a word or series of words in acollocationposition-specific relationship to a target word (i.e., exactly one word to the right, orthe two words starting 3 words to the left, and so on). Thus, collocational featurescollocational

featuresencode information about specific positions located to the left or right of the targetword. Typical features extracted for these context words include the word itself, theroot form of the word, and the word’s part-of-speech. Such features are effective atencoding local lexical and grammatical information that can often accurately isolatea given sense.

For example consider the ambiguous word bass in the following WSJ sentence:(16.17) An electric guitar and bass player stand off to one side, not really part of

the scene, just as a sort of nod to gringo expectations perhaps.A collocational feature vector, extracted from a window of two words to the rightand left of the target word, made up of the words themselves, their respective parts-of-speech, and pairs of words, that is,

[wi�2,POSi�2,wi�1,POSi�1,wi+1,POSi+1,wi+2,POSi+2,wi�1i�2,w

i+1i ] (16.18)

would yield the following vector:[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]

High performing systems generally use POS tags and word collocations of length1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and Ng,2010).

The second type of feature consists of bag-of-words information about neigh-boring words. A bag-of-words means an unordered set of words, with their exactbag-of-words

10 CHAPTER 16 • COMPUTING WITH WORD SENSES

ually tagged with WordNet senses (Miller et al. 1993, Landes et al. 1998). In ad-dition, sense-tagged corpora have been built for the SENSEVAL all-word tasks. TheSENSEVAL-3 English all-words test data consisted of 2081 tagged content word to-kens, from 5,000 total running words of English from the WSJ and Brown corpora(Palmer et al., 2001).

The first step in supervised training is to extract features that are predictive ofword senses. The insight that underlies all modern algorithms for word sense disam-biguation was famously first articulated by Weaver (1955) in the context of machinetranslation:

If one examines the words in a book, one at a time as through an opaquemask with a hole in it one word wide, then it is obviously impossibleto determine, one at a time, the meaning of the words. [. . . ] But ifone lengthens the slit in the opaque mask, until one can see not onlythe central word in question but also say N words on either side, thenif N is large enough one can unambiguously decide the meaning of thecentral word. [. . . ] The practical question is : “What minimum value ofN will, at least in a tolerable fraction of cases, lead to the correct choiceof meaning for the central word?”

We first perform some processing on the sentence containing the window, typi-cally including part-of-speech tagging, lemmatization , and, in some cases, syntacticparsing to reveal headwords and dependency relations. Context features relevant tothe target word can then be extracted from this enriched input. A feature vectorfeature vectorconsisting of numeric or nominal values encodes this linguistic information as aninput to most machine learning algorithms.

Two classes of features are generally extracted from these neighboring contexts,both of which we have seen previously in part-of-speech tagging: collocational fea-tures and bag-of-words features. A collocation is a word or series of words in acollocationposition-specific relationship to a target word (i.e., exactly one word to the right, orthe two words starting 3 words to the left, and so on). Thus, collocational featurescollocational

featuresencode information about specific positions located to the left or right of the targetword. Typical features extracted for these context words include the word itself, theroot form of the word, and the word’s part-of-speech. Such features are effective atencoding local lexical and grammatical information that can often accurately isolatea given sense.

For example consider the ambiguous word bass in the following WSJ sentence:(16.17) An electric guitar and bass player stand off to one side, not really part of

the scene, just as a sort of nod to gringo expectations perhaps.A collocational feature vector, extracted from a window of two words to the rightand left of the target word, made up of the words themselves, their respective parts-of-speech, and pairs of words, that is,

[wi�2,POSi�2,wi�1,POSi�1,wi+1,POSi+1,wi+2,POSi+2,wi�1i�2,w

i+1i ] (16.18)

would yield the following vector:[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]

High performing systems generally use POS tags and word collocations of length1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and Ng,2010).

The second type of feature consists of bag-of-words information about neigh-boring words. A bag-of-words means an unordered set of words, with their exactbag-of-words

Page 5: Lecture: Vector Semantics (aka Distributional Semantics)

Bag-­‐of-­‐words  features:  supervised  

•  Assume  we’ve  seGled  on  a  possible  vocabulary  of  12  words  in  “bass”  sentences:    

 [fishing,  big,  sound,  player,  fly,  rod,  pound,  double,  runs,  playing,  guitar,  band]    

•  The  vector  for:    guitar  and  bass  player  stand    [0,0,0,1,0,0,0,0,0,0,1,0]    

 

Page 6: Lecture: Vector Semantics (aka Distributional Semantics)

Prac$cal  ac$vity:  Lesk  algorithms  

•  Michael  Lesk  (1986):  Original  Lesk  •  Compare  the  target  word’s  signature  with  the  signature  of  each  of  the  context  words  

•  Kilgarriff  and  Rosenzweig  (2000):  Simplified  Lesk  •  Compare  the  target  word’s  signature  with  the  context  words  

•  Vasilescu  et  al.  (2004):  Corpus  Lesk    •  Add  all  the  words  in  a  labelled  corpus  sentence  for  a  word  sense  into  the  signature  of  that  sense  (remember  the  labelled  sentences  in  Senseval  2).    

signature  <-­‐  set  of  words  in  the  gloss  and  examples  of  sense    6      

Page 7: Lecture: Vector Semantics (aka Distributional Semantics)

Simplified  Lesk:  Time  flies  like  an  arrow  

•  Common  sense:  

•  Modern  English  speakers  unambiguously  understand  the  sentence  to  mean  "As  a  generalisa(on,  (me  passes  in  the  same  way  that  an  arrow  generally  flies  (i.e.  quickly)"  (as  in  the  common  metaphor  5me  goes  by  quickly).  

7  

Page 8: Lecture: Vector Semantics (aka Distributional Semantics)

Ref:  wikipedia  •  But  formally/logically/syntactally/seman(cally  à  ambiguous:  

1.  (as  an  impera(ve)  Measure  the  speed  of  flies  like  you  would  measure  that  of  an  arrow  -­‐  i.e.  (You  should)  (me  flies  as  you  would  (me  an  arrow.  

2.  (impera(ve)  Measure  the  speed  of  flies  like  an  arrow  would  -­‐  i.e.  (You  should)  (me  flies  in  the  same  manner  that  an  arrow  would  (me  them.  

3.  (impera(ve)  Measure  the  speed  of  flies  that  are  like  arrows  -­‐  i.e.  (You  should)  (me  those  flies  that  are  like  an  arrow.  

4.  (declara(ve)  Time  moves  in  a  way  an  arrow  would.  5.  (declara(ve,  i.e.  neutrally  sta(ng  a  proposi(on)  Certain  flying  insects,  

"(me  flies,"  enjoy  an  arrow.  

•     

8  

Page 9: Lecture: Vector Semantics (aka Distributional Semantics)

Simplified  Lesk  algorithm  (2000)  and  WordNet  (3.1)  

•   Disambigua(ng  $me  :  •  (me#n#5  shares  ”pass”    and  ”$me  flies  as  an  arrow”  with  flies#v#8  

•  Disambigua(ng  flies  •  flies#v#8  shares  ”pass”    and  ”$me  flies  as  an  arrow”  with  (me#v#5  

So  we  select  the  following  senses:  (me#n#5    and  flies#v#8.  

9  

Page 10: Lecture: Vector Semantics (aka Distributional Semantics)

like  &  arrow  

 Disambigua(ng  like  :  •    like#a#1  shares  like    with  flies#v#8        Arrow  cannot  be  disambiguated    

10  

Page 11: Lecture: Vector Semantics (aka Distributional Semantics)

11  

Similar  a#3  

like  a#1  

fly  v#8  

Time  n#5  

Page 12: Lecture: Vector Semantics (aka Distributional Semantics)

Corpus  Lesk  Algorithm  •  Expands  the  approach  by:  

•  Adding  all  the  words  of  any  sense-­‐tagged  corpus  data  (like  SemCor)  for  a  word  sense  into  the  signature  for  that  sense.  

•  Signature=  gloss+examples  of  a  word  sense  

12  

Page 13: Lecture: Vector Semantics (aka Distributional Semantics)

MacMillan  dic$onary  

   

13  

Arrow???  

Time  n#1  

Fly  v#6  

Like  a#1  

Page 14: Lecture: Vector Semantics (aka Distributional Semantics)

Arrow  ???  

14  

Page 15: Lecture: Vector Semantics (aka Distributional Semantics)

Implementa$on?  

•  What  if,  the  next  ac(vity  was:  

•  Build  an  implementa$on  of    your  solu$on  of  the  simplified  Lesk  ?  

•  Watch  out  :  licences  (commercial,  academic,  crea(ve  commons,  etc.)    

15  

Page 16: Lecture: Vector Semantics (aka Distributional Semantics)

Problems  with  thesaurus-­‐based  meaning  

•  We  don’t  have  a  thesaurus  for  every  language  •  Even  if  we  do,  they  have  problems  with  recall  

•  Many  words  are  missing  •  Most  (if  not  all)  phrases  are  missing  •  Some  connec(ons  between  senses  are  missing  •  Thesauri  work  less  well  for  verbs,  adjec(ves    

Page 17: Lecture: Vector Semantics (aka Distributional Semantics)

End  of  previous  lecture  

17  

Page 18: Lecture: Vector Semantics (aka Distributional Semantics)

Vector/Distribu$onal  Seman$cs  

•  The  meaning  of  a  word  is  computed  from  the  distribu(on  of  words  around  it.    

•  These  words  are  represented  as  a  vector  of  numbers.  

•  Very  popular  and  very  intruiging!  

18  

Page 19: Lecture: Vector Semantics (aka Distributional Semantics)

hZp://esslli2016.unibz.it/?page_id=256    

19  

Page 20: Lecture: Vector Semantics (aka Distributional Semantics)

(Oversimplified)  Preliminaries    (cf  also  Lect  03:  SA,  Turney  Algorithm)  

•  Probability  •  Joint  probability  •  Marginals  •  PMI  •  PPMI  •  Smoothing  •  Dot  product  (aka  inner  product)  •  Window  20  

Page 21: Lecture: Vector Semantics (aka Distributional Semantics)

Probability  

•  Probability  is  the  measure  of  how  likely  an  event  is.      

21  

Ex:    John  has  a  box  with  a  book,  a  map  and  a  ruler  in  it  (Cantos  Gomez,  2013)  This  sentence  has  14  words  and  5  nouns.      The  probability  of  picking  up  a  noun  is:    P(noun)=  5/14  =  0.357  

Page 22: Lecture: Vector Semantics (aka Distributional Semantics)

Joints  and  Marginals  (oversimplifying)  

•  Joint:  The  probability  of  word  A  occurring  together  with  word  B  à  the  frequency  with  which  the  two  words  appear  together  

•  P(A,B)  

•  Marginals:  the  probability  of  a  word  A  &  the  probability  of  the  other  word  B  

•  P(A)          P(B)  

22  

Page 23: Lecture: Vector Semantics (aka Distributional Semantics)

Can  also  be  said  in  other  ways:    Dependent  and  independent  events:  Joints  &  Marginals  

•  Two  events  are  dependent  if  the  outcome  or  occurrence  of  the  first  affects  the  outcome  or  occurrence  of  the  second  so  that  the  probability  is  changed.    •  Consider  two  dependent  events,  A  and  B.    The  joint  probability  that  A  and  B  occur  together  is  :  

•  P(A  and  B)  =  P(A)*P(B  given  A)  OR  P(A  and  B)  =  P(B)*P(A  given  B)    •  Two  events  are  independent,  each  probability  is  mul(plied  

together  to  find  the  overall  probability  for  the  set  of  events.    •  P(A  and  B)  =  P(A)*P(B)  

Marginal  probability  is  the  probability    of  the  occurrence  of  a  single  event  in  joint  probability.  23  

Equivalent  Nota(ons  (joint)    •  P(A,B)  or  P(A  ∩B)  

Page 24: Lecture: Vector Semantics (aka Distributional Semantics)

Associa$on  measure  

•  Pointwise  mutual  informa$on:    •  How  much  more  do  events  x  and  y  co-­‐occur  than  if  they  were  independent?  

Read:  the  joint  probability  of  two  dependent  events  (ie,  the  2  words  that  are  supposed  to  be  associated)  divided  by  the  product  of  the  individual  probabili(es  (ie,  we  assume  that  the  words  are  not  associated,  we  assume  they  are  independent),  and  we  take  the  log  of  it.    It  tells  us  how  much  more  the  two  events  co-­‐occur  than  if  they  were  independent  

   

PMI(X,Y ) = log2P(x,y)P(x)P(y)

Page 25: Lecture: Vector Semantics (aka Distributional Semantics)

POSITIVE  PMI  

•  We  replace  all  the  nega(ve  values  with  0.  

25  

Page 26: Lecture: Vector Semantics (aka Distributional Semantics)

Smoothing  (addi$ve,  Laplace,  etc.)  

•  In  very  simple  words:  we  add  an  arbitrary  value  to  the  counts.  

•  In  a  bag  of  words  model  of  natural  language  processing  and  informa(on  retrieval,  addi(ve  smoothing  allows  the  assignment  of  non-­‐zero  probabili(es  to  words  which  do  not  occur  in  the  sample  à  data  sparsenessà  mul(plica(on  by  0  probability:  all  the  counts  are  0.  

•  (Addi(ve  smoothing  is  commonly  a  component  of  naive  Bayes  classifiers.  26  

Page 27: Lecture: Vector Semantics (aka Distributional Semantics)

Dot  product  (aka  inner  product)  

•  Given  the  two  vectors:  

•  The  dot  product  is  :    

•  The  Dot  Product  is  wriGen  using  a  central  dot  

27  

Page 28: Lecture: Vector Semantics (aka Distributional Semantics)

Window  (around  the  ambiguous  word)  

•  The  number  of  words  that  we  take  into  account  before  and  axer  the  word  we  want  to  disambiguate:  

•  We  can  decide  any  arbirtrary  value,  eg:    •  -­‐3  ???  +3  :    •  Ex:  The  president  said  central  banks  should  maintain  flows  of  

cheap  credit  to  households    

28  

Page 29: Lecture: Vector Semantics (aka Distributional Semantics)

Acknowledgements Most  slides  borrowed  or  adapted  from:  

Dan  Jurafsky  and  James  H.  Mar(n  

Dan  Jurafsky  and  Christopher  Manning,  Coursera    

 

J&M(2015,  drax):  hGps://web.stanford.edu/~jurafsky/slp3/      

 

     

Page 30: Lecture: Vector Semantics (aka Distributional Semantics)

Distributional Semantics Term-­‐context  matrix  

Page 31: Lecture: Vector Semantics (aka Distributional Semantics)

Distribu$onal  models  of  meaning  

•  Also  called  vector-­‐space  models  of  meaning  •  Offer  much  higher  recall  than  hand-­‐built  thesauri  

•  Although  they  tend  to  have  lower  precision  •  Zellig  Harris  (1954):  “oculist  and  eye-­‐doctor  …  

occur  in  almost  the  same  environments….                                  If  A  and  B  have  almost  iden$cal  environments  we  say  that  they  are  synonyms.  

 

•  Firth  (1957):  “You  shall  know  a  word  by  the  company  it  keeps!”  31  

•  Also  called  vector-­‐space  models  of  meaning  •  Offer  much  higher  recall  than  hand-­‐built  thesauri  

•  Although  they  tend  to  have  lower  precision  

Page 32: Lecture: Vector Semantics (aka Distributional Semantics)

Intui$on  of  distribu$onal  word  similarity  

•  Examples:  A bottle of tesgüino is on the table!Everybody likes tesgüino!Tesgüino makes you drunk!We make tesgüino out of corn.!

•  From context words humans can guess tesgüino means •  an  alcoholic  beverage  like  beer  

•  Intui(on  for  algorithm:    •  Two  words  are  similar  if  they  have  similar  word  contexts.  

Page 33: Lecture: Vector Semantics (aka Distributional Semantics)

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

IR:  Term-­‐document  matrix  

•  Each  cell:  count  of  term  t  in  a  document  d:    |t,d:    •  Each  document  is  a  count  vector  in  ℕv:  a  column  below    

33  

Page 34: Lecture: Vector Semantics (aka Distributional Semantics)

Document  similarity:  Term-­‐document  matrix  

•  Two  documents  are  similar  if  their  vectors  are  similar  

34  

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Page 35: Lecture: Vector Semantics (aka Distributional Semantics)

The  words  in  a  term-­‐document  matrix  

•  Each  word  is  a  count  vector  in  ℕD:  a  row  below    

35  

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Page 36: Lecture: Vector Semantics (aka Distributional Semantics)

The  words  in  a  term-­‐document  matrix  

•  Two  words  are  similar  if  their  vectors  are  similar  

36  

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Page 37: Lecture: Vector Semantics (aka Distributional Semantics)

The  intui$on  of  distribu$onal  word  similarity…  

•  Instead  of  using  en(re  documents,  use  smaller  contexts  •  Paragraph  • Window  of  10  words  

•  A  word  is  now  defined  by  a  vector  over  counts  of  context  words  

37  

Page 38: Lecture: Vector Semantics (aka Distributional Semantics)

Sample  contexts:  20  words  (Brown  corpus)      •  equal  amount  of  sugar,  a  sliced  lemon,  a  tablespoonful  of  apricot  

preserve  or  jam,  a  pinch  each  of  clove  and  nutmeg,  •  on  board  for  their  enjoyment.  Cau(ously  she  sampled  her  first  

pineapple  and  another  fruit  whose  taste  she  likened  to  that  of  

38  

•  of  a  recursive  type  well  suited  to  programming  on  the  digital  computer.  In  finding  the  op(mal  R-­‐stage  policy  from  that  of  

•  substan(ally  affect  commerce,  for  the  purpose  of  gathering  data  and  informa$on  necessary  for  the  study  authorized  in  the  first  sec(on  of  this  

Page 39: Lecture: Vector Semantics (aka Distributional Semantics)

Term-­‐context  matrix  for  word  similarity  

•  Two  words  are  similar  in  meaning  if  their  context  vectors  are  similar  

39  

aardvark computer data pinch result sugar …apricot 0 0 0 1 0 1pineapple 0 0 0 1 0 1digital 0 2 1 0 1 0information 0 1 6 0 4 0

Page 40: Lecture: Vector Semantics (aka Distributional Semantics)

Should  we  use  raw  counts?  

•  For  the  term-­‐document  matrix  • We  used  |-­‐idf  instead  of  raw  term  counts  

•  For  the  term-­‐context  matrix  •  Posi(ve  Pointwise  Mutual  Informa(on  (PPMI)  is  common  

40  

Page 41: Lecture: Vector Semantics (aka Distributional Semantics)

Pointwise  Mutual  Informa$on  

•  Pointwise  mutual  informa$on:    •  Do  events  x  and  y  co-­‐occur  more  than  if  they  were  independent?  

•  PMI  between  two  words:    (Church  &  Hanks  1989)  •   Do  words  x  and  y  co-­‐occur  more  than  if  they  were  independent?    

•  Posi$ve  PMI  between  two  words  (Niwa  &  NiGa  1994)  •   Replace  all  PMI  values  less  than  0  with  zero  

   

PMI(X,Y ) = log2P(x,y)P(x)P(y)

PMI(word1,word2 ) = log2P(word1,word2)P(word1)P(word2)

Page 42: Lecture: Vector Semantics (aka Distributional Semantics)

Compu$ng  PPMI  on  a  term-­‐context  matrix  

•  Matrix  F  with  W  rows  (words)  and  C  columns  (contexts)  •  fij  is  #  of  $mes  wi  occurs  in  context  cj

42  

pij =fij

fijj=1

C

∑i=1

W

∑pi* =

fijj=1

C

fijj=1

C

∑i=1

W

∑ p* j =fij

i=1

W

fijj=1

C

∑i=1

W

pmiij = log2pij

pi*p* jppmiij =

pmiij if pmiij > 0

0 otherwise

!"#

$#

The  count  of  all  the  words  that  occur  in  that  context  

The  count  of  all  the  contexts  where  the  word  appear  

The  sum  of  all  words  in  all  contexts  =  all  the  numbers  in  the  matrix  

Page 43: Lecture: Vector Semantics (aka Distributional Semantics)

p(w=informa(on,c=data)  =    p(w=informa(on)  =  p(c=data)  =  

43  

=  .32  6/19  

11/19   =  .58  

7/19   =  .37  

pij =fij

fijj=1

C

∑i=1

W

p(wi ) =fij

j=1

C

Np(cj ) =

fiji=1

W

N

The  count  of  all  the  words  that  occur  in  that  context  

The  count  of  all  the  contexts  where  the  word  appear  

N=The sum of all words in all contexts = all the numbers in the matrix  

Page 44: Lecture: Vector Semantics (aka Distributional Semantics)

44  

pmiij = log2pij

pi*p* j

•  pmi(informa(on,data)  =  log2  (  

PPMI(w,context)computer data pinch result sugar

apricot 1 1 2.25 1 2.25pineapple 1 1 2.25 1 2.25digital 1.66 0.00 1 0.00 1information 0.00 0.57 1 0.47 1

.32  /   (.37*.58)  )    =  .58  

Page 45: Lecture: Vector Semantics (aka Distributional Semantics)

Weighing  PMI  

•  PMI  is  biased  toward  infrequent  events  •  Various  weigh(ng  schemes  help  alleviate  this  

•  See  Turney  and  Pantel  (2010)  

•  Add-­‐one  smoothing  can  also  help  

45  

Page 46: Lecture: Vector Semantics (aka Distributional Semantics)

46  

Add#2%Smoothed%Count(w,context)computer data pinch result sugar

apricot 2 2 3 2 3pineapple 2 2 3 2 3digital 4 3 2 3 2information 3 8 2 6 2

p(w,context),[add02] p(w)computer data pinch result sugar

apricot 0.03 0.03 0.05 0.03 0.05 0.20pineapple 0.03 0.03 0.05 0.03 0.05 0.20digital 0.07 0.05 0.03 0.05 0.03 0.24information 0.05 0.14 0.03 0.10 0.03 0.36

p(context) 0.19 0.25 0.17 0.22 0.17

Page 47: Lecture: Vector Semantics (aka Distributional Semantics)

Original  vs  add-­‐2  smoothing  

47  

PPMI(w,context).[add22]computer data pinch result sugar

apricot 0.00 0.00 0.56 0.00 0.56pineapple 0.00 0.00 0.56 0.00 0.56digital 0.62 0.00 0.00 0.00 0.00information 0.00 0.58 0.00 0.37 0.00

PPMI(w,context)computer data pinch result sugar

apricot 1 1 2.25 1 2.25pineapple 1 1 2.25 1 2.25digital 1.66 0.00 1 0.00 1information 0.00 0.57 1 0.47 1

Page 48: Lecture: Vector Semantics (aka Distributional Semantics)

Distributional Semantics Dependency  rela(ons  

Page 49: Lecture: Vector Semantics (aka Distributional Semantics)

Using  syntax  to  define  a  word’s  context  •  Zellig  Harris  (1968)  

•  “The  meaning  of  en((es,  and  the  meaning  of  gramma(cal  rela(ons  among  them,  is  related  to  the  restric(on  of  combina(ons  of  these  en((es  rela(ve  to  other  en((es”  

•  Two  words  are  similar  if  they  have  similar  parse  contexts  •  Duty  and  responsibility  (Chris  Callison-­‐Burch’s  example)  

Modified  by  adjec$ves  

addi(onal,  administra(ve,  assumed,  collec(ve,  congressional,  cons(tu(onal  …  

Objects  of  verbs   assert,  assign,  assume,  aGend  to,  avoid,  become,  breach  …  

Page 50: Lecture: Vector Semantics (aka Distributional Semantics)

Co-­‐occurrence  vectors  based  on  syntac$c  dependencies  

•  The  contexts  C  are  different  dependency  rela(ons  •  Subject-­‐of-­‐  “absorb”  •  Preposi(onal-­‐object  of  “inside”  

•  Counts  for  the  word  cell:  

Dekang  Lin,  1998  “Automa(c  Retrieval  and  Clustering  of  Similar  Words”  

Page 51: Lecture: Vector Semantics (aka Distributional Semantics)

PMI  applied  to  dependency  rela$ons  

•  “Drink it” more  common  than  “drink wine”!•  But  “wine”  is  a  beGer  “drinkable”  thing  than  “it”  

Object  of  “drink”   Count   PMI  

it   3   1.3  

anything   3   5.2  

wine   2   9.3  

tea   2   11.8  

liquid   2   10.5  

Hindle, Don. 1990. Noun Classification from Predicate-Argument Structure. ACL

Object  of  “drink”   Count   PMI  

tea   2   11.8  

liquid   2   10.5  

wine   2   9.3  

anything   3   5.2  

it   3   1.3  

Page 52: Lecture: Vector Semantics (aka Distributional Semantics)

Cosine  for  compu$ng  similarity  

cos(v, w) =v • wv w

=vv•ww=

viwii=1

N∑vi2

i=1

N∑ wi

2i=1

N∑

Dot product Unit vectors

vi is the PPMI value for word v in context i wi is the PPMI value for word w in context i. Cos(v,w) is the cosine similarity of v and w

Sec. 6.3

Page 53: Lecture: Vector Semantics (aka Distributional Semantics)

Cosine  as  a  similarity  metric  

•  -­‐1:  vectors  point  in  opposite  direc(ons    •  +1:    vectors  point  in  same  direc(ons  •  0:  vectors  are  orthogonal  

•  Raw  frequency  or  PPMI  are  non-­‐nega(ve,  so    cosine  range  0-­‐1  

53  

Page 54: Lecture: Vector Semantics (aka Distributional Semantics)

large   data   computer  apricot   1   0   0  digital   0   1   2  informa(on   1   6   1  

54  

Which  pair  of  words  is  more  similar?  cosine(apricot,informa(on)  =      cosine(digital,informa(on)  =    cosine(apricot,digital)  =    

cos(v, w) =v • wv w

=vv•ww=

viwii=1

N∑vi2

i=1

N∑ wi

2i=1

N∑

1+ 0+ 0

1+ 0+ 0

1+36+1

1+36+1

0+1+ 4

0+1+ 4

1+ 0+ 0

0+ 6+ 2

0+ 0+ 0

=138

= .16

=838 5

= .58

= 0

Page 55: Lecture: Vector Semantics (aka Distributional Semantics)

Other  possible  similarity  measures  

Page 56: Lecture: Vector Semantics (aka Distributional Semantics)

The end