Top Banner
Natural Language Processing with Deep Learning CS224N/Ling284 Richard Socher Lecture 2: Word Vectors
32

Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

May 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Natural Language Processingwith Deep Learning

CS224N/Ling284

Richard Socher

Lecture 2: Word Vectors

Page 2: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Organization

• PSet 1isreleased.CodingSession1/22:(Monday,PA1dueThursday)

• SomeofthequestionsfromPiazza:• sharingthechoose-your-ownfinalprojectwithanotherclass

seemsfine-->Yes*• Buthowaboutthedefaultfinalproject?Canthatalsobeused

asafinalprojectforadifferentcourse?-->Yes*• Areweallowingstudentstobringonesheetofnotesforthe

midterm?-->Yes• AzurecomputingresourcesforProjects/PSet4.Partofmilestone

1/11/182

Page 3: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

LecturePlan

1. Wordmeaning(15mins)2. Word2vecintroduction(20mins)3. Word2vecobjectivefunctiongradients(25mins)4. Optimizationrefresher(10mins)

1/11/183

Page 4: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

1.Howdowerepresentthemeaningofaword?

Definition:meaning (Websterdictionary)

• theideathatisrepresentedbyaword,phrase,etc.

• theideathatapersonwantstoexpressbyusingwords,signs,etc.

• theideathatisexpressedinaworkofwriting,art,etc.

Commonestlinguisticwayofthinkingofmeaning:

signifier(symbol)⟺ signified(ideaorthing)

=denotation 1/11/184

Page 5: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Howdowehaveusablemeaninginacomputer?Commonsolution:Usee.g.WordNet,aresourcecontaininglistsofsynonymsets andhypernyms (“isa”relationships).

[Synset('procyonid.n.01'), Synset('carnivore.n.01'), Synset('placental.n.01'), Synset('mammal.n.01'), Synset('vertebrate.n.01'), Synset('chordate.n.01'), Synset('animal.n.01'), Synset('organism.n.01'), Synset('living_thing.n.01'), Synset('whole.n.02'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]

(adj) full, good (adj) estimable, good, honorable, respectable (adj) beneficial, good (adj) good, just, upright (adj) adept, expert, good, practiced, proficient, skillful(adj) dear, good, near (adj) good, right, ripe…(adv) well, good (adv) thoroughly, soundly, good (n) good, goodness (n) commodity, trade good, good

e.g.synonymsetscontaining“good”: e.g.hypernymsof“panda”:

1/11/185

Page 6: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

ProblemswithresourceslikeWordNet

• Greatasaresourcebutmissingnuance

• e.g.“proficient”islistedasasynonymfor“good”.Thisisonlycorrectinsomecontexts.

• Missingnewmeaningsofwords

• e.g.wicked,badass,nifty,wizard,genius,ninja,bombest

• Impossibletokeepup-to-date!

• Subjective

• Requireshumanlabortocreateandadapt

• Hardtocomputeaccuratewordsimilarityà1/11/186

Page 7: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Representingwordsasdiscretesymbols

IntraditionalNLP,weregardwordsasdiscretesymbols:hotel, conference, motel

Wordscanberepresentedbyone-hot vectors:

motel=[000000000010000]hotel=[000000010000000]

Vectordimension=numberofwordsinvocabulary(e.g.500,000)

Meansone1,therest0s

1/11/187

Page 8: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Problemwithwordsasdiscretesymbols

Example: inwebsearch,ifusersearchesfor“Seattlemotel”,wewouldliketomatchdocumentscontaining“Seattlehotel”.

But:motel=[000000000010000]hotel=[000000010000000]

Thesetwovectorsareorthogonal.Thereisnonaturalnotionofsimilarityforone-hotvectors!

Solution:• CouldrelyonWordNet’slistofsynonymstogetsimilarity?• Instead:learntoencodesimilarityinthevectorsthemselves

Sec. 9.2.2

1/11/188

Page 9: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Representingwordsbytheircontext

• Coreidea:Aword’smeaningisgivenbythewordsthatfrequentlyappearclose-by

• “Youshallknowawordbythecompanyitkeeps” (J.R.Firth1957:11)

• OneofthemostsuccessfulideasofmodernstatisticalNLP!

• Whenawordw appearsinatext,itscontext isthesetofwordsthatappearnearby(withinafixed-sizewindow).

• Usethemanycontextsofw tobuilduparepresentationofw

…government debt problems turning into banking crises as happened in 2009……saying that Europe needs unified banking regulation to replace the hodgepodge…

…India has just given its banking system a shot in the arm…

Thesecontextwordswillrepresentbanking 1/11/189

Page 10: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Wordvectors

Wewillbuildadensevectorforeachword,chosensothatitissimilartovectorsofwordsthatappearinsimilarcontexts.

Note:wordvectorsaresometimescalledwordembeddings orwordrepresentations.

linguistics=

0.2860.792−0.177−0.1070.109−0.5420.3490.271

1/11/1810

Page 11: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

2.Word2vec:Overview

Word2vec(Mikolov etal.2013)isaframeworkforlearningwordvectors.Idea:

• Wehavealargecorpusoftext• Everywordinafixedvocabularyisrepresentedbyavector• Gothrougheachpositiont inthetext,whichhasacenterword

c andcontext(“outside”)wordso• Usethesimilarityofthewordvectorsforcando tocalculate

theprobabilityofo givenc(orviceversa)• Keepadjustingthewordvectorstomaximizethisprobability

1/11/1811

Page 12: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Word2VecOverview

• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2

outsidecontextwordsinwindowofsize2

𝑃 𝑤78;|𝑤7

𝑃 𝑤78<|𝑤7

𝑃 𝑤7=;|𝑤7

𝑃 𝑤7=<|𝑤7

1/11/1812

Page 13: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Word2VecOverview

• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2

outsidecontextwordsinwindowofsize2

𝑃 𝑤78;|𝑤7

𝑃 𝑤78<|𝑤7

𝑃 𝑤7=;|𝑤7

𝑃 𝑤7=<|𝑤7

1/11/1813

Page 14: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Word2vec:objectivefunctionForeachposition𝑡 = 1,… , 𝑇,predictcontextwordswithinawindowoffixedsizem,givencenterword𝑤9.

𝐿 𝜃 =F F 𝑃 𝑤789|𝑤7; 𝜃�

=IJ9JI9KL

M

7N;

Theobjectivefunction 𝐽 𝜃 isthe(average)negativeloglikelihood:

𝐽 𝜃 = −1𝑇log 𝐿(𝜃) = −

1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃

=IJ9JI9KL

M

7N;

Minimizingobjectivefunction⟺Maximizingpredictiveaccuracy

Likelihood=

𝜃 isallvariablestobeoptimized

sometimescalledcost orlossfunction

1/11/1814

Page 15: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Word2vec:objectivefunction

• Wewanttominimizetheobjectivefunction:

𝐽 𝜃 = −1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃

=IJ9JI9KL

M

7N;

• Question: Howtocalculate𝑃 𝑤789|𝑤7; 𝜃 ?

• Answer: Wewillusetwovectorsperwordw:

• 𝑣U whenw isacenterword

• 𝑢U whenw isacontextword

• Thenforacenterwordc andacontextwordo:

𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)

∑ exp(𝑢UM 𝑣Z)�U∈] 1/11/1815

Page 16: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Word2VecOverviewwithVectors

• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7• 𝑃 𝑢^_Y`abIc|𝑣de7Y shortforP 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠|𝑖𝑛𝑡𝑜; 𝑢^_Y`abIc, 𝑣de7Y, 𝜃

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2

outsidecontextwordsinwindowofsize2

𝑃 𝑢`peqder|𝑣de7Y

𝑃 𝑢Z_dcdc|𝑣de7Y

𝑃 𝑢7seder|𝑣de7Y

𝑃 𝑢^_Y`abIc|𝑣de7Y

1/11/1816

Page 17: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Word2VecOverviewwithVectors

• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2

outsidecontextwordsinwindowofsize2

𝑃 𝑢Z_dcbc|𝑣`peqder

𝑃 𝑢pc|𝑣`peqder

𝑃 𝑢de7Y|𝑣`peqder

𝑃 𝑢7s_eder|𝑣`peqder

1/11/1817

Page 18: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Word2vec:predictionfunction

𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)

∑ exp(𝑢UM 𝑣Z)�U∈]

• Thisisanexampleofthesoftmax functionℝe → ℝe

softmax 𝑥d = exp(𝑥d)

∑ exp(𝑥9)e9N;

= 𝑝d

• Thesoftmax functionmapsarbitraryvalues𝑥d toaprobabilitydistribution𝑝d• “max” becauseamplifiesprobabilityoflargest𝑥d• “soft” becausestillassignssomeprobabilitytosmaller𝑥d• FrequentlyusedinDeepLearning

Dotproductcomparessimilarityofo andc.Largerdotproduct=largerprobability

Aftertakingexponent,normalizeoverentirevocabulary

1/11/1818

Page 19: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Totrainthemodel:Computeall vectorgradients!

• Recall:𝜃representsallmodelparameters,inonelongvector• Inourcasewithd-dimensionalvectorsand V-manywords:

• Remember:everywordhastwovectors• Wethenoptimizetheseparameters

1/11/1819

Page 20: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

3.Derivationsofgradient

• Whiteboard– seevideoifyou’renotinclass;)

• ThebasicLegopiece

• Usefulbasics:

• Ifindoubt:writeoutwithindices

• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:

1/11/1820

Page 21: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

ChainRule

• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:

• Simpleexample:

1/11/1821

Page 22: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

InteractiveWhiteboard Session!

Let’sderivegradientforcenterwordtogetherForoneexamplewindowandoneexampleoutsideword:

Youthenalsoneedthegradientforcontextwords(it’ssimilar;leftforhomework).That’salloftheparametersθ here.

1/11/1822

Page 23: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Calculatingallgradients!

• Wewentthroughgradientforeachcentervectorv inawindow• Wealsoneedgradientsforoutsidevectorsu• Deriveathome!• Generallyineachwindowwewillcomputeupdatesforall

parametersthatarebeingusedinthatwindow.Forexample:

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2

outsidecontextwordsinwindowofsize2

𝑃 𝑢Z_dcbc|𝑣`peqder

𝑃 𝑢pc|𝑣`peqder

𝑃 𝑢de7Y|𝑣`peqder

𝑃 𝑢7s_eder|𝑣`peqder

1/11/1823

Page 24: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Word2vec:Moredetails

Whytwovectors?à Easieroptimization.Averagebothattheend.

Twomodelvariants:1. Skip-grams(SG)

Predictcontext(”outside”)words(positionindependent)givencenterword

2. ContinuousBagofWords(CBOW)Predictcenterwordfrom(bagof)contextwords

Thislecturesofar:Skip-grammodel

Additionalefficiencyintraining:1. Negativesampling

Sofar:Focusonnaïvesoftmax (simplertrainingmethod)

1/11/1824

Page 25: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

GradientDescent

• Wehaveacostfunction𝐽 𝜃 wewanttominimize• GradientDescentisanalgorithmtominimize𝐽 𝜃• Idea:forcurrentvalueof𝜃,calculategradientof𝐽 𝜃 ,thentake

smallstepindirectionofnegativegradient.Repeat.

Note:Ourobjectivesarenotconvexlikethis:(

1/11/1825

Page 26: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Intuition

Forasimpleconvexfunctionovertwoparameters.

Contourlinesshowlevelsofobjectivefunction

1/11/1826

Page 27: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

• Updateequation(inmatrixnotation):

• Updateequation(forsingleparameter):

• Algorithm:

GradientDescent

𝛼 =stepsizeorlearningrate

1/11/1827

Page 28: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

StochasticGradientDescent

• Problem:𝐽 𝜃 isafunctionofall windowsinthecorpus(potentiallybillions!)• Soisveryexpensivetocompute

• Youwouldwaitaverylongtimebeforemakingasingleupdate!

• Very badideaforprettymuchallneuralnets!• Solution: Stochasticgradientdescent(SGD)• Repeatedlysamplewindows,andupdateaftereachone.

• Algorithm:

1/11/1828

Page 29: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

1/11/1829

Page 30: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

PSet1:Theskip-grammodelandnegativesampling

• Frompaper:“DistributedRepresentationsofWordsandPhrasesandtheirCompositionality”(Mikolov etal.2013)

• Overallobjectivefunction(theymaximize):

• Thesigmoidfunction!(we’llbecomegoodfriendssoon)

• Sowemaximizetheprobabilityoftwowordsco-occurringinfirstlogà

1/11/1830

Page 31: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

PSet1:Theskip-grammodelandnegativesampling

• Simplernotation,moresimilartoclassandPSet:

• Wetakeknegativesamples.• Maximizeprobabilitythatrealoutsidewordappears,

minimizeprob.thatrandomwordsappeararoundcenterword

• P(w)=U(w)3/4/Z,theunigramdistributionU(w)raisedtothe3/4power(Weprovidethisfunctioninthestartercode).

• Thepowermakeslessfrequentwordsbesampledmoreoften

1/11/1831

Page 32: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

PSet1:Thecontinuousbagofwordsmodel

• Mainideaforcontinuousbagofwords(CBOW):Predictcenterwordfromsumofsurroundingwordvectorsinsteadofpredictingsurroundingsinglewordsfromcenterwordasinskip-grammodel

• Tomakeassignmentslightlyeasier:

ImplementationoftheCBOWmodelisnotrequired(youcandoitforacoupleofbonuspoints!),butyoudohavetodothetheoryproblemonCBOW.

1/11/1832