Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Natural Language Processingwith Deep Learning

CS224N/Ling284

Richard Socher

Lecture 2: Word Vectors

Organization

• PSet 1isreleased.CodingSession1/22:(Monday,PA1dueThursday)

• SomeofthequestionsfromPiazza:• sharingthechoose-your-ownfinalprojectwithanotherclass

seemsfine-->Yes*• Buthowaboutthedefaultfinalproject?Canthatalsobeused

asafinalprojectforadifferentcourse?-->Yes*• Areweallowingstudentstobringonesheetofnotesforthe

midterm?-->Yes• AzurecomputingresourcesforProjects/PSet4.Partofmilestone

1/11/182

LecturePlan

1. Wordmeaning(15mins)2. Word2vecintroduction(20mins)3. Word2vecobjectivefunctiongradients(25mins)4. Optimizationrefresher(10mins)

1/11/183

1.Howdowerepresentthemeaningofaword?

Definition:meaning (Websterdictionary)

• theideathatisrepresentedbyaword,phrase,etc.

• theideathatapersonwantstoexpressbyusingwords,signs,etc.

• theideathatisexpressedinaworkofwriting,art,etc.

Commonestlinguisticwayofthinkingofmeaning:

signifier(symbol)⟺ signified(ideaorthing)

=denotation 1/11/184

Howdowehaveusablemeaninginacomputer?Commonsolution:Usee.g.WordNet,aresourcecontaininglistsofsynonymsets andhypernyms (“isa”relationships).

[Synset('procyonid.n.01'), Synset('carnivore.n.01'), Synset('placental.n.01'), Synset('mammal.n.01'), Synset('vertebrate.n.01'), Synset('chordate.n.01'), Synset('animal.n.01'), Synset('organism.n.01'), Synset('living_thing.n.01'), Synset('whole.n.02'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]

(adj) full, good (adj) estimable, good, honorable, respectable (adj) beneficial, good (adj) good, just, upright (adj) adept, expert, good, practiced, proficient, skillful(adj) dear, good, near (adj) good, right, ripe…(adv) well, good (adv) thoroughly, soundly, good (n) good, goodness (n) commodity, trade good, good

e.g.synonymsetscontaining“good”: e.g.hypernymsof“panda”:

1/11/185

ProblemswithresourceslikeWordNet

• Greatasaresourcebutmissingnuance

• e.g.“proficient”islistedasasynonymfor“good”.Thisisonlycorrectinsomecontexts.

• Missingnewmeaningsofwords

• e.g.wicked,badass,nifty,wizard,genius,ninja,bombest

• Impossibletokeepup-to-date!

• Subjective

• Requireshumanlabortocreateandadapt

• Hardtocomputeaccuratewordsimilarityà1/11/186

Representingwordsasdiscretesymbols

IntraditionalNLP,weregardwordsasdiscretesymbols:hotel, conference, motel

Wordscanberepresentedbyone-hot vectors:

motel=[000000000010000]hotel=[000000010000000]

Vectordimension=numberofwordsinvocabulary(e.g.500,000)

Meansone1,therest0s

1/11/187

Problemwithwordsasdiscretesymbols

Example: inwebsearch,ifusersearchesfor“Seattlemotel”,wewouldliketomatchdocumentscontaining“Seattlehotel”.

But:motel=[000000000010000]hotel=[000000010000000]

Thesetwovectorsareorthogonal.Thereisnonaturalnotionofsimilarityforone-hotvectors!

Solution:• CouldrelyonWordNet’slistofsynonymstogetsimilarity?• Instead:learntoencodesimilarityinthevectorsthemselves

Sec. 9.2.2

1/11/188

Representingwordsbytheircontext

• Coreidea:Aword’smeaningisgivenbythewordsthatfrequentlyappearclose-by

• “Youshallknowawordbythecompanyitkeeps” (J.R.Firth1957:11)

• OneofthemostsuccessfulideasofmodernstatisticalNLP!

• Whenawordw appearsinatext,itscontext isthesetofwordsthatappearnearby(withinafixed-sizewindow).

• Usethemanycontextsofw tobuilduparepresentationofw

…government debt problems turning into banking crises as happened in 2009……saying that Europe needs unified banking regulation to replace the hodgepodge…

…India has just given its banking system a shot in the arm…

Thesecontextwordswillrepresentbanking 1/11/189

Wordvectors

Wewillbuildadensevectorforeachword,chosensothatitissimilartovectorsofwordsthatappearinsimilarcontexts.

Note:wordvectorsaresometimescalledwordembeddings orwordrepresentations.

linguistics=

0.2860.792−0.177−0.1070.109−0.5420.3490.271

1/11/1810

2.Word2vec:Overview

Word2vec(Mikolov etal.2013)isaframeworkforlearningwordvectors.Idea:

• Wehavealargecorpusoftext• Everywordinafixedvocabularyisrepresentedbyavector• Gothrougheachpositiont inthetext,whichhasacenterword

c andcontext(“outside”)wordso• Usethesimilarityofthewordvectorsforcando tocalculate

theprobabilityofo givenc(orviceversa)• Keepadjustingthewordvectorstomaximizethisprobability

1/11/1811

Word2VecOverview

• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2


𝑃 𝑤78;|𝑤7

𝑃 𝑤78<|𝑤7

𝑃 𝑤7=;|𝑤7

𝑃 𝑤7=<|𝑤7

1/11/1812

Word2VecOverview






𝑃 𝑤78;|𝑤7

𝑃 𝑤78<|𝑤7

𝑃 𝑤7=;|𝑤7

𝑃 𝑤7=<|𝑤7

1/11/1813

Word2vec:objectivefunctionForeachposition𝑡 = 1,… , 𝑇,predictcontextwordswithinawindowoffixedsizem,givencenterword𝑤9.

𝐿 𝜃 =F F 𝑃 𝑤789|𝑤7; 𝜃�

=IJ9JI9KL

M

7N;

Theobjectivefunction 𝐽 𝜃 isthe(average)negativeloglikelihood:

𝐽 𝜃 = −1𝑇log 𝐿(𝜃) = −

1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃

�

=IJ9JI9KL

M

7N;

Minimizingobjectivefunction⟺Maximizingpredictiveaccuracy

Likelihood=

𝜃 isallvariablestobeoptimized

sometimescalledcost orlossfunction

1/11/1814

Word2vec:objectivefunction

• Wewanttominimizetheobjectivefunction:

𝐽 𝜃 = −1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃

�

=IJ9JI9KL

M

7N;

• Question: Howtocalculate𝑃 𝑤789|𝑤7; 𝜃 ?

• Answer: Wewillusetwovectorsperwordw:

• 𝑣U whenw isacenterword

• 𝑢U whenw isacontextword

• Thenforacenterwordc andacontextwordo:

𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)

∑ exp(𝑢UM 𝑣Z)�U∈] 1/11/1815

Word2VecOverviewwithVectors

• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7• 𝑃 𝑢^_YàbIc|𝑣de7Y shortforP 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠|𝑖𝑛𝑡𝑜; 𝑢^_YàbIc, 𝑣de7Y, 𝜃





𝑃 𝑢`peqder|𝑣de7Y

𝑃 𝑢Z_dcdc|𝑣de7Y

𝑃 𝑢7seder|𝑣de7Y

𝑃 𝑢^_YàbIc|𝑣de7Y

1/11/1816

Word2VecOverviewwithVectors






𝑃 𝑢Z_dcbc|𝑣`peqder

𝑃 𝑢pc|𝑣`peqder

𝑃 𝑢de7Y|𝑣`peqder

𝑃 𝑢7s_eder|𝑣`peqder

1/11/1817

Word2vec:predictionfunction

𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)

∑ exp(𝑢UM 𝑣Z)�U∈]

• Thisisanexampleofthesoftmax functionℝe → ℝe

softmax 𝑥d = exp(𝑥d)

∑ exp(𝑥9)e9N;

= 𝑝d

• Thesoftmax functionmapsarbitraryvalues𝑥d toaprobabilitydistribution𝑝d• “max” becauseamplifiesprobabilityoflargest𝑥d• “soft” becausestillassignssomeprobabilitytosmaller𝑥d• FrequentlyusedinDeepLearning

Dotproductcomparessimilarityofo andc.Largerdotproduct=largerprobability

Aftertakingexponent,normalizeoverentirevocabulary

1/11/1818

Totrainthemodel:Computeall vectorgradients!

• Recall:𝜃representsallmodelparameters,inonelongvector• Inourcasewithd-dimensionalvectorsand V-manywords:

• Remember:everywordhastwovectors• Wethenoptimizetheseparameters

1/11/1819

3.Derivationsofgradient

• Whiteboard– seevideoifyou’renotinclass;)

• ThebasicLegopiece

• Usefulbasics:

• Ifindoubt:writeoutwithindices

• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:

1/11/1820

ChainRule

• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:

• Simpleexample:

1/11/1821

InteractiveWhiteboard Session!

Let’sderivegradientforcenterwordtogetherForoneexamplewindowandoneexampleoutsideword:

Youthenalsoneedthegradientforcontextwords(it’ssimilar;leftforhomework).That’salloftheparametersθ here.

1/11/1822

Calculatingallgradients!

• Wewentthroughgradientforeachcentervectorv inawindow• Wealsoneedgradientsforoutsidevectorsu• Deriveathome!• Generallyineachwindowwewillcomputeupdatesforall

parametersthatarebeingusedinthatwindow.Forexample:





𝑃 𝑢Z_dcbc|𝑣`peqder

𝑃 𝑢pc|𝑣`peqder

𝑃 𝑢de7Y|𝑣`peqder

𝑃 𝑢7s_eder|𝑣`peqder

1/11/1823

Word2vec:Moredetails

Whytwovectors?à Easieroptimization.Averagebothattheend.

Twomodelvariants:1. Skip-grams(SG)

Predictcontext(”outside”)words(positionindependent)givencenterword

2. ContinuousBagofWords(CBOW)Predictcenterwordfrom(bagof)contextwords

Thislecturesofar:Skip-grammodel

Additionalefficiencyintraining:1. Negativesampling

Sofar:Focusonnaïvesoftmax (simplertrainingmethod)

1/11/1824

GradientDescent

• Wehaveacostfunction𝐽 𝜃 wewanttominimize• GradientDescentisanalgorithmtominimize𝐽 𝜃• Idea:forcurrentvalueof𝜃,calculategradientof𝐽 𝜃 ,thentake

smallstepindirectionofnegativegradient.Repeat.

Note:Ourobjectivesarenotconvexlikethis:(

1/11/1825

Intuition

Forasimpleconvexfunctionovertwoparameters.

Contourlinesshowlevelsofobjectivefunction

1/11/1826

• Updateequation(inmatrixnotation):

• Updateequation(forsingleparameter):

• Algorithm:

GradientDescent

𝛼 =stepsizeorlearningrate

1/11/1827

StochasticGradientDescent

• Problem:𝐽 𝜃 isafunctionofall windowsinthecorpus(potentiallybillions!)• Soisveryexpensivetocompute

• Youwouldwaitaverylongtimebeforemakingasingleupdate!

• Very badideaforprettymuchallneuralnets!• Solution: Stochasticgradientdescent(SGD)• Repeatedlysamplewindows,andupdateaftereachone.

• Algorithm:

1/11/1828

1/11/1829

PSet1:Theskip-grammodelandnegativesampling

• Frompaper:“DistributedRepresentationsofWordsandPhrasesandtheirCompositionality”(Mikolov etal.2013)

• Overallobjectivefunction(theymaximize):

• Thesigmoidfunction!(we’llbecomegoodfriendssoon)

• Sowemaximizetheprobabilityoftwowordsco-occurringinfirstlogà

1/11/1830

PSet1:Theskip-grammodelandnegativesampling

• Simplernotation,moresimilartoclassandPSet:

• Wetakeknegativesamples.• Maximizeprobabilitythatrealoutsidewordappears,

minimizeprob.thatrandomwordsappeararoundcenterword

• P(w)=U(w)3/4/Z,theunigramdistributionU(w)raisedtothe3/4power(Weprovidethisfunctioninthestartercode).

• Thepowermakeslessfrequentwordsbesampledmoreoften

1/11/1831

PSet1:Thecontinuousbagofwordsmodel

• Mainideaforcontinuousbagofwords(CBOW):Predictcenterwordfromsumofsurroundingwordvectorsinsteadofpredictingsurroundingsinglewordsfromcenterwordasinskip-grammodel

• Tomakeassignmentslightlyeasier:

ImplementationoftheCBOWmodelisnotrequired(youcandoitforacoupleofbonuspoints!),butyoudohavetodothetheoryproblemonCBOW.

1/11/1832

Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability

Documents