Top Banner
Copy/Pointer + Self-A3en4on Wei Xu (many slides from Greg Durrett)
38

Copy/Pointer + Self-A3en4on

Apr 30, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Copy/Pointer + Self-A3en4on

Copy/Pointer+Self-A3en4on

WeiXu(many slides from Greg Durrett)

Page 2: Copy/Pointer + Self-A3en4on

Administrivia

‣Mid-semesterfeedbacksurvey

‣ Thankstomanyofyouwhohavefilleditin!‣ Ifyouhaven’tyet,todayisagood4metodoit.‣We'verespondedtosomecommentsonPiazza(likelyonemoreupdate)

‣ Finalcourseproject—willdiscussmorenextclass.

‣Midtermisreleased(dueNov1st)

Page 3: Copy/Pointer + Self-A3en4on

ThisLecture

‣ Transformerarchitecture(if4me)

‣ Copymechanismsforcopyingwordstotheoutput

‣ Decodinginseq2seqmodels

‣ Applica4onsofSeq2Seq(beyondMT)

Page 4: Copy/Pointer + Self-A3en4on

OtherApplica4onsofSeq2Seq

Page 5: Copy/Pointer + Self-A3en4on

RegexPredic4on

‣ Seq2seqmodelscanbeusedformanyothertasks!

‣ Predictregexfromtext

‣ Problem:requiresalotofdata:10,000examplesneededtoget~60%accuracyonpre3ysimpleregexes

Locascioetal.(2016)

Page 6: Copy/Pointer + Self-A3en4on

Seman4cParsingasTransla4on

JiaandLiang(2015)

‣Writedownalinearizedformoftheseman4cparse,trainseq2seqmodelstodirectlytranslateintothisrepresenta4on

‣Mightnotproducewell-formedlogicalforms,mightrequirelotsofdata

“whatstatesborderTexas”

‣ Noneedtohaveanexplicitgrammar,simplifiesalgorithms

https://www.youtube.com/watch?v=OocGXG-BY6k&t=200sSeman4cParsing/LambdaCalculus:

Page 7: Copy/Pointer + Self-A3en4on

SQLGenera4on

‣ Convertnaturallanguagedescrip4onintoaSQLqueryagainstsomeDB

‣ Howtoensurethatwell-formedSQLisgenerated?

Zhongetal.(2017)

‣ Threecomponents

‣ Howtocapturecolumnnames+constants?

‣ Pointermechanisms

Page 8: Copy/Pointer + Self-A3en4on

TextSimplifica4on(Text-to-Text)

ChaoJiang,MounicaMaddela,WuweiLan,YangZhong,WeiXu.“NeuralCRFModelforSentenceAlignmentinTextSimplifica4on”inACL(2020)

Page 9: Copy/Pointer + Self-A3en4on

TextSimplifica4on

94ksent.pairs

394ksent.pairs

ChaoJiang,MounicaMaddela,WuweiLan,YangZhong,WeiXu.“NeuralCRFModelforSentenceAlignmentinTextSimplifica4on”inACL(2020)

Page 10: Copy/Pointer + Self-A3en4on

Copy/PointerNetworks

Page 11: Copy/Pointer + Self-A3en4on

UnknownWords

Jeanetal.(2015),Luongetal.(2015)

‣Wanttobeabletocopynameden44eslikePont-de-Buis

1

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

froma3en4onfromRNNhiddenstate

‣ Problems:targetwordhastobeinthevocabulary,a3en4on+RNNneedtogenerategoodembeddingtopickit

Page 12: Copy/Pointer + Self-A3en4on

Copying

{ {Lede

ma4n

Pont-de-Buisecotax

‣ Solu4on:Vocabularycontains“normal”vocabaswellaswordsininput.

‣ Somewordswewanttocopymaynotbeinthefixedoutputvocab(Pont-de-Buis)

Page 13: Copy/Pointer + Self-A3en4on

PointerNetworks‣ Standarddecoder(Pvocab):sormaxovervocabulary

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

Le ama4n… …

‣ Pointernetwork(Ppointer):predictfromsourcewords, insteadoftargetvocabulary

in

Pont-de-Buis

por4co

……

themovie…greatw1w2w3wn

{ 0otherwisePpointer(yi|x, y1, . . . , yi�1) /

<latexit sha1_base64="/P0oXrWO7G7mmaRuLo+24t1JeVE=">AAAG+3icjVXdbtQ4FA6wM8sO+9PCJTfWViMSNVSTggRCKkLLDVpppdllW5Ca1nIcz4xpEke2p5ko41fZm70AIW73RfaOt+HESVo6HQqWkpyc851ff06iPOFKj0Yfr12/8V2v//3NHwa3fvzp5182Nm8fKDGXlO1TkQj5OiKKJTxj+5rrhL3OJSNplLBX0cnz2v7qlEnFRfa3LnN2lJJpxiecEg0qvNnbHIYkyWcEB67y0B4K2SJ3w3zGMXOVH/hhSvQsmlQL43mDDqtdhbVFq3mKK4UrfT8wBrVm++a2Wu9iSKx9vRr0zK47J7+O7yGwRUyvzbdt81mrfXFb5Wo6q/ThdmVS7XfuNmnCJtptck1RyDMUarbQMq2yqSSpgswFBgONhUY0lHw603XIREzR2C33gmXhUw9to3AiCa0CU52YtnS+F5hjiNdBR8sCcwAPhmO3K7Awdatjt8CBZx+7yzPxQS36oII+E0ivakcIYdWNBqwVt4NvgtQ2UGT1HFYQ1nnZNXfKFdcsRi9JZp2b6jsrFfNMG3cd2EeFZ74F6JmLGTOYX0EyjbRAUwH31ZK+CvhdzDLUpfiDxGRKFCWyqT+BcxADs6/u5EshruzqC04eUGO73RkrNBU8XF/Buvg1p9oxne/BMqzyJ2gFmvuAeDoKzRKi1OQq0Ndwg6HCwWC4aDht6RZVf5njGDUkJlKKAi3uNfZq5AehOa7ie+BYguMYn5WbC55pJo1xSyDY+dHyy3Melh0Pw1yKHHZwMJzhN8ehFjk6gLNLZDUzmJ/Vi/gEGQTxoO8Cv6ln8A3Ba5Kfd5yXxltXJmyGG6D7l7Cfg08FJZExeGNrtDOyC10WglbYcto1xhv/h7Gg85RlmiZEqcNglOujikjNacLMIJwrlhN6QqbsEMSMpEwdVfbbbdAQNDGaCAkX8NxqP/eo4HujyjQCZD0HtWqrletsh3M9eXxU8Syfa5bRJtFkntQnqf4RoJhLRnVSgkCo5FArojMCbIN5qQEMIVht+bJwsLsTPNjZ/fPh1rPf2nHcdO46vzquEziPnGfOC2fs7Du0t+j903vbe9c3/X/77/sfGuj1a63PHefC6v/3CTVpXbg=</latexit>

exp(h>j V h̄i) if yi = wj

<latexit sha1_base64="tOA9re2eChbbXkXm9a02nE8oZ0U=">AAADiHichVJbb9MwFE4bBiPcOnjk5YiqUqJtVTKGBg+TpvHCY5FoN6kukeM4q7ckjmxnaRT8V/hRvPFvcC+DraPakSx9PpfvO8c+UZEyqXz/d6ttP9p6/GT7qfPs+YuXrzo7r0eSl4LQIeEpF+cRljRlOR0qplJ6XgiKsyilZ9HV53n87JoKyXj+TdUFnWT4ImcJI1gZV7jT+tlD2ukN3ApQxmKowgYpOlMiawzPtdYeHANKBCZ//YSXudJuFd5J3IPK0w/keA8rpab1GG9SXKvYoLnOq2EX3AD2b9i9DfS3uBRXOIWKixiWUdN5/Y9Y4qxIaaw1IMkyGLg1/DBTYTWNkmZm+qrDYM/IxVzJ+aVh+4H2HERnhTsNL78jxQsYAYqwaKY6ZB4CBHDDzhLQpogdV+Fl2On6fX9hcB8EK9C1VjYIO79QzEmZ0VyRFEs5DvxCTRosFCMp1Q4qJS0wucIXdGxgjjMqJ81ikTT0jCeGhAtzcgUL7+2KBmdS1llkMufDyvXY3Pm/2LhUycdJw/KiVDQnS6GkTEFxmG8lxExQotLaAEwEM70CmWLzScrsrmMeIVgf+T4YHfSD9/2Dr4fdk9PVc2xbb613lmsF1pF1Yn2xBtbQIu2t9m77sP3BdmzfPrI/LVPbrVXNG+uO2ad/ALjbJuk=</latexit>

Page 14: Copy/Pointer + Self-A3en4on

PointerGeneratorMixtureModels‣ DefinethedecodermodelasamixturemodelofPvocabandPpointer

‣ PredictP(copy)basedondecoderstate,input,etc.

‣Marginalizeovercopyvariableduringtrainingandinference

‣Modelwillbeabletobothgenerateandcopy,flexiblyadaptbetweenthetwo

Gulcehreetal.(2016),Guetal.(2016)

Le ama4n

Pont-de-Buis

por4co in… …v

P(copy)1-P(copy)

v

Page 15: Copy/Pointer + Self-A3en4on

CopyinginSummariza4on

Seeetal.(2017)

Page 16: Copy/Pointer + Self-A3en4on

CopyinginSummariza4on

Seeetal.(2017)

Page 17: Copy/Pointer + Self-A3en4on

CopyinginSummariza4on

Seeetal.(2017)

Page 18: Copy/Pointer + Self-A3en4on

DecodingStrategies

Page 19: Copy/Pointer + Self-A3en4on

GreedyDecoding‣ Generatenextwordcondi4onedonpreviouswordaswellashiddenstate

themoviewasgreat

‣ Duringinference:needtocomputetheargmaxoverthewordpredic4onsandthenfeedthattothenextRNNstate.Thisisgreedydecoding

le

<s>

film était bon [STOP]

P (yi|x, y1, . . . , yi�1) = softmax(Wh̄)

ypred = argmaxyP (y|x, y1, . . . , yi�1)<latexit sha1_base64="BKzIm/yKraU6a64Z2EgswwSRmsQ=">AAADX3ichVLBattAEF3LbZM6aeK0p9LLUmOQaGKktJBcCqG99OhCnQQsI1arlbNkpRW7o9hC3Z/srdBL/6Qr2y2xU5MBwejNzHtvh4kLwTX4/s+W037y9NnO7vPO3v6Lg8Pu0ctLLUtF2YhKIdV1TDQTPGcj4CDYdaEYyWLBruLbz0396o4pzWX+DaqCTTIyzXnKKQELRUetsh+aTn/oznCY8QTPojoENgeV1ZbnzhgPf8Rhqgj9h1NZ5mDcWbTWeIxnnnmkx3tcSVjrCdmmuDGxRXOT1+B32A3wyV92bwv9PS6QQASeSZXgZdV0qjXexJjG7gogapqRuYkqPHQr/N0+kMBNnNZza7GKgmOrnEjQzU/NTwLjRd2eP/AXgR8mwSrpoVUMo+6PMJG0zFgOVBCtx4FfwMQKA6eCmU5YalYQekumbGzTnGRMT+rFfRjct0iCU6nslwNeoPcnapJpXWWx7WyM681aA/6vNi4hPZ/UPC9KYDldCqWlwCBxc2w44YpREJVNCFXcesX0htjdgz3Jjl1CsPnkh8nl6SB4Pzj9+qF38Wm1jl30Br1FLgrQGbpAX9AQjRBt/XIcZ8/Zd363d9oH7e6y1WmtZl6htWi//gMykBmH</latexit>

(ora3en4on/copying/etc.)

Page 20: Copy/Pointer + Self-A3en4on

ProblemswithGreedyDecoding

‣ Onlyreturnsonesolu4on,anditmaynotbeop4mal

‣ Canaddressthiswithbeamsearch,whichusuallyworksbe3er…butevenbeamsearchmaynotfindthecorrectanswer!(maxprobabilitysequence)

StahlbergandByrne(2019)

A sentence is classified as search error if the decoderdoes not find the global best model score.

Page 21: Copy/Pointer + Self-A3en4on

“Problems”withBeamDecoding‣ Formachinetransla4on,thehighestprobabilitysequenceisorentheemptystring,i.e..asingle</s>token!(>50%ofthe4me)

StahlbergandByrne(2019)

‣ Beamsearchresultsinfortuitoussearcherrorsthatavoidthesebadsolu4ons

‣ Exactinferenceusesdepth-firstsearch,butcutoffbranchesthatfallbelowalowerbound.

Page 22: Copy/Pointer + Self-A3en4on

Sampling‣ Beamsearchmaygivemanysimilarsequences,andtheseactuallymaybetooclosetotheop4mal.Cansampleinstead:

‣ TextdegeneraEon:greedysolu4oncanbeuninteres4ng/vacuousforvariousreasons.Samplingcanhelp.

P (yi|x, y1, . . . , yi�1) = softmax(Wh̄)

ysampled ⇠ P (y|x, y1, . . . , yi�1)<latexit sha1_base64="PRMh0d0POdeSz1TX/ixzw2HuV+c=">AAADU3ichVNNbxMxEPUmfJRAaQpHLiOiSBvRRtmCBBekCi4cg0TaStlo5fV6W6ve9cqe7Xa1+D8iJA78ES4cwPkA2pSoI1kav5l5bzwax4UUBkej716rfefuvftbDzoPH20/3unuPjkyqtSMT5iSSp/E1HApcj5BgZKfFJrTLJb8OD5/P48fX3BthMo/YV3wWUZPc5EKRtFB0a4n+qHt9Md+BWEmEqiiJkR+iTprHM+FtQN4C2GqKfuLM1XmaP0qupa4B9XA3pIzuF1JutYTuklxrWKD5jqvhRfgB7D/h32wgf4KFyqkEiqlE1hGbaf+x2toVkieWAuhERmM/Ro+u0dRPIvT5tK1VUfBnlNLFJr5pRH7gR1E3d5oOFoY3HSCldMjKxtH3a9holiZ8RyZpMZMg1GBs4ZqFExy2wlLwwvKzukpnzo3pxk3s2axExb6DkkgVdqdHGGBXq1oaGZMncUuc964WY/Nwf/FpiWmb2aNyIsSec6WQmkpARXMFwwSoTlDWTuHMi1cr8DOqJs3ujXsuCEE60++6RwdDIOXw4OPr3qH71bj2CLPyHPik4C8JofkAxmTCWHeF++H96tFWt9aP9vulyxTW96q5im5Zu3t3+LjFz8=</latexit>

Page 23: Copy/Pointer + Self-A3en4on

BeamSearchvs.Sampling

Holtzmanetal.(2019)

Page 24: Copy/Pointer + Self-A3en4on

DecodingStrategies

‣ Greedy

‣ Beamsearch

‣ Sampling(e.g.,top-korNucleussampling)

‣ Nucleus:takethetopp%(95%)ofthedistribu4on,samplefromwithinthat

‣ Top-k:takethetopkmostlikelywords(k=5),samplefromthose

Page 25: Copy/Pointer + Self-A3en4on

BeamSearchvs.Sampling

Holtzmanetal.(2019)

‣ Thesearesamplesfromanuncondi4onedlanguagemodel(notseq2seqmodel)

‣ Samplingisbe3erbutsome4mesdrawstoofarfromthetailofthedistribu4on

Page 26: Copy/Pointer + Self-A3en4on

Genera4onTasks

Uncondi4onedsampling/ e.g.,storygenera4on

Dialogue Transla4on

Summariza4onText-to-code

Lessconstrained Moreconstrained

Data-to-text

‣ Therearearangeofseq2seqmodelingtaskswewilladdress

‣ Formoreconstrainedproblems:greedy/beamdecodingareusuallybest

‣ Forlessconstrainedproblems:nucleussamplingintroducesfavorablevaria4onintheoutput

Text-to-text

Page 27: Copy/Pointer + Self-A3en4on

Transformers

Page 28: Copy/Pointer + Self-A3en4on

A3en4onisAllYouNeed

�28 Vaswanietal.(2017)

Page 29: Copy/Pointer + Self-A3en4on

Readings

‣ “TheAnnotatedTransformer”bySashaRush

‣ “TheIllustratedTransformer”byJayLamarhttp://jalammar.github.io/illustrated-transformer/

https://nlp.seas.harvard.edu/2018/04/03/attention.html

Page 30: Copy/Pointer + Self-A3en4on

SentenceEncoders

themoviewasgreat

‣ LSTMabstrac4on:mapseachvectorinasentencetoanew,context-awarevector

‣ CNNsdosomethingsimilarwithfilters

‣ A3en4oncangiveusathirdwaytodothis

Vaswanietal.(2017)

themoviewasgreat

Page 31: Copy/Pointer + Self-A3en4on

Self-A3en4on

Vaswanietal.(2017)

Theballerinaisveryexcitedthatshewilldanceintheshow.

‣ Assumewe’reusingGloVe—whatdowewantourneuralnetworktodo?

‣ Q:Whatwordsneedtobecontextualizedhere?

Page 32: Copy/Pointer + Self-A3en4on

Self-A3en4on

Vaswanietal.(2017)

Theballerinaisveryexcitedthatshewilldanceintheshow.

‣ Pronounsneedtolookatantecedents‣ Ambiguouswordsshouldlookatcontext

‣ Assumewe’reusingGloVe—whatdowewantourneuralnetworktodo?

‣Whatwordsneedtobecontextualizedhere?

‣Wordsshouldlookatsyntac4cparents/children

‣ Problem:LSTMsandCNNsdon’tdothis

Page 33: Copy/Pointer + Self-A3en4on

Self-A3en4on

Vaswanietal.(2017)

Theballerinaisveryexcitedthatshewilldanceintheshow.

‣Want:

‣ LSTMs/CNNs:tendtolookatlocalcontext

Theballerinaisveryexcitedthatshewilldanceintheshow.

‣ Toappropriatelycontextualizeembeddings,weneedtopassinforma4onoverlongdistancesdynamicallyforeachword

Page 34: Copy/Pointer + Self-A3en4on

Self-A3en4on

Vaswanietal.(2017)

themoviewasgreat

‣ Eachwordformsa“query”whichthencomputesa3en4onovereachword

‣Mul4ple“heads”analogoustodifferentconvolu4onalfilters.UseparametersWkandVktogetdifferenta3en4onvalues+transformvectors

x4

x04

scalar

vector=sumofscalar*vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

↵k,i,j = softmax(x>i Wkxj) x0

k,i =nX

j=1

↵k,i,jVkxj

Page 35: Copy/Pointer + Self-A3en4on

Whatcanself-a3en4ondo?

Vaswanietal.(2017)

Theballerinaisveryexcitedthatshewilldanceintheshow.

‣Whymul4pleheads?Sormaxesendupbeingpeaked,singledistribu4oncannoteasilyputweightonmul4plethings

0.5 0.20.10.10.10 0 0 0 0 0 0

‣ Thisisademonstra4on,wewillrevisitwhatthesemodelsactuallylearnwhenwediscussBERT

‣ A3endnearby+toseman4callyrelatedterms

0.5 0 0.40 0.1 0 0 0 0 0 0 0

Page 36: Copy/Pointer + Self-A3en4on

�36

TransformerUses

‣ Supervised:transformercanreplaceLSTMasencoder,decoder,orboth;willrevisitthiswhenwediscussMT

‣ Unsupervised:transformersworkbe8erthanLSTMforunsupervisedpre-trainingofembeddings:predictwordgivencontextwords

‣ BERT(BidirecPonalEncoderRepresentaPonsfromTransformers):pretrainingtransformerlanguagemodelssimilartoELMo

‣ Strongerthansimilarmethods,SOTAon~11tasks(includingNER—92.8F1)

suchasinmachinetransla4onandnaturallanguagegenera4ontasks.

Vaswanietal.(2017)

‣ Encoderanddecoderarebothtransformers

‣ Decoderconsumesthepreviousgeneratedtoken(anda3endstoinput),buthasnorecurrentstate

‣Manyotherdetailstogetittowork:residualconnec4ons,layernormaliza4on,posi4onalencoding,op4mizerwithlearningrateschedule,labelsmoothing….

Page 37: Copy/Pointer + Self-A3en4on

TransformerUses

‣ Unsupervised:transformersworkbe3erthanLSTMforunsupervisedpre-trainingofembeddings—predictwordgivencontextwords

‣ BERT(Bidirec4onalEncoderRepresenta4onsfromTransformers):pretrainingtransformerlanguagemodelssimilartoELMo(basedonLSTM)

‣ Strongerthansimilarmethods,SOTAon~11tasks(includingNER—92.8F1)

Page 38: Copy/Pointer + Self-A3en4on

Takeaways

‣ A3en4onisveryhelpfulforseq2seqmodels,andexplicitcopyingcanextendthisevenfurther

‣ Upnext:Transformers(tofinishup)

‣ Then:pre-trainedmodels

‣ Carefullychooseadecodingstrategy