Top Banner
Attention DLAI – MARTA R. COSTA-JUSSÀ SLIDES ADAPTED FROM GRAHAM NEUBIG’S LECTURES
52

Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

May 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionDLAI– MARTAR.COSTA-JUSSÀ

SLIDESADAPTEDFROMGRAHAMNEUBIG’SLECTURES

Page 2: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Whatadvancementsexciteyoumostinthefield?Iamveryexcitedbytherecentlyintroducedattentionmodels,duetotheirsimplicityandduetothefactthattheyworksowell.Althoughthesemodelsarenew,Ihavenodoubtthattheyareheretostay,andthattheywillplayaveryimportantroleinthefutureofdeeplearning.

ILYASUTSKEVER, RESEARCHDIRECTORANDCOFUNDEROFOPENAI

2

Page 3: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Outline1.Sequencemodeling&Sequence-to-sequencemodels[WRAP-UPFROMPREVIOUSRNN’sSESSION]

2.Attention-basedmechanism

3.Attentionvarieties

4.AttentionImprovements

5.Applications

6.“Attentionisallyouneed”

7.Summary

3

Page 4: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

SequencemodelingModeltheprobabilityofsequencesofwords

Frompreviouslecture…wemodelsequences

ithRNNs

p(I’m) p(fine|I’m) p(.|fine) EOS

I’m fine .<s>

4

Page 5: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Sequence-to-sequencemodels

how are you ?

Cómo estás EOS

encoder decoder

¿ Cómo estás

?

?

¿

<s>

THOUGHT/CONTEXT

VECTOR

5

Page 6: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Anyproblemwiththesemodels?

6

Page 7: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

7

Page 8: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

2.Attention-basedmechanism

8

Page 9: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

MotivationinthecaseofMT

9

Page 10: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

MotivationinthecaseofMT

10

Page 11: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Attention

encoder

decoder

+

Attention allows to use multiple vectors, based onthe length of the input

11

Page 12: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionKeyIdeas•Encodeeachwordintheinputandoutputsentenceintoavector

•Whendecoding,performalinearcombinationofthesevectors,weightedby“attentionweights”

•Usethiscombinationinpickingthenextword

12

Page 13: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentioncomputationI•Use“query”vector(decoderstate)and“key”vectors(allencoderstates)

•Foreachquery-keypair,calculateweight

•Normalizetoaddtooneusingsoftmax

Query Vector

Key Vectors

a1=2.1 a2=-0.1 a3=0.3 a4=-1.0

softmax

a1=0.5 a2=0.3 a3=0.1 a4=0.1

13

Page 14: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentioncomputationII• Combinetogethervaluevectors(usuallyencoderstates,likekeyvectors)bytakingtheweightedsum

Value Vectors

a1=0.5 a2=0.3 a3=0.1 a4=0.1* * * *

14

Page 15: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionScoreFunctionsqisthequeryandkisthekey

Reference

Multi-layerPerceptron

𝑎 𝑞, 𝑘 = tanh(𝒲- 𝑞, 𝑘 ) Flexible,oftenverygoodwithlargedata

Bahdanau etal.,2015

Bilinear 𝑎 𝑞, 𝑘 = 𝑞/𝒲𝑘 Luongetal2015

DotProduct 𝑎 𝑞, 𝑘 = 𝑞/𝑘 Noparameters!Butrequiressizestobethesame

Luongetal.2015

ScaledDotProduct𝑎 𝑞, 𝑘 =

𝑞/𝑘|𝑘|�

Scalebysizeofthevector Vaswani etal.2017

15

Page 16: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionIntegration

16

Page 17: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionIntegration

17

Page 18: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

3.AttentionVarieties

18

Page 19: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

HardAttention*Insteadofasoftinterpolation,makeazero-onedecisionaboutwheretoattend(Xuetal.2015)

19

Page 20: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

MonotonicAttentionThisapproach"softly"preventsthemodelfromassigningattentionprobabilitybeforewhereitattendedataprevioustimestep bytakingintoaccounttheattentionattheprevioustimestep.

20

ENCODER STATE E

Page 21: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Intra-Attention/Self- AttentionEachelementinthesentenceattendstootherelementsfromtheSAMEsentenceà contextsensitiveencodings!

21

Page 22: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

MultipleSourcesAttendtomultiplesentences(Zoph etal.,2015)

Attendtoasentenceandanimage(Huangetal.2016)

22

Page 23: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Multi-headedAttentionIMultipleattention“heads”focusondifferentpartsofthesentence

𝑎 𝑞, 𝑘 =𝑞/𝑘|𝑘|�

23

Page 24: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Multi-headedAttentionIIMultipleattention“heads”focusondifferentpartsofthesentence

E.g.Multipleindependentlylearnedheads(Vaswani etal.2017)

𝑎 𝑞, 𝑘 =𝑞/𝑘|𝑘|�

24

Page 25: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

4.ImprovementsinAttentionINTHECONTEXTOFMT

25

Page 26: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

CoverageProblem:Neuralmodelstendstodroporrepeatcontent

InMT,

1.Over-translation:somewordsareunnecessarilytranslatedformultipletimes;

2.Under-translation:somewordsaremistakenlyuntranslated.

SRC:Señor Presidente,abre lasesión.

TRG:Mr PresidentMr PresidentMr President.

Solution:Modelhowmanytimeswordshavebeencoverede.g.maintainingacoveragevectortokeeptrackoftheattentionhistory(Tu etal.,2016)

26

Page 27: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

IncorporatingMarkovPropertiesIntuition:Attentionfromlasttimetendstobecorrelatedwithattentionthistime

Approach:Addinformationaboutthelastattentionwhenmakingthenextdecision

27

Page 28: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

BidirectionalTraining-Background:Establishedthatforlatentvariabletranslationmodelsthealignmentsimproveifbothdirectionalmodelsarecombined(koehn etal,2005)

-Approach:jointtrainingoftwodirectionalmodels

28

Page 29: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

SupervisedTrainingSometimeswecanget“goldstandard”alignmentsa–priori◦ Manualalignments◦ Pre-trainedwithstrongalignmentmodel

Trainthemodeltomatchthesestrongalignments

29

Page 30: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

5.Applications

30

Page 31: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Chatbotsacomputerprogramthatconductsaconversation

Human: what is your job Enc-dec: i’m a lawyer Human: what do you do ?Enc-dec: i’m a doctor .

what is your job

I’m a EOS

<s> I’m a

lawyer

lawyer

+attention

31

Page 32: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

NaturalLanguageInference

32

Page 33: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

OtherNLPTasksText summarization: process of shortening a text document withsoftware to create a summary with the major points of the originaldocument.Question Answering: automatically producing an answer to aquestion given a corresponding document.

Semantic Parsing: mapping natural language into a logical form thatcan be executed on a knowledge base and return an answer

Syntactic Parsing: process of analysing a string of symbols, either innatural language or in computer languages, conforming to the rulesof a formal grammar

33

Page 34: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

ImagecaptioningIdecoder

encoder A cat on the mata cat

<s> a

on the mat

cat on the

34

Page 35: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

ImageCaptioningII

35

Page 36: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

OtherComputerVisionTaskswithAttentionVisual Question Answering: given an image and a natural languagequestion about the image, the task is to provide an accurate naturallanguage answer.Video Caption Generation: attempts to generate a complete andnatural sentence, enriching the single label as in video classification,to capture the most informative dynamics in videos.

36

Page 37: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Speechrecognition/translation

37

Page 38: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

6.“Attentionisallyouneed”SLIDESBASEDONHTTPS://RESEARCH.GOOGLEBLOG.COM/2017/08/TRANSFORMER-NOVEL-NEURAL-NETWORK.HTML

38

Page 39: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

MotivationSequentialnatureofRNNs-à difficulttotakeadvantageofmoderncomputingdevicessuchasTPUs(TensorProcessingUnits)

39

Page 40: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Transformer

Iarrivedat thebankaftercrossingtheriver

40

Page 41: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

TransformerIDecoder

Encoder

41

Page 42: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

TransformerII

42

Page 43: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Transformerresults

43

Page 44: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Attentionweights

44

Page 45: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Attentionweights

45

Page 46: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

7.Summary

46

Page 47: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

RNNsandAttentionRNNsareusedtomodelsequences

Attentionisusedtoenhancemodelinglongsequences

Versatilityofthesemodelsallowstoapplythemtoawiderangeofapplications

47

Page 48: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

ImplementationsofEncoder-DecoderLSTM CNN

48

Page 49: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Attention-basedmechanismsSoftvsHard:softattentionweightsallpixels,hardattentioncropstheimageandforcesattentiononlyonthekeptpart.

GlobalvsLocal: aglobal approach whichalwaysattendstoallsourcewordsandalocalonethatonlylooksatasubsetofsourcewordsatatime.

IntravsExternal:intraattentioniswithintheencoder’sinputsentence,externalattentionisacrosssentences.

49

Page 50: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Onelargeencoder-decoder•Text,speech,image…isallconverging toasignalparadigm?

•IfyouknowhowtobuildaneuralMTsystem,youmayeasilylearnhowtobuildaspeech-to-textrecognitionsystem...

•Oryoumaytrainthemtogethertoachievezero-shot AI.

*And other references on this research direction….

50

Page 51: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

51

Research going on… [email protected]

Q&A?

Page 52: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Quizz1.Markallstatementsthataretrue

A.Sequencemodelingonlyreferstolanguageapplications

B.Theattentionmechanismcanbeappliedtoanencoder-decoderarchitecture

C.Neuralmachinetranslationsystemsrequirerecurrentneuralnetworks

D.Ifwewanttohaveafixedrepresentation(thoughtvector),wecannotapplyattention-basedmechanisms

2.Giventhequeryvectorq=[],thekeyvector1k1=[]andthekeyvector2k2=[].

A.Whataretheattentionweights1&2computingthedotproduct?

B.Andwhencomputingthescaleddotproduct?

C.Towhatkeyvectorarewegivingmoreattention?

D.Whatistheadvantageofcomputingthescaleddotproduct?

52