Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionDLAI– MARTAR.COSTA-JUSSÀ

SLIDESADAPTEDFROMGRAHAMNEUBIG’SLECTURES

Whatadvancementsexciteyoumostinthefield?Iamveryexcitedbytherecentlyintroducedattentionmodels,duetotheirsimplicityandduetothefactthattheyworksowell.Althoughthesemodelsarenew,Ihavenodoubtthattheyareheretostay,andthattheywillplayaveryimportantroleinthefutureofdeeplearning.

ILYASUTSKEVER, RESEARCHDIRECTORANDCOFUNDEROFOPENAI

2

Outline1.Sequencemodeling&Sequence-to-sequencemodels[WRAP-UPFROMPREVIOUSRNN’sSESSION]

2.Attention-basedmechanism

3.Attentionvarieties

4.AttentionImprovements

5.Applications

6.“Attentionisallyouneed”

7.Summary

3

SequencemodelingModeltheprobabilityofsequencesofwords

Frompreviouslecture…wemodelsequences

ithRNNs

p(I’m) p(fine|I’m) p(.|fine) EOS

I’m fine .<s>

4

Sequence-to-sequencemodels

how are you ?

Cómo estás EOS

encoder decoder

¿ Cómo estás

?

?

¿

<s>

THOUGHT/CONTEXT

VECTOR

5

Anyproblemwiththesemodels?

6

7

2.Attention-basedmechanism

8

MotivationinthecaseofMT

9

MotivationinthecaseofMT

10

Attention

encoder

decoder

+

Attention allows to use multiple vectors, based onthe length of the input

11

AttentionKeyIdeas•Encodeeachwordintheinputandoutputsentenceintoavector

•Whendecoding,performalinearcombinationofthesevectors,weightedby“attentionweights”

•Usethiscombinationinpickingthenextword

12

AttentioncomputationI•Use“query”vector(decoderstate)and“key”vectors(allencoderstates)

•Foreachquery-keypair,calculateweight

•Normalizetoaddtooneusingsoftmax

Query Vector

Key Vectors

a1=2.1 a2=-0.1 a3=0.3 a4=-1.0

softmax

a1=0.5 a2=0.3 a3=0.1 a4=0.1

13

AttentioncomputationII• Combinetogethervaluevectors(usuallyencoderstates,likekeyvectors)bytakingtheweightedsum

Value Vectors

a1=0.5 a2=0.3 a3=0.1 a4=0.1* * * *

14

AttentionScoreFunctionsqisthequeryandkisthekey

Reference

Multi-layerPerceptron

𝑎 𝑞, 𝑘 = tanh(𝒲- 𝑞, 𝑘 ) Flexible,oftenverygoodwithlargedata

Bahdanau etal.,2015

Bilinear 𝑎 𝑞, 𝑘 = 𝑞/𝒲𝑘 Luongetal2015

DotProduct 𝑎 𝑞, 𝑘 = 𝑞/𝑘 Noparameters!Butrequiressizestobethesame

Luongetal.2015

ScaledDotProduct𝑎 𝑞, 𝑘 =

𝑞/𝑘|𝑘|�

Scalebysizeofthevector Vaswani etal.2017

15

AttentionIntegration

16

AttentionIntegration

17

3.AttentionVarieties

18

HardAttention*Insteadofasoftinterpolation,makeazero-onedecisionaboutwheretoattend(Xuetal.2015)

19

MonotonicAttentionThisapproach"softly"preventsthemodelfromassigningattentionprobabilitybeforewhereitattendedataprevioustimestep bytakingintoaccounttheattentionattheprevioustimestep.

20

ENCODER STATE E

Intra-Attention/Self- AttentionEachelementinthesentenceattendstootherelementsfromtheSAMEsentenceà contextsensitiveencodings!

21

MultipleSourcesAttendtomultiplesentences(Zoph etal.,2015)

Attendtoasentenceandanimage(Huangetal.2016)

22

Multi-headedAttentionIMultipleattention“heads”focusondifferentpartsofthesentence

𝑎 𝑞, 𝑘 =𝑞/𝑘|𝑘|�

23

Multi-headedAttentionIIMultipleattention“heads”focusondifferentpartsofthesentence

E.g.Multipleindependentlylearnedheads(Vaswani etal.2017)

𝑎 𝑞, 𝑘 =𝑞/𝑘|𝑘|�

24

4.ImprovementsinAttentionINTHECONTEXTOFMT

25

CoverageProblem:Neuralmodelstendstodroporrepeatcontent

InMT,

1.Over-translation:somewordsareunnecessarilytranslatedformultipletimes;

2.Under-translation:somewordsaremistakenlyuntranslated.

SRC:Señor Presidente,abre lasesión.

TRG:Mr PresidentMr PresidentMr President.

Solution:Modelhowmanytimeswordshavebeencoverede.g.maintainingacoveragevectortokeeptrackoftheattentionhistory(Tu etal.,2016)

26

IncorporatingMarkovPropertiesIntuition:Attentionfromlasttimetendstobecorrelatedwithattentionthistime

Approach:Addinformationaboutthelastattentionwhenmakingthenextdecision

27

BidirectionalTraining-Background:Establishedthatforlatentvariabletranslationmodelsthealignmentsimproveifbothdirectionalmodelsarecombined(koehn etal,2005)

-Approach:jointtrainingoftwodirectionalmodels

28

SupervisedTrainingSometimeswecanget“goldstandard”alignmentsa–priori◦ Manualalignments◦ Pre-trainedwithstrongalignmentmodel

Trainthemodeltomatchthesestrongalignments

29

5.Applications

30

Chatbotsacomputerprogramthatconductsaconversation

Human: what is your job Enc-dec: i’m a lawyer Human: what do you do ?Enc-dec: i’m a doctor .

what is your job

I’m a EOS

<s> I’m a

lawyer

lawyer

+attention

31

NaturalLanguageInference

32

OtherNLPTasksText summarization: process of shortening a text document withsoftware to create a summary with the major points of the originaldocument.Question Answering: automatically producing an answer to aquestion given a corresponding document.

Semantic Parsing: mapping natural language into a logical form thatcan be executed on a knowledge base and return an answer

Syntactic Parsing: process of analysing a string of symbols, either innatural language or in computer languages, conforming to the rulesof a formal grammar

33

ImagecaptioningIdecoder

encoder A cat on the mata cat

<s> a

on the mat

cat on the

34

ImageCaptioningII

35

OtherComputerVisionTaskswithAttentionVisual Question Answering: given an image and a natural languagequestion about the image, the task is to provide an accurate naturallanguage answer.Video Caption Generation: attempts to generate a complete andnatural sentence, enriching the single label as in video classification,to capture the most informative dynamics in videos.

36

Speechrecognition/translation

37

6.“Attentionisallyouneed”SLIDESBASEDONHTTPS://RESEARCH.GOOGLEBLOG.COM/2017/08/TRANSFORMER-NOVEL-NEURAL-NETWORK.HTML

38

MotivationSequentialnatureofRNNs-à difficulttotakeadvantageofmoderncomputingdevicessuchasTPUs(TensorProcessingUnits)

39

Transformer

Iarrivedat thebankaftercrossingtheriver

40

TransformerIDecoder

Encoder

41

TransformerII

42

Transformerresults

43

Attentionweights

44

Attentionweights

45

7.Summary

46

RNNsandAttentionRNNsareusedtomodelsequences

Attentionisusedtoenhancemodelinglongsequences

Versatilityofthesemodelsallowstoapplythemtoawiderangeofapplications

47

ImplementationsofEncoder-DecoderLSTM CNN

48

Attention-basedmechanismsSoftvsHard:softattentionweightsallpixels,hardattentioncropstheimageandforcesattentiononlyonthekeptpart.

GlobalvsLocal: aglobal approach whichalwaysattendstoallsourcewordsandalocalonethatonlylooksatasubsetofsourcewordsatatime.

IntravsExternal:intraattentioniswithintheencoder’sinputsentence,externalattentionisacrosssentences.

49

Onelargeencoder-decoder•Text,speech,image…isallconverging toasignalparadigm?

•IfyouknowhowtobuildaneuralMTsystem,youmayeasilylearnhowtobuildaspeech-to-textrecognitionsystem...

•Oryoumaytrainthemtogethertoachievezero-shot AI.

*And other references on this research direction….

50

51

Research going on… [email protected]

Q&A?

Quizz1.Markallstatementsthataretrue

A.Sequencemodelingonlyreferstolanguageapplications

B.Theattentionmechanismcanbeappliedtoanencoder-decoderarchitecture

C.Neuralmachinetranslationsystemsrequirerecurrentneuralnetworks

D.Ifwewanttohaveafixedrepresentation(thoughtvector),wecannotapplyattention-basedmechanisms

2.Giventhequeryvectorq=[],thekeyvector1k1=[]andthekeyvector2k2=[].

A.Whataretheattentionweights1&2computingthedotproduct?

B.Andwhencomputingthescaleddotproduct?

C.Towhatkeyvectorarewegivingmoreattention?

D.Whatistheadvantageofcomputingthescaleddotproduct?

52

Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Documents