Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Deep Learning and Lexical, Syntacticand Semantic Analysis

Wanxiang Che and Yue Zhang2016-10

Part 2: Introduction toDeepLearning

Part 2.1: DeepLearning Background

WhatisMachineLearning?

• FromDatatoKnowledge

InputAlgorithm

Output

Traditional Program

InputOutput

“Algorithm”

ML Program

AStandardExampleofML

• TheMNIST(ModifiedNIST)databaseofhand-writtendigitsrecognition– Publiclyavailable– AhugeamountabouthowwellvariousMLmethodsdoonit

– 60,000+ 10,000 hand-writtendigits(28x28pixelseach)

Veryhardtosaywhatmakesa2

Traditional Model(before2012)

• Fixed/engineeredfeatures+trainableclassifier (分类器)– Designingafeatureextractorrequiresconsiderableeffortsbyexperts

SIFTGISTShapecontext

DeepLearning (after2012)

• LearningHierarchicalRepresentations• DEEPmeansmorethanonestageofnon-linearfeaturetransformation

DeepLearningArchitecture

DeepLearningisNotNew

• 1980stechnology(NeuralNetworks)

AboutNeuralNetworks

• Pros– Simpletolearnp(y|x)– ResultsOKforshallownets

• Cons– Doesnotlearnp(x)– Troublewith>3layers– Overfitts– Slowtotrain

DeepLearningbeatsNN

• Pros– Simpletolearnp(y|x)– ResultsOK forshallownets

• Cons– Doesnotlearnp(x)– Troublewith>3layers– Overfitts– Slowtotrain

Unsupervisedfeaturelearning:RMBs,DAEs,

…

• Dropout• Maxout• StochasticPooling

GPU

• Newactivationfunctions:ReLU,…

• Gatedmechanism

ResultsonMNIST

• NaïveNeuralNetwork– 96.59%

• SVM(defaultsettingsforlibsvm)– 94.35%

• OptimalSVM[AndreasMueller]– 98.56%

• Thestateoftheart:ConvolutionalNN(2013)– 99.79%

DeepLearningWins

9.MICCAI2013GrandChallengeonMitosisDetection8.ICPR2012ContestonMitosisDetectioninBreastCancerHistologicalImages7.ISBI2012BrainImageSegmentationChallenge(withsuperhumanpixelerrorrate)6.IJCNN2011TrafficSignRecognitionCompetition(onlyourmethodachievedsuperhumanresults)5.ICDAR2011offlineChineseHandwritingCompetition4.OnlineGermanTrafficSignRecognitionContest3.ICDAR2009ArabicConnectedHandwritingCompetition2.ICDAR2009HandwrittenFarsi/ArabicCharacterRecognitionCompetition1.ICDAR2009FrenchConnectedHandwritingCompetition.Comparetheoverviewpageonhandwritingrecognition.• http://people.idsia.ch/~juergen/deeplearning.html

DeepLearningforSpeechRecognition

DeepLearningforNLP

Monolingual Data

Multi-lingual Data

Multi-modalData

Big Data

Deep Learning

我

喜欢

红

苹果

我

喜欢

红

苹果

我

喜欢

红

苹果

Recurrent NN Convolutional NN Recursive NN

Semantic Vector Space

Applications

QAPOS tagging Parsing

CaptionMT Dialog MCTest

Word Seg…

Part 2.2: Feedforward Neural Networks

TheTraditionalParadigmforML

1. Converttherawinputvectorintoavectoroffeatureactivations– Usehand-writtenprogramsbasedoncommon-sensetodefinethe

features2. Learnhowtoweighteachofthefeatureactivationstogeta

singlescalarquantity3. Ifthisquantityisabovesomethreshold,decidethattheinput

vectorisapositiveexampleofthetargetclass

feature units

decision unit

input units

hand-coded programs

learned weights

TheStandardPerceptronArchitecture

IPhone is very good .

good very/good very …

>5?

0.8 0.9 0.1 …

TheLimitationsofPerceptrons

• Thehand-codedfeatures– Greatinfluenceontheperformance– Needlotsofcosttofindsuitablefeatures

• Alinearclassifierwithahyperplane– Cannotseparatenon-lineardata,suchasXORfunctioncannotbelearnedbyasingle-layerperceptron

0,1

0,0 1,0

1,1

The positive and negative casescannot be separated by a plane

LearningwithNon-linearHiddenLayers

FeedforwardNeuralNetworks

• Theinformationispropagatedfromtheinputstotheoutputs

• Timehasnorole(NOcyclebetweenoutputsandinputs)

• Multi-layerPerceptron(MLP)?• Learningtheweightsofhiddenunits

isequivalenttolearningfeatures• Networkswithouthiddenlayersare

verylimitedintheinput-outputmappings– Morelayersoflinearunitsdonot

help.Itsstilllinear– Fixedoutputnon-linearities arenot

enough

x1 x2 xn…..

1st hidden layer

2nd hiddenlayer

Output layer

MultipleLayer NeuralNetworks

• Whatarethosehiddenneuronsdoing?– Mayberepresentoutlines

GeneralOptimizing(Learning)Algorithms

• GradientDescent

• StochasticGradientDescent(SGD)– Minibatch SGD(m>1),OnlineGD(m=1)

Computational/Flow Graphs

• DescribingMathematicalExpressions• Forexample

– e=(a+b)*(b+1)• c=a+b,d=b+1,e=c*d

– Ifa=2,b=1

DerivativesonComputationalGraphs

ComputationalGraphBackwardPass(Backpropagation)

AnFNNPOSTagger

Part 2.3: Word Embeddings

Typical Approaches for Word Representation

• 1-hot representation(orthogonality)– bag-of-word model

sun[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …]

[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …]

star

sim(star, sun) = 0

DistributedWordRepresentation

• Eachwordisassociatedwithalow-dimension(compressed,50-1000),density (non-sparse)andreal (continuous)vector(wordembedding)– Learningwordvectorsthroughsupervisedmodels

• Nature– Semanticsimilarityasvectorsimilarity

HowtoobtainWordEmbedding

NeuralNetworkLanguageModels

• NeuralNetworkLanguageModels(NNLM)– FeedForward(Bengio etal.2003)

• Maximum-Likelihood Estimation• Back-propagation• Input: (𝑛 − 1) embeddings

Predict Word Vector Directly

• SENNA (Collobert and Weston,2008)• word2vec(Mikolov etal.2013)

Word2vec:CBOW(ContinuousBag-of-Word)

• Addinputsfromwordswithinshortwindowtopredictthecurrentword

• Theweightsfordifferentpositionsareshared• ComputationallymuchmoreefficientthannormalNNLM• Thehiddenlayerisjustlinear• Eachwordisanembeddingv(w)• Eachcontextisanembeddingv’(c)

Word2vec:Skip-Gram

• Predictingsurroundingwordsusingthecurrentword

• SimilarperformancewithCBOW• Eachwordisanembeddingv(w)• Eachcontextisanembeddingv’(c)

Word2vecTraining

• SGD+backpropagation• Mostofthecomputationalcostisafunctionofthesizeofthevocabulary(millions)

• Trainingaccelerating– NegativeSampling

• Mikolov etal.2013– HierarchicalDecomposition

• MorinandBengio 2005.Mnih andHinton2008.Mikolov etal.2013– GraphProcessingUnit (GPU)

WordAnalogy

Part 2.4: Recurrent and OtherNeural Networks

LanguageModels

• Alanguagemodelcomputesaprobabilityforasequenceofword:𝑃(𝑤(,⋯𝑤+) orpredictsaprobabilityforthenextword:𝑃(𝑤+,(|𝑤(,⋯𝑤+)

• Usefulformachinetranslation,speechrecognition,andsoon• Wordordering

• P(thecatissmall)>P(smalltheiscat)• Wordchoice

• P(therearefour cats)>P(therearefor cats)

TraditionalLanguageModels

• AnincorrectbutnecessaryMarkovassumption!• Probabilityisusuallyconditionedonwindowofn previouswords• 𝑃 𝑤(,⋯𝑤+ = ∏ 𝑃(𝑤0|𝑤(,⋯ ,𝑤01() ≈3

04( ∏ 𝑃(𝑤0|𝑤01(+1(),⋯ ,𝑤01()304(

• Howtoestimateprobabilities• 𝑝 𝑤6 𝑤( = 789+:(;<,;=)

789+:(;<)𝑝 𝑤> 𝑤(, 𝑤6 = 789+:(;<,;=,;?)

789+:(;<,;=)

• Performanceimproveswithkeepingaroundhighern-gramscountsanddoingsmoothing,suchasbackoff (e.g.if4-gramnotfound,try3-gram,etc)

• Disadvantages• ThereareALOTofn-grams!• Cannotseetoolonghistory

• P(坐/作了一整天的火车/作业)

RecurrentNeuralNetworks(RNNs)

• Conditiontheneuralnetworkonallpreviousinputs• RAMrequirementonlyscaleswithnumberofinputs

W1 W1

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W1

W2 W2W2

W3 W3W3

RecurrentNeuralNetworks (RNNs)

• Atasingletimestept• ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)• 𝑦I: = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊>ℎ:)

ht …ht-1

𝑦I:

xt

W1

W2

W3

ht

𝑦I:

xt

W1

W2

W3

W1

TrainingRNNsishard

• Ideallyinputsfrommanytimestepsagocanmodifyoutputy• Forexample,with2timesteps

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W1 W1

W2 W2

W3 W3

BackPropagation ThroughTime(BPTT)

• Totalerroristhesumofeacherrorattimestept• PQPR

= ∑ PQTPR

U:4(

• PQTPR? =

PQTPVT

PVTPR? iseasytobecalculated

• Buttocalculate PQTPR< =

PQTPVT

PVTPXT

PXTPR< ishard(alsofor𝑊6)

• Becauseℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:) dependsonℎ:1(,whichdependson𝑊( andℎ:16,andsoon.

• So PQTPR< = ∑ PQT

PVT

PVTPXT

PXTPXY

PXYPR<

:Z4(

ht

𝑦I:

xt

W1

W2

W3

The vanishinggradientproblem

• PQTPR

= ∑ PQTPVT

PVTPXT

PXTPXY

PXYPR

:Z4( ,ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)

• PXTPXY

= ∏ PX[PX[\<

:]4Z,( = ∏ 𝑊(diag[tanh′(⋯ )]:

]4Z,(

• PXTPXT\<

≤ 𝛾 𝑊( ≤ 𝛾𝜆(• where 𝛾 is bound diag[tanh′(⋯ )] , 𝜆(is the largest singular value of𝑊(

• PXTPXY

≤ 𝛾𝜆( :1Zà0,if𝛾𝜆( < 1

• Thiscanbecomeverysmallorverylargequicklyà Vanishingorexplodinggradient

• Trickforexplodinggradient:clippingtrick(setathreshold)

A“solution”

• Intuition• Ensure𝛾𝜆( ≥ 1à topreventvanishinggradients

•So…• ProperinitializationoftheW• TouseReLU insteadoftanh orsigmoidactivationfunctions

Abetter“solution”

• Recalltheoriginaltransitionequation• ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)

• Wecaninsteadupdatethestateadditively• 𝑢: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)• ℎ: = ℎ:1( + 𝑢:• then, PXT

PXT\<= 1 + P9T

PXT\<≥ 1

• Ontheotherhand• ℎ: = ℎ:1( + 𝑢: = ℎ:16 + 𝑢:1( + 𝑢: = ⋯

ht …ht-1

𝑦I:

xt

…

…

Abetter“solution” (cont.)

• Interpolatebetweenoldstateandnewstate(“choosingtoforget”)• 𝑓: = 𝜎 𝑊k𝑥: + 𝑈kℎ:1(• ℎ: = 𝑓: ⊙ ℎ:1( + (1 − 𝑓:) ⊙ 𝑢:

• Introduceaseparateinputgate𝑖:• 𝑖: = 𝜎 𝑊0𝑥: + 𝑈0ℎ:1(• ℎ: = 𝑓: ⊙ ℎ:1( + 𝑖: ⊙ 𝑢:

• Selectivelyexposememorycell𝑐: withanoutputgate𝑜:• 𝑜: = 𝜎 𝑊8𝑥: + 𝑈8ℎ:1(• 𝑐: = 𝑓: ⊙ 𝑐:1( + 𝑖: ⊙ 𝑢:• ℎ: = 𝑜: ⊙ tanh(𝑐:)

LongShort-TermMemory(LSTM)

• Hochreiter & Schmidhuber,1997• LSTM=additiveupdates+gating

Xt

+ + + +0 1 2 3

ht-1

Ct-1 Ct

ht

�

�

+

�σ σ σtanh

tanh

ht

GatedRecurrentUnites,GRU(Choetal.2014)

• Mainideas• Keeparoundmemoriestocapturelongdistancedependencies• Allowerrormessagestoflowatdifferentstrengthsdependingontheinputs

• Updategate• Basedoncurrentinputandhiddenstate• 𝑧: = 𝜎 𝑊q𝑥: + 𝑈qℎ:1(

• Resetgate• Similarlybutwithdifferentweights• 𝑟: = 𝜎(𝑊s𝑥: + 𝑈sℎ:1()

GRU

• Newmemorycontent• ℎt: = tanh(𝑊𝑥: + 𝑟: ⊙ 𝑈ℎ:1()• Updategatez controlshowmuchofpaststateshouldmatternow

• Ifz closedto1,thenwecancopyinformationinthatunitthroughmanytimestepsàlessvanishinggradient!

• Ifresetgater unitiscloseto0,thenthisignorespreviousmemoryandonlystoresthenewinputinformationà allowsmodeltodropinformationthatisirrelevantinthefuture

• Unitswithlongtermdependencieshaveactiveupdategatesz• Unitswithshort-termdependenciesoftenhaverestgatesr veryactive

• Finalmemoryattimestepcombinescurrentandprevioustimesteps• ℎ: = 𝑧: ⊙ ℎ:1( + (1 − 𝑧:) ⊙ ℎt

LSTMvs.GRU

• Noclearwinner!• Tuninghyperparameters likelayersizeisprobablymoreimportantthanpickingtheidealarchitecture

• GRUshave fewer parametersandthusmaytrainabitfasterorneedless datatogeneralize

• Ifyouhaveenoughdata,thegreaterexpressivepowerofLSTMsmay leadtobetterresults.

MoreRNNs

• BidirectionalRNN • Stack BidirectionalRNN

Tree-LSTMs

• TraditionalSequentialComposition

• Tree-StructuredComposition

MoreApplicationsofRNN

• NeuralMachineTranslation• HandwritingGeneration• ImageCaptionGeneration• …...

NeuralMachineTranslation

• RNNtrainedend-to-end:encoder-decoder

W1 W1

h2 h3h1

I love you

W2 W2W2

我 ?

Encoder:

Decoder:

Thisneedstocapturethe

entiresentence!

我

AttentionMechanism– Scoring

• Bahdanau etal.2015

W1 W1

h2 h3h1

I love you

W2 W2W2

我 ?

Encoder:

Decoder:

我

α=score(ℎ:, ℎ{)

0.3 0.6 0.1

ht𝛼

c

ConvolutionNeuralNetwork

CS231nConvolutionalNeuralNetworkforVisualRecognition.

Pooling

CNN for NLP

Zhang,Y.,&Wallace,B.(2015).ASensitivityAnalysisof(andPractitioners’Guideto)ConvolutionalNeuralNetworksforSentenceClassification.

RecursiveNeuralNetwork

Socher, R., Manning, C., & Ng, A. (2011). Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Network. NIPS.

Deep Learning and Lexical, Syntactic and Semantic Analysis · Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10. Part 2:Introduction to

Documents