Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10
Deep Learning and Lexical, Syntacticand Semantic Analysis
Wanxiang Che and Yue Zhang2016-10
Part 2: Introduction toDeepLearning
Part 2.1: DeepLearning Background
WhatisMachineLearning?
• FromDatatoKnowledge
InputAlgorithm
Output
Traditional Program
InputOutput
“Algorithm”
ML Program
AStandardExampleofML
• TheMNIST(ModifiedNIST)databaseofhand-writtendigitsrecognition– Publiclyavailable– AhugeamountabouthowwellvariousMLmethodsdoonit
– 60,000+ 10,000 hand-writtendigits(28x28pixelseach)
Veryhardtosaywhatmakesa2
Traditional Model(before2012)
• Fixed/engineeredfeatures+trainableclassifier (分类器)– Designingafeatureextractorrequiresconsiderableeffortsbyexperts
SIFTGISTShapecontext
DeepLearning (after2012)
• LearningHierarchicalRepresentations• DEEPmeansmorethanonestageofnon-linearfeaturetransformation
DeepLearningArchitecture
DeepLearningisNotNew
• 1980stechnology(NeuralNetworks)
AboutNeuralNetworks
• Pros– Simpletolearnp(y|x)– ResultsOKforshallownets
• Cons– Doesnotlearnp(x)– Troublewith>3layers– Overfitts– Slowtotrain
DeepLearningbeatsNN
• Pros– Simpletolearnp(y|x)– ResultsOK forshallownets
• Cons– Doesnotlearnp(x)– Troublewith>3layers– Overfitts– Slowtotrain
Unsupervisedfeaturelearning:RMBs,DAEs,
…
• Dropout• Maxout• StochasticPooling
GPU
• Newactivationfunctions:ReLU,…
• Gatedmechanism
ResultsonMNIST
• NaïveNeuralNetwork– 96.59%
• SVM(defaultsettingsforlibsvm)– 94.35%
• OptimalSVM[AndreasMueller]– 98.56%
• Thestateoftheart:ConvolutionalNN(2013)– 99.79%
DeepLearningWins
9.MICCAI2013GrandChallengeonMitosisDetection8.ICPR2012ContestonMitosisDetectioninBreastCancerHistologicalImages7.ISBI2012BrainImageSegmentationChallenge(withsuperhumanpixelerrorrate)6.IJCNN2011TrafficSignRecognitionCompetition(onlyourmethodachievedsuperhumanresults)5.ICDAR2011offlineChineseHandwritingCompetition4.OnlineGermanTrafficSignRecognitionContest3.ICDAR2009ArabicConnectedHandwritingCompetition2.ICDAR2009HandwrittenFarsi/ArabicCharacterRecognitionCompetition1.ICDAR2009FrenchConnectedHandwritingCompetition.Comparetheoverviewpageonhandwritingrecognition.• http://people.idsia.ch/~juergen/deeplearning.html
DeepLearningforSpeechRecognition
DeepLearningforNLP
Monolingual Data
Multi-lingual Data
Multi-modalData
Big Data
Deep Learning
我
喜欢
红
苹果
我
喜欢
红
苹果
我
喜欢
红
苹果
Recurrent NN Convolutional NN Recursive NN
Semantic Vector Space
Applications
QAPOS tagging Parsing
CaptionMT Dialog MCTest
Word Seg…
Part 2.2: Feedforward Neural Networks
TheTraditionalParadigmforML
1. Converttherawinputvectorintoavectoroffeatureactivations– Usehand-writtenprogramsbasedoncommon-sensetodefinethe
features2. Learnhowtoweighteachofthefeatureactivationstogeta
singlescalarquantity3. Ifthisquantityisabovesomethreshold,decidethattheinput
vectorisapositiveexampleofthetargetclass
feature units
decision unit
input units
hand-coded programs
learned weights
TheStandardPerceptronArchitecture
IPhone is very good .
good very/good very …
>5?
0.8 0.9 0.1 …
TheLimitationsofPerceptrons
• Thehand-codedfeatures– Greatinfluenceontheperformance– Needlotsofcosttofindsuitablefeatures
• Alinearclassifierwithahyperplane– Cannotseparatenon-lineardata,suchasXORfunctioncannotbelearnedbyasingle-layerperceptron
0,1
0,0 1,0
1,1
The positive and negative casescannot be separated by a plane
LearningwithNon-linearHiddenLayers
FeedforwardNeuralNetworks
• Theinformationispropagatedfromtheinputstotheoutputs
• Timehasnorole(NOcyclebetweenoutputsandinputs)
• Multi-layerPerceptron(MLP)?• Learningtheweightsofhiddenunits
isequivalenttolearningfeatures• Networkswithouthiddenlayersare
verylimitedintheinput-outputmappings– Morelayersoflinearunitsdonot
help.Itsstilllinear– Fixedoutputnon-linearities arenot
enough
x1 x2 xn…..
1st hidden layer
2nd hiddenlayer
Output layer
MultipleLayer NeuralNetworks
• Whatarethosehiddenneuronsdoing?– Mayberepresentoutlines
GeneralOptimizing(Learning)Algorithms
• GradientDescent
• StochasticGradientDescent(SGD)– Minibatch SGD(m>1),OnlineGD(m=1)
Computational/Flow Graphs
• DescribingMathematicalExpressions• Forexample
– e=(a+b)*(b+1)• c=a+b,d=b+1,e=c*d
– Ifa=2,b=1
DerivativesonComputationalGraphs
ComputationalGraphBackwardPass(Backpropagation)
AnFNNPOSTagger
Part 2.3: Word Embeddings
Typical Approaches for Word Representation
• 1-hot representation(orthogonality)– bag-of-word model
sun[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …]
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …]
star
sim(star, sun) = 0
DistributedWordRepresentation
• Eachwordisassociatedwithalow-dimension(compressed,50-1000),density (non-sparse)andreal (continuous)vector(wordembedding)– Learningwordvectorsthroughsupervisedmodels
• Nature– Semanticsimilarityasvectorsimilarity
HowtoobtainWordEmbedding
NeuralNetworkLanguageModels
• NeuralNetworkLanguageModels(NNLM)– FeedForward(Bengio etal.2003)
• Maximum-Likelihood Estimation• Back-propagation• Input: (𝑛 − 1) embeddings
Predict Word Vector Directly
• SENNA (Collobert and Weston,2008)• word2vec(Mikolov etal.2013)
Word2vec:CBOW(ContinuousBag-of-Word)
• Addinputsfromwordswithinshortwindowtopredictthecurrentword
• Theweightsfordifferentpositionsareshared• ComputationallymuchmoreefficientthannormalNNLM• Thehiddenlayerisjustlinear• Eachwordisanembeddingv(w)• Eachcontextisanembeddingv’(c)
Word2vec:Skip-Gram
• Predictingsurroundingwordsusingthecurrentword
• SimilarperformancewithCBOW• Eachwordisanembeddingv(w)• Eachcontextisanembeddingv’(c)
Word2vecTraining
• SGD+backpropagation• Mostofthecomputationalcostisafunctionofthesizeofthevocabulary(millions)
• Trainingaccelerating– NegativeSampling
• Mikolov etal.2013– HierarchicalDecomposition
• MorinandBengio 2005.Mnih andHinton2008.Mikolov etal.2013– GraphProcessingUnit (GPU)
WordAnalogy
Part 2.4: Recurrent and OtherNeural Networks
LanguageModels
• Alanguagemodelcomputesaprobabilityforasequenceofword:𝑃(𝑤(,⋯𝑤+) orpredictsaprobabilityforthenextword:𝑃(𝑤+,(|𝑤(,⋯𝑤+)
• Usefulformachinetranslation,speechrecognition,andsoon• Wordordering
• P(thecatissmall)>P(smalltheiscat)• Wordchoice
• P(therearefour cats)>P(therearefor cats)
TraditionalLanguageModels
• AnincorrectbutnecessaryMarkovassumption!• Probabilityisusuallyconditionedonwindowofn previouswords• 𝑃 𝑤(,⋯𝑤+ = ∏ 𝑃(𝑤0|𝑤(,⋯ ,𝑤01() ≈3
04( ∏ 𝑃(𝑤0|𝑤01(+1(),⋯ ,𝑤01()304(
• Howtoestimateprobabilities• 𝑝 𝑤6 𝑤( = 789+:(;<,;=)
789+:(;<)𝑝 𝑤> 𝑤(, 𝑤6 = 789+:(;<,;=,;?)
789+:(;<,;=)
• Performanceimproveswithkeepingaroundhighern-gramscountsanddoingsmoothing,suchasbackoff (e.g.if4-gramnotfound,try3-gram,etc)
• Disadvantages• ThereareALOTofn-grams!• Cannotseetoolonghistory
• P(坐/作了一整天的火车/作业)
RecurrentNeuralNetworks(RNNs)
• Conditiontheneuralnetworkonallpreviousinputs• RAMrequirementonlyscaleswithnumberofinputs
W1 W1
ht ht+1ht-1
yt-1 yt yt+1
xt-1 xt xt+1
W1
W2 W2W2
W3 W3W3
RecurrentNeuralNetworks (RNNs)
• Atasingletimestept• ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)• 𝑦I: = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊>ℎ:)
ht …ht-1
𝑦I:
xt
W1
W2
W3
ht
𝑦I:
xt
W1
W2
W3
W1
TrainingRNNsishard
• Ideallyinputsfrommanytimestepsagocanmodifyoutputy• Forexample,with2timesteps
ht ht+1ht-1
yt-1 yt yt+1
xt-1 xt xt+1
W1 W1
W2 W2
W3 W3
BackPropagation ThroughTime(BPTT)
• Totalerroristhesumofeacherrorattimestept• PQPR
= ∑ PQTPR
U:4(
• PQTPR? =
PQTPVT
PVTPR? iseasytobecalculated
• Buttocalculate PQTPR< =
PQTPVT
PVTPXT
PXTPR< ishard(alsofor𝑊6)
• Becauseℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:) dependsonℎ:1(,whichdependson𝑊( andℎ:16,andsoon.
• So PQTPR< = ∑ PQT
PVT
PVTPXT
PXTPXY
PXYPR<
:Z4(
ht
𝑦I:
xt
W1
W2
W3
The vanishinggradientproblem
• PQTPR
= ∑ PQTPVT
PVTPXT
PXTPXY
PXYPR
:Z4( ,ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)
• PXTPXY
= ∏ PX[PX[\<
:]4Z,( = ∏ 𝑊(diag[tanh′(⋯ )]:
]4Z,(
• PXTPXT\<
≤ 𝛾 𝑊( ≤ 𝛾𝜆(• where 𝛾 is bound diag[tanh′(⋯ )] , 𝜆(is the largest singular value of𝑊(
• PXTPXY
≤ 𝛾𝜆( :1Zà0,if𝛾𝜆( < 1
• Thiscanbecomeverysmallorverylargequicklyà Vanishingorexplodinggradient
• Trickforexplodinggradient:clippingtrick(setathreshold)
A“solution”
• Intuition• Ensure𝛾𝜆( ≥ 1à topreventvanishinggradients
•So…• ProperinitializationoftheW• TouseReLU insteadoftanh orsigmoidactivationfunctions
Abetter“solution”
• Recalltheoriginaltransitionequation• ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)
• Wecaninsteadupdatethestateadditively• 𝑢: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)• ℎ: = ℎ:1( + 𝑢:• then, PXT
PXT\<= 1 + P9T
PXT\<≥ 1
• Ontheotherhand• ℎ: = ℎ:1( + 𝑢: = ℎ:16 + 𝑢:1( + 𝑢: = ⋯
ht …ht-1
𝑦I:
xt
…
…
Abetter“solution” (cont.)
• Interpolatebetweenoldstateandnewstate(“choosingtoforget”)• 𝑓: = 𝜎 𝑊k𝑥: + 𝑈kℎ:1(• ℎ: = 𝑓: ⊙ ℎ:1( + (1 − 𝑓:) ⊙ 𝑢:
• Introduceaseparateinputgate𝑖:• 𝑖: = 𝜎 𝑊0𝑥: + 𝑈0ℎ:1(• ℎ: = 𝑓: ⊙ ℎ:1( + 𝑖: ⊙ 𝑢:
• Selectivelyexposememorycell𝑐: withanoutputgate𝑜:• 𝑜: = 𝜎 𝑊8𝑥: + 𝑈8ℎ:1(• 𝑐: = 𝑓: ⊙ 𝑐:1( + 𝑖: ⊙ 𝑢:• ℎ: = 𝑜: ⊙ tanh(𝑐:)
LongShort-TermMemory(LSTM)
• Hochreiter & Schmidhuber,1997• LSTM=additiveupdates+gating
Xt
+ + + +0 1 2 3
ht-1
Ct-1 Ct
ht
�
�
+
�σ σ σtanh
tanh
ht
GatedRecurrentUnites,GRU(Choetal.2014)
• Mainideas• Keeparoundmemoriestocapturelongdistancedependencies• Allowerrormessagestoflowatdifferentstrengthsdependingontheinputs
• Updategate• Basedoncurrentinputandhiddenstate• 𝑧: = 𝜎 𝑊q𝑥: + 𝑈qℎ:1(
• Resetgate• Similarlybutwithdifferentweights• 𝑟: = 𝜎(𝑊s𝑥: + 𝑈sℎ:1()
GRU
• Newmemorycontent• ℎt: = tanh(𝑊𝑥: + 𝑟: ⊙ 𝑈ℎ:1()• Updategatez controlshowmuchofpaststateshouldmatternow
• Ifz closedto1,thenwecancopyinformationinthatunitthroughmanytimestepsàlessvanishinggradient!
• Ifresetgater unitiscloseto0,thenthisignorespreviousmemoryandonlystoresthenewinputinformationà allowsmodeltodropinformationthatisirrelevantinthefuture
• Unitswithlongtermdependencieshaveactiveupdategatesz• Unitswithshort-termdependenciesoftenhaverestgatesr veryactive
• Finalmemoryattimestepcombinescurrentandprevioustimesteps• ℎ: = 𝑧: ⊙ ℎ:1( + (1 − 𝑧:) ⊙ ℎt
LSTMvs.GRU
• Noclearwinner!• Tuninghyperparameters likelayersizeisprobablymoreimportantthanpickingtheidealarchitecture
• GRUshave fewer parametersandthusmaytrainabitfasterorneedless datatogeneralize
• Ifyouhaveenoughdata,thegreaterexpressivepowerofLSTMsmay leadtobetterresults.
MoreRNNs
• BidirectionalRNN • Stack BidirectionalRNN
Tree-LSTMs
• TraditionalSequentialComposition
• Tree-StructuredComposition
MoreApplicationsofRNN
• NeuralMachineTranslation• HandwritingGeneration• ImageCaptionGeneration• …...
NeuralMachineTranslation
• RNNtrainedend-to-end:encoder-decoder
W1 W1
h2 h3h1
I love you
W2 W2W2
我 ?
Encoder:
Decoder:
Thisneedstocapturethe
entiresentence!
我
AttentionMechanism– Scoring
• Bahdanau etal.2015
W1 W1
h2 h3h1
I love you
W2 W2W2
我 ?
Encoder:
Decoder:
我
α=score(ℎ:, ℎ{)
0.3 0.6 0.1
ht𝛼
c
ConvolutionNeuralNetwork
CS231nConvolutionalNeuralNetworkforVisualRecognition.
Pooling
CNN for NLP
Zhang,Y.,&Wallace,B.(2015).ASensitivityAnalysisof(andPractitioners’Guideto)ConvolutionalNeuralNetworksforSentenceClassification.
RecursiveNeuralNetwork
Socher, R., Manning, C., & Ng, A. (2011). Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Network. NIPS.