Top Banner
Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang [email protected]
35

Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang [email protected].

Dec 17, 2015

Download

Documents

Todd Willis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Deep Learning in NLPWord representation

and how to use it for ParsingFor Sig Group Seminar Talk

Wenyi [email protected]

Page 2: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Existing NLP Applications

• Language Modeling• Speech Recognition• Machine Translation

• Part-Of-Speech Tagging• Chunking• Named Entity Recognition• Semantic Role Labeling• Sentiment Analysis• Paraphrasing• Question-Answering• Word-Sense Disambiguation

Page 3: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Word Representation

• One hit representation:• movie [0 0 0 0 0 0 0 0 1 0 0 0 0] && • film [0 0 0 0 0 1 0 0 0 0 0 0 0] = 0

• Distributional similarity based representations• You can get a lot of value by representing a word by means of

its neighbors:• Class-based (hard) clustering word representations

• Brown clustering (Brown et al. 1992)• Exchange clustering (Martin et al. 1998, Clark 2003)

• Soft clustering word representations• LSA/LSI • LDA

Page 4: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Brown Clustering: Example Clusters (from Brown et al, 1992)

Page 5: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

The Formulation

• is the set of all words seen in the corpus • Say is a partition of the vocabulary into classes• The model:

• Quality of a partition :

Bigram model, HHM

Page 6: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

A Sample Hierarchy (from Miller et al., NAACL 2004)

Huffman CodingPrefixes -> similarity

Page 7: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Language Model

• N-gram model (n=3)

• Calculated from n-gram frequency counts:

Page 8: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

A Neural Probabilistic Language Model(Bengio et al, NIPS’2000 and JMLR 2003)

• Motivation:• LM does not take into account contexts farther than 2 words.• LM does not take into account the “similarity” between words.

• Idea:• A word is associated with a distributed feature vector (a

real-valued vector in is much smaller than size of the vocabulary)• Express joint probability function of words sequence in

terms of feature vectors• Learn simultaneously the word feature vector and the

parameters of

Page 9: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Neural Language Model

Neural architecture

Page 10: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Neural Language Model

• softmax output layer:

• unnormalized log-probabilities for each output word

• is the word features layer activation vector

• The free parameters of the model are:

• output biases ()• the hidden layer biases ()• the hidden-to-output weights ()• the hidden layer weights ()• word features ()

4 weeks of training (40 CPUs) on 14,000,000 words training set |V|=17964

Page 11: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Neural word Embedding as a distributed representation

http://metaoptimize.com/projects/wordreprs/

Page 12: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

A neural network for learning word vectors (Collobert et al. JMLR 2011)

• A word and its context is a positive training sample; a random word in that same context gives a negative training sample:• [+] positive = Score(Cat chills [on] a mat) >• [-] negative = Score(Cat chills [god] a mat)

• What to feed in the NN• each word is an n-dimensional vector, a look up table:

• Training objective:

• 3-layer NN:

Where is a NN function.

• SENNA: http://ml.nec-labs.com/senna/

Window size n = 11|V| = 13000007 weeks

Page 13: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Linguistic Regularities in Continuous Space Word Representations (Mikolov, et al. 2013)• Recurrent Neural Network Model

One-hit representation

d(t)

Error= y(t)-d(t)

Context at time t-1

Page 14: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Linguistic Regularities in Continuous Space Word Representations (Mikolov, et al. 2013)

• Recurrent Neural Network Model• The input vector represents input word at time encoded

using One hit coding.• The output layer produces a probability distribution over

words.• The hidden layer maintains a representation of the

sentence history.• and are of same dimension as vocabulary

• Model:

Where is the sigmod function and is the softmax funciton

Page 15: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

• Training:• Stochastic Gradient Descent (SGD)

• Objective(Error) function:

where d(t) is the desired vector, i.e w(t)• Go through all the training data iteratively, and updatethe weight matrices U, V and W online (after processingevery word)• Training is performed in several epochs (usually 5-10)

• Where is the word representation?• , with each column

Linguistic Regularities in Continuous Space Word Representations (Mikolov, et al. 2013)

Page 16: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

• Measuring Linguistic Regularity• Syntactic/Semetic Test

Linguistic Regularities in Continuous Space Word Representations (Mikolov, et al. 2013)

These representations are surprisingly good at capturing syntacticand semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.

Page 17: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Exploiting Similarities among Languages for Machine Translation (Mikolov, et al. 2013 http://arxiv.org/pdf/1309.4168.pdf)

Figure 1: Distributed word vector representations of numbers and animals in English (left) and Spanish (right). The five vectors in each language were projected down to two dimensions using PCA, and then manually rotated to accentuate their similarity. It can be seen that these concepts have similar geometric arrangements in both spaces, suggesting that it is possible to learn an accurate linear mapping from one space to another.

Page 18: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Improving Word Representations via Global Context and Multiple Word Prototypes (Huang, et al. ACL 2013)

Page 19: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

• Improve Collobert & Weston’s model• Training objective:

where is the document (weighted sum of words in )

Improving Word Representations via Global Context and Multiple Word Prototypes (Huang, et al. ACL 2013)

Page 20: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

• The Model

• Clustering word base on context and retrain the model

Improving Word Representations via Global Context and Multiple Word Prototypes (Huang, et al. ACL 2013)

Page 21: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Summary: Word Representation and Language ModelBengio et al, 2010 Associated Press

(AP) News from 1995 and 199614,000,000 words

Word: 17964 Trained for 4 weeks (40 CPUS)

C&W English Wikipedia + Reuters RCV1

Words: 130000Dimensions: 50,

Trained for 7 weeks

Mikolov Broadcast news Words: 82390 Dimensions: 80, 640, 1600

Trained for several days

Huang 2012 English Wikipedia Words: 100232Dimensions: 50

|V|= 600010 cluster for each words

Page 22: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Parsing

• What we want:

Page 23: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Using word vector space model

The meaning (vector) of a sentence is determined by(1) The meanings of its words and(2) The rules that combine them.

Page 24: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Recursive Neural Networks for Structure Prediction• Inputs: two candidate children’s representations• Outputs:• The semantic representation if the two nodes are

merged• Score of how plausible the new node would be

Page 25: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Recursive Neural Networks

Page 26: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Parsing Example with an RNN

Page 27: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Parsing Example with an RNN

Page 28: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Parsing Example with an RNN

Page 29: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Parsing Example with an RNN

Page 30: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Parsing with Compositional Vector Grammars (Socher, et al. ACL 2013)

• A Compositional Vector Grammar (CVG) model, which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations.

Page 31: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Parsing with Compositional Vector Grammars (Socher, et al. ACL 2013)

• Probabilistic Context-Free Grammars (PCFGs)• A PCFG consists of:

• A context-free grammar G = (N, Σ, S, R).• A parameter

for each rule The parameter can be interpreted as the conditional probabilty of choosing rule in a left-most derivation, given that the non-terminal being expanded is . For any , we havethe constraint:

• Given a parse-tree containing rules , the probability of under the PCFG is

Chomsky Normal Form -> Binary Parsing Tree

Page 32: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

• Define a structured margin loss for predicting a tree for a given correct tree.

• For a given set of training instances , we search for the function , parameterized by, with the smallest expected loss on a new sentence.

where

Parsing with Compositional Vector Grammars (Socher, et al. ACL 2013)

Page 33: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Parsing with Compositional Vector Grammars (Socher, et al. ACL 2013)

• The CVG computes the first parent vector via the SU-RNN:

where is now a matrix that depends on the categories of the two children.• Score for each node consists of summing two elements:

where is a vector of parameters that need to be trained. And comes from the PCFG

Page 34: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Parsing with CVGs

• bottom-up beam search keeping a k-best list at every cell of the chart

• The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%.

• As a reranker.

Page 35: Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com.

Q&A

Thanks!