Deep Learning, an interactive introduction for NLP-ers

@graphificRoelof Pieters

Introduc0on to Deep Learning for NLP

22 January 2015 Stockholm Natural Language Processing Meetup

FEEDA

Slides at:http://www.slideshare.net/roelofp/220115dlmeetup

1

https://twitter.com/graphific

http://@graphific

http://www.feeda.com













































http://www.csc.kth.se/~roelof/

http://www.slideshare.net/roelofp/220115dlmeetup

Deep Learning ???

2

A couple of headlines… [all November ’14]

3

(source: Google Trends)4

Machine Learning ??

- Audience Check -

5

• “Brain” inspired / simulations:

• vision: make learning algorithms better and easier to use

• goal: revolutions in (practical) advances for machine learning and AI

• Deep Learning = subfield of Machine Learning

Deep Learning ??

6

Biological Inspiration

7

Deep Learning ??

8

DL: Impact

9

Speech Recognition

DL: Impact

10

Deep Learning for the win!a few examples:

• IJCNN 2011 Traffic Sign Recognition Competition• ISBI 2012 Segmentation of neuronal structures in EM stacks

challenge• ICDAR 2011 Chinese handwriting recognition

• Deals with “construction and study of systems that can learn from data”

Machine Learning ??

A computer program is said to learn from experience (E) with respect to some class of tasks (T) and performance measure (P), if its performance at tasks in T, as measured by P, improves with experience E

— T. Mitchell 1997

11

Machine Learning ??

Traditional Programming:

Data

ProgramOutput

DataProgram

Output

Machine Learning:

12

Supervised (inductive) learning

• Training data includes desired outputs

Unsupervised learning

• Training data does not include desired outputs

Semi-supervised learning

• Training data includes a few desired outputs

Reinforcement learning

• Rewards from sequence of actions

Types of Learning

13

ML: Traditional Approach

1. Gather as much LABELED data as you can get

2. Throw some algorithms at it (mainly put in an SVM and keep it at that)

3. If you actually have tried more algos: Pick the best

4. Spend hours hand engineering some features / feature selection / dimensionality reduction (PCA, SVD, etc)

5. Repeat…

For each new problem/question::

14

Machine Learning for NLP

Data

Classic Approach: Data is fed into a learning algorithm:

Learning Algorithm

15


some of the (many) treebank datasets

source: http://www-nlp.stanford.edu/links/statnlp.html#Treebanks

!

16

http://www-nlp.stanford.edu/links/statnlp.html#Treebanks

Penn TreebankThat’s a lot of “manual” work:

17

• the students went to class

DT NN VB P NN

• plays well with others

VB ADV P NN

NN NN P DT

• fruit flies like a banana

NN NN VB DT NN

NN VB P DT NN

NN NN P DT NN

NN VB VB DT NN

With a lot of issues:

Penn Treebank

18


Learning AlgorithmData

“Features”

PredictionPrediction/Classifier

train set

test set

19


Learning Algorithm

“Features”

PredictionPrediction/Classifier

train set

test set

20


• Until the early 1990’s, NLP systems were built manually with hand-crafted dictionaries and rules.

• As large electronic text corpora became increasingly available, researchers began using machine learning techniques to automatically build NLP systems.

• Today, the vast majority of NLP systems use machine learning.

21

2. Neural Networks and a short history lesson

22

Perceptron (1957)

Frank Rosenblatt (1928-1971)

Original Perceptron

Simplified model:

(From Perceptrons by M. L Minsky and S. Papert, 1969, Cambridge, MA: MIT Press. Copyright 1969

by MIT Press.

23

Perceptron (1957)

Perceptron Research, youtube clip: https://www.youtube.com/watch?v=cNxadbrN_aI&feature=youtu.be&t=12

24

https://www.youtube.com/watch?v=cNxadbrN_aI&feature=youtu.be&t=12

Perceptron (1957)

25

or

Multilayer Perceptron (1986)

inputs

weights

biasactivation

26

Neuron Model

All you need to know:

27

Activation functions

28

Backpropagation (1974/1986)

1974 Paul Werbos’ invents Backpropagation algorithm for NN1986 Backdrop popularized by Rumelhart, Hinton, Williams1990: Renewed Interest in NN’s

29

Backprop Renaissance

Forward Propagation

• Sum inputs, produce activation, feed-forward

30

Backprop Renaissance

Back Propagation (of error)

• Calculate total error at the top

• Calculate contributions to error at each step going backwards

31

• Compute gradient of example-wise loss wrt parameters

• Simply applying the derivative chain rule wisely

• If computing the loss (example, parameters) is O(n)computation, then so is computing the gradient

Backpropagation

32

Simple Chain Rule

33

Training procedure

• Initialize randomly• Sequentially give it data.• See what the difference is between network output

and actual output.• Update the weights according to this error.• End result: give a model input, and it produces a

proper output.

Quest for the weights. The weights are the model!

To reiterate:

34

So why only now?

• Inspired by the architectural depth of the brain, researchers wanted for decades to train deep multi-layer neural networks.

• No successful attempts were reported before 2006 …Exception: convolutional neural networks, LeCun 1998

• SVM: Vapnik and his co-workers developed the Support Vector Machine (1993) (shallow architecture).

• Breakthrough in 2006!

35

2006 Breakthrough

• More data

• Faster hardware: GPU’s, multi-core CPU’s

• Working ideas on how to train deep architectures

36

2006 Breakthrough

• More data



37

2006 Breakthrough

38

2006 Breakthrough

• More data



39

2006 Breakthrough

40

2006 Breakthrough

• More data



41

2006 Breakthrough

Stacked Restricted Boltzman Machines* (RBM) Hinton, G. E, Osindero, S., and Teh, Y. W. (2006).A fast learning algorithm for deep belief nets.Neural Computation, 18:1527-1554.

Stacked Autoencoders (AE) Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007).Greedy Layer-Wise Training of Deep Networks,Advances in Neural Information Processing Systems 19

* called Deep Belief Networks (DBN)42

https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf

https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf

3. Deep Learning onwards we go…

43

44

Hierarchies

Efficient

Generalization

Distributed

Sharing

Unsupervised*

Black Box

Training Time

Major PWNAGE!

Much Data

Why go Deep ?

45

No More Handcrafted Features !

46

— Andrew Ng

“I’ve worked all my life in Machine Learning, and I’ve

never seen one algorithm knock over benchmarks like Deep

Learning”

Deep Learning: Why?

47

Biological JustificationDeep Learning = Brain “inspired”Audio/Visual Cortex has multiple stages == Hierarchical

• Computational Biology • CVAP

• Jorge Dávila-Chacón • “that guy”

“Brainiacs” “Pragmatists”vs

48

Different Levels of Abstraction

49

Hierarchical Learning

• Natural progression from low level to high level structure as seen in natural complexity

Different Levels of AbstractionFeature Representation

50



• Easier to monitor what is being learnt and to guide the machine to better subspaces


51




• A good lower level representation can be used for many distinct tasks


52




• A good lower level representation can be used for many distinct tasks


53

• Shared Low Level Representations

• Multi-Task Learning

• Unsupervised Training

Generalizable Learning

54

• Shared Low Level Representations

• Multi-Task Learning

• Unsupervised Training

• Partial Feature Sharing

• Mixed Mode Learning

• Composition of Functions

Generalizable Learning

55

Classic Deep Architecture

Input layer

Hidden layers

Output layer

56

Modern Deep Architecture

Input layer

Hidden layers

Output layer

57

Deep Learning: Why? (again)

Beat state of the art in many areas:• Language Modeling (2012, Mikolov et al)• Image Recognition (Krizhevsky won

2012 ImageNet competition)• Sentiment Classification (2011, Socher et

al)• Speech Recognition (2010, Dahl et al)• MNIST hand-written digit recognition (Ciresan et al, 2010)

58

One Model rules them all ?DL approaches have been successfully applied to:

Deep Learning: Why for NLP ?

Automatic summarization Coreference resolution Discourse analysis

Machine translation Morphological segmentation Named entity recognition (NER)

Natural language generation

Natural language understanding

Optical character recognition (OCR)

Part-of-speech tagging

Parsing

Question answering

Relationship extraction

sentence boundary disambiguation

Sentiment analysis

Speech recognition

Speech segmentation

Topic segmentation and recognition

Word segmentation

Word sense disambiguation

Information retrieval (IR)

Information extraction (IE)

Speech processing

59

- COFFEE BREAK -after the break we return with: CODE

Download the code samples already now from:https://github.com/graphific/DL-Meetup-intro

http://goo.gl/abX1E2 shortened url: 60

https://github.com/graphific/DL-Meetup-intro

http://goo.gl/abX1E2

• Deep Neural Network

• Multilayer Perceptron (MLP) or Artificial Neural Network (ANN)

1. MLP

Logistic regression

Training regime: Stochastic Gradient Descent (SGD) with minibatches

MNIST dataset

Simple hidden layer

61

2. Convolutional Neural Network

62

from: Krizhevsky, Sutskever, Hinton. (2012). ImageNet Classification with Deep Convolutional Neural Networks[breakthrough in object recognition, Imagenet 2012]

http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf

Convolutional Neural Network

http://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

movie time:http://www.cs.toronto.edu/~hinton/adi/index.htm

63

http://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

http://www.cs.toronto.edu/~hinton/adi/index.htm

Thats it, no more code! (for now)

64

Deep Learning: Future Developments

Currently an explosion of developments• Hessian-Free networks (2010)• Long Short Term Memory (2011)• Large Convolutional nets, max-pooling (2011)• Nesterov’s Gradient Descent (2013)

Currently state of the art but...• No way of doing logical inference (extrapolation)• No easy integration of abstract knowledge• Hypothetic space bias might not conform with reality

65

Deep Learning: Future Challenges

a

66

Szegedy, C., Wojciech, Z., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R. (2013) Intriguing properties of neural networks

L: correctly identified, Center: added noise x10, R: “Ostrich”

• cuda-convnet2 (Alex Krizhevsky, Toronto) (c++/CUDA, optimized for GTX 580) https://code.google.com/p/cuda-convnet2/

• Caffe (Berkeley) (Cuda/OpenCL, Theano, Python)http://caffe.berkeleyvision.org/

• OverFeat (NYU) http://cilvr.nyu.edu/doku.php?id=code:start

Wanna Play ?

https://code.google.com/p/cuda-convnet2/

http://caffe.berkeleyvision.org/

http://cilvr.nyu.edu/doku.php?id=code:start

• Theano - CPU/GPU symbolic expression compiler in python (from LISA lab at University of Montreal). http://deeplearning.net/software/theano/

• Pylearn2 - library designed to make machine learning research easy. http://deeplearning.net/software/pylearn2/

• Torch - Matlab-like environment for state-of-the-art machine learning algorithms in lua (from Ronan Collobert, Clement Farabet and Koray Kavukcuoglu) http://torch.ch/

• more info: http://deeplearning.net/software links/

Wanna Play ?

Wanna Play ?

http://deeplearning.net/software/theano/

http://deeplearning.net/software/pylearn2/

http://torch.ch/

http://deeplearning.net/software%20links/

as PhD candidate KTH/CSC:“Always interested in discussing

Machine Learning, Deep Architectures, Graphs, and

Language Technology”

In touch!

[email protected]/~roelof/

Internship / EntrepeneurshipAcademic/Researchas CIO/CTO Feeda:

“Always looking for additions to our brand new R&D team”

[Internships upcoming on KTH exjobb website…]

[email protected]

Feeda

69

mailto:[email protected]

http://www.csc.kth.se/~roelof/

mailto:[email protected]?subject=


Were Hiring!

[email protected]

Feeda

• Dev Ops • Software Developers • Data Scientists

70

mailto:[email protected]?subject=


Thanks for listening

Mingling time!

71

72

Can’t get enough? Come to my talk Tomorrow (friday)

Description on KTH website

Visual-Semantic Embeddings: some thoughts on Language

Roelof Pieters TCS/CSCFriday jan 23 13:30.

Room 304, Teknikringen 14 level 3

https://www.kth.se/en/csc/forskning/cvap/seminars/visual-semantic-embeddings-some-thoughts-on-language-1.534764

Appendum

Some of the exciting recent developments in NLPespecially Distributed Semantics

73

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 74

http://metaoptimize.com/projects/wordreprs/

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 75

http://metaoptimize.com/projects/wordreprs/

Word Embeddings: Collobert & Weston (2011)

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) . Natural Language Processing (almost) from Scratch

76

Multi-embeddings: Stanford (2012)

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Improving Word Representations via Global Context and Multiple Word Prototypes

77

Linguistic Regularities: Mikolov (2013)

code & info: https://code.google.com/p/word2vec/ Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations

78

https://code.google.com/p/word2vec/

Word Embeddings for MT: Mikolov (2013)

Mikolov, T., Le, V. L., Sutskever, I. (2013) . Exploiting Similarities among Languages for Machine Translation79

Recursive Deep Models & Sentiment: Socher (2013)

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Chris Manning, Andrew Ng and Chris Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP 2013

code & demo: http://nlp.stanford.edu/sentiment/index.html80

http://nlp.stanford.edu/sentiment/index.html

Paragraph Vectors: Le & Mikolov (2014)

Le, Q., Mikolov,. T. (2014) Distributed Representations of Sentences and Documents

81

• add context (sentence, paragraph, document) to word vectors during training

!

Results on Stanford Sentiment Treebank dataset:

http://www-cs.stanford.edu/~quocle/paragraph_vector.pdf

Global Vectors, GloVe: Stanford (2014)

Pennington, P., Socher, R., Manning,. D.M. (2014). GloVe: Global Vectors for Word Representation

code & demo: http://nlp.stanford.edu/projects/glove/

vsresults on the word analogy task

“similar accuracy”

82

http://nlp.stanford.edu/projects/glove/

https://twitter.com/RichardSocher/status/497235079903473664

https://docs.google.com/document/d/1ydIujJ7ETSZ688RGfU5IMJJsbxAi-kRl8czSwpti15s/mobilebasic?pli=1

Dependency-based Embeddings: Levy & Goldberg (2014)

Levy, O., Goldberg, Y. (2014). Dependency-Based Word Embeddings

code & demo: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

- Syntactic Dependency Context

Australian scientist discovers star with telescope

- Bag of Words (BoW) Context

0.3$

0.4$

0.5$

0.6$

0.7$

0.8$

0.9$

1$

0$ 0.1$ 0.2$ 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$

Precision

$

Recall$

“Dependency-based embeddings have more

functional similarities”

83

https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

Deep Learning, an interactive introduction for NLP-ers

Internet