Andrew Ng Machine Learning and AI via Brain simulations Andrew Ng Stanford University Adam Coates Quoc Le Honglak Lee Andrew Saxe Andrew Maas Chris Manning Jiquan Ngiam Richard Socher Will Zou Thanks to: Google: Kai Chen, Greg Corrado, Jeff Dean, Matthieu Devin, Andrea Frome, Rajat Monga, Marc’Aurelio Ranzato, Paul Tucker, Kay Le
110
Embed
Andrew Ng Machine Learning and AI via Brain simulations Andrew Ng Stanford University Adam Coates Quoc Le Honglak Lee Andrew Saxe Andrew Maas Chris Manning.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Andrew Ng
Machine Learning and AI via Brain simulations
Andrew NgStanford University
Adam Coates Quoc Le Honglak Lee Andrew Saxe Andrew Maas Chris Manning Jiquan Ngiam Richard Socher Will Zou
Thanks to:
Google: Kai Chen, Greg Corrado, Jeff Dean, Matthieu Devin, Andrea Frome, Rajat Monga, Marc’Aurelio Ranzato, Paul Tucker, Kay Le
Andrew Ng
This talk: Deep Learning
Using brain simulations: - Make learning algorithms much better and easier to use.- Make revolutionary advances in machine learning and AI.
Vision shared with many researchers:
E.g., Samy Bengio, Yoshua Bengio, Tom Dean, Jeff Dean, Nando de Freitas, Jeff Hawkins, Geoff Hinton, Quoc Le, Yann LeCun, Honglak Lee, Tommy Poggio, Marc’Aurelio Ranzato, Ruslan Salakhutdinov, Josh Tenenbaum, Kai Yu, Jason Weston, ….
I believe this is our best shot at progress towards real AI.
“It’s not who has the best algorithm that wins. It’s who has the most data.”
Andrew Ng
Unsupervised Learning
Large numbers of features is critical. The specific learning algorithm is important, but ones that can scale to many features also have a big advantage.
[Adam Coates]
Learning from Labeled data
Model
Training Data
Model
Training Data
Machine (Model Partition)
Model
Machine (Model Partition)
CoreTraining Data
Model
Training Data
• Unsupervised or Supervised Objective
• Minibatch Stochastic Gradient Descent (SGD)
• Model parameters sharded by partition
• 10s, 100s, or 1000s of cores per model
Basic DistBelief Model Training
Model
Training Data
Basic DistBelief Model Training
Parallelize across ~100 machines (~1600 cores).
But training is still slow with large data sets.
Add another dimension of parallelism, and have multiple model instances in parallel.
“It’s not who has the best algorithm that wins. It’s who has the most data.”
Andrew Ng
Unsupervised Learning
Large numbers of features is critical. The specific learning algorithm is important, but ones that can scale to many features also have a big advantage.
[Adam Coates]
50 thousand 32x32 images
10 million parameters
10 million 200x200 images
1 billion parameters
Training procedure
What features can we learn if we train a massive model on a massive amount of data. Can we learn a “grandmother cell”?
• Train on 10 million images (YouTube)• 1000 machines (16,000 cores) for 1 week. • Test on novel images
Training set (YouTube) Test set (FITW + ImageNet)
Top stimuli from the test set Optimal stimulus by numerical optimization
The face neuron
Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012
Cat neuronTop Stimuli from the test set Average of top stimuli from test set
• Deep Learning and Self-Taught learning: Lets learn rather than manually design our features.
• Discover the fundamental computational principles that underlie perception?
• Sparse coding and deep versions very successful on vision and audio tasks. Other variants for learning recursive representations.
• To get this to work for yourself, see online tutorial: http://deeplearning.stanford.edu/wiki or go/brain
Unsupervised Feature Learning Summary
Unlabeled imagesCar Motorcycle
Adam Coates Quoc Le Honglak Lee Andrew Saxe Andrew Maas Chris Manning Jiquan Ngiam Richard Socher Will Zou
Stanford
Google
Kai Chen Greg Corrado Jeff Dean Matthieu Devin Andrea Frome Rajat Monga Marc’Aurelio Paul Tucker Kay Le Ranzato
Andrew Ng
Advanced TopicsAndrew Ng
Stanford University & Google
Andrew Ng
Language: Learning Recursive
Representations
Andrew Ng
Feature representations of words
00001000
Imagine taking each word, and computing an n-dimensional feature vector for it. [Distributional representations, or Bengio et al., 2003, Collobert & Weston, 2008.]
2-d embedding example below, but in practice use ~100-d embeddings.
x2
x1
0 1 2 3 4 5 6 7 8 9 10
5
4
3
2
1
01000000
Monday Britain
Monday24
Britain92
Tuesday 2.13.3
France 9.51.5
On 85
On Monday, Britain ….
Representation: 85
24
92
Andrew Ng
“Generic” hierarchy on text doesn’t make sense
Node has to represent sentence fragment “cat sat on.” Doesn’t make sense.
The cat on the mat.The cat sat
91
53
85
91
43
71
Feature representation for words
Andrew Ng
The cat on the mat.
What we want (illustration)
The cat sat
91
53
85
91
43
NPNP
PP
S This node’s job is to represent “on the mat.”
71
VP
Andrew Ng
The cat on the mat.
What we want (illustration)
The cat sat
91
53
85
91
43
NPNP
PP
S This node’s job is to represent “on the mat.”
71
VP
52 3
3
83
54
73
Andrew Ng
What we want (illustration)
x2
x1
0 1 2 3 4 5 6 7 8 9 10
5
4
3
2
1
Monday
Britain
Tuesday
France
The day after my birthday, …
g8524
92
32
92
52
33
83
35
g8592
99
32
22
28
32
92
93
The country of my birth…
The country of my birth
The day after my birthday
Andrew Ng
Learning recursive representations
The cat on the mat.
85
91
43
33
83
This node’s job is to represent “on the mat.”
Andrew Ng
Learning recursive representations
The cat on the mat.
85
91
43
33
83
This node’s job is to represent “on the mat.”
Andrew Ng
Learning recursive representations
The cat on the mat.
85
91
43
33
83
This node’s job is to represent “on the mat.”
Basic computational unit: Neural Network that inputs two candidate children’s representations, and outputs:• Whether we should merge the two nodes.• The semantic representation if the two
nodes are merged.
85
33
Neural Network
83“Yes”
Andrew Ng
Parsing a sentence
Neural Network
No01
Neural Network
No00
Neural Network
Yes33
The cat on the mat.The cat sat
91
53
85
91
43
71
Neural Network
Yes52
Neural Network
No01
Andrew Ng
The cat on the mat.
Parsing a sentence
The cat sat
91
53
85
91
43
71
52 3
3
Neural Network
Yes83
Neural Network
No01
Neural Network
No01
Andrew Ng
Parsing a sentence
The cat on the mat.
91
53
85
91
43
52
33
[Socher, Manning & Ng]
Neural Network
Yes83
Neural Network
No01
Andrew Ng
The cat on the mat.
Parsing a sentence
The cat sat
91
53
85
91
43
71
52 3
3
83
54
73
Andrew Ng
Finding Similar Sentences
• Each sentence has a feature vector representation. • Pick a sentence (“center sentence”) and list nearest neighbor sentences. • Often either semantically or syntactically similar. (Digits all mapped to 2.)
Similarities Center Sentence Nearest Neighbor Sentences (most similar feature vector)Bad News Both took further
hits yesterday1. We 're in for a lot of turbulence ... 2. BSN currently has 2.2 million common shares
outstanding 3. This is panic buying 4. We have a couple or three tough weeks coming
Something said I had calls all night long from the States, he said
1. Our intent is to promote the best alternative, he says 2. We have sufficient cash flow to handle that, he said3. Currently, average pay for machinists is 22.22 an hour,
Boeing said4. Profit from trading for its own account dropped, the
securities firm saidGains and good news
Fujisawa gained 22 to 2,222
1. Mochida advanced 22 to 2,222 2. Commerzbank gained 2 to 222.2 3. Paris loved her at first sight 4. Profits improved across Hess's businesses
Unknown words which are cities
Columbia , S.C 1. Greenville , Miss 2. UNK , Md 3. UNK , Miss 4. UNK , Calif
Andrew Ng
Finding Similar Sentences
Similarities Center Sentence Nearest Neighbor Sentences (most similar feature vector)
Declining to comment = not disclosing
Hess declined to comment
1. PaineWebber declined to comment 2. Phoenix declined to comment 3. Campeau declined to comment 4. Coastal wouldn't disclose the terms
Large changes in sales or revenue
Sales grew almost 2 % to 222.2 million from 222.2 million
1. Sales surged 22 % to 222.22 billion yen from 222.22 billion2. Revenue fell 2 % to 2.22 billion from 2.22 billion3. Sales rose more than 2 % to 22.2 million from 22.2 million4. Volume was 222.2 million shares , more than triple recent
levelsNegation of different types
There's nothing unusual about business groups pushing for more government spending
1. We don't think at this point anything needs to be said2. It therefore makes no sense for each market to adopt
different circuit breakers3. You can't say the same with black and white 4. I don't think anyone left the place UNK UNK
People in bad situations
We were lucky 1. It was chaotic2. We were wrong3. People had died4. They still are
Andrew Ng
Application: Paraphrase Detection
• Task: Decide whether or not two sentences are paraphrases of each other. (MSR Paraphrase Corpus)
Method F1Baseline 79.9Rus et al., (2008) 80.5Mihalcea et al., (2006) 81.3Islam et al. (2007) 81.3Qiu et al. (2006) 81.6Fernando & Stevenson (2008) (WordNet based features) 82.4Das et al. (2009) 82.7Wan et al (2006) (many features: POS, parsing, BLEU, etc.) 83.0Stanford Feature Learning 83.4
Andrew Ng
Parsing sentences and parsing images
A small crowd quietly enters the historic church.
Each node in the hierarchy has a “feature vector” representation.
Andrew Ng
Nearest neighbor examples for image patches
• Each node (e.g., set of merged superpixels) in the hierarchy has a feature vector. • Select a node (“center patch”) and list nearest neighbor nodes. • I.e., what image patches/superpixels get mapped to similar features?
“It’s not who has the best algorithm that wins. It’s who has the most data.”
Andrew Ng
Unsupervised Feature Learning
• Many choices in feature learning algorithms;– Sparse coding, RBM, autoencoder, etc. – Pre-processing steps (whitening)– Number of features learned – Various hyperparameters.