Mastering the game of Go with deep neural networks and tree search Karel Ha article by Google DeepMind Spring School of Combinatorics 2016
Mastering the game of Go
with deep neural networks and tree search
Karel Ha
article by Google DeepMind
Spring School of Combinatorics 2016
Why AI?
Applications of AI
� spam filters
� recommender systems (Netflix, YouTube)
� predictive text (Swiftkey)
� audio recognition (Shazam, SoundHound)
� music generation (DeepHear - Composing and harmonizing
music with neural networks)
� self-driving cars
1
Applications of AI
� spam filters
� recommender systems (Netflix, YouTube)
� predictive text (Swiftkey)
� audio recognition (Shazam, SoundHound)
� music generation (DeepHear - Composing and harmonizing
music with neural networks)
� self-driving cars
1
Applications of AI
� spam filters
� recommender systems (Netflix, YouTube)
� predictive text (Swiftkey)
� audio recognition (Shazam, SoundHound)
� music generation (DeepHear - Composing and harmonizing
music with neural networks)
� self-driving cars
1
Applications of AI
� spam filters
� recommender systems (Netflix, YouTube)
� predictive text (Swiftkey)
� audio recognition (Shazam, SoundHound)
� music generation (DeepHear - Composing and harmonizing
music with neural networks)
� self-driving cars
1
Applications of AI
� spam filters
� recommender systems (Netflix, YouTube)
� predictive text (Swiftkey)
� audio recognition (Shazam, SoundHound)
� music generation (DeepHear - Composing and harmonizing
music with neural networks)
� self-driving cars
1
Applications of AI
� spam filters
� recommender systems (Netflix, YouTube)
� predictive text (Swiftkey)
� audio recognition (Shazam, SoundHound)
� music generation (DeepHear - Composing and harmonizing
music with neural networks)
� self-driving cars
1
Auto Reply Feature of Google Inbox
Corrado 2015 2
Artistic-style Painting
[1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 3
Artistic-style Painting
[1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 3
Baby Names Generated Character by Character
� Baby Killiel Saddie Char Ahbort With
� Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy
� Marylen Hammine Janye Marlise Jacacrie Hendred Romand
Charienna Nenotto Ette Dorane Wallen Marly Darine Salina
Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille
Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha
Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen
Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn
Lusine Charyanne Sales Sanny Resa Wallon Martine Merus
Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne
Arnande Karella Roselina Alessia Chasty Deland Berther
Geamar Jackein Mellisand Sagdy Nenc Lessie Rasemy Guen
Karpathy 2015 4
Baby Names Generated Character by Character
� Baby Killiel Saddie Char Ahbort With
� Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy
� Marylen Hammine Janye Marlise Jacacrie Hendred Romand
Charienna Nenotto Ette Dorane Wallen Marly Darine Salina
Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille
Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha
Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen
Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn
Lusine Charyanne Sales Sanny Resa Wallon Martine Merus
Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne
Arnande Karella Roselina Alessia Chasty Deland Berther
Geamar Jackein Mellisand Sagdy Nenc Lessie Rasemy Guen
Karpathy 2015 4
Baby Names Generated Character by Character
� Baby Killiel Saddie Char Ahbort With
� Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy
� Marylen Hammine Janye Marlise Jacacrie Hendred Romand
Charienna Nenotto Ette Dorane Wallen Marly Darine Salina
Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille
Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha
Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen
Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn
Lusine Charyanne Sales Sanny Resa Wallon Martine Merus
Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne
Arnande Karella Roselina Alessia Chasty Deland Berther
Geamar Jackein Mellisand Sagdy Nenc Lessie Rasemy Guen
Karpathy 2015 4
C code Generated Character by Character
Karpathy 2015 5
Algebraic Geometry Generated Character by Character
Karpathy 2015 6
DeepDrumpf
https://twitter.com/deepdrumpf
= a Twitter bot that has
learned the language of Donald Trump from his speeches
Hayes 2016 7
DeepDrumpf
https://twitter.com/deepdrumpf = a Twitter bot that has
learned the language of Donald Trump from his speeches
Hayes 2016 7
Atari Player by Google DeepMind
https://youtu.be/0X-NdPtFKq0?t=21m13s
Mnih et al. 2015 8
8
Heads-up Limit Holdem Poker Is Solved!
Cepheus http://poker.srv.ualberta.ca/
Bowling et al. 2015 9
Heads-up Limit Holdem Poker Is Solved!
Cepheus http://poker.srv.ualberta.ca/
Bowling et al. 2015 9
Basics of Machine learning
Supervised versus Unsupervised Learning
Supervised learning:
� data set must be labelled
� e.g. which e-mail is regular/spam, which image is duck/face,
...
Unsupervised learning:
� data set is not labelled
� it can try to cluster the data into different groups
� e.g. grouping similar news, ...
10
Supervised versus Unsupervised Learning
Supervised learning:
� data set must be labelled
� e.g. which e-mail is regular/spam, which image is duck/face,
...
Unsupervised learning:
� data set is not labelled
� it can try to cluster the data into different groups
� e.g. grouping similar news, ...
10
Supervised versus Unsupervised Learning
Supervised learning:
� data set must be labelled
� e.g. which e-mail is regular/spam, which image is duck/face,
...
Unsupervised learning:
� data set is not labelled
� it can try to cluster the data into different groups
� e.g. grouping similar news, ...
10
Supervised versus Unsupervised Learning
Supervised learning:
� data set must be labelled
� e.g. which e-mail is regular/spam, which image is duck/face,
...
Unsupervised learning:
� data set is not labelled
� it can try to cluster the data into different groups
� e.g. grouping similar news, ...
10
Supervised versus Unsupervised Learning
Supervised learning:
� data set must be labelled
� e.g. which e-mail is regular/spam, which image is duck/face,
...
Unsupervised learning:
� data set is not labelled
� it can try to cluster the data into different groups
� e.g. grouping similar news, ...
10
Supervised versus Unsupervised Learning
Supervised learning:
� data set must be labelled
� e.g. which e-mail is regular/spam, which image is duck/face,
...
Unsupervised learning:
� data set is not labelled
� it can try to cluster the data into different groups
� e.g. grouping similar news, ...
10
Supervised versus Unsupervised Learning
Supervised learning:
� data set must be labelled
� e.g. which e-mail is regular/spam, which image is duck/face,
...
Unsupervised learning:
� data set is not labelled
� it can try to cluster the data into different groups
� e.g. grouping similar news, ...
10
Supervised versus Unsupervised Learning
Supervised learning:
� data set must be labelled
� e.g. which e-mail is regular/spam, which image is duck/face,
...
Unsupervised learning:
� data set is not labelled
� it can try to cluster the data into different groups
� e.g. grouping similar news, ...
10
Supervised Learning
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment
http://www.nickgillian.com/ 11
Supervised Learning
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment
http://www.nickgillian.com/ 11
Supervised Learning
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment
http://www.nickgillian.com/ 11
Supervised Learning
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment
http://www.nickgillian.com/ 11
Supervised Learning
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment
http://www.nickgillian.com/ 11
Supervised Learning
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment
http://www.nickgillian.com/ 11
Supervised Learning
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment
http://www.nickgillian.com/ 11
Regression
12
Regression
12
Mathematical Regression
https://thermanuals.wordpress.com/descriptive-analysis/sampling-and-regression/13
Classification
https://kevinbinz.files.wordpress.com/2014/08/ml-svm-after-comparison.png 14
Underfitting and Overfitting
Beware of overfitting!
It is like learning for a math exam by memorizing proofs.
https://www.researchgate.net/post/How_to_Avoid_Overfitting 15
Underfitting and Overfitting
Beware of overfitting!
It is like learning for a math exam by memorizing proofs.
https://www.researchgate.net/post/How_to_Avoid_Overfitting 15
Underfitting and Overfitting
Beware of overfitting!
It is like learning for a math exam by memorizing proofs.
https://www.researchgate.net/post/How_to_Avoid_Overfitting 15
Reinforcement Learning
Specially: games of self-play
https://youtu.be/0X-NdPtFKq0?t=16m57s 16
Reinforcement Learning
Specially: games of self-play
https://youtu.be/0X-NdPtFKq0?t=16m57s 16
Monte Carlo Tree Search
Tree Search
Optimal value v∗(s) determines the outcome of the game:
� from every board position or state s
� under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
� b is the games breadth (number of legal moves per position)
� d is its depth (game length)
Silver et al. 2016 17
Tree Search
Optimal value v∗(s) determines the outcome of the game:
� from every board position or state s
� under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
� b is the games breadth (number of legal moves per position)
� d is its depth (game length)
Silver et al. 2016 17
Tree Search
Optimal value v∗(s) determines the outcome of the game:
� from every board position or state s
� under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
� b is the games breadth (number of legal moves per position)
� d is its depth (game length)
Silver et al. 2016 17
Tree Search
Optimal value v∗(s) determines the outcome of the game:
� from every board position or state s
� under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
� b is the games breadth (number of legal moves per position)
� d is its depth (game length)
Silver et al. 2016 17
Tree Search
Optimal value v∗(s) determines the outcome of the game:
� from every board position or state s
� under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
� b is the games breadth (number of legal moves per position)
� d is its depth (game length)
Silver et al. 2016 17
Tree Search
Optimal value v∗(s) determines the outcome of the game:
� from every board position or state s
� under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
� b is the games breadth (number of legal moves per position)
� d is its depth (game length)
Silver et al. 2016 17
Tree Search
Optimal value v∗(s) determines the outcome of the game:
� from every board position or state s
� under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
� b is the games breadth (number of legal moves per position)
� d is its depth (game length)
Silver et al. 2016 17
Game tree of Go
Sizes of trees for various games:
� chess: b ≈ 35, d ≈ 80
� Go: b ≈ 250, d ≈ 150
⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
� for the breadth: a neural network to select moves
� for the depth: a neural network to evaluate current position
� for the tree traverse: Monte Carlo tree search (MCTS)
Allis et al. 1994 18
Game tree of Go
Sizes of trees for various games:
� chess: b ≈ 35, d ≈ 80
� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
� for the breadth: a neural network to select moves
� for the depth: a neural network to evaluate current position
� for the tree traverse: Monte Carlo tree search (MCTS)
Allis et al. 1994 18
Game tree of Go
Sizes of trees for various games:
� chess: b ≈ 35, d ≈ 80
� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
� for the breadth: a neural network to select moves
� for the depth: a neural network to evaluate current position
� for the tree traverse: Monte Carlo tree search (MCTS)
Allis et al. 1994 18
Game tree of Go
Sizes of trees for various games:
� chess: b ≈ 35, d ≈ 80
� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
� for the breadth: a neural network to select moves
� for the depth: a neural network to evaluate current position
� for the tree traverse: Monte Carlo tree search (MCTS)
Allis et al. 1994 18
Game tree of Go
Sizes of trees for various games:
� chess: b ≈ 35, d ≈ 80
� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
� for the breadth: a neural network to select moves
� for the depth: a neural network to evaluate current position
� for the tree traverse: Monte Carlo tree search (MCTS)
Allis et al. 1994 18
Game tree of Go
Sizes of trees for various games:
� chess: b ≈ 35, d ≈ 80
� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
� for the breadth: a neural network to select moves
� for the depth: a neural network to evaluate current position
� for the tree traverse: Monte Carlo tree search (MCTS)
Allis et al. 1994 18
Game tree of Go
Sizes of trees for various games:
� chess: b ≈ 35, d ≈ 80
� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
� for the breadth: a neural network to select moves
� for the depth: a neural network to evaluate current position
� for the tree traverse: Monte Carlo tree search (MCTS)
Allis et al. 1994 18
Monte Carlo tree search
19
Neural networks
Neural Network: Inspiration
� inspired by the neuronal structure of the mammalian cerebral
cortex
� but on much smaller scales� suitable to model systems with a high tolerance to error
� e.g. audio or image recognition
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
Neural Network: Inspiration
� inspired by the neuronal structure of the mammalian cerebral
cortex
� but on much smaller scales� suitable to model systems with a high tolerance to error
� e.g. audio or image recognition
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
Neural Network: Inspiration
� inspired by the neuronal structure of the mammalian cerebral
cortex
� but on much smaller scales
� suitable to model systems with a high tolerance to error
� e.g. audio or image recognition
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
Neural Network: Inspiration
� inspired by the neuronal structure of the mammalian cerebral
cortex
� but on much smaller scales� suitable to model systems with a high tolerance to error
� e.g. audio or image recognition
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
Neural Network: Inspiration
� inspired by the neuronal structure of the mammalian cerebral
cortex
� but on much smaller scales� suitable to model systems with a high tolerance to error
� e.g. audio or image recognitionhttp://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
Neural Network: Modes
Two modes
� feedforward for making predictions
� backpropagation for learning
Dieterle 2003 21
Neural Network: Modes
Two modes
� feedforward for making predictions
� backpropagation for learning
Dieterle 2003 21
Neural Network: Modes
Two modes
� feedforward for making predictions
� backpropagation for learning
Dieterle 2003 21
Neural Network: Modes
Two modes
� feedforward for making predictions
� backpropagation for learningDieterle 2003 21
Neural Network: an example of feedforward
http://stevenmiller888.github.io/mind-how-to-build-a-neural-network/ 22
Gradient Descent in Neural Networks
Motto: ”Learn by mistakes!”
However, error functions are not necessarily convex or so “smooth”.
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 23
Gradient Descent in Neural Networks
Motto: ”Learn by mistakes!”
However, error functions are not necessarily convex or so “smooth”.
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 23
Gradient Descent in Neural Networks
Motto: ”Learn by mistakes!”
However, error functions are not necessarily convex or so “smooth”.
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 23
Deep Neural Network: Inspiration
The hierarchy of concepts is captured in the number of layers (the deep in “deep learning”)
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 24
Deep Neural Network: Inspiration
The hierarchy of concepts is captured in the number of layers (the deep in “deep learning”)
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 24
Convolutional Neural Network
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 25
Rules of Go
Classic games (1/2)
Backgammon: Man vs. Fate
Chess: Man vs. Man
26
Classic games (1/2)
Backgammon: Man vs. Fate
Chess: Man vs. Man
26
Classic games (1/2)
Backgammon: Man vs. Fate
Chess: Man vs. Man
26
Classic games (2/2)
Go: Man vs. Self
Robert Samal (White) versus Karel Kral (Black), Spring School of Combinatorics 2016 27
Rules of Go
Black versus White. Black starts the game.
the rule of liberty
the “ko” rule
Handicap for difference in ranks: Black can place 1 or more stones
in advance (compensation for White’s greater strength).
28
Rules of Go
Black versus White. Black starts the game.
the rule of liberty
the “ko” rule
Handicap for difference in ranks: Black can place 1 or more stones
in advance (compensation for White’s greater strength).
28
Rules of Go
Black versus White. Black starts the game.
the rule of liberty
the “ko” rule
Handicap for difference in ranks: Black can place 1 or more stones
in advance (compensation for White’s greater strength).
28
Rules of Go
Black versus White. Black starts the game.
the rule of liberty
the “ko” rule
Handicap for difference in ranks: Black can place 1 or more stones
in advance (compensation for White’s greater strength).
28
Rules of Go
Black versus White. Black starts the game.
the rule of liberty
the “ko” rule
Handicap for difference in ranks: Black can place 1 or more stones
in advance (compensation for White’s greater strength).
28
Rules of Go
Black versus White. Black starts the game.
the rule of liberty
the “ko” rule
Handicap for difference in ranks: Black can place 1 or more stones
in advance (compensation for White’s greater strength). 28
Scoring Rules: Area Scoring
A player’s score is:
� the number of stones that the player has on the board
� plus the number of empty intersections surrounded by that
player’s stones
� plus komi(dashi) points for the White player
which is a compensation for the first move advantage of the Black player
https://en.wikipedia.org/wiki/Go_(game) 29
Scoring Rules: Area Scoring
A player’s score is:
� the number of stones that the player has on the board
� plus the number of empty intersections surrounded by that
player’s stones
� plus komi(dashi) points for the White player
which is a compensation for the first move advantage of the Black player
https://en.wikipedia.org/wiki/Go_(game) 29
Scoring Rules: Area Scoring
A player’s score is:
� the number of stones that the player has on the board
� plus the number of empty intersections surrounded by that
player’s stones
� plus komi(dashi) points for the White player
which is a compensation for the first move advantage of the Black player
https://en.wikipedia.org/wiki/Go_(game) 29
Scoring Rules: Area Scoring
A player’s score is:
� the number of stones that the player has on the board
� plus the number of empty intersections surrounded by that
player’s stones
� plus komi(dashi) points for the White player
which is a compensation for the first move advantage of the Black player
https://en.wikipedia.org/wiki/Go_(game) 29
Ranks of Players
Kyu and Dan ranks
or alternatively, ELO ratings
https://en.wikipedia.org/wiki/Go_(game) 30
Ranks of Players
Kyu and Dan ranks
or alternatively, ELO ratings
https://en.wikipedia.org/wiki/Go_(game) 30
Chocolate micro-break
30
AlphaGo: Inside Out
Policy and Value Networks
Silver et al. 2016 31
Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 32
SL Policy Networks (1/3)
� 13-layer deep convolutional neural network
� goal: to predict expert human moves
� task of classification
� trained from 30 millions positions from the KGS Go Server
� stochastic gradient ascent:
∆σ ∝ ∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
� 44.4% accuracy (the state-of-the-art from other groups)
� 55.7% accuracy (raw board position + move history as input)
� 57.0% accuracy (all input features)
Silver et al. 2016 33
SL Policy Networks (1/3)
� 13-layer deep convolutional neural network
� goal: to predict expert human moves
� task of classification
� trained from 30 millions positions from the KGS Go Server
� stochastic gradient ascent:
∆σ ∝ ∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
� 44.4% accuracy (the state-of-the-art from other groups)
� 55.7% accuracy (raw board position + move history as input)
� 57.0% accuracy (all input features)
Silver et al. 2016 33
SL Policy Networks (1/3)
� 13-layer deep convolutional neural network
� goal: to predict expert human moves
� task of classification
� trained from 30 millions positions from the KGS Go Server
� stochastic gradient ascent:
∆σ ∝ ∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
� 44.4% accuracy (the state-of-the-art from other groups)
� 55.7% accuracy (raw board position + move history as input)
� 57.0% accuracy (all input features)
Silver et al. 2016 33
SL Policy Networks (1/3)
� 13-layer deep convolutional neural network
� goal: to predict expert human moves
� task of classification
� trained from 30 millions positions from the KGS Go Server
� stochastic gradient ascent:
∆σ ∝ ∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
� 44.4% accuracy (the state-of-the-art from other groups)
� 55.7% accuracy (raw board position + move history as input)
� 57.0% accuracy (all input features)
Silver et al. 2016 33
SL Policy Networks (1/3)
� 13-layer deep convolutional neural network
� goal: to predict expert human moves
� task of classification
� trained from 30 millions positions from the KGS Go Server
� stochastic gradient ascent:
∆σ ∝ ∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
� 44.4% accuracy (the state-of-the-art from other groups)
� 55.7% accuracy (raw board position + move history as input)
� 57.0% accuracy (all input features)
Silver et al. 2016 33
SL Policy Networks (1/3)
� 13-layer deep convolutional neural network
� goal: to predict expert human moves
� task of classification
� trained from 30 millions positions from the KGS Go Server
� stochastic gradient ascent:
∆σ ∝ ∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
� 44.4% accuracy (the state-of-the-art from other groups)
� 55.7% accuracy (raw board position + move history as input)
� 57.0% accuracy (all input features)
Silver et al. 2016 33
SL Policy Networks (1/3)
� 13-layer deep convolutional neural network
� goal: to predict expert human moves
� task of classification
� trained from 30 millions positions from the KGS Go Server
� stochastic gradient ascent:
∆σ ∝ ∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
� 44.4% accuracy (the state-of-the-art from other groups)
� 55.7% accuracy (raw board position + move history as input)
� 57.0% accuracy (all input features)
Silver et al. 2016 33
SL Policy Networks (1/3)
� 13-layer deep convolutional neural network
� goal: to predict expert human moves
� task of classification
� trained from 30 millions positions from the KGS Go Server
� stochastic gradient ascent:
∆σ ∝ ∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
� 44.4% accuracy (the state-of-the-art from other groups)
� 55.7% accuracy (raw board position + move history as input)
� 57.0% accuracy (all input features)
Silver et al. 2016 33
SL Policy Networks (1/3)
� 13-layer deep convolutional neural network
� goal: to predict expert human moves
� task of classification
� trained from 30 millions positions from the KGS Go Server
� stochastic gradient ascent:
∆σ ∝ ∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
� 44.4% accuracy (the state-of-the-art from other groups)
� 55.7% accuracy (raw board position + move history as input)
� 57.0% accuracy (all input features)
Silver et al. 2016 33
SL Policy Networks (1/3)
� 13-layer deep convolutional neural network
� goal: to predict expert human moves
� task of classification
� trained from 30 millions positions from the KGS Go Server
� stochastic gradient ascent:
∆σ ∝ ∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
� 44.4% accuracy (the state-of-the-art from other groups)
� 55.7% accuracy (raw board position + move history as input)
� 57.0% accuracy (all input features)
Silver et al. 2016 33
SL Policy Networks (2/3)
Small improvements in accuracy led to large improvements
in playing strength (see the next slide)
Silver et al. 2016 34
SL Policy Networks (3/3)
move probabilities taken directly from the SL policy network pσ (reported as a percentage if above 0.1%).
Silver et al. 2016 35
Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 36
Rollout Policy
� Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
� accuracy of 24.2%
� It takes 2µs to select an action, compared to 3 ms in case
of SL policy network.
Silver et al. 2016 37
Rollout Policy
� Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
� accuracy of 24.2%
� It takes 2µs to select an action, compared to 3 ms in case
of SL policy network.
Silver et al. 2016 37
Rollout Policy
� Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
� accuracy of 24.2%
� It takes 2µs to select an action, compared to 3 ms in case
of SL policy network.
Silver et al. 2016 37
Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 38
RL Policy Networks (1/2)
� identical in structure to the SL policy network
� goal: to win in the games of self-play
� task of classification
� weights ρ initialized to the same values, ρ := σ
� games of self-play
� between the current RL policy network and a randomly
selected previous iteration
� to prevent overfitting to the current policy
� stochastic gradient ascent:
∆ρ ∝ ∂ log pρ(at |st)
∂ρzt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 39
RL Policy Networks (1/2)
� identical in structure to the SL policy network
� goal: to win in the games of self-play
� task of classification
� weights ρ initialized to the same values, ρ := σ
� games of self-play
� between the current RL policy network and a randomly
selected previous iteration
� to prevent overfitting to the current policy
� stochastic gradient ascent:
∆ρ ∝ ∂ log pρ(at |st)
∂ρzt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 39
RL Policy Networks (1/2)
� identical in structure to the SL policy network
� goal: to win in the games of self-play
� task of classification
� weights ρ initialized to the same values, ρ := σ
� games of self-play
� between the current RL policy network and a randomly
selected previous iteration
� to prevent overfitting to the current policy
� stochastic gradient ascent:
∆ρ ∝ ∂ log pρ(at |st)
∂ρzt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 39
RL Policy Networks (1/2)
� identical in structure to the SL policy network
� goal: to win in the games of self-play
� task of classification
� weights ρ initialized to the same values, ρ := σ
� games of self-play
� between the current RL policy network and a randomly
selected previous iteration
� to prevent overfitting to the current policy
� stochastic gradient ascent:
∆ρ ∝ ∂ log pρ(at |st)
∂ρzt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 39
RL Policy Networks (1/2)
� identical in structure to the SL policy network
� goal: to win in the games of self-play
� task of classification
� weights ρ initialized to the same values, ρ := σ
� games of self-play
� between the current RL policy network and a randomly
selected previous iteration
� to prevent overfitting to the current policy
� stochastic gradient ascent:
∆ρ ∝ ∂ log pρ(at |st)
∂ρzt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 39
RL Policy Networks (1/2)
� identical in structure to the SL policy network
� goal: to win in the games of self-play
� task of classification
� weights ρ initialized to the same values, ρ := σ
� games of self-play
� between the current RL policy network and a randomly
selected previous iteration
� to prevent overfitting to the current policy
� stochastic gradient ascent:
∆ρ ∝ ∂ log pρ(at |st)
∂ρzt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 39
RL Policy Networks (1/2)
� identical in structure to the SL policy network
� goal: to win in the games of self-play
� task of classification
� weights ρ initialized to the same values, ρ := σ
� games of self-play
� between the current RL policy network and a randomly
selected previous iteration
� to prevent overfitting to the current policy
� stochastic gradient ascent:
∆ρ ∝ ∂ log pρ(at |st)
∂ρzt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 39
RL Policy Networks (1/2)
� identical in structure to the SL policy network
� goal: to win in the games of self-play
� task of classification
� weights ρ initialized to the same values, ρ := σ
� games of self-play
� between the current RL policy network and a randomly
selected previous iteration
� to prevent overfitting to the current policy
� stochastic gradient ascent:
∆ρ ∝ ∂ log pρ(at |st)
∂ρzt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 39
RL Policy Networks (1/2)
� identical in structure to the SL policy network
� goal: to win in the games of self-play
� task of classification
� weights ρ initialized to the same values, ρ := σ
� games of self-play
� between the current RL policy network and a randomly
selected previous iteration
� to prevent overfitting to the current policy
� stochastic gradient ascent:
∆ρ ∝ ∂ log pρ(at |st)
∂ρzt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 39
RL Policy Networks (1/2)
� identical in structure to the SL policy network
� goal: to win in the games of self-play
� task of classification
� weights ρ initialized to the same values, ρ := σ
� games of self-play
� between the current RL policy network and a randomly
selected previous iteration
� to prevent overfitting to the current policy
� stochastic gradient ascent:
∆ρ ∝ ∂ log pρ(at |st)
∂ρzt
at time step t, where reward function zt is +1 for winning and −1 for losing.
Silver et al. 2016 39
RL Policy Networks (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
� 80% of win rate against the SL policy network
� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)
� The previous state-of-the-art, based only on SL of CNN:
11% of “win” rate against Pachi
Silver et al. 2016 40
RL Policy Networks (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
� 80% of win rate against the SL policy network
� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)
� The previous state-of-the-art, based only on SL of CNN:
11% of “win” rate against Pachi
Silver et al. 2016 40
RL Policy Networks (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
� 80% of win rate against the SL policy network
� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)
� The previous state-of-the-art, based only on SL of CNN:
11% of “win” rate against Pachi
Silver et al. 2016 40
RL Policy Networks (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
� 80% of win rate against the SL policy network
� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)
� The previous state-of-the-art, based only on SL of CNN:
11% of “win” rate against Pachi
Silver et al. 2016 40
RL Policy Networks (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
� 80% of win rate against the SL policy network
� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)
� The previous state-of-the-art, based only on SL of CNN:
11% of “win” rate against Pachi
Silver et al. 2016 40
RL Policy Networks (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
� 80% of win rate against the SL policy network
� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)
� The previous state-of-the-art, based only on SL of CNN:
11% of “win” rate against Pachi
Silver et al. 2016 40
Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 41
Value Network (1/2)
� similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
� goal: to estimate a value function
vp(s) = E[zt |st = s, at...T ∼ p]
that predicts the outcome from position s (of games played
by using policy pρ)
� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).
� task of regression
� stochastic gradient descent:
∆θ ∝ ∂vθ(s)
∂θ(z − vθ(s))
(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)
Silver et al. 2016 42
Value Network (1/2)
� similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
� goal: to estimate a value function
vp(s) = E[zt |st = s, at...T ∼ p]
that predicts the outcome from position s (of games played
by using policy pρ)
� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).
� task of regression
� stochastic gradient descent:
∆θ ∝ ∂vθ(s)
∂θ(z − vθ(s))
(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)
Silver et al. 2016 42
Value Network (1/2)
� similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
� goal: to estimate a value function
vp(s) = E[zt |st = s, at...T ∼ p]
that predicts the outcome from position s (of games played
by using policy pρ)
� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).
� task of regression
� stochastic gradient descent:
∆θ ∝ ∂vθ(s)
∂θ(z − vθ(s))
(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)
Silver et al. 2016 42
Value Network (1/2)
� similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
� goal: to estimate a value function
vp(s) = E[zt |st = s, at...T ∼ p]
that predicts the outcome from position s (of games played
by using policy pρ)
� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).
� task of regression
� stochastic gradient descent:
∆θ ∝ ∂vθ(s)
∂θ(z − vθ(s))
(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)
Silver et al. 2016 42
Value Network (1/2)
� similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
� goal: to estimate a value function
vp(s) = E[zt |st = s, at...T ∼ p]
that predicts the outcome from position s (of games played
by using policy pρ)
� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).
� task of regression
� stochastic gradient descent:
∆θ ∝ ∂vθ(s)
∂θ(z − vθ(s))
(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)
Silver et al. 2016 42
Value Network (2/2)
Beware of overfitting!
� Successive positions are strongly correlated.
� Value network memorized the game outcomes, rather than
generalizing to new positions.
� Solution: generate 30 million (new) positions, each sampled
from a seperate game
� almost the accuracy of Monte Carlo rollouts (using pρ), but
15000 times less computation!
Silver et al. 2016 43
Value Network (2/2)
Beware of overfitting!
� Successive positions are strongly correlated.
� Value network memorized the game outcomes, rather than
generalizing to new positions.
� Solution: generate 30 million (new) positions, each sampled
from a seperate game
� almost the accuracy of Monte Carlo rollouts (using pρ), but
15000 times less computation!
Silver et al. 2016 43
Value Network (2/2)
Beware of overfitting!
� Successive positions are strongly correlated.
� Value network memorized the game outcomes, rather than
generalizing to new positions.
� Solution: generate 30 million (new) positions, each sampled
from a seperate game
� almost the accuracy of Monte Carlo rollouts (using pρ), but
15000 times less computation!
Silver et al. 2016 43
Value Network (2/2)
Beware of overfitting!
� Successive positions are strongly correlated.
� Value network memorized the game outcomes, rather than
generalizing to new positions.
� Solution: generate 30 million (new) positions, each sampled
from a seperate game
� almost the accuracy of Monte Carlo rollouts (using pρ), but
15000 times less computation!
Silver et al. 2016 43
Value Network (2/2)
Beware of overfitting!
� Successive positions are strongly correlated.
� Value network memorized the game outcomes, rather than
generalizing to new positions.
� Solution: generate 30 million (new) positions, each sampled
from a seperate game
� almost the accuracy of Monte Carlo rollouts (using pρ), but
15000 times less computation!
Silver et al. 2016 43
Selection of Moves by the Value Network
evaluation of all successors s′ of the root position s, using vθ(s)
Silver et al. 2016 44
Evaluation accuracy in various stages of a game
Move number is the number of moves that had been played in the given position.
Each position evaluated by:
� forward pass of the value network vθ
� 100 rollouts, played out using the corresponding policy
Silver et al. 2016 45
Evaluation accuracy in various stages of a game
Move number is the number of moves that had been played in the given position.
Each position evaluated by:
� forward pass of the value network vθ
� 100 rollouts, played out using the corresponding policy
Silver et al. 2016 45
Evaluation accuracy in various stages of a game
Move number is the number of moves that had been played in the given position.
Each position evaluated by:
� forward pass of the value network vθ
� 100 rollouts, played out using the corresponding policySilver et al. 2016 45
Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 46
ELO Ratings for Various Combinations of Networks
Silver et al. 2016 47
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
Each edge (s, a) keeps:
� action value Q(s, a)
� visit count N(s, a)
� prior probability P(s, a) (from SL policy network pσ)
The tree is traversed by simulation (descending the tree) from the
root state.
Silver et al. 2016 48
MCTS Algorithm: Selection
At each time step t, an action at is selected from state st
at = arg maxa
(Q(st , a) + u(st , a))
where bonus
u(st , a) ∝P(s, a)
1 + N(s, a)
Silver et al. 2016 49
MCTS Algorithm: Selection
At each time step t, an action at is selected from state st
at = arg maxa
(Q(st , a) + u(st , a))
where bonus
u(st , a) ∝P(s, a)
1 + N(s, a)
Silver et al. 2016 49
MCTS Algorithm: Selection
At each time step t, an action at is selected from state st
at = arg maxa
(Q(st , a) + u(st , a))
where bonus
u(st , a) ∝P(s, a)
1 + N(s, a)
Silver et al. 2016 49
MCTS Algorithm: Expansion
A leaf position may be expanded (just once) by the SL policy network pσ .
The output probabilities are stored as priors P(s, a) := pσ(a|s).
Silver et al. 2016 50
MCTS Algorithm: Expansion
A leaf position may be expanded (just once) by the SL policy network pσ .
The output probabilities are stored as priors P(s, a) := pσ(a|s).
Silver et al. 2016 50
MCTS Algorithm: Expansion
A leaf position may be expanded (just once) by the SL policy network pσ .
The output probabilities are stored as priors P(s, a) := pσ(a|s).
Silver et al. 2016 50
MCTS: Evaluation
� evaluation from the value network vθ(s)
� evaluation by the outcome z using the fast rollout policy pπ until the end of game
Using a mixing parameter λ, the final leaf evaluation V (s) is
V (s) = (1− λ)vθ(s) + λz
Silver et al. 2016 51
MCTS: Evaluation
� evaluation from the value network vθ(s)
� evaluation by the outcome z using the fast rollout policy pπ until the end of game
Using a mixing parameter λ, the final leaf evaluation V (s) is
V (s) = (1− λ)vθ(s) + λz
Silver et al. 2016 51
MCTS: Evaluation
� evaluation from the value network vθ(s)
� evaluation by the outcome z using the fast rollout policy pπ until the end of game
Using a mixing parameter λ, the final leaf evaluation V (s) is
V (s) = (1− λ)vθ(s) + λz
Silver et al. 2016 51
MCTS: Evaluation
� evaluation from the value network vθ(s)
� evaluation by the outcome z using the fast rollout policy pπ until the end of game
Using a mixing parameter λ, the final leaf evaluation V (s) is
V (s) = (1− λ)vθ(s) + λz
Silver et al. 2016 51
MCTS: Evaluation
� evaluation from the value network vθ(s)
� evaluation by the outcome z using the fast rollout policy pπ until the end of game
Using a mixing parameter λ, the final leaf evaluation V (s) is
V (s) = (1− λ)vθ(s) + λz
Silver et al. 2016 51
Tree Evaluation from Value Network
action values Q(s, a) for each tree-edge (s, a) from root position s (averaged over value network evaluations only)
Silver et al. 2016 52
Tree Evaluation from Rollouts
action values Q(s, a), averaged over rollout evaluations only
Silver et al. 2016 53
MCTS: Backup
At the end of simulation, each traversed edge is updated by accumulating:
� the action values Q
� visit counts N
Silver et al. 2016 54
MCTS: Backup
At the end of simulation, each traversed edge is updated by accumulating:
� the action values Q
� visit counts N
Silver et al. 2016 54
Once the search is complete, the algorithm
chooses the most visited move from the root
position.
Silver et al. 2016 54
Percentage of Simulations
percentage frequency with which actions were selected from the root during simulations
Silver et al. 2016 55
Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
� AlphaGo selected the move indicated by the red circle;
� Fan Hui responded with the move indicated by the white square;
� in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.
Silver et al. 2016 56
Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
� AlphaGo selected the move indicated by the red circle;
� Fan Hui responded with the move indicated by the white square;
� in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.
Silver et al. 2016 56
Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
� AlphaGo selected the move indicated by the red circle;
� Fan Hui responded with the move indicated by the white square;
� in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.
Silver et al. 2016 56
Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
� AlphaGo selected the move indicated by the red circle;
� Fan Hui responded with the move indicated by the white square;
� in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.
Silver et al. 2016 56
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
Scalability
� asynchronous multi-threaded search
� simulations on CPUs
� computation of neural networks on GPUs
AlphaGo:
� 40 search threads
� 40 CPUs
� 8 GPUs
Distributed version of AlphaGo (on multiple machines):
� 40 search threads
� 1202 CPUs
� 176 GPUs
Silver et al. 2016 57
ELO Ratings for Various Combinations of Threads
Silver et al. 2016 58
Results: the strength of AlphaGo
Tournament with Other Go Programs
Silver et al. 2016 59
Fan Hui
� professional 2 dan
� European Go Champion in 2013, 2014 and 2015
� European Professional Go Champion in 2016� biological neural network:
� 100 billion neurons
� 100 up to 1,000 trillion neuronal connections
https://en.wikipedia.org/wiki/Fan_Hui 60
Fan Hui
� professional 2 dan
� European Go Champion in 2013, 2014 and 2015
� European Professional Go Champion in 2016� biological neural network:
� 100 billion neurons
� 100 up to 1,000 trillion neuronal connections
https://en.wikipedia.org/wiki/Fan_Hui 60
Fan Hui
� professional 2 dan
� European Go Champion in 2013, 2014 and 2015
� European Professional Go Champion in 2016� biological neural network:
� 100 billion neurons
� 100 up to 1,000 trillion neuronal connections
https://en.wikipedia.org/wiki/Fan_Hui 60
Fan Hui
� professional 2 dan
� European Go Champion in 2013, 2014 and 2015
� European Professional Go Champion in 2016
� biological neural network:
� 100 billion neurons
� 100 up to 1,000 trillion neuronal connections
https://en.wikipedia.org/wiki/Fan_Hui 60
Fan Hui
� professional 2 dan
� European Go Champion in 2013, 2014 and 2015
� European Professional Go Champion in 2016� biological neural network:
� 100 billion neurons
� 100 up to 1,000 trillion neuronal connections
https://en.wikipedia.org/wiki/Fan_Hui 60
Fan Hui
� professional 2 dan
� European Go Champion in 2013, 2014 and 2015
� European Professional Go Champion in 2016� biological neural network:
� 100 billion neurons
� 100 up to 1,000 trillion neuronal connections
https://en.wikipedia.org/wiki/Fan_Hui 60
Fan Hui
� professional 2 dan
� European Go Champion in 2013, 2014 and 2015
� European Professional Go Champion in 2016� biological neural network:
� 100 billion neurons
� 100 up to 1,000 trillion neuronal connectionshttps://en.wikipedia.org/wiki/Fan_Hui 60
AlphaGo versus Fan Hui
AlphaGo won 5 - 0 in a formal match on October 2015.
[AlphaGo] is very strong and stable, it seems
like a wall. ... I know AlphaGo is a computer,
but if no one told me, maybe I would think
the player was a little strange, but a very
strong player, a real person.
Fan Hui
61
AlphaGo versus Fan Hui
AlphaGo won 5 - 0 in a formal match on October 2015.
[AlphaGo] is very strong and stable, it seems
like a wall. ... I know AlphaGo is a computer,
but if no one told me, maybe I would think
the player was a little strange, but a very
strong player, a real person.
Fan Hui
61
AlphaGo versus Fan Hui
AlphaGo won 5 - 0 in a formal match on October 2015.
[AlphaGo] is very strong and stable, it seems
like a wall. ... I know AlphaGo is a computer,
but if no one told me, maybe I would think
the player was a little strange, but a very
strong player, a real person.
Fan Hui 61
Lee Sedol “The Strong Stone”
� professional 9 dan
� the 2nd in international titles
� the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
� Lee Sedol would win 97 out of 100 games against Fan Hui.
� biological neural network, comparable to Fan Hui’s (in number
of neurons and connections)
https://en.wikipedia.org/wiki/Lee_Sedol 62
Lee Sedol “The Strong Stone”
� professional 9 dan
� the 2nd in international titles
� the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
� Lee Sedol would win 97 out of 100 games against Fan Hui.
� biological neural network, comparable to Fan Hui’s (in number
of neurons and connections)
https://en.wikipedia.org/wiki/Lee_Sedol 62
Lee Sedol “The Strong Stone”
� professional 9 dan
� the 2nd in international titles
� the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
� Lee Sedol would win 97 out of 100 games against Fan Hui.
� biological neural network, comparable to Fan Hui’s (in number
of neurons and connections)
https://en.wikipedia.org/wiki/Lee_Sedol 62
Lee Sedol “The Strong Stone”
� professional 9 dan
� the 2nd in international titles
� the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
� Lee Sedol would win 97 out of 100 games against Fan Hui.
� biological neural network, comparable to Fan Hui’s (in number
of neurons and connections)
https://en.wikipedia.org/wiki/Lee_Sedol 62
Lee Sedol “The Strong Stone”
� professional 9 dan
� the 2nd in international titles
� the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
� Lee Sedol would win 97 out of 100 games against Fan Hui.
� biological neural network, comparable to Fan Hui’s (in number
of neurons and connections)
https://en.wikipedia.org/wiki/Lee_Sedol 62
Lee Sedol “The Strong Stone”
� professional 9 dan
� the 2nd in international titles
� the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
� Lee Sedol would win 97 out of 100 games against Fan Hui.
� biological neural network, comparable to Fan Hui’s (in number
of neurons and connections)https://en.wikipedia.org/wiki/Lee_Sedol 62
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
...even beating AlphaGo by 4-1 may allow
the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.
interview in JTBC
Newsroom
62
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
...even beating AlphaGo by 4-1 may allow
the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.
interview in JTBC
Newsroom
62
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
...even beating AlphaGo by 4-1 may allow
the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.
interview in JTBC
Newsroom
62
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize will be
donated to charities, including UNICEF, and Go organisations.
Lee received $170,000 ($150,000 for participating in all the five
games, and an additional $20,000 for each game won).
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize will be
donated to charities, including UNICEF, and Go organisations.
Lee received $170,000 ($150,000 for participating in all the five
games, and an additional $20,000 for each game won).
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize will be
donated to charities, including UNICEF, and Go organisations.
Lee received $170,000 ($150,000 for participating in all the five
games, and an additional $20,000 for each game won).
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize will be
donated to charities, including UNICEF, and Go organisations.
Lee received $170,000 ($150,000 for participating in all the five
games, and an additional $20,000 for each game won).
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize will be
donated to charities, including UNICEF, and Go organisations.
Lee received $170,000 ($150,000 for participating in all the five
games, and an additional $20,000 for each game won).
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize will be
donated to charities, including UNICEF, and Go organisations.
Lee received $170,000 ($150,000 for participating in all the five
games, and an additional $20,000 for each game won).
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63
Conclusion
Difficulties of Go
� challenging decision-making
� intractable search space
� complex optimal solution
It appears infeasible to directly approximate using a policy or value function!
Silver et al. 2016 64
Difficulties of Go
� challenging decision-making
� intractable search space
� complex optimal solution
It appears infeasible to directly approximate using a policy or value function!
Silver et al. 2016 64
Difficulties of Go
� challenging decision-making
� intractable search space
� complex optimal solution
It appears infeasible to directly approximate using a policy or value function!
Silver et al. 2016 64
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
AlphaGo: summary
� Monte Carlo tree search
� effective move selection and position evaluation
� through deep convolutional neural networks
� trained by novel combination of supervised and reinforcement
learning
� new search algorithm combining
� neural network evaluation
� Monte Carlo rollouts
� scalable implementation
� multi-threaded simulations on CPUs
� parallel GPU computations
� distributed version over multiple machines
Silver et al. 2016 65
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than DeepBlue against Kasparov.
It compensated this by:
� selecting those positions more intelligently (policy network)
� evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of (so far seemingly)
intractable problems in AI!
Silver et al. 2016 66
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than DeepBlue against Kasparov.
It compensated this by:
� selecting those positions more intelligently (policy network)
� evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of (so far seemingly)
intractable problems in AI!
Silver et al. 2016 66
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than DeepBlue against Kasparov.
It compensated this by:
� selecting those positions more intelligently (policy network)
� evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of (so far seemingly)
intractable problems in AI!
Silver et al. 2016 66
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than DeepBlue against Kasparov.
It compensated this by:
� selecting those positions more intelligently (policy network)
� evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of (so far seemingly)
intractable problems in AI!
Silver et al. 2016 66
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than DeepBlue against Kasparov.
It compensated this by:
� selecting those positions more intelligently (policy network)
� evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of (so far seemingly)
intractable problems in AI!
Silver et al. 2016 66
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than DeepBlue against Kasparov.
It compensated this by:
� selecting those positions more intelligently (policy network)
� evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of (so far seemingly)
intractable problems in AI!
Silver et al. 2016 66
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than DeepBlue against Kasparov.
It compensated this by:
� selecting those positions more intelligently (policy network)
� evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of (so far seemingly)
intractable problems in AI!
Silver et al. 2016 66
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than DeepBlue against Kasparov.
It compensated this by:
� selecting those positions more intelligently (policy network)
� evaluating them more precisely (value network)
Deep Blue relied on a handcrafted evaluation function.
AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of (so far seemingly)
intractable problems in AI!
Silver et al. 2016 66
Thank you!
Questions?
66
Backup slides
Input features for rollout and tree policy
Silver et al. 2016
Results of a tournament between different Go programs
Silver et al. 2016
Results of a tournament between AlphaGo and distributed Al-
phaGo, testing scalability with hardware
Silver et al. 2016
AlphaGo versus Fan Hui: Game 1
Silver et al. 2016
AlphaGo versus Fan Hui: Game 2
Silver et al. 2016
AlphaGo versus Fan Hui: Game 3
Silver et al. 2016
AlphaGo versus Fan Hui: Game 4
Silver et al. 2016
AlphaGo versus Fan Hui: Game 5
Silver et al. 2016
AlphaGo versus Lee Sedol: Game 1
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 2 (1/2)
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 2 (2/2)
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 3
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 4
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 5 (1/2)
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 5 (2/2)
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
Further Reading I
AlphaGo:
� Google Research Blog
http://googleresearch.blogspot.cz/2016/01/alphago-mastering-ancient-game-of-go.html
� an article in Nature
http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234
� a reddit article claiming that AlphaGo is even stronger than it appears to be:
“AlphaGo would rather win by less points, but with higher probability.”
https://www.reddit.com/r/baduk/comments/49y17z/the_true_strength_of_alphago/
Articles by Google DeepMind:
� Atari player: a DeepRL system which combines Deep Neural Networks with Reinforcement Learning (Mnih
et al. 2015)
� Neural Turing Machines (Graves, Wayne, and Danihelka 2014)
Artificial Intelligence:
� Artificial Intelligence course at MIT
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/
6-034-artificial-intelligence-fall-2010/index.htm
� Introduction to Artificial Intelligence at Udacity
https://www.udacity.com/course/intro-to-artificial-intelligence--cs271
Further Reading II
� General Game Playing course https://www.coursera.org/course/ggp
� Singularity http://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html + Part 2
� The Singularity Is Near (Kurzweil 2005)
Combinatorial Game Theory (founded by John H. Conway to study endgames in Go):
� Combinatorial Game Theory course https://www.coursera.org/learn/combinatorial-game-theory
� On Numbers and Games (Conway 1976)
Machine Learning:
� Machine Learning course
https://youtu.be/hPKJBXkyTK://www.coursera.org/learn/machine-learning/
� Reinforcement Learning http://reinforcementlearning.ai-depot.com/
� Deep Learning (LeCun, Bengio, and Hinton 2015)
� Deep Learning course https://www.udacity.com/course/deep-learning--ud730
� Two Minute Papers https://www.youtube.com/user/keeroyz
� Applications of Deep Learning https://youtu.be/hPKJBXkyTKM
Neuroscience:
� http://www.brainfacts.org/
References I
Allis, Louis Victor et al. (1994). Searching for solutions in games and artificial intelligence. Ponsen & Looijen.
Baudis, Petr and Jean-loup Gailly (2011). “Pachi: State of the art open source Go program”. In: Advances in
Computer Games. Springer, pp. 24–38.
Bowling, Michael et al. (2015). “Heads-up limit holdem poker is solved”. In: Science 347.6218, pp. 145–149. url:
http://poker.cs.ualberta.ca/15science.html.
Conway, John Horton (1976). “On Numbers and Games”. In: London Mathematical Society Monographs 6.
Corrado, Greg (2015). Computer, respond to this email. url:
http://googleresearch.blogspot.cz/2015/11/computer-respond-to-this-email.html#1 (visited on
03/31/2016).
Dieterle, Frank Jochen (2003). “Multianalyte quantifications by means of integration of artificial neural networks,
genetic algorithms and chemometrics for time-resolved analytical data”. PhD thesis. Universitat Tubingen.
Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge (2015). “A Neural Algorithm of Artistic Style”. In:
CoRR abs/1508.06576. url: http://arxiv.org/abs/1508.06576.
Graves, Alex, Greg Wayne, and Ivo Danihelka (2014). “Neural turing machines”. In: arXiv preprint
arXiv:1410.5401.
Hayes, Bradley (2016). url: https://twitter.com/deepdrumpf.
References II
Karpathy, Andrej (2015). The Unreasonable Effectiveness of Recurrent Neural Networks. url:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/ (visited on 04/01/2016).
Kurzweil, Ray (2005). The singularity is near: When humans transcend biology. Penguin.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learning”. In: Nature 521.7553, pp. 436–444.
Li, Chuan and Michael Wand (2016). “Combining Markov Random Fields and Convolutional Neural Networks for
Image Synthesis”. In: CoRR abs/1601.04589. url: http://arxiv.org/abs/1601.04589.
Mnih, Volodymyr et al. (2015). “Human-level control through deep reinforcement learning”. In: Nature 518.7540,
pp. 529–533. url:
https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf.
Munroe, Randall. Game AIs. url: https://xkcd.com/1002/ (visited on 04/02/2016).
Silver, David et al. (2016). “Mastering the game of Go with deep neural networks and tree search”. In: Nature
529.7587, pp. 484–489.
Sun, Felix. DeepHear - Composing and harmonizing music with neural networks. url:
http://web.mit.edu/felixsun/www/neural-music.html (visited on 04/02/2016).