Top Banner
Temporal-difference search in computer Go David Silver · Richard S. Sutton · Martin M¨ uller March 3, 2014 1 / 24
24
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Goprez  sg

Temporal-difference search in computer Go

David Silver · Richard S. Sutton · Martin Muller

March 3, 2014

1 / 24

Page 2: Goprez  sg

Three Major Sections

Section 3: Shape features (almost exact duplicate of previous paperfrom 2004)

Section 2: TD-Search (Simulation-based search methods for gameplay)

Section 4: Dyna-2 algorithm (Introducing concept of long andshort-term memory in simulated search)

2 / 24

Page 3: Goprez  sg

AI Game Playing

In AI the most successful game play strategies have followed thesesteps:

1st: Positions are evaluated by a linear combination of many features.Position is broken down into small local components2nd: Weights are trained by TD learning and self playFinally: A linear evaluation function is combined with a suitable searchalgorithm to produce a high-performing game.

3 / 24

Page 4: Goprez  sg

Sec 3.0: TD Learning with Local Shape Features

Approach: Shapes are an important part of Go. Master-level playersare thinking in terms of general shapes.

Objective is to win the games of Go. Rewards: r = 1 if black wins andr= 0 if white wins.There is a state value s for each intersection of the board. Threepossible values: empty, white, or black.Local shape l is found by all combinations of shapes up to 3x3 region.φ(s) = 1 if board matches local shape liThe value function V π(s) is the expected total reward from state swhen following policy π (Probability of winning)Value function is approximated by logistic-linear combination of shapefeatures. Model-free learning since learning the value function.V (s) = σ(φ(s) · θLI + φ(s) · θLI ) Black’s winning probability from states.Use two-ply update: TD(0) error is calculated after both player andopponent have made a move V (st+2)Using self play: Agent and opponent use the same policy π(s, a)

4 / 24

Page 5: Goprez  sg

5 / 24

Page 6: Goprez  sg

6 / 24

Page 7: Goprez  sg

Sec 3.2.1: Training Procedure

Objective: Train the weights of the value function V(s) and updateusing logistic TD(0).

Initialize all weights to zero and run a million games of self play tofind the weights of value function V(s)

Black and white select moves using an ε− greedy policy over samevalue function.

Using self play: Agent and opponent use the same ε− greedy policyπ(s, a)

Terminate when both players pass

7 / 24

Page 8: Goprez  sg

Results with different sets of local shape features

8 / 24

Page 9: Goprez  sg

Results with different sets of local shape features cont’d

9 / 24

Page 10: Goprez  sg

Alpha Beta Search

AB Search: During game play this is a technique used to search thepotential moves from state s to s ′ (Note: we are now using the learntvalue function V(s))

For example, for a depth of 2, if it’s white’s move we consider all ofwhite’s moves and blacks’s responses to all those moves. Wemaintain an alpha and a beta for the upper and lower bound (value)of the move.

10 / 24

Page 11: Goprez  sg

AB Search Results

11 / 24

Page 12: Goprez  sg

Section 4.0: Temporal-different search

Idea: If we are in a current state s there is always a subgame G′

oforiginal game G. Apply TD learning to G

′using subgames of

self-play, that start from the current state st .

Simulation-based search: agent samples episodes of simulatedexperience and updates its value function from simulated experience.

Begin in state s0. At each step µ of simulation, an action au isselected according to a simulation policy, and a new state su+1 andreward ru+1 is generated by the MDP model. Repeat until terminalstate is reached.

12 / 24

Page 13: Goprez  sg

MCTS and TD-difference search cont’d

TD-search uses value function approximation on the currentsub-graph (our V(s) from before). We can update all similar statessince the value function approximates the value of any position s ∈ S .

MCTS must wait on many time-steps to until getting a finaloutcome. Depends on all the agents decisions throughout thesimulation. TD search can bootstrap, as before, using steps betweensubsequent states. Does not need to wait until the end to makecorrections to to TD-error (just as in TD-difference learning).

MCTS is currently the best known example of simulated search

13 / 24

Page 14: Goprez  sg

Linear TD Search

Linear TD search is an implementation of TD search where the valuefunction is updated online. The agent simulates episodes ofexperience from the current state by sampling its current policyπ(s, a) and from transition model Pπ

ss′and reward model Rπ

ss′(note:

P is transition probabilities and R is reward function)

Linear TD search is applied to the sub-game at the current state.And instead of using a search tree, the agent is going to approximatethe value function by using a linear combination given by:Qµ(s, a) = φ(s, a) · θµQ is the action value function Qπ(s, a) = E [Rt |st = s, at = a]

θ is the weights, µ is the current time step, and φ(s, a) is featurevector representing states and actions

After each step the agent updates the parameters by TD learning,using TD(λ)

14 / 24

Page 15: Goprez  sg

TD Search Results

15 / 24

Page 16: Goprez  sg

TD search in computer Go

In section 3.0 the value of each shape was learned offline using TDlearning by self play. This is considered myopic since each section isevaluated without knowledge of the entire board.

The ideas is to use local shape features in TD search. TD search canlearn the value of each feature in the context current board contextor state µ as discussed previously. This allows the agent to focus onwhat works well now.

Issues: By starting simulations from current position we break thesymmetries. So the weight sharing (feature reduction) based on thisbreaking is lost.

16 / 24

Page 17: Goprez  sg

Change to TD search

Remember with shapes in section 3.0 we were learning the Valuefunction V(s).

We modify the TD search algorithm to update the weights of ourvalue function.

Linear TD search is applied to the sub-game at the current state.And instead of using a search tree, the agent is going to approximatethe value function by using a linear combination given by:δθ = α θ(st)

||θ(st)||2 (V (st+2)− V (st))

17 / 24

Page 18: Goprez  sg

Experiments with TD search

Ran tournaments between different versions of RLGO of at least 200games

Used bayeselo program to calculate Elo rating

Recall that for TD search we are doing simulated search (slide 13)and we have no simulation policy for each step µ. A simulation policymaximizes the action from every state in the MDP.

Fuego 0.1 is a “vanilla” policy that they use as a default policy. Theydo this to incorporate some prior knowledge into the simulationpolicy. TD search assumes no prior knowledge. They switch to thisevery T moves.

Switching policy ever 2-8 moves resulted in a 300 point Eloimprovement.

The results show the importance of what they call “temporality”.Focusing on the agents resources at the current moment.

18 / 24

Page 19: Goprez  sg

TD Results

19 / 24

Page 20: Goprez  sg

TD Results

20 / 24

Page 21: Goprez  sg

Dyna-2: Integrating short and long-term memories

Learning algorithms slowly exact knowledge from the complete historyof training data.

Search algorithms use and extend this knowledge, rapidly and online,so as to evaluate local states more accurately.

Dyna-2: combines both TD learning and TD search

Sutton had a previous algorithm Dyna (1990) that applied TDlearning both to real experience and simulated experience.

They key idea with Dyna-2 is to maintain two separate memories: along-term memory that is learnt from real experience; and short-termmemory that is used during search and is update from simulatedexperience.

21 / 24

Page 22: Goprez  sg

Dyna-2 cont’d

Define a short Qµ(s, a) and long-term value function Q(s, a)

Q(s, a) = φ(s, a) · θQ(s, a) = φ(s, a) · θ + φ(s, a) · θ

Q is the action value function Qπ(s, a) = E [Rt |st = s, at = a] , θ isthe weights, and φ(s, a) is feature vector representing states andactions.

The short term value function,Q(s, a), uses both memories toapproximate true value function.

Two phase search: AB search is performed after each TD search.

22 / 24

Page 23: Goprez  sg

Dyna-2 results

23 / 24

Page 24: Goprez  sg

Discussion

What is the significance of the 2nd author?

What are your thoughts on the overall performance of this algorithm?

Why didn’t they outperform modern MCTS methods?

Are there any other applications where this might be useful?

Did you think the paper did a good job explaining their approach?Was it descriptive enough?

What feature of Go, as compared to chess, checkers, or backgammon,makes it different in the reinforcement learning environment.

Is using only a 1x1 feature set of shapes equivalent to the notion of”over-fitting”?

What is the advantage of two-ply update verse 1-ply update that theyreferred to in section 3.2? What is the trade-off as we go up to 6 ply?

24 / 24