Goprez sg

Temporal-difference search in computer Go

David Silver · Richard S. Sutton · Martin Muller

March 3, 2014

1 / 24

Three Major Sections

Section 3: Shape features (almost exact duplicate of previous paperfrom 2004)

Section 2: TD-Search (Simulation-based search methods for gameplay)

Section 4: Dyna-2 algorithm (Introducing concept of long andshort-term memory in simulated search)

2 / 24

AI Game Playing

In AI the most successful game play strategies have followed thesesteps:

1st: Positions are evaluated by a linear combination of many features.Position is broken down into small local components2nd: Weights are trained by TD learning and self playFinally: A linear evaluation function is combined with a suitable searchalgorithm to produce a high-performing game.

3 / 24

Sec 3.0: TD Learning with Local Shape Features

Approach: Shapes are an important part of Go. Master-level playersare thinking in terms of general shapes.

Objective is to win the games of Go. Rewards: r = 1 if black wins andr= 0 if white wins.There is a state value s for each intersection of the board. Threepossible values: empty, white, or black.Local shape l is found by all combinations of shapes up to 3x3 region.φ(s) = 1 if board matches local shape liThe value function V π(s) is the expected total reward from state swhen following policy π (Probability of winning)Value function is approximated by logistic-linear combination of shapefeatures. Model-free learning since learning the value function.V (s) = σ(φ(s) · θLI + φ(s) · θLI ) Black’s winning probability from states.Use two-ply update: TD(0) error is calculated after both player andopponent have made a move V (st+2)Using self play: Agent and opponent use the same policy π(s, a)

4 / 24

5 / 24

6 / 24

Sec 3.2.1: Training Procedure

Objective: Train the weights of the value function V(s) and updateusing logistic TD(0).

Initialize all weights to zero and run a million games of self play tofind the weights of value function V(s)

Black and white select moves using an ε− greedy policy over samevalue function.

Using self play: Agent and opponent use the same ε− greedy policyπ(s, a)

Terminate when both players pass

7 / 24

Results with different sets of local shape features

8 / 24

Results with different sets of local shape features cont’d

9 / 24

Alpha Beta Search

AB Search: During game play this is a technique used to search thepotential moves from state s to s ′ (Note: we are now using the learntvalue function V(s))

For example, for a depth of 2, if it’s white’s move we consider all ofwhite’s moves and blacks’s responses to all those moves. Wemaintain an alpha and a beta for the upper and lower bound (value)of the move.

10 / 24

AB Search Results

11 / 24

Section 4.0: Temporal-different search

Idea: If we are in a current state s there is always a subgame G′

oforiginal game G. Apply TD learning to G

′using subgames of

self-play, that start from the current state st .

Simulation-based search: agent samples episodes of simulatedexperience and updates its value function from simulated experience.

Begin in state s0. At each step µ of simulation, an action au isselected according to a simulation policy, and a new state su+1 andreward ru+1 is generated by the MDP model. Repeat until terminalstate is reached.

12 / 24

MCTS and TD-difference search cont’d

TD-search uses value function approximation on the currentsub-graph (our V(s) from before). We can update all similar statessince the value function approximates the value of any position s ∈ S .

MCTS must wait on many time-steps to until getting a finaloutcome. Depends on all the agents decisions throughout thesimulation. TD search can bootstrap, as before, using steps betweensubsequent states. Does not need to wait until the end to makecorrections to to TD-error (just as in TD-difference learning).

MCTS is currently the best known example of simulated search

13 / 24

Linear TD Search

Linear TD search is an implementation of TD search where the valuefunction is updated online. The agent simulates episodes ofexperience from the current state by sampling its current policyπ(s, a) and from transition model Pπ

ss′and reward model Rπ

ss′(note:

P is transition probabilities and R is reward function)

Linear TD search is applied to the sub-game at the current state.And instead of using a search tree, the agent is going to approximatethe value function by using a linear combination given by:Qµ(s, a) = φ(s, a) · θµQ is the action value function Qπ(s, a) = E [Rt |st = s, at = a]

θ is the weights, µ is the current time step, and φ(s, a) is featurevector representing states and actions

After each step the agent updates the parameters by TD learning,using TD(λ)

14 / 24

TD Search Results

15 / 24

TD search in computer Go

In section 3.0 the value of each shape was learned offline using TDlearning by self play. This is considered myopic since each section isevaluated without knowledge of the entire board.

The ideas is to use local shape features in TD search. TD search canlearn the value of each feature in the context current board contextor state µ as discussed previously. This allows the agent to focus onwhat works well now.

Issues: By starting simulations from current position we break thesymmetries. So the weight sharing (feature reduction) based on thisbreaking is lost.

16 / 24

Change to TD search

Remember with shapes in section 3.0 we were learning the Valuefunction V(s).

We modify the TD search algorithm to update the weights of ourvalue function.

Linear TD search is applied to the sub-game at the current state.And instead of using a search tree, the agent is going to approximatethe value function by using a linear combination given by:δθ = α θ(st)

||θ(st)||2 (V (st+2)− V (st))

17 / 24

Experiments with TD search

Ran tournaments between different versions of RLGO of at least 200games

Used bayeselo program to calculate Elo rating

Recall that for TD search we are doing simulated search (slide 13)and we have no simulation policy for each step µ. A simulation policymaximizes the action from every state in the MDP.

Fuego 0.1 is a “vanilla” policy that they use as a default policy. Theydo this to incorporate some prior knowledge into the simulationpolicy. TD search assumes no prior knowledge. They switch to thisevery T moves.

Switching policy ever 2-8 moves resulted in a 300 point Eloimprovement.

The results show the importance of what they call “temporality”.Focusing on the agents resources at the current moment.

18 / 24

TD Results

19 / 24

TD Results

20 / 24

Dyna-2: Integrating short and long-term memories

Learning algorithms slowly exact knowledge from the complete historyof training data.

Search algorithms use and extend this knowledge, rapidly and online,so as to evaluate local states more accurately.

Dyna-2: combines both TD learning and TD search

Sutton had a previous algorithm Dyna (1990) that applied TDlearning both to real experience and simulated experience.

They key idea with Dyna-2 is to maintain two separate memories: along-term memory that is learnt from real experience; and short-termmemory that is used during search and is update from simulatedexperience.

21 / 24

Dyna-2 cont’d

Define a short Qµ(s, a) and long-term value function Q(s, a)

Q(s, a) = φ(s, a) · θQ(s, a) = φ(s, a) · θ + φ(s, a) · θ

Q is the action value function Qπ(s, a) = E [Rt |st = s, at = a] , θ isthe weights, and φ(s, a) is feature vector representing states andactions.

The short term value function,Q(s, a), uses both memories toapproximate true value function.

Two phase search: AB search is performed after each TD search.

22 / 24

Dyna-2 results

23 / 24

Discussion

What is the significance of the 2nd author?

What are your thoughts on the overall performance of this algorithm?

Why didn’t they outperform modern MCTS methods?

Are there any other applications where this might be useful?

Did you think the paper did a good job explaining their approach?Was it descriptive enough?

What feature of Go, as compared to chess, checkers, or backgammon,makes it different in the reinforcement learning environment.

Is using only a 1x1 feature set of shapes equivalent to the notion of”over-fitting”?

What is the advantage of two-ply update verse 1-ply update that theyreferred to in section 3.2? What is the trade-off as we go up to 6 ply?

24 / 24

Goprez sg

Documents

linear td search

td search

td learning

linear combination

search tree

simulated search

shape features

dierent sets