Top Banner
4/21 Make-up class on this Friday No class on next Tuesday ssion corresponds to ng a single path in the transition graph bout regression?
65

4/21 Make-up class on this Friday No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

4/21

Make-up class on this FridayNo class on next Tuesday

Progression corresponds to finding a single path in the transition graphWhat about regression?

Page 2: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Interpreting progression and regression in the transition graph

• In the transition graph (corresponding to the atomic model)– progression search corresponds to finding a single

path– Regression search corresponds to simultaneously

starting from multiple states (all of which satisfy the goal conditions), and effectively searching in parallel until one of the paths reaches the initial state

• Alternately, you can see regression as searching in the space of sets of states, with the termination condition being that any of the states is an initial state.

• ..In contrast, planning with an incomplete state is also a search in the space of belief states (remember the vaccum world), except the termination condition requires that every state in the belief state is a goal state.

Page 3: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

CSE 574: Planning & Learning Subbarao Kambhampati

Handling Conditional Effects

Conditional effects don’t change the progression much at all– Why? (because the state in which the operator is being

applied is known. So you know whether or not the conditional effect actually happens)

Handling conditional effects in regression planning introduces “secondary” preconditions– Consider regressing goals {P,Q} over an action A with two

conditional effects: R=>P; J=>~Q– What happens if A has two more effects: U=> P; N=>~Q

Page 4: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

CSE 574: Planning & Learning Subbarao Kambhampati

Page 5: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Don’t look at curved lines for now…

Have(cake)~eaten(cake)

~Have(cake)eaten(cake)Eat

No-op

No-op

Have(cake)eaten(cake)

bake

~Have(cake)eaten(cake)

Have(cake)~eaten(cake)

Eat

No-op

Have(cake)~eaten(cake)

Graph has leveled off, when the prop list has not changed from the previous iteration

The note that the graph has leveled off now since the last two Prop lists are the same (we could actually have stopped at the

Previous level since we already have all possible literals by step 2)

Page 6: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

onT-A

onT-B

cl-A

cl-B

he

Pick-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

h-A

h-B

~cl-A

~cl-B

~he

Page 7: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

onT-A

onT-B

cl-A

cl-B

he

Pick-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

h-A

h-B

~cl-A

~cl-B

~he

St-A-B

St-B-A

Ptdn-A

Ptdn-B

Pick-A

onT-A

onT-B

cl-A

cl-B

he

h-Ah-B

~cl-A

~cl-B

~he

on-A-B

on-B-A

Pick-B

Page 8: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

onT-A

onT-B

cl-A

cl-B

he

Pick-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

h-A

h-B

~cl-A

~cl-B

~he

St-A-B

St-B-A

Ptdn-A

Ptdn-B

Pick-A

onT-A

onT-B

cl-A

cl-B

he

h-Ah-B

~cl-A

~cl-B

~he

on-A-B

on-B-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

Pick-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

h-A

h-B

~cl-A

~cl-B

~he

St-A-B

St-B-A

Ptdn-A

Ptdn-B

Pick-A

onT-A

onT-B

cl-A

cl-B

he

h-Ah-B

~cl-A

~cl-B

~he

on-A-B

on-B-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

Pick-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

h-A

h-B

~cl-A

~cl-B

~he

St-A-B

St-B-A

Ptdn-A

Ptdn-B

Pick-A

onT-A

onT-B

cl-A

cl-B

he

h-Ah-B

~cl-A

~cl-B

~he

on-A-B

on-B-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

Pick-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

h-A

h-B

~cl-A

~cl-B

~he

St-A-B

St-B-A

Ptdn-A

Ptdn-B

Pick-A

onT-A

onT-B

cl-A

cl-B

he

h-Ah-B

~cl-A

~cl-B

~he

on-A-B

on-B-A

Pick-B

Progression Regression

How do we use reachability heuristics for regression?

Page 9: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Neither hlev nor hsum work well always

p1

p2

p3

p99

p100

B1q

B2B3

B99B100

q

P1A0P0

p1

p2

p3

p99

p100

q

B*

q

P1A0P0

True cost of {p1…p100} is 100 (needs 100 actions to reach)Hlev says the cost is 1Hsum says the cost is 100

Hsum better than Hlev

True cost of {p1…p100} is 1 (needs just one action reach)Hlev says the cost is 1Hsum says the cost is 100

Hlev better than Hsum

Hrelax will get it correct both times..

Page 10: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

onT-A

onT-B

cl-A

cl-B

he

Pick-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

h-A

h-B

~cl-A

~cl-B

~he

St-A-B

St-B-A

Ptdn-A

Ptdn-B

Pick-A

onT-A

onT-B

cl-A

cl-B

he

h-Ah-B

~cl-A

~cl-B

~he

on-A-B

on-B-A

Pick-B

Relaxed plan for our blocks example

Page 11: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

“Relaxed plan”• Suppose you want to find a relaxed

plan for supporting literals g1…gm on a k-length PG. You do it this way:– Start at kth level. Pick an action

for supporting each gi (the actions don’t have to be distinct—one can support more than one goal). Let the actions chosen be {a1…aj}

– Take the union of preconditions of a1…aj. Let these be the set p1…pv.

– Repeat the steps 1 and 2 for p1…pv—continue until you reach init prop list.

• The plan is called “relaxed” because you are assuming that sets of actions can be done together without negative interactions.

onT-A

onT-B

cl-A

cl-B

he

Pick-A

Pick-B

onT-A

onT-B

cl-A

cl-B

he

h-A

h-B

~cl-A

~cl-B

~he

St-A-B

St-B-A

Ptdn-A

Ptdn-B

Pick-A

onT-A

onT-B

cl-A

cl-B

he

h-Ah-B

~cl-A

~cl-B

~he

on-A-B

on-B-A

Pick-B

No backtracking needed!

Optimal relaxed plan is still NP-hard

Page 12: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

h-sum; h-lev; h-relax

• Given a set of literals {l1…lk}– H-lev is the earliest level in which all of them are present– H-sum is the sum of the earliest level in which each of them are

present– H-relax is the length of the plan to support the literals

• H-lev is lower than or equal to h-relax• H-sum is larger than or equal to H-lev• H-lev is admissible• H-relax is not admissible unless you find optimal relaxed

plan– Which is NP-Hard..

Page 13: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Planning Graphs for heuristics

Construct planning graph(s) at each search node Extract relaxed plan to achieve goal for

heuristic

p5

q5

r5

p6

opq

opr

o56

p

5

pqr56

opq

opr

o56

pqrst567

ops

oqt

o67

q

5

qtr56

oqt

oqr

o56

qtrsp567

oqs

otp

o67r

5

rqp56

orq

orp

o56

rqpst567

ors

oqt

o67p

6

pqr67

opq

opr

o67

pqrst678

ops

oqt

o78

1

3

4

1

3

o12

o34

2

1

3

4

5

o12

o34

o23

o45

2

3

4

5

3

5

o34

o56

3

4

5

o34

o45

o56

6 6

7

o67

1

5

1

5

o12

o56

2

1

3

5

o12

o23

o56

2

6 6

7

o67

GoG

GoG

GoG

GoG

GoG

1

3

3

5

1

5

h( )=5

Page 14: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

What if actions have non-uniform costs?

Page 15: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Challenges in Cost Propagation

Page 16: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?
Page 17: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Planning PSP MDPs• In addition to actions having costs, we might have goals with rewards, with the understanding

that if you achieve a goal, you get the corresponding reward• So now, the objective of planning is to find a plan that has the highest net benefit measured as

the difference between the cumulative reward for the goals achieved and the cumulative cost of the actions used

• This problem, called partial satisfaction planning, is both easy (since an “empty” plan is a solution, just not a very good one) and hard (since now the “quality of the plan” in terms of its net benefit is more important)

– It is possible to extend the planning graph heuristics to this problem• On top of this, we might also want to say that rewards are not limited to just goals achieved in the

final state, but can also be gathered for visiting certain good states on the way – Such goals are called “trajectory constraints”

• Even further, we can consider a scenario where the actions are stochastic – By this time it is not even clear that a sequence of actions is an adequate form for the solution. We need to

understand it first at the atomic level—and we shall do so. By the way, this problem is called Markov Decision Process. [MDPs can be done at propositional and relational level, but we won’t discuss that in this class].

• If our masochism continues unabated, we can also now say that in addition to actions being stochastic, we have partial observability

– This will lead to a generalization of MDP called POMDP (Partially Observable MDP); we won’t cover this in this course.

• ..but as long as we are naming things, if we consider actions with durations, we get Semi-MDPs; if we consider other agents, we get decentralized-MDPs (and in each case we can have PO versions..)

Page 18: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

[can generalize to have action costs C(a,s)]

If Mij matrix is not known a priori, then we have a reinforcement learning scenario..

Repeat

Page 19: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

What does a solution to an MDP look like?

• The solution should tell the optimal action to do in each state (called a “Policy”)– Policy is a function from states to actions (* see

finite horizon case below*)– Not a sequence of actions anymore

• Needed because of the non-deterministic actions

– If there are |S| states and |A| actions that we can do at each state, then there are |A||S| policies

• How do we get the best policy?– Pick the policy that gives the maximal expected

reward– For each policy

• Simulate the policy (take actions suggested by the policy) to get behavior traces

• Evaluate the behavior traces• Take the average value of the behavior

traces.

We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

Page 20: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Optimal Policies depend on rewards..

- -

Repeat

-

-

Page 21: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

4/23

If you are twenty and not a liberal, you are heartless.If you are sixty and not a conservative, you are mindless.

--Winston Churchill

But why is Rao putting this here? He better not be hinting that the campus republicans are heartless or geriatric..

Make-up class: Tomorrow (Friday) 10:30—11:45 in DCDC Conference Room 175 [pass Bisonwitches, turn right]

Page 22: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Horizon & Policy• We said policy is a function from states to

actions.. but we sort of lied. • Best policy is non-stationary, i.e., depends on

how long the agent has to “live” – which is called “horizon”

• More generally, a policy is a mapping from <state, time-to-death> <action>– So, if we have a horizon of k, then we will have k

policies• If the horizon is infinite, then policies must all be

the same.. (So infinite horizon case is easy!)

If you are twenty and not a liberal, you are heartless If you are sixty and not a conservative, you are mindless --Churchill

Page 23: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

What does a solution to an MDP look like?

• The solution should tell the optimal action to do in each state (called a “Policy”)– Policy is a function from states to actions (* see

finite horizon case below*)– Not a sequence of actions anymore

• Needed because of the non-deterministic actions

– If there are |S| states and |A| actions that we can do at each state, then there are |A||S| policies

• How do we get the best policy?– Pick the policy that gives the maximal expected

reward– For each policy

• Simulate the policy (take actions suggested by the policy) to get behavior traces

• Evaluate the behavior traces• Take the average value of the behavior

traces.

We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

Page 24: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Horizon & Policy

• How long should behavior traces be?– Each trace is no longer than k

(Finite Horizon case)• Policy will be horizon-dependent

(optimal action depends not just on what state you are in, but how far is your horizon)

– Eg: Financial portfolio advice for yuppies vs. retirees.

– No limit on the size of the trace (Infinite horizon case)

• Policy is not horizon dependent

We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

If you are twenty and not a liberal, you are heartless If you are sixty and not a conservative, you are mindless --Churchill

Page 25: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

How to evaluate a policy?

• Step 1: Define utility of a sequence of states in terms of their rewards– Assume “stationarity” of preferences

• If you prefer future f1 to f2 starting tomorrow, you should prefer them the same way even if they start today

– Then, only two reasonable ways to define Utility of a sequence of states

– U(s1, s2 sn) = n R(si)– U(s1, s2 sn) = n °i R(si) (0 · ° · 1)

• Maximum utility bounded from above by Rmax/(1 - °)

• Step 2: Utility of a policy ¼ is the expected utility of the behaviors exhibited by an agent following it. E [ 1

t=0 °t R(st) | ¼ ]• Step 3: Optimal policy ¼* is the one that maximizes the

expectation: argmax¼ E [ 1t=0 °t R(st) | ¼ ]

– Since there are only A|s| different policies, you can evaluate them all in finite time (Haa haa..)

Page 26: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

How to handle unbounded state sequences?

• If we don’t have a horizon, then we can have potentially infinitely long state sequences. Three ways to handle them

1. Use discounted reward model ( ith state in the sequence contributes only °i R(si)

2. Assume that the policy is proper (i.e., each sequence terminates into an absorbing state with non-zero probability).

3. Consider “average reward per-step”

Page 27: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Utility of a State

• The (long term) utility of a state s with respect to a policy \pi is the expected value of all state sequences starting with s– U¼(s) = E [ 1

t=0 °t R(st) | ¼ , s0 =s ]

• The true utility of a state s is just its utility w.r.t optimal policy U(s) =U¼*(s)

• Thus, U and ¼* are closely related– ¼*(s) = argmaxa s’ Ma

ss’ U(s’)

• As are utilities of neighboring states– U(s) = R(s) + ° argmaxa s’ Ma

ss’ U(s’)Bellman Eqn

Page 28: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Optimal Utility depends on Optimal Policy

If you go to Tiger Hill nearDarjeeling, and only look towardsthe direction the Sun is rising, you may not understand what the brouhaha is all about; but if you look the other side, you see this enchanting view of Kanchanjunga

Page 29: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Think of these as h*() values…

Called value function U* Think of these as related to h* values

Repeat

U* is the maximal expected utility (value) assuming optimal policy

Page 30: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Bellman Equations as a basis for computing optimal policy

• Qn: Is there a simpler way than having to evaluate |A||S| policies? – Yes…

• The Optimal Value and Optimal Policy are related by the Bellman Equations– U(s) = R(s) + ° argmaxa s’ Ma

ss’ U(s’)– ¼*(s) = argmaxa s’ Ma

ss’ U(s’)

• The equations can be solved exactly through – “value iteration” (iteratively compute U and then

compute ¼*) – “policy iteration” ( iterate over policies)– Or solve approximately through “real-time dynamic

programming”

Page 31: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

.8

.1.1

U(i) = R(i) + ° maxj Maij U(j)

+ °

Page 32: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Why are values coming down first?Why are some states reaching optimal value faster?

Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often

.8

.1.1

Page 33: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Value Iteration Demo

• http://www.cs.ubc.ca/spider/poole/demos/mdp/vi.html

• Things to note– The way the values change (states far from

absorbing states may first reduce and then increase their values)

– The convergence speed difference between Policy and value

Page 34: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Terminating Value Iteration

• The basic idea is to terminate the value iteration when the values have “converged” (i.e., not changing much from iteration to iteration)– Set a threshold and stop when the change across

two consecutive iterations is less than – There is a minor problem since value is a vector

• We can bound the maximum change that is allowed in any of the dimensions between two successive iterations by

• Max norm ||.|| of a vector is the maximal value among all its dimensions. We are basically terminating when ||Ui – Ui+1|| <

Page 35: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

4/28 (held on 4/24)

Policy IterationReal-time Dynamic

ProgrammingMin-max Search

Alpha-beta pruning

Page 36: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Policies converge earlier than values•There are finite number of policies but infinite number of value functions.

• So entire regions of value vector are mapped to a specific policy

• So policies may be converging faster than values. Search in the space of policies

•Given a utility vector Ui we can compute the greedy policy ui

• The policy loss of ui is ||UuiU*||

(max norm difference of two vectors is the maximum amount by which they differ on any dimension)

V(S1)

V(S2)

Consider an MDP with 2 states and 2 actions

P1P2

P3

P4

U*

Page 37: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have the “max” operation)

n linear equations with n unknowns.

+ °

Page 38: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Bellman equations when actions have costs

• The model discussed in class ignores action costs and only thinks of state rewards– C(s,a) is the cost of doing action a in state s

• Assume costs are just negative rewards..

– The Bellman equation then becomes

U(s) = R(s) + ° maxa [ -C(s,a) + s’ R(s’) Mass’ ]

• Notice that the only difference is that -C(s,a) is now inside the maximization

• With this model, we can talk about “partial satisfaction” planning problems where– Actions have costs; goals have utilities and the

optimal plan may not satisfy all goals.

Page 39: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Real Time Dynamic Programming• Value and Policy iteration are the

bed-rock methods for solving MDPs. Both give optimality guarantees

– Both of them tend to be very inefficient for large (several thousand state) MDPs (Polynomial in |S| )

• Many ideas are used to improve the efficiency while giving up optimality guarantees

– E.g. Consider the part of the policy for more likely states (envelope extension method)

– Interleave “search” and “execution” (Real Time Dynamic Programming)

• Do limited-depth analysis based on reachability to find the value of a state (and there by the best action you should be doing—which is the action that is sending you the best value)

• The values of the leaf nodes are set to be their immediate rewards

– Alternatively some admissible estimate of the value function (h*)

• If all the leaf nodes are terminal nodes, then the backed up value will be true optimal value. Otherwise, it is an approximation…

RTDP

For leaf nodes, can use R(s) or some heuristic value h(s)

Page 40: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

What if you see this as a game?The expected value computation is fine if you are maximizing “expected” returnIf you are --if you are risk-averse? (and think “nature” is out to get you) V2= min(V3,V4)

If you are perpetual optimist then V2= max(V3,V4)

If you have deterministic actions then RTDP becomes RTA* (if you use h(.) to evaluate leaves

Page 41: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Von Neuman(Min-Max theorem)

Claude Shannon(finite look-ahead)

Chaturanga, India (~550AD)(Proto-Chess)

John McCarthy (pruning)

Donald Knuth(analysis)

Page 42: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Game Playing (Adversarial Search)

• Perfect play– Do minmax on the complete game tree

• Alpha-Beta pruning (a neat idea that is the bane of many a CSE471 student)

• Resource limits– Do limited depth lookahead– Apply evaluation functions at the leaf nodes– Do minmax

• Miscellaneous– Games of Chance– Status of computer games..

Page 43: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Snakes-and-ladders is perfect information with chance think of the utter boringness of deterministic snakes and ladders Not that the normal snakes-and-ladders has any real scope for showing your thinking power (your only action is dictated by the dice—so the dice can play it as a solitaire—at most they need your hand..).

Kriegspiel(blind-fold chess)

Page 44: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?
Page 45: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Searching Tic Tac Toe using Minmax

A game is consideredSolved if it canbe shown thatthe MAX playerhas a winning(or at least Non-losing)Strategy

This means that the backed-upValue in theFull min-max Tree is +ve

Page 46: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

2

<= 2

Cut

14

<= 14

5

<= 5

2

<= 2

•Whenever a node gets its “true” value, its parent’s bound gets updated

•When all children of a node have been evaluated (or a cut off occurs below that node), the current bound of that node is its true value

•Two types of cutoffs:

•If a min node n has bound <=k, and a max ancestor of n, say m, has a bound >=j, then cutoff occurs as long as j >=k

•If a max node n has bound >=k, and a min ancestor of n, say m, has a bound <=j, then cutoff occurs as long as j<=k

Page 47: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Another alpha-beta exampleProject 2 assigned

Page 48: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Click for an animation of Alpha-beta search in action on Tic-Tac-Toe

(order nodes in terms of their static eval values)

Page 49: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?
Page 50: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

4/24 class ended here

Page 51: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?
Page 52: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?
Page 53: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

• How does it feel to be black and poor?– A. Very bad– B. Somewhat bad– C. Neither bad nor good– D. Somewhat good– E. Very good– F. F*** you

Page 54: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?
Page 55: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?
Page 56: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Evaluation Functions: TicTacToe

If win for Max +inftyIf lose for Max -inftyIf draw for Max 0Else # rows/cols/diags open for Max - #rows/cols/diags open for Min

Page 57: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?
Page 58: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

What depth should we go to? --Deeper the better (but why?)

Should we go to uniform depth? --Go deeper in branches where the game is in a flux (backed up values are changing fast) [Called “Quiescence” ]

Can we avoid the horizon effect?

Page 59: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Why is “deeper” better?

• Possible reasons– Taking mins/maxes of the evaluation values of

the leaf nodes improves their collective accuracy

– Going deeper makes the agent notice “traps” thus significantly improving the evaluation accuracy

• All evaluation functions first check for termination states before computing the non-terminal evaluation

Page 60: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

(just as human weight lifters refuse to compete against cranes)

Page 61: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?
Page 62: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

MDPs and Deterministic Search• Problem solving agent search corresponds to what special case of

MDP?– Actions are deterministic; Goal states are all equally valued, and

are all sink states.• Is it worth solving the problem using MDPs?

– The construction of optimal policy is an overkill• The policy, in effect, gives us the optimal path from every state

to the goal state(s))– The value function, or its approximations, on the other hand are

useful. How?• As heuristics for the problem solving agent’s search

• This shows an interesting connection between dynamic programming and “state search” paradigms– DP solves many related problems on the way to solving the one

problem we want– State search tries to solve just the problem we want– We can use DP to find heuristics to run state search..

Page 63: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

End of Gametrees

Page 64: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Multi-player Games

Everyone maximizes their utility --How does this compare to 2-player games? (Max’s utility is negative of Min’s)

Page 65: 4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?

Expecti-Max