A Bayesian Approach to Finding Compact Representations for ...people.csail.mit.edu/agf/Files/12EWRL-MHPI-Slides.pdffor sequential decision making in which an agent in-teracts with

1

Special thanks to Joelle Pineau for presenting our paper - July, 2012

A Bayesian Approach to FindingCompact Representations for

Reinforcement Learning

Authors

Stefanie TellexAlborz Geramifard

Jonathan How

David Wingate

Nicholas Roy

2

VisionSolving Large Sequential Decision Making

Problems Formulated as MDPs.

3

Reinforcement Learning�(s) : S � A at

st, rt

2. Background

Reinforcement learning (RL) is a powerful frameworkfor sequential decision making in which an agent in-teracts with an environment on every time step. Theenvironment is often modeled using a Markov De-cision Process (MDP) which is defined by a tuple(S,A,Pa

ss0 ,Rass0 , �), where S identifies the finite set

of states, A corresponds to the finite set of actions,Pass0 dictates the transition probability from state s to

state s

0 when taking action a, Rass0 is the correspond-

ing reward along the way, and � 2 [0, 1] is a dis-count factor emphasizing the relative significance ofimmediate rewards versus feature rewards.1 A tra-jectory of experience is identified by the sequences0, a0, r1, s1, a1, r2, · · · , where at time i the agent instate si took action ai, received reward ri+1, and tran-sited to state si+1. The behavior of the agent is cap-tured through the notion of policy ⇡ : S ⇥A ! [0, 1]

governing the probability of taking each action in eachstate. We limit our attention to deterministic poli-cies mapping each state to one action. The value of astate given policy ⇡ is defined as the expected cumula-tive discounted rewards obtained starting the sequencefrom s and following ⇡ thereafter:

V

⇡(s) = E⇡

" 1X

t=1

�

t�1rt

�

�

�

�

s0 = s

#

Similarly the value of a state-action pair is defined as:

Q

⇡(s, a) = E⇡

" 1X

t=1

�

t�1rt

�

�

�

�

s0 = s, a0 = a,

#

The objective is to find the optimal policy defined as:

⇡

⇤(s) = argmax

aQ

⇡⇤(s, a).

One popular thrust of online reinforcement learningmethods such as SARSA (Rummery and Niranjan,1994) and Q-Learning (Watkins and Dayan, 1992)tackle the problem by updating the estimated value ofa state based on temporal difference error (TD-error),while acting mostly greedy with respect to the esti-mated values. TD-error is defined as

�t(Q⇡) = rt + �Q

⇡(st+1, at+1)� Q

⇡(st, at).

One of the main challenges facing researchers is thatmost realistic domains consist of large state spaces

1. � = 1 is only valid for episodic tasks.

and continuous state variables. Function approxima-tors have been used as a machinery to overcome theseobstacles, enabling an agent to generalize its experi-ence in order to act appropriately in states it may havenever previously encountered during training. Lin-ear function approximators, which are the focus ofthis paper, have been favored due to their theoret-ical properties and low computational complexities(Sutton, 1996; Tsitsiklis and Van Roy, 1997; Geram-ifard et al., 2006). Using a linear function approxi-mation Q

⇡(s, a) is approximated by w

T�(s, a) where

� : S⇥A ! <n is the mapping function and w is theweight vector. For simplicity we call �(s, a) the fea-ture vector and each element of the vector a feature.

Finding a suitable mapping function is one of thecritical elements to obtain an adept policy. Earlystudies on random feature generation methods haveshown promising directions on expanding the repre-sentation using some basic set of features (Sutton andWhitehead, 1993). Representational Policy Iteration(RPI) (Mahadevan, 2005) is another approach for dis-covering task independent representations fusing thetheory of smooth functions on a Riemannian mani-fold with the Least-Squares method. Another popu-lar trend of methods migrated the idea of Cascade-Correlation (Fahlman and Lebiere, 1991) to the re-inforcement learning realm using temporal differencelearning (Rivest and Precup, 2003), approximate dy-namic programming (Girgin and Preux, 2007), andLSPI (Girgin and Preux, 2008). (Geramifard et al.,2011) described a method for incrementally addinga feature which maximally reduces TD-error. How-ever, none of these techniques facilitate a regulariza-tion scheme by which the designer incorporates hisknowledge over the set of hypotheses.

From the Bayesian cognitive science community,Goodman et al. (2008) used a grammar-based in-duction scheme to learn human concepts in a super-vised learning setting. In their approach new con-cepts (features) were derived from a limited set ofinitial propositions using a generative grammar. Thiswork motivated us to revisit the representational ex-pansion within the RL community from the Bayesianapproach.

3. Our Approach

We adopt a Bayesian approach to find well perform-ing policies. The core idea is to find a representation

4

5

Linear Function Approximation

.

.

.

�✓1

✓2

�1

�2

�n ✓n

sQ⇡(s, a) ⇡ �(s, a)>✓

Challenge

D

�

G

(a)

D

�

Q

⇡

G

(b)Figure 1: Graphical models a) compact and b) detailed. The gray

node (Data) is given.

for which, given a dataset, the resulting policy is mostlikely to be optimal. For this purpose, we define vari-able G 2 {0, 1} that indicates if a given policy is op-timal.2 Define D as our observed variable that is a setof interactions in the form of (st, at, rt, st+1, at+1).Hence our search for a good representation is formu-lated as:

�

⇤= argmax

�P (�|G, D). (1)

Figure 1-(a) depicts the corresponding graphicalmodel. D is filled to show that it is a known vari-able. We factor the distribution in order to break itdown into a prior and a likelihood. The solution toour optimization problem in Equation 1 can be statedas finding the Maximum-a-Posteriori (MAP) solutionto the following distribution:

P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)

We define a family of representations by forming newfeatures which are combinations of existing featuresusing logical operators. This space is infinite, andcould lead to very complex representations; we ad-dress this issue by defining a prior on representationsthat favors simplicity, so that concise representations

2. From this point, we use G instead of G = 1 for brevity.

should have a higher a priori probability compared tocomplex ones. The likelihood function is based onthe performance of the representation at obtaining re-ward. Defining the likelihood requires first computingthe value of each state, Q, using LSPI. Then the policy⇡ is calculated as a function of Q. Finally G probesthe quality of the computed policy. This chain of rela-tions is captured in Figure 1-(b).

Since inference for Equation 2 is searching an expo-nential space, our algorithm samples representationsfrom the posterior distribution using the Metropolis-Hastings algorithm. Because our algorithm integratesMetropolis-Hastings with LSPI, we call the algorithmMetropolis-Hastings Policy Iteration (MHPI).

3.1. Space of Representations

In this work, we assume the presence of a set of prim-

itive binary features. Starting with the initial set offeatures, the system adds extended features, which arelogical combinations of primitive features built usingthe ^ and _ operators. Each representation is main-tained as a directed acyclic graph (DAG) structure,where nodes are features, and edges are logical op-erators. We use sparse binary features (i.e., featurevectors with very limited non-zero values) to reducethe computational complexity of learning methods, acommon practice within the RL community (Bowlingand Veloso, 2002; Sherstov and Stone, 2005). Whileadding negation (¬) to the set of operators expandsthe hypothesis space, it eliminates the sparsity char-acteristic, so we do not include it in the space of rep-resentations. Figure 2 (left) shows an example featureset, where each feature is marked with its correspond-ing index. The number of features for this set is 8,where 6 of the features are primitive and 2 extendedfeatures are: f8 = f4 ^ f6 and f7 = f2 _ f3. Noticethat more complex extended features can be built ontop of existing extended features (e.g., f9 = f1 ^ f7).In order to discourage representations with complexstructures, we adopted the following Poisson distri-bution akin to the work of Goodman et al. (2008) tomathematically encourage conciseness:

P (�) /nY

i=1

�

die

��

di!,

where n is the total number of features and di is thedepth of feature fi in the DAG structure. Lower valuesof � > 0 make complex DAG structures less and lesslikely.

Good Representation ( )

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


Good Policy ( )D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


Good Value Function ( )

Our focus

6

D �

Q

⇡

G

ApproachObserved Data Samples

Policy is good?

Representation

Value Function

Policy

2 {0, 1}

7

D �

Q

⇡

G

ApproachObserved Data Samples

Policy is good?

Representation

Value Function

Policy

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


2 {0, 1}

Ideally:

Using G instead of G=1 for brevity

7

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


Big!

2Problem:∧∨

∧

∧∨

∧

∧∧

Primitive features

Add

Mutate

Rem

ove

Extended features

1 2 3 4 5 6

7 8

1 2 3 4 5 6

7 8

1 2 3 4 5 6

7

1 2 3 4 5 6

7 8

9

Figure 2: Representation of primitive and extended features andthe possible outcomes of the propose function

3.3.2. The Transition Probability Function (T )

When the transition probability function is symmetric,it can be removed from the MH algorithm, as it is can-celed out during the calculation. In our setting, how-ever, T is not symmetric. Given that hypothesis � hasp primitive features, e extended features and h headerfeatures, and representation �

0 proposed by taking ac-tion a from �, the transition probability function isdefined as:

T (�

0|�) =

8

<

:

2(p+e)/p+e�1 a = add

1/h a = remove

1/e a = mutate

4. Empirical Results

In this section, we investigate the performance of run-ning MHPI in the three domains: maze, BlocksWorld,and the inverted pendulum problem. For each domainsamples were gathered by the SARSA (Rummery andNiranjan, 1994) algorithm using the initial feature rep-resentation with the learning rates generated from thefollowing series:

↵t = ↵0N0 + 1

N0 + Episode#1.1,

where N0 was set to 100, and ↵0 was initialized at1 due to the short amount of interaction. For explo-ration, we chose the ✏-greedy approach with ✏ = .1

(i.e., 10% chance of taking a random action on eachtime step). The � parameter of the Poisson distribu-tion was set to 0.01 while ⌘ for the exponential distri-bution was set to 1. The initial representation used for

MH included all basic features. Additionally �(s, a)

was built by copying �(s) vector into the correspond-ing action slot. Therefore �(s, a) has |A| times morefeatures compared to �(s). For LSPI, we limited thenumber of policy iterations to 5, while the value of theinitial state for each policy, V ⇡i

(s0), was evaluated bya single Monte-Carlo run.

Maze Figure 3-(a) shows a simple 11 ⇥ 11 naviga-tion problem where the initial state is on the top leftcorner of the maze (�) and the goal is at the bottomright corner of the maze (?). Light blue cells indi-cate blocked areas. The action set consist of one stepmoves along the four cardinal directions. Actions arenoiseless and possible if the destination cell is notblocked. Reward is �.001 for all interactions exceptthe move leading to the goal with reward of +1. Theepisodic task is terminated if goal is reached, or 100steps is passed. There were 22 initial features used for�(s) corresponding to 11 rows and 11 columns of themaze. � was set to 1.

We used 200 samples through 2 episodes, gatheredin the domain using SARSA. The agent reached thegoal in the first episode following the top right cor-ner of the middle blocked square. The second episodefailed as the agent struggled behind the blocked areaon the bottom. Figure 3-(b) shows the distributionof the representation sizes sampled along 1, 000 iter-ations of the MH algorithm, while Figure 3-(c) showsthe corresponding performance of the sampled rep-resentations. The distribution together with the per-formance measure suggest that a desirable representa-tion should have 3 extended features. After 100 itera-tions all sampled hypotheses were expressive enoughto solve the task. It is interesting to see how Occam’sRazor is being carried away through the whole pro-cess. The MH algorithm spent most of its time ex-ploring various hypotheses with 3 extended featureswhile the more complicated representations were ofless interest as they provided the same performance(i.e., likelihood) yet had lower prior.

Figure 3-(d) shows the value function (green indi-cates positive, white represents zero, and red standsfor blocked areas) and the corresponding policy (ar-rows) for the best performing representation. Thisrepresentation had 3 extended features: (X = 2^Y =

11), (X = 3 ^ Y = 6), and (X = 2 ^ Y = 8), whereX is the row number and Y is the column number.Notice that the policy guides the agent successfullyfrom the starting point to the goal on the shortest path.

8

Approach

f8 = f4 ^ f6

Logical combinations of primitive features such as

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


Big!

2Problem:

Insight:

∧∨

∧

∧∨

∧

∧∧

Primitive features

Add

Mutate

Rem

ove

Extended features

1 2 3 4 5 6

7 8

1 2 3 4 5 6

7 8

1 2 3 4 5 6

7

1 2 3 4 5 6

7 8

9





T (�

0|�) =

8

<

:


1/h a = remove

1/e a = mutate



↵t = ↵0N0 + 1

N0 + Episode#1.1,










8

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


Approach

f8 = f4 ^ f6

Logical combinations of primitive features such as

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


PriorLikelihood

9

Approach

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


PriorLikelihood

9

Approach

Likelihood: - Find the best policy given (we used LSPI)

-

�, D

P (G|�, D) / e⌘V⇡(s0)

⇡ [Lagoudakis et al. 2003]

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


PriorLikelihood

9

A well performing policy is more likely to be a Good policy!

Simulate trajectories for estimating V ⇡(s0)

Approach


-

�, D

P (G|�, D) / e⌘V⇡(s0)


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


PriorLikelihood

9

A well performing policy is more likely to be a Good policy!

Simulate trajectories for estimating V ⇡(s0)

Approach


-

�, D

P (G|�, D) / e⌘V⇡(s0)


Prior:- Representations with less number of features are more likely.- Representations with simple features are more likely.

[Goodman et al. 2008]

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


Inference:

Posterior

Use Metropolis-Hastings (MH) to sample from the posterior.

10

MH+LSPI = MHPI

Approach

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


Inference:

Posterior

Use Metropolis-Hastings (MH) to sample from the posterior.

� �0Propose

Accept probabilistically based on the posterior

Markov Chain Monte-Carlo:

10

MH+LSPI = MHPI

Approach

D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


D

�

G

(a)

D

�

Q

⇡

G




�

⇤= argmax

�P (�|G, D). (1)


P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)








P (�) /nY

i=1

�

die

��

di!,


Inference:

Posterior

∧∨

∧

∧∨

∧

∧∧

Primitive features

Add

Mutate

Rem

ove

Extended features

1 2 3 4 5 6

7 8

1 2 3 4 5 6

7 8

1 2 3 4 5 6

7

1 2 3 4 5 6

7 8

9





T (�

0|�) =

8

<

:


1/h a = remove

1/e a = mutate



↵t = ↵0N0 + 1

N0 + Episode#1.1,










Propose Function:

�

�0Use Metropolis-Hastings (MH) to sample from the posterior.

� �0Propose

Accept probabilistically based on the posterior

Markov Chain Monte-Carlo:

10

MH+LSPI = MHPI

Approach

(a) Domain

0 1 2 3 4 5 6 7 8 90

100

200

300

400

500

600

700

800

900

1000

# of Extended Features

# of

Sam

ples

(b) Posterior Distribution

0 200 400 600 800 100010

20

30

40

50

60

70

80

90

100

Steps

Iteration

# of

Ste

ps to

the

Goa

l

MH Iteration

(c) Sampled Performance1 2 3 4 5 6 7 8 9 10 11

1234567891011

(d) Resulting PolicyFigure 3: Maze domain empirical results

start

goal

(a) Domain

# of

Sam

ples

0 1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200



0 200 400 600 800 10000

10

20

30

40

50

60

70

80

90

100

Steps

Iteration

# of

Ste

ps to

Mak

e th

e To

wer

MH Iteration

(c) Sampled Performance# of Extended Features

2.75

3.75

4.75

5.75

6.75

7.75

1 2 3 4 5 6 7 8 9 10 11 12# of

Ste

ps to

Mak

e th

e To

wer

(d) Performance Dist.Figure 4: BlocksWorld

mance along each iteration. Unlike other domainsthat more features often helped the performance earlyon. In this domain irrelevant features dropped theperformance resulting in MH to reject them. Thisprocess took a while until interesting features startedto emerge. This effect is usually avoided by settinga burn-in value discarding limited number of initialsamples in the MH setting. Yet we added this data tohighlight the fact that expanding the representation ar-bitrary does not necessarily improve the performancein light of limited data. Figure 5-(d) shows the perfor-mance of the representations based on the number ofextended features. In our experiments, while addingmost extended features hurt the performance, the ex-tended feature (� ⇡

21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)

enabled the agent to complete the task successfully.This feature identifies an intuitive situation where thependulum is almost balanced with a velocity on theopposite direction, which would be often visited. Thisis very interesting results, because out of all possiblecorrelations among the initial features (21⇥ 21), cap-turing one such intuitive feature made the task solv-able with very limited amount of data.

In our work, we found that the adjustment of priorsplayed a critical role on the success of MHPI as priorscompete against the performance of the resulting poli-cies. Also we found MHPI to be robust in handlingstochastic domains. For exampling adding 20% noise

to the movement of the agent in the maze domain didnot change the performance noticeably.

5. Conclusion

This paper introduces a Bayesian approach for find-ing concise yet expressive representations for solvingMDPs. We introduced MHPI, a new RL techniquethat builds new representations from limited numberof simple features that perform well. Our approachuses a prior distribution that encourages representa-tion simplicity, and a likelihood function based onLSPI to encourage representations that lead to capa-ble policies. MHPI samples representations from theresulting posterior distribution. Although, the idea ofMHPI is general, in our implementation, we narrowedthe representation space to DAG structures on primi-tive binary features. The empirical results show thatMHPI finds simple yet effective representations forthree classical RL problems.

There are immediate visible expansions to this work.In our implementation, we excluded the samples gen-erated during the performance test in order to takeadvantage of caching old representation evaluations.One can use such samples along the way while beingaware of the increase to the runtime complexity. An-other extension is to relax the need of the simulationbox in LSPI by measuring the performance using off-

Maze200 Initial SamplesInitial features: row and column indicatorsNoiseless Actions: →,←,↓,↑

11

MHPI Iteration

BlocksWorld1000 Initial SamplesInitial features: on(A,B)20% noise of dropping the block

12

(a) Domain

0 1 2 3 4 5 6 7 8 90

100

200

300

400

500

600

700

800

900

1000

# of Extended Features#

of S

ampl

es


0 200 400 600 800 100010

20

30

40

50

60

70

80

90

100

Steps

Iteration

# of

Ste

ps to

the

Goa

l

MH Iteration


1234567891011


start

goal

(a) Domain

# of

Sam

ples

0 1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200



0 200 400 600 800 10000

10

20

30

40

50

60

70

80

90

100

Steps

Iteration#

of S

teps

to M

ake

the

Tow

erMH Iteration


2.75

3.75

4.75

5.75

6.75

7.75

1 2 3 4 5 6 7 8 9 10 11 12# of

Ste

ps to

Mak

e th

e To

wer



21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)




5. Conclusion



MHPI Iteration


12

(a) Domain

0 1 2 3 4 5 6 7 8 90

100

200

300

400

500

600

700

800

900

1000


of S

ampl

es


0 200 400 600 800 100010

20

30

40

50

60

70

80

90

100

Steps

Iteration

# of

Ste

ps to

the

Goa

l

MH Iteration


1234567891011


start

goal

(a) Domain

# of

Sam

ples

0 1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200



0 200 400 600 800 10000

10

20

30

40

50

60

70

80

90

100

Steps

Iteration#

of S

teps

to M

ake

the

Tow

erMH Iteration


2.75

3.75

4.75

5.75

6.75

7.75

1 2 3 4 5 6 7 8 9 10 11 12# of

Ste

ps to

Mak

e th

e To

wer



21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)




5. Conclusion



MHPI Iteration


12

(a) Domain

0 1 2 3 4 5 6 7 8 90

100

200

300

400

500

600

700

800

900

1000


of S

ampl

es


0 200 400 600 800 100010

20

30

40

50

60

70

80

90

100

Steps

Iteration

# of

Ste

ps to

the

Goa

l

MH Iteration


1234567891011


start

goal

(a) Domain

# of

Sam

ples

0 1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200



0 200 400 600 800 10000

10

20

30

40

50

60

70

80

90

100

Steps

Iteration#

of S

teps

to M

ake

the

Tow

erMH Iteration


2.75

3.75

4.75

5.75

6.75

7.75

1 2 3 4 5 6 7 8 9 10 11 12# of

Ste

ps to

Mak

e th

e To

wer



21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)




5. Conclusion



MHPI Iteration

θ

θ·

τ

(a) Domain

# of

Sam

ples

# of Extended Features0 1 2 3 4 5 6 7 8 9 10 11

0

50

100

150

200

250

300

350


0 100 200 300 400 500500

1000

1500

2000

2500

3000

Steps

IterationMH Iteration

# of

Bal

anci

ng S

teps

(c) Performance

1000

1500

2000

2500

3000

0 1 >1# of Extended Features

# of

Bal

anci

ng S

teps

(d) Performance Dist.Figure 5: Inverted pendulum

policy evaluation techniques such as importance sam-pling (Sutton and Barto, 1998) and model-free MonteCarlo (Fonteneau et al., 2010).

References

M. Bowling and M. Veloso. Scalable learning in stochastic games,2002.

Michael Bowling, Alborz Geramifard, and David Wingate. Sigmapoint policy iteration. In AAMAS ’08: Proceedings of the 7th

international joint conference on Autonomous agents and mul-

tiagent systems, pages 379–386, 2008.S. Bradtke and A. Barto. Linear least-squares algorithms for tem-

poral difference learning. Machine Learning, 22:33–57, 1996.Scott E. Fahlman and Christian Lebiere. The Cascade-

Correlation Learning Architecture, 1991.Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and

Damien Ernst. Model-free monte carlo-like policy evaluation.Journal of Machine Learning Research - Proceedings Track,9:217–224, 2010.

Alborz Geramifard, Michael Bowling, and Richard S. Sutton.Incremental least-square temporal difference learning. InThe Twenty-first National Conference on Artificial Intelligence

(AAAI), pages 356–361, 2006.Alborz Geramifard, Finale Doshi, Joshua Redding, Nicholas Roy,

and Jonathan How. Online discovery of feature dependencies.In Lise Getoor and Tobias Scheffer, editors, International Con-

ference on Machine Learning (ICML), pages 881–888, NewYork, NY, USA, June 2011. ACM. ISBN 978-1-4503-0619-5.

Sertan Girgin and Philippe Preux. Feature Discovery in Rein-forcement Learning using Genetic Programming. ResearchReport RR-6358, INRIA, 2007.

Sertan Girgin and Philippe Preux. Basis function constructionin reinforcement learning using cascade-correlation learningarchitecture. In ICMLA ’08: Proceedings of the 2008 Sev-

enth International Conference on Machine Learning and Ap-

plications, pages 75–82, Washington, DC, USA, 2008. IEEEComputer Society. ISBN 978-0-7695-3495-4. doi: http://dx.doi.org/10.1109/ICMLA.2008.24.

N. D. Goodman, J. B. Tenenbaum, T. L. Griffiths, and J. Feld-man. Compositionality in rational analysis: Grammar-basedinduction for concept learning. In M. Oaksford and N. Chater,editors, The probabilistic mind: Prospects for Bayesian cogni-

tive science, 2008.W. K. Hastings. Monte carlo sampling methods using markov

chains and their applications. Biometrika, 57(1):97–109, April1970. doi: 10.1093/biomet/57.1.97.

Michail G. Lagoudakis and Ronald Parr. Least-squares policy it-eration. Journal of Machine Learning Research, 4:1107–1149,

2003.Sridhar Mahadevan. Representation policy iteration. Proceedings

of the 21st International Conference on Uncertainty in Artifi-

cial Intelligence, 2005.Franois Rivest and Doina Precup. Combining td-learning with

cascade-correlation networks. In In Proceedings of the Twen-

tieth International Conference on Machine Learning, pages632–639. AAAI Press, 2003.

G. A. Rummery and M. Niranjan. Online q-learning using con-nectionist systems (tech. rep. no. cued/f-infeng/tr 166). Cam-

bridge University Engineering Department, 1994.Alexander A. Sherstov and Peter Stone. Function approxima-

tion via tile coding: Automating parameter choice. In J.-D.Zucker and I. Saitta, editors, SARA 2005, volume 3607 of Lec-

ture Notes in Artificial Intelligence, pages 194–205. SpringerVerlag, Berlin, 2005.

Richard S. Sutton. Learning to predict by the methods of temporaldifferences. Machine Learning, 3:9–44, 1988.

Richard S. Sutton. Generalization in reinforcement learning: Suc-cessful examples using sparse coarse coding. In Advances in

Neural Information Processing Systems 8, pages 1038–1044.The MIT Press, 1996.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning:

An Introduction. MIT Press, 1998.Richard S. Sutton and Steven D. Whitehead. Online learning with

random representations. In In Proceedings of the Tenth In-

ternational Conference on Machine Learning, pages 314–321.Morgan Kaufmann, 1993.

Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shal-abh Bhatnagar, David Silver, Csaba Szepesvari, and EricWiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. InICML ’09: Proceedings of the 26th Annual International Con-

ference on Machine Learning, pages 993–1000, New York,NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. doi: http://doi.acm.org/10.1145/1553374.1553501.

John N. Tsitsiklis and Benjamin Van Roy. An analysis oftemporal-difference learning with function approximation.IEEE Transactions on Automatic Control, 42(5):674–690,1997.

Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Ma-

chine Learning, 8(3):279–292, May 1992. doi: 10.1007/BF00992698.

Inverted Pendulum1000 Initial SamplesInitial features: Discretize into 21 buckets separatelyGaussian noise was added to torque values

13

✓, ✓

MHPI Iteration

θ

θ·

τ

(a) Domain

# of

Sam

ples


0

50

100

150

200

250

300

350


0 100 200 300 400 500500

1000

1500

2000

2500

3000

Steps


# of

Bal

anci

ng S

teps

(c) Performance

1000

1500

2000

2500

3000


# of

Bal

anci

ng S

teps



References










































13

✓, ✓

Many proposed representations were rejected initially

MHPI Iteration

θ

θ·

τ

(a) Domain

# of

Sam

ples


0

50

100

150

200

250

300

350


0 100 200 300 400 500500

1000

1500

2000

2500

3000

Steps


# of

Bal

anci

ng S

teps

(c) Performance

1000

1500

2000

2500

3000


# of

Bal

anci

ng S

teps



References










































13

✓, ✓

Many proposed representations were rejected initially

(a) Domain

0 1 2 3 4 5 6 7 8 90

100

200

300

400

500

600

700

800

900

1000


# of

Sam

ples


0 200 400 600 800 100010

20

30

40

50

60

70

80

90

100

Steps

Iteration

# of

Ste

ps to

the

Goa

l

MH Iteration


1234567891011


start

goal

(a) Domain

# of

Sam

ples

0 1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200



0 200 400 600 800 10000

10

20

30

40

50

60

70

80

90

100

Steps

Iteration

# of

Ste

ps to

Mak

e th

e To

wer

MH Iteration


2.75

3.75

4.75

5.75

6.75

7.75

1 2 3 4 5 6 7 8 9 10 11 12# of

Ste

ps to

Mak

e th

e To

wer



21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)




5. Conclusion



Key feature:

MHPI Iteration

ContributionsIntroduced a Bayesian approach for finding concise yet expressive representations for solving MDPs.Introduced MHPI as a new RL technique that expands the representation using limited samples.Empirically demonstrated the effectiveness of our approach in 3 domains.

14

Feature Work:Reuse the data for estimating for policy iterationRelax the need of a simulator to generate trajectories

Importance sampling [Sutton and Barto, 1998] Model-free Monte Carlo [Fonteneau et al., 2010]

V ⇡(s0)

A Bayesian Approach to Finding Compact Representations for ...people.csail.mit.edu/agf/Files/12EWRL-MHPI-Slides.pdffor sequential decision making in which an agent in-teracts with

Documents