A Bayesian Approach to Finding Compact Representations for ...people.csail.mit.edu/agf/Files/12EWRL-MHPI.pdflikely to be optimal. For this purpose, we deﬁne vari-able G 2{0,1} that

A Bayesian Approach to FindingCompact Representations for Reinforcement Learning

Alborz Geramifard†, Stefanie Tellex⇤, David Wingate†, Nicholas Roy⇤, Jonathan How††{AGF,WINGATED,JHOW}@MIT.EDU

⇤{STEFIE10,NICKROY}@CSAIL.MIT.EDU

Abstract

Feature-based function approximationmethods have been applied to reinforce-ment learning to learn policies in adata-efficient way, even when the learnermay not have visited all states duringtraining. For these methods to work, itis important to identify the right set offeatures in order to reduce over-fitting,enable efficient learning, and provideinsight into the structure of the problem. Inthis paper, we propose a Bayesian methodfor reinforcement learning that finds apolicy by identifying a compact set ofhigh-performing features. Empirical resultsin classic RL domains demonstrate that ouralgorithm learns concise representationswhich focus representational resources onregions of the state space that are necessaryfor good performance.

Keywords: reinforcement learning, representationlearning, policy iteration, Metropolis-Hastings

1. Introduction

For many learning problems, the choice of represen-tation is critical to an algorithm’s performance: agood representation can make learning easy, but apoor one can make learning impossible. In reinforce-ment learning (RL), this problem is most pronouncedwhen designing function approximators to solve com-plex problems: underpowered representations requirea small number of samples to fit yet they often leadto poor polices. Expressive representations providegood approximations, yet they often require a plethora

of training data and the learned representation maynot provide insight into the structure of the problem.Hence, finding a concise set of features that leads to agood approximation is of high importance when sam-ples and computation are limited.

While many function approximators have been pro-posed, linear value function approximators have en-joyed special attention because of a combinationof analytic tractability and good empirical perfor-mance (Sutton, 1988; Bradtke and Barto, 1996;Lagoudakis and Parr, 2003; Bowling et al., 2008; Sut-ton et al., 2009). In any linear architecture, a functionf(x) is approximated by a linear combination of fea-tures: f(x) ⇡ w

T�(x), where �(x) is known as a fea-

ture extractor and w is a corresponding weight vector.In order for the system to avoid over-fitting and gen-eralize to previously unseen states, a compact set ofpredictive features must be identified. Previous ap-proaches have identified heuristics for expanding thefeature representation (Geramifard et al., 2011; Sut-ton and Whitehead, 1993; Mahadevan, 2005; Fahlmanand Lebiere, 1991), but these methods are not biasedtowards learning a compact set of features that cap-tures only the essential elements necessary to solvethe planning problem.

This paper proposes a Bayesian approach to featureconstruction which addresses these concerns. We usethe performance of a particular representation as alikelihood term, and combine it with a prior that fa-vors simple representations. Inference in the result-ing probabilistic model jointly optimizes both terms,identifying a set of features which is simultaneouslycompact and performs well. For example, for the in-verted pendulum problem, our algorithm identifies asingle extra feature, which, when added to the repre-sentation, enables the system to perform optimally.

2. Background

Reinforcement learning (RL) is a powerful frameworkfor sequential decision making in which an agent in-teracts with an environment on every time step. Theenvironment is often modeled using a Markov De-cision Process (MDP) which is defined by a tuple(S,A,Pa

ss0 ,Rass0 , �), where S identifies the finite set

of states, A corresponds to the finite set of actions,Pass0 dictates the transition probability from state s to

state s

0 when taking action a, Rass0 is the correspond-

ing reward along the way, and � 2 [0, 1] is a dis-count factor emphasizing the relative significance ofimmediate rewards versus feature rewards.1 A tra-jectory of experience is identified by the sequences0, a0, r1, s1, a1, r2, · · · , where at time i the agent instate si took action ai, received reward ri+1, and tran-sited to state si+1. The behavior of the agent is cap-tured through the notion of policy ⇡ : S ⇥A! [0, 1]

governing the probability of taking each action in eachstate. We limit our attention to deterministic poli-cies mapping each state to one action. The value of astate given policy ⇡ is defined as the expected cumula-tive discounted rewards obtained starting the sequencefrom s and following ⇡ thereafter:

V

⇡(s) = E⇡

" 1X

t=1

�

t�1rt

�

�

�

�

s0 = s

#

Similarly the value of a state-action pair is defined as:

Q

⇡(s, a) = E⇡

" 1X

t=1

�

t�1rt

�

�

�

�

s0 = s, a0 = a,

#

The objective is to find the optimal policy defined as:

⇡

⇤(s) = argmax

aQ

⇡⇤(s, a).

One popular thrust of online reinforcement learningmethods such as SARSA (Rummery and Niranjan,1994) and Q-Learning (Watkins and Dayan, 1992)tackle the problem by updating the estimated value ofa state based on temporal difference error (TD-error),while acting mostly greedy with respect to the esti-mated values. TD-error is defined as

�t(Q⇡) = rt + �Q

⇡(st+1, at+1)�Q

⇡(st, at).

One of the main challenges facing researchers is thatmost realistic domains consist of large state spaces

1. � = 1 is only valid for episodic tasks.

and continuous state variables. Function approxima-tors have been used as a machinery to overcome theseobstacles, enabling an agent to generalize its experi-ence in order to act appropriately in states it may havenever previously encountered during training. Lin-ear function approximators, which are the focus ofthis paper, have been favored due to their theoret-ical properties and low computational complexities(Sutton, 1996; Tsitsiklis and Van Roy, 1997; Geram-ifard et al., 2006). Using a linear function approxi-mation Q

⇡(s, a) is approximated by w

T�(s, a) where

� : S⇥A! <n is the mapping function and w is theweight vector. For simplicity we call �(s, a) the fea-ture vector and each element of the vector a feature.

Finding a suitable mapping function is one of thecritical elements to obtain an adept policy. Earlystudies on random feature generation methods haveshown promising directions on expanding the repre-sentation using some basic set of features (Sutton andWhitehead, 1993). Representational Policy Iteration(RPI) (Mahadevan, 2005) is another approach for dis-covering task independent representations fusing thetheory of smooth functions on a Riemannian mani-fold with the Least-Squares method. Another popu-lar trend of methods migrated the idea of Cascade-Correlation (Fahlman and Lebiere, 1991) to the re-inforcement learning realm using temporal differencelearning (Rivest and Precup, 2003), approximate dy-namic programming (Girgin and Preux, 2007), andLSPI (Girgin and Preux, 2008). (Geramifard et al.,2011) described a method for incrementally addinga feature which maximally reduces TD-error. How-ever, none of these techniques facilitate a regulariza-tion scheme by which the designer incorporates hisknowledge over the set of hypotheses.

From the Bayesian cognitive science community,Goodman et al. (2008) used a grammar-based in-duction scheme to learn human concepts in a super-vised learning setting. In their approach new con-cepts (features) were derived from a limited set ofinitial propositions using a generative grammar. Thiswork motivated us to revisit the representational ex-pansion within the RL community from the Bayesianapproach.

3. Our Approach

We adopt a Bayesian approach to find well perform-ing policies. The core idea is to find a representation

D

�

G

(a)

D

�

Q

⇡

G

(b)Figure 1: Graphical models a) compact and b) detailed. The gray

node (Data) is given.

for which, given a dataset, the resulting policy is mostlikely to be optimal. For this purpose, we define vari-able G 2 {0, 1} that indicates if a given policy is op-timal.2 Define D as our observed variable that is a setof interactions in the form of (st, at, rt, st+1, at+1).Hence our search for a good representation is formu-lated as:

�

⇤= argmax

�P (�|G, D). (1)

Figure 1-(a) depicts the corresponding graphicalmodel. D is filled to show that it is a known vari-able. We factor the distribution in order to break itdown into a prior and a likelihood. The solution toour optimization problem in Equation 1 can be statedas finding the Maximum-a-Posteriori (MAP) solutionto the following distribution:

P (�|G, D) / P (G|�, D)P (�|D)

/ P (G|�, D)P (�). (2)

We define a family of representations by forming newfeatures which are combinations of existing featuresusing logical operators. This space is infinite, andcould lead to very complex representations; we ad-dress this issue by defining a prior on representationsthat favors simplicity, so that concise representations

2. From this point, we use G instead of G = 1 for brevity.

should have a higher a priori probability compared tocomplex ones. The likelihood function is based onthe performance of the representation at obtaining re-ward. Defining the likelihood requires first computingthe value of each state, Q, using LSPI. Then the policy⇡ is calculated as a function of Q. Finally G probesthe quality of the computed policy. This chain of rela-tions is captured in Figure 1-(b).

Since inference for Equation 2 is searching an expo-nential space, our algorithm samples representationsfrom the posterior distribution using the Metropolis-Hastings algorithm. Because our algorithm integratesMetropolis-Hastings with LSPI, we call the algorithmMetropolis-Hastings Policy Iteration (MHPI).

3.1. Space of Representations

In this work, we assume the presence of a set of prim-

itive binary features. Starting with the initial set offeatures, the system adds extended features, which arelogical combinations of primitive features built usingthe ^ and _ operators. Each representation is main-tained as a directed acyclic graph (DAG) structure,where nodes are features, and edges are logical op-erators. We use sparse binary features (i.e., featurevectors with very limited non-zero values) to reducethe computational complexity of learning methods, acommon practice within the RL community (Bowlingand Veloso, 2002; Sherstov and Stone, 2005). Whileadding negation (¬) to the set of operators expandsthe hypothesis space, it eliminates the sparsity char-acteristic, so we do not include it in the space of rep-resentations. Figure 2 (left) shows an example featureset, where each feature is marked with its correspond-ing index. The number of features for this set is 8,where 6 of the features are primitive and 2 extendedfeatures are: f8 = f4 ^ f6 and f7 = f2 _ f3. Noticethat more complex extended features can be built ontop of existing extended features (e.g., f9 = f1 ^ f7).In order to discourage representations with complexstructures, we adopted the following Poisson distri-bution akin to the work of Goodman et al. (2008) tomathematically encourage conciseness:

P (�) /nY

i=1

�

die

��

di!,

where n is the total number of features and di is thedepth of feature fi in the DAG structure. Lower valuesof � > 0 make complex DAG structures less and lesslikely.

3.2. Likelihood

The likelihood function states the probably of a re-sulting policy with a fixed representation and set ofdata to be optimal. This term often encourages ex-pressive representations. One approach is to relate thelikelihood to the accuracy of the value function afterpolicy iteration.3 The main drawback of this approachis that the likelihood function encourages representa-tions with good value function approximation, ratherthan good resulting policies. In general a greedy pol-icy with respect to the the exact value function is guar-anteed to be optimal (Sutton and Barto, 1998), butwhen the value function is approximated, more accu-rate value functions do not necessarily provide betterpolicies as the greedy policy is related to the rank-ing of the values not their accuracies. Consequently,we adopt a likelihood function that has high valuesfor high-performing policies, based on reward earnedfrom the initial state:

P (G|�, D) / e

⌘V ⇡i (s0) (3)

⌘ > 0 is the distribution parameter. Higher valuesof ⌘ make well performing policies more likely to beoptimal.

To obtain ⇡i and V

⇡i(s0), we use the least-squares

policy iteration (LSPI) (Lagoudakis and Parr, 2003)algorithm, with one modification. It is known thateach iteration of LSPI does not necessarily improvethe resulting policy. Hence on each iteration of LSPI,we test the performance of the policy using Monte-Carlo simulations. After all iterations, the highest per-forming policy and its performance is returned. No-tice that if a simulation box is not present, off-policyevaluation techniques such as importance sampling(Sutton and Barto, 1998) and model-free Monte Carlo(Fonteneau et al., 2010) can be used.

3.3. Inference

So far, we have introduced a probability distributionover representations, where representations with thebest trade-off between conciseness and expressivityare most likely. While P (�|G, D) can be sampledfor each representation, the shape of it is not known,posing a challenging for direct inference. To mit-igate this problem, we use the Metropolis-Hastings(MH) (Hastings, 1970) algorithm.

3. Assuming each TD-error, �i is sampled independently fromN (0,�2) then P (D|�) = ⇧|D|

i=1G(�i; 0,�2), where G is the

Gaussian probability density function.

Input: �, propose, T, P (�|G, D), SampleSize

Output: Samplesforeach i 2 {1, · · · , SampleSize} do

Samples(i) = �

�

0 propose(�)

if rand < min

n

1,

P (�0|G,D)T (�|�0)P (�|G,D)T (�0|�)

o

then� �

0

return Samples

Algorithm 1: Metropolis-Hastings Policy Iteration

MH is a Markov chain Monte Carlo method for sam-pling random variables for which the underlying dis-tribution is not known, shown in Algorithm 1. Thealgorithm is seeded with the initial representation�. The propose function generates a new candidatesample (�0) on each iteration. The T function is thetransition probability; T (�

0|�) states the probabilityof the propose function generating �

0 from �.4 Theposterior distribution function, P (�|G, D), is calcu-lated using Equation 2. SampleSize states the num-ber of samples generated through MH. On each itera-tion, the new candidate sample �

0 is accepted scholas-tically based on a probability value that highlights thedesirability of the sample. As the number of samplesgenerated through MH goes to infinity, the distribu-tion of the samples converges to P .

3.3.1. The propose Function

As stated in Section 3.1, we focus on set of featuresbuilt using Boolean logic. Figure 2 explains threeactions used in our propose function: add, mutate,

and remove. The add action attaches a node with arandom operator 2 {^,_} connecting two childrenselected uniformly randomly from existing features(e.g., f9). The second action is to mutate the oper-ator in one of the extended nodes (e.g., f7). Becausethe representation is preserved as a DAG, the effect ofmutation is propagated to related features on the topof the affected node, allowing fast exploration of therepresentation space. The last action is to remove anextended feature. Note that in order to have the repre-sentation sound at all time, we only remove featuresfrom the top of the DAG structure with no parents(i.e., “header” features). On each iteration of MH, thepropose method selects one of the three actions withequal probability and generates a candidate represen-tation.

4. In the original MH algorithm the convention is to use Q forthe transition probability, yet this notation collides with the Qfunction of the RL framework. Hence we use T .

∧∨

∧

∧∨

∧

∧∧

Primitive features

Add

Mutate

Rem

ove

Extended features

1 2 3 4 5 6

7 8

1 2 3 4 5 6

7 8

1 2 3 4 5 6

7

1 2 3 4 5 6

7 8

9

Figure 2: Representation of primitive and extended features andthe possible outcomes of the propose function

3.3.2. The Transition Probability Function (T )

When the transition probability function is symmetric,it can be removed from the MH algorithm, as it is can-celed out during the calculation. In our setting, how-ever, T is not symmetric. Given that hypothesis � hasp primitive features, e extended features and h headerfeatures, and representation �

0 proposed by taking ac-tion a from �, the transition probability function isdefined as:

T (�

0|�) =

8

<

:

2(p+e)/p+e�1 a = add

1/h a = remove

1/e a = mutate

4. Empirical Results

In this section, we investigate the performance of run-ning MHPI in the three domains: maze, BlocksWorld,and the inverted pendulum problem. For each domainsamples were gathered by the SARSA (Rummery andNiranjan, 1994) algorithm using the initial feature rep-resentation with the learning rates generated from thefollowing series:

↵t = ↵0N0 + 1

N0 + Episode#1.1,

where N0 was set to 100, and ↵0 was initialized at1 due to the short amount of interaction. For explo-ration, we chose the ✏-greedy approach with ✏ = .1

(i.e., 10% chance of taking a random action on eachtime step). The � parameter of the Poisson distribu-tion was set to 0.01 while ⌘ for the exponential distri-bution was set to 1. The initial representation used for

MH included all basic features. Additionally �(s, a)

was built by copying �(s) vector into the correspond-ing action slot. Therefore �(s, a) has |A| times morefeatures compared to �(s). For LSPI, we limited thenumber of policy iterations to 5, while the value of theinitial state for each policy, V ⇡i

(s0), was evaluated bya single Monte-Carlo run.

Maze Figure 3-(a) shows a simple 11 ⇥ 11 naviga-tion problem where the initial state is on the top leftcorner of the maze (�) and the goal is at the bottomright corner of the maze (?). Light blue cells indi-cate blocked areas. The action set consist of one stepmoves along the four cardinal directions. Actions arenoiseless and possible if the destination cell is notblocked. Reward is �.001 for all interactions exceptthe move leading to the goal with reward of +1. Theepisodic task is terminated if goal is reached, or 100steps is passed. There were 22 initial features used for�(s) corresponding to 11 rows and 11 columns of themaze. � was set to 1.

We used 200 samples through 2 episodes, gatheredin the domain using SARSA. The agent reached thegoal in the first episode following the top right cor-ner of the middle blocked square. The second episodefailed as the agent struggled behind the blocked areaon the bottom. Figure 3-(b) shows the distributionof the representation sizes sampled along 1, 000 iter-ations of the MH algorithm, while Figure 3-(c) showsthe corresponding performance of the sampled rep-resentations. The distribution together with the per-formance measure suggest that a desirable representa-tion should have 3 extended features. After 100 itera-tions all sampled hypotheses were expressive enoughto solve the task. It is interesting to see how Occam’sRazor is being carried away through the whole pro-cess. The MH algorithm spent most of its time ex-ploring various hypotheses with 3 extended featureswhile the more complicated representations were ofless interest as they provided the same performance(i.e., likelihood) yet had lower prior.

Figure 3-(d) shows the value function (green indi-cates positive, white represents zero, and red standsfor blocked areas) and the corresponding policy (ar-rows) for the best performing representation. Thisrepresentation had 3 extended features: (X = 2^Y =

11), (X = 3 ^ Y = 6), and (X = 2 ^ Y = 8), whereX is the row number and Y is the column number.Notice that the policy guides the agent successfullyfrom the starting point to the goal on the shortest path.

At the same the policy is not optimal in the wholestate space. This observation which is aligned withour initial statement is due to two factors. First sam-ples gathered were sparse and did not cover the wholestate space. Hence the algorithm generalizes the val-ues incorrectly in some parts of the state space it hasnot been exposed to (e.g., ignoring the presence of ablockade on the last column). Secondly, the perfor-mance measure was solely based on the execution ofthe policies from the initial state. If a globally optimalpolicy is desired, V

⇡i in Equation 3 should be eval-uated from various initial states. In the next exper-iment, we probed a larger domain that also includesuncertainty.

BlocksWorld Figure 4-(a) depicts the classicalBlocksWorld domain. The task starts with 4 blockslaying on the table. The episodic task is to build atower out of all blocks with a predefined color order.Each episode is finished after 100 steps. In each state,the agent can take any clear block (i.e., a block withno other block on top of it) and attempt to move it onthe top of any other clear blocks or table. Any moveinvolves 20% chance of failure resulting in droppingthe block on the table. �(s) was derived directly fromthe logical representation of on(A, B) resulting in 16

basic features. The reward function was identical tothe maze domain.

In this domain, we increased the number of samplesto 1, 000 as the environment was noisy and seeing thegoal state was more challenging than the previous de-terministic task. The agent finished making the toweronly once. Figure 4-(b) shows the distribution of therepresentation size after running MH, while Figure 4-(c) shows the corresponding performance through thenumber of steps to reach the goal. It is interesting tosee that the performance started from the top whichmeans the primitive representation was not expres-sive enough to solve the task, yet after few iterationsthe extended features made the task learnable. Thesmall perturbation on the performance graph is dueto the stochasticity, causing some trajectories to belonger than the others. According to MH, representa-tions with 4 and 5 extended features were most likelyeven though fewer extended feature could solve thetask. We conjecture that if our agent gets lucky anddoes not drop blocks by accident, it does not needmore extended features to solve the puzzle. On theother hand, if blocks are dropped during the move-ment which happens frequently, the agent experiences

new parts of the state space. Hence it needs more fea-tures to differentiate the value function correctly. Inorder to verify our hypothesis, we plotted the averagenumber of steps it took the agent to build the towerbased on the size of the representation. On failed tri-als the episode cap (i.e., 100 steps) were accumulated.Figure 4-(d) shows the corresponding result includ-ing standard error bars with 95% confidence.5 Addingone extended feature rendered the task learnable, yetthe resulting policies were not robust to the stochastic-ity as the error bar highlights. Overall, extended fea-tures reduced variance and improved the result as longas they were less than 6. The optimal expected num-ber of steps for this problem is 3.75. Adding morethan 5 extended features increased variance and in-creased the expected number of steps due to overfit-ting. This trend coincides with the sample distribu-tion shown in Figure 4-(d). Next, we probe MHPI ina domain with continuous state variables.

Inverted Pendulum Figure 5-(a) depicts the in-verted pendulum domain based on the previous workof Lagoudakis and Parr (2003). The episodic taskis to balance the pendulum up right as long as pos-sible. Each episode is finished when the pendulumhits the ground or reach the cap of 3, 000 steps. Thestate of the system is defined by the angle and angu-lar velocity of the pendulum, [✓, ˙✓]. Both dimensionsof the initial state were sampled from the uniformdistribution between [�0.2,+0.2] on each episode.The set of actions were limited to three values offorce: {�50, 0,+50}+ ! where ! is a uniform noisebetween [�10, 10]. The reward was +0.01 for allsteps except for the time that the pendulum hits theground resulting in �1 reward. Notice that we de-viated from the reward function in the original workin order to differentiate between the performance ofdifferent representations. Otherwise LSPI would havereturned �1 for all representations incapable of bal-ancing the pendulum for 3, 000 steps. � was set to0.95. Basic features were generated by discretizingeach dimension into 21 buckets separately translatinginto 42 initial features.

We gathered 1, 000 steps of experience through 70 tra-jectories. Figure 5-(b) shows the histogram of the ex-tended feature size through 500 iterations of MHPI,while Figure 5-(c) depicts the corresponding perfor-

5. For representations with more than 9 extended features, wedid not have more than 30 samples, hence we excluded thestandard error.

(a) Domain

0 1 2 3 4 5 6 7 8 90

100

200

300

400

500

600

700

800

900

1000

# of Extended Features

# of

Sam

ples

(b) Posterior Distribution

0 200 400 600 800 100010

20

30

40

50

60

70

80

90

100

Steps

Iteration

# of

Ste

ps to

the

Goa

l

MH Iteration

(c) Sampled Performance1 2 3 4 5 6 7 8 9 10 11

1234567891011

(d) Resulting PolicyFigure 3: Maze domain empirical results

start

goal

(a) Domain

# of

Sam

ples

0 1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

# of Extended Features


0 200 400 600 800 10000

10

20

30

40

50

60

70

80

90

100

Steps

Iteration

# of

Ste

ps to

Mak

e th

e To

wer

MH Iteration

(c) Sampled Performance# of Extended Features

2.75

3.75

4.75

5.75

6.75

7.75

1 2 3 4 5 6 7 8 9 10 11 12# of

Ste

ps to

Mak

e th

e To

wer

(d) Performance Dist.Figure 4: BlocksWorld

mance along each iteration. Unlike other domainsthat more features often helped the performance earlyon. In this domain irrelevant features dropped theperformance resulting in MH to reject them. Thisprocess took a while until interesting features startedto emerge. This effect is usually avoided by settinga burn-in value discarding limited number of initialsamples in the MH setting. Yet we added this data tohighlight the fact that expanding the representation ar-bitrary does not necessarily improve the performancein light of limited data. Figure 5-(d) shows the perfor-mance of the representations based on the number ofextended features. In our experiments, while addingmost extended features hurt the performance, the ex-tended feature (� ⇡

21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)

enabled the agent to complete the task successfully.This feature identifies an intuitive situation where thependulum is almost balanced with a velocity on theopposite direction, which would be often visited. Thisis very interesting results, because out of all possiblecorrelations among the initial features (21⇥ 21), cap-turing one such intuitive feature made the task solv-able with very limited amount of data.

In our work, we found that the adjustment of priorsplayed a critical role on the success of MHPI as priorscompete against the performance of the resulting poli-cies. Also we found MHPI to be robust in handlingstochastic domains. For exampling adding 20% noise

to the movement of the agent in the maze domain didnot change the performance noticeably.

5. Conclusion

This paper introduces a Bayesian approach for find-ing concise yet expressive representations for solvingMDPs. We introduced MHPI, a new RL techniquethat builds new representations from limited numberof simple features that perform well. Our approachuses a prior distribution that encourages representa-tion simplicity, and a likelihood function based onLSPI to encourage representations that lead to capa-ble policies. MHPI samples representations from theresulting posterior distribution. Although, the idea ofMHPI is general, in our implementation, we narrowedthe representation space to DAG structures on primi-tive binary features. The empirical results show thatMHPI finds simple yet effective representations forthree classical RL problems.

There are immediate visible expansions to this work.In our implementation, we excluded the samples gen-erated during the performance test in order to takeadvantage of caching old representation evaluations.One can use such samples along the way while beingaware of the increase to the runtime complexity. An-other extension is to relax the need of the simulationbox in LSPI by measuring the performance using off-

θ

θ·

τ

(a) Domain

# of

Sam

ples

# of Extended Features0 1 2 3 4 5 6 7 8 9 10 11

0

50

100

150

200

250

300

350


0 100 200 300 400 500500

1000

1500

2000

2500

3000

Steps

IterationMH Iteration

# of

Bal

anci

ng S

teps

(c) Performance

1000

1500

2000

2500

3000

0 1 >1# of Extended Features

# of

Bal

anci

ng S

teps

(d) Performance Dist.Figure 5: Inverted pendulum

policy evaluation techniques such as importance sam-pling (Sutton and Barto, 1998) and model-free MonteCarlo (Fonteneau et al., 2010).

References

M. Bowling and M. Veloso. Scalable learning in stochastic games,2002.

Michael Bowling, Alborz Geramifard, and David Wingate. Sigmapoint policy iteration. In AAMAS ’08: Proceedings of the 7th

international joint conference on Autonomous agents and mul-

tiagent systems, pages 379–386, 2008.S. Bradtke and A. Barto. Linear least-squares algorithms for tem-

poral difference learning. Machine Learning, 22:33–57, 1996.Scott E. Fahlman and Christian Lebiere. The Cascade-

Correlation Learning Architecture, 1991.Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and

Damien Ernst. Model-free monte carlo-like policy evaluation.Journal of Machine Learning Research - Proceedings Track,9:217–224, 2010.

Alborz Geramifard, Michael Bowling, and Richard S. Sutton.Incremental least-square temporal difference learning. InThe Twenty-first National Conference on Artificial Intelligence

(AAAI), pages 356–361, 2006.Alborz Geramifard, Finale Doshi, Joshua Redding, Nicholas Roy,

and Jonathan How. Online discovery of feature dependencies.In Lise Getoor and Tobias Scheffer, editors, International Con-

ference on Machine Learning (ICML), pages 881–888, NewYork, NY, USA, June 2011. ACM. ISBN 978-1-4503-0619-5.

Sertan Girgin and Philippe Preux. Feature Discovery in Rein-forcement Learning using Genetic Programming. ResearchReport RR-6358, INRIA, 2007.

Sertan Girgin and Philippe Preux. Basis function constructionin reinforcement learning using cascade-correlation learningarchitecture. In ICMLA ’08: Proceedings of the 2008 Sev-

enth International Conference on Machine Learning and Ap-

plications, pages 75–82, Washington, DC, USA, 2008. IEEEComputer Society. ISBN 978-0-7695-3495-4. doi: http://dx.doi.org/10.1109/ICMLA.2008.24.

N. D. Goodman, J. B. Tenenbaum, T. L. Griffiths, and J. Feld-man. Compositionality in rational analysis: Grammar-basedinduction for concept learning. In M. Oaksford and N. Chater,editors, The probabilistic mind: Prospects for Bayesian cogni-

tive science, 2008.W. K. Hastings. Monte carlo sampling methods using markov

chains and their applications. Biometrika, 57(1):97–109, April1970. doi: 10.1093/biomet/57.1.97.

Michail G. Lagoudakis and Ronald Parr. Least-squares policy it-eration. Journal of Machine Learning Research, 4:1107–1149,

2003.Sridhar Mahadevan. Representation policy iteration. Proceedings

of the 21st International Conference on Uncertainty in Artifi-

cial Intelligence, 2005.Franois Rivest and Doina Precup. Combining td-learning with

cascade-correlation networks. In In Proceedings of the Twen-

tieth International Conference on Machine Learning, pages632–639. AAAI Press, 2003.

G. A. Rummery and M. Niranjan. Online q-learning using con-nectionist systems (tech. rep. no. cued/f-infeng/tr 166). Cam-

bridge University Engineering Department, 1994.Alexander A. Sherstov and Peter Stone. Function approxima-

tion via tile coding: Automating parameter choice. In J.-D.Zucker and I. Saitta, editors, SARA 2005, volume 3607 of Lec-

ture Notes in Artificial Intelligence, pages 194–205. SpringerVerlag, Berlin, 2005.

Richard S. Sutton. Learning to predict by the methods of temporaldifferences. Machine Learning, 3:9–44, 1988.

Richard S. Sutton. Generalization in reinforcement learning: Suc-cessful examples using sparse coarse coding. In Advances in

Neural Information Processing Systems 8, pages 1038–1044.The MIT Press, 1996.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning:

An Introduction. MIT Press, 1998.Richard S. Sutton and Steven D. Whitehead. Online learning with

random representations. In In Proceedings of the Tenth In-

ternational Conference on Machine Learning, pages 314–321.Morgan Kaufmann, 1993.

Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shal-abh Bhatnagar, David Silver, Csaba Szepesvari, and EricWiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. InICML ’09: Proceedings of the 26th Annual International Con-

ference on Machine Learning, pages 993–1000, New York,NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. doi: http://doi.acm.org/10.1145/1553374.1553501.

John N. Tsitsiklis and Benjamin Van Roy. An analysis oftemporal-difference learning with function approximation.IEEE Transactions on Automatic Control, 42(5):674–690,1997.

Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Ma-

chine Learning, 8(3):279–292, May 1992. doi: 10.1007/BF00992698.

A Bayesian Approach to Finding Compact Representations for ...people.csail.mit.edu/agf/Files/12EWRL-MHPI.pdflikely to be optimal. For this purpose, we deﬁne vari-able G 2{0,1} that

Documents