Python Programming, 2/e1. Most useful Presentation of algorithms Requiring us to read & respond (2) Discussion (2) or in-class exercises Examples (x2)

1Python Programming, 2/e

Most useful

• Presentation of algorithms• Requiring us to read & respond (2)• Discussion (2) or in-class exercises• Examples (x2)

Least Useful

• The book• Book is hard to read• Book is redundant• Book goes too quickly• Reading Responses• No deadline for exercises• Long uninterrupted lectures• Easy to get confused during discussions

What could students do?

• Talk more / less (x4)• Work towards mastery• Summarize chapters and discuss with others• Rephrase problems / algorithms in own words• Read the chapter more than once

What could Matt do?

• Provide more examples / code (x3)• Show some real applications• Present challenge / problem before method• Go through textbook slower

• δ is or normal TD error• e is vector of eligibility traces• θ is a weight vector

Linear Methods

• Why are these a particularly important type of function approximation?

• Parameter vector θt

• Column vector of features φs for every state • (same number of components)

Tile coding

Mountain-Car Task

3D Mountain Car

• X: position and acceleration• Y: position and acceleration

http://eecs.wsu.edu/~taylorm/traj.gif



• Control with FA• Bootstrapping

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. Fast gradient-descent methods for temporal-difference learning

with linear function approximation.

• ICML-09.

• Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator.

• We introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD).

• The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero.

• In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD.

van Seijen, H., Sutton, R. S. True online TD(λ)

• ICML-14

• TD(λ) based on equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner.

• Equivalence between TD(λ) and the forward view is exact only for the off-line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(λ) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate.

• In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(λ) that exactly achieves it.

• In our empirical comparisons, our algorithm outperformed TD(λ) in all of its variations. It seems, by adhering more truly to the original goal of TD(λ)—matching an intuitively clear forward view even in the online case—that we have found a new algorithm that simply improves on classical TD(λ).

Efficiency in ML / AI

1. Data efficiency (rate of learning)2. Computational efficiency (memory,

computation, communication)3. Researcher efficiency (autonomy, ease of

setup, parameter tuning, priors, labels, expertise)

• Would it have been fine to skip the review [of RL] or is it bad form? Also it says V*(s) is the optimal policy but shouldn't that be the optimal state-value function, and pi* is the optimal policy?

• Why are action-value functions are preferred to value functions

• I’m still confused with the concepts of expert or domain knowledge. “the knowledge the domain expert must supply to IFSA is less detailed”. What does this mean and what is the benefit? (like need less information about features?)

• performance of IFSA-rev: dropped off very quickly then jumped around for a while to eventually find a better solution than Sarsa, and is able to find the optimal way before either IFSA or Sarsa, but why does it freak out at the beginning? Do we know?

• Different initial intuition may result in totally different ordering of the features, so I believe that it will influence the performance of IFSA and IFSA-rev since it’s easy for us to understand that picking more relevant features to learn more important concepts at first is usually better to speed up learning. But I’m not sure whether they will generally converge to the same level finally.

• The paper states that the agent adds a feature to the feature set when the algorithm is reasonably converged, in a case of a continuous state, how does the algorithm address the values of the new state?

• I can imagine a greedy heuristic approach where the algorithm runs each feature set as the first feature set, and after x runs, whichever feature is doing best is selected as the first feature set. Then fork that run and run each other feature set from there, again selecting the one which performs best, etc. This gives a quadratic run time based on the number of feature sets, rather than the exponential time it takes to find the optimal one. This could be used when having an expert select the order is not feasible.

• If in Chess the positions of the pawns didnt matter that much you could train the RL with just the more important pieces. thereby shrinking your potential state space down tremendously. Then when you later readd the pawns back in you now know something about every state you encounter for a particular set of the main 8 pieces independent of how the pawns are situated. That would allow you to ignore millions of states where the pawns were shifted just slightly but you already knew the primary 8 pieces had a low value.

• Background of 1st author• Self-citations• Understandability

– utilize• Paper length• “Reasonably Converged”

• Related work

• Error bars / stat sig• How to pick test domains• XOR Keepaway

Python Programming, 2/e1. Most useful Presentation of algorithms Requiring us to read & respond (2) Discussion (2) or in-class exercises Examples (x2)

Documents

linear td

new algorithm

2e1 slide

expertise slide

algorithm outperformed

conventional td

weight vector slide

classical td