Click here to load reader
Jun 28, 2020
Faculteit Wetenschappen en Bio-ingenieurswetenschappen
Vakgroep Computerwetenschappen
Unsupervised Feature Extraction for
Reinforcement Learning
Proefschrift ingediend met het oog op het behalen van de graad van
Master of Science in de Ingenieurswetenschappen: Computerwetenschappen
Yoni Pervolarakis
Promotor: Prof. Dr. Peter Vrancx
Prof. Dr. Ann Nowé
Juni 2016
Faculty of Science and Bio-Engineering Sciences
Department of Computer Science
Unsupervised Feature Extraction for
Reinforcement Learning
Thesis submitted in partial fulfillment of the requirements for the degree of
Master of Science in de Ingenieurswetenschappen: Computerwetenschappen
Yoni Pervolarakis
Promotor: Prof. Dr. Peter Vrancx
Prof. Dr. Ann Nowé
June 2016
Abstract
When using high dimensional features chances are that most of the features
are not important to a specific problem. To eliminate those features and
potentially finding better features different possibilities exist. For example,
feature extraction that will transform the original input features to a new
smaller dimensional feature set or even a feature selection method where
only features are taken that are more important than other features. This
can be done in a supervised or unsupervised manner. In this thesis, we will
investigate if we can use autoencoders as a means of unsupervised feature
extraction method on data that is not necessary interpretable. These new
features will then be tested in a Reinforcement Learning environment. This
data will be represented as RAM states and are blackbox since we cannot
understand them. The autoencoders will receive a high dimensional feature
set and will transform it into a lower dimension, these new features will be
given to an agent who will make use of those features and tries to learn from
them. The results will be compared to a manual feature selection method
and no feature selection method.
i
Acknowledgements
First and foremost I would like to thank Prof. Dr. Peter Vrancx for helping
me find a subject I am passionate about, taking the time for weekly updates
and for all his suggestions and numerous conversions on how this subject
could be tackled.
Secondly, I would also like to thank Prof. Dr. Ann Nowé for piquing my
interest in the master Artificial Intelligence when taking her course in my
first year on the Vrije Universiteit Brussel.
And finally I would also like to thank my mother for supporting me to pursue
my studies at university level and my girlfriend for her endless support.
ii
Contents
1 Introduction 1
1.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Machine Learning 6
2.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Underfitting and overfitting . . . . . . . . . . . . . . . . . . . 13
2.4 Bias - Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Ensembles methods . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Evaluating models . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7.1 Cross validation . . . . . . . . . . . . . . . . . . . . . . 20
3 Artificial Neural Networks 21
3.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Training perceptrons . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Hyperbolic tangent . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Rectified Linear Unit . . . . . . . . . . . . . . . . . . . 28
3.4.4 Which is better? . . . . . . . . . . . . . . . . . . . . . 29
3.5 Tips and tricks . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iii
3.6 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Reinforcement Learning 34
4.1 The setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . 39
4.4 Value functions . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Incrementing Q-values . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Monte Carlo & Dynamic Programming . . . . . . . . . . . . . 45
4.8 Temporal Difference . . . . . . . . . . . . . . . . . . . . . . . 46
4.8.1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . 47
4.8.2 SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.9 Eligibility traces . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.10 Function approximation . . . . . . . . . . . . . . . . . . . . . 51
5 Experiments and results 54
5.1 ALE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Space Invaders . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Flow of experiments . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Manual features and basic RAM . . . . . . . . . . . . . . . . . 60
5.6 Difference between bits and bytes . . . . . . . . . . . . . . . . 61
5.7 Comparing different activation functions . . . . . . . . . . . . 63
5.8 Initializing Q-values . . . . . . . . . . . . . . . . . . . . . . . . 65
5.9 Pretraining and extracting other layers . . . . . . . . . . . . . 68
5.10 Combination of RAM and layer . . . . . . . . . . . . . . . . . 72
5.11 Visualizing high dimensional data . . . . . . . . . . . . . . . . 73
6 Conclusions 75
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Appendices 77
A Extended graphs and tables 78
7 Bibliography 82
iv
List of Figures
1 Architecture of data processing . . . . . . . . . . . . . . . . . 5
2 Example of a decision tree . . . . . . . . . . . . . . . . . . . . 8
3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Data of two features . . . . . . . . . . . . . . . . . . . . . . . 12
6 k-mean clustering . . . . . . . . . . . . . . . . . . . . . . . . . 13
7 Unsupervised learning: reduction of dimensions . . . . . . . . 13
7a MNIST example of the number 2 . . . . . . . . . . . . 13
7b MNIST reduction of dimensions . . . . . . . . . . . . . 13
8 Difference between under and overfitting . . . . . . . . . . . . 15
9 Dartboard analogy from (Sammut & Webb, 2011) . . . . . . . 16
10 Bias Variance trade-off . . . . . . . . . . . . . . . . . . . . . . 17
11 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 18
12 Searching in different dimensions . . . . . . . . . . . . . . . . 19
12a 1D space . . . . . . . . . . . . . . . . . . . . . . . . . . 19
12b 2D space . . . . . . . . . . . . . . . . . . . . . . . . . . 19
12c 3D space . . . . . . . . . . . . . . . . . . . . . . . . . . 19
13 Example of a perceptron . . . . . . . . . . . . . . . . . . . . . 21
14 Bitwise operations . . . . . . . . . . . . . . . . . . . . . . . . 23
14a AND operator . . . . . . . . . . . . . . . . . . . . . . . 23
14b OR operator . . . . . . . . . . . . . . . . . . . . . . . . 23
14c XOR operator . . . . . . . . . . . . . . . . . . . . . . . 23
15 XOR with decision boundaries by learnt MLP . . . . . . . . . 25
16 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . 26
17 Other activation functions: linear and step function . . . . . . 27
18 Sigmoid activation function . . . . . . . . . . . . . . . . . . . 27
v
19 Hyperbolic tangent activation function . . . . . . . . . . . . . 28
20 ReLU activation function . . . . . . . . . . . . . . . . . . . . . 29
21 Example of an autoencoder . . . . . . . . . . . . . . . . . . . 33
22 A Skinner’s Box from (Skinner, 1938) . . . . . . . . . . . . . . 35
23 Agent Environment setting . . . . . . . . . . . . . . . . . . . . 36
24 Another view of the agent environment setting . . . . . . . . . 36
25 Mountain car; image from (RL-Library, n.d.) . . . . . . . . . . 37
26 Pole Balancing; image from (Anji, n.d.) . . . . . . . . . . . . . 37