Hauptprojekt Tasmin Herrmann Neural Networks for Buy Predictions Supervision by: Prof. Dr. Kai von Luck Submitted on: March 20, 2019 Fakultät Technik und Informatik Department Informations- und Elektrotechnik Faculty of Computer Science and Engineering Department Information and Electrical Engineering
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Hauptprojekt
Tasmin Herrmann
Neural Networks for Buy Predictions
Supervision by: Prof. Dr. Kai von LuckSubmitted on: March 20, 2019
Fakultät Technik und InformatikDepartment Informations- und Elektrotechnik
Faculty of Computer Science and EngineeringDepartment Information and Electrical Engineering
In this work, neural networks with session-based click data are trained topredict purchases of items. The dataset is from the RecSys Challenge 2015.The challenge was dominated by ensemble learning models and I want toinvestigate if neural networks are better than these models on the dataset.The neural networks are based on feature sets. Here I use features of thewinning model and another approach from the Challenge. In the first stepthese are implemented to build the training sets for the models. Then theexperiments with different parameters follow. These experiments are contin-ued in the subsequent master thesis to clarify whether neural networks aremore suitable than the ensemble learning models on this dataset.
Recommendation systems help the user to find relevant areas of interest, e.g. on e-commerce sites. Most academic research deals with approaches that personalize recom-mendations with the help of long-term user profiles. In many real-world applications,however, such long-term profiles often do not exist. In this case, recommendations mustbe based exclusively on the observed behavior of a user during a running session [11].This is called session-based recommendations.
Session-based recommendations was also discussed in the Association for Comput-ing Machinery’s Recommender Systems conference (RecSys) in 2015. Every year, thisconference takes place. It is the premier international forum for the presentation of newresearch results, systems and techniques in the broad field of recommender systems [3].The conference will be accompanied by a competition, the so-called RecSys Challenge.In 2015, the topic was to predict session-based product purchases in a web shop. Thewinners have chosen an ensemble learning model to solve the problem.
I would like to investigate whether deep learning models are suitable for buy pre-diction. Then I would like to investigate whether the models are better than ensemblelearning models as used in the RecSys Challenge 2015. Here I would like to test whetherthe accuracy of deep learning models is better than that of ensemble learning models.
In this thesis I begin with experiments on neural networks on the data set of the
1
2 Methods
challenge. I want to compare the results with the results of the winning model of Romovand Sokolov [15]. In addition, I am still dealing with an approach of Cohen et al. [2] sincethey also used an ensemble learning approach. My approach is to train neural networkson the same features to check if the neural networks reach a similar quality with thesame knowledge. For this I calculate the features from the click and purchase data ofthe RecSys Challenge 2015. Example data are shown in A.1.
1.2 Outline
In section 2 the methods are presented. This includes a brief presentation of two papersthat influence this work and the experiments derived from them. The results of theexperiments are presented in section 3. This includes the created feature sets and themodels. In section 4 the results and the structure of the experiments are discussed, aswell as the basis used by the two papers. In section 5 a conclusion is drawn and anoutlook to the following master thesis is given.
2 Methods
In this section I first describe the two works on which my experiments are based andthen I describe my exhibits that are based on parts of these works. The first experimentis by Romov and Sokolov [15]. They created the winning model. The second work Idescribe is by Cohen et al. [2] and was also created within the framework of the RecSysChallenge.
2.1 Romov Experiment
The first thing Romov and Sokolov [15] noticed was that the classes of the data recordsare not balanced. Only 5.5% of users buy at least one item. So, they decided to classifyfirst who buys and based on what someone buys if they were predicted as a buyer. This isa two-stage classifier. One for purchase detection and one for item detection. They havedecided to work with thresholds instead of binary forecasts to optimize them directly.
The purchase detection classifier uses only session features and the item detectionclassifier uses session features and session-item features. The features include some cat-egorical ones that are not supported by most ensemble learning libraries. Instead ofmethods like one-hot encoding [12, p. 215] they use hash tables with the MatrixNet [16]tool from Yandex [17]. The model they used is an implementation of gradient boosting[7, p. 359-360] over oblivious decision trees. They decided to use binary log-likelihood [7,
2
2 Methods
p. 31] as loss function, after the mean squared error [7, p. 24] gain tended to overfittingthe model. The training amount consisted of 90% of the data and the remaining 10%was used as validation set. The threshold values were optimized with the validation set.
2.2 Cohen Experiment
Cohen et al. [2] followed two different approaches. One approach, which is similar toRomov and Sokolov’s, is with an ensemble learning model of two trees and the otherapproach uses a tree to directly predict the items purchased. They randomly sampled500,000 negative records and an equal number of positive records for model training. Be-cause there are only 509,696 positive sessions in the dataset. They have chosen REPTreeof Weka [1] as classifier for all models.
With the ensemble learning model, the first model for purchase detection was trainedwith 509,696 positive sessions and 509,696 randomly sampled negative sessions from theirtraining data. The second model for item detection was trained with both feature sets,their session features and their item in session features. In the approach with one model,the model is trained on both feature sets. Finally, they have optimized the thresholdsof both approaches. The model which consists only of one classifier has given betterpredictions.
2.3 Experiment Design
My models are based on the features of Romov and Sokolov and Cohen et al., butneural networks are used for modeling and no ensemble learning approaches. So, I caninvestigate how neural networks with the same prior knowledge perform compared toensemble learning models. The experiments will be performed with the technologies andthe environment as described in [8]. There I describe how I calculate features with pandasand train models with Keras.
In the following I explain my experimental setups.
Experiments with feature sets of Romov and Sokolov
In these investigations I use the features Romov and Sokolov described in their work.Here I use a split of 90% training data to 10% test data. I balance the data, becauseneural networks do not learn well from unbalanced data. Therefore, I undersample thedata with the sample function of pandas [13]. I take all positive session of the trainingdata and the equal size of randomly sampled negative session. For the preparation of
3
2 Methods
the data I use Dask [4], which is based on the pandas library and uses multi-threading.Because the whole dataset is too big to do calculations in one kernel.
Logistic regression I train a model with logistic regression to check my feature sets.The feature sets are shown in A.2.2. For this I use the implementation from the scikitlearn library [14]. The model is trained on all 395 features of the balanced train set. Thedefault settings of the model are used.
Multilayer perceptron I train fully connected neural networks with the balancedtraining data. Since the feature sets have 395 features, I decided to do a feature selection.I do one feature selection with 10% of the features and one with 25% of the features.I perform the feature selection with SelectPercentile from scikit learn. This methodselects features according to a percentile of the highest scores. As scoring function I usef_classif. For model development I use the Sequential API from Keras [9]. Dependingon the experiment (see Table 1) I define the dimensions of the input layer and add layerswith the corresponding number of neurons. In the first experiment the input layer has40 dimensions, because I use 10% of the features. Then follow two hidden layers with 64neurons each. I use the regularization method dropout [6, p. 258-268] behind each layerwith 0.5. This means that when training the network 50% of the neurons in the hiddenlayers of the network are switched off and not considered for the next calculation step.As activation function I use the rectified linear unit (ReLU) [6, p. 135]. The output layerhas one neuron and uses the sigmoid function [6, p. 67-70] as activation function. I usebinary_crossentropy [6, p. 132] as loss function and rmsprop [6, p. 307-308] as optimizer.
In the second experiment I use 25% of the features as input, that are 99 features.Here I use two hidden layers with 128 neurons each. In the third experiment I use all395 features as input. I take two hidden layers with 512 neurons each.
layer neurons features2 64 402 128 992 512 395
Table 1: Hyperparameter of models with features of Romov and Sokolov
4
2 Methods
Experiments with feature sets of Cohen et al.
In these investigations I use the features Cohen et al. described in their work. Itemprevious buy count and item previous buy percentage are not used. I was not able tocalculate item previous buy count and since the other one is based on it, the featureis missing too. More about this in the Discussion in section 4. Here I use a splitof 70% training data to 30% test data. I balance the data like Cohen et al. and useRandomUnderSampler from the imbalanced-learn library [10] for this. Since I work withmuch less features here, I can use pandas [13] and NumPy [5] for the data preparations.
Logistic regression Again, I start with a logistic regression to test the feature sets.The feature sets are shown in A.2.1. I use the implementation of the scikit learn libraryagain. The model is trained with 33 features of the balanced training set. The defaultsettings of the model are used.
Multilayer perceptron I train neural networks on balanced training data with fea-tures of Cohen et al.. For this I use the Sequential API from Keras. The dimensions ofthe input layer are 33, because I use 33 features. Then follow the hidden layers whichhave a different number of layers and neurons depending on the experiment (see Table 2).After each layer I use a dropout of 0.5 and I use relu as activation function. The outputlayer has a neuron and uses sigmoid as activation function. I use binary_crossentropyas loss function and rmsprop as optimizer.
For the first neural network I use two hidden layers with 20 neurons each. For thesecond network I use two hidden layers with 64 neurons each. In the third experiment Iuse ten hidden layers with 64 neurons each.
layer neurons2 202 6410 64
Table 2: Hyperparameter of models with features of Cohen et al.
For the experiments based on features of Romov and Sokolov and Cohen et al.,the test sets are divided into validation and test set before model training. For this Idivide the data into two equal halves with train_test_split of the scikit learn library.The method randomly splits the data between the two sets. The validation set is used
5
3 Results
during model training to adjust hyperparameters. The test set is only used for modelsthat are not to be further adjusted and are considered final. Thus, the generalizationcan be tested on a data set that the model has not yet seen. The model learns from thetraining data and further adjustments are made by the developer using the validationset to improve hyperparameters and select the best model. Both can lead to overfittingto the two sets of data like described in [12, p. 263-265]. Here the test set is not usedto check whether the models also achieve good predictions on this set. I keep the set forlater use in the experiments for my master thesis. Since the experiments I did here area preliminary step to these experiments.
3 Results
The first results of the experiments are the features I calculated from the click streamdata. I have one training set and one test set each for the two experiments. For features ofRomov and Sokolov I also calculated a balanced training set, because there were memoryproblems to balance the complete training set before modeling. I balanced the trainingset of features of Cohen et al. before modeling.
Features Set Rows ColumnsRomov and Sokolov train set 24,046,352 396Romov and Sokolov balanced train set 4,131,574 396Romov and Sokolov test set 2,569,224 396Cohen et al. train set 18,882,229 34Cohen et al. balanced train set 3,198,846 34Cohen et al. test set 7,733,347 34
Table 3: Quantity of train and test sets
In Table 3 are the quantities of the rows and columns of the data sets shown. Thesize of the test set for model training with features of Romov and Sokolov is 10% of thedata and for model training with features of Cohen et al. I use 30% of the data as thetest set. Therefore, the sizes of the datasets differ, but the total line of the rows remainsthe same.
The individual features are described in A.2. Once both feature sets have been setup, the experiments can be performed as described in section 2. In Table 4, the modelmetrics are listed each on the training data of the model and in comparison, the metricson the validation set. In the table, the first row contains the model metrics of the winning
6
3 Results
model, so I can directly compare the values with my own results. I do not know the exactdataset on which these metrics were calculated. As metrics I use the model’s accuracy,precision and recall [12, p. 283-285]. In the next lines the results follow to my models.Under Model the name of the model is described. So LogR stands for logistic regressionand NN for neural network. More details about the model follow to distinguish theexperiments. Then follows the number of hidden layers with the number of neurons perhidden layer. For example, 2x20 stands for two hidden layers with 20 neurons each in aneural network. If a feature selection is carried out in the experiment, the abbreviationFS follows in the name and then the number of features used, for example FS 40. Nowall experiments can be identified by their model name. The Features column indicateswhether the model is trained with features from Romov and Sokolov or Cohen et al.
Model Features Set Accuracy Precision RecallWinning SolutionRecSys 2015
In binary classification the accuracy [12, p. 284] is the number of correct predictionsdivided by the total number of predictions. Precision [12, p. 285] is the fraction ofrelevant instances among the retrieved instances and recall [12, p. 285] is the fraction of
7
4 Discussion
relevant instances that have been retrieved over the total amount of relevant instances.With relevant instances, the binary classification refers to the positive class, i.e. in thiscase the user buys the item. For unbalanced data we must consider metrics like precisionand recall, because a model which only predicts the predominant class can have a veryhigh accuracy like in [12, p. 279].
The winning model has an accuracy of 0.77, which means it assigns 77% of therecords to their correct classes. The precision of the model means that 16% of thepositive predicted class also belongs to this class. The recall of the model means that itrecognizes 77% of the positive class as this.
My trained models have an accuracy between 0.09 and 0.90. The precision is between0.09 and 0.71. The recall is between 0.03 and 1.00.
4 Discussion
First, I interpret the model results here and then I will discuss the feature sets.
4.1 Model Results
As Müller and Guido describes in [12, p. 285], I use recall and precision as additionalmetrics for unbalanced data. But there is a compromise between optimizing recall andoptimizing precision. A perfect recall is achieved by predicting for all examples that theybelong to the positive class. Then there are no false negatives and no real negatives. Theprediction of all examples as positive, however, leads to many false alarms, so that theprecision is very low. On the other hand, a model achieves perfect precision if it onlypredicts the examples as positive where it is certain and predicts the rest as negative. Inthis case, the recall will be very bad because many positive examples are not recognized.
The case of the perfect recall we have here with the two models NN 10x64 trained onfeatures of Cohen et al. and NN 2x512 trained on features of Romov and Sokolov. Thetwo models did not learn to distinguish between the classes in this experimental setup.
The model NN 2x20 trained on features of Cohen et al. rarely recognizes the purchasesessions. Here it recognizes 3% of the purchases as these. The accuracy is accordinglyhigh at 0.90, since it almost always predicts no purchase of the item. This model haslearned few patterns to recognize the purchases.
The logistic regressions both have a recall above 0.70 and achieve an accuracy of atleast 0.60 on their validation sets. This is a difference of 17 percentage points from LogRtrained on features of Cohen et al. to the winning model. LogR trained on features of
8
4 Discussion
Romov and Sokolov has a gap of 11 percentage points to the winning model. The logisticregression on features of Romov and Sokolov thus achieves a higher accuracy than on thefeatures of Cohen et al. There are more features included. Note that the validation setsare not the same and cannot be compared exactly. However, both models do not reachthe result of the winning model. But I can see from the values of the model metrics thatthe models do not predict only one class and generalize by training. So, with the featuresets models can be trained. A logistic regression can only recognize linear correlationsand for the recognition of non-linear correlations models like neural networks are used.
There are three models among neural networks that have learned to generalize dif-ferently well. On the features of Cohen et al., it is the model NN 2x64. With an accuracyof 0.66 on the validation set. The precision is 0.18 and the recall is 0.79. Precision andaccuracy are like the LogR trained on features of Romov and Sokolov. The model NN2x64 trained on features of Cohen et al. achieves similar results on 33 features as thelogistic regression on the 395 features.
The other two neural networks are trained on features of Romov and Sokolov. Themetric values of the models NN 2x64 FS 40 and NN 2x128 FS 99 are close to each other.The precision is 0.19 for both models. The recall is 0.61 for NN 2x64 FS 40 and 0.64 forNN 2x128 FS 99. Therefore, the accuracy is slightly higher for NN 2x64 FS 40 than forNN 2x128 FS 99, because it often predicts the negative class. Here there is no definiteadvantage to use 25% instead of 10% of the features, but I use more neurons in the layers,which also has an influence. Both models have a higher accuracy than LogR trained onfeatures of Romov and Sokolov but are below the winning model.
4.2 Feature Sets
Both feature sets were not designed independently of the future, but the knowledge flowsin at different points. The features of Romov and Sokolov calculate the 100 most commonitems and the 50 most common categories on the database. This means that the modelknows which items/categories are often bought/clicked. However, the distribution doesnot have to be transferable to other time periods and then leads to scribbles. An exampleis that during the pre-Christmas period other items are interesting than in summer. Withthe assumption that only data from the same period are to be predicted, it is so realizable.
Knowledge about the future is also used in features of Cohen et al. However, thisis difficult to judge because the code is not freely available as with Romov and Sokolov.But I estimate it by the name and the description. They also use a kind of interest foritems, but they rank the items by popularity.
9
5 Conclusion
In my opinion, too much prior knowledge about the future does not allow an inde-pendent prediction and should not be used for modelling.
As already described in section 2, I have not used two features from the featuresets of Cohen et al. I could not calculate item previous buy count as I interpreted andimplemented it. For each item in a session, I looked at the previous purchases. This ledto a lot of data because each item in the clicks was linked to all items in the buys andthen filtered which were bought before the session time. There were memory problems inthe random-access memory even with distributed calculation with Dask. Unfortunately,there is no detailed description how Cohen et al. implemented it. I excluded them frommy experiment. Another possibility is to calculate the purchases on all data withouttemporal filtering, but this argues against independence from the future.
5 Conclusion
With the results presented here, the model results of the neural networks do not exceedthe results of the winner model of Romov and Sokolov. However, the neural net NN 2x64trained on features of Cohen et al. achieved similar results as the logistic regression on thefeatures of Romov and Sokolov. So far it has only been possible to train a generalizingneural net on a smaller number than the 395 features of Romov and Sokolov. Limitingknowledge by feature selection always means deciding about which information is moreimportant or more suitable for solving the problem and which information is not relevant.As with me, this decision can be based on statistics or domain knowledge. This means,however, that I have not put all knowledge into the model which is available, and I takethe decision from the model among all features to decide which are the decisive ones. Ido not want to limit this knowledge and calculate on all features like Romov and Sokolov.The model should gain the important knowledge for the purchase prediction. I suspectan improvement of the metrics if it works to train a model on all features and the goalis still to exceed the ensemble learning models.
There can be many reasons why it did not work to train a model on the 395 functionsthat generalizes. Therefore, I would like to pursue this further in my master thesis. Forthis I would like to pursue the following three approaches.
The first approach is the adaptation of the model architecture by hyperparameteradaptation. For example, the adjustment of the number of layers and the number ofneurons. The current choice can be inappropriate for the model and prevents the modelfrom generalizing.
The second approach is to reduce the dimensions by representations. Procedures
10
5 Conclusion
such as embeddings reduce the dimensions of data by representatives, but do not havethe loss of information as with feature selection. This follows the approach that I havealready successfully trained networks on features of Romov and Sokolov on a smallerinput dimension.
The third approach is to try out other net architectures like recurrent neural net-works. They are trained with sequence data and thus follow a different method to surpassthe ensemble learning models.
So, there are other ways to structure models before I can draw a conclusion as towhether ensemble learning methods or deep learning methods are better suited to thisproblem. I will pursue these further possibilities in my master thesis. So, the goal is totrain a model that uses the knowledge of the 395 features of Romov and Sokolov and totest other net architectures like recurrent neural networks.
Acknowledgements
This work was developed within the context of the machine learning working group MLAG at HAW Hamburg. I would like to thank Prof. Dr. Kai von Luck and M.Sc. TobiasEichler, who actively support the working group, encourage the exchange between thestudents and supervise our projects. I would especially like to thank Matthias Nitsche,Stephan Halbritter and Timo Lange for the professional exchange in this group and forthe work on the systems to provide a platform for machine learning projects. I wouldalso like to thank Henrik Wortmann from the /* CREATIVE SPACE FOR TECHNICALINNOVATIONS */ laboratory at HAW Hamburg, who are driving the development of aninfrastructure for machine learning projects at the HAW Hamburg with Tobias Eichler.I would also like to thank all members of the ML AG who are not named by name forthe exchange and participation which contributed to this work.
11
References
References
[1] Machine Learning Group at the University of Waikato. Weka 3: Data mining soft-ware in java, 2019. URL https://www.cs.waikato.ac.nz/ml/weka/. Ac-cessed 2019-03-15.
[2] Nadav Cohen, Adi Gerzi, David Ben-Shimon, Bracha Shapira, Lior Rokach, andMichael Friedmann. In-House Solution for the RecSys Challenge 2015. Proceedingsof the 2015 International ACM Recommender Systems Challenge on - RecSys ’15Challenge, pages 1–4, 2015. doi: 10.1145/2813448.2813519. URL http://dl.
[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,2016. http://www.deeplearningbook.org.
[7] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of StatisticalLearning. Springer, second edition, 2009. ISBN 978-0-387-84857-0.
[8] Tasmin Herrmann. Development Environment for Buy Predictions with Ma-chine Learning. 2018. URL http://users.informatik.haw-hamburg.de/
~ubicomp/projekte/master2018-proj/herrmann.pdf.
[9] Keras. Keras: The python deep learning library, 2019. URL https://keras.io/.Accessed 2019-03-15.
[10] Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. Imbalanced-learn:A python toolbox to tackle the curse of imbalanced datasets in machine learning.Journal of Machine Learning Research, 18(17):1–5, 2017. URL http://jmlr.
org/papers/v18/16-365.
[11] Malte Ludewig and Dietmar Jannach. Evaluation of session-based recommendationalgorithm. CoRR, abs/1803.09587, 2018. URL http://arxiv.org/abs/1803.
[12] Andreas C. Müller and Sarah Guido. Introduction to Machine Learning with Python.O’Reilly Media, Inc., third edition, 2017. ISBN 978-1-449-36941-5.
[13] Pandas. Pandas: Python data analysis library, 2019. URL https://pandas.
pydata.org/index.html. Accessed 2019-03-15.
[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-peau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research, 12:2825–2830, 2011.
[15] Peter Romov and Evgeny Sokolov. RecSys Challenge 2015. In Proceedings ofthe 2015 International ACM Recommender Systems Challenge on - RecSys ’15Challenge, pages 1–4, New York, New York, USA, 2015. ACM Press. ISBN9781450336659. doi: 10.1145/2813448.2813510. URL http://dl.acm.org/
citation.cfm?doid=2813448.2813510.
[16] Yandex. Matrixnet: New level of search quality, 2019. URL https://yandex.