Neural Networks for Buy Predictions - HAW Hamburgubicomp/... · RecSys2015 Romovand Sokolov - 0.77 0.16 0.77 LogR Cohenetal. balancedtrain 0.71 0.71 0.70 validation 0.60 0.16 0.79

Hauptprojekt

Tasmin Herrmann

Neural Networks for Buy Predictions

Supervision by: Prof. Dr. Kai von LuckSubmitted on: March 20, 2019

Fakultät Technik und InformatikDepartment Informations- und Elektrotechnik

Faculty of Computer Science and EngineeringDepartment Information and Electrical Engineering

Contents

Contents

1 Introduction 11.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Methods 22.1 Romov Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Cohen Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Results 6

4 Discussion 84.1 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Conclusion 10

A Appendices 14A.1 RecSys Challenge 2015 Data . . . . . . . . . . . . . . . . . . . . . . . . . . 14A.2 Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2.1 Feature Sets of Cohen et al. . . . . . . . . . . . . . . . . . . . . . . 15A.2.2 Feature Sets of Romov and Sokolov . . . . . . . . . . . . . . . . . . 17

ii

1 Introduction

Abstract

In this work, neural networks with session-based click data are trained topredict purchases of items. The dataset is from the RecSys Challenge 2015.The challenge was dominated by ensemble learning models and I want toinvestigate if neural networks are better than these models on the dataset.The neural networks are based on feature sets. Here I use features of thewinning model and another approach from the Challenge. In the first stepthese are implemented to build the training sets for the models. Then theexperiments with different parameters follow. These experiments are contin-ued in the subsequent master thesis to clarify whether neural networks aremore suitable than the ensemble learning models on this dataset.

Keywords: Purchase prediction; neural networks; RecSys Challenge 2015

1 Introduction

1.1 Research Question

Recommendation systems help the user to find relevant areas of interest, e.g. on e-commerce sites. Most academic research deals with approaches that personalize recom-mendations with the help of long-term user profiles. In many real-world applications,however, such long-term profiles often do not exist. In this case, recommendations mustbe based exclusively on the observed behavior of a user during a running session [11].This is called session-based recommendations.

Session-based recommendations was also discussed in the Association for Comput-ing Machinery’s Recommender Systems conference (RecSys) in 2015. Every year, thisconference takes place. It is the premier international forum for the presentation of newresearch results, systems and techniques in the broad field of recommender systems [3].The conference will be accompanied by a competition, the so-called RecSys Challenge.In 2015, the topic was to predict session-based product purchases in a web shop. Thewinners have chosen an ensemble learning model to solve the problem.

I would like to investigate whether deep learning models are suitable for buy pre-diction. Then I would like to investigate whether the models are better than ensemblelearning models as used in the RecSys Challenge 2015. Here I would like to test whetherthe accuracy of deep learning models is better than that of ensemble learning models.

In this thesis I begin with experiments on neural networks on the data set of the

1

2 Methods

challenge. I want to compare the results with the results of the winning model of Romovand Sokolov [15]. In addition, I am still dealing with an approach of Cohen et al. [2] sincethey also used an ensemble learning approach. My approach is to train neural networkson the same features to check if the neural networks reach a similar quality with thesame knowledge. For this I calculate the features from the click and purchase data ofthe RecSys Challenge 2015. Example data are shown in A.1.

1.2 Outline

In section 2 the methods are presented. This includes a brief presentation of two papersthat influence this work and the experiments derived from them. The results of theexperiments are presented in section 3. This includes the created feature sets and themodels. In section 4 the results and the structure of the experiments are discussed, aswell as the basis used by the two papers. In section 5 a conclusion is drawn and anoutlook to the following master thesis is given.

2 Methods

In this section I first describe the two works on which my experiments are based andthen I describe my exhibits that are based on parts of these works. The first experimentis by Romov and Sokolov [15]. They created the winning model. The second work Idescribe is by Cohen et al. [2] and was also created within the framework of the RecSysChallenge.

2.1 Romov Experiment

The first thing Romov and Sokolov [15] noticed was that the classes of the data recordsare not balanced. Only 5.5% of users buy at least one item. So, they decided to classifyfirst who buys and based on what someone buys if they were predicted as a buyer. This isa two-stage classifier. One for purchase detection and one for item detection. They havedecided to work with thresholds instead of binary forecasts to optimize them directly.

The purchase detection classifier uses only session features and the item detectionclassifier uses session features and session-item features. The features include some cat-egorical ones that are not supported by most ensemble learning libraries. Instead ofmethods like one-hot encoding [12, p. 215] they use hash tables with the MatrixNet [16]tool from Yandex [17]. The model they used is an implementation of gradient boosting[7, p. 359-360] over oblivious decision trees. They decided to use binary log-likelihood [7,

2

2 Methods

p. 31] as loss function, after the mean squared error [7, p. 24] gain tended to overfittingthe model. The training amount consisted of 90% of the data and the remaining 10%was used as validation set. The threshold values were optimized with the validation set.

2.2 Cohen Experiment

Cohen et al. [2] followed two different approaches. One approach, which is similar toRomov and Sokolov’s, is with an ensemble learning model of two trees and the otherapproach uses a tree to directly predict the items purchased. They randomly sampled500,000 negative records and an equal number of positive records for model training. Be-cause there are only 509,696 positive sessions in the dataset. They have chosen REPTreeof Weka [1] as classifier for all models.

With the ensemble learning model, the first model for purchase detection was trainedwith 509,696 positive sessions and 509,696 randomly sampled negative sessions from theirtraining data. The second model for item detection was trained with both feature sets,their session features and their item in session features. In the approach with one model,the model is trained on both feature sets. Finally, they have optimized the thresholdsof both approaches. The model which consists only of one classifier has given betterpredictions.

2.3 Experiment Design

My models are based on the features of Romov and Sokolov and Cohen et al., butneural networks are used for modeling and no ensemble learning approaches. So, I caninvestigate how neural networks with the same prior knowledge perform compared toensemble learning models. The experiments will be performed with the technologies andthe environment as described in [8]. There I describe how I calculate features with pandasand train models with Keras.

In the following I explain my experimental setups.

Experiments with feature sets of Romov and Sokolov

In these investigations I use the features Romov and Sokolov described in their work.Here I use a split of 90% training data to 10% test data. I balance the data, becauseneural networks do not learn well from unbalanced data. Therefore, I undersample thedata with the sample function of pandas [13]. I take all positive session of the trainingdata and the equal size of randomly sampled negative session. For the preparation of

3

2 Methods

the data I use Dask [4], which is based on the pandas library and uses multi-threading.Because the whole dataset is too big to do calculations in one kernel.

Logistic regression I train a model with logistic regression to check my feature sets.The feature sets are shown in A.2.2. For this I use the implementation from the scikitlearn library [14]. The model is trained on all 395 features of the balanced train set. Thedefault settings of the model are used.

Multilayer perceptron I train fully connected neural networks with the balancedtraining data. Since the feature sets have 395 features, I decided to do a feature selection.I do one feature selection with 10% of the features and one with 25% of the features.I perform the feature selection with SelectPercentile from scikit learn. This methodselects features according to a percentile of the highest scores. As scoring function I usef_classif. For model development I use the Sequential API from Keras [9]. Dependingon the experiment (see Table 1) I define the dimensions of the input layer and add layerswith the corresponding number of neurons. In the first experiment the input layer has40 dimensions, because I use 10% of the features. Then follow two hidden layers with 64neurons each. I use the regularization method dropout [6, p. 258-268] behind each layerwith 0.5. This means that when training the network 50% of the neurons in the hiddenlayers of the network are switched off and not considered for the next calculation step.As activation function I use the rectified linear unit (ReLU) [6, p. 135]. The output layerhas one neuron and uses the sigmoid function [6, p. 67-70] as activation function. I usebinary_crossentropy [6, p. 132] as loss function and rmsprop [6, p. 307-308] as optimizer.

In the second experiment I use 25% of the features as input, that are 99 features.Here I use two hidden layers with 128 neurons each. In the third experiment I use all395 features as input. I take two hidden layers with 512 neurons each.

layer neurons features2 64 402 128 992 512 395

Table 1: Hyperparameter of models with features of Romov and Sokolov

4

2 Methods

Experiments with feature sets of Cohen et al.

In these investigations I use the features Cohen et al. described in their work. Itemprevious buy count and item previous buy percentage are not used. I was not able tocalculate item previous buy count and since the other one is based on it, the featureis missing too. More about this in the Discussion in section 4. Here I use a splitof 70% training data to 30% test data. I balance the data like Cohen et al. and useRandomUnderSampler from the imbalanced-learn library [10] for this. Since I work withmuch less features here, I can use pandas [13] and NumPy [5] for the data preparations.

Logistic regression Again, I start with a logistic regression to test the feature sets.The feature sets are shown in A.2.1. I use the implementation of the scikit learn libraryagain. The model is trained with 33 features of the balanced training set. The defaultsettings of the model are used.

Multilayer perceptron I train neural networks on balanced training data with fea-tures of Cohen et al.. For this I use the Sequential API from Keras. The dimensions ofthe input layer are 33, because I use 33 features. Then follow the hidden layers whichhave a different number of layers and neurons depending on the experiment (see Table 2).After each layer I use a dropout of 0.5 and I use relu as activation function. The outputlayer has a neuron and uses sigmoid as activation function. I use binary_crossentropyas loss function and rmsprop as optimizer.

For the first neural network I use two hidden layers with 20 neurons each. For thesecond network I use two hidden layers with 64 neurons each. In the third experiment Iuse ten hidden layers with 64 neurons each.

layer neurons2 202 6410 64

Table 2: Hyperparameter of models with features of Cohen et al.

For the experiments based on features of Romov and Sokolov and Cohen et al.,the test sets are divided into validation and test set before model training. For this Idivide the data into two equal halves with train_test_split of the scikit learn library.The method randomly splits the data between the two sets. The validation set is used

5

3 Results

during model training to adjust hyperparameters. The test set is only used for modelsthat are not to be further adjusted and are considered final. Thus, the generalizationcan be tested on a data set that the model has not yet seen. The model learns from thetraining data and further adjustments are made by the developer using the validationset to improve hyperparameters and select the best model. Both can lead to overfittingto the two sets of data like described in [12, p. 263-265]. Here the test set is not usedto check whether the models also achieve good predictions on this set. I keep the set forlater use in the experiments for my master thesis. Since the experiments I did here area preliminary step to these experiments.

3 Results

The first results of the experiments are the features I calculated from the click streamdata. I have one training set and one test set each for the two experiments. For features ofRomov and Sokolov I also calculated a balanced training set, because there were memoryproblems to balance the complete training set before modeling. I balanced the trainingset of features of Cohen et al. before modeling.

Features Set Rows ColumnsRomov and Sokolov train set 24,046,352 396Romov and Sokolov balanced train set 4,131,574 396Romov and Sokolov test set 2,569,224 396Cohen et al. train set 18,882,229 34Cohen et al. balanced train set 3,198,846 34Cohen et al. test set 7,733,347 34

Table 3: Quantity of train and test sets

In Table 3 are the quantities of the rows and columns of the data sets shown. Thesize of the test set for model training with features of Romov and Sokolov is 10% of thedata and for model training with features of Cohen et al. I use 30% of the data as thetest set. Therefore, the sizes of the datasets differ, but the total line of the rows remainsthe same.

The individual features are described in A.2. Once both feature sets have been setup, the experiments can be performed as described in section 2. In Table 4, the modelmetrics are listed each on the training data of the model and in comparison, the metricson the validation set. In the table, the first row contains the model metrics of the winning

6

3 Results

model, so I can directly compare the values with my own results. I do not know the exactdataset on which these metrics were calculated. As metrics I use the model’s accuracy,precision and recall [12, p. 283-285]. In the next lines the results follow to my models.Under Model the name of the model is described. So LogR stands for logistic regressionand NN for neural network. More details about the model follow to distinguish theexperiments. Then follows the number of hidden layers with the number of neurons perhidden layer. For example, 2x20 stands for two hidden layers with 20 neurons each in aneural network. If a feature selection is carried out in the experiment, the abbreviationFS follows in the name and then the number of features used, for example FS 40. Nowall experiments can be identified by their model name. The Features column indicateswhether the model is trained with features from Romov and Sokolov or Cohen et al.

Model Features Set Accuracy Precision RecallWinning SolutionRecSys 2015

Romov andSokolov

- 0.77 0.16 0.77

LogR Cohen et al.balanced train 0.71 0.71 0.70validation 0.60 0.16 0.79

NN 2x20 Cohen et al.balanced train 0.51 0.80 0.03validation 0.90 0.27 0.03



LogRRomov andSokolov

balanced train 0.69 0.70 0.67validation 0.66 0.18 0.71

NN 2x64 FS 40Romov andSokolov


NN 2x128 FS 99Romov andSokolov


NN 2x512Romov andSokolov


Table 4: Model training results

In binary classification the accuracy [12, p. 284] is the number of correct predictionsdivided by the total number of predictions. Precision [12, p. 285] is the fraction ofrelevant instances among the retrieved instances and recall [12, p. 285] is the fraction of

7

4 Discussion

relevant instances that have been retrieved over the total amount of relevant instances.With relevant instances, the binary classification refers to the positive class, i.e. in thiscase the user buys the item. For unbalanced data we must consider metrics like precisionand recall, because a model which only predicts the predominant class can have a veryhigh accuracy like in [12, p. 279].

The winning model has an accuracy of 0.77, which means it assigns 77% of therecords to their correct classes. The precision of the model means that 16% of thepositive predicted class also belongs to this class. The recall of the model means that itrecognizes 77% of the positive class as this.

My trained models have an accuracy between 0.09 and 0.90. The precision is between0.09 and 0.71. The recall is between 0.03 and 1.00.

4 Discussion

First, I interpret the model results here and then I will discuss the feature sets.

4.1 Model Results

As Müller and Guido describes in [12, p. 285], I use recall and precision as additionalmetrics for unbalanced data. But there is a compromise between optimizing recall andoptimizing precision. A perfect recall is achieved by predicting for all examples that theybelong to the positive class. Then there are no false negatives and no real negatives. Theprediction of all examples as positive, however, leads to many false alarms, so that theprecision is very low. On the other hand, a model achieves perfect precision if it onlypredicts the examples as positive where it is certain and predicts the rest as negative. Inthis case, the recall will be very bad because many positive examples are not recognized.

The case of the perfect recall we have here with the two models NN 10x64 trained onfeatures of Cohen et al. and NN 2x512 trained on features of Romov and Sokolov. Thetwo models did not learn to distinguish between the classes in this experimental setup.

The model NN 2x20 trained on features of Cohen et al. rarely recognizes the purchasesessions. Here it recognizes 3% of the purchases as these. The accuracy is accordinglyhigh at 0.90, since it almost always predicts no purchase of the item. This model haslearned few patterns to recognize the purchases.

The logistic regressions both have a recall above 0.70 and achieve an accuracy of atleast 0.60 on their validation sets. This is a difference of 17 percentage points from LogRtrained on features of Cohen et al. to the winning model. LogR trained on features of

8

4 Discussion

Romov and Sokolov has a gap of 11 percentage points to the winning model. The logisticregression on features of Romov and Sokolov thus achieves a higher accuracy than on thefeatures of Cohen et al. There are more features included. Note that the validation setsare not the same and cannot be compared exactly. However, both models do not reachthe result of the winning model. But I can see from the values of the model metrics thatthe models do not predict only one class and generalize by training. So, with the featuresets models can be trained. A logistic regression can only recognize linear correlationsand for the recognition of non-linear correlations models like neural networks are used.

There are three models among neural networks that have learned to generalize dif-ferently well. On the features of Cohen et al., it is the model NN 2x64. With an accuracyof 0.66 on the validation set. The precision is 0.18 and the recall is 0.79. Precision andaccuracy are like the LogR trained on features of Romov and Sokolov. The model NN2x64 trained on features of Cohen et al. achieves similar results on 33 features as thelogistic regression on the 395 features.

The other two neural networks are trained on features of Romov and Sokolov. Themetric values of the models NN 2x64 FS 40 and NN 2x128 FS 99 are close to each other.The precision is 0.19 for both models. The recall is 0.61 for NN 2x64 FS 40 and 0.64 forNN 2x128 FS 99. Therefore, the accuracy is slightly higher for NN 2x64 FS 40 than forNN 2x128 FS 99, because it often predicts the negative class. Here there is no definiteadvantage to use 25% instead of 10% of the features, but I use more neurons in the layers,which also has an influence. Both models have a higher accuracy than LogR trained onfeatures of Romov and Sokolov but are below the winning model.

4.2 Feature Sets

Both feature sets were not designed independently of the future, but the knowledge flowsin at different points. The features of Romov and Sokolov calculate the 100 most commonitems and the 50 most common categories on the database. This means that the modelknows which items/categories are often bought/clicked. However, the distribution doesnot have to be transferable to other time periods and then leads to scribbles. An exampleis that during the pre-Christmas period other items are interesting than in summer. Withthe assumption that only data from the same period are to be predicted, it is so realizable.

Knowledge about the future is also used in features of Cohen et al. However, thisis difficult to judge because the code is not freely available as with Romov and Sokolov.But I estimate it by the name and the description. They also use a kind of interest foritems, but they rank the items by popularity.

9

5 Conclusion

In my opinion, too much prior knowledge about the future does not allow an inde-pendent prediction and should not be used for modelling.

As already described in section 2, I have not used two features from the featuresets of Cohen et al. I could not calculate item previous buy count as I interpreted andimplemented it. For each item in a session, I looked at the previous purchases. This ledto a lot of data because each item in the clicks was linked to all items in the buys andthen filtered which were bought before the session time. There were memory problems inthe random-access memory even with distributed calculation with Dask. Unfortunately,there is no detailed description how Cohen et al. implemented it. I excluded them frommy experiment. Another possibility is to calculate the purchases on all data withouttemporal filtering, but this argues against independence from the future.

5 Conclusion

With the results presented here, the model results of the neural networks do not exceedthe results of the winner model of Romov and Sokolov. However, the neural net NN 2x64trained on features of Cohen et al. achieved similar results as the logistic regression on thefeatures of Romov and Sokolov. So far it has only been possible to train a generalizingneural net on a smaller number than the 395 features of Romov and Sokolov. Limitingknowledge by feature selection always means deciding about which information is moreimportant or more suitable for solving the problem and which information is not relevant.As with me, this decision can be based on statistics or domain knowledge. This means,however, that I have not put all knowledge into the model which is available, and I takethe decision from the model among all features to decide which are the decisive ones. Ido not want to limit this knowledge and calculate on all features like Romov and Sokolov.The model should gain the important knowledge for the purchase prediction. I suspectan improvement of the metrics if it works to train a model on all features and the goalis still to exceed the ensemble learning models.

There can be many reasons why it did not work to train a model on the 395 functionsthat generalizes. Therefore, I would like to pursue this further in my master thesis. Forthis I would like to pursue the following three approaches.

The first approach is the adaptation of the model architecture by hyperparameteradaptation. For example, the adjustment of the number of layers and the number ofneurons. The current choice can be inappropriate for the model and prevents the modelfrom generalizing.

The second approach is to reduce the dimensions by representations. Procedures

10

5 Conclusion

such as embeddings reduce the dimensions of data by representatives, but do not havethe loss of information as with feature selection. This follows the approach that I havealready successfully trained networks on features of Romov and Sokolov on a smallerinput dimension.

The third approach is to try out other net architectures like recurrent neural net-works. They are trained with sequence data and thus follow a different method to surpassthe ensemble learning models.

So, there are other ways to structure models before I can draw a conclusion as towhether ensemble learning methods or deep learning methods are better suited to thisproblem. I will pursue these further possibilities in my master thesis. So, the goal is totrain a model that uses the knowledge of the 395 features of Romov and Sokolov and totest other net architectures like recurrent neural networks.

Acknowledgements

This work was developed within the context of the machine learning working group MLAG at HAW Hamburg. I would like to thank Prof. Dr. Kai von Luck and M.Sc. TobiasEichler, who actively support the working group, encourage the exchange between thestudents and supervise our projects. I would especially like to thank Matthias Nitsche,Stephan Halbritter and Timo Lange for the professional exchange in this group and forthe work on the systems to provide a platform for machine learning projects. I wouldalso like to thank Henrik Wortmann from the /* CREATIVE SPACE FOR TECHNICALINNOVATIONS */ laboratory at HAW Hamburg, who are driving the development of aninfrastructure for machine learning projects at the HAW Hamburg with Tobias Eichler.I would also like to thank all members of the ML AG who are not named by name forthe exchange and participation which contributed to this work.

11

References

References

[1] Machine Learning Group at the University of Waikato. Weka 3: Data mining soft-ware in java, 2019. URL https://www.cs.waikato.ac.nz/ml/weka/. Ac-cessed 2019-03-15.

[2] Nadav Cohen, Adi Gerzi, David Ben-Shimon, Bracha Shapira, Lior Rokach, andMichael Friedmann. In-House Solution for the RecSys Challenge 2015. Proceedingsof the 2015 International ACM Recommender Systems Challenge on - RecSys ’15Challenge, pages 1–4, 2015. doi: 10.1145/2813448.2813519. URL http://dl.

acm.org/citation.cfm?doid=2813448.2813519.

[3] RecSys Community. Recsys, 2019. URL https://recsys.acm.org/

recsys19/. Accessed 2019-03-08.

[4] Dask. Dask, 2018. URL https://docs.dask.org/en/latest/. Accessed2019-03-15.

[5] NumPy developers. Numpy, 2018. URL http://www.numpy.org/. Accessed2019-03-15.

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,2016. http://www.deeplearningbook.org.

[7] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of StatisticalLearning. Springer, second edition, 2009. ISBN 978-0-387-84857-0.

[8] Tasmin Herrmann. Development Environment for Buy Predictions with Ma-chine Learning. 2018. URL http://users.informatik.haw-hamburg.de/

~ubicomp/projekte/master2018-proj/herrmann.pdf.

[9] Keras. Keras: The python deep learning library, 2019. URL https://keras.io/.Accessed 2019-03-15.

[10] Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. Imbalanced-learn:A python toolbox to tackle the curse of imbalanced datasets in machine learning.Journal of Machine Learning Research, 18(17):1–5, 2017. URL http://jmlr.

org/papers/v18/16-365.

[11] Malte Ludewig and Dietmar Jannach. Evaluation of session-based recommendationalgorithm. CoRR, abs/1803.09587, 2018. URL http://arxiv.org/abs/1803.

09587.

12

https://www.cs.waikato.ac.nz/ml/weka/

http://dl.acm.org/citation.cfm?doid=2813448.2813519


https://recsys.acm.org/recsys19/

https://recsys.acm.org/recsys19/

https://docs.dask.org/en/latest/

http://www.numpy.org/

http://www.deeplearningbook.org

http://users.informatik.haw-hamburg.de/~ubicomp/projekte/master2018-proj/herrmann.pdf

http://users.informatik.haw-hamburg.de/~ubicomp/projekte/master2018-proj/herrmann.pdf

https://keras.io/

http://jmlr.org/papers/v18/16-365

http://jmlr.org/papers/v18/16-365

http://arxiv.org/abs/1803.09587

http://arxiv.org/abs/1803.09587

References

[12] Andreas C. Müller and Sarah Guido. Introduction to Machine Learning with Python.O’Reilly Media, Inc., third edition, 2017. ISBN 978-1-449-36941-5.

[13] Pandas. Pandas: Python data analysis library, 2019. URL https://pandas.

pydata.org/index.html. Accessed 2019-03-15.

[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-peau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research, 12:2825–2830, 2011.

[15] Peter Romov and Evgeny Sokolov. RecSys Challenge 2015. In Proceedings ofthe 2015 International ACM Recommender Systems Challenge on - RecSys ’15Challenge, pages 1–4, New York, New York, USA, 2015. ACM Press. ISBN9781450336659. doi: 10.1145/2813448.2813510. URL http://dl.acm.org/

citation.cfm?doid=2813448.2813510.

[16] Yandex. Matrixnet: New level of search quality, 2019. URL https://yandex.

com/company/technologies/matrixnet/. Accessed 2019-03-15.

[17] Yandex. Yandex, 2019. URL https://yandex.com/company/. Accessed 2019-03-15.

13

https://pandas.pydata.org/index.html

https://pandas.pydata.org/index.html



https://yandex.com/company/technologies/matrixnet/

https://yandex.com/company/technologies/matrixnet/

https://yandex.com/company/

A Appendices

A Appendices

A.1 RecSys Challenge 2015 Data

Session ID Timestamp Item ID Category1 2014-04-07T10:51:09.277Z 214536502 01 2014-04-07T10:54:09.868Z 214536500 01 2014-04-07T10:54:46.998Z 214536506 01 2014-04-07T10:57:00.306Z 214577561 02 2014-04-07T13:56:37.614Z 214662742 02 2014-04-07T13:57:19.373Z 214662742 02 2014-04-07T13:58:37.446Z 214825110 02 2014-04-07T13:59:50.710Z 214757390 0

Table 5: Examples of the click dataset

Session ID Timestamp Item ID Price Quantity420374 2014-04-06T18:44:58.314Z 214537888 12462 1420374 2014-04-06T18:44:58.325Z 214537850 10471 1281626 2014-04-06T09:40:13.032Z 214535653 1883 1420368 2014-04-04T06:13:28.848Z 214530572 6073 1420368 2014-04-04T06:13:28.858Z 214835025 2617 1140806 2014-04-07T09:22:28.132Z 214668193 523 1140806 2014-04-07T09:22:28.176Z 214587399 1046 1140806 2014-04-07T09:22:28.219Z 214586690 837 1

Table 6: Examples of the buy dataset

14

A Appendices

A.2 Feature Sets

A.2.1 Feature Sets of Cohen et al.

Name Data type DescriptionSession time Float64 Session time in secondsAverage timeclicks

Float64 Average time between twoclicks in seconds

Maximal timeclicks

Float64 Maximal time between twoclicks in seconds

Day week Int64 Day of the week!Month year Int64 Month of the yearTime day Int64 Time during the day –

morning | afternoon | evening| night

Number clicks Int64 Number of clicksMaximal clicksitem

Int64 Maximal number of clicks onone item

Percentagepopular items

Float64 Percentage of items in thesession that are popular – Icompute an overallpopularity score for each itemas #buys/#clicks.

Distinct items Int64 Number of distinct itemsAverage price Float64 Average price of clicked itemsPercentage S Float64 Percentage of “S” category

from all categories of theitems

Percentage itemswith category

Float64 Percentage of items that havecategory

Table 7: Feature set for describing a session

15

A Appendices

Name Data type DescriptionItem clicked morethan once

Int64 Whether the item appears more thanonce in the session

Item clicked first Int64 Whether the item was clicked first in thesession

Item clicked last Int64 Whether the item was clicked last in thesession

Item appearances Int64 Number of appearances in the sessionItem click mostclicked ratio

Float64 Number of clicks on that item divided bythe number of clicks on the most clickedon item in the session

Item click avgclicks ratio

Float64 Number of clicks on that item divided bythe average number of clicks on items inthat session

Click countbefore first itemclick

Int64 Number of clicks in the session before thefirst click on that item

Click countbefore last itemclick

Int64 Number of clicks in the session before thelast click on that item

Click count afterfirst item click

Int64 Number of clicks in the session after thefirst click on that item

Click count afterlast item click

Int64 Number of clicks in the session after thelast click on that item

Item before clickratio

Float64 Number of clicks before first and lastclick on the item divided by the totalnumber of clicks in the session

Item price Float64 Price of the itemItem popularityscore

Float64 Popularity score of the item

Item rankpopularity score

Int64 Rank of the item according to thepopularity score

Item previousbuy count

Int64 Number of previous buys recorded for theitem

16

A Appendices

Name Data type DescriptionItem previousbuy percentage

Float64 Percentage of previous buys for that item

Item time gap Float64 Time gap from first to last click on theitem in that session

Item avg timegap

Float64 Average time gap between clicks on theitem in the session

Item max timegap

Float64 Maximum time gap between clicks on theitem in the session

Item time gapratio

Float64 Time gap from the first click on the itemto the last click on it divided by thelength of the session

Item min timegap

Float64 Minimum time gap between the clicks onthat item in the session

Item sales peak Int64 Is the item at the peak of its salesLabel Int64 0 if the user bought not this item in this

session and 1 if the user bought this itemin this session

Table 8: Feature set for describing an item in a session

A.2.2 Feature Sets of Romov and Sokolov

Name Data type Quantity DescriptionSession start Int64,

Float64,Category

14 Numerical and categoricaltime features of the start ofthe session (month, day,hour, minute, second, etc.)

Session end Int64,Float64,Category

14 Numerical and categoricaltime features of the end ofthe session (month, day,hour, minute, second, etc.)

Session lengthsesonds

Float64 1 Length of the session inseconds

17

A Appendices

Name Data type Quantity DescriptionNumber of clicks Int64 1 Number of clicks in the

sessionUnique items Int64 1 Number of unique items in

the sessionUnique categories Int64 1 Number of unique categories

in the sessionUnique itemcategory

Int64 1 Number of uniqueitem-category pairs in thesession

Top items Category 10 Top 10 items by the numberof clicks in the session

Top categories Category 5 Top 5 categories by thenumber of clicks in thesession

First last leastclicked

Category 12 IDs of the first/last itemclicked at least k = 1, 2 . . . ,6 times in the session

Clicks item Float64 100 Click numbers for 100 itemsthat were the most popularin the whole training set

Duration item Float64 100 Total durations for 100 itemsthat were the most popularin the whole training set

Clicks category Float64 50 Click numbers for 50categories that were the mostpopular in the whole trainingset

Durationcategory

Float64 50 Total durations for 50categories that were the mostpopular in the whole trainingset

Table 9: Feature set for describing a session

18

A Appendices

Name Data type Quantity DescriptionItem id Category 1 Item IDItem clicks Int64 1 Total number of clicks in the

session for the given itemItem relativeclicks

Float64 1 Relative number of clicks inthe session for the given item

Times first last Int64,Float64,Category

28 Numerical and categoricaltime features of the first/lastclick on the item (month, day,month-day, month-day-hour,hour, minute, weekday)

Item diff seconds Float64 1 Number of seconds betweenthe first and the last click onthe item

Total duration Float64 2 Total duration of the clickson the item in the sessionand of all item’s categoriesseen in the session

Item uniquecategories

Int64 1 Number of unique categoriesseen in the session for a givenitem

Table 10: Feature set for describing an item in a session

19

Neural Networks for Buy Predictions - HAW Hamburgubicomp/... · RecSys2015 Romovand Sokolov - 0.77 0.16 0.77 LogR Cohenetal. balancedtrain 0.71 0.71 0.70 validation 0.60 0.16 0.79

Documents