UNIVERSIT ` A DEGLI STUDI DI CAGLIARI FACOLT ` A DI SCIENZE Corso di Laurea Magistrale in Informatica Price Probe - Price Forecasting using ARIMA on Amazon’s Items Relatore Candidati Prof. Reforgiato Recupero Diego Andrea Medda Matricola: 65034 Alessio Pili Matricola: 65040 Anno accademico 2016-2017
50
Embed
Price Probe - Price Forecasting using ARIMA on Amazon’s Items · Price Probe - Price Forecasting using ARIMA on Amazon’s Items Relatore Candidati Prof. Reforgiato Recupero Diego
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
UNIVERSITA DEGLI STUDI DI CAGLIARIFACOLTA DI SCIENZE
Corso di Laurea Magistrale in Informatica
Price Probe - Price Forecasting usingARIMA on Amazon’s Items
Relatore CandidatiProf. Reforgiato Recupero Diego Andrea Medda Matricola: 65034
A unit root is a stochastic trend in a time series. If a time series has a unit root, it shows a
systematic pattern that is unpredictable. The augmented Dickey-Fuller test is built with a
null hypothesis that there is a unit root The null hypothesis could be rejected if the p-value of
the test result is less than 5%. Furthermore, we checked also that the Dickey-Fuller statistical
value is more negative than the associated t-distribution critical value; the more negative is
the statistical value, the more we can reject the hypothesis that there is a unit root. If the
test result shows that we can’t reject the hypothesis, we have to differentiate the series and
repeat the test again. Usually, differencing more than twice a series means that the series is
not good to fit into ARIMA.
5.2. Algorithm 23
Figure 5.2: Example of a not stationary series.
• Parameters estimation: In time series analysis, the Partial AutoCorrelation Function
(PACF) gives the partial correlation of a time series with its own lagged values, control-
ling for the values of the time series at all shorter lags. It contrasts with the AutoCorrelation
Function (ACF), which does not control for other lags. These functions play an important
role in time series analysis, helping to identify the extent of the lag in an autoregressive model
(PACF) and in a moving average model (ACF). More informations about p and q estimation
and their importance is highlighted by authors in [20].
24 Chapter 5. Method
Figure 5.3: Example of an ACF plot. This plot shows a spike for lag values lessthan 4, based on 95% confidential criteria. So we choose a q value of 4 for the
upper bound when we’re searching for best (p,d,q)configuration.
The use of these function was introduced as part of the BoxJenkins approach to time series
modeling, where computing the partial autocorrelation function one could determine the ap-
propriate lags p in an ARIMA (p,d,q) model, and computing the autocorrelation function one
could determine the appropriate lag q in an ARIMA (p, d, q) model. The partial autocorre-
lation of an AR(p) process becomes zero at lag p+1 and greater, so we examine the sample
partial autocorrelation function to see if there is evidence of a departure from zero. This is
usually determined by placing a 95% confidence interval on the sample partial autocorrela-
tion plot. If the software program does not generate the confidence band, it is approximately
±2/√N , with N denoting the sample size. The autocorrelation function of a MA(q) process
becomes zero at lag q+1 and greater, so we examine the sample autocorrelation function to
see where it essentially becomes zero. We do this by placing the 95% confidence interval for
the sample autocorrelation function on the sample autocorrelation plot.
5.2. Algorithm 25
Figure 5.4: Example of a PACF plot. This plot shows a spike for lag values lessthan 2, based on 95% confidential criteria. So we choose a p value of 2 for the
upper bound when we’re searching for best (p, d, q) configuration.
• Model evaluation: To determine the best ARIMA model for every Amazon’s item, we use
always those steps:
– check stationarity, with the ADF test, useful also to find an appropriate d value
– find p and q based on PACF and ACF. The result on ACF and PACF gives us an upper
bound for iterate the fitting of the model, keeping the best (p,d,q) combination, based
on less Mean Squared Error value, described as:
MSE =1
n
n∑i=1
(Yi − Yi)2.
where Y is a vector of n predictions, and Y is the vector of observed values of the
variable being predicted.
5.2.2 Application
This subsection describes how we applied ARIMA to our study. As stated in the previous chap-
ters, we have different features (in time series format) that can be considered exogenous variables
to ARIMA. This means that each one of them has the date field that is mandatory to merge Data
Frames containing this information. Essentially, we manage to merge the exogenous data with the
basic one on the date key. Basic Data is formed by (price, date) rows, the exogenous ones have
a different value but always have a date column, so, an exogenous row would be like (date, value)
entry, where value is a float or a list of them. For instance, when merging Basic Data with Google
26 Chapter 5. Method
Trends Data we will have to merge (date, price) with (date, popularity) so that we’ll obtain (date,
popularity, price) rows. Our algorithm is very specialized since, for each item:
• ARIMA’s parameters - Calculate (p, q, d) as stated in 5.2.1
• Data Retrieving - Retrieve item’s external data. For instance, if the latter has Google Trends
entries for its Manufacturer we retrieve it and add it to the current external features
• Model Fit - Fit a model for each combination of item’s available basic and external features.
This also depends on the test size used
• Results - Compare the results obtained in the previous steps and only select the best in terms
of lowest MAPE produced by the difference of the original trend and the forecast taken in
consideration
The latter points can be formalized as follows: Given an item I, we retrieve its available data
(features) Fi that can be splitted in two subsets of features such that:
FBi ⊆ Fi
FEi ⊆ Fi
FBi
⋂FEi = {date}
FBi
⋃FEi = Fi
FBi = {date, price}FEi ∈ P((Fi − FBi)
⋃{date})
We have a function P that takes FBi and FEi as inputs and returns a new set of features FTi that
contains all the elements of FBi and a possible combination of FEi (also ∅) that is given by the
function C returns all the possible combinations of features in FEi :
C(FEi) = {fEi : fEi ∈ P(FEi)}P (FBi , fEi) = FBi
⋃fEi
FTi = {P (FBi , fEi) : fEi ∈ C(FEi)}
Obtained all the features combinations FTi we now proceed to fit the ARIMA model with each
one of them coupled with their correlated MAPE and saving these results as a (Map[Fitted Model,
MAPE Score]) defined as m[MTi , ETi ] . Given Oi to be the real trend described by FBi , we can
So we’ll end up having m, that is fitModelScores in the pseudo-code above, containing as key the
fitted model of a given combination of features and as value the MAPE Score described as follows:
MAPE =100
n
n∑t=1
∣∣∣∣Otj −Rtj
Otj
∣∣∣∣Taking a real price trend Ot and a forecast Rt makes the difference for each point with the same date
field by accessing them with an index j. It sums all of these differences and returns a percentage
error of the average difference between all the points considered. More about MAPE can be read
in [22].
5.2.3 Results
In order to obtain the final result, we simply look for the value in m having the lowest value.
It means that the key such that m[key] = min(value) is the best combination of features for this
item i.
29
Chapter 6
Evaluation
In this section, we’ll describe the results obtained and considerations on them:
• Results 6.1
• Comments 6.2
6.1 Results
For experiment purposes we considered 1000 with most prices entries. We did this for two
reasons. The first one, is that we are trying to emulate a daily price tracking, so, items with
more prices reflect this statement closer. The second reason is that the process is computationally
expensive and takes time depending on how many features are involved and on p, d, q values. We
used three different test sizes, 10%, 20% and 30%. Test size also reflects the number of prediction
made. So, in a set of 100 prices, if we have a test set size of 10%, we will use 90 of them to
train the model, we will perform 10 forecasts on the same days excluded from the data set and
then we will calculate the MAPE with them. At the moment, we are excluding Categories feature
because ARIMA is not well suited to make clusterized analysis on items having one or more common
categories. We also excluded currencies because we noticed that these feature influenced negatively
each prediction. It seems like price trend and currencies are not bound. Below we report a table
containing how many times a feature has been found and used for each item:
Table 6.1: Items and Features Numbers
30 Chapter 6. Evaluation
Features
Combination
Number of Items having
this features
Items having
this features
over 1000 items
in %
price, date 1000 100%
price, date,
manufacturer360 36%
price, date,
manufacturer, trend230 23%
price, date, sentiment 552 55.2%
price, date, stars 552 55.2%
price, date, sentiment,
stars552 55.2%
price, date,
manufacturer, stars331 33.1%
price, date,
manufacturer,
sentiment
331 33.1%
price, date,
manufacturer,
sentiment, stars
331 33.1%
price, date,
manufacturer, trend,
stars
211 21.1%
price, date,
manufacturer, trend,
sentiment
211 21.1%
price, date,
manufacturer, trend,
sentiment, stars
211 21.1%
Then, we report three tables containing, for each combination of features, on how many items the
latter is available and their average MAPE respectively to the test size used:
6.1. Results 31
Table 6.2: Test Size 10%
Features CombinationN. of items having
this combinationAverage MAPE
price, date 1000 2.31%
price, date, flag 1000 2.31%
price, date, trend 330 1.98%
price, date, trend, flag 330 1.98%
price, date, sentiment 40 1.83%
price, date, stars 40 1.83%
price, date, flag, sentiment 40 1.83%
price, date, flag, stars 40 1.83%
price, date, flag, sentiment,
stars40 1.83%
price, date, sentiment, stars 40 1.83%
price, date, trend, sentiment 30 3.59%
price, date, trend, stars 30 3.59%
price, date, trend, sentiment,
stars30 3.59%
price, date, trend, sentiment,
flag30 3.59%
price, date, trend, stars, flag 30 3.59%
price, date, trend, flag,
sentiment, stars30 3.59%
32 Chapter 6. Evaluation
Table 6.3: Test Size 20%
Features CombinationN. of items having
this combinationAverage MAPE
price, date 1000 9.56%
price, date, flag 1000 9.56%
price, date, trend 330 5.69%
price, date, trend, flag 330 5.69%
price, date, sentiment 40 7.69%
price, date, stars 40 7.69%
price, date, flag, sentiment 40 7.69%
price, date, flag, stars 40 7.69%
price, date, flag, sentiment,
stars40 7.69%
price, date, sentiment, stars 40 7.69%
price, date, trend, sentiment 30 11.37%
price, date, trend, stars 30 11.37%
price, date, trend, sentiment,
stars30 11.37%
price, date, trend, sentiment,
flag30 11.37%
price, date, trend, stars, flag 30 11.37%
price, date, trend, flag,
sentiment, stars30 11.37%
6.1. Results 33
Table 6.4: Test Size 30%
Features CombinationN. of items having
this combinationAverage MAPE
price, date 1000 9.25%
price, date, flag 1000 9.25%
price, date, trend 330 6.64%
price, date, trend, flag 330 6.64%
price, date, sentiment 40 7.9%
price, date, stars 40 7.9%
price, date, flag, sentiment 40 7.9%
price, date, flag, stars 40 7.9%
price, date, flag, sentiment,
stars40 7.9%
price, date, sentiment, stars 40 7.9%
price, date, trend, sentiment 30 10.65%
price, date, trend, stars 30 10.65%
price, date, trend, sentiment,
stars30 10.65%
price, date, trend, sentiment,
flag30 10.65%
price, date, trend, stars, flag 30 10.65%
price, date, trend, flag,
sentiment, stars30 10.65%
34 Chapter 6. Evaluation
Next, we report three tables containing on how many items a combination of features has been the
best and its average MAPE score on that number of items based on the test size:
Table 6.5: Test Size 10%
Features Combination
N. of Items where
this combination
has been the best
Average MAPE
of this
combination
price, date 830 2.77%
price, date, trend 110 1.29%
price, date, flag 30 0%
price, date, trend, flag 30 0.43%
Table 6.6: Test Size 20%
Features Combination
N. of Items where
this combination
has been the best
Average MAPE
of this
combination
price, date 830 10.59%
price, date, trend 130 4.85%
price, date, flag 40 0%
Table 6.7: Test Size 30%
Features Combination
N. of Items where
this combination
has been the best
Average MAPE
of this
combination
price, date 850 9.97%
price, date, trend 110 6.14%
price, date, flag 40 0%
In the tables above, we have a 0% score because those 40 items have a linear trend history (the
price value never changes over time).
6.1. Results 35
Following we report a table containing the average of the ones above with common feature combi-
nations:
Table 6.8: Average Best Scores on different Test Sizes
Features Combination
N. of Items where
this combination
has been the best
Average MAPE
of this
combination
price, date 837 7.77%
price, date, trend 116 4.1%
price, date, flag 33 0%
As highlighted in the tables above, in most of the cases, the more the test size decreases the
more accurate is the score. We noticed that (price, date, trend) feature combination is the most
promising, having an overall lower Average MAPE. It is interesting to analyze all the items having
a Manufacturers, how many of those have Google Trends entries and on how many of the latter
have as best combination (price, date, trend) respectively to the test size used:
Table 6.9: Manufacturer and Google Trends based predictions
Test
Size
N. of items having
a Manufacturer
N. of items having
Google Trends
Entries
How many times
(price, date,
trend) has been
a best
combination on
them
10% 360 230 140
20% 360 230 130
30% 360 230 100
As stated in chapter 4, unluckily, it was impossible to retrieve some Google Trends entries given a
Manufacturer. Otherwise we would have surely had more of the latter ones. It is interesting to notice
that over 230 items we have an average number of items having the latter best features combination
of 123. Therefore, on the 53.47% of items having both Manufacturer and Google Trends entries
we have (price, date) as best configuration. That highlights that Popularity influences consistently
the prediction.
36 Chapter 6. Evaluation
For completeness, we present a full example of results obtained with an item, respectively to the
test size, having trend, stars and sentiment entries. We describe three graphs, one for each test
size used, comparing the real price to the forecasted ones on the same days. That particular item
considered has (date, price, trend) as best features combination:
10-22 11-11 12-01 12-21 01-10 01-30 02-19200
250
300
350
400 Real Trend
Figure 6.1: This plot shows a simple time series of an item, with price changingover time. Besides some spikes, the overall prices stays between 280 and 260 £
02-17 02-19 02-21 02-23 02-25240
250
260
270
280Real Trend
Forecast
Figure 6.2: This plot shows the time series above compared with the forecastmade with test size size of 10%. It’s relevant how much is closer the forecast to
the real price
6.1. Results 37
01-14 01-24 02-03 02-13 02-23240
260
280
300Real Trend
Forecast
Figure 6.3: This plot shows the time series above compared with the forecastmade with test size size of 20%. The forecast is still very close to the real price,as we can see the decreasing trend of the forecast value in a similar way that the
real one does
12-07 12-27 01-16 02-05 02-25200
250
300
Real TrendForecast
Figure 6.4: This plot shows the time series above compared with the forecastmade with test size size of 30%. The forecast is not as close to real price as the
test size above, because the algorithm learns with less data
38 Chapter 6. Evaluation
Below we report the same charts described above normalized in a range 0-100 and we compare
both real trend and forecast their Google Trend history on the same days:
02-17 02-19 02-21 02-23 02-250
20
40
60
80
100Real Trend
ForecastGoogle Trends
Figure 6.5: This plot shows the normalized price and forecast compared withitem’s manufacturer popularity with a test size of 10%. As we can see, the nor-
malized values are very close to the trend ones
01-14 01-24 02-03 02-13 02-230
20
40
60
80
100Real Trend
ForecastGoogle Trends
Figure 6.6: This plot shows the normalized price and forecast compared withitem’s manufacturer popularity with a test size of 20%. As we can see, the nor-
malized values finally returns close to the trend ones
6.2. Comments 39
12-07 12-27 01-16 02-05 02-250
20
40
60
80
100Real Trend
ForecastGoogle Trends
Figure 6.7: This plot shows the normalized price and forecast compared withitem’s manufacturer popularity with a test size of 30%. As we can see, the nor-malized values finally returns close to the trend ones, with a decreasing price over
an increasing popularity
6.2 Comments
In this section we are going to comment the result obtained in the previous one 6.1. In Table
6.1 it’s clear that we had a consistent number of different feature combinations. Overall, we had
a consistent number or (price, date, stars) and (price, date, sentiment) of 55.2%. As stated in
3 and 4.1 sometimes Amazon Affiliate APIs did not respond or did not have any information on
reviews so having at least 55.2% items having such features is a good percentage. Same thing goes
for Google Trends Data. We have a 23% of items with these entries because, sometimes, Amazon
Affiliate APIs did not respond or didn’t have any information about an item’s manufacturer, so,
with no manufacturer is impossible getting any Google Trend history for that item. In Table 6.2,
6.3 and 6.4 we are showing how the MAPE score changes based on the test size (10%, 20% and
30%) and on the feature combination used. In Table 6.2 we can see that the lowest average MAPE
40 Chapter 6. Evaluation
score is bound to the flag. As stated in chapter 3, the flag is a boolean that is true if a given
date lies on a range of festivity days. In the other two related tables 6.3 and 6.4 we can see how
the best result is bound to the trend feature. Also in 6.2 we can see how trend gives a very good
score. In these tables, we can see how reviews do not influence the trend in a consistent way even
though they give pretty good score. In tables 6.5, 6.6 and 6.7, we highlighted the best features
combinations in terms of times that it has been used and average MAPE score. In the latter ones,
we can notice that the basic configuration is the most used, but this is justified by the fact that
we did not have many Google Trends feature entries for these items. Indeed, we can see how trend
feature influences positively the score in most of cases when it has been present for an item. We
can also notice that (price, date, flag) has a 0% score but this is justified by the fact that those
items had a linear trend history; they had always the same price over time. It also emerges that all
the feature combinations having reviews features have never been a best configuration. This last
evidence makes stronger our previous statement about the trend not being most likely influenced
by reviews. In table 6.8 we highlight that the average of the previous considered tables still states
that the trend feature leads to an high accuracy. In 6.9 we show that for the 64% of items having
a manufacturer it was possible to retrieve their Google Trends history. We also show on how many
of the latter ones, (price, trend, date) was used as best combination. It’s pretty clear that it is a
good number of entries. Next, we are analyzing an item that had all features entries. In Chart 6.1
we drawn a given item’s price trend over time. It has a pretty not linear pattern. In Chart 6.2,
6.3, 6.4 we can see the real trend compared to the best forecast made with a (price, date, trend)
feature combination. It’s clear that, has the test size increases, also the MAPE score increases. In
charts 6.5, 6.6 and 6.7 we compared the real trend, the prediction based on the test size used and
the Google Trend history over time. All the entries have been normalized in a range that goes from
0 to 100 since Google Trends entries are on that range. We can see how using the two trends have
a pretty similar behavior and on small test sizes they are pretty close. While, in the last one, we
can see how as the Google Trend score increases at its maximum value and the price decreases.
41
Chapter 7
Conclusion
The number of online sales is growing quickly and customers do not have a real clear idea of how
prices are influenced aside sale times. Having a way to predict such thing would help customers
to make better choices regarding which marketplace to use in order to purchase a certain product
or in which period. We have shown how Price Probe predicts Amazon prices, one of the biggest
E-Shop players, for a remarkable amount of time and showing a high precision. Using ARIMA
with proper external features and fine-tuned parameters leads to a high accuracy. Our method
highly depends on how external features have been chosen and collected. As stated in the paper,
working with such close data is very expensive. Given that, results can be surely more accurate
having more resources which can be used to crawl items to have a daily (price, date) tuple for each
one of them. Indeed, having a daily tracking of items would improve our results as highlighted
in the previous chapter. Without a shadow of a doubt, it would also be interesting to use more
external features (for instance the popularity of a specific item on Twitter, e.g.: iPhone) about the
product’s popularity over time, since we highlighted how significantly Google Trends influences our
results, also, similar studies already exists like one highlighted by the authors of [24] . The future
of purchases relies on online marketplaces and Price Probe lays the first stone in predicting when
and where it is more profitable to purchase a product.
43
Bibliography
[1] Pai PF, Lin CS. A hybrid ARIMA and support vector machines model in stock price fore-
casting. Omega. 2005;33(6):497 – 505. Available from: http://www.sciencedirect.com/