Page 1
Forecasting Copper Spot Prices:
A Knowledge-Discovery Approach
A DISSERTATION SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE
DEGREE OF MASTER OF SCIENCE
IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES
2016
By
Adegbenga Olayiwola
School of Computer Science
Page 2
2
Table of Contents
Abstract ...................................................................................................................................... 6
Declaration ................................................................................................................................. 7
Intellectual Property Statement .................................................................................................. 8
Acknowledgements .................................................................................................................... 9
Dedication ................................................................................................................................ 10
Preface...................................................................................................................................... 10
Chapter 1 Introduction ........................................................................................................ 11
1.1. Motivation ................................................................................................................. 13
1.2. Aims and Objectives ................................................................................................. 13
1.3. Outline of the Report ................................................................................................. 14
Chapter 2 Background and Related Work .......................................................................... 15
2.1. Time Series ................................................................................................................ 17
2.1.1. Trend, Seasonality.............................................................................................. 17
2.1.2. Stationarity ......................................................................................................... 18
2.1.3. Differencing ....................................................................................................... 19
2.2. Models ....................................................................................................................... 20
2.2.1. Econometric Model – ARIMA .......................................................................... 20
2.2.2. Data Mining Model – Decision Trees ................................................................ 24
2.3. Previous Work ........................................................................................................... 30
Chapter 3 Research Design and Methodology ................................................................... 34
3.1. CRISP-DM Methodology ......................................................................................... 34
3.1.1. Business Understanding ..................................................................................... 34
3.1.2. Data Understanding ........................................................................................... 35
3.1.3. Data Preparation................................................................................................. 35
3.1.4. Modelling ........................................................................................................... 35
3.1.5. Evaluation .......................................................................................................... 35
3.1.6. Deployment ........................................................................................................ 36
Page 3
3
3.2. Project Evaluation ..................................................................................................... 36
Chapter 4 Data Analysis and Modelling .................................................................................. 38
4.1. Data Understanding ................................................................................................... 38
4.1.1. Distribution and Statistical Characteristics ........................................................ 38
4.1.2. Correlation Analysis .......................................................................................... 43
4.2. Data Preparation ........................................................................................................ 43
4.2.1. Transformation ................................................................................................... 43
4.2.2. Standardisation ................................................................................................... 43
4.3. Decision Tree Modelling ........................................................................................... 44
4.3.1. Run Set 1 – Log Difference with CHAID ......................................................... 44
4.3.2. Run Set 2 – Price Movement with C5.0 ............................................................ 47
4.3.3. Run Set 3 – Price Change Rate with CHAID .................................................... 47
4.4. ARIMA (Time Series) Modelling ............................................................................. 47
Chapter 5 Evaluation and Results ....................................................................................... 49
5.1. Decision Trees ........................................................................................................... 49
5.1.1. Log Difference Modelling ................................................................................. 49
5.1.2. Price Movement Modelling ............................................................................... 61
5.1.3. Price Change Rate Modelling ............................................................................ 65
5.2. ARIMA...................................................................................................................... 66
5.3. Models Comparison .................................................................................................. 69
5.4. Deployment ............................................................................................................... 70
Chapter 6 Conclusions and Future Works ............................................................................... 71
6.1. Conclusions ............................................................................................................... 71
6.2. Recommendations for Future Work .......................................................................... 72
List of References .................................................................................................................... 73
Appendix .................................................................................................................................. 76
Word Count: 18,801
Page 4
4
List of Figures
Figure 2.1: Pie Chart showing the relative use of copper in industrial sectors (Crowson, 2008;
Black, 1995) ..................................................................................................................... 16
Figure 2.2: Global Demand for Copper over five decades (Crowson, 2008) .......................... 16
Figure 2.3 Time series decomposed (Bontempi, 2013) ........................................................... 18
Figure 2.4 Plot of white noise signal ....................................................................................... 19
Figure 2.5 Sample Time series, difference and ACF/PACF plots ........................................... 23
Figure 2.6 Decision tree for deciding whether to play tennis .................................................. 24
Figure 2.7 Entropy as a function of a binary valued distribution ............................................ 26
Figure 3.1: CRISP-DM Methodology Phases .......................................................................... 34
Figure 4.1 Composite Time Series plot of Copper Spot and Crude Oil Spot Prices ............... 38
Figure 5.1: Predictor variables in ranked order of importance (Scenario 1ii) ......................... 53
Figure 5.2: Predictor variables in ranked order of importance (Scenario 3) ........................... 54
Figure 5.3: Predictor variables in ranked order of importance (Scenario 3 Improved) ........... 55
Figure 5.4: Predictor variables in ranked order of importance (Scenario 4) ........................... 56
Figure 5.5a CHAID Left Tree Subsection ............................................................................... 58
Figure 5.5b CHAID Middle Tree Subsection .......................................................................... 59
Figure 5.5c CHAID Right Tree Subsection ............................................................................. 60
Figure 5.6: Predictor variables in ranked order of importance (Price Movement Target –
Scenario 2) ........................................................................................................................ 63
Figure 5.7: C5.0 Decision Tree and Ruleset (Directional Forecasting) ................................... 64
Figure 5.8: Predictor variables in ranked order of importance (Price Change Rate Target) ... 66
Figures 5.9: Time series plot of original dataset showing variance and trend. ........................ 67
Figures 5.10: Time series plot of transformed dataset showing reduced variance. ................. 68
Figures 5.11: Time series plot of differenced transformed dataset eliminating trend to make
stationary. ......................................................................................................................... 68
Figure A.1: SPSS Modeler Stream Design (CHAID Decision Tree, Scenario 3) ................... 80
Figure A.2: SPSS Modeler Stream Design (ARIMA) ............................................................. 80
Page 5
5
List of Tables
Table 2.1: Copper Properties and Uses .................................................................................... 15
Table 2.2 Standard AR(1) models ........................................................................................... 20
Table 2.3: Decision Tree Algorithms....................................................................................... 29
Table 2.4 Summary of Previous Work on Prediction of Metals Prices and other Time Series
Quantities .......................................................................................................................... 32
Table 4.1: LME Copper Spot Price Month-on-Month Movement .......................................... 47
Table 4.2: Expert Modeler with Constant and Transformation Option Value Runs ............... 48
Table 5.1: Trained Model Variables under Scenario 1 ............................................................ 49
Table 5.2: Trained Model Variables under Scenario 2 ............................................................ 49
Table 5.3 Trained Model Variables under Scenario 3 (Entire Dataset) ................................... 50
Table 5.4 Trained Model Variables under Scenario 4 ............................................................. 50
Table 5.5 Evaluation of Tree Models Built under Various Conditions/Combinations............ 51
Table 5.6: C5.0 Decision Tree Evaluation using Price (month-on-month) Direction Target.. 62
Table 5.7: Evaluation of Tree Models (month-on-month rate of change target) ..................... 65
Table 5.8: Expert Modeler Result: ARIMA(2,1,0)(1,0,1) with Constant and Transformation
options .............................................................................................................................. 66
Table 5.9: ARIMA Model Holdout (Out-of-Sample) Dataset Test Error Rates ...................... 69
Table A.1 Dataset Variables Showing Statistical Characterisation (First Variable is Target) 76
Page 6
6
Abstract
The importance of copper as an industrial metal has grown with time due to increasing
technological applications. This has led to the metal being quoted on major commodities
exchanges and the stakeholders interested in the price trend of the commodity has thus
transcended just the producing and consuming nations and industries to include investors. As
a result there has been increasing interest in developing models for the prediction of the price
of the metal.
The autoregressive integrated moving average (ARIMA) has traditionally been used in
forecasting time series quantities such as commodities prices. Recent research reflects an
attempt to improve on the performance of ARIMA by instead using data mining techniques
of which artificial neural networks (ANN) has been the model of choice. However because
ANNs are black-box models, no insight can be drawn from the results they produce.
Furthermore, there has been a lack of clear methodological framework which enables a
systematic and standard approach to the analysis process.
This research work addresses the aforementioned gaps by presenting a knowledge-
discovery methodology applied to the development of (open-box) decision tree models in the
forecast of copper spot prices thereby revealing the prime predictor variables for the metal.
The accuracy of the decision tree model is further also contrasted with a developed ARIMA
model.
Metal fundamentals as well as economic and financial variables selected as predictors by
the decision tree model include Chinese copper import levels, volatility index (VIX), Baltic
Dry index and the Standard & Poor’s GSCI index amongst others. With root mean square
error (RMSE) rates of 18.65, the decision tree model performed far more accurately
compared to ARIMA.
Page 7
7
Declaration
This dissertation is original work containing some research material that are all clearly
referenced. No portion of the work referred to in this dissertation has been submitted in
support of an application for another degree or qualification of this or any other university or
other institute of learning.
Page 8
8
Intellectual Property Statement
i. The author of this dissertation (including any appendices and/or schedules to this
dissertation) owns certain copyright or related rights in it (the “Copyright”) and s/he
has given The University of Manchester certain rights to use such Copyright,
including for administrative purposes.
ii. Copies of this dissertation, either in full or in extracts and whether in hard or
electronic copy, may be made only in accordance with the Copyright, Designs and
Patents Act 1988 (as amended) and regulations issued under it or, where appropriate,
in accordance with licensing agreements which the University has entered into. This
page must form part of any such copies made.
iii. The ownership of certain Copyright, patents, designs, trademarks and other
intellectual property (the “Intellectual Property”) and any reproductions of copyright
works in the dissertation, for example graphs and tables (“Reproductions”), which
may be described in this dissertation, may not be owned by the author and may be
owned by third parties. Such Intellectual Property and Reproductions cannot and must
not be made available for use without the prior written permission of the owner(s) of
the relevant Intellectual Property and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication and
commercialisation of this dissertation, the Copyright and any Intellectual Property
and/or Reproductions described in it may take place is available in the University IP
Policy, in any relevant Dissertation restriction declarations deposited in the University
Library, and The University Library’s regulations.
Page 9
9
Acknowledgements
A special thank you goes to my supervisor, Dr. Charalampos Theodoulidis (Babis) for
his mentorship and support throughout the entire dissertation research period.
Special thanks also must go to the School of Computer Science and Alliance Manchester
Business School, most especially my lecturers: Dr Sandra Sampaio, Prof. John Keane, Dr.
Gavin Brown, Prof. Christopher Holland, Dr. Daniel Dresner, Dr. Yu-wang Chen, Dr. Julia
Handl and Prof Peter Kawalek for providing me with the necessary foundations to execute
this research.
To my wonderfully supportive friends back home in Nigeria who through their
generosity have helped in rendering financial support through this period of my study I say a
big thank you. I would like to specially thank my Manchester families: the Idowus, the
Osinugas, the Onches and Chuks. You folks have been great pillars of support.
My wife Ayo, and the twins, Kemi and Femi, thank you for your patience, prayers and
support. And to my mother who always heeds my call, thank you so much. Special
appreciation goes to my dad for his constant prayers and support. My siblings who have
always rallied support and help, thank you so much.
To God who alone has been my One and All be all the glory and adoration.
Page 10
10
Dedication
With profound gratitude and appreciation to the Almighty God...
In great love and admiration of my dear wife, Ayo ma Cherie…
For my lovely twins, Kemi and Femi…
To my wonderful Parents...
To my Siblings...
To my Friends and Colleagues...
Preface
The author has a B.Sc. degree in Computer Engineering from the Obafemi Awolowo
University Ile-Ife, Nigeria. With ten years’ working experience as a systems analyst in the
Nigeria National Petroleum Corporation, he has skills in software development as well as
database development and support using technologies including Microsoft .NET, SharePoint,
SQL Server, Oracle amongst others.
Page 11
11
Chapter 1 Introduction
Copper is a non-ferrous corrosion resistant metal with anti-microbial properties as well
as very high thermal and electrical conductivity (second only to silver which is far more
expensive). As a result of these qualities, copper is in very high demand and one of the top
industrial metals used in electronic/electrical applications, construction, medical and general
engineering (Anyadike, 2002). Considering the importance of these industries in the modern
world, “movements in copper prices can therefore be seen as an early indicator of global
economic performance” (Buncic and Moretto, 2015). Thus copper is one of the metal
commodities traded on the major commodities Exchanges: London Metal Exchange (LME),
New York Commodity Exchange (COMEX) and Shanghai Metal Exchange (SHME)
(Lasheras et al., 2015).
The ability to reliably forecast the future value of the metal therefore becomes very
valuable to investors, speculators and even more so to the world’s top exporter, Chile (Fisher
et al., 1972). Similar arguments can be made for China which is the top importing nation of
copper ores and concentrates as well as the global top producer and consumer of refined
copper (ICSG, 2016). Also, due to the intricacies of trading in copper whereby there is a time
lag between contracting, payment and delivery as well as storage and insurance
considerations (Lasheras et al., 2015), contracts are usually agreed upon based on a future
price figure.
Several methods have been used in the attempt at forecasting copper prices with mixed
results. This is due to the high volatility inherent in the price of the metal in the global
markets over any period (time series). The ARIMA is well known as a forecast model for
time series quantities since the work of Box, Jenkins and Reinsel (1970). The use of the
ARIMA for copper price forecasting which involves fitting the model to the time series data
has limitations because it only focuses on the trend over time of the prices and uses that to
extrapolate into the future without consideration of the external factors (industrial, economic,
financial etc.) that affect the fluctuations of prices with time. Also, since the ARIMA is a
linear model, it can only produce approximations for modelling complex non-linear
problems.
Other research in considering these factors, have explored the use of data mining models
such as support vector machines (SVM) and neural networks (NN) as well as other analytical
tools like regression, Fourier transforms, etc. These have produced results with better
accuracy than the ARIMA (Adebiyi, Adewumi and Ayo, 2014; Lasheras et al., 2015;
Kriechbaumer et al, 2014). In other cases, a hybrid approach has been adopted in combining
Page 12
12
ARIMA with neural networks for example as undertaken by Zhang (2003) as well as Jan and
Katarina (2010) which also produced results with better accuracy than using either approach
alone. In the case of Jan and Katarina (2010) for example, their results showed ARIMA as
having a Mean Absolute Percentage Error (MAPE) of 3.2% compared to 2.4% when the
ARIMA-NN hybrid is tested. However, because these are black box models, the nature of the
effect of the predictor variables (factors) in determining prices is not known (Lai et al, 2009).
The use of the decision tree model as a forecast tool does not suffer from the black box
limitation mentioned above. Chang et al, (2011) and Lai et al. (2009) used clustering
techniques in conjunction with genetic algorithms to develop fuzzy decision trees as a
decision support tool for stock trading based on prices. The decision rules derived from the
tree revealed the nature of the effect that the factors considered have on the price of stocks
and upon application to test data, produced results with superior performance (in terms of hit
rate) compared to other models (random walk, ARIMA, neural nets etc.). However, in these
papers, there is a lack of clear methodological approach to the investigation which will allow
for the process to be adapted for use in other domains. Also, the variable selection process
does not take into account the considerable effect of business (economic) cycles (Diaz et al.,
2016) on such time series quantities as stock prices which primarily respond to global
economic and financial activity.
Of the many existing decision tree algorithms, the more commonly used include the
successive generations developed by Ross Quinlan: ID3, C4.5, C5.0 (Quinlan, 1990, 1993,
2004); CHAID (CHi-squared Automatic Interaction Detector) (Kass, 1980) and CART
(Classification And Regression Tree) (Breiman et al., 1984). The C4.5 is an improvement
over the ID3 with the ability to handle missing data, possibility of using differently weighted
attributes, pruning to simplify the tree thereby improving generalization, etc. (Hssina et al.,
2014; Khoonsari and Motie, 2012). The C5.0 further improves on the C4.5 in terms of speed
and memory efficiency. The CHAID uses multiway splits and therefore makes for improved
reading. This research work uses CHAID and the C5.0 (which as implemented in IBM SPSS
Modeler can only handle categorical data).
In terms of the use of decision trees in the forecast of metal prices, Malliaris and
Malliaris (2015) used it in forecasting gold price movement direction. Their work did put
some focus on the effect of business cycles by clustering the dataset (into 4 groups in and
around the global recession of 2008) before applying decision tree models to each cluster
producing varied results in terms of the effect of the predictor variables. However, only six
predictor variables were considered and no clear or standard methodology was used in the
Page 13
13
analysis process making it less generalizable to other domains. In so far as the literature
review shows however, decision trees have not been applied to the forecast of copper prices.
In this project, the use of decision trees as forecast tools for copper spot prices is
investigated using the CRISP-DM (CRoss Industry Standard Process for Data Mining)
methodology in considering relevant economic and financial predictor variables thereby
revealing the nature of impact these variables have on the price of copper over time. A
comparison with ARIMA is also carried out to determine the relative performance in terms of
predictive accuracy of the models. The CRISP-DM being a widely accepted and used
methodology in the industry (Wirth and Hipp, 2000) enables the use of a structured approach
in the analysis process from understanding of the data up till the evaluation of the results and
deployment.
1.1. Motivation
The following considerations are worthy of note:
The importance of copper today (which is only likely to continue in the foreseeable
future based on the current vast application areas of the metal in technology and
industry), the ability to reliably forecast the price of the metal with a good
understanding of the economic and financial indicators that determine its value will be
of increasing benefit to stakeholders.
Also, the dynamic nature of global business calls for a standard and adaptable method
for determining the above as the business climate evolves.
These form the basis of motivation for this research work.
1.2. Aims and Objectives
The aim of this research work is to explore the use of the decision tree data mining model
as a forecast tool for copper spot prices.
The following are particular research objectives identified to be achieved:
To use the CRISP-DM methodology in developing a decision tree model for the
prediction of copper spot prices using time series data of LME monthly copper prices
from January 1970 to January 2012.
To determine the nature of the effect of relevant economic and financial predictor
variables on copper spot prices from the resultant decision tree.
Page 14
14
To empirically evaluate the use of the decision tree contrasted against ARIMA in
forecasting copper spot prices in terms of prediction accuracy using the Root Mean
Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage
Error (MAPE) metrics.
1.3. Outline of the Report
The structure of this research work is here outlined. In this first chapter, an introduction to
the dissertation has been presented. Chapter 2 is a discussion of some theoretical background
and related work giving an account of historical development of the importance of copper as
well as prior research work into the prediction of copper prices and similar commodities or
time series quantities using various models and tools. The approach to design and
methodology used in this research work is developed in Chapter 3. Models development and
implementation is the subject of Chapter 4. In Chapter 5, an evaluation of the models is
carried out and the results are presented and discussed. Lastly, Chapter 6 contains final
conclusions and recommendations for further work.
Page 15
15
Chapter 2 Background and Related Work
Copper has always played a critical role in the civilization of man from pre-historic
times till date. It is established that the metal was the first to be mined by man and utilized for
tool making and other uses. Due to its malleability and ductility, it is easily shaped, drawn or
hammered into sheets for various applications. Table 2.1 below shows some of the various
properties and uses of the metal
Table 2.1: Copper Properties and Uses
SN Property Comments and Uses
1 A Natural Element Relatively safe non-radioactive production methods
2 Recyclable Sustainable balance of demand and supply
3
Malleable &
Ductile Easily hammered into sheets and drawn into wires
4 Aesthetic Used for making various ornamental items
5 A Family of Alloys
From bronze to brass to a host of modern alloys developed
from advances in material science
6 Antifouling
Inhibits the adhesion of marine life to surfaces thereby
preventing drag on ship hulls for example
7 Antimicrobial
Used in alloys to reduce germ transmission rates from
frequently touched surfaces like door knobs
8 Easily Shaped Making bells and musical instruments like trumpets
9 Durable Due to anti-corrosion properties used extensively in piping
10 Conductive
Very high electrical and heat conductivity makes it the metal
of choice in electrical/electronic as well as various industrial
applications
11 Easy to Join
By welding, soldering, brazing, bolting etc. makes for an
excellent choice in piping and electrical distribution
In modern times, the excellent electrical and heat conductivity of copper as well as its
corrosion resistance has made it an essential metal in industry with applications in
electronics, wiring, building and construction, piping, plumbing etc. Figure 2.1 shows the
relative use of copper in these different industrial applications.
Page 16
16
Figure 2.1: Pie Chart showing the relative use of copper in industrial sectors (Crowson, 2008;
Black, 1995)
Following from the mostly electrical/electronic use of copper as mentioned above in
cabling and wiring, the demand for the metal has been on the rise as more and more people
require housing, transportation etc. Figure 2.2 below shows the rising demand for copper over
the decades. According to Crowson (2008), this represents an average annual growth rate of
about 3.7% and “world exports of refined copper metal accounted for 38 percent of
production, worth almost $23 billion, in 2005”. This clearly shows the importance of the
metal in global industry and economy.
Figure 2.2: Global Demand for Copper over five decades (Crowson, 2008)
3.7
6.8
9
10.9
15.1 16.5
0
2
4
6
8
10
12
14
16
18
1960 1970 1980 1990 2000 2005
Mill
ion
To
nn
es
Years
Global Copper Demand
48%
18%
12.5%
10%
11%
Electrical and electronicproducts, Wires & Cables,
Construction, Piping
Transport
Industrial Machinery
Consumer Products &Others
Page 17
17
2.1. Time Series
In forecasting the price of copper we start by looking at the history of copper prices over
time. The copper spot prices as noted in the obtained dataset are monthly figures from 1970
to 2012 which essentially is a univariate time series since they are a single sequence of
ordered observations in equal discrete time intervals.
Univariate time series can be represented by the general model
𝑠𝑡 = 𝑣(𝑡) + 𝜑𝑡 𝑡 = 1, … . . , 𝑇 (2.1)
𝑤ℎ𝑒𝑟𝑒 𝑣(𝑡) 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑎 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑖𝑠𝑡𝑖𝑐 𝑝𝑎𝑟𝑡,
𝜑𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑡𝑒𝑟𝑚 𝑎𝑛𝑑
𝑡 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑒𝑎𝑐ℎ 𝑡𝑖𝑚𝑒 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑡𝑜𝑡𝑎𝑙 𝑝𝑒𝑟𝑖𝑜𝑑 𝑇
The signal component of the quantity represents the (possible) trend and seasonality
observed in the distribution. Being deterministic, it can easily be characterised using some
function of close fit. The residual term however, is stochastic and can only be evaluated using
probabilistic methods. As such the residual is the more difficult aspect to model and is the
focus of one of the analytical models for time series forecasting used in this research work. In
order to model residuals, some important properties of time series need to be adequately
managed and these are discussed next.
2.1.1. Trend, Seasonality
As mentioned earlier, the signal component of the time series can incorporate trend and
seasonality. Trend in a time series is a change in the mean value over the long-term.
Seasonality on the other hand refers to periodic highs and lows observed in the series in
roughly equal intervals or seasons (e.g. monthly, annually). These traits make for an
‘unstable’ series which does not lend easily to modelling. Figure 2.3 below illustrates these
properties of a time series
Page 18
18
Figure 2.3 Time series decomposed (Bontempi, 2013)
2.1.2. Stationarity
A time series is said to be stationary when its joint distribution remains unchanged when
shifted in time. The distribution of observations over time is independent of any particular
time origin. As such the identified properties of the series are independent of the time of
observation. This also implies that regardless of any subset of the series in time one observes,
the plot looks generally the same. The statistical properties of mean and variance are
therefore constant. White noise (illustrated in Figure 2.4 an example of a stationary series,
has zero mean and constant variance.
Page 19
19
Figure 2.4 Plot of white noise signal
On the other hand, a time series with trend and/or seasonality is not stationary because
the distribution of the series depends on time as can be seen in Figure 2.3 above. One way of
making a time series stationary is called differencing.
2.1.3. Differencing
Differencing is a transformation that can be applied to time-series data so as to make it
stationary. To do this, a fresh series is generated by taking the difference between consecutive
observations in the original. Thus if 𝑦𝑡 is a time series quantity where t = 1,2,…,T, then a first
degree differenced time series (or first lag) 𝑦𝑡′ is given by
yt′ = yt − yt−1 (2.2)
Differencing stabilises the mean of a time series thereby getting rid of the changes in the
level of the series and thus eliminating trend and seasonality. If 𝑦𝑡′ is also not stationary, the
process can be repeated again producing a second order differenced series 𝑦𝑡′′ (second lag)
𝑦𝑡′′ = 𝑦𝑡
′ − 𝑦𝑡−1′
= (𝑦𝑡 − 𝑦𝑡−1) − (𝑦𝑡−1 − 𝑦𝑡−2)
= 𝑦𝑡 − 2𝑦𝑡−1 + 𝑦𝑡−2 (2.3)
Another transformation that can be applied is by taking logarithms. This helps to stabilise the
variance in the series. Having outlined some of the basic properties of time series as above,
the development of predictive models is discussed next.
-3
-2
-1
0
1
2
3
0 200 400 600 800
Page 20
20
2.2. Models
Two main classes of models developed as forecasting tools are discussed in this section.
2.2.1. Econometric Model – ARIMA
In mathematics or statistics, linear regression is one of the methods for deriving a
relationship function or line of best fit between a dependent variable y and independent
variable x such that in the case of simple linear regression, y can be expressed as
y = mx + c. (2.4)
And more generally or for multiple regression with 𝑛 independent variables,
𝑦 = ∑ mixi
n
i=1
+ 𝑐 (2.5)
With the above in mind, the building blocks of the ARIMA model are hence discussed.
Autoregressive Model
An autoregression model is one in which the independent (predictor) variables are made
up of past values of the (time series) target variable. “The term autoregression indicates that it
is a regression of the variable against itself” (Hyndman and Athanasopoulos, 2014).
Therefore an autoregressive model of order p, AR(p) can be expressed as
yt = c + ɸ1yt−1 + ɸ2yt−2 + ⋯ ɸpyt−p + φt (2.6)
where c is a constant and 𝜑𝑡 is white noise.
As such the AR(p) model is basically a multiple regression but with lagged values of the
target, 𝑦𝑡, as predictors. Table 2.2 below shows some standard AR(1) models
Table 2.2 Standard AR(1) models
ɸ𝟏 𝒄 𝒚𝒕
= 0 White noise
= 0 = 1 Random walk
≠ 0 = 1 Random walk with drift
< 0 oscillate between positive and negative values.
Autoregressive models are best applied to stationary time series data, and the parameters also
need to be constrained to some certain values (Hyndman and Athanasopoulos, 2014).
For an AR(1) model: −1 < ɸ1 < 1.
For an AR(2) model: −1 < ɸ2 < 1, ɸ1 + ɸ2 < 1, ɸ2 − ɸ1 < 1.
Page 21
21
Moving Average Model
The moving average model (MA), similarly to the AR, also builds some form of linear
regression on time series data but rather than regressing on past values of the target it uses the
past forecast error terms thus
yt = c + et + θ1et−1 + θ2et−2 + ⋯ + θqet−q, (2.7)
where 𝑒𝑡 is white noise. This is referred to as an MA(q) model. From equation 2.7, it can be
seen that 𝑦𝑡 can be seen as a weighted (θ) moving average of prior forecast errors, hence the
name of the model.
Similarly to the AR, the MA also has some value constraints placed on its parameters
when used:
For an MA(1) model: −1 < θ1 < 1.
For an MA(2) model: −1 < θ2 < 1, θ1 + θ2 > −1, θ1 − θ2 < 1.
ARIMA Model
When the autoregression and moving average models are combined and used on a
differenced time series data we have the non-seasonal ARIMA (AutoRegressive Integrated
Moving Average) model. The ARIMA model was popularised by the Box Jenkins approach
(Box and Jenkins, 1970) to time series forecasting and is useful for both stationary and non-
stationary time series datasets.
In creating forecast models of a time series dataset that is strongly seasonal and where
seasonality is required to be considered, then additional seasonal terms are added to the non-
seasonal ARIMA model to make a seasonal ARIMA model. The seasonal ARIMA model
contains a separate set of autoregressive, difference and moving average terms to account for
seasonality in the data. The models are denoted in generalized form as follows:
𝐴𝑅𝐼𝑀𝐴(𝑝, 𝑑, 𝑞) - non-seasonal ARIMA
𝐴𝑅𝐼𝑀𝐴(𝑝, 𝑑, 𝑞)(𝑃, 𝐷, 𝑄) - seasonal ARIMA
where:
p is the order of the autoregression
d, the degree of differencing and
q, the order of the moving average
of the non-seasonal (part of the) model while
P, D and Q are similar terms as above for the seasonal part of the model.
Page 22
22
ACF/PACF
An autocorrelation function (ACF) shows a plot of correlation between the time series
data and itself over different lags. That is, a plot of yt against yt-k for different values of k. The
partial autocorrelation (PACF) plot on the other hand, indicates the level of autocorrelation at
lag k that is not supported by lower-order autocorrelations. It shows the correspondence
between yt and yt-k after removing the effects of the intervening lags (that is, at 1,2,3,…,k-1).
In determining the values of p and q (P and Q) the use of the ACF and PACF plots are
made. An ARIMA(p,d,0) model is inferred if the plots of the differenced data (in order to
make stationary) reveal a pattern where:
the ACF decays exponentially;
the PACF has a significant spike at lag p that cuts off sharply thereafter.
An ARIMA(0,d,q) model is inferred if the plots of the differenced data reveal a pattern
where:
the PACF decays exponentially;
the ACF has a significant spike at lag p that cuts off sharply thereafter.
Figure 2.5 below shows a sample set of plots where an ARIMA(0,1,1) has been inferred.
The first panel shows the plot of a time series data showing an upward trend. In the second
panel we see a plot of the series after taking a first difference. And finally the third panel
shows the ACF and PACF plots.
Page 23
23
Figure 2.5 Sample Time series, difference and ACF/PACF plots
Page 24
24
2.2.2. Data Mining Model – Decision Trees
Decision trees are a non-parametric supervised learning method used for classification
(discrete valued target) and regression (continuous valued target) data mining or machine
learning tasks (Kavitha and Iyakutti, 2014). Given tuples of data with attributes and target
pairs, the decision tree algorithm when applied produces a tree structure that enables the
categorisation or description of the dataset determining the target value based on simple
hierarchical rules as the tree is traversed from root to leaf across nodes. Each node represents
a simple test based on an attribute’s values, a branch is a path based on the outcome of the
test and ultimately the leaf represents the target class or value. Figure 2.6 shows a decision
tree structure.
Figure 2.6 Decision tree for deciding whether to play tennis
The decision tree has several advantages one of which has already been mentioned, that
of its being an open-box model. Others include:
Simple to understand and communicate
Can be combined into an ensemble (Random Forests)
Can be used as a pre-processor (feature selection) in combination with other classifiers.
Some drawbacks of decision trees and mitigation techniques are listed below:
Sensitivity to outliers and noise. This can be dealt with by adequate dataset preprocessing
to identify (as much as possible) cases of outliers.
Missing values. The use of surrogate splits can help overcome this challenge which is a
technique of identifying similar or suitable feature values which can be used for splitting
in place of the missing-value feature.
Page 25
25
Based on the logic of the algorithm, decision trees determine and pick the most
discriminative attributes in splitting at each node of the tree. Tree growth is stopped when the
algorithm arrives at an optimal tree size based on maximum depth specification, minimum
node size or pruning. As a result, decision trees are also widely used for optimal attribute
selection as a pre-processing step to other machine learning methods like neural networks
which are designed to use all attributes that are fed into it.
Decision tree algorithms use different criteria for determining attributes upon which to
split based on some tests and measures. These are discussed next.
Entropy
In information theory, entropy is a measure of the purity (or impurity) in a distribution of
examples. Consider the case of a binary target collection of examples T, which can have
either a positive (𝑝) or negative (𝑛) value, the entropy E of T is given by
E(T) = − P(p)log2P(p) − P(n)log2P(n) (2.8)
𝑤ℎ𝑒𝑟𝑒 𝑃(𝑥) 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑖𝑡𝑖𝑜𝑛
Thus as seen in figure 2.7 below, entropy has its minimum value of 0 when P(𝑥) = 0,1
(when there is certainty) and a maximum of 1 when P(𝑥) = 0.5 (when there is maximum
uncertainty with an equal chance of positive or negative outcome).
Generally, for a distribution T with v possible values, entropy of T is defined by
E(T) = − ∑ P(i)log2P(i)
v
i
(2.9)
having a maximum value of 𝑙𝑜𝑔2𝑣
Page 26
26
Figure 2.7 Entropy as a function of a binary valued distribution
Information Gain
Information gain is one of the metrics used in determining the effectiveness of an
attribute as a choice for splitting when building a decision tree. It is the measure of reduction
in entropy after splitting on the attribute. As seen above the more entropy, the more
uncertainty in the predictability of outcome and vice versa. Therefore the attribute that in
effect has the least entropy upon use for splitting would have contributed the most
information (maximum information gain) in the effort to arrive at a leaf node. Entropy for an
attribute split is calculated by taking the sum of the entropies of each subset of the whole
collection having each possible value of A weighted by the fraction of examples in the whole
the subset contains, that is
∑|Tv|
|T|E(Tv)
vϵValues(A)
(2.10)
𝑤ℎ𝑒𝑟𝑒 𝑉𝑎𝑙𝑢𝑒𝑠(𝐴) 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑒𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝐴,
𝑇𝑣 𝑖𝑠 𝑠𝑢𝑏𝑠𝑒𝑡 𝑜𝑓 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑇 𝑤𝑖𝑡ℎ 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝐴 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑣,
|𝑇𝑣|
|𝑇| 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑖𝑡𝑖𝑜𝑛 𝑇 𝑤𝑖𝑡ℎ 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝐴 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑣 𝑎𝑛𝑑
𝐸(𝑇𝑣) 𝑖𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑜𝑓 𝑇𝑣
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.20 0.4 0.60 0.80 1
Entr
op
y
P(x)
Page 27
27
From the above the information gain on splitting on attribute A, 𝐼𝐴, is then simply given by
IA = E(T) − ∑|Tv|
|T|E(Tv)
vϵValues(A)
(2.11)
GINI Impurity
The Gini Index is another measure of impurity that can be used for attribute split
selection in decision trees. Using same notations from equation (3.2) above, it is given by
G = 1 − ∑ P(i)2
v
i
(2.12)
The Gini index from equation (2.12) has a maximum value of 1. Similarly to the use entropy,
the Gini index can be used in calculating information gain.
Algorithms
Several decision tree algorithms have been developed over the years with some being
improvements over previous versions. Some of the more commonly used algorithms are
summarised in Table 2.3 indicating some main characteristics:
Other more advanced algorithm developments include ensemble methods which produce
multiple decision trees known as random forests which are then combined (and using voting
techniques for example) to produce a final target prediction/classification. Random forests
have the advantage of producing models that are more stable where only decision trees may
overfit training data and perform with worse generalisation error. However, because of the
multiple trees involved, random forests lose the transparency and ease of interpretation which
is the hallmark of decision trees.
In this project research work, the C5.0 (for categorical target) and CHAID decision tree
algorithms are used. Of the decision tree algorithms present in IBM SPSS Modeler 15
(analysis tool used), these two proved to be most accurate. Below is the pseudocode of the
CHAID algorithm (as mentioned earlier, the C5.0 is proprietary):
Page 28
28
CHAID Pseudocode
CHAID(predictors, target, alpha_to_merge, alpha_to_split)
do
splitting = false
foreach predictor in predictors
if predictor is continuous then
divide into equal sized category bins
else if predictor is categorical then
take each unique value as category
end if
do
merging = false
foreach category, nextcategory pair in category_set
if target is continuous then
mergeTest = F_Test(category, nextcategory)
elseif target is categorical
mergeTest = ChiSquare_Test(category, nextcategory)
end if
if mergeTest >= alpha_to_merge then
merge(category, nextcategory)
merging = true
break
end if
end for
loop while merging = true
PValue[predictor] = BonferroniP(category_set)
end for
if min(PValue) <= alpha_to_split then
split on predictor
splitting = true
else
designate node as terminal node (leaf)
loop while splitting = true
Page 29
29
Table 2.3: Decision Tree Algorithms
*C5.0 is implemented in proprietary commercial software.
Name Author Split Determinant Data Type Split Levels Other Features
CART
(Classification And
Regression Trees)
Leo
Breiman
Least Squares
Deviation, Gini
Index
Continuous &
Categorical Predictor,
Categorical Target
Binary
CHAID
(CHi-squared Automatic
Interaction Detection)
Gordon
V. Kass
Chi-square test, F-
Test, Bonferroni
adjusted p-value
Continuous &
Categorical (Continuous
Predictor by Binning)
Multiway
MARS
(Multivariate Adaptive
Regression Splines)
Jerome H.
Friedman
Basis Function Continuous &
Categorical
Multiway
ID3
(Iterative Dichotomiser 3)
Ross
Quinlan
Entropy/Information
Gain
Continuous (more time-
consuming) &
Categorical
Multiway
C4.5 Ross
Quinlan
Entropy/Normalised
Information Gain
Continuous &
Categorical
Multiway Improvement over
ID3:Pruning, Varying
Attribute Costs, Missing Data
C5.0 Ross
Quinlan
* Continuous &
Categorical
Multiway Improvement over C4.5:
Winnowing, Boosting, Speed
Page 30
30
2.3. Previous Work
The value of forecasting of quantities in planning and investment has been established.
Previous work in the literature regarding forecasting of time series quantities is discussed
below in terms of copper in particular and commodities in general.
Given the global importance of copper trade, many attempts have been made to forecast
copper prices in the literature. Since copper spot prices form a time series, the most used
forecast tool has been the ARIMA. This has been shown to be good enough in producing
forecasts especially in the long run (Lasheras et al., 2015). However, according to
Kriechbaumer et al (2014), “Normal ARIMA models were shown to be rather unsuitable for
predicting monthly base metal prices”.
Also, Lasheras et al. (2015), showed that Elman recurrent neural networks (RNN) produce
forecast results for copper spot prices with better error rates compared to ARIMA. A similar
finding was made by Adebiyi et al. (2014), showing that artificial neural networks performed
better than ARIMA. Thus some evidence has been established as to the benefits of data
mining algorithms and models in producing forecast results with better accuracy and even
variance.
Other researchers have also tried a hybrid approach where some data mining technique
has been combined with ARIMA in an attempt to increase the overall forecast accuracy as
seen in Zhang (2003) as well as Jan and Katarina (2010), where neural nets were combined
with ARIMA. Kriechbaumer et al. (2014), considering the cyclical behaviour of metal prices,
applied wavelet analysis and multiresolution analysis prior to ARIMA for much improved
forecast accuracy for copper, lead, zinc and aluminium.
This literature review reveals that mostly, neural networks have been considered and
investigated in this regard as seen in the papers mentioned above as well as in Lai et al (2009)
who also underscored the fact that neural networks “… do not provide an insight into the
nature of the interactions between technical indicators and … fluctuations” since it is a black-
box model.
Decision trees are used by Ongsritrakul et al. (2003), in the prediction of gold prices but
only for feature selection which was then fed to Support Vector Machine (SVM), linear
regression and neural net models. Similarly, Malliaris and Malliaris (2015) used decision
Page 31
31
trees for gold price forecast but limited to just predicting the price movement direction (up or
down). Lai et al (2009), on the other hand used the ID3, a decision tree algorithm, along with
case based reasoning weighted clustering as a decision support tool for stock price decisions
(buy/hold/sell). In terms of predicting actual values of a time series quantity, Diaz et al.
(2016), demonstrated the value of using decision trees as the tool of choice both for
prediction with reasonable accuracy as well as being able to realise the nature of the
relationship between economic variables and risk-free interest rates. Table 2.4 below is a
summary of these previous works researched.
The use of the business cycle as a predictor variable in the literature is lean as seen from
the above. Yet its role and importance in metal pricing has been established. Cuddington and
Jerrett (2011) demonstrated the significance of the business cycle as a determinant of metals
prices stating that “…metals and oil prices are much more responsive to cyclical than trend
movements in economic activity”. Fama and French (1988) also concluded that “the variation
of spot and forward prices for metals has a strong business-cycle component”.
Page 32
32
Table 2.4 Summary of Previous Work on Prediction of Metals Prices and other Time Series Quantities
SN Author(s) Year Target Model Predictors
1 Lasheras et al. 2015 Copper spot price ARIMA and neural networks Copper spot price time series
2 Adebiyi et al. 2014 Stock price ARIMA and neural networks Stock price time series
3 Zhang 2003 Sunspot observations
Canadian lynx annual mortality,
British Pound to US dollar
exchange rate
ARIMA and neural networks hybrid Time series of quantities
4 Jan and
Katrina
2010 Monthly water volume
consumption
ARIMA and neural networks hybrid Water consumption time series
5 Kriechbaumer 2014 Monthly price of aluminium,
copper, lead and zinc
Wavelet analysis and ARIMA hybrid Metal nominal price time series
6 Lai et al. 2009 Stock price Fuzzy decision trees,
Genetic algorithms,
k-means
Stock price time series
Six days moving average (MA)
Six days bias (BIAS)
Six days relative strength index (RSI)
Nine days stochastic line (K,D)
Moving average convergence and divergence (MACD)
13 days psychological Line (PSY)
Volume
7 Goss and
Avser
2013 Copper spot and futures prices Simultaneous rational expectations
model of LME copper
Inventory
Production Volume
Industrial production index
Tin spot price
High grade zinc spot price
Page 33
33
SN Author(s) Year Target Model Predictors
8 Malliaris and
Malliaris
2015 Gold price movement direction Decision tree Cleveland Financial Stress Indicator
Cushing Oil
S&P 500
VIX
Euro to US Dollar exchange rate
9 Ongsritrakul 2003 Gold price Support vector regression using
decision tree for feature selection
South Africa Rand to US dollar exchange rate
Australian Dollar to US Dollar rate
Canadian Dollar to US Dollar exchange rate
Gold lease rate
10 Buncic and
Moretto
2015 Copper monthly returns Dynamic Model Averaging and
Selection (DMA/DMS) framework
Excess demand
Inventory
Convenience yield
TED spread
Volatility Index (VIX)
Equity price of large resource based firms
Chilean Peso to US Dollar rate
Australian Dollar to US Dollar rate
US Industrial production
US term spread
Baltic Dry index
Broad S&P500 index
Gold price
Crude oil price
Page 34
34
Chapter 3 Research Design and Methodology
The CRISP-DM methodology (Wirth and Hipp, 2000) will be used in this project for the
investigation and evaluation of the models on the dataset. It is a widely used methodology for
data mining tasks (Piatetsky, 2014) and highly compatible with data mining software such as
SPSS Modeler. The methodology involves six phases as illustrated in Fig. 3.1 below and
discussed thereafter.
Figure 3.1: CRISP-DM Methodology Phases
3.1. CRISP-DM Methodology
3.1.1. Business Understanding
This phase involves comprehension of the project aims and objectives from a business
perspective and then creating a data mining problem based on this understanding. This
essentially is summed up in the high value placed on the ability to reliably predict copper spot
prices as well as to understand how relevant economic variables affect the price fluctuations.
This capacity for forecasting with a reliable degree of accuracy is highly sought after by
stakeholders in the copper industry. Chile, the world’s largest producer of copper depends on
the metal for almost half of its exports (Meller and Simpasa, 2011). Spilimbergo (in Buncic
and Moretto, 2015) also highlighted the strong dependence of the Chilean economy on
copper. Having a good model for predicting copper prices as well as being able to understand
the effect that various economic and financial variables have on the price of the metal will
thus be of invaluable use for planning and fiscal control of the economy for a country like
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment Data
Page 35
35
Chile. Similar arguments can be made for the world’s largest importer nation of copper:
China.
In the same vein, investors and traders in the various commodities exchanges globally
will find the decision tree model deliverable of this project highly useful in helping them
make informed trading decisions in order to maximize returns on their investment in the
metal.
3.1.2. Data Understanding
The data understanding phase requires the preliminary analysis and characterization of
the dataset with the goal of identifying data quality issues (outliers, missing values, etc.) and
interesting statistical properties and patterns. At this stage, the project dataset is appraised
with a view to determining the nature of the sources and context of the data as well as the
extent to which the dataset is sufficient for the purposes of the project investigation.
3.1.3. Data Preparation
This phase follows from the previous one to attempt to clean out the data quality
problems identified and organise the dataset (feature selection, transformation, etc.) into the
format to be fed into the model for analysis. This step is carried out iteratively on the dataset
to enhance the quality of data used for the modelling thereby improving the modelling
outcome.
3.1.4. Modelling
In this phase, the models to be used for analysis are applied to the dataset and necessary
parameters are tuned to optimal values. 2 models (Decision Tree and ARIMA) are used for
analysis of the dataset.
3.1.5. Evaluation
At this stage, model(s) have been developed and they are assessed to ensure that all
necessary business requirements were adequately taken into consideration in the construction
Page 36
36
of the models. At this point, a decision is reached whether the data mining results are good
enough to be used or not. This stage is very critical because it involves the interpretation of
the modelling results
3.1.6. Deployment
This is the phase where the model is implemented ‘in the field’. The end user or
customer should be able to effectively apply the model to fresh datasets and produce results
in a format that is clear and interpretable. For this project, the deployment phase involves an
understanding of the implications for the economy (global and individual countries) of the
effect of economic indices on copper price variations. The model produced can also be
utilised out in the finance industry for copper price forecasting.
It is to be noted that just as the diagram in Figure 3.1 depicts, the various phases are not
worked through in a basic waterfall. Rather each stage can involve multiple passes and can
further lead to the revisit of prior or other stages in order to achieve the ultimate business
objective of the data mining endeavour.
3.2. Project Evaluation
Considering that the forecasting of copper price is being examined as a process (which
can be adapted to similar domains) rather than a one-off, a methodology (CRISP-DM) is
being adopted for use in the execution of this project. This methodology will be critically
evaluated in terms of interpretability as well as accuracy as further discussed below.
The resultant decision tree model developed for forecasting will show the nature of the
effect that the various predictor variables have on the target: copper spot price. This openness
is a quality lacking in the ARIMA model. The relationships that exist between the variables
and copper prices can thus easily be interpreted from the results.
The forecast results of the models will be analysed and evaluated with a view to rank
them based on their accuracy. This will be realised using the metrics of the Root Mean
Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error
(MAPE) metrics. These are defined as follows:
Page 37
37
𝑅𝑀𝑆𝐸 = √1
𝑛∑(𝑞𝑡 − 𝑝𝑡)2
𝑛
𝑡=1
𝑀𝐴𝐸 =1
𝑛∑ |𝑞𝑡 − 𝑝𝑡|
𝑛
𝑡=1
𝑀𝐴𝑃𝐸 = ∑|𝑞𝑡 − 𝑝𝑡|
𝑞𝑡
𝑛
𝑡=1
∗100
𝑛
where n = length of forecast period
qt = actual price at time t
pt = predicted price at time t
Page 38
38
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
Jan
-70
Jul-
71
Jan
-73
Jul-
74
Jan
-76
Jul-
77
Jan
-79
Jul-
80
Jan
-82
Jul-
83
Jan
-85
Jul-
86
Jan
-88
Jul-
89
Jan
-91
Jul-
92
Jan
-94
Jul-
95
Jan
-97
Jul-
98
Jan
-00
Jul-
01
Jan
-03
Jul-
04
Jan
-06
Jul-
07
Jan
-09
Jul-
10
Jan
-12
$
Month-Year
Time Series Plot of Copper Spot Prices ($/MT) and Crude Oil Prices ($/barrel)
LME COPPER SPOT ($) OFF
Chapter 4 Data Analysis and Modelling
4.1. Data Understanding
An initial assay of the dataset is carried out in order to gain some insight into the nature,
distribution, statistical characteristics and correlation inherent in the data.
4.1.1. Distribution and Statistical Characteristics
The dataset used includes copper spot prices from the LME as obtained from Bloomberg
data services. Table A.1 in the appendix lists the economic variables (as contained in the
obtained dataset) considered for use in this project showing their summary statistics. In order
to gain some insight into the trend behaviour of the target and one of the main predictor
variables, a composite time series plot of copper prices and crude oil prices is made and
shown in Figure 4.1 below. It can be clearly seen for example that there is cyclicality
(business cycle) of periodic peaks and troughs in the plot of copper prices. Also, both
quantities have a similar trend in values and are affected in similar proportion by the 2008
global recession where we see a significant drop in prices.
Figure 4.1 Composite Time Series plot of Copper Spot and Crude Oil Spot Prices
Page 39
39
In part of the scenarios for analysis, twenty (20) out of 142 variables available in the
dataset, have been chosen for decision tree learning and prediction of the target variable,
LME Copper Spot ($). From the dataset provided all variables have been presented with a
granularity of one month. The rationale for the inclusion of each variable for this data
analysis is as follows:
(1) Lagged Copper Returns
As noted by Buncic and Moretto (2015), “…there can be periods of momentum in asset
returns due to some market participants adopting a trend following trading strategy”.
This implies that a time series quantity like commodity prices which tend to have an
inherent trend in its values over time is likely to be recognised as such by traders who
now speculate based on this knowledge. Thus the lagged copper return calculated as log
change from the monthly series is also used as a predictor.
(2) WTI Cushing Crude Oil Spot Px
Considering that crude oil remains the primary source of energy for industry globally, the
price of crude is likely to affect and serve as a good predictor of the price of copper
(3) S&P GSCI Copper Inx Spot
An index published by Standard & Poor’s which gives an indication of the investment
performance of the copper commodity market. Since copper is one of the main industrial
metals in the commodities market, this index is likely to have an effect on the price of the
metal as investors either rush to buy or shy away from it based on its perceived
performance in the market.
(4) LME COPPER TOTAL
The level of stock inventory of copper indicates its level of ready availability as well as
showing how much excess capacity exists. It therefore can be a factor in the
determination of the price of the metal based on the basic economics laws of demand and
supply.
Page 40
40
(5) Global Refined Copper Production – World
This is a field derived from the aggregation of the global refined production figures of
the continents. Representing the supply of the metal, it is a factor that can determine the
price of copper based on the basic economics laws of demand and supply.
(6) Global Refined Copper Demand – World
Also derived from the aggregation of global refined demand figures of the continents,
this is included based on a similar rationale as #4 above.
(7) Known Copper Ore & Concentrate Inventories
In terms of the source material of the metal, the raw ore concentrate inventories also give
an indication of the level of source availability of the metal which in turn should affect
finished copper pricing.
(8) Chicago Board Options Exchange SPX Volatility Index
A key measure of market expectations of near-term volatility conveyed by S&P 500
stock index option prices, the VIX index gives a measure of investor sentiment and
market volatility. Thus it reveals the level of investor appetite for investment which in
turn affects commodities pricing.
(9) Baltic Dry Index
The index measures the demand for shipping capacity versus the supply of dry bulk
carriers for haulage of dry cargo like grain and metal ores. Since the supply of ships is
relatively inelastic due to the cost and time it takes to build one, the index becomes much
more sensitive to the level of demand for shipping of these dry raw materials. It therefore
is a leading economic indicator of future economic activity. Widely used in literature, the
BDI is a strong candidate for use as a predictor variable.
(10) US CPI Urban Consumers NSA
An index showing the change in price of a basket of goods and services purchased by
urban consumers, the CPI is effectively a measure of inflation in the US. This in turn has
an inverse effect on the buying power and demand for these goods and services by
Page 41
41
consumers which can be used by producers to determine the level of production to target
in order to maximise sales and reduce waste or inventory. Thus the CPI is a good
candidate for an industrial metal price prediction.
(11) S&P 500 Index
The S&P 500 is an American stock market index based on the market capitalizations of
500 large select companies across several industries in the US economy having common
stock listed on the NYSE or NASDAQ. It is thus a good representation of the U.S. stock
market and a leading indicator of the general health of the U.S. economy (Investopedia,
2016).
(12) Dow Jones Industrial Average
This is an index that shows how 30 large publicly owned companies based in the US
have traded during a standard trading session in the stock market. It is a price-weighted
scaled average which also is computed to gauge the performance of the industrial sector
of the US economy.
Variables (10) and (11) are included as predictors considering that they are indicators of
the performance of the US stock exchange which themselves are also representative of
“returns … [of] various other global stock markets” (Buncic and Moretto, 2015). Thus
they show the level of investor activity and the general performance of stock markets.
(13) USDCLP Spot Exchange Rate - Price of 1 USD in CLP
The exchange rate of the US Dollar to the Chilean Peso is included considering that
Chile is the world’s largest exporter of copper and the Chilean economy depends to a
very large extent on the copper exports. Therefore fluctuations in the currency exchange
rate to a large extent are an indication of the level of copper exports and market
performance.
(14) BHP Billiton Ltd
Share Price of Anglo-Australian multinational mining, metals and petroleum company
and the world's largest mining company
Page 42
42
(15) Rio Tinto PLC
Share Price of British-Australian multinational and one of the world’s largest metals and
mining corporations
(16) Freeport-McMoRan Copper & Gold Inc
Share Price of world's largest copper producing and mining company based in the US
Variables (13), (14) and (15) are included as predictors considering that they are
indicators of the performance of the main mining organisations in the world. They are
widely used in literature as predictor variables in data mining research involving
industrial metals.
(17) LME ALUMINUM 3MO ($)
Aluminium is a close substitute for copper in one of its highest application areas:
electrical wiring and electronics. Thus the price of aluminium in the LME commodities
market is considered for inclusion as a predictor.
(18) United States Money Supply M2
A measure of the level of money in supply as published by the United States Federal
Reserve System. The M2 (which is a broader definition of money and encompasses the
M1) is an economic indicator of inflation. Like the CPI, this variable is also a good
predictor candidate since it gives a measure of inflation.
(19) US Industrial Production 2007=100 SA
The Industrial Production Index is an economic indicator that measures real output for all
facilities located in the United States manufacturing, mining, and electric, and gas
utilities. It thus is a measure of economic activity. This variable is also used in literature.
(20) Generic 1st 'LP' Future
Futures is a contract to buy/sell a financial instrument or asset at a predetermined future
date and price. Generic 1st ‘LP’ future is the contract price on copper against the next
month and therefore the shortest length futures contract. Being a form of forecast itself
on the next month price of the metal, this variable is included as part of the predictors.
Page 43
43
4.1.2. Correlation Analysis
The correlation of the variables to the target variable as seen in Table A.1 of the
Appendix shows the degree of association of each individual variable to copper spot prices.
The Peso/Dollar exchange rate for example is negatively correlated to the target. That is,
the lower the price of copper the more Pesos required to make 1 dollar, (and vice versa). This
is to be expected as Chile’s currency, the Peso, is highly sensitive to the price of the metal
considering it is the main export item from the country. Also worthy of note is the fact that
the S&P GSCI Copper Inx Spot is almost perfectly positively correlated with the Lagged
Copper Prices which thus perhaps give an indication of how the index is computed.
4.2. Data Preparation
4.2.1. Transformation
The dataset consisting of the twenty predictor variables chosen as above had their actual
values transformed to log difference values using the following formula:
𝑋𝑇 = 𝐿𝑁 (𝑋𝑡
𝑋𝑡−1) ∗ 100 (4.1)
𝑤ℎ𝑒𝑟𝑒 𝑋𝑇 = 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 𝑉𝑎𝑙𝑢𝑒
𝑋𝑡 = 𝑋 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑚𝑜𝑛𝑡ℎ 𝑡 𝑎𝑛𝑑
𝑋𝑡−1 = 𝑋 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑚𝑜𝑛𝑡ℎ 𝑡 − 1
This transformation was applied to all non-target variable fields and is necessary in order
to avoid ‘discovering’ spurious correlations in the data considering that most of the variables
are non-stationary having a trend since the dataset is basically a time series. The only
exception is the Volatility Index field considering that its value is computed to give an
indication of investor appetite which does not necessarily follow a trend over time.
4.2.2. Standardisation
Then a copy of the dataset was made with all values scaled using standardisation
according to the following formula:
Page 44
44
𝑋𝑆 =𝑋𝑇 − 𝑋𝑇̅̅ ̅̅
𝜎𝑇 (4.2)
𝑤ℎ𝑒𝑟𝑒 𝑋𝑆 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑠𝑒𝑑 𝑉𝑎𝑙𝑢𝑒
𝑋𝑇̅̅ ̅̅ = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑋𝑇𝑎𝑛𝑑
𝜎𝑇 = 𝑆𝑡𝑑. 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑋𝑇
This was done in order to reduce the variance of the variable values considering that they
are of different units and therefore have varying value ranges. This is especially necessary
when building models like neural networks. However, despite the fact that decision trees are
able to handle dataset of variables with widely varying range of values, the standardisation of
the data was still carried out to produce a separate dataset in order to use for further
modelling and compare the results.
4.3. Decision Tree Modelling
4.3.1. Run Set 1 – Log Difference with CHAID
Several modelling approaches are investigated and analysed with a view to determining
which is most appropriate that produces results with low error rates.
With the actual dollar value of copper spot price as the target, the development of models
was trialled through several runs of train/test sessions that involved going through different
combinations of dataset columns choice scenarios against entire/partial data row sets and
finally tree depth values. These are further explained as follows:
Page 45
45
I. Sessions
Datasets are split at random into two equal parts using a Partition node and then used for
a. Train and
b. Test runs
II. Scenarios
Different scenarios are generated considering different column choices as follows:
a. Scenario 1: 20 Variable Selection
The twenty variables identified (as above) were used in the modelling runs under
this scenario. These variables (chosen based on rationale also explained prior) were
applied using the following values:
i. Log Difference Values
As mentioned under Data Transformation, the data values for each variable in the
dataset were transformed to derive growth or change rate using log difference.
ii. Actual Values
b. Scenario 2: Maximum Missing Value (MV) rate
Based on the missing value rate and correlation of the variables to the target, two cut
off points were identified that allowed the retention of some highly correlated
variables while discarding those with very high MV rates as follows: Of the 142
variables retain only those with a
i. 60% maximum MV rate comprising a total of 56 variables.
ii. 76% maximum MV rate. From Table A.1, there are quite a number of variables
with strong target correlation in the 73%-76% MV rate range – thus informing
this second set made up of 105 variables in total.
c. Scenario 3: Entire Dataset
In this scenario all 142 columns of the original dataset are retained and fed into the
algorithm for model training.
III. Data Row Sets
Where missing values exist, they always start from the beginning up to a certain row for
each column of the dataset. Two data row sets are thus formed:
a. All Rows
Entire dataset of the scenario is used for modelling.
Page 46
46
b. Missing Values reduced Rows
For each scenario, rows are discarded from the beginning of the dataset until all
columns have at most a 50% MV rate. This value point was used in order to
compare and see the model performance where the effect of missing values is at or
less than half of all rows for each variable at the expense of fewer examples to train
(and test on).
IV. Tree Depth
Using the CHAID decision tree model node, there are different parameters that can be
tuned while training the model including:
Building single trees vs ensembles with a bias for enhancing accuracy
(boosting) or stability (bagging)
Stopping rules based on minimum records (percentage or value) in parent and
child branches
Significance level for splitting and merging
Maximum tree depth
Amongst the above, tree depth was chosen as the parameter for variation because the
others were either not producing any change in the resulting trees or are producing
multiple trees (in the case of ensembles). The following maximum tree depth values were
used:
a. 5 (default)
Higher values were not used because through several preliminary trials of higher
values the resulting trees produced were consistently at 5 levels or less deep. This
may be as a result of the other features (stopping rule and significance level) limiting
the complexity of the tree from going any further which could lead to overfitting.
b. 4
c. 3
In all the runs the log difference value (growth rate) was used except in Scenario 1ii
where the actual values of the variables are applied.
Also, considering the nature of the Generic 1st ‘LP’ future as an immediate (next month)
forecast on the price of copper, a copy of this variable column data is created and staggered
forward one record so that it now coincides with the next month’s copper price record. A
fresh run using all columns (now 143) and rows is made to serve as scenario 4.
Page 47
47
4.3.2. Run Set 2 – Price Movement with C5.0
Another set of modelling runs was conducted but now with a categorical target
developed to capture the copper spot price movement (UP/DOWN) from month to month.
The distribution of the movements as shown in Table 4.1 indicates that from 1970 to 2012
there has been a month-on-month rise in copper prices slightly more than half of the time.
This represents a fairly balanced dataset and thus stratification or oversampling (Alpaydin,
2010) techniques are not required in the data preparation for modelling.
Table 4.1: LME Copper Spot Price Month-on-Month Movement
Price Direction Count %Count
UP 271 53.77%
DOWN 233 46.23%
Total 504
Using similar set of scenarios from Run Set 1, the C5.0 decision tree algorithm is used in this
instance (the C5.0 node in IBM SPSS Modeler can handle only categorical data)
4.3.3. Run Set 3 – Price Change Rate with CHAID
A final set of decision tree modelling runs using month-on-month rate of change as target
was conducted. As this is also a continuous value, the CHAID algorithm was used in this run
set as well.
4.4. ARIMA (Time Series) Modelling
Using the LME Copper spot prices in the dataset for a total of 505 monthly records, an
ARIMA model was built in the SPSS Modeler environment using the Expert Modeler node.
This node automatically trials and selects the ARIMA model that best fits the dataset. Table
4.2 shows some options values permutations used in a number of runs.
Page 48
48
Table 4.2: Expert Modeler with Constant and Transformation Option Value Runs
Constant Transformation
No None
Yes None
No Square Root
Yes Square Root
No Natural Log
Yes Natural Log
Page 49
49
Chapter 5 Evaluation and Results
5.1. Decision Trees
5.1.1. Log Difference Modelling
Based on runs from the combinations of sessions, scenarios, data row sets and tree depths
as explained in section 4.3 earlier, the variable set selected by the best evaluated model tree
under each run are as captured in Tables 5.1, 5.2, 5.3 and 5.4.
Table 5.1: Trained Model Variables under Scenario 1
SN Scenario 1i: Best Trained Model Variables Scenario 1ii: Best Trained Model Variables
1 Chicago Board Options Exchange SPX Volatility Index Generic 1st 'LP' Future
2 Generic 1st 'LP' Future LME COPPER TOTAL
3 Global Refined Copper Production - World S&P 500 Index
4 LME ALUMINUM 3MO ($) S&P GSCI Copper Inx Spot
5 LME COPPER TOTAL US CPI Urban Consumers NSA
6 S&P GSCI Copper Inx Spot US Industrial Production 2007=100 SA
7 United States Money Supply M2 WTI Cushing Crude Oil Spot Px
8 US CPI Urban Consumers NSA
9 WTI Cushing Crude Oil Spot Px
Table 5.2: Trained Model Variables under Scenario 2
SN Scenario 2i: Best Trained Model Variables Scenario 2ii: Best Trained Model Variables
1 Baltic Dry Index Baltic Dry Index
2
Chicago Board Options Exchange SPX
Volatility Index Chicago Board Options Exchange SPX Volatility Index
3 Comex Copper Inventory Data Commodity Research Bureau BLS/US Spot Raw Industrials
4
Eurostat Industrial Production Eurozone
Industry Ex Construction SA Federal Funds Target Rate US
5 Federal Funds Target Rate US Generic 1st 'LA' Future
6 LME COPPER TOTAL Global Refined Copper Demand - South & Central America
7 United States Money Supply M1 Global Refined Copper Production - Oceania
8
USDPEN Spot Exchange Rate - Price of 1
USD in PEN LME COPPER TOTAL
9 S&P GSCI Index Spot CME
10 US PPI By Processing Stage Finished Goods Total SA
11 USDPEN Spot Exchange Rate - Price of 1 USD in PEN
Page 50
50
Table 5.3 Trained Model Variables under Scenario 3 (Entire Dataset)
SN Scenario 3: Best Trained Model Variables Variable Category
1 China Import Commodity Value - Copper Products
Metal Fundamentals 2 Zambia Copper Prices
3 LME COPPER TOTAL
4 LME CNCL WRNT COPPER TOT
5 Commodity Research Bureau BLS/US Spot Raw Industrials
Economic Activity Indicators
6 BBA LIBOR USD 3 Month
7 Federal Funds Target Rate US
8 US PPI By Processing Stage Finished Goods Total SA
9 Baltic Dry Index
10 Chicago Board Options Exchange SPX Volatility Index
11 S&P GSCI Index Spot CME
Finance Indicators 12 USDPEN Spot Exchange Rate - Price of 1 USD in PEN
Table 5.4 Trained Model Variables under Scenario 4
SN Scenario 4: Best Trained Model Variables
1 Baltic Dry Index
2 BBA LIBOR USD 3 Month
3 Chicago Board Options Exchange SPX Volatility Index
4 Commodity Research Bureau BLS/US Spot Raw Industrials
5 Federal Funds Effective Rate US
6 Federal Funds Target Rate US
7 LME ALUMINUM 3MO ($)
8 S&P GSCI Copper Exc Tot
9 Staggered Generic 1st 'LP' Future
10 US New Privately Owned Housing Units Started by Structure Total SAAR
11 USDPEN Spot Exchange Rate - Price of 1 USD in PEN
The evaluation of the trees using the following 3 metrics is as shown in Table 5.5 (best
performance per run in bold):
i. Mean Absolute Error (MAE)
The MAE metric captures the average of the absolute deviations from predicted to
target values. It is used often in literature for model comparisons because it is simple
to calculate and understand (Hyndman and Athanasopoulos, 2014).
ii. Mean Absolute Percentage Error (MAPE)
In dividing by the target value to arrive at the MAPE, this error metric becomes
dimensionless and allows for comparison of models forecasting quantities on different
Page 51
51
scales. It however, is undesirable when one of the possible target values is zero or
near zero at which point the MAPE value becomes extremely large or infinite.
iii. Root Mean Square Error (RMSE)
The RMSE like the MAE captures the error metric in the same unit as the target.
Considering however, that the deviations are squared before averaging (after which a
square root is taken), this metric puts progressively greater penalties on larger
deviations. Thus it is susceptible to outliers (bloating its value) and this is the primary
reason for avoiding its use (Chai and Draxler, 2014). However when the distribution
of deviations is normal (no significant outliers), the advantage of varying weights
used for different degrees of deviation makes it more desirable than the MAE
especially where accuracy is of more concern than stability.
Since the models being developed are all decision trees and the dataset is one and the
same across board, the MAPE may not be so required. Apart from the aforementioned, the
RMSE will also produce values with greater ranges for discrimination so that where the
values of MAE are close, it helps to show more clearly the differences in accuracy among
models.
Table 5.5 Evaluation of Tree Models Built under Various Conditions/Combinations
Row Set All Rows MV Reduced Rows
Tree Depth 3 4 5 3 4 5
Metric MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE
Scenario 1i 36.65 28.50 68.93 36.68 28.50 68.91 36.56 28.30 68.89 63.77 45.80 88.70 63.72 45.70 88.70 63.72 45.70 88.70
Scenario 1ii 6.81 5.90 10.73 6.72 5.70 10.68 6.72 5.70 10.68 5.97 4.10 9.23 6.08 4.10 9.54 6.08 4.10 9.54
Scenario 2i 22.98 19.30 43.98 20.41 16.20 43.38 20.12 15.40 43.49 20.54 16.00 41.24 20.54 16.00 41.24 21.00 16.40 41.40
Scenario 2ii 22.31 18.00 43.28 20.61 15.70 42.86 21.15 16.10 43.21 34.74 17.60 66.33 34.25 17.10 65.63 34.19 17.10 65.86
Scenario 3 13.36 12.20 19.60 11.64 9.80 18.65 12.13 10.10 19.38 21.46 10.40 38.72 21.73 10.60 39.02 21.73 10.60 39.02
Scenario 4 13.83 12.10 21.81 12.43 10.30 21.17 12.5 10.10 21.47
The modelling forecast accuracy results as seen in Table 5.5 shows the varying levels of
performance obtained under the various scenarios and conditions of run. The following are
some of the particular observations made:
Using more rows of data even with very high MV rates tended to produce better
results than using less
Page 52
52
Retaining more columns to feed into the model algorithm also produced better results
than filtering out even when the columns are known to have very high MV rates. By
retaining columns with an MV of up to 76% (which is a marginal increase from 60%)
where the variables so added have high correlation with the target, we see that the
accuracy of the model increases. Thus where variables are shown to be well correlated
with target they need to be retained even if they have a lot of missing values.
The above then confirms again the fact that decision trees are versatile for use in
forecasting even when predictor variables have high rates of missing values.
Generally, a tree depth of 4 is observed to be optimal. This is seen when comparing
the Train session results with the Test session.
𝐼𝑓 𝐸𝑥(𝑑) 𝑖𝑠 𝑒𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 𝑜𝑓 𝑥 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑢𝑠𝑖𝑛𝑔 𝑡𝑟𝑒𝑒 𝑑𝑒𝑝𝑡ℎ 𝑑
It is seen (in scenario 3 for example) that
𝐸𝑡𝑟𝑎𝑖𝑛(5) < 𝐸𝑡𝑟𝑎𝑖𝑛(4) < 𝐸𝑡𝑟𝑎𝑖𝑛(3)
whereas
𝐸𝑡𝑒𝑠𝑡(4) < 𝐸𝑡𝑒𝑠𝑡(5) < 𝐸𝑡𝑒𝑠𝑡(3)
Therefore a tree depth of 5 tends to overfit while 3 underfits.
Log Difference (Growth Rate) vs Actual Values
In scenario 1ii with the 20 variables earlier identified and chosen, the actual values of the
variables are used for training and testing rather than growth rate. As seen in Table 5.5 under
the MV reduced data rowset with a tree depth of 3, the result shows a remarkable drop in the
error rates with best MAE and RMSE values of 5.97 and 9.23 respectively. Indeed looking
generally through the table, it can be seen that scenario1ii error rate figures are far less
(mostly less than 10) compared to the other scenarios. However, upon inspection of the
predictor variables and ranking in the produced model as shown in Table 5.1 and Figure 5.1,
it can be seen that mostly the model is basically just using one variable (S&P GSCI Copper
Inx Spot) for prediction. This is as a result of the very high (almost perfect) correlation
between the variable actual values and the target which is what the index is designed to track
by definition. Thus being a coincident indicator, using the actual values is not practical as
they may not be available in good enough time as to be retrospectively applied. Also, since
Page 53
53
most of the variables of the entire dataset are lagging indicators, it becomes necessary to use
their growth rates rather than actual values for prediction as done in Scenarios 2 and 3.
Figure 5.1: Predictor variables in ranked order of importance (Scenario 1ii)
Scenario 3 produced the next best error rate results with ranked predictor variables
produced as shown in figure 5.2. This was further improved by remodelling again after
removing the least important variable (Global Refined Copper Production – South and
Central America) which as can be seen in Figure 5.2 is far less ranked than the rest. This
process was repeated iteratively until no further improvements could be realised producing
the variable ranking in Figure 5.3 and final results as recorded in Table 5.5.
Page 54
54
Figure 5.2: Predictor variables in ranked order of importance (Scenario 3)
Page 55
55
Figure 5.3: Predictor variables in ranked order of importance (Scenario 3 Improved)
The predictive power of the Staggered Generic 1st ‘LP’ (Copper) Future variable in
Scenario 4 can be seen in Figure 5.4. The error rates realised were very close to those of
Scenario 3 as detailed in Table 5.1. As such this model may still be used where the business
understanding favours this set of variables as listed in Table 5.4
Page 56
56
Figure 5.4: Predictor variables in ranked order of importance (Scenario 4)
Predictor Variables
The chosen best decision tree model is the Scenario 3 (improved) model with the next
best error rate results (after Scenario 1ii which as earlier mentioned is impractical). The
variables of the model are as listed in Table 5.3 and ranked in Figure 5.3 showing the
foremost predictor as China Import Commodity Value - Copper Products. This goes to show
that Chinese demand for copper, over and above other factors, largely determines the price of
the metal making it highly susceptible to the vagaries of the Chinese economy. The second
most important predictor, Zambia Copper Prices is rather an obvious one considering that
international copper prices are bound to be comparable and move with similar trend since the
commodity is marketed globally. Along with two other inventory indicators these make up
the metal fundamentals variables that are predictive of copper spot prices. In the predictor
importance ranking, the other variables are mostly at similar levels of impact.
Page 57
57
The Volatility index, PPI and Federal Funds Target Rate are different indicators of the
US economy measuring investor appetite, variations in manufactured goods prices and
interbank lending rates respectively. Thus the health of the US economy also largely
determines the price of copper. The Baltic Dry Index being a measure of the level of shipping
activity of ‘dry’ commodities including copper is a leading indicator of demand levels and so
expectedly does demonstrate good predictive power. This corroborates the observation of its
high correlation rates (against copper price) as seen earlier.
Finance indicator variables include the S&P GSCI Index which is a widely quoted index
as a measure of global general inflation. And finally in this category, we have the US Dollar
to Peruvian Sol exchange rate (USDPEN). Peru is the third largest copper producing country
after Chile and China, and thus its currency reflects the price of the metal being one of its
important export commodities.
Model Tree and Ruleset
The decision tree structure for the most accurate model (scenario 3) is shown in Figure
5.5a-c where it can be observed that the root node is the China Import Commodity Value –
Copper Products variable. At the 2nd level of the tree and indicating the next level of
variable importance we see the following variables:
1. USDPEN Spot Exchange Rate (PEN Curncy)
Here the inverse relationship between the exchange rate and copper spot price can be
seen from the rule:
If PEN Curncy <= 3.34 then LOCADY Comdty = 150.31
If PEN Curncy > 3.34 then LOCADY Comdty = 126.93
2. Zambia Copper Prices (ZMCMCOPP Index)
3. Baltic Dry Index (BDIY Index)
The Baltic Index variable level of importance is reflected further when taking into
consideration the fact that it is used to split further on 82.6% of the dataset. This is
consistent with the high correlation (0.6) it bears to the target variable. Being a
leading indicator, the Baltic thus is a very useful predictor as seen in its wide use in
literature.
Page 58
58
Figure 5.5a CHAID Left Tree Subsection
Page 59
59
Figure 5.5b CHAID Middle Tree Subsection
Page 60
60
Figure 5.5c CHAID Right Tree Subsection
Page 61
61
With a four-level depth beyond the root and a total of 23 leaf nodes, the decision tree is
quite intricate and as such an exhaustive analysis of the rules is treated as out of scope and
not undertaken.
The design of the stream (in IBM SPSS Modeler) to load the dataset and train a CHAID
node in creating the decision tree model for scenario 3 is as shown in Figure A.2 under the
Appendix.
5.1.2. Price Movement Modelling
Another set of modelling runs was conducted but now with a categorical target
developed to capture the copper spot price movement (UP/DOWN) from month to month.
Using same set of scenarios as earlier, the C5.0 algorithm in SPSS Modeler (since target is
now categorical) was used to build model decision trees through several runs. Evaluation
results as shown in Table 5.6 indicate that scenario 2 produced the tree with the highest
accuracy of 96% during test run. The variables selected by the algorithm for the decision tree
and their ranking as shown in Figure 5.6 reveals the significant impact of the S&P GSCI Inx
Spot variable in predicting price movement.
Page 62
62
Table 5.6: C5.0 Decision Tree Evaluation using Price (month-on-month) Direction Target
Train Test
20 Field Log Rate Predicted Predicted
DOWN UP DOWN UP
Actual DOWN 93 27 86 27
UP 8 127 3 133
Accuracy 0.86 Accuracy 0.88
20 Field Actual Values Predicted Predicted
DOWN UP DOWN UP
Actual DOWN 102 12 89 30
UP 80 53 94 44
Accuracy 0.63 Accuracy 0.52
Scenario1 Predicted Predicted
DOWN UP DOWN UP
Actual DOWN 84 5 90 2
UP 6 103 14 93
Accuracy 0.94 Accuracy 0.92
Scenario2 Predicted Predicted
DOWN UP DOWN UP
Actual DOWN 51 2 44 3
UP 0 62 2 71
Accuracy 0.98 Accuracy 0.96
Scenario3 Predicted Predicted
DOWN UP DOWN UP
Actual DOWN 87 27 92 27
UP 5 128 7 131
Accuracy 0.87 Accuracy 0.87
Page 63
63
Figure 5.6: Predictor variables in ranked order of importance
(Price Movement Target – Scenario 2)
Variables and Ruleset
Figure 5.7 shows the tree structure and derived ruleset from the decision tree model for
copper price direction forecasting. This ruleset (Rule 1) shows that when the S&P GSCI Inx
Spot growth rate falls below the mean value (0.41) at -1.76 or less, the direction of copper
spot price will trend downward. Otherwise it swings upward if the Euro to US Dollar
exchange rate growth is not less than one and a third standard deviation from its mean (-0.01)
at a value of -3.348 (Rule 3).
These two rules determine the price direction in almost all the examples tested (51 and
60 respectively out of a total of 115). Crude Oil is a major source of energy and this reflects
widely as a common predictor variable in literature. In the forecast of copper price movement
it is used by the decision tree model to discriminate in just 4 examples.
Page 64
64
Rule 1 for CopperSpotPriceDirection(Down) (51)
If S&P GSCI Inx Spot <= -1.76
then DOWN
Rule 2 for CopperSpotPriceDirection(Down) (2)
If S&P GSCI Inx Spot > -1.76
and EURUSD Spot Exchange Rate <= -3.348
and WTI Cushing Crude Oil Spot Px <= -8.23
then DOWN
Rule 3 for CopperSpotPriceDirection(Up) (60)
If S&P GSCI Inx Spot > -1.76
and EURUSD Spot Exchange Rate > -3.348
then UP
Rule 4 for CopperSpotPriceDirection(Up) (2)
If S&P GSCI Inx Spot > -1.76
and EURUSD Spot Exchange Rate <= -3.348
and WTI Cushing Crude Oil Spot Px > -8.23
then UP
Figure 5.7: C5.0 Decision Tree and Ruleset (Directional Forecasting)
Page 65
65
5.1.3. Price Change Rate Modelling
A final set of modelling runs using month-on-month rate of change as developed target
was conducted. As this is also a continuous value, the CHAID algorithm was used in this run
set as well. Table 5.7 shows the results evaluation where we see scenario 3 with missing
value reduced rows giving the best accuracy figures. As seen in Figure 5.8, the S&P GSCI
Inx Spot variable again is largely the determinant predictor. With an importance of 0.97, and
given the very low error rates recorded, it can be inferred that the variable has an almost
perfect correlation with the price movement target.
Table 5.7: Evaluation of Tree Models (month-on-month rate of change target)
Row Set All Rows MV Reduced Rows
Tree Depth 3 4 5 3 4 5
Metric MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE
Scenario 1i 0.025 0.746 0.045 0.016 0.467 0.032
Scenario 1ii 0.053 2.762 0.071 6.716 0.057 10.682 0.057 2.602 0.084 0.056 2.557 0.077
Scenario 2i 0.030 2.030 0.045 0.030 2.030 0.045 0.019 0.698 0.032
Scenario 2ii 0.030 2.020 0.045 0.030 2.020 0.045 0.016 0.501 0.032 0.016 0.506 0.032
Scenario 3 0.030 2.020 0.045 0.030 2.020 0.045 0.015 0.491 0.032 0.015 0.496 0.032
Page 66
66
Figure 5.8: Predictor variables in ranked order of importance (Price Change Rate Target)
5.2. ARIMA
Using the Expert Modeler, a seasonal ARIMA(2,1,0)(1,0,1) on a transformed dataset
(natural log) with a constant proved to be the model producing the best performance in fitting
the dataset. Table 5.8 shows details of the results of a number of runs with some options
values permutations.
Table 5.8: Expert Modeler Result: ARIMA(2,1,0)(1,0,1) with Constant and Transformation options
c Transform. Stationary
R2 R
2 RMSE MAPE MAE MaxAPE MaxAE
Norm.
BIC Q df Sig.
No None 0.14 0.99 11.15 4.54 6.10 38.59 86.23 4.87 73.84 14 0
Yes None 0.15 0.99 11.16 4.53 6.09 38.78 86.64 4.89 73.84 14 0
No Square Root 0.16 0.98 11.18 4.52 6.10 38.52 86.06 4.88 43.08 14 8.30E-05
Yes Square Root 0.16 0.98 11.19 4.53 6.10 38.75 86.58 4.89 43.05 14 8.40E-05
No Natural Log 0.16 0.98 11.21 4.53 6.11 38.95 87.04 4.88 21.05 14 0.10032
Yes Natural Log 0.17 0.98 11.24 4.53 6.11 39.27 87.73 4.90 20.94 14 0.10306
These results indicate that an ARIMA(2,1,0)(1,0,1) with constant on a natural log
transformed dataset is the best fit model as seen in Table 5.8 (in bold). This is so considering
Page 67
67
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
Jan
-70
Jul-
71
Jan
-73
Jul-
74
Jan
-76
Jul-
77
Jan
-79
Jul-
80
Jan
-82
Jul-
83
Jan
-85
Jul-
86
Jan
-88
Jul-
89
Jan
-91
Jul-
92
Jan
-94
Jul-
95
Jan
-97
Jul-
98
Jan
-00
Jul-
01
Jan
-03
Jul-
04
Jan
-06
Jul-
07
Jan
-09
Jul-
10
Jan
-12
$
Month-Year
Time Series Plot of Copper Spot Prices ($/MT)
that this trial run has the highest Stationary R2 value of 0.17 as well as having the best
statistical significance result of 0.103 (> 0.05) at the 95% confidence interval level. The runs
without the natural log transformation all have statistically insignificant results (< 0.05) at the
95% confidence interval level.
Thus we see that the copper spot prices is a high variance, trended time series dataset
requiring a natural log transformation and first differencing to make it stationary. It has a
non-seasonal second order autoregressive signature, AR(2) and a seasonal part fitted with an
ARMA(1,1) model. Figure 5.9 shows a time series plot of the original dataset where the
variance and trend can be clearly seen. In Figure 5.10, the dataset has been transformed
thereby reducing variance and Figure 5.11 shows a stationary plot after first differencing. The
model design stream is as shown in Figure A.2 under the Appendix.
It is to be noted that careful scrutiny of the dataset output of the ARIMA model reveals that
the error rate (RMSE, MAPE, MAE) figures in Table 5.8 are computed based on the entire
dataset being used as training set. In essence then these are training error rates. Using several
holdout values, test error rates for the model are as shown in Table 5.9 where we see far higher
figures. This indicates that the ARIMA is at best useful for forecasts in the very near term.
Figures 5.9: Time series plot of original dataset showing variance and trend.
Page 68
68
Figures 5.10: Time series plot of transformed dataset showing reduced variance.
Figures 5.11: Time series plot of differenced transformed dataset eliminating trend to make stationary.
0
1
2
3
4
5
6
7Ja
n-7
0
Sep
-71
May
-73
Jan
-75
Sep
-76
May
-78
Jan
-80
Sep
-81
May
-83
Jan
-85
Sep
-86
May
-88
Jan
-90
Sep
-91
May
-93
Jan
-95
Sep
-96
May
-98
Jan
-00
Sep
-01
May
-03
Jan
-05
Sep
-06
May
-08
Jan
-10
Sep
-11
LN(P
rice
)
Month Year
Time Series Plot of Natural Log Transformed Copper Spot Prices
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Jan
-70
Sep
-71
May
-73
Jan
-75
Sep
-76
May
-78
Jan
-80
Sep
-81
May
-83
Jan
-85
Sep
-86
May
-88
Jan
-90
Sep
-91
May
-93
Jan
-95
Sep
-96
May
-98
Jan
-00
Sep
-01
May
-03
Jan
-05
Sep
-06
May
-08
Jan
-10
Sep
-11
Firs
t D
iffe
ren
ce
Month Year
Time Series Plot of Differenced Transformed Copper Spot Prices
Page 69
69
Table 5.9: ARIMA Model Holdout (Out-of-Sample) Dataset Test Error Rates
Holdout % of Total Count MAE MAPE RMSE
252 50% 93.41 89.09% 113.60
100 20% 104.38 56.33% 124.40
10 2% 54.91 15.45% 70.39
3 0.6% 33.54 9.50% 35.16
1 0.2% 22.51 6.17% 22.51
5.3. Models Comparison
Using the figures from Tables 5.5, and 5.9 (rather than 5.8 as explained earlier) for
comparison of the two model types, we see that the ARIMA has grossly worse figures
compared to the decision tree. At the 50% holdout (test) level, the decision tree RMSE of
18.65 compared to ARIMA’s 113.60 is far more reliable. A limitation of this comparison
however is that while the decision tree uses 50% of the dataset picked at random for testing,
the ARIMA, as a matter of necessity due to the nature of the model, uses the half picked from
the end of the chronologically ordered dataset. Thus the test datasets presented to each model
are not exactly the same. Nevertheless, considering that even the least RMSE figure of the
ARIMA (22.51 at just 1 record holdout) is still worse than that of the decision tree it can be
safely adjudged that the decision tree model performs better and will continue to do so even
when presented with real world data scenarios.
As the decision tree model uses several variables that have been selected from a large
pool of potential predictors, it uses more information to arrive at a forecast value. The
ARIMA on the other hand only has the historical target values to rely on for its modelling
and forecasting. It is not surprising then that the decision tree performs much better. This is
the reason for the increasing research interest in the use of data mining or machine learning
algorithms and models for time series quantities forecasting especially when it has been
shown that promising predictor variables exist (Chen et al, 2010; Lai et al., 2009; Chang et al,
2011).
Apart from the decision tree, there exist several other machine learning models including
support vector machines (SVM), logistic regression, k-nearest neighbour (KNN), artificial
neural networks (ANN) amongst several others. In fact the neural network has been shown in
some studies to have better accuracy figures compared to the decision tree (Diaz et al., 2016).
However neural nets and other very sophisticated models with high accuracy are essentially
black box techniques which do not lend to easy interpretation of the results as “they do not
provide an insight into the nature of the interactions between the technical indicators and the
Page 70
70
[target]” (Lai et al., 2009). The main advantage of using the decision tree model as mentioned
before is its interpretability by granting the ability to identify predictor variables and their
threshold values for determining the forecast value of the target. As determined by the
decision tree model, Table 5.3 lists the economic and financial indicators as well as metal
fundamentals that are most predictive of copper spot prices. Figure 5.5 shows the decision
tree generated indicating the threshold values of these variables from which rulesets for
prediction are generated.
Also as seen in Table 5.6, the decision tree can be used for categorical target prediction
which in this case is price direction. The high accuracy figures realised in this instance bears
out with the literature where it is demonstrated that decision trees have very good
performance even better than the KNN, SVM and ANN when it comes to directional
forecasting (Diaz et al., 2016). The model here can be reliably used by investors (hedgers and
speculators) to decide on a buy/hold strategy in their trade on the metal.
5.4. Deployment
With the successful development of these insightful models for copper price prediction,
the final step in the methodology is deployment. For the purposes of this research, some
promising deployment scenarios for the following stakeholders are outlined:
1. Governments
Chile, Peru and Zambia are the top copper producing countries in South America and
Africa whose economies rely heavily on the export of the commodity. The models
developed in this research can readily provide useful data to help inform strategy for
national economic planning.
2. Investors
Hedgers and speculators on various exchanges globally are able to apply the models
(for example, the price movement direction predicting model) to make informed
trading decisions that helps to maximise portfolio returns.
3. Academia
Based on the outcome of this research endeavour, the academic community can
further use the findings to improve on existing theories and methods in financial time
series analyses and other similar research areas.
Page 71
71
Chapter 6 Conclusions and Future Works
This final chapter presents the summary findings of the research work, conclusions
reached and recommendations for future research.
6.1. Conclusions
This research work focuses on the prediction of copper spot prices using data from the
LME as well as a set of other economic and financial predictor variables. The value of
predicting the future value of an industrial metal like copper to various stakeholders has been
amply demonstrated and is also reflected in the literature.
Several attempts at forecasting the price of industrial metals using various approaches
have been implemented. The ARIMA has been applied in a number of studies and more
recently the use of data mining techniques like the linear regression, ANN and SVM has been
investigated (Ongsritrakul and Soonthornphisaj, 2003). While the data mining approach has
demonstrated better performance in accuracy (Adebiyi et al., 2014; Lasheras et al., 2015), the
results produced have been difficult to interpret due to the use of these black box algorithms.
Also lacking is the use of a methodological framework in the analysis process which gives
the benefit of a systematic and generic approach that can be adapted to other domains.
In this research, the use of decision trees as a forecast tool for copper spot prices is
investigated and being an open-box model, the results are easily interpreted and applied. The
use of the CRISP-DM methodology in the analysis process ensured a systematic approach to
the investigation. A rigorous search through the literature reveals that the application of this
methodological approach and the use of decision trees for copper price forecasting have not
been done prior. The decision tree is also contrasted with ARIMA in terms of model
accuracy.
The CRISP-DM methodology comprises five phases that entails understanding of the
business objectives, understanding and preparation of data, modelling, evaluation and
deployment. The phases are navigated not in a single waterfall pass but with several iterations
involving re-visit of previous stages thus ensuring continuous improvement in the overall
modelling effort.
The results obtained are consistent with literature in terms of the metal fundamentals and
economic/financial variables identified and selected as predictors by the model. There is clear
evidence of the leading impact of Chinese demand on the price of copper. Thus the
performance of the metal’s price is highly predicated on the Chinese economy. The Standard
& Poor’s copper commodity index (S&P GSCI Copper Inx Spot) has also been shown to
Page 72
72
have very good predictive capacity for the metal. The US economy being the largest in the
world also to a large extent determines the price of copper as seen by the significant effect of
various indicators of the US economy (Volatility index, PPI and Federal Funds Target Rate)
as predictor variables. The Peruvian Sol to US Dollar exchange rate completes the forecast
variables identified. The threshold values of the predictors in determining prices are captured
in the produced decision tree.
A comparison with the ARIMA model showed clearly that the decision tree produced
results with far greater accuracy and thus will deliver more reliably upon deployment in real-
world scenarios.
Deployment scenarios considered include economic planning by governments of
countries like Chile, Peru and Zambia whose economies depend strongly on copper exports.
Investors are also able to utilise the developed models in planning more profitable trading
strategies.
Due to time and other resource constraints, the research was limited in the dataset where
a number of promising (based on literature) variables had many missing values. Also the
identification of business cycles as a strong potential predictor (Cuddington and Jerrett, 2011;
Diaz et al., 2016) based on the cycles noted in the price points of copper over the years, was
not thoroughly investigated to develop it as a forecast variable.
6.2. Recommendations for Future Work
Future work recommended in this research area includes the following:
1. A more thorough study of the variables available in the dataset with a view to
discovering further the interrelationships that exist among them thus further fine
tuning the data understanding and preparation processes.
2. Further modelling runs can be executed tuning more parameter values for possible
improvement in the accuracy of the resulting model.
3. Analysis and investigation of business cycles for development as a predictor variable.
4. Application of decision trees using the CRISP-DM methodology to other industrial
metals clearly revealing predictor variables and their threshold values in determining
prices (or other suitable targets).
Page 73
73
List of References Adebiyi, A.A., Adewumi, A.O. and Ayo, C.K. (2014), “Comparison of ARIMA and
Artificial Neural Networks Models for Stock Price Prediction”, Journal of Applied
Mathematics, Vol. 2014 No. 1, pp. 1–7.
Alpaydin, E., 2014. Introduction to machine learning. MIT press.
Anyadike, N. (2002), Copper: A material for the new millennium, Woodhead Publishing,
Cambridge, England.
Black, W.T., 1995. Trends in the use of copper wire & cable in the USA.Electrical &
Electronic Markets.
Bontempi, G. (2013). Machine Learning Strategies for Time Series Prediction. Machine
Learning Summer School. ULB, Brussels.
Box, G.E., Jenkins, G.M. and Reinsel, G., (1970). Forecasting and control. Time Series
Analysis, 3, p.75.
Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A., 1984.Classification and
regression trees. CRC press.
Buncic, D. and Moretto, C. (2015), “Forecasting copper prices with dynamic averaging
and selection models”, The North American Journal of Economics and Finance,
Vol. 33, pp. 1–38.
Chang, P.C., Fan, C.Y. and Lin, J.L., 2011. Trend discovery in financial time series data
using a case based fuzzy decision tree. Expert Systems with Applications, 38(5),
pp.6070-6080.
Chai, T. and Draxler, R.R., 2014. Root mean square error (RMSE) or mean absolute
error (MAE)?–Arguments against avoiding RMSE in the literature. Geoscientific
Model Development, 7(3), pp.1247-1250
Chen, Y., Rogoff, K., & Rossi, B. (2010). Can exchange rates forecast commodity
prices? Quarterly Journal of Economics, 125(3),1145–1194
Crowson, Philip (2008), “Copper industry”, International Encyclopedia of the Social
Sciences. Retrieved May 13, 2016, from,
http://www.encyclopedia.com/topic/Copper_industry.aspx.
Cruse, H. (2006), Neural Networks as Cybernetic Systems, 2nd ed., Brains, Minds &
Media, Bielefeld.
Cuddington, J.T. and Jerrett, D., 2011. Business Cycle Effects on Metal and Oil Prices:
Understanding the Price Retreat of 2008-9.
Diaz, D., Theodoulidis, B. and Dupouy, C. (2016), “Modelling and forecasting interest
rates during stages of the economic cycle. A knowledge-discovery approach”,
Expert Systems with Applications, Vol. 44, pp. 245–264.
Page 74
74
Fama, E.F. and French, K.R. (1988), “Business Cycles and the Behavior of Metals
Prices”, The Journal of Finance, Vol. 43 No. 5, pp. 1075–1093.
Fisher, F.M., Cootner, P.H. and Baily, M.N. (1972), “An econometric model of the world
copper industry”, The Bell Journal of Economics and Management Science, pp.
568–609.
Hssina, B., Merbouha, A., Ezzikouri, H. and Erritali, M., 2014. A comparative study of
decision tree ID3 and C4. 5. International Journal of Advanced Computer Science
and Applications, 4(2).
Hyndman, R.J. and Athanasopoulos, G., 2014. Forecasting: principles and practice.
OTexts.
IBM (2013) IBM SPSS Modeler 16 Modeling Nodes
International Copper Study Group (ICSG), 2016. The World Copper Factbook 2015,
http://www.icsg.org/index.php/component/jdownloads/viewdownload/170/2092.
Ján, Š. and Katarina, H. (2010), “THE IMPLEMENTATION OF HYBRID ARIMA-
NEURAL NETWORK PREDICTION MODEL FOR AGREGATE WATER
CONSUMTION PREDICTION”, Journal of Applied Mathematics, Vol. 3 No. 3.
Kass, G.V., 1980. An exploratory technique for investigating large quantities of
categorical data. Applied statistics, pp.119-127.
Kavitha, C. and Iyakutti, K., 2014. Optimized Anomaly based Risk Reduction using
PCA based Genetic Classifier. Global Journal of Computer Science and
Technology, 14(7).
Khoonsari, P.E. and Motie, A., 2012. A comparison of efficiency and robustness of ID3
and C4. 5 algorithms using dynamic test and training data sets. International Journal
of Machine Learning and Computing, 2(5), p.540.
Kriechbaumer, T., Angus, A., Parsons, D. and Casado, M.R., 2014. An improved
wavelet–ARIMA approach for forecasting metal prices. Resources Policy, 39,
pp.32-41.
Lai, R.K., Fan, C.Y., Huang, W.H. and Chang, P.C., 2009. Evolving and clustering fuzzy
decision tree for financial time series data forecasting. Expert Systems with
Applications, 36(2), pp.3761-3773.
Lasheras, F.S., de Cos Juez, F.J., Sánchez, A.S., Krzemień, A. and Fernández, P.R.,
2015. Forecasting the COMEX copper spot price by means of neural networks and
ARIMA models. Resources Policy, 45, pp.37-43.
Maier, H.R. and Dandy, G.C. (2000), “Neural networks for the prediction and
forecasting of water resources variables. A review of modelling issues and
applications”, Environmental Modelling & Software, Vol. 15 No. 1, pp. 101–124.
Page 75
75
Malliaris, A.G. and Malliaris, M. (2015), “What drives gold returns? A decision tree
analysis”, Finance Research Letters, Vol. 13, pp. 45–53.
Meller, P. and Simpasa, A.M. (2011), “Role of Copper in the Chilean & Zambian
Economies: Main Economic and Policy Issues”, GDN Working Paper Series,
Vol. 43.
Ongsritrakul, P. and Soonthornphisaj, N. (2003), Apply Decision Tree and Support
Vector Regression to Predict the Gold Price: Proceedings of the International Joint
Conference on Neural Networks 2003, Doubletree Hotel, Jantzen Beach, Portland,
Oregon, July 20-24, 2003 / co-sponsored by the International Neural Network
Society, the IEEE Neural Networks Society. Vol. 1, IEEE, Piscataway, N. J.
Piatetsky, G. (2014) KDnuggets Methodology Poll, [Online], Available:
http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-
methodology.html [10 Mar 2016].
Quinlan, J.R. (1990). Learning logical definitions from relations. Machine Learning, 5,
239–266. doi:10.1007/BF00117105.264
Quinlan, J.R. (1993). C4.5: programs for machine learning: 1.San Mateo, California:
Morgan Kaufmann. doi:10.1016/S0019-9958(62)90649-6.
Quinlan, R., 2004. Data mining tools See5 and C5. 0.
‘Standard & Poor's 500 Index - S&P 500’ (2016) Investopedia. Available at:
http://www.investopedia.com/terms/s/sp500.asp (Accessed: 22 June 2016)
Wirth, R. and Hipp, J., 2000, April. CRISP-DM: Towards a standard process model for
data mining. In Proceedings of the 4th international conference on the practical
applications of knowledge discovery and data mining (pp. 29-39).
Zhang, G. (2003), “Time series forecasting using a hybrid ARIMA and neural network
model”, Neurocomputing, Vol. 50, pp. 159–175.
Page 76
76
Appendix
Table A.1 Dataset Variables Showing Statistical Characterisation (First Variable is Target)
SN Variable Description Mean Median Std. Dev Skew Kurtosis Min Max Missing Target Corr.
1 LOCADY Comdty LME COPPER SPOT ($) OFF 120.27 84.64 90.72 2.01 2.97 45.65 447.59 0 1
2 USCRWTIC Index
WTI Cushing Crude Oil Spot Px 0.35 1.03 8.51 -0.41 3.29 -39.50 39.09 161 0.08
3 spgsci index S&P GSCI Index Spot CME 0.37 0.58 4.81 -0.43 2.95 -26.13 17.35 1 0.08
4 spgsin index S&P GSCI Industrial Metals Index Spt 0.32 0.50 5.53 -0.56 3.78 -28.03 19.10 85 0.07
5 MEPRCUPR Index
FOB Copper Pricing Usd per lb 0.27 2.79 9.92 -1.43 3.05 -34.90 16.56 457 0.15
6 spgsic Index S&P GSCI Copper Inx Spot 0.41 0.53 6.35 -0.63 4.62 -35.80 22.50 85 0.09
7 spgsintr index S&P GSCI Ind Met Tot Ret 0.66 0.76 5.73 -0.28 3.73 -27.73 22.25 85 0.07
8 SPGSICP Index S&P GSCI Copper Exc Tot 0.46 0.53 6.56 -0.42 4.10 -35.69 23.08 85 0.12
9 SPGSPMP Index S&P GSCI PREC METAL ER 0.19 -0.22 5.63 1.12 9.18 -23.73 38.74 37 0.09
10 SPGSAMP Index
S&P GSCI All Metals Index Excess Return 0.52 1.38 4.36 -0.63 -0.59 -8.51 6.76 479 0.29
11 SPGSICTR Index S&P GSCI Copper Tot Ret 0.92 1.10 6.53 -0.43 4.25 -35.60 23.45 85 0.10
12 SPGSESP Index S&P GSCI Enhanc ER 0.71 1.51 5.29 -1.23 3.58 -25.79 11.96 301 0.02
13 SPGSHG Index S&P GSCI North American Copper Index Spot 0.48 0.94 6.97 -0.73 4.76 -35.87 22.80 301 0.10
14 SPGSAM Index S&P GSCI All Metals Index Spot 0.68 1.55 4.35 -0.61 -0.61 -8.25 6.91 479 0.28
15 SPGSAP Index
S&P GSCI All Metals Capped Commodity 35/20 Index Spot 0.94 2.03 4.96 -0.68 -0.70 -9.54 7.66 479 0.25
16 LSCA Index LME COPPER TOTAL 288785.58 204775.00 220004.19 1.01 0.24 11663.00 965427.00 0 -0.09
17 COMXCOPR Comdty
Comex Copper Inventory Data 0.31 -0.07 23.46 0.78 5.92 -90.47 132.25 273 -0.03
18 SHFCCOPD index
Shanghai Futures Exchange Copper Deliverable Stocks 0.50 0.11 26.16 0.25 0.12 -64.75 69.15 397 0.01
19 lfca Index LME CNCL WRNT COPPER TOT 23441.12 15782.00 21573.94 1.91 4.38 914.00 125044.00 333 -0.23
20 CNIVCOPP Index
China Import Commodity Unwrought Copper & Copper Products 242.39 223.86 85.03 0.98 0.54 73.77 508.94 372 0.45
21 CNIVCOPA Index
China Import Commodity Volume - Unwrought Copper & Copper Alloy 0.49 2.71 21.87 -0.70 0.92 -76.85 42.07 418 -0.06
22 CNIVCOPR Index
China Import Commodity Copper Products 82.11 83.72 14.99 -0.58 0.30 30.88 109.26 384 -0.19
23 CHIVCORE Index
China Import Commodity Value - Copper Ores & Concentrates 2.47 0.93 23.55 -0.07 0.68 -76.46 70.82 420 0.04
24 CHIVCOPR Index
China Import Commodity Value - Copper Products 540.73 592.49 148.73 -0.46 -0.96 199.48 799.67 414 0.87
25 CHIVSCPR Index
China Import Commodity Value - Scrap Copper 2.36 1.55 25.95 0.03 1.11 -78.36 84.64 418 0.00
26 CNMDCCCD Index
Implied % of China Construction Copper Demand 99.86 99.86 0.03 -0.13 0.02 99.79 99.92 447 -0.63
27 CNMDCRCY Index
YTD China Refined Copper Apparent Consumption mt 549433.00 586140.00 139795.62 -0.21 -0.75 291710.91 859489.00 449 -0.01
28 MEPRMCAS Index
Global Mined Copper Production - Asia 220839.63 220836.00 25328.95 -0.22 0.16 148894.00 276330.00 387 0.37
29 MEPRMCME Index
Global Mined Copper Production - Middle East 3.25 0.00 21.15 6.01 43.37 -38.44 172.38 388 0.15
30 MEPRMCNA Index
Global Mined Copper Production - North America 173274.17 173216.00 12156.25 -0.35 -0.24 139116.00 196851.00 387 0.07
31 MEPRMCSA Index
Global Mined Copper Production - South & Central America 558747.36 565338.50 52318.34 -0.22 -0.24 423371.00 677839.00 387 0.52
Page 77
77
SN Variable Description Mean Median Std. Dev Skew Kurtosis Min Max Missing Target Corr.
32 MEPRRCAF Index
Global Refined Copper Production - Africa 0.91 0.93 8.34 0.08 0.45 -21.41 28.22 388 -0.02
33 MEPRRCAS Index
Global Refined Copper Production - Asia 0.57 0.68 3.46 -0.04 1.76 -10.53 11.00 388 0.01
34 MEPRRCEU Index
Global Refined Copper Production - Europe 0.09 -0.16 2.13 0.38 0.75 -5.72 6.42 388 0.07
35 MEPRRCME Index
Global Refined Copper Production - Middle East -0.17 0.00 20.08 -0.43 15.16 -112.74 94.29 388 0.05
36 MEPRRCNA Index
Global Refined Copper Production - North America -0.25 0.22 5.82 -0.16 0.12 -17.58 13.56 388 0.03
37 MEPRRCOC Index
Global Refined Copper Production - Oceania 39070.62 39667.00 4570.40 -0.68 1.52 24333.00 48000.00 387 -0.34
38 MEPRRCSA Index
Global Refined Copper Production - South & Central America 0.04 -1.08 6.40 0.20 -0.32 -14.71 15.47 388 0.00
39 MHMCWC Index
Mongolian Production of Major Commodities Copper with Concentrate 47.26 43.60 39.24 10.63 115.13 29.50 471.00 385 0.07
40 SAMPCPPM Index
South Africa Mining Production Volume Index 2005=100 Copper NSA MoM 1.92 -1.19 22.91 2.32 12.24 -59.44 171.21 121 -0.03
41 ZMCMCPPM Index
Zambia Copper production MoM 1.77 -1.63 12.66 0.79 0.38 -23.99 35.28 457 -0.08
42 MEPRCDAF Index
Global Refined Copper Demand - Africa 17798.19 17021.50 6142.36 0.80 0.31 5330.00 35438.00 387 0.51
43 MEPRCDAS Index
Global Refined Copper Demand - Asia 0.59 0.34 7.84 0.77 3.09 -16.74 37.11 388 -0.02
44 MEPRCDEU Index
Global Refined Copper Demand - Europe -0.14 -1.31 11.88 0.31 -0.02 -24.43 35.05 388 0.00
45 MEPRCDNA Index
Global Refined Copper Demand - North America -0.34 -1.36 9.18 0.58 0.24 -17.82 27.88 388 0.00
46 MEPRCDOC Index
Global Refined Copper Demand - Oceania 12685.01 13100.00 4719.73 0.03 0.14 918.00 25925.00 387 -0.24
47 MEPRCDSA Index
Global Refined Copper Demand - South & Central America 46193.16 46323.00 6679.45 -0.16 0.50 26999.00 65619.00 387 0.65
48 MEPRCOCI Index
Known Copper Ore & Concentrate Inventories 134180.80 134522.50 23125.93 0.00 -0.60 76306.00 184598.00 387 -0.50
49 SPWIICP Index S&P World Commodity Copper - Grade A Index ER 1.09 1.87 7.54 -0.91 4.92 -35.94 23.48 360 0.07
50 SPWIICTR Index
S&P World Commodity Copper - Grade A Index TR 1.28 1.98 7.53 -0.91 5.04 -35.85 23.86 360 0.07
51 SPWIIC Index S&P World Commodity Copper - Grade A Index 1.01 2.04 7.43 -0.95 5.20 -36.04 22.90 360 0.06
52 ZMCMCOPP Index Zambia Copper Prices 324.57 342.75 82.51 -0.75 -0.23 139.05 447.59 456 1.00
53 MEPRCCOW Index
Inventory Statistics- COMEX Copper st - On Warrants - % of Total Inventory 19560.37 97.54 130577.01 6.78 46.00 85.41 895497.56 459 0.02
54 MEPRCCCW Index
Inventory Statistics-COMEX Copper st - Cancelled Warrants - % of Total Inventory 10086.25 2.81 67628.21 6.78 46.00 0.02 463750.10 459 0.02
55 CNIVCORE Index
China Import Commodity Volume - Copper Ore & Concentrate 0.83 2.87 23.48 -0.42 1.31 -74.72 63.60 421 -0.03
56 CNMDCRCA Index
China Refined Copper Apparent Consumption mt 549433.00 586140.00 139795.62 -0.21 -0.75 291710.91 859489.00 449 -0.01
57 MEPRCICW Index
LME Copper Inventories mt - On/Cancelled Warrants 2061.47 2116.02 182.33 -0.68 -0.84 1710.69 2270.59 495 0.31
58 MEPRCUCW Index
Inventory Statistics- LME Copper mt - Cancelled Warrants - % of Total Inventory 6.91 6.51 4.62 1.15 1.48 0.33 21.45 432 0.12
59 MEPRCUOW Index
Inventory Statistics- LME Copper mt - On Warrants - % of Total Inventory 93.09 93.49 4.62 -1.15 1.48 78.55 99.67 432 -0.12
Page 78
78
SN Variable Description Mean Median Std. Dev Skew Kurtosis Min Max Missing Target Corr.
60 MEPRCTOI Index
LME Copper Total Open Interest Number of Contracts 666745.43 658012.77 53580.25 0.67 0.00 573964.47 794079.85 459 0.16
61 MEPRCUWP Index
Copper Wire Pricing USd per mt 36677.64 38598.84 8006.53 -0.82 -0.20 16995.76 46641.64 459 0.91
62 CEI1CNCL Index
CFTC CEI High-Grade Copper Non-Commercial Long Contracts/Futures Only 20321.47 15781.00 12962.72 1.13 0.72 1095.00 63843.00 276 0.41
63 CEI1CNCS Index
CFTC CEI High-Grade Copper Non-Commercial Short Contracts/Futures Only 15815.44 14455.00 10290.18 0.40 -1.05 930.00 39368.00 276 0.52
64 CEI1CCOL Index
CFTC CEI High-Grade Copper Commercial Long Contracts/Futures Only 0.42 0.47 14.57 0.09 3.14 -64.37 50.32 277 0.00
65 CEI1CTLL Index
CFTC CEI High-Grade Copper Total Long Contracts/Futures Only 0.62 0.92 10.25 0.27 0.80 -26.43 43.56 277 -0.03
66 vix index Chicago Board Options Exchange SPX Volatility Index 20.58 19.26 7.98 1.81 5.59 10.82 62.64 240 -0.01
67 BDIY Index Baltic Dry Index 2118.73 1471.00 1791.67 2.64 7.65 572.00 10844.00 180 0.60
68 CRB RIND Index
Commodity Research Bureau BLS/US Spot Raw Industrials 0.17 0.08 2.45 -0.64 4.14 -13.30 7.23 137 0.13
69 CPURNSA Index
US CPI Urban Consumers NSA 0.36 0.32 0.37 -0.17 3.65 -1.93 1.79 1 -0.13
70 EUITEMU Index
Eurostat Industrial Production Eurozone Industry Ex Construction SA 0.10 0.12 1.08 -0.53 1.86 -4.15 3.34 181 0.02
71 IP Index US Industrial Production 2007=100 SA 0.19 0.24 0.76 -1.14 4.94 -4.21 2.38 1 -0.06
72 CHVAIOY Index
China Value Added of Industry YoY 13.48 13.50 5.25 -0.92 6.90 -21.10 29.40 241 0.20
73 EUR CURNCY EURUSD Spot Exchange Rate - Price of 1 EUR in USD -0.01 -0.11 2.52 -0.01 0.14 -7.96 6.67 61 0.03
74 USTWBROA INDEX
US Trade Weighted Broad Dollar January 1997=100 0.23 0.24 1.32 0.21 1.11 -4.17 6.43 37 -0.14
75 SPX Index S&P 500 Index 0.53 0.77 3.79 -0.99 3.87 -22.81 11.35 1 -0.02
76 INDU Index Dow Jones Industrial Average 0.55 0.81 3.75 -0.82 2.70 -19.15 10.12 1 -0.01
77 US0003M Index BBA LIBOR USD 3 Month -0.85 -0.08 8.49 -1.64 10.69 -57.71 38.62 180 0.02
78 CLP Curncy USDCLP Spot Exchange Rate - Price of 1 USD in CLP 0.49 0.28 2.60 1.50 8.29 -7.49 16.29 177 -0.15
79 PEN Curncy USDPEN Spot Exchange Rate - Price of 1 USD in PEN 2.94 3.02 0.52 -0.78 -0.10 1.29 3.62 271 -0.03
80 CNFREXP$ Index China Export Trade 1.50 2.03 20.03 -2.14 10.67 -121.46 54.94 241 0.01
81 CNFRIMP$ Index China Import Trade 1.46 3.01 24.39 -1.95 9.92 -142.76 70.40 241 0.00
82 FDTR Index Federal Funds Target Rate US 5.95 5.50 3.63 0.76 1.10 0.25 20.00 12 -0.42
83 USGG10YR Index
US Generic Govt 10 Year Yield -0.28 -0.44 4.80 -0.99 9.22 -37.62 17.77 1 -0.06
84 1636659 Index
IMF Euro Area Industrial Production SA by Reporting Country 0.06 0.15 1.11 -0.98 2.34 -4.17 2.36 337 0.05
85 djushg Index
Dow Jones US Household Goods & Home Construction Index 0.55 0.83 4.07 -1.35 5.21 -21.97 11.35 265 -0.04
86 AAUKY US Equity Anglo American PLC 0.84 1.15 10.78 -0.82 2.13 -46.37 28.70 352 -0.09
87 BHP UN Equity BHP Billiton Ltd 0.92 0.80 8.88 -0.41 1.58 -39.73 26.57 208 -0.03
88 XSRAF US Equity Xstrata PLC 1.31 2.87 14.98 -1.71 5.30 -62.16 29.14 406 -0.18
89 RIO US Equity Rio Tinto PLC 0.74 0.57 10.31 -1.14 6.02 -61.99 27.33 245 -0.05
90 FCX US Equity Freeport-McMoRan Copper & Gold Inc 0.64 1.45 14.03 -0.80 2.68 -67.11 34.22 306 -0.05
91 SCO US Equity Southern Copper Corp 1.24 -0.34 12.38 -0.10 0.41 -37.16 34.80 312 -0.05
Page 79
79
SN Variable Description Mean Median Std. Dev Skew Kurtosis Min Max Missing Target Corr.
92 SBCRP Index Citigroup BIG Corporate 0.75 0.75 2.01 0.27 4.77 -7.75 11.48 120 -0.09
93 SBBIG Index Citigroup BIG Bond 0.71 0.74 1.60 0.71 5.36 -6.01 10.63 120 -0.08
94 SBWBL Index Citigroup WorldBIG Local Currency 0.39 0.49 0.82 -0.06 0.44 -2.11 3.25 348 -0.04
95 SBCI Index Citigroup BIG Industrial 0.76 0.84 1.98 0.08 5.27 -9.84 11.22 120 -0.07
96 SBGT Index Citigroup Treas Local Currency 0.68 0.65 1.61 0.34 2.27 -5.23 9.00 120 -0.06
97 SBEB13 Index Citigroup EuroBIG 1 to 3 Year 0.31 0.30 0.40 0.16 -0.31 -0.63 1.38 348 -0.11
98 SBWBINL Index
Citigroup WorldBIG Industrial Local Currency 0.51 0.64 1.38 -1.20 7.22 -7.58 4.76 348 -0.05
99 LMAHDS03 LME Comdty LME ALUMINUM 3MO ($) 0.13 -0.36 6.05 -0.16 1.13 -24.59 17.78 209 -0.03
100 LMAADS03 LME Comdty LME ALUM ALY 3MO ($) 0.29 0.22 5.59 -1.00 7.21 -35.40 19.52 273 -0.06
101 LA1 Comdty Generic 1st 'LA' Future 0.17 -0.36 5.82 0.06 0.42 -17.67 15.66 330 -0.03
102 LA3 Comdty Generic 3rd 'LA' Future 0.18 -0.15 5.67 0.00 0.53 -17.26 15.10 330 -0.03
103 LA6 Comdty Generic 6th 'LA' Future 0.20 -0.11 5.38 -0.03 0.88 -16.86 14.48 330 -0.03
104 LA12 Comdty Generic 12th 'LA' Future 0.22 -0.16 4.99 -0.09 1.34 -15.93 13.60 330 -0.01
105 US.MONEY.M2 FED Index
United States Money Supply M2 0.56 0.53 0.39 0.91 3.20 -0.40 2.74 2 -0.14
106 US.MONEY.M1 FED Index
United States Money Supply M1 0.47 0.45 0.74 1.65 12.17 -3.25 5.99 2 0.04
107 FEDL01 Index Federal Funds Effective Rate US -0.94 0.00 10.11 -3.31 27.27 -91.11 38.30 2 -0.08
108 US.CCONF CNFB Index
United States Consumer Confidence -0.08 -0.05 8.40 -0.26 5.77 -45.90 41.66 89 -0.05
109 PPI INDX Index US PPI By Processing Stage Finished Goods Total SA 0.32 0.29 0.61 -0.08 4.89 -3.08 3.46 2 0.05
110 US.HHSPNR BEA Index
United States Consumer Spending (Real) 0.23 0.23 0.39 0.38 4.17 -1.07 2.38 300 -0.22
111 NHSPSTOT Index
US New Privately Owned Housing Units Started by Structure Total SAAR -0.12 -0.17 7.84 -0.08 0.99 -30.67 25.67 2 -0.06
112 USGG5YR Index US Generic Govt 5 Year Yield -0.42 -0.47 7.64 -0.13 3.90 -39.01 31.35 2 -0.09
113 USGG2YR Index US Generic Govt 2 Year Yield -0.74 -0.42 10.52 -0.31 5.69 -57.74 53.77 77 -0.11
114 lp1 Comdty Generic 1st 'LP' Future 0.65 1.46 7.16 -0.77 4.97 -35.92 23.05 330 0.09
115 lp2 Comdty Generic 2nd 'LP' Future 0.70 1.34 7.11 -0.79 5.17 -36.11 22.94 331 0.09
116 lp3 Comdty Generic 3rd 'LP' Future 0.71 1.24 7.06 -0.79 5.16 -35.86 22.80 331 0.09
117 lp4 Comdty Generic 4th 'LP' Future 0.72 1.17 7.01 -0.79 5.19 -35.55 22.60 331 0.09
118 lp5 Comdty Generic 5th 'LP' Future 0.72 1.07 6.95 -0.81 5.26 -35.30 22.54 331 0.09
119 lp6 Comdty Generic 6th 'LP' Future 0.73 1.03 6.89 -0.82 5.30 -35.05 22.51 331 0.09
120 lp7 Comdty Generic 7th 'LP' Future 0.73 1.25 6.83 -0.85 5.35 -34.79 22.48 331 0.09
121 lp8 Comdty Generic 8th 'LP' Future 0.74 1.34 6.77 -0.86 5.40 -34.53 22.46 331 0.10
122 lp9 Comdty Generic 9th 'LP' Future 0.74 1.28 6.72 -0.88 5.44 -34.28 22.49 331 0.10
123 lp10 Comdty Generic 10th 'LP' Future 0.74 1.14 6.66 -0.89 5.49 -34.01 22.53 331 0.10
124 lp11 Comdty Generic 11th 'LP' Future 0.75 1.12 6.61 -0.91 5.53 -33.76 22.59 331 0.10
125 lp12 Comdty Generic 12th 'LP' Future 0.75 1.13 6.57 -0.92 5.58 -33.51 22.65 331 0.10
126 lp13 Comdty Generic 13th 'LP' Future 0.75 1.09 6.52 -0.93 5.62 -33.26 22.74 331 0.10
127 lp14 Comdty Generic 14th 'LP' Future 0.76 1.09 6.48 -0.94 5.65 -33.02 22.84 331 0.11
128 lp15 Comdty Generic 15th 'LP' Future 0.77 1.09 6.47 -0.94 5.62 -32.78 22.94 332 0.11
129 lp16 Comdty Generic 16th 'LP' Future 0.76 1.16 6.43 -0.95 5.56 -32.55 23.05 331 0.11
130 lp17 Comdty Generic 17th 'LP' Future 0.76 1.04 6.37 -0.95 5.72 -32.32 23.18 331 0.11
131 lp18 Comdty Generic 18th 'LP' Future 0.77 1.03 6.34 -0.95 5.75 -32.09 23.31 331 0.11
132 lp19 Comdty Generic 19th 'LP' Future 0.98 1.59 6.48 -1.03 5.62 -31.86 23.44 344 0.09
133 lp20 Comdty Generic 20th 'LP' Future 0.98 1.56 6.45 -1.04 5.66 -31.63 23.57 344 0.09
134 lp21 Comdty Generic 21st 'LP' Future 0.98 1.53 6.42 -1.03 5.69 -31.40 23.74 344 0.09
135 lp22 Comdty Generic 22nd 'LP' Future 0.98 1.49 6.39 -1.03 5.72 -31.17 23.91 344 0.09
136 lp23 Comdty Generic 23rd 'LP' Future 0.98 1.46 6.36 -1.03 5.75 -30.95 24.07 344 0.10
137 lp24 Comdty Generic 24th 'LP' Future 0.98 1.54 6.34 -1.03 5.77 -30.73 24.21 344 0.10
138 lp30 Comdty Generic 30th 'LP' Future 1.45 2.65 7.07 -1.18 4.88 -29.42 25.09 393 0.04
139 lp40 Comdty Generic 40th 'LP' Future 1.43 2.20 6.85 -1.16 5.18 -27.33 26.03 393 0.06
140 lp50 Comdty Generic 50th 'LP' Future 1.41 1.96 6.72 -1.16 5.39 -25.47 26.31 393 0.08
141 lp60 Comdty Generic 60th 'LP' Future 1.39 1.83 6.67 -1.17 5.41 -26.46 25.86 393 0.10
142 lp70 Comdty Generic 70th 'LP' Future 0.57 1.88 7.22 -1.27 1.42 -19.21 9.88 465 0.23
Page 80
80
Figure A.1: SPSS Modeler Stream Design (CHAID Decision Tree, Scenario 3)
Figure A.2: SPSS Modeler Stream Design (ARIMA)