Forecasting Copper Spot Prices: A Knowledge-Discovery Approach · Forecasting Copper Spot Prices: A Knowledge-Discovery Approach . A. ... SPSS Modeler Stream Design ... Expert Modeler

Forecasting Copper Spot Prices:

A Knowledge-Discovery Approach

A DISSERTATION SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE

DEGREE OF MASTER OF SCIENCE

IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES

2016

By

Adegbenga Olayiwola

School of Computer Science

2

Table of Contents

Abstract ...................................................................................................................................... 6

Declaration ................................................................................................................................. 7

Intellectual Property Statement .................................................................................................. 8

Acknowledgements .................................................................................................................... 9

Dedication ................................................................................................................................ 10

Preface...................................................................................................................................... 10

Chapter 1 Introduction ........................................................................................................ 11

1.1. Motivation ................................................................................................................. 13

1.2. Aims and Objectives ................................................................................................. 13

1.3. Outline of the Report ................................................................................................. 14

Chapter 2 Background and Related Work .......................................................................... 15

2.1. Time Series ................................................................................................................ 17

2.1.1. Trend, Seasonality.............................................................................................. 17

2.1.2. Stationarity ......................................................................................................... 18

2.1.3. Differencing ....................................................................................................... 19

2.2. Models ....................................................................................................................... 20

2.2.1. Econometric Model – ARIMA .......................................................................... 20

2.2.2. Data Mining Model – Decision Trees ................................................................ 24

2.3. Previous Work ........................................................................................................... 30

Chapter 3 Research Design and Methodology ................................................................... 34

3.1. CRISP-DM Methodology ......................................................................................... 34

3.1.1. Business Understanding ..................................................................................... 34

3.1.2. Data Understanding ........................................................................................... 35

3.1.3. Data Preparation................................................................................................. 35

3.1.4. Modelling ........................................................................................................... 35

3.1.5. Evaluation .......................................................................................................... 35

3.1.6. Deployment ........................................................................................................ 36

3

3.2. Project Evaluation ..................................................................................................... 36

Chapter 4 Data Analysis and Modelling .................................................................................. 38

4.1. Data Understanding ................................................................................................... 38

4.1.1. Distribution and Statistical Characteristics ........................................................ 38

4.1.2. Correlation Analysis .......................................................................................... 43

4.2. Data Preparation ........................................................................................................ 43

4.2.1. Transformation ................................................................................................... 43

4.2.2. Standardisation ................................................................................................... 43

4.3. Decision Tree Modelling ........................................................................................... 44

4.3.1. Run Set 1 – Log Difference with CHAID ......................................................... 44

4.3.2. Run Set 2 – Price Movement with C5.0 ............................................................ 47

4.3.3. Run Set 3 – Price Change Rate with CHAID .................................................... 47

4.4. ARIMA (Time Series) Modelling ............................................................................. 47

Chapter 5 Evaluation and Results ....................................................................................... 49

5.1. Decision Trees ........................................................................................................... 49

5.1.1. Log Difference Modelling ................................................................................. 49

5.1.2. Price Movement Modelling ............................................................................... 61

5.1.3. Price Change Rate Modelling ............................................................................ 65

5.2. ARIMA...................................................................................................................... 66

5.3. Models Comparison .................................................................................................. 69

5.4. Deployment ............................................................................................................... 70

Chapter 6 Conclusions and Future Works ............................................................................... 71

6.1. Conclusions ............................................................................................................... 71

6.2. Recommendations for Future Work .......................................................................... 72

List of References .................................................................................................................... 73

Appendix .................................................................................................................................. 76

Word Count: 18,801

4

List of Figures

Figure 2.1: Pie Chart showing the relative use of copper in industrial sectors (Crowson, 2008;

Black, 1995) ..................................................................................................................... 16

Figure 2.2: Global Demand for Copper over five decades (Crowson, 2008) .......................... 16

Figure 2.3 Time series decomposed (Bontempi, 2013) ........................................................... 18

Figure 2.4 Plot of white noise signal ....................................................................................... 19

Figure 2.5 Sample Time series, difference and ACF/PACF plots ........................................... 23

Figure 2.6 Decision tree for deciding whether to play tennis .................................................. 24

Figure 2.7 Entropy as a function of a binary valued distribution ............................................ 26

Figure 3.1: CRISP-DM Methodology Phases .......................................................................... 34

Figure 4.1 Composite Time Series plot of Copper Spot and Crude Oil Spot Prices ............... 38

Figure 5.1: Predictor variables in ranked order of importance (Scenario 1ii) ......................... 53

Figure 5.2: Predictor variables in ranked order of importance (Scenario 3) ........................... 54

Figure 5.3: Predictor variables in ranked order of importance (Scenario 3 Improved) ........... 55

Figure 5.4: Predictor variables in ranked order of importance (Scenario 4) ........................... 56

Figure 5.5a CHAID Left Tree Subsection ............................................................................... 58

Figure 5.5b CHAID Middle Tree Subsection .......................................................................... 59

Figure 5.5c CHAID Right Tree Subsection ............................................................................. 60

Figure 5.6: Predictor variables in ranked order of importance (Price Movement Target –

Scenario 2) ........................................................................................................................ 63

Figure 5.7: C5.0 Decision Tree and Ruleset (Directional Forecasting) ................................... 64

Figure 5.8: Predictor variables in ranked order of importance (Price Change Rate Target) ... 66

Figures 5.9: Time series plot of original dataset showing variance and trend. ........................ 67

Figures 5.10: Time series plot of transformed dataset showing reduced variance. ................. 68

Figures 5.11: Time series plot of differenced transformed dataset eliminating trend to make

stationary. ......................................................................................................................... 68

Figure A.1: SPSS Modeler Stream Design (CHAID Decision Tree, Scenario 3) ................... 80

Figure A.2: SPSS Modeler Stream Design (ARIMA) ............................................................. 80

5

List of Tables

Table 2.1: Copper Properties and Uses .................................................................................... 15

Table 2.2 Standard AR(1) models ........................................................................................... 20

Table 2.3: Decision Tree Algorithms....................................................................................... 29

Table 2.4 Summary of Previous Work on Prediction of Metals Prices and other Time Series

Quantities .......................................................................................................................... 32

Table 4.1: LME Copper Spot Price Month-on-Month Movement .......................................... 47

Table 4.2: Expert Modeler with Constant and Transformation Option Value Runs ............... 48

Table 5.1: Trained Model Variables under Scenario 1 ............................................................ 49

Table 5.2: Trained Model Variables under Scenario 2 ............................................................ 49

Table 5.3 Trained Model Variables under Scenario 3 (Entire Dataset) ................................... 50

Table 5.4 Trained Model Variables under Scenario 4 ............................................................. 50

Table 5.5 Evaluation of Tree Models Built under Various Conditions/Combinations............ 51

Table 5.6: C5.0 Decision Tree Evaluation using Price (month-on-month) Direction Target.. 62

Table 5.7: Evaluation of Tree Models (month-on-month rate of change target) ..................... 65

Table 5.8: Expert Modeler Result: ARIMA(2,1,0)(1,0,1) with Constant and Transformation

options .............................................................................................................................. 66

Table 5.9: ARIMA Model Holdout (Out-of-Sample) Dataset Test Error Rates ...................... 69

Table A.1 Dataset Variables Showing Statistical Characterisation (First Variable is Target) 76

6

Abstract

The importance of copper as an industrial metal has grown with time due to increasing

technological applications. This has led to the metal being quoted on major commodities

exchanges and the stakeholders interested in the price trend of the commodity has thus

transcended just the producing and consuming nations and industries to include investors. As

a result there has been increasing interest in developing models for the prediction of the price

of the metal.

The autoregressive integrated moving average (ARIMA) has traditionally been used in

forecasting time series quantities such as commodities prices. Recent research reflects an

attempt to improve on the performance of ARIMA by instead using data mining techniques

of which artificial neural networks (ANN) has been the model of choice. However because

ANNs are black-box models, no insight can be drawn from the results they produce.

Furthermore, there has been a lack of clear methodological framework which enables a

systematic and standard approach to the analysis process.

This research work addresses the aforementioned gaps by presenting a knowledge-

discovery methodology applied to the development of (open-box) decision tree models in the

forecast of copper spot prices thereby revealing the prime predictor variables for the metal.

The accuracy of the decision tree model is further also contrasted with a developed ARIMA

model.

Metal fundamentals as well as economic and financial variables selected as predictors by

the decision tree model include Chinese copper import levels, volatility index (VIX), Baltic

Dry index and the Standard & Poor’s GSCI index amongst others. With root mean square

error (RMSE) rates of 18.65, the decision tree model performed far more accurately

compared to ARIMA.

7

Declaration

This dissertation is original work containing some research material that are all clearly

referenced. No portion of the work referred to in this dissertation has been submitted in

support of an application for another degree or qualification of this or any other university or

other institute of learning.

8

Intellectual Property Statement

i. The author of this dissertation (including any appendices and/or schedules to this

dissertation) owns certain copyright or related rights in it (the “Copyright”) and s/he

has given The University of Manchester certain rights to use such Copyright,

including for administrative purposes.

ii. Copies of this dissertation, either in full or in extracts and whether in hard or

electronic copy, may be made only in accordance with the Copyright, Designs and

Patents Act 1988 (as amended) and regulations issued under it or, where appropriate,

in accordance with licensing agreements which the University has entered into. This

page must form part of any such copies made.

iii. The ownership of certain Copyright, patents, designs, trademarks and other

intellectual property (the “Intellectual Property”) and any reproductions of copyright

works in the dissertation, for example graphs and tables (“Reproductions”), which

may be described in this dissertation, may not be owned by the author and may be

owned by third parties. Such Intellectual Property and Reproductions cannot and must

not be made available for use without the prior written permission of the owner(s) of

the relevant Intellectual Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication and

commercialisation of this dissertation, the Copyright and any Intellectual Property

and/or Reproductions described in it may take place is available in the University IP

Policy, in any relevant Dissertation restriction declarations deposited in the University

Library, and The University Library’s regulations.

http://documents.manchester.ac.uk/display.aspx?DocID=24420

http://documents.manchester.ac.uk/display.aspx?DocID=24420

http://www.library.manchester.ac.uk/about/regulations/_files/Library-regulations.pdf

9

Acknowledgements

A special thank you goes to my supervisor, Dr. Charalampos Theodoulidis (Babis) for

his mentorship and support throughout the entire dissertation research period.

Special thanks also must go to the School of Computer Science and Alliance Manchester

Business School, most especially my lecturers: Dr Sandra Sampaio, Prof. John Keane, Dr.

Gavin Brown, Prof. Christopher Holland, Dr. Daniel Dresner, Dr. Yu-wang Chen, Dr. Julia

Handl and Prof Peter Kawalek for providing me with the necessary foundations to execute

this research.

To my wonderfully supportive friends back home in Nigeria who through their

generosity have helped in rendering financial support through this period of my study I say a

big thank you. I would like to specially thank my Manchester families: the Idowus, the

Osinugas, the Onches and Chuks. You folks have been great pillars of support.

My wife Ayo, and the twins, Kemi and Femi, thank you for your patience, prayers and

support. And to my mother who always heeds my call, thank you so much. Special

appreciation goes to my dad for his constant prayers and support. My siblings who have

always rallied support and help, thank you so much.

To God who alone has been my One and All be all the glory and adoration.

10

Dedication

With profound gratitude and appreciation to the Almighty God...

In great love and admiration of my dear wife, Ayo ma Cherie…

For my lovely twins, Kemi and Femi…

To my wonderful Parents...

To my Siblings...

To my Friends and Colleagues...

Preface

The author has a B.Sc. degree in Computer Engineering from the Obafemi Awolowo

University Ile-Ife, Nigeria. With ten years’ working experience as a systems analyst in the

Nigeria National Petroleum Corporation, he has skills in software development as well as

database development and support using technologies including Microsoft .NET, SharePoint,

SQL Server, Oracle amongst others.

11

Chapter 1 Introduction

Copper is a non-ferrous corrosion resistant metal with anti-microbial properties as well

as very high thermal and electrical conductivity (second only to silver which is far more

expensive). As a result of these qualities, copper is in very high demand and one of the top

industrial metals used in electronic/electrical applications, construction, medical and general

engineering (Anyadike, 2002). Considering the importance of these industries in the modern

world, “movements in copper prices can therefore be seen as an early indicator of global

economic performance” (Buncic and Moretto, 2015). Thus copper is one of the metal

commodities traded on the major commodities Exchanges: London Metal Exchange (LME),

New York Commodity Exchange (COMEX) and Shanghai Metal Exchange (SHME)

(Lasheras et al., 2015).

The ability to reliably forecast the future value of the metal therefore becomes very

valuable to investors, speculators and even more so to the world’s top exporter, Chile (Fisher

et al., 1972). Similar arguments can be made for China which is the top importing nation of

copper ores and concentrates as well as the global top producer and consumer of refined

copper (ICSG, 2016). Also, due to the intricacies of trading in copper whereby there is a time

lag between contracting, payment and delivery as well as storage and insurance

considerations (Lasheras et al., 2015), contracts are usually agreed upon based on a future

price figure.

Several methods have been used in the attempt at forecasting copper prices with mixed

results. This is due to the high volatility inherent in the price of the metal in the global

markets over any period (time series). The ARIMA is well known as a forecast model for

time series quantities since the work of Box, Jenkins and Reinsel (1970). The use of the

ARIMA for copper price forecasting which involves fitting the model to the time series data

has limitations because it only focuses on the trend over time of the prices and uses that to

extrapolate into the future without consideration of the external factors (industrial, economic,

financial etc.) that affect the fluctuations of prices with time. Also, since the ARIMA is a

linear model, it can only produce approximations for modelling complex non-linear

problems.

Other research in considering these factors, have explored the use of data mining models

such as support vector machines (SVM) and neural networks (NN) as well as other analytical

tools like regression, Fourier transforms, etc. These have produced results with better

accuracy than the ARIMA (Adebiyi, Adewumi and Ayo, 2014; Lasheras et al., 2015;

Kriechbaumer et al, 2014). In other cases, a hybrid approach has been adopted in combining

12

ARIMA with neural networks for example as undertaken by Zhang (2003) as well as Jan and

Katarina (2010) which also produced results with better accuracy than using either approach

alone. In the case of Jan and Katarina (2010) for example, their results showed ARIMA as

having a Mean Absolute Percentage Error (MAPE) of 3.2% compared to 2.4% when the

ARIMA-NN hybrid is tested. However, because these are black box models, the nature of the

effect of the predictor variables (factors) in determining prices is not known (Lai et al, 2009).

The use of the decision tree model as a forecast tool does not suffer from the black box

limitation mentioned above. Chang et al, (2011) and Lai et al. (2009) used clustering

techniques in conjunction with genetic algorithms to develop fuzzy decision trees as a

decision support tool for stock trading based on prices. The decision rules derived from the

tree revealed the nature of the effect that the factors considered have on the price of stocks

and upon application to test data, produced results with superior performance (in terms of hit

rate) compared to other models (random walk, ARIMA, neural nets etc.). However, in these

papers, there is a lack of clear methodological approach to the investigation which will allow

for the process to be adapted for use in other domains. Also, the variable selection process

does not take into account the considerable effect of business (economic) cycles (Diaz et al.,

2016) on such time series quantities as stock prices which primarily respond to global

economic and financial activity.

Of the many existing decision tree algorithms, the more commonly used include the

successive generations developed by Ross Quinlan: ID3, C4.5, C5.0 (Quinlan, 1990, 1993,

2004); CHAID (CHi-squared Automatic Interaction Detector) (Kass, 1980) and CART

(Classification And Regression Tree) (Breiman et al., 1984). The C4.5 is an improvement

over the ID3 with the ability to handle missing data, possibility of using differently weighted

attributes, pruning to simplify the tree thereby improving generalization, etc. (Hssina et al.,

2014; Khoonsari and Motie, 2012). The C5.0 further improves on the C4.5 in terms of speed

and memory efficiency. The CHAID uses multiway splits and therefore makes for improved

reading. This research work uses CHAID and the C5.0 (which as implemented in IBM SPSS

Modeler can only handle categorical data).

In terms of the use of decision trees in the forecast of metal prices, Malliaris and

Malliaris (2015) used it in forecasting gold price movement direction. Their work did put

some focus on the effect of business cycles by clustering the dataset (into 4 groups in and

around the global recession of 2008) before applying decision tree models to each cluster

producing varied results in terms of the effect of the predictor variables. However, only six

predictor variables were considered and no clear or standard methodology was used in the

13

analysis process making it less generalizable to other domains. In so far as the literature

review shows however, decision trees have not been applied to the forecast of copper prices.

In this project, the use of decision trees as forecast tools for copper spot prices is

investigated using the CRISP-DM (CRoss Industry Standard Process for Data Mining)

methodology in considering relevant economic and financial predictor variables thereby

revealing the nature of impact these variables have on the price of copper over time. A

comparison with ARIMA is also carried out to determine the relative performance in terms of

predictive accuracy of the models. The CRISP-DM being a widely accepted and used

methodology in the industry (Wirth and Hipp, 2000) enables the use of a structured approach

in the analysis process from understanding of the data up till the evaluation of the results and

deployment.

1.1. Motivation

The following considerations are worthy of note:

The importance of copper today (which is only likely to continue in the foreseeable

future based on the current vast application areas of the metal in technology and

industry), the ability to reliably forecast the price of the metal with a good

understanding of the economic and financial indicators that determine its value will be

of increasing benefit to stakeholders.

Also, the dynamic nature of global business calls for a standard and adaptable method

for determining the above as the business climate evolves.

These form the basis of motivation for this research work.

1.2. Aims and Objectives

The aim of this research work is to explore the use of the decision tree data mining model

as a forecast tool for copper spot prices.

The following are particular research objectives identified to be achieved:

To use the CRISP-DM methodology in developing a decision tree model for the

prediction of copper spot prices using time series data of LME monthly copper prices

from January 1970 to January 2012.

To determine the nature of the effect of relevant economic and financial predictor

variables on copper spot prices from the resultant decision tree.

14

To empirically evaluate the use of the decision tree contrasted against ARIMA in

forecasting copper spot prices in terms of prediction accuracy using the Root Mean

Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage

Error (MAPE) metrics.

1.3. Outline of the Report

The structure of this research work is here outlined. In this first chapter, an introduction to

the dissertation has been presented. Chapter 2 is a discussion of some theoretical background

and related work giving an account of historical development of the importance of copper as

well as prior research work into the prediction of copper prices and similar commodities or

time series quantities using various models and tools. The approach to design and

methodology used in this research work is developed in Chapter 3. Models development and

implementation is the subject of Chapter 4. In Chapter 5, an evaluation of the models is

carried out and the results are presented and discussed. Lastly, Chapter 6 contains final

conclusions and recommendations for further work.

15

Chapter 2 Background and Related Work

Copper has always played a critical role in the civilization of man from pre-historic

times till date. It is established that the metal was the first to be mined by man and utilized for

tool making and other uses. Due to its malleability and ductility, it is easily shaped, drawn or

hammered into sheets for various applications. Table 2.1 below shows some of the various

properties and uses of the metal

Table 2.1: Copper Properties and Uses

SN Property Comments and Uses

1 A Natural Element Relatively safe non-radioactive production methods

2 Recyclable Sustainable balance of demand and supply

3

Malleable &

Ductile Easily hammered into sheets and drawn into wires

4 Aesthetic Used for making various ornamental items

5 A Family of Alloys

From bronze to brass to a host of modern alloys developed

from advances in material science

6 Antifouling

Inhibits the adhesion of marine life to surfaces thereby

preventing drag on ship hulls for example

7 Antimicrobial

Used in alloys to reduce germ transmission rates from

frequently touched surfaces like door knobs

8 Easily Shaped Making bells and musical instruments like trumpets

9 Durable Due to anti-corrosion properties used extensively in piping

10 Conductive

Very high electrical and heat conductivity makes it the metal

of choice in electrical/electronic as well as various industrial

applications

11 Easy to Join

By welding, soldering, brazing, bolting etc. makes for an

excellent choice in piping and electrical distribution

In modern times, the excellent electrical and heat conductivity of copper as well as its

corrosion resistance has made it an essential metal in industry with applications in

electronics, wiring, building and construction, piping, plumbing etc. Figure 2.1 shows the

relative use of copper in these different industrial applications.

16

Figure 2.1: Pie Chart showing the relative use of copper in industrial sectors (Crowson, 2008;

Black, 1995)

Following from the mostly electrical/electronic use of copper as mentioned above in

cabling and wiring, the demand for the metal has been on the rise as more and more people

require housing, transportation etc. Figure 2.2 below shows the rising demand for copper over

the decades. According to Crowson (2008), this represents an average annual growth rate of

about 3.7% and “world exports of refined copper metal accounted for 38 percent of

production, worth almost $23 billion, in 2005”. This clearly shows the importance of the

metal in global industry and economy.

Figure 2.2: Global Demand for Copper over five decades (Crowson, 2008)

3.7

6.8

9

10.9

15.1 16.5

0

2

4

6

8

10

12

14

16

18

1960 1970 1980 1990 2000 2005

Mill

ion

To

nn

es

Years

Global Copper Demand

48%

18%

12.5%

10%

11%

Electrical and electronicproducts, Wires & Cables,

Construction, Piping

Transport

Industrial Machinery

Consumer Products &Others

17

2.1. Time Series

In forecasting the price of copper we start by looking at the history of copper prices over

time. The copper spot prices as noted in the obtained dataset are monthly figures from 1970

to 2012 which essentially is a univariate time series since they are a single sequence of

ordered observations in equal discrete time intervals.

Univariate time series can be represented by the general model

𝑠𝑡 = 𝑣(𝑡) + 𝜑𝑡 𝑡 = 1, … . . , 𝑇 (2.1)

𝑤ℎ𝑒𝑟𝑒 𝑣(𝑡) 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑎 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑖𝑠𝑡𝑖𝑐 𝑝𝑎𝑟𝑡,

𝜑𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑡𝑒𝑟𝑚 𝑎𝑛𝑑

𝑡 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑒𝑎𝑐ℎ 𝑡𝑖𝑚𝑒 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑡𝑜𝑡𝑎𝑙 𝑝𝑒𝑟𝑖𝑜𝑑 𝑇

The signal component of the quantity represents the (possible) trend and seasonality

observed in the distribution. Being deterministic, it can easily be characterised using some

function of close fit. The residual term however, is stochastic and can only be evaluated using

probabilistic methods. As such the residual is the more difficult aspect to model and is the

focus of one of the analytical models for time series forecasting used in this research work. In

order to model residuals, some important properties of time series need to be adequately

managed and these are discussed next.

2.1.1. Trend, Seasonality

As mentioned earlier, the signal component of the time series can incorporate trend and

seasonality. Trend in a time series is a change in the mean value over the long-term.

Seasonality on the other hand refers to periodic highs and lows observed in the series in

roughly equal intervals or seasons (e.g. monthly, annually). These traits make for an

‘unstable’ series which does not lend easily to modelling. Figure 2.3 below illustrates these

properties of a time series

18

Figure 2.3 Time series decomposed (Bontempi, 2013)

2.1.2. Stationarity

A time series is said to be stationary when its joint distribution remains unchanged when

shifted in time. The distribution of observations over time is independent of any particular

time origin. As such the identified properties of the series are independent of the time of

observation. This also implies that regardless of any subset of the series in time one observes,

the plot looks generally the same. The statistical properties of mean and variance are

therefore constant. White noise (illustrated in Figure 2.4 an example of a stationary series,

has zero mean and constant variance.

19

Figure 2.4 Plot of white noise signal

On the other hand, a time series with trend and/or seasonality is not stationary because

the distribution of the series depends on time as can be seen in Figure 2.3 above. One way of

making a time series stationary is called differencing.

2.1.3. Differencing

Differencing is a transformation that can be applied to time-series data so as to make it

stationary. To do this, a fresh series is generated by taking the difference between consecutive

observations in the original. Thus if 𝑦𝑡 is a time series quantity where t = 1,2,…,T, then a first

degree differenced time series (or first lag) 𝑦𝑡′ is given by

yt′ = yt − yt−1 (2.2)

Differencing stabilises the mean of a time series thereby getting rid of the changes in the

level of the series and thus eliminating trend and seasonality. If 𝑦𝑡′ is also not stationary, the

process can be repeated again producing a second order differenced series 𝑦𝑡′′ (second lag)

𝑦𝑡′′ = 𝑦𝑡

′ − 𝑦𝑡−1′

= (𝑦𝑡 − 𝑦𝑡−1) − (𝑦𝑡−1 − 𝑦𝑡−2)

= 𝑦𝑡 − 2𝑦𝑡−1 + 𝑦𝑡−2 (2.3)

Another transformation that can be applied is by taking logarithms. This helps to stabilise the

variance in the series. Having outlined some of the basic properties of time series as above,

the development of predictive models is discussed next.

-3

-2

-1

0

1

2

3

0 200 400 600 800

20

2.2. Models

Two main classes of models developed as forecasting tools are discussed in this section.

2.2.1. Econometric Model – ARIMA

In mathematics or statistics, linear regression is one of the methods for deriving a

relationship function or line of best fit between a dependent variable y and independent

variable x such that in the case of simple linear regression, y can be expressed as

y = mx + c. (2.4)

And more generally or for multiple regression with 𝑛 independent variables,

𝑦 = ∑ mixi

n

i=1

+ 𝑐 (2.5)

With the above in mind, the building blocks of the ARIMA model are hence discussed.

Autoregressive Model

An autoregression model is one in which the independent (predictor) variables are made

up of past values of the (time series) target variable. “The term autoregression indicates that it

is a regression of the variable against itself” (Hyndman and Athanasopoulos, 2014).

Therefore an autoregressive model of order p, AR(p) can be expressed as

yt = c + ɸ1yt−1 + ɸ2yt−2 + ⋯ ɸpyt−p + φt (2.6)

where c is a constant and 𝜑𝑡 is white noise.

As such the AR(p) model is basically a multiple regression but with lagged values of the

target, 𝑦𝑡, as predictors. Table 2.2 below shows some standard AR(1) models

Table 2.2 Standard AR(1) models

ɸ𝟏 𝒄 𝒚𝒕

= 0 White noise

= 0 = 1 Random walk

≠ 0 = 1 Random walk with drift

< 0 oscillate between positive and negative values.

Autoregressive models are best applied to stationary time series data, and the parameters also

need to be constrained to some certain values (Hyndman and Athanasopoulos, 2014).

For an AR(1) model: −1 < ɸ1 < 1.

For an AR(2) model: −1 < ɸ2 < 1, ɸ1 + ɸ2 < 1, ɸ2 − ɸ1 < 1.

21

Moving Average Model

The moving average model (MA), similarly to the AR, also builds some form of linear

regression on time series data but rather than regressing on past values of the target it uses the

past forecast error terms thus

yt = c + et + θ1et−1 + θ2et−2 + ⋯ + θqet−q, (2.7)

where 𝑒𝑡 is white noise. This is referred to as an MA(q) model. From equation 2.7, it can be

seen that 𝑦𝑡 can be seen as a weighted (θ) moving average of prior forecast errors, hence the

name of the model.

Similarly to the AR, the MA also has some value constraints placed on its parameters

when used:

For an MA(1) model: −1 < θ1 < 1.

For an MA(2) model: −1 < θ2 < 1, θ1 + θ2 > −1, θ1 − θ2 < 1.

ARIMA Model

When the autoregression and moving average models are combined and used on a

differenced time series data we have the non-seasonal ARIMA (AutoRegressive Integrated

Moving Average) model. The ARIMA model was popularised by the Box Jenkins approach

(Box and Jenkins, 1970) to time series forecasting and is useful for both stationary and non-

stationary time series datasets.

In creating forecast models of a time series dataset that is strongly seasonal and where

seasonality is required to be considered, then additional seasonal terms are added to the non-

seasonal ARIMA model to make a seasonal ARIMA model. The seasonal ARIMA model

contains a separate set of autoregressive, difference and moving average terms to account for

seasonality in the data. The models are denoted in generalized form as follows:

𝐴𝑅𝐼𝑀𝐴(𝑝, 𝑑, 𝑞) - non-seasonal ARIMA

𝐴𝑅𝐼𝑀𝐴(𝑝, 𝑑, 𝑞)(𝑃, 𝐷, 𝑄) - seasonal ARIMA

where:

p is the order of the autoregression

d, the degree of differencing and

q, the order of the moving average

of the non-seasonal (part of the) model while

P, D and Q are similar terms as above for the seasonal part of the model.

22

ACF/PACF

An autocorrelation function (ACF) shows a plot of correlation between the time series

data and itself over different lags. That is, a plot of yt against yt-k for different values of k. The

partial autocorrelation (PACF) plot on the other hand, indicates the level of autocorrelation at

lag k that is not supported by lower-order autocorrelations. It shows the correspondence

between yt and yt-k after removing the effects of the intervening lags (that is, at 1,2,3,…,k-1).

In determining the values of p and q (P and Q) the use of the ACF and PACF plots are

made. An ARIMA(p,d,0) model is inferred if the plots of the differenced data (in order to

make stationary) reveal a pattern where:

the ACF decays exponentially;

the PACF has a significant spike at lag p that cuts off sharply thereafter.

An ARIMA(0,d,q) model is inferred if the plots of the differenced data reveal a pattern

where:

the PACF decays exponentially;

the ACF has a significant spike at lag p that cuts off sharply thereafter.

Figure 2.5 below shows a sample set of plots where an ARIMA(0,1,1) has been inferred.

The first panel shows the plot of a time series data showing an upward trend. In the second

panel we see a plot of the series after taking a first difference. And finally the third panel

shows the ACF and PACF plots.

23

Figure 2.5 Sample Time series, difference and ACF/PACF plots

24

2.2.2. Data Mining Model – Decision Trees

Decision trees are a non-parametric supervised learning method used for classification

(discrete valued target) and regression (continuous valued target) data mining or machine

learning tasks (Kavitha and Iyakutti, 2014). Given tuples of data with attributes and target

pairs, the decision tree algorithm when applied produces a tree structure that enables the

categorisation or description of the dataset determining the target value based on simple

hierarchical rules as the tree is traversed from root to leaf across nodes. Each node represents

a simple test based on an attribute’s values, a branch is a path based on the outcome of the

test and ultimately the leaf represents the target class or value. Figure 2.6 shows a decision

tree structure.

Figure 2.6 Decision tree for deciding whether to play tennis

The decision tree has several advantages one of which has already been mentioned, that

of its being an open-box model. Others include:

Simple to understand and communicate

Can be combined into an ensemble (Random Forests)

Can be used as a pre-processor (feature selection) in combination with other classifiers.

Some drawbacks of decision trees and mitigation techniques are listed below:

Sensitivity to outliers and noise. This can be dealt with by adequate dataset preprocessing

to identify (as much as possible) cases of outliers.

Missing values. The use of surrogate splits can help overcome this challenge which is a

technique of identifying similar or suitable feature values which can be used for splitting

in place of the missing-value feature.

25

Based on the logic of the algorithm, decision trees determine and pick the most

discriminative attributes in splitting at each node of the tree. Tree growth is stopped when the

algorithm arrives at an optimal tree size based on maximum depth specification, minimum

node size or pruning. As a result, decision trees are also widely used for optimal attribute

selection as a pre-processing step to other machine learning methods like neural networks

which are designed to use all attributes that are fed into it.

Decision tree algorithms use different criteria for determining attributes upon which to

split based on some tests and measures. These are discussed next.

Entropy

In information theory, entropy is a measure of the purity (or impurity) in a distribution of

examples. Consider the case of a binary target collection of examples T, which can have

either a positive (𝑝) or negative (𝑛) value, the entropy E of T is given by

E(T) = − P(p)log2P(p) − P(n)log2P(n) (2.8)

𝑤ℎ𝑒𝑟𝑒 𝑃(𝑥) 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑖𝑡𝑖𝑜𝑛

Thus as seen in figure 2.7 below, entropy has its minimum value of 0 when P(𝑥) = 0,1

(when there is certainty) and a maximum of 1 when P(𝑥) = 0.5 (when there is maximum

uncertainty with an equal chance of positive or negative outcome).

Generally, for a distribution T with v possible values, entropy of T is defined by

E(T) = − ∑ P(i)log2P(i)

v

i

(2.9)

having a maximum value of 𝑙𝑜𝑔2𝑣

26

Figure 2.7 Entropy as a function of a binary valued distribution

Information Gain

Information gain is one of the metrics used in determining the effectiveness of an

attribute as a choice for splitting when building a decision tree. It is the measure of reduction

in entropy after splitting on the attribute. As seen above the more entropy, the more

uncertainty in the predictability of outcome and vice versa. Therefore the attribute that in

effect has the least entropy upon use for splitting would have contributed the most

information (maximum information gain) in the effort to arrive at a leaf node. Entropy for an

attribute split is calculated by taking the sum of the entropies of each subset of the whole

collection having each possible value of A weighted by the fraction of examples in the whole

the subset contains, that is

∑|Tv|

|T|E(Tv)

vϵValues(A)

(2.10)

𝑤ℎ𝑒𝑟𝑒 𝑉𝑎𝑙𝑢𝑒𝑠(𝐴) 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑒𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝐴,

𝑇𝑣 𝑖𝑠 𝑠𝑢𝑏𝑠𝑒𝑡 𝑜𝑓 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑇 𝑤𝑖𝑡ℎ 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝐴 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑣,

|𝑇𝑣|

|𝑇| 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑖𝑡𝑖𝑜𝑛 𝑇 𝑤𝑖𝑡ℎ 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝐴 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑣 𝑎𝑛𝑑

𝐸(𝑇𝑣) 𝑖𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑜𝑓 𝑇𝑣

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.20 0.4 0.60 0.80 1

Entr

op

y

P(x)

27

From the above the information gain on splitting on attribute A, 𝐼𝐴, is then simply given by

IA = E(T) − ∑|Tv|

|T|E(Tv)

vϵValues(A)

(2.11)

GINI Impurity

The Gini Index is another measure of impurity that can be used for attribute split

selection in decision trees. Using same notations from equation (3.2) above, it is given by

G = 1 − ∑ P(i)2

v

i

(2.12)

The Gini index from equation (2.12) has a maximum value of 1. Similarly to the use entropy,

the Gini index can be used in calculating information gain.

Algorithms

Several decision tree algorithms have been developed over the years with some being

improvements over previous versions. Some of the more commonly used algorithms are

summarised in Table 2.3 indicating some main characteristics:

Other more advanced algorithm developments include ensemble methods which produce

multiple decision trees known as random forests which are then combined (and using voting

techniques for example) to produce a final target prediction/classification. Random forests

have the advantage of producing models that are more stable where only decision trees may

overfit training data and perform with worse generalisation error. However, because of the

multiple trees involved, random forests lose the transparency and ease of interpretation which

is the hallmark of decision trees.

In this project research work, the C5.0 (for categorical target) and CHAID decision tree

algorithms are used. Of the decision tree algorithms present in IBM SPSS Modeler 15

(analysis tool used), these two proved to be most accurate. Below is the pseudocode of the

CHAID algorithm (as mentioned earlier, the C5.0 is proprietary):

28

CHAID Pseudocode

CHAID(predictors, target, alpha_to_merge, alpha_to_split)

do

splitting = false

foreach predictor in predictors

if predictor is continuous then

divide into equal sized category bins

else if predictor is categorical then

take each unique value as category

end if

do

merging = false

foreach category, nextcategory pair in category_set

if target is continuous then

mergeTest = F_Test(category, nextcategory)

elseif target is categorical

mergeTest = ChiSquare_Test(category, nextcategory)

end if

if mergeTest >= alpha_to_merge then

merge(category, nextcategory)

merging = true

break

end if

end for

loop while merging = true

PValue[predictor] = BonferroniP(category_set)

end for

if min(PValue) <= alpha_to_split then

split on predictor

splitting = true

else

designate node as terminal node (leaf)

loop while splitting = true

29

Table 2.3: Decision Tree Algorithms

*C5.0 is implemented in proprietary commercial software.

Name Author Split Determinant Data Type Split Levels Other Features

CART

(Classification And

Regression Trees)

Leo

Breiman

Least Squares

Deviation, Gini

Index

Continuous &

Categorical Predictor,

Categorical Target

Binary

CHAID

(CHi-squared Automatic

Interaction Detection)

Gordon

V. Kass

Chi-square test, F-

Test, Bonferroni

adjusted p-value

Continuous &

Categorical (Continuous

Predictor by Binning)

Multiway

MARS

(Multivariate Adaptive

Regression Splines)

Jerome H.

Friedman

Basis Function Continuous &

Categorical

Multiway

ID3

(Iterative Dichotomiser 3)

Ross

Quinlan

Entropy/Information

Gain

Continuous (more time-

consuming) &

Categorical

Multiway

C4.5 Ross

Quinlan

Entropy/Normalised

Information Gain

Continuous &

Categorical

Multiway Improvement over

ID3:Pruning, Varying

Attribute Costs, Missing Data

C5.0 Ross

Quinlan

* Continuous &

Categorical

Multiway Improvement over C4.5:

Winnowing, Boosting, Speed

30

2.3. Previous Work

The value of forecasting of quantities in planning and investment has been established.

Previous work in the literature regarding forecasting of time series quantities is discussed

below in terms of copper in particular and commodities in general.

Given the global importance of copper trade, many attempts have been made to forecast

copper prices in the literature. Since copper spot prices form a time series, the most used

forecast tool has been the ARIMA. This has been shown to be good enough in producing

forecasts especially in the long run (Lasheras et al., 2015). However, according to

Kriechbaumer et al (2014), “Normal ARIMA models were shown to be rather unsuitable for

predicting monthly base metal prices”.

Also, Lasheras et al. (2015), showed that Elman recurrent neural networks (RNN) produce

forecast results for copper spot prices with better error rates compared to ARIMA. A similar

finding was made by Adebiyi et al. (2014), showing that artificial neural networks performed

better than ARIMA. Thus some evidence has been established as to the benefits of data

mining algorithms and models in producing forecast results with better accuracy and even

variance.

Other researchers have also tried a hybrid approach where some data mining technique

has been combined with ARIMA in an attempt to increase the overall forecast accuracy as

seen in Zhang (2003) as well as Jan and Katarina (2010), where neural nets were combined

with ARIMA. Kriechbaumer et al. (2014), considering the cyclical behaviour of metal prices,

applied wavelet analysis and multiresolution analysis prior to ARIMA for much improved

forecast accuracy for copper, lead, zinc and aluminium.

This literature review reveals that mostly, neural networks have been considered and

investigated in this regard as seen in the papers mentioned above as well as in Lai et al (2009)

who also underscored the fact that neural networks “… do not provide an insight into the

nature of the interactions between technical indicators and … fluctuations” since it is a black-

box model.

Decision trees are used by Ongsritrakul et al. (2003), in the prediction of gold prices but

only for feature selection which was then fed to Support Vector Machine (SVM), linear

regression and neural net models. Similarly, Malliaris and Malliaris (2015) used decision

31

trees for gold price forecast but limited to just predicting the price movement direction (up or

down). Lai et al (2009), on the other hand used the ID3, a decision tree algorithm, along with

case based reasoning weighted clustering as a decision support tool for stock price decisions

(buy/hold/sell). In terms of predicting actual values of a time series quantity, Diaz et al.

(2016), demonstrated the value of using decision trees as the tool of choice both for

prediction with reasonable accuracy as well as being able to realise the nature of the

relationship between economic variables and risk-free interest rates. Table 2.4 below is a

summary of these previous works researched.

The use of the business cycle as a predictor variable in the literature is lean as seen from

the above. Yet its role and importance in metal pricing has been established. Cuddington and

Jerrett (2011) demonstrated the significance of the business cycle as a determinant of metals

prices stating that “…metals and oil prices are much more responsive to cyclical than trend

movements in economic activity”. Fama and French (1988) also concluded that “the variation

of spot and forward prices for metals has a strong business-cycle component”.

32

Table 2.4 Summary of Previous Work on Prediction of Metals Prices and other Time Series Quantities

SN Author(s) Year Target Model Predictors

1 Lasheras et al. 2015 Copper spot price ARIMA and neural networks Copper spot price time series

2 Adebiyi et al. 2014 Stock price ARIMA and neural networks Stock price time series

3 Zhang 2003 Sunspot observations

Canadian lynx annual mortality,

British Pound to US dollar

exchange rate

ARIMA and neural networks hybrid Time series of quantities

4 Jan and

Katrina

2010 Monthly water volume

consumption

ARIMA and neural networks hybrid Water consumption time series

5 Kriechbaumer 2014 Monthly price of aluminium,

copper, lead and zinc

Wavelet analysis and ARIMA hybrid Metal nominal price time series

6 Lai et al. 2009 Stock price Fuzzy decision trees,

Genetic algorithms,

k-means

Stock price time series

Six days moving average (MA)

Six days bias (BIAS)

Six days relative strength index (RSI)

Nine days stochastic line (K,D)

Moving average convergence and divergence (MACD)

13 days psychological Line (PSY)

Volume

7 Goss and

Avser

2013 Copper spot and futures prices Simultaneous rational expectations

model of LME copper

Inventory

Production Volume

Industrial production index

Tin spot price

High grade zinc spot price

33

SN Author(s) Year Target Model Predictors

8 Malliaris and

Malliaris

2015 Gold price movement direction Decision tree Cleveland Financial Stress Indicator

Cushing Oil

S&P 500

VIX

Euro to US Dollar exchange rate

9 Ongsritrakul 2003 Gold price Support vector regression using

decision tree for feature selection

South Africa Rand to US dollar exchange rate

Australian Dollar to US Dollar rate

Canadian Dollar to US Dollar exchange rate

Gold lease rate

10 Buncic and

Moretto

2015 Copper monthly returns Dynamic Model Averaging and

Selection (DMA/DMS) framework

Excess demand

Inventory

Convenience yield

TED spread

Volatility Index (VIX)

Equity price of large resource based firms

Chilean Peso to US Dollar rate

Australian Dollar to US Dollar rate

US Industrial production

US term spread

Baltic Dry index

Broad S&P500 index

Gold price

Crude oil price

34

Chapter 3 Research Design and Methodology

The CRISP-DM methodology (Wirth and Hipp, 2000) will be used in this project for the

investigation and evaluation of the models on the dataset. It is a widely used methodology for

data mining tasks (Piatetsky, 2014) and highly compatible with data mining software such as

SPSS Modeler. The methodology involves six phases as illustrated in Fig. 3.1 below and

discussed thereafter.

Figure 3.1: CRISP-DM Methodology Phases

3.1. CRISP-DM Methodology

3.1.1. Business Understanding

This phase involves comprehension of the project aims and objectives from a business

perspective and then creating a data mining problem based on this understanding. This

essentially is summed up in the high value placed on the ability to reliably predict copper spot

prices as well as to understand how relevant economic variables affect the price fluctuations.

This capacity for forecasting with a reliable degree of accuracy is highly sought after by

stakeholders in the copper industry. Chile, the world’s largest producer of copper depends on

the metal for almost half of its exports (Meller and Simpasa, 2011). Spilimbergo (in Buncic

and Moretto, 2015) also highlighted the strong dependence of the Chilean economy on

copper. Having a good model for predicting copper prices as well as being able to understand

the effect that various economic and financial variables have on the price of the metal will

thus be of invaluable use for planning and fiscal control of the economy for a country like

Business

Understanding

Data

Understanding

Data

Preparation

Modelling

Evaluation

Deployment Data

35

Chile. Similar arguments can be made for the world’s largest importer nation of copper:

China.

In the same vein, investors and traders in the various commodities exchanges globally

will find the decision tree model deliverable of this project highly useful in helping them

make informed trading decisions in order to maximize returns on their investment in the

metal.

3.1.2. Data Understanding

The data understanding phase requires the preliminary analysis and characterization of

the dataset with the goal of identifying data quality issues (outliers, missing values, etc.) and

interesting statistical properties and patterns. At this stage, the project dataset is appraised

with a view to determining the nature of the sources and context of the data as well as the

extent to which the dataset is sufficient for the purposes of the project investigation.

3.1.3. Data Preparation

This phase follows from the previous one to attempt to clean out the data quality

problems identified and organise the dataset (feature selection, transformation, etc.) into the

format to be fed into the model for analysis. This step is carried out iteratively on the dataset

to enhance the quality of data used for the modelling thereby improving the modelling

outcome.

3.1.4. Modelling

In this phase, the models to be used for analysis are applied to the dataset and necessary

parameters are tuned to optimal values. 2 models (Decision Tree and ARIMA) are used for

analysis of the dataset.

3.1.5. Evaluation

At this stage, model(s) have been developed and they are assessed to ensure that all

necessary business requirements were adequately taken into consideration in the construction

36

of the models. At this point, a decision is reached whether the data mining results are good

enough to be used or not. This stage is very critical because it involves the interpretation of

the modelling results

3.1.6. Deployment

This is the phase where the model is implemented ‘in the field’. The end user or

customer should be able to effectively apply the model to fresh datasets and produce results

in a format that is clear and interpretable. For this project, the deployment phase involves an

understanding of the implications for the economy (global and individual countries) of the

effect of economic indices on copper price variations. The model produced can also be

utilised out in the finance industry for copper price forecasting.

It is to be noted that just as the diagram in Figure 3.1 depicts, the various phases are not

worked through in a basic waterfall. Rather each stage can involve multiple passes and can

further lead to the revisit of prior or other stages in order to achieve the ultimate business

objective of the data mining endeavour.

3.2. Project Evaluation

Considering that the forecasting of copper price is being examined as a process (which

can be adapted to similar domains) rather than a one-off, a methodology (CRISP-DM) is

being adopted for use in the execution of this project. This methodology will be critically

evaluated in terms of interpretability as well as accuracy as further discussed below.

The resultant decision tree model developed for forecasting will show the nature of the

effect that the various predictor variables have on the target: copper spot price. This openness

is a quality lacking in the ARIMA model. The relationships that exist between the variables

and copper prices can thus easily be interpreted from the results.

The forecast results of the models will be analysed and evaluated with a view to rank

them based on their accuracy. This will be realised using the metrics of the Root Mean

Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error

(MAPE) metrics. These are defined as follows:

37

𝑅𝑀𝑆𝐸 = √1

𝑛∑(𝑞𝑡 − 𝑝𝑡)2

𝑛

𝑡=1

𝑀𝐴𝐸 =1

𝑛∑ |𝑞𝑡 − 𝑝𝑡|

𝑛

𝑡=1

𝑀𝐴𝑃𝐸 = ∑|𝑞𝑡 − 𝑝𝑡|

𝑞𝑡

𝑛

𝑡=1

∗100

𝑛

where n = length of forecast period

qt = actual price at time t

pt = predicted price at time t

38

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

450.00

500.00

Jan

-70

Jul-

71

Jan

-73

Jul-

74

Jan

-76

Jul-

77

Jan

-79

Jul-

80

Jan

-82

Jul-

83

Jan

-85

Jul-

86

Jan

-88

Jul-

89

Jan

-91

Jul-

92

Jan

-94

Jul-

95

Jan

-97

Jul-

98

Jan

-00

Jul-

01

Jan

-03

Jul-

04

Jan

-06

Jul-

07

Jan

-09

Jul-

10

Jan

-12

$

Month-Year

Time Series Plot of Copper Spot Prices ($/MT) and Crude Oil Prices ($/barrel)

LME COPPER SPOT ($) OFF

Chapter 4 Data Analysis and Modelling

4.1. Data Understanding

An initial assay of the dataset is carried out in order to gain some insight into the nature,

distribution, statistical characteristics and correlation inherent in the data.

4.1.1. Distribution and Statistical Characteristics

The dataset used includes copper spot prices from the LME as obtained from Bloomberg

data services. Table A.1 in the appendix lists the economic variables (as contained in the

obtained dataset) considered for use in this project showing their summary statistics. In order

to gain some insight into the trend behaviour of the target and one of the main predictor

variables, a composite time series plot of copper prices and crude oil prices is made and

shown in Figure 4.1 below. It can be clearly seen for example that there is cyclicality

(business cycle) of periodic peaks and troughs in the plot of copper prices. Also, both

quantities have a similar trend in values and are affected in similar proportion by the 2008

global recession where we see a significant drop in prices.

Figure 4.1 Composite Time Series plot of Copper Spot and Crude Oil Spot Prices

39

In part of the scenarios for analysis, twenty (20) out of 142 variables available in the

dataset, have been chosen for decision tree learning and prediction of the target variable,

LME Copper Spot ($). From the dataset provided all variables have been presented with a

granularity of one month. The rationale for the inclusion of each variable for this data

analysis is as follows:

(1) Lagged Copper Returns

As noted by Buncic and Moretto (2015), “…there can be periods of momentum in asset

returns due to some market participants adopting a trend following trading strategy”.

This implies that a time series quantity like commodity prices which tend to have an

inherent trend in its values over time is likely to be recognised as such by traders who

now speculate based on this knowledge. Thus the lagged copper return calculated as log

change from the monthly series is also used as a predictor.

(2) WTI Cushing Crude Oil Spot Px

Considering that crude oil remains the primary source of energy for industry globally, the

price of crude is likely to affect and serve as a good predictor of the price of copper

(3) S&P GSCI Copper Inx Spot

An index published by Standard & Poor’s which gives an indication of the investment

performance of the copper commodity market. Since copper is one of the main industrial

metals in the commodities market, this index is likely to have an effect on the price of the

metal as investors either rush to buy or shy away from it based on its perceived

performance in the market.

(4) LME COPPER TOTAL

The level of stock inventory of copper indicates its level of ready availability as well as

showing how much excess capacity exists. It therefore can be a factor in the

determination of the price of the metal based on the basic economics laws of demand and

supply.

40

(5) Global Refined Copper Production – World

This is a field derived from the aggregation of the global refined production figures of

the continents. Representing the supply of the metal, it is a factor that can determine the

price of copper based on the basic economics laws of demand and supply.

(6) Global Refined Copper Demand – World

Also derived from the aggregation of global refined demand figures of the continents,

this is included based on a similar rationale as #4 above.

(7) Known Copper Ore & Concentrate Inventories

In terms of the source material of the metal, the raw ore concentrate inventories also give

an indication of the level of source availability of the metal which in turn should affect

finished copper pricing.

(8) Chicago Board Options Exchange SPX Volatility Index

A key measure of market expectations of near-term volatility conveyed by S&P 500

stock index option prices, the VIX index gives a measure of investor sentiment and

market volatility. Thus it reveals the level of investor appetite for investment which in

turn affects commodities pricing.

(9) Baltic Dry Index

The index measures the demand for shipping capacity versus the supply of dry bulk

carriers for haulage of dry cargo like grain and metal ores. Since the supply of ships is

relatively inelastic due to the cost and time it takes to build one, the index becomes much

more sensitive to the level of demand for shipping of these dry raw materials. It therefore

is a leading economic indicator of future economic activity. Widely used in literature, the

BDI is a strong candidate for use as a predictor variable.

(10) US CPI Urban Consumers NSA

An index showing the change in price of a basket of goods and services purchased by

urban consumers, the CPI is effectively a measure of inflation in the US. This in turn has

an inverse effect on the buying power and demand for these goods and services by

41

consumers which can be used by producers to determine the level of production to target

in order to maximise sales and reduce waste or inventory. Thus the CPI is a good

candidate for an industrial metal price prediction.

(11) S&P 500 Index

The S&P 500 is an American stock market index based on the market capitalizations of

500 large select companies across several industries in the US economy having common

stock listed on the NYSE or NASDAQ. It is thus a good representation of the U.S. stock

market and a leading indicator of the general health of the U.S. economy (Investopedia,

2016).

(12) Dow Jones Industrial Average

This is an index that shows how 30 large publicly owned companies based in the US

have traded during a standard trading session in the stock market. It is a price-weighted

scaled average which also is computed to gauge the performance of the industrial sector

of the US economy.

Variables (10) and (11) are included as predictors considering that they are indicators of

the performance of the US stock exchange which themselves are also representative of

“returns … [of] various other global stock markets” (Buncic and Moretto, 2015). Thus

they show the level of investor activity and the general performance of stock markets.

(13) USDCLP Spot Exchange Rate - Price of 1 USD in CLP

The exchange rate of the US Dollar to the Chilean Peso is included considering that

Chile is the world’s largest exporter of copper and the Chilean economy depends to a

very large extent on the copper exports. Therefore fluctuations in the currency exchange

rate to a large extent are an indication of the level of copper exports and market

performance.

(14) BHP Billiton Ltd

Share Price of Anglo-Australian multinational mining, metals and petroleum company

and the world's largest mining company

42

(15) Rio Tinto PLC

Share Price of British-Australian multinational and one of the world’s largest metals and

mining corporations

(16) Freeport-McMoRan Copper & Gold Inc

Share Price of world's largest copper producing and mining company based in the US

Variables (13), (14) and (15) are included as predictors considering that they are

indicators of the performance of the main mining organisations in the world. They are

widely used in literature as predictor variables in data mining research involving

industrial metals.

(17) LME ALUMINUM 3MO ($)

Aluminium is a close substitute for copper in one of its highest application areas:

electrical wiring and electronics. Thus the price of aluminium in the LME commodities

market is considered for inclusion as a predictor.

(18) United States Money Supply M2

A measure of the level of money in supply as published by the United States Federal

Reserve System. The M2 (which is a broader definition of money and encompasses the

M1) is an economic indicator of inflation. Like the CPI, this variable is also a good

predictor candidate since it gives a measure of inflation.

(19) US Industrial Production 2007=100 SA

The Industrial Production Index is an economic indicator that measures real output for all

facilities located in the United States manufacturing, mining, and electric, and gas

utilities. It thus is a measure of economic activity. This variable is also used in literature.

(20) Generic 1st 'LP' Future

Futures is a contract to buy/sell a financial instrument or asset at a predetermined future

date and price. Generic 1st ‘LP’ future is the contract price on copper against the next

month and therefore the shortest length futures contract. Being a form of forecast itself

on the next month price of the metal, this variable is included as part of the predictors.

43

4.1.2. Correlation Analysis

The correlation of the variables to the target variable as seen in Table A.1 of the

Appendix shows the degree of association of each individual variable to copper spot prices.

The Peso/Dollar exchange rate for example is negatively correlated to the target. That is,

the lower the price of copper the more Pesos required to make 1 dollar, (and vice versa). This

is to be expected as Chile’s currency, the Peso, is highly sensitive to the price of the metal

considering it is the main export item from the country. Also worthy of note is the fact that

the S&P GSCI Copper Inx Spot is almost perfectly positively correlated with the Lagged

Copper Prices which thus perhaps give an indication of how the index is computed.

4.2. Data Preparation

4.2.1. Transformation

The dataset consisting of the twenty predictor variables chosen as above had their actual

values transformed to log difference values using the following formula:

𝑋𝑇 = 𝐿𝑁 (𝑋𝑡

𝑋𝑡−1) ∗ 100 (4.1)

𝑤ℎ𝑒𝑟𝑒 𝑋𝑇 = 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 𝑉𝑎𝑙𝑢𝑒

𝑋𝑡 = 𝑋 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑚𝑜𝑛𝑡ℎ 𝑡 𝑎𝑛𝑑

𝑋𝑡−1 = 𝑋 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑚𝑜𝑛𝑡ℎ 𝑡 − 1

This transformation was applied to all non-target variable fields and is necessary in order

to avoid ‘discovering’ spurious correlations in the data considering that most of the variables

are non-stationary having a trend since the dataset is basically a time series. The only

exception is the Volatility Index field considering that its value is computed to give an

indication of investor appetite which does not necessarily follow a trend over time.

4.2.2. Standardisation

Then a copy of the dataset was made with all values scaled using standardisation

according to the following formula:

44

𝑋𝑆 =𝑋𝑇 − 𝑋𝑇̅̅ ̅̅

𝜎𝑇 (4.2)

𝑤ℎ𝑒𝑟𝑒 𝑋𝑆 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑠𝑒𝑑 𝑉𝑎𝑙𝑢𝑒

𝑋𝑇̅̅ ̅̅ = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑋𝑇𝑎𝑛𝑑

𝜎𝑇 = 𝑆𝑡𝑑. 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑋𝑇

This was done in order to reduce the variance of the variable values considering that they

are of different units and therefore have varying value ranges. This is especially necessary

when building models like neural networks. However, despite the fact that decision trees are

able to handle dataset of variables with widely varying range of values, the standardisation of

the data was still carried out to produce a separate dataset in order to use for further

modelling and compare the results.

4.3. Decision Tree Modelling

4.3.1. Run Set 1 – Log Difference with CHAID

Several modelling approaches are investigated and analysed with a view to determining

which is most appropriate that produces results with low error rates.

With the actual dollar value of copper spot price as the target, the development of models

was trialled through several runs of train/test sessions that involved going through different

combinations of dataset columns choice scenarios against entire/partial data row sets and

finally tree depth values. These are further explained as follows:

45

I. Sessions

Datasets are split at random into two equal parts using a Partition node and then used for

a. Train and

b. Test runs

II. Scenarios

Different scenarios are generated considering different column choices as follows:

a. Scenario 1: 20 Variable Selection

The twenty variables identified (as above) were used in the modelling runs under

this scenario. These variables (chosen based on rationale also explained prior) were

applied using the following values:

i. Log Difference Values

As mentioned under Data Transformation, the data values for each variable in the

dataset were transformed to derive growth or change rate using log difference.

ii. Actual Values

b. Scenario 2: Maximum Missing Value (MV) rate

Based on the missing value rate and correlation of the variables to the target, two cut

off points were identified that allowed the retention of some highly correlated

variables while discarding those with very high MV rates as follows: Of the 142

variables retain only those with a

i. 60% maximum MV rate comprising a total of 56 variables.

ii. 76% maximum MV rate. From Table A.1, there are quite a number of variables

with strong target correlation in the 73%-76% MV rate range – thus informing

this second set made up of 105 variables in total.

c. Scenario 3: Entire Dataset

In this scenario all 142 columns of the original dataset are retained and fed into the

algorithm for model training.

III. Data Row Sets

Where missing values exist, they always start from the beginning up to a certain row for

each column of the dataset. Two data row sets are thus formed:

a. All Rows

Entire dataset of the scenario is used for modelling.

46

b. Missing Values reduced Rows

For each scenario, rows are discarded from the beginning of the dataset until all

columns have at most a 50% MV rate. This value point was used in order to

compare and see the model performance where the effect of missing values is at or

less than half of all rows for each variable at the expense of fewer examples to train

(and test on).

IV. Tree Depth

Using the CHAID decision tree model node, there are different parameters that can be

tuned while training the model including:

Building single trees vs ensembles with a bias for enhancing accuracy

(boosting) or stability (bagging)

Stopping rules based on minimum records (percentage or value) in parent and

child branches

Significance level for splitting and merging

Maximum tree depth

Amongst the above, tree depth was chosen as the parameter for variation because the

others were either not producing any change in the resulting trees or are producing

multiple trees (in the case of ensembles). The following maximum tree depth values were

used:

a. 5 (default)

Higher values were not used because through several preliminary trials of higher

values the resulting trees produced were consistently at 5 levels or less deep. This

may be as a result of the other features (stopping rule and significance level) limiting

the complexity of the tree from going any further which could lead to overfitting.

b. 4

c. 3

In all the runs the log difference value (growth rate) was used except in Scenario 1ii

where the actual values of the variables are applied.

Also, considering the nature of the Generic 1st ‘LP’ future as an immediate (next month)

forecast on the price of copper, a copy of this variable column data is created and staggered

forward one record so that it now coincides with the next month’s copper price record. A

fresh run using all columns (now 143) and rows is made to serve as scenario 4.

47

4.3.2. Run Set 2 – Price Movement with C5.0

Another set of modelling runs was conducted but now with a categorical target

developed to capture the copper spot price movement (UP/DOWN) from month to month.

The distribution of the movements as shown in Table 4.1 indicates that from 1970 to 2012

there has been a month-on-month rise in copper prices slightly more than half of the time.

This represents a fairly balanced dataset and thus stratification or oversampling (Alpaydin,

2010) techniques are not required in the data preparation for modelling.

Table 4.1: LME Copper Spot Price Month-on-Month Movement

Price Direction Count %Count

UP 271 53.77%

DOWN 233 46.23%

Total 504

Using similar set of scenarios from Run Set 1, the C5.0 decision tree algorithm is used in this

instance (the C5.0 node in IBM SPSS Modeler can handle only categorical data)

4.3.3. Run Set 3 – Price Change Rate with CHAID

A final set of decision tree modelling runs using month-on-month rate of change as target

was conducted. As this is also a continuous value, the CHAID algorithm was used in this run

set as well.

4.4. ARIMA (Time Series) Modelling

Using the LME Copper spot prices in the dataset for a total of 505 monthly records, an

ARIMA model was built in the SPSS Modeler environment using the Expert Modeler node.

This node automatically trials and selects the ARIMA model that best fits the dataset. Table

4.2 shows some options values permutations used in a number of runs.

48

Table 4.2: Expert Modeler with Constant and Transformation Option Value Runs

Constant Transformation

No None

Yes None

No Square Root

Yes Square Root

No Natural Log

Yes Natural Log

49

Chapter 5 Evaluation and Results

5.1. Decision Trees

5.1.1. Log Difference Modelling

Based on runs from the combinations of sessions, scenarios, data row sets and tree depths

as explained in section 4.3 earlier, the variable set selected by the best evaluated model tree

under each run are as captured in Tables 5.1, 5.2, 5.3 and 5.4.

Table 5.1: Trained Model Variables under Scenario 1

SN Scenario 1i: Best Trained Model Variables Scenario 1ii: Best Trained Model Variables

1 Chicago Board Options Exchange SPX Volatility Index Generic 1st 'LP' Future

2 Generic 1st 'LP' Future LME COPPER TOTAL

3 Global Refined Copper Production - World S&P 500 Index

4 LME ALUMINUM 3MO ($) S&P GSCI Copper Inx Spot

5 LME COPPER TOTAL US CPI Urban Consumers NSA

6 S&P GSCI Copper Inx Spot US Industrial Production 2007=100 SA

7 United States Money Supply M2 WTI Cushing Crude Oil Spot Px

8 US CPI Urban Consumers NSA

9 WTI Cushing Crude Oil Spot Px

Table 5.2: Trained Model Variables under Scenario 2

SN Scenario 2i: Best Trained Model Variables Scenario 2ii: Best Trained Model Variables

1 Baltic Dry Index Baltic Dry Index

2

Chicago Board Options Exchange SPX

Volatility Index Chicago Board Options Exchange SPX Volatility Index

3 Comex Copper Inventory Data Commodity Research Bureau BLS/US Spot Raw Industrials

4

Eurostat Industrial Production Eurozone

Industry Ex Construction SA Federal Funds Target Rate US

5 Federal Funds Target Rate US Generic 1st 'LA' Future

6 LME COPPER TOTAL Global Refined Copper Demand - South & Central America

7 United States Money Supply M1 Global Refined Copper Production - Oceania

8

USDPEN Spot Exchange Rate - Price of 1

USD in PEN LME COPPER TOTAL

9 S&P GSCI Index Spot CME

10 US PPI By Processing Stage Finished Goods Total SA

11 USDPEN Spot Exchange Rate - Price of 1 USD in PEN

50

Table 5.3 Trained Model Variables under Scenario 3 (Entire Dataset)

SN Scenario 3: Best Trained Model Variables Variable Category

1 China Import Commodity Value - Copper Products

Metal Fundamentals 2 Zambia Copper Prices

3 LME COPPER TOTAL

4 LME CNCL WRNT COPPER TOT

5 Commodity Research Bureau BLS/US Spot Raw Industrials

Economic Activity Indicators

6 BBA LIBOR USD 3 Month

7 Federal Funds Target Rate US

8 US PPI By Processing Stage Finished Goods Total SA

9 Baltic Dry Index

10 Chicago Board Options Exchange SPX Volatility Index

11 S&P GSCI Index Spot CME

Finance Indicators 12 USDPEN Spot Exchange Rate - Price of 1 USD in PEN

Table 5.4 Trained Model Variables under Scenario 4

SN Scenario 4: Best Trained Model Variables

1 Baltic Dry Index

2 BBA LIBOR USD 3 Month

3 Chicago Board Options Exchange SPX Volatility Index

4 Commodity Research Bureau BLS/US Spot Raw Industrials

5 Federal Funds Effective Rate US

6 Federal Funds Target Rate US

7 LME ALUMINUM 3MO ($)

8 S&P GSCI Copper Exc Tot

9 Staggered Generic 1st 'LP' Future

10 US New Privately Owned Housing Units Started by Structure Total SAAR

11 USDPEN Spot Exchange Rate - Price of 1 USD in PEN

The evaluation of the trees using the following 3 metrics is as shown in Table 5.5 (best

performance per run in bold):

i. Mean Absolute Error (MAE)

The MAE metric captures the average of the absolute deviations from predicted to

target values. It is used often in literature for model comparisons because it is simple

to calculate and understand (Hyndman and Athanasopoulos, 2014).

ii. Mean Absolute Percentage Error (MAPE)

In dividing by the target value to arrive at the MAPE, this error metric becomes

dimensionless and allows for comparison of models forecasting quantities on different

51

scales. It however, is undesirable when one of the possible target values is zero or

near zero at which point the MAPE value becomes extremely large or infinite.

iii. Root Mean Square Error (RMSE)

The RMSE like the MAE captures the error metric in the same unit as the target.

Considering however, that the deviations are squared before averaging (after which a

square root is taken), this metric puts progressively greater penalties on larger

deviations. Thus it is susceptible to outliers (bloating its value) and this is the primary

reason for avoiding its use (Chai and Draxler, 2014). However when the distribution

of deviations is normal (no significant outliers), the advantage of varying weights

used for different degrees of deviation makes it more desirable than the MAE

especially where accuracy is of more concern than stability.

Since the models being developed are all decision trees and the dataset is one and the

same across board, the MAPE may not be so required. Apart from the aforementioned, the

RMSE will also produce values with greater ranges for discrimination so that where the

values of MAE are close, it helps to show more clearly the differences in accuracy among

models.

Table 5.5 Evaluation of Tree Models Built under Various Conditions/Combinations

Row Set All Rows MV Reduced Rows

Tree Depth 3 4 5 3 4 5

Metric MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE

Scenario 1i 36.65 28.50 68.93 36.68 28.50 68.91 36.56 28.30 68.89 63.77 45.80 88.70 63.72 45.70 88.70 63.72 45.70 88.70

Scenario 1ii 6.81 5.90 10.73 6.72 5.70 10.68 6.72 5.70 10.68 5.97 4.10 9.23 6.08 4.10 9.54 6.08 4.10 9.54

Scenario 2i 22.98 19.30 43.98 20.41 16.20 43.38 20.12 15.40 43.49 20.54 16.00 41.24 20.54 16.00 41.24 21.00 16.40 41.40

Scenario 2ii 22.31 18.00 43.28 20.61 15.70 42.86 21.15 16.10 43.21 34.74 17.60 66.33 34.25 17.10 65.63 34.19 17.10 65.86

Scenario 3 13.36 12.20 19.60 11.64 9.80 18.65 12.13 10.10 19.38 21.46 10.40 38.72 21.73 10.60 39.02 21.73 10.60 39.02

Scenario 4 13.83 12.10 21.81 12.43 10.30 21.17 12.5 10.10 21.47

The modelling forecast accuracy results as seen in Table 5.5 shows the varying levels of

performance obtained under the various scenarios and conditions of run. The following are

some of the particular observations made:

Using more rows of data even with very high MV rates tended to produce better

results than using less

52

Retaining more columns to feed into the model algorithm also produced better results

than filtering out even when the columns are known to have very high MV rates. By

retaining columns with an MV of up to 76% (which is a marginal increase from 60%)

where the variables so added have high correlation with the target, we see that the

accuracy of the model increases. Thus where variables are shown to be well correlated

with target they need to be retained even if they have a lot of missing values.

The above then confirms again the fact that decision trees are versatile for use in

forecasting even when predictor variables have high rates of missing values.

Generally, a tree depth of 4 is observed to be optimal. This is seen when comparing

the Train session results with the Test session.

𝐼𝑓 𝐸𝑥(𝑑) 𝑖𝑠 𝑒𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 𝑜𝑓 𝑥 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 𝑢𝑠𝑖𝑛𝑔 𝑡𝑟𝑒𝑒 𝑑𝑒𝑝𝑡ℎ 𝑑

It is seen (in scenario 3 for example) that

𝐸𝑡𝑟𝑎𝑖𝑛(5) < 𝐸𝑡𝑟𝑎𝑖𝑛(4) < 𝐸𝑡𝑟𝑎𝑖𝑛(3)

whereas

𝐸𝑡𝑒𝑠𝑡(4) < 𝐸𝑡𝑒𝑠𝑡(5) < 𝐸𝑡𝑒𝑠𝑡(3)

Therefore a tree depth of 5 tends to overfit while 3 underfits.

Log Difference (Growth Rate) vs Actual Values

In scenario 1ii with the 20 variables earlier identified and chosen, the actual values of the

variables are used for training and testing rather than growth rate. As seen in Table 5.5 under

the MV reduced data rowset with a tree depth of 3, the result shows a remarkable drop in the

error rates with best MAE and RMSE values of 5.97 and 9.23 respectively. Indeed looking

generally through the table, it can be seen that scenario1ii error rate figures are far less

(mostly less than 10) compared to the other scenarios. However, upon inspection of the

predictor variables and ranking in the produced model as shown in Table 5.1 and Figure 5.1,

it can be seen that mostly the model is basically just using one variable (S&P GSCI Copper

Inx Spot) for prediction. This is as a result of the very high (almost perfect) correlation

between the variable actual values and the target which is what the index is designed to track

by definition. Thus being a coincident indicator, using the actual values is not practical as

they may not be available in good enough time as to be retrospectively applied. Also, since

53

most of the variables of the entire dataset are lagging indicators, it becomes necessary to use

their growth rates rather than actual values for prediction as done in Scenarios 2 and 3.

Figure 5.1: Predictor variables in ranked order of importance (Scenario 1ii)

Scenario 3 produced the next best error rate results with ranked predictor variables

produced as shown in figure 5.2. This was further improved by remodelling again after

removing the least important variable (Global Refined Copper Production – South and

Central America) which as can be seen in Figure 5.2 is far less ranked than the rest. This

process was repeated iteratively until no further improvements could be realised producing

the variable ranking in Figure 5.3 and final results as recorded in Table 5.5.

54

Figure 5.2: Predictor variables in ranked order of importance (Scenario 3)

55

Figure 5.3: Predictor variables in ranked order of importance (Scenario 3 Improved)

The predictive power of the Staggered Generic 1st ‘LP’ (Copper) Future variable in

Scenario 4 can be seen in Figure 5.4. The error rates realised were very close to those of

Scenario 3 as detailed in Table 5.1. As such this model may still be used where the business

understanding favours this set of variables as listed in Table 5.4

56

Figure 5.4: Predictor variables in ranked order of importance (Scenario 4)

Predictor Variables

The chosen best decision tree model is the Scenario 3 (improved) model with the next

best error rate results (after Scenario 1ii which as earlier mentioned is impractical). The

variables of the model are as listed in Table 5.3 and ranked in Figure 5.3 showing the

foremost predictor as China Import Commodity Value - Copper Products. This goes to show

that Chinese demand for copper, over and above other factors, largely determines the price of

the metal making it highly susceptible to the vagaries of the Chinese economy. The second

most important predictor, Zambia Copper Prices is rather an obvious one considering that

international copper prices are bound to be comparable and move with similar trend since the

commodity is marketed globally. Along with two other inventory indicators these make up

the metal fundamentals variables that are predictive of copper spot prices. In the predictor

importance ranking, the other variables are mostly at similar levels of impact.

57

The Volatility index, PPI and Federal Funds Target Rate are different indicators of the

US economy measuring investor appetite, variations in manufactured goods prices and

interbank lending rates respectively. Thus the health of the US economy also largely

determines the price of copper. The Baltic Dry Index being a measure of the level of shipping

activity of ‘dry’ commodities including copper is a leading indicator of demand levels and so

expectedly does demonstrate good predictive power. This corroborates the observation of its

high correlation rates (against copper price) as seen earlier.

Finance indicator variables include the S&P GSCI Index which is a widely quoted index

as a measure of global general inflation. And finally in this category, we have the US Dollar

to Peruvian Sol exchange rate (USDPEN). Peru is the third largest copper producing country

after Chile and China, and thus its currency reflects the price of the metal being one of its

important export commodities.

Model Tree and Ruleset

The decision tree structure for the most accurate model (scenario 3) is shown in Figure

5.5a-c where it can be observed that the root node is the China Import Commodity Value –

Copper Products variable. At the 2nd level of the tree and indicating the next level of

variable importance we see the following variables:

1. USDPEN Spot Exchange Rate (PEN Curncy)

Here the inverse relationship between the exchange rate and copper spot price can be

seen from the rule:

If PEN Curncy <= 3.34 then LOCADY Comdty = 150.31

If PEN Curncy > 3.34 then LOCADY Comdty = 126.93

2. Zambia Copper Prices (ZMCMCOPP Index)

3. Baltic Dry Index (BDIY Index)

The Baltic Index variable level of importance is reflected further when taking into

consideration the fact that it is used to split further on 82.6% of the dataset. This is

consistent with the high correlation (0.6) it bears to the target variable. Being a

leading indicator, the Baltic thus is a very useful predictor as seen in its wide use in

literature.

58

Figure 5.5a CHAID Left Tree Subsection

59

Figure 5.5b CHAID Middle Tree Subsection

60

Figure 5.5c CHAID Right Tree Subsection

61

With a four-level depth beyond the root and a total of 23 leaf nodes, the decision tree is

quite intricate and as such an exhaustive analysis of the rules is treated as out of scope and

not undertaken.

The design of the stream (in IBM SPSS Modeler) to load the dataset and train a CHAID

node in creating the decision tree model for scenario 3 is as shown in Figure A.2 under the

Appendix.

5.1.2. Price Movement Modelling

Another set of modelling runs was conducted but now with a categorical target

developed to capture the copper spot price movement (UP/DOWN) from month to month.

Using same set of scenarios as earlier, the C5.0 algorithm in SPSS Modeler (since target is

now categorical) was used to build model decision trees through several runs. Evaluation

results as shown in Table 5.6 indicate that scenario 2 produced the tree with the highest

accuracy of 96% during test run. The variables selected by the algorithm for the decision tree

and their ranking as shown in Figure 5.6 reveals the significant impact of the S&P GSCI Inx

Spot variable in predicting price movement.

62

Table 5.6: C5.0 Decision Tree Evaluation using Price (month-on-month) Direction Target

Train Test

20 Field Log Rate Predicted Predicted

DOWN UP DOWN UP

Actual DOWN 93 27 86 27

UP 8 127 3 133

Accuracy 0.86 Accuracy 0.88

20 Field Actual Values Predicted Predicted

DOWN UP DOWN UP


UP 80 53 94 44


Scenario1 Predicted Predicted

DOWN UP DOWN UP


UP 6 103 14 93



DOWN UP DOWN UP


UP 0 62 2 71



DOWN UP DOWN UP


UP 5 128 7 131


63

Figure 5.6: Predictor variables in ranked order of importance

(Price Movement Target – Scenario 2)

Variables and Ruleset

Figure 5.7 shows the tree structure and derived ruleset from the decision tree model for

copper price direction forecasting. This ruleset (Rule 1) shows that when the S&P GSCI Inx

Spot growth rate falls below the mean value (0.41) at -1.76 or less, the direction of copper

spot price will trend downward. Otherwise it swings upward if the Euro to US Dollar

exchange rate growth is not less than one and a third standard deviation from its mean (-0.01)

at a value of -3.348 (Rule 3).

These two rules determine the price direction in almost all the examples tested (51 and

60 respectively out of a total of 115). Crude Oil is a major source of energy and this reflects

widely as a common predictor variable in literature. In the forecast of copper price movement

it is used by the decision tree model to discriminate in just 4 examples.

64

Rule 1 for CopperSpotPriceDirection(Down) (51)

If S&P GSCI Inx Spot <= -1.76

then DOWN

Rule 2 for CopperSpotPriceDirection(Down) (2)

If S&P GSCI Inx Spot > -1.76

and EURUSD Spot Exchange Rate <= -3.348

and WTI Cushing Crude Oil Spot Px <= -8.23

then DOWN

Rule 3 for CopperSpotPriceDirection(Up) (60)


and EURUSD Spot Exchange Rate > -3.348

then UP

Rule 4 for CopperSpotPriceDirection(Up) (2)


and EURUSD Spot Exchange Rate <= -3.348

and WTI Cushing Crude Oil Spot Px > -8.23

then UP

Figure 5.7: C5.0 Decision Tree and Ruleset (Directional Forecasting)

65

5.1.3. Price Change Rate Modelling

A final set of modelling runs using month-on-month rate of change as developed target

was conducted. As this is also a continuous value, the CHAID algorithm was used in this run

set as well. Table 5.7 shows the results evaluation where we see scenario 3 with missing

value reduced rows giving the best accuracy figures. As seen in Figure 5.8, the S&P GSCI

Inx Spot variable again is largely the determinant predictor. With an importance of 0.97, and

given the very low error rates recorded, it can be inferred that the variable has an almost

perfect correlation with the price movement target.

Table 5.7: Evaluation of Tree Models (month-on-month rate of change target)

Row Set All Rows MV Reduced Rows

Tree Depth 3 4 5 3 4 5

Metric MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE MAE MAPE RMSE

Scenario 1i 0.025 0.746 0.045 0.016 0.467 0.032

Scenario 1ii 0.053 2.762 0.071 6.716 0.057 10.682 0.057 2.602 0.084 0.056 2.557 0.077

Scenario 2i 0.030 2.030 0.045 0.030 2.030 0.045 0.019 0.698 0.032

Scenario 2ii 0.030 2.020 0.045 0.030 2.020 0.045 0.016 0.501 0.032 0.016 0.506 0.032

Scenario 3 0.030 2.020 0.045 0.030 2.020 0.045 0.015 0.491 0.032 0.015 0.496 0.032

66

Figure 5.8: Predictor variables in ranked order of importance (Price Change Rate Target)

5.2. ARIMA

Using the Expert Modeler, a seasonal ARIMA(2,1,0)(1,0,1) on a transformed dataset

(natural log) with a constant proved to be the model producing the best performance in fitting

the dataset. Table 5.8 shows details of the results of a number of runs with some options

values permutations.

Table 5.8: Expert Modeler Result: ARIMA(2,1,0)(1,0,1) with Constant and Transformation options

c Transform. Stationary

R2 R

2 RMSE MAPE MAE MaxAPE MaxAE

Norm.

BIC Q df Sig.

No None 0.14 0.99 11.15 4.54 6.10 38.59 86.23 4.87 73.84 14 0

Yes None 0.15 0.99 11.16 4.53 6.09 38.78 86.64 4.89 73.84 14 0

No Square Root 0.16 0.98 11.18 4.52 6.10 38.52 86.06 4.88 43.08 14 8.30E-05

Yes Square Root 0.16 0.98 11.19 4.53 6.10 38.75 86.58 4.89 43.05 14 8.40E-05

No Natural Log 0.16 0.98 11.21 4.53 6.11 38.95 87.04 4.88 21.05 14 0.10032

Yes Natural Log 0.17 0.98 11.24 4.53 6.11 39.27 87.73 4.90 20.94 14 0.10306

These results indicate that an ARIMA(2,1,0)(1,0,1) with constant on a natural log

transformed dataset is the best fit model as seen in Table 5.8 (in bold). This is so considering

67

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

450.00

500.00

Jan

-70

Jul-

71

Jan

-73

Jul-

74

Jan

-76

Jul-

77

Jan

-79

Jul-

80

Jan

-82

Jul-

83

Jan

-85

Jul-

86

Jan

-88

Jul-

89

Jan

-91

Jul-

92

Jan

-94

Jul-

95

Jan

-97

Jul-

98

Jan

-00

Jul-

01

Jan

-03

Jul-

04

Jan

-06

Jul-

07

Jan

-09

Jul-

10

Jan

-12

$

Month-Year

Time Series Plot of Copper Spot Prices ($/MT)

that this trial run has the highest Stationary R2 value of 0.17 as well as having the best

statistical significance result of 0.103 (> 0.05) at the 95% confidence interval level. The runs

without the natural log transformation all have statistically insignificant results (< 0.05) at the

95% confidence interval level.

Thus we see that the copper spot prices is a high variance, trended time series dataset

requiring a natural log transformation and first differencing to make it stationary. It has a

non-seasonal second order autoregressive signature, AR(2) and a seasonal part fitted with an

ARMA(1,1) model. Figure 5.9 shows a time series plot of the original dataset where the

variance and trend can be clearly seen. In Figure 5.10, the dataset has been transformed

thereby reducing variance and Figure 5.11 shows a stationary plot after first differencing. The

model design stream is as shown in Figure A.2 under the Appendix.

It is to be noted that careful scrutiny of the dataset output of the ARIMA model reveals that

the error rate (RMSE, MAPE, MAE) figures in Table 5.8 are computed based on the entire

dataset being used as training set. In essence then these are training error rates. Using several

holdout values, test error rates for the model are as shown in Table 5.9 where we see far higher

figures. This indicates that the ARIMA is at best useful for forecasts in the very near term.

Figures 5.9: Time series plot of original dataset showing variance and trend.

68

Figures 5.10: Time series plot of transformed dataset showing reduced variance.

Figures 5.11: Time series plot of differenced transformed dataset eliminating trend to make stationary.

0

1

2

3

4

5

6

7Ja

n-7

0

Sep

-71

May

-73

Jan

-75

Sep

-76

May

-78

Jan

-80

Sep

-81

May

-83

Jan

-85

Sep

-86

May

-88

Jan

-90

Sep

-91

May

-93

Jan

-95

Sep

-96

May

-98

Jan

-00

Sep

-01

May

-03

Jan

-05

Sep

-06

May

-08

Jan

-10

Sep

-11

LN(P

rice

)

Month Year

Time Series Plot of Natural Log Transformed Copper Spot Prices

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Jan

-70

Sep

-71

May

-73

Jan

-75

Sep

-76

May

-78

Jan

-80

Sep

-81

May

-83

Jan

-85

Sep

-86

May

-88

Jan

-90

Sep

-91

May

-93

Jan

-95

Sep

-96

May

-98

Jan

-00

Sep

-01

May

-03

Jan

-05

Sep

-06

May

-08

Jan

-10

Sep

-11

Firs

t D

iffe

ren

ce

Month Year

Time Series Plot of Differenced Transformed Copper Spot Prices

69

Table 5.9: ARIMA Model Holdout (Out-of-Sample) Dataset Test Error Rates

Holdout % of Total Count MAE MAPE RMSE

252 50% 93.41 89.09% 113.60

100 20% 104.38 56.33% 124.40

10 2% 54.91 15.45% 70.39

3 0.6% 33.54 9.50% 35.16

1 0.2% 22.51 6.17% 22.51

5.3. Models Comparison

Using the figures from Tables 5.5, and 5.9 (rather than 5.8 as explained earlier) for

comparison of the two model types, we see that the ARIMA has grossly worse figures

compared to the decision tree. At the 50% holdout (test) level, the decision tree RMSE of

18.65 compared to ARIMA’s 113.60 is far more reliable. A limitation of this comparison

however is that while the decision tree uses 50% of the dataset picked at random for testing,

the ARIMA, as a matter of necessity due to the nature of the model, uses the half picked from

the end of the chronologically ordered dataset. Thus the test datasets presented to each model

are not exactly the same. Nevertheless, considering that even the least RMSE figure of the

ARIMA (22.51 at just 1 record holdout) is still worse than that of the decision tree it can be

safely adjudged that the decision tree model performs better and will continue to do so even

when presented with real world data scenarios.

As the decision tree model uses several variables that have been selected from a large

pool of potential predictors, it uses more information to arrive at a forecast value. The

ARIMA on the other hand only has the historical target values to rely on for its modelling

and forecasting. It is not surprising then that the decision tree performs much better. This is

the reason for the increasing research interest in the use of data mining or machine learning

algorithms and models for time series quantities forecasting especially when it has been

shown that promising predictor variables exist (Chen et al, 2010; Lai et al., 2009; Chang et al,

2011).

Apart from the decision tree, there exist several other machine learning models including

support vector machines (SVM), logistic regression, k-nearest neighbour (KNN), artificial

neural networks (ANN) amongst several others. In fact the neural network has been shown in

some studies to have better accuracy figures compared to the decision tree (Diaz et al., 2016).

However neural nets and other very sophisticated models with high accuracy are essentially

black box techniques which do not lend to easy interpretation of the results as “they do not

provide an insight into the nature of the interactions between the technical indicators and the

70

[target]” (Lai et al., 2009). The main advantage of using the decision tree model as mentioned

before is its interpretability by granting the ability to identify predictor variables and their

threshold values for determining the forecast value of the target. As determined by the

decision tree model, Table 5.3 lists the economic and financial indicators as well as metal

fundamentals that are most predictive of copper spot prices. Figure 5.5 shows the decision

tree generated indicating the threshold values of these variables from which rulesets for

prediction are generated.

Also as seen in Table 5.6, the decision tree can be used for categorical target prediction

which in this case is price direction. The high accuracy figures realised in this instance bears

out with the literature where it is demonstrated that decision trees have very good

performance even better than the KNN, SVM and ANN when it comes to directional

forecasting (Diaz et al., 2016). The model here can be reliably used by investors (hedgers and

speculators) to decide on a buy/hold strategy in their trade on the metal.

5.4. Deployment

With the successful development of these insightful models for copper price prediction,

the final step in the methodology is deployment. For the purposes of this research, some

promising deployment scenarios for the following stakeholders are outlined:

1. Governments

Chile, Peru and Zambia are the top copper producing countries in South America and

Africa whose economies rely heavily on the export of the commodity. The models

developed in this research can readily provide useful data to help inform strategy for

national economic planning.

2. Investors

Hedgers and speculators on various exchanges globally are able to apply the models

(for example, the price movement direction predicting model) to make informed

trading decisions that helps to maximise portfolio returns.

3. Academia

Based on the outcome of this research endeavour, the academic community can

further use the findings to improve on existing theories and methods in financial time

series analyses and other similar research areas.

71

Chapter 6 Conclusions and Future Works

This final chapter presents the summary findings of the research work, conclusions

reached and recommendations for future research.

6.1. Conclusions

This research work focuses on the prediction of copper spot prices using data from the

LME as well as a set of other economic and financial predictor variables. The value of

predicting the future value of an industrial metal like copper to various stakeholders has been

amply demonstrated and is also reflected in the literature.

Several attempts at forecasting the price of industrial metals using various approaches

have been implemented. The ARIMA has been applied in a number of studies and more

recently the use of data mining techniques like the linear regression, ANN and SVM has been

investigated (Ongsritrakul and Soonthornphisaj, 2003). While the data mining approach has

demonstrated better performance in accuracy (Adebiyi et al., 2014; Lasheras et al., 2015), the

results produced have been difficult to interpret due to the use of these black box algorithms.

Also lacking is the use of a methodological framework in the analysis process which gives

the benefit of a systematic and generic approach that can be adapted to other domains.

In this research, the use of decision trees as a forecast tool for copper spot prices is

investigated and being an open-box model, the results are easily interpreted and applied. The

use of the CRISP-DM methodology in the analysis process ensured a systematic approach to

the investigation. A rigorous search through the literature reveals that the application of this

methodological approach and the use of decision trees for copper price forecasting have not

been done prior. The decision tree is also contrasted with ARIMA in terms of model

accuracy.

The CRISP-DM methodology comprises five phases that entails understanding of the

business objectives, understanding and preparation of data, modelling, evaluation and

deployment. The phases are navigated not in a single waterfall pass but with several iterations

involving re-visit of previous stages thus ensuring continuous improvement in the overall

modelling effort.

The results obtained are consistent with literature in terms of the metal fundamentals and

economic/financial variables identified and selected as predictors by the model. There is clear

evidence of the leading impact of Chinese demand on the price of copper. Thus the

performance of the metal’s price is highly predicated on the Chinese economy. The Standard

& Poor’s copper commodity index (S&P GSCI Copper Inx Spot) has also been shown to

72

have very good predictive capacity for the metal. The US economy being the largest in the

world also to a large extent determines the price of copper as seen by the significant effect of

various indicators of the US economy (Volatility index, PPI and Federal Funds Target Rate)

as predictor variables. The Peruvian Sol to US Dollar exchange rate completes the forecast

variables identified. The threshold values of the predictors in determining prices are captured

in the produced decision tree.

A comparison with the ARIMA model showed clearly that the decision tree produced

results with far greater accuracy and thus will deliver more reliably upon deployment in real-

world scenarios.

Deployment scenarios considered include economic planning by governments of

countries like Chile, Peru and Zambia whose economies depend strongly on copper exports.

Investors are also able to utilise the developed models in planning more profitable trading

strategies.

Due to time and other resource constraints, the research was limited in the dataset where

a number of promising (based on literature) variables had many missing values. Also the

identification of business cycles as a strong potential predictor (Cuddington and Jerrett, 2011;

Diaz et al., 2016) based on the cycles noted in the price points of copper over the years, was

not thoroughly investigated to develop it as a forecast variable.

6.2. Recommendations for Future Work

Future work recommended in this research area includes the following:

1. A more thorough study of the variables available in the dataset with a view to

discovering further the interrelationships that exist among them thus further fine

tuning the data understanding and preparation processes.

2. Further modelling runs can be executed tuning more parameter values for possible

improvement in the accuracy of the resulting model.

3. Analysis and investigation of business cycles for development as a predictor variable.

4. Application of decision trees using the CRISP-DM methodology to other industrial

metals clearly revealing predictor variables and their threshold values in determining

prices (or other suitable targets).

73

List of References Adebiyi, A.A., Adewumi, A.O. and Ayo, C.K. (2014), “Comparison of ARIMA and

Artificial Neural Networks Models for Stock Price Prediction”, Journal of Applied

Mathematics, Vol. 2014 No. 1, pp. 1–7.

Alpaydin, E., 2014. Introduction to machine learning. MIT press.

Anyadike, N. (2002), Copper: A material for the new millennium, Woodhead Publishing,

Cambridge, England.

Black, W.T., 1995. Trends in the use of copper wire & cable in the USA.Electrical &

Electronic Markets.

Bontempi, G. (2013). Machine Learning Strategies for Time Series Prediction. Machine

Learning Summer School. ULB, Brussels.

Box, G.E., Jenkins, G.M. and Reinsel, G., (1970). Forecasting and control. Time Series

Analysis, 3, p.75.

Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A., 1984.Classification and

regression trees. CRC press.

Buncic, D. and Moretto, C. (2015), “Forecasting copper prices with dynamic averaging

and selection models”, The North American Journal of Economics and Finance,

Vol. 33, pp. 1–38.

Chang, P.C., Fan, C.Y. and Lin, J.L., 2011. Trend discovery in financial time series data

using a case based fuzzy decision tree. Expert Systems with Applications, 38(5),

pp.6070-6080.

Chai, T. and Draxler, R.R., 2014. Root mean square error (RMSE) or mean absolute

error (MAE)?–Arguments against avoiding RMSE in the literature. Geoscientific

Model Development, 7(3), pp.1247-1250

Chen, Y., Rogoff, K., & Rossi, B. (2010). Can exchange rates forecast commodity

prices? Quarterly Journal of Economics, 125(3),1145–1194

Crowson, Philip (2008), “Copper industry”, International Encyclopedia of the Social

Sciences. Retrieved May 13, 2016, from,

http://www.encyclopedia.com/topic/Copper_industry.aspx.

Cruse, H. (2006), Neural Networks as Cybernetic Systems, 2nd ed., Brains, Minds &

Media, Bielefeld.

Cuddington, J.T. and Jerrett, D., 2011. Business Cycle Effects on Metal and Oil Prices:

Understanding the Price Retreat of 2008-9.

Diaz, D., Theodoulidis, B. and Dupouy, C. (2016), “Modelling and forecasting interest

rates during stages of the economic cycle. A knowledge-discovery approach”,

Expert Systems with Applications, Vol. 44, pp. 245–264.

74

Fama, E.F. and French, K.R. (1988), “Business Cycles and the Behavior of Metals

Prices”, The Journal of Finance, Vol. 43 No. 5, pp. 1075–1093.

Fisher, F.M., Cootner, P.H. and Baily, M.N. (1972), “An econometric model of the world

copper industry”, The Bell Journal of Economics and Management Science, pp.

568–609.

Hssina, B., Merbouha, A., Ezzikouri, H. and Erritali, M., 2014. A comparative study of

decision tree ID3 and C4. 5. International Journal of Advanced Computer Science

and Applications, 4(2).

Hyndman, R.J. and Athanasopoulos, G., 2014. Forecasting: principles and practice.

OTexts.

IBM (2013) IBM SPSS Modeler 16 Modeling Nodes

International Copper Study Group (ICSG), 2016. The World Copper Factbook 2015,

http://www.icsg.org/index.php/component/jdownloads/viewdownload/170/2092.

Ján, Š. and Katarina, H. (2010), “THE IMPLEMENTATION OF HYBRID ARIMA-

NEURAL NETWORK PREDICTION MODEL FOR AGREGATE WATER

CONSUMTION PREDICTION”, Journal of Applied Mathematics, Vol. 3 No. 3.

Kass, G.V., 1980. An exploratory technique for investigating large quantities of

categorical data. Applied statistics, pp.119-127.

Kavitha, C. and Iyakutti, K., 2014. Optimized Anomaly based Risk Reduction using

PCA based Genetic Classifier. Global Journal of Computer Science and

Technology, 14(7).

Khoonsari, P.E. and Motie, A., 2012. A comparison of efficiency and robustness of ID3

and C4. 5 algorithms using dynamic test and training data sets. International Journal

of Machine Learning and Computing, 2(5), p.540.

Kriechbaumer, T., Angus, A., Parsons, D. and Casado, M.R., 2014. An improved

wavelet–ARIMA approach for forecasting metal prices. Resources Policy, 39,

pp.32-41.

Lai, R.K., Fan, C.Y., Huang, W.H. and Chang, P.C., 2009. Evolving and clustering fuzzy

decision tree for financial time series data forecasting. Expert Systems with

Applications, 36(2), pp.3761-3773.

Lasheras, F.S., de Cos Juez, F.J., Sánchez, A.S., Krzemień, A. and Fernández, P.R.,

2015. Forecasting the COMEX copper spot price by means of neural networks and

ARIMA models. Resources Policy, 45, pp.37-43.

Maier, H.R. and Dandy, G.C. (2000), “Neural networks for the prediction and

forecasting of water resources variables. A review of modelling issues and

applications”, Environmental Modelling & Software, Vol. 15 No. 1, pp. 101–124.

75

Malliaris, A.G. and Malliaris, M. (2015), “What drives gold returns? A decision tree

analysis”, Finance Research Letters, Vol. 13, pp. 45–53.

Meller, P. and Simpasa, A.M. (2011), “Role of Copper in the Chilean & Zambian

Economies: Main Economic and Policy Issues”, GDN Working Paper Series,

Vol. 43.

Ongsritrakul, P. and Soonthornphisaj, N. (2003), Apply Decision Tree and Support

Vector Regression to Predict the Gold Price: Proceedings of the International Joint

Conference on Neural Networks 2003, Doubletree Hotel, Jantzen Beach, Portland,

Oregon, July 20-24, 2003 / co-sponsored by the International Neural Network

Society, the IEEE Neural Networks Society. Vol. 1, IEEE, Piscataway, N. J.

Piatetsky, G. (2014) KDnuggets Methodology Poll, [Online], Available:

http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-

methodology.html [10 Mar 2016].

Quinlan, J.R. (1990). Learning logical definitions from relations. Machine Learning, 5,

239–266. doi:10.1007/BF00117105.264

Quinlan, J.R. (1993). C4.5: programs for machine learning: 1.San Mateo, California:

Morgan Kaufmann. doi:10.1016/S0019-9958(62)90649-6.

Quinlan, R., 2004. Data mining tools See5 and C5. 0.

‘Standard & Poor's 500 Index - S&P 500’ (2016) Investopedia. Available at:

http://www.investopedia.com/terms/s/sp500.asp (Accessed: 22 June 2016)

Wirth, R. and Hipp, J., 2000, April. CRISP-DM: Towards a standard process model for

data mining. In Proceedings of the 4th international conference on the practical

applications of knowledge discovery and data mining (pp. 29-39).

Zhang, G. (2003), “Time series forecasting using a hybrid ARIMA and neural network

model”, Neurocomputing, Vol. 50, pp. 159–175.

http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html

http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html

http://www.investopedia.com/terms/s/sp500.asp

76

Appendix

Table A.1 Dataset Variables Showing Statistical Characterisation (First Variable is Target)

SN Variable Description Mean Median Std. Dev Skew Kurtosis Min Max Missing Target Corr.

1 LOCADY Comdty LME COPPER SPOT ($) OFF 120.27 84.64 90.72 2.01 2.97 45.65 447.59 0 1

2 USCRWTIC Index

WTI Cushing Crude Oil Spot Px 0.35 1.03 8.51 -0.41 3.29 -39.50 39.09 161 0.08

3 spgsci index S&P GSCI Index Spot CME 0.37 0.58 4.81 -0.43 2.95 -26.13 17.35 1 0.08

4 spgsin index S&P GSCI Industrial Metals Index Spt 0.32 0.50 5.53 -0.56 3.78 -28.03 19.10 85 0.07

5 MEPRCUPR Index

FOB Copper Pricing Usd per lb 0.27 2.79 9.92 -1.43 3.05 -34.90 16.56 457 0.15

6 spgsic Index S&P GSCI Copper Inx Spot 0.41 0.53 6.35 -0.63 4.62 -35.80 22.50 85 0.09

7 spgsintr index S&P GSCI Ind Met Tot Ret 0.66 0.76 5.73 -0.28 3.73 -27.73 22.25 85 0.07

8 SPGSICP Index S&P GSCI Copper Exc Tot 0.46 0.53 6.56 -0.42 4.10 -35.69 23.08 85 0.12

9 SPGSPMP Index S&P GSCI PREC METAL ER 0.19 -0.22 5.63 1.12 9.18 -23.73 38.74 37 0.09

10 SPGSAMP Index

S&P GSCI All Metals Index Excess Return 0.52 1.38 4.36 -0.63 -0.59 -8.51 6.76 479 0.29

11 SPGSICTR Index S&P GSCI Copper Tot Ret 0.92 1.10 6.53 -0.43 4.25 -35.60 23.45 85 0.10

12 SPGSESP Index S&P GSCI Enhanc ER 0.71 1.51 5.29 -1.23 3.58 -25.79 11.96 301 0.02

13 SPGSHG Index S&P GSCI North American Copper Index Spot 0.48 0.94 6.97 -0.73 4.76 -35.87 22.80 301 0.10

14 SPGSAM Index S&P GSCI All Metals Index Spot 0.68 1.55 4.35 -0.61 -0.61 -8.25 6.91 479 0.28

15 SPGSAP Index

S&P GSCI All Metals Capped Commodity 35/20 Index Spot 0.94 2.03 4.96 -0.68 -0.70 -9.54 7.66 479 0.25

16 LSCA Index LME COPPER TOTAL 288785.58 204775.00 220004.19 1.01 0.24 11663.00 965427.00 0 -0.09

17 COMXCOPR Comdty

Comex Copper Inventory Data 0.31 -0.07 23.46 0.78 5.92 -90.47 132.25 273 -0.03

18 SHFCCOPD index

Shanghai Futures Exchange Copper Deliverable Stocks 0.50 0.11 26.16 0.25 0.12 -64.75 69.15 397 0.01

19 lfca Index LME CNCL WRNT COPPER TOT 23441.12 15782.00 21573.94 1.91 4.38 914.00 125044.00 333 -0.23

20 CNIVCOPP Index

China Import Commodity Unwrought Copper & Copper Products 242.39 223.86 85.03 0.98 0.54 73.77 508.94 372 0.45

21 CNIVCOPA Index

China Import Commodity Volume - Unwrought Copper & Copper Alloy 0.49 2.71 21.87 -0.70 0.92 -76.85 42.07 418 -0.06

22 CNIVCOPR Index

China Import Commodity Copper Products 82.11 83.72 14.99 -0.58 0.30 30.88 109.26 384 -0.19

23 CHIVCORE Index

China Import Commodity Value - Copper Ores & Concentrates 2.47 0.93 23.55 -0.07 0.68 -76.46 70.82 420 0.04

24 CHIVCOPR Index

China Import Commodity Value - Copper Products 540.73 592.49 148.73 -0.46 -0.96 199.48 799.67 414 0.87

25 CHIVSCPR Index

China Import Commodity Value - Scrap Copper 2.36 1.55 25.95 0.03 1.11 -78.36 84.64 418 0.00

26 CNMDCCCD Index

Implied % of China Construction Copper Demand 99.86 99.86 0.03 -0.13 0.02 99.79 99.92 447 -0.63

27 CNMDCRCY Index

YTD China Refined Copper Apparent Consumption mt 549433.00 586140.00 139795.62 -0.21 -0.75 291710.91 859489.00 449 -0.01

28 MEPRMCAS Index

Global Mined Copper Production - Asia 220839.63 220836.00 25328.95 -0.22 0.16 148894.00 276330.00 387 0.37

29 MEPRMCME Index

Global Mined Copper Production - Middle East 3.25 0.00 21.15 6.01 43.37 -38.44 172.38 388 0.15

30 MEPRMCNA Index

Global Mined Copper Production - North America 173274.17 173216.00 12156.25 -0.35 -0.24 139116.00 196851.00 387 0.07

31 MEPRMCSA Index

Global Mined Copper Production - South & Central America 558747.36 565338.50 52318.34 -0.22 -0.24 423371.00 677839.00 387 0.52

77


32 MEPRRCAF Index

Global Refined Copper Production - Africa 0.91 0.93 8.34 0.08 0.45 -21.41 28.22 388 -0.02

33 MEPRRCAS Index

Global Refined Copper Production - Asia 0.57 0.68 3.46 -0.04 1.76 -10.53 11.00 388 0.01

34 MEPRRCEU Index

Global Refined Copper Production - Europe 0.09 -0.16 2.13 0.38 0.75 -5.72 6.42 388 0.07

35 MEPRRCME Index

Global Refined Copper Production - Middle East -0.17 0.00 20.08 -0.43 15.16 -112.74 94.29 388 0.05

36 MEPRRCNA Index

Global Refined Copper Production - North America -0.25 0.22 5.82 -0.16 0.12 -17.58 13.56 388 0.03

37 MEPRRCOC Index

Global Refined Copper Production - Oceania 39070.62 39667.00 4570.40 -0.68 1.52 24333.00 48000.00 387 -0.34

38 MEPRRCSA Index

Global Refined Copper Production - South & Central America 0.04 -1.08 6.40 0.20 -0.32 -14.71 15.47 388 0.00

39 MHMCWC Index

Mongolian Production of Major Commodities Copper with Concentrate 47.26 43.60 39.24 10.63 115.13 29.50 471.00 385 0.07

40 SAMPCPPM Index

South Africa Mining Production Volume Index 2005=100 Copper NSA MoM 1.92 -1.19 22.91 2.32 12.24 -59.44 171.21 121 -0.03

41 ZMCMCPPM Index

Zambia Copper production MoM 1.77 -1.63 12.66 0.79 0.38 -23.99 35.28 457 -0.08

42 MEPRCDAF Index

Global Refined Copper Demand - Africa 17798.19 17021.50 6142.36 0.80 0.31 5330.00 35438.00 387 0.51

43 MEPRCDAS Index

Global Refined Copper Demand - Asia 0.59 0.34 7.84 0.77 3.09 -16.74 37.11 388 -0.02

44 MEPRCDEU Index

Global Refined Copper Demand - Europe -0.14 -1.31 11.88 0.31 -0.02 -24.43 35.05 388 0.00

45 MEPRCDNA Index

Global Refined Copper Demand - North America -0.34 -1.36 9.18 0.58 0.24 -17.82 27.88 388 0.00

46 MEPRCDOC Index

Global Refined Copper Demand - Oceania 12685.01 13100.00 4719.73 0.03 0.14 918.00 25925.00 387 -0.24

47 MEPRCDSA Index

Global Refined Copper Demand - South & Central America 46193.16 46323.00 6679.45 -0.16 0.50 26999.00 65619.00 387 0.65

48 MEPRCOCI Index

Known Copper Ore & Concentrate Inventories 134180.80 134522.50 23125.93 0.00 -0.60 76306.00 184598.00 387 -0.50

49 SPWIICP Index S&P World Commodity Copper - Grade A Index ER 1.09 1.87 7.54 -0.91 4.92 -35.94 23.48 360 0.07

50 SPWIICTR Index

S&P World Commodity Copper - Grade A Index TR 1.28 1.98 7.53 -0.91 5.04 -35.85 23.86 360 0.07

51 SPWIIC Index S&P World Commodity Copper - Grade A Index 1.01 2.04 7.43 -0.95 5.20 -36.04 22.90 360 0.06

52 ZMCMCOPP Index Zambia Copper Prices 324.57 342.75 82.51 -0.75 -0.23 139.05 447.59 456 1.00

53 MEPRCCOW Index

Inventory Statistics- COMEX Copper st - On Warrants - % of Total Inventory 19560.37 97.54 130577.01 6.78 46.00 85.41 895497.56 459 0.02

54 MEPRCCCW Index

Inventory Statistics-COMEX Copper st - Cancelled Warrants - % of Total Inventory 10086.25 2.81 67628.21 6.78 46.00 0.02 463750.10 459 0.02

55 CNIVCORE Index

China Import Commodity Volume - Copper Ore & Concentrate 0.83 2.87 23.48 -0.42 1.31 -74.72 63.60 421 -0.03

56 CNMDCRCA Index

China Refined Copper Apparent Consumption mt 549433.00 586140.00 139795.62 -0.21 -0.75 291710.91 859489.00 449 -0.01

57 MEPRCICW Index

LME Copper Inventories mt - On/Cancelled Warrants 2061.47 2116.02 182.33 -0.68 -0.84 1710.69 2270.59 495 0.31

58 MEPRCUCW Index

Inventory Statistics- LME Copper mt - Cancelled Warrants - % of Total Inventory 6.91 6.51 4.62 1.15 1.48 0.33 21.45 432 0.12

59 MEPRCUOW Index

Inventory Statistics- LME Copper mt - On Warrants - % of Total Inventory 93.09 93.49 4.62 -1.15 1.48 78.55 99.67 432 -0.12

78


60 MEPRCTOI Index

LME Copper Total Open Interest Number of Contracts 666745.43 658012.77 53580.25 0.67 0.00 573964.47 794079.85 459 0.16

61 MEPRCUWP Index

Copper Wire Pricing USd per mt 36677.64 38598.84 8006.53 -0.82 -0.20 16995.76 46641.64 459 0.91

62 CEI1CNCL Index

CFTC CEI High-Grade Copper Non-Commercial Long Contracts/Futures Only 20321.47 15781.00 12962.72 1.13 0.72 1095.00 63843.00 276 0.41

63 CEI1CNCS Index

CFTC CEI High-Grade Copper Non-Commercial Short Contracts/Futures Only 15815.44 14455.00 10290.18 0.40 -1.05 930.00 39368.00 276 0.52

64 CEI1CCOL Index

CFTC CEI High-Grade Copper Commercial Long Contracts/Futures Only 0.42 0.47 14.57 0.09 3.14 -64.37 50.32 277 0.00

65 CEI1CTLL Index

CFTC CEI High-Grade Copper Total Long Contracts/Futures Only 0.62 0.92 10.25 0.27 0.80 -26.43 43.56 277 -0.03

66 vix index Chicago Board Options Exchange SPX Volatility Index 20.58 19.26 7.98 1.81 5.59 10.82 62.64 240 -0.01

67 BDIY Index Baltic Dry Index 2118.73 1471.00 1791.67 2.64 7.65 572.00 10844.00 180 0.60

68 CRB RIND Index

Commodity Research Bureau BLS/US Spot Raw Industrials 0.17 0.08 2.45 -0.64 4.14 -13.30 7.23 137 0.13

69 CPURNSA Index

US CPI Urban Consumers NSA 0.36 0.32 0.37 -0.17 3.65 -1.93 1.79 1 -0.13

70 EUITEMU Index

Eurostat Industrial Production Eurozone Industry Ex Construction SA 0.10 0.12 1.08 -0.53 1.86 -4.15 3.34 181 0.02

71 IP Index US Industrial Production 2007=100 SA 0.19 0.24 0.76 -1.14 4.94 -4.21 2.38 1 -0.06

72 CHVAIOY Index

China Value Added of Industry YoY 13.48 13.50 5.25 -0.92 6.90 -21.10 29.40 241 0.20

73 EUR CURNCY EURUSD Spot Exchange Rate - Price of 1 EUR in USD -0.01 -0.11 2.52 -0.01 0.14 -7.96 6.67 61 0.03

74 USTWBROA INDEX

US Trade Weighted Broad Dollar January 1997=100 0.23 0.24 1.32 0.21 1.11 -4.17 6.43 37 -0.14

75 SPX Index S&P 500 Index 0.53 0.77 3.79 -0.99 3.87 -22.81 11.35 1 -0.02

76 INDU Index Dow Jones Industrial Average 0.55 0.81 3.75 -0.82 2.70 -19.15 10.12 1 -0.01

77 US0003M Index BBA LIBOR USD 3 Month -0.85 -0.08 8.49 -1.64 10.69 -57.71 38.62 180 0.02

78 CLP Curncy USDCLP Spot Exchange Rate - Price of 1 USD in CLP 0.49 0.28 2.60 1.50 8.29 -7.49 16.29 177 -0.15

79 PEN Curncy USDPEN Spot Exchange Rate - Price of 1 USD in PEN 2.94 3.02 0.52 -0.78 -0.10 1.29 3.62 271 -0.03

80 CNFREXP$ Index China Export Trade 1.50 2.03 20.03 -2.14 10.67 -121.46 54.94 241 0.01

81 CNFRIMP$ Index China Import Trade 1.46 3.01 24.39 -1.95 9.92 -142.76 70.40 241 0.00

82 FDTR Index Federal Funds Target Rate US 5.95 5.50 3.63 0.76 1.10 0.25 20.00 12 -0.42

83 USGG10YR Index

US Generic Govt 10 Year Yield -0.28 -0.44 4.80 -0.99 9.22 -37.62 17.77 1 -0.06

84 1636659 Index

IMF Euro Area Industrial Production SA by Reporting Country 0.06 0.15 1.11 -0.98 2.34 -4.17 2.36 337 0.05

85 djushg Index

Dow Jones US Household Goods & Home Construction Index 0.55 0.83 4.07 -1.35 5.21 -21.97 11.35 265 -0.04

86 AAUKY US Equity Anglo American PLC 0.84 1.15 10.78 -0.82 2.13 -46.37 28.70 352 -0.09

87 BHP UN Equity BHP Billiton Ltd 0.92 0.80 8.88 -0.41 1.58 -39.73 26.57 208 -0.03

88 XSRAF US Equity Xstrata PLC 1.31 2.87 14.98 -1.71 5.30 -62.16 29.14 406 -0.18

89 RIO US Equity Rio Tinto PLC 0.74 0.57 10.31 -1.14 6.02 -61.99 27.33 245 -0.05

90 FCX US Equity Freeport-McMoRan Copper & Gold Inc 0.64 1.45 14.03 -0.80 2.68 -67.11 34.22 306 -0.05

91 SCO US Equity Southern Copper Corp 1.24 -0.34 12.38 -0.10 0.41 -37.16 34.80 312 -0.05

79


92 SBCRP Index Citigroup BIG Corporate 0.75 0.75 2.01 0.27 4.77 -7.75 11.48 120 -0.09

93 SBBIG Index Citigroup BIG Bond 0.71 0.74 1.60 0.71 5.36 -6.01 10.63 120 -0.08

94 SBWBL Index Citigroup WorldBIG Local Currency 0.39 0.49 0.82 -0.06 0.44 -2.11 3.25 348 -0.04

95 SBCI Index Citigroup BIG Industrial 0.76 0.84 1.98 0.08 5.27 -9.84 11.22 120 -0.07

96 SBGT Index Citigroup Treas Local Currency 0.68 0.65 1.61 0.34 2.27 -5.23 9.00 120 -0.06

97 SBEB13 Index Citigroup EuroBIG 1 to 3 Year 0.31 0.30 0.40 0.16 -0.31 -0.63 1.38 348 -0.11

98 SBWBINL Index

Citigroup WorldBIG Industrial Local Currency 0.51 0.64 1.38 -1.20 7.22 -7.58 4.76 348 -0.05

99 LMAHDS03 LME Comdty LME ALUMINUM 3MO ($) 0.13 -0.36 6.05 -0.16 1.13 -24.59 17.78 209 -0.03

100 LMAADS03 LME Comdty LME ALUM ALY 3MO ($) 0.29 0.22 5.59 -1.00 7.21 -35.40 19.52 273 -0.06

101 LA1 Comdty Generic 1st 'LA' Future 0.17 -0.36 5.82 0.06 0.42 -17.67 15.66 330 -0.03

102 LA3 Comdty Generic 3rd 'LA' Future 0.18 -0.15 5.67 0.00 0.53 -17.26 15.10 330 -0.03

103 LA6 Comdty Generic 6th 'LA' Future 0.20 -0.11 5.38 -0.03 0.88 -16.86 14.48 330 -0.03

104 LA12 Comdty Generic 12th 'LA' Future 0.22 -0.16 4.99 -0.09 1.34 -15.93 13.60 330 -0.01

105 US.MONEY.M2 FED Index

United States Money Supply M2 0.56 0.53 0.39 0.91 3.20 -0.40 2.74 2 -0.14

106 US.MONEY.M1 FED Index

United States Money Supply M1 0.47 0.45 0.74 1.65 12.17 -3.25 5.99 2 0.04

107 FEDL01 Index Federal Funds Effective Rate US -0.94 0.00 10.11 -3.31 27.27 -91.11 38.30 2 -0.08

108 US.CCONF CNFB Index

United States Consumer Confidence -0.08 -0.05 8.40 -0.26 5.77 -45.90 41.66 89 -0.05

109 PPI INDX Index US PPI By Processing Stage Finished Goods Total SA 0.32 0.29 0.61 -0.08 4.89 -3.08 3.46 2 0.05

110 US.HHSPNR BEA Index

United States Consumer Spending (Real) 0.23 0.23 0.39 0.38 4.17 -1.07 2.38 300 -0.22

111 NHSPSTOT Index

US New Privately Owned Housing Units Started by Structure Total SAAR -0.12 -0.17 7.84 -0.08 0.99 -30.67 25.67 2 -0.06

112 USGG5YR Index US Generic Govt 5 Year Yield -0.42 -0.47 7.64 -0.13 3.90 -39.01 31.35 2 -0.09

113 USGG2YR Index US Generic Govt 2 Year Yield -0.74 -0.42 10.52 -0.31 5.69 -57.74 53.77 77 -0.11

114 lp1 Comdty Generic 1st 'LP' Future 0.65 1.46 7.16 -0.77 4.97 -35.92 23.05 330 0.09

115 lp2 Comdty Generic 2nd 'LP' Future 0.70 1.34 7.11 -0.79 5.17 -36.11 22.94 331 0.09

116 lp3 Comdty Generic 3rd 'LP' Future 0.71 1.24 7.06 -0.79 5.16 -35.86 22.80 331 0.09

117 lp4 Comdty Generic 4th 'LP' Future 0.72 1.17 7.01 -0.79 5.19 -35.55 22.60 331 0.09

















134 lp21 Comdty Generic 21st 'LP' Future 0.98 1.53 6.42 -1.03 5.69 -31.40 23.74 344 0.09

135 lp22 Comdty Generic 22nd 'LP' Future 0.98 1.49 6.39 -1.03 5.72 -31.17 23.91 344 0.09

136 lp23 Comdty Generic 23rd 'LP' Future 0.98 1.46 6.36 -1.03 5.75 -30.95 24.07 344 0.10







80

Figure A.1: SPSS Modeler Stream Design (CHAID Decision Tree, Scenario 3)

Figure A.2: SPSS Modeler Stream Design (ARIMA)

Forecasting Copper Spot Prices: A Knowledge-Discovery Approach · Forecasting Copper Spot Prices: A Knowledge-Discovery Approach . A. ... SPSS Modeler Stream Design ... Expert Modeler

Documents