-
Applied Mathematical Sciences, Vol. 8, 2014, no. 62, 3051 -
3062
HIKARI Ltd, www.m-hikari.com
http://dx.doi.org/10.12988/ams.2014.44270
A Hybrid GMDH and Box-Jenkins Models in
Time Series Forecasting
Ani Shabri
Department of Mathematics
Faculty of Science
University Technology of Malaysia
81310 Skudai, Johor, Malaysia
Ruhaidah Samsudin
Department of Software Engineering
Faculty of Computing
University Technology of Malaysia
81310 Skudai, Johor, Malaysia
Copyright © 2014 Ani Shabri and Ruhaidah Samsudin. This is an
open access article distributed under the
Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any
medium, provided the original work is properly cited.
Abstract
The group method of data handling technique (GMDH) and
Box-Jenkins methods are two well-
known time series forecasting of mathematical modeling. In this
paper, we introduce a hybrid
modeling which combines the GMDH method with the Box-Jenkins
method to model time series
data. The Box-Jenkins method was used to determine the useful
input variables of GMDH
method and then the GMDH method which works as time series
forecasting. The lynx series
contains the number of lynx trapped per year is used in this
study to demonstrate the
effectiveness of the forecasting model. The results found by the
proposed GMDH were
compared with the results of Box-Jenkins and artificial neural
network (ANN) models. The
comparison of modeling results shows that the GMDH model perform
better than two other
models based on terms of mean absolute error (MAE) and root mean
square error (RMSE). It
also indicates that GMDH provides a promising technique in time
series forecasting.
-
3052 Ani Shabri and Ruhaidah Samsudin
Keywords: Group method of data handling technique, time series
forecasting, neural networks,
Box-Jenkins method
1. Introduction
The most comprehensive of all popular and widely known
statistical models which have been
utilized in the last four decades for time series forecasting
are Box-Jenkins method. However,
the Box-Jenkins model is only a class of linear model and thus
it can only capture linear feature
of data time series [1], but many time series are often full of
nonlinearity and chaotic.
More advanced nonlinear methods such as neural networks have
been frequently applied in
nonlinear time series modeling and chaotic time series modeling
in recent years [2, 3, 4, 5, 6, 7].
ANN provides an attractive alternative tool for both forecasting
researchers and has shown their
nonlinear modeling capability in data time series
forecasting.
One sub-model of ANN is a group method of data handling (GMDH)
algorithm was first
developed by [8]. This model has been successfully used to deal
with uncertainty, linear or
nonlinearity of systems in a wide range of disciplines such as
engineering, science, economy,
medical diagnostics, signal processing and control systems [9,
10, 11].
Improving forecasting especially time series forecasting
accuracy is an important yet often
difficult task facing many decision makers in a wide range of
areas. Combining several models
or using hybrid models can be an effective way to improve
forecasting performance. There have
been several studies suggesting hybrid models such as combining
the ARIMA and ANN model
[1, 12, 13, 14], the GMDH and ANN model [14], GMDH and
differential evolution [15], GMDH
and LSSVM [16]. More recently, a new class of neural network
combining ANN model with
Box-Jenkins (BJ) approach was explored for modeling time series
[17, 18, 19, 20]. The BJ
approach was used to determine the most important variables as
input nodes in the input layer
and ANN were explored for modeling time series data. Their
results showed that the hybrid
model can be an effective way to improving predictions achieved
when the variables of input
layer of ANN is chosen based on BJ approach rather than on
traditional methods.
In this paper, a new hybrid GMDH-type algorithm is proposed by
combining the GMDH
model with the BJ approach to model time series data. The BJ
approach is used to generate the
most useful variables as input nodes in the input layer of data
from under study time series.
Then, a GMDH model is used to model the generated data by
Box-Jenkins model and to predict
the future of time series. To verify the application of this
approach, the Canadian lynx data sets
are used in this study.
2. Forecasting Methodology
This section presents the ARIMA, ANN and GMDH models used for
modeling time series.
The choice of these models in this study was because these
methods have been widely and
successfully used in time series forecasting.
-
A hybrid GMDH and Box-Jenkins models 3053
2.1 Box-Jenkins Approach
The Box-Jenkins method that was introduced by [21] has been one
of the most popular
approaches to the analysis of the time series and prediction.
The general ARIMA models are
represented by the following way:
tqtd
p aByBB )()1)(( (1)
where )(B and )(B are polynomials of order p and q,
respectively; d number of regular
differencing. Random errors, ta are assumed to be independently
and identically distributed with
a mean of zero and a constant variance of 2 . The Box-Jenkins
methodology is basically divided in the four steps: identification,
estimation, diagnostic checking and forecasting. In the
identification step, transformation is often needed to make time
series stationary. The next step is
choosing a tentative model by matching both the autocorrelation
(ACF) and partial
autocorrelation function (PACF) of the stationary series. Once a
tentative model is identified, the
parameters of the model are estimated. The last step of model
building is the diagnostic checking
of model adequacy, basically to check if the model assumptions
about the error, ta are satisfied.
The process is repeated several times until a satisfactory model
is finally selected. The
forecasting model was then used to compute the fitted values and
forecasts values.
2.2 The Neural Network Forecasting Model
The artificial neural networks (ANN) that serve as flexible
computational frameworks have
been extensively studied and gained much popularity in many
areas applications as well as in
science, psychology and engineering. The ANN with single hidden
layer feed-forward network is
the most widely used and suitable for modeling and forecasting
in time series. In a feed-forward
ANN, the neurons are usually arranged in layers. The first layer
is the input layer where the data
are introduced to the network, the second layer is the hidden
layer where data are processed and
the last layer is the output layer where the results of given
input are produced. The structure of a
feed-forward ANN is shown in Fig. 1.
-
3054 Ani Shabri and Ruhaidah Samsudin
Fig. 1 Architecture of three layers feed-forward
back-propagation ANN
The relationship between the input observations )...,,,( 21 pttt
yyy and the output value )( ty
assuming a linear output neuron is given by
q
j
p
i itijjjtywwfabgy
1 100)( (2)
where jb (j = 0, 1, 2, …, q) is a bias on the jth unit, and ijw
(i = 0, 1, 2, …, p; j = 0, 1, 2, …, q)
are connection weights, f and g are hidden and output layer
activation functions, respectively.
[22]. Several optimization algorithms can be used to train the
ANN. Among the several training
algorithms available, back-propagation has been the most popular
and most widely used [23]. In
a back-propagation network, the weights and bias values are
initially chosen as random numbers
and then fixed by the results of a training process. The goal of
training algorithm is to minimize
the global error.
2.3 The Group Method of Data Handling
The GMDH method was originally formulated to solve for higher
order regression polynomials
especially for solving modeling and classification problem.
General connection between inputs
and output variables can be expressed by a complicated
polynomial series in the form of the
Volterra series, known as the Kolmogorov-Gabor polynomial
[8]:
Input Layer
Hidden Layer
Output Layer1ty
2ty
.
.
.pty
ty
jw0
ijw
jw
Bias unit
(Tangent Sigmoid transfer function)
(Linear transfer function)
.
.
.
-
A hybrid GMDH and Box-Jenkins models 3055
M
i
M
j
M
k
kjiijk
M
i
M
j
jiij
M
i
ii xxxaxxaxaay1 1 11 11
0 ... (3)
where ,...),,( kji xxx is the input vector variables, M is the
number of input and ,...),,,( 0 ijkiji aaaa
is the vector of summand coefficients.. However, for most
application the quadratic form are
called as partial descriptions (PD) for only two variables is
used in the form
2
5
2
43210),( jijijiji xaxaxxaxaxaaxxGy (4)
to predict the output. The input variables are set to ),...,,,(
321 Mxxxx and output is set to {y}. The
coefficients ia for 5...,,1,0i are determined using the least
square method. The framework of
the design procedure of the GMDH consists of the following
steps.
Step 1: Select input variables },...,,{ 21 MxxxX where M is the
total number of input. The data
are separated into training and testing data sets. The training
data set is used to construct
a GMDH model and the testing data set is used to evaluate the
estimated GMDH model.
Step 2: Construct 2/)1( MML new variables },...,,{ 21 LzzzZ in
the training data set for
all independent variables and choose a PD of the GMDH.
Conventional GMDH is
developed using the polynomial
2
5
2
43210),( jijijijil xaxaxxaxaxaaxxGz for Ll ,...,2,1 (5)
as PD. In this study, a PD structure, namely radial basis
function (RBF) using the
polynomial function is proposed in construct the GMDH. The
radial basis function
(RBF) model is used in the form
2lg
l ez
where 2
5
2
43210 jijijil xaxaxxaxaxaag (6)
Step 3: Estimate the coefficient of the PD. The vectors of
coefficients of the PDs are determined
using the least square method.
Step 4: Determine new input variables for the next layer. There
are several specific selection
criteria to identify the input variables for the next layer. In
our study, we used two
criteria. The first criteria, the single best neuron out of
these L neurons, 'z identified according to the value of mean
square error (MSE) of testing dataset such that
-
3056 Ani Shabri and Ruhaidah Samsudin
Tn
i
kii
T
zyn 1
2
, )(1
MSE for k = 1, 2, …, L. (7)
where Tn is the number of testing data set. If the smallest
value of MSE of 'z less than
threshold, then terminated, otherwise set the new input
variables )',,...,,,( 321 zxxxx M .
In second criteria, eliminate the least effective variables,
replace the column of
},...,,{ 21 kxxxX by those column },...,,{ 21 kzzzZ that best
estimate the dependent
variable y in the testing dataset. This is captured by the
expression
11 zx , 22 zx , … , kk zx (8)
where k is the total number of the retained new input
variables.
Step 5 : Check the stopping criterion. The lowest value of
selection criteria using GMDH model
at each layer obtained during this iteration is compared with
the smallest value obtained
at the previous one. If an improvement is achieved, one goes
back and repeats step 1 to
5, otherwise the iterations terminate and a realization of the
network has been
completed. Once the final layer has been determined, only the
one node characterized by
the best performance is selected as the output node. The
remaining nodes in that layer
are discarded. And finally, the GMDH model is obtained.
3. Time series prediction by ANN and GMDH
A classical method, for time series forecasting problem, the
number of input nodes of nonlinear
such as the ANN or GMDH model is equal to the number of lagged
variables ),...,,( 21 pttt yyy ,
where p is the number of chosen lagged. The outputs, ty , the
predicted value of a time series
defined as
),...,,( 21 ptttt yyyfy (9)
However, currently there is no suggested systematic way to
determine the optimum number of
lagged p. The other method to determine the number of nodes in
the input layer is based on the
Box-Jenkins model. Unlike the previous method where the number
of lagged p is chosen either
in an ad hoc basis or from traditional methods, the lagged
variables obtained from the Box-
Jenkins analysis are the most important variables to be used as
input nodes in the input layer of
the ANN or GMDH model. In our proposed model, a time series
model based on Box-Jenkins
methodology is considered as nonlinear function of several past
observations and random errors
as follows:
-
A hybrid GMDH and Box-Jenkins models 3057
)],...,,(),,...,,[( 2121 qtttptttt aaayyyfy (10)
where f is a nonlinear function determined by the ANN and
GMDH.
4. Empirical Results
In this section, we illustrate the hybrid GMDH-type algorithm
and show its performance for
forecasting the Canadian lynx data. The lynx series contains the
number of lynx trapped per year
in the Mackenzie River district of Northern Canada. The data set
has 114 observations,
corresponding to the period of 1821 to 1934. It has also been
extensively analyzed in the time
series literature with a focus on the nonlinear modeling. This
lynx data are one of the most
frequently used time series. The data are plotted in Fig. 2,
which shows a periodicity of
approximately 10 years. It indicates that the series is
stationary in the mean but not be stationary
in variance. The lynx series was studied by many researchers and
the first time series analysis
was carried out by [24] and then recently by [25] Kajitani et
al. (2005) who fit an AR(2) model
to the logged data. [26], [1], and [20] found the best-fitted
model is AR(12) model.
Fig.2 Canadian Lynx data series (1821-1934)
The authors use R-package to formulate the Box-Jenkins
technique. Using the Box-Jenkins
technique on the Lynx time series, two models, AR(2) and AR(12)
were considered and
statistical results are compared in the following Table 1 based
on mean squared error (MSE),
Akaike Information Criterion (AIC) and Schwardz Information
Criterion (SIC). We used log
with base 10, which makes the lynx data more symmetrical.
Time
Ly
nx N
um
be
rs
1109988776655443322111
7000
6000
5000
4000
3000
2000
1000
0
-
3058 Ani Shabri and Ruhaidah Samsudin
Table 1: Comparison of ARIMA models’ Statistical Results
Model MSE AIC SIC
AR(2)
AR(12) 0.0179
0.0238 -1.7072
-1.3834 -1.6550
-1.0708 Note: The data in boldface means the best statistical
results.
Table 1 shows that the lowest MSE, AIC and SIC statistics of
0.0179, -1.7072 and -1.6550,
respectively were observed for AR(2). Hence, according to their
performances indices, AR(2) is
selected for appropriate ARIMA model for Lynx series. The AR(2)
that we identified takes the
following form
tttt ayyy 21 7385.03693.10652.1 (11)
In designing the ANN and GMDH models, one must determine the
following variables: the
number of input nodes and the number of layers. The selection of
the number of input
corresponds to the number of variables play important roles for
many successful applications of
ANN and GMDH models.
To make the ANN and GMDH models simple and reduce some
computational burden, only the
lagged variables obtained from the Box-Jenkins are used as input
layers. In this study, based on
Box-Jenkins methodology in linear modeling from Eq. 11, a time
series is considered as
nonlinear function of several past observations as follows
),( 21 ttt yyfy (12)
where f is a nonlinear function determined by ANN and GMDH
models. The nodes in the input
layer consist of lagged variables 1ty and 2ty obtained from the
Box-Jenkins analysis.
The ANN model was implemented with software package neural
network toolbox using
MATLAB The hidden nodes use the hyperbolic tangent sigmoid
transfer function and the output
layer uses the linear function because the prediction
performance is the best when these transfer
functions are used. The network was trained for 5000 epochs
using the conjugate gradient
descent back-propagation algorithm with a learning rate of 0.001
and a momentum coefficient of
0.9. The optimal number of neuron in the hidden layer was
identified using several practical
guidelines. These includes using I/2 [27], 2I [28] and 2I+1
[29], where I is the number of input.
In this study, a trial and error method is performed to optimize
the number of neurons in the
hidden layer. The best neural networks architecture for lynx
series found consists of 2 inputs, 5
hidden and 1 output neurons (2x5x1).
The GMDH model works by building successive layers with complex
connections that are
created by using second-order polynomial and exponential
function. The first layer created is
made by computing regressions of the input variables. The second
layer is created by computing
regressions of the output value. Only the best are chosen at
each layer and this process continues
until a pre-specified selection criterion is found. The optimal
number of neuron in the hidden
-
A hybrid GMDH and Box-Jenkins models 3059
layer of GMDH model was identified using a trial and error
procedure by varying the number of
hidden neurons from 1 to 10 for each model. The best fit model
structure is for each model is
determined according to criteria of performance index (MSE).
Table 2 shows the performance results of the ARIMA, ANN and GMDH
approach based on
mean squared error (MSE), mean absolute error (MAE) and
correlation coefficients (R2).
Table 2: Comparison of ARIMA, ANN and GMDH
Model MSE MAE R2
AR(2) 0.0179 0.1166 0.8959
ANN (2x5x1) 0.0107 0.0684 0.9351
GMDH 0.0074 0.0624 0.9589 Note: The data in boldface means the
best statistical results.
From Table 2, considering the MSE, MAE and R2 being regarded
here as a performance
indicator, the experimental results clearly demonstrate that the
GMDH outperforms the other
models. Fig. 3 shows the actual and forecasted values
respectively.
Table 3 shows the performance of our proposed model and other
models studied in the
previous literature. To measure forecasting performance, MSE and
MAE are employed as
performance indicator. The experiment results show that our
proposed model offers encouraging
advantages and has good performance.
Table 3: Comparison of the performance of the proposed model
with those of other forecasting
models
Model MSE MAE
Zhang’s ARIMA model 0.02049 0.1123
Zhang’s ANN model 0.02046 0.1121
Zhang’s Hybrid model 0.01723 0.10397
Khashei & Bijari’ s ANN model 0.01361 0.08963
Kajitani’s SETAR model 0.01400 -
Kajitani’s FNN model 0.0090 -
Aladag’s Hybrid model 0.0090 -
Proposed model 0.0074 0.0624 Note: The data in boldface means
the best statistical results.
-
3060 Ani Shabri and Ruhaidah Samsudin
Fig. 3: Comparison between observed and predicted for GMDH,
ARIMA and ANN models for
Lynx time series (testing phase)
5. Conclusion
One of the major developments in ANN over the last decade is the
model combining or hybrid
models. In this paper we proposed a hybrid GMDH model that
combines the time series Box-
Jenkins model and the GMDH model to forecast time series data.
The GMDH model in
conjunction with Box-Jenkins approach has been demonstrated to
model the lynx data. The Box-
Jenkins approach is applied to propose a new hybrid method for
improving the performance of
the GMDH to time series forecasting. The empirical results
indicate the proposed method yields
better results than other methods. This approach presents a
superior and reliable alternative to
ANN, ARIMA and hybrid models studied by other researchers.
Acknowledgements. This research has been funded by the Ministry
of Science, Technology and
Innovation (MOSTI), Malaysia under Vot 4F399. Lastly, thanks are
given to the Universiti
Teknologi Malaysia.
References
[1] G.P. Zhang, Time Series Forecasting Using a Hybrid ARIMA and
Neural Network Model, Neurocomputing, 50 (2003), 159 - 175.
Time
Ly
nx N
um
be
r
1413121110987654321
3.75
3.50
3.25
3.00
2.75
2.50
Variable
ANN
GMDH
Data
ARIMA
-
A hybrid GMDH and Box-Jenkins models 3061
[2] D.S.K. Karunasinghe and S.Y. Liong, Chaotic time series
prediction with a global model: Artificial neural network, Journal
of Hydrology, 323 (2006), 92 - 105.
[3] I. Rojas, O. Valenzuela, F. Rojas, A. Guillen, L. J.
Herrera, H. Pomares, L. Marquez, M. Pasadas, Soft-computing
techniques and ARMA model for time series prediction,
Neurocomputing, 71(4 – 6) (2008), 519 - 537.
[4] F. Camastra, and A. Colla, Mneural short-term rediction
based on dynamics reconstruction, ACM, 9(1) (1999), 45-52.
[5] M. Han and M. Wang, Analysis and modeling of multivariate
chaotic time series based on neural network, Expert Systems with
Applications, 2(36) (2009), 1280-1290.
[6] A. Abraham and B. Nath, A neuro-fuzzy approach for modeling
electricity demand in Victoria, Applied Soft Computing, 1(2)
(2001), 127–138.
[7] K.A. Oliveira, I. Vannucci, E.C. Silva, Using artificial
neural networks to forecast chaotic time series, Physica A:
Statistical Mechanics and its Applications, 284(1-4) (2000), 393
-
404.
[8] A.G. Ivakheneko, Polynomial theory of complex system, IEEE
Trans. Syst., Man Cybern. SMCI-1, 1 (1971), 364-378.
[9] H. Tamura, and T. Kondo, Heuristic free group method of data
handling algorithm of generating optional partial polynomials with
application to air pollution prediction.
International Journal of Systems Science, 11 (1980),
1095–1111.
[10] A.G. Ivakheneko and G.A. Ivakheneko, A review of problems
solved by algorithms of the GMDH, Pattern Recognition and Image
Analysis, 5(4) (1995), 527-535.
[11] M.S. Voss and X. Feng, A new methodology for emergent
system identification using particle swarm optimization (PSO) and
the group method data handling (GMDH). GECCO
2002, (2002), 1227-1232.
[12] A. Jain and A. Kumar, An evaluation of artificial neural
network technique for the determination of infiltration model
parameters, Applied Soft Computing, 6 (2006), 272–282.
[13] C.T. Su, L.I. Tong, C.M. Leou, Combination of time series
and neural network for reliability forecasting modeling, Journal of
Chinese Industrial Engineering, 14 (1997), 419
– 429.
[14] W. Wang, P.V. Gelder, J.K. Vrijling, Improving daily stream
flow forecasts by combining ARMA and ANN models, International
Conference on Innovation Advances and
Implementation of Flood Forecasting Technology, 2005.
[15] G.C. Onwubolu, Design of hybrid differential evolution and
group method of data handling networks for modeling and prediction,
Information Sciences, 178 (2008), 3616-3634.
[16] R. Samsudin, P. Saad, A. Shabri, A hybrid GMDH and least
squares support vector machines in time series forecasting, Neural
Network World, 21(3) (2011), 251-268.
[17] K.Y. Chen and C.H. Wang, A hybrid ARIMA and support vector
machines in forecasting the production values of the machinery
industry in Taiwan, Expert Systems with
Applications, 32 (2007), 254 - 264.
[18] S. BuHamra, N. Smaoui, M. Gabr, The Box–Jenkins analysis
and neural networks, California, Wadsworth, 2003.
-
3062 Ani Shabri and Ruhaidah Samsudin
[19] C. Hamzaçebi, Improving artificial neural networks'
performance in seasonal time series forecasting, Information
Sciences, 178(23) (2008), 4550-4559.
[20] M. Khashei and M. Bijari, An artificial neural network (p,
d, q) model for timeseries forecasting, Expert Systems with
Applications, 37(1) (2010), 479-489.
[21] G.E.P. Box and G. Jenkins, Time Series Analysis.
Forecasting and Control, Holden-Day, San Francisco, CA, 1970.
[22] K.K. Lai, L. Yu, S. Wang, W. Huang, Hybridizing Exponential
Smoothing and Neural Network for Financial Time Series Predication.
ICCS 2006, Part IV, LNCS 3994 (2006),
493-500.
[23] H.F. Zou, G.P. Xia, F.T. Yang, H.Y. Wang, An investigation
and comparison of artificial neural network and time series models
for chinese food grain price forecasting,
Neurocomputing, 70 (2007), 2913-2923.
[24] P.A.P. Moran, The statistical analysis of the sunspot and
lynx cycles. Journal of Animal Ecology, 18 (1953), 115–116.
[25] Y. Katijani, W.K. Hipel, A.I. Mcleod, Forecasting nonlinear
time series with feedforward neural networks: A case study of
Canadian lynx data, Journal of Forecasting 24 (2005),
105–117.
[26] T. Subba Rao and M.M. Gabr, An Introduction to Bispectral
Analysis and Bilinear Time Series Models, Springer, Berlin,
1984.
[27] S. Kang, An Investigation of the Use of Feed forward Neural
Network for Forecasting. Ph.D. Thesis, Kent State University,
1991.
[28] F.S. Wong, Time series forecasting using back propagation
neural network, Neurocomputing, 2 (1991), 147 - 159.
[29] R.P. Lippmann, An introduction to computing with neural
nets. IEEE ASSP Magazine, April, 4-22, 1987.
Received: April 14, 2014
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V0M-3VJ3YPK-D&_user=167669&_coverDate=01%2F01%2F1999&_rdoc=1&_fmt=high&_orig=search&_sort=d&_docanchor=&view=c&_searchStrId=1356002637&_rerunOrigin=google&_acct=C000013278&_version=1&_urlVersion=0&_userid=167669&md5=27b36d5a0a49516ca15c7f659e81376d#bb16