-
Applied Acoustics 77 (2014) 169–177
Contents lists available at ScienceDirect
Applied Acoustics
journal homepage: www.elsevier .com/locate /apacoust
Dimensionality reduction via variables selection – Linear and
nonlinearapproaches with application to vibration-based condition
monitoringof planetary gearbox
0003-682X/$ - see front matter � 2013 Elsevier Ltd. All rights
reserved.http://dx.doi.org/10.1016/j.apacoust.2013.06.017
⇑ Corresponding author. Tel.: +48 71 320 68 49.E-mail addresses:
[email protected] (A. Bartkowiak), Radoslaw.Zimroz@pwr.
wroc.pl (R. Zimroz).
A. Bartkowiak a,b, R. Zimroz c,⇑a Institute of Computer Science,
University of Wroclaw, Joliot Curie 15, Wroclaw 50-383, Polandb
Wroclaw School of Applied Informatics, Wejherowska 28, Wroclaw
54-239, Polandc Diagnostics and Vibro-Acoustics Science Laboratory,
Wroclaw University of Technology, Plac Teatralny 2, Wroclaw 50-051,
Poland
a r t i c l e i n f o
Keywords:Dimensionality reduction
Feature selectionLinear and nonlinear approachLeast square
regressionLassoDiagnosticsPlanetary gearbox
a b s t r a c t
Feature extraction and variable selection are two important
issues in monitoring and diagnosing a plan-etary gearbox. The
preparation of data sets for final classification and decision
making is usually a multi-stage process. We consider data from two
gearboxes, one in a healthy and the other in a faulty state.
First,the gathered raw vibration data in time domain have been
segmented and transformed to frequencydomain using power spectral
density. Next, 15 variables denoting amplitudes of calculated power
spectrawere extracted; these variables were further examined with
respect to their diagnostic ability. We haveapplied here a novel
hybrid approach: all subset search by using multivariate linear
regression (MLR) andvariables shrinkage by the least absolute
selection and shrinkage operator (Lasso) performing a non-lin-ear
approach. Both methods gave consistent results and yielded subsets
with healthy or faulty diagnosticproperties.
� 2013 Elsevier Ltd. All rights reserved.
1. Introduction, statement of the problem
Multidimensional feature space is frequently used in manyfields
of sciences, including advanced condition monitoring of theobserved
rotating machinery. Diagnostics of the objects conditionusually
uses a model for the investigated phenomenon; in thesimplest one
dimensional (1D) case this may be a probabilitydensity function for
healthy and faulty conditions, in morecomplex problems after
initial preprocessing of the gathered datacomplex mathematical
modeling of different kind is applied[11,25,14,15,1,2,8].
Generally, when analyzing vibration signals from a
rotatingmachinery working with installed gearboxes, the methods of
anal-ysis fall into three broad categories (i) time domain
analysis; (ii)frequency domain analysis; (iii) simultaneous
time–frequency do-main analysis. Each domain may use many specific
multivariatemethods originating from statistics, pattern
recognition, artificialneural networks and artificial intelligence;
for examples see the in-vited paper by Jardine et al. [16] with its
271 references, also twoother invited papers, a little bit more
specific [18,20], with 122 and119 references to the mentioned
topics.
In the case of multidimensional data space it is very
importantto decide how many measured variables should be used to
buildthe model. It is not reasonable to use all of available
variables.Reducing dimensionality of data set can be carried out in
manyways. Optimal (in sense of classification ability, contained
informa-tion, etc.) features set allows to classify data with
computationalefforts and maximal stability/efficiency of
classification results.
In particular, when monitoring the rotating machinery, one
maywant – on the base of the recorded data – to come to know
aboutthe state of the machine, in particular to find out if it is
in a healthyor a faulty state.
The assumed model may be related to number of sensors,different
physical parameters used for diagnostics, for exampletemperature,
vibration, acoustics signals (including noise or acous-tics
emissions, etc.) or multidimensional representation of a sin-gle
signal (statistical descriptors of the process, 1D
spectralrepresentation, 2D time–frequency representations and
other)[3,24,10,11,25,14,15,1,2,8]. Having too many variables
included inthe model may not be convenient for the following
reasons: Somevariables may be not relevant to the problem at hand
and contain alarge amount of redundant information; taking them for
the anal-ysis may introduce noise and unexplained fluctuations of
the out-put. Also, using some more complicated nonlinear equations,
thenecessary parameters may be extremely difficult to estimate.
Inother words, redundant and irrelevant variables may cause
consid-erable impediment in the performed analysis. Therefore,
before
http://crossmark.crossref.org/dialog/?doi=10.1016/j.apacoust.2013.06.017&domain=pdfhttp://dx.doi.org/10.1016/j.apacoust.2013.06.017mailto:[email protected]:Radoslaw.Zimroz@pwr.
wroc.plmailto:Radoslaw.Zimroz@pwr.
wroc.plhttp://dx.doi.org/10.1016/j.apacoust.2013.06.017http://www.sciencedirect.com/science/journal/0003682Xhttp://www.elsevier.com/locate/apacoust
-
170 A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014)
169–177
starting the proper analysis, a responsible researcher should
findout what kind of data will be analyzed. The first question
shouldbe about the intrinsic dimensionality of the data. The second
ques-tion should be if the number of variables at hand might be
some-how reduced without losing the overall information hidden
inthe data. These are difficult questions and need expert
guide.
Such a guide may be obtained from a special issue of Journal
ofMachine Learning Research, in particular from the first paper
ofthat issue authored by Guyon and Elisseeff [12]. The
contributionsin the mentioned issue consider such topics as:
providing a betterdefinition of the objective function, feature
construction, featureselection, features ranking, efficient search
methods, and featuresvalidity assessment methods.
When considering dependencies among variables, we may con-sider
either linear or non-linear dependencies. The same
concernsprediction models. A comprehensive introduction to
non-linearmethods serving for dimension reduction may be found in
the pa-per by Yin [26] (contains 62 bibliographical references).
The authordiscusses various non-linear projections, such as
nonlinear princi-pal component analysis (PCA) obtained via
self-associative neuralnetworks), kernel PCA, principal manifolds,
isomaps, local linearembedding, Laplacian eigenmaps, spectral
clustering, principalcurves and surfaces. He emphasizes the
importance of visualiza-tion of the data by topology preserving
maps, like the Visualiza-tion-induced Self Organizing Map
(ViSOM).
A more detailed elaboration on variable selection may by foundin
the book by Clarke et al. [7]. Chapter 10 of this book is devoted
to‘Variable Selection’. The authors describe there in more than
100pages such topics as linear regression, subset selection, some
clas-sical and recently developed information criteria (Akaike’s
AIC,Bayesian BIC, deviance information DIC). They discuss how
tochoose the proper criterion and assess the appropriate
model.Apart from the model selection methods, they consider also
someshrinkage methods penalizing the risk: ridge regression,
nonnega-tive garrotte, least absolute selection and shrinkage
operator (Las-so), elastic net, least angle regression, shrinkage
methods forsupport vector machines and Bayesian variable selection.
Compu-tational issues of the methods are also discussed and
compared.
Some comparative reviews on dimensionality reduction meth-ods
may be found in van der Maaten et al. [23] (Technical Reportof
Tilburg University with 149 bibliographic references). and
Par-viainen [17] (Ph.D. dissertation with 218 bibliographic
references,where a taxonomy of the existing methods is built).
Recently, Pie-tila and Lim have published a review of intelligent
systems ap-proaches when investigating sound quality [19]. They
state thatthe most common model used today is the Multiple Linear
Regres-sion and non-linear Artificial Neural Network. They go into
short-comings that are associated with both the current regression
andneural network approaches. They mention robust approach as anew
thought for improving the current state-of-art technology.
Generally there are many methods and their success dependson the
data and the problem to be solved. The first crucial step isdata
acquisition and extracting from them variables (traits) for
fur-ther analysis. A number of approaches can be used to obtain
vari-ables for further elaboration. The most popular are: plain
statisticalvariables of vibration time series (treated as record of
unknownprocess), spectral representation of the vibration time
series orother advanced multidimensional representations. In
conditionmonitoring, the preparation of data sets for final
classificationand decision making is usually a multilevel
(multi-step) process.
After recording the experimental data, the first step of the
anal-ysis makes a kind of preprocessing of the raw vibration data
byaveraging them, de-noising and segmenting. Often the
featureextraction process is carried out by transformation of the
time ser-ies to other domain (frequency, time–frequency, wavelets
coeffi-cient matrix, etc.). After that, selection of particular
components
(for example mesh frequency) or aggregation of group of
compo-nents (energy in band) is done. Final features set
preparation isaimed at minimizing redundancy of data, which results
in reducingcomputational efforts at classification phase (both in
the trainingand the testing phase). There is no univocal and clear
answerwhich representation of raw vibration signal is better for
conditionmonitoring. This may be the reason that some authors use
not one,but more of them. Giving a few examples of dealing the
problem: Forreducing generally the dimensionality of the data, many
authors use:principal component analysis (PCA), independent
component analy-sis (ICA), isomap, local linear embedding, kernel
PCA, curvilinearcomponent analysis, simple genetic algorithm,
adaptive genetic algo-rithm with combinatorial optimization and
others. These methodsserve generally for reduction of
dimensionality of the recorded datawith the intention to obtain a
subspace representation, in which thefault classes are more easily
discriminable.
In the second stage, having the variables for the analysis
fixed, aprediction algorithm, usually neural networks like radial
basisfunctions (RBF) or support vector machines (SVM) are applied
toperform the monitoring. Some authors use also hybrid
modelscombining multiple feature selection to obtain the input
variablesfor the second stage. It is also possible to use a
combined approachby analyzing frequency spectra obtained by Fourier
analysis intime (to monitor changes in spectra along lifetime and
fault devel-opment, it can be seen as kind of time–frequency
analysis). Oftenwavelets decomposition is used as a preprocessor,
wavelets areused before calculation and comparing frequency spectra
todecompose raw time series into set of sub-signals with simpler
fre-quency structure. Such an analysis was performed by
Eftekharne-jad et.al. [10] when considering shaft cracks in a
helical gearbox.
In this paper we will focus on the most informative
variablesselection from 15D data vectors using linear and nonlinear
tech-niques. The basic data origin from spectral representation of
vibra-tion time series measured on two planetary gearboxes, mounted
inbucket wheel excavators used in a mining company (for more
de-tails, see [3,27,29]). A planetary gearbox is really a complex
deviceand it is difficult to deal with. In our machine, planetary
gearboxwas part of complex (multi-stage) gearbox. The purpose of
theexperiment was to asses the planetary stage as a key element
ofthe system.
After gathering the vibration signals they were
segmented(divided into short sub-signals) and analyzed in frequency
domainusing power spectral density to obtain an array of real
values –amplitudes of isolated components indicating for high
energy atsome frequencies (planetary gear mesh frequency and
itsharmonics).
Based on some a priori knowledge related to machine design,
15parameters from vibration spectrum have been extracted. These
15parameters are expected to describe the technical condition of
aplanetary gearbox. It should be added that the same method mightbe
used for a distributed form of change of condition, not for
thelocalized one. A feature extraction procedure used here might
beinterpreted as time frequency approach, because for each
shortsegment of signal, frequency domain features were extracted.
Itcould be seen as calculation of spectrogram (without
overlapping)– the simplest time–frequency representation. For each
slice ofspectrogram 15 features are extracted.
It should be emphasized that such method of feature
extractionwas proposed already by Bartelmus and Zimroz [3]. From
the ob-tained power spectra, they have retained 15 components and
havecalled them pp1, pp2, . . . , pp15 appropriately. The
distribution ofthese variables, characterizing the observed two
gearboxes – onehealthy and one faulty– working with and without
load, was alsoconsidered in [4,28].
In a previous work the authors [3] used sum of amplitudes
ofselected components as an aggregated measure of gearbox
condi-
-
A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177
171
tion. Some preliminary investigations on the structure and
dimen-sionality of the data may be found in Bartkowiak and
Zimroz[5,4,6] and Zimroz and Bartkowiak [28]. Concerning the
dimen-sionality, Zimroz and Bartkowiak, using PCA, have shown
thatthe intrinsic dimensionality of the 15D-data is 2 or 3 [27,29],
whichallowed for visualization of the data in a 2D and in a 3D
plot. It ap-peared, that the ‘healthy’ and ‘faulty’ data are nearly
perfectly sep-arated (only 5 data vectors – out of total 2183 – are
assigned to thewrong class). Canonical Discriminant Analysis
provided resultswith similar meaning.
In this paper, a novel strategy of variable selection is
proposed.Instead of (in general lossy) projection of
multidimensional dataspace to a new one with lower dimension, it is
suggested to selectdirectly the most informative variables. In our
opinion, it is betterto select and process original data than
create new features be-cause by projections the physical meaning of
the original variablesmay be lost. In this paper we propose two
techniques for selectionof variables from a 15D vector: (i).
Finding the best subset of the 15variables by all subsets search.
We will use here the multiple linearregression (MLR) method
considering least squares error (LSE). (ii).Variable selection
using a penalized least-squares criterion withthe l1 type penalty
called LASSO (least absolute shrinkage andselection operation) –
which is a nonlinear approach.
Our investigations are performed in the following way: We
sub-divide the entire data into two parts: the training and the
test sam-ple (see Section 2 for details). Using the training sample
we find fork = 1, 2, . . . , 15 the best subset of size k. Using
the test sample wecheck how many data vectors are misclassified.
Applying addition-ally the Lasso technique, we validate the results
obtained by the allsubset search.
The paper is organized as follows. In the following, Section
2presents shortly the data and scheme of our experiment. The
re-sults obtained by the classical LSE method, when using full
regres-sion in 15 variables, are shown in Section 3. The all-subset
searchand its results are presented in Section 4. The Lasso method
andits results appear in Section 5. Finally, Section 6 contains
discussionand closing remarks.
2. The data and scheme of the experiment
We use data recorded and described by Bartelmus and Zimroz[3].
The data were recorded from two planetary gearboxes, onebeing in
healthy and the other in faulty conditions. The vibrationsignal was
cut into time segments. The obtained segments weresubjected to
Fourier transform yielding power spectra densities,wherefrom 15
variables called pp1, . . . , pp15 were extracted forfurther
analysis. In such a way the authors [3] obtained two datamatrices:
X1 of size of 951 � 15 representing the healthy gearbox,and X2 of
size 1232 � 15 representing the faulty gearbox. Each rowin these
matrices constitutes one data vector containing one in-stance of
the considered variables pp1, . . . , pp15 (obtained fromone
segment). Such a data vector will be in the following referredto as
a data item.
The state of the machine is defined numerically as the variable
Ywith values y = +1 when being in the healthy state and y = �1when
being in the faulty state.
For our regression calculations we have subdivided the
matricesX1 and X2 randomly the entire data into two sets of data:
calledtraining set Xtrain and test set Xtest:
� The training set was obtained by choosing randomly 500
itemsfrom the healthy and 500 items from the faulty data
vectors.The chosen items were put together into one common matrixof
size 1000 � 15 called in the following the Xtrain set. Thisdata
matrix will be the basis for evaluating the regression equa-
tion (defined in Section 3.1). Before starting the calculations,
theXtrain set was standardized to have means (vector m of size1 �
15) equal to zero and standard deviations (vector s of size1 � 15)
equal to one.Notice also, that the values ytrain being the target
values for sub-sequent rows of the Xtrain set are – by its
definition – cen-tered to zero (according to our design, there are
500 ones and500 minus ones, indicating the ‘healthy’ and The
‘faulty’ datavectors.).� The remaining items from the data, that is
to mean the data vec-
tors both from X1 and X2, not included neither into Xtrain
norinto Xtest, were put together into other common matrix of
size1183 � 15 called in the following the Xtest set. This
matrixcomprises 451 data vectors from the healthy gearbox and
732data vectors from the faulty gearbox; it will serve for
testingthe fitness of the regression equation, obtained from
theXtrain data. To use it for testing, it is necessary to
standardizeeach data vector (row of the Xtest set) by subtracting
therespective mean vector (m) and dividing the result by
therespective standard deviations (s) derived from the Xtrain
set.
Note that both of the Xtrain and Xtest data sets contain as
itsfirst part items (data vectors) belonging to the first class
(‘healthy’gearbox), and as its second part items belonging to the
second class(‘faulty’ gearbox). This will facilitate to recognize
visually the qual-ity of the predictions, when drawing simply an
index plot of theestimated predicted values ŷ (which may be
evaluated possiblyfor the train and test data sets). Thus the class
distribution ofitems representing the healthy and the faulty
gearboxes is:
Xtrain: 500 ‘healthy’ and next 500 ‘faulty’ items, together1000
items;Xtest: 451 ‘healthy’ and next 732 ‘faulty’ items, together
1183items.
3. Classical LSE method establishing full LSE regression
withconfidence intervals of the regression coefficients
3.1. Full least squares regression
In the following we will use the well known and well docu-mented
multivariate linear regression model (MLR) [13,7]. Let Xof size n �
d denote the observed data matrix, with n denotingthe number of
rows containing succeeding data vectors xi = (xi1, -. . . , xid), i
= 1, . . . , n, and d denoting the number of variables. Thematrix X
contains data recorded in two classes of items: ‘healthy’and
‘faulty’. We define a predicted variable Y with values storedin the
vector y of size n � 1 with elements taking values +1 and�1
according the following rule:
yi ¼ þ1 if xi 2 class ‘healthy’ and yi ¼ �1 if xi 2 class
‘faulty’:
We scale the vector y in such a way, that its mean equals
zero.Let X denote the standardized matrix X. The following
regres-
sion model is considered:
y ¼ Xbþ e: ð1Þ
The vector y in the above equation denotes the vector of the
depen-dent variable. In our case it denotes the class membership
coded bytwo values: +1 and �1. The vector b = (b1, . . . , bd)
denotes the vectorof the regression coefficients expressing the
dependence of the var-iable Y from the d predictors whose values
are recorded in subse-quent columns of X. Because of the
standardization of the X-matrix, and the centering of the y-values,
the regression equationcontains no intercept b0; its estimate when
using the LSE method,is equal to 0: b̂0 ¼ 0.
-
172 A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014)
169–177
Let e of size n � 1 denote the error term; its components
areindependent and have expected value equal to zero and
varianceequal r2.
The Least Squares Error (LSE) method finds estimates of
theregression coefficients b by minimizing – with respect to b –
theResidual Sum of Squares (RSS) given by the following
quadraticform:
RSSðbÞ ¼ ðy � XbÞTðy � XbÞ: ð2Þ
The solution is:
b̂ ¼ ðXTXÞ�1XT y: ð3Þ
The predicted values of y are then computed as
ŷ ¼ Xb̂ ¼ XðXTXÞ�1XT y: ð4Þ
The variance–covariance matrix of b̂, useful for constructing
confi-dence intervals for individual regression coefficients bj, j
= 1, . . . , d is
Varðb̂Þ ¼ ðXTXÞ�1r2; ð5Þ
with the theoretical (population) variance r2 usually estimated
as
r̂2 ¼ 1n� d� 1
Xni¼1ðyi � ŷiÞ
2:
3.2. Results from full regression applied to the gearbox
data
All calculations in this section were conducted using the
stan-dardized data set Xtrain – see Section 2. We calculated the
fullregression equation (from d = 15 variables). The obtained
resultsallowed us to construct a graph depicting the regression
coeffi-cients together with their confidence intervals. The graph
is shownin Fig. 1.
The figure shows the amplitudes of the regression
coefficientsbj, j = 1, . . . , 15 embraced by their confidence
intervals. If for aregression coefficient bj its confidence
interval contains the value0, then the respective regression
coefficient may be equal to zero,which means, that the given
variable may have a zero impact onthe predicted variable Y and as
such should not be considered forinclusion into the set of
predictors. Looking at Fig. 1, one may no-
2 4 6 8 10 12 14
−0.3
−0.2
−0.1
0
0.1
0.2
Xtrain, regression coeffs with conf. intv.
b j ±
con
f int
erva
l
j − no. of the regression coeff
Fig. 1. Regression coefficients b1, . . . , b15 with alpha =
0.05 confidence intervalsfrom full regression equation with d = 15
variables. Notice that variables Nos. 5, 10and 12 are statistically
nonsignificant, because their confidence intervals containthe value
0. Notice also that variables Nos. 4 and 15 may have values very
near tozero.
tice that there are three such variables: Nos. 5, 10, and 15.
One maynotice also that variables Nos. 4 and 15 may have values
very nearto zero.
Our next task was to explore and illustrate the diagnostic
prop-erties of the full regression equation both for the train and
testdata. For all the data vectors x contained in the data sets
Xtrain(in algebraical notation below written as Xtrain) or Xtest
(algebraicdenotation Xtest) we calculated the predicted values of
the depen-dent variable Y as
ŷtrain ¼ Xtrainb̂; and ŷtest ¼ Xtestb̂:
The resulting predicted values for the two sets Xtrain and
Xtestare shown shown in Fig. 2. The left panel of the figure shows
resultsfor Xtrain and the right panel – for Xtest appropriately.
Remem-ber that the composition of the two sets is such, that
firstly theycontain data vectors from the ‘healthy’ gearbox which
should exhi-bit y = +1, and next data vectors from the ‘faulty’
gearbox whichshould exhibit y = �1. Thus points with positive
predicted value yindicate assignment to class ‘healthy’, and points
with negative yindicate assignment to class ‘faulty’. Looking at
Fig. 2 one may state,that all ‘healthy’ data vectors are correctly
classified as belonging tothat class, and this happens both for
items included into set Xtrainand Xtest. What concerns the ‘faulty’
items (data vectors), eleven ofthem got positive predicted values,
and as such should be classifiedas ‘healthy’. Curiously enough,
this happens both for the Xtrain andXtest data sets.
It is interesting to notice that in both data sets (Xtrain,n =
1000 items, and Xtest, n = 1183 items) there are only 11wrong class
assignments in the data set Xtrain, and similarly also11 wrong
class assignments in the data set Xtest, and this hap-pens by using
such a simple and not much sophisticated algorithmas the ordinary
LSE regression. Summarizing this part of consider-ations, we may
state that predictions obtained from the regressionderived from the
training set act similarly when applied to the testset. This is an
optimistic fact: we have found a regression equationwhich is stable
and describes in the same way both the trainingand the testing
data.
The results described above were obtained by using the full
setof d = 15 variables as predictors. It is obvious from Fig. 1
that someof the variables are statistically nonsignificant and as
such could bedropped from the regression set.
To find more parsimonious subsets, step-wise and
all-subsetmethods were developed. The stepwise method proceeds
sequen-tially. Firstly the best single variable (predictor) is
chosen. Next,depending on two declared parameters, p-to-enter and
p-to-remove,the actually best variable (significant in terms of
p-to-enter) maybe added to the regression set. Conversely, some
other variablefrom the regression set, stated as nonsignificant in
terms of p-to-re-move, may be removed from the actual regression
set.
It is also possible to perform a stepwise downwards procedureof
removing the least significant variable from the regression set.
Amuch better method investigates all possible subsets of the
consid-ered d variables and chooses that one that yields the
smallestresidual error. This method will be shown in next
section.
4. Search for the best subset by performing all-subsets
search
4.1. Finding the best subset of length J, (J = 1, . . . , 15)
and its root meansquare error rmse
With d = 15 variables we have 215 = 32768 subsets toinvestigate.
The subsets may contain J = 0, 1, 2, . . . , 15 variables.
For each J there are NJ ¼15J
� �subsets to investigate, as shown
below:
-
J: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NJ: 15 105 455 1365 3003 5005 6435 6435 5005 3003 1365 455 105
15 1
A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177
173
As the criterion for the ‘best’ subset – from all those of
length J –the best one was defined as that one which yields the
minimalsquared prediction error mse given as
mseðJÞ ¼ 1n� J
Xni¼1ðyi � ŷ
ðJÞi Þ
2; ð6Þ
where the minimum is taken over all NJ subsets of length
J.Alternatively, we may take as the criterion of fit the square
root
of mse given as
rmse ¼ffiffiffiffiffiffiffiffiffimsep
: ð7Þ
0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Decay of rmse error
length of subset (J)
rmse
err
or
Fig. 3. The prediction error rmse for subsets of length J = 0,
1, 2, . . . , 15, (J = 0 meansan empty regression set). It is
evident that the downfall in rmse, when consideringsubsets of
length J > 7, is really meager.
The rmse tells, how much, in the average, the predicted value
ŷðJÞi isdiffering the from its target value yi, i = 1, . . . , n.
The rmse values forthe best subset of length J for j = 2, . . . ,
15 are shown in Fig. 3.
In Fig. 3 one may see that for J > 7 the rmse error is
already verynear to its ultimate lower bound derived from the full
regression in15 variables equal y = 0.3624. Thus – there is not
much left to ex-plain by the additional variables, and say seven
variables would beenough to retain for further analysis. Which
ones? We will show itin next subsection. Meanwhile, having a best
subset of length J, wewould like to turn our attention to other
subsets also with length J,having a larger rmse as the found best
subset. Their rmse is reallydiversified, which is shown in Fig.
4.
One may see in that figure the spread of the rmse-s –
consideredin subsets of length J = 5 and J = 7. In each plot, for a
given J, all thesubsets were ordered according to their rmse-s. The
spread of thermse-s is large and it is worth to search for the
subset yieldingthe minimal one rmse. One may notice also in Fig. 4
that thereare always anther few subsets that yield similar values
of rmse asthat found by the optimal subsets. We will consider the
composi-tion of variables included into such subsets in next
subsection.
Fig. 2. Predictions ŷ from the Xtrain group, n = 1000 (left
panel) and Xtest group n = 118line in each panel separates the
items (that is, data vectors) having the ‘healthy’ and ‘faulalso
that the dominant majority of ‘faulty’ items are recognized
correctly as ‘faulty’, exc‘healthy’.
4.2. Composition of the best subsets
Starting from remarks made when inspecting Fig. 3 we will
findfor J = 5, 6, 7, 8 the ten subsets with the smallest rmse. We
will callthe 10 found subsets ‘quasi-optimal’ subsets. We will look
at thecomposition of these quasi-optimal subsets, that is to say
whichvariables constitute these 10 subsets. This is shown in Table
1.
Table 1 shows which variables are included into the
regressionset of the found quasi-optimal subsets. The table
consists of twoparts. The upper part lists for J = 5, 6, 7, 8 the
10 subsets withrmse-s very near to the optimal one. These 10
subsets are
3 (right panel) using the full LSE regression equation with 15
variables. The verticalty’ status. Notice that all ‘healthy’ items
are recognized correctly as ‘healthy’. Noticeept 11 items in Xtrain
and 11 items in Xtest, which are recognized erroneously as
-
0 500 1000 1500 2000 2500 3000 35000.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7rmse for subsets of length J=5
subset no. 1, ..., 3003
rmse
erro
r
0 1000 2000 3000 4000 5000 6000 70000.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7rmse for subsets of length J=7
subset no. 1, ... , 6435
rmse
erro
r
Fig. 4. The prediction error rmse for all subsets of length J =
5 (left) and J = 7 (right). The dashed line indicates the value
rmse = 0.3624 obtained when putting into theregression equation all
15 variables.
Table 1Variables that have entered the top 10 best subsets for J
= 5, 6, 7, 8 and their rmse-s.
J = 5 J = 7
k = 1: 2 3 6 11 13 k = 1: 1 2 3 6 11 13 14k = 2: 2 3 6 13 14 k =
2: 2 3 6 8 11 13 14k = 3: 2 6 11 13 14 k = 3: 2 3 6 11 13 14 15k =
4: 2 6 11 13 15 k = 4: 2 3 5 6 11 13 14k = 5: 2 6 9 11 13 k = 5: 2
3 6 9 11 13 14k = 6: 2 6 7 11 13 k = 6: 2 3 6 7 11 13 14k = 7: 2 6
7 13 14 k = 7: 2 3 4 6 11 13 14k = 8: 2 6 9 13 14 k = 8: 2 3 6 11
12 13 14k = 9: 1 2 3 6 11 k = 9: 2 3 6 10 11 13 14k = 10: 2 3 6 9
13 k = 10: 1 2 3 6 9 11 13
J = 6 J = 8
k = 1: 2 3 6 11 13 14 k = 1: 1 2 3 6 7 11 13 14k = 2: 2 3 6 11
13 15 k = 2: 1 2 3 6 11 13 14 15k = 3: 2 3 6 9 11 13 k = 3: 1 2 3 6
9 11 13 14k = 4: 1 2 3 6 11 13 k = 4: 1 2 3 4 6 11 13 14k = 5: 2 3
6 9 13 14 k = 5: 1 2 3 6 8 11 13 14k = 6: 2 3 6 13 14 15 k = 6: 1 2
3 6 10 11 13 14k = 7: 1 2 6 7 11 13 k = 7: 1 2 3 6 11 12 13 14k =
8: 2 6 11 13 14 15 k = 8: 1 2 3 5 6 11 13 14k = 9: 2 3 6 11 12 13 k
= 9: 2 3 6 8 11 13 14 15k = 10: 2 3 6 7 11 13 k = 10: 2 3 6 8 9 11
13 14rmse for J = 5, 6, 7, 8J = 5: 0.391 0.392 0.395 0.395 0.397
0.399 0.401 0.402 0.402 0.402J = 6: 0.378 0.384 0.385 0.387 0.387
0.388 0.388 0.389 0.389 0.389J = 7: 0.373 0.375 0.376 0.377 0.377
0.378 0.378 0.378 0.378 0.379J = 8: 0.369 0.371 0.371 0.371 0.372
0.372 0.372 0.373 0.373 0.373
174 A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014)
169–177
numerated by the index k = 1, . . . , 10, appearing in
ascendingorder; with k = 1 indicating the subset with the smallest
rmse, thatis the optimal one.
No. of variable 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Its frequency 8 13 10 4 1 13 7 6 4 2 10 0 11 10 4
In the lower part of the table the corresponding rmse-s for
therespective subsets are given. One may notice that indeed,
thermse-s within one group of 10 subsets differ very little. One
maynotice also the preference of inclusion into the quasi-optimal
sub-sets of the variables Nos.: 2, 3, 6, 11, 13, 14.
When taking for each J(J = 1, . . . , 15) only one ultimately
bestsubset, we found the following frequencies of
subsequentvariables:
4.3. Assignments to classes ‘healthy’ and ‘faulty’ done by the
bestsubsets of length J
How good are the found best subsets in correct prediction of
thevariable Y, that is in assigning the correct class label
(‘healthy’ or
-
Table 2Number of erroneous predictions by the best subsets.
Symbol n12 denotes number offalse assignments of items belonging to
class ‘healthy’; n21 denotes false assignmentsof items from class
‘faulty’ to class ‘healthy’. Label ‘fraction’ denotes the
proportion ofall erroneous assignments compared to total number of
items in the train/test dataappropriately.
Train n = 1000 Test n = 1183n12 n21 fraction n12 n21
fraction
j = 1 0 111 0.1110 j = 1 0 145 0.1226j = 2 20 5 0.0250 j = 2 19
13 0.0270j = 3 8 1 0.0090 j = 3 3 1 0.0034j = 4 5 3 0.0080 j = 4 2
1 0.0025j = 5 1 9 0.0100 j = 5 2 8 0.0085j = 6 0 12 0.0120 j = 6 2
11 0.0110j = 7 0 12 0.0120 j = 7 2 11 0.0110j = 8 0 11 0.0110 j = 8
2 10 0.0101j = 9 0 11 0.0110 j = 9 1 11 0.0101j = 10 0 11 0.0110 j
= 10 1 11 0.0101j = 11 0 11 0.0110 j = 11 0 11 0.0093j = 12 0 11
0.0110 j = 12 0 11 0.0093j = 13 0 11 0.0110 j = 13 0 11 0.0093j =
14 0 11 0.0110 j = 14 0 11 0.0093j = 15 0 11 0.0110 j = 15 0 11
0.0093
1 3 5 7 9 11 13 15
1
3
5
7
9
11
13
15
Growing LASSO
No. of variable
itera
tion
Fig. 5. Growing lasso – full path of variables included into the
regression subset.The algorithm starts from an empty regression set
(first line from bottom) and ineach step one variable is added to
the regression set. The algorithm ends when allvariables are in the
regression set or a stop criterion is satisfied.
A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177
175
‘faulty’) to an data vector? In Table 2 we show the number of
erro-neous predictions both when using the Xtrain data and the
Xtestdata. Looking at the results displayed in Table 2 one may
notice,that the erroneous classifications start to stabilize for J
= 6. Whentaking only one best variable as predictor, the fraction
of erroneousclassifications amounts ferr = 0.0110 in the Xtrain set
(n = 1000),and is equal to ferr = 0.0101 in the test set (n =
1183). Whenincluding more variables into the regression set, the
decay of theferr rates is similar in both sets, and starts to
stabilize for J = 6 orJ = 7 variables included into the regression
set. Looking at Table 2we may infer, that the gain of introducing
more than J = 7 variablesinto the regression set is meager, what
could be also seen in Fig. 3.For example, when taking a subset with
J > 7 best variables, andadding to that subset one additional
variable, the improvementof the correct predictions equals one data
vector more (occurringby placing only one data vector more in the
correct class) – outof total 1000 considered data vectors.
The prediction in the Xtest set (with n = 1183 independentdata
items) yields additionally two correctly classified data vec-tors.
Thus: adding one variable more into the regression set resultsin
classifying correctly one or two data vectors more. One may ask:is
it worth it in this situation to use a more complex model withone
predictor more in the regression equation?
5. Search for a reduced subset using the Lasso method
5.1. Lasso principles
Generally, the Lasso belongs to so called shrinkage
methods,which retain only a subset of the regression predictors and
discardthe rest. It yields estimators of the regression
coefficients obtainedby applying a regularization procedure using
the Lasso penaltyshown in Eq. (9); this makes the estimates more
stable andconsistent.
Let X of size N � d denote the observed design matrix X. Let y
bethe centered target variable Y taking values +1 for items
belongingto class 1 (‘healthy’), and �1 for items belonging to
class 2 (‘faulty’,see the denotations at the beginning of Section
3). The Lasso meth-od (Least Absolute Shrinkage and Selection
Operator) solves thefollowing restrained LSE problem
blasso ¼ argminbðy � XbÞTðy � XbÞ; ð8Þ
where the regression coefficients b = (b1, . . . , bd) are
subjected tothe constraints called the Lasso penalty:
Xdi¼1jbjj 6 t; t > 0: ð9Þ
The constant t is a kind of tuning parameter; it decides on
theamount of shrinkage of the parameters b. It may be used also in
astandardized form given as:
s ¼ tXdj¼1jbLSj j;
,ð10Þ
where bLSj ; j ¼ 1; . . . ;d, is the ordinary least square error
(LSE) esti-mate of the respective regression parameter bj.
The regression Eq. (8) yields estimates of the Lasso
regressioncoefficients blasso without intercept b0. The respective
estimate ofb0 is equal b̂0 ¼ �y ¼ 0 (because both the columns of
the data matrixX and the target vector y have means equal to zero).
The Lasso ap-proach was originally proposed by Tibshirani [21], who
proposedalso some algorithms yielding the Lasso coefficients blasso
as thesolution of a quadratic programming problem with constraints.
Avery convenient algorithm for solving the Lasso problem may
beobtained by a small modification of the Least Angle
Regression(Lar) algorithm by Efron et al. [9], as shown firstly in
the originalpaper introducing the Lar algorithm [9], see also
[13,7] for detailedexamples. Generally, the Lasso shrinks some
regression coefficientsand sets other to zero. Some characteristic
stages of the algorithm:
� For t = 0, one obtains all regression coefficients equal 0,
whichmeans an empty regression set of variables.� If t is chosen
larger than tM ¼
Pd1jb̂LSj j (where b̂LSj is the ordinary
least square errors LSE estimate), then the lasso estimates
arethe ordinary LSE estimates.� Letting t to grow from 0 to tM, the
restraints given by (9) are
becoming successively relaxed and more variables are able
toenter the regression set with non-zero regression
coefficients.Thus, for d variables, there will be M + 1 time
instances, t0, t1, -. . . , tM, M � d, where the status of the
regression set changes(there may be some loops within, see [9]).
The authors [9] haveshown that the lasso problem has a piecewise
linear solutionpath, with the points {tj} as nodes where the linear
dependencemay change. The regression coefficients grow linearly
betweenthe nodes, with growth rate depending on the localization of
thenodes. Moreover, the same authors have shown that the num-
-
176 A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014)
169–177
ber of the linear pieces in the solution path equals
approxi-mately d (the entire number of predictor variables) and
thecomplexity of getting the whole Lasso path is O(nd2), the sameas
the cost of computing a single ordinary LSE fit (see the origi-nal
paper [9] or the description of the Lar algorithm in [13,7]).
5.2. The Lasso solution for the gearbox data
Now we will investigate, how the Lasso method works for our
data.The calculations will be carried out again on normalized data
matricesusing the Lars algorithm implemented by Karl Sjöstrand
(see: http://www2.imm.dtu.dk/�ksjo/kas/software, accessed
20.08.2012).
The function Lars produced the full path of regression
coeffi-cients. To illustrate what happens with the content of the
regres-sion set (that is, which variables are in, which out) we
haveconstructed the plot named by us growing LASSO which is shownin
Fig. 5.
The algorithm started from an empty regression set, to which
insubsequent steps new variables were added. Looking at the
graph,one may notice that the variable No. 13 was added as first,
and var-iable No. 11 as last, making the set of predictors
complete. Eachrow of the graph exhibited in Fig. 5 shows the
content of oneregression set yielded by the Lars function;
variables being in
Table 3Number of erroneous predictions by the Lasso subsets.
Notations as in Table 2.
Train n = 1000 Test n = 1183n12 n21 fraction n12 n21
fraction
j = 1 0 111 0.1110 j = 1 0 145 0.1226j = 2 1 60 0.0610 j = 2 0
76 0.0642j = 3 0 27 0.0270 j = 3 0 32 0.0270j = 4 0 23 0.0230 j = 4
0 23 0.0194j = 5 0 12 0.0120 j = 5 0 9 0.0076j = 6 0 12 0.0120 j =
6 0 9 0.0076j = 7 1 9 0.0100 j = 7 0 3 0.0025j = 8 1 9 0.0100 j = 8
0 3 0.0025j = 9 0 12 0.0120 j = 9 1 9 0.0085j = 10 0 11 0.0110 j =
10 1 10 0.0093j = 11 0 11 0.0110 j = 11 1 10 0.0093j = 12 0 11
0.0110 j = 12 1 10 0.0093j = 13 0 11 0.0110 j = 13 0 11 0.0093j =
14 0 11 0.0110 j = 14 0 11 0.0093j = 15 0 11 0.0110 j = 15 0 11
0.0093
Fig. 6. Lasso Predictions, yL , for the variable Y noted in the
train data (left panel) and tesfound when applying the Lasso
technique. See Fig. 2 for details of the composition of
Xtraassignments in the Xtest data.
the respective regression set are indicated by quadrants
paintedin color. For example, row 6 from the bottom indicates a
5-vari-ables subset containing the variables No. 2, 6, 9, 11, 14.
This subsetdoes not appear among the 10 best subsets of length 5
found by theall-subset search in Section 4. Non the less, it has
very good diag-nostic properties, as shown below in Table 3 and
Fig. 6.
Now let us look at the predicted values of the Xtrain andXtest
data sets when using the regression coefficients yieldedby the
Lasso method. The numbers of erroneous predictions whenusing Lasso
subsets of length J = 1, . . . , 15 are shown in Table 3. Weshow
there predictions obtained both from the Xtrain and Xtestdata sets,
when using regression formulae developed on the base ofthe Xtrain
data.
Details for the subset composed from k = 5 variables are shownin
Fig. 6. Considering the Xtrain and Xtest data, the number
oferroneous predictions from the 5-variables subset in the trainset
equals 12, and in the test set only 9. Formerly it was statedthat
the full regression set with 15 predictors gave 11
erroneouspredictions both for the train and test data (see Fig. 2).
Thus, thenumber of correct predictions is very similar and the
5-predictormodel seems to be not substantially worse as the full
regressionmodel (in the test set it works even better!).
6. Discussion and closing remarks
We have considered two sets of vibration time series, for
whichpower spectra were calculated. The main topic of our research
was:how many features (amplitudes of power spectra observed at
somefixed frequencies), denoted by us as variables pp1, . . . ,
pp15),should be taken into consideration when aiming at the
diagnosisof a healthy or a faulty state of the machine.
It seems that so far there was no systematic investigation
onsuch topic.
Let’s say that spectral representation of a time series allows
toextract features that may be related to energy of selected
bandsor amplitudes of characteristic frequencies (here: planetary
gear-box mesh frequency and its harmonics). To our knowledge,
thenumber of such features was not investigated deeply and
wasestablished ad hoc by the researchers. The only paper being
closeto that topic is [27], investigating correlations among
spectra com-ing from two different machines: the one being in a
healthy andthe other in a faulty state.
t data (right panel) when using reduced subset with 5 variables
(No. 2, 6, 9, 11, 14)in and Xtest data. There are 12 erroneous
assignments in the Xtrain and 9 erroneous
http://www2.imm.dtu.dk/~ksjo/kas/softwarehttp://www2.imm.dtu.dk/~ksjo/kas/softwarehttp://www2.imm.dtu.dk/~ksjo/kas/software
-
A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177
177
In the paper here we have intensively investigated the
possibil-ity of finding the best subset using multivariate linear
regressionand investigating in detail all possible 215 = 32768
subsets. Wehave also considered a faster and more robust method,
namelythe Lasso (least absolute shrinkage and selection operator),
com-bined with least angle regression (Lar). Lasso has the
advantagebeing a robust method, Lar gives the speed of calculations
compa-rable with ordinary least squares regression. The results of
bothmethods are in reasonable agreement and show that the
analyzedset of 15 variables characterizing the gearbox data can be
reducedto about 7–9 variables. We have used quite large training
and testsets (with cardinalities n = 1000 and n = 1183). The
results fromboth these sets are consistent; sometimes the results
in the testsets are even little better as those in the train
sets.
The composition (appearance in the subsets) of individual
vari-ables is not random: some variables enter the best subsets
morefrequently than the others. The most frequent variables
containedin the best subsets found by the LSE method are: 1, 2, 3,
6, 7, 11, 13,14. What is characteristic: the most frequent
variables found in thebest subsets cover the whole span 1–15; it
does not happen that,say, variables Nos. 1–8 enter the best set
more frequently thanvariables No. 9–15. This is in accordance with
preliminary resultsobtained in [28].
For a fixed cardinality k of the regression subset, the
composi-tion of the k-subset is not unique; one may find several
composi-tions with very similar fitness criteria (rmse) to the
analyzeddata. This is the property of the analyzed variables, not
statedexplicitly before.
When experimenting manually using stepwise methods, we ob-served
that subsets composed from the first 8 variables are slightlybetter
(R-square � 0.82) as subsets composed from the variablesNo. 8–15
(R-square � 0.77).
Summarizing our research presented here: We have
appliedclassical methods based on multivariate linear regression
using or-dinary least squares error and a combination of two modern
meth-ods: the Lasso and the Lar, making the evaluations robust
andquick (lasso puts the l1 penalty on the least squares error)
andLar enables calculations with speed comparable to the LSE
meth-od). Both methods yielded similar results. The research has
animmediate practical implication: the number of recorded and
usedfor analysis variables may be considerably reduced.
At the end we would like to mention, that
dimensionalityreduction methods have an extensive literature which
has ap-peared in last years (see the bibliographic references cited
in Sec-tion 1). This indicates that reduction of dimensionality
isnowadays a hot topic in data analysis. To perform reduction
ofdimensionality, authors use frequently quite complicated
andcomputer intensive methods. Our aim in the presented researchwas
simplicity, having in mind the principle formulated by Wil-liam von
Occam (quoted after M. Tipping [22]): ‘‘pluraritas nonest ponenda
sine necessitas’’, which translates as ‘‘entities shouldnot be
multiplied unnecessarily’’. In the machine learning commu-nity this
means ‘‘models should be not more complex than issufficient to
explain the data’’. The models presented in the paperare
conceptually simple and – as shown – are applicable to
realindustrial data.
Acknowledgements
This paper is in part financially supported by Polish State
Com-mittee for Scientific research 2010–2013 as research Project
No.N504 147838.
References
[1] Barszcz T, Bielecki A, Romaniuk T. Application of
probabilistic neural networksfor detection of mechanical faults in
electric motors. PrzegladElektrotechniczny 2009;85(8):37–41.
[2] Barszcz T, Bielecka M, Bielecki A, Wójcik M. Wind turbines
states classificationby a fuzzy-ART neural network with a
stereographic projection as a signalnormalization. Lect Notes
Comput Sci (LNCS) 2011;6594(PART 2):225–34.
[3] Bartelmus W, Zimroz R. A new feature for monitoring the
condition ofgearboxes in nonstationary operating systems. Mech Syst
Signal Process2009;23(5):1528–34.
[4] Bartkowiak A, Zimroz R. Outliers analysis and one class
classification approachfor planetary gearbox diagnosis. J Phys:
Conf Ser 2011;305(1) [art. no. 012031].
[5] Bartkowiak A, Zimroz R. Curvilinear dimensionality reduction
of data forgearbox condition monitoring. Przeglad Elektrotechniczny
2012;88(10B):268–71.
[6] Bartkowiak A, Zimroz R. Data dimension reduction and
visualization withapplication to multidimensional gearbox
diagnostics data: comparison ofseveral methods. Diffus Defect Data
Pt. B: Solid State Phenom2012;180:177–84.
[7] Clarke B, Fokoué E, Zhang HH. Principles and theory for data
mining andmachine learning. Springer Series in Statistics;
2009.
[8] Cocconcelli M, Bassi L, Secchi C, Fantuzzi C, Rubini R. An
algorithm to diagnoseball bearing faults in servomotors running
arbitrary motion profiles. Mech SystSignal Process
2012;27(1):667–82.
[9] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle
regression. Ann Stat2004;32(2):407–99.
[10] Eftekharnejad B, Adali A, Mba D. Shaft crack diagnosis in a
gearbox, ApplAcoust 2012 [in press].
http://dx.doi.org/10.1016/j.apacoust.2012.02.004.
[11] Gryllias KC, Antoniadis IA. A support vector machine
approach based onphysical model training for rolling element
bearing fault detection inindustrial environments. Eng Appl Artif
Intell 2012;25(2):326–44.
[12] Guyon I, Elisseeff A. An introduction to variable and
feature selection. J MachLearn Res 2003;3:1157–82.
[13] Hastie T, Tibshirani R, Friedman J. The Elements of
statistical learning; datamining. Inference and prediction.
New-York: Springer; 2010.
[14] Heyns T, Godsill SJ, De Villiers JP, Heyns PS. Statistical
gear health analysiswhich is robust to fluctuating loads and
operating speeds. Mech Syst SignalProcess 2012;27(1):651–66.
[15] Herzog MA, Marwala T, Heyns PS. Machine and component
residual lifeestimation through the application of neural networks.
Reliab Eng Syst Safety2009;94(2):479–89.
[16] Jardine AKS, Lin D, Banjevic D. A review on machinery
diagnostics andprognostics implementing condition-based
maintenance. Mech Syst SignalProcess 2006;20:1483–510.
[17] Parviainen E. Studies on dimension reduction and feature
spaces. AaltoUniversity, School of Science, Dept. of Biomedical
Engineering andComputational Science. Aalto University Publication
Series, DoctoralDissertations 94/2011.
[18] Peng ZK, Chu FL. Applications of the wavelet transform in
machine conditionmonitoring and fault diagnostics: a review with
bibliography. Mech Syst SignalProcess 2004;18:199–221.
[19] Pietila G, Lim TC. Intelligent systems approaches to
product sound qualityevaluations – a review. Appl Acoust
2012;73:897–1002.
[20] Samuel PD, Pines DJ. A review of vibration-based techniques
for helicoptertransmission diagnostics. J Sound Vib
2005;282:475–508.
[21] Tibshirani R. Regression shrinkage and selection via the
lasso. J Roy Statist SocB 1996;58(1):267–88.
[22] Tipping ME. Bayesian inference: an introduction to
principles and practice inmachine learning. In: Bousquet O, von
Luxberg U, Rätsch G, editors. Advancedlectures on machine learning.
Springer; 2004. p. 41–62. .
[23] van der Maaten L, Postma E, van der Herik J. Dimensionality
reduction: acomparative review, TiCC TR 2009-005, 1-36. Belgium:
University of Tilburg.
[24] Urbanek J, Barszcz T, Zimroz R, Antoni J. Application of
averaged instantaneouspower spectrum for diagnostics of machinery
operating under non-stationaryoperational conditions. Measure: J
Int Meas Confeder 2012;45(7):1782–91.
[25] Yiakopoulos CT, Gryllias KC, Antoniadis IA. Rolling element
bearing faultdetection in industrial environments based on a
K-means clustering approach.Exp Syst Appl 2011;38(3):2888–911.
[26] Yin H. Nonlinear principal manifolds – adaptive hybrid
learning approaches.In: Corchado E, et al. (Eds.), HAIS 2008. LNAI
5271. Springer; 2008. p. 15–29.
[27] Zimroz R, Bartkowiak A. Investigation on spectral structure
of gearboxvibration signals by principal component analysis for
condition monitoringpurposes. J Phys: Conf Ser 2011;305(1) [art.
no. 012075].
[28] Zimroz R, Bartkowiak A. Multidimensional data analysis for
conditionmonitoring: features selection and data classification.
CM2012—MFPT2012.BINDT. In: 11–14 June, London. Electronic
Proceedings, 2012, p. 1–12 [art no.402].
[29] Zimroz R, Bartkowiak A. Two simple multivariate procedures
for monitoringplanetary gearboxes in non-stationary operating
conditions. Mech Syst SignalProcess 2013;38(1):237–47.
http://dx.doi.org/10.1016/j.ymssp.2012.03.022.
http://refhub.elsevier.com/S0003-682X(13)00158-8/h0005http://refhub.elsevier.com/S0003-682X(13)00158-8/h0005http://refhub.elsevier.com/S0003-682X(13)00158-8/h0005http://refhub.elsevier.com/S0003-682X(13)00158-8/h0010http://refhub.elsevier.com/S0003-682X(13)00158-8/h0010http://refhub.elsevier.com/S0003-682X(13)00158-8/h0010http://refhub.elsevier.com/S0003-682X(13)00158-8/h0015http://refhub.elsevier.com/S0003-682X(13)00158-8/h0015http://refhub.elsevier.com/S0003-682X(13)00158-8/h0015http://refhub.elsevier.com/S0003-682X(13)00158-8/h0020http://refhub.elsevier.com/S0003-682X(13)00158-8/h0020http://refhub.elsevier.com/S0003-682X(13)00158-8/h0025http://refhub.elsevier.com/S0003-682X(13)00158-8/h0025http://refhub.elsevier.com/S0003-682X(13)00158-8/h0025http://refhub.elsevier.com/S0003-682X(13)00158-8/h0030http://refhub.elsevier.com/S0003-682X(13)00158-8/h0030http://refhub.elsevier.com/S0003-682X(13)00158-8/h0030http://refhub.elsevier.com/S0003-682X(13)00158-8/h0030http://refhub.elsevier.com/S0003-682X(13)00158-8/h0035http://refhub.elsevier.com/S0003-682X(13)00158-8/h0035http://refhub.elsevier.com/S0003-682X(13)00158-8/h0040http://refhub.elsevier.com/S0003-682X(13)00158-8/h0040http://refhub.elsevier.com/S0003-682X(13)00158-8/h0040http://refhub.elsevier.com/S0003-682X(13)00158-8/h0045http://refhub.elsevier.com/S0003-682X(13)00158-8/h0045http://dx.doi.org/10.1016/j.apacoust.2012.02.004http://refhub.elsevier.com/S0003-682X(13)00158-8/h0050http://refhub.elsevier.com/S0003-682X(13)00158-8/h0050http://refhub.elsevier.com/S0003-682X(13)00158-8/h0050http://refhub.elsevier.com/S0003-682X(13)00158-8/h0055http://refhub.elsevier.com/S0003-682X(13)00158-8/h0055http://refhub.elsevier.com/S0003-682X(13)00158-8/h0060http://refhub.elsevier.com/S0003-682X(13)00158-8/h0060http://refhub.elsevier.com/S0003-682X(13)00158-8/h0065http://refhub.elsevier.com/S0003-682X(13)00158-8/h0065http://refhub.elsevier.com/S0003-682X(13)00158-8/h0065http://refhub.elsevier.com/S0003-682X(13)00158-8/h0070http://refhub.elsevier.com/S0003-682X(13)00158-8/h0070http://refhub.elsevier.com/S0003-682X(13)00158-8/h0070http://refhub.elsevier.com/S0003-682X(13)00158-8/h0075http://refhub.elsevier.com/S0003-682X(13)00158-8/h0075http://refhub.elsevier.com/S0003-682X(13)00158-8/h0075http://refhub.elsevier.com/S0003-682X(13)00158-8/h0080http://refhub.elsevier.com/S0003-682X(13)00158-8/h0080http://refhub.elsevier.com/S0003-682X(13)00158-8/h0080http://refhub.elsevier.com/S0003-682X(13)00158-8/h0085http://refhub.elsevier.com/S0003-682X(13)00158-8/h0085http://refhub.elsevier.com/S0003-682X(13)00158-8/h0090http://refhub.elsevier.com/S0003-682X(13)00158-8/h0090http://refhub.elsevier.com/S0003-682X(13)00158-8/h0095http://refhub.elsevier.com/S0003-682X(13)00158-8/h0095http://www.miketipping.com/papers.htmhttp://www.miketipping.com/papers.htmhttp://refhub.elsevier.com/S0003-682X(13)00158-8/h0105http://refhub.elsevier.com/S0003-682X(13)00158-8/h0105http://refhub.elsevier.com/S0003-682X(13)00158-8/h0105http://refhub.elsevier.com/S0003-682X(13)00158-8/h0110http://refhub.elsevier.com/S0003-682X(13)00158-8/h0110http://refhub.elsevier.com/S0003-682X(13)00158-8/h0110http://refhub.elsevier.com/S0003-682X(13)00158-8/h0115http://refhub.elsevier.com/S0003-682X(13)00158-8/h0115http://refhub.elsevier.com/S0003-682X(13)00158-8/h0115http://refhub.elsevier.com/S0003-682X(13)00158-8/h0120http://refhub.elsevier.com/S0003-682X(13)00158-8/h0120http://refhub.elsevier.com/S0003-682X(13)00158-8/h0120
Dimensionality reduction via variables selection – Linear and
nonlinear approaches with application to vibration-based condition
monitoring of planetary gearbox1 Introduction, statement of the
problem2 The data and scheme of the experiment3 Classical LSE
method establishing full LSE regression with confidence intervals
of the regression coefficients3.1 Full least squares regression3.2
Results from full regression applied to the gearbox data
4 Search for the best subset by performing all-subsets search4.1
Finding the best subset of length J, (J=1,…,15) and its root mean
square error rmse4.2 Composition of the best subsets4.3 Assignments
to classes ‘healthy’ and ‘faulty’ done by the best subsets of
length J
5 Search for a reduced subset using the Lasso method5.1 Lasso
principles5.2 The Lasso solution for the gearbox data
6 Discussion and closing remarksAcknowledgementsReferences