Top Banner
Dimensionality reduction via variables selection – Linear and nonlinear approaches with application to vibration-based condition monitoring of planetary gearbox A. Bartkowiak a,b , R. Zimroz c,a Institute of Computer Science, University of Wroclaw, Joliot Curie 15, Wroclaw 50-383, Poland b Wroclaw School of Applied Informatics, Wejherowska 28, Wroclaw 54-239, Poland c Diagnostics and Vibro-Acoustics Science Laboratory, Wroclaw University of Technology, Plac Teatralny 2, Wroclaw 50-051, Poland article info Keywords: Dimensionality reduction Feature selection Linear and nonlinear approach Least square regression Lasso Diagnostics Planetary gearbox abstract Feature extraction and variable selection are two important issues in monitoring and diagnosing a plan- etary gearbox. The preparation of data sets for final classification and decision making is usually a multi- stage process. We consider data from two gearboxes, one in a healthy and the other in a faulty state. First, the gathered raw vibration data in time domain have been segmented and transformed to frequency domain using power spectral density. Next, 15 variables denoting amplitudes of calculated power spectra were extracted; these variables were further examined with respect to their diagnostic ability. We have applied here a novel hybrid approach: all subset search by using multivariate linear regression (MLR) and variables shrinkage by the least absolute selection and shrinkage operator (Lasso) performing a non-lin- ear approach. Both methods gave consistent results and yielded subsets with healthy or faulty diagnostic properties. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction, statement of the problem Multidimensional feature space is frequently used in many fields of sciences, including advanced condition monitoring of the observed rotating machinery. Diagnostics of the objects condition usually uses a model for the investigated phenomenon; in the simplest one dimensional (1D) case this may be a probability density function for healthy and faulty conditions, in more complex problems after initial preprocessing of the gathered data complex mathematical modeling of different kind is applied [11,25,14,15,1,2,8]. Generally, when analyzing vibration signals from a rotating machinery working with installed gearboxes, the methods of anal- ysis fall into three broad categories (i) time domain analysis; (ii) frequency domain analysis; (iii) simultaneous time–frequency do- main analysis. Each domain may use many specific multivariate methods originating from statistics, pattern recognition, artificial neural networks and artificial intelligence; for examples see the in- vited paper by Jardine et al. [16] with its 271 references, also two other invited papers, a little bit more specific [18,20], with 122 and 119 references to the mentioned topics. In the case of multidimensional data space it is very important to decide how many measured variables should be used to build the model. It is not reasonable to use all of available variables. Reducing dimensionality of data set can be carried out in many ways. Optimal (in sense of classification ability, contained informa- tion, etc.) features set allows to classify data with computational efforts and maximal stability/efficiency of classification results. In particular, when monitoring the rotating machinery, one may want – on the base of the recorded data – to come to know about the state of the machine, in particular to find out if it is in a healthy or a faulty state. The assumed model may be related to number of sensors, different physical parameters used for diagnostics, for example temperature, vibration, acoustics signals (including noise or acous- tics emissions, etc.) or multidimensional representation of a sin- gle signal (statistical descriptors of the process, 1D spectral representation, 2D time–frequency representations and other) [3,24,10,11,25,14,15,1,2,8]. Having too many variables included in the model may not be convenient for the following reasons: Some variables may be not relevant to the problem at hand and contain a large amount of redundant information; taking them for the anal- ysis may introduce noise and unexplained fluctuations of the out- put. Also, using some more complicated nonlinear equations, the necessary parameters may be extremely difficult to estimate. In other words, redundant and irrelevant variables may cause consid- erable impediment in the performed analysis. Therefore, before 0003-682X/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.apacoust.2013.06.017 Corresponding author. Tel.: +48 71 320 68 49. E-mail addresses: [email protected] (A. Bartkowiak), Radoslaw.Zimroz@pwr. wroc.pl (R. Zimroz). Applied Acoustics 77 (2014) 169–177 Contents lists available at ScienceDirect Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust
9

Dimensionality reduction via variables selection – Linear ...labdiag.pwr.wroc.pl/radzim/papers/2014/aba.pdf · some frequencies (planetary gear mesh frequency and its harmonics).

Feb 10, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Applied Acoustics 77 (2014) 169–177

    Contents lists available at ScienceDirect

    Applied Acoustics

    journal homepage: www.elsevier .com/locate /apacoust

    Dimensionality reduction via variables selection – Linear and nonlinearapproaches with application to vibration-based condition monitoringof planetary gearbox

    0003-682X/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.apacoust.2013.06.017

    ⇑ Corresponding author. Tel.: +48 71 320 68 49.E-mail addresses: [email protected] (A. Bartkowiak), Radoslaw.Zimroz@pwr.

    wroc.pl (R. Zimroz).

    A. Bartkowiak a,b, R. Zimroz c,⇑a Institute of Computer Science, University of Wroclaw, Joliot Curie 15, Wroclaw 50-383, Polandb Wroclaw School of Applied Informatics, Wejherowska 28, Wroclaw 54-239, Polandc Diagnostics and Vibro-Acoustics Science Laboratory, Wroclaw University of Technology, Plac Teatralny 2, Wroclaw 50-051, Poland

    a r t i c l e i n f o

    Keywords:Dimensionality reduction

    Feature selectionLinear and nonlinear approachLeast square regressionLassoDiagnosticsPlanetary gearbox

    a b s t r a c t

    Feature extraction and variable selection are two important issues in monitoring and diagnosing a plan-etary gearbox. The preparation of data sets for final classification and decision making is usually a multi-stage process. We consider data from two gearboxes, one in a healthy and the other in a faulty state. First,the gathered raw vibration data in time domain have been segmented and transformed to frequencydomain using power spectral density. Next, 15 variables denoting amplitudes of calculated power spectrawere extracted; these variables were further examined with respect to their diagnostic ability. We haveapplied here a novel hybrid approach: all subset search by using multivariate linear regression (MLR) andvariables shrinkage by the least absolute selection and shrinkage operator (Lasso) performing a non-lin-ear approach. Both methods gave consistent results and yielded subsets with healthy or faulty diagnosticproperties.

    � 2013 Elsevier Ltd. All rights reserved.

    1. Introduction, statement of the problem

    Multidimensional feature space is frequently used in manyfields of sciences, including advanced condition monitoring of theobserved rotating machinery. Diagnostics of the objects conditionusually uses a model for the investigated phenomenon; in thesimplest one dimensional (1D) case this may be a probabilitydensity function for healthy and faulty conditions, in morecomplex problems after initial preprocessing of the gathered datacomplex mathematical modeling of different kind is applied[11,25,14,15,1,2,8].

    Generally, when analyzing vibration signals from a rotatingmachinery working with installed gearboxes, the methods of anal-ysis fall into three broad categories (i) time domain analysis; (ii)frequency domain analysis; (iii) simultaneous time–frequency do-main analysis. Each domain may use many specific multivariatemethods originating from statistics, pattern recognition, artificialneural networks and artificial intelligence; for examples see the in-vited paper by Jardine et al. [16] with its 271 references, also twoother invited papers, a little bit more specific [18,20], with 122 and119 references to the mentioned topics.

    In the case of multidimensional data space it is very importantto decide how many measured variables should be used to buildthe model. It is not reasonable to use all of available variables.Reducing dimensionality of data set can be carried out in manyways. Optimal (in sense of classification ability, contained informa-tion, etc.) features set allows to classify data with computationalefforts and maximal stability/efficiency of classification results.

    In particular, when monitoring the rotating machinery, one maywant – on the base of the recorded data – to come to know aboutthe state of the machine, in particular to find out if it is in a healthyor a faulty state.

    The assumed model may be related to number of sensors,different physical parameters used for diagnostics, for exampletemperature, vibration, acoustics signals (including noise or acous-tics emissions, etc.) or multidimensional representation of a sin-gle signal (statistical descriptors of the process, 1D spectralrepresentation, 2D time–frequency representations and other)[3,24,10,11,25,14,15,1,2,8]. Having too many variables included inthe model may not be convenient for the following reasons: Somevariables may be not relevant to the problem at hand and contain alarge amount of redundant information; taking them for the anal-ysis may introduce noise and unexplained fluctuations of the out-put. Also, using some more complicated nonlinear equations, thenecessary parameters may be extremely difficult to estimate. Inother words, redundant and irrelevant variables may cause consid-erable impediment in the performed analysis. Therefore, before

    http://crossmark.crossref.org/dialog/?doi=10.1016/j.apacoust.2013.06.017&domain=pdfhttp://dx.doi.org/10.1016/j.apacoust.2013.06.017mailto:[email protected]:Radoslaw.Zimroz@pwr. wroc.plmailto:Radoslaw.Zimroz@pwr. wroc.plhttp://dx.doi.org/10.1016/j.apacoust.2013.06.017http://www.sciencedirect.com/science/journal/0003682Xhttp://www.elsevier.com/locate/apacoust

  • 170 A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177

    starting the proper analysis, a responsible researcher should findout what kind of data will be analyzed. The first question shouldbe about the intrinsic dimensionality of the data. The second ques-tion should be if the number of variables at hand might be some-how reduced without losing the overall information hidden inthe data. These are difficult questions and need expert guide.

    Such a guide may be obtained from a special issue of Journal ofMachine Learning Research, in particular from the first paper ofthat issue authored by Guyon and Elisseeff [12]. The contributionsin the mentioned issue consider such topics as: providing a betterdefinition of the objective function, feature construction, featureselection, features ranking, efficient search methods, and featuresvalidity assessment methods.

    When considering dependencies among variables, we may con-sider either linear or non-linear dependencies. The same concernsprediction models. A comprehensive introduction to non-linearmethods serving for dimension reduction may be found in the pa-per by Yin [26] (contains 62 bibliographical references). The authordiscusses various non-linear projections, such as nonlinear princi-pal component analysis (PCA) obtained via self-associative neuralnetworks), kernel PCA, principal manifolds, isomaps, local linearembedding, Laplacian eigenmaps, spectral clustering, principalcurves and surfaces. He emphasizes the importance of visualiza-tion of the data by topology preserving maps, like the Visualiza-tion-induced Self Organizing Map (ViSOM).

    A more detailed elaboration on variable selection may by foundin the book by Clarke et al. [7]. Chapter 10 of this book is devoted to‘Variable Selection’. The authors describe there in more than 100pages such topics as linear regression, subset selection, some clas-sical and recently developed information criteria (Akaike’s AIC,Bayesian BIC, deviance information DIC). They discuss how tochoose the proper criterion and assess the appropriate model.Apart from the model selection methods, they consider also someshrinkage methods penalizing the risk: ridge regression, nonnega-tive garrotte, least absolute selection and shrinkage operator (Las-so), elastic net, least angle regression, shrinkage methods forsupport vector machines and Bayesian variable selection. Compu-tational issues of the methods are also discussed and compared.

    Some comparative reviews on dimensionality reduction meth-ods may be found in van der Maaten et al. [23] (Technical Reportof Tilburg University with 149 bibliographic references). and Par-viainen [17] (Ph.D. dissertation with 218 bibliographic references,where a taxonomy of the existing methods is built). Recently, Pie-tila and Lim have published a review of intelligent systems ap-proaches when investigating sound quality [19]. They state thatthe most common model used today is the Multiple Linear Regres-sion and non-linear Artificial Neural Network. They go into short-comings that are associated with both the current regression andneural network approaches. They mention robust approach as anew thought for improving the current state-of-art technology.

    Generally there are many methods and their success dependson the data and the problem to be solved. The first crucial step isdata acquisition and extracting from them variables (traits) for fur-ther analysis. A number of approaches can be used to obtain vari-ables for further elaboration. The most popular are: plain statisticalvariables of vibration time series (treated as record of unknownprocess), spectral representation of the vibration time series orother advanced multidimensional representations. In conditionmonitoring, the preparation of data sets for final classificationand decision making is usually a multilevel (multi-step) process.

    After recording the experimental data, the first step of the anal-ysis makes a kind of preprocessing of the raw vibration data byaveraging them, de-noising and segmenting. Often the featureextraction process is carried out by transformation of the time ser-ies to other domain (frequency, time–frequency, wavelets coeffi-cient matrix, etc.). After that, selection of particular components

    (for example mesh frequency) or aggregation of group of compo-nents (energy in band) is done. Final features set preparation isaimed at minimizing redundancy of data, which results in reducingcomputational efforts at classification phase (both in the trainingand the testing phase). There is no univocal and clear answerwhich representation of raw vibration signal is better for conditionmonitoring. This may be the reason that some authors use not one,but more of them. Giving a few examples of dealing the problem: Forreducing generally the dimensionality of the data, many authors use:principal component analysis (PCA), independent component analy-sis (ICA), isomap, local linear embedding, kernel PCA, curvilinearcomponent analysis, simple genetic algorithm, adaptive genetic algo-rithm with combinatorial optimization and others. These methodsserve generally for reduction of dimensionality of the recorded datawith the intention to obtain a subspace representation, in which thefault classes are more easily discriminable.

    In the second stage, having the variables for the analysis fixed, aprediction algorithm, usually neural networks like radial basisfunctions (RBF) or support vector machines (SVM) are applied toperform the monitoring. Some authors use also hybrid modelscombining multiple feature selection to obtain the input variablesfor the second stage. It is also possible to use a combined approachby analyzing frequency spectra obtained by Fourier analysis intime (to monitor changes in spectra along lifetime and fault devel-opment, it can be seen as kind of time–frequency analysis). Oftenwavelets decomposition is used as a preprocessor, wavelets areused before calculation and comparing frequency spectra todecompose raw time series into set of sub-signals with simpler fre-quency structure. Such an analysis was performed by Eftekharne-jad et.al. [10] when considering shaft cracks in a helical gearbox.

    In this paper we will focus on the most informative variablesselection from 15D data vectors using linear and nonlinear tech-niques. The basic data origin from spectral representation of vibra-tion time series measured on two planetary gearboxes, mounted inbucket wheel excavators used in a mining company (for more de-tails, see [3,27,29]). A planetary gearbox is really a complex deviceand it is difficult to deal with. In our machine, planetary gearboxwas part of complex (multi-stage) gearbox. The purpose of theexperiment was to asses the planetary stage as a key element ofthe system.

    After gathering the vibration signals they were segmented(divided into short sub-signals) and analyzed in frequency domainusing power spectral density to obtain an array of real values –amplitudes of isolated components indicating for high energy atsome frequencies (planetary gear mesh frequency and itsharmonics).

    Based on some a priori knowledge related to machine design, 15parameters from vibration spectrum have been extracted. These 15parameters are expected to describe the technical condition of aplanetary gearbox. It should be added that the same method mightbe used for a distributed form of change of condition, not for thelocalized one. A feature extraction procedure used here might beinterpreted as time frequency approach, because for each shortsegment of signal, frequency domain features were extracted. Itcould be seen as calculation of spectrogram (without overlapping)– the simplest time–frequency representation. For each slice ofspectrogram 15 features are extracted.

    It should be emphasized that such method of feature extractionwas proposed already by Bartelmus and Zimroz [3]. From the ob-tained power spectra, they have retained 15 components and havecalled them pp1, pp2, . . . , pp15 appropriately. The distribution ofthese variables, characterizing the observed two gearboxes – onehealthy and one faulty– working with and without load, was alsoconsidered in [4,28].

    In a previous work the authors [3] used sum of amplitudes ofselected components as an aggregated measure of gearbox condi-

  • A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177 171

    tion. Some preliminary investigations on the structure and dimen-sionality of the data may be found in Bartkowiak and Zimroz[5,4,6] and Zimroz and Bartkowiak [28]. Concerning the dimen-sionality, Zimroz and Bartkowiak, using PCA, have shown thatthe intrinsic dimensionality of the 15D-data is 2 or 3 [27,29], whichallowed for visualization of the data in a 2D and in a 3D plot. It ap-peared, that the ‘healthy’ and ‘faulty’ data are nearly perfectly sep-arated (only 5 data vectors – out of total 2183 – are assigned to thewrong class). Canonical Discriminant Analysis provided resultswith similar meaning.

    In this paper, a novel strategy of variable selection is proposed.Instead of (in general lossy) projection of multidimensional dataspace to a new one with lower dimension, it is suggested to selectdirectly the most informative variables. In our opinion, it is betterto select and process original data than create new features be-cause by projections the physical meaning of the original variablesmay be lost. In this paper we propose two techniques for selectionof variables from a 15D vector: (i). Finding the best subset of the 15variables by all subsets search. We will use here the multiple linearregression (MLR) method considering least squares error (LSE). (ii).Variable selection using a penalized least-squares criterion withthe l1 type penalty called LASSO (least absolute shrinkage andselection operation) – which is a nonlinear approach.

    Our investigations are performed in the following way: We sub-divide the entire data into two parts: the training and the test sam-ple (see Section 2 for details). Using the training sample we find fork = 1, 2, . . . , 15 the best subset of size k. Using the test sample wecheck how many data vectors are misclassified. Applying addition-ally the Lasso technique, we validate the results obtained by the allsubset search.

    The paper is organized as follows. In the following, Section 2presents shortly the data and scheme of our experiment. The re-sults obtained by the classical LSE method, when using full regres-sion in 15 variables, are shown in Section 3. The all-subset searchand its results are presented in Section 4. The Lasso method andits results appear in Section 5. Finally, Section 6 contains discussionand closing remarks.

    2. The data and scheme of the experiment

    We use data recorded and described by Bartelmus and Zimroz[3]. The data were recorded from two planetary gearboxes, onebeing in healthy and the other in faulty conditions. The vibrationsignal was cut into time segments. The obtained segments weresubjected to Fourier transform yielding power spectra densities,wherefrom 15 variables called pp1, . . . , pp15 were extracted forfurther analysis. In such a way the authors [3] obtained two datamatrices: X1 of size of 951 � 15 representing the healthy gearbox,and X2 of size 1232 � 15 representing the faulty gearbox. Each rowin these matrices constitutes one data vector containing one in-stance of the considered variables pp1, . . . , pp15 (obtained fromone segment). Such a data vector will be in the following referredto as a data item.

    The state of the machine is defined numerically as the variable Ywith values y = +1 when being in the healthy state and y = �1when being in the faulty state.

    For our regression calculations we have subdivided the matricesX1 and X2 randomly the entire data into two sets of data: calledtraining set Xtrain and test set Xtest:

    � The training set was obtained by choosing randomly 500 itemsfrom the healthy and 500 items from the faulty data vectors.The chosen items were put together into one common matrixof size 1000 � 15 called in the following the Xtrain set. Thisdata matrix will be the basis for evaluating the regression equa-

    tion (defined in Section 3.1). Before starting the calculations, theXtrain set was standardized to have means (vector m of size1 � 15) equal to zero and standard deviations (vector s of size1 � 15) equal to one.Notice also, that the values ytrain being the target values for sub-sequent rows of the Xtrain set are – by its definition – cen-tered to zero (according to our design, there are 500 ones and500 minus ones, indicating the ‘healthy’ and The ‘faulty’ datavectors.).� The remaining items from the data, that is to mean the data vec-

    tors both from X1 and X2, not included neither into Xtrain norinto Xtest, were put together into other common matrix of size1183 � 15 called in the following the Xtest set. This matrixcomprises 451 data vectors from the healthy gearbox and 732data vectors from the faulty gearbox; it will serve for testingthe fitness of the regression equation, obtained from theXtrain data. To use it for testing, it is necessary to standardizeeach data vector (row of the Xtest set) by subtracting therespective mean vector (m) and dividing the result by therespective standard deviations (s) derived from the Xtrain set.

    Note that both of the Xtrain and Xtest data sets contain as itsfirst part items (data vectors) belonging to the first class (‘healthy’gearbox), and as its second part items belonging to the second class(‘faulty’ gearbox). This will facilitate to recognize visually the qual-ity of the predictions, when drawing simply an index plot of theestimated predicted values ŷ (which may be evaluated possiblyfor the train and test data sets). Thus the class distribution ofitems representing the healthy and the faulty gearboxes is:

    Xtrain: 500 ‘healthy’ and next 500 ‘faulty’ items, together1000 items;Xtest: 451 ‘healthy’ and next 732 ‘faulty’ items, together 1183items.

    3. Classical LSE method establishing full LSE regression withconfidence intervals of the regression coefficients

    3.1. Full least squares regression

    In the following we will use the well known and well docu-mented multivariate linear regression model (MLR) [13,7]. Let Xof size n � d denote the observed data matrix, with n denotingthe number of rows containing succeeding data vectors xi = (xi1, -. . . , xid), i = 1, . . . , n, and d denoting the number of variables. Thematrix X contains data recorded in two classes of items: ‘healthy’and ‘faulty’. We define a predicted variable Y with values storedin the vector y of size n � 1 with elements taking values +1 and�1 according the following rule:

    yi ¼ þ1 if xi 2 class ‘healthy’ and yi ¼ �1 if xi 2 class ‘faulty’:

    We scale the vector y in such a way, that its mean equals zero.Let X denote the standardized matrix X. The following regres-

    sion model is considered:

    y ¼ Xbþ e: ð1Þ

    The vector y in the above equation denotes the vector of the depen-dent variable. In our case it denotes the class membership coded bytwo values: +1 and �1. The vector b = (b1, . . . , bd) denotes the vectorof the regression coefficients expressing the dependence of the var-iable Y from the d predictors whose values are recorded in subse-quent columns of X. Because of the standardization of the X-matrix, and the centering of the y-values, the regression equationcontains no intercept b0; its estimate when using the LSE method,is equal to 0: b̂0 ¼ 0.

  • 172 A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177

    Let e of size n � 1 denote the error term; its components areindependent and have expected value equal to zero and varianceequal r2.

    The Least Squares Error (LSE) method finds estimates of theregression coefficients b by minimizing – with respect to b – theResidual Sum of Squares (RSS) given by the following quadraticform:

    RSSðbÞ ¼ ðy � XbÞTðy � XbÞ: ð2Þ

    The solution is:

    b̂ ¼ ðXTXÞ�1XT y: ð3Þ

    The predicted values of y are then computed as

    ŷ ¼ Xb̂ ¼ XðXTXÞ�1XT y: ð4Þ

    The variance–covariance matrix of b̂, useful for constructing confi-dence intervals for individual regression coefficients bj, j = 1, . . . , d is

    Varðb̂Þ ¼ ðXTXÞ�1r2; ð5Þ

    with the theoretical (population) variance r2 usually estimated as

    r̂2 ¼ 1n� d� 1

    Xni¼1ðyi � ŷiÞ

    2:

    3.2. Results from full regression applied to the gearbox data

    All calculations in this section were conducted using the stan-dardized data set Xtrain – see Section 2. We calculated the fullregression equation (from d = 15 variables). The obtained resultsallowed us to construct a graph depicting the regression coeffi-cients together with their confidence intervals. The graph is shownin Fig. 1.

    The figure shows the amplitudes of the regression coefficientsbj, j = 1, . . . , 15 embraced by their confidence intervals. If for aregression coefficient bj its confidence interval contains the value0, then the respective regression coefficient may be equal to zero,which means, that the given variable may have a zero impact onthe predicted variable Y and as such should not be considered forinclusion into the set of predictors. Looking at Fig. 1, one may no-

    2 4 6 8 10 12 14

    −0.3

    −0.2

    −0.1

    0

    0.1

    0.2

    Xtrain, regression coeffs with conf. intv.

    b j ±

    con

    f int

    erva

    l

    j − no. of the regression coeff

    Fig. 1. Regression coefficients b1, . . . , b15 with alpha = 0.05 confidence intervalsfrom full regression equation with d = 15 variables. Notice that variables Nos. 5, 10and 12 are statistically nonsignificant, because their confidence intervals containthe value 0. Notice also that variables Nos. 4 and 15 may have values very near tozero.

    tice that there are three such variables: Nos. 5, 10, and 15. One maynotice also that variables Nos. 4 and 15 may have values very nearto zero.

    Our next task was to explore and illustrate the diagnostic prop-erties of the full regression equation both for the train and testdata. For all the data vectors x contained in the data sets Xtrain(in algebraical notation below written as Xtrain) or Xtest (algebraicdenotation Xtest) we calculated the predicted values of the depen-dent variable Y as

    ŷtrain ¼ Xtrainb̂; and ŷtest ¼ Xtestb̂:

    The resulting predicted values for the two sets Xtrain and Xtestare shown shown in Fig. 2. The left panel of the figure shows resultsfor Xtrain and the right panel – for Xtest appropriately. Remem-ber that the composition of the two sets is such, that firstly theycontain data vectors from the ‘healthy’ gearbox which should exhi-bit y = +1, and next data vectors from the ‘faulty’ gearbox whichshould exhibit y = �1. Thus points with positive predicted value yindicate assignment to class ‘healthy’, and points with negative yindicate assignment to class ‘faulty’. Looking at Fig. 2 one may state,that all ‘healthy’ data vectors are correctly classified as belonging tothat class, and this happens both for items included into set Xtrainand Xtest. What concerns the ‘faulty’ items (data vectors), eleven ofthem got positive predicted values, and as such should be classifiedas ‘healthy’. Curiously enough, this happens both for the Xtrain andXtest data sets.

    It is interesting to notice that in both data sets (Xtrain,n = 1000 items, and Xtest, n = 1183 items) there are only 11wrong class assignments in the data set Xtrain, and similarly also11 wrong class assignments in the data set Xtest, and this hap-pens by using such a simple and not much sophisticated algorithmas the ordinary LSE regression. Summarizing this part of consider-ations, we may state that predictions obtained from the regressionderived from the training set act similarly when applied to the testset. This is an optimistic fact: we have found a regression equationwhich is stable and describes in the same way both the trainingand the testing data.

    The results described above were obtained by using the full setof d = 15 variables as predictors. It is obvious from Fig. 1 that someof the variables are statistically nonsignificant and as such could bedropped from the regression set.

    To find more parsimonious subsets, step-wise and all-subsetmethods were developed. The stepwise method proceeds sequen-tially. Firstly the best single variable (predictor) is chosen. Next,depending on two declared parameters, p-to-enter and p-to-remove,the actually best variable (significant in terms of p-to-enter) maybe added to the regression set. Conversely, some other variablefrom the regression set, stated as nonsignificant in terms of p-to-re-move, may be removed from the actual regression set.

    It is also possible to perform a stepwise downwards procedureof removing the least significant variable from the regression set. Amuch better method investigates all possible subsets of the consid-ered d variables and chooses that one that yields the smallestresidual error. This method will be shown in next section.

    4. Search for the best subset by performing all-subsets search

    4.1. Finding the best subset of length J, (J = 1, . . . , 15) and its root meansquare error rmse

    With d = 15 variables we have 215 = 32768 subsets toinvestigate. The subsets may contain J = 0, 1, 2, . . . , 15 variables.

    For each J there are NJ ¼15J

    � �subsets to investigate, as shown

    below:

  • J: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    NJ: 15 105 455 1365 3003 5005 6435 6435 5005 3003 1365 455 105 15 1

    A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177 173

    As the criterion for the ‘best’ subset – from all those of length J –the best one was defined as that one which yields the minimalsquared prediction error mse given as

    mseðJÞ ¼ 1n� J

    Xni¼1ðyi � ŷ

    ðJÞi Þ

    2; ð6Þ

    where the minimum is taken over all NJ subsets of length J.Alternatively, we may take as the criterion of fit the square root

    of mse given as

    rmse ¼ffiffiffiffiffiffiffiffiffimsep

    : ð7Þ

    0 5 10 150

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1Decay of rmse error

    length of subset (J)

    rmse

    err

    or

    Fig. 3. The prediction error rmse for subsets of length J = 0, 1, 2, . . . , 15, (J = 0 meansan empty regression set). It is evident that the downfall in rmse, when consideringsubsets of length J > 7, is really meager.

    The rmse tells, how much, in the average, the predicted value ŷðJÞi isdiffering the from its target value yi, i = 1, . . . , n. The rmse values forthe best subset of length J for j = 2, . . . , 15 are shown in Fig. 3.

    In Fig. 3 one may see that for J > 7 the rmse error is already verynear to its ultimate lower bound derived from the full regression in15 variables equal y = 0.3624. Thus – there is not much left to ex-plain by the additional variables, and say seven variables would beenough to retain for further analysis. Which ones? We will show itin next subsection. Meanwhile, having a best subset of length J, wewould like to turn our attention to other subsets also with length J,having a larger rmse as the found best subset. Their rmse is reallydiversified, which is shown in Fig. 4.

    One may see in that figure the spread of the rmse-s – consideredin subsets of length J = 5 and J = 7. In each plot, for a given J, all thesubsets were ordered according to their rmse-s. The spread of thermse-s is large and it is worth to search for the subset yieldingthe minimal one rmse. One may notice also in Fig. 4 that thereare always anther few subsets that yield similar values of rmse asthat found by the optimal subsets. We will consider the composi-tion of variables included into such subsets in next subsection.

    Fig. 2. Predictions ŷ from the Xtrain group, n = 1000 (left panel) and Xtest group n = 118line in each panel separates the items (that is, data vectors) having the ‘healthy’ and ‘faulalso that the dominant majority of ‘faulty’ items are recognized correctly as ‘faulty’, exc‘healthy’.

    4.2. Composition of the best subsets

    Starting from remarks made when inspecting Fig. 3 we will findfor J = 5, 6, 7, 8 the ten subsets with the smallest rmse. We will callthe 10 found subsets ‘quasi-optimal’ subsets. We will look at thecomposition of these quasi-optimal subsets, that is to say whichvariables constitute these 10 subsets. This is shown in Table 1.

    Table 1 shows which variables are included into the regressionset of the found quasi-optimal subsets. The table consists of twoparts. The upper part lists for J = 5, 6, 7, 8 the 10 subsets withrmse-s very near to the optimal one. These 10 subsets are

    3 (right panel) using the full LSE regression equation with 15 variables. The verticalty’ status. Notice that all ‘healthy’ items are recognized correctly as ‘healthy’. Noticeept 11 items in Xtrain and 11 items in Xtest, which are recognized erroneously as

  • 0 500 1000 1500 2000 2500 3000 35000.35

    0.4

    0.45

    0.5

    0.55

    0.6

    0.65

    0.7rmse for subsets of length J=5

    subset no. 1, ..., 3003

    rmse

    erro

    r

    0 1000 2000 3000 4000 5000 6000 70000.35

    0.4

    0.45

    0.5

    0.55

    0.6

    0.65

    0.7rmse for subsets of length J=7

    subset no. 1, ... , 6435

    rmse

    erro

    r

    Fig. 4. The prediction error rmse for all subsets of length J = 5 (left) and J = 7 (right). The dashed line indicates the value rmse = 0.3624 obtained when putting into theregression equation all 15 variables.

    Table 1Variables that have entered the top 10 best subsets for J = 5, 6, 7, 8 and their rmse-s.

    J = 5 J = 7

    k = 1: 2 3 6 11 13 k = 1: 1 2 3 6 11 13 14k = 2: 2 3 6 13 14 k = 2: 2 3 6 8 11 13 14k = 3: 2 6 11 13 14 k = 3: 2 3 6 11 13 14 15k = 4: 2 6 11 13 15 k = 4: 2 3 5 6 11 13 14k = 5: 2 6 9 11 13 k = 5: 2 3 6 9 11 13 14k = 6: 2 6 7 11 13 k = 6: 2 3 6 7 11 13 14k = 7: 2 6 7 13 14 k = 7: 2 3 4 6 11 13 14k = 8: 2 6 9 13 14 k = 8: 2 3 6 11 12 13 14k = 9: 1 2 3 6 11 k = 9: 2 3 6 10 11 13 14k = 10: 2 3 6 9 13 k = 10: 1 2 3 6 9 11 13

    J = 6 J = 8

    k = 1: 2 3 6 11 13 14 k = 1: 1 2 3 6 7 11 13 14k = 2: 2 3 6 11 13 15 k = 2: 1 2 3 6 11 13 14 15k = 3: 2 3 6 9 11 13 k = 3: 1 2 3 6 9 11 13 14k = 4: 1 2 3 6 11 13 k = 4: 1 2 3 4 6 11 13 14k = 5: 2 3 6 9 13 14 k = 5: 1 2 3 6 8 11 13 14k = 6: 2 3 6 13 14 15 k = 6: 1 2 3 6 10 11 13 14k = 7: 1 2 6 7 11 13 k = 7: 1 2 3 6 11 12 13 14k = 8: 2 6 11 13 14 15 k = 8: 1 2 3 5 6 11 13 14k = 9: 2 3 6 11 12 13 k = 9: 2 3 6 8 11 13 14 15k = 10: 2 3 6 7 11 13 k = 10: 2 3 6 8 9 11 13 14rmse for J = 5, 6, 7, 8J = 5: 0.391 0.392 0.395 0.395 0.397 0.399 0.401 0.402 0.402 0.402J = 6: 0.378 0.384 0.385 0.387 0.387 0.388 0.388 0.389 0.389 0.389J = 7: 0.373 0.375 0.376 0.377 0.377 0.378 0.378 0.378 0.378 0.379J = 8: 0.369 0.371 0.371 0.371 0.372 0.372 0.372 0.373 0.373 0.373

    174 A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177

    numerated by the index k = 1, . . . , 10, appearing in ascendingorder; with k = 1 indicating the subset with the smallest rmse, thatis the optimal one.

    No. of variable 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    Its frequency 8 13 10 4 1 13 7 6 4 2 10 0 11 10 4

    In the lower part of the table the corresponding rmse-s for therespective subsets are given. One may notice that indeed, thermse-s within one group of 10 subsets differ very little. One maynotice also the preference of inclusion into the quasi-optimal sub-sets of the variables Nos.: 2, 3, 6, 11, 13, 14.

    When taking for each J(J = 1, . . . , 15) only one ultimately bestsubset, we found the following frequencies of subsequentvariables:

    4.3. Assignments to classes ‘healthy’ and ‘faulty’ done by the bestsubsets of length J

    How good are the found best subsets in correct prediction of thevariable Y, that is in assigning the correct class label (‘healthy’ or

  • Table 2Number of erroneous predictions by the best subsets. Symbol n12 denotes number offalse assignments of items belonging to class ‘healthy’; n21 denotes false assignmentsof items from class ‘faulty’ to class ‘healthy’. Label ‘fraction’ denotes the proportion ofall erroneous assignments compared to total number of items in the train/test dataappropriately.

    Train n = 1000 Test n = 1183n12 n21 fraction n12 n21 fraction

    j = 1 0 111 0.1110 j = 1 0 145 0.1226j = 2 20 5 0.0250 j = 2 19 13 0.0270j = 3 8 1 0.0090 j = 3 3 1 0.0034j = 4 5 3 0.0080 j = 4 2 1 0.0025j = 5 1 9 0.0100 j = 5 2 8 0.0085j = 6 0 12 0.0120 j = 6 2 11 0.0110j = 7 0 12 0.0120 j = 7 2 11 0.0110j = 8 0 11 0.0110 j = 8 2 10 0.0101j = 9 0 11 0.0110 j = 9 1 11 0.0101j = 10 0 11 0.0110 j = 10 1 11 0.0101j = 11 0 11 0.0110 j = 11 0 11 0.0093j = 12 0 11 0.0110 j = 12 0 11 0.0093j = 13 0 11 0.0110 j = 13 0 11 0.0093j = 14 0 11 0.0110 j = 14 0 11 0.0093j = 15 0 11 0.0110 j = 15 0 11 0.0093

    1 3 5 7 9 11 13 15

    1

    3

    5

    7

    9

    11

    13

    15

    Growing LASSO

    No. of variable

    itera

    tion

    Fig. 5. Growing lasso – full path of variables included into the regression subset.The algorithm starts from an empty regression set (first line from bottom) and ineach step one variable is added to the regression set. The algorithm ends when allvariables are in the regression set or a stop criterion is satisfied.

    A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177 175

    ‘faulty’) to an data vector? In Table 2 we show the number of erro-neous predictions both when using the Xtrain data and the Xtestdata. Looking at the results displayed in Table 2 one may notice,that the erroneous classifications start to stabilize for J = 6. Whentaking only one best variable as predictor, the fraction of erroneousclassifications amounts ferr = 0.0110 in the Xtrain set (n = 1000),and is equal to ferr = 0.0101 in the test set (n = 1183). Whenincluding more variables into the regression set, the decay of theferr rates is similar in both sets, and starts to stabilize for J = 6 orJ = 7 variables included into the regression set. Looking at Table 2we may infer, that the gain of introducing more than J = 7 variablesinto the regression set is meager, what could be also seen in Fig. 3.For example, when taking a subset with J > 7 best variables, andadding to that subset one additional variable, the improvementof the correct predictions equals one data vector more (occurringby placing only one data vector more in the correct class) – outof total 1000 considered data vectors.

    The prediction in the Xtest set (with n = 1183 independentdata items) yields additionally two correctly classified data vec-tors. Thus: adding one variable more into the regression set resultsin classifying correctly one or two data vectors more. One may ask:is it worth it in this situation to use a more complex model withone predictor more in the regression equation?

    5. Search for a reduced subset using the Lasso method

    5.1. Lasso principles

    Generally, the Lasso belongs to so called shrinkage methods,which retain only a subset of the regression predictors and discardthe rest. It yields estimators of the regression coefficients obtainedby applying a regularization procedure using the Lasso penaltyshown in Eq. (9); this makes the estimates more stable andconsistent.

    Let X of size N � d denote the observed design matrix X. Let y bethe centered target variable Y taking values +1 for items belongingto class 1 (‘healthy’), and �1 for items belonging to class 2 (‘faulty’,see the denotations at the beginning of Section 3). The Lasso meth-od (Least Absolute Shrinkage and Selection Operator) solves thefollowing restrained LSE problem

    blasso ¼ argminbðy � XbÞTðy � XbÞ; ð8Þ

    where the regression coefficients b = (b1, . . . , bd) are subjected tothe constraints called the Lasso penalty:

    Xdi¼1jbjj 6 t; t > 0: ð9Þ

    The constant t is a kind of tuning parameter; it decides on theamount of shrinkage of the parameters b. It may be used also in astandardized form given as:

    s ¼ tXdj¼1jbLSj j;

    ,ð10Þ

    where bLSj ; j ¼ 1; . . . ;d, is the ordinary least square error (LSE) esti-mate of the respective regression parameter bj.

    The regression Eq. (8) yields estimates of the Lasso regressioncoefficients blasso without intercept b0. The respective estimate ofb0 is equal b̂0 ¼ �y ¼ 0 (because both the columns of the data matrixX and the target vector y have means equal to zero). The Lasso ap-proach was originally proposed by Tibshirani [21], who proposedalso some algorithms yielding the Lasso coefficients blasso as thesolution of a quadratic programming problem with constraints. Avery convenient algorithm for solving the Lasso problem may beobtained by a small modification of the Least Angle Regression(Lar) algorithm by Efron et al. [9], as shown firstly in the originalpaper introducing the Lar algorithm [9], see also [13,7] for detailedexamples. Generally, the Lasso shrinks some regression coefficientsand sets other to zero. Some characteristic stages of the algorithm:

    � For t = 0, one obtains all regression coefficients equal 0, whichmeans an empty regression set of variables.� If t is chosen larger than tM ¼

    Pd1jb̂LSj j (where b̂LSj is the ordinary

    least square errors LSE estimate), then the lasso estimates arethe ordinary LSE estimates.� Letting t to grow from 0 to tM, the restraints given by (9) are

    becoming successively relaxed and more variables are able toenter the regression set with non-zero regression coefficients.Thus, for d variables, there will be M + 1 time instances, t0, t1, -. . . , tM, M � d, where the status of the regression set changes(there may be some loops within, see [9]). The authors [9] haveshown that the lasso problem has a piecewise linear solutionpath, with the points {tj} as nodes where the linear dependencemay change. The regression coefficients grow linearly betweenthe nodes, with growth rate depending on the localization of thenodes. Moreover, the same authors have shown that the num-

  • 176 A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177

    ber of the linear pieces in the solution path equals approxi-mately d (the entire number of predictor variables) and thecomplexity of getting the whole Lasso path is O(nd2), the sameas the cost of computing a single ordinary LSE fit (see the origi-nal paper [9] or the description of the Lar algorithm in [13,7]).

    5.2. The Lasso solution for the gearbox data

    Now we will investigate, how the Lasso method works for our data.The calculations will be carried out again on normalized data matricesusing the Lars algorithm implemented by Karl Sjöstrand (see: http://www2.imm.dtu.dk/�ksjo/kas/software, accessed 20.08.2012).

    The function Lars produced the full path of regression coeffi-cients. To illustrate what happens with the content of the regres-sion set (that is, which variables are in, which out) we haveconstructed the plot named by us growing LASSO which is shownin Fig. 5.

    The algorithm started from an empty regression set, to which insubsequent steps new variables were added. Looking at the graph,one may notice that the variable No. 13 was added as first, and var-iable No. 11 as last, making the set of predictors complete. Eachrow of the graph exhibited in Fig. 5 shows the content of oneregression set yielded by the Lars function; variables being in

    Table 3Number of erroneous predictions by the Lasso subsets. Notations as in Table 2.

    Train n = 1000 Test n = 1183n12 n21 fraction n12 n21 fraction

    j = 1 0 111 0.1110 j = 1 0 145 0.1226j = 2 1 60 0.0610 j = 2 0 76 0.0642j = 3 0 27 0.0270 j = 3 0 32 0.0270j = 4 0 23 0.0230 j = 4 0 23 0.0194j = 5 0 12 0.0120 j = 5 0 9 0.0076j = 6 0 12 0.0120 j = 6 0 9 0.0076j = 7 1 9 0.0100 j = 7 0 3 0.0025j = 8 1 9 0.0100 j = 8 0 3 0.0025j = 9 0 12 0.0120 j = 9 1 9 0.0085j = 10 0 11 0.0110 j = 10 1 10 0.0093j = 11 0 11 0.0110 j = 11 1 10 0.0093j = 12 0 11 0.0110 j = 12 1 10 0.0093j = 13 0 11 0.0110 j = 13 0 11 0.0093j = 14 0 11 0.0110 j = 14 0 11 0.0093j = 15 0 11 0.0110 j = 15 0 11 0.0093

    Fig. 6. Lasso Predictions, yL , for the variable Y noted in the train data (left panel) and tesfound when applying the Lasso technique. See Fig. 2 for details of the composition of Xtraassignments in the Xtest data.

    the respective regression set are indicated by quadrants paintedin color. For example, row 6 from the bottom indicates a 5-vari-ables subset containing the variables No. 2, 6, 9, 11, 14. This subsetdoes not appear among the 10 best subsets of length 5 found by theall-subset search in Section 4. Non the less, it has very good diag-nostic properties, as shown below in Table 3 and Fig. 6.

    Now let us look at the predicted values of the Xtrain andXtest data sets when using the regression coefficients yieldedby the Lasso method. The numbers of erroneous predictions whenusing Lasso subsets of length J = 1, . . . , 15 are shown in Table 3. Weshow there predictions obtained both from the Xtrain and Xtestdata sets, when using regression formulae developed on the base ofthe Xtrain data.

    Details for the subset composed from k = 5 variables are shownin Fig. 6. Considering the Xtrain and Xtest data, the number oferroneous predictions from the 5-variables subset in the trainset equals 12, and in the test set only 9. Formerly it was statedthat the full regression set with 15 predictors gave 11 erroneouspredictions both for the train and test data (see Fig. 2). Thus, thenumber of correct predictions is very similar and the 5-predictormodel seems to be not substantially worse as the full regressionmodel (in the test set it works even better!).

    6. Discussion and closing remarks

    We have considered two sets of vibration time series, for whichpower spectra were calculated. The main topic of our research was:how many features (amplitudes of power spectra observed at somefixed frequencies), denoted by us as variables pp1, . . . , pp15),should be taken into consideration when aiming at the diagnosisof a healthy or a faulty state of the machine.

    It seems that so far there was no systematic investigation onsuch topic.

    Let’s say that spectral representation of a time series allows toextract features that may be related to energy of selected bandsor amplitudes of characteristic frequencies (here: planetary gear-box mesh frequency and its harmonics). To our knowledge, thenumber of such features was not investigated deeply and wasestablished ad hoc by the researchers. The only paper being closeto that topic is [27], investigating correlations among spectra com-ing from two different machines: the one being in a healthy andthe other in a faulty state.

    t data (right panel) when using reduced subset with 5 variables (No. 2, 6, 9, 11, 14)in and Xtest data. There are 12 erroneous assignments in the Xtrain and 9 erroneous

    http://www2.imm.dtu.dk/~ksjo/kas/softwarehttp://www2.imm.dtu.dk/~ksjo/kas/softwarehttp://www2.imm.dtu.dk/~ksjo/kas/software

  • A. Bartkowiak, R. Zimroz / Applied Acoustics 77 (2014) 169–177 177

    In the paper here we have intensively investigated the possibil-ity of finding the best subset using multivariate linear regressionand investigating in detail all possible 215 = 32768 subsets. Wehave also considered a faster and more robust method, namelythe Lasso (least absolute shrinkage and selection operator), com-bined with least angle regression (Lar). Lasso has the advantagebeing a robust method, Lar gives the speed of calculations compa-rable with ordinary least squares regression. The results of bothmethods are in reasonable agreement and show that the analyzedset of 15 variables characterizing the gearbox data can be reducedto about 7–9 variables. We have used quite large training and testsets (with cardinalities n = 1000 and n = 1183). The results fromboth these sets are consistent; sometimes the results in the testsets are even little better as those in the train sets.

    The composition (appearance in the subsets) of individual vari-ables is not random: some variables enter the best subsets morefrequently than the others. The most frequent variables containedin the best subsets found by the LSE method are: 1, 2, 3, 6, 7, 11, 13,14. What is characteristic: the most frequent variables found in thebest subsets cover the whole span 1–15; it does not happen that,say, variables Nos. 1–8 enter the best set more frequently thanvariables No. 9–15. This is in accordance with preliminary resultsobtained in [28].

    For a fixed cardinality k of the regression subset, the composi-tion of the k-subset is not unique; one may find several composi-tions with very similar fitness criteria (rmse) to the analyzeddata. This is the property of the analyzed variables, not statedexplicitly before.

    When experimenting manually using stepwise methods, we ob-served that subsets composed from the first 8 variables are slightlybetter (R-square � 0.82) as subsets composed from the variablesNo. 8–15 (R-square � 0.77).

    Summarizing our research presented here: We have appliedclassical methods based on multivariate linear regression using or-dinary least squares error and a combination of two modern meth-ods: the Lasso and the Lar, making the evaluations robust andquick (lasso puts the l1 penalty on the least squares error) andLar enables calculations with speed comparable to the LSE meth-od). Both methods yielded similar results. The research has animmediate practical implication: the number of recorded and usedfor analysis variables may be considerably reduced.

    At the end we would like to mention, that dimensionalityreduction methods have an extensive literature which has ap-peared in last years (see the bibliographic references cited in Sec-tion 1). This indicates that reduction of dimensionality isnowadays a hot topic in data analysis. To perform reduction ofdimensionality, authors use frequently quite complicated andcomputer intensive methods. Our aim in the presented researchwas simplicity, having in mind the principle formulated by Wil-liam von Occam (quoted after M. Tipping [22]): ‘‘pluraritas nonest ponenda sine necessitas’’, which translates as ‘‘entities shouldnot be multiplied unnecessarily’’. In the machine learning commu-nity this means ‘‘models should be not more complex than issufficient to explain the data’’. The models presented in the paperare conceptually simple and – as shown – are applicable to realindustrial data.

    Acknowledgements

    This paper is in part financially supported by Polish State Com-mittee for Scientific research 2010–2013 as research Project No.N504 147838.

    References

    [1] Barszcz T, Bielecki A, Romaniuk T. Application of probabilistic neural networksfor detection of mechanical faults in electric motors. PrzegladElektrotechniczny 2009;85(8):37–41.

    [2] Barszcz T, Bielecka M, Bielecki A, Wójcik M. Wind turbines states classificationby a fuzzy-ART neural network with a stereographic projection as a signalnormalization. Lect Notes Comput Sci (LNCS) 2011;6594(PART 2):225–34.

    [3] Bartelmus W, Zimroz R. A new feature for monitoring the condition ofgearboxes in nonstationary operating systems. Mech Syst Signal Process2009;23(5):1528–34.

    [4] Bartkowiak A, Zimroz R. Outliers analysis and one class classification approachfor planetary gearbox diagnosis. J Phys: Conf Ser 2011;305(1) [art. no. 012031].

    [5] Bartkowiak A, Zimroz R. Curvilinear dimensionality reduction of data forgearbox condition monitoring. Przeglad Elektrotechniczny 2012;88(10B):268–71.

    [6] Bartkowiak A, Zimroz R. Data dimension reduction and visualization withapplication to multidimensional gearbox diagnostics data: comparison ofseveral methods. Diffus Defect Data Pt. B: Solid State Phenom2012;180:177–84.

    [7] Clarke B, Fokoué E, Zhang HH. Principles and theory for data mining andmachine learning. Springer Series in Statistics; 2009.

    [8] Cocconcelli M, Bassi L, Secchi C, Fantuzzi C, Rubini R. An algorithm to diagnoseball bearing faults in servomotors running arbitrary motion profiles. Mech SystSignal Process 2012;27(1):667–82.

    [9] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Stat2004;32(2):407–99.

    [10] Eftekharnejad B, Adali A, Mba D. Shaft crack diagnosis in a gearbox, ApplAcoust 2012 [in press]. http://dx.doi.org/10.1016/j.apacoust.2012.02.004.

    [11] Gryllias KC, Antoniadis IA. A support vector machine approach based onphysical model training for rolling element bearing fault detection inindustrial environments. Eng Appl Artif Intell 2012;25(2):326–44.

    [12] Guyon I, Elisseeff A. An introduction to variable and feature selection. J MachLearn Res 2003;3:1157–82.

    [13] Hastie T, Tibshirani R, Friedman J. The Elements of statistical learning; datamining. Inference and prediction. New-York: Springer; 2010.

    [14] Heyns T, Godsill SJ, De Villiers JP, Heyns PS. Statistical gear health analysiswhich is robust to fluctuating loads and operating speeds. Mech Syst SignalProcess 2012;27(1):651–66.

    [15] Herzog MA, Marwala T, Heyns PS. Machine and component residual lifeestimation through the application of neural networks. Reliab Eng Syst Safety2009;94(2):479–89.

    [16] Jardine AKS, Lin D, Banjevic D. A review on machinery diagnostics andprognostics implementing condition-based maintenance. Mech Syst SignalProcess 2006;20:1483–510.

    [17] Parviainen E. Studies on dimension reduction and feature spaces. AaltoUniversity, School of Science, Dept. of Biomedical Engineering andComputational Science. Aalto University Publication Series, DoctoralDissertations 94/2011.

    [18] Peng ZK, Chu FL. Applications of the wavelet transform in machine conditionmonitoring and fault diagnostics: a review with bibliography. Mech Syst SignalProcess 2004;18:199–221.

    [19] Pietila G, Lim TC. Intelligent systems approaches to product sound qualityevaluations – a review. Appl Acoust 2012;73:897–1002.

    [20] Samuel PD, Pines DJ. A review of vibration-based techniques for helicoptertransmission diagnostics. J Sound Vib 2005;282:475–508.

    [21] Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Statist SocB 1996;58(1):267–88.

    [22] Tipping ME. Bayesian inference: an introduction to principles and practice inmachine learning. In: Bousquet O, von Luxberg U, Rätsch G, editors. Advancedlectures on machine learning. Springer; 2004. p. 41–62. .

    [23] van der Maaten L, Postma E, van der Herik J. Dimensionality reduction: acomparative review, TiCC TR 2009-005, 1-36. Belgium: University of Tilburg.

    [24] Urbanek J, Barszcz T, Zimroz R, Antoni J. Application of averaged instantaneouspower spectrum for diagnostics of machinery operating under non-stationaryoperational conditions. Measure: J Int Meas Confeder 2012;45(7):1782–91.

    [25] Yiakopoulos CT, Gryllias KC, Antoniadis IA. Rolling element bearing faultdetection in industrial environments based on a K-means clustering approach.Exp Syst Appl 2011;38(3):2888–911.

    [26] Yin H. Nonlinear principal manifolds – adaptive hybrid learning approaches.In: Corchado E, et al. (Eds.), HAIS 2008. LNAI 5271. Springer; 2008. p. 15–29.

    [27] Zimroz R, Bartkowiak A. Investigation on spectral structure of gearboxvibration signals by principal component analysis for condition monitoringpurposes. J Phys: Conf Ser 2011;305(1) [art. no. 012075].

    [28] Zimroz R, Bartkowiak A. Multidimensional data analysis for conditionmonitoring: features selection and data classification. CM2012—MFPT2012.BINDT. In: 11–14 June, London. Electronic Proceedings, 2012, p. 1–12 [art no.402].

    [29] Zimroz R, Bartkowiak A. Two simple multivariate procedures for monitoringplanetary gearboxes in non-stationary operating conditions. Mech Syst SignalProcess 2013;38(1):237–47. http://dx.doi.org/10.1016/j.ymssp.2012.03.022.

    http://refhub.elsevier.com/S0003-682X(13)00158-8/h0005http://refhub.elsevier.com/S0003-682X(13)00158-8/h0005http://refhub.elsevier.com/S0003-682X(13)00158-8/h0005http://refhub.elsevier.com/S0003-682X(13)00158-8/h0010http://refhub.elsevier.com/S0003-682X(13)00158-8/h0010http://refhub.elsevier.com/S0003-682X(13)00158-8/h0010http://refhub.elsevier.com/S0003-682X(13)00158-8/h0015http://refhub.elsevier.com/S0003-682X(13)00158-8/h0015http://refhub.elsevier.com/S0003-682X(13)00158-8/h0015http://refhub.elsevier.com/S0003-682X(13)00158-8/h0020http://refhub.elsevier.com/S0003-682X(13)00158-8/h0020http://refhub.elsevier.com/S0003-682X(13)00158-8/h0025http://refhub.elsevier.com/S0003-682X(13)00158-8/h0025http://refhub.elsevier.com/S0003-682X(13)00158-8/h0025http://refhub.elsevier.com/S0003-682X(13)00158-8/h0030http://refhub.elsevier.com/S0003-682X(13)00158-8/h0030http://refhub.elsevier.com/S0003-682X(13)00158-8/h0030http://refhub.elsevier.com/S0003-682X(13)00158-8/h0030http://refhub.elsevier.com/S0003-682X(13)00158-8/h0035http://refhub.elsevier.com/S0003-682X(13)00158-8/h0035http://refhub.elsevier.com/S0003-682X(13)00158-8/h0040http://refhub.elsevier.com/S0003-682X(13)00158-8/h0040http://refhub.elsevier.com/S0003-682X(13)00158-8/h0040http://refhub.elsevier.com/S0003-682X(13)00158-8/h0045http://refhub.elsevier.com/S0003-682X(13)00158-8/h0045http://dx.doi.org/10.1016/j.apacoust.2012.02.004http://refhub.elsevier.com/S0003-682X(13)00158-8/h0050http://refhub.elsevier.com/S0003-682X(13)00158-8/h0050http://refhub.elsevier.com/S0003-682X(13)00158-8/h0050http://refhub.elsevier.com/S0003-682X(13)00158-8/h0055http://refhub.elsevier.com/S0003-682X(13)00158-8/h0055http://refhub.elsevier.com/S0003-682X(13)00158-8/h0060http://refhub.elsevier.com/S0003-682X(13)00158-8/h0060http://refhub.elsevier.com/S0003-682X(13)00158-8/h0065http://refhub.elsevier.com/S0003-682X(13)00158-8/h0065http://refhub.elsevier.com/S0003-682X(13)00158-8/h0065http://refhub.elsevier.com/S0003-682X(13)00158-8/h0070http://refhub.elsevier.com/S0003-682X(13)00158-8/h0070http://refhub.elsevier.com/S0003-682X(13)00158-8/h0070http://refhub.elsevier.com/S0003-682X(13)00158-8/h0075http://refhub.elsevier.com/S0003-682X(13)00158-8/h0075http://refhub.elsevier.com/S0003-682X(13)00158-8/h0075http://refhub.elsevier.com/S0003-682X(13)00158-8/h0080http://refhub.elsevier.com/S0003-682X(13)00158-8/h0080http://refhub.elsevier.com/S0003-682X(13)00158-8/h0080http://refhub.elsevier.com/S0003-682X(13)00158-8/h0085http://refhub.elsevier.com/S0003-682X(13)00158-8/h0085http://refhub.elsevier.com/S0003-682X(13)00158-8/h0090http://refhub.elsevier.com/S0003-682X(13)00158-8/h0090http://refhub.elsevier.com/S0003-682X(13)00158-8/h0095http://refhub.elsevier.com/S0003-682X(13)00158-8/h0095http://www.miketipping.com/papers.htmhttp://www.miketipping.com/papers.htmhttp://refhub.elsevier.com/S0003-682X(13)00158-8/h0105http://refhub.elsevier.com/S0003-682X(13)00158-8/h0105http://refhub.elsevier.com/S0003-682X(13)00158-8/h0105http://refhub.elsevier.com/S0003-682X(13)00158-8/h0110http://refhub.elsevier.com/S0003-682X(13)00158-8/h0110http://refhub.elsevier.com/S0003-682X(13)00158-8/h0110http://refhub.elsevier.com/S0003-682X(13)00158-8/h0115http://refhub.elsevier.com/S0003-682X(13)00158-8/h0115http://refhub.elsevier.com/S0003-682X(13)00158-8/h0115http://refhub.elsevier.com/S0003-682X(13)00158-8/h0120http://refhub.elsevier.com/S0003-682X(13)00158-8/h0120http://refhub.elsevier.com/S0003-682X(13)00158-8/h0120

    Dimensionality reduction via variables selection – Linear and nonlinear approaches with application to vibration-based condition monitoring of planetary gearbox1 Introduction, statement of the problem2 The data and scheme of the experiment3 Classical LSE method establishing full LSE regression with confidence intervals of the regression coefficients3.1 Full least squares regression3.2 Results from full regression applied to the gearbox data

    4 Search for the best subset by performing all-subsets search4.1 Finding the best subset of length J, (J=1,…,15) and its root mean square error rmse4.2 Composition of the best subsets4.3 Assignments to classes ‘healthy’ and ‘faulty’ done by the best subsets of length J

    5 Search for a reduced subset using the Lasso method5.1 Lasso principles5.2 The Lasso solution for the gearbox data

    6 Discussion and closing remarksAcknowledgementsReferences