Top Banner
IN DEGREE PROJECT MECHANICAL ENGINEERING, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2021 Gearbox fault detection, based on Machine Learning of multiple sensors ARMANDS KRUMINS KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT
70

Gearbox fault detection, based on Machine Learning of ...

Mar 21, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gearbox fault detection, based on Machine Learning of ...

IN DEGREE PROJECT MECHANICAL ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2021

Gearbox fault detection, based on Machine Learning of multiple sensors

ARMANDS KRUMINS

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT

Page 2: Gearbox fault detection, based on Machine Learning of ...

TRITA 515

www.kth.se

Page 3: Gearbox fault detection, based on Machine Learning of ...

1

Examensarbete TRITA-ITM-EX 2021:515

Skadedetektering av växellådor baserat på maskininlärning från flera sensorer

Armands Krumins

Godkänt

2021-09-07

Examinator

Ulf Olofsson

Handledare

Ulf Olofsson

Uppdragsgivare

KTH

Kontaktperson

Edwin Bergstedt

Sammanfattning Den ökande efterfrågan på högre effektivitet och mindre miljöpåverkan från växellådor, som

bland annat används i bilar och vindkraftverk, har skapat ett behov av mer avancerade tekniska

lösningar för att uppfylla dessa krav. En sådan teknisk lösning är tillståndsövervakning som kan

förlänga växellådors livscykel, vilket sparar resurser och tid. Idag bidrar tillståndsövervakning,

genom att tillämpa maskininlärning, till att gå från reaktiva till prediktiva handlingar för att

åtgärda och förutsäga även mindre fel innan de blir betydande.

Syftet med detta examensarbete är att utveckla en metodik som kan användas för att förutsäga

utmattningsskador under test i en kugg-rigg (FZG-rigg) växer sig för stora, som finns på KTH

Institutionen för maskinkonstruktion. I det här arbetet användes existerande standardsensorer på

riggen för temperatur, varvtal och vridmoment. Mätdata från test av fyra olika kugghjul

analyserades. Två olika material (smidesstål eller pulvermetall) med två olika ytbehandlingar

(slipade eller superfinerade).

Efter litteraturstudier om kuggytutmattning (pitting), tillståndsindikatorer för dessa fel och

maskininlärning. utfördes en statistisk analys för att se hur mätdata beter sig under pågående

testning och för att kunna jämföra med resultat från maskininlärning. Två

maskininlärningsmodeller, beslutsträd och Support Vector Machine (svensk översättning saknas)

valdes och tränades i två kombinationer, antingen endast med effektivvärdesavvikelse (RMS),

eller också med crestfaktor, standardavvikelse och kurtosis.

Totalt tränades 64 modeller, 32 för alla tester och ytterligare 32 för att undersöka två specifika

test där som hade en längre provtid innan skada uppstod. Nya tillståndsindikatorer som

standardavvikelse och signal-brusförhållande beräknades för att få mer nyanserade trender än

enbart ytprofilförändringar hos kugghjulen för att övervaka växelns beteende. Efter en

jämförelse med resultaten från den statistiska analysen och tidigare utförda ytprofilmätningar

drogs slutsatsen att de nya indikatorerna kan indikera förändringar i växelns beteende innan den

första ytskadan upptäcks med kuggprofilmätning.

Nyckelord: kuggväxel, maskininlärning, tillståndsövervakning, ytutmattning

Page 4: Gearbox fault detection, based on Machine Learning of ...

2

Page 5: Gearbox fault detection, based on Machine Learning of ...

3

Master of Science Thesis TRITA-ITM-EX 2021:515

Gearbox fault detection, based on Machine Learning of multiple sensors

Armands Krumins

Approved

2021-09-07

Examiner

Ulf Olofsson

Supervisor

Ulf Olofsson

Commissioner

KTH

Contact person

Edwin Bergstedt

Abstract The increasing demand for higher efficiency and lower environmental impact of transmissions,

used in automotive and wind energy industries has created a need for more advanced technical

solutions to fulfil those requirements. Condition monitoring plays an important role in the

transmission life cycle, saving resources and time. Recently condition monitoring, using machine

learning has shifted from reactive to proactive action, predicting minor faults before they become

significant.

This thesis intends to develop a methodology that can be used to predict faults like pitting

initiation, before propagating in FZG test rig, available at KTH Machine Design department.

Standard sensor measurements already available like temperature, rotation speed and torque are

used in this project. Four kinds of gears were used, two made of wrought, and two – of powder

metal steel, each with ground or superfinish surface.

After a literature review about pitting fatigue, condition indicators for these failures and machine

learning were done, a statistical analysis was done, to see how the transmission behaves during

testing and to have comparison material, helpful when having machine learning results. Two

machine learning models, Decision Tree and Support Vector Machine were selected and trained

in two combinations, either with Root Mean Square only, or with Crest Factor, Standard

Deviation and Kurtosis in addition.

As a result, 64 models were trained, 32 for all tests and another 32 to investigate two particular

tests due to a longer pitting propagation period. New condition indicators like Standard

Deviation and Signal – to – noise ratio was calculated to get more nuanced trends than just using

one measurement to monitor the gearbox behavior. After comparing with the results from

statistical analysis and previously done tooth profile measurements, it was concluded that the

new indicators could indicate the change in gearbox operation before the first pitting initiation is

detected, using tooth profile measurement.

Keywords: condition monitoring, gear, machine learning, surface fatigue

Page 6: Gearbox fault detection, based on Machine Learning of ...

4

Page 7: Gearbox fault detection, based on Machine Learning of ...

5

FOREWORD

I would like to thank my supervisor Ulf Olofsson, who offered this thesis project, for guidance

through the whole thesis process and consulting regarding to various technical challenges. As

well I would like to thank advisors Edwin Bergstedt and Ellen Bergseth for conversations during

the thesis writing process, provided materials and useful feedback.

Armands Krumins

Stockholm, August 2021

Page 8: Gearbox fault detection, based on Machine Learning of ...

6

Page 9: Gearbox fault detection, based on Machine Learning of ...

7

NOMENCLATURE

Notations

Symbol Description

t time (s)

τ torque (Nm)

N rotation speed (min-1)

T temperature (°C)

Abbreviations

CF Crest Factor

CI Condition Indicator

DT Decision Tree

FZG Forschungsstelle für Zahnräder und Getriebebau

LS Load Stage

MAE Mean Absolute Error

ML Machine Learning

PID Proportional Integral Derivative controller

PMGR Powder Metal, Ground surface

PMSF Powder Metal, Superfinished surface

R2 Coefficient of Determination

RMS Root Mean Square

RMSE Root Mean Square Error

SNR Signal – to – noise ratio

STD Standard Deviation

SVM Support Vector Machine

WGR Wrought Steel, Ground surface

WSF Wrought Steel, Superfinished surface

Page 10: Gearbox fault detection, based on Machine Learning of ...

8

TABLE OF CONTENTS

SAMMANFATTNING (SWEDISH) 1

ABSTRACT 3

FOREWORD 5

NOMENCLATURE 7

TABLE OF CONTENTS 8

1 INTRODUCTION 10

1.1 Background 10

1.2 Purpose 10

1.3 Delimitations 11

1.4 Research questions 11

1.5 Method 11

2 FRAME OF REFERENCE 14

2.1 Pitting fatigue phenomena 14

2.2 Condition monitoring in gearboxes 14

2.3 The basics of Machine Learning 17

2.4 Test rig 20

2.4.1 Transmission 20

2.4.2 Sensors 22

2.5 The used data 22

3 IMPLEMENTATION 24

3.1 First view on data 24

3.2 Statistical analysis of the data 27

Page 11: Gearbox fault detection, based on Machine Learning of ...

9

3.3 Machine learning method selection 28

3.4 Data preparation for machine learning 29

4 RESULTS 30

4.1 Statistical analysis results 30

4.2 Machine learning trained models 35

4.3 Condition Indicator calculations 47

5 DISCUSSION AND CONCLUSIONS 52

5.1 Discussion 52

5.2 Conclusions 53

6 RECOMMENDATIONS AND FUTURE WORK 55

6.1 Recommendation 55

6.2 Future work 55

7 REFERENCES 56

APPENDIX A: SUPPLEMENTARY INFORMATION 58

Page 12: Gearbox fault detection, based on Machine Learning of ...

10

1 INTRODUCTION

1.1 Background

Nowadays gearboxes are used in various industries like automotive, manufacturing and energy

producing to multiply or reduce the drivetrain speed and torque, from automotive to

manufacturing and energy industry. In latter, especially in wind energy industry, the quality and

maintenance of the gearbox plays significant role in performance of the whole wind turbine. The

reason for that are the costs and amount of work and energy put in the installation and

maintenance, this argument is usually causing skepticism regarding to wind energy as

sustainable energy source (1). To mitigate these doubts and encourage the society to come to

consensus, it is important to prove that there are opportunities to develop it as material and

energy efficient as possible.

Therefore, in last two decades there has been a big interest and development in predictive

maintenance sphere, which has mostly developed alongside wind turbine industry as a

complement to it. The latest trend is transformation to industry 4.0, when online data are used to

monitor and predict the conditions, instead offline or onsite measurements (2). In multiple

research studies the most often observed data source are vibrations of transmission (3), measured

in multiple places. However, they occur only then, when the damage has already started to

propagate, and it is too late to plan the maintenance work ahead. In that case there are possible

different approaches like using transmission error, oil temperature or oil contamination as

parameters that indicate changes in transmission operation conditions.

This master thesis project was offered by Professor Ulf Olofsson as research – based project at

KTH with goal to do research, using pre – recorded measurements from spur gear transmission

tests. As there are signs that during the tests with increasing load, the damage on gear teeth

forms, it raises question whether it is possible to predict it before it becomes so large that entire

transmission fails to operate. The proposed methodology here is machine learning, in which

multiple sources of data are used to predict the outcome, learning from previous cases with

similar conditions, it will be investigated whether this methodology can be used in this situation

and if there are opportunities to do the measurements in the way that is more suitable for this

methodology.

1.2 Purpose

The purpose of this thesis project is to continue a long – term research process in Machine

Design Department of gear contact characteristics, particularly in this case pitting phenomena in

FZG test rig. The novelty of the project is to predict the phenomena before it becomes so severe

that the operation of transmission is interrupted immediately and servicing is required, causing

undesirable downtime. It can be done, analyzing pre – recorded measurements to find patterns in

them that show the changes in gear contact behavior.

The approach that is selected for this project is Machine Learning (further in text ML) algorithm

training, that is trained, using previously mentioned measurements, so, when the algorithms

recognize pattern that is associated with coming failure, it can calculate when the failure can

happen. This proactive approach helps to make more precise decisions to do the maintenance

work just in time. However, to do that, there should be enough measurements to train the

algorithms as advanced as the task requires. This problem will be discussed more in research

question section.

Page 13: Gearbox fault detection, based on Machine Learning of ...

11

1.3 Delimitations

It is desirable that the work fits in 20 weeks, so, from many failure modes that are possible in

gear contact, such as pitting, spalling, scuffing and wear in this thesis project only the pitting will

be analysed. The project relies on pre – recorded measurements in test rig due limited access to

campus during Spring semester 2021. They have been recorded during (4) and have been made

with gears, manufactured from different materials like powder metal and wrought steel. Each of

them has two variants – first with ground, second with super finish surface. The damage after

each Load Stage was documented, measuring tooth profile and comparing it to default profile. In

result deviation from it depending on rolling angle is measured.

At the moment there are no goals to use the output result of this project elsewhere than for FZG

test rig at KTH, as mentioned before, this knowledge would contribute in long term to

transmission research.

1.4 Research questions

After background research of gearbox industry requirements and discussions with supervisors,

following research questions were defined:

1. Is it possible to detect surface fatigue and especially pitting, using existing sensor

system?

2. How early can the pitting fatigue be detected?

3. Which measured parameters give the most useful contribution to predict the surface

fatigue?

4. How the test rig can be upgraded to get earlier/more precise pitting detection?

1.5 Method

The method used in this thesis project contains following tasks:

1. Literature review on previous published studies – before going deeper in this master

thesis topic, it is important to understand the context of problem, how often the problems

like premature pitting happens in transmissions used in industry. Moreover, it has to be

clarified, where to go during implementation phase, so some examples of condition

monitoring should be viewed, including measurement sources, their sensors and

measured data amount and quality.

2. Background study on machine learning – although the main idea of machine learning is

simple, namely, use given data to find patterns that they are forming, the whole sphere of

machine learning is getting wider and wider every year. Therefore, it is necessary to

define, how much content from this sphere should be used to keep the machine learning

model, that can answer the research question, as simple as possible.

3. Measurement evaluation – first, it is visual evaluation, when pre – recorded values that

are electric signals in range from 0 to 10 V, multiplied with corresponding scale, are

plotted in respect to time domain. When each test run is plotted in sub – plot that is

included in main plot, they can be compared visually, evaluating whether there are curve

shape changes during the whole test with certain gear set.

4. Basic statistical analysis of the measurements – it includes parameters like root mean

square, standard deviation, crest factor and kurtosis, calculated through the entire test to

see, what kind of changes in transmission behaviour there are.

Page 14: Gearbox fault detection, based on Machine Learning of ...

12

5. Selection of appropriate method to train algorithms with given data – here appropriate

tool and its method is going to be selected, for example, either clustering, fitting, pattern

recognition or something else.

6. Model training, passing prepared data to machine learning algorithms. The models are

going to be trained with different ML methods to see whether any of them is more

preferable than another.

7. Use of the trained models – when the model is trained, practical application of it should

be defined. Indicators like standard deviation and signal – to – noise ratio can be

extracted from it and used, to evaluate the behaviour of transmission.

8. Evaluation of results – comparison between statistical analysis, tooth profile

measurements from (5) and machine learning outcome, conclusion about the whole

process and its output, how usable it is in other cases than FZG test rig.

9. Suggestions for improvement – they have to be based on findings during background

study of gearbox condition monitoring.

10. Finalization of report – read through the whole report, update it with the latest findings.

Page 15: Gearbox fault detection, based on Machine Learning of ...

13

Page 16: Gearbox fault detection, based on Machine Learning of ...

14

2 FRAME OF REFERENCE

2.1 Pitting fatigue phenomena

Gears during their operation can experience two kinds of failure – either surface damage or tooth

breakage. As in this thesis project the focus is on detecting possible damage as soon as possible

and tooth breakage means the end of transmission life, this time only surface damage is

considered. The reason for selecting particularly pitting as the parameter to monitor is because it

is one of those failure modes, that appears at the earliest. However, it is difficult to spot its origin

visually in the beginning because this phenomenon consists of multiple stages: initial, micro and

macro pitting (in some sources called spalling) (6).

Surface fatigue of spur and helical gears primarily appears in the form of pitting near the pitch

line. From Hertz theory it is known that at the pitch line, where the rolling action dominates, the

shear stresses are greatest at the surface. This stress consideration leads to subsurface crack

initiation. The crack may originate at material incursions that contribute to the stress

concentration (7).

In the area over and above the pitch, where combined rolling with slip takes place, the tangential

traction load can significantly affect the value and location of the stress concentration. With large

traction load the maximum stress concentration shifts to the surface and may initiate surface

cracking. Contributory factors to surface crack initiation include intensified stresses around large

asperity contacts or surface defects. As the teeth of pinion are in contact more frequently than the

wheel teeth, they get more severe pitting than the latter. At the addendum a slip takes place,

inducing a positive surface stress, whereas at the dedendum flank it becomes oriented in opposite

direction. Tensile stress is induced in latter case, which pulls open surface cracks, then lubricant

is pressed in them, and it accelerates crack growth (8).

When cracks start to propagate, they form small, shallow pits in the micrometre range, resulting

in grey patches. This stage can start right after run – in phase, when gear irregularities are worn

off. Finally, when the propagation of cracks results in particles breaking off, leaving pits in the

flanks, macro pitting starts to propagate. It usually leads to more vibrations in transmission

operation (8).

In (5) the pitting damage was investigated, applying higher and higher load in the FZG test rig.

The indicator of it was the deviation from gear standard profile, measured after each Load Stage.

This is more accurate than plain weighting, getting impression about phenomena broadly, instead

locally. The results in this paper show that the pitting propagates more and more after each load,

it forms at the middle of tooth, where the pure rolling point is, and right above gear root, where

the contact ends. The results of this thesis project and this paper will be compared further in

chapter 5.

2.2 Failures and faults

Before going deeper in condition monitoring, it is important to distinguish different conditions.

The concept of failure is fundamental to reliability. Failure is always related to a required

function. The function is often specified together with a performance requirement. The failure

occurs when the function cannot be performed or has a performance that falls outside the

performance requirement. It may develop gradually or suddenly. It can be revealed on demand,

when it has already appeared, during a functional test to seek, whether it is coming soon, and

finally, by monitoring or diagnostics. Fault is the state of an item characterized by inability to

perform a required function. While a failure is an event that occurs at a specific point in time, a

Page 17: Gearbox fault detection, based on Machine Learning of ...

15

fault is a state that will last for a shorter or longer period. An error is present when the

performance of a function deviates from the target performance (the theoretically correct

performance), but still satisfies the performance requirement. An error will often, but not always,

develop into a failure (9).

2.3 Condition monitoring in gearboxes

Predictive maintenance is becoming more and more interesting and even necessary for many

industries to be cost effective. It can consist of three main parts – measurement recording, their

processing and using in machine learning algorithms (2), these steps are going to be described

further in this chapter. So far wind turbines are the most often represented in predictive

maintenance research articles, in some extent it can be explained with exponential growth

worldwide in the windmill industry in last two decades (1). Also, in wind turbine industry the

maintenance can not be done spontaneously, as it is time and resource consuming due large

scale, compared to maintenance of passenger or commercial vehicles. The big amount of data,

collected by industrial systems, contains information about processes, events and alarms that

occur along an industrial production line.

Moreover, when processed and analyzed, these data can bring out valuable information and

knowledge from manufacturing process and system dynamics. By applying analytic approaches

based on data, it is possible to find interpretive results for strategic decision-making, providing

advantages such as, maintenance cost reduction, machine fault reduction, repair stop reduction,

spare parts inventory reduction, spare part life increasing, increased production, improvement in

operator safety, repair verification and overall profit (10).

Techniques for maintenance policies can be categorized into the following main classifications:

1. Run to Failure (R2F): also known as corrective maintenance or unplanned maintenance. It is

the simplest amongst maintenance techniques which is performed only when the equipment has

failed. It may lead to high equipment downtime and a high risk of secondary faults and thus,

create a very large number of defective products in production.

2. Preventive Maintenance (PvM): also known as scheduled maintenance or time-based

maintenance (TBM). PvM refers to periodically performed maintenance based on a planned

schedule in order to anticipate the failures. It sometimes leads to unnecessary maintenance which

increase the operating costs. The main aim here is to improve the efficiency of the equipment by

minimizing the failures in production.

3. Condition-based Maintenance (CBM): this method of maintenance is based on a constant

machine or equipment monitoring or their process health that can be carried out only when they

are actually necessary. The maintenance actions can only be carried out when the actions on the

process are taken after one or more conditions of degradation of the process. CBM usually

cannot be planned in advance.

4. Predictive maintenance (PdM): known as Statistical-based maintenance: maintenance

schedules are only taken when needed. It is based on the continuous monitoring of the equipment

or the machine, as like CBM. It utilizes prediction tools to measure when such maintenance

actions are necessary, hence the maintenance can be scheduled. Furthermore, it allows failure

detection at an early stage based on the historical data by utilizing those prediction tools such as

machine learning methods, integrity factors (such as visual aspects, coloration different from

original, wear), statistical inference approaches, and engineering techniques (11).

The gearbox is one of the most important components in industrial processes. Its health and

safety are vital to the reliable operation and improved efficiency of relevant facilities in the

whole system. However, gearboxes generally work under harsh operating environment, which

may accelerate their degradation. Consequently, they are subject to different defect types such as

Page 18: Gearbox fault detection, based on Machine Learning of ...

16

gear fatigue crack, gear pitting, bearing defects, bent shaft, etc. Gearbox defects may even cause

failure of the whole system, leading to significant economic losses, costly downtime, and even

catastrophic damage. Thus, fault diagnosis and prognosis of gearboxes are of great importance to

achieve a high degree of availability, reliability, and operational safety (12).

Condition monitoring systems deal with various types of input data, for instance vibration,

acoustic emission, temperature, oil debris analysis etc. Systems based on vibration analysis,

acoustic emission and oil debris are the most common and are very well established in industry.

Systems based on acoustic emission have a more obvious application for bearing monitoring

than for gearing monitoring. However, some applications for gearbox condition monitoring have

been introduced. Acoustic emission (AE) is usually defined as transient elastic waves generated

from a rapid release of strain energy caused by a deformation or by damage within or on the

surface of the material.

Oil debris analysis is a very reliable method for detecting gearing damage in the early stages and

allows estimation of the wear level. During gearbox operations the contacting surfaces of

gearwheels and bearings are gradually abraded. Small pieces of material break down from the

contact surfaces. These small pieces of material are carried away by oil lubricating the

gearwheels and bearing. By detecting the number and size of particles in the oil we can identify

gear – pitting damage in an early stage, which is unidentifiable by vibration analysis. Oil debris

sensors are usually based on a magnetic or an optical principle. Magnetic sensors measure the

change in magnetic field caused by metal particles in a monitored sample of oil.

A disadvantage of oil debris analysis is that it does not localize the failure in complicated

gearboxes. The oil used in the oil debris monitoring system should not dissolve the metal

particles and spread a metal film on to the gearbox housing (13).

The gearbox lubrication oil temperature values are important from a condition monitoring

perspective, as the most common failure modes in the gearbox will, potentially, manifest

themselves into a deviation in these measurements. Hence, normal behavior models for the

gearbox bearing and lubrication oil temperatures are utilized to achieve condition monitoring of

the gearbox (14).

There are two ways to record measurements – directly or indirectly. The indirect sensing

parameters are less accurate to indicate gearbox components conditions, but the rugged senor

design makes them more suitable for practical applications. On the other hand, direct sensing

techniques measure actual quantities of gearbox components conditions and have a high degree

of accuracy. Due to the practical limitations during gearbox normal operations, direct sensing

techniques are commonly used for offline measurement or as laboratory techniques (12).

In general, for condition monitoring the more data are available, the better because then the

training and decision – making process can be more nuanced, also there is always option to sort

out the data that give inadequate result that could be caused by test interruption due maintenance

work, malfunctioning sensor, mistakes during experiment etc., that can cause false alarm.

For example, in (14) 4 wind turbines are inspected, and the 10-min-average Supervisory control

and data acquisition (SCADA) data are used for monitoring purposes. Hence, in 24 h, there are a

maximum 144 measurements. The results for the anomaly detection are presented as an average

of 12-h periods. In order for the confidence in the prediction to be increased and in-line with the

missing data filtering approach, it is ensured that at least 1 h of data is available for an output

from the anomaly detection to be considered. In cases where sufficient data are not available, the

previous valid output is copied, and an indication of missing data is presented in the output.

In order to get more accurate answer after visual evaluation of measurements, Condition

Indicators (CI) are used, that describe measurements from statistical perspective. They can be

either time – based or frequency – based. The time – based indicators can be divided further into

Page 19: Gearbox fault detection, based on Machine Learning of ...

17

Raw Signal, Time Synchronous Average Signal, Residual Signal, Difference Signal, and Band-

Pass Mesh Signal (15). Raw signal sub – group includes parameters like Root Mean Squared

(RMS), Crest Factor (CF), Energy Ratio (ER) and Kurtosis.

RMS is the simplest method in detecting and measuring defects in the time domain. It is good for

tracking the general noise level. It is also very useful for detecting unbalanced rotating elements.

The root mean squared, also is called a quadratic mean, is a statistical measure of the magnitude

of a varying quantity. RMS was initially developed to describe a temperature of a resistor

subjected to sine wave alternative current. It can be described with the equation:

RMS = √1

N[Σi=1

N (xi)2] (1)

Where:

x – the original sample time signal

N – the number of samples

i – the sample index

CF gives better measurements than RMS in detecting defects in rotating machinery (15). It can be

defined as the ratio of the positive peak value of the input signal to the RMS level. The value of

the crest factor is affected by the numbers of peaks in the time series signal. Crest factor can be

calculated to be between 2 and 6 in normal operation. However, any value higher than 6 is

usually related to machinery problems. A signal that has a smaller number of high amplitude

peaks can generate a larger crest factor value as the numerator increases (high amplitude peaks),

as the denominator decreases (few peaks means lower RMS) It is a normalized measurement of

the amplitude of the signal and is calculated to increase even when a small number of high

amplitude peaks, such as a signal resulted from local tooth damage, occurs. The equation for CF

is:

CF =Peak level

RMS (2)

ER is a useful technique for detecting heavy uniform wear. It can be defined as the ratio of the

RMS of the difference signal d and the RMS of the signal of the regular meshing component r.

The energy in the regular component signal d is compared to the energy in the difference signal

r. The theory in this technique is that the energy moves from the regular signal to the difference

signal (15). This parameter is more suitable to detect severe damage, as it is dependent on large

difference of acceleration that causes significant change of energy. It can be described with

equation:

ER =RMSd

RMSr (3)

Next one is Kurtosis, the parameter that provides a measure of the size of the tails of distribution.

It can be used as an indicator of major peaks in a signal. The kurtosis can be defined as the fourth

normalized moment of the signal. It is useful measurement to demonstrate how peaky the signal

is. As gear wears or breaks, this feature should signal the error due to the increased level of

vibration. Simply, it can be said that kurtosis is a statistical measure of the number of amplitude

of peaks in a signal. when there are more peaks in a signal, kurtosis value become larger. A

signal of Gaussian noise has a kurtosis value close to 3. A gearbox in a good condition is

associated with a Gaussian distribution and has a kurtosis around 3. It should be noted that

researchers subtract 3 from the calculated value and they end up with a value of near zero for a

healthy gearbox (15). If more than one tooth is defective, the data distribution becomes flat, and

the kurtosis value decreases (13). Kurtosis equation is given by:

Page 20: Gearbox fault detection, based on Machine Learning of ...

18

Kurtosis =NΣi=1

N (xi−x̅)4

(Σi=1N (xi−x̅)2)

2 (4)

Where:

x – the signal

�̅� – mean value of the signal

i – the index of data points in time record

N – total number of data points in time record

2.4 The basics of Machine Learning Nowadays ML has become one of the leading spheres in technology area, leading to Industry 4.0

and changing the way how many issues are managed like self – driving cars and treatment of

diseases. ML is a branch of artificial intelligence (AI) focused on building applications that learn

from data and improve their accuracy over time without being programmed to do so. In data

science, an algorithm is a sequence of statistical processing steps. In ML, algorithms are

”trained” to find patterns and features in massive amounts of data in order to make decisions and

predictions based on new data. The better the algorithm, the more accurate the decisions and

predictions will become as it processes more data. As big data keeps getting bigger, as

computing becomes more powerful and affordable, and as data scientists keep developing more

capable algorithms, ML will drive greater and greater efficiency in personal and work lives (16) .

ML algorithms are categorized into three different types: supervised, unsupervised, and

reinforcement learning. The categories are depictured in figure 1. In this chapter the first two will

be discussed as they are the most used in various industries and are the simplest ones.

Figure 1. Classifications within Machine Learning techniques (11)

Page 21: Gearbox fault detection, based on Machine Learning of ...

19

Supervised learning, also known as supervised machine learning, is a subcategory of machine

learning and artificial intelligence. It is defined by its use of labeled datasets to train algorithms

that to classify data or predict outcomes accurately. As input data is fed into the model, it adjusts

its weights through a reinforcement learning process, which ensures that the model has been

fitted appropriately. Supervised learning helps organizations solve for a variety of real-world

problems at scale, such as classifying spam in a separate folder from inbox (16).

Supervised learning uses a training set to teach models to yield the desired output. This training

dataset includes inputs and correct outputs, which allow the model to learn over time. The

algorithm measures its accuracy through the loss function, adjusting until the error has been

sufficiently minimized.

Supervised learning can be separated into two types of problems – data mining classification and

regression:

• Classification uses an algorithm to accurately assign test data into specific categories. It

recognizes specific entities within the dataset and attempts to draw some conclusions on

how those entities should be labeled or defined. Common classification algorithms are

linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbor, and

random forest.

• Regression is used to understand the relationship between dependent and independent

variables. It is commonly used to make projections, such as for sales revenue for a given

business. Linear regression, logistical regression, and polynomial regression are popular

regression algorithms (16).

Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled

datasets. These algorithms discover hidden patterns or data groupings without the need for

human intervention. Its ability to discover similarities and differences in information make it the

ideal solution for exploratory data analysis, cross-selling strategies, customer segmentation, and

image recognition.

Unsupervised learning models are utilized for three main tasks—clustering, association, and

dimensionality reduction. Clustering is a data mining technique which groups unlabeled data

based on their similarities or differences. Clustering algorithms are used to process raw,

unclassified data objects into groups represented by structures or patterns in the information (16).

In (10) it is stated that the most employed ML algorithm in industry is Random Forest (RF) - 33%,

followed by neural network-based methods (i.e. ANN - Artificial NN, CNN - Convolution NN,

LSTM - Long short-term memory network, and deep learning) - 27%, Support Vector Machine

(SVM) - 25%, and k-means - 13%.

ANNs are intelligent computational techniques inspired by the biological neurons (10). An ANN

is composed of several processing units (nodes or neurons) that have relatively simple operation.

These units are usually connected by communication channels that have an associated weight

and they only operate their local data that are indicated through their connections. The intelligent

behavior of ANNs comes from the interactions between the processing units of the network.

ANNs are one of the most common and applied ML algorithms, and they have been proposed in

many industrial applications, including soft and predictive control (10). Their main advantages

include: no expert knowledge to make decisions is needed, since they are based only on the

historical data (as the k-means model), even if the data are inconsistent, they do not suffer

degradation (i.e., ANNs are robust); and by building an accurate ANN for a particular

application, it can be employed in real-time without having to change its architecture at every

update. However, some disadvantages of ANNs are:

• networks can reach conclusions that deny the rules and theories established by the

applications,

Page 22: Gearbox fault detection, based on Machine Learning of ...

20

• training an ANN can be time-consuming,

• they are “black box” methods (that is, it is impossible to know why the ANN model has

reached an output prediction),

• huge data set is needed for an ANN to learn correctly.

SVM is a well-known ML technique which is widely used for both classification and regression

analysis, due to its high accuracy (11).SVM is defined as a statistical learning concept with an

adaptive computational learning method. SVM learning algorithm is presented in Figure 2. SVM

learning technique employs input vectors to map nonlinearly into a feature space whose

dimension is high.

SVM is a set of supervised learning methods that perform regression analysis and pattern

recognition. Initially, SVMs were non-probabilistic binary classifiers. But now they are also

employed in multi-class problems. In this case, SVM creates n-dimension hyperplanes that

divide data ideally into n groups/classes.

Figure 2. Support Vector Machine algorithm visualization (11)

Decision Tree is a network system composed primarily of nodes and branches, and nodes

comprising root nodes and intermediate nodes. The intermediate nodes are used to represent a

feature, and the leaf nodes are used to represent a class label. DT classifiers have gained

considerable popularity in a number of areas, such as character identification, medical diagnosis,

and voice recognition. More notably, the DT model has the potential to decompose a

complicated decision-making mechanism into a series of simplified decisions by recursively

splitting covariate space into subspaces, thereby offering a solution that is sensitive to

interpretation (11).

Random Forest creates a ”forest” (ensemble) with multiple randomized decision trees and

aggregates their predictions by simple average. RF is a supervised learning algorithm for both

classification and regression tasks. Although a RF is a collection of decision trees, there are

differences that need to be emphasized: while decision trees generate rules and nodes from the

calculation of information gain and index gini, RFs generate decision trees randomly.

Additionally, while deep decision trees may suffer from overfitting, RFs avoid overfitting in

most cases, because they work with random subsets of features and build smaller trees from such

subsets.

RF is the most used and compared ML method in PdM applications, because their main

motivations are decision trees provide a large number of observations to be part of the forecast,

and in some scenarios, RFs can reduce variation and increase generalization. However, the RF

method also has some drawbacks. For example, the RF method is complex, and takes more

computational time when compared to other ML algorithms (10).

Page 23: Gearbox fault detection, based on Machine Learning of ...

21

2.5 Test rig

2.5.1 Transmission

The test rig (further in text referred as FZG test rig), that is used as data source for this master

thesis project, is back – to – back gear rig that is used to run a gear transmission with different

loads. The measured result is the condition of gear tooth and its surface, looking for the patterns

of failure modes like pitting, scuffing, sliding wear, erosion and tooth breakage. It has been

designed by Forschungsstelle für Zahnräder und Getriebebau (Gear Research Centre) in

Technical University of Munich, where the abbreviation FZG is coming from. It contains two

gearboxes – slave and test, each of them is filled with 1,5 liters of transmission oil.

The working principle is that in the beginning, when the motor is turned on, it accelerates

transmission until it reaches a constant speed and as there is closed energy circulation loop, the

motor later compensates power loss in transmission. To make this layout work, the pre – load is

required, which in this case is applied with loading clutch that is connected to lever arm, and in

the end of lever arm there are added weights, in result generating reacting torque that the motor

has to overcome to maintain constant rotation speed. The layout is demonstrated in figure 3 and

technical specifications are summarized in table 1.

The test process is standardized and is summarized in standard ISO 14635-1:2000. First, the

transmission oil is heated up to 80° C, the motor turns on and rotates with speed 2250 rpm. It

stars with run – in phase, which lasts for 1,3*105 contacts of the pinion at the lowest load. The

loads that are used for this process are compiled in table 2. After run – in phase, each load from

Load Stage 3 to Load Stage 10 (further in text simply LS) is applied, running the test for 2,1 *

106 contacts of the pinion. After each test with its load, the tooth profile is measured, evaluating

how much material is removed from the profile, depending on gear roll angle. If no signs of

pitting can be noticed until the final load, the test is repeated with the same load and same

number of cycles until it appears.

The profile of tooth in (4) was measured, using Taylor Hobson Stylus instrument mounted on

fixture (see fig. 4), so the gear can remain in its place. It measures the profile from the root to the

tip of the tooth.

Figure 3. FZG test rig layout (4)

Page 24: Gearbox fault detection, based on Machine Learning of ...

22

Figure 4. Gear tooth profile measurement (18)

Table 1. FZG test rig technical specifications (17)

Table 2. Test loads

LS 3 4 5 6 7 8 9 10

Torque,

Nm

35,3 60,8 94,1 135,3 183,4 239,3 302,0 372,7

2.5.2 Sensors

FZG test rig that is available at KTH is equipped with several sensors, that measure eight

following parameters:

• Motor rotation speed

• Input torque

• Output torque

• Temperature in slave gearbox

• Temperature in test gearbox

• Temperature at coolant injection

• Ambient temperature

• Flow rate of coolant

Axle distance a 91,5 mm

Module mn 2 .. 5 mm

Face width b 10 .. 40 mm

Load torque T 0 .. 800 Nm

Rotation speed n 10 .. 3000 min-1

Page 25: Gearbox fault detection, based on Machine Learning of ...

23

The sensor positions in this pitting setup are the same as in gear efficiency, these locations are

demonstrated in figure 5. Rotation speed and torque is measured in the motor. Although the

information about the content is limited, it is known that temperature is regulated and measured,

using proportional – integral – derivative (PID) controller, that uses feedback loop to either

increase or decrease temperature, depending on the measurement of previous sample.

Figure 5. FZG test rig sensor locations (17)

2.6 The used data

The data that are used in this project are all stored as .mat format files, each test, for example,

PMGR29 has its own folder and in which each LS is represented as separate file. Then each LS

file contains all measurements and separate vector with scales, that are multiplied with particular

signals to get real measurement values. To make it easier to work with the files, all LSes were

merged in one, 3D array. For each load stage it is in size 58230 x 8, former number representing

samples that are recorded every second and latter – number of total kinds of measurements. They

are recorded as a 0 – 10 V signal, later multiplied by scaling factor to obtain the real values with

the sampling rate – one sample per second.

Page 26: Gearbox fault detection, based on Machine Learning of ...

24

Page 27: Gearbox fault detection, based on Machine Learning of ...

25

3 IMPLEMENTATION

3.1 First view on data

To understand how the given measurements can be used for predictive maintenance, it is useful

to take a look on them in graph format. From all those parameters the rotation speed, Input

torque and temperature in test gearbox will be evaluated further, because output torque is directly

related to output torque and does not show any other difference, and all other parameters do not

show any kind of trends. To get better understanding about the measurements, figures of

temperature, speed and torque from WGR34 test are presented in this chapter, as they have the

most sub – tests and show the most indicators of defects. The figures of remaining tests can be

found in appendix A, as they would take lots of space in the report.

In total, there are four kinds of gears tested: Powder metal gears with ground and super –

finished surface, further called PMGR and PMSF, and Wrought steel gears, also with ground and

super – finished surface, namely, WGR and WSF. Gears of each kind are tested twice, resulting

into eight tests. For each test there is unique ID number to differentiate it from other tests, for

example, WGR33 and WGR34 are separate tests. The data of each gear test are plotted in

MATLAB software.

Rotation speed: WGR33 run – in phase has some irregularities, after that, nothing stands out

visually, fluctuation amplitude is from 1495 to 1505 rpm during all tests. Although, in test

WGR34 situation is different – there are higher periodical irregularities during run – in sub – test

every 500 seconds (around 8 minutes). Sub – tests from LS3 to LS6 are stable with similar

amplitude like in WGR33 test, but then more fluctuations start to appear starting from LS7 until

LS10 with periodically striking spikes. When LS10 is repeated, the amplitude of oscillations

stands constant in range from 1490 to 1510 rpm. It is interesting that at LS10v9 test and further

the amplitude decreases to the level that in LS3 test was.

WSF35 test starts with long periodical fluctuations, after that there is nothing interesting –

amplitude of the fluctuations is between 1490 to 1510 rpm as periodical spikes. The same trend

happens in WSF38 test.

In PMGR29 test, run – in phase is relatively stable, compared to wrought steel gear tests, very

narrow amplitude between 1497 to 1503 rpm. In rest of the sub – tests the amplitude is stable,

without periodical spikes and is between 1495 and 1505 rpm. Similar trend is in PMGR30 test,

however, in LS10 sub – test, after around 9,7 hours, the oscillations start to increase significantly

with some outstanding peaks like 1460 rpm after around 12,5 hours from start. In the end of this

sub – test the amplitude is between 1470 to 1520 rpm. This phenomenon happened because the

teeth of gear just broke off and the operation of transmission failed.

In last two sub – tests with Powder metal, super finished gears (PMSFR31 & PMSFR32) there is

the same trend as in case with ground surface gears, except the episode with broken teeth.

Input torque: In WGR33, run – in phase has similar trend like in the case with speed –

fluctuations with long period. From LS3 to LS6, the fluctuations of torque are between around

75 to 190 Nm. They increase starting from LS7, with range from 100 to 240 Nm, finally topping

at LS10v3 with range from 120 to 300 Nm and the trend remains the same until the last

LS10v10. However, in test WGR34 the observations are quite different. The periods of

fluctuations during run – in phase are shorter, the amplitudes of torque fluctuations are much

higher; from LS3 to LS5 they are between 20 to 240 Nm, few peaks fall even below zero. In next

sub – tests the lowest margin of amplitude remains similar, but the top margin increases

constantly up to 440 Nm in LS10 load case. After that, the amplitude remains the same until

LS10v8, after that the input torque amplitude decreases radically, in LS10v9 it is between around

110 to 280 Nm, with periodical peaks up to 300 Nm and this trend remains until LS10v11.

Page 28: Gearbox fault detection, based on Machine Learning of ...

26

In WSFR35, during the run – in phase the torque fluctuates periodically from -10 to 440 Nm,

later the trend remains similar – from -50 to 440 Nm, the upper margin gradually increases

during the test to 500 Nm. Similar trend is in WSF38 test, however, in LS8 is one outstanding

lowest value -513 Nm, which, most probably, is random error as all other measurements fit in

their range.

PMGR29 test shows similar trends; run – in phase with periodical fluctuations, later they

gradually increase from amplitude between 80 to 170 Nm in LS3 and 130 to 280 Nm in the LS9

with some periodical peaks up to 310 Nm. Same situation is with test PMGR30, but in the LS10,

as it was mentioned in speed chapter, the gear teeth broke off, so the motor was disrupted, and it

worked with full load which is easily noticeable in the last figure.

Also, in PMSF31 test the torque increment is noticeable, with amplitude between 80 and 180 Nm

in LS3 to amplitude between 120 and 240 Nm in LS8. Also, during all tests there are periodical

peaks whose values usually are between 20 and 30 Nm in both directions. In LS9 after around

10,5 h since the beginning of this run, the torque started to increase faster than before, 2 hours

later reaching amplitude between 110 and 320 Nm.

The last test, PMSF32, runs in smooth manner, with similar run – in phase like in PMSF31 test,

after that the torque amplitude increases from between 70 and 180 Nm in LS3 to amplitude

between 130 and 270 Nm in the end of test. Also in this test there are some periodical peaks,

whose value is approximately 30 Nm higher than previously mentioned amplitude margin

values.

Temperature in test gearbox: All tests start with larger fluctuations, which can be explained as

heating up the transmission oil, it takes time to stabilize it to 80° C. The temperature fluctuations

happen as a result of increasing temperature from gear contact, in result, the PID controller has

to shut down the heater, and when the temperature falls below 80° C, it switches on again. In

WGR33 test the fluctuations of temperature are increasing until LS10, when the temperature

increases up to 84°C. Similar trend remains for the rest of measurements until the pitting is

detected. LS10 in the first 4 hours has some peaks that are standing out. Meanwhile in WGR34

test the fluctuations around nominal temperature 80° C increases from LS3 to LS7, then in

following two runs they decrease. Then the temperature increases parabolically and the trend

remains the same until the end of the test.

Regarding to both WSF35 and WSF38, these tests have similar trend with increasing fluctuations

until LS7, then they decrease, but in the LS10, 4 hours after the beginning of run, temperature

increases constantly up to 81 and 82° C in the end of the former and latter run of former and

latter tests.

And there is the same trend with Powder Metal gears – in first four sub – tests fluctuations

increase, then decrease, and, finally, the temperature starts to increase, in PMGR29 LS10 has

some waviness with long periods. Meanwhile, in PMGR30 test LS10 demonstrates dramatic

temperature increase, starting already from 89° C until 94°C after 10 hours, then test rig was

turned off for around 20 minutes, when temperature stabilized around 80° C. PMSF31 and

PMSF32 tests demonstrate similar trends and results like PMGR29, although, the temperature in

the Load Stage in both tests increases more – up to 92° C in former and 87,5° C in latter.

Page 29: Gearbox fault detection, based on Machine Learning of ...

27

Figure 6. Rotation speed in slave gearbox

Figure 7. Temperature in test gearbox

Page 30: Gearbox fault detection, based on Machine Learning of ...

28

Figure 8. Input torque in test gearbox.

3.2 Statistical analysis of the data

To get better understanding about results and have reference for comparison with machine

learning results, statistical analysis is conducted. The parameters that will be used in this analysis

are motor rotation speed, input torque and test gearbox temperature due to same reasons

mentioned in previous sub – chapter. The features that are selected for the analysis are root mean

square (RMS), standard deviation (STD), crest factor (CF) and kurtosis. RMS can help to see

general picture, minimizing noise effect contributions and point out transition of speed or torque

during the test. STD explains, how far each measurement is from mean value of the whole Load

Stage, CF is useful to detect early local damage on tooth, demonstrating periodical peaks and

kurtosis demonstrates the severity of the damage, for example, when its value decreases.

Each of those parameters is calculated for every LS, so it is possible to see the trends more

clearly where raw data can create doubts. Run – in phase is neglected, because it has shorter time

period, and it does not describe regular gear behavior. Some temperature measurements have to

be filtered off, because every time, when next load is applied, it decreases during the shutdown,

but is shooting up during the beginning until the PID controller reaches desirable equilibrium at

80° C. The same applies to the end of each sub – test, as the data also includes the period when

the test rig is shut down and its oil is cooling down from the last temperature.

In the beginning and end of each LS, all parameters need time to increase and reach steady state

or decrease, when the test rig is shut down. During this time data also are recorded, so from

beginning 3500 measurement samples, namely, 3500 seconds are neglected, whereas at the end

the last 2030 samples were neglected, in result, 55000 samples or 15,28 h long sub – test is used

for the analysis. In PMGR29 test LS8 temperature reaches its steady state later than 1000

seconds, so longer number – 2000 samples from the beginning in all Load Stages were neglected

in this test and each sub – test is evaluated equally. In WGR33 test LS5 and in WGR34 test

LS10v4, LS10v5 and LS10v7 were done with sampling rate 2 Hz, so in default their datasets

were 2 times longer than others. Also, sub – test LS5 in WGR33 test was interrupted in the

middle, so the temperature decrease seen in plot was modified and replaced with overall mean

value of the temperature value in this Load Stage, so it does not show false peak in

measurements. LS3 in WSF38 test was run for shorter time (47000 seconds), to make it

compatible with other Load stages for statistical analysis and upcoming machine learning, it was

Page 31: Gearbox fault detection, based on Machine Learning of ...

29

resampled to the same length as other Load Stages, yet shorter range was used for calculations,

because after resampling, the decrease of all parameters started earlier and as mentioned before,

this transition from transmission operation to complete end of the work (transmission is shut

down and measurement process is stopped) turns out longer due changed scale.

3.3 Machine learning method selection

For machine learning the same parameters and features are used, so both results can be

compared. For data training Machine Learning and Deep Learning Toolbox on MATLAB was

selected. In this thesis supervised learning is used, as there already are some trends in raw data,

especially in the data of temperature. From classification and regression strategies the latter is

selected, because the measurements are given as time dependent continuous values and can be

used to learn, where the trend can go.

The data are passed to the application when a new session is started (see fig. 9). From the dataset

predictors and responses are selected, in this case the former are measurement signals and latter

are data labels. The next step is validation selection, which is necessary to protect the model

from overfitting. Cross – Validation selects the number of folds to partition the data set using the

slider control. This method gives a good estimate of the predictive accuracy of the final model

trained using the full data set. The method requires multiple fits, but makes efficient use of all

the data, so it works well for small data sets. Holdout Validation selects a percentage of the data

to use as a validation set using the slider control. The app trains a model on the training set and

assesses its performance with the validation set. The model used for validation is based on only a

portion of the data, so holdout validation is appropriate only for large data sets. The final model

is trained using the full data set (19). This time the Holdout Validation is used due compatibility

with large data set. The division between training and testing data is 70:30, as recommended in

(20).

After that interface offers multiple possible model types like linear regression models,

Regression Trees, Support Vector Machines, Gaussian process regression models and ensembles

of trees. From all these the Decision Tree and Support Vector Machine are selected, as these

were the most often mentioned in literature review and more investigated in connection with

damage detection in rotating machines. There are options like fine, medium and coarse decision

tree and linear, quadratic, cubic and gaussian support vector machine. The former differs with

minimum leaf size that varies from 4 to 36, and latter differs with kernel function type. It means

that additional dimension is introduced in order to enable them to operate in a high –

dimensional, implicit feature space without computing the coordinates of the data in that space,

but rather by simply computing the inner products between the data points in the feature space.

From all kinds of Decision Tree, the coarse one is selected as it is the simplest one and from

Support Vector Machine assortment the cubic is selected to see whether there is any gain from

using more sophisticated method than decision tree. Support Vector Machine parameters like

kernel scale, box constraint and epsilon are set to automatic.

Page 32: Gearbox fault detection, based on Machine Learning of ...

30

Figure 9. Data selection for training session

3.4 Data preparation for machine learning

Total length of test is very long, for example, PMGR29 contains 7 used LS, each of them 58230

samples long, in result these are 407610 samples. It means that training with such a large data

amount on budget notebook computer with 1,8 GHz processor and 8 GB Random Access

Memory it would be very time and energy consuming. To reduce the burden on the computer in

this project, the length of the data is reduced 100 times. First, like in statistical analysis, run – in

phase is neglected and the beginning and end from each LS is neglected in order to not spoil the

trend in remaining data. All selected ranges of samples are merged together in one vector

representing the whole test. When the beginning and end of each LS are neglected, 52500

samples remaining, then they are sliced in 525 parts, and each of those parts contain 100

samples. Then for each part previously mentioned statistical parameters are calculated and

finally those new calculated values from each LS are merged together. Decision in favor of the

100 samples was made because it would be easier to see the difference between the raw and

reduced data if round number was used for scaling.

To see whether it makes sense to make machine learning model more sophisticated or not, inputs

will be passed to it in two versions: first, only with RMS, second, with RMS, STD, CF and

kurtosis. Regression learner requires labeled data, so additional vector with values from 0 to 1

with increment of dataset length is introduced to each dataset with measurements. These labels

explain how far the measurement is in the test. To make the measurements comparable, the scale

is neglected, in result they all are signals in range from 0 to 10 V again. Although, regression

learner application normalizes the data in scale between 0 and 1 in the whole range, before they

are used for learning.

Page 33: Gearbox fault detection, based on Machine Learning of ...

31

4 RESULTS

4.1 Statistical analysis results

The results of statistical analysis are plotted in figures 10 to 21, for each of selected parameters

and CIs, namely, CF, kurtosis, RMS and STD of speed, test gearbox oil temperature and input

torque. There is additional crop view on results in range from first to eight sub – test, as there are

some results that are very close in narrow range, so it is simpler to see them.

Regarding to the speed, CF during the test remains relatively stable in all tests during all LSes.

Both WSF gears are standing above other tests, perhaps it is due noisier signal that was also

noticed in 3.1. chapter. It slightly increases in WGR34 test after LS10, when micropitting and

mild damage only starts to form. The last value of PMGR30 is the highest, which can be

explained by the massive failure of gear, already explained earlier.

Kurtoses in general also do not show significant change, for all measurements they all are around

2,75 and 2,85 in the beginning, relatively larger changes happen in WGR and WSF tests, while

all powder metal gears demonstrate much smoother behavior during all LSes. RMS values do not

vary much, they are slightly lower in WSF35 test, later WGR34 test values drop even lower until

LS10v6, later increasing again. Similar trend is happening in STD graph, where all but WSF35,

WSF38 and WSF34 gears demonstrate increasing value during the, although, the values of both

WSF gears tend to decrease after LS7.

In case with the temperature, the results stand out more: In all cases CF gradually increases from

1,004 at LS3 to 1,008 at LS7, after that it decreases and later in all, except WGR tests, increases

rapidly, which can be explained with significant temperature increase, when macropitting

already starts to propagate. WGR tests show similar trend, but there is smaller CF increase

during repeated LS10 tests than in the last LSes of other tests. Later it remains relatively stable,

as the temperature during previously mentioned tests is already higher than steady temperature

80° C and does not change a lot upwards or downwards.

Kurtoses before pitting appearance tend to vary between 2,2 and 2,7. In difference from CF,

they can go upwards or downwards when the pitting starts to propagate. For example, both WSF

kurtoses tend to go downwards starting from LS8, at the same time, it increases sharply in

PMGR30 to 6, then it rapidly decreases to 1,4. Similar scenario happens in both WGR tests,

when the kurtosis reaches maximum above at LS10v2 and LS10v3 in, namely, WGR34 and

WGR33 tests. Later they start to decrease down to 4 at LS10v7, and after that, once again they

are increasing. RMS in all tests tend to slightly decrease after LS4, at LS5 they become stable

and increase significantly at last LS of each test, except WGR. In those tests the RMS value

increases, but not as high as in other tests, it remains relatively stable during the remaining LSes,

although still higher than nominal steady state temperature.

CF values of torque show very similar picture to speed case, yet here without outstanding WSF

tests also the values in WGR34 test are higher than in other tests. If in majority of tests they tend

to start at around 1,7 and decrease to 1,6 before pitting, WGR values start from 2,2 and tend to

slightly increase during the whole test up to 2,4, whereas both WSF gears start with CF at around

3, it significantly decreases down to 2,4 at the last LS. In contrast, kurtoses show less significant

changes, but also here WSF and WGR34 tests tend to have distance from other tests,

respectively, they vary between 2,1 and 2,5, while other tests tend to be around 3. There is clear

trend in all tests relating to RMS value that increases until the last LS or LS10 in WGR case,

when the same torque was applied for multiple times until severe pitting damage was reached.

Only feature that stands out again is the distance between WSF and other gears that was also

Page 34: Gearbox fault detection, based on Machine Learning of ...

32

noticed in raw measurements. Same thing can be noticed in STD graph, where rest of the results

are between 20 and 30 Nm, while WSF35 and WSF38 gears have increasing values from 120

and 110 to 140 and 145 Nm at LS7, later decreasing very close to starting values. Like in CF

plot, also here WGR34 test stands somewhere between WSF and remaining tests, whose STD

starts at 45 Nm and peaks at LS10v4 with value 100 Nm, remaining the same for the rest of the

test.

Figure 10. Speed crest factors

Figure 11. Speed kurtoses

Page 35: Gearbox fault detection, based on Machine Learning of ...

33

Figure 12. Speed RMS

Figure 13. Speed STD

Figure 14. Temperature CF

Page 36: Gearbox fault detection, based on Machine Learning of ...

34

Figure 15. Temperature kurtoses

Figure 16. Temperature RMS

Figure 17. Temperature STD

Page 37: Gearbox fault detection, based on Machine Learning of ...

35

Figure 18. Torque CF

Figure 19. Torque kurtoses

Figure 20. Torque RMS

Page 38: Gearbox fault detection, based on Machine Learning of ...

36

Figure 21. Torque STD

4.2 Machine learning trained models

The results of ML application are presented in figures 22 to 37, in each of them four graphs

are included with results from models trained with DT and cubic SVM. They both are trained

whether with RMS only or with all other parameters, including CF, STD and kurtosis. The

blue line represents true values, yellow dots – predicted values and red lines – error, or in

other words, distance between true and predicted value. To make it easier to follow the

results, additional markers from first to last LS are added to the plot. The Root Mean Square

Error (RMSE), Coefficient of Determination (R2) and Mean Absolute Error (MAE) are

calculated for each test and compared in bar chart. As both WGR33 and WGR34 tests were

longer and the damage propagated in those gears much slower than in other tests, additional

training models from LS10v2 in WGR33 and LS10v3 in WGR test were created to see the

difference during the damage propagation.

PMGR29 results with RMS only tend to have stable prediction before severe damage at LS9.

The prediction becomes slightly more accurate at LS8, where the error decreases, but at the

second half of LS8 the model tends to underestimate the response, as the predicted lie below

the blue line before LS9. Both SVM and DT during the last LS tend to show very similar

behavior, when prediction of the condition more accurate than it was during previous LSes.

The situation in results with all inputs is different – while in DT case the results show

something like a step trend, perhaps due the torque that is increased after every LS. However,

the model tends to predict some especially outstanding predictions both in positive and

negative directions, namely, at LS4 there are some signal predictions of value 0,5, while the

true values are between 0,15 and 0,25. At LS7 the ML algorithms underestimate the true

response from 0,55 to 0,7, predicting values even down to 0,2. The prediction at final LS is

nearly identical to one that was in model trained only with RMS. In case with all inputs,

trained with cubic SVM, there are lower underestimations and overestimations of true

response, but the final LS is vaguer than in other PMGR test results. What draws attention is

that in models, trained with all inputs, the prediction error at LS5 and LS6 decreases,

whereas in models trained only with RMS it is similar to other LSes. As it was discussed in

chapter 3.1, in those LSes the temperature response changes were more intense, perhaps it

can be linked to this situation, when there is larger difference between adjacent

measurements, and it is easier to see a difference between them.

In PMGR30 case, when at the last LS the gear was totally damaged, namely, multiple teeth

were broken off, the results are quite ambiguous. In models, trained with RMS only, the

Page 39: Gearbox fault detection, based on Machine Learning of ...

37

prediction up to LS9 is relatively even, yet there are some random values periodically

standing out, like it was in PMGR29 test, also with DT. After LS8 in both test there is similar

trend, when damage starts to propagate significantly until in the middle of LS10 the gear was

destroyed. Interesting, that both ML methods in that situation tend to underestimate the real

response thoroughly, when the real value is 0,9, but the predicted value can be even down to

0,1. The behavior of DT model trained with all inputs is quite similar to the one that was

demonstrated also in PMGR29 test with some predictions that are standing out. Compared to

both simpler models, this one also tends to underestimate the total failure of gear, although,

the error is much lower, predicting signal response 0,5, when the true response is 0,95. It

means that at failure more sophisticated model predicts with around 2 times higher precision

than the model trained only with RMS. However, the situation with SVM model that is

trained with all inputs is not as pleasing – there are two severe random errors in model, when

the gear was destroyed. The model predicts signal responses 25 and 22, which is physically

impossible to have in reality, so this model is not usable.

PMSF31 results, when the algorithms are trained only with RMS, demonstrate very similar

behavior, yet the major difference is at the last LS, when gear rapidly gets damaged. DT

model demonstrates nearly perfect prediction, whereas SVM model – less accurate

prediction, even with one random underestimation with true response 0,95 and predicted one

– 0,8. This situation demonstrates the difference between both methodologies very well –

there was obvious temperature increase, hence, larger difference between adjacent data

points, and it is easier for DT be more certain about decision. SVM benefits from more

certain data trend too, yet in this situation it can not be as accurate DT, just because it used

hyperplane to separate two data points and introduces additional dimension to project the

data, so perhaps it is too clumsy for this situation. When both models are trained with all

available inputs, generally the prediction error tends to decrease, in DT case at LS9 it

demonstrates exactly the same prediction as the simpler model does, SVM model is quite

stable during the whole test, but in the beginning of LS6 is one random error with predicted

value 1,7 that could cause fake alarm. The end of LS9, when damage has propagated, the

prediction is less accurate than in simpler SVM model.

The results from PMSF32 tests, trained with RMS, are similar to PMSF31 results, although

they tend to be less symmetrical at LS6 than in PMSF31. Also failure predictions are similar

to the previously mentioned test. In difference from PMSF31 DT case with all inputs, here

the errors are larger from LS6 to LS8, but the failure prediction is as accurate as in former.

SVM model demonstrates similar errors, yet the data trends look like they have specific

hyperbolic shape through the test, as cubic kernel function is used.

WSF35 test already in chapter 3.1. was described as very noisy, namely, the torque signal

had very wide bandwidth and it was problematic any kind of trends to notice. Passing the

RMS to the DT and SVM, the result in both cases has relatively large errors, compared with

powder metal gear tests. They become lower in both cases at LS6 and LS7, however, in

model trained with DT it is less noticeable. SVM model starts to underestimate the signal

response earlier than DT model, former starts to show it at the end of LS7, while latter – in

the middle of LS9. Also DT model has more periodical predictions with large errors,

especially in the beginning and the end. Nevertheless, in DT model it is easier to notice,

where gear failure starts – true and predicted values match nearly perfectly, like in previously

discussed results. Both models trained with all inputs demonstrate similar picture, except that

DT model has lower errors in the middle and larger in the beginning and the end just before

the failure. SVM model looks relatively, except the random error in LS4, giving the predicted

response 3,5 that is totally out of normal predicted value range.

The same trend keeps going in WSF38 results, trained with RMS – DT signal during the test

tends to have more densely distributed errors, whereas in SVM model they tend to vary less

in smaller scale. The failure in DT model is predicted more accurately than in SVM model,

Page 40: Gearbox fault detection, based on Machine Learning of ...

38

where it has rather parabolic trend that gives no clear indication that something has been

damaged rapidly. The models trained with all inputs also are similar to WSF35 results, yet in

DT case there are three significant response underestimations, when true response is 0,9, but

predicted – 0,6. SVM model has some negative response estimations, at the beginning of LS7

one prediction is -0,2. Compared to simpler SVM model, this one brings less smooth

predictions, and the last LS, when failure occurs, the prediction is even more vague.

The final gear set is WGR gear set, during the test they did not demonstrate significant

damage, so the highest load was applied multiple times until it finally started to propagate.

WGR33 test results were trained in following configurations: up to LS10v2, LS10v4,

LS10v6, LS10v8 and LS10v10, whereas WGR34 – up to LS10v3, LS10v5, LS10v7, LS10v9

and LS10v11.

WGR33 model results with measurements up to LS10v2 demonstrate quite smooth and

symmetric predicted value distribution around the line of true values, especially SVM,

trained with both RMS only and all inputs. LS10 and LS10v2 in all results look very similar,

showing that something has happened in gearbox behavior. Dataset up to LS10v4 show

different behavior – while predictions until LS10 are very similar to ones in the LS10v2

model, in all kinds of models there is strong overreaction at LS10, following similar yet

smaller patterns at next LSes. The predicted response at LS10 jumps right to value 1, while

in reality it is just over 0,6. While both DT models show similar shape, SVM model trained

with RMS only demonstrates smoother prediction transition that in model trained with all

inputs. In next step, namely, up to LS10v6, the predictions up to LS10 remain similar,

previously mentioned overreaction reduces down to 0,85 in SVM and 0,9 in DT models.

Later, starting from LS10v5 the prediction error increases again, but this time in opposite

direction – that happens in all models. The models with data up to LS10v8 demonstrate again

similar predictions until LS10, then the model overestimates the response again, but this

time, especially in SVM models, the area below predicted values till true value line increases,

also the predicted responses form parabolic shape. Finally, in models with all LSes

significant damage can be detected only in DT models, where predicted and true responses

have much smaller differences than in models trained with SVM. The SVM models even

tend to overestimate the true value in the end.

WGR34 model results with measurements up to LS10v3 in difference from all WGR33

models demonstrate weaker symmetry, especially when they are trained with RMS only.

Both DT models starting from LS10 tend to demonstrate better accuracy than SVM models,

but in the beginning of LS10v3 in all tests there is a sharp underestimation. In this situation

SVM model with all inputs has the lowest error, however, in previous LS10v2 it makes false

overestimation of response with value close to 1,4. Next, in models with data up to LS10v5

SVM model with all inputs at LS8 makes unreasonable underestimation down to -1,6,

although in previous step in this LS everything worked fine. Looking at rest of the models, it

is clear that they become less accurate starting from LS10, compared to previous step. They

all have rapid underestimation at start of LS10v3 and LS10v5. Then, in models with data up

to LS10v7, similar trend keeps happening, but this time at last two LSes DT model shows

more accurate prediction than SVM models. Moreover, SVM model with all inputs again

makes weird underestimation, but this time at different location – the middle of LS10v2. In

next step LS10v9, the same underestimation appears at the same location. At LS10v8 and

LS10v9 DT starts to demonstrate larger error, but still not as large as SVM does. Finally,

when full datasets are trained, the error at LS10v10 and LS10v11 increases significantly,

especially in models trained with SVM.

Page 41: Gearbox fault detection, based on Machine Learning of ...

39

Figure 22. PMGR29 ML results

Figure 23. PMGR30 ML results

Figure 24. PMSF31 ML results

Page 42: Gearbox fault detection, based on Machine Learning of ...

40

Figure 25. PMSF32 ML results

Figure 26. WSF35 ML results

Figure 27. WSF38 ML results

Page 43: Gearbox fault detection, based on Machine Learning of ...

41

Figure 28. WGR33 up to LS10v2 ML results

Figure 29. WGR33 up to LS10v4 ML results

Figure 30. WGR33 up to LS10v6 ML results

Page 44: Gearbox fault detection, based on Machine Learning of ...

42

Figure 31. WGR33 up to LS10v8 ML results

Figure 32. WGR33 ML results

Figure 33. WGR34 up to LS10v3 ML results

Page 45: Gearbox fault detection, based on Machine Learning of ...

43

Figure 34. WGR34 up to LS10v5 ML results

Figure 35. WGR34 up to LS10v7 ML results

Figure 36. WGR34 up to LS10v9 ML results

Page 46: Gearbox fault detection, based on Machine Learning of ...

44

Figure 37. WGR34 ML results

In figures 38 to 43 performance parameters are presented, results of WGR repeated tests are

presented in separate figures. As PMGR30 test, trained with SVM and all inputs, had severe

errors presented before, it also gives large values in bar charts, so it will not be discussed

further. In order to make the charts easier to read, the y axis was limited, so the rest of the

tests are comparable, but this one will be mentioned separately.

In PM gear tests the RMSE tends to be approximately the same around 0,1. However, it is

easy to notice the trend that with more input parameters it tends to decrease, the model

trained with all inputs and DT method has the lowest error in all cases. Similar trend is

visible in WGR tests, yet the errors are around 2 times higher than in PM cases, reaching

even 0,26 in WSF38 test, trained with RMS only and SVM. Perhaps it can be linked to

noisier signal response. Although WGR tests bring relatively similar errors, they differ from

each other in different way – the error tends to be lower in models trained with DT. Similar

trend is happening in MAE chart, although here the PMGR30 test mentioned before shows

relatively low absolute error, compared to other PMGR30 test combinations.

R2 values tend to be from 0,8 to 0,92 in PM tests, namely, the accuracy of prediction is from

80 to 92%. The difference between training combinations is not very big, but still DT with all

inputs also here demonstrates slight advantage. Meanwhile in WSF tests the data tend to vary

more, in result the coefficients are lower, around 0,5 at WSF35, trained with RMS only and

DT, all inputs and SVM and WSF38, trained with RMS only and DT. Better results with

coefficients around 0,65 bring tests WSF35, trained with all parameters and DT, WSF38,

both trained with all parameters and DT and SVM. WSF tests trained with RMS only and

SVM bring the worst results – 0,38 at WSF35 and 0,18 at WSF38. WGR tests are not far

behind all PM tests, as they demonstrate quite good R2 values from 0,7 to 0,86, knowing that

these tests were quite noisy too. Also, here the difference between training methodologies

remains the same – DT models bring better results, especially at WGR34 test.

Looking at the results of additional WGR33 test, it can be concluded that the further the test

is going, the larger error it returns. There is not significant difference between results of tests

up to LS10v2 and up to LS10v4, but the error starts to slightly increase in next tests.

Opposite trend happened at WGR34 test, where it decreased from 0,155 in WGR34 up to

LS10v3, trained with RMS only and SVM to 0,08 in WGR34 up to LS10v9, trained with all

parameters and DT. The difference between training combinations remains similar to WGR

tests with data from entire test, WGR33 models trained with RMS only and SVM tends to be

less accurate than models trained with all inputs and DT. Only exception is WGR33 up to

LS8 trained with RMS only and DT that brings lower error than other training combinations.

Page 47: Gearbox fault detection, based on Machine Learning of ...

45

In WGR34 tests DT models tend to return lower error in all cases. Similar trend is happening

also in MAE chart. R2 values demonstrate very low variation in both WGR33 and WGR34

tests, in former they tend to slightly decrease from 0,9 to 0,8 and in later they increase from

0,72 to 0,91. The relation between different training combinations here is similar to one

discussed about RMSE figure.

Figure 38. RMSE of main tests

Figure 39. MAE of main tests

Page 48: Gearbox fault detection, based on Machine Learning of ...

46

Figure 40. R2 of main tests

Figure 41. RMSE of additional tests

Page 49: Gearbox fault detection, based on Machine Learning of ...

47

Figure 42. MAE of additional tests

Figure 43. R2 of additional tests

4.3 Condition indicator calculations

To have practical application of ML results, it was decided to calculate new condition indicators

(CI) like STD and Signal to Noise Ratio (SNR) from the trained models. The STD is simple way

to detect changes in transmission behavior, whereas SNR compares desired signal to the

background noise. It can be calculated with following formula:

𝑆𝑁𝑅 =µ

𝜎 (5)

Where:

µ - mean value of signal

σ – standard deviation of signal

In figures 44 to 51 mean value, STD and SNR of each gear test are presented, using each

machine learning methodology. These parameters are calculated as moving average values with

Page 50: Gearbox fault detection, based on Machine Learning of ...

48

time period 30 minutes. This time was selected during supervisions with the intention of not

being present to test rig all time, so indicators can be checked occasionally to see whether the rig

should be shut down or not.

PMSF gear tests, trained with both DT and SVM, demonstrate similar STD trend – it starts at

around 0,05, then increases up to LS4 and then decreases until LS8 and demonstrate totally

different behavior at LS9, when damage starts to propagate. The only difference is between input

variants – model that is trained with RMS only forms curved arc, whereas model with all inputs

experiences significant STD decrease at LS6 in PMSF31 and at LS5 in PMSF32 test. DT and

SVM models have fundamentally different SNR results in terms of shape and absolute values,

although they peak approximately at the same time. SVM models in the middle of test

demonstrate increase of SNR, but when wider picture is evaluated, it is clear that in the end there

is bigger increase, probably SVM models should be evaluated in longer perspective than DT

models, that suddenly demonstrate severe value increase.

The difference between input variants as well as between ML methodologies in PMGR tests is

not as big as in PMSF tests, and they also form arc – like curve that returns to the same value

that was in the beginning. After that, STD decreases nearly down to zero, in PMGR30 test it

increases significantly, when there was severe damage with gear teeth torn off. Only exception is

model with all inputs and SVM, which does not demonstrate this occurrence, which could lead to

missing the failure. Regarding to SNR, there is difference in PMGR29 test trained with SVM

between variant with all inputs and RMS only – the latter seems to be more sensitive, as there is

rapid peak of value close to beginning of LS9, while model with all inputs demonstrates only

gradual response increase. Like in situation with PMSF gears, also here SNR behaves in similar

manner. The response rapidly increases at LS9, values of models trained with both RMS only

and all inputs overlap so there is no difference between them here.

WSF gear tests, in contrast to all PM gears, demonstrate different response behavior. There is a

difference within models trained with DT and all inputs and RMS only – the response of former

tends to form downward arc during the test, whereas the latter – upward. Interesting, that they

end up in the same place before the last LS, then decrease nearly to zero and in then bring tiny

response, when the damage started to form. Both kinds of inputs in SVM models behave in more

similar manner, especially in WSF38 case. They star at approximately same value, then increase

during LS7 and LS8, then decease and finally increase once again in the last LS. The only

difference is that the model with all inputs three times demonstrated peak in the middle of LS5

and LS8 and in the end of test, so it is more sensitive to noise. SNR values in DT models

demonstrate similar behavior like previously discussed models, and SVM models also peak in

the last LS, yet in WSF35 test the model trained with all inputs demonstrate more rapid increase

from LS5 to LS7 than model trained with RMS only.

WGR test turns out to be the most difficult to evaluate, as the damage in it is not certain in one

exact LS. STD response in both tests is very similar between both input variants. In both

WGR33 DT and SVM tests it tends to form upward arc, then decrease back to start value around

0,02 at LS10, then there is steep increase up to 0,15 at LS10v2, after that there is decrease in

both models at LS10v3. The response tends to increase gradually in DT model, whereas in SVM

model it remains relatively same, even at the end, when more significant damage started to form.

Both WGR34 DT and SVM models form very similar trend and shape, increasing until LS10,

then the pit for two LSes follows until the value increases significantly at LS10v2 and starts to

decrease until the last LS, when it reaches the highest point during the whole test. SNR values in

WGR33 DT models appear only at the last LS, missing the mild damage that was forming during

previous LSes, in WGR test the first response comes at LS6. Meanwhile SVM models for both

tests are noisier, especially WGR33 test which is difficult to evaluate, especially since LS10. In

WGR34 model RMS only data tend to gradually peak around LS10v7 and LS10v8, while model

trained with all inputs in the same region tends to form arc – like upward curve, both responses

behave in the same way at the last LS.

Page 51: Gearbox fault detection, based on Machine Learning of ...

49

Figure 44. PMSF indicators, using DT

Figure 45. PMSF indicators, using SVM

Figure 46. PMGR indicators, using DT

Page 52: Gearbox fault detection, based on Machine Learning of ...

50

Figure 47. PMGR indicators, using SVM

Figure 48. WSF indicators, using DT

Figure 49. WSF indicators, using SVM

Page 53: Gearbox fault detection, based on Machine Learning of ...

51

Figure 50. WGR indicators, using DT

Figure 51. WGR indicators, using SVM

Page 54: Gearbox fault detection, based on Machine Learning of ...

52

5 DISCUSSION AND CONCLUSIONS

5.1 Discussion

In total 32 ML models were created during this thesis, four combinations per each test. In

parallel statistical parameters with each input parameter at each LS were calculated as a

reference point for their validation. Statistical analysis shows that speed has the least

contribution to condition indication, because the most important factors like CF and Kurtosis

showed nearly no difference during the test.

Temperature demonstrated mild increase of those factors in the middle of the test, in the end

increasing significantly, this trend can be connected to more intense response of PID controller,

however, so far there is no certain explanation why it happened. One version could be due mild

wear of asperities on gear teeth, although they are more common among wrought steel surfaces,

but also super – finished gears demonstrated this behavior. The temperature gave more useful

feedback in WGR gear tests, when the highest load was applied multiple times until noteworthy

damage started to form.

Torque is measured indirectly and is the measurement that is dependent on resistance in gear

contact, so it gave the most descriptive data from all measurements. As mentioned in Chapter 2,

CF indicates local damage, and in most of the measurements it started to slightly increase

starting from LS4. When it decreases, the damage becomes more evenly spread between teeth

and that kind of behavior demonstrated PM gears. However, wrought steel gears did not follow

the same trend, the CF of WSF gears even constantly decreased up to end of the test. The torque

measurement is connected to the speed measurement, and it reflects also in STD plot, where they

demonstrate the same trend.

The ML models of PM gear tests were simpler to evaluate than the tests of wrought steel gears. It

can be linked to fact that the former gears in all cases experienced severe damage, especially at

PMGR30 test, while at latter case, especially WGR, many teeth remained even undamaged after

the last repeated LS. Training combinations at PM tests show no significant difference in

accuracy, however, DT models brings more accurate response then, when damage already has

started to propagate. It was noticed that often, when the model with all input has low error in

general, a few random errors spoil the whole picture, so it is one of the biggest drawbacks, when

more data are passed to ML model training, although, in case with DT very often it can

demonstrate overall performance.

WGR gear tests, when all their data were used for training, did not demonstrate severe

differences, although, DT models, trained with RMS only and all parameters, gave slightly

higher accuracy than SVM models. When these tests were investigated further, creating the

models with data after each two LSes, there was not big differences between combinations, but

in WGR33 case the most accurate prediction was in the model trained with data up to LS10v2,

after that the predictions became less accurate. Perhaps this was the LS, from which damage

slowly started to form, as temperature response also changed then, as it can be seen in statistical

analysis. When WGR34 test data were trained in the same way, higher accuracy was in the

model with data up to LS10v3, especially, when trained with DT, later the accuracy decreased

and increased once again in the model with data up to LS10v7, when significant damage started

to form. WSF gear tests contain lots of noise that is coming from torque measurements, yet the

damage at the last LS can be easier noticed with DT – like it was in case with PM gears.

The new CIs, extracted from ML models, can be useful from practical perspective, as they

demonstrate clear trend in STD and SNR plots. In both PMGR tests the model trained with RMS

gives more stable trend than the model trained with all parameters. Regardless of whether the

Page 55: Gearbox fault detection, based on Machine Learning of ...

53

model is trained with DT or SVM, they show significant changes already at LS8, SNR response

of model that is trained with SVM brings more scrutinized response than DT model, but

sometimes it can overreact. Very similar situation is also with both PMSF tests, STD plot

indicates significant changes at LS9. Regarding to WGR tests, there is no big difference between

any combination in STD plots, but SNR values, especially in WGR33 test, calculated from SVM

models, are hard to interpret, as there are no clear changes visible, here SNR values from DT

models bring stingier but clearer information. In WSF condition indicators it is more difficult to

spot the trends, perhaps the clearest are the ones extracted from DT models, which demonstrate

clear decrease of STD in the middle of LS10.

The tooth profile measurements in (5) show that in PMGR test significant damage starts to form

at LS8, close to the tooth root, in PMSF test it happens at LS9 at the same location, in WSF tests

only mild damage formed at LS10 and in WGR test early damage at tooth root starts to form at

LS9, at LS10v6 micropitting at pure rolling point, ending there with macropitting at LS10v10.

CIs in PMGR tests also demonstrate their value change at LS8, same applies to PMSF tests,

when STD and SNR significantly changed at LS9 and to WSF tests. Although during the writing

process it is unclear whether the profile measurements were made for WGR33 or WGR34 gear,

but the indicators in both WGR tests change their values significantly after LS9, but only STD

values that are calculated from WGR34 DT models show raid decrease at LS10v6, when

micropitting started to form.

If the CI are compared with statistical analysis, the most remarkable similarity is with

temperature parameters, CF and STD of the temperature formed similar trend like the CIs of PM

and WSF gear tests, going upwards at LS4, peaking at LS6 and decreasing at LS8. Meanwhile

CIs of WGR tests show similar trend to temperature Kurtosis and RMS that increased after

LS10.

After comparing all combinations of ML training strategies, it can be concluded that very often

using multiple input parameters gave more precise response of signal, but it came with expense

of random errors that could cause false alarm. SVM method demonstrated the best performance

in wrought steel gear tests, especially WSF, where measurements were the noisiest, but even

then, it gave less accurate response in the end of test than DT method. Same thing happened in

WGR tests, especially, when they were evaluated further. In all PM gear tests DT method was

always demonstrating higher accuracy. Although both methods have their own advantages at

particular situation, the last and one of the most important factors are the consumed time. SVM

is more time – consuming method and can overcomplicate stable test, even more, when

additional inputs are introduced. Thus, the most useful of the tested combinations is DT, trained

with RMS.

5.2 Conclusions

After obtaining the results, the research questions defined in chapter can be answered:

1. Is it possible to detect fatigue, using existing sensor system?

Yes, it is possible. After comparison with reference data like tooth profile measurements

and statistical analysis it can be stated that CIs show change in transmissions behavior,

when damage starts to propagate.

2. How early can the pitting fatigue be detected?

In PM gear case it can be stated that in the moment, when STD decreases back to the

value that was in the start of the test, the damage will start to propagate very soon, so it is

possible to act proactively. The most useful finding was at WGR34 STD indicator, when

Page 56: Gearbox fault detection, based on Machine Learning of ...

54

it immediately decreased at LS10v6, when micropitting started to form at pure rolling

zone. It means that in this case it is possible to react early enough before micropitting

turns into macropitting.

3. Which measured parameters give the most useful contribution to predict the

fatigue?

The temperature gives the most useful contribution due its clear changes, closer to the

end of the test. However, it is more useful to pass at least another one measurement to let

algorithms make more nuanced decision. In this project it would be input torque, rotation

speed did not bring as much useful information as other two measurements that were

used in training, also the speed is indirectly related to the torque, so they are repeating

each other, as it was noticeable in their Kurtoses and STD plots.

4. How the test rig can be upgraded to get earlier/more precise pitting detection?

Introduce additional measurements, also higher sampling rate would help to get more

refined data from existing sensors.

Page 57: Gearbox fault detection, based on Machine Learning of ...

55

6 RECOMMENDATIONS AND FUTURE WORK

6.1 Recommendations

Increased sampling rate could bring the data with higher resolution, meanwhile economical

factor should be kept in mind, finding the balance between complexity and expense of the model.

In this project due time limitation, it was not done, but for better comparison ML models should

be trained also with temperature or torque only to get even better understanding whether they

give better contribution to the models together or not.

For better decision making more data sources would be desirable. One of the most widely used

are vibration sensors, discussed in (5), (20) and (21). In (21) the vibrations were linked to

transmission error (TE), and noise emissions. The noise and vibrations each were measured with

3 microphones and 3 accelerometers. The test gearbox and microphones were shielded from

ambient noise by a box made of sound absorbing material as initial measurements showed that

the noise from the electric motor was louder than the gear noise, at least for low RPM.

Accelerometer 1 registers vibrations in an axial direction; accelerometer 2 registers vibrations in

a radial direction, at an angle corresponding to the direction of the gear mesh contact force and

accelerometer 3 registers vibrations at a right angle to the direction of accelerometer 2.

6.2 Future work

First work that could be done is to implement developed methodology in practical use, writing

software that every 30 minutes checks the CI values and their trends. When it is done, the

software should be tested in working conditions.

From technical perspective this fault detection can be done also by classification, sorting whether

the gears are damaged or not. This time due time limitation it was not done, because it would

require more time to prepare the data for classification.

As this thesis project was done with limited amount of information and knowledge about AI,

closer cooperation with some student with Computer Science background should be established,

perhaps next time the master thesis, connected to this topic, should be done in cooperation with

another department, for example, Department of Computer Science at KTH.

Page 58: Gearbox fault detection, based on Machine Learning of ...

56

7 REFERENCES

1. The impact of wind energy on wildlife and environment. Peiser, B. s.l. : The Global Warming

Policy Foundation, 2019. ISBN 978-1-9160700-1-1.

2. A systematic literature review of machine learning methods applied to. Thyago P. Carvalhoa,

Fabrízzio A. A. M. N. Soaresa,d, Roberto Vitac, Roberto da P. Franciscob. s.l. : Elsevier, 2019,

Vol. 137. 106024.

3. A Review of Gearbox Condition Monitoring Based on vibration Analysis. Yahya I. Sharaf-

Eldeen, Abdulrahman S. Sait. Melbourne, FL : s.n., 2011, Vol. 5.

4. Bergstedt, Edwin. A Comparative Investigation of Gear Performance Between Wrought and

Sintered Powder Metallurgical Steel. Stockholm : KTH Royal Institute of Technology, 2021.

ISBN 978-91-7873-821-2.

5. A quantitatively distributed wear-measurement method for spur gears during micro-pitting

and pitting tests. Jiachun Lin, Chen Teng, Edwin Bergstedt, Hanxiao Li, Zhaoyao Shi, Ulf

Olofsson. s.l. : Elsevier, 2021, Vol. 157. 0301-679X.

6. Beek, Anton van. Advanced engineering design. Delft : TU Delft, 2009. p. 35. ISBN-10

9081040618.

7. —. Advanced engineering design. Delft : TU Delft, 2009. p. 37. ISBN-10 9081040618.

8. —. Advanced engineering design. Delft : TU Delft, 2009. p. 38. ISBN-10 9081040618.

9. Lundteigen, Mary Ann and Rausand, Marvin. Failures and Failure Classification.

Trondheim : NTNU.

10. A systematic literature review of machine learning methods applied to. Thyago P. Carvalho,

Fabrízzio A. A. M. N. Soares, Roberto Vita, Roberto da P. Francisco, João P. Basto, Symone G.

S. Alcalá. Computers & Industrial Engineering, s.l. : Elsevier, 2019, Vol. 137.

11. Machine Learning in Predictive Maintenance twards Sustainable Smart Manufacturing in

Industry 4.0. Zeki Murate Cinar, Abubakar Abdussalam Nuhu, Qasim Zeeshan, Orhan Korhan,

Mohammed Asmael, Babak Safaei. 19, s.l. : MDPI, 2020, Vol. 12.

12. Virtual sensing for gearbox condition monitoring based on extreme learning machine.

Jinjiang Wang, Yinghao Zheng, Lixiang Duan, Junyao Xie4, Laibin Zhang. 2, Beijing : JVE

Journals, 2016, Vol. 19.

13. Condition Indicators for Gearbox Condition Monitoring Systems. P. Večeř, M. Kreidl, R.

Šmíd. Prague : Czech Technical University in Prague, 2005.

14. An artificial neural network-based condition monitoring method for wind turbines, with

application to the monitoring of the gearbox. P. Bangalore, S. Letzgus, D. Karlsson and M.

Patriksson. s.l. : Wiley Online Library, 2017, Vol. 20.

15. A Review of Gearbox Condition Monitoring Based on vibration Analysis Techniques

Diagnostics and Prognostics. Sharaf-Eldeen, Yahya I. s.l. : Springer, 2011, Vol. 5. ISSN 2191-

5644.

16. Machine Learning. [Online] IBM, 15 July 2020. [Cited: 19 April 2021.]

https://www.ibm.com/cloud/learn/machine-learning.

17. Lehrstuhl für Maschinenelemente. [Online] Technische Universität München. [Cited: 23

April 2021.] https://www.mw.tum.de/fzg/forschung/ausstattung/.

18. Gear micropitting initiation of ground and superfinished gears: Wrought versus pressed and

sintered steel. Edwin Bergstedt, Jiachun Lin, Michael Andersson, Ellen Bergseth, Ulf Olofsson.

s.l. : Tribology International, 2021, Vol. 160. 0301-679X.

19. Select Data and Validation for Regression Problem. [Online] Matlab. [Cited: 4 June 2021.]

https://www.mathworks.com/help/stats/select-data-and-validation-for-regression-problem.html.

Page 59: Gearbox fault detection, based on Machine Learning of ...

57

20. Wind turbine gearbox failure and remaining useful life prediction using machine learning

techniques. James Carroll, Sofia Koukoura, Alasdair Mcdonald, Anastasis Charalambous. s.l. :

Wiley Online Library, 2018, Vol. 22.

21. Åkerblom, M and Pärssinen, M. A study of gear noise and vibration. Stockholm : KTH,

2002. ISSN 1400-1179 ; 2002:8.

Page 60: Gearbox fault detection, based on Machine Learning of ...

58

APPENDIX A: FIGURES OF RAW DATA

Figure 49. Temperature in test gearbox.

Figure 50. Temperature in test gearbox.

Page 61: Gearbox fault detection, based on Machine Learning of ...

59

Figure 51. Temperature in test gearbox.

Figure 52. Temperature in test gearbox.

Page 62: Gearbox fault detection, based on Machine Learning of ...

60

Figure 53. Temperature in test gearbox.

Figure 54. Temperature in test gearbox.

Page 63: Gearbox fault detection, based on Machine Learning of ...

61

Figure 55. Temperature in test gearbox.

Figure 56. Rotation speed in test gearbox.

Page 64: Gearbox fault detection, based on Machine Learning of ...

62

Figure 57. Rotation speed in test gearbox.

Figure 58. Rotation speed in test gearbox.

Page 65: Gearbox fault detection, based on Machine Learning of ...

63

Figure 59. Rotation speed in test gearbox.

Figure 60. Rotation speed in test gearbox.

Page 66: Gearbox fault detection, based on Machine Learning of ...

64

Figure 61. Rotation speed in test gearbox.

Figure 62. Rotation speed in test gearbox.

Page 67: Gearbox fault detection, based on Machine Learning of ...

65

Figure 63. Input torque in test gearbox.

Figure 64. Input torque in test gearbox.

Page 68: Gearbox fault detection, based on Machine Learning of ...

66

Figure 65. Input torque in test gearbox.

Figure 66. Input torque in test gearbox.

Page 69: Gearbox fault detection, based on Machine Learning of ...

67

Figure 67. Input torque in test gearbox.

Figure 68. Input torque in test gearbox.

Page 70: Gearbox fault detection, based on Machine Learning of ...

68

Figure 69. Input torque in test gearbox.